摘要

1. High-Resolution Deep Image Matting [PDF] 返回目录
Haichao Yu, Ning Xu, Zilong Huang, Yuqian Zhou, Humphrey Shi
Abstract: Image matting is a key technique for image and video editing and composition. Conventionally, deep learning approaches take the whole input image and an associated trimap to infer the alpha matte using convolutional neural networks. Such approaches set state-of-the-arts in image matting; however, they may fail in real-world matting applications due to hardware limitations, since real-world input images for matting are mostly of very high resolution. In this paper, we propose HDMatt, a first deep learning based image matting approach for high-resolution inputs. More concretely, HDMatt runs matting in a patch-based crop-and-stitch manner for high-resolution inputs with a novel module design to address the contextual dependency and consistency issues between different patches. Compared with vanilla patch-based inference which computes each patch independently, we explicitly model the cross-patch contextual dependency with a newly-proposed Cross-Patch Contextual module (CPC) guided by the given trimap. Extensive experiments demonstrate the effectiveness of the proposed method and its necessity for high-resolution inputs. Our HDMatt approach also sets new state-of-the-art performance on Adobe Image Matting and AlphaMatting benchmarks and produce impressive visual results on more real-world high-resolution images.
摘要：图像抠图是用于图像和视频编辑和组合物的关键技术。以往，深学习方法取整输入图像和相关联的三分图来推断使用卷积神经网络的阿尔法遮片。这样的方法设定状态的最艺术在图像抠图;然而，他们可能无法在现实世界中消光的应用，由于硬件的限制，由于消光现实世界的输入图像大多是非常高的分辨率。在本文中，我们提出HDMatt，第一深学习基于图像抠图换高分辨率的输入方法。更具体地，HDMatt运行在基于块拼贴的作物和针脚方式用于高分辨率输入消光具有新颖模块设计来解决不同的贴片之间的上下文依赖性和一致性的问题。与独立计算每个补丁香草基于补丁的推理相比，我们明确了跨语境补丁依赖用给定三分图引导的新提出的跨补丁语境模块（CPC）的模式。大量的实验证明了该方法的有效性及其对高精度输入必要性。我们HDMatt方法也将在Adobe图像抠图和AlphaMatting基准测试新的国家的最先进的性能和生产上更真实的高清晰图像令人印象深刻的视觉效果。

2. Adaptive Text Recognition through Visual Matching [PDF] 返回目录
Chuhan Zhang, Ankush Gupta, Andrew Zisserman
Abstract: In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents. We introduce a new model that exploits the repetitive nature of characters in languages, and decouples the visual representation learning and linguistic modelling stages. By doing this, we turn text recognition into a shape matching problem, and thereby achieve generalization in appearance and flexibility in classes. We evaluate the new model on both synthetic and real datasets across different alphabets and show that it can handle challenges that traditional architectures are not able to solve without expensive retraining, including: (i) it can generalize to unseen fonts without new exemplars from them; (ii) it can flexibly change the number of classes, simply by changing the exemplars provided; and (iii) it can generalize to new languages and new characters that it has not been trained for by providing a new glyph set. We show significant improvements over state-of-the-art models for all these cases.
摘要：在这项工作中，我们的目标是解决推广和灵活性的问题，在文档中的文字识别。我们介绍的是利用人物的语言重复性质，以及解耦视觉表现的学习和语言建模阶段的新模式。通过这样做，我们把文字识别成形状匹配的问题，从而在外观和灵活性类实现泛化。我们评估跨不同的字母，并表明它能够处理的挑战，传统的架构是不能够解决，无需昂贵的再培训，包括合成和真实数据集的新模式：（i）其可以推广到没有从他们的新典范看不见的字体; （ii）其可以灵活地变更的类，数目简单地通过改变所提供的范例;及（iii）它可以推广到新的语言和新的人物，它没有被提供一个新的字形集中训练。我们展示了国家的最先进的机型显著改善对所有这些案件。

3. GIA-Net: Global Information Aware Network for Low-light Imaging [PDF] 返回目录
Zibo Meng, Runsheng Xu, Chiu Man Ho
Abstract: It is extremely challenging to acquire perceptually plausible images under low-light conditions due to low SNR. Most recently, U-Nets have shown promising results for low-light imaging. However, vanilla U-Nets generate images with artifacts such as color inconsistency due to the lack of global color information. In this paper, we propose a global information aware (GIA) module, which is capable of extracting and integrating the global information into the network to improve the performance of low-light imaging. The GIA module can be inserted into a vanilla U-Net with negligible extra learnable parameters or computational cost. Moreover, a GIA-Net is constructed, trained and evaluated on a large scale real-world low-light imaging dataset. Experimental results show that the proposed GIA-Net outperforms the state-of-the-art methods in terms of four metrics, including deep metrics that measure perceptual similarities. Extensive ablation studies have been conducted to verify the effectiveness of the proposed GIA-Net for low-light imaging by utilizing global information.
摘要：它是极具挑战性的获得低光条件下，由于低SNR感知可信图像。最近，U-篮网已经显示了低光成像可喜的成果。然而，香草U形网生成具有伪像，例如颜色不一致的图像，由于缺乏的全局颜色信息。在本文中，我们提出了一个全球信息感知（GIA）模块，其能够提取和整合全球信息到网络中以提高低光成像的性能。的GIA模块能够被插入到一个香草U形网具有可忽略的额外的可学习参数或计算成本。此外，GIA-Net的构造，训练和大规模真实世界低光成像数据集进行评估。实验结果表明，所提出的GIA-Net的性能优于国家的最先进的方法在四个度量，包括测量知觉相似性度量深术语。大量消融研究已进行了利用全球信息验证了GIA-网的弱光成像的效果。

4. Collaborative Attention Mechanism for Multi-View Action Recognition [PDF] 返回目录
Yue Bai, Zhiqiang Tao, Lichen Wang, Sheng Li, Yu Yin, Yun Fu
Abstract: Multi-view action recognition (MVAR) leverages complementary temporal information from different views to enhance the learning process. Attention is an effective mechanism which has been extensively adopted for modeling temporal data. However, most existing MVAR methods only utilize attention to extract view-specific patterns. They ignore the potential to dig latent mutual-support information inattention space. To fully take the advantage of the multi-view cooperation, we propose a collaborative attention mechanism (CAM). It detects the attention differences among multi-view inputs, and adaptively integrates complementary frame-level information to benefit each other. Specifically, we utilize recurrent neural network (RNN) by expanding long short-term memory (LSTM) as a Mutual-Aid RNN (MAR). CAM takes advantages of view-specific attention pattern to guide another view and unlock potential information which is hard to explore by itself. Extensive experiments on three action datasets illustrate our CAM achieves better result for each single view, and also boosts the multi-view performance.
摘要：多视图动作识别（MVAR）来自不同视图的杠杆互补的时间信息，以提高学习过程。注意的是已建模时间数据被广泛采用的一种有效机制。然而，大多数现有MVAR方法只能使用注意提取特定视图模式。因此忽视了挖掘潜在的相互支持信息疏忽空间的潜力。为了充分利用多视角合作的优势，我们提出了一个协作注意机制（CAM）。它检测到多视点输入中的关注的差异，并自适应地集成的互补帧级信息彼此受益。具体而言，我们利用通过扩大长短期记忆（LSTM）作为互助RNN（MAR）回归神经网络（RNN）。 CAM取视图特定的注意模式的优点，以指导另一视图和解锁潜在的信息，这是很难通过本身去探索。三个动作的数据集大量的实验说明我们的CAM取得了较好的结果对每个单一视图，同时可以提高多视角性能。

5. Zero-shot Synthesis with Group-Supervised Learning [PDF] 返回目录
Yunhao Ge, Sami Abu-El-Haija, Gan Xin, Laurent Itti
Abstract: Visual cognition of primates is superior to that of artificial neural networks in its ability to 'envision' a visual object, even a newly-introduced one, in different attributes including pose, position, color, texture, etc. To aid neural networks to envision objects with different attributes, we propose a family of objective functions, expressed on groups of examples, as a novel learning framework that we term Group-Supervised Learning (GSL). GSL decomposes inputs into a disentangled representation with swappable components that can be recombined to synthesize new samples, trained through similarity mining within groups of exemplars. For instance, images of red boats & blue cars can be decomposed and recombined to synthesize novel images of red cars. We describe a general class of datasets admissible by GSL. We propose an implementation based on auto-encoder, termed group-supervised zero-shot synthesis network (GZS-Net) trained with our learning framework, that can produce a high-quality red car even if no such example is witnessed during training. We test our model and learning framework on existing benchmarks, in addition to new dataset that we open-source. We qualitatively and quantitatively demonstrate that GZS-Net trained with GSL outperforms state-of-the-art methods
摘要：灵长类动物的视觉认知是优于它的能力“预想”可视对象，即使是新推出的一个，在不同的属性，包括姿势，位置，颜色，纹理等为了帮助神经网络人工神经网络设想具有不同属性的对象，我们提出了一个家庭的目标函数，对团的实例表示，作为一种新型的学习框架，我们长期组监督学习（GSL）。 GSL分解投入与可重组，以合成新的样品交换的组件，范例的组内通过相似性采矿培养了解缠结的表示。例如，红色的船只和蓝色汽车的图像可以分解和重组合成的红色汽车新颖的图像。我们描述了一个通用类数据集由GSL受理的。我们提出了一种基于自动编码器来实现，被称为组监督我们的学习框架，训练有素的零拍摄合成网络（GZS-NET），可以产生即使在训练中没有这样的例子，见证了高品质的红色汽车。我们测试现有的基准我们的模型和学习框架，除了新的数据集，我们开源。我们定性和定量证明GZS-Net的训练了与GSL性能优于国家的最先进的方法

6. Beyond Weak Perspective for Monocular 3D Human Pose Estimation [PDF] 返回目录
Imry Kissos, Lior Fritz, Matan Goldman, Omer Meir, Eduard Oks, Mark Kliger
Abstract: We consider the task of 3D joints location and orientation prediction from a monocular video with the skinned multi-person linear (SMPL) model. We first infer 2D joints locations with an off-the-shelf pose estimation algorithm. We use the SPIN algorithm and estimate initial predictions of body pose, shape and camera parameters from a deep regression neural network. We then adhere to the SMPLify algorithm which receives those initial parameters, and optimizes them so that inferred 3D joints from the SMPL model would fit the 2D joints locations. This algorithm involves a projection step of 3D joints to the 2D image plane. The conventional approach is to follow weak perspective assumptions which use ad-hoc focal length. Through experimentation on the 3D Poses in the Wild (3DPW) dataset, we show that using full perspective projection, with the correct camera center and an approximated focal length, provides favorable results. Our algorithm has resulted in a winning entry for the 3DPW Challenge, reaching first place in joints orientation accuracy.
摘要：我们认为3D关节的位置和方向预测的任务从单筒视频与剥皮多人线性（SMPL）模式。首先，我们推断2D关节有一个现成的，现成的姿态估计算法的位置。我们使用SPIN算法，并估计从深度回归神经网络的身体姿势，形状和相机参数的初始预测。然后，我们坚持以接收那些初始参数，并优化它们，以便从SMPL模型推断出3D关节会适合二维关节位置SMPLify算法。这种算法涉及3D关节的2D图像平面上的投影步骤。传统的方法是按照它们使用的ad-hoc焦距薄弱的角度假设。通过对在野生（3DPW）数据集的3D体式实验，我们表明，使用全透视投影，与正确的摄像机中心和一个近似的焦距，提供了有利的结果。我们的算法导致了3DPW挑战赛获奖作品，达到关节定位精度首位。

7. Improving Inversion and Generation Diversity in StyleGAN using a Gaussianized Latent Space [PDF] 返回目录
Jonas Wulff, Antonio Torralba
Abstract: Modern Generative Adversarial Networks are capable of creating artificial, photorealistic images from latent vectors living in a low-dimensional learned latent space. It has been shown that a wide range of images can be projected into this space, including images outside of the domain that the generator was trained on. However, while in this case the generator reproduces the pixels and textures of the images, the reconstructed latent vectors are unstable and small perturbations result in significant image distortions. In this work, we propose to explicitly model the data distribution in latent space. We show that, under a simple nonlinear operation, the data distribution can be modeled as Gaussian and therefore expressed using sufficient statistics. This yields a simple Gaussian prior, which we use to regularize the projection of images into the latent space. The resulting projections lie in smoother and better behaved regions of the latent space, as shown using interpolation performance for both real and generated images. Furthermore, the Gaussian model of the distribution in latent space allows us to investigate the origins of artifacts in the generator output, and provides a method for reducing these artifacts while maintaining diversity of the generated images.
摘要：现代创成对抗性的网络能够产生人造的，从生活在低维了解到潜在空间潜在向量逼真的图像中。它已经显示出宽范围的图像可以被投影到该空间，包括图像，所述发生器上训练的域的外部。然而，尽管在这种情况下，发生器再现像素和图像的纹理结构，重建的潜在向量是不稳定的和小的扰动导致显著图像失真。在这项工作中，我们建议在潜在空间数据分布清晰的模型。我们表明，一个简单的非线性操作下，数据分布可以建模为高斯，因此使用足够的统计数据表示。这产生一个简单的高斯先验，我们用它来的图像的投影正规化到潜在空间。将所得的突起位于潜在空间的更顺畅，更好的表现的区域，作为使用内插性能实际和所生成的图像中所示。此外，在潜在空间分布的高斯模型使我们能够调查在发电机输出工件的起源，并提供一种方法，用于减少这些伪影，同时保持所生成的图像的多样性。

8. A Study of Human Gaze Behavior During Visual Crowd Counting [PDF] 返回目录
Raji Annadi, Yupei Chen, Viresh Ranjan, Dimitris Samaras, Gregory Zelinsky, Minh Hoai
Abstract: In this paper, we describe our study on how humans allocate their attention during visual crowd counting. Using an eye tracker, we collect gaze behavior of human participants who are tasked with counting the number of people in crowd images. Analyzing the collected gaze behavior of ten human participants on thirty crowd images, we observe some common approaches for visual counting. For an image of a small crowd, the approach is to enumerate over all people or groups of people in the crowd, and this explains the high level of similarity between the fixation density maps of different human participants. For an image of a large crowd, our participants tend to focus on one section of the image, count the number of people in that section, and then extrapolate to the other sections. In terms of count accuracy, our human participants are not as good at the counting task, compared to the performance of the current state-of-the-art computer algorithms. Interestingly, there is a tendency to under count the number of people in all crowd images. Gaze behavior data and images can be downloaded from this https URL
摘要：在本文中，我们描述了我们对人类如何视觉人群计数期间分配他们的注意力学习。利用眼动仪，我们收集谁的任务是人群中的图像计数的人数人类参与者的注视行为。分析上30张人群图片10张人参加的收集凝视行为，我们观察到的视觉计数一些常见的方法。对于一小群人的形象，采取的办法是人民所有的人或团体枚举中的佼佼者，而这之间的不同人类参与者的固定密度图解释了类似的高水平。对于一大群人的形象，我们的参与者往往侧重于图像的一个部分，数人在该节中的号码，然后外推到其他部分。在计数准确性方面，我们人类的参与者都比不上在计数任务，相比当前国家的最先进的计算机算法的性能。有趣的是，下在全体人群图像的人数的倾向。凝视行为数据和图像可以从这个HTTPS URL下载

9. Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices [PDF] 返回目录
Anton Trusov, Elena Limonova, Dmitry Slugin, Dmitry Nikolaev, Vladimir V. Arlazarov
Abstract: Quantized low-precision neural networks are very popular because they require less computational resources for inference and can provide high performance, which is vital for real-time and embedded recognition systems. However, their advantages are apparent for FPGA and ASIC devices, while general-purpose processor architectures are not always able to perform low-bit integer computations efficiently. The most frequently used low-precision neural network model for mobile central processors is an 8-bit quantized network. However, in a number of cases, it is possible to use fewer bits for weights and activations, and the only problem is the difficulty of efficient implementation. We introduce an efficient implementation of 4-bit matrix multiplication for quantized neural networks and perform time measurements on a mobile ARM processor. It shows 2.9 times speedup compared to standard floating-point multiplication and is 1.5 times faster than 8-bit quantized one. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. 4-bit quantization gives 95.0% accuracy and 48% overall inference speedup, while an 8-bit quantized network gives 95.4% accuracy and 39% speedup. The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
摘要：因为他们需要推理少的计算资源，并能提供高性能，这是实时和嵌入式识别系统中重要的量化低精度的神经网络是非常受欢迎的。然而，它们的优点对于FPGA和ASIC器件显而易见的，而通用处理器架构不总是能够有效地执行低位整数计算。用于移动中央处理器最频繁使用的低精度神经网络模型是一个8位的量化网络。然而，在许多情况下，可以使用更少的比特的权重和激活，而唯一的问题是有效实现的难度。我们介绍量化神经网络的有效实现4位矩阵乘法的和在移动ARM处理器执行时间测量。它显示了2.9倍的加速比标准浮点乘法和是快1.5倍低于8位量化之一。我们还演示了一个4位量化神经网络在MIDV-500数据集OCR识别。 4比特量化给出95.0％的准确率和48％的总推理加速，而8位量化的网络给出95.4％的准确率和39％的加速。结果表明，4位量化完全适合移动设备，从而产生足够好的精度和较低的推理时间。

10. Unsupervised Domain Adaptation by Uncertain Feature Alignment [PDF] 返回目录
Tobias Ringwald, Rainer Stiefelhagen
Abstract: Unsupervised domain adaptation (UDA) deals with the adaptation of models from a given source domain with labeled data to an unlabeled target domain. In this paper, we utilize the inherent prediction uncertainty of a model to accomplish the domain adaptation task. The uncertainty is measured by Monte-Carlo dropout and used for our proposed Uncertainty-based Filtering and Feature Alignment (UFAL) that combines an Uncertain Feature Loss (UFL) function and an Uncertainty-Based Filtering (UBF) approach for alignment of features in Euclidean space. Our method surpasses recently proposed architectures and achieves state-of-the-art results on multiple challenging datasets. Code is available on the project website.
摘要：无监督域适配（UDA）涉及从用标记的数据的给定源域模型适配到未标记的目标域。在本文中，我们利用模型的内在不确定性的预测来完成的领域适应性任务。中的不确定性是通过蒙特卡罗漏失水平并用于我们所提出的基于不确定性的过滤和特征对齐（UFAL），其结合了一个不确定的特性损失（UFL）功能和用于在欧几里德的特征对准的不确定度为基础的过滤（UBF）方法空间。我们的方法优于最近提出的结构和多重挑战数据集实现了国家的先进成果。代码可以在项目网站上。

11. EfficientSeg: An Efficient Semantic Segmentation Network [PDF] 返回目录
Vahit Bugra Yesilkaynak, Yusuf H. Sahin, Gozde Unal
Abstract: Deep neural network training without pre-trained weights and few data is shown to need more training iterations. It is also known that, deeper models are more successful than their shallow counterparts for semantic segmentation task. Thus, we introduce EfficientSeg architecture, a modified and scalable version of U-Net, which can be efficiently trained despite its depth. We evaluated EfficientSeg architecture on Minicity dataset and outperformed U-Net baseline score (40% mIoU) using the same parameter count (51.5% mIoU). Our most successful model obtained 58.1% mIoU score and got the fourth place in semantic segmentation track of ECCV 2020 VIPriors challenge.
摘要：没有预先训练的权重和几个数据深层神经网络训练表明需要更多的训练迭代。它也知道，更深层次的模型比其浅同行语义分割的任务更加成功。因此，我们引入EfficientSeg架构，U-Net的，可尽管它的深度可以有效地培养了修改和可扩展版本。我们评估Minicity数据集EfficientSeg架构和表现优于使用相同的参数个数（51.5％米欧）U-Net的基础得分（40％米欧）。我们最成功的模式获得58.1％米欧得分，并获得在ECCV 2020 VIPriors挑战的语义分割轨道的第四位。

12. Scene-Graph Augmented Data-Driven Risk Assessment of Autonomous Vehicle Decisions [PDF] 返回目录
Shih-Yuan Yu, Arnav V. Malawade, Deepan Muthirayan, Pramod P. Khargonekar, Mohammad A. Al Faruque
Abstract: Despite impressive advancements in Autonomous Driving Systems (ADS), navigation in complex road conditions remains a challenging problem. There is considerable evidence that evaluating the subjective risk level of various decisions can improve ADS' safety in both normal and complex driving scenarios. However, existing deep learning-based methods often fail to model the relationships between traffic participants and can suffer when faced with complex real-world scenarios. Besides, these methods lack transferability and explainability. To address these limitations, we propose a novel data-driven approach that uses scene-graphs as intermediate representations. Our approach includes a Multi-Relation Graph Convolution Network, a Long-Short Term Memory Network, and attention layers for modeling the subjective risk of driving maneuvers. To train our model, we formulate this task as a supervised scene classification problem. We consider a typical use case to demonstrate our model's capabilities: lane changes. We show that our approach achieves a higher classification accuracy than the state-of-the-art approach on both large (96.4% vs. 91.2%) and small (91.8% vs. 71.2%) synthesized datasets, also illustrating that our approach can learn effectively even from smaller datasets. We also show that our model trained on a synthesized dataset achieves an average accuracy of 87.8% when tested on a real-world dataset compared to the 70.3% accuracy achieved by the state-of-the-art model trained on the same synthesized dataset, showing that our approach can more effectively transfer knowledge. Finally, we demonstrate that the use of spatial and temporal attention layers improves our model's performance by 2.7% and 0.7% respectively, and increases its explainability.
摘要：尽管在自动驾驶系统（ADS）令人印象深刻的进步，在复杂路况导航仍然是一个具有挑战性的问题。有相当多的证据表明，评估各种决策的主观风险水平可以提高ADS的安全正常和复杂驾驶情况。然而，现有的基于深学习方法往往不能交通参与者之间的关系模型，当面对复杂的现实世界的情景会受到影响。此外，这些方法缺乏可转移性和explainability。为了解决这些限制，我们提出了一个新颖的数据驱动的方法，使用场景图形作为中间表示。我们的方法包括多关系图卷积网络，长短期记忆网络，并注意层建模的驾驶操纵的主观风险。为了训练我们的模型，我们制定这个任务为监督场景分类问题。我们认为，一个典型的用例来证明我们的模型的能力：变更车道。我们表明，我们的方法实现了更高的分类准确率比既有大（96.4％对91.2％）和小（91.8％对71.2％），合成数据集的国家的最先进的方法，也说明我们的方法可以甚至从较小的数据集有效地学习。我们还表明，相比于由训练有素的同一合成数据集的国家的最先进的模型实现了70.3％的准确率在真实世界的数据集测试了我们的训练了合成数据集模型实现了87.8％的平均准确度，这表明我们的方法可以更有效地传授知识。最后，我们证明，使用空间和时间关注层的2.7％和0.7％，分别提高了我们的模型的性能，并增加其explainability。

13. Adaptive Label Smoothing [PDF] 返回目录
Ujwal Krothapalli, A. Lynn Abbott
Abstract: This paper concerns the use of objectness measures to improve the calibration performance of Convolutional Neural Networks (CNNs). Objectness is a measure of likelihood of an object from any class being present in a given image. CNNs have proven to be very good classifiers and generally localize objects well; however, the loss functions typically used to train classification CNNs do not penalize inability to localize an object, nor do they take into account an object's relative size in the given image. We present a novel approach to object localization that combines the ideas of objectness and label smoothing during training. Unlike previous methods, we compute a smoothing factor that is adaptive based on relative object size within an image. We present extensive results using ImageNet and OpenImages to demonstrate that CNNs trained using adaptive label smoothing are much less likely to be overconfident in their predictions, as compared to CNNs trained using hard targets. We also show qualitative results using class activation maps to illustrate the improvements.
摘要：本文涉及使用的对象性措施，以改善卷积神经网络（细胞神经网络）的校准性能。对象性是来自任何类存在一个给定的图像中的对象的可能性的度量。细胞神经网络已经被证明是很好的分类，一般本地化对象良好;然而，损失函数典型地用于列车类别细胞神经网络不惩罚不能定位的对象，也没有考虑到在给定图像中的对象的相对大小。本文提出了一种新的方法来目标定位，结合训练期间对象性和标签平滑的想法。不同于以往的方法中，我们计算的平滑因子基于图像内的相对对象大小是自适应的。我们目前使用ImageNet和OpenImages证明细胞神经网络的使用自适应标签平滑广泛的培训结果是不太可能多过于自信在他们的预测，相比使用硬目标，以细胞神经网络训练。我们还表明使用类激活图来说明改善定性结果。

14. Completely Self-Supervised Crowd Counting via Distribution Matching [PDF] 返回目录
Deepak Babu Sam, Abhinav Agarwalla, Jimmy Joseph, Vishwanath A. Sindagi, R. Venkatesh Babu, Vishal M. Patel
Abstract: Dense crowd counting is a challenging task that demands millions of head annotations for training models. Though existing self-supervised approaches could learn good representations, they require some labeled data to map these features to the end task of density estimation. We mitigate this issue with the proposed paradigm of complete self-supervision, which does not need even a single labeled image. The only input required to train, apart from a large set of unlabeled crowd images, is the approximate upper limit of the crowd count for the given dataset. Our method dwells on the idea that natural crowds follow a power law distribution, which could be leveraged to yield error signals for backpropagation. A density regressor is first pretrained with self-supervision and then the distribution of predictions is matched to the prior by optimizing Sinkhorn distance between the two. Experiments show that this results in effective learning of crowd features and delivers significant counting performance. Furthermore, we establish the superiority of our method in less data setting as well. The code and models for our approach is available at this https URL.
摘要：密集的人群计数是一项艰巨的任务，要求数百万头注释的培训模式。虽然现有的自我监督的方法可以学习好表示，他们需要一些标记数据，这些功能映射到密度估计的结束任务。我们减轻与完善自我监督的建议模式，它并不需要甚至一个标签图像此问题。从一大组未标记的人群的图像训练，除了所需的唯一输入是对于给定的数据集的人群计数的近似上限。我们的方法详细论述的想法，自然人群遵循幂律分布，这可能被利用来产生误差信号的反向传播。甲密度回归首先用自检预训练，然后预测的分布通过优化两者之间Sinkhorn距离匹配到前。实验表明，该成果在人群中有效的学习功能，并提供显著计数的性能。此外，我们还建立了该方法的优势在更短的数据设定为好。我们的方法的代码和模型可在此HTTPS URL。

15. Synbols: Probing Learning Algorithms with Synthetic Datasets [PDF] 返回目录
Alexandre Lacoste, Pau Rodríguez, Frédéric Branchaud-Charron, Parmida Atighehchian, Massimo Caccia, Issam Laradji, Alexandre Drouin, Matt Craddock, Laurent Charlin, David Vázquez
Abstract: Progress in the field of machine learning has been fueled by the introduction of benchmark datasets pushing the limits of existing algorithms. Enabling the design of datasets to test specific properties and failure modes of learning algorithms is thus a problem of high interest, as it has a direct impact on innovation in the field. In this sense, we introduce Synbols -- Synthetic Symbols -- a tool for rapidly generating new datasets with a rich composition of latent features rendered in low resolution images. Synbols leverages the large amount of symbols available in the Unicode standard and the wide range of artistic font provided by the open font community. Our tool's high-level interface provides a language for rapidly generating new distributions on the latent features, including various types of textures and occlusions. To showcase the versatility of Synbols, we use it to dissect the limitations and flaws in standard learning algorithms in various learning setups including supervised learning, active learning, out of distribution generalization, unsupervised representation learning, and object counting.
摘要：在机器学习领域的进展已经通过引入标准数据集推现有算法限制的燃料。启用数据集的设计到测试的具体性质和学习算法失效模式是这样的高利率的问题，因为它在该领域的创新有直接的影响。在这个意义上，我们引入Synbols - 合成符号 - 与在低分辨率图像渲染的潜在功能丰富的成分迅速生成新的数据集的工具。 Synbols利用了大量Unicode标准和宽范围由开放字体社区提供艺术字体的可用符号。我们的工具的高层次的接口提供了快速产生的潜在功能的新分布，包括各类纹理和闭塞的语言。为了展示Synbols的多功能性，我们用它来剖析在各种学习设置标准学习算法包括监督学习，主动学习，出分布概括，无监督学习的表示，以及对象计数的局限性和缺陷。

16. Adaptive Convolution Kernel for Artificial Neural Networks [PDF] 返回目录
F. Boray Tek, İlker Çam, Deniz Karlı
Abstract: Many deep neural networks are built by using stacked convolutional layers of fixed and single size (often 3$\times$3) kernels. This paper describes a method for training the size of convolutional kernels to provide varying size kernels in a single layer. The method utilizes a differentiable, and therefore backpropagation-trainable Gaussian envelope which can grow or shrink in a base grid. Our experiments compared the proposed adaptive layers to ordinary convolution layers in a simple two-layer network, a deeper residual network, and a U-Net architecture. The results in the popular image classification datasets such as MNIST, MNIST-CLUTTERED, CIFAR-10, Fashion, and ``Faces in the Wild'' showed that the adaptive kernels can provide statistically significant improvements on ordinary convolution kernels. A segmentation experiment in the Oxford-Pets dataset demonstrated that replacing a single ordinary convolution layer in a U-shaped network with a single 7$\times$7 adaptive layer can improve its learning performance and ability to generalize.
摘要：许多深层神经网络通过使用堆叠的固定和单一尺寸的卷积层（通常3 $ \ $次3）内核构建的。本文描述了用于训练卷积内核的大小，以在一个单一的层提供不同尺寸的内核的方法。该方法利用微的，因此反向传播可训练高斯包络，可以生长或在碱网格收缩。我们的实验比较所提出的自适应层以普通卷积层在一个简单的两层网络，更深的剩余网络，和U-Net的架构。在流行的图像数据集的分类如MNIST，MNIST-杂乱，CIFAR-10，时尚，并且``面孔在野生“”的结果表明，该自适应内核可以提供在普通卷积核统计学显著改进。在牛津-宠物甲分割实验数据集证明，在U形网络与单个7 $ \倍替代单个普通卷积层$ 7的自适应层可以改善其性能的学习和概括的能力。

17. P-DIFF: Learning Classifier with Noisy Labels based on Probability Difference Distributions [PDF] 返回目录
Wei Hu, QiHao Zhao, Yangyu Huang, Fan Zhang
Abstract: Learning deep neural network (DNN) classifier with noisy labels is a challenging task because the DNN can easily over-fit on these noisy labels due to its high capability. In this paper, we present a very simple but effective training paradigm called P-DIFF, which can train DNN classifiers but obviously alleviate the adverse impact of noisy labels. Our proposed probability difference distribution implicitly reflects the probability of a training sample to be clean, then this probability is employed to re-weight the corresponding sample during the training process. P-DIFF can also achieve good performance even without prior knowledge on the noise rate of training samples. Experiments on benchmark datasets also demonstrate that P-DIFF is superior to the state-of-the-art sample selection methods.
摘要：学习深层神经网络（DNN）的分类与标签吵是一项具有挑战性的任务，因为DNN能轻松地过拟合这些嘈杂的标签由于其高能力。在本文中，我们提出了所谓的P-DIFF一个非常简单而有效的培训模式，它可以训练DNN分类，但显然减轻嘈杂标签的不利影响。我们提出的概率差分布隐式反映了一个训练样本的概率是干净的，则该概率在训练过程期间用于重新重量相应的样本。 P-DIFF也能取得良好的业绩，即使没有训练样本的噪声率先验知识。上基准数据集的实验还表明，P-DIFF优于状态的最先进的样本选择方法。

18. 4Seasons: A Cross-Season Dataset for Multi-Weather SLAM in Autonomous Driving [PDF] 返回目录
Patrick Wenzel, Rui Wang, Nan Yang, Qing Cheng, Qadeer Khan, Lukas von Stumberg, Niclas Zeller, Daniel Cremers
Abstract: We present a novel dataset covering seasonal and challenging perceptual conditions for autonomous driving. Among others, it enables research on visual odometry, global place recognition, and map-based re-localization tracking. The data was collected in different scenarios and under a wide variety of weather conditions and illuminations, including day and night. This resulted in more than 350 km of recordings in nine different environments ranging from multi-level parking garage over urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up-to centimeter accuracy obtained from the fusion of direct stereo visual-inertial odometry with RTK-GNSS. The full dataset is available at this http URL.
摘要：本文提出了一种新的数据集涵盖的季节性和自动驾驶挑战感性的条件。其中，它使视觉里程计，全球排名的认可，并基于地图的重新定位跟踪研究。数据被收集在不同的场景和下各种各样的天气条件和照明，包括一天一夜。这导致了九种不同的环境下的录音，从多层次的停车场在城市（包括隧道）到乡村和公路350多公里。我们提供从直接立体声视觉惯性测程与RTK-GNSS的融合获得先进的厘米级精度全球一致的参考姿势。完整的数据集可在此HTTP URL。

19. PRAFlow_RVC: Pyramid Recurrent All-Pairs Field Transforms for Optical Flow Estimation in Robust Vision Challenge 2020 [PDF] 返回目录
Zhexiong Wan, Yuxin Mao, Yuchao Dai
Abstract: Optical flow estimation is an important computer vision task, which aims at estimating the dense correspondences between two frames. RAFT (Recurrent All Pairs Field Transforms) currently represents the state-of-the-art in optical flow estimation. It has excellent generalization ability and has obtained outstanding results across several benchmarks. To further improve the robustness and achieve accurate optical flow estimation, we present PRAFlow (Pyramid Recurrent All-Pairs Flow), which builds upon the pyramid network structure. Due to computational limitation, our proposed network structure only uses two pyramid layers. At each layer, the RAFT unit is used to estimate the optical flow at the current resolution. Our model was trained on several simulate and real-image datasets, submitted to multiple leaderboards using the same model and parameters, and won the 2nd place in the optical flow task of ECCV 2020 workshop: Robust Vision Challenge.
摘要：光流估计是一个重要的计算机视觉任务，其目的是在估计两帧之间的密集对应。 RAFT（复发所有配对场变换）目前代表所述状态的最先进的光流估计。它具有良好的推广能力，并已在多个基准测试取得优异成绩。为了进一步改善鲁棒性和实现准确的光流估计，我们本PRAFlow（金字塔递归所有点对流量），其建立在金字塔的网络结构。由于计算的限制，我们提出的网络结构，只使用了两个金字塔图层。在每一层，在RAFT单元被用于估计在当前的分辨率的光流。我们的模型进行训练几个模拟和实时图像数据集，提交使用相同的模型和参数的多个排行榜，并在ECCV 2020车间的光流任务获得了第二名：强大的视觉挑战。

20. DeepWriteSYN: On-Line Handwriting Synthesis via Deep Short-Term Representations [PDF] 返回目录
Ruben Tolosana, Paula Delgado-Santos, Andres Perez-Uribe, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales
Abstract: This study proposes DeepWriteSYN, a novel on-line handwriting synthesis approach via deep short-term representations. It comprises two modules: i) an optional and interchangeable temporal segmentation, which divides the handwriting into short-time segments consisting of individual or multiple concatenated strokes; and ii) the on-line synthesis of those short-time handwriting segments, which is based on a sequence-to-sequence Variational Autoencoder (VAE). The main advantages of the proposed approach are that the synthesis is carried out in short-time segments (that can run from a character fraction to full characters) and that the VAE can be trained on a configurable handwriting dataset. These two properties give a lot of flexibility to our synthesiser, e.g., as shown in our experiments, DeepWriteSYN can generate realistic handwriting variations of a given handwritten structure corresponding to the natural variation within a given population or a given subject. These two cases are developed experimentally for individual digits and handwriting signatures, respectively, achieving in both cases remarkable results. Also, we provide experimental results for the task of on-line signature verification showing the high potential of DeepWriteSYN to improve significantly one-shot learning scenarios. To the best of our knowledge, this is the first synthesis approach capable of generating realistic on-line handwriting in the short term (including handwritten signatures) via deep learning. This can be very useful as a module toward long-term realistic handwriting generation either completely synthetic or as natural variation of given handwriting samples.
摘要：本研究提出DeepWriteSYN，通过深短期陈述了一种新的在线手写合成方法。它包括两个模块：ⅰ）一个可选的，可互换的时域分割，其将手写成由单个或多个串联的笔画短时分段;和ii）在线合成那些短时手写段，它是基于一个序列到序列变自动编码器（VAE）的。所提出的方法的主要优点是，合成在短时间段中进行（即可以从一个字符分数完整字符运行），并且所述VAE可以在可配置的数据集的笔迹被训练。这两个属性我们合成器，例如提供了很大的灵活性，如图我们的实验中，DeepWriteSYN可以产生对应于给定群体或给定的受试者中的自然变化的给定的手写结构的现实笔迹变化。这两种情况都实验开发的个别数字和手写签名，分别在这两种情况下显着的成效实现。此外，我们还提供了显示DeepWriteSYN的高潜力，提高显著一次性学习情境在线签名验证的任务的实验结果。据我们所知，这是能够通过深度学习，短期内（包括手写签名）生成逼真的联机手写首次合成方法。这可以是作为朝向长期现实手写代或者完全合成的或给出手写样本的自然变化的模块非常有用的。

21. Deep intrinsic decomposition trained on surreal scenes yet with realistic light effects [PDF] 返回目录
Hassan Sial, Ramon Baldrich, Maria Vanrell
Abstract: Estimation of intrinsic images still remains a challenging task due to weaknesses of ground-truth datasets, which either are too small or present non-realistic issues. On the other hand, end-to-end deep learning architectures start to achieve interesting results that we believe could be improved if important physical hints were not ignored. In this work, we present a twofold framework: (a) a flexible generation of images overcoming some classical dataset problems such as larger size jointly with coherent lighting appearance; and (b) a flexible architecture tying physical properties through intrinsic losses. Our proposal is versatile, presents low computation time, and achieves state-of-the-art results.
摘要：本征图像的估计仍然是一个具有挑战性的任务，由于地面实况数据集的弱点，这要么太小或出现非现实的问题。在另一方面，终端到终端的深度学习结构开始得到有趣的结果，我们如果重要的物理提示未予理睬相信可以改善。在这项工作中，我们提出了一个双重的框架：（a）柔性代克服一些经典数据集问题的图像，如用相干照明外观共同较大尺寸的;和（b）一个灵活的体系结构通过固有损耗捆扎物理性质。我们的建议是通用的，呈现低计算时间，实现国家的最先进的成果。

22. AIM 2020 Challenge on Video Extreme Super-Resolution: Methods and Results [PDF] 返回目录
Dario Fuoli, Zhiwu Huang, Shuhang Gu, Radu Timofte, Arnau Raventos, Aryan Esfandiari, Salah Karout, Xuan Xu, Xin Li, Xin Xiong, Jinge Wang, Pablo Navarrete Michelini, Wenhao Zhang, Dongyang Zhang, Hanwei Zhu, Dan Xia, Haoyu Chen, Jinjin Gu, Zhi Zhang, Tongtong Zhao, Shanshan Zhao, Kazutoshi Akita, Norimichi Ukita, Hrishikesh P S, Densen Puthussery, Jiji C V
Abstract: This paper reviews the video extreme super-resolution challenge associated with the AIM 2020 workshop at ECCV 2020. Common scaling factors for learned video super-resolution (VSR) do not go beyond factor 4. Missing information can be restored well in this region, especially in HR videos, where the high-frequency content mostly consists of texture details. The task in this challenge is to upscale videos with an extreme factor of 16, which results in more serious degradations that also affect the structural integrity of the videos. A single pixel in the low-resolution (LR) domain corresponds to 256 pixels in the high-resolution (HR) domain. Due to this massive information loss, it is hard to accurately restore the missing information. Track 1 is set up to gauge the state-of-the-art for such a demanding task, where fidelity to the ground truth is measured by PSNR and SSIM. Perceptually higher quality can be achieved in trade-off for fidelity by generating plausible high-frequency content. Track 2 therefore aims at generating visually pleasing results, which are ranked according to human perception, evaluated by a user study. In contrast to single image super-resolution (SISR), VSR can benefit from additional information in the temporal domain. However, this also imposes an additional requirement, as the generated frames need to be consistent along time.
摘要：本文综述与AIM 2020车间在2020年ECCV相关的视频极端超分辨率的挑战通用缩放因子了解到视频超分辨率（VSR）不超出4倍。缺少信息能够很好地在这一地区恢复，尤其是在人力资源的视频，其中高频率的内容大多是由纹理细节。在这个挑战的任务是高档的视频与16的极端因素，导致更严重的降级也影响了影片的结构完整性。在低分辨率（LR）结构域对应于在高分辨率（HR）域256个像素的单个像素。由于这个庞大的信息丢失，很难准确地还原丢失的信息。轨道1被设置来测量的状态的最先进的用于这种艰巨的任务，其中保真度到地面实况由PSNR和SSIM测量。感知质量更高的可在权衡通过产生合理的高频内容来实现对保真度。跟踪2个因此旨在产生视觉上令人满意的结果，这是根据人类感知，由用户研究评价位列。与此相反，以单图像超分辨率（SISR），VSR可以受益于在时域中的附加信息。然而，这也施加额外的要求，所生成的帧必须沿着时间一致。

23. Unsupervised learning for vascular heterogeneity assessment of glioblastoma based on magnetic resonance imaging: The Hemodynamic Tissue Signature [PDF] 返回目录
Javier Juan-Albarracín
Abstract: This thesis focuses on the research and development of the Hemodynamic Tissue Signature (HTS) method: an unsupervised machine learning approach to describe the vascular heterogeneity of glioblastomas by means of perfusion MRI analysis. The HTS builds on the concept of habitats. An habitat is defined as a sub-region of the lesion with a particular MRI profile describing a specific physiological behavior. The HTS method delineates four habitats within the glioblastoma: the High Angiogenic Tumor (HAT) habitat, as the most perfused region of the enhancing tumor; the Low Angiogenic Tumor (LAT) habitat, as the region of the enhancing tumor with a lower angiogenic profile; the potentially Infiltrated Peripheral Edema (IPE) habitat, as the non-enhancing region adjacent to the tumor with elevated perfusion indexes; and the Vasogenic Peripheral Edema (VPE) habitat, as the remaining edema of the lesion with the lowest perfusion profile. The results of this thesis have been published in ten scientific contributions, including top-ranked journals and conferences in the areas of Medical Informatics, Statistics and Probability, Radiology & Nuclear Medicine, Machine Learning and Data Mining and Biomedical Engineering. An industrial patent registered in Spain (ES201431289A), Europe (EP3190542A1) and EEUU (US20170287133A1) was also issued, summarizing the efforts of the thesis to generate tangible assets besides the academic revenue obtained from research publications. Finally, the methods, technologies and original ideas conceived in this thesis led to the foundation of ONCOANALYTICS CDX, a company framed into the business model of companion diagnostics for pharmaceutical compounds, thought as a vehicle to facilitate the industrialization of the ONCOhabitats technology.
摘要：本文主要对血流动力学组织标识（HTS）方法的研究和开发：无监督的机器学习方法，通过灌注MRI分析的方法来描述胶质母细胞瘤的血管异质性。高温超导基础上栖息地的概念。的栖息地被定义为与特定的MRI轮廓描述特定的生理行为损伤的子区域。在HTS方法描绘成胶质细胞瘤中的四个生境：高级血管生成肿瘤（HAT）的栖息地，作为增强肿瘤的最灌注区域;低血管生成肿瘤（LAT）的栖息地，作为具有较低的血管生成轮廓增强肿瘤的区域中;潜在的渗透陶瓷外周水肿（IPE）栖息地，作为相邻的升高灌注索引肿瘤中的非增强区域;和血管性外周性水肿（VPE）栖息地，作为具有最低灌注轮廓病变的剩余水肿。本文的研究成果已发表在十年的科学贡献，包括在医学信息，统计与概率，放射与核医学，机器学习和数据挖掘和生物医学工程等领域排名第一的期刊和会议。在西班牙（ES201431289A）注册的工业专利，欧洲（EP3190542A1）和EEUU（US20170287133A1）也被发布了，总结了论文的努力产生除了从研究出版物所获得的收入学术有形资产。最后，在这篇论文导致ONCOANALYTICS CDX，成帧为伴侣诊断为药物化合物的商业模式的公司的基础上构思的方法，技术和独到的见解，认为作为一种手段，促进ONCOhabitats技术的产业化。

24. Accurate and Lightweight Image Super-Resolution with Model-Guided Deep Unfolding Network [PDF] 返回目录
Qian Ning, Weisheng Dong, Guangming Shi, Leida Li, Xin Li
Abstract: Deep neural networks (DNNs) based methods have achieved great success in single image super-resolution (SISR). However, existing state-of-the-art SISR techniques are designed like black boxes lacking transparency and interpretability. Moreover, the improvement in visual quality is often at the price of increased model complexity due to black-box design. In this paper, we present and advocate an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN). Targeting at breaking the coherence barrier, we opt to work with a well-established image prior named nonlocal auto-regressive model and use it to guide our DNN design. By integrating deep denoising and nonlocal regularization as trainable modules within a deep learning framework, we can unfold the iterative process of model-based SISR into a multi-stage concatenation of building blocks with three interconnected modules (denoising, nonlocal-AR, and reconstruction). The design of all three modules leverages the latest advances including dense/skip connections as well as fast nonlocal implementation. In addition to explainability, MoG-DUN is accurate (producing fewer aliasing artifacts), computationally efficient (with reduced model parameters), and versatile (capable of handling multiple degradations). The superiority of the proposed MoG-DUN method to existing state-of-the-art image SR methods including RCAN, SRMDNF, and SRFBN is substantiated by extensive experiments on several popular datasets and various degradation scenarios.
摘要：基于方法深层神经网络（DNNs）都实现了单幅图像超分辨率（SISR）大获成功。然而，国家的最先进的现有技术SISR设计像缺乏透明度和可解释性黑盒子。此外，在视觉质量的提高在增加模型的复杂性，由于黑盒设计，价格往往是。在本文中，我们目前倡导朝解释的方法SISR命名模式制导深展开网络（MOG-DUN）。在打破了连贯性障碍的定位，我们选择工作与以前名为外地自回归模型一套行之有效的图像，并用它来指导我们的DNN设计。通过深学习框架内集成深去噪和非局部正则作为可训练模块，我们可以展开基于模型的SISR的迭代过程进入构建具有三个相互连接的模块的块的多级级联（去噪，非局部-AR，和重建）。所有三个模块的设计充分利用了最新进展，包括致密/跳过以及快速实施非局域连接。除了explainability，MOG-DUN是准确的（产生较少混叠伪像），计算有效的（具有降低的模型参数），和通用（能够处理多个劣化）。所提出的MOG-DUN方法国家的最先进的现有的图像SR方法，包括RCAN，SRMDNF，和SRFBN的优越性是通过大量的实验在几个流行的数据集和各种降解场景证实。

25. Prior Knowledge about Attributes: Learning a More Effective Potential Space for Zero-Shot Recognition [PDF] 返回目录
Chunlai Chai, Yukuan Lou, Shijin Zhang
Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes accurately by learning seen classes and known attributes, but correlations in attributes were ignored by previous study which lead to classification results confused. To solve this problem, we build an Attribute Correlation Potential Space Generation (ACPSG) model which uses a graph convolution network and attribute correlation to generate a more discriminating potential space. Combining potential discrimination space and user-defined attribute space, we can better classify unseen classes. Our approach outperforms some existing state-of-the-art methods on several benchmark datasets, whether it is conventional ZSL or generalized ZSL.
摘要：零次学习（ZSL）旨在表彰看不见类准确地看到学习班和已知属性，但属性相关性由先前的研究从而导致混淆分类结果忽略。为了解决这个问题，我们必须建立一个属性关联势航天新一代（ACPSG）模型，它采用了图形卷积网络属性相关，以生成更判别潜在空间。结合潜在的歧视空间和用户自定义属性的空间，我们可以更好地进行分类看不见类。我们的方法优于几个基准数据集现有的一些国家的最先进的方法，无论是传统的ZSL或广义ZSL。

26. Cascade Network for Self-Supervised Monocular Depth Estimation [PDF] 返回目录
Chunlai Chai, Yukuan Lou, Shijin Zhang
Abstract: It is a classical compute vision problem to obtain real scene depth maps by using a monocular camera, which has been widely concerned in recent years. However, training this model usually requires a large number of artificially labeled samples. To solve this problem, some researchers use a self-supervised learning model to overcome this problem and reduce the dependence on manually labeled data. Nevertheless, the accuracy and reliability of these methods have not reached the expected standard. In this paper, we propose a new self-supervised learning method based on cascade networks. Compared with the previous self-supervised methods, our method has improved accuracy and reliability, and we have proved this by experiments. We show a cascaded neural network that divides the target scene into parts of different sight distances and trains them separately to generate a better depth map. Our approach is divided into the following four steps. In the first step, we use the self-supervised model to estimate the depth of the scene roughly. In the second step, the depth of the scene generated in the first step is used as a label to divide the scene into different depth parts. The third step is to use models with different parameters to generate depth maps of different depth parts in the target scene, and the fourth step is to fuse the depth map. Through the ablation study, we demonstrated the effectiveness of each component individually and showed high-quality, state-of-the-art results in the KITTI benchmark.
摘要：这是一个经典的计算愿景的问题通过使用单眼相机，它已被广泛关注，近年来，以获得真正的场景深度图。然而，这种训练模型通常需要大量的人工标记的样本。为了解决这个问题，一些研究人员使用自监督学习模型来解决这一问题，并减少对手工标注数据的依赖。然而，这些方法的准确性和可靠性都没有达到预期的标准。在本文中，我们提出了一种基于级联网络的一个新的自我监督学习方法。与以前的自我监督方法相比，我们的方法提高了精度和可靠性，我们通过实验证明了这一点。我们展示了一个级联神经网络是将目标场景分为不同的视距和火车的部分他们分别产生更好的深度图。我们的方法分为以下四个步骤。在第一步中，我们使用自监督模型粗略估计场景的深度。在第二步骤中，在第一步骤中生成的场景的深度被用作标签的场景划分成不同的深度部分。第三步是用型号不同的参数，以产生在目标场景的不同深度的部分的深度图，以及第四步是融合深度图。通过消融研究中，我们证明了单独的各成分的效果和显示状态的最先进的高品质，结果在KITTI基准。

27. Residual Learning for Effective joint Demosaicing-Denoising [PDF] 返回目录
Yu Guo, Qiyu Jin, Gabriele Facciolo, Tieyong Zeng, Jean-Michel Morel
Abstract: Image demosaicing and denoising are key steps for color image production pipeline. The classical processing sequence consists in applying denoising first, and then demosaicing. However, this sequence leads to oversmoothing and unpleasant checkerboard effect. Moreover, it is very difficult to change this order, because once the image is demosaiced, the statistical properties of the noise will be changed dramatically. This is extremely challenging for the traditional denoising models that strongly rely on statistical assumptions. In this paper, we attempt to tackle this prickly problem. Indeed, here we invert the traditional CFA processing pipeline by first applying demosaicing and then using an adapted denoising. In order to obtain high-quality demosaicing of noiseless images, we combine the advantages of traditional algorithms with deep learning. This is achieved by training convolutional neural networks (CNNs) to learn the residuals of traditional algorithms. To improve the performance in image demosaicing we propose a modified Inception architecture. Given the trained demosaicing as a basic component, we apply it to noisy images and use another CNN to learn the residual noise (including artifacts) of the demosaiced images, which allows to reconstruct full color images. Experimental results show clearly that this method outperforms several state-of-the-art methods both quantitatively as well as in terms of visual quality.
摘要：图像去镶嵌和降噪对于彩色图像流水线生产的关键步骤。经典的处理序列由第一应用去噪，然后去马赛克。然而，这个序列导致oversmoothing和不愉快的棋盘效果。此外，它是很难改变这个顺序，因为一旦图像被去马赛克处理，噪声的统计特性会发生巨大变化。这是非常具有挑战性的传统降噪模式，强烈依赖于统计假设。在本文中，我们试图解决这个问题的刺。事实上，我们在这里转化传统CFA通过首先将去马赛克处理，然后用适合去噪处理流水线。为了获得无噪音图像的高品质去马赛克，我们结合与深度学习传统算法的优点。这是由训练卷积神经网络（细胞神经网络）来学习传统算法的残差实现。为了提高图像的表现去马赛克我们提出了一种改进盗梦空间架构。给定经训练的去马赛克作为基本组成部分，我们将其应用到嘈杂图像，并使用另一个CNN学习去马赛克图像的残留噪声（包括伪像），这允许重构全色图像。实验结果清楚地表明此方法优于状态的最先进的几种方法在数量，以及在视觉质量方面。

28. Learning from Multimodal and Multitemporal Earth Observation Data for Building Damage Mapping [PDF] 返回目录
Bruno Adriano, Naoto Yokoya, Junshi Xia, Hiroyuki Miura, Wen Liu, Masashi Matsuoka, Shunichi Koshimura
Abstract: Earth observation technologies, such as optical imaging and synthetic aperture radar (SAR), provide excellent means to monitor ever-growing urban environments continuously. Notably, in the case of large-scale disasters (e.g., tsunamis and earthquakes), in which a response is highly time-critical, images from both data modalities can complement each other to accurately convey the full damage condition in the disaster's aftermath. However, due to several factors, such as weather and satellite coverage, it is often uncertain which data modality will be the first available for rapid disaster response efforts. Hence, novel methodologies that can utilize all accessible EO datasets are essential for disaster management. In this study, we have developed a global multisensor and multitemporal dataset for building damage mapping. We included building damage characteristics from three disaster types, namely, earthquakes, tsunamis, and typhoons, and considered three building damage categories. The global dataset contains high-resolution optical imagery and high-to-moderate-resolution multiband SAR data acquired before and after each disaster. Using this comprehensive dataset, we analyzed five data modality scenarios for damage mapping: single-mode (optical and SAR datasets), cross-modal (pre-disaster optical and post-disaster SAR datasets), and mode fusion scenarios. We defined a damage mapping framework for the semantic segmentation of damaged buildings based on a deep convolutional neural network algorithm. We compare our approach to another state-of-the-art baseline model for damage mapping. The results indicated that our dataset, together with a deep learning network, enabled acceptable predictions for all the data modality scenarios.
摘要：地球观测技术，诸如光学成像和合成孔径雷达（SAR），提供优异的手段来连续监测日益增长的城市环境。值得注意的是，在大规模灾害（例如，海啸和地震），其中的响应是高度对时间要求严格的情况下，来自两个数据模态的图像可以互为补充，以准确地传达在灾难的后果的全部损坏状态。然而，由于多种因素，如天气和卫星覆盖，它往往是不确定的数据模式将是第一个可用的快速救灾工作。因此，可以利用所有可访问的EO数据集的新方法是灾难管理至关重要。在这项研究中，我们已经制定了房屋震害映射一个全球性的多传感器多时相数据集。我们包括三个灾难类型，即，地震，海啸和台风房屋震害特点，并考虑了三种建筑损坏的类别。全球数据集包含高分辨率光学图像和高到中等分辨率的多波段在每次灾难发生后获得的SAR数据。使用这种综合数据集，我们分析了五个数据方式情景损伤映射：单模（光学和SAR数据集），跨模式（灾前光学和灾后SAR数据集）和模式的融合方案。我们定义了基于深卷积神经网络算法建筑物受损的语义分割损坏映射框架。我们我们的做法比较有损坏映射另一个国家的最先进的基准模型。结果表明，我们的数据集，具有深刻学习网络一起，使所有数据方式情景接受的预测。

29. RelativeNAS: Relative Neural Architecture Search via Slow-Fast Learning [PDF] 返回目录
Hao Tan, Ran Cheng, Shihua Huang, Cheng He, Changxiao Qiu, Fan Yang, Ping Luo
Abstract: Despite the remarkable successes of Convolutional neural networks (CNN) in computer vision, it is time-consuming and error-prone to manually design a CNN. Among various neural architecture search (NAS) methods that are motivated to automate designs of high-performance CNNs, the differentiable NAS and population-based NAS are attracting increasing interests due to their unique characters. To benefit from the merits while overcoming the deficiencies of both, this work proposes a novel NAS method, RelativeNAS. As the key to efficient search, RelativeNAS performs joint learning between fast-learners (i.e. networks with relatively higher accuracy) and slow-learners in a pairwise manner. Moreover, since RelativeNAS only requires low-fidelity performance estimation to distinguish each pair of fast-learner and slow-learner, it saves certain computation costs for training the candidate architectures. The proposed RelativeNAS brings several unique advantages: (1) it achieves state-of-the-art performances on imageNet with top-1 error rate of 24.88%, i.e. outperforming DARTS and AmoebaNet-B by 1.82% and 1.12% respectively; (2) it spends only nine hours with a single 1080Ti GPU to obtain the discovered cells, i.e. 3.75x and 7875x faster than DARTS and AmoebaNet respectively; (3) it provides that the discovered cells obtained on CIFAR-10 can be directly transferred to object detection, semantic segmentation, and keypoint detection, yielding competitive results of 73.1% mAP on PASCAL VOC, 78.7% mIoU on Cityscapes, and 68.5% AP on MSCOCO, respectively. The code is available at this https URL
摘要：尽管在计算机视觉卷积神经网络（CNN）的显着成就，这是耗时且容易出错的手动设计一个CNN。在被激励自动化高性能的细胞神经网络中，可分化NAS和基于人口的NAS的设计越来越受到利益由于其独特的角色不同的神经结构搜索（NAS）的方法。从受益的优点，同时克服两者的不足，这项工作提出了一种新方法，NAS，RelativeNAS。为重点，以高效的搜索，RelativeNAS进行快速学习者之间共同学习（即，具有相对较高的精度网络）和慢学习者以成对的方式。此外，由于RelativeNAS只需要较低的高保真性能推定来区分每对快速学习和慢速学习的，这样可以节省用于训练候选人架构一定的计算成本。所提出的RelativeNAS带来了一些独特的优点：（1）它达到状态的最艺术表演上imageNet用的24.88％，即效果优于飞镖顶部-1的错误率和由分别为1.82％和1.12％AmoebaNet-B; （2）它仅花费9具有单个GPU 1080Ti小时，以获得所发现的细胞，即3.75倍和7875x比分别飞镖和AmoebaNet快; （3）它提供了对CIFAR-10中得到的发现的细胞可以直接转移到物体检测，语义分割，和关键点检测，得到上PASCAL VOC 73.1％地图的有竞争力的结果，78.7％米欧上风情，和68.5％AP在MSCOCO，分别。该代码可在此HTTPS URL

30. 3D Object Detection and Tracking Based on Streaming Data [PDF] 返回目录
Xusen Guo, Jiangfeng Gu, Silu Guo, Zixiao Xu, Chengzhang Yang, Shanghua Liu, Long Cheng, Kai Huang
Abstract: Recent approaches for 3D object detection have made tremendous progresses due to the development of deep learning. However, previous researches are mostly based on individual frames, leading to limited exploitation of information between frames. In this paper, we attempt to leverage the temporal information in streaming data and explore 3D streaming based object detection as well as tracking. Toward this goal, we set up a dual-way network for 3D object detection based on keyframes, and then propagate predictions to non-key frames through a motion based interpolation algorithm guided by temporal information. Our framework is not only shown to have significant improvements on object detection compared with frame-by-frame paradigm, but also proven to produce competitive results on KITTI Object Tracking Benchmark, with 76.68% in MOTA and 81.65% in MOTP respectively.
摘要：立体物检测最近的方法已经由于深学习的发展取得了巨大的进展。然而，以往的研究大多基于对个别帧，从而导致帧之间的信息利用的限制。在本文中，我们试图利用时间信息的数据流和探索3D串流基于对象的检测以及跟踪。为了实现这一目标，我们建立了基于关键帧的三维物体检测双路网络，然后通过时间信息引导基于运动插值算法传播预测，以非关键帧。我们的框架不仅示出为具有与帧接一帧的范例相比于物体检测显著的改进，但也证明分别产生上KITTI对象跟踪基准的竞争的结果，与MOTA 76.68％和81.65 MOTP％。

31. One-bit Supervision for Image Classification [PDF] 返回目录
Hengtong Hu, Lingxi Xie, Zewei Du, Richang Hong, Qi Tian
Abstract: This paper presents one-bit supervision, a novel setting of learning from incomplete annotations, in the scenario of image classification. Instead of training a model upon the accurate label of each sample, our setting requires the model to query with a predicted label of each sample and learn from the answer whether the guess is correct. This provides one bit (yes or no) of information, and more importantly, annotating each sample becomes much easier than finding the accurate label from many candidate classes. There are two keys to training a model upon one-bit supervision: improving the guess accuracy and making use of incorrect guesses. For these purposes, we propose a multi-stage training paradigm which incorporates negative label suppression into an off-the-shelf semi-supervised learning algorithm. In three popular image classification benchmarks, our approach claims higher efficiency in utilizing the limited amount of annotations.
摘要：本文介绍一位监管，从残缺的注解学习，在图像分类中的情景的一种新的设置。而不是在每个样品的准确标签训练模式，我们的设置要求的模型来查询每个样本的预测标签和答案的猜测是否正确学习。这提供了一个比特的信息（是或否），更重要的是，注释每个样品比从众多候选班找到准确的标签容易得多。有两个关键在一个位监督训练模式：提高猜测的准确性和利用不正确的猜测。为了这些目的，我们提出一种结合了负面标签压制到一个现成的，货架半监督学习算法的多阶段培训模式。三个流行的图像分类基准，我们的做法声称在利用注释的数量有限更高的效率。

32. GINet: Graph Interaction Network for Scene Parsing [PDF] 返回目录
Tianyi Wu, Yu Lu, Yu Zhu, Chuang Zhang, Ming Wu, Zhanyu Ma, Guodong Guo
Abstract: Recently, context reasoning using image regions beyond local convolution has shown great potential for scene parsing. In this work, we explore how to incorporate the linguistic knowledge to promote context reasoning over image regions by proposing a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss). The GI unit is capable of enhancing feature representations of convolution networks over high-level semantics and learning the semantic coherency adaptively to each sample. Specifically, the dataset-based linguistic knowledge is first incorporated in the GI unit to promote context reasoning over the visual graph, then the evolved representations of the visual graph are mapped to each local representation to enhance the discriminated capability for scene parsing. GI unit is further improved by the SC-loss to enhance the semantic representations over the exemplar-based semantic graph. We perform full ablation studies to demonstrate the effectiveness of each component in our approach. Particularly, the proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
摘要：现场解析近来，利用图像区域超出当地卷积上下文推理已显示出巨大的潜力。在这项工作中，我们将探讨如何把语言知识通过提出一种图形交互单元（GI单位）和语义语境损失（SC-损失），以促进在图像区域的上下文推理。的GI单元能够增强卷积网络通过高级别语义特征表示并自适应地学习语义相干性，以每个样品的。具体地，基于数据集-语言知识在GI单元被第一结合，以促进在所述视觉图形上下文推理，那么视觉图形的演变表示被映射到每个局部表示以加强对场景分析所判别能力。 GI单元由SC-损失进一步改进，以提高通过基于典范-语义图的语义表示。我们执行完全消融研究，以证明我们的方法每个组件的有效性。特别是，提出GINet优于国家的最先进的方法上流行的基准测试，包括帕斯卡尔 - 语境和COCO的东西。

33. SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition [PDF] 返回目录
Liangzhi Li, Bowen Wang, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, Hajime Nagahara
Abstract: Explainable artificial intelligence is gaining attention. However, most existing methods are based on gradients or intermediate features, which are not directly involved in the decision-making process of the classifier. In this paper, we propose a slot attention-based light-weighted classifier called SCOUTER for transparent yet accurate classification. Two major differences from other attention-based methods include: (a) SCOUTER's explanation involves the final confidence for each category, offering more intuitive interpretation, and (b) all the categories have their corresponding positive or negative explanation, which tells "why the image is of a certain category" or "why the image is not of a certain category." We design a new loss tailored for SCOUTER that controls the model's behavior to switch between positive and negative explanations, as well as the size of explanatory regions. Experimental results show that SCOUTER can give better visual explanations while keeping good accuracy on a large dataset.
摘要：可解释的人工智能受到关注。然而，大多数现有的方法是基于梯度或中间特征，其不直接参与在分类器的决策过程。在本文中，我们提出了所谓的侦察者基于关注插槽光加权分类透明但准确分类。与其他基于注意机制的方法有两个主要差别包括：（a）侦察者的解释涉及每个类别的最后信心，提供更直观的解释，以及（b）所有类别都有各自对应的正或负的解释，它告诉“为什么图像是某一类”或‘为什么图像是某一类的不是’。我们设计的侦察者量身打造的一款新的损失控制模型的行为以积极和消极的解释之间的切换，以及说明区域的大小。实验结果表明，侦察者可以提供更好的视觉解释，同时保持大型数据集良好的精度。

34. Accelerating COVID-19 Differential Diagnosis with Explainable Ultrasound Image Analysis [PDF] 返回目录
Jannis Born, Nina Wiedemann, Gabriel Brändle, Charlotte Buhre, Bastian Rieck, Karsten Borgwardt
Abstract: Controlling the COVID-19 pandemic largely hinges upon the existence of fast, safe, and highly-available diagnostic tools. Ultrasound, in contrast to CT or X-Ray, has many practical advantages and can serve as a globally-applicable first-line examination technique. We provide the largest publicly available lung ultrasound (US) dataset for COVID-19 consisting of 106 videos from three classes (COVID-19, bacterial pneumonia, and healthy controls); curated and approved by medical experts. On this dataset, we perform an in-depth study of the value of deep learning methods for differential diagnosis of COVID-19. We propose a frame-based convolutional neural network that correctly classifies COVID-19 US videos with a sensitivity of 0.98+-0.04 and a specificity of 0.91+-08 (frame-based sensitivity 0.93+-0.05, specificity 0.87+-0.07). We further employ class activation maps for the spatio-temporal localization of pulmonary biomarkers, which we subsequently validate for human-in-the-loop scenarios in a blindfolded study with medical experts. Aiming for scalability and robustness, we perform ablation studies comparing mobile-friendly, frame and video-based architectures and show reliability of the best model by aleatoric and epistemic uncertainty estimates. We hope to pave the road for a community effort toward an accessible, efficient and interpretable screening method and we have started to work on a clinical validation of the proposed method. Data and code are publicly available.
摘要：在快速，安全和高可用性的诊断工具的存在控制COVID-19大流行，很大程度上取决于。超声，而相比之下，CT或X射线，具有许多实际的优点，可以作为一个全球适用的一线检查技术。我们为COVID-19提供最大的可公开获得的肺部超声（US）数据集包括从三类（COVID-19，细菌性肺炎，与健康对照组）106个视频;策划和医学专家的批准。在此数据集，我们执行的深度学习方法COVID-19的鉴别诊断价值的深入研究。我们建议以0.98±0.04的敏感性和0.91 + -08（基于帧的灵敏度0.93±0.05，特异性0.87±0.07）的特异性正确分类COVID-19美国影片基于帧的卷积神经网络。我们进一步使用类映射激活肺生物标志物的时空定位，我们随后与医学专家研究蒙住了验证人在半实物场景。针对可扩展性和稳健性，我们履行肆意和认识的不确定性估算比较适合移动设备，框架和基于视频的体系结构和最佳模型的展示可靠性消融研究。我们希望铺平了走向一个平等，高效和解释筛选方法的社区努力的道路，我们已经开始工作，对提出的方法的临床验证。数据和代码是公开的。

35. Multi-channel MRI Embedding: An EffectiveStrategy for Enhancement of Human Brain WholeTumor Segmentation [PDF] 返回目录
Apurva Pandya, Catherine Samuel, Nisargkumar Patel, Vaibhavkumar Patel, Thangarajah Akilan
Abstract: One of the most important tasks in medical image processing is the brain's whole tumor segmentation. It assists in quicker clinical assessment and early detection of brain tumors, which is crucial for lifesaving treatment procedures of patients. Because, brain tumors often can be malignant or benign, if they are detected at an early stage. A brain tumor is a collection or a mass of abnormal cells in the brain. The human skull encloses the brain very rigidly and any growth inside this restricted place can cause severe health issues. The detection of brain tumors requires careful and intricate analysis for surgical planning and treatment. Most physicians employ Magnetic Resonance Imaging (MRI) to diagnose such tumors. A manual diagnosis of the tumors using MRI is known to be time-consuming; approximately, it takes up to eighteen hours per sample. Thus, the automatic segmentation of tumors has become an optimal solution for this problem. Studies have shown that this technique provides better accuracy and it is faster than manual analysis resulting in patients receiving the treatment at the right time. Our research introduces an efficient strategy called Multi-channel MRI embedding to improve the result of deep learning-based tumor segmentation. The experimental analysis on the Brats-2019 dataset wrt the U-Net encoder-decoder (EnDec) model shows significant improvement. The embedding strategy surmounts the state-of-the-art approaches with an improvement of 2% without any timing overheads.
摘要：一个在医学图像处理最重要的任务是大脑的整个肿瘤分割。它有助于更快临床评估和早期发现脑肿瘤的，这是救命的患者在治疗过程至关重要。因为，脑肿瘤常可以是恶性的还是良性的，如果它们在早期阶段检测到。脑瘤是一个集合或异常细胞在脑的质量。人类头骨包围大脑非常严格，这限制地方内的任何增长可能会导致严重的健康问题。脑肿瘤的检测需要进行手术规划和精心的治疗和复杂的分析。大多数医生使用磁共振成像（MRI）来诊断这些肿瘤。使用MRI已知是费时的肿瘤的手动诊断;约，它占用每个样品18小时。因此，肿瘤的自动分割已成为针对此问题的最佳解决方案。有研究表明，这种技术提供更好的精度和它比导致接收在合适的时间在治疗的患者人工分析更快。我们的研究引入了一个叫做多通道MRI嵌入，以提高深学习型肿瘤分割的结果有效策略。在实验分析臭小子-2019数据集WRT在U-Net的编码器 - 解码器（ENDEC）模型显示显著改善。嵌入策略的跨越状态的最先进的为2％，没有任何时间的负担的改进方法。

36. Cosine meets Softmax: A tough-to-beat baseline for visual grounding [PDF] 返回目录
Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna, Vineet Gandhi
Abstract: In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.
摘要：在本文中，我们提出了一个用于自动驾驶其优于现有技术方法的状态视觉接地简单的基线，同时保持最小的设计选择。我们的框架最小化在具有一个文本嵌入（表示给予句子/短语）的多个图像的ROI特征之间的余弦距离的交叉熵损失。我们使用预先训练网络获得初始的嵌入和学习上的文本嵌入顶部的转换层。我们在Talk2Car数据集进行实验，达到68.7％AP50精度，8.6％，在现有技术中以前的状态得到改善。我们的调查表明，对通过显示简单的替代品承诺采用复杂的注意力机制或多级推理或复杂的度量学习丧失功能的更多方法复议。

37. Pairwise-GAN: Pose-based View Synthesis through Pair-Wise Training [PDF] 返回目录
Xuyang Shen, Jo Plested, Yue Yao, Tom Gedeon
Abstract: Three-dimensional face reconstruction is one of the popular applications in computer vision. However, even state-of-the-art models still require frontal face as inputs, which restricts its usage scenarios in the wild. A similar dilemma also happens in face recognition. New research designed to recover the frontal face from a single side-pose facial image has emerged. The state-of-the-art in this area is the Face-Transformation generative adversarial network, which is based on the CycleGAN. This inspired our research which explores the performance of two models from pixel transformation in frontal facial synthesis, Pix2Pix and CycleGAN. We conducted the experiments on five different loss functions on Pix2Pix to improve its performance, then followed by proposing a new network Pairwise-GAN in frontal facial synthesis. Pairwise-GAN uses two parallel U-Nets as the generator and PatchGAN as the discriminator. The detailed hyper-parameters are also discussed. Based on the quantitative measurement by face similarity comparison, our results showed that Pix2Pix with L1 loss, gradient difference loss, and identity loss results in 2.72% of improvement at average similarity compared to the default Pix2Pix model. Additionally, the performance of Pairwise-GAN is 5.4% better than the CycleGAN and 9.1% than the Pix2Pix at average similarity.
摘要：三维人脸重建是计算机视觉领域中流行的应用之一。然而，即使国家的最先进的车型仍需要正面人脸作为输入，这限制了它的使用场景在野外。类似的窘境也发生在面部识别。旨在恢复从单边姿态的面部图像脸的正面新的研究又出现了。的状态的最先进的在这方面的是脸部生成转化对抗性网络，这是基于CycleGAN。这激发了我们的研究，探索了两种型号的从像素转型，正面脸部合成，Pix2Pix和CycleGAN性能。我们进行了对Pix2Pix五个不同的损失函数的实验，以提高其性能，再其次是在正面脸部合成提出了新的网络配对-GaN。成对-GAN使用两个平行的U篮网作为发电机和PatchGAN作为鉴别器。详细的超参数进行了讨论。基于由面部相似性比较的定量测量，我们的结果表明，与Pix2Pix L1损失，梯度差的损失，并在平均相似性相比，默认Pix2Pix模型中的改善2.72％的同一性损失的结果。此外，成对-GAN的性能是5.4％，比CycleGAN更好，比Pix2Pix在平均相似性9.1％。

38. A Review of Visual Descriptors and Classification Techniques Used in Leaf Species Identification [PDF] 返回目录
K. K. Thyagharajan, I. Kiruba Raji
Abstract: Plants are fundamentally important to life. Key research areas in plant science include plant species identification, weed classification using hyper spectral images, monitoring plant health and tracing leaf growth, and the semantic interpretation of leaf information. Botanists easily identify plant species by discriminating between the shape of the leaf, tip, base, leaf margin and leaf vein, as well as the texture of the leaf and the arrangement of leaflets of compound leaves. Because of the increasing demand for experts and calls for biodiversity, there is a need for intelligent systems that recognize and characterize leaves so as to scrutinize a particular species, the diseases that affect them, the pattern of leaf growth, and so on. We review several image processing methods in the feature extraction of leaves, given that feature extraction is a crucial technique in computer vision. As computers cannot comprehend images, they are required to be converted into features by individually analysing image shapes, colours, textures and moments. Images that look the same may deviate in terms of geometric and photometric variations. In our study, we also discuss certain machine learning classifiers for an analysis of different species of leaves.
摘要：植物对生命至关重要的。在植物科学重点研究领域包括植物物种鉴定，利用高光谱图像，监控设备的健康和追查叶生长杂草分类，叶信息语义解释。植物学家容易被叶，尖，基部，叶缘和叶脉的形状，以及所述叶的质地和复叶的小叶的布置之间进行区分识别植物物种。由于专家和呼吁生物多样性要求的提高，有必要识别智能系统和特征的叶子，以审查特定物种，影响他们的疾病，叶片生长的模式，等等。我们回顾在叶的特征提取多种图像处理方法，考虑到特征提取是计算机视觉中的关键技术。随着计算机无法理解的图像，它们需要通过单独分析图像的形状，颜色，纹理和力矩转换成特征。看起来相同的图像可以在几何和光度变化方面的偏差。在我们的研究中，我们还讨论了不同物种叶的分析某些机器学习分类。

39. Semantic Segmentation of Surface from Lidar Point Cloud [PDF] 返回目录
Aritra Mukherjee, Sourya Dipta Das, Jasorsi Ghosh, Ananda S. Chowdhury, Sanjoy Kumar Saha
Abstract: In the field of SLAM (Simultaneous Localization And Mapping) for robot navigation, mapping the environment is an important task. In this regard the Lidar sensor can produce near accurate 3D map of the environment in the format of point cloud, in real time. Though the data is adequate for extracting information related to SLAM, processing millions of points in the point cloud is computationally quite expensive. The methodology presented proposes a fast algorithm that can be used to extract semantically labelled surface segments from the cloud, in real time, for direct navigational use or higher level contextual scene reconstruction. First, a single scan from a spinning Lidar is used to generate a mesh of subsampled cloud points online. The generated mesh is further used for surface normal computation of those points on the basis of which surface segments are estimated. A novel descriptor to represent the surface segments is proposed and utilized to determine the surface class of the segments (semantic label) with the help of classifier. These semantic surface segments can be further utilized for geometric reconstruction of objects in the scene, or can be used for optimized trajectory planning by a robot. The proposed methodology is compared with number of point cloud segmentation methods and state of the art semantic segmentation methods to emphasize its efficacy in terms of speed and accuracy.
摘要：在SLAM（同步定位与地图），用于机器人导航领域，映射环境是一项重要任务。在这方面，雷达传感器可产生近点云的形式对环境的精确的3D地图，实时。虽然数据是足够的点云中提取相关信息SLAM，处理数以百万计分的计算上是相当昂贵的。介绍的方法提出了一种快速算法，该算法可被用于从云中提取语义标记的表面片段，实时地直接导航用途或更高级别的上下文场景重建。首先，从纺纱激光雷达的单次扫描用于在线生成二次采样点云的网格。生成的网格还用于其中的表面段被估计的基础上那些点的表面法线计算。一种新的描述符来表示表面片段被提出并用于确定表面类的分段（语义标签）与分类器的帮助的。这些语义表面片段可以进一步用于场景中的对象的几何重建，或者可以由机器人被用于优化轨迹规划。所提出的方法与本领域语义分割方法的点云分割方法号码和状态相比，强调其在速度和准确度方面的功效。

40. Calibration Venus: An Interactive Camera Calibration Method Based on Search Algorithm and Pose Decomposition [PDF] 返回目录
Wentai Lei, Mengdi Xu.Feifei Hou, Wensi Jiang
Abstract: In many scenarios where cameras are applied, such as robot positioning and unmanned driving, camera calibration is one of the most important pre-work. The interactive calibration method based on the plane board is becoming popular in camera calibration field due to its repeatability and operation advantages. However, the existing methods select suggestions from a fixed dataset of pre-defined poses based on subjective experience, which leads to a certain degree of one-sidedness. Moreover, they does not give users clear instructions on how to place the board in the specified pose.
摘要：在许多情况下，其中被施加摄像机，诸如机器人定位和无人驾驶，摄像机标定是最重要的前期工作中的一个。基于平面板交互校准方法由于其重复性和操作优点成为照相机校准场流行。然而，现有的方法中选择从基于主观经验预先定义的姿势固定的数据集，建议这导致了一定程度的片面性。此外，他们并没有给用户如何将板放置在指定的姿势明确的指示。

41. Improving Deep Video Compression by Resolution-adaptive Flow Coding [PDF] 返回目录
Zhihao Hu, Zhenghao Chen, Dong Xu, Guo Lu, Wanli Ouyang, Shuhang Gu
Abstract: In the learning based video compression approaches, it is an essential issue to compress pixel-level optical flow maps by developing new motion vector (MV) encoders. In this work, we propose a new framework called Resolution-adaptive Flow Coding (RaFC) to effectively compress the flow maps globally and locally, in which we use multi-resolution representations instead of single-resolution representations for both the input flow maps and the output motion features of the MV encoder. To handle complex or simple motion patterns globally, our frame-level scheme RaFC-frame automatically decides the optimal flow map resolution for each video frame. To cope different types of motion patterns locally, our block-level scheme called RaFC-block can also select the optimal resolution for each local block of motion features. In addition, the rate-distortion criterion is applied to both RaFC-frame and RaFC-block and select the optimal motion coding mode for effective flow coding. Comprehensive experiments on four benchmark datasets HEVC, VTL, UVG and MCL-JCV clearly demonstrate the effectiveness of our overall RaFC framework after combing RaFC-frame and RaFC-block for video compression.
摘要：在基于学习的视频压缩方法，它是压缩的像素级光流的基本问题通过开发新的运动矢量（MV）的编码器映射。在这项工作中，我们提出了一个名为分辨率自适应流编码（RAFC），以有效地压缩流全局和局部映射新的框架，在此我们使用多分辨率的表示，而不是单一的决议表示了输入流量图和两个的MV编码器的输出运动特征。在全球范围内处理复杂或简单的运动模式，我们的帧级方案RAFC帧自动确定每个视频帧中的最佳流动地图的分辨率。为了应对当地不同类型的运动模式，我们的块级方案称为RAFC块也可以选择的运动特征的每个局部块的最佳分辨率。此外，率失真判定标准被施加到两个RAFC帧和RAFC块和选择最佳运动为有效的流动编码的编码模式。四个基准数据集HEVC，VTL，UVG和MCL-JCV综合实验清楚地证明我们的梳理RAFC帧和RAFC块视频压缩后整体RAFC框架的有效性。

42. SSKD: Self-Supervised Knowledge Distillation for Cross Domain Adaptive Person Re-Identification [PDF] 返回目录
Junhui Yin, Jiayan Qiu, Siqing Zhang, Zhanyu Ma, Jun Guo
Abstract: Domain adaptive person re-identification (re-ID) is a challenging task due to the large discrepancy between the source domain and the target domain. To reduce the domain discrepancy, existing methods mainly attempt to generate pseudo labels for unlabeled target images by clustering algorithms. However, clustering methods tend to bring noisy labels and the rich fine-grained details in unlabeled images are not sufficiently exploited. In this paper, we seek to improve the quality of labels by capturing feature representation from multiple augmented views of unlabeled images. To this end, we propose a Self-Supervised Knowledge Distillation (SSKD) technique containing two modules, the identity learning and the soft label learning. Identity learning explores the relationship between unlabeled samples and predicts their one-hot labels by clustering to give exact information for confidently distinguished images. Soft label learning regards labels as a distribution and induces an image to be associated with several related classes for training peer network in a self-supervised manner, where the slowly evolving network is a core to obtain soft labels as a gentle constraint for reliable images. Finally, the two modules can resist label noise for re-ID by enhancing each other and systematically integrating label information from unlabeled images. Extensive experiments on several adaptation tasks demonstrate that the proposed method outperforms the current state-of-the-art approaches by large margins.
摘要：域自适应人重新鉴定（重新-ID）是一个具有挑战性的任务，由于到源域和目标域之间的大的差异。为了减少域差异，现有的方法主要是试图通过聚类算法来生成未标记的目标图像伪标签。然而，聚类方法往往会带来嘈杂的标签和未标记的图像丰富的细粒度的细节没有被充分利用。在本文中，我们力求通过未标记的图像的多个增强的观点捕捉特征表示，提高标签的品质。为此，我们提出了包含两个模块，一个自我监督知识蒸馏（SSKD）技术，身份学习和软标签学习。身份学习探索的未标记样本之间的关系，并通过聚类，给出自信地辨别图像的准确信息预测他们的一个热的标签。学习关于标签的分布和诱导图像软标签与几个相关类的自我监督的方式，在缓慢发展的网络是一个核心来获得软标签作为一个温和的约束可靠的图像训练等网络相关联。最后，两个模块可以通过增强彼此并系统地从未经标记的图像积分标签信息抗蚀剂用于重新ID标签的噪声。在几个任务，适应了广泛的实验表明，该方法优于当前国家的最先进的大利润的方法。

43. Semi-supervised dictionary learning with graph regularization and active points [PDF] 返回目录
Khanh-Hung Tran, Fred-Maurice Ngole-Mboula, Jean-Luc Starck, Vincent Prost
Abstract: Supervised Dictionary Learning has gained much interest in the recent decade and has shown significant performance improvements in image classification. However, in general, supervised learning needs a large number of labelled samples per class to achieve an acceptable result. In order to deal with databases which have just a few labelled samples per class, semi-supervised learning, which also exploits unlabelled samples in training phase is used. Indeed, unlabelled samples can help to regularize the learning model, yielding an improvement of classification accuracy. In this paper, we propose a new semi-supervised dictionary learning method based on two pillars: on one hand, we enforce manifold structure preservation from the original data into sparse code space using Locally Linear Embedding, which can be considered a regularization of sparse code; on the other hand, we train a semi-supervised classifier in sparse code space. We show that our approach provides an improvement over state-of-the-art semi-supervised dictionary learning methods.
摘要：监督意思 - 在最近的十年里取得了很大的兴趣，并已显示出图像分类显著的性能改进。然而，一般地，监督学习需要大量每类标记的样品来实现可接受的结果。为了对付这些每班，半监督学习，也利用未标记的样品在训练阶段使用短短标记样本数据库。事实上，未标记的样本可以帮助广告正规化的学习模式，产生分类精度的提高。在本文中，我们提出了一种基于两个支柱一个新的半监督字典学习方法：一方面，我们从原始数据转换成使用局部线性嵌入，它可以被认为是稀疏码的正则化的稀疏代码空间执行歧管结构保存;在另一方面，我们训练稀疏代码空间半监督分类。我们表明，我们的方法提供了国家的最先进的半监督的词典学习方法的改进。

44. Interpretation of smartphone-captured radiographs utilizing a deep learning-based approach [PDF] 返回目录
Hieu X. Le, Phuong D. Nguyen, Thang H. Nguyen, Khanh N.Q. Le, Thanh T. Nguyen
Abstract: Recently, computer-aided diagnostic systems (CADs) that could automatically interpret medical images effectively have been the emerging subject of recent academic attention. For radiographs, several deep learning-based systems or models have been developed to study the multi-label diseases recognition tasks. However, none of them have been trained to work on smartphone-captured chest radiographs. In this study, we proposed a system that comprises a sequence of deep learning-based neural networks trained on the newly released CheXphoto dataset to tackle this issue. The proposed approach achieved promising results of 0.684 in AUC and 0.699 in average F1 score. To the best of our knowledge, this is the first published study that showed to be capable of processing smartphone-captured radiographs.
摘要：近日，计算机辅助诊断系统（CADS），可以自动解释医学图像有效一直是近年来学术界关注的新兴学科。对于X光片，几个深学习型系统或模型已经发展到研究多标签识别疾病的任务。但是，他们都没有经过培训工作在智能手机拍摄胸片。在这项研究中，我们提出了包括训练有素的官方最新公布的数据集CheXphoto解决这一问题深学习型神经网络的序列的系统。所提出的方法实现的承诺，平均F1值的AUC 0.684 0.699结果。据我们所知，这是第一次发表一项研究表明，以能够处理智能手机拍摄的X光片。

45. Synthesizing brain tumor images and annotations by combining progressive growing GAN and SPADE [PDF] 返回目录
Mehdi Foroozandeh, Anders Eklund
Abstract: Training segmentation networks requires large annotated datasets, but manual annotation is time consuming and costly. We here investigate if the combination of a noise-to-image GAN and an image-to-image GAN can be used to synthesize realistic brain tumor images as well as the corresponding tumor annotations (labels), to substantially increase the number of training images. The noise-to-image GAN is used to synthesize new label images, while the image-to-image GAN generates the corresponding MR image from the label image. Our results indicate that the two GANs can synthesize label images and MR images that look realistic, and that adding synthetic images improves the segmentation performance, although the effect is small.
摘要：培训分割网络需要大量的注释数据集，但人工注释是耗时且昂贵的。在这里，我们研究是否一个噪声 - 图像GAN和图像 - 图像GAN的组合可以被用于合成现实脑瘤图像以及相应的肿瘤注解（标签），以显着增加的训练图像的数量。噪声对图像GAN用于合成新的标签的图像，而图像到图像GAN从标签图像生成相应的MR图像。我们的研究结果表明，两种甘斯可以合成标签图像和神态逼真MR图像，并添加合成图像改进分割性能，但效果不大。

46. PolSAR Image Classification Based on Robust Low-Rank Feature Extraction and Markov Random Field [PDF] 返回目录
Haixia Bi, Jing Yao, Zhiqiang Wei, Danfeng Hong, Jocelyn Chanussot
Abstract: Polarimetric synthetic aperture radar (PolSAR) image classification has been investigated vigorously in various remote sensing applications. However, it is still a challenging task nowadays. One significant barrier lies in the speckle effect embedded in the PolSAR imaging process, which greatly degrades the quality of the images and further complicates the classification. To this end, we present a novel PolSAR image classification method, which removes speckle noise via low-rank (LR) feature extraction and enforces smoothness priors via Markov random field (MRF). Specifically, we employ the mixture of Gaussian-based robust LR matrix factorization to simultaneously extract discriminative features and remove complex noises. Then, a classification map is obtained by applying convolutional neural network with data augmentation on the extracted features, where local consistency is implicitly involved, and the insufficient label issue is alleviated. Finally, we refine the classification map by MRF to enforce contextual smoothness. We conduct experiments on two benchmark PolSAR datasets. Experimental results indicate that the proposed method achieves promising classification performance and preferable spatial consistency.
摘要：极化合成孔径雷达（极化SAR）图像分类已经在各种遥感应用大力研究。但是，它仍然是时下具有挑战性的任务。一个显著屏障在于嵌入在极化SAR成像过程，这极大地降低了图像的质量，并进一步分级复杂的散斑效应。为此，我们提出了一个新颖的极化SAR图像分类方法，其去除通过低秩（LR）的特征提取和通过马尔可夫随机场（MRF）强制实施平滑先验噪声散斑。具体而言，我们采用基于高斯健壮LR矩阵因子的混合物同时提取判别特征并移除复杂噪声。然后，分类图，通过所提取的特征，其中，局部一致性被隐式地涉及应用具有数据扩张卷积神经网络获得，并且所述标签发行不足被减轻。最后，我们通过MRF细化分类图执行上下文平滑。我们进行了两个基准极化SAR数据集实验。实验结果表明，所提出的方法实现了有希望的分类性能和优选空间一致性。

47. Coding Facial Expressions with Gabor Wavelets (IVC Special Issue) [PDF] 返回目录
Michael J. Lyons, Miyuki Kamachi, Jiro Gyoba
Abstract: We present a method for extracting information about facial expressions from digital images. The method codes facial expression images using a multi-orientation, multi-resolution set of Gabor filters that are topographically ordered and approximately aligned with the face. A similarity space derived from this code is compared with one derived from semantic ratings of the images by human observers. Interestingly the low-dimensional structure of the image-derived similarity space shares organizational features with the circumplex model of affect, suggesting a bridge between categorical and dimensional representations of facial expression. Our results also indicate that it would be possible to construct a facial expression classifier based on a topographically-linked multi-orientation, multi-resolution Gabor coding of the facial images at the input stage. The significant degree of psychological plausibility exhibited by the proposed code may also be useful in the design of human-computer interfaces.
摘要：我们提出，用于提取从数字图像的面部表情信息的方法。使用多取向的方法的代码的面部表情的图像，多分辨率组Gabor滤波器的被地形有序的和与面部近似对准。从该代码导出的相似性空间与一个从由人类观察者的图像的语义的评分衍生比较。有趣的是，图像来源的相似性空间股组织特征与影响的环状模型，表明面部表情的绝对的和二维表示之间的桥梁的低维度结构。我们的研究结果还表明，将有可能构建根据地形联多方向，多分辨率的Gabor在输入级的面部图像的编码面部表情分类。通过所提出的代码表现出心理真实性的显著程度也可以是在人机接口的设计是有用的。

48. Deep Detection for Face Manipulation [PDF] 返回目录
Disheng Feng, Xuequan Lu, Xufeng Lin
Abstract: It has become increasingly challenging to distinguish real faces from their visually realistic fake counterparts, due to the great advances of deep learning based face manipulation techniques in recent years. In this paper, we introduce a deep learning method to detect face manipulation. It consists of two stages: feature extraction and binary classification. To better distinguish fake faces from real faces, we resort to the triplet loss function in the first stage. We then design a simple linear classification network to bridge the learned contrastive features with the real/fake faces. Experimental results on public benchmark datasets demonstrate the effectiveness of this method, and show that it generates better performance than state-of-the-art techniques in most cases.
摘要：它已成为越来越具有挑战性的从他们的视觉逼真的假动作同行区分实面，由于深学习基础的脸操纵技术在近几年的巨大进步。在本文中，我们介绍了深刻的学习方法，人脸检测操作。它包括两个阶段：特征提取和二元分类。为了更好的区分实面假面孔，我们采取了三重损失函数中的第一阶段。然后，我们设计了一个简单的线性分类网弥合与真实/假面孔了解到对比功能。公共基准数据集实验结果证明了该方法的有效性，并表明它产生比在大多数情况下，国家的最先进的技术，更好的性能。

49. An approach to human iris recognition using quantitative analysis of image features and machine learning [PDF] 返回目录
Abolfazl Zargari Khuzani, Najmeh Mashhadi, Morteza Heidari, Donya Khaledyan
Abstract: The Iris pattern is a unique biological feature for each individual, making it a valuable and powerful tool for human identification. In this paper, an efficient framework for iris recognition is proposed in four steps. (1) Iris segmentation (using a relative total variation combined with Coarse Iris Localization), (2) feature extraction (using Shape&density, FFT, GLCM, GLDM, and Wavelet), (3) feature reduction (employing Kernel-PCA) and (4) classification (applying multi-layer neural network) to classify 2000 iris images of CASIA-Iris-Interval dataset obtained from 200 volunteers. The results confirm that the proposed scheme can provide a reliable prediction with an accuracy of up to 99.64%.
摘要：虹膜模式是每个人独特的生物学特性，使之成为人体识别有价值的和强大的工具。在本文中，虹膜识别一个有效的框架分四个步骤建议。（1）虹膜分割（使用相对总变化与粗虹膜定位组合），（2）特征提取（使用形状和密度，FFT，GLCM，GLDM，和小波），（3）功能降低（采用内核PCA）和（ 4）分类（施加多层神经网络）来从200名志愿者得到的CASIA光圈间隔数据集的2000虹膜图像进行分类。结果证实，该方案可以提供最多的精度可靠地预测到99.64％。

50. A Unified Approach to Kinship Verification [PDF] 返回目录
Eran Dahan, Yosi Keller
Abstract: In this work, we propose a deep learning-based approach for kin verification using a unified multi-task learning scheme where all kinship classes are jointly learned. This allows us to better utilize small training sets that are typical of kin verification. We introduce a novel approach for fusing the embeddings of kin images, to avoid overfitting, which is a common issue in training such networks. An adaptive sampling scheme is derived for the training set images to resolve the inherent imbalance in kin verification datasets. A thorough ablation study exemplifies the effectivity of our approach, which is experimentally shown to outperform contemporary state-of-the-art kin verification results when applied to the Families In the Wild, FG2018, and FG2020 datasets.
摘要：在这项工作中，我们提出，使用统一的多任务学习方案，其中所有的亲属关系类共同学习了亲属验证了深刻的基于学习的方法。这使我们能够更好地利用小的训练集是典型的亲属验证。我们引入一种新的方法用于融合亲属图像的嵌入物，以避免过拟合，这是培养出这样的网络中常见的问题。自适应采样方案导出对于训练集的图像，以解决在亲属验证数据集的固有不平衡。彻底的研究消融体现了我们的方法，这是实验显示当施加到家庭在野外，FG2018和FG2020数据集胜过状态的最先进的当代亲属验证结果的有效性。

51. Exploring the Hierarchy in Relation Labels for Scene Graph Generation [PDF] 返回目录
Yi Zhou, Shuyang Sun, Chao Zhang, Yikang Li, Wanli Ouyang
Abstract: By assigning each relationship a single label, current approaches formulate the relationship detection as a classification problem. Under this formulation, predicate categories are treated as completely different classes. However, different from the object labels where different classes have explicit boundaries, predicates usually have overlaps in their semantic meanings. For example, sit\_on and stand\_on have common meanings in vertical relationships but different details of how these two objects are vertically placed. In order to leverage the inherent structures of the predicate categories, we propose to first build the language hierarchy and then utilize the Hierarchy Guided Feature Learning (HGFL) strategy to learn better region features of both the coarse-grained level and the fine-grained level. Besides, we also propose the Hierarchy Guided Module (HGM) to utilize the coarse-grained level to guide the learning of fine-grained level features. Experiments show that the proposed simple yet effective method can improve several state-of-the-art baselines by a large margin (up to $33\%$ relative gain) in terms of Recall@50 on the task of Scene Graph Generation in different datasets.
摘要：通过分配每个关系一个标签，目前的方法制定关系的检测作为一个分类问题。在这种制剂中，谓词类别被视为完全不同的类。但是，从对象的标签，不同的阶级有明确的边界不同，谓语通常有重叠的语义。例如，坐\ _ON和立场\ _ON有垂直关系，而且这两个对象是如何垂直放置不同的细节常见的含义。为了充分利用谓词类别的固有结构，我们建议先建的语言层次，然后利用层次指导地物学习（HGFL）战略，以更好地学习区域都粗粒级和细粒级的功能。此外，我们也提出了指导模块（HGM）层次利用粗粒级指导的细粒级功能的学习。实验结果表明，所提出的简单而有效的方法，可以大幅度提高国家的最先进的几个基准在不同的数据集的场景图生成的任务（高达$ 33 \％$相对增益）在召回方面@ 50 。

52. Map-merging Algorithms for Visual SLAM: Feasibility Study and Empirical Evaluation [PDF] 返回目录
Andrey Bokovoy, Kirill Muraviev, Konstantin Yakovlev
Abstract: Simultaneous localization and mapping, especially the one relying solely on video data (vSLAM), is a challenging problem that has been extensively studied in robotics and computer vision. State-of-the-art vSLAM algorithms are capable of constructing accurate-enough maps that enable a mobile robot to autonomously navigate an unknown environment. In this work, we are interested in an important problem related to vSLAM, i.e. map merging, that might appear in various practically important scenarios, e.g. in a multi-robot coverage scenario. This problem asks whether different vSLAM maps can be merged into a consistent single representation. We examine the existing 2D and 3D map-merging algorithms and conduct an extensive empirical evaluation in realistic simulated environment (Habitat). Both qualitative and quantitative comparison is carried out and the obtained results are reported and analyzed.
摘要：同时定位与地图，尤其是在一个单独的视频数据（vSLAM）依靠，是已经在机器人和计算机视觉被广泛地研究一个具有挑战性的问题。状态的最先进的vSLAM算法能够构造准确-足够的地图，使移动机器人自主导航未知环境的。在这项工作中，我们感兴趣的是有关vSLAM的一个重要问题，即地图合并，可能出现的各种重要的现实意义的场景，例如在多机器人覆盖场景。这个问题问不同vSLAM地图是否可以合并成一个单一的一致表示。我们审视现有的2D和3D地图合并算法和现实的模拟环境（境）进行广泛的实证评价。定性和定量比较被执行并且所获得的结果报告和分析。

53. Learning semantic Image attributes using Image recognition and knowledge graph embeddings [PDF] 返回目录
Ashutosh Tiwari, Sandeep Varma
Abstract: Extracting structured knowledge from texts has traditionally been used for knowledge base generation. However, other sources of information, such as images can be leveraged into this process to build more complete and richer knowledge bases. Structured semantic representation of the content of an image and knowledge graph embeddings can provide a unique representation of semantic relationships between image entities. Linking known entities in knowledge graphs and learning open-world images using language models has attracted lots of interest over the years. In this paper, we propose a shared learning approach to learn semantic attributes of images by combining a knowledge graph embedding model with the recognized attributes of images. The proposed model premises to help us understand the semantic relationship between the entities of an image and implicitly provide a link for the extracted entities through a knowledge graph embedding model. Under the limitation of using a custom user-defined knowledge base with limited data, the proposed model presents significant accuracy and provides a new alternative to the earlier approaches. The proposed approach is a step towards bridging the gap between frameworks which learn from large amounts of data and frameworks which use a limited set of predicates to infer new knowledge.
摘要：从文本中提取结构化的知识一向被用来对知识产生碱。然而，其他信息源，如图像可以利用这个过程来构建更完整，更丰富的知识基础。图像和知识图嵌入的内容的结构化语义表示可以提供图像实体之间的语义关系的唯一代表。链接在知识图已知实体和学习使用语言模型的开放世界的图像已经吸引了大量的历年的兴趣。在本文中，我们提出了一个共同学习方法，通过知识图嵌入模型与影像的认可属性相结合，学习图像的语义属性。该模型处所，帮助我们理解图像的实体和含蓄之间的语义关系，通过知识图嵌入模型提供了一个链接，所提取的实体。下使用具有有限的数据，所提出的模型呈现显著精度的用户自定义的知识基础的限制，并提供到早期的方法一个新的替代。建议的做法是在弥合从大量其中使用一组有限的谓词来推断新的知识数据和框架的学习框架之间的差距的一个步骤。

54. Revisiting the Threat Space for Vision-based Keystroke Inference Attacks [PDF] 返回目录
John Lim, True Price, Fabian Monrose, Jan-Michael Frahm
Abstract: A vision-based keystroke inference attack is a side-channel attack in which an attacker uses an optical device to record users on their mobile devices and infer their keystrokes. The threat space for these attacks has been studied in the past, but we argue that the defining characteristics for this threat space, namely the strength of the attacker, are outdated. Previous works do not study adversaries with vision systems that have been trained with deep neural networks because these models require large amounts of training data and curating such a dataset is expensive. To address this, we create a large-scale synthetic dataset to simulate the attack scenario for a keystroke inference attack. We show that first pre-training on synthetic data, followed by adopting transfer learning techniques on real-life data, increases the performance of our deep learning models. This indicates that these models are able to learn rich, meaningful representations from our synthetic data and that training on the synthetic data can help overcome the issue of having small, real-life datasets for vision-based key stroke inference attacks. For this work, we focus on single keypress classification where the input is a frame of a keypress and the output is a predicted key. We are able to get an accuracy of 95.6% after pre-training a CNN on our synthetic data and training on a small set of real-life data in an adversarial domain adaptation framework. Source Code for Simulator: this https URL
摘要：基于视觉甲击键推理攻击是其中攻击者使用的光学器件，以记录用户在其移动设备和推断他们的击键的侧信道攻击。这些攻击的威胁空间已经研究了过去，但我们认为，这种威胁的空间，攻击者即强度的定义性特征，已经过时。以前的作品不学习与因为这些模型需要大量的训练数据和策划这样的数据集是太贵了，已经训练了与深层神经网络的视觉系统的对手。为了解决这个问题，我们创建了一个大型的综合数据集来模拟攻击场景的按键推断攻击。我们展示的合成数据，首先前培训，其次是采用现实生活中的数据传输学习技术，提高了我们的深度学习模型的性能。这表明，这些模型能够学习丰富，有意义的表述从我们的综合数据和合成数据使训练能帮助克服具有基于视觉的击键推断攻击小，现实生活中的数据集的问题。对于这项工作，我们专注于单个按键分类，其中输入是一个按键的帧和输出是一个预测的关键。我们能够训练后前一个CNN对我们的综合数据和培训一小部分真实数据的对抗性领域适应性框架，以获得95.6％的准确度。源代码模拟器：此HTTPS URL

55. A CNN Based Approach for the Near-Field Photometric Stereo Problem [PDF] 返回目录
Fotios Logothetis, Ignas Budvytis, Roberto Mecca, Roberto Cipolla
Abstract: Reconstructing the 3D shape of an object using several images under different light sources is a very challenging task, especially when realistic assumptions such as light propagation and attenuation, perspective viewing geometry and specular light reflection are considered. Many of works tackling Photometric Stereo (PS) problems often relax most of the aforementioned assumptions. Especially they ignore specular reflection and global illumination effects. In this work, we propose the first CNN based approach capable of handling these realistic assumptions in Photometric Stereo. We leverage recent improvements of deep neural networks for far-field Photometric Stereo and adapt them to near field setup. We achieve this by employing an iterative procedure for shape estimation which has two main steps. Firstly we train a per-pixel CNN to predict surface normals from reflectance samples. Secondly, we compute the depth by integrating the normal field in order to iteratively estimate light directions and attenuation which is used to compensate the input images to compute reflectance samples for the next iteration. To the best of our knowledge this is the first near-field framework which is able to accurately predict 3D shape from highly specular objects. Our method outperforms competing state-of-the-art near-field Photometric Stereo approaches on both synthetic and real experiments.
摘要：重构使用不同光源下的若干图像的物体的三维形状是一个非常具有挑战性的任务，尤其是当现实的假设如光的传播和衰减，立体观察几何和镜面光反射被考虑。许多工程解决光度立体的（PS）问题往往最放松上述假设。特别是他们忽略镜面反射和全局照明效果。在这项工作中，我们提出了能够处理在光度立体这些现实的假设第一个基于CNN的做法。我们充分利用远场光度立体深层神经网络的最新改进和适应他们近场设置。我们采用了有两个主要步骤形状估计的迭代过程实现这一目标。首先，我们培养了每像素CNN从反射率的样品表面预测法线。其次，我们通过按顺序集成正常字段计算所述深度以迭代地估计其用于补偿所述输入图像，以计算反射率的样品为下一次迭代光方向的和衰减。据我们所知，这是第一次近距离框架，它能够准确地从高镜面物体预测3D形状。我们的方法优于竞争的国家的最先进的近场光度立体的合成和真实的实验方法。

56. Micro-Facial Expression Recognition Based on Deep-Rooted Learning Algorithm [PDF] 返回目录
S. D. Lalitha, K. K. Thyagharajan
Abstract: Facial expressions are important cues to observe human emotions. Facial expression recognition has attracted many researchers for years, but it is still a challenging topic since expression features vary greatly with the head poses, environments, and variations in the different persons involved. In this work, three major steps are involved to improve the performance of micro-facial expression recognition. First, an Adaptive Homomorphic Filtering is used for face detection and rotation rectification processes. Secondly, Micro-facial features were used to extract the appearance variations of a testing image-spatial analysis. The features of motion information are used for expression recognition in a sequence of facial images. An effective Micro-Facial Expression Based Deep-Rooted Learning (MFEDRL) classifier is proposed in this paper to better recognize spontaneous micro-expressions by learning parameters on the optimal features. This proposed method includes two loss functions such as cross entropy loss function and centre loss function. Then the performance of the algorithm will be evaluated using recognition rate and false measures. Simulation results show that the predictive performance of the proposed method outperforms that of the existing classifiers such as Convolutional Neural Network (CNN), Deep Neural Network (DNN), Artificial Neural Network (ANN), Support Vector Machine (SVM), and k-Nearest Neighbours (KNN) in terms of accuracy and Mean Absolute Error (MAE).
摘要：面部表情来观察人类情感的重要线索。面部表情识别吸引了众多研究者多年，但它仍然是一个具有挑战性的课题，因为表达功能与头部的姿势，环境，以及所涉及的不同的人的变化而变化很大。在这项工作中，三个主要的步骤都参与到改善人体微表情识别的性能。首先，自适应同态滤波被用于面部检测和旋转精馏过程。其次，微面部特征被用来提取测试图像的空间分析的外观变化。的运动信息的特征用于表情识别中的面部图像的一个序列。基于深层次的学习（MFEDRL）分类器本文提出了一种有效的微表情通过学习上的最佳特征参数，更好地认识自发微表情。该提出的方法包括两个损失功能，诸如交叉熵损失函数和中心损失函数。那么算法的性能将采用识别率和误判的措施进行评估。仿真结果表明，该方法优于的预测性能，现有的分类如卷积神经网络（CNN），深层神经网络（DNN），人工神经网络（ANN），支持向量机（SVM），和K-近邻（KNN），在精度和平均绝对误差（MAE）的条款。

57. Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning [PDF] 返回目录
Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J. Ma, Xing Sun
Abstract: Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by constructing surrogate supervision signals from the unlabeled data. However, some of the current methods tend to suffer from a background cheating problem, i.e., the prediction is highly dependent on the video background instead of the motion, making the model vulnerable to background changes. To alleviate the problem, we propose to remove the background impact by adding the background. That is, given a video, we randomly select a static frame and add it to every other frames to construct a distracting video sample. Then we force the model to pull the feature of the distracting video and the feature of the original video closer, so that the model is explicitly restricted to resist the background influence, focusing more on the motion changes. In addition, in order to prevent the static frame from disturbing the motion area too much, we restrict the feature being consistent with the temporally flipped feature of the reversed video, forcing the model to concentrate more on the motion. We term our method as Temporal-sensitive Background Erasing (TBE). Experiments on UCF101 and HMDB51 show that TBE brings about 6.4% and 4.8% improvements over the state-of-the-art method on the HMDB51 and UCF101 datasets respectively. And it is worth noting that the implementation of our method is so simple and neat and can be added as an additional regularization term to most of the SOTA methods without much efforts.
摘要：自监督学习提高从标签数据构建替代监管的信号深层神经网络的视频表现能力已显示出巨大的潜力。然而，目前的一些方法往往从背景作弊行为问题遭受的，即，该预测是高度依赖于视频的背景，而不是运动，使得模型易受背景变化。为了缓解这一问题，我们提出通过添加背景去除背景的影响。也就是说，给出一个视频中，我们随机选择静态帧，并把它添加到所有其他帧构建分心视频样本。然后，我们迫使模型拉分心视频和原始视频接近的特征的特征，使模型明确限于抵御背景影响，更注重运动的变化。另外，为了防止静电帧扰乱运动区域太大，我们限制功能是符合反转视频的时间翻转功能，迫使模型更专注于运动。我们称我们为时空敏感的背景擦除（TBE）方法。上UCF101实验和HMDB51表明TBE分别带来约6.4％和4.8％的改进过的国家的最先进的方法对HMDB51和UCF101数据集。而值得注意的是，我们的方法的实现是如此简单，整洁，可以添加额外的正则项大多数的SOTA方法没有太多的努力。

58. Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion [PDF] 返回目录
Jinpeng Wang, Yuting Gao, Ke Li, Xinyang Jiang, Xiaowei Guo, Rongrong Ji, Xing Sun
Abstract: One significant factor we expect the video representation learning to capture, especially in contrast with the image representation learning, is the object motion. However, we found that in the current mainstream video datasets, some action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded. For example, a trained model may predict a video as playing football simply because it sees the field, neglecting that the subject is dancing as a cheerleader on the field. This is against our original intention towards the video representation learning and may bring scene bias on different dataset that can not be ignored. In order to tackle this problem, we propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid. Specifically, we construct a positive clip and a negative clip for each video. Compared to the original video, the positive/negative is motion-untouched/broken but scene-broken/untouched by Spatial Local Disturbance and Temporal Local Disturbance. Our objective is to pull the positive closer while pushing the negative farther to the original clip in the latent space. In this way, the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced. We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8.1% and 8.8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.
摘要：我们预计视频表示学习捕捉，特别是在图像表示学习反差的一个显著的因素，就是物体的运动。然而，我们发现，在目前主流的视频数据集，一些操作类是高度那里的行动发生的场景相关，使模型倾向于降低到只有现场信息进行编码的解决方案。例如，因为它认为该领域，而忽略了主题跳舞作为该领域的啦啦队训练模型可以预测视频作为踢足球根本。这是对我们对视频表示学习的初衷，并可能带来现场偏置在不同的数据集也不容忽视。为了解决这个问题，我们提出了两个简单的操作分离的场景和运动（DSM），以便对运动信息模型关注更好的报酬。具体而言，我们构建了一个阳性夹和每个视频的负剪辑。相较于原始视频，正/负是运动不变/打破，但现场破碎/不变的空间局部干扰时空局部干扰。我们的目标是拉正接近，同时推动负更远的原始剪辑于潜在空间。以这种方式，而网络的时间灵敏度进一步增强场景的影响减弱。我们对两个任务与各种骨干网和不同的预先训练数据集进行实验，发现我们的方法超越SOTA方法与对上UCF101分别使用相同的骨干HMDB51数据集动作识别任务，一个显着的8.1％和8.8％的改善。

59. Multi-Spectral Image Synthesis for Crop/Weed Segmentation in Precision Farming [PDF] 返回目录
Mulham Fawakherji, Ciro Potena, Alberto Pretto, Domenico D. Bloisi, Daniele Nardi
Abstract: An effective perception system is a fundamental component for farming robots, as it enables them to properly perceive the surrounding environment and to carry out targeted operations. The most recent approaches make use of state-of-the-art machine learning techniques to learn an effective model for the target task. However, those methods need a large amount of labelled data for training. A recent approach to deal with this issue is data augmentation through Generative Adversarial Networks (GANs), where entire synthetic scenes are added to the training data, thus enlarging and diversifying their informative content. In this work, we propose an alternative solution with respect to the common data augmentation techniques, applying it to the fundamental problem of crop/weed segmentation in precision farming. Starting from real images, we create semi-artificial samples by replacing the most relevant object classes (i.e., crop and weeds) with their synthesized counterparts. To do that, we employ a conditional GAN (cGAN), where the generative model is trained by conditioning the shape of the generated object. Moreover, in addition to RGB data, we take into account also near-infrared (NIR) information, generating four channel multi-spectral synthetic images. Quantitative experiments, carried out on three publicly available datasets, show that (i) our model is capable of generating realistic multi-spectral images of plants and (ii) the usage of such synthetic images in the training process improves the segmentation performance of state-of-the-art semantic segmentation Convolutional Networks.
摘要：一个有效的感知系统，不适合耕种机器人的基本组成部分，因为这使他们能够正确地感知周围的环境，并有针对性地开展业务。最近方法利用国家的最先进的机器学习技术，了解目标任务的有效模式。然而，这些方法都需要培训大量的标签数据。处理这个问题最近方法是通过创成对抗性网络（甘斯），在整个人工场景被添加到训练数据，从而扩大和多样化的信息内容的数据增强。在这项工作中，我们提出了一个替代的解决方案相对于普通的数据增量技术，将其应用到作物/杂草分割的精准农业的根本问题。从真实图像开始，我们将与他们同行的合成替代最相关的对象类（即作物和杂草）创建半人工的样品。要做到这一点，我们聘请了条件GAN（cGAN），其中生成模型是由空调产生的物体的形状培训。此外，除了RGB数据，我们考虑到还近红外（NIR）的信息，生成四个通道多光谱合成影像。定量实验，在三个可公开获得的数据集进行显示，（ⅰ）我们的模型能够产生植物和（ii）这样的合成图像的在训练过程中的使用提高了国有的分割性能的现实多光谱图像的的最先进的语义分割卷积网络。

60. Smoothness Sensor: Adaptive Smoothness-Transition Graph Convolutions for Attributed Graph Clustering [PDF] 返回目录
Chaojie Ji, Hongwei Chen, Ruxin Wang, Yunpeng Cai, Hongyan Wu
Abstract: Clustering techniques attempt to group objects with similar properties into a cluster. Clustering the nodes of an attributed graph, in which each node is associated with a set of feature attributes, has attracted significant attention. Graph convolutional networks (GCNs) represent an effective approach for integrating the two complementary factors of node attributes and structural information for attributed graph clustering. However, oversmoothing of GCNs produces indistinguishable representations of nodes, such that the nodes in a graph tend to be grouped into fewer clusters, and poses a challenge due to the resulting performance drop. In this study, we propose a smoothness sensor for attributed graph clustering based on adaptive smoothness-transition graph convolutions, which senses the smoothness of a graph and adaptively terminates the current convolution once the smoothness is saturated to prevent oversmoothing. Furthermore, as an alternative to graph-level smoothness, a novel fine-gained node-wise level assessment of smoothness is proposed, in which smoothness is computed in accordance with the neighborhood conditions of a given node at a certain order of graph convolution. In addition, a self-supervision criterion is designed considering both the tightness within clusters and the separation between clusters to guide the whole neural network training process. Experiments show that the proposed methods significantly outperform 12 other state-of-the-art baselines in terms of three different metrics across four benchmark datasets. In addition, an extensive study reveals the reasons for their effectiveness and efficiency.
摘要：聚类技术试图具有类似性质的成群集组对象。聚类的属性图，其中每个节点与一组特征的属性相关联的节点，已引起关注显著。图表卷积网络（GCNs）代表用于积分节点的属性和针对归因图聚类结构信息的两个互补因子的有效方法。然而，GCNs的oversmoothing产生节点的不可区分的表示，使得在图中的节点倾向于被分组为更少的集群，并且提出了一个挑战，因为所产生的性能下降。在这项研究中，我们提出基于自适应平滑过渡图表卷积归因图聚类，其感测图的平滑性和自适应一旦平滑性饱和，以防止oversmoothing终止当前卷积的平滑度传感器。此外，作为替代图表级的平滑度，平滑性的新颖了精细的节点逐水平评估建议，在这种平滑度是根据与给定节点的邻居状况在图卷积的一定的顺序计算的。此外，自监管标准的设计考虑两个群集内的气密性和集群，引导全神经网络训练过程之间的分离。实验表明，该方法显著优于其他12个国家的最先进的基线在跨四个标准数据集三种不同的衡量标准条款。此外，广泛的研究揭示了其有效性和效率的原因。

61. Monitoring Spatial Sustainable Development: semi-automated analysis of Satellite and Aerial Images for Energy Transition and Sustainability Indicators [PDF] 返回目录
Tim De Jong, Stefano Bromuri, Xi Chang, Marc Debusschere, Natalie Rosenski, Clara Schartner, Katharina Strauch, Marion Boehmer, Lyana Curier
Abstract: This report presents the results of the DeepSolaris project that was carried out under the ESS action 'Merging Geostatistics and Geospatial Information in Member States'. During the project several deep learning algorithms were evaluated to detect solar panels in remote sensing data. The aim of the project was to evaluate whether deep learning models could be developed, that worked across different member states in the European Union. Two remote sensing data sources were considered: aerial images on the one hand, and satellite images on the other. Two flavours of deep learning models were evaluated: classification models and object detection models. For the evaluation of the deep learning models we used a cross-site evaluation approach: the deep learning models where trained in one geographical area and then evaluated on a different geographical area, previously unseen by the algorithm. The cross-site evaluation was furthermore carried out twice: deep learning models trained on he Netherlands were evaluated on Germany and vice versa. While the deep learning models were able to detect solar panels successfully, false detection remained a problem. Moreover, model performance decreased dramatically when evaluated in a cross-border fashion. Hence, training a model that performs reliably across different countries in the European Union is a challenging task. That being said, the models detected quite a share of solar panels not present in current solar panel registers and therefore can already be used as-is to help reduced manual labor in checking these registers.
摘要：这份报告介绍了在ESS行动“合并统计学与地理空间信息在会员国下开展的DeepSolaris项目的结果。在项目几个深学习算法进行评估，以检测在遥感数据的太阳能电池板。该项目的目的是评估深的学习模式能否发展，跨越不同的成员国在欧盟的工作。两个遥感数据源被认为是：在另一一方面航空图像，和卫星图像。深学习模式的两种形式进行了评价：分类模型和对象检测模型。对于深度学习模型的评测我们使用的跨网站评估方法：深层的学习模式，使在一个地理区域的培训，然后在不同的地理区域进行评估，之前由算法看不见。跨网站评价还进行了两次培训了他荷兰深度学习模型对德国和副评价反之亦然。而深学习模式能够成功地检测太阳能电池板，虚假检测仍然是一个问题。此外，在跨界时尚评估时模型的性能显着下降。因此，培养模式，可靠地在不同的国家在欧盟执行是一项艰巨的任务。如此说来，模型检测电流的太阳能电池板的寄存器不存在太阳能电池板的相当份额，因此已经可以原样使用在检查这些寄存器来帮助降低手工劳动。

62. Abstractive Information Extraction from Scanned Invoices (AIESI) using End-to-end Sequential Approach [PDF] 返回目录
Shreeshiv Patel, Dvijesh Bhatt
Abstract: Recent proliferation in the field of Machine Learning and Deep Learning allows us to generate OCR models with higher accuracy. Optical Character Recognition(OCR) is the process of extracting text from documents and scanned images. For document data streamlining, we are interested in data like, Payee name, total amount, address, and etc. Extracted information helps to get complete insight of data, which can be helpful for fast document searching, efficient indexing in databases, data analytics, and etc. Using AIESI we can eliminate human effort for key parameters extraction from scanned documents. Abstract Information Extraction from Scanned Invoices (AIESI) is a process of extracting information like, date, total amount, payee name, and etc from scanned receipts. In this paper we proposed an improved method to ensemble all visual and textual features from invoices to extract key invoice parameters using Word wise BiLSTM.
摘要：在机器学习和深度学习领域的最新扩散使我们能够以更高的精度产生OCR模型。光学字符识别（OCR）是提取文档和扫描的图像的文本的过程。对于文档数据精简，我们感兴趣的数据，如，收款人名称，总金额，地址等信息提取有助于获得数据的完整的洞察力，这对于快速的文件搜索的帮助，在数据库，数据分析高效的索引，等使用AIESI我们可以消除从扫描的文档的关键参数提取人的努力。从扫描发票摘要信息提取（AIESI）是像，日期，总金额，收款人姓名，和从等扫描的收据提取信息的过程。在本文中，我们提出了合奏从发票所有视觉和文本特征提取使用Word明智BiLSTM关键发票参数的改进方法。

63. Generator Versus Segmentor: Pseudo-healthy Synthesis [PDF] 返回目录
Zhang Yunlong, Lin Xin, Sun Liyan, Zhuang Yihong, Huang Yue, Ding Xinghao, Liu Xiaoqing, Yu Yizhou
Abstract: Pseudo-healthy synthesis is defined as synthesizing a subject-specific 'healthy' image from a pathological one, with applications ranging from segmentation to anomaly detection. In recent years, the existing GAN-based methods proposed for pseudo-healthy synthesis aim to eliminate the global differences between synthetic and healthy images. In this paper, we discuss the problems of these approaches, which are the style transfer and artifacts respectively. To address these problems, we consider the local differences between the lesions and normal tissue. To achieve this, we propose an adversarial training regime that alternatively trains a generator and a segmentor. The segmentor is trained to distinguish the synthetic lesions (i.e. the region in synthetic images corresponding to the lesions in the pathological ones) from the normal tissue, while the generator is trained to deceive the segmentor by transforming lesion regions into lesion-free-like ones and preserve the normal tissue at the same time. Qualitative and quantitative experimental results on public datasets BraTS and LiTS demonstrate that the proposed method outperforms state-of-the-art methods by preserving style and removing the artifacts. Our implementation is publicly available at this https URL
摘要：伪健康合成被定义为从病理一个合成对象特定“健康的”图像，其应用范围从分割到异常检测。近年来，提出了伪健康综合现有的基于GaN的方法旨在消除合成和健康的图像之间的差异全球。在本文中，我们将讨论这些方法，分别风格转移和文物的问题。为了解决这些问题，我们考虑的病变与正常组织之间的局部差异。为了实现这一目标，我们建议或者培训发电机和分段装置的对抗性训练制度。该分割器被训练来区分合成病变（即，在对应于病理那些病变合成图像的区域）从正常组织，而发电机被训练通过将病变区域进入病变 - 自由状者欺骗分割器并同时保留正常组织。定性和对公共数据集臭小子定量实验结果和双床表明，该方法优于状态的最先进的方法通过保护式和去除伪影。我们的实现是公开的，在此HTTPS URL

64. Short-Term and Long-Term Context Aggregation Network for Video Inpainting [PDF] 返回目录
Ang Li, Shanshan Zhao, Xingjun Ma, Mingming Gong, Jianzhong Qi, Rui Zhang, Dacheng Tao, Ramamohanarao Kotagiri
Abstract: Video inpainting aims to restore missing regions of a video and has many applications such as video editing and object removal. However, existing methods either suffer from inaccurate short-term context aggregation or rarely explore long-term frame information. In this work, we present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting. In the encoding stage, we propose boundary-aware short-term context aggregation, which aligns and aggregates, from neighbor frames, local regions that are closely related to the boundary context of missing regions into the target frame. Furthermore, we propose dynamic long-term context aggregation to globally refine the feature map generated in the encoding stage using long-term frame features, which are dynamically updated throughout the inpainting process. Experiments show that it outperforms state-of-the-art methods with better inpainting results and fast inpainting speed.
摘要：视频修复的目的是恢复丢失的视频的地区，有许多应用，如视频编辑和对象去除。但是，现有的方法或者不准确的短期背景下聚集吃亏的还是很少探讨长期帧信息。在这项工作中，我们提出了一种新颖的历境汇聚网络来有效地利用用于视频图像修复短期和长期的帧的信息。在编码阶段，我们提出了边界意识的短期背景下聚集，其排列并聚集，从相邻帧，这是密切相关的失踪区域到目标帧中的边界范围内的局部区域。此外，我们提出了动态长期上下文聚集到全局优化使用长期帧特征，其在整个图像修复过程动态地更新在编码阶段中产生特征地图。实验表明，它优于国家的最先进的方法具有更好的图像修补效果和快速的图像修补速度。

65. YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design [PDF] 返回目录
Yuxuan Cai, Hongjia Li, Geng Yuan, Wei Niu, Yanyu Li, Xulong Tang, Bin Ren, Yanzhi Wang
Abstract: The rapid development and wide utilization of object detection techniques have aroused attention on both accuracy and speed of object detectors. However, the current state-of-the-art object detection works are either accuracy-oriented using a large model but leading to high latency or speed-oriented using a lightweight model but sacrificing accuracy. In this work, we propose YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design. A novel block-punched pruning scheme is proposed for any kernel size. To improve computational efficiency on mobile devices, a GPU-CPU collaborative scheme is adopted along with advanced compiler-assisted optimizations. Experimental results indicate that our pruning scheme achieves 14$\times$ compression rate of YOLOv4 with 49.0 mAP. Under our YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung Galaxy S20. By incorporating our proposed GPU-CPU collaborative scheme, the inference speed is increased to 19.1 FPS, and outperforms the original YOLOv4 by 5$\times$ speedup.
摘要：目标检测技术的迅猛发展和广泛利用已经激起了准确度和物体探测器的速度关注。但是，当前的状态的最先进的物体检测作品要么精度为本使用大模型，但采用了轻量化模型导致高等待时间或速度为本但牺牲精度。在这项工作中，我们提出YOLObile框架下，通过压缩编译协同设计在移动设备上的实时目标检测。一种新颖的块冲孔修剪方案提出了一种用于任何内核尺寸。为了提高移动设备上的计算效率，一个GPU-CPU协同方案具有先进的编译器辅助优化采用沿。实验结果表明，我们的修剪方案实现了14 $ \倍YOLOv4的$压缩率49.0地图。根据我们的YOLObile框架下，我们实现了用在三星Galaxy S20 GPU 17 FPS的推理速度。通过结合我们提出的GPU-CPU协同方案，推理速度提高到19.1 FPS，和$加速5 $ \倍优于原来的YOLOv4。

66. RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization [PDF] 返回目录
Niluthpol Chowdhury Mithun, Karan Sikka, Han-Pang Chiu, Supun Samarasekera, Rakesh Kumar
Abstract: We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.
摘要：我们通过地面RGB图像匹配的地理参考航空LIDAR三维点云研究大型跨模式的可视本地化的一个重要的，但基本上还未问题（呈现为深度图像）。在此之前的作品都表现出对小数据集，并没有借给自己扩大为大规模应用。为了使大型评价，我们引入含有RGB和航空LIDAR深度图像的超过550K对（覆盖143公里^ 2区域）的新数据集。我们提出了一个新颖的关节嵌入其有效地将来自两个模态的外观和语义线索处理急剧跨模式的变化为基础的方法。在实验提出数据集上，我们的模型实现了5匹配在大量试验组从14公里^ 2区收集50K位置对中位数排名的好成绩。这代表了在性能和规模之前作品显著的进步。我们的结论与定性结果突出这一任务的挑战性和该模型的好处。我们的工作提供了跨模式视觉定位进一步研究奠定了基础。

67. AttnGrounder: Talking to Cars with Attention [PDF] 返回目录
Vivek Mittal
Abstract: We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.
摘要：我们建议关注地滚球（AttnGrounder），视觉接地的任务单级端至端可训练模式。视觉接地旨在根据给定的自然语言文本查询图像中定位的特定对象。与使用对每个图像区域相同的文本表示以前的方法，我们使用涉及在给定的查询每一个字每区域对应的图像在构建区域相关的文本表示视觉文本注意模块。此外，为改善我们的模型的定位能力，我们用我们的视觉文本注意模块生成简称对象周围的关注度掩。注意口罩训练为使用与提供地面实况坐标生成的矩形边框的辅助任务。我们评估AttnGrounder在Talk2Car数据集，并显示3.26％的现有方法上的改进。

68. A Progressive Sub-Network Searching Framework for Dynamic Inference [PDF] 返回目录
Li Yang, Zhezhi He, Yu Cao, Deliang Fan
Abstract: Many techniques have been developed, such as model compression, to make Deep Neural Networks (DNNs) inference more efficiently. Nevertheless, DNNs still lack excellent run-time dynamic inference capability to enable users trade-off accuracy and computation complexity (i.e., latency on target hardware) after model deployment, based on dynamic requirements and environments. Such research direction recently draws great attention, where one realization is to train the target DNN through a multiple-term objective function, which consists of cross-entropy terms from multiple sub-nets. Our investigation in this work show that the performance of dynamic inference highly relies on the quality of sub-net sampling. With objective to construct a dynamic DNN and search multiple high quality sub-nets with minimal searching cost, we propose a progressive sub-net searching framework, which is embedded with several effective techniques, including trainable noise ranking, channel group and fine-tuning threshold setting, sub-nets re-selection. The proposed framework empowers the target DNN with better dynamic inference capability, which outperforms prior works on both CIFAR-10 and ImageNet dataset via comprehensive experiments on different network structures. Taken ResNet18 as an example, our proposed method achieves much better dynamic inference accuracy compared with prior popular Universally-Slimmable-Network by 4.4%-maximally and 2.3%-averagely in ImageNet dataset with the same model size.
摘要：很多技术已被开发，如模型压缩，使深层神经网络（DNNs）推断更有效。尽管如此，DNNs仍然缺乏优秀的运行时动态推理能力，使用户模型部署后权衡精度和计算复杂性（即，在目标硬件上的等待时间），基于动态的需求和环境。这样的研究方向最近吸引极大的关注，其中一个实现是通过一个多学期的目标函数，它由来自多个子网交叉熵方面的培养目标DNN。我们在这项工作表明，动态推理的性能在很大程度上依赖于子网质量抽检调查。与目标构建动态DNN和搜索多个高质量子网络以最小的搜索成本，提出了一种渐进子网搜索框架，其嵌入有几种有效的技术，包括可训练噪声排名，信道组和微调阈设置子网重新选择。拟议的框架授权目标DNN具有更好的动态推理能力，它通过对不同的网络结构综合实验优于两个CIFAR-10和ImageNet数据集之前的作品。采取ResNet18作为一个例子，与现有-maximally流行普遍-Slimmable-网络由4.4％和2.3 ImageNet数据集具有相同的模型大小-averagely％相比我们提出的方法实现了更好的动态推理精度。

69. Inverse mapping of face GANs [PDF] 返回目录
Nicky Bayat, Vahid Reza Khazaie, Yalda Mohsenzadeh
Abstract: Generative adversarial networks (GANs) synthesize realistic images from a random latent vector. While many studies have explored various training configurations and architectures for GANs, the problem of inverting a generative model to extract latent vectors of given input images has been inadequately investigated. Although there is exactly one generated image per given random vector, the mapping from an image to its recovered latent vector can have more than one solution. We train a ResNet architecture to recover a latent vector for a given face that can be used to generate a face nearly identical to the target. We use a perceptual loss to embed face details in the recovered latent vector while maintaining visual quality using a pixel loss. The vast majority of studies on latent vector recovery perform well only on generated images, we argue that our method can be used to determine a mapping between real human faces and latent-space vectors that contain most of the important face style details. In addition, our proposed method projects generated faces to their latent-space with high fidelity and speed. At last, we demonstrate the performance of our approach on both real and generated faces.
摘要：剖成对抗网络（甘斯）合成从随机潜矢量逼真的图像。虽然许多研究探讨了甘斯各种培训的配置和结构，倒相生成模型提取给定的输入图像的潜在向量的问题已经充分研究。虽然每个给定的随机矢量恰好一个生成的图像，从图像到它的回收潜向量的映射可以具有多于一个的解决方案。我们培养了RESNET架构以回收潜向量对于给定的面部可以被用于生成一个面几乎相同的目标。我们使用视觉损失嵌入脸部细节在回收潜在向量，而使用像素损失保持视觉质量。绝大多数的潜在向量恢复研究只对生成的图像表现良好，我们认为我们的方法可以用来确定真正的人脸，并包含最重要的脸款式细节潜在空间向量之间的映射。此外，产生我们所提出的方法，项目面孔高保真和速度的潜在空间。最后，我们证明了我们对真实和生成的面部方法的性能。

70. KSM: Fast Multiple Task Adaption via Kernel-wise Soft Mask Learning [PDF] 返回目录
Li Yang, Zhezhi He, Junshan Zhang, Deliang Fan
Abstract: Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as \textit{catastrophic forgetting}. While recent continual learning methods are capable of alleviating the catastrophic problem on toy-sized datasets, some issues still remain to be tackled when applying them in real-world problems. Recently, the fast mask-based learning method (e.g. piggyback \cite{mallya2018piggyback}) is proposed to address these issues by learning only a binary element-wise mask in a fast manner, while keeping the backbone model fixed. However, the binary mask has limited modeling capacity for new tasks. A more recent work \cite{hung2019compacting} proposes a compress-grow-based method (CPG) to achieve better accuracy for new tasks by partially training backbone model, but with order-higher training cost, which makes it infeasible to be deployed into popular state-of-the-art edge-/mobile-learning. The primary goal of this work is to simultaneously achieve fast and high-accuracy multi task adaption in continual learning setting. Thus motivated, we propose a new training method called \textit{kernel-wise Soft Mask} (KSM), which learns a kernel-wise hybrid binary and real-value soft mask for each task, while using the same backbone model. Such a soft mask can be viewed as a superposition of a binary mask and a properly scaled real-value tensor, which offers a richer representation capability without low-level kernel support to meet the objective of low hardware overhead. We validate KSM on multiple benchmark datasets against recent state-of-the-art methods (e.g. Piggyback, Packnet, CPG, etc.), which shows good improvement in both accuracy and training cost.
摘要：深层神经网络（DNN）学习新的任务时可以忘掉前面任务的知识，这是被称为\ textit {灾难性遗忘}。虽然近期不断学习方法能够减轻对玩具大小的数据集的灾难性的问题，有些问题还停留在现实世界中的问题在申请时他们加以解决。最近，快速的基于掩模的学习方法（例如捎带\举{mallya2018piggyback}）提出了通过学习只以快速的方式二进制逐元素掩模来解决这些问题，同时保持固定的骨架模型。然而，二进制掩码对于新的任务有限的建模能力。更近期的工作\举{hung2019compacting}通过部分训练骨干模型提出了一种长出基于压缩方法（CPG），以实现新的任务更高的精度，但随着订单更高的培训费用，这使得它不可能被部署到流行状态的最先进的边沿/移动学习。这项工作的主要目标是同时实现快速，通过不断地学习设置高精度的多任务适应。因此动机，我们提出了一个新的名为训练方法\ textit {内核明智软膜}（KSM），其学习内核明智混合二进制和实值柔光罩为每个任务，同时使用相同的骨干机型。这样的软面罩可以被看作是一个二进制掩码的叠加和适当的比例实值张量，没有底层内核支持更丰富的表现能力，达到目标的低硬件开销的哪些优惠。我们验证关于对国家的最先进的方法，近期多个基准数据集（例如肩扛，Packnet，CPG等），这表明在准确性和培训费用不错的改进KSM。

71. Deep Hiearchical Multi-Label Classification Applied to Chest X-Ray Abnormality Taxonomies [PDF] 返回目录
Haomin Chen, Shun Miao, Daguang Xu, Gregory D. Hager, Adam P. Harrison
Abstract: CXRs are a crucial and extraordinarily common diagnostic tool, leading to heavy research for CAD solutions. However, both high classification accuracy and meaningful model predictions that respect and incorporate clinical taxonomies are crucial for CAD usability. To this end, we present a deep HMLC approach for CXR CAD. Different than other hierarchical systems, we show that first training the network to model conditional probability directly and then refining it with unconditional probabilities is key in boosting performance. In addition, we also formulate a numerically stable cross-entropy loss function for unconditional probabilities that provides concrete performance improvements. Finally, we demonstrate that HMLC can be an effective means to manage missing or incomplete labels. To the best of our knowledge, we are the first to apply HMLC to medical imaging CAD. We extensively evaluate our approach on detecting abnormality labels from the CXR arm of the PLCO dataset, which comprises over $198,000$ manually annotated CXRs. When using complete labels, we report a mean AUC of 0.887, the highest yet reported for this dataset. These results are supported by ancillary experiments on the PadChest dataset, where we also report significant improvements, 1.2% and 4.1% in AUC and AP, respectively over strong "flat" classifiers. Finally, we demonstrate that our HMLC approach can much better handle incompletely labelled data. These performance improvements, combined with the inherent usefulness of taxonomic predictions, indicate that our approach represents a useful step forward for CXR CAD.
摘要：CXRS是至关重要的，非常常见的诊断工具，导致对CAD解决方案重研究。然而，无论是高分类精度和有意义的模型预测这方面并结合临床分类是CAD可用性是至关重要的。为此，我们提出了CXR CAD深HMLC方法。比其他分级系统不同的是，我们证明了第一次训练网络条件概率直接建模，然后用无条件概率提炼它在提高性能的关键。此外，我们还制定，提供具体的性能改进无条件概率的数值稳定交叉熵损失函数。最后，我们证明了HMLC可管理缺失或不完整的标签的有效手段。据我们所知，我们是第一个HMLC应用到医疗成像CAD。我们广泛地评估我们从PLCO数据集的CXR手臂，包括超过198,000 $ $手动注释CXRS异常检测标签的方法。当使用完整的标签，我们报告的0.887的平均AUC，但报告该数据集的最高水平。这些结果是通过对PadChest数据集，在这里我们还报告显著的改善，1.2％和AUC和AP 4.1％，分别较强烈的“扁平化”分类的辅助实验的支持。最后，我们证明了我们的HMLC方法可以更好地处理不完整的标签数据。这些性能的改善，与分类预测的内在实用性相结合，指出我们的做法代表了CXR CAD一个有用的一步。

72. 3D Reconstruction and Segmentation of Dissection Photographs for MRI-free Neuropathology [PDF] 返回目录
Henry Tregidgo, Adria Casamitjana, Caitlin Latimer, Mitchell Kilgore, Eleanor Robinson, Emily Blackburn, Koen Van Leemput, Bruce Fischl, Adrian Dalca, Christine Mac Donald, Dirk Keene, Juan Eugenio Iglesias
Abstract: Neuroimaging to neuropathology correlation (NTNC) promises to enable the transfer of microscopic signatures of pathology to in vivo imaging with MRI, ultimately enhancing clinical care. NTNC traditionally requires a volumetric MRI scan, acquired either ex vivo or a short time prior to death. Unfortunately, ex vivo MRI is difficult and costly, and recent premortem scans of sufficient quality are seldom available. To bridge this gap, we present methodology to 3D reconstruct and segment full brain image volumes from brain dissection photographs, which are routinely acquired at many brain banks and neuropathology departments. The 3D reconstruction is achieved via a joint registration framework, which uses a reference volume other than MRI. This volume may represent either the sample at hand (e.g., a surface 3D scan) or the general population (a probabilistic atlas). In addition, we present a Bayesian method to segment the 3D reconstructed photographic volumes into 36 neuroanatomical structures, which is robust to nonuniform brightness within and across photographs. We evaluate our methods on a dataset with 24 brains, using Dice scores and volume correlations. The results show that dissection photography is a valid replacement for ex vivo MRI in many volumetric analyses, opening an avenue for MRI-free NTNC, including retrospective data. The code is available at this https URL.
摘要：神经成像到神经病理学的相关性（NTNC）承诺，以使病理的微观签名转移到体内成像的MRI，最终增强临床护理。 NTNC传统上需要一个体积MRI扫描，无论是收购体外或之前死亡时间很短。不幸的是，体外MRI是困难和昂贵的，并且具有足够的质量最近生前扫描很少使用。为了弥补这种差距，我们目前的方法，以三维可视化和段全脑图像卷脑解剖照片，这在许多脑银行和神经病理学部门日常采集。三维重建通过接头登记框架，它使用比其他MRI参考音量来实现。该体积可以表示手头的样品（例如，表面的3D扫描）或一般群体（概率图谱）。此外，我们提出了一个贝叶斯法来分割三维重建照相卷成36种神经解剖结构，这是稳健的内和跨照片不均匀亮度。我们评估对24周的大脑数据集我们的方法，使用骰子的分数和体积的相关性。结果表明，夹层摄影是许多体积分析了体外MRI有效的替代品，打开了一个渠道的MRI无NTNC，包括回顾性资料。该代码可在此HTTPS URL。

73. Label-Free Segmentation of COVID-19 Lesions in Lung CT [PDF] 返回目录
Qingsong Yao, Li Xiao, Peihang Liu, S. Kevin Zhou
Abstract: Scarcity of annotated images hampers the building of automated solution for reliable COVID-19 diagnosis and evaluation from CT. To alleviate the burden of data annotation, we herein present a label-free approach for segmenting COVID-19 lesions in CT via pixel-level anomaly modeling that mines out the relevant knowledge from normal CT lung scans. Our modeling is inspired by the observation that the parts of tracheae and vessels, which lay in the high-intensity range where lesions belong to, exhibit strong patterns. To facilitate the learning of such patterns at a pixel level, we synthesize `lesions' using a set of surprisingly simple operations and insert the synthesized `lesions' into normal CT lung scans to form training pairs, from which we learn a normalcy-converting network (NormNet) that turns an 'abnormal' image back to normal. Our experiments on three different datasets validate the effectiveness of NormNet, which conspicuously outperforms a variety of unsupervised anomaly detection (UAD) methods.
摘要：注释的图像的稀缺性阻碍了从CT可靠COVID-19的诊断和评价的自动化解决方案的建筑物。为了减轻数据注解的负担，我们在本文提出了一种无标记的方法，用于经由像素级异常建模地雷出从正常CT肺扫描相关的知识在CT分割COVID-19病变。我们的建模是由观察到气管和血管的部分，其中躺在高强度范围，其中病变属于，表现出强的图案的启发。为了便于这样的图案的学习在像素级，我们合成`病变利用一组令人惊讶的简单操作，并插入合成`病变进入正常CT肺扫描以形成训练对，从中我们学习正常状态转换网络（NormNet），轮流的“异常”图像恢复正常。我们的在三个不同的数据集实验验证NormNet的有效性，这显着优于各种无监督异常检测（UAD）的方法。

74. Automatic elimination of the pectoral muscle in mammograms based on anatomical features [PDF] 返回目录
Jairo A. Ayala-Godoy, Rosa E. Lillo, Juan Romo
Abstract: Digital mammogram inspection is the most popular technique for early detection of abnormalities in human breast tissue. When mammograms are analyzed through a computational method, the presence of the pectoral muscle might affect the results of breast lesions detection. This problem is particularly evident in the mediolateral oblique view (MLO), where pectoral muscle occupies a large part of the mammography. Therefore, identifying and eliminating the pectoral muscle are essential steps for improving the automatic discrimination of breast tissue. In this paper, we propose an approach based on anatomical features to tackle this problem. Our method consists of two steps: (1) a process to remove the noisy elements such as labels, markers, scratches and wedges, and (2) application of an intensity transformation based on the Beta distribution. The novel methodology is tested with 322 digital mammograms from the Mammographic Image Analysis Society (mini-MIAS) database and with a set of 84 mammograms for which the area normalized error was previously calculated. The results show a very good performance of the method.
摘要：数字乳房X光检查是早期发现在人类乳腺组织异常的最流行的技术。当乳房X线照片是通过一个计算方法进行分析，胸肌的存在可能会影响乳腺病变检测的结果。这个问题是在内外侧斜视图（MLO），其中，胸肌占据乳房X射线摄影的很大一部分特别明显。因此，识别和消除胸肌是改善乳房组织的自动识别必要步骤。在本文中，我们提出了一种基于解剖学特征来解决这个问题的方法。我们的方法包括两个步骤：（1）一个过程，以除去噪声的元件，例如标签，标记，划痕和楔子，并基于Beta分布的强度变换的（2）应用程序。该新颖的方法与来自乳腺摄影图像分析学会（微型MIAS）数据库和与该区域被先前计算的归一化误差，其的一组84个乳房X线照片322个数字乳房X线照片测试。结果表明，该方法的一个非常不错的表现。

75. A Multisensory Learning Architecture for Rotation-invariant Object Recognition [PDF] 返回目录
Murat Kirtay, Guido Schillaci, Verena V. Hafner
Abstract: This study presents a multisensory machine learning architecture for object recognition by employing a novel dataset that was constructed with the iCub robot, which is equipped with three cameras and a depth sensor. The proposed architecture combines convolutional neural networks to form representations (i.e., features) for grayscaled color images and a multi-layer perceptron algorithm to process depth data. To this end, we aimed to learn joint representations of different modalities (e.g., color and depth) and employ them for recognizing objects. We evaluate the performance of the proposed architecture by benchmarking the results obtained with the models trained separately with the input of different sensors and a state-of-the-art data fusion technique, namely decision level fusion. The results show that our architecture improves the recognition accuracy compared with the models that use inputs from a single modality and decision level multimodal fusion method.
摘要：本研究提出了一种多感官机器学习架构对象识别通过采用与所述的iCub机器人，其配备有三个相机和深度传感器构造的新颖的数据集。所提出的架构将卷积神经网络，以形成表示（即，功能），用于灰度化的彩色图像和一个多层感知算法来处理深度数据。为此，我们的目的是学习不同的方式（例如，颜色和深度）的联合代表，并聘请他们识别物体。我们通过基准与不同的传感器的输入和一个国家的最先进的数据融合技术，即决策级融合单独训练模型获得的结果评价所提出的体系结构的性能。结果表明，我们的架构与使用输入从一个单一的模式和决策水平的多模态融合方法的机型相比提高了识别的准确性。

76. VC-Net: Deep Volume-Composition Networks for Segmentation and Visualization of Highly Sparse and Noisy Image Data [PDF] 返回目录
Yifan Wang, Guoli Yan, Haikuan Zhu, Sagar Buch, Ying Wang, Ewart Mark Haacke, Jing Hua, Zichun Zhong
Abstract: The motivation of our work is to present a new visualization-guided computing paradigm to combine direct 3D volume processing and volume rendered clues for effective 3D exploration such as extracting and visualizing microstructures in-vivo. However, it is still challenging to extract and visualize high fidelity 3D vessel structure due to its high sparseness, noisiness, and complex topology variations. In this paper, we present an end-to-end deep learning method, VC-Net, for robust extraction of 3D microvasculature through embedding the image composition, generated by maximum intensity projection (MIP), into 3D volume image learning to enhance the performance. The core novelty is to automatically leverage the volume visualization technique (MIP) to enhance the 3D data exploration at deep learning level. The MIP embedding features can enhance the local vessel signal and are adaptive to the geometric variability and scalability of vessels, which is crucial in microvascular tracking. A multi-stream convolutional neural network is proposed to learn the 3D volume and 2D MIP features respectively and then explore their inter-dependencies in a joint volume-composition embedding space by unprojecting the MIP features into 3D volume embedding space. The proposed framework can better capture small / micro vessels and improve vessel connectivity. To our knowledge, this is the first deep learning framework to construct a joint convolutional embedding space, where the computed vessel probabilities from volume rendering based 2D projection and 3D volume can be explored and integrated synergistically. Experimental results are compared with the traditional 3D vessel segmentation methods and the deep learning state-of-the-art on public and real patient (micro-)cerebrovascular image datasets. Our method demonstrates the potential in a powerful MR arteriogram and venogram diagnosis of vascular diseases.
摘要：我们工作的动机是提出了一种新的可视化引导的计算模式，以有效的3D勘探直接三维体积处理和渲染体积线索，如提取和体内可视化的微观结构结合起来。然而，它仍然是具有挑战性的，以提取和可视化高保真3D血管结构由于其高稀疏，噪度，和复杂的拓扑结构变化。在本文中，我们通过图像组合物，由最大强度投影（MIP）产生的，嵌入到3D体积图像的学习，以提高性能呈现的端至端深的学习方法，VC-Net的，用于3D微血管的鲁棒提取。核心新颖之处在于自动利用量可视化技术（MIP），以提高在深度学习水平的3D数据探索。该MIP嵌入功能，可以提升本地船只信号，并适应船舶的几何变化和可扩展性，这是微血管跟踪至关重要。一种多流卷积神经网络提出了学习3D体积和2D分别MIP功能，然后探索及其相互依存关系的联合体积的组合物通过将unprojecting MIP功能集成到3D体积嵌入空间嵌入空间。拟议的框架能够更好地捕捉小型/微型血管，改善血管连接。据我们所知，这是第一个深刻的学习框架构建联合卷积嵌入空间，其中从体积计算的容器概率基于渲染2D投影和3D体积可以探索和协同集成。实验结果与传统的3D血管分割方法和深学习对公共和实际患者（微）脑血管图像数据集的状态的最先进的比较。我们的方法演示在血管疾病的有力MR动脉造影和静脉造影诊断的潜力。

77. Mathematical Morphology via Category Theory [PDF] 返回目录
Hossein Memarzadeh Sharifipour, Bardia Yousefi
Abstract: Mathematical morphology contributes many profitable tools to image processing area. Some of these methods considered to be basic but the most important fundamental of data processing in many various applications. In this paper, we modify the fundamental of morphological operations such as dilation and erosion making use of limit and co-limit preserving functors within (Category Theory). Adopting the well-known matrix representation of images, the category of matrix, called Mat, can be represented as an image. With enriching Mat over various semirings such as Boolean and (max,+) semirings, one can arrive at classical definition of binary and gray-scale images using the categorical tensor product in Mat. With dilation operation in hand, the erosion can be reached using the famous tensor-hom adjunction. This approach enables us to define new types of dilation and erosion between two images represented by matrices using different semirings other than Boolean and (max,+) semirings. The viewpoint of morphological operations from category theory can also shed light to the claimed concept that mathematical morphology is a model for linear logic.
摘要：数学形态学有助于许多有利可图的工具，图像处理领域。有些方法被认为是基本的，但在许多不同的应用中最重要的基本的数据处理。在本文中，我们修改的基本形态的操作，如膨胀和腐蚀利用限制和共保限制（类别理论）内函子的。采用图像的公知的矩阵表示，矩阵的类别，称为垫，可以被表示为图像。随着在各种半环，如布尔和（最大值，+）半环富集垫，可以在使用垫中的分类张量积二进制和灰度图像的经典定义到达。在手膨胀操作，腐蚀可以用著名的张坎红利到达。这种方法使我们能够确定，通过使用比布尔和（最大值，+）半环的其它不同的半环的矩阵所表示的新类型的扩张和两个图像之间的侵蚀。形态学运算的从类别理论的观点出发，也可以揭示所要求保护的概念，即数学形态学是线性逻辑的模型。

78. Towards the Quantification of Safety Risks in Deep Neural Networks [PDF] 返回目录
Peipei Xu, Wenjie Ruan, Xiaowei Huang
Abstract: Safety concerns on the deep neural networks (DNNs) have been raised when they are applied to critical sectors. In this paper, we define safety risks by requesting the alignment of the network's decision with human perception. To enable a general methodology for quantifying safety risks, we define a generic safety property and instantiate it to express various safety risks. For the quantification of risks, we take the maximum radius of safe norm balls, in which no safety risk exists. The computation of the maximum safe radius is reduced to the computation of their respective Lipschitz metrics - the quantities to be computed. In addition to the known adversarial example, reachability example, and invariant example, in this paper we identify a new class of risk - uncertainty example - on which humans can tell easily but the network is unsure. We develop an algorithm, inspired by derivative-free optimization techniques and accelerated by tensor-based parallelization on GPUs, to support efficient computation of the metrics. We perform evaluations on several benchmark neural networks, including ACSC-Xu, MNIST, CIFAR-10, and ImageNet networks. The experiments show that, our method can achieve competitive performance on safety quantification in terms of the tightness and the efficiency of computation. Importantly, as a generic approach, our method can work with a broad class of safety risks and without restrictions on the structure of neural networks.
摘要：当它们被应用到关键部门的深层神经网络（DNNs）的安全问题已经提出。在本文中，我们通过请求网络与人类的感知决定的排列定义安全风险。要启用量化安全风险的一般方法，我们定义一个通用的安全性能并创建实例来表达各种安全隐患。对于风险的量化，我们采取安全规范球的最大半径，在没有安全风险存在。最大安全半径的计算减少到各自的李氏指标的计算 - 要计算的数量。除了已知的对抗例子，例如可达性，和不变例如，在本文中，我们确定了一类新的风险 - 不确定性的例子 - 上人类可以很容易分辨，但网络是不确定。我们开发了一种算法，通过导数的免费优化技术的启发，并基于张量并行化在GPU加速，支持度量的高效计算。我们几个标杆神经网络，包括ACSC旭，MNIST，CIFAR-10进行评估，并ImageNet网络。实验结果表明，我们的方法可以实现在气密性和计算效率方面对安全量化竞争力的性能。重要的是，作为一个通用的方法，我们的方法可以用一个大类的安全风险，而对神经网络的结构限制问题。

79. Extracting Optimal Solution Manifolds using Constrained Neural Optimization [PDF] 返回目录
Gurpreet Singh, Soumyajit Gupta, Matthew Lease
Abstract: Constrained Optimization solution algorithms are restricted to point based solutions. In practice, single or multiple objectives must be satisfied, wherein both the objective function and constraints can be non-convex resulting in multiple optimal solutions. Real world scenarios include intersecting surfaces as Implicit Functions, Hyperspectral Unmixing and Pareto Optimal fronts. Local or global convexification is a common workaround when faced with non-convex forms. However, such an approach is often restricted to a strict class of functions, deviation from which results in sub-optimal solution to the original problem. We present neural solutions for extracting optimal sets as approximate manifolds, where unmodified, non-convex objectives and constraints are defined as modeler guided, domain-informed $L_2$ loss function. This promotes interpretability since modelers can confirm the results against known analytical forms in their specific domains. We present synthetic and realistic cases to validate our approach and compare against known solvers for bench-marking in terms of accuracy and computational efficiency.
摘要：约束优化求解算法仅限于基于单点解决方案。在实践中，单个或多个目标，必须满足，其中所述目标函数和约束两者可以导致多个最优解非凸。现实世界中的场景包括相交表面为隐函数，高光谱混合像元分解和Pareto最优前沿。本地或全局凸化是一种常见的解决方法，当面对非凸形式。然而，这样的方法通常限于严格的类的功能，偏离其导致次优解到原来的问题。我们提取的最佳组合为近似歧管，在未经修改的，非凸目标和约束被定义为建模的指导下，域知情$ L_2 $损失函数存在神经的解决方案。这促进了解释性因为建模者可以确认在其特定领域对已知分析形式的结果。我们目前合成的和现实的情况下，来验证我们的方法和针对已知求解比较在精度和计算效率方面的基准标记。

80. Attention Cube Network for Image Restoration [PDF] 返回目录
Yucheng Hang, Qingmin Liao, Wenming Yang, Yupeng Chen, Jie Zhou
Abstract: Recently, deep convolutional neural network (CNN) have been widely used in image restoration and obtained great success. However, most of existing methods are limited to local receptive field and equal treatment of different types of information. Besides, existing methods always use a multi-supervised method to aggregate different feature maps, which can not effectively aggregate hierarchical feature information. To address these issues, we propose an attention cube network (A-CubeNet) for image restoration for more powerful feature expression and feature correlation learning. Specifically, we design a novel attention mechanism from three dimensions, namely spatial dimension, channel-wise dimension and hierarchical dimension. The adaptive spatial attention branch (ASAB) and the adaptive channel attention branch (ACAB) constitute the adaptive dual attention module (ADAM), which can capture the long-range spatial and channel-wise contextual information to expand the receptive field and distinguish different types of information for more effective feature representations. Furthermore, the adaptive hierarchical attention module (AHAM) can capture the long-range hierarchical contextual information to flexibly aggregate different feature maps by weights depending on the global context. The ADAM and AHAM cooperate to form an "attention in attention" structure, which means AHAM's inputs are enhanced by ASAB and ACAB. Experiments demonstrate the superiority of our method over state-of-the-art image restoration methods in both quantitative comparison and visual analysis.
摘要：近日，深卷积神经网络（CNN）已被广泛应用于图像恢复并取得了巨大的成功。然而，大多数的现有方法限于局部感受域和不同类型的信息的平等对待。此外，现有的方法总是使用多监督的方法来聚合不同的特征图，这不能有效聚集分层特征信息。为了解决这些问题，我们提出了图像恢复的注意立方体网络（A-CubeNet）更强大的功能表达和功能相关的学习。具体来说，我们从三个维度，即空间维度，通道方向的尺寸和分层尺寸设计的新颖注意机制。自适应空间注意分支（ASAB）和自适应信道关注分支（ACAB）构成了自适应双注意模块（ADAM），其可以捕获长范围空间和通道明智的上下文信息以展开感受野和区分不同类型的为更有效的功能表示信息。此外，自适应分层注意模块（AHAM）可以捕捉远射被取决于全球范围内的权重分级上下文信息灵活地聚合不同的特征图。的ADAM和AHAM协作以形成一个“注意力在关注”结构，这意味着AHAM的输入由ASAB和ACAB增强。实验表明我们的方法的以上两种定量比较和视觉分析状态的最先进的图像恢复方法的优越性。

81. How Much Can We Really Trust You? Towards Simple, Interpretable Trust Quantification Metrics for Deep Neural Networks [PDF] 返回目录
Alexander Wong, Xiao Yu Wang, Andrew Hryniowski
Abstract: A critical step to building trustworthy deep neural networks is trust quantification, where we ask the question: How much can we trust a deep neural network? In this study, we take a step towards simple, interpretable metrics for trust quantification by introducing a suite of metrics for assessing the overall trustworthiness of deep neural networks based on their behaviour when answering a set of questions. We conduct a thought experiment and explore two key questions about trust in relation to confidence: 1) How much trust do we have in actors who give wrong answers with great confidence?, and 2) How much trust do we have in actors who give right answers hesitantly? Based on insights gained, we introduce the concept of question-answer trust to quantify trustworthiness of an individual answer based on confident behaviour under correct and incorrect answer scenarios, and the concept of trust density to characterize the distribution of overall trust for an individual answer scenario. We further introduce the concept of trust spectrum for representing overall trust with respect to the spectrum of possible answer scenarios across correctly and incorrectly answered questions. Finally, we introduce NetTrustScore, a scalar metric summarizing overall trustworthiness. The suite of metrics aligns with past social psychology studies that study the relationship between trust and confidence. Leveraging these metrics, we quantify the trustworthiness of several well-known deep neural network architectures for image recognition to get a deeper understanding of where trust breaks down. The proposed metrics are by no means perfect, but the hope is to push the conversation towards better metrics to help guide practitioners and regulators in producing, deploying, and certifying deep learning solutions that can be trusted to operate in real-world, mission-critical scenarios.
摘要：建立值得信赖的深层神经网络的一个关键步骤是量化的信任，在这里我们提出这样一个问题：我们多少可以信任的深层神经网络？在这项研究中，我们采取通过引入一套指标的评估回答一组问题时，基于他们的行为深层神经网络的整体可信性的信任量化朝着简单的一步，可解释的指标。我们进行思想实验和探索中关系有关信任的两个关键问题的信心：1）有多靠谱？我们有演员谁给人以极大的信心错误的答案？2）有多靠谱？我们有演员谁给的权利回答欲言又止？根据获得的分析结果，我们引入问答信任的概念，基于下正确和不正确的答案情景自信行为的单独答复和信任密度的概念进行量化守信表征整体信任的分布为个人回答情况。我们进一步介绍信任谱的概念，表示相对于可能的答案场景横跨正确和不正确的回答问题，频谱整体的信任。最后，我们介绍NetTrustScore，标量指标汇总的整体可信度。与学习信任和信心之间的关系，过去的社会心理学研究的指标对齐的套件。利用这些指标，我们量化的几个著名的深层神经网络结构的可信度图像识别来获得更深的信任发生故障的地方理解。所提出的指标绝不是完美的，但希望是朝着更好的指标推动对话，以帮助指导从业者和监管者在生产，部署和验证深的学习解决方案，可以被信任在现实世界中来操作，关键任务场景。

82. Multi-Channel Potts-Based Reconstruction for Multi-Spectral Computed Tomography [PDF] 返回目录
Lukas Kiefer, Stefania Petra, Martin Storath, Andreas Weinmann
Abstract: We consider reconstructing multi-channel images from measurements performed by photon-counting and energy-discriminating detectors in the setting of multi-spectral X-ray computed tomography (CT). Our aim is to exploit the strong structural correlation that is known to exist between the channels of multi-spectral CT images. To that end, we adopt the multi-channel Potts prior to jointly reconstruct all channels. This prior produces piecewise constant solutions with strongly correlated channels. In particular, edges are enforced to have the same spatial position across channels which is a benefit over TV-based methods. We consider the Potts prior in two frameworks: (a) in the context of a variational Potts model, and (b) in a Potts-superiorization approach that perturbs the iterates of a basic iterative least squares solver. We identify an alternating direction method of multipliers (ADMM) approach as well as a Potts-superiorized conjugate gradient method as particularly suitable. In numerical experiments, we compare the Potts prior based approaches to existing TV-type approaches on realistically simulated multi-spectral CT data and obtain improved reconstruction for compound solid bodies.
摘要：考虑从通过光子计数和鉴别能量检测器在多光谱X射线计算机断层扫描（CT）的设置执行的测量来重构多通道的图像。我们的目的是利用这一点是已知的多光谱CT图像的信道之间存在的强的结构相关性。为此目的，我们事先采用多通道的Potts，共同重建所有通道。该现有产生具有强相关信道分段恒定的解决方案。特别地，边缘被强制为具有跨渠道相同的空间位置，这是通过基于TV-方法是有利的。在波茨-superiorization方法在变Potts模型，和（b）的情况下（a）该扰乱基本的迭代最小二乘迭代求解：我们认为在之前两个框架的波茨。我们确定的乘数（ADMM）的方式的交替的方向的方法以及一个帕兹-superiorized共轭梯度方法，特别合适的。数值实验中，我们先前比较的Potts基础的方法与现有电视型接近上真实地模拟多光谱CT数据并获得化合物实心体改进的重建。

83. Segmentation of Lungs in Chest X-Ray Image Using Generative Adversarial Networks [PDF] 返回目录
Faizan Munawar, Shoaib Azmat, Talha Iqbal, Christer Grönlund, Hazrat Ali
Abstract: Chest X-ray (CXR) is a low-cost medical imaging technique. It is a common procedure for the identification of many respiratory diseases compared to MRI, CT, and PET scans. This paper presents the use of generative adversarial networks (GAN) to perform the task of lung segmentation on a given CXR. GANs are popular to generate realistic data by learning the mapping from one domain to another. In our work, the generator of the GAN is trained to generate a segmented mask of a given input CXR. The discriminator distinguishes between a ground truth and the generated mask, and updates the generator through the adversarial loss measure. The objective is to generate masks for the input CXR, which are as realistic as possible compared to the ground truth masks. The model is trained and evaluated using four different discriminators referred to as D1, D2, D3, and D4, respectively. Experimental results on three different CXR datasets reveal that the proposed model is able to achieve a dice-score of 0.9740, and IOU score of 0.943, which are better than other reported state-of-the art results.
摘要：胸部X-射线（CXR）是一种低成本的医学成像技术。它是许多呼吸道疾病的识别相比，MRI，CT和PET扫描的公共过程。本文介绍了使用生成对抗网络（GAN）的在给定的CXR进行肺分割的任务。甘斯是流行通过学习映射从一个域到另一个，以产生真实的数据。在我们的工作中，GaN的发电机被训练为产生一个给定的输入CXR的分段式面具。鉴别一个地面实况和所产生的掩模之间进行区分，并通过对抗措施损失更新发生器。目的是产生用于输入CXR口罩，其是尽可能的真实比地面实况掩模。该模型训练和使用被称为D1，D2，D3和D4分别四种不同的鉴别器进行评价。在三个不同的数据集CXR实验结果表明，该模型能够达到0.9740骰子得分，并且欠条得分0.943，这比其它报道的国家的艺术效果更好。

84. Efficient Folded Attention for 3D Medical Image Reconstruction and Segmentation [PDF] 返回目录
Hang Zhang, Jinwei Zhang, Rongguang Wang, Qihao Zhang, Pascal Spincemaille, Thanh D. Nguyen, Yi Wang
Abstract: Recently, 3D medical image reconstruction (MIR) and segmentation (MIS) based on deep neural networks have been developed with promising results, and attention mechanism has been further designed to capture global contextual information for performance enhancement. However, the large size of 3D volume images poses a great computational challenge to traditional attention methods. In this paper, we propose a folded attention (FA) approach to improve the computational efficiency of traditional attention methods on 3D medical images. The main idea is that we apply tensor folding and unfolding operations with four permutations to build four small sub-affinity matrices to approximate the original affinity matrix. Through four consecutive sub-attention modules of FA, each element in the feature tensor can aggregate spatial-channel information from all other elements. Compared to traditional attention methods, with moderate improvement of accuracy, FA can substantially reduce the computational complexity and GPU memory consumption. We demonstrate the superiority of our method on two challenging tasks for 3D MIR and MIS, which are quantitative susceptibility mapping and multiple sclerosis lesion segmentation.
摘要：近日，基于深层神经网络的三维医学图像重建（MIR）和分割（MIS）已经开发了可喜的成果，并注意机制得到了进一步的设计来捕捉性能增强全球上下文信息。然而，3D体积图像的大尺寸问题向传统的关注方式有很大的计算挑战。在本文中，我们提出了一个折叠的关注（FA）的方法来改善传统方法的关注三维医学图像的计算效率。其主要思想是，我们应用张折叠和展开操作有四个排列打造四个小分亲和力矩阵来近似原始亲和基质。通过FA的四个连续子注意模块，在所述特征张量的每个元件可以聚合来自所有其他元件的空间信道信息。相比传统的关注方式，准确的温和的改善，FA可以大大降低计算复杂度和GPU内存消耗。我们证明我们的方法对3D MIR和MIS 2个有挑战性的任务，这是定量敏感性映射和多发性硬化病变划分的优越性。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-15

目录

摘要