摘要

1. Flow-edge Guided Video Completion [PDF] 返回目录
Chen Gao, Ayush Saraf, Jia-Bin Huang, Johannes Kopf
Abstract: We present a new flow-based video completion algorithm. Previous flow completion methods are often unable to retain the sharpness of motion boundaries. Our method first extracts and completes motion edges, and then uses them to guide piecewise-smooth flow completion with sharp edges. Existing methods propagate colors among local flow connections between adjacent frames. However, not all missing regions in a video can be reached in this way because the motion boundaries form impenetrable barriers. Our method alleviates this problem by introducing non-local flow connections to temporally distant frames, enabling propagating video content over motion boundaries. We validate our approach on the DAVIS dataset. Both visual and quantitative results show that our method compares favorably against the state-of-the-art algorithms.
摘要：本文提出了一种新的基于流的视频完成算法。上一页流完成方法往往留不住运动边界的清晰度。我们的方法的第一提取物和完成运动边缘，然后使用它们来引导分段畅通完成具有尖锐边缘。现有的方法传播相邻帧之间的本地流连接中的颜色。然而，并非在视频所有缺失的区域可以以这种方式达到的，因为运动边界形成坚不可摧的壁垒。我们的方法减轻了通过引入非本地流动连接到时间上远离的帧，使在传播运动边界的视频内容这一问题。我们确认我们的DAVIS数据集的方式。视觉和定量结果显示，我们的方法相比是有利的对状态的最先进的算法。

2. Computational Analysis of Deformable Manifolds: from Geometric Modelling to Deep Learning [PDF] 返回目录
Stefan C Schonsheck
Abstract: Leo Tolstoy opened his monumental novel Anna Karenina with the now famous words: Happy families are all alike; every unhappy family is unhappy in its own way A similar notion also applies to mathematical spaces: Every flat space is alike; every unflat space is unflat in its own way. However, rather than being a source of unhappiness, we will show that the diversity of non-flat spaces provides a rich area of study. The genesis of the so-called big data era and the proliferation of social and scientific databases of increasing size has led to a need for algorithms that can efficiently process, analyze and, even generate high dimensional data. However, the curse of dimensionality leads to the fact that many classical approaches do not scale well with respect to the size of these problems. One technique to avoid some of these ill-effects is to exploit the geometric structure of coherent data. In this thesis, we will explore geometric methods for shape processing and data analysis. More specifically, we will study techniques for representing manifolds and signals supported on them through a variety of mathematical tools including, but not limited to, computational differential geometry, variational PDE modeling, and deep learning. First, we will explore non-isometric shape matching through variational modeling. Next, we will use ideas from parallel transport on manifolds to generalize convolution and convolutional neural networks to deformable manifolds. Finally, we conclude by proposing a novel auto-regressive model for capturing the intrinsic geometry and topology of data. Throughout this work, we will use the idea of computing correspondences as a though-line to both motivate our work and analyze our results.
摘要：托尔斯泰打开了他的不朽的小说安娜·卡列尼娜与现在著名的话：幸福的家庭都是相似的;每个不幸的家庭却以自己的方式类似的概念也适用于数学空间不高兴：每一个平坦的空间是相似的;每一个不平坦的空间，以自己的方式不平坦。然而，而不是不快乐的源泉，我们将证明非平面空间的多样性研究提供了丰富的区域。所谓大数据时代的起源和规模日益扩大的社会和科学数据库的激增导致需要算法，可以有效地处理，分析，甚至产生高维数据。然而，维引线的诅咒的事实，许多经典方法并不相对于这些问题的大小缩放。的一种技术，以避免其中的一些不良影响是利用相干数据的几何结构。在此论文中，我们将探讨形状处理和数据分析的几何方法。更具体地说，我们将研究技术代表通过各种数学工具，包括支持它们歧管和信号，包括但不限于，计算微分几何，偏微分方程变建模和深度学习。首先，我们将通过变动建模探索非等距形状匹配。接下来，我们将使用歧管从并行传输的思想概括卷积和卷积神经网络变形歧管。最后，我们得出结论：通过提出对捕获数据的内在几何和拓扑结构新颖的自回归模型。在整个工作中，我们将使用计算对应的虽然行双方的想法激发我们的工作和我们的分析结果。

3. Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild [PDF] 返回目录
Weijia Wu, Ning Lu, Enze Xie
Abstract: Deep learning-based scene text detection can achieve preferable performance, powered with sufficient labeled training data. However, manual labeling is time consuming and laborious. At the extreme, the corresponding annotated data are unavailable. Exploiting synthetic data is a very promising solution except for domain distribution mismatches between synthetic datasets and real datasets. To address the severe domain distribution mismatch, we propose a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data (source domain) to real data (target domain). In this paper, a text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text detection are introduced. ATA helps the network learn domain-invariant features by training a domain classifier in an adversarial manner. TST diminishes the adverse effects of false positives~(FPs) and false negatives~(FNs) from inaccurate pseudo-labels. Two components have positive effects on improving the performance of scene text detectors when adapting from synthetic-to-real scenes. We evaluate the proposed method by transferring from SynthText, VISD to ICDAR2015, ICDAR2013. The results demonstrate the effectiveness of the proposed method with up to 10% improvement, which has important exploration significance for domain adaptive scene text detection. Code is available at this https URL
摘要：深基础的学习场景文本检测可以达到较好的性能，搭载有足够的标记的训练数据。然而，手动贴标签是费时和费力的。在极端情况下，对应的带注释的数据不可用。利用合成的数据是除了合成的数据集和真实数据集之间的域分布的失配一个非常有前途的解决方案。为了解决严重的领域分布不匹配，我们提出了场景文本检测，其传输的知识来自合成数据（源域）以真实数据（目标域）合成到实域自适应方法。在本文中，一个文本自我训练（TST）为域自适应场景的文本检测的方法和对抗性文本实例对准（ATA）被引入。 ATA帮助网络通过一个对抗性的方式训练域分类学领域不变特征。 TST减少不准确的伪标签误报〜（FPS）和漏报〜（FNS）的不利影响。两个组件都从合成到真实场景改编时提高场景文字检测器的性能的积极作用。我们评估由SynthText，VISD转移到ICDAR2015，ICDAR2013所提出的方法。结果表明，该方法具有高达10％的提升，这对于域自适应场景文本检测的重要勘探意义的有效性。代码可在此HTTPS URL

4. MIPGAN -- Generating Robust and High QualityMorph Attacks Using Identity Prior Driven GAN [PDF] 返回目录
Haoyu Zhang, Sushma Venkatesh, Raghavendra Ramachandra, Kiran Raja, Naser Damer, Christoph Busch
Abstract: Face morphing attacks target to circumvent Face Recognition Systems (FRS) by employing face images derived from multiple data subjects (e.g., accomplices and malicious actors). Morphed images can verify against contributing data subjects with a reasonable success rate, given they have a high degree of identity resemblance. The success of the morphing attacks is directly dependent on the quality of the generated morph images. We present a new approach for generating robust attacks extending our earlier framework for generating face morphs. We present a new approach using an Identity Prior Driven Generative Adversarial Network, which we refer to as \textit{MIPGAN (Morphing through Identity Prior driven GAN)}. The proposed MIPGAN is derived from the StyleGAN with a newly formulated loss function exploiting perceptual quality and identity factor to generate a high quality morphed face image with minimal artifacts and with higher resolution. We demonstrate the proposed approach's applicability to generate robust morph attacks by evaluating it against a commercial Face Recognition System (FRS) and demonstrate the success rate of attacks. Extensive experiments are carried out to assess the FRS's vulnerability against the proposed morphed face generation technique on three types of data such as digital images, re-digitized (printed and scanned) images, and compressed images after re-digitization from newly generated \textit{MIPGAN Face Morph Dataset}. The obtained results demonstrate that the proposed approach of morph generation profoundly threatens the FRS.
摘要：面部变形的攻击目标通过采用从多个数据对象（例如，同谋和恶意行动者）衍生的面部图像，以规避人脸识别系统（FRS）。演变图像可以验证对以合理的成功率提供数据对象，给他们有高度认同相似的。的变形攻击成功直接依赖于所生成的变形图像的质量。我们提出了产生扩展我们前面生成的脸摇身一变框架强劲攻击的新方法。我们目前使用的身份在此之前驱动剖成对抗性的网络，我们称之为\ textit {MIPGAN（通过Identity变形方法之前驱动GAN）}的新方法。所提出的MIPGAN从StyleGAN衍生与新配制的损失函数利用感知质量和身份因子，以产生高质量的变形的面部图像以最小的伪像和具有较高分辨率。我们证明了该方法的适用性产生通过评估其对商业人脸识别系统（FRS）强大的变身攻击和证明攻击的成功率。广泛进行了实验，以评估对三种类型的数据，如数字图像，再数字化（打印和扫描）的图像，并从新生成的重新数字化之后压缩的图像的FRS的对所提出的变形的面部生成技术漏洞\ textit { MIPGAN脸变身数据集}。得到的结果表明，该变形产生的方式深刻地威胁到FRS。

5. Multi-Loss Weighting with Coefficient of Variations [PDF] 返回目录
Rick Groenendijk, Sezer Karaoglu, Theo Gevers, Thomas Mensink
Abstract: Many interesting tasks in machine learning and computer vision are learned by optimising an objective function defined as a weighted linear combination of multiple losses. The final performance is sensitive to choosing the correct (relative) weights for these losses. Finding a good set of weights is often done by adopting them into the set of hyper-parameters, which are set using an extensive grid search. This is computationally expensive. In this paper, the weights are defined based on properties observed while training the model, including the specific batch loss, the average loss, and the variance for each of the losses. An additional advantage is that the defined weights evolve during training, instead of using static loss weights. In literature, loss weighting is mostly used in a multi-task learning setting, where the different tasks obtain different weights. However, there is a plethora of single-task multi-loss problems that can benefit from automatic loss weighting. In this paper, it is shown that these multi-task approaches do not work on single tasks. Instead, a method is proposed that automatically and dynamically tunes loss weights throughout training specifically for single-task multi-loss problems. The method incorporates a measure of uncertainty to balance the losses. The validity of the approach is shown empirically for different tasks on multiple datasets.
摘要：在机器学习和计算机视觉很多有趣的任务是通过优化定义为多重损失的加权线性组合的目标函数的经验教训。最终的性能是选择这些损失正确的（相对）的权重敏感。找到一个好的组权重往往是通过采用他们入组超参数，其中，使用广泛的网格搜索设定完成。这是计算量很大。在本文中，所述权重是基于在训练模型，包括特定批次损耗，平均损耗，并且对于每个的损失方差观察属性中定义。另外一个优点是定义权重训练过程中发展的，而不是使用静态损耗的权重。在文献中，损耗的加权在一个多任务的学习设定，其中，所述不同的任务获得不同的权重大多使用。然而，存在的可以从自动损失受益加权单任务多损失问题过多。在本文中，它表明，这些多任务方法并不单任务。取而代之的是，一个方法是自动建议和整个专门针对单任务多损失问题培训动态调整损失的权重。该方法结合了不确定性的度量来平衡的损失。该方法的有效性，经验显示对多个数据集不同的任务。

6. Future Frame Prediction of a Video Sequence [PDF] 返回目录
Jasmeen Kaur, Sukhendu Das
Abstract: Predicting future frames of a video sequence has been a problem of high interest in the field of Computer Vision as it caters to a multitude of applications. The ability to predict, anticipate and reason about future events is the essence of intelligence and one of the main goals of decision-making systems such as human-machine interaction, robot navigation and autonomous driving. However, the challenge lies in the ambiguous nature of the problem as there may be multiple future sequences possible for the same input video shot. A naively designed model averages multiple possible futures into a single blurry prediction. Recently, two distinct approaches have attempted to address this problem as: (a) use of latent variable models that represent underlying stochasticity and (b) adversarially trained models that aim to produce sharper images. A latent variable model often struggles to produce realistic results, while an adversarially trained model underutilizes latent variables and thus fails to produce diverse predictions. These methods have revealed complementary strengths and weaknesses. Combining the two approaches produces predictions that appear more realistic and better cover the range of plausible futures. This forms the basis and objective of study in this project work. In this paper, we proposed a novel multi-scale architecture combining both approaches. We validate our proposed model through a series of experiments and empirical evaluations on Moving MNIST, UCF101, and Penn Action datasets. Our method outperforms the results obtained using the baseline methods.
摘要：因为它迎合了多种应用的预测未来的视频序列的帧已经在计算机视觉领域高度关注的问题。预测能力，预测和理由对未来事件是智能的本质和决策系统，如人机互动，机器人导航和自动驾驶的主要目标之一。然而，挑战就在于问题的模糊性，因为可能是相同的输入视频镜头可能多个未来序列。甲天真地设计模型的平均值的多个可能期货成一个单一的模糊预测。最近，两种不同的方法试图解决这个问题，因为：（1）使用潜变量模型表示底层随机性和（b）adversarially训练的模型旨在产生更清晰的图像。一个潜变量模型往往斗争，以产生实际的结果，而一个adversarially训练模型都无法充分利用潜在变量，因而不能产生不同的预测。这些方法揭示了互补的优势和劣势。这两种方法相结合产生显得更逼真和更好的覆盖可能的未来的范围预测。这就形成了在这个项目工作的基础和客观的研究。在本文中，我们提出了一种新的多尺度结构结合了这两种方法。我们通过一系列的实验和实证评估验证了我们提出的模型上移动MNIST，UCF101，和Penn行动数据集。我们的方法优于使用基线方法得到的结果。

7. Multi-domain semantic segmentation with pyramidal fusion [PDF] 返回目录
Marin Oršić, Petra Bevandić, Ivan Grubišić, Josip Šarić, Siniša Šegvić
Abstract: We present our submission to the semantic segmentation contest of the Robust Vision Challenge held at ECCV 2020. The contest requires submitting the same model to seven benchmarks from three different domains. Our approach is based on the SwiftNet architecture with pyramidal fusion. We address inconsistent taxonomies with a single-level 193-dimensional softmax output. We strive to train with large batches in order to stabilize optimization of a hard recognition problem, and to favour smooth evolution of batchnorm statistics. We achieve this by implementing a custom backward step through log-sum-prob loss, and by using small crops before freezing the population statistics. Our model ranks first on the RVC semantic segmentation challenge as well as on the WildDash 2 leaderboard. This suggests that pyramidal fusion is competitive not only for efficient inference with lightweight backbones, but also in large-scale setups for multi-domain application.
摘要：我们提出我们提交稳健视觉挑战的语义分割大赛在2020年ECCV举行的比赛需要从三个不同的领域提交相同型号七个基准。我们的做法是基于与锥体融合的SWIFTNet架构。我们解决了与单级193维SOFTMAX输出不一致的分类。我们力求以稳定的硬识别问题的优化，有利于batchnorm统计的平滑演进与大批量培养。我们通过实现通过数和-概率损失定制的倒退，并冻结人口统计数字之前，使用小作物实现这一目标。我们的模型名列第一的RVC语义分割的挑战以及对WildDash 2领先。这表明，金字塔的融合是有竞争力的不仅是用轻质骨架高效推论，而且在大规模设置了多领域的应用。

8. Modification method for single-stage object detectors that allows to exploit the temporal behaviour of a scene to improve detection accuracy [PDF] 返回目录
Menua Gevorgyan
Abstract: A simple modification method for single-stage generic object detection neural networks, such as YOLO and SSD, is proposed, which allows for improving the detection accuracy on video data by exploiting the temporal behavior of the scene in the detection pipeline. It is shown that, using this method, the detection accuracy of the base network can be considerably improved, especially for occluded and hidden objects. It is shown that a modified network is more prone to detect hidden objects with more confidence than an unmodified one. A weakly supervised training method is proposed, which allows for training a modified network without requiring any additional annotated data.
摘要：一种用于单级通用对象检测的神经网络，如YOLO和SSD简单修改方法，提出，其允许通过利用在所述检测管道中的场景的时间特性改善上的视频数据的检测精度。结果表明，使用这种方法，基网的检测精度，可以显着改善，尤其是对于遮挡和隐藏的对象。结果表明，经修饰的网络是更容易检测具有比未修饰的一个更有信心隐藏的对象。弱监督训练方法提出，这允许用于训练修改的网络，而无需任何附加的注释的数据。

9. Few-shot Object Detection with Feature Attention Highlight Module in Remote Sensing Images [PDF] 返回目录
Zixuan Xiao, Ping Zhong, Yuan Quan, Xuping Yin, Wei Xue
Abstract: In recent years, there are many applications of object detection in remote sensing field, which demands a great number of labeled data. However, in many cases, data is extremely rare. In this paper, we proposed a few-shot object detector which is designed for detecting novel objects based on only a few examples. Through fully leveraging labeled base classes, our model that is composed of a feature-extractor, a feature attention highlight module as well as a two-stage detection backend can quickly adapt to novel classes. The pre-trained feature extractor whose parameters are shared produces general features. While the feature attention highlight module is designed to be light-weighted and simple in order to fit the few-shot cases. Although it is simple, the information provided by it in a serial way is helpful to make the general features to be specific for few-shot objects. Then the object-specific features are delivered to the two-stage detection backend for the detection results. The experiments demonstrate the effectiveness of the proposed method for few-shot cases.
摘要：近年来，在遥感领域，这就要求标签数据的大量物体检测的许多应用。然而，在许多情况下，数据是极为罕见的。在本文中，我们提出了一种被设计用于检测仅基于几个例子新物体几拍物体检测装置。通过充分利用标记的基类，我们认为由特征提取的模型，功能关注高亮模块，以及一个两阶段检测后端能迅速适应新的课程。预先训练的特征提取器，它的参数是共用的产生一般特征。虽然功能关注高亮模块的设计是重量轻，简单，以适应几个拍的情况。虽然是简单的，通过它以串行的方式提供的信息有助于使一般特征是具体的几拍的对象。然后与具体对象相关的特征被输送到用于检测结果的两阶段检测后端。实验证明了几拍的情况下所提出的方法的有效性。

10. SCG-Net: Self-Constructing Graph Neural Networks for Semantic Segmentation [PDF] 返回目录
Qinghui Liu, Michael Kampffmeyer, Robert Jenssen, Arnt-Børre Salberg
Abstract: Capturing global contextual representations by exploiting long-range pixel-pixel dependencies has shown to improve semantic segmentation performance. However, how to do this efficiently is an open question as current approaches of utilising attention schemes or very deep models to increase the models field of view, result in complex models with large memory consumption. Inspired by recent work on graph neural networks, we propose the Self-Constructing Graph (SCG) module that learns a long-range dependency graph directly from the image and uses it to propagate contextual information efficiently to improve semantic segmentation. The module is optimised via a novel adaptive diagonal enhancement method and a variational lower bound that consists of a customized graph reconstruction term and a Kullback-Leibler divergence regularization term. When incorporated into a neural network (SCG-Net), semantic segmentation is performed in an end-to-end manner and competitive performance (mean F1-scores of 92.0% and 89.8% respectively) on the publicly available ISPRS Potsdam and Vaihingen datasets is achieved, with much fewer parameters, and at a lower computational cost compared to related pure convolutional neural network (CNN) based models.
摘要：通过利用远程像素像素的依赖捕捉全球语境表示已经显示出改善的语义分割性能。然而，如何有效地做，这是一个悬而未决的问题是利用关注计划或非常深的模型来增大视场模型，导致与内存消耗大户复杂模型的电流的方法。最近在图形神经网络工作的启发，我们建议直接从图像和用途学会了一门长程依赖关系图它有效地传播上下文信息，以提高语义分割的自建图（SCG）模块。该模块经由新的自适应对角线增强方法优化和变下界，它由一个定制图形重建项和相对熵正则化项的。当掺入到神经网络（SCG-净），语义分割是在端至端方式和有竞争力的性能（平均为92.0％和89.8％分别F1分数）上公开可用ISPRS波茨坦和Vaihingen数据集执行是实现用少得多的参数，并在比较相关的纯卷积神经网络（CNN）的机型为主较低的计算成本。

11. Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors [PDF] 返回目录
Duy Thanh Nguyen, Hyun Kim, Hyuk-Jae Lee
Abstract: Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.
摘要：卷积神经网络（细胞神经网络）既需要密集计算和频繁的内存访问，从而导致较低的处理速度和大的功率损耗。虽然在CNN的不同层的特点是经常很大的不同，以前的硬件设计对他们就业常见的优化方案。本文提出了一种层特定设计，采用了为不同的层优化的不同的组织。所提出的设计使用了两个层特定优化：层特异性混合数据流和特定层混合精度。混合数据流的目的而苛刻的FPGA装置的最小的片上存储器（BRAM）资源，以最小化访问片外存储器。混合精度量化是同时实现无损精度和积极的模型的压缩，从而进一步减少访问片外存储器。贝叶斯优化方法是用来选择每层的最佳稀疏，获得最佳权衡准确性和压缩之间。该混合方案允许整个网络模型被存储在FPGA的BRAMs积极减小片外访问，并且由此实现了显著性能增强。模型大小由22.66-28.93倍相比减少，在一个全精度网络准确对VOC，COCO可忽略的降解，和ImageNet数据集。此外，混合数据流和混合精度的组合显著优于以前的作品中的吞吐率，片准入方面，以及片上存储器的要求。

12. DESC: Domain Adaptation for Depth Estimation via Semantic Consistency [PDF] 返回目录
Adrian Lopez-Rodriguez, Krystian Mikolajczyk
Abstract: Accurate real depth annotations are difficult to acquire, needing the use of special devices such as a LiDAR sensor. Self-supervised methods try to overcome this problem by processing video or stereo sequences, which may not always be available. Instead, in this paper, we propose a domain adaptation approach to train a monocular depth estimation model using a fully-annotated source dataset and a non-annotated target dataset. We bridge the domain gap by leveraging semantic predictions and low-level edge features to provide guidance for the target domain. We enforce consistency between the main model and a second model trained with semantic segmentation and edge maps, and introduce priors in the form of instance heights. Our approach is evaluated on standard domain adaptation benchmarks for monocular depth estimation and show consistent improvement upon the state-of-the-art.
摘要：精确真正深入注解是难以获取，需要使用特殊的设备，如激光雷达传感器。自我监督的方法试图通过处理视频或立体序列，这可能并不总是可用来克服这个问题。相反，在本文中，我们提出了一个领域适应性方法使用完全注释的源数据集和未标注目标数据集来训练单眼深度估计模型。我们弥合通过利用语义预测和低级别的边缘特征为目标域提供指导领域的差距。我们执行一致性主模型和语义分割和边缘映射训练的第二模型之间，并且在实例高度的形式介绍先验。我们的做法是在标准领域适应性基准单眼深度估计评估，并显示在国家的最先进的持续改善。

13. Auto-Classifier: A Robust Defect Detector Based on an AutoML Head [PDF] 返回目录
Vasco Lopes, Luís A. Alexandre
Abstract: The dominant approach for surface defect detection is the use of hand-crafted feature-based methods. However, this falls short when conditions vary that affect extracted images. So, in this paper, we sought to determine how well several state-of-the-art Convolutional Neural Networks perform in the task of surface defect detection. Moreover, we propose two methods: CNN-Fusion, that fuses the prediction of all the networks into a final one, and Auto-Classifier, which is a novel proposal that improves a Convolutional Neural Network by modifying its classification component using AutoML. We carried out experiments to evaluate the proposed methods in the task of surface defect detection using different datasets from DAGM2007. We show that the use of Convolutional Neural Networks achieves better results than traditional methods, and also, that Auto-Classifier out-performs all other methods, by achieving 100% accuracy and 100% AUC results throughout all the datasets.
摘要：表面缺陷检测惯用的方法是使用的手工制作的基于特征的方法。然而，当条件变化影响提取的图像，这属于短。因此，在本文中，我们试图确定如何好几个国家的最先进的卷积神经网络在表面缺陷检测的任务执行。此外，我们提出了两种方法：CNN-融合，熔合所有网络的预测成最终一个，和自动分类器，其是通过使用AutoML修改其分类组件提高了卷积神经网络的新的提案。我们进行了实验，以评估使用不同的数据集从DAGM2007表面缺陷检测的任务所提出的方法。我们表明，使用卷积神经网络的实现比传统方法更好的效果，而且，该自动分类出执行所有其他方法，通过实现100％的准确率和100个％的AUC结果在所有的数据集。

14. 1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask [PDF] 返回目录
Jingru Tan, Gang Zhang, Hanming Deng, Changbao Wang, Lewei Lu, Quanquan Li, Jifeng Dai
Abstract: This article introduces the solutions of the team lvisTraveler for LVIS Challenge 2020. In this work, two characteristics of LVIS dataset are mainly considered: the long-tailed distribution and high quality instance segmentation mask. We adopt a two-stage training pipeline. In the first stage, we incorporate EQL and self-training to learn generalized representation. In the second stage, we utilize Balanced GroupSoftmax to promote the classifier, and propose a novel proposal assignment strategy and a new balanced mask loss for mask head to get more precise mask predictions. Finally, we achieve 41.5 and 41.2 AP on LVIS v1.0 val and test-dev splits respectively, outperforming the baseline based on X101-FPN-MaskRCNN by a large margin.
摘要：本文介绍了团队lvisTraveler为LVIS挑战2020的解决方案，在这项工作中，LVIS的两个特点集主要考虑：长尾分布和高品质的情况下分割掩码。我们采用两阶段训练的管道。在第一阶段中，我们结合EQL和自我培训学习一般化表示。在第二阶段，我们利用平衡GroupSoftmax推广分类，并提出了一种新的提案分配策略和面具头一个新的平衡面膜损失，以获得更多精密掩模的预测。最后，我们分别实现上LVIS 1.0 VAL和测试开发拆分41.5和41.2 AP，跑赢大幅度基于X101-FPN-MaskRCNN基线。

15. Physics-based Shading Reconstruction for Intrinsic Image Decomposition [PDF] 返回目录
Anil S. Baslamisli, Yang Liu, Sezer Karaoglu, Theo Gevers
Abstract: We investigate the use of photometric invariance and deep learning to compute intrinsic images (albedo and shading). We propose albedo and shading gradient descriptors which are derived from physics-based models. Using the descriptors, albedo transitions are masked out and an initial sparse shading map is calculated directly from the corresponding RGB image gradients in a learning-free unsupervised manner. Then, an optimization method is proposed to reconstruct the full dense shading map. Finally, we integrate the generated shading map into a novel deep learning framework to refine it and also to predict corresponding albedo image to achieve intrinsic image decomposition. By doing so, we are the first to directly address the texture and intensity ambiguity problems of the shading estimations. Large scale experiments show that our approach steered by physics-based invariant descriptors achieve superior results on MIT Intrinsics, NIR-RGB Intrinsics, Multi-Illuminant Intrinsic Images, Spectral Intrinsic Images, As Realistic As Possible, and competitive results on Intrinsic Images in the Wild datasets while achieving state-of-the-art shading estimations.
摘要：我们调查使用光度不变性和深度学习的计算征图像（反照率和阴影）。我们提出这是从物理模型得出反照率和阴影梯度描述符。使用描述符，反照率转换被屏蔽掉和初始稀疏的阴影图是直接从在无学习无监督的方式对应的RGB图像梯度计算。然后，优化方法，提出了重构完整致密晕渲图。最后，我们所产生的晕渲图整合成一个新的深度学习框架，使之完善，还预测相应的反射率图像，实现内在的图像分解。通过这样做，我们是第一个直接解决了阴影估计的质感和强度不确定性的问题。大规模实验表明，我们的做法转向物理学为基础的不变描述实现麻省理工学院内部函数，NIR-RGB内部函数，多光源征图像效果出众，光谱征图像，尽可能的真实，和有竞争力的结果在野生征图像数据集，同时实现国家的最先进的阴影估计。

16. A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports [PDF] 返回目录
Yikuan Li, Hanyin Wang, Yuan Luo
Abstract: Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR radiographs and associated reports. The extrinsic evaluation on OpenI dataset shows that in comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task. We conduct an ablation study to analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. We also visualize attention maps to illustrate the attention mechanism of V+L models.
摘要：从医疗图像和相关内容相关报表联合图像 - 文本嵌入，提取的是大多数生物医学的眼光和语言（V + L）的任务，包括医疗视觉答疑，临床图像，文本检索，临床报告自动生成的基石。在这项研究中，我们采用四个预先训练的V + L型：LXMERT，VisualBERT，尤利尔和PixelBERT从MIMIC-CXR X光片和相关报告学习多表示。在OpenI数据集显示了外在的评价，认为相比于创业CNN-RNN模型，联合嵌入由预训练V + L型了解到表明在胸结果分类任务性能改进。我们进行消融研究，分析某些模型组件的贡献和验证过的文字，只嵌入嵌入联合的优势。我们也关注可视化映射到说明的V + L型的注意机制。

17. TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback [PDF] 返回目录
Surgan Jandial, Ayush Chopra, Pinkesh Badjatiya, Pranit Chawla, Mausoom Sarkar, Balaji Krishnamurthy
Abstract: The ability to efficiently search for images over an indexed database is the cornerstone for several user experiences. Incorporating user feedback, through multi-modal inputs provide flexible and interaction to serve fine-grained specificity in requirements. We specifically focus on text feedback, through descriptive natural language queries. Given a reference image and textual user feedback, our goal is to retrieve images that satisfy constraints specified by both of these input modalities. The task is challenging as it requires understanding the textual semantics from the text feedback and then applying these changes to the visual representation. To address these challenges, we propose a novel architecture TRACE which contains a hierarchical feature aggregation module to learn the composite visio-linguistic representations. TRACE achieves the SOTA performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, with an average improvement of at least ~5.7%, ~3%, and ~5% respectively in R@K metric. Our extensive experiments and ablation studies show that TRACE consistently outperforms the existing techniques by significant margins both quantitatively and qualitatively.
摘要：在图像索引数据库有效搜索的能力是一些用户体验的基石。并入用户反馈，通过多模态输入提供灵活和互动服务细粒度特异性要求。我们特别注重文本的反馈，通过描述性的自然语言查询。给定一个参考图像和文本用户的反馈，我们的目标是检索满足这两种输入模式的指定的约束图像。任务是具有挑战性的，因为它需要从文本反馈理解文本的语义，然后应用这些改变视觉表示。为了应对这些挑战，我们提出了一个新颖的架构TRACE其中包含分层特征聚集模块，以了解复合Visio的语言表述。 TRACE达到3个基准数据集的性能SOTA：FashionIQ，鞋，和鸟到词，平均提高至少〜5.7％，约3％，和分别〜5％中的R @ķ度量。我们大量的实验和消融的研究表明，通过TRACE利润显著始终优于现有技术的定量和定性。

18. Modeling Global Body Configurations in American Sign Language [PDF] 返回目录
Nicholas Wilkins, Beck Cordes Galbraith, Ifeoma Nwogu
Abstract: American Sign Language (ASL) is the fourth most commonly used language in the United States and is the language most commonly used by Deaf people in the United States and the English-speaking regions of Canada. Unfortunately, until recently, ASL received little research. This is due, in part, to its delayed recognition as a language until William C. Stokoe's publication in 1960. Limited data has been a long-standing obstacle to ASL research and computational modeling. The lack of large-scale datasets has prohibited many modern machine-learning techniques, such as Neural Machine Translation, from being applied to ASL. In addition, the modality required to capture sign language (i.e. video) is complex in natural settings (as one must deal with background noise, motion blur, and the curse of dimensionality). Finally, when compared with spoken languages, such as English, there has been limited research conducted into the linguistics of ASL. We realize a simplified version of Liddell and Johnson's Movement-Hold (MH) Model using a Probabilistic Graphical Model (PGM). We trained our model on ASLing, a dataset collected from three fluent ASL signers. We evaluate our PGM against other models to determine its ability to model ASL. Finally, we interpret various aspects of the PGM and draw conclusions about ASL phonetics. The main contributions of this paper are
摘要：美国手语（ASL）是美国第四大最常用的语言，是由聋人在美国和加拿大的英语为母语的地区最常用的语言。不幸的是，直到最近，翔升很少得到研究。这是因为在某种程度上，它的延迟识别作为一种语言，直到威廉·司徒高义在1960年限量数据发布是一个长期存在的障碍，ASL的研究和计算模型。缺乏大型数据集已禁止许多现代机器学习技术，如神经机器翻译，被应用到ASL。此外，模态需要捕获的手语（即视频）是在自然环境中配合物（如一个必须处理的背景噪声，运动模糊和维数灾难）。最后，当语言，如英语相比，还存在被限制进行到ASL的语言学研究。我们认识到使用概率图模型（PGM）利德尔和约翰逊的运动保持（MH）模型的简化版本。我们训练我们的ASLing模式，从三个流利翔升签名者收集的数据集。我们评估我们PGM对其他模型来确定其型号ASL的能力。最后，我们解释PGM的各个方面，并得出ASL语音结论。本文的主要贡献

19. Adherent Mist and Raindrop Removal from a Single Image Using Attentive Convolutional Network [PDF] 返回目录
Da He, Xiaoyu Shang, Jiajia Luo
Abstract: Temperature difference-induced mist adhered to the windshield, camera lens, etc. are often inhomogeneous and obscure, which can easily obstruct the vision and degrade the image severely. Together with adherent raindrops, they bring considerable challenges to various vision systems but without enough attention. Recent methods for similar problems typically use hand-crafted priors to generate spatial attention maps. In this work, we propose to visually remove the adherent mist and raindrop jointly from a single image using attentive convolutional neural networks. We apply classification activation map attention to our model to strengthen the spatial attention without hand-crafted priors. In addition, the smoothed dilated convolution is adopted to obtain a large receptive field without spatial information loss, and the dual attention module is utilized for efficiently selecting channels and spatial features. Our experiments show our method achieves state-of-the-art performance, and demonstrate that this underrated practical problem is critical to high-level vision scenes.
摘要：温度差引起的雾附着到挡风玻璃，相机透镜等往往不均匀和模糊，这可以很容易地妨碍视力和严重地降低图像。再加上附着的雨滴，他们带来的各种视觉系统，但没有引起足够的重视相当大的挑战。近期类似问题的方法通常使用手工制作的先验生成空间注意地图。在这项工作中，我们建议在视觉上去除附着的雾气和使用周到的卷积神经网络的单一图像共同雨滴。我们应用分类激活地图关注我们的模型，以加强无手工制作的先验的空间注意。此外，平滑后的膨大的卷积，采用获得大的感受域，而不空间信息丢失，和双注意模块被用于高效地选择频道和空间特征。我们的实验表明，我们的方法实现国家的最先进的性能，并证明这种被低估的实际问题是高层次的视觉场景的关键。

20. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding [PDF] 返回目录
Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, Wei Liu, Shih-Fu Chang
Abstract: The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMSoperation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref-NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS.
摘要：用于解决参照表达接地普遍的框架是基于两阶段过程：1）检测带接地指涉到的提案之一的物体检测和2）的提案。现有的两阶段方案主要集中在接地的步骤，其目标是让与建议的表达。在本文中，我们认为，这些方法在两个阶段忽视的提案角色之间存在明显的不匹配：它们产生完全基于检测置信提案（即表达无关），希望这些建议包含了所有正确的情况下，在表达（即，表达感知）。由于这种不匹配，目前的两级从方法检测地面实况提案之间产生严重的性能下降受到影响。为此，我们提出REF-NMS，这是在第一阶段，以产生表达感知提案的第一种方法。 REF-NMS关于在表达式作为关键对象所有的名词，并且引入了一个轻便的模块来预测分数与临界对象对准每个框。这些分数可以引导NMSoperation过滤掉无关的表达框，增加关键对象的调用，从而导致改进的显著接地性能。由于REF-NMS是不可知的接地步骤，它可以被容易地集成到国家的最先进的任何两阶段方法。在几个骨干，基准和任务广泛切除研究一致表明REF-NMS的优越性。

21. Tasks Integrated Networks: Joint Detection and Retrieval for Image Search [PDF] 返回目录
Lei Zhang, Zhenwei He, Yi Yang, Liang Wang, Xinbo Gao
Abstract: The traditional object retrieval task aims to learn a discriminative feature representation with intra-similarity and inter-dissimilarity, which supposes that the objects in an image are manually or automatically pre-cropped exactly. However, in many real-world searching scenarios (e.g., video surveillance), the objects (e.g., persons, vehicles, etc.) are seldom accurately detected or annotated. Therefore, object-level retrieval becomes intractable without bounding-box annotation, which leads to a new but challenging topic, i.e. image-level search. In this paper, to address the image search issue, we first introduce an end-to-end Integrated Net (I-Net), which has three merits: 1) A Siamese architecture and an on-line pairing strategy for similar and dissimilar objects in the given images are designed. 2) A novel on-line pairing (OLP) loss is introduced with a dynamic feature dictionary, which alleviates the multi-task training stagnation problem, by automatically generating a number of negative pairs to restrict the positives. 3) A hard example priority (HEP) based softmax loss is proposed to improve the robustness of classification task by selecting hard categories. With the philosophy of divide and conquer, we further propose an improved I-Net, called DC-I-Net, which makes two new contributions: 1) two modules are tailored to handle different tasks separately in the integrated framework, such that the task specification is guaranteed. 2) A class-center guided HEP loss (C2HEP) by exploiting the stored class centers is proposed, such that the intra-similarity and inter-dissimilarity can be captured for ultimate retrieval. Extensive experiments on famous image-level search oriented benchmark datasets demonstrate that the proposed DC-I-Net outperforms the state-of-the-art tasks-integrated and tasks-separated image search models.
摘要：传统的对象检索任务目标学会与帧内的相似性和差异性间具有区分特征表示，其中假设在图像中的对象被手动或自动精确地预先裁剪。然而，在许多现实世界的搜索场景（例如，视频监控），对象（例如，人员，车辆等）很少被精确地检测或注释。因此，对象级的检索变得没有边界框标注，这导致了新的，但富有挑战性的课题，即映像级搜索棘手。在本文中，为解决图像搜索问题，我们首先介绍一个终端到终端的综合网（席型网），其中有三个优点：1）暹罗建筑和相似或不同对象的在线配对策略在给定的图像设计。 2）一种新的上线配对（OLP）损失被引入与动态特征字典，其减轻了多任务训练停滞的问题，通过自动生成的数负对限制阳性。 3）一种硬例如优先级（HEP）基于SOFTMAX损失提出了通过选择硬类别来改善分类任务的鲁棒性。随着分而治之的理念，我们进一步提出了一种改进席型网，称为DC-I-Net的，这使得两个新的贡献：1）两个模块都适合在集成框架分别处理不同的任务，这样的任务规格有保证。 2）一类中心引导HEP损失（C2HEP）通过利用存储的类中心提出，使得帧内的相似性和差异性间可以为最终检索被捕获。著名的图像级搜索面向基准数据集大量实验表明，该DC-I-Net的优于国家的最先进的任务集成和任务分离图像搜索模式。

22. Spatial Transformer Point Convolution [PDF] 返回目录
Yuan Fang, Chunyan Xu, Zhen Cui, Yuan Zong, Jian Yang
Abstract: Point clouds are unstructured and unordered in the embedded 3D space. In order to produce consistent responses under different permutation layouts, most existing methods aggregate local spatial points through maximum or summation operation. But such an aggregation essentially belongs to the isotropic filtering on all operated points therein, which tends to lose the information of geometric structures. In this paper, we propose a spatial transformer point convolution (STPC) method to achieve anisotropic convolution filtering on point clouds. To capture and represent implicit geometric structures, we specifically introduce spatial direction dictionary to learn those latent geometric components. To better encode unordered neighbor points, we design sparse deformer to transform them into the canonical ordered dictionary space by using direction dictionary learning. In the transformed space, the standard image-like convolution can be leveraged to generate anisotropic filtering, which is more robust to express those finer variances of local regions. Dictionary learning and encoding processes are encapsulated into a network module and jointly learnt in an end-to-end manner. Extensive experiments on several public datasets (including S3DIS, Semantic3D, SemanticKITTI) demonstrate the effectiveness of our proposed method in point clouds semantic segmentation task.
摘要：点云，非结构化和无序在嵌入式3D空间。为了产生下不同的排列布局一致的反应，大多数现有的方法通过聚集最多或求和操作局部空间点。但这种聚合本质上属于上在其中的所有操作点各向同性滤波，这趋向于失去几何结构的信息。在本文中，我们提出了一个空间变换器点卷积（STPC）方法来实现上点云各向异性卷积滤波。为了捕获和表示隐式几何结构，我们专门介绍空间方向字典学习这些潜几何部件。为了更好地编码无序的邻居点，我们设计稀疏变形把它们转化为规范，使用方向的词典学习有序字典的空间。在变换空间中，标准图像状卷积可以被利用来产生各向异性过滤，这是更健壮的表达局部区域的精细的那些方差。字典学习和编码处理被封装成网络模块和在端至端的方式共同学习。在几个公开的数据集（包括S3DIS，Semantic3D，SemanticKITTI）大量的实验证明我们提出的在点云法语义分割任务的有效性。

23. Noise-Aware Texture-Preserving Low-Light Enhancement [PDF] 返回目录
Zohreh Azizi, Xuejing Lei, C.-C Jay Kuo
Abstract: A simple and effective low-light image enhancement method based on a noise-aware texture-preserving retinex model is proposed in this work. The new method, called NATLE, attempts to strike a balance between noise removal and natural texture preservation through a low-complexity solution. Its cost function includes an estimated piece-wise smooth illumination map and a noise-free texture-preserving reflectance map. Afterwards, illumination is adjusted to form the enhanced image together with the reflectance map. Extensive experiments are conducted on common low-light image enhancement datasets to demonstrate the superior performance of NATLE.
摘要：一个简单的基于噪声感知的纹理保留视网膜皮层模型有效的低光图像增强方法在这项工作中提出。新的方法，称为NATLE，试图通过低复杂度的解决方案撞击噪声消除和自然纹理保护之间的平衡。其成本函数包括估计分段平滑照明图和无噪声的纹理保留反射图。此后，照明被调节，以形成具有的反射率映射的增强图像一起。广泛实验在共同的低光图像增强数据集以证实NATLE的优异的性能。

24. Towards Practical Implementations of Person Re-Identification from Full Video Frames [PDF] 返回目录
Felix O. Sumari, Luigy Machaca, Jose Huaman, Esteban W. G. Clua, Joris Guérin
Abstract: With the major adoption of automation for cities security, person re-identification (Re-ID) has been extensively studied recently. In this paper, we argue that the current way of studying person re-identification, i.e. by trying to re-identify a person within already detected and pre-cropped images of people, is not sufficient to implement practical security applications, where the inputs to the system are the full frames of the video streams. To support this claim, we introduce the Full Frame Person Re-ID setting (FF-PRID) and define specific metrics to evaluate FF-PRID implementations. To improve robustness, we also formalize the hybrid human-machine collaboration framework, which is inherent to any Re-ID security applications. To demonstrate the importance of considering the FF-PRID setting, we build an experiment showing that combining a good people detection network with a good Re-ID model does not necessarily produce good results for the final application. This underlines a failure of the current formulation in assessing the quality of a Re-ID model and justifies the use of different metrics. We hope that this work will motivate the research community to consider the full problem in order to develop algorithms that are better suited to real-world scenarios.
摘要：随着城市的安全，人重新鉴定主要采用自动化（再ID）已被广泛最近研究。在本文中，我们认为中已检测的人预裁剪的图像试图通过研究人重新鉴定，即目前的方式重新识别一个人，是不是足以实现实际的安全应用，其中输入该系统是视频流的全帧。为了支持这种说法，我们引进了全帧人再ID设置（FF-PRID），并确定具体的指标来评估FF-PRID实现。为了提高鲁棒性，我们还正规化所述混合人机合作框架，这是固有的任何重新ID安全应用。为了证明考虑FF-PRID设置的重要性，我们建立显示出了良好的再ID模型相结合良好的人检测网络不一定产生了最终的应用程序很好的效果的实验。这强调当前制剂的故障在评估重新ID模型和证明使用不同的指标的质量。我们希望，这项工作将促使研究界的考虑，以开发更适合于现实世界的场景算法完全问题。

25. NITES: A Non-Parametric Interpretable Texture Synthesis Method [PDF] 返回目录
Xuejing Lei, Ganning Zhao, C.-C. Jay Kuo
Abstract: A non-parametric interpretable texture synthesis method, called the NITES method, is proposed in this work. Although automatic synthesis of visually pleasant texture can be achieved by deep neural networks nowadays, the associated generation models are mathematically intractable and their training demands higher computational cost. NITES offers a new texture synthesis solution to address these shortcomings. NITES is mathematically transparent and efficient in training and inference. The input is a single exemplary texture image. The NITES method crops out patches from the input and analyzes the statistical properties of these texture patches to obtain their joint spatial-spectral representations. Then, the probabilistic distributions of samples in the joint spatial-spectral spaces are characterized. Finally, numerous texture images that are visually similar to the exemplary texture image can be generated automatically. Experimental results are provided to show the superior quality of generated texture images and efficiency of the proposed NITES method in terms of both training and inference time.
摘要：非参数可解释的纹理合成方法，叫做NITES方法，在这项工作提出了。虽然质地视觉愉快的自动合成可以通过深层神经网络如今可以实现，相关代车型在数学上是棘手的和他们的训练要求更高的计算成本。 NITES提供了一个新的纹理合成的解决方案来解决这些缺点。 NITES是在训练和数学推理透明和高效。输入是单示例性纹理图像。该方法NITES从输入作物出补丁和分析这些纹理拼贴块的统计特性，以获得他们的联合空间频谱表示。然后，将样品在联合空间的光谱空间概率分布的特征。最后，可以自动生成视觉上相似于示例性纹理图像众多纹理图像。提供的实验结果中显示生成的纹理图像和所提出的方法NITES的效率的优异质量在训练和推理时间方面。

26. Robust Object Classification Approach using Spherical Harmonics [PDF] 返回目录
Ayman Mukhaimar, Ruwan Tennakoon, Chow Yin Lai, Reza Hoseinnezhad, Alireza Bab-Hadiashar
Abstract: In this paper, we present a robust spherical harmonics approach for the classification of point cloud-based objects. Spherical harmonics have been used for classification over the years, with several frameworks existing in the literature. These approaches use variety of spherical harmonics based descriptors to classify objects. We first investigated these frameworks robustness against data augmentation, such as outliers and noise, as it has not been studied before. Then we propose a spherical convolution neural network framework for robust object classification. The proposed framework uses the voxel grid of concentric spheres to learn features over the unit ball. Our proposed model learn features that are less sensitive to data augmentation due to the selected sampling strategy and the designed convolution operation. We tested our proposed model against several types of data augmentation, such as noise and outliers. Our results show that the proposed model outperforms the state of art networks in terms of robustness to data augmentation.
摘要：在本文中，我们提出了点基于云的对象分类的稳健球谐方法。球谐已被用于分类多年来，与现有文献几个框架。这些方法中使用的各种基于描述符对象分类球谐。我们首先研究了这些框架可以有效抵抗数据增强，如异常值和噪声，因为它没有被研究过。然后，我们提出了稳健的对象分类的球面卷积神经网络架构。拟议的框架使用同心球的体素网格学习在单位球的特点。我们提出的模型了解到，由于所选择的抽样策略和设计的卷积运算是增强数据不敏感的特点。我们测试了我们提出的模型对多种类型的数据增强，如噪音和异常值。我们的研究结果表明，该模型优于中数据增强鲁棒性方面的艺术网络的状态。

27. Unsupervised Point Cloud Registration via Salient Points Analysis (SPA) [PDF] 返回目录
Pranav Kadam, Min Zhang, Shan Liu, C.-C. Jay Kuo
Abstract: An unsupervised point cloud registration method, called salient points analysis (SPA), is proposed in this work. The proposed SPA method can register two point clouds effectively using only a small subset of salient points. It first applies the PointHop++ method to point clouds, finds corresponding salient points in two point clouds based on the local surface characteristics of points and performs registration by matching the corresponding salient points. The SPA method offers several advantages over the recent deep learning based solutions for registration. Deep learning methods such as PointNetLK and DCP train end-to-end networks and rely on full supervision (namely, ground truth transformation matrix and class label). In contrast, the SPA is completely unsupervised. Furthermore, SPA's training time and model size are much less. The effectiveness of the SPA method is demonstrated by experiments on seen and unseen classes and noisy point clouds from the ModelNet-40 dataset.
摘要：无监督的点云登记方法，称为要点分析（SPA），在这项工作中提出的。所提出的SPA方法可以有效地只使用要点一小部分注册两个点云。它首先应用PointHop ++方法点云，基于由匹配相应的跳跃点的点和执行登记局部表面特性对应的显着点在两个点云发现。 SPA的方法提供了超过注册的最近深度学习基础的解决方案的几个优点。深度学习方法，如PointNetLK和DCP列车端部到终端的网络和依靠全程监督（即地面实况变换矩阵和类标签）。相比之下，SPA是完全无人监管。此外，SPA的培训时间和模型的大小要少得多。的SPA方法的有效性是通过对从ModelNet-40数据集可见和不可见的类和嘈杂的点云的实验证实。

28. Unsupervised Feedforward Feature (UFF) Learning for Point Cloud Classification and Segmentation [PDF] 返回目录
Min Zhang, Pranav Kadam, Shan Liu, C. -C. Jay Kuo
Abstract: In contrast to supervised backpropagation-based feature learning in deep neural networks (DNNs), an unsupervised feedforward feature (UFF) learning scheme for joint classification and segmentation of 3D point clouds is proposed in this work. The UFF method exploits statistical correlations of points in a point cloud set to learn shape and point features in a one-pass feedforward manner through a cascaded encoder-decoder architecture. It learns global shape features through the encoder and local point features through the concatenated encoder-decoder architecture. The extracted features of an input point cloud are fed to classifiers for shape classification and part segmentation. Experiments are conducted to evaluate the performance of the UFF method. For shape classification, the UFF is superior to existing unsupervised methods and on par with state-of-the-art DNNs. For part segmentation, the UFF outperforms semi-supervised methods and performs slightly worse than DNNs.
摘要：与在深层神经网络（DNNs）监督基于反向传播特征的学习，学习联合分类和三维点云的分割方案的无监督的前馈功能（UFF）在这项工作中提出。的UFF方法利用点的统计相关性中的点云组学习的形状和点通过级联编码器 - 解码器架构的一个通前馈方式提供。它通过编码器学习全球形状的功能和局部点通过级联编码器，解码器架构的特点。输入点云的所提取的特征被馈送到分类器形状分类和部分分割。实验以评估UFF方法的性能。对于形状分类，UFF优于现有的无监督的方法和在与国家的最先进的DNNs相提并论。对于部分分割，UFF优于半监督的方法和进行比DNNs略差。

29. Efficiency in Real-time Webcam Gaze Tracking [PDF] 返回目录
Amogh Gudi, Xin Li, Jan van Gemert
Abstract: Efficiency and ease of use are essential for practical applications of camera based eye/gaze-tracking. Gaze tracking involves estimating where a person is looking on a screen based on face images from a computer-facing camera. In this paper we investigate two complementary forms of efficiency in gaze tracking: 1. The computational efficiency of the system which is dominated by the inference speed of a CNN predicting gaze-vectors; 2. The usability efficiency which is determined by the tediousness of the mandatory calibration of the gaze-vector to a computer screen. To do so, we evaluate the computational speed/accuracy trade-off for the CNN and the calibration effort/accuracy trade-off for screen calibration. For the CNN, we evaluate the full face, two-eyes, and single eye input. For screen calibration, we measure the number of calibration points needed and evaluate three types of calibration: 1. pure geometry, 2. pure machine learning, and 3. hybrid geometric regression. Results suggest that a single eye input and geometric regression calibration achieve the best trade-off.
摘要：效率和易用性是基于摄像机的眼睛/视线跟踪的实际应用是必不可少的。视线跟踪涉及估算，当一个人正在寻求基于人脸图像从面向电脑摄像头在屏幕上。在本文中，我们调查效率的两个互补形式的注视跟踪：1.其由CNN预测注视矢量的推理速度为主的系统的计算效率; 2.这是由注视矢量的强制校准到计算机屏幕的乏味来确定可用性效率。要做到这一点，我们评估的计算速度/准确性权衡的CNN和校准工作/准确性权衡屏幕校准。对于CNN，我们评估了全脸，两眼和单眼输入。对于屏幕校准中，我们测量的需要校准点的数目和评估三种类型的校准：1.纯的几何形状，2.纯机器学习，和3混合几何消退。结果表明，单眼输入和几何校正回归达到最佳的平衡。

30. CNN-Based Ultrasound Image Reconstruction for Ultrafast Displacement Tracking [PDF] 返回目录
Dimitris Perdios, Manuel Vonlanthen, Florian Martinez, Marcel Arditi, Jean-Philippe Thiran
Abstract: Thanks to its capability of acquiring full-view frames at multiple kilohertz, ultrafast ultrasound imaging unlocked the analysis of rapidly changing physical phenomena in the human body, with pioneering applications such as ultrasensitive flow imaging in the cardiovascular system or shear-wave elastography. The accuracy achievable with these motion estimation techniques is strongly contingent upon two contradictory requirements: a high quality of consecutive frames and a high frame rate. Indeed, the image quality can usually be improved by increasing the number of steered ultrafast acquisitions, but at the expense of a reduced frame rate and possible motion artifacts. To achieve accurate motion estimation at uncompromised frame rates and immune to motion artifacts, the proposed approach relies on single ultrafast acquisitions to reconstruct high-quality frames and on only two consecutive frames to obtain 2-D displacement estimates. To this end, we deployed a convolutional neural network-based image reconstruction method combined with a speckle tracking algorithm based on cross-correlation. Numerical and in vivo experiments, conducted in the context of plane-wave imaging, demonstrate that the proposed approach is capable of estimating displacements in regions where the presence of side lobe and grating lobe artifacts prevents any displacement estimation with a state-of-the-art technique that rely on conventional delay-and-sum beamforming. The proposed approach may therefore unlock the full potential of ultrafast ultrasound, in applications such as ultrasensitive cardiovascular motion and flow analysis or shear-wave elastography.
摘要：由于其在多个千赫获取全视图帧的能力，超快超声成像解锁的在人体内迅速变化的物理现象的分析，以开拓应用，例如在心血管系统中的超灵敏血流成像或横波弹性成像。与这些运动估计技术的精度实现的是在两个相互矛盾的要求强烈队伍：连续帧的高质量和高的帧速率。实际上，图像质量，通常可以通过增加转向超快收购数提高，但以降低的帧速率和可能的运动伪影的费用。为了实现在未受损帧速率准确的运动估计和免疫运动伪影，所提出的方法依赖于单个超快采集重构高质量的帧和上只有两个连续的帧，以获得2-d位移估计。为此，我们部署卷积基于神经网络的图像重建方法与基于互相关散斑跟踪算法相结合。数值和体内实验，在平面波成像的背景下进行的，证明了所提出的方法能够在区域估计位移，其中旁瓣的存在和光栅瓣伪影防止任何位移估计与国家的the-的本领域技术依赖于已有的延迟与求和波束形成。因此，所提出的方法可以解锁超快超声的全部潜力，在诸如超灵敏心血管运动与流动分析或横波弹性成像。

31. Limited View Tomographic Reconstruction Using a Deep Recurrent Framework with Residual Dense Spatial-Channel Attention Network and Sinogram Consistency [PDF] 返回目录
Bo Zhou, S. Kevin Zhou, James S. Duncan, Chi Liu
Abstract: Limited view tomographic reconstruction aims to reconstruct a tomographic image from a limited number of sinogram or projection views arising from sparse view or limited angle acquisitions that reduce radiation dose or shorten scanning time. However, such a reconstruction suffers from high noise and severe artifacts due to the incompleteness of sinogram. To derive quality reconstruction, previous state-of-the-art methods use UNet-like neural architectures to directly predict the full view reconstruction from limited view data; but these methods leave the deep network architecture issue largely intact and cannot guarantee the consistency between the sinogram of the reconstructed image and the acquired sinogram, leading to a non-ideal reconstruction. In this work, we propose a novel recurrent reconstruction framework that stacks the same block multiple times. The recurrent block consists of a custom-designed residual dense spatial-channel attention network. Further, we develop a sinogram consistency layer interleaved in our recurrent framework in order to ensure that the sampled sinogram is consistent with the sinogram of the intermediate outputs of the recurrent blocks. We evaluate our methods on two datasets. Our experimental results on AAPM Low Dose CT Grand Challenge datasets demonstrate that our algorithm achieves a consistent and significant improvement over the existing state-of-the-art neural methods on both limited angle reconstruction (over 5dB better in terms of PSNR) and sparse view reconstruction (about 4dB better in term of PSNR). In addition, our experimental results on Deep Lesion datasets demonstrate that our method is able to generate high-quality reconstruction for 8 major lesion types.
摘要：有限查看断层摄影重建目的是从从稀疏视图或有限角度收购减少辐射剂量或缩短扫描时间所产生的正弦图或投影视图数目有限重建的断层图像。然而，从高噪声和伪影严重这样的重建遭受由于正弦图的不完全性。以导出质量重建，先前状态的最先进的方法使用UNET状神经架构直接预测从有限视图数据全视图重建;但这些方法留下深刻的网络体系结构问题基本完好，并不能保证重建图像的正弦和获取的正弦图之间的一致性，从而导致非理想的重建。在这项工作中，我们提出了一个新颖的反复重建框架堆叠同一块多次。复发块由定制设计的残留致密空间沟道关注网络。此外，我们开发为了在我们的经常框架交错的正弦图的一致性层，以确保采样正弦图是与轮回块的中间输出的正弦图是一致的。我们评估我们的两个数据集的方法。我们对AAPM低剂量CT大挑战数据集实验结果表明，我们的算法（更好的PSNR方面比5分贝）和稀疏视图实现了两个有限角度重建现有的国家的最先进的神经方法一致，显著改善重建（4dB左右的PSNR的期限更好）。此外，我们对深病变数据集实验结果表明，我们的方法是能够为8种主要病变类型生成高质量的重建。

32. Software Effort Estimation using parameter tuned Models [PDF] 返回目录
Akanksha Baghel, Meemansa Rathod, Pradeep Singh
Abstract: Software estimation is one of the most important activities in the software project. The software effort estimation is required in the early stages of software life cycle. Project Failure is the major problem undergoing nowadays as seen by software project managers. The imprecision of the estimation is the reason for this problem. Assize of software size grows, it also makes a system complex, thus difficult to accurately predict the cost of software development process. The greatest pitfall of the software industry was the fast-changing nature of software development which has made it difficult to develop parametric models that yield high accuracy for software development in all domains. We need the development of useful models that accurately predict the cost of developing a software product. This study presents the novel analysis of various regression models with hyperparameter tuning to get the effective model. Nine different regression techniques are considered for model development
摘要：软件估计是软件工程中最重要的活动之一。该软件工作量估计需要在软件生命周期的早期阶段。项目失败是时下正在进行的软件项目经理所看到的主要问题。估计的不精确性是这个问题的原因。软件大小的巡回增长，这也使得复杂的系统，因此，很难准确地预测软件开发过程中的成本。软件业最大的缺陷是软件开发的本质瞬息万变这使得它很难开发参数化模型能产生高精确度在所有领域的软件开发。我们需要的有用的模型，并准确预测开发一个软件产品的成本发展。这项研究提出了不同的回归模型与超参数调整的新颖的分析，以获得有效的模式。九个不同回归技术被认为对模型开发

33. Heightmap Reconstruction of Macula on Color Fundus Images Using Conditional Generative Adversarial Networks [PDF] 返回目录
Peyman Tahghighi, Reza A.Zoroofi, Sareh Saffi, Alireza Ramezani
Abstract: For medical diagnosis based on retinal images, a clear understanding of 3D structure is often required but due to the 2D nature of images captured, we cannot infer that information. However, by utilizing 3D reconstruction methods, we can construct the 3D structure of the macula area on fundus images which can be helpful for diagnosis and screening of macular disorders. Recent approaches have used shading information for 3D reconstruction or heightmap prediction but their output was not accurate since they ignored the dependency between nearby pixels. Additionally, other methods were dependent on the availability of more than one image of the eye which is not available in practice. In this paper, we use conditional generative adversarial networks (cGANs) to generate images that contain height information of the macula area on a fundus image. Results using our dataset show a 0.6077 improvement in Structural Similarity Index (SSIM) and 0.071 improvements in Mean Squared Error (MSE) metric over Shape from Shading (SFS) method. Additionally, Qualitative studies also indicate that our method outperforms recent approaches.
摘要：基于视网膜图像医疗诊断，3D结构有清晰的认识往往是必需的，但由于图像的2D拍摄的性质，我们不能推断出这些信息。然而，通过利用三维重建方法，我们可以构建在其上可以是有助于诊断和黄斑疾病的筛查眼底图像黄斑区的3D结构。最近的方法已使用阴影信息三维重建或高度图预测，而是因为他们忽略了附近的像素之间的依赖它们的输出是不准确的。此外，其他方法都依赖于眼睛的多于一个图像的可用性是不可用在实践中。在本文中，我们使用条件生成对抗网络（CGANS）生成包含一个眼底图像上的黄斑区域的高度信息图像。结果使用我们的数据显示，结构相似度指数（SSIM）和均方误差0.071改进（MSE）的阴影（SFS）方法指标在形状的0.6077提高。此外，定性研究也表明，我们的方法优于近来的方案。

34. Multimodal brain tumor classification [PDF] 返回目录
Marvin Lerousseau, Eric Deutsh, Nikos Paragios
Abstract: Cancer is a complex disease that provides various types of information depending on the scale of observation. While most tumor diagnostics are performed by observing histopathological slides, radiology images should yield additional knowledge towards the efficacy of cancer diagnostics. This work investigates a deep learning method combining whole slide images and magnetic resonance images to classify tumors. Experiments are prospectively conducted on the 2020 Computational Precision Medicine challenge, in a 3-classes unbalanced classification task. We report cross-validation (resp. validation) balanced-accuracy, kappa and f1 of 0.913, 0.897 and 0.951 (resp. 0.91, 0.90 and 0.94). The complete code of the method is open-source at XXXX. Those include histopathological data pre-processing, and can therefore be used off-the-shelf for other histopathological and/or radiological classification.
摘要：癌症是一种复杂的疾病，其提供根据观察的规模的各种类型的信息。虽然大多数肿瘤的诊断是通过观察组织切片进行放射影像应该产生更多的知识对癌症诊断的功效。这项工作研究了一种深刻的学习方法，整个幻灯片图像和磁共振图像结合进行分类肿瘤。实验是在2020年的计算精度医学挑战前瞻性进行的，在3类不平衡分类任务。我们报告交叉验证（相应的验证）平衡精度，κ和0.913，0.897和0.951（分别是0.91，0.90和0.94）F1。该方法的完整代码是开源在XXXX。那些包括组织病理学数据预处理，并因此可用于现成的，货架对于其他组织病理学和/或放射学分类。

35. Detection-Aware Trajectory Generation for a Drone Cinematographer [PDF] 返回目录
Boseong Felipe Jeon, Dongseok Shim, H. Jin Kim
Abstract: This work investigates an efficient trajectory generation for chasing a dynamic target, which incorporates the detectability objective. The proposed method actively guides the motion of a cinematographer drone so that the color of a target is well-distinguished against the colors of the background in the view of the drone. For the objective, we define a measure of color detectability given a chasing path. After computing a discrete path optimized for the metric, we generate a dynamically feasible trajectory. The whole pipeline can be updated on-the-fly to respond to the motion of the target. For the efficient discrete path generation, we construct a directed acyclic graph (DAG) for which a topological sorting can be determined analytically without the depth-first search. The smooth path is obtained in quadratic programming (QP) framework. We validate the enhanced performance of state-of-the-art object detection and tracking algorithms when the camera drone executes the trajectory obtained from the proposed method.
摘要：本工作研究了一种有效的轨迹生成追一个动态目标，结合有可探测目标。所提出的方法主动引导电影摄影师无人驾驶飞机的运动，使得一个目标的颜色是针对在无人驾驶飞机的视图中的背景的颜色良好区分。对于目标，我们定义赋予追逐通路彩色探测的量度。计算离散路径度量优化后，我们生成一个动态可行轨迹。整个管线可以即时响应该目标的运动进行更新。为高效的离散路径生成，我们构建了其中拓扑排序可以分析确定没有深度优先搜索有向非循环图（DAG）。在二次规划（QP）框架中获得的平滑路径。我们验证状态的最先进的物体检测和跟踪算法的性能增强当相机无人驾驶飞机执行从所提出的方法所获得的轨迹。

36. Fundus Image Analysis for Age Related Macular Degeneration: ADAM-2020 Challenge Report [PDF] 返回目录
Sharath M Shankaranarayana
Abstract: Age related macular degeneration (AMD) is one of the major causes for blindness in the elderly population. In this report, we propose deep learning based methods for retinal analysis using color fundus images for computer aided diagnosis of AMD. We leverage the recent state of the art deep networks for building a single fundus image based AMD classification pipeline. We also propose methods for the other directly relevant and auxiliary tasks such as lesions detection and segmentation, fovea detection and optic disc segmentation. We propose the use of generative adversarial networks (GANs) for the tasks of segmentation and detection. We also propose a novel method of fovea detection using GANs.
摘要：年龄相关性黄斑变性（AMD）是面向中老年人群失明的主要原因之一。在这份报告中，我们提出了使用AMD的计算机辅助诊断彩色眼底图像视网膜分析深度学习为基础的方法。我们充分利用在该领域深厚的网络状态用于构建基于AMD分类管道单眼底图像。我们还提出了其他直接相关和辅助任务的方法，如病变检测与分割，中心凹检测和视盘分割。我们建议使用生成对抗网络（甘斯）的分割和检测任务。我们还建议使用甘斯中心凹检测的新方法。

37. TopoMap: A 0-dimensional Homology Preserving Projection of High-Dimensional Data [PDF] 返回目录
Harish Doraiswamy, Julien Tierny, Paulo J. S. Silva, Luis Gustavo Nonato, Claudio Silva
Abstract: Multidimensional Projection is a fundamental tool for high-dimensional data analytics and visualization. With very few exceptions, projection techniques are designed to map data from a high-dimensional space to a visual space so as to preserve some dissimilarity (similarity) measure, such as the Euclidean distance for example. In fact, although adopting distinct mathematical formulations designed to favor different aspects of the data, most multidimensional projection methods strive to preserve dissimilarity measures that encapsulate geometric properties such as distances or the proximity relation between data objects. However, geometric relations are not the only interesting property to be preserved in a projection. For instance, the analysis of particular structures such as clusters and outliers could be more reliably performed if the mapping process gives some guarantee as to topological invariants such as connected components and loops. This paper introduces TopoMap, a novel projection technique which provides topological guarantees during the mapping process. In particular, the proposed method performs the mapping from a high-dimensional space to a visual space, while preserving the 0-dimensional persistence diagram of the Rips filtration of the high-dimensional data, ensuring that the filtrations generate the same connected components when applied to the original as well as projected data. The presented case studies show that the topological guarantee provided by TopoMap not only brings confidence to the visual analytic process but also can be used to assist in the assessment of other projection methods.
摘要：多维投影是高维数据分析和可视化的基本工具。除了极少数例外，投影技术被设计为从高维空间中的数据映射到一个视觉空间，以便保持一些相异（相似度）测量，例如欧几里得距离。事实上，虽然采用设计以有利于数据的不同方面不同数学公式，最多维投影方法努力维护封装几何性质，例如距离或数据对象之间的接近关系相异的措施。然而，几何关系都没有的投影被保留的唯一有趣的属性。举例来说，如果可能的映射过程给出了一些保证以拓扑不变量如连接的组件和循环更可靠地进行特定的结构，例如集群和异常值的分析。本文介绍TopoMap，新颖的投影技术，其在映射过程中提供的拓扑保证。特别地，所提出的方法进行从高维空间到一个视觉空间的映射，同时保持高维数据的裂口过滤的0维持久性图，确保的过滤生成相同的连接部件时，施加原来的以及预期的数据。所提出的案例研究表明，通过TopoMap提供的拓扑保证不仅带来了信心，视觉分析过程，也可以用来协助其他投影方法的评估。

38. TAP-Net: Transport-and-Pack using Reinforcement Learning [PDF] 返回目录
Ruizhen Hu, Juzhan Xu, Bin Chen, Minglun Gong, Hao Zhang, Hui Huang
Abstract: We introduce the transport-and-pack(TAP) problem, a frequently encountered instance of real-world packing, and develop a neural optimization solution based on reinforcement learning. Given an initial spatial configuration of boxes, we seek an efficient method to iteratively transport and pack the boxes compactly into a target container. Due to obstruction and accessibility constraints, our problem has to add a new search dimension, i.e., finding an optimal transport sequence, to the already immense search space for packing alone. Using a learning-based approach, a trained network can learn and encode solution patterns to guide the solution of new problem instances instead of executing an expensive online search. In our work, we represent the transport constraints using a precedence graph and train a neural network, coined TAP-Net, using reinforcement learning to reward efficient and stable packing. The network is built on an encoder-decoder architecture, where the encoder employs convolution layers to encode the box geometry and precedence graph and the decoder is a recurrent neural network (RNN) which inputs the current encoder output, as well as the current box packing state of the target container, and outputs the next box to pack, as well as its orientation. We train our network on randomly generated initial box configurations, without supervision, via policy gradients to learn optimal TAP policies to maximize packing efficiency and stability. We demonstrate the performance of TAP-Net on a variety of examples, evaluating the network through ablation studies and comparisons to baselines and alternative network designs. We also show that our network generalizes well to larger problem instances, when trained on small-sized inputs.
摘要：介绍了运输和包装（TAP）的问题，现实世界包装经常遇到的情况，并开发了基于强化学习神经优化解决方案。给定的框的初始的空间配置，我们寻求一种有效的方法以迭代地运输和紧凑地包装盒装入一个目标容器。由于阻塞和可访问性的限制，我们的问题必须添加新的搜索维度，即找到一个最佳的交通序列，对已经巨大的搜索空间单独包装。使用基于学习的方法，训练有素的网络能够学习和编码解决方案模式来引导，而不是执行昂贵的在线搜索新的问题情况下的解决方案。在我们的工作，我们表示使用优先图运输限制和训练神经网络，创造TAP-网，采用强化学习奖励高效，稳定的包装。的网络上建立的编码器 - 解码器架构，其中，所述编码器采用卷积层来编码盒的几何形状和优先图和解码器是一个回归神经网络（RNN）输入所述当前编码器输出，以及当前的盒包装该目标容器，并输出下一个框包装，以及其取向的状态。我们培训我们的网络上随机生成的初始盒的配置，没有监督，通过策略梯度学习最佳TAP政策，以最大限度地提高包装效率和稳定性。我们证明在各种例子TAP-Net的性能，评估通过切除研究和比较基准和备选网络设计的网络。我们还表明，我们的网络推广以及较大问题的情况下，在小尺寸的输入来训练的时候。

39. Dexterous Robotic Grasping with Object-Centric Visual Affordances [PDF] 返回目录
Priyanka Mandikal, Kristen Grauman
Abstract: Dexterous robotic hands are appealing for their agility and human-like morphology, yet their high degree of freedom makes learning to manipulate challenging. We introduce an approach for learning dexterous grasping. Our key idea is to embed an object-centric visual affordance model within a deep reinforcement learning loop to learn grasping policies that favor the same object regions favored by people. Unlike traditional approaches that learn from human demonstration trajectories (e.g., hand joint sequences captured with a glove), the proposed prior is object-centric and image-based, allowing the agent to anticipate useful affordance regions for objects unseen during policy learning. We demonstrate our idea with a 30-DoF five-fingered robotic hand simulator on 40 objects from two datasets, where it successfully and efficiently learns policies for stable grasps. Our affordance-guided policies are significantly more effective, generalize better to novel objects, and train 3 X faster than the baselines. Our work offers a step towards manipulation agents that learn by watching how people use objects, without requiring state and action information about the human body. Project website: this http URL
摘要：机器人灵巧手都呼吁他们的敏捷性和人类类似的形态，但其高自由度使学习操纵挑战的。我们引进学习灵巧抓的方法。我们的主要想法是嵌入了深刻的强化学习环路内的对象为中心的视觉启示模型学习掌握有利于人们的青睐同一对象的区域政策。不同于从人示范轨迹学的传统方法（例如，用手套捕获的手关节序列），所提出的现有的对象为中心和图像为基础，使所述试剂在预期有用启示区域为政策学习过程中看不见的对象。我们证明了我们对40级的对象有30个自由度五手指的机械手模拟从两个数据集，它成功，有效地学习稳定掌握政策的想法。我们的启示引导政策是显著更有效，更好地推广新型的对象，培养3 X比基线快。我们的工作提供了实现这一通过观察人们如何使用对象，而不需要对人的身体状态和行为信息学会操作代理的一个步骤。项目网站：这个HTTP URL

40. Real Image Super Resolution Via Heterogeneous Model using GP-NAS [PDF] 返回目录
Zhihong Pan, Baopu Li, Teng Xi, Yanwen Fan, Gang Zhang, Jingtuo Liu, Junyu Han, Errui Ding
Abstract: With advancement in deep neural network (DNN), recent state-of-the-art (SOTA) image superresolution (SR) methods have achieved impressive performance using deep residual network with dense skip connections. While these models perform well on benchmark dataset where low-resolution (LR) images are constructed from high-resolution (HR) references with known blur kernel, real image SR is more challenging when both images in the LR-HR pair are collected from real cameras. Based on existing dense residual networks, a Gaussian process based neural architecture search (GP-NAS) scheme is utilized to find candidate network architectures using a large search space by varying the number of dense residual blocks, the block size and the number of features. A suite of heterogeneous models with diverse network structure and hyperparameter are selected for model-ensemble to achieve outstanding performance in real image SR. The proposed method won the first place in all three tracks of the AIM 2020 Real Image Super-Resolution Challenge.
摘要：采用深剩余网络密集的跳跃连接在深层神经网络（DNN），最近的国家的最先进的（SOTA）影像超分辨率（SR）的方法都取得了骄人的业绩进步。虽然这些模型上，其中低分辨率（LR）图像是从与已知模糊核的高分辨率（HR）的引用构成基准数据集表现良好，当在LR-HR对中的两个图像从实收集真实图像SR更有挑战性相机。基于现有密残余网络中，高斯过程基于神经结构搜索（GP-NAS）方案被用于通过改变密残余块的数目，块的大小和特征的数量使用一个大的搜索空间，以找到候选网络架构。异构模型与不同的网络结构和超参数的套件被选择用于模型的集合，以实现在实像SR出色的性能。该方法在AIM 2020真实影像超分辨率挑战的所有三个轨道荣获第一名。

41. An Internal Cluster Validity Index Based on Distance-based Separability Measure [PDF] 返回目录
Shuyue Guan, Murray Loew
Abstract: To evaluate clustering results is a significant part in cluster analysis. Usually, there is no true class labels for clustering as a typical unsupervised learning. Thus, a number of internal evaluations, which use predicted labels and data, have been created. They also named internal cluster validity indices (CVIs). Without true labels, to design an effective CVI is not simple because it is similar to create a clustering method. And, to have more CVIs is crucial because there is no universal CVI that can be used to measure all datasets, and no specific method for selecting a proper CVI for clusters without true labels. Therefore, to apply more CVIs to evaluate clustering results is necessary. In this paper, we propose a novel CVI - called Distance-based Separability Index (DSI), based on a data separability measure. We applied the DSI and eight other internal CVIs including early studies from Dunn (1974) to most recent studies CVDD (2019) as comparison. We used an external CVI as ground truth for clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. In addition, we summarized the general process to evaluate CVIs and created a new method - rank difference - to compare the results of CVIs.
摘要：为了评估聚类结果是聚类分析显著部分。通常情况下，有没有真正的类标签聚类作为一个典型的无监督学习。因此，一些内部评估的，它使用预测标签和数据，已被创建。他们还命名内部群集有效性指数（CVIS）。如果没有真正的标签，设计一个有效CVI并不简单，因为它类似于创建一个聚类方法。而且，为了有更多的CVIS是至关重要的，因为没有普遍CVI可用于测量所有数据集，并没有选择为集群正确的CVI没有真正的标签，具体方法。因此，应用更CVIS评估聚类结果是必要的。在本文中，我们提出了一个新的CVI - 所谓的基于距离的可分性指数（DSI），基于数据可分性措施。我们采用DSI和其他八个内部CVIS包括邓恩（1974）的早期研究到最近的研究CVDD（2019）作为比较。我们使用外部CVI作为地面真理聚类的12真正的5种聚类算法和97个合成数据集的结果。结果表明DSI是一种有效的，独特的，和有竞争力CVI到其他比较CVIS。此外，我们总结了一般的流程来评估CVIS并创造了新的方法 - 等级差异 - 比较CVIS的结果。

42. When Image Decomposition Meets Deep Learning: A Novel Infrared and Visible Image Fusion Method [PDF] 返回目录
Zixiang Zhao, Shuang Xu, Rui Feng, Chunxia Zhang, Junmin Liu, Jiangshe Zhang
Abstract: Infrared and visible image fusion, as a hot topic in image processing and image enhancement, aims to produce fused images retaining the detail texture information in visible images and the thermal radiation information in infrared images. In this paper, we propose a novel two-stream auto-encoder (AE) based fusion network. The core idea is that the encoder decomposes an image into base and detail feature maps with low- and high-frequency information, respectively, and that the decoder is responsible for the original image reconstruction. To this end, a well-designed loss function is established to make the base/detail feature maps similar/dissimilar. In the test phase, base and detail feature maps are respectively merged via a fusion module, and the fused image is recovered by the decoder. Qualitative and quantitative results demonstrate that our method can generate fusion images containing highlighted targets and abundant detail texture information with strong reproducibility and meanwhile superior than the state-of-the-art (SOTA) approaches.
摘要：红外和可见光图像融合，如在图像处理和图像增强，目标以产生融合图像保持在可见光图像的细节纹理信息和红外图像的热辐射信息的热点话题。在本文中，我们提出了一种新的双流自动编码器（AE）基于融合网络。的核心思想是，所述编码器将图像分解成基和细节特征与低和高频率的信息，分别映射，并且所述解码器负责将原始图像重建。为此，设计良好的损失函数被建立以使碱/细节特征地图相似/不相似的。在测试阶段，碱和细节特征图分别经由融合模块合并，融合图像被解码器恢复。定性和定量结果表明，我们的方法可以产生含有突出显示的目标和，重现性良好丰富细节纹理信息，同时比状态的最先进的优良融合图像（SOTA）接近。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-04

目录

摘要