摘要

1. Inducing Predictive Uncertainty Estimation for Face Recognition [PDF] 返回目录
Weidi Xie, Jeffrey Byrne, Andrew Zisserman
Abstract: Knowing when an output can be trusted is critical for reliably using face recognition systems. While there has been enormous effort in recent research on improving face verification performance, understanding when a model's predictions should or should not be trusted has received far less attention. Our goal is to assign a confidence score for a face image that reflects its quality in terms of recognizable information. To this end, we propose a method for generating image quality training data automatically from 'mated-pairs' of face images, and use the generated data to train a lightweight Predictive Confidence Network, termed as PCNet, for estimating the confidence score of a face image. We systematically evaluate the usefulness of PCNet with its error versus reject performance, and demonstrate that it can be universally paired with and improve the robustness of any verification model. We describe three use cases on the public IJB-C face verification benchmark: (i) to improve 1:1 image-based verification error rates by rejecting low-quality face images; (ii) to improve quality score based fusion performance on the 1:1 set-based verification benchmark; and (iii) its use as a quality measure for selecting high quality (unblurred, good lighting, more frontal) faces from a collection, e.g. for automatic enrolment or display.
摘要：知道何时可以被信任的输出为可靠地使用面部识别系统中的关键。虽然出现了巨大的努力在最近的研究对提高人脸验证的性能，了解当一个模型的预测应该或不应该被信任已收到却很少关注。我们的目标是分配置信度得分反映其在识别信息质量方面的面部图像。为此，我们提出了一种方法，用于自动地生成图像的质量的训练数据从“配合-对”面部图像，并使用所生成的数据来训练轻质预测置信网络，称为PCNET，用于估计置信度得分的面部的图片。我们系统评估PCNET的实用性，其误差与拒绝的性能，并证明它可与通用配对，提高任何验证模型的稳健性。我们描述对公共IJB-C面验证基准三个用例：（i）向提高1：通过拒绝低质量的人脸图像1基于图像的验证错误率; （ⅱ），以改善在1质量分数基于融合性能：1基于集合的验证基准;及（iii）其作为一个质量度量，用于选择高品质的（不模糊的，良好的照明，更正面）从集合中，例如面对自动注册或显示器。

2. A Short Review on Data Modelling for Vector Fields [PDF] 返回目录
Jun Li, Wanrong Hong, Yusheng Xiang
Abstract: Machine learning methods based on statistical principles have proven highly successful in dealing with a wide variety of data analysis and analytics tasks. Traditional data models are mostly concerned with independent identically distributed data. The recent success of end-to-end modelling scheme using deep neural networks equipped with effective structures such as convolutional layers or skip connections allows the extension to more sophisticated and structured practical data, such as natural language, images, videos, etc. On the application side, vector fields are an extremely useful type of data in empirical sciences, as well as signal processing, e.g. non-parametric transformations of 3D point clouds using 3D vector fields, the modelling of the fluid flow in earth science, and the modelling of physical fields. This review article is dedicated to recent computational tools of vector fields, including vector data representations, predictive model of spatial data, as well as applications in computer vision, signal processing, and empirical sciences.
摘要：基于统计原理的机器学习方法已被证明在处理各种各样的数据分析和分析任务的非常成功。传统的数据模型大多关注与独立同分布数据。使用深神经网络端至端建模方案的最近成功配备有有效的结构，例如卷积层或跳过的连接允许扩展到更复杂的结构和实际数据，如自然语言，图像，视频等。在应用侧，矢量场是在经验科学数据的一个非常有用的类型，以及信号处理，例如使用三维矢量场，在地球科学中的流体流动的建模，和物理字段的建模三维点云的非参数变换。该评论文章是专门为矢量场的最近的计算工具，包括矢量数据表示，空间数据的预测模型，以及在计算机视觉应用中，信号处理和经验科学。

3. MORPH-DSLAM: Model Order Reduction for PHysics-based Deformable SLAM [PDF] 返回目录
Alberto Badias, Iciar Alfaro, David Gonzalez, Francisco Chinesta, Elias Cueto
Abstract: We propose a new methodology to estimate the 3D displacement field of deformable objects from video sequences using standard monocular cameras. We solve in real time the complete (possibly visco-)hyperelasticity problem to properly describe the strain and stress fields that are consistent with the displacements captured by the images, constrained by real physics. We do not impose any ad-hoc prior or energy minimization in the external surface, since the real and complete mechanics problem is solved. This means that we can also estimate the internal state of the objects, even in occluded areas, just by observing the external surface and the knowledge of material properties and geometry. Solving this problem in real time using a realistic constitutive law, usually non-linear, is out of reach for current systems. To overcome this difficulty, we solve off-line a parametrized problem that considers each source of variability in the problem as a new parameter and, consequently, as a new dimension in the formulation. Model Order Reduction methods allow us to reduce the dimensionality of the problem, and therefore, its computational cost, while preserving the visualization of the solution in the high-dimensionality space. This allows an accurate estimation of the object deformations, improving also the robustness in the 3D points estimation.
摘要：本文提出了一种新的方法来使用标准的单眼摄像头的视频序列估计可变形物体的三维位移场。我们在实时解决了完整的（可能visco-）超弹性问题的正确描述的应变和应力场是由图像捕捉到的位移相符，通过真实的物理限制。我们不强加外部表面的任何临时或之前能量最小化，因为真正的，完整的力学问题就解决了。这意味着，我们也可以估计对象的内部状态，即使是在封闭区域，只需通过观察外表面和材料特性和几何形状的知识。解决在使用一个现实的本构关系，通常是非线性的实时这个问题，是达到了当前系统的出来。为了克服这个困难，我们解决离线一种考虑该问题作为一个新的参数变化的每一个源，并因此一个参数化的问题，因为在制剂中一个新的层面。模型降阶方法允许我们减少的问题的维度，因此，它的计算成本，同时保持在高维空间解决方案的可视化。这允许物体的变形的准确估计，还改善该3D点估计的鲁棒性。

4. A Primer on Motion Capture with Deep Learning: Principles, Pitfalls and Perspectives [PDF] 返回目录
Alexander Mathis, Steffen Schneider, Jessy Lauer, Mackenzie W. Mathis
Abstract: Extracting behavioral measurements non-invasively from video is stymied by the fact that it is a hard computational problem. Recent advances in deep learning have tremendously advanced predicting posture from videos directly, which quickly impacted neuroscience and biology more broadly. In this primer we review the budding field of motion capture with deep learning. In particular, we will discuss the principles of those novel algorithms, highlight their potential as well as pitfalls for experimentalists, and provide a glimpse into the future.
摘要：从视频中提取的行为测量非侵入性是由事实，这是一个很难计算问题阻碍。在深度学习的最新进展，从视频中极其先进的预测姿势直接，迅速影响神经系统科学和生物学的更广泛。在本读物，我们回顾与深度学习动作捕捉的萌芽领域。特别是，我们将讨论这些新算法的原则，突出自己的潜能以及对实验者的陷阱，并提供一窥未来。

5. A High-Level Description and Performance Evaluation of Pupil Invisible [PDF] 返回目录
Marc Tonsen, Chris Kay Baumann, Kai Dierkes
Abstract: Head-mounted eye trackers promise convenient access to reliable gaze data in unconstrained environments. Due to several limitations, however, often they can only partially deliver on this promise. Among those are the following: (i) the necessity of performing a device setup and calibration prior to every use of the eye tracker, (ii) a lack of robustness of gaze-estimation results against perturbations, such as outdoor lighting conditions and unavoidable slippage of the eye tracker on the head of the subject, and (iii) behavioral distortion resulting from social awkwardness, due to the unnatural appearance of current head-mounted eye trackers. Recently, Pupil Labs released Pupil Invisible glasses, a head-mounted eye tracker engineered to tackle these limitations. Here, we present an extensive evaluation of its gaze-estimation capabilities. To this end, we designed a data-collection protocol and evaluation scheme geared towards providing a faithful portrayal of the real-world usage of Pupil Invisible glasses. In particular, we develop a geometric framework for gauging gaze-estimation accuracy that goes beyond reporting mean angular accuracy. We demonstrate that Pupil Invisible glasses, without the need of a calibration, provide gaze estimates which are robust to perturbations, including outdoor lighting conditions and slippage of the headset.
摘要：头戴式眼动仪誓将不受约束的环境中方便地访问可靠的注视数据。由于一些限制，但是，他们往往只能部分在此承诺。在那些有以下几方面：（i）向每一个使用眼睛跟踪的之前执行设备设置和校准的必要性，（ⅱ）缺乏对扰动，例如户外照明条件和不可避免的滑移注视估计结果的稳健性的关于被检体的头部，和（iii）从社会尴尬所得，行为失真的眼跟踪器的由于电流的不自然的外观的头戴式眼跟踪器。近日，瞳孔实验室发布瞳隐形眼镜，头戴式眼动仪设计来解决这些限制。在这里，我们提出的其视线估计能力进行广泛的评估。为此，我们设计了对提供的瞳孔隐形眼镜的现实使用的忠实写照面向数据收集协议和评估方案。特别是，我们制定衡量注视估计精度超越平均的报告角精度的几何框架。我们表明，瞳孔隐形眼镜，而无需校准，提供凝视估计这是稳健的扰动，包括户外照明条件和耳机的滑动。

6. DropLeaf: a precision farming smartphone application for measuring pesticide spraying methods [PDF] 返回目录
Bruno Brandoli, Gabriel Spadon, Travis Esau, Patrick Hennessy, Andre C. P. L. Carvalho, Jose F. Rodrigues-Jr, Sihem Amer-Yahia
Abstract: Pesticide application has been heavily used in the cultivation of major crops, contributing to the increase of crop production over the past decades. However, their appropriate use and calibration of machines rely upon evaluation methodologies that can precisely estimate how well the pesticides' spraying covered the crops. A few strategies have been proposed in former works, yet their elevated costs and low portability do not permit their wide adoption. This work introduces and experimentally assesses a novel tool that functions over a smartphone-based mobile application, named DropLeaf - Spraying Meter. Tests performed using DropLeaf demonstrated that, notwithstanding its versatility, it can estimate the pesticide spraying with high precision. Our methodology is based on image analysis, and the assessment of spraying deposition measures is performed successfully over real and synthetic water-sensitive papers. The proposed tool can be extensively used by farmers and agronomists furnished with regular smartphones, improving the utilization of pesticides with well-being, ecological, and monetary advantages. DropLeaf can be easily used for spray drift assessment of different methods, including emerging UAV (Unmanned Aerial Vehicle) sprayers.
摘要：农药的应用已在中国主要作物种植被大量使用，有利于作物生产在过去几十年的增长。然而，它们的适当使用和机器的校准依赖于评估方法，可以精确地估计杀虫剂会喷洒如何覆盖的作物。一些策略，已经提出了前人的工作，但他们的成本升高和低便携不允许他们广泛采用。这项工作介绍和实验评估一个新的工具，它在基于智能手机的移动应用程序，名为DropLeaf功能 - 喷涂仪表。使用DropLeaf进行的试验表明，尽管它的通用性，它可以估算出农药以高精度喷雾。我们的方法是基于图像分析，并在真实和合成水敏纸成功完成喷涂沉积而采取的措施。建议的工具，可以通过农民和配有常规的智能手机农艺师被广泛使用，提高农药与福祉，生态和货币优势的利用。 DropLeaf可以很容易地用于不同的方法，包括新出现UAV（无人驾驶飞行器）喷雾器喷雾漂移的评估。

7. Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning [PDF] 返回目录
Liqi Yan, Dongfang Liu, Yaoxian Song, Changbin Yu
Abstract: Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.
摘要：视觉和声音是代理商的互动和学习两个重要钥匙。在本文中，我们提出了称为存储器视觉语音室内导航（MVV-IN）的新颖室内导航模型，它接收语音命令并分析的目测多模式信息，以增强机器人的环境的认识。我们利用通过第一视角单眼相机拍摄的单RGB图像。我们还应用自注意机制，保持突出重点代理。内存是避免重复某些任务不必要的，并且为了它充分适应新的场景，因此代理的重要，我们利用元学习的。我们已经尝试了从视觉观察中提取的各种功能特性。对比实验证明我们的方法优于国家的最先进的基线。

8. LiftFormer: 3D Human Pose Estimation using attention models [PDF] 返回目录
Adrian Llopart
Abstract: Estimating the 3D position of human joints has become a widely researched topic in the last years. Special emphasis has gone into defining novel methods that extrapolate 2-dimensional data (keypoints) into 3D, namely predicting the root-relative coordinates of joints associated to human skeletons. The latest research trends have proven that the Transformer Encoder blocks aggregate temporal information significantly better than previous approaches. Thus, we propose the usage of these models to obtain more accurate 3D predictions by leveraging temporal information using attention mechanisms on ordered sequences human poses in videos. Our method consistently outperforms the previous best results from the literature when using both 2D keypoint predictors by 0.3 mm (44.8 MPJPE, 0.7% improvement) and ground truth inputs by 2mm (MPJPE: 31.9, 8.4% improvement) on Human3.6M. It also achieves state-of-the-art performance on the HumanEva-I dataset with 10.5 P-MPJPE (22.2% reduction). The number of parameters in our model is easily tunable and is smaller (9.5M) than current methodologies (16.95M and 11.25M) whilst still having better performance. Thus, our 3D lifting model's accuracy exceeds that of other end-to-end or SMPL approaches and is comparable to many multi-view methods.
摘要：估算人体关节的三维位置，已成为在过去几年广泛研究的课题。特别强调已经进入限定的新颖的方法即外推的2维数据（关键点）转换为3D，即预测相关联的人的骨骼的关节的根目录相对坐标。最新的研究趋势已经证明，转换器编码器模块汇总时间信息显著优于以前的方法。因此，我们提出了这些模型的使用通过利用使用在视频排序的序列人类的姿态关注机制的时间信息，以获得更精确的3D预测。上Human3.6M：我们的方法0.3毫米（44.8 MPJPE，0.7％的改善）和地面实况输入由2毫米（31.9，8.4％的改善MPJPE）同时使用2D关键点的预测时始终优于从文献中以前最好的结果。它也实现了在HumanEva-I数据集10.5 P-MPJPE（减少22.2％）的状态的最先进的性能。在我们的模型参数的数量是很容易可调和小（9.5M），比目前的方法（16.95M和11.25），而仍然有更好的表现。因此，我们的3D提升模型的精确度超过了另一端 - 端或SMPL的方法和媲美许多多视图的方法。

9. 3D-DEEP: 3-Dimensional Deep-learning based on elevation patterns forroad scene interpretation [PDF] 返回目录
A. Hernández, S. Woo, H. Corrales, I. Parra, E. Kim, D. F. Llorca, M. A. Sotelo
Abstract: Road detection and segmentation is a crucial task in computer vision for safe autonomous driving. With this in mind, a new net architecture (3D-DEEP) and its end-to-end training methodology for CNN-based semantic segmentation are described along this paper for. The method relies on disparity filtered and LiDAR projected images for three-dimensional information and image feature extraction through fully convolutional networks architectures. The developed models were trained and validated over Cityscapes dataset using just fine annotation examples with 19 different training classes, and over KITTI road dataset. 72.32% mean intersection over union(mIoU) has been obtained for the 19 Cityscapes training classes using the validation images. On the other hand, over KITTIdataset the model has achieved an F1 error value of 97.85% invalidation and 96.02% using the test images.
摘要：道路检测和分割是安全自动驾驶计算机视觉的关键任务。考虑到这一点，新的净架构（3D-DEEP）和其终端到终端的培训方法基于CNN-语义分割沿着本文所描述。该方法依赖于视差过滤并激光雷达投影图像，用于通过完全卷积网络架构的三维信息和图像特征提取。建立的模型进行了培训，并使用就好了注释的例子有19门不同的培训课程，在风情的数据集进行验证，在KITTI道路数据集。 72.32％的平均交叉点以上联合（米欧）已经获得了19个都市风景训练使用所述验证图像类。在另一方面，经KITTIdataset模型取得97.85％失效和使用测试图像96.02％的F1误差值。

10. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval [PDF] 返回目录
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä
Abstract: The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at this https URL .
摘要：基于查询的时刻检索是根据一个查询语句未修剪视频本地化特定的剪辑的问题。这是一项艰巨的任务，需要自然语言查询和视频内容两者的解释。就像在计算机视觉和机器学习等诸多领域，在基于查询的时刻检索进度和基准数据集在很大程度上驱动，因此，它们的质量对现场显著影响。在本文中，我们提出了一系列的实验评估的基准测试结果如何很好地体现在解决当下检索任务的真正进步。我们的研究结果表明在流行的数据集和国家的最先进的机型意外行为实质的偏见。此外，我们提出了新的完整性检查的实验和方法用于可视化的结果。最后，我们建议可能的方向，以提高在未来的时间句子接地。我们对本文代码可在此HTTPS URL。

11. Active Deep Densely Connected Convolutional Network for Hyperspectral Image Classification [PDF] 返回目录
Bing Liu, Anzhu Yu, Pengqiang Zhang, Lei Ding, Wenyue Guo, Kuiliang Gao, Xibing Zuo
Abstract: Deep learning based methods have seen a massive rise in popularity for hyperspectral image classification over the past few years. However, the success of deep learning is attributed greatly to numerous labeled samples. It is still very challenging to use only a few labeled samples to train deep learning models to reach a high classification accuracy. An active deep-learning framework trained by an end-to-end manner is, therefore, proposed by this paper in order to minimize the hyperspectral image classification costs. First, a deep densely connected convolutional network is considered for hyperspectral image classification. Different from the traditional active learning methods, an additional network is added to the designed deep densely connected convolutional network to predict the loss of input samples. Then, the additional network could be used to suggest unlabeled samples that the deep densely connected convolutional network is more likely to produce a wrong label. Note that the additional network uses the intermediate features of the deep densely connected convolutional network as input. Therefore, the proposed method is an end-to-end framework. Subsequently, a few of the selected samples are labelled manually and added to the training samples. The deep densely connected convolutional network is therefore trained using the new training set. Finally, the steps above are repeated to train the whole framework iteratively. Extensive experiments illustrates that the method proposed could reach a high accuracy in classification after selecting just a few samples.
摘要：基于深学习方法已经看到了普及将大幅增加对高光谱影像分类，在过去的几年里。然而，深度学习的成功极大地归咎于许多标记的样品。它仍然是非常具有挑战性的使用只有少数标记的样品培养深度学习模式，以达到较高的分类精度。由端至端的方式训练的有源深学习框架，因此，为了最小化高光谱图像分类成本提出了通过本文。首先，将深密集连接的卷积网络被认为是用于高光谱图像的分类。从传统的主动学习方法的不同，一个额外的网络被添加到设计深密集连接卷积网络预测输入样本的损失。然后，附加网络可以用来暗示未标记样本的深层密集连接卷积网络是更有可能产生一个错误的标签。注意，附加的网络使用深密集连接卷积网络作为输入的中间特性。因此，所提出的方法是一个终端到终端的框架。随后，几个选定的样品被手动标记，并加入到该训练样本。因此，深密集连接卷积网络使用新的训练集训练。最后，上述步骤被重复以迭代训练整体框架。大量的实验说明提出的方法只选择少数样本后能达到分类的精度高。

12. PIDNet: An Efficient Network for Dynamic Pedestrian Intrusion Detection [PDF] 返回目录
Jingchen Sun, Jiming Chen, Tao Chen, Jiayuan Fan, Shibo He
Abstract: Vision-based dynamic pedestrian intrusion detection (PID), judging whether pedestrians intrude an area-of-interest (AoI) by a moving camera, is an important task in mobile surveillance. The dynamically changing AoIs and a number of pedestrians in video frames increase the difficulty and computational complexity of determining whether pedestrians intrude the AoI, which makes previous algorithms incapable of this task. In this paper, we propose a novel and efficient multi-task deep neural network, PIDNet, to solve this problem. PIDNet is mainly designed by considering two factors: accurately segmenting the dynamically changing AoIs from a video frame captured by the moving camera and quickly detecting pedestrians from the generated AoI-contained areas. Three efficient network designs are proposed and incorporated into PIDNet to reduce the computational complexity: 1) a special PID task backbone for feature sharing, 2) a feature cropping module for feature cropping, and 3) a lighter detection branch network for feature compression. In addition, considering there are no public datasets and benchmarks in this field, we establish a benchmark dataset to evaluate the proposed network and give the corresponding evaluation metrics for the first time. Experimental results show that PIDNet can achieve 67.1% PID accuracy and 9.6 fps inference speed on the proposed dataset, which serves as a good baseline for the future vision-based dynamic PID study.
摘要：基于视觉的动态行人入侵检测（PID），在判断是否行人由移动照相机侵入的区域的感兴趣（AOI），是在移动监控的重要任务。动态变化的兴趣区和多个视频帧中的行人增加确定行人是否侵入的AOI，这使得以前的算法不能此任务的难度和计算复杂度。在本文中，我们提出了一个新的，高效的多任务深层神经网络，PIDNet，来解决这个问题。 PIDNet主要设计考虑两个因素：准确分割从通过从生成的AOI-包含区域的移动摄像机和迅速地检测行人所捕获的视频帧中的动态变化的兴趣区。三个有效的网络设计，提出并掺入到PIDNet降低计算复杂度：1）的特殊PID任务骨干特征共享，2）的特征为裁剪特征裁剪模块，以及3）一个打火机检测分支网络为特征的压缩。另外，考虑到有没有公共数据集和基准在这个领域，我们建立一个基准数据集来评估所提出的网络，并给予了相应的评价指标首次。实验结果表明，PIDNet可以达到67.1％PID准确性和9.6 fps的速度推断的提出数据集，作为一个良好的基线对于基于视觉的未来动态PID研究。

13. Generalized Zero-Shot Learning via VAE-Conditioned Generative Flow [PDF] 返回目录
Yu-Chao Gu, Le Zhang, Yun Liu, Shao-Ping Lu, Ming-Ming Cheng
Abstract: Generalized zero-shot learning (GZSL) aims to recognize both seen and unseen classes by transferring knowledge from semantic descriptions to visual representations. Recent generative methods formulate GZSL as a missing data problem, which mainly adopts GANs or VAEs to generate visual features for unseen classes. However, GANs often suffer from instability, and VAEs can only optimize the lower bound on the log-likelihood of observed data. To overcome the above limitations, we resort to generative flows, a family of generative models with the advantage of accurate likelihood estimation. More specifically, we propose a conditional version of generative flows for GZSL, i.e., VAE-Conditioned Generative Flow (VAE-cFlow). By using VAE, the semantic descriptions are firstly encoded into tractable latent distributions, conditioned on that the generative flow optimizes the exact log-likelihood of the observed visual features. We ensure the conditional latent distribution to be both semantic meaningful and inter-class discriminative by i) adopting the VAE reconstruction objective, ii) releasing the zero-mean constraint in VAE posterior regularization, and iii) adding a classification regularization on the latent variables. Our method achieves state-of-the-art GZSL results on five well-known benchmark datasets, especially for the significant improvement in the large-scale setting. Code is released at this https URL.
摘要：广义零次学习（GZSL）目标是，通过语义描述传授知识，以直观表示承认也看见和看不见的类。最近的生成方法制定GZSL为丢失数据的问题，这主要采用甘斯或VAES产生了看不见类的视觉特征。然而，甘斯经常遭受不稳定性，和VAES只能优化下界观测数据的对数似然。为了克服上述限制，我们求助于生成流量，一个家庭生成模型的精确似然估计的优势。更具体地讲，我们建议生成流量为GZSL，即，VAE空调剖成流量（VAE-CFLOW）的条件版本。通过使用VAE，语义描述被首先编码成易于处理的潜分布，调节在所述生成流优化的观测的视觉特征的确切数似然。我们保证条件潜分布用i既语义有意义和帧间类辨别）采用VAE重建目标，ⅱ）释放在VAE后正规化零均值约束，和iii）加入上的潜变量分类正规化。我们的方法实现的五大知名基准数据集的国家的最先进的GZSL结果，尤其是在大型环境中显著的改善。代码在此HTTPS URL释放。

14. To augment or not to augment? Data augmentation in user identification based on motion sensors [PDF] 返回目录
Cezara Benegui, Radu Tudor Ionescu
Abstract: Nowadays, commonly-used authentication systems for mobile device users, e.g. password checking, face recognition or fingerprint scanning, are susceptible to various kinds of attacks. In order to prevent some of the possible attacks, these explicit authentication systems can be enhanced by considering a two-factor authentication scheme, in which the second factor is an implicit authentication system based on analyzing motion sensor data captured by accelerometers or gyroscopes. In order to avoid any additional burdens to the user, the registration process of the implicit authentication system must be performed quickly, i.e. the number of data samples collected from the user is typically small. In the context of designing a machine learning model for implicit user authentication based on motion signals, data augmentation can play an important role. In this paper, we study several data augmentation techniques in the quest of finding useful augmentation methods for motion sensor data. We propose a set of four research questions related to data augmentation in the context of few-shot user identification based on motion sensor signals. We conduct experiments on a benchmark data set, using two deep learning architectures, convolutional neural networks and Long Short-Term Memory networks, showing which and when data augmentation methods bring accuracy improvements. Interestingly, we find that data augmentation is not very helpful, most likely because the signal patterns useful to discriminate users are too sensitive to the transformations brought by certain data augmentation techniques. This result is somewhat contradictory to the common belief that data augmentation is expected to increase the accuracy of machine learning models.
摘要：用于移动设备的用户，例如现今，通常使用的认证系统密码检查，面部识别或指纹扫描，很容易受到各种攻击。为了防止某些可能的攻击，这些明确的认证系统可以通过考虑一个双因素认证方案，其中所述第二因子是基于分析由加速度计或陀螺仪捕获的运动传感器数据的隐式认证系统来增强。为了避免任何额外的负担给用户，必须迅速地进行隐式认证系统的注册过程中，即，从用户收集的数据采样的数目通常很小。在设计基于运动信号隐含的用户身份验证的机器学习模型的情况下，数据增强可以发挥重要的作用。在本文中，我们在寻找运动传感器数据有用的增强方法的探索研究几个数据增量技术。我们提出了一套在基于运动传感器信号几次用户识别范围内有关数据增强四个研究的问题。我们的基准数据集进行实验，采用两道深深的学习架构，卷积神经网络和长短期记忆网络，示出了当数据隆胸方法带来的准确性提高。有趣的是，我们发现，数据增强是不是非常有帮助，很可能是因为来区分用户有用的信号模式是由某些数据增强技术带来的变革过于敏感。这个结果是有点矛盾的共同信念，即数据扩张预计将增加机器学习模型的准确性。

15. Multi-channel Transformers for Multi-articulatory Sign Language Translation [PDF] 返回目录
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden
Abstract: Sign languages use multiple asynchronous information channels (articulators), not just the hands but also the face and body, which computational approaches often ignore. In this paper we tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between different sign articulators to be modelled within the transformer network itself, while also maintaining channel specific information. We evaluate our approach on the RWTH-PHOENIX-Weather-2014T dataset and report competitive translation performance. Importantly, we overcome the reliance on gloss annotations which underpin other state-of-the-art approaches, thereby removing future need for expensive curated datasets.
摘要：手语使用多个异步信息渠道（发音器官），不只是手，而且脸部和身体，其计算方法往往会忽视。在本文中，我们解决了多关节手语翻译任务，并提出了一种新的多通道变压器架构。所提出的结构允许不同的符号之间咬合同时帧间和帧内上下文关系到变压器网络本身内被建模，同时还保持信道的特定信息。我们评估我们在RWTH-PHOENIX-天气-2014T数据集方法和报告有竞争力的翻译性能。重要的是，我们克服其支撑其他国家的最先进的方法光泽注解的依赖，从而消除了昂贵的策展的数据集未来的需要。

16. Personalization in Human Activity Recognition [PDF] 返回目录
Anna Ferrari, Daniela Micucci, Marco Mobilio, Paolo Napoletano
Abstract: In the recent years there has been a growing interest in techniques able to automatically recognize activities performed by people. This field is known as Human Activity recognition (HAR). HAR can be crucial in monitoring the wellbeing of the people, with special regard to the elder population and those people affected by degenerative conditions. One of the main challenges concerns the diversity of the population and how the same activities can be performed in different ways due to physical characteristics and life-style. In this paper we explore the possibility of exploiting physical characteristics and signal similarity to achieve better results with respect to deep learning classifiers that do not rely on this information.
摘要：近年来出现了能够自动识别被人们执行的活动技术的兴趣与日俱增。这个字段被称为人类活动识别（HAR）。 HAR可在监控人民的福利，特别是关于老年人口和那些受退化性疾病的人是至关重要的。其中一个主要的挑战涉及人口的多样性，以及如何在同一活动可以由于物理特性和生活方式不同的方式进行。在本文中，我们探索利用物理特性和信号的相似性，以实现相对于深度学习分类不依赖这些信息更好的结果的可能性。

17. Distinctive 3D local deep descriptors [PDF] 返回目录
Fabio Poiesi, Davide Boscaini
Abstract: We present a simple but yet effective method for learning distinctive 3D local deep descriptors (DIPs) that can be used to register point clouds without requiring an initial alignment. Point cloud patches are extracted, canonicalised with respect to their estimated local reference frame and encoded into rotation-invariant compact descriptors by a PointNet-based deep neural network. DIPs can effectively generalise across different sensor modalities because they are learnt end-to-end from locally and randomly sampled points. Because DIPs encode only local geometric information, they are robust to clutter, occlusions and missing regions. We evaluate and compare DIPs against alternative hand-crafted and deep descriptors on several indoor and outdoor datasets consiting of point clouds reconstructed using different sensors. Results show that DIPs (i) achieve comparable results to the state-of-the-art on RGB-D indoor scenes (3DMatch dataset), (ii) outperform state-of-the-art by a large margin on laser-scanner outdoor scenes (ETH dataset), and (iii) generalise to indoor scenes reconstructed with the Visual-SLAM system of Android ARCore. Source code: this https URL.
摘要：本文提出了一种简单但尚未生效方法学习特殊的3D当地深厚的描述，可以使用，而不需要初始对准登记点云（DIP封装）。点云补丁被提取，canonicalised相对于它们的估计的局部参考系中，且由基于PointNet深神经网络编码成旋转不变的紧凑描述符。 DIP封装可以在不同的传感器模式有效地一概而论，因为他们学会了终端到终端由当地和随机采样点。由于DIP封装编码仅局部几何信息，他们是强大的杂波，遮挡和缺失区域。我们评价和比较对替代手工制作和深厚的描述符的DIP几个室内和室外的数据集consiting点云的使用不同的传感器重建。结果表明，DIP封装（ⅰ）实现可比较的结果，以对RGB-d室内场景的状态的最先进的（3DMatch数据集），（ⅱ）优于大状态的最先进的由激光扫描器室外大幅度场景（ETH数据集），以及（iii）推广到与Android ARCORE的视觉-SLAM系统重建的室内场景。源代码：这个HTTPS URL。

18. Temporal Continuity Based Unsupervised Learning for Person Re-Identification [PDF] 返回目录
Usman Ali, Bayram Bayramli, Hongtao Lu
Abstract: Person re-identification (re-id) aims to match the same person from images taken across multiple cameras. Most existing person re-id methods generally require a large amount of identity labeled data to act as discriminative guideline for representation learning. Difficulty in manually collecting identity labeled data leads to poor adaptability in practical scenarios. To overcome this problem, we propose an unsupervised center-based clustering approach capable of progressively learning and exploiting the underlying re-id discriminative information from temporal continuity within a camera. We call our framework Temporal Continuity based Unsupervised Learning (TCUL). Specifically, TCUL simultaneously does center based clustering of unlabeled (target) dataset and fine-tunes a convolutional neural network (CNN) pre-trained on irrelevant labeled (source) dataset to enhance discriminative capability of the CNN for the target dataset. Furthermore, it exploits temporally continuous nature of images within-camera jointly with spatial similarity of feature maps across-cameras to generate reliable pseudo-labels for training a re-identification model. As the training progresses, number of reliable samples keep on growing adaptively which in turn boosts representation ability of the CNN. Extensive experiments on three large-scale person re-id benchmark datasets are conducted to compare our framework with state-of-the-art techniques, which demonstrate superiority of TCUL over existing methods.
摘要：人重新鉴定（重新编号）旨在匹配来自多个相机拍摄的图像是同一个人。大多数现有的人重新编号方法通常需要大量的身份标签数据充当表示学习判别准则。困难手工收集身份标记的数据导致适应性较差的实际情况下。为了克服这个问题，我们提出一种能够逐步学习和利用从时间连续性相机中的底层再ID判别信息的无监督的基于中心的聚类方法。我们呼吁我们的框架时间上的连续性基于无监督学习（TCUL）。具体而言，TCUL同时确实未标记的（目标）的数据集和微调的卷积神经网络的基于中心聚类（CNN）预先训练在不相关的标记的（源）数据集，以提高CNN的辨别能力对目标数据集。此外，它利用内，摄像机图像的时间连续特性与功能的空间相似共同映射跨相机产生可靠的伪标签训练重新识别模型。随着训练的进行，可靠的样本数量不断增长适应其在CNN的转提升表现能力。三个大型的人重新编号基准数据集大量实验以我们的框架与国家的最先进的技术，这表明TCUL的优于现有的方法进行了比较。

19. Heatmap Regression via Randomized Rounding [PDF] 返回目录
Baosheng Yu, Dacheng Tao
Abstract: Heatmap regression has become the mainstream methodology for deep learning-based semantic landmark localization, including in facial landmark localization and human pose estimation. Though heatmap regression is robust to large variations in pose, illumination, and occlusion in unconstrained settings, it usually suffers from a sub-pixel localization problem. Specifically, considering that the activation point indices in heatmaps are always integers, quantization error thus appears when using heatmaps as the representation of numerical coordinates. Previous methods to overcome the sub-pixel localization problem usually rely on high-resolution heatmaps. As a result, there is always a trade-off between achieving localization accuracy and computational cost, where the computational complexity of heatmap regression depends on the heatmap resolution in a quadratic manner. In this paper, we formally analyze the quantization error of vanilla heatmap regression and propose a simple yet effective quantization system to address the sub-pixel localization problem. The proposed quantization system induced by the randomized rounding operation 1) encodes the fractional part of numerical coordinates into the ground truth heatmap using a probabilistic approach during training; and 2) decodes the predicted numerical coordinates from a set of activation points during testing. We prove that the proposed quantization system for heatmap regression is unbiased and lossless. Experimental results on four popular facial landmark localization datasets (WFLW, 300W, COFW, and AFLW) demonstrate the effectiveness of the proposed method for efficient and accurate semantic landmark localization. Code is available at this http URL.
摘要：热图的回归已经成为深基于学习的语义的地标定位主流的方法，包括面部界标定位和人体姿势估计。虽然热图回归是稳健的姿态，照明和闭塞不受约束设置的大的变化，但通常是一个子像素的定位问题的困扰。具体而言，考虑到在热图的激活点的索引总是整数，使用热图作为数值坐标的表示时出现这样的量化误差。以前的方法克服像素定位问题通常依靠高分辨率的热图。其结果是，总有实现的定位精度和计算成本，其中热图回归的计算复杂度依赖于二次地热图分辨率之间的权衡。在本文中，我们正式分析香草热图回归的量化误差，并提出了一个简单而有效的量化系统，解决了像素定位的问题。通过随机化舍入运算1）诱导所提出的量化系统编码数值坐标转换成使用在训练期间概率方法地面实况热图的小数部分;和2）在测试过程中解码从一组激活点的预测的数值坐标。我们证明了对热图回归所提出的量化系统是公正和无损。在四个流行面部界标的定位的数据集（WFLW，300W，COFW和AFLW）实验结果表明，为有效和精确的语义界标定位所提出的方法的有效性。代码可以在这个HTTP URL。

20. Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition [PDF] 返回目录
Yang Liu, Guanbin Li, Liang Lin
Abstract: Existing vision-based action recognition is susceptible to occlusion and appearance variations, while wearable sensors can alleviate these challenges by capturing human motion with one-dimensional time-series signals (e.g. acceleration, gyroscope and orientation). For the same action, the knowledge learned from vision sensors (videos or images) and wearable sensors, may be related and complementary. However, there exists significantly large modality difference between action data captured by wearable-sensor and vision-sensor in data dimension, data distribution and inherent information content. In this paper, we propose a novel framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) by adaptively transferring and distilling the knowledge from multiple wearable sensors. The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality. Specifically, we transform one-dimensional time-series signals of wearable sensors to two-dimensional images by designing a gramian angular field based virtual image generation model. Then, we build a novel Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM) to adaptively fuse intermediate representation knowledge from different teacher networks. To fully exploit and transfer the knowledge of multiple well-trained teacher networks to the student network, we propose a novel Graph-guided Semantically Discriminative Mapping (GSDM) loss, which utilizes graph-guided ablation analysis to produce a good visual explanation highlighting the important regions across modalities and concurrently preserving the interrelations of original data. Experimental results on Berkeley-MHAD, UTD-MHAD and MMAct datasets well demonstrate the effectiveness of our proposed SAKDN.
摘要：现有的基于视觉的动作识别是容易受到遮挡和外观的变化，而穿戴式传感器可以通过捕获人体运动具有一维时间序列信号（例如加速度计，陀螺仪和取向）缓解了这些挑战。出于同样的动作，从知识的视觉传感器（视频或图像）和便携的传感器得知，可能与互补。然而，存在由可佩戴传感器和视觉传感器中的数据尺寸，数据分配和固有信息内容捕获动作数据之间显著大形态差异。在本文中，我们提出了一个新的框架，命名为语义感知自适应知识蒸馏网络（SAKDN），通过自适应传输和蒸馏来自多个穿戴式传感器的知识，以提高视觉传感器模式（视频）动作识别。该SAKDN使用多个可穿戴式传感器作为教师的方式和使用RGB视频作为学生的方式。具体来说，我们通过设计Gramian矩阵场角基于虚拟图像生成模型变换穿戴式传感器的一维时间序列信号到二维图像。然后，我们建立了一个新的相似性，保来自不同教师网络自适应多模态融合模块（SPAMFM）以自适应保险丝中间表示知识。为了充分利用和，我们提出了一个新颖的图形导向语义判别映射（GSDM）的损失，它利用图形引导消融分析，以产生良好的视觉解释突出了重要的多训练有素的教师网络的知识传授给学生网络跨模态，并同时保留原始数据的相互关系的区域。在伯克利MHAD，UTD-MHAD和MMAct数据集实验结果清晰显示我们提出的SAKDN的有效性。

21. RangeRCNN: Towards Fast and Accurate 3D Object Detection with Range Image Representation [PDF] 返回目录
Zhidong Liang, Ming Zhang, Zehan Zhang, Xian Zhao, Shiliang Pu
Abstract: We present RangeRCNN, a novel and effective 3D object detection framework based on the range image representation. Most existing 3D object detection methods are either voxel-based or point-based. Though several optimizations have been introduced to ease the sparsity issue and speed up the running time, the two representations are still computationally inefficient. Compared to these two representations, the range image representation is dense and compact which can exploit the powerful 2D convolution and avoid the uncertain receptive field caused by the sparsity issue. Even so, the range image representation is not preferred in 3D object detection due to the scale variation and occlusion. In this paper, we utilize the dilated residual block to better adapt different object scales and obtain a more flexible receptive field on range image. Considering the scale variation and occlusion of the range image, we propose the RV-PV-BEV~(Range View to Point View to Bird's Eye View) module to transfer the feature from the range view to the bird's eye view. The anchor is defined in the BEV space which avoids the scale variation and occlusion. Both RV and BEV cannot provide enough information for height estimation, so we propose a two-stage RCNN for better 3D detection performance. The point view aforementioned does not only serve as a bridge from RV to BEV but also provides pointwise features for RCNN. Extensive experiments show that the proposed RangeRCNN achieves state-of-the-art performance on the KITTI 3D object detection dataset. We prove that the range image based methods can be effective on the KITTI dataset which provides more possibilities for real-time 3D object detection.
摘要：我们提出RangeRCNN，基于所述距离图像表示的新颖的和有效的立体物检测框架。大多数现有的立体物检测方法或者基于点基于体素或。虽然一些优化相继出台，以缓解稀疏性问题，加快了运行时间，这两种表示仍在计算效率低下。相比于这两种表示，范围图像表示是密实的，可以利用强大的二维卷积，并避免引起稀疏性问题的不确定的感受野。即便如此，范围图像表示未在立体物检测由于规模变化和遮挡优选的。在本文中，我们利用扩张残余块以更好地适应不同的对象尺度和上范围图像获得更灵活的感受域。考虑到规模变化和范围图像的遮挡，我们提出RV-PV-BEV〜（范围视图以查看点到鸟瞰）模块从远程视图到鸟瞰传输功能。锚在其避免了尺度变化和遮挡的BEV空间限定。无论RV和BEV不能适合身高估计提供足够的信息，所以我们提出了两阶段RCNN为更好的3D检测性能。点看法上述不仅作为从RV到BEV的桥梁，但也提供了逐点功能为RCNN。广泛的实验表明，该RangeRCNN实现对KITTI立体物检测的数据集的状态的最先进的性能。我们证明了范围基于图像的方法可以是对数据集中KITTI其中用于实时的3D对象检测提供了更多的可能性有效。

22. Deep Ice Layer Tracking and Thickness Estimation using Fully Convolutional Networks [PDF] 返回目录
Maryam Rahnemoonfar, Debvrat Varshney, Masoud Yari, John Paden
Abstract: Global warming is rapidly reducing glaciers and ice sheets across the world. Real time assessment of this reduction is required so as to monitor its global climatic impact. In this paper, we introduce a novel way of estimating the thickness of each internal ice layer using Snow Radar images and Fully Convolutional Networks. The estimated thickness can be analysed to understand snow accumulation each year. To understand the depth and structure of each internal ice layer, we carry out a set of image processing techniques and perform semantic segmentation on the radar images. After detecting each ice layer uniquely, we calculate its thickness and compare it with the available ground truth. Through this procedure we were able to estimate the ice layer thicknesses within a Mean Absolute Error of approximately 3.6 pixels. Such a Deep Learning based method can be used with ever-increasing datasets to make accurate assessments for cryospheric studies.
摘要：全球变暖正在迅速减少世界各地的冰川和冰盖。本次减持的实时评估要求，以监控其全球气候的影响。在本文中，我们介绍估计使用雪雷达图像和全卷积网络每个内部冰层的厚度的一种新颖的方式。估计厚度可以分析每年理解积雪。为了理解每一个内部冰层的深度和结构，我们进行了一组图像处理技术和在雷达图像执行语义分割。唯一地检测每个冰层之后，我们计算其厚度和它与可用的地面实况比较。通过此过程，我们能够约3.6个像素的平均绝对误差内估计所述冰的层厚度。这么深的学习为基础的方法可以不断增加的数据集被用来制造冰冻圈研究的准确的评估。

23. Object Detection-Based Variable Quantization Processing [PDF] 返回目录
Likun Liu, Hua Qi
Abstract: In this paper, we propose a preprocessing method for conventional image and video encoders that can make these existing encoders content-aware. By going through our process, a higher quality parameter could be set on a traditional encoder without increasing the output size. A still frame or an image will firstly go through an object detector. Either the properties of the detection result will decide the parameters of the following procedures, or the system will be bypassed if no object is detected in the given frame. The processing method utilizes an adaptive quantization process to determine the portion of data to be dropped. This method is primarily based on the JPEG compression theory and is optimum for JPEG-based encoders such as JPEG encoders and the Motion JPEG encoders. However, other DCT-based encoders like MPEG part 2, H.264, etc. can also benefit from this method. In the experiments, we compare the MS-SSIM under the same bitrate as well as similar MS-SSIM but enhanced bitrate. As this method is based on human perception, even with similar MS-SSIM, the overall watching experience will be better than the direct encoded ones.
摘要：在本文中，我们提出了传统的图像和视频编码器，可以使这些现有编码器的内容感知的预处理方法。通过我们的流程走，更高质量的参数可以在传统的编码器，而不增加输出大小设置。的静止画面或图像将首先经过的物体检测。任一的检测结果的性质将决定的以下过程中的参数，或者如果在给定的帧中没有检测到对象的系统将被旁路。该处理方法是利用一个自适应量化处理，以确定数据的部分被丢弃。此方法主要基于JPEG压缩的理论和最佳是基于JPEG-编码诸如JPEG编码器和运动JPEG编码器。然而，像MPEG部分2，H.264等其他基于DCT的编码器也可以受益于这种方法。在实验中，我们比较了相同比特率下的MS-SSIM以及类似MS-SSIM但增强比特率。由于这种方法是基于人类感知，即使有类似MS-SSIM，整体观看体验会比直接编码的好。

24. Utilizing Satellite Imagery Datasets and Machine Learning Data Models to Evaluate Infrastructure Change in Undeveloped Regions [PDF] 返回目录
Kyle McCullough, Andrew Feng, Meida Chen, Ryan McAlinden
Abstract: In the globalized economic world, it has become important to understand the purpose behind infrastructural and construction initiatives occurring within developing regions of the earth. This is critical when the financing for such projects must be coming from external sources, as is occurring throughout major portions of the African continent. When it comes to imagery analysis to research these regions, ground and aerial coverage is either non-existent or not commonly acquired. However, imagery from a large number of commercial, private, and government satellites have produced enormous datasets with global coverage, compiling geospatial resources that can be mined and processed using machine learning algorithms and neural networks. The downside is that a majority of these geospatial data resources are in a state of technical stasis, as it is difficult to quickly parse and determine a plan for request and processing when acquiring satellite image data. A goal of this research is to allow automated monitoring for largescale infrastructure projects, such as railways, to determine reliable metrics that define and predict the direction construction initiatives could take, allowing for a directed monitoring via narrowed and targeted satellite imagery requests. By utilizing photogrammetric techniques on available satellite data to create 3D Meshes and Digital Surface Models (DSM) we hope to effectively predict transport routes. In understanding the potential directions that largescale transport infrastructure will take through predictive modeling, it becomes much easier to track, understand, and monitor progress, especially in areas with limited imagery coverage.
摘要：在经济全球化的世界里，它已经明白背后内开发地球的地区发生的基础设施和建设举措的目的变得很重要。这是关键的时候为这些项目融资必须从外部来源来这里，因为在整个非洲大陆的主要部分发生。当涉及到图像分析研究这些地区，地面和空中覆盖要么不存在，或不常用收购。然而，从大量的商业，私人和政府的卫星图像已经产生覆盖全球的庞大的数据集，编译，可开采并利用机器学习算法和神经网络处理地理空间资源。缺点是，大部分这些地理空间数据的资源都被技术淤滞的状态，因为它是难以快速解析和确定用于请求和获取卫星图像数据时的处理的俯视图。这项研究的目的是允许自动监控大规模的基础设施项目，如铁路，以确定定义和预测建设举措可能采取的方向可靠指标，允许定向通过缩小和有针对性的卫星图像监测的要求。通过利用可用的卫星数据的摄影技术来创建3D网格和数字表面模型（DSM），我们希望能够有效地预测交通路线。在了解潜在的方向是大规模的交通基础设施将需要通过预测模型，它变得更容易跟踪，了解和监测进展情况，尤其是在有限的图像覆盖区域。

25. Automatic Radish Wilt Detection Using Image Processing Based Techniques and Machine Learning Algorithm [PDF] 返回目录
Asif Ashraf Patankar, Hyeonjoon Moon
Abstract: Image processing, computer vision, and pattern recognition have been playing a vital role in diverse agricultural applications, such as species detection, recognition, classification, identification, plant growth stages, plant disease detection, and many more. On the other hand, there is a growing need to capture high resolution images using unmanned aerial vehicles (UAV) and to develop better algorithms in order to find highly accurate and to the point results. In this paper, we propose a segmentation and extraction-based technique to detect fusarium wilt in radish crops. Recent wilt detection algorithms are either based on image processing techniques or conventional machine learning algorithms. However, our methodology is based on a hybrid algorithm, which combines image processing and machine learning. First, the crop image is divided into three segments, which include viz., healthy vegetation, ground and packing material. Based on the HSV decision tree algorithm, all the three segments are segregated from the image. Second, the extracted segments are summed together into an empty canvas of the same resolution as the image and one new image is produced. Third, this new image is compared with the original image, and a final noisy image, which contains traces of wilt is extracted. Finally, a k-means algorithm is applied to eliminate the noise and to extract the accurate wilt from it. Moreover, the extracted wilt is mapped on the original image using the contouring method. The proposed combination of algorithms detects the wilt appropriately, which surpasses the traditional practice of separately using the image processing techniques or machine learning.
摘要：图像处理，计算机视觉和模式识别已经打在不同的农业应用，如核素检测，识别，分类，识别，植物的生长阶段，植物病害检测，多了至关重要的作用。在另一方面，人们越来越需要使用无人机（UAV）来捕获高分辨率图像和为了寻找高度准确，中肯结果制定更好的算法。在本文中，我们提出了分割和基于提取-技术检测萝卜作物枯萎病。最近萎检测算法或者基于图像处理技术或常规的机器学习算法。然而，我们的方法是基于一种混合算法，它结合了图像处理和机器学习。首先，裁剪图像被划分成三个分段，其中包括即，健康植被，地面和包装材料。基于对HSV决策树算法，所有的三段从图像隔离。第二，将所提取的段被加在一起到相同的分辨率图像和一个新的图像产生的一个空的画布。第三，这种新的图像与原始图像进行比较，并且最终的噪声图像，其中包含青枯病的痕迹被提取。最后，K-means算法被施加以消除噪声，并从中提取准确萎。此外，所提取的萎映射使用轮廓方法在原始图像上。所提出的算法的组合来检测适当的枯萎，这优于单独使用图像处理技术或机器学习的传统做法。

26. LodoNet: A Deep Neural Network with 2D Keypoint Matchingfor 3D LiDAR Odometry Estimation [PDF] 返回目录
Ce Zheng, Yecheng Lyu, Ming Li, Ziming Zhang
Abstract: Deep learning based LiDAR odometry (LO) estimation attracts increasing research interests in the field of autonomous driving and robotics. Existing works feed consecutive LiDAR frames into neural networks as point clouds and match pairs in the learned feature space. In contrast, motivated by the success of image based feature extractors, we propose to transfer the LiDAR frames to image space and reformulate the problem as image feature extraction. With the help of scale-invariant feature transform (SIFT) for feature extraction, we are able to generate matched keypoint pairs (MKPs) that can be precisely returned to the 3D space. A convolutional neural network pipeline is designed for LiDAR odometry estimation by extracted MKPs. The proposed scheme, namely LodoNet, is then evaluated in the KITTI odometry estimation benchmark, achieving on par with or even better results than the state-of-the-art.
摘要：深学习基础的雷达测距（LO）估计吸引了自动驾驶和机器人领域越来越多的研究兴趣。现有作品养活连续激光雷达帧到神经网络的点云和比赛中对学习地物的空间。相比之下，基于图像特征提取的成功动机，我们建议激光雷达帧传送到图像空间和重新制定的问题，图像特征提取。随着规模不变特征的帮助下变换（SIFT）进行特征提取，我们能够生成匹配的关键点对，可以精确地返回到三维空间（MKPs）。卷积神经网络的管道是专为提取MKPs激光雷达测距法估计。所提出的方案，即LodoNet，在KITTI测距法估算基准，然后进行评估，实现看齐或高于国家的最先进的，甚至更好的效果。

27. A Review of Single-Source Deep Unsupervised Visual Domain Adaptation [PDF] 返回目录
Sicheng Zhao, Xiangyu Yue, Shanghang Zhang, Bo Li, Han Zhao, Bichen Wu, Ravi Krishna, Joseph E. Gonzalez, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia, Kurt Keutzer
Abstract: Large-scale labeled training datasets have enabled deep neural networks to excel across a wide range of benchmark vision tasks. However, in many applications, it is prohibitively expensive and time-consuming to obtain large quantities of labeled data. To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain. Unfortunately, direct transfer across domains often performs poorly due to the presence of domain shift or dataset bias. Domain adaptation is a machine learning paradigm that aims to learn a model from a source domain that can perform well on a different (but related) target domain. In this paper, we review the latest single-source deep unsupervised domain adaptation methods focused on visual tasks and discuss new perspectives for future research. We begin with the definitions of different domain adaptation strategies and the descriptions of existing benchmark datasets. We then summarize and compare different categories of single-source unsupervised domain adaptation methods, including discrepancy-based methods, adversarial discriminative methods, adversarial generative methods, and self-supervision-based methods. Finally, we discuss future research directions with challenges and possible solutions.
摘要：大型标记的训练数据集已经在广泛的基准视觉任务启用深层神经网络到Excel。然而，在许多应用中，它是昂贵和费时的获得大量的标签数据。为了应对限制标记的训练数据，许多人都试图直接应用于训练了大规模的标记源域到另一个稀疏标注或未标注的目标域模型。不幸的是，通过域直接转移往往较差由于域移位或数据集偏压的存在进行。领域适应性是机器学习的范例，旨在借鉴，可以在不同（但相关的）目标域以及执行源域的模型。在本文中，我们回顾了最新的单源深无人监督的领域适应性方法侧重于视觉任务，并讨论今后研究的新视角。我们从不同的领域适应战略的定义和现有的基准数据集的描述。然后我们总结和比较不同类别的单源监督的领域适应性方法，包括基于差异的方法，对抗性判别方法，对抗性的生成方法，以及基于自我监管的方法。最后，我们讨论了挑战和可能的解决方案未来的研究方向。

28. GIF: Generative Interpretable Faces [PDF] 返回目录
Partha Ghosh, Pravir Singh Gupta, Roy Uziel, Anurag Ranjan, Michael Black, Timo Bolkart
Abstract: Photo-realistic visualization and animation of expressive human faces have been a long standing challenge. On one end of the spectrum, 3D face modeling methods provide parametric control but tend to generate unrealistic images, while on the other end, generative 2D models like GANs (Generative Adversarial Networks) output photo-realistic face images, but lack explicit control. Recent methods gain partial control, either by attempting to disentangle different factors in an unsupervised manner, or by adding control post hoc to a pre-trained model. Trained GANs without pre-defined control, however, may entangle factors that are hard to undo later. To guarantee some disentanglement that provides us with desired kinds of control, we train our generative model conditioned on pre-defined control parameters. Specifically, we condition StyleGAN2 on FLAME, a generative 3D face model. However, we found out that a naive conditioning on FLAME parameters yields rather unsatisfactory results. Instead we render out geometry and photo-metric details of the FLAME mesh and use these for conditioning instead. This gives us a generative 2D face model named GIF (Generative Interpretable Faces) that shares FLAME's parametric control. Given FLAME parameters for shape, pose, and expressions, parameters for appearance and lighting, and an additional style vector, GIF outputs photo-realistic face images. To evaluate how well GIF follows its conditioning and the impact of different design choices, we perform a perceptual study. The code and trained model are publicly available for research purposes at this https URL.
摘要：照片般逼真的可视化和表现力的人脸动画是一个长期的挑战。在光谱的一端，3D人脸建模方法提供了参数化控制，但往往会产生不切实际的图像，而在另一端，生成二维模型，如甘斯（剖成对抗性网络）输出照片般逼真的人脸图像，但缺乏明确的控制。最近的方法获得局部控制，无论是通过试图解开不同的因素以无监督的方式，或通过添加控制事后到预先训练的模型。受过训练的甘斯没有预先定义的控制，但是，可能缠结，是很难后来撤消因素。为了保证一定的解开这为我们提供了所需的各种控制的，我们培训生成模型条件上预先定义的控制参数。具体来说，我们调理StyleGAN2火焰，生成一个三维人脸模型。然而，我们发现，在火焰参数天真调节产生相当令人满意的结果。相反，我们呈现出火焰网格的几何形状和光度量的详细信息，并利用这些调理代替。这给了我们一个名为GIF（剖成可解释面）生成一个2D脸部模型，股火焰的参数化控制。给定火焰参数的形状，姿态和表达式，外观和照明，以及额外的风格向量，GIF输出照片般逼真的人脸图像的参数。为了评估GIF如何遵循其调节和不同的设计选择的影响，我们进行感知研究。代码和训练模型是公开的用于研究目的在此HTTPS URL。

29. A Framework For Contrastive Self-Supervised Learning And Designing A New Approach [PDF] 返回目录
William Falcon, Kyunghyun Cho
Abstract: Contrastive self-supervised learning (CSL) is an approach to learn useful representations by solving a pretext task that selects and compares anchor, negative and positive (APN) features from an unlabeled dataset. We present a conceptual framework that characterizes CSL approaches in five aspects (1) data augmentation pipeline, (2) encoder selection, (3) representation extraction, (4) similarity measure, and (5) loss function. We analyze three leading CSL approaches--AMDIM, CPC, and SimCLR--, and show that despite different motivations, they are special cases under this framework. We show the utility of our framework by designing Yet Another DIM (YADIM) which achieves competitive results on CIFAR-10, STL-10 and ImageNet, and is more robust to the choice of encoder and the representation extraction strategy. To support ongoing CSL research, we release the PyTorch implementation of this conceptual framework along with standardized implementations of AMDIM, CPC (V2), SimCLR, BYOL, Moco (V2) and YADIM.
摘要：对比自我监督学习（CSL）是通过解决一个借口的任务，选择和比较的锚，阴性和阳性（APN）从一个未标记的数据集功能，以学习用的表现的方法。我们提出表征CSL在五个方面（1）的数据增大的管道，（2）编码器选择，（3）表示的提取，（4）的相似性度量，以及（5）损耗函数接近的概念框架。我们分析了三大主导CSL方法 - AMDIM，CPC和SimCLR--，并表明，尽管不同的动机，他们是这个框架下的特殊情况。我们发现我们的框架中通过设计然而，另一个DIM（YADIM）的工具，它实现了对CIFAR-10竞争的结果，STL-10和ImageNet，更稳健的编码器的选择和表现提取的策略。为了支持正在进行的CSL的研究，我们发布这个概念框架的PyTorch实现与AMDIM，CPC（V2），SimCLR，BYOL，莫科（V2）和YADIM的标准化实现沿。

30. Online Multi-Object Tracking and Segmentation with GMPHD Filter and Simple Affinity Fusion [PDF] 返回目录
Young-min Song, Moongu Jeon
Abstract: In this paper, we propose a highly practical fully online multi-object tracking and segmentation (MOTS) method that uses instance segmentation results as an input in video. The proposed method exploits the Gaussian mixture probability hypothesis density (GMPHD) filter for online approach which is extended with a hierarchical data association (HDA) and a simple affinity fusion (SAF) model. HDA consists of segment-to-track and track-to-track associations. To build the SAF model, an affinity is computed by using the GMPHD filter that is represented by the Gaussian mixture models with position and motion mean vectors, and another affinity for appearance is computed by using the responses from single object tracker such as the kernalized correlation filters. These two affinities are simply fused by using a score-level fusion method such as Min-max normalization. In addition, to reduce false positive segments, we adopt Mask IoU based merging. In experiments, those key modules, i.e., HDA, SAF, and Mask merging show incremental improvements. For instance, ID-switch decreases by half compared to baseline method. In conclusion, our tracker achieves state-of-the-art level MOTS performance.
摘要：在本文中，我们提出了一个非常实用的全在线多目标跟踪和分割（MOTS）方法，例如使用分割结果作为视频输入。所提出的方法利用在线方法，其被延伸具有层级数据协会（HDA）和一个简单的亲和力融合（SAF）模型中的高斯混合概率假设密度（GMPHD）滤波器。 HDA由段到轨道和轨到轨协会。为了构建SAF模型中，亲和力是通过使用由所述高斯混合模型与位置和运动均值向量表示的GMPHD滤波器计算，并且外观另一亲和力是通过使用从单个对象跟踪器的响应来计算诸如kernalized相关过滤器。这两种亲和力简单地通过使用一个分数级融合方法如最小 - 最大归一化熔融。此外，为了减少误报段，我们采用基于面膜借条合并。在实验中，这些关键模块，即HDA，SAF，和掩码合并显示增量改进。例如，ID-开关一半相比于基线方法降低。总之，我们的跟踪实现国家的最先进水平MOTS性能。

31. LaDDer: Latent Data Distribution Modelling with a Generative Prior [PDF] 返回目录
Shuyu Lin, Ronald Clark
Abstract: In this paper, we show that the performance of a learnt generative model is closely related to the model's ability to accurately represent the inferred \textbf{latent data distribution}, i.e. its topology and structural properties. We propose LaDDer to achieve accurate modelling of the latent data distribution in a variational autoencoder framework and to facilitate better representation learning. The central idea of LaDDer is a meta-embedding concept, which uses multiple VAE models to learn an embedding of the embeddings, forming a ladder of encodings. We use a non-parametric mixture as the hyper prior for the innermost VAE and learn all the parameters in a unified variational framework. From extensive experiments, we show that our LaDDer model is able to accurately estimate complex latent distribution and results in improvement in the representation quality. We also propose a novel latent space interpolation method that utilises the derived data distribution.
摘要：在本文中，我们表明，一个有学问的生成模型的性能密切相关的模型能够准确代表推断\ textbf {潜在的数据分布}，即它的拓扑结构和结构特性。我们提出的阶梯，实现在变的自动编码框架潜在的数据分布的精确建模，并促进更好的代表性学习。梯子的中心思想是元嵌入的概念，它使用多个VAE模型学习的嵌入的嵌入，形成编码的阶梯。我们之前在最里面的VAE使用非参数混合物作为超和学习在一个统一的框架变的所有参数。从大量的实验，我们证明了我们的阶梯模型能够准确地估计在表达质量改进复杂的潜在分布和结果。我们还建议，利用所得到的数据分布的新颖潜伏空间插值方法。

32. Quality-aware semi-supervised learning for CMR segmentation [PDF] 返回目录
Bram Ruijsink, Esther Puyol-Anton, Ye Li, Wenja Bai, Eric Kerfoot, Reza Razavi, Andrew P. King
Abstract: One of the challenges in developing deep learning algorithms for medical image segmentation is the scarcity of annotated training data. To overcome this limitation, data augmentation and semi-supervised learning (SSL) methods have been developed. However, these methods have limited effectiveness as they either exploit the existing data set only (data augmentation) or risk negative impact by adding poor training examples (SSL). Segmentations are rarely the final product of medical image analysis - they are typically used in downstream tasks to infer higher-order patterns to evaluate diseases. Clinicians take into account a wealth of prior knowledge on biophysics and physiology when evaluating image analysis results. We have used these clinical assessments in previous works to create robust quality-control (QC) classifiers for automated cardiac magnetic resonance (CMR) analysis. In this paper, we propose a novel scheme that uses QC of the downstream task to identify high quality outputs of CMR segmentation networks, that are subsequently utilised for further network training. In essence, this provides quality-aware augmentation of training data in a variant of SSL for segmentation networks (semiQCSeg). We evaluate our approach in two CMR segmentation tasks (aortic and short axis cardiac volume segmentation) using UK Biobank data and two commonly used network architectures (U-net and a Fully Convolutional Network) and compare against supervised and SSL strategies. We show that semiQCSeg improves training of the segmentation networks. It decreases the need for labelled data, while outperforming the other methods in terms of Dice and clinical metrics. SemiQCSeg can be an efficient approach for training segmentation networks for medical image data when labelled datasets are scarce.
摘要：一个在发展中深学习算法的医学图像分割所面临的挑战是注释的训练数据的稀缺性。为了克服这个限制，数据增强和半监督学习（SSL）的方法已被开发。然而，这些方法具有有限的效果，因为它们通过添加差训练实例（SSL）利用现有的数据仅设置（数据扩张）或风险的负面影响。分割很少医学图像分析的最终产物 - 它们通常在下游任务用于推断高阶模式来评价疾病。评估图像分析结果时，医生考虑到丰富的先验知识对生物物理学和生理学。我们在以前的作品中使用这些临床评估，创建用于自动心脏磁共振（CMR）分析强有力的质量控制（QC）分类。在本文中，我们提议使用下游任务的QC识别CMR分割网络的高品质的输出，随后被用于进一步的网络训练的新的方案。在本质上，这提供了在SSL的用于分割网络的变体（semiQCSeg）训练数据质量感知增强。我们评估我们两个CMR分割任务（主动脉和短轴心脏容量分割）使用英国生物库数据以及两种常用的网络架构（U-网和全卷积网络）的方法和比较反对监督和SSL策略。我们发现，semiQCSeg提高了分割的网络培训。这降低了对标数据的需求，而在超越骰子和临床指标方面的其他方法。 SemiQCSeg可以用于训练网络分割为医学图像数据时标记的数据集是稀缺的有效方法。

33. A Deep 2-Dimensional Dynamical Spiking Neuronal Network for Temporal Encoding trained with STDP [PDF] 返回目录
Matthew Evanusa, Cornelia Fermuller, Yiannis Aloimonos
Abstract: The brain is known to be a highly complex, asynchronous dynamical system that is highly tailored to encode temporal information. However, recent deep learning approaches to not take advantage of this temporal coding. Spiking Neural Networks (SNNs) can be trained using biologically-realistic learning mechanisms, and can have neuronal activation rules that are biologically relevant. This type of network is also structured fundamentally around accepting temporal information through a time-decaying voltage update, a kind of input that current rate-encoding networks have difficulty with. Here we show that a large, deep layered SNN with dynamical, chaotic activity mimicking the mammalian cortex with biologically-inspired learning rules, such as STDP, is capable of encoding information from temporal data. We argue that the randomness inherent in the network weights allow the neurons to form groups that encode the temporal data being inputted after self-organizing with STDP. We aim to show that precise timing of input stimulus is critical in forming synchronous neural groups in a layered network. We analyze the network in terms of network entropy as a metric of information transfer. We hope to tackle two problems at once: the creation of artificial temporal neural systems for artificial intelligence, as well as solving coding mechanisms in the brain.
摘要：脑已知是被高度定制的编码时间信息的高度复杂的，异步动态系统。然而，最近的深度学习方法不采取这个时间编码的优势。扣球神经网络（SNNS）可使用生物现实的学习机制进行训练，并且可以是生物学相关的神经元激活规则。这种类型的网络也被构造从根本上围绕接受通过时间衰减电压更新时间信息，是一种输入的电流速率编码网络有困难。在这里，我们表明，大，深层次SNN与动力，混乱的模仿活动与生物激发学习规则，如STDP哺乳动物的大脑皮层，是能够编码从时间数据信息。我们认为，在网络权所固有的随机性使得编码时间数据的神经元形态组后STDP自组织被输入。我们的目标是显示，输入激励的精确定时是在分层网络形成同步神经组关键的。我们分析网络中的网络熵方面的度量信息传递的。我们希望一次解决两个问题：创造人工智能人工神经颞系统，以及在大脑中编码解决机制。

34. Training Deep Neural Networks with Constrained Learning Parameters [PDF] 返回目录
Prasanna Date, Christopher D. Carothers, John E. Mitchell, James A. Hendler, Malik Magdon-Ismail
Abstract: Today's deep learning models are primarily trained on CPUs and GPUs. Although these models tend to have low error, they consume high power and utilize large amount of memory owing to double precision floating point learning parameters. Beyond the Moore's law, a significant portion of deep learning tasks would run on edge computing systems, which will form an indispensable part of the entire computation fabric. Subsequently, training deep learning models for such systems will have to be tailored and adopted to generate models that have the following desirable characteristics: low error, low memory, and low power. We believe that deep neural networks (DNNs), where learning parameters are constrained to have a set of finite discrete values, running on neuromorphic computing systems would be instrumental for intelligent edge computing systems having these desirable characteristics. To this extent, we propose the Combinatorial Neural Network Training Algorithm (CoNNTrA), that leverages a coordinate gradient descent-based approach for training deep learning models with finite discrete learning parameters. Next, we elaborate on the theoretical underpinnings and evaluate the computational complexity of CoNNTrA. As a proof of concept, we use CoNNTrA to train deep learning models with ternary learning parameters on the MNIST, Iris and ImageNet data sets and compare their performance to the same models trained using Backpropagation. We use following performance metrics for the comparison: (i) Training error; (ii) Validation error; (iii) Memory usage; and (iv) Training time. Our results indicate that CoNNTrA models use 32x less memory and have errors at par with the Backpropagation models.
摘要：今天的深度学习模式主要培训了CPU和GPU。虽然这些模型往往具有低的错误，他们消耗高功率，并且由于双精度浮点学习参数使用大量的内存。超越摩尔定律的深度学习任务的显著部分将在边缘计算系统，这将形成整个计算构造中不可或缺的一部分运行。随后，培养深度学习模型，这种系统将不得不进行调整，并通过生成具有以下所需特性的车型：低故障，低内存和低功耗。我们认为，深神经网络（DNNs），其中学习参数被约束为具有一组有限的离散值，对神经形态计算系统上运行。将仪器对具有这些希望的特征的智能边缘计算系统。在这个意义上，我们提出了组合神经网络训练算法（CoNNTrA），它利用坐标梯度基于血统的方法训练深度学习模型与有限离散学习参数。接下来，我们详细阐述的理论基础和评价CoNNTrA的计算复杂度。作为概念证明，我们用CoNNTrA在MNIST与三元学习参数来训练深度学习模式，光圈和ImageNet数据集，并使用反向传播训练他们的表现比较相同的模式。我们用下面的性能指标比较：（一）培训的错误; （ⅱ）验证错误; （ⅲ）内存使用;和（iv）培养时间。我们的研究结果表明，CoNNTrA机型使用32倍的内存更少，并有错误，按面值与反向传播模型。

35. Unsupervised Domain Adaptation with Progressive Adaptation of Subspaces [PDF] 返回目录
Weikai Li, Songcan Chen
Abstract: Unsupervised Domain Adaptation (UDA) aims to classify unlabeled target domain by transferring knowledge from labeled source domain with domain shift. Most of the existing UDA methods try to mitigate the adverse impact induced by the shift via reducing domain discrepancy. However, such approaches easily suffer a notorious mode collapse issue due to the lack of labels in target domain. Naturally, one of the effective ways to mitigate this issue is to reliably estimate the pseudo labels for target domain, which itself is hard. To overcome this, we propose a novel UDA method named Progressive Adaptation of Subspaces approach (PAS) in which we utilize such an intuition that appears much reasonable to gradually obtain reliable pseudo labels. Speci fically, we progressively and steadily refine the shared subspaces as bridge of knowledge transfer by adaptively anchoring/selecting and leveraging those target samples with reliable pseudo labels. Subsequently, the refined subspaces can in turn provide more reliable pseudo-labels of the target domain, making the mode collapse highly mitigated. Our thorough evaluation demonstrates that PAS is not only effective for common UDA, but also outperforms the state-of-the arts for more challenging Partial Domain Adaptation (PDA) situation, where the source label set subsumes the target one.
摘要：无监督域适应（UDA）的目的是通过从标记为源域与域移位知识转移的未标记靶标域进行分类。大多数现有的UDA方法试图减轻由通过减少域差异移位引起的不良影响。然而，这些方法容易遭受臭名昭著的模式崩溃的问题，由于缺乏在目标域的标签。当然，以减轻这一问题的有效途径之一是可靠地估计伪标签目标域，它本身是很难的。为了克服这个问题，我们提出了一个名为中，我们利用出现很多合理，逐步获得可靠的假标签，这样的直觉子空间的方法（PAS）的逐步适应一种新的UDA方法。规格地讲，我们逐步逐步细化共享子空间由自适应固定/选择知识转移的桥梁，并利用这些目标样本具有可靠的假标签。随后，细化子空间又可以提供目标域的更可靠的伪标签，使得该模式崩溃高度缓解。我们全面的评估表明，PAS不仅是有效的共同UDA，而且还优于国家的艺术更具挑战性的部分领域适应性（PDA）的情况，其中，源标签集涵括目标之一。

36. End-to-End Hyperspectral-Depth Imaging with Learned Diffractive Optics [PDF] 返回目录
Seung-Hwan Baek, Hayato Ikoma, Daniel S. Jeon, Yuqi Li, Wolfgang Heidrich, Gordon Wetzstein, Min H. Kim
Abstract: To extend the capabilities of spectral imaging, hyperspectral and depth imaging have been combined to capture the higher-dimensional visual information. However, the form factor of the combined imaging systems increases, limiting the applicability of this new technology. In this work, we propose a monocular imaging system for simultaneously capturing hyperspectral-depth (HS-D) scene information with an optimized diffractive optical element (DOE). In the training phase, this DOE is optimized jointly with a convolutional neural network to estimate HS-D data from a snapshot input. To study natural image statistics of this high-dimensional visual data and to enable such a machine learning-based DOE training procedure, we record two HS-D datasets. One is used for end-to-end optimization in deep optical HS-D imaging, and the other is used for enhancing reconstruction performance with a real-DOE prototype. The optimized DOE is fabricated with a grayscale lithography process and inserted into a portable HS-D camera prototype, which is shown to robustly capture HS-D information. In extensive evaluations, we demonstrate that our deep optical imaging system achieves state-of-the-art results for HS-D imaging and that the optimized DOE outperforms alternative optical designs.
摘要：为了延长光谱成像，高光谱和深度成像的能力已被组合来捕获高维视觉信息。然而，组合的成像系统的增加的形式因子，从而限制了这种新技术的适用性。在这项工作中，我们提出了一个单眼成像系统用于同时捕获高光谱深度（HS-d）场景信息与优化的衍射光学元件（DOE）。在训练阶段，这DOE与从快照输入估计HS-d的数据的卷积神经网络联合优化。为了研究这个高维可视化数据的自然图像统计并启用这样的基于机器学习的DOE训练过程中，我们记录两个HS-d的数据集。一个用于在深光学HS-d成像端至端的优化，而另一个用于增强与实DOE原型重建性能。优化的DOE是用纤维制作的灰度光刻工艺，并插入到便携式HS-d相机原型，其被示出为稳健地捕获HS-d的信息。在广泛的评价中，我们证明，我们的深光学成像系统实现状态的最先进的结果HS-d成像和，优化DOE性能优于替换的光学设计。

37. Image Super-Resolution using Explicit Perceptual Loss [PDF] 返回目录
Tomoki Yoshida, Kazutoshi Akita, Muhammad Haris, Norimichi Ukita
Abstract: This paper proposes an explicit way to optimize the super-resolution network for generating visually pleasing images. The previous approaches use several loss functions which is hard to interpret and has the implicit relationships to improve the perceptual score. We show how to exploit the machine learning based model which is directly trained to provide the perceptual score on generated images. It is believed that these models can be used to optimizes the super-resolution network which is easier to interpret. We further analyze the characteristic of the existing loss and our proposed explicit perceptual loss for better interpretation. The experimental results show the explicit approach has a higher perceptual score than other approaches. Finally, we demonstrate the relation of explicit perceptual loss and visually pleasing images using subjective evaluation.
摘要：本文提出了一种明确的方式来优化超分辨率网络产生赏心悦目的图像。以前的方法使用了几种损耗功能，这是很难解释并具有隐含关系，提高知觉分值。我们将展示如何利用它直接培训，以提供对生成的图像感知得分机器学习基于模型。据认为，这些模型可以用来优化了超分辨率网络，这是比较容易解释的。我们进一步分析现有的损失和更好地诠释我们提出的明确的视觉损失的特点。实验结果表明：显式方式具有更高的知觉分值比其他方法。最后，我们表现出明确的感知损失及使用主观评价赏心悦目的图像之间的关系。

38. Recognition Oriented Iris Image Quality Assessment in the Feature Space [PDF] 返回目录
Leyuan Wang, Kunbo Zhang, Min Ren, Yunlong Wang, Zhenan Sun
Abstract: A large portion of iris images captured in real world scenarios are poor quality due to the uncontrolled environment and the non-cooperative subject. To ensure that the recognition algorithm is not affected by low-quality images, traditional hand-crafted factors based methods discard most images, which will cause system timeout and disrupt user experience. In this paper, we propose a recognition-oriented quality metric and assessment method for iris image to deal with the problem. The method regards the iris image embeddings Distance in Feature Space (DFS) as the quality metric and the prediction is based on deep neural networks with the attention mechanism. The quality metric proposed in this paper can significantly improve the performance of the recognition algorithm while reducing the number of images discarded for recognition, which is advantageous over hand-crafted factors based iris quality assessment methods. The relationship between Image Rejection Rate (IRR) and Equal Error Rate (EER) is proposed to evaluate the performance of the quality assessment algorithm under the same image quality distribution and the same recognition algorithm. Compared with hand-crafted factors based methods, the proposed method is a trial to bridge the gap between the image quality assessment and biometric recognition.
摘要：在现实世界中的场景拍摄的虹膜图像中的很大一部分是质量差，由于不受控制的环境和非合作的主题。为了确保识别算法不受低质量的图像，传统的手工制作方法的因素最丢弃的图像，这将导致系统超时，破坏用户体验。在本文中，我们提出了虹膜图像处理问题面向识别质量度量和评估方法。该方法关于虹膜图像的嵌入距离在特征空间（DFS）作为质量度量和所述预测是基于与所述注意机制深神经网络。在本文提出的质量度量可以显著提高识别算法的性能，同时降低丢弃识别的图像的数量，这是过度的手工制作的基于因素虹膜质量评估方法是有利的。镜像抑制率（IRR）和等错误率（EER）之间的关系，提出了下相同的图像质量分布和相同的识别算法来评估质量评估算法的性能。与基于手工制作的因素的方法相比，所提出的方法是缩小图像质量评价和生物识别之间的间隙的试验。

39. Gaussian Process Gradient Maps for Loop-Closure Detection in Unstructured Planetary Environments [PDF] 返回目录
Cedric Le Gentil, Mallikarjuna Vayugundla, Riccardo Giubilato, Wolfgang Stürzl, Teresa Vidal-Calleja, Rudolph Triebel
Abstract: The ability to recognize previously mapped locations is an essential feature for autonomous systems. Unstructured planetary-like environments pose a major challenge to these systems due to the similarity of the terrain. As a result, the ambiguity of the visual appearance makes state-of-the-art visual place recognition approaches less effective than in urban or man-made environments. This paper presents a method to solve the loop closure problem using only spatial information. The key idea is to use a novel continuous and probabilistic representations of terrain elevation maps. Given 3D point clouds of the environment, the proposed approach exploits Gaussian Process (GP) regression with linear operators to generate continuous gradient maps of the terrain elevation information. Traditional image registration techniques are then used to search for potential matches. Loop closures are verified by leveraging both the spatial characteristic of the elevation maps (SE(2) registration) and the probabilistic nature of the GP representation. A submap-based localization and mapping framework is used to demonstrate the validity of the proposed approach. The performance of this pipeline is evaluated and benchmarked using real data from a rover that is equipped with a stereo camera and navigates in challenging, unstructured planetary-like environments in Morocco and on Mt. Etna.
摘要：承认以前映射位置的能力是自治系统的基本特征。非结构化行星般的环境中会对这些系统的一大挑战，因为地形的相似性。其结果是，视觉外观的模糊使得先进的国家的视觉识别的地方接近比城市或人工环境不太有效。本文介绍了只使用空间信息解决环闭合问题的方法。关键思想是利用地形高程图的新型连续和概率表示。鉴于环境的3D点云，所提出的方法利用与线性算高斯过程（GP）回归，以产生的地形标高信息连续梯度映射。传统的图像配准技术，然后用于寻找潜在的匹配。环封闭件通过利用立面图（SE（2）注册）的两个空间特性和GP表示的概率性质验证。基于子图的定位和映射框架用于证明所提出的方法的有效性。这条管线的性能进行了评价和使用基准从在挑战配备了立体相机并定位流动站，非结构化的行星就像在摩洛哥和山环境的真实数据埃特纳火山。

40. On The Usage Of Average Hausdorff Distance For Segmentation Performance Assessment: Hidden Bias When Used For Ranking [PDF] 返回目录
Orhun Utku Aydin, Abdel Aziz Taha, Adam Hilbert, Ahmed A. Khalil, Ivana Galinovic, Jochen B. Fiebach, Dietmar Frey, Vince Istvan Madai
Abstract: Average Hausdorff Distance (AVD) is a widely used performance measure to calculate the distance between two point sets. In medical image segmentation, AVD is used to compare ground truth images with segmentation results allowing their ranking. We identified, however, a ranking bias of AVD making it less suitable for segmentation ranking. To mitigate this bias, we present a modified calculation of AVD that we have coined balanced AVD (bAVD). To simulate segmentations for ranking, we manually created non-overlapping segmentation errors common in cerebral vessel segmentation as our use-case. Adding the created errors consecutively and randomly to the ground truth, we created sets of simulated segmentations with increasing number of errors. Each set of simulated segmentations was ranked using AVD and bAVD. We calculated the Kendall-rank-correlation-coefficient between the segmentation ranking and the number of errors in each simulated segmentation. The rankings produced by bAVD had a significantly higher average correlation (0.969) than those of AVD (0.847). In 200 total rankings, bAVD misranked 52 and AVD misranked 179 segmentations. Our proposed evaluation measure, bAVD, alleviates AVDs ranking bias making it more suitable for rankings and quality assessment of segmentations.
摘要：平均Hausdorff距离（AVD）是一种广泛使用的性能测量值来计算两个点集之间的距离。在医学图像分割，AVD用于地面实况图像与分割结果让他们的排名进行比较。我们确定，但是，AVD的排名偏使其不太适合分割排名。为了减轻这种偏差，我们提出AVD的变形计算，我们已经创造均衡AVD（bAVD）。为了模拟分割的排名，我们在人工脑血管分割为我们的用例创建非重叠分段错误常见。添加产生的误差连续随机到地面真相，我们创建了套模拟分割的随错误数。每一套模拟分割的使用AVD和bAVD排名。我们计算出的分段的排名和在每个模拟分割的错误的数量之间肯德尔秩相关系数。通过bAVD产生的排名比那些AVD（0.847）的显著较高的平均相关性（0.969）。在200个的总排名，bAVD misranked 52和AVD misranked 179个分割。我们所提出的评价尺度，bAVD，缓解自动真空淀积排名偏见使得它更适合的排名和分割的质量评估。

41. Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering [PDF] 返回目录
Jing Yu, Zihao Zhu, Yujing Wang, Weifeng Zhang, Yue Hu, Jianlong Tan
Abstract: Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing KVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the correct answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. Thereinto, the visual graph and semantic graph are regarded as image-conditioned instantiation of the factual graph. On top of these new representations, we re-formulate Knowledge-based Visual Question Answering as a recurrent reasoning process for obtaining complementary evidence from multimodal information. To this end, we decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol ( GRUC ) module that conducts parallel reasoning over both visual and semantic information. By stacking the modules multiple times, our model performs transitive reasoning and obtains question-oriented concept representations under the constrain of different modalities. Finally, we perform graph neural networks to infer the global-optimal answer by jointly considering all the concepts. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA, and demonstrate the effectiveness and interpretability of our model with extensive experiments.
摘要：基于知识的视觉答疑（KVQA）需要超过可见的内容，有关图像答题外部知识。这种能力具有挑战性，但不可缺少的实现一般VQA。现有KVQA解决方案的一个限制是他们共同嵌入各种信息，而细粒度的选择，它引入了意想不到的噪音进行推理正确的答案。如何捕获问题为导向，以信息互补的证据仍然是解决问题的一个关键挑战。由人类认知理论的启发，在本文中，我们描述的是从视觉，语义和事实的看法多个知识图的图像。到其中，视觉图形和语义图被视为实际图形的图像的空调的实例化。在这些新的表示的顶部，我们重新制订基于知识的视觉答疑作为从多式联运信息获取补充证据反复推理过程。为此，我们分解模型成一系列的基于记忆的推理步骤，每个步骤通过基于拍摄和G ^ - [R EAD，U PDATE和C ONTROL（GRUC）模块执行的，超过视觉和语义信息行为并行推理。通过多次堆叠模块，我们的模型进行传递性推理和不同形式的约束下取得导向问题 - 概念表示。最后，我们执行图形神经网络通过联合考虑所有的概念来推断全球最佳答案。我们实现了三个流行的基准数据集，包括FVQA，Visual7W-KB和OK-VQA一个新的国家的最先进的性能，我们将证明我们有广泛的实验模型的有效性和可解释性。

42. Adversarial Eigen Attack on Black-Box Models [PDF] 返回目录
Linjun Zhou, Peng Cui, Yinan Jiang, Shiqiang Yang
Abstract: Black-box adversarial attack has attracted a lot of research interests for its practical use in AI safety. Compared with the white-box attack, a black-box setting is more difficult for less available information related to the attacked model and the additional constraint on the query budget. A general way to improve the attack efficiency is to draw support from a pre-trained transferable white-box model. In this paper, we propose a novel setting of transferable black-box attack: attackers may use external information from a pre-trained model with available network parameters, however, different from previous studies, no additional training data is permitted to further change or tune the pre-trained model. To this end, we further propose a new algorithm, EigenBA to tackle this problem. Our method aims to explore more gradient information of the black-box model, and promote the attack efficiency, while keeping the perturbation to the original attacked image small, by leveraging the Jacobian matrix of the pre-trained white-box model. We show the optimal perturbations are closely related to the right singular vectors of the Jacobian matrix. Further experiments on ImageNet and CIFAR-10 show that even the unlearnable pre-trained white-box model could also significantly boost the efficiency of the black-box attack and our proposed method could further improve the attack efficiency.
摘要：黑盒敌对攻击已经吸引了大量的研究兴趣在AI安全的实用化。与白盒攻击相比，黑盒设置是相关的攻击模式和查询预算追加限制条件较少的可用信息更加困难。以提高攻击效率的一般方法是从一个预先训练转让白盒模型得出的支持。在本文中，我们提出了转让的黑箱攻击的一个新的设置：攻击者可能会使用与现有的网络参数，不过，从以往的研究不同的预先训练模式的外部信息，不需要额外的训练数据被允许进一步改变或调整预先训练的模式。为此，我们进一步提出了一种新的算法，EigenBA来解决这个问题。我们的方法旨在探索黑盒模型的更多梯度信息，促进了进攻效率，同时保持扰动原来的攻击图像小，通过利用预先训练的白盒模型的雅可比矩阵。我们展示了最优扰动密切相关的雅可比矩阵的右奇异向量。在ImageNet和CIFAR-10显示，即使是unlearnable预先训练的白箱模型还可以显著提升黑箱攻击，我们提出的方法还可以进一步提高攻击效率的效率进一步的实验。

43. Data and Image Prior Integration for Image Reconstruction Using Consensus Equilibrium [PDF] 返回目录
Muhammad Usman Ghani, W. Clem Karl
Abstract: Image domain prior models have been shown to improve the quality of reconstructed images, especially when data are limited. Pre-processing of raw data, through the implicit or explicit inclusion of data domain priors have separately also shown utility in improving reconstructions. In this work, a principled approach is presented allowing the unified integration of both data and image domain priors for improved image reconstruction. The consensus equilibrium framework is extended to integrate physical sensor models, data models, and image models. In order to achieve this integration, the conventional image variables used in consensus equilibrium are augmented with variables representing data domain quantities. The overall result produces combined estimates of both the data and the reconstructed image that is consistent with the physical models and prior models being utilized. The prior models used in both domains in this work are created using deep neural networks. The superior quality allowed by incorporating both data and image domain prior models is demonstrated for two applications: limited-angle CT and accelerated MRI. The prior data model in both these applications is focused on recovering missing data. Experimental results are presented for a 90 degree limited-angle tomography problem from a real checked-bagged CT dataset and a 4x accelerated MRI problem on a simulated dataset. The new framework is very flexible and can be easily applied to other computational imaging problems with imperfect data.
摘要：图像域现有模型已被证实可以改善重建的图像的质量，特别是当数据被限制。原始数据的预处理，通过数据域先验的隐式或显式包含已分别也改善重建所示效用。在这项工作中，提出了允许改善图像重建数据和图像域先验的统一整合原则的做法。共有均衡框架延伸到整合物理传感器模型，数据模型，和图像模型。为了实现这种集成，在协商一致的平衡中使用的以往的图像变量增加与表示数据域的数量的变量。总的结果所产生的数据和所述重构图像与所用的物理模型和先前的模型是一致的两者的组合估计。在这项工作中两个域使用的现有模型使用深层神经网络的创建。通过将数据和图像域现有模型允许优质证明对于两种应用：有限角度CT和MRI加速。在这两个应用中现有的数据模型专注于恢复丢失的数据。实验结果中给出了从一个真正的90度有限角度断层摄影的问题签袋装CT数据集和4倍于模拟数据集加速MRI问题。新的框架是非常灵活的并且可以容易地应用到其他计算成像问题与不完整的数据。

44. Image Reconstruction of Static and Dynamic Scenes through Anisoplanatic Turbulence [PDF] 返回目录
Zhiyuan Mao, Nicholas Chimitt, Stanley Chan
Abstract: Ground based long-range passive imaging systems often suffer from degraded image quality due to a turbulent atmosphere. While methods exist for removing such turbulent distortions, many are limited to static sequences which cannot be extended to dynamic scenes. In addition, the physics of the turbulence is often not integrated into the image reconstruction algorithms, making the physics foundations of the methods weak. In this paper, we present a unified method for atmospheric turbulence mitigation in both static and dynamic sequences. We are able to achieve better results compared to existing methods by utilizing (i) a novel space-time non-local averaging method to construct a reliable reference frame, (ii) a geometric consistency and a sharpness metric to generate the lucky frame, (iii) a physics-constrained prior model of the point spread function for blind deconvolution. Experimental results based on synthetic and real long-range turbulence sequences validate the performance of the proposed method.
摘要：基于地面远距离无源成像系统往往是从图像质量下降遭受由于湍流大气。虽然用于除去这种湍流畸变存在的方法，许多被限制为不能扩展到动态场景的静态序列。另外，湍流的物理往往不是集成到图像重建算法，使得所述方法的物理基础弱。在本文中，我们提出了在静态和动态序列大气湍流减轻的统一方法。我们能够通过利用（ⅰ）一种新颖的空间 - 时间的非本地平均方法来构造可靠的参考帧，以实现比现有方法更好的效果，（ⅱ）一个几何一致性和锐度的度量以产生幸运帧，（ iii）所述点扩散函数对盲去卷积的物理约束的先验模型。基于合成和实时远程湍流序列的实验结果验证了该方法的性能。

45. Semantic Segmentation of Neuronal Bodies in Fluorescence Microscopy Using a 2D+3D CNN Training Strategy with Sparsely Annotated Data [PDF] 返回目录
Filippo Maria Castelli, Matteo Roffilli, Giacomo Mazzamuto, Irene Costantini, Ludovico Silvestri, Francesco Saverio Pavone
Abstract: Semantic segmentation of neuronal structures in 3D high-resolution fluorescence microscopy imaging of the human brain cortexcan take advantage of bidimensional CNNs, which yield good resultsin neuron localization but lead to inaccurate surface reconstruction. 3DCNNs on the other hand would require manually annotated volumet-ric data on a large scale, and hence considerable human effort. Semi-supervised alternative strategies which make use only of sparse anno-tations suffer from longer training times and achieved models tend tohave increased capacity compared to 2D CNNs, needing more groundtruth data to attain similar results. To overcome these issues we proposea two-phase strategy for training native 3D CNN models on sparse 2Dannotations where missing labels are inferred by a 2D CNN model andcombined with manual annotations in a weighted manner during losscalculation.
摘要：在二维细胞神经网络，它产生良好的神经元resultsin本地化但导致不准确的表面重建的人类大脑cortexcan乘虚而入三维高分辨率荧光显微成像神经元结构的语义分割。在另一方面3DCNNs将需要手动注释大规模volumet-RIC数据，因此大量的人的努力。半监督的替代策略，只利用稀疏阿鲁 - tations从训练时间遭受取得了模型倾向于有机会上增加的容量相比，二维细胞神经网络，需要更多的真实状况的数据以达到类似的结果。为了克服这些问题，我们proposea两阶段战略稀疏2Dannotations地方还缺标签由losscalculation期间以加权的方式与手动注释andcombined二维CNN模型推断训练原生3D模型的CNN。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-02

目录

摘要