摘要

1. A Novel Spatial-Spectral Framework for the Classification of Hyperspectral Satellite Imagery [PDF] 返回目录
Shriya TP Gupta, Sanjay K Sahay
Abstract: Hyper-spectral satellite imagery is now widely being used for accurate disaster prediction and terrain feature classification. However, in such classification tasks, most of the present approaches use only the spectral information contained in the images. Therefore, in this paper, we present a novel framework that takes into account both the spectral and spatial information contained in the data for land cover classification. For this purpose, we use the Gaussian Maximum Likelihood (GML) and Convolutional Neural Network methods for the pixel-wise spectral classification and then, using segmentation maps generated by the Watershed algorithm, we incorporate the spatial contextual information into our model with a modified majority vote technique. The experimental analyses on two benchmark datasets demonstrate that our proposed methodology performs better than the earlier approaches by achieving an accuracy of 99.52% and 98.31% on the Pavia University and the Indian Pines datasets respectively. Additionally, our GML based approach, a non-deep learning algorithm, shows comparable performance to the state-of-the-art deep learning techniques, which indicates the importance of the proposed approach for performing a computationally efficient classification of hyper-spectral imagery.
摘要：高光谱卫星图像现已广泛被用于准确的灾害预报和地形特征分类。但是，在这种分类任务，最本方法中的使用仅包含在所述图像中的光谱信息。因此，在本文中，我们提出了一种新的框架，考虑到既包含在用于土地覆盖分类数据的光谱和空间信息。为了这个目的，我们使用了高斯最大似然（GML）和卷积神经网络的方法的逐个像素光谱分类，然后，利用分割映射由分水岭算法生成，我们结合了空间上下文信息到我们的模型与经修饰的多数投票技术。两个基准数据集的实验分析表明，我们所提出的方法进行比早期更好地实现分别对帕维亚大学和印度松树数据集，99.52％和98.31％的准确度接近。此外，我们的基于GML的方法中，非深学习算法，显示性能相当的国家的最先进的深学习技术，这表明该方法进行高光谱图像的计算效率分类的重要性。

2. Learning to Factorize and Relight a City [PDF] 返回目录
Andrew Liu, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros, Noah Snavely
Abstract: We propose a learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors. Inspired by the classic intrinsic image decomposition, our learning signal builds upon two insights: 1) combining the disentangled factors should reconstruct the original image, and 2) the permanent factors should stay constant across multiple temporal samples of the same scene. To facilitate training, we assemble a city-scale dataset of outdoor timelapse imagery from Google Street View, where the same locations are captured repeatedly through time. This data represents an unprecedented scale of spatio-temporal outdoor imagery. We show that our learned disentangled factors can be used to manipulate novel images in realistic ways, such as changing lighting effects and scene geometry. Please visit this http URL for animated results.
摘要：本文提出了一种解开室外场景到时间变化的照明和永久现场因素基于学习的框架。通过经典的内在图像分解的启发，我们的学习信号建立于两个启示：1）结合解开因素应该重构原始图像，并且2）永久性因素应该在同样的场景的多个时间样本保持不变。为了方便训练，我们从组装谷歌街景，在相同的位置，通过反复的时间拍摄户外延时影像的城市规模的数据集。该数据表示的时空户外影像的前所未有的规模。我们证明了我们的学者解开的因素可以用来操纵的现实途径，如改变灯光效果和场景的几何形状新颖的图像。请访问此HTTP URL的动画效果。

3. Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications [PDF] 返回目录
Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, Arun Mallya
Abstract: The generative adversarial network (GAN) framework has emerged as a powerful tool for various image and video synthesis tasks, allowing the synthesis of visual content in an unconditional or input-conditional manner. It has enabled the generation of high-resolution photorealistic images and videos, a task that was challenging or impossible with prior methods. It has also led to the creation of many new applications in content creation. In this paper, we provide an overview of GANs with a special focus on algorithms and applications for visual synthesis. We cover several important techniques to stabilize GAN training, which has a reputation for being notoriously difficult. We also discuss its applications to image translation, image processing, video synthesis, and neural rendering.
摘要：生成对抗网络（GAN）框架已成为用于各种图像和视频合成任务的有力工具，从而允许可视内容的合成在无条件或输入的条件的方式。它使高清晰逼真的图像和视频的生成，这是具有挑战性的或不可能与现有方法的任务。这也导致了创作的内容创建许多新的应用。在本文中，我们提供的甘斯与特别关注的算法和应用程序的视觉合成的概述。我们覆盖的几个重要技术稳定GAN培训，其中有一个被非常困难的声誉。我们还讨论了应用图像平移，图像处理，视频合成和神经渲染。

4. CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations [PDF] 返回目录
Davis Rempe, Tolga Birdal, Yongheng Zhao, Zan Gojcic, Srinath Sridhar, Leonidas J. Guibas
Abstract: We propose CaSPR, a method to learn object-centric canonical spatiotemporal point cloud representations of dynamically moving or evolving objects. Our goal is to enable information aggregation over time and the interrogation of object state at any spatiotemporal neighborhood in the past, observed or not. Different from previous work, CaSPR learns representations that support spacetime continuity, are robust to variable and irregularly spacetime-sampled point clouds, and generalize to unseen object instances. Our approach divides the problem into two subtasks. First, we explicitly encode time by mapping an input point cloud sequence to a spatiotemporally-canonicalized object space. We then leverage this canonicalization to learn a spatiotemporal latent representation using neural ordinary differential equations and a generative model of dynamically evolving shapes using continuous normalizing flows. We demonstrate the effectiveness of our method on several applications including shape reconstruction, camera pose estimation, continuous spatiotemporal sequence reconstruction, and correspondence estimation from irregularly or intermittently sampled observations.
摘要：本文提出CASPR，学习的动态移动或演变的对象的对象为中心的规范时空点云的表示方法。我们的目标是在时间和对象的状态在过去任何时空附近的审讯，使信息聚合，观测或无法。从以前的工作不同，CASPR获悉表示，用于支持时空连续性，是稳健的变量和不规则时空采样点云，并推广到看不见的对象实例。我们的方法划分问题分为两个子任务。首先，我们通过映射的输入点云序列到一个时空-规范化对象空间显式地编码时间。然后，我们利用这一规范化用神经常微分方程的使用连续流动正火动态发展的形状的生成模型学习时空潜表示。我们证明在若干应用中，包括形状重建，照相机姿势估计，连续时空序列重建，以及对应估计从不规则或间歇地取样观测我们的方法的有效性。

5. Efficient Non-Line-of-Sight Imaging from Transient Sinograms [PDF] 返回目录
Mariko Isogawa, Dorian Chan, Ye Yuan, Kris Kitani, Matthew O'Toole
Abstract: Non-line-of-sight (NLOS) imaging techniques use light that diffusely reflects off of visible surfaces (e.g., walls) to see around corners. One approach involves using pulsed lasers and ultrafast sensors to measure the travel time of multiply scattered light. Unlike existing NLOS techniques that generally require densely raster scanning points across the entirety of a relay wall, we explore a more efficient form of NLOS scanning that reduces both acquisition times and computational requirements. We propose a circular and confocal non-line-of-sight (C2NLOS) scan that involves illuminating and imaging a common point, and scanning this point in a circular path along a wall. We observe that (1) these C2NLOS measurements consist of a superposition of sinusoids, which we refer to as a transient sinogram, (2) there exists computationally efficient reconstruction procedures that transform these sinusoidal measurements into 3D positions of hidden scatterers or NLOS images of hidden objects, and (3) despite operating on an order of magnitude fewer measurements than previous approaches, these C2NLOS scans provide sufficient information about the hidden scene to solve these different NLOS imaging tasks. We show results from both simulated and real C2NLOS scans.
摘要：非线的视距（NLOS）成像技术使用光即漫反射可见表面（例如，墙壁）的关闭看到周围的角落。一种方法包括使用脉冲激光和超快的传感器来测量多次散射的光的旅行时间。不像现有通常需要横跨中继壁的整体密集光栅扫描点NLOS技术，我们探索NLOS扫描，降低两者的采集时间和计算要求的更有效的形式。我们提出了一个圆形和共焦非线的视距（C2NLOS）扫描涉及照明和成像的公共点，并且在沿着壁的圆形路径扫描这一点。我们观察到，（1）这些C2NLOS测量由正弦曲线的叠加，其中我们称之为瞬时正弦图，（2）有在计算存在的隐藏的变换这些正弦测量到隐藏散射或NLOS图像的3D位置有效重建程序对象，以及（3）尽管幅度更少测量，比以前的方法的顺序上操作，这些C2NLOS扫描提供关于隐藏现场解决这些不同的NLOS成像任务的足够信息。我们展示来自模拟和实际C2NLOS扫描结果。

6. On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter Evaluation [PDF] 返回目录
Bernhard Liebl, Manuel Burghardt
Abstract: We investigate how to train a high quality optical character recognition (OCR) model for difficult historical typefaces on degraded paper. Through extensive grid searches, we obtain a neural network architecture and a set of optimal data augmentation settings. We discuss the influence of factors such as binarization, input line height, network width, network depth, and other network training parameters such as dropout. Implementing these findings into a practical model, we are able to obtain a 0.44% character error rate (CER) model from only 10,000 lines of training data, outperforming currently available pretrained models that were trained on more than 20 times the amount of data. We show ablations for all components of our training pipeline, which relies on the open source framework Calamari.
摘要：我们研究如何培养对退化的纸张难以历史字样高质量的光学字符识别（OCR）模型。通过广泛的电网搜索，我们得到了一个神经网络结构和一组最佳数据增强设置。我们将讨论的因素，例如二值化，输入线的高度，宽度网络，网络深度和其他网络的训练参数例如脱落的影响。实现这些成果转化为实际的模型，我们能够从只有1万行的训练数据获得0.44％字符错误率（CER）的模型，跑赢了培训20余倍的数据量，目前可用的预训练模式。我们显示了我们的训练管道，这依赖于开源框架鱿鱼的所有组成部分消融。

7. Assessing the (Un)Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging [PDF] 返回目录
Nishanth Arun, Nathan Gaw, Praveer Singh, Ken Chang, Mehak Aggarwal, Bryan Chen, Katharina Hoebel, Sharut Gupta, Jay Patel, Mishka Gidwani, Julius Adebayo, Matthew D. Li, Jayashree Kalpathy-Cramer
Abstract: Saliency maps have become a widely used method to make deep learning models more interpretable by providing post-hoc explanations of classifiers through identification of the most pertinent areas of the input medical image. They are increasingly being used in medical imaging to provide clinically plausible explanations for the decisions the neural network makes. However, the utility and robustness of these visualization maps has not yet been rigorously examined in the context of medical imaging. We posit that trustworthiness in this context requires 1) localization utility, 2) sensitivity to model weight randomization, 3) repeatability, and 4) reproducibility. Using the localization information available in two large public radiology datasets, we quantify the performance of eight commonly used saliency map approaches for the above criteria using area under the precision-recall curves (AUPRC) and structural similarity index (SSIM), comparing their performance to various baseline measures. Using our framework to quantify the trustworthiness of saliency maps, we show that all eight saliency map techniques fail at least one of the criteria and are, in most cases, less trustworthy when compared to the baselines. We suggest that their usage in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network. Additionally, to promote reproducibility of our findings, we provide the code we used for all tests performed in this work at this link: this https URL.
摘要：特征地图已成为使通过输入医学图像的最相关领域的鉴别提供分类的事后解释深学习模型更可解释的广泛使用的方法。他们越来越多地被用于医疗成像，以提供神经网络使决策临床合理解释。然而，这些可视化地图的实用性和耐用性尚未经过严格的医学成像的环境里进行检查。我们断定在这方面，需要可信度1）定位实用工具，2）到模型体重随机灵敏度，3）可重复性，和4）的再现性。使用在两个大型公共放射学数据集提供的定位信息，我们量化八个常用的显着图的性能接近的精确召回曲线（AUPRC）和结构相似性指数（SSIM）下，使用区域中的上述标准，它们的性能进行比较，以不同的基线措施。使用我们的框架来量化显着图的可信度，我们表明，相比于基线时，所有八个显着图技术失败的标准之一，并且，在大多数情况下，不值得信任。我们建议他们在医疗成像权证的额外审查的高风险领域的使用和建议，如果定位是网络的期望输出检测或分割的机型上使用。此外，为了促进我们的研究结果的可重复性，我们提供这个链接在这项工作中，我们进行用于所有测试代码：此HTTPS URL。

8. Joint Self-Attention and Scale-Aggregation for Self-Calibrated Deraining Network [PDF] 返回目录
Cong Wang, Yutong Wu, Zhixun Su, Junyang Chen
Abstract: In the field of multimedia, single image deraining is a basic pre-processing work, which can greatly improve the visual effect of subsequent high-level tasks in rainy conditions. In this paper, we propose an effective algorithm, called JDNet, to solve the single image deraining problem and conduct the segmentation and detection task for applications. Specifically, considering the important information on multi-scale features, we propose a Scale-Aggregation module to learn the features with different scales. Simultaneously, Self-Attention module is introduced to match or outperform their convolutional counterparts, which allows the feature aggregation to adapt to each channel. Furthermore, to improve the basic convolutional feature transformation process of Convolutional Neural Networks (CNNs), Self-Calibrated convolution is applied to build long-range spatial and inter-channel dependencies around each spatial location that explicitly expand fields-of-view of each convolutional layer through internal communications and hence enriches the output features. By designing the Scale-Aggregation and Self-Attention modules with Self-Calibrated convolution skillfully, the proposed model has better deraining results both on real-world and synthetic datasets. Extensive experiments are conducted to demonstrate the superiority of our method compared with state-of-the-art methods. The source code will be available at \url{this https URL}.
摘要：多媒体的领域中，单个图像deraining是一个基本的预处理工作，这可大大提高在雨天条件后续高级任务的视觉效果。在本文中，我们提出了一种有效的算法，叫JDNet，解决了单个图像deraining问题，并进行分割和检测任务的应用程序。具体而言，考虑到多尺度特征的重要信息，我们提出了一个规模化聚集模块的学习特点与不同尺度。同时，自注意模块引入到匹配或优于它们的卷积对应，这允许特征聚合，以适应每个信道。此外，为了改善卷积神经网络的基本卷积特征变换处理（细胞神经网络），自校准卷积应用于建立远程围绕每个空间位置的空间和信道间的依赖关系明确地扩展字段的视每个卷积的通过内部通信，因此层丰富了输出特性。通过巧妙地设计具有自校准卷积尺度聚集和自我注意模块，该模型既有对现实世界和合成数据集更好deraining结果。大量的实验以证明我们的方法的优越性与国家的最先进的方法相比。源代码将可在\ {URL这HTTPS URL}。

9. IV-SLAM: Introspective Vision for Simultaneous Localization and Mapping [PDF] 返回目录
Sadegh Rabiee, Joydeep Biswas
Abstract: Existing solutions to visual simultaneous localization and mapping (V-SLAM) assume that errors in feature extraction and matching are independent and identically distributed (i.i.d), but this assumption is known to not be true - features extracted from low-contrast regions of images exhibit wider error distributions than features from sharp corners. Furthermore, V-SLAM algorithms are prone to catastrophic tracking failures when sensed images include challenging conditions such as specular reflections, lens flare, or shadows of dynamic objects. To address such failures, previous work has focused on building more robust visual frontends, to filter out challenging features. In this paper, we present introspective vision for SLAM (IV-SLAM), a fundamentally different approach for addressing these challenges. IV-SLAM explicitly models the noise process of reprojection errors from visual features to be context-dependent, and hence non-i.i.d. We introduce an autonomously supervised approach for IV-SLAM to collect training data to learn such a context-aware noise model. Using this learned noise model, IV-SLAM guides feature extraction to select more features from parts of the image that are likely to result in lower noise, and further incorporate the learned noise model into the joint maximum likelihood estimation, thus making it robust to the aforementioned types of errors. We present empirical results to demonstrate that IV-SLAM 1) is able to accurately predict sources of error in input images, 2) reduces tracking error compared to V-SLAM, and 3) increases the mean distance between tracking failures by more than 70% on challenging real robot data compared to V-SLAM.
摘要：视觉同步定位和映射（V-SLAM）的现有解决方案假定在特征提取和匹配误差是独立同分布（iid），但这个假设是已知的不会是真实的 - 从低对比度区域中提取特征图像表现出比从尖角特征更宽误差分布。此外，V-SLAM算法易于灾难性故障跟踪时感测到的图像包括具有挑战性如镜面反射，透镜光斑，或动态对象的阴影的条件。为了解决此类故障，以前的工作专注于建立更强大的视觉前端，以滤除挑战的特点。在本文中，我们提出了SLAM（IV-SLAM），一个完全不同的方法应对这些挑战，自省的眼光。 IV-SLAM明确地从模型的视觉特征重投影误差的噪声处理是上下文相关的，并且因此非i.i.d。我们介绍IV-SLAM来收集训练数据的自主监督的方法来学习这样的情境感知噪声模型。使用这种学习噪声模型，IV-SLAM导轨特征提取，以从有可能导致较低的噪声，并进一步结合所述学习噪声模式进入联合最大似然估计，从而使得健壮的部分图像选择更多的功能上述类型的错误。我们本经验结果证明IV-SLAM 1）是能够准确地预测误差来源中输入的图像，2）减少了跟踪误差相比V-SLAM，以及3）由70％以上的增加跟踪故障之间的平均距离上相比，V-SLAM挑战实际机器人的数据。

10. Unsupervised Learning for Identifying Events in Active Target Experiments [PDF] 返回目录
Robert Solli, Daniel Bazin, Michelle P. Kuchera, Ryan R. Strauss, Morten Hjorth-Jensen
Abstract: This article presents novel applications of unsupervised machine learning methods to the problem of event separation in an active target detector, the Active-Target Time Projection Chamber (AT-TPC). The overarching goal is to group similar events in the early stages of the data analysis, thereby improving efficiency by limiting the computationally expensive processing of unnecessary events. The application of unsupervised clustering algorithms to the analysis of two-dimensional projections of particle tracks from a resonant proton scattering experiment on $^{46}$Ar is introduced. We explore the performance of autoencoder neural networks and a pre-trained VGG16 convolutional neural network. We find that a $K$-means algorithm applied to the simulated data in the VGG16 latent space forms almost perfect clusters. Additionally, the VGG16+$K$-means approach finds high purity clusters of proton events for real experimental data. We also explore the application of clustering the latent space of autoencoder neural networks for event separation. While these networks show strong performance, they suffer from high variability in their results. %With autoencoder neural networks we find improved descriptions of data from experiments.
摘要：本文介绍了一个积极的目标探测器的无监督的机器学习方法，以事件分离问题的新应用，活动目标时间投影腔（AT-TPC）。的总体目标是在数据分析的早期阶段组类似的事件，从而通过限制不必要的事件的计算上昂贵的处理提高了效率。 {46} $ Ar是引入的无监督聚类算法，以粒子径迹的二维投影的从$ ^共振质子散射实验分析中的应用。我们探索自动编码的神经网络的性能和预训练VGG16卷积神经网络。我们发现，一$ $ķ算法-means应用到模拟的数据在VGG16潜在空间形态近乎完美的集群。此外，VGG16 + $ķ$ -means办法找到真正的实验数据质子事件的高纯度集群。我们还探讨了聚类的自动编码的神经网络的潜在空间事件分离的应用。虽然这些网络表现出强劲的表现，他们在他们的结果高变异性受到影响。凭借％自动编码的神经网络，我们发现从实验数据的改进描述。

11. The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval [PDF] 返回目录
Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Franca Debole, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo, Claudio Vairo
Abstract: In this paper, we describe VISIONE, a video search system that allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and satisfy user needs. The peculiarity of our approach is that we encode all the information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query needs to be merged. We report an extensive analysis of the system retrieval performance, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies among the ones that we tested.
摘要：在本文中，我们描述VISIONE，视频搜索系统，允许用户搜索使用文本关键字，对象的发生和他们的空间关系，颜色发生和他们的空间关系，以及图像相似的视频。这些模式可以结合在一起来表达复杂的查询和满足用户的需求。我们的方法的特点是，我们所有的编码从关键帧，如视觉深的特点，标签，颜色和物体位置，使用便捷的文本编码在一个单一的文本检索引擎索引中提取的信息。这提供了当对应的查询需求各部分的结果将被合并了极大的灵活性。我们报告系统检索性能进行了广泛的分析，使用视频浏览器摊牌（VBS）2019竞争过程中产生的查询日志。这通过我们测试的那些中选择最佳的参数和战略使我们能够微调系统。

12. Towards Accurate Pixel-wise Object Tracking by Attention Retrieval [PDF] 返回目录
Zhipeng Zhang, Bing Li, Weiming Hu, Houweng Peng
Abstract: The encoding of the target in object tracking moves from the coarse bounding-box to fine-grained segmentation map recently. Revisiting de facto real-time approaches that are capable of predicting mask during tracking, we observed that they usually fork a light branch from the backbone network for segmentation. Although efficient, directly fusing backbone features without considering the negative influence of background clutter tends to introduce false-negative predictions, lagging the segmentation accuracy. To mitigate this problem, we propose an attention retrieval network (ARN) to perform soft spatial constraints on backbone features. We first build a look-up-table (LUT) with the ground-truth mask in the starting frame, and then retrieves the LUT to obtain an attention map for spatial constraints. Moreover, we introduce a multi-resolution multi-stage segmentation network (MMS) to further weaken the influence of background clutter by reusing the predicted mask to filter backbone features. Our approach set a new state-of-the-art on recent pixel-wise object tracking benchmark VOT2020 while running at 40 fps. Notably, the proposed model surpasses SiamMask by 11.7/4.2/5.5 points on VOT2020, DAVIS2016, and DAVIS2017, respectively. We will release our code at this https URL.
摘要：目标在从粗边界框最近对象跟踪移动到细粒度分割图的编码。重温事实上，其能够跟踪期间预测掩模的实时方法，我们观察到，他们通常从骨干网进行分割fork一个光分支。虽然高效，直接融合骨干网的功能，不考虑背景杂波的负面影响往往会引入假阴性预测，滞后的分割精度。为了缓解这一问题，我们提出了一个关注检索网络（ARN）对骨干网的功能进行软空间限制。我们首先建立一个查找表（LUT）与所述起始帧中的地面实况面膜，然后检索LUT来获得空间限制的注意力图。此外，我们引入一个多分辨率多级分割网络（MMS）通过重新使用所预测的掩模来过滤骨干特征进一步削弱背景杂波的影响。在40帧每秒，而我们的方法设置最近像素向对象跟踪基准VOT2020一个新的国家的最先进的。值得注意的是，该模型分别超过SiamMask通过VOT2020，DAVIS2016和DAVIS2017 11.7 / 4.2 / 5.5点。我们将在本HTTPS URL释放我们的代码。

13. Shonan Rotation Averaging: Global Optimality by Surfing $SO(p)^n$ [PDF] 返回目录
Frank Dellaert, David M. Rosen, Jing Wu, Robert Mahony, Luca Carlone
Abstract: Shonan Rotation Averaging is a fast, simple, and elegant rotation averaging algorithm that is guaranteed to recover globally optimal solutions under mild assumptions on the measurement noise. Our method employs semidefinite relaxation in order to recover provably globally optimal solutions of the rotation averaging problem. In contrast to prior work, we show how to solve large-scale instances of these relaxations using manifold minimization on (only slightly) higher-dimensional rotation manifolds, re-using existing high-performance (but local) structure-from-motion pipelines. Our method thus preserves the speed and scalability of current SFM methods, while recovering globally optimal solutions.
摘要：湘南旋转平均是一种快速，简单，优雅的旋转平均这是保证恢复下的测量噪声温和的假设全局最优解算法。我们的方法采用半定松弛，以恢复旋转平均问题的可证明的全局最优的解决方案。相较于以前的工作中，我们展示了如何使用（仅略）高维旋转歧管，重新使用现有的高性能（但当地）结构由运动管道歧管的最小化来解决这些放宽大型实例。因此，我们的方法保留的当前SFM方法的速度和可扩展性，同时回收全局最优解。

14. Exploring Relations in Untrimmed Videos for Self-Supervised Learning [PDF] 返回目录
Dezhao Luo, Bo Fang, Yu Zhou, Yucan Zhou, Dayan Wu, Weiping Wang
Abstract: Existing video self-supervised learning methods mainly rely on trimmed videos for model training. However, trimmed datasets are manually annotated from untrimmed videos. In this sense, these methods are not really self-supervised. In this paper, we propose a novel self-supervised method, referred to as Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos (real unlabeled) to learn spatio-temporal features. ERUV first generates single-shot videos by shot change detection. Then a designed sampling strategy is used to model relations for video clips. The strategy is saved as our self-supervision signals. Finally, the network learns representations by predicting the category of relations between the video clips. ERUV is able to compare the differences and similarities of videos, which is also an essential procedure for action and video related tasks. We validate our learned models with action recognition and video retrieval tasks with three kinds of 3D CNNs. Experimental results show that ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
摘要：现有视频自我监督学习的方法主要依赖于模型训练修剪视频。然而，修剪数据集手动修剪视频注释。在这个意义上，这些方法是不是真的自我监督。在本文中，我们提出了一种新的自我监督方法，被称为在毛边影片（ERUV），它可以被直接应用到修剪视频探索关系（实未标记）学习时空特征。 ERUV首先通过镜头剪接检测产生单次视频。然后，设计的抽样策略来为视频剪辑模型关系。该战略被保存为我们的自我监管的信号。最后，网络通过预测视频片段之间关系的范畴学表示。 ERUV是能够比较影片的异同，这也是行动与视频相关的任务的重要步骤。我们确认我们了解到型号，动作识别和具有三种3D细胞神经网络的视频检索任务。实验结果表明，ERUV能够学到更丰富的表示，它优于同显著边缘国家的最先进的自我监督方法。

15. Pairwise Relation Learning for Semi-supervised Gland Segmentation [PDF] 返回目录
Yutong Xie, Jianpeng Zhang, Zhibin Liao, Chunhua Shen, Johan Verjans, Yong Xia
Abstract: Accurate and automated gland segmentation on histology tissue images is an essential but challenging task in the computer-aided diagnosis of adenocarcinoma. Despite their prevalence, deep learning models always require a myriad number of densely annotated training images, which are difficult to obtain due to extensive labor and associated expert costs related to histology image annotations. In this paper, we propose the pairwise relation-based semi-supervised (PRS^2) model for gland segmentation on histology images. This model consists of a segmentation network (S-Net) and a pairwise relation network (PR-Net). The S-Net is trained on labeled data for segmentation, and PR-Net is trained on both labeled and unlabeled data in an unsupervised way to enhance its image representation ability via exploiting the semantic consistency between each pair of images in the feature space. Since both networks share their encoders, the image representation ability learned by PR-Net can be transferred to S-Net to improve its segmentation performance. We also design the object-level Dice loss to address the issues caused by touching glands and combine it with other two loss functions for S-Net. We evaluated our model against five recent methods on the GlaS dataset and three recent methods on the CRAG dataset. Our results not only demonstrate the effectiveness of the proposed PR-Net and object-level Dice loss, but also indicate that our PRS^2 model achieves the state-of-the-art gland segmentation performance on both benchmarks.
摘要：在组织学上组织图像准确和自动压盖分割是腺癌的计算机辅助诊断的必不可少的，但具有挑战性的任务。尽管他们的患病率，深度学习模型总是需要标注的密集训练图像，这是很难获得由于大量的劳动力和相关的组织学图像注释相关专家费用无数的数量。在本文中，我们提出了基于关系成对半监督（PRS ^ 2）在组织学上的图像腺体分割模型。该模型包括一个分割网络（S-净）和成对关系网络（PR-净）的。在S-Net的训练对分割标记的数据，和PR-Net是以无监督的方式通过利用在特征空间中的每个图像对之间的语义的一致性，以增强它的图像表现能力训练上标记的和未标记的数据。由于两个网络分享自己的编码器，通过PR-网了解到的图像表现能力可以被转移到S-网，以提高其分割性能。我们还设计了对象级骰子亏损来解决因接触腺体的问题，并与S-净另外两个损失函数结合起来。我们评估我们的模型对上GLAS数据集5种近期的方法和悬崖顶上数据集最近三次的方法。我们的研究结果不仅证明了该PR-Net和对象级骰子亏损的有效性，同时也表明，我们PRS ^ 2模型实现了两个基准的国家的最先进的腺体分割性能。

16. Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards [PDF] 返回目录
Xuewen Yang, Heming Zhang, Di Jin, Yingru Liu, Chi-Hao Wu, Jianchao Tan, Dongliang Xie, Jue Wang, Xin Wang
Abstract: Generating accurate descriptions for online fashion items is important not only for enhancing customers' shopping experiences, but also for the increase of online sales. Besides the need of correctly presenting the attributes of items, the expressions in an enchanting style could better attract customer interests. The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. Different from popular work on image captioning, it is hard to identify and describe the rich attributes of fashion items. We seed the description of an item by first identifying its attributes, and introduce attribute-level semantic (ALS) reward and sentence-level semantic (SLS) reward as metrics to improve the quality of text descriptions. We further integrate the training of our model with maximum likelihood estimation (MLE), attribute embedding, and Reinforcement Learning (RL). To facilitate the learning, we build a new FAshion CAptioning Dataset (FACAD), which contains 993K images and 130K corresponding enchanting and diverse descriptions. Experiments on FACAD demonstrate the effectiveness of our model.
摘要：生成准确的描述了在线的时尚单品，不仅提高顾客的购物体验，同时也为网上销售的增长非常重要。除了需要正确地呈现项目的属性，在一个迷人的风格表现可能会更好地吸引客户的利益。这项工作的目标是开发准确和表现力的时尚字幕一种新型的学习框架。从图像字幕流行的工作不同，很难确定和描述的时尚单品丰富的属性。我们通过首先识别其属性种子项目的描述，并引入属性级语义（ALS）的奖励和句子级语义（SLS）作为奖励度量以提高文字描述的质量。我们进一步整合我们的模型与最大似然估计（MLE），属性嵌入，以及强化学习（RL）的培训。为了方便学习，我们建立了一个新的时尚字幕数据集（FACAD），其中包含993K图像和130K相应的迷人多样的描述。在FACAD实验结果表明我们的模型的有效性。

17. Learnable Graph Inception Network for Emotion Recognition [PDF] 返回目录
A. Shirian, S. Tripathi, T. Guha
Abstract: Analyzing emotion from verbal and non-verbal behavioral cues is critical for many intelligent human-centric systems. The emotional cues can be captured using audio, video, motion-capture (mocap) or other modalities. We propose a generalized graph approach to emotion recognition that can take any time-varying (dynamic) data modality as input. To alleviate the problem of optimal graph construction, we cast this as a joint graph learning and classification task. To this end, we present the \emph{Learnable Graph Inception Network} (L-GrIN) that jointly learns to recognize emotion and to identify the underlying graph structure in data. Our architecture comprises multiple novel components: a new graph convolution operation, a graph inception layer, learnable adjacency, and a learnable pooling function that yields a graph-level embedding. We evaluate the proposed architecture on four benchmark emotion recognition databases spanning three different modalities (video, audio, mocap), where each database captures one of the following emotional cues: facial expressions, speech and body gestures. We achieve state-of-the-art performance on all databases outperforming several competitive baselines and relevant existing methods.
摘要：从言语和非言语行为线索分析情感是许多智能人为中心的系统中的关键。的情感暗示可使用音频，视频，运动捕捉（动作捕捉）或其他方式被捕获。我们建议情绪识别广义图的方法，可以采取任何时间变化（动态）的数据模式作为输入。为了减轻最佳的图形结构的问题，我们投这是一个联合图的学习和分类的任务。为此，我们提出\ EMPH {可学习格拉夫启网络}（L-GRIN），该联合学会承认情感和识别数据的底层图结构。我们的体系结构包括多个新颖的组件：一个新的图卷积运算，曲线开始层，可学习邻接，和一个可学习池功能的产率的曲线图，水平嵌入。我们评估四个基准情感识别数据库中所提出的架构跨越了三个不同的模式（视频，音频，动作捕捉），其中每个数据库捕获以下情感暗示之一：面部表情，言语和肢体语言。我们实现超越上几个有竞争力的基准和有关现有方法的所有数据库的国家的最先进的性能。

18. Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition [PDF] 返回目录
Vikas Kumar, Shivansh Rao, Li Yu
Abstract: Facial expression recognition from videos in the wild is a challenging task due to the lack of abundant labelled training data. Large DNN (deep neural network) architectures and ensemble methods have resulted in better performance, but soon reach saturation at some point due to data inadequacy. In this paper, we use a self-training method that utilizes a combination of a labelled dataset and an unlabelled dataset (Body Language Dataset - BoLD). Experimental analysis shows that training a noisy student network iteratively helps in achieving significantly better results. Additionally, our model isolates different regions of the face and processes them independently using a multi-level attention mechanism which further boosts the performance. Our results show that the proposed method achieves state-of-the-art performance on benchmark datasets CK+ and AFEW 8.0 when compared to other single models.
摘要：在野外视频面部表情识别是一项具有挑战性的任务，由于缺乏丰富的标记的训练数据。大DNN（深层神经网络）架构和集成方法导致更好的性能，但在某些时候，由于数据不足，很快达到饱和。在本文中，我们使用的是利用标记的数据集和未标记的数据集（身体语言的数据集 - BOLD）组合的自我训练方法。实验分析表明在实现显著更好的结果反复训练嘈杂的学生网络帮助。此外，我们的模型隔离脸的不同地区和独立使用的多层次注意机制，进一步提升性能处理它们。我们的研究结果表明，相对于其他单一模型时所提出的方法实现对基准数据集CK +和AFEW 8.0国家的最先进的性能。

19. Approach for document detection by contours and contrasts [PDF] 返回目录
Daniil V. Tropin, Sergey A. Ilyuhin, Dmitry P. Nikolaev, Vladimir V. Arlazarov
Abstract: This paper considers the task of arbitrary document detection performed on a mobile device. The classical contour-based approach often mishandles cases with occlusion, complex background, or blur. Region-based approach, which relies on the contrast between object and background, does not have limitations, however its known implementations are highly resource-consuming. We propose a modification of a countor-based method, in which the competing hypotheses of the contour location are ranked according to the contrast between the areas inside and outside the border. In the performed experiments such modification leads to the 40% decrease of alternatives ordering errors and 10% decrease of the overall number of detection errors. We updated state-of-the-art performance on the open MIDV-500 dataset and demonstrated competitive results with the state-of-the-art on the SmartDoc dataset.
摘要：本文考虑的移动设备上执行的任意文档检测的任务。经典的基于轮廓的方法常常错误处理与闭塞，复杂的背景，或模糊的情况。基于区域的方法，该方法依赖于物体与背景之间的反差，没有限制，但其已知的实现是非常消耗资源。我们提出了一个基于countor法，其中的轮廓位置的竞争假设是根据内部和外部边界的区域之间的对比排名的修改。在所进行的实验，例如改性导致替代排序错误的40％的下降和检测误差的总数量的10％的降低的。我们更新了对开放MIDV-500数据集的国家的最先进的性能，并演示了在SmartDoc数据集的国家的最先进的具有竞争力的结果。

20. Image Generation for Efficient Neural Network Training in Autonomous Drone Racing [PDF] 返回目录
Theo Morales, Andriy Sarabakha, Erdal Kayacan
Abstract: Drone racing is a recreational sport in which the goal is to pass through a sequence of gates in a minimum amount of time while avoiding collisions. In autonomous drone racing, one must accomplish this task by flying fully autonomously in an unknown environment by relying only on computer vision methods for detecting the target gates. Due to the challenges such as background objects and varying lighting conditions, traditional object detection algorithms based on colour or geometry tend to fail. Convolutional neural networks offer impressive advances in computer vision but require an immense amount of data to learn. Collecting this data is a tedious process because the drone has to be flown manually, and the data collected can suffer from sensor failures. In this work, a semi-synthetic dataset generation method is proposed, using a combination of real background images and randomised 3D renders of the gates, to provide a limitless amount of training samples that do not suffer from those drawbacks. Using the detection results, a line-of-sight guidance algorithm is used to cross the gates. In several experimental real-time tests, the proposed framework successfully demonstrates fast and reliable detection and navigation.
摘要：雄蜂赛车是休闲运动，其中所述目标是门的序列通过在最小的时间量，同时避免碰撞。在自主无人机的赛车，一个必须由只在计算机视觉方法依靠检测目标盖茨在未知环境完全自主飞行的完成这个任务。由于挑战，如背景对象和不同的照明条件的基础上，色彩或几何传统的对象检测算法倾向于失败。卷积神经网络提供计算机视觉跨出去，但需要数据的巨量学习。收集该数据是一个冗长的过程，因为无人驾驶飞机必须手动飞行，并且所收集的数据可以从传感器故障困扰。在这项工作中，一种半合成的数据集生成方法，提出了使用真实背景图像的组合和随机化的栅极的3D渲染，以提供训练不从这些缺点采样的无限量。利用该检测结果，线的视线引导算法被用于交叉的栅极。在一些实验性实时测试，拟议框架成功地演示了快速，可靠的检测和导航。

21. MED-TEX: Transferring and Explaining Knowledge with Less Data from Pretrained Medical Imaging Models [PDF] 返回目录
Thanh Nguyen-Duc, He Zhao, Jianfei Cai, Dinh Phung
Abstract: Deep neural network based image classification methods usually require a large amount of training data and lack interpretability, which are critical in the medical imaging domain. In this paper, we develop a novel knowledge distillation and model interpretation framework for medical image classification that jointly solves the above two issues. Specifically, to address the data-hungry issue, we propose to learn a small student model with less data by distilling knowledge only from a cumbersome pretrained teacher model. To interpret the teacher model as well as assisting the learning of the student, an explainer module is introduced to highlight the regions of an input medical image that are important for the predictions of the teacher model. Furthermore, the joint framework is trained by a principled way derived from the information-theoretic perspective. Our framework performance is demonstrated by the comprehensive experiments on the knowledge distillation and model interpretation tasks compared to state-of-the-art methods on a fundus disease dataset.
摘要：深基于神经网络的图像分类方法通常需要大量的训练数据和缺乏可解释性，这是在医疗成像领域临界的。在本文中，我们开发的医学图像分类的新知识蒸馏和模型的解释框架，共同解决了上述两个问题。具体而言，以解决大量数据的问题，我们建议只从繁琐的预训练的老师模型蒸馏知识学习数据较少小的学生模型。为了解释教师模型，以及协助学生的学习，一个解释器模块被引入到强调是老师模型的预测重要的输入医学图像的区域。此外，所述关节框架由从信息理论透视衍生的原则的方式训练。我们的框架性能通过对知识蒸馏的综合性实验和模型解释任务相比表现出对眼底疾病数据集的国家的最先进的方法。

22. IIIT-AR-13K: A New Dataset for Graphical Object Detection in Documents [PDF] 返回目录
Ajoy Mondal, Peter Lipps, C. V. Jawahar
Abstract: We introduce a new dataset for graphical object detection in business documents, more specifically annual reports. This dataset, IIIT-AR-13k, is created by manually annotating the bounding boxes of graphical or page objects in publicly available annual reports. This dataset contains a total of 13k annotated page images with objects in five different popular categories table, figure, natural image, logo, and signature. It is the largest manually annotated dataset for graphical object detection. Annual reports created in multiple languages for several years from various companies bring high diversity into this dataset. We benchmark IIIT-AR-13K dataset with two state of the art graphical object detection techniques using Faster R-CNN [20] and Mask R-CNN [11] and establish high baselines for further research. Our dataset is highly effective as training data for developing practical solutions for graphical object detection in both business documents and technical articles. By training with IIIT-AR-13K, we demonstrate the feasibility of a single solution that can report superior performance compared to the equivalent ones trained with a much larger amount of data, for table detection. We hope that our dataset helps in advancing the research for detecting various types of graphical objects in business documents.
摘要：我们在商务文档，更具体的年度报告介绍了图形对象检测新的数据集。此数据集，IIIT-AR-13K，通过手动标注的包围盒中可公开获得的年度报告图形或页面对象创建。此数据集包含五个不同类别的流行表，图，自然的图像，徽标和签名共使用对象13K注释页面图像。它是用于图形对象的检测最大的手动注释的数据集。从各公司在多语言创造了好几年的年度报告带来很高的多样性纳入此数据集。我们基准IIIT-AR-13K数据集的使用更快的R-CNN [20]本领域的图形对象检测技术的两个状态和面膜R-CNN [11]和建立高基线为进一步的研究。我们的数据是作为训练数据用于开发在商务文件和技术文章图形对象的检测切实可行的解决方案非常有效。通过培训IIIT-AR-13K，我们展示了一个单一的解决方案相比，具有更大的数据量，对表格检测训练等同的，可报告性能优越的可行性。我们希望，我们的数据有助于推进商务文档检测各种类型的图形对象的研究。

23. Fast Approximate Modelling of the Next Combination Result for Stopping the Text Recognition in a Video [PDF] 返回目录
Konstantin Bulatov, Nadezhda Fedotova, Vladimir V. Arlazarov
Abstract: In this paper, we consider a task of stopping the video stream recognition process of a text field, in which each frame is recognized independently and the individual results are combined together. The video stream recognition stopping problem is an under-researched topic with regards to computer vision, but its relevance for building high-performance video recognition systems is clear. Firstly, we describe an existing method of optimally stopping such a process based on a modelling of the next combined result. Then, we describe approximations and assumptions which allowed us to build an optimized computation scheme and thus obtain a method with reduced computational complexity. The methods were evaluated for the tasks of document text field recognition and arbitrary text recognition in a video. The experimental comparison shows that the introduced approximations do not diminish the quality of the stopping method in terms of the achieved combined result precision, while dramatically reducing the time required to make the stopping decision. The results were consistent for both text recognition tasks.
摘要：在本文中，我们考虑停止文本字段，其中每一帧是独立的认可和各个结果结合在一起的视频流识别过程中的任务。视频流识别停车问题是关于计算机视觉研究不足的话题，但其建设高性能视频识别系统的相关性是显而易见的。首先，我们描述的基于下一个组合的结果的建模最佳停止这样的处理的现有的方法。然后，我们描述近似和假设使我们能够建立一个优化的计算方案，从而获得具有降低计算复杂度的方法。该方法对文档中的文本字段识别和任意文字识别的视频中的任务进行了评价。试验比较表明，引入近似不减少停止方法的质量的实现综合作用的结果精度方面，同时显着降低，使停止决定所需要的时间。结果表明两种文字识别任务是一致的。

24. Modeling Data Reuse in Deep Neural Networks by Taking Data-Types into Cognizance [PDF] 返回目录
Nandan Kumar Jha, Sparsh Mittal
Abstract: In recent years, researchers have focused on reducing the model size and number of computations (measured as "multiply-accumulate" or MAC operations) of DNNs. The energy consumption of a DNN depends on both the number of MAC operations and the energy efficiency of each MAC operation. The former can be estimated at design time; however, the latter depends on the intricate data reuse patterns and underlying hardware architecture. Hence, estimating it at design time is challenging. This work shows that the conventional approach to estimate the data reuse, viz. arithmetic intensity, does not always correctly estimate the degree of data reuse in DNNs since it gives equal importance to all the data types. We propose a novel model, termed "data type aware weighted arithmetic intensity" ($DI$), which accounts for the unequal importance of different data types in DNNs. We evaluate our model on 25 state-of-the-art DNNs on two GPUs. We show that our model accurately models data-reuse for all possible data reuse patterns for different types of convolution and different types of layers. We show that our model is a better indicator of the energy efficiency of DNNs. We also show its generality using the central limit theorem.
摘要：近年来，研究人员已经集中在降低计算的模型大小和数量DNNs的（如“乘法累加”或MAC操作测量）。一个DNN的能量消耗取决于MAC运算的数量和每个MAC操作的能量效率。前者可估计在设计时;然而，后者依赖于复杂的数据复用模式和底层的硬件结构。因此，估计它在设计时是具有挑战性的。这项工作表明，传统方法来估计数据重用，即。算术强度，并不总是正确估计数据重用的DNNs程度，因为它给所有的数据类型同等的重要性。我们提出了一种新的模式，称为“数据类型感知加权算术强度”（$ $ DI），占不同的数据类型的DNNs不平等的重要性。我们评估我们在国家的最先进的25对两个GPU DNNs模型。我们表明，我们的模型精确模拟数据，再利用所有可能的数据复用模式为不同类型的卷积和不同类型的层。我们表明，我们的模型是DNNs的能效更好的指标。我们还表明其使用中心极限定理普遍性。

25. Handwritten Character Recognition from Wearable Passive RFID [PDF] 返回目录
Leevi Raivio, Han He, Johanna Virkki, Heikki Huttunen
Abstract: In this paper we study the recognition of handwritten characters from data captured by a novel wearable electro-textile sensor panel. The data is collected sequentially, such that we record both the stroke order and the resulting bitmap. We propose a preprocessing pipeline that fuses the sequence and bitmap representations together. The data is collected from ten subjects containing altogether 7500 characters. We also propose a convolutional neural network architecture, whose novel upsampling structure enables successful use of conventional ImageNet pretrained networks, despite the small input size of only 10x10 pixels. The proposed model reaches 72\% accuracy in experimental tests, which can be considered good accuracy for this challenging dataset. Both the data and the model are released to the public.
摘要：在本文中，我们从通过新颖的可佩戴电子纺织品传感器面板捕获的数据的学习手写字符识别。该数据被顺序地收集，使得我们记录这两个笔画顺序和所得到的位图。我们提出了一个预处理流水线融合序列和位图表示在一起。将数据从含有总共7500个字符十名受试者收集。我们也建议卷积神经网络架构，其新颖的采样结构能够成功地利用传统ImageNet预训练的网络，尽管只有10×10像素的小尺寸输入。该模型达到实验测试72 \％的准确率，这可以被认为是较好的精度为这一具有挑战性的数据集。无论是数据和模型向公众发布。

26. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework [PDF] 返回目录
Li Tao, Xueting Wang, Toshihiko Yamasaki
Abstract: We propose a self-supervised method to learn feature representations from videos. A standard approach in traditional self-supervised methods uses positive-negative data pairs to train with contrastive learning strategy. In such a case, different modalities of the same video are treated as positives and video clips from a different video are treated as negatives. Because the spatio-temporal information is important for video representation, we extend the negative samples by introducing intra-negative samples, which are transformed from the same anchor video by breaking temporal relations in video clips. With the proposed inter-intra contrastive framework, we can train spatio-temporal convolutional networks to learn video representations. There are many flexible options in our proposed framework and we conduct experiments by using several different configurations. Evaluations are conducted on video retrieval and video recognition tasks using the learned video representation. Our proposed methods outperform current state-of-the-art results by a large margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, improvements can also be obtained on these two benchmark datasets.
摘要：我们建议从视频学习功能表示自监督法。在传统的自我监督方法的标准方法是使用正负数据对与对比学习策略训练。在这样的情况下，从不同的视频阳性和视频剪辑被作为底片处理的相同视频的不同模态进行处理。因为时空信息是用于视频表示重要，我们通过引入帧内阴性样品，这是从相同的锚视频通过破坏在视频剪辑时间关系转化延伸负样本。与拟议间内对比框架，我们可以训练时空卷积网络学习视频表示。还有在我们提出的框架许多灵活的选择，我们用几种不同的配置进行实验。评估是在使用学习视频表示视频检索和视频识别任务进行。我们提出的方法分别优于状态的最先进的目前的结果通过一大截，如16.7％和9.5％的点的改善顶1的精度上UCF101和HMDB51数据集用于视频检索。对于视频识别，也可以在这两个标准数据集可以得到改善。

27. Dual Gaussian-based Variational Subspace Disentanglement for Visible-Infrared Person Re-Identification [PDF] 返回目录
Nan Pu, Wei Chen, Yu Liu, Erwin M. Bakker, Michael S. Lew
Abstract: Visible-infrared person re-identification (VI-ReID) is a challenging and essential task in night-time intelligent surveillance systems. Except for the intra-modality variance that RGB-RGB person re-identification mainly overcomes, VI-ReID suffers from additional inter-modality variance caused by the inherent heterogeneous gap. To solve the problem, we present a carefully designed dual Gaussian-based variational auto-encoder (DG-VAE), which disentangles an identity-discriminable and an identity-ambiguous cross-modality feature subspace, following a mixture-of-Gaussians (MoG) prior and a standard Gaussian distribution prior, respectively. Disentangling cross-modality identity-discriminable features leads to more robust retrieval for VI-ReID. To achieve efficient optimization like conventional VAE, we theoretically derive two variational inference terms for the MoG prior under the supervised setting, which not only restricts the identity-discriminable subspace so that the model explicitly handles the cross-modality intra-identity variance, but also enables the MoG distribution to avoid posterior collapse. Furthermore, we propose a triplet swap reconstruction (TSR) strategy to promote the above disentangling process. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two VI-ReID datasets.
摘要：可见红外的人重新识别（VI-REID）是在夜间智能监控系统的一个挑战和重要任务。除了内部形态方差该RGB-RGB人重新鉴定主要克服了，从附加的模态间方差VI-里德患有引起的固有异构间隙。为了解决这个问题，我们提出了一个精心设计的基于双高斯变自动编码器（DG-VAE），其理顺了那些纷繁身份-可辨别和身份的含糊跨模态特征子空间，承接的-高斯混合物-（MOG ）之前和现有标准的高斯分布，分别。理清跨模态身份的甄别功能，导致了VI-里德更强大的检索。为了实现像常规VAE有效的优化，我们在理论上有监督的设置，这不仅限制了身份辨别的子空间，使模型明确地处理跨模态内的身份变化之前下得出的MOG 2点变推理而言，也使MOG分布，以避免崩溃后。此外，我们提出了一个三元组交换重建（TSR）的策略，以促进上述解开过程。大量的实验证明，我们的方法优于状态的最先进的方法，在两个VI-REID数据集。

28. Object-based Illumination Estimation with Rendering-aware Neural Networks [PDF] 返回目录
Xin Wei, Guojun Chen, Yue Dong, Stephen Lin, Xin Tong
Abstract: We present a scheme for fast environment light estimation from the RGBD appearance of individual objects and their local image areas. Conventional inverse rendering is too computationally demanding for real-time applications, and the performance of purely learning-based techniques may be limited by the meager input data available from individual objects. To address these issues, we propose an approach that takes advantage of physical principles from inverse rendering to constrain the solution, while also utilizing neural networks to expedite the more computationally expensive portions of its processing, to increase robustness to noisy input data as well as to improve temporal and spatial stability. This results in a rendering-aware system that estimates the local illumination distribution at an object with high accuracy and in real time. With the estimated lighting, virtual objects can be rendered in AR scenarios with shading that is consistent to the real scene, leading to improved realism.
摘要：我们提出了从单个对象的RGBD外观快速环境光估计与当地图像区域的方案。以往的逆向渲染得需要大量计算的实时应用，以及纯粹的学习为基础的技术性能可以通过从单个对象微薄的输入数据的限制。为了解决这些问题，我们提出了一种方法，利用物理原理，从逆向绘制约束解决方案，同时还利用神经网络来加快其处理的计算更昂贵的部分，增加稳健性噪声的输入数据，以及对提高时间和空间的稳定性。这导致在具有高精度的对象，并实时估计局部照明分布呈现感知系统。随着估计照明，虚拟对象可以在AR场景其中阴影是真实场景相一致，从而提高真实感渲染。

29. Gender and Ethnicity Classification based on Palmprint and Palmar Hand Images from Uncontrolled Environment [PDF] 返回目录
Wojciech Michal Matkowski, Adams Wai Kin Kong
Abstract: Soft biometric attributes such as gender, ethnicity or age may provide useful information for biometrics and forensics applications. Researchers used, e.g., face, gait, iris, and hand, etc. to classify such attributes. Even though hand has been widely studied for biometric recognition, relatively less attention has been given to soft biometrics from hand. Previous studies of soft biometrics based on hand images focused on gender and well-controlled imaging environment. In this paper, the gender and ethnicity classification in uncontrolled environment are considered. Gender and ethnicity labels are collected and provided for subjects in a publicly available database, which contains hand images from the Internet. Five deep learning models are fine-tuned and evaluated in gender and ethnicity classification scenarios based on palmar 1) full hand, 2) segmented hand and 3) palmprint images. The experimental results indicate that for gender and ethnicity classification in uncontrolled environment, full and segmented hand images are more suitable than palmprint images.
摘要：软的生物特征属性，如性别，种族或年龄可以提供生物识别和法医应用的有用信息。研究人员使用，例如，面部，步态，虹膜和手等，以这样的属性进行分类。尽管手已经为生物特征识别被广泛研究，相对较少有人注意从手部软生物。基于手图像软生物以往的研究集中在性别和控制良好的成像环境。在本文中，在自由环境的性别和种族分类的考虑。性别和种族标签收集和公开可用的数据库，其中包含手图片来自互联网提供的科目。五个深学习模型是微调和基于手掌1）满手，2）分段手和3）掌纹图像中的性别和种族的分类方案进行评价。实验结果表明，对于在不受控制的环境中的性别和种族分类，充分分割手图像比掌纹图像更合适。

30. Zero-Shot Multi-View Indoor Localization via Graph Location Networks [PDF] 返回目录
Meng-Jiun Chiou, Zhenguang Liu, Yifang Yin, Anan Liu, Roger Zimmermann
Abstract: Indoor localization is a fundamental problem in location-based applications. Current approaches to this problem typically rely on Radio Frequency technology, which requires not only supporting infrastructures but human efforts to measure and calibrate the signal. Moreover, data collection for all locations is indispensable in existing methods, which in turn hinders their large-scale deployment. In this paper, we propose a novel neural network based architecture Graph Location Networks (GLN) to perform infrastructure-free, multi-view image based indoor localization. GLN makes location predictions based on robust location representations extracted from images through message-passing networks. Furthermore, we introduce a novel zero-shot indoor localization setting and tackle it by extending the proposed GLN to a dedicated zero-shot version, which exploits a novel mechanism Map2Vec to train location-aware embeddings and make predictions on novel unseen locations. Our extensive experiments show that the proposed approach outperforms state-of-the-art methods in the standard setting, and achieves promising accuracy even in the zero-shot setting where data for half of the locations are not available. The source code and datasets are publicly available at this https URL.
摘要：室内定位是基于位置的应用的一个基本问题。这个问题目前的做法通常依靠无线射频技术，这不仅需要配套的基础设施，但来测量和校准信号人类的努力。此外，所有位置数据的收集是在现有的方法，这反过来又阻碍了其大规模部署必不可少的。在本文中，我们提出了一个新的基于神经网络的结构图定位网络（GLN）进行免费基础设施，多视角图像基于室内定位。 GLN使得基于通过消息传递网络从图像中提取的鲁棒位置表示位置的预测。此外，我们引入一个新的零拍室内定位设置和建议的GLN延伸到一个专用的零拍的版本，它利用的新机制Map2Vec到列车位置感知的嵌入和新颖的看不见的位置做出预测解决它。我们广泛的实验表明，该方法比国家的最先进的方法，在标准制定，实现即使在零射门设置其中对于位置的一半数据无法获得准确看好。源代码和数据集是公开的，在此HTTPS URL。

31. Few-shot Classification via Adaptive Attention [PDF] 返回目录
Zihang Jiang, Bingyi Kang, Kuangqi Zhou, Jiashi Feng
Abstract: Training a neural network model that can quickly adapt to a new task is highly desirable yet challenging for few-shot learning problems. Recent few-shot learning methods mostly concentrate on developing various meta-learning strategies from two aspects, namely optimizing an initial model or learning a distance metric. In this work, we propose a novel few-shot learning method via optimizing and fast adapting the query sample representation based on very few reference samples. To be specific, we devise a simple and efficient meta-reweighting strategy to adapt the sample representations and generate soft attention to refine the representation such that the relevant features from the query and support samples can be extracted for a better few-shot classification. Such an adaptive attention model is also able to explain what the classification model is looking for as the evidence for classification to some extent. As demonstrated experimentally, the proposed model achieves state-of-the-art classification results on various benchmark few-shot classification and fine-grained recognition datasets.
摘要：训练神经网络模型，能够迅速适应新的任务，是非常可取的却为几拍的学习问题挑战。最近几年拍的学习方法主要集中在两个方面开发各种元学习策略，即优化初始模型或学习一个距离度量。在这项工作中，我们提出了通过优化和快适应基于查询的样本表示很少参考样本新颖的几个次学习方法。具体而言，我们设计一个简单而有效的元重新加权战略相适应的样品陈述和产生软重视细化，使得从查询和支持采样相关特征可以更好一些，镜头分类中提取的表示。这种自适应注意模型也能解释一下什么是分类模型作为分类在一定程度上佐证寻找。作为实验证明，该模型实现了对各种基准数，镜头分类和细粒度识别数据集的国家的最先进的分类结果。

32. Graph Convolutional Networks for Hyperspectral Image Classification [PDF] 返回目录
Danfeng Hong, Lianru Gao, Jing Yao, Bing Zhang, Antonio Plaza, Jocelyn Chanussot
Abstract: Convolutional neural networks (CNNs) have been attracting increasing attention in hyperspectral (HS) image classification, owing to their ability to capture spatial-spectral feature representations. Nevertheless, their ability in modeling relations between samples remains limited. Beyond the limitations of grid sampling, graph convolutional networks (GCNs) have been recently proposed and successfully applied in irregular (or non-grid) data representation and analysis. In this paper, we thoroughly investigate CNNs and GCNs (qualitatively and quantitatively) in terms of HS image classification. Due to the construction of the adjacency matrix on all the data, traditional GCNs usually suffer from a huge computational cost, particularly in large-scale remote sensing (RS) problems. To this end, we develop a new mini-batch GCN (called miniGCN hereinafter) which allows to train large-scale GCNs in a mini-batch fashion. More significantly, our miniGCN is capable of inferring out-of-sample data without re-training networks and improving classification performance. Furthermore, as CNNs and GCNs can extract different types of HS features, an intuitive solution to break the performance bottleneck of a single model is to fuse them. Since miniGCNs can perform batch-wise network training (enabling the combination of CNNs and GCNs) we explore three fusion strategies: additive fusion, element-wise multiplicative fusion, and concatenation fusion to measure the obtained performance gain. Extensive experiments, conducted on three HS datasets, demonstrate the advantages of miniGCNs over GCNs and the superiority of the tested fusion strategies with regards to the single CNN or GCN models. The codes of this work will be available at this https URL for the sake of reproducibility.
摘要：卷积神经网络（细胞神经网络）已经吸引了高光谱（HS）图像分类越来越多的关注，由于其捕捉空间光谱特征表示能力。尽管如此，他们在模拟样本之间的关系的能力仍然有限。超出网格取样的局限，图卷积网络（GCNs）近来已经提出和在不规则的（或非网格）的数据表示和分析成功的应用。在本文中，我们深入调查细胞神经网络和HS图像分类方面GCNs（定性和定量）。由于上的所有数据的邻接矩阵的结构，传统的GCNs通常遭受了巨大的计算成本，特别是在大规模遥感（RS）的问题。为此，我们开发了一个新的小批量GCN（称为miniGCN以下），它可以培养大型GCNs在小批量的方式。更显著，我们miniGCN能够无需重新培训网络推断出的样本数据，并提高分类性能。此外，由于细胞神经网络和GCNs可以提取不同类型的HS的功能，一个直观的解决方案，打破单个模型的性能瓶颈是它们熔合。由于miniGCNs可以执行间歇网络训练（使细胞神经网络和GCNs的组合），我们探索三种融合策略：添加剂融合，逐元素乘法融合，和级联融合来测量所获得的性能增益。大量的实验，在三个协数据集进行的，证明miniGCNs超过GCNs优点和测试的融合策略与问候单CNN或GCN模式的优越性。这项工作的代码将可在此HTTPS URL可重复性的缘故。

33. Structured Convolutions for Efficient Neural Network Design [PDF] 返回目录
Yash Bhalgat, Yizhe Zhang, Jamie Lin, Fatih Porikli
Abstract: In this work, we tackle model efficiency by exploiting redundancy in the \textit{implicit structure} of the building blocks of convolutional neural networks. We start our analysis by introducing a general definition of Composite Kernel structures that enable the execution of convolution operations in the form of efficient, scaled, sum-pooling components. As its special case, we propose \textit{Structured Convolutions} and show that these allow decomposition of the convolution operation into a sum-pooling operation followed by a convolution with significantly lower complexity and fewer weights. We show how this decomposition can be applied to 2D and 3D kernels as well as the fully-connected layers. Furthermore, we present a Structural Regularization loss that promotes neural network layers to leverage on this desired structure in a way that, after training, they can be decomposed with negligible performance loss. By applying our method to a wide range of CNN architectures, we demonstrate "structured" versions of the ResNets that are up to 2$\times$ smaller and a new Structured-MobileNetV2 that is more efficient while staying within an accuracy loss of 1% on ImageNet and CIFAR-10 datasets. We also show similar structured versions of EfficientNet on ImageNet and HRNet architecture for semantic segmentation on the Cityscapes dataset. Our method performs equally well or superior in terms of the complexity reduction in comparison to the existing tensor decomposition and channel pruning methods.
摘要：在这项工作中，我们通过在\ {textit隐结构}利用卷积神经网络的构建模块的冗余处理模型的效率。我们通过引入复合内核结构，使卷积运算高效，缩放和-池组件的形式执行的一般定义开始我们的分析。由于其特殊的情况下，我们提出了\ {textit结构卷积}并显示这些允许卷积运算分解成加法，集中操作，后跟卷积显著降低复杂性和较少的权重。我们发现这种分解是如何被应用到2D和3D内核以及完全连接层。此外，我们提出了一种结构的正则化损失，促进神经网络层，以杠杆作用在此期望的结构在于，训练后，它们可以具有可忽略的性能损失分解的方式。通过应用我们的方法将广泛CNN架构，我们证明了高达2 $ \次$较小的和新的结构化MobileNetV2这样效率更高，而1％的精度损失内停留的ResNets的“结构化”版本上ImageNet和CIFAR-10的数据集。我们还表明EfficientNet对ImageNet和HRNet架构类似结构的版本上的风情数据集语义分割。我们的方法进行同样好或更好的在相比于现有张量分解和信道修剪方法的复杂性降低的方面。

34. Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos [PDF] 返回目录
Xiaoye Qu, Pengwei Tang, Zhikang Zhou, Yu Cheng, Jianfeng Dong, Pan Zhou
Abstract: Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.
摘要：在视频中的目标时态语言本地化到地面一个视频片段在基于给定的句子查询未修剪视频。为了解决这个任务，无论从视觉和文本模式设计有效的模型提取地面荷兰国际集团的信息是至关重要的。然而，在这个领域最先前的尝试只专注于从视频到查询，强调这词听，并通过香草软注意照顾到句子信息的单向交互，但暗示在哪里看不采取从查询通过视频互动线索考虑在内。在本文中，我们提出了一个细粒度迭代关注网络（FIAN），它由双边查询视频中，形成提取的迭代注意模块。具体而言，迭代注意模块中，查询中的每个字首先被通过细粒度注意主治到每个帧中的视频，然后视频迭代地照顾到所述集成查询增强。最后，视频和查询信息被利用来为进一步的时刻本地化健壮的跨模态表示。此外，为了更好地预测目标段，我们提出了一个以内容为导向的本土化战略，而不是采用基于最近的锚定位。我们评估三个挑战公众的基准所提出的方法：AC-tivityNet字幕，玉米饼，以及字谜游戏-STA。 FIAN显著优于国家的最先进的方法。

35. Group Activity Prediction with Sequential Relational Anticipation Model [PDF] 返回目录
Junwen Chen, Wentao Bao, Yu Kong
Abstract: In this paper, we propose a novel approach to predict group activities given the beginning frames with incomplete activity executions. Existing action prediction approaches learn to enhance the representation power of the partial observation. However, for group activity prediction, the relation evolution of people's activity and their positions over time is an important cue for predicting group activity. To this end, we propose a sequential relational anticipation model (SRAM) that summarizes the relational dynamics in the partial observation and progressively anticipates the group representations with rich discriminative information. Our model explicitly anticipates both activity features and positions by two graph auto-encoders, aiming to learn a discriminative group representation for group activity prediction. Experimental results on two popularly used datasets demonstrate that our approach significantly outperforms the state-of-the-art activity prediction methods.
摘要：在本文中，我们提出了一种新的方法来预测给出不完整的活动执行开始帧组的活动。现有的动作预测方法学习，提升的部分观察表现力。然而，小组活动的预测，人们的活动以及它们的位置随时间的关系进化是用于预测小组活动的重要线索。为此，我们建议，总结了部分观测关系动态和预期逐渐丰富判别信息组表示顺序关系预测模型（SRAM）。我们的模型预测明确由两个图形自动编码器均具有活性的功能和位置，旨在为学习小组活动预测的判别群表示。两种广泛使用的数据集实验结果表明，我们的方法显著优于国家的最先进的活性预测方法。

36. Data-driven Meta-set Based Fine-Grained Visual Classification [PDF] 返回目录
Chuanyi Zhang, Yazhou Yao, Xiangbo Shu, Zechao Li, Zhenmin Tang, Qi Wu
Abstract: Constructing fine-grained image datasets typically requires domain-specific expert knowledge, which is not always available for crowd-sourcing platform annotators. Accordingly, learning directly from web images becomes an alternative method for fine-grained visual recognition. However, label noise in the web training set can severely degrade the model performance. To this end, we propose a data-driven meta-set based approach to deal with noisy web images for fine-grained recognition. Specifically, guided by a small amount of clean meta-set, we train a selection net in a meta-learning manner to distinguish in and out-of-distribution noisy images. To further boost the robustness of model, we also learn a labeling net to correct the labels of in-distribution noisy data. In this way, our proposed method can alleviate the harmful effects caused by out-of-distribution noise and properly exploit the in-distribution noisy samples for training. Extensive experiments on three commonly used fine-grained datasets demonstrate that our approach is much superior to state-of-the-art noise-robust methods.
摘要：构建细粒度的图像数据集通常需要特定领域的专业知识，这并不总是适用于众包平台注释。因此，直接从网页中的图像的学习成为细粒视觉识别的替代方法。然而，在网络训练集标签的噪音会严重降低模型的性能。为此，我们提出了一个数据驱动的元组为基础的方法来处理嘈杂的网页图像进行细粒度的认可。具体而言，通过清洁元组的少量引导，我们在一元学习的方式训练的选择在网和外的分布噪声图像进行区分。为了进一步提升模型的稳健性，我们也学到了标签净纠正分布噪声数据的标签。通过这种方式，我们提出的方法可以减轻因外的分布噪声和妥善利用培训中分布噪声采样的有害影响。在常用的三种细粒度数据集，大量实验证明我们的方法大大优于国家的最先进的噪声稳健的方法。

37. GL-GAN: Adaptive Global and Local Bilevel Optimization model of Image Generation [PDF] 返回目录
Ying Liu, Wenhong Cai, Xiaohui Yuan, Jinhai Xiang
Abstract: Although Generative Adversarial Networks have shown remarkable performance in image generation, there are some challenges in image realism and convergence speed. The results of some models display the imbalances of quality within a generated image, in which some defective parts appear compared with other regions. Different from general single global optimization methods, we introduce an adaptive global and local bilevel optimization model(GL-GAN). The model achieves the generation of high-resolution images in a complementary and promoting way, where global optimization is to optimize the whole images and local is only to optimize the low-quality areas. With a simple network structure, GL-GAN is allowed to effectively avoid the nature of imbalance by local bilevel optimization, which is accomplished by first locating low-quality areas and then optimizing them. Moreover, by using feature map cues from discriminator output, we propose the adaptive local and global optimization method(Ada-OP) for specific implementation and find that it boosts the convergence speed. Compared with the current GAN methods, our model has shown impressive performance on CelebA, CelebA-HQ and LSUN datasets.
摘要：虽然创成对抗性的网络显示，在图像生成表现可圈可点，也有像现实主义和收敛速度的一些挑战。一些模型的结果显示所生成的图像，其中一些缺陷部分出现与其他区域相比内质量的不平衡。从一般的单全局优化方法的不同，我们引入了自适应全局和局部双层优化模型（GL-GAN）。该模型实现互补和促进的方式，在全球优化是优化整个图像和地方只是为了优化低质量的区域的高分辨率图像的生成。用一个简单的网络结构，GL-GaN允许有效避免不平衡的地方双相优化，这是第一个定位低质量的区域来完成，然后优化它们的性质。此外，通过使用特征映射线索来自鉴别输出，提出了具体的实施自适应局部和全局优化方法（ADA-OP），并发现它提高了收敛速度。当前GAN方法相比，我们的模型展示在CelebA，CelebA-HQ和LSUN数据集骄人的业绩。

38. Salvage Reusable Samples from Noisy Data for Robust Learning [PDF] 返回目录
Zeren Sun, Xian-Sheng Hua, Yazhou Yao, Xiu-Shen Wei, Guosheng Hu, Jian Zhang
Abstract: Due to the existence of label noise in web images and the high memorization capacity of deep neural networks, training deep fine-grained (FG) models directly through web images tends to have an inferior recognition ability. In the literature, to alleviate this issue, loss correction methods try to estimate the noise transition matrix, but the inevitable false correction would cause severe accumulated errors. Sample selection methods identify clean ("easy") samples based on the fact that small losses can alleviate the accumulated errors. However, "hard" and mislabeled examples that can both boost the robustness of FG models are also dropped. To this end, we propose a certainty-based reusable sample selection and correction approach, termed as CRSSC, for coping with label noise in training deep FG models with web images. Our key idea is to additionally identify and correct reusable samples, and then leverage them together with clean examples to update the networks. We demonstrate the superiority of the proposed approach from both theoretical and experimental perspectives.
摘要：由于网络的图像标签噪音和深层神经网络的高容量记忆，直接通过网页中的图像训练深细粒度（FG）模式的存在往往具有较差的识别能力。在文献中，来缓解这个问题，损失修正方法试图估计噪声转移矩阵，但不可避免的错误校正会导致严重的累积误差。样本选择方法鉴定干净（“易”）的基础上的事实，小的损失可以缓解累积误差样本。然而，“硬”和贴错标签的例子既能提升FG模型的鲁棒性也将被删除。为此，我们提出了一个基于确定性，可重复使用的样本选择和修正的方法，称为CRSSC，用于培训深FG模型与网页中的图像与标签噪音应对。我们的主要想法是另外发现和纠正重复使用的样品，然后用干净的例子一起利用它们来更新网络。我们证明，从理论和实验两角度所提出的方法的优越性。

39. Cross-Model Image Annotation Platform with Active Learning [PDF] 返回目录
Ng Hui Xian Lynnette, Henry Ng Siong Hock, Nguwi Yok Yen
Abstract: We have seen significant leapfrog advancement in machine learning in recent decades. The central idea of machine learnability lies on constructing learning algorithms that learn from good data. The availability of more data being made publicly available also accelerates the growth of AI in recent years. In the domain of computer vision, the quality of image data arises from the accuracy of image annotation. Labeling large volume of image data is a daunting and tedious task. This work presents an End-to-End pipeline tool for object annotation and recognition aims at enabling quick image labeling. We have developed a modular image annotation platform which seamlessly incorporates assisted image annotation (annotation assistance), active learning and model training and evaluation. Our approach provides a number of advantages over current image annotation tools. Firstly, the annotation assistance utilizes reference hierarchy and reference images to locate the objects in the images, thus reducing the need for annotating the whole object. Secondly, images can be annotated using polygon points allowing for objects of any shape to be annotated. Thirdly, it is also interoperable across several image models, and the tool provides an interface for object model training and evaluation across a series of pre-trained models. We have tested the model and embeds several benchmarking deep learning models. The highest accuracy achieved is 74%.
摘要：我们已经看到在机器学习显著的跨越式进步近几十年来。机器学习能力的谎言构建学习算法，从良好的数据学习的中心思想。正在取得更多数据的可用性可公开获得的也加速AI的增长在最近几年。在计算机视觉的领域中，图像数据的质量从产生图像标注的准确度。标注大容量图像数据是一项艰巨而繁琐的工作。这项工作提出了以实现快速的图像标记为对象的注释和识别目标的端至端管道工具。我们已经开发了模块化的图像标注平台，它完美整合辅助图像标注（标注援助），主动学习和模型训练和评估。我们的方法提供了对当前图像注释工具有许多优点。首先，注释协助利用参考层次结构和基准图像以定位在图像中的对象，从而减少用于注释整个对象的需要。其次，图像可以使用多边形点从而允许要注释的任何形状的物体进行注释。第三，它也是在几个图像模型互操作，并且该工具提供了通过一系列预先训练的模型对象模型的训练和评估的接口。我们已经测试模型，并嵌入了几种基准深度学习模型。达到的最高精确度是74％。

40. StyleFlow: Attribute-conditioned Exploration of StyleGAN-Generated Images using Conditional Continuous Normalizing Flows [PDF] 返回目录
Rameen Abdal, Peihao Zhu, Niloy Mitra, Peter Wonka
Abstract: High-quality, diverse, and photorealistic images can now be generated by unconditional GANs (e.g., StyleGAN). However, limited options exist to control the generation process using (semantic) attributes, while still preserving the quality of the output. Further, due to the entangled nature of the GAN latent space, performing edits along one attribute can easily result in unwanted changes along other attributes. In this paper, in the context of conditional exploration of entangled latent spaces, we investigate the two sub-problems of attribute-conditioned sampling and attribute-controlled editing. We present StyleFlow as a simple, effective, and robust solution to both the sub-problems by formulating conditional exploration as an instance of conditional continuous normalizing flows in the GAN latent space conditioned by attribute features. We evaluate our method using the face and the car latent space of StyleGAN, and demonstrate fine-grained disentangled edits along various attributes on both real photographs and StyleGAN generated images). For example, for faces, we vary camera pose, illumination variation, expression, facial hair, gender, and age. We show edits on synthetically generated as well as projected real images. Finally, via extensive qualitative and quantitative comparisons, we demonstrate the superiority of StyleFlow to other concurrent works.
摘要：高质量，多样化和逼真的图像现在可以通过无条件甘斯（例如，StyleGAN）来生成。然而，有限的选项存在使用（语义）的属性，同时仍保留的输出的质量，以控制生成处理。此外，由于GAN潜在空间的缠结性质，执行沿着一个属性的修改可以容易导致沿着其他属性不必要的变化。在本文中，在纠缠潜在空间的条件探索的背景下，我们研究属性空调，采样和属性控制的编辑的两个子问题。我们本StyleFlow作为一种简单，有效和健壮的解决方案既子问题通过配制条件勘探作为条件连续正火的一个实例中通过属性特征调节的GAN潜在空间中流动。我们评估使用面部和StyleGAN的汽车潜在空间我们的方法，并展示沿两个真实照片和StyleGAN生成的图像）的各种属性细粒度解开编辑。例如，对于面，我们改变相机的姿势，照明变化，表达，面部毛发，性别和年龄。我们给出综合编辑生成的以及预期的真实图像。最后，通过广泛的定性和定量比较，我们证明StyleFlow到其他并发工作的优越性。

41. Learning Illumination from Diverse Portraits [PDF] 返回目录
Chloe LeGendre, Wan-Chun Ma, Rohit Pandey, Sean Fanello, Christoph Rhemann, Jason Dourgarian, Jay Busch, Paul Debevec
Abstract: We present a learning-based technique for estimating high dynamic range (HDR), omnidirectional illumination from a single low dynamic range (LDR) portrait image captured under arbitrary indoor or outdoor lighting conditions. We train our model using portrait photos paired with their ground truth environmental illumination. We generate a rich set of such photos by using a light stage to record the reflectance field and alpha matte of 70 diverse subjects in various expressions. We then relight the subjects using image-based relighting with a database of one million HDR lighting environments, compositing the relit subjects onto paired high-resolution background imagery recorded during the lighting acquisition. We train the lighting estimation model using rendering-based loss functions and add a multi-scale adversarial loss to estimate plausible high frequency lighting detail. We show that our technique outperforms the state-of-the-art technique for portrait-based lighting estimation, and we also show that our method reliably handles the inherent ambiguity between overall lighting strength and surface albedo, recovering a similar scale of illumination for subjects with diverse skin tones. We demonstrate that our method allows virtual objects and digital characters to be added to a portrait photograph with consistent illumination. Our lighting inference runs in real-time on a smartphone, enabling realistic rendering and compositing of virtual objects into live video for augmented reality applications.
摘要：我们提出用于从任意室内或室外照明条件下捕捉到一个单一的低动态范围（LDR）肖像图像估计高动态范围（HDR），全向照明基于学习的技术。我们使用他们的地面实况环境照度配对的人像照片训练我们的模型。我们通过使用光阶段记录的反射场的各种表情70周不同的主题阿尔法磨砂生成一组丰富的此类照片。我们再重新点火使用对象的图像为基础的有一个百万HDR光照环境数据库重新点燃，合成了重新照明对象到照明采集过程中记录的配对高分辨率的背景图像。我们使用的渲染基于损失函数训练照明估计模型，添加一个多尺度对抗性损失估计似是而非高频照明的细节。我们表明，我们的技术优于国家的最先进的技术用于基于肖像照明估计，并且我们还表明，我们的方法可靠地处理整体照明的强度和表面反射率之间的固有的不明确，回收照明为主体的类似尺度与不同的肤色。我们证明我们的方法允许虚拟对象和数字字符被添加到人像照片一致的照明。我们的照明推理实时运行在智能手机上，实现逼真的渲染和虚拟物体的合成到视频直播的增强现实应用。

42. A Neural-Symbolic Framework for Mental Simulation [PDF] 返回目录
Michael Kissner
Abstract: We present a neural-symbolic framework for observing the environment and continuously learning visual semantics and intuitive physics to reproduce them in an interactive simulation. The framework consists of five parts, a neural-symbolic hybrid network based on capsules for inverse graphics, an episodic memory to store observations, an interaction network for intuitive physics, a meta-learning agent that continuously improves the framework and a querying language that acts as the framework's interface for simulation. By means of lifelong meta-learning, the capsule network is expanded and trained continuously, in order to better adapt to its environment with each iteration. This enables it to learn new semantics using a few-shot approach and with minimal input from an oracle over its lifetime. From what it learned through observation, the part for intuitive physics infers all the required physical properties of the objects in a scene, enabling predictions. Finally, a custom query language ties all parts together, which allows to perform various mental simulation tasks, such as navigation, sorting and simulation of a game environment, with which we illustrate the potential of our novel approach.
摘要：本文提出了一种神经象征性的框架，观察环境，不断学习视觉语义和直观的物理复制它们互动模拟。该框架由五个部分组成，基于胶囊逆图形，情节存储器存储的观测，用于直观的物理交互网络神经象征混合网络，元学习剂不断提高的框架和一个查询语言，行为作为框架的接口进行仿真。通过终身元学习的手段，胶囊网络扩容和持续的培训，以更好地适应其在每次迭代的环境。这使得它使用了几拍的方式和以最少的投入，从整个生命周期内神谕来学习新的语义。从它通过观察了解到，部分直观推断物理场景中的对象的所有所需的物理性质，从而实现预测。最后，自定义查询语言结合所有部分组合在一起，允许执行各种心理模拟任务，比如导航，排序和游戏环境，与我们说明我们的新方法的潜力模拟。

43. Axiom-based Grad-CAM: Towards Accurate Visualization and Explanation of CNNs [PDF] 返回目录
Ruigang Fu, Qingyong Hu, Xiaohu Dong, Yulan Guo, Yinghui Gao, Biao Li
Abstract: To have a better understanding and usage of Convolution Neural Networks (CNNs), the visualization and interpretation of CNNs has attracted increasing attention in recent years. In particular, several Class Activation Mapping (CAM) methods have been proposed to discover the connection between CNN's decision and image regions. In spite of the reasonable visualization, lack of clear and sufficient theoretical support is the main limitation of these methods. In this paper, we introduce two axioms -- Conservation and Sensitivity -- to the visualization paradigm of the CAM methods. Meanwhile, a dedicated Axiom-based Grad-CAM (XGrad-CAM) is proposed to satisfy these axioms as much as possible. Experiments demonstrate that XGrad-CAM is an enhanced version of Grad-CAM in terms of conservation and sensitivity. It is able to achieve better visualization performance than Grad-CAM, while also be class-discriminative and easy-to-implement compared with Grad-CAM++ and Ablation-CAM. The code is available at this https URL.
摘要：为了更好地理解和卷积神经网络（细胞神经网络），细胞神经网络的可视化和解释已吸引近几年越来越多的关注的用法。特别是，一些类激活映射（CAM）方法被提出来发现CNN的决定和图像区域之间的连接。尽管合理的可视化，缺乏明确和充分的理论支持是这些方法的主要限制。在本文中，我们引入两个公理 - 保护和灵敏度 - 到的CAM方法的可视化模式。同时，一个专用的基于公理-梯度-CAM（XGrad-CAM）的建议以满足这些公理尽可能。实验表明，XGrad-CAM是梯度-CAM的保护和灵敏度方面的增强版本。它能够实现比奇格勒-CAM更好的可视化的表现，同时也是阶级歧视和易于实现与梯度-CAM ++和消融-CAM相比。该代码可在此HTTPS URL。

44. A Sensitivity Analysis Approach for Evaluating a Radar Simulation for Virtual Testing of Autonomous Driving Functions [PDF] 返回目录
Anthony Ngo, Max Paul Bauer, Michael Resch
Abstract: Simulation-based testing is a promising approach to significantly reduce the validation effort of automated driving functions. Realistic models of environment perception sensors such as camera, radar and lidar play a key role in this testing strategy. A generally accepted method to validate these sensor models does not yet exist. Particularly radar has traditionally been one of the most difficult sensors to model. Although promising as an alternative to real test drives, virtual tests are time-consuming due to the fact that they simulate the entire radar system in detail, using computation-intensive simulation techniques to approximate the propagation of electromagnetic waves. In this paper, we introduce a sensitivity analysis approach for developing and evaluating a radar simulation, with the objective to identify the parameters with the greatest impact regarding the system under test. A modular radar system simulation is presented and parameterized to conduct a sensitivity analysis in order to evaluate a spatial clustering algorithm as the system under test, while comparing the output from the radar model to real driving measurements to ensure a realistic model behavior. The presented approach is evaluated and it is demonstrated that with this approach results from different situations can be traced back to the contribution of the individual sub-modules of the radar simulation.
摘要：基于模拟的测试是一个很有前途的方法，以减少显著的自动驾驶功能验证工作。环境感知传感器，如摄像机，雷达的现实模型和激光打在这个测试策略了关键作用。一个公认的方法来验证这些传感器模型还不存在。特别是雷达历来是最困难的传感器模型之一。尽管有为作为替代真实测试驱动器，虚拟测试是费时由于它们模拟整个雷达系统详细地说，使用计算密集的模拟技术来近似电磁波的传播的事实。在本文中，我们介绍了用于开发和评估雷达仿真，与目标提供有关被测系统影响最大的识别参数的敏感性分析方法。一种模块化雷达系统仿真被呈现和参数，以评价一个空间聚类算法作为被测系统进行灵敏度分析，而比较从雷达模型以实驱动测量输出，以确保真实模型的行为。所提出的方法进行评估，它证明了这种方法从不同的情况的结果可以追溯到雷达模拟的各个子模块的贡献。

45. Deep Learning Based Defect Detection for Solder Joints on Industrial X-Ray Circuit Board Images [PDF] 返回目录
Qianru Zhang, Meng Zhang, Chinthaka Gamanayake, Chau Yuen, Zehao Geng, Hirunima Jayasekaraand, Xuewen Zhang, Chia-wei Woo, Jenny Low, Xiang Liu
Abstract: Quality control is of vital importance during electronics production. As the methods of producing electronic circuits improve, there is an increasing chance of solder defects during assembling the printed circuit board (PCB). Many technologies have been incorporated for inspecting failed soldering, such as X-ray imaging, optical imaging, and thermal imaging. With some advanced algorithms, the new technologies are expected to control the production quality based on the digital images. However, current algorithms sometimes are not accurate enough to meet the quality control. Specialists are needed to do a follow-up checking. For automated X-ray inspection, joint of interest on the X-ray image is located by region of interest (ROI) and inspected by some algorithms. Some incorrect ROIs deteriorate the inspection algorithm. The high dimension of X-ray images and the varying sizes of image dimensions also challenge the inspection algorithms. On the other hand, recent advances on deep learning shed light on image-based tasks and are competitive to human levels. In this paper, deep learning is incorporated in X-ray imaging based quality control during PCB quality inspection. Two artificial intelligence (AI) based models are proposed and compared for joint defect detection. The noised ROI problem and the varying sizes of imaging dimension problem are addressed. The efficacy of the proposed methods are verified through experimenting on a real-world 3D X-ray dataset. By incorporating the proposed methods, specialist inspection workload is largely saved.
摘要：质量控制是至关重要的电子产品的生产过程中。作为制造电子电路的方法来提高，有组装在印刷电路板（PCB）中焊接缺陷的增加的机会。许多技术已经被引入作为用于检查焊接失败，例如X射线成像，光学成像和热成像。随着一些先进的算法，新技术有望用于控制基于数字图像的生产质量。然而，目前的算法有时不够准确，以满足质量控制。需要专家做后续检查。对于自动X射线检查，关节的X射线图像上的兴趣由感兴趣区域（ROI）位于并通过一些算法检查。一些不正确的ROI恶化的检查算法。透视图像的高维数，图像尺寸的变化的大小也挑战检查算法。在另一方面，近期深学习进步阐明了基于图像的任务，是人类水平的竞争。在本文中，深学习期间在PCB的质量检验基于X射线成像的质量控制并入。两个人工智能（AI）的模型，提出并联合缺陷检测比较。所述降噪ROI问题和成像尺寸问题的不同大小进行寻址。所提出的方法的有效性，通过实验对现实世界的三维X射线数据集进行验证。通过将所提出的方法，专家检查的工作量大大节省。

46. Gibbs Sampling with People [PDF] 返回目录
Peter M. C. Harrison, Raja Marjieh, Federico Adolfi, Pol van Rijn, Manuel Anglada-Tort, Ofer Tchernichovski, Pauline Larrouy-Maestri, Nori Jacoby
Abstract: A core problem in cognitive science and machine learning is to understand how humans derive semantic representations from perceptual objects, such as color from an apple, pleasantness from a musical chord, or trustworthiness from a face. Markov Chain Monte Carlo with People (MCMCP) is a prominent method for studying such representations, in which participants are presented with binary choice trials constructed such that the decisions follow a Markov Chain Monte Carlo acceptance rule. However, MCMCP's binary choice paradigm generates relatively little information per trial, and its local proposal function makes it slow to explore the parameter space and find the modes of the distribution. Here we therefore generalize MCMCP to a continuous-sampling paradigm, where in each iteration the participant uses a slider to continuously manipulate a single stimulus dimension to optimize a given criterion such as 'pleasantness'. We formulate both methods from a utility-theory perspective, and show that the new method can be interpreted as 'Gibbs Sampling with People' (GSP). Further, we introduce an aggregation parameter to the transition step, and show that this parameter can be manipulated to flexibly shift between Gibbs sampling and deterministic optimization. In an initial study, we show GSP clearly outperforming MCMCP; we then show that GSP provides novel and interpretable results in three other domains, namely musical chords, vocal emotions, and faces. We validate these results through large-scale perceptual rating experiments. The final experiments combine GSP with a state-of-the-art image synthesis network (StyleGAN) and a recent network interpretability technique (GANSpace), enabling GSP to efficiently explore high-dimensional perceptual spaces, and demonstrating how GSP can be a powerful tool for jointly characterizing semantic representations in humans and machines.
摘要：在认知科学和机器学习一个核心问题是了解人类是如何从感性的认识对象，比如从苹果的颜色获得语义表示，从愉悦感音乐和弦，或可信赖的脸。马尔可夫链蒙特卡罗与人民（MCMCP）是研究这样的表述，其中的参与者都带有构造使得决策遵循一个马尔可夫链蒙特卡罗验收规则二元选择试验突出的方法。然而，MCMCP的二元选择模式生成每个审判信息相对较少，其当地的提案功能使得它慢探索参数空间，找到分布的模式。在这里，我们因此概括MCMCP到采样连续范例，其中在每次迭代中参与者使用滑块来连续操作单个刺激尺寸来优化给定的标准，例如“愉快”。我们制定从效用理论的角度这两种方法，并表明新方法可以解释为“Gibbs抽样与人民（GSP）。此外，我们引入聚集参数的过渡步骤，并且表明，该参数可以被操纵以Gibbs抽样和确定性优化之间灵活移动。在最初的研究中，我们展示GSP明显优于MCMCP;我们那么说明GSP提供了其他三个领域新的和可解释的结果，即音乐的和弦，声音情绪，和面。我们验证通过大规模的感知评价实验的这些结果。最终实验结合GSP用状态的最先进的图像合成网络（StyleGAN）和最近的网络解释性技术（GANSpace），使GSP有效地探索高维感知的空间，并展示GSP如何可以是一个功能强大的工具为共同特征的人与机器的语义表示。

47. Optical Flow and Mode Selection for Learning-based Video Coding [PDF] 返回目录
Théo Ladune, Pierrick Philippe, Wassim Hamidouche, Lu Zhang, Olivier Déforges
Abstract: This paper introduces a new method for inter-frame coding based on two complementary autoencoders: MOFNet and CodecNet. MOFNet aims at computing and conveying the Optical Flow and a pixel-wise coding Mode selection. The optical flow is used to perform a prediction of the frame to code. The coding mode selection enables competition between direct copy of the prediction or transmission through CodecNet. The proposed coding scheme is assessed under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding conditions, where it is shown to perform on par with the state-of-the-art video codec ITU/MPEG HEVC. Moreover, the possibility of copying the prediction enables to learn the optical flow in an end-to-end fashion i.e. without relying on pre-training and/or a dedicated loss term.
摘要：本文介绍了帧间的新方法编码基于两个互补自动编码：MOFNet和CodecNet。 MOFNet旨在计算和传送光流和逐像素编码模式选择。光流用于执行帧到代码的预测。编码模式选择使预测或传输的直接拷贝之间的竞争通过CodecNet。所提出的编码方案被挑战下评估上了解到图像压缩2020（CLIC20）P帧编码条件，在那里它被示出为与国家的最先进的视频编解码器ITU / MPEG HEVC看齐执行。此外，复制预测的可能性使得能够学习在端至端的方式，即光流不依赖于预培训和/或专用的损耗项。

48. FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [PDF] 返回目录
Jinglin Liu, Yi Ren, Zhou Zhao, Chen Zhang, Baoxing Huai, Jing Yuan
Abstract: Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I\&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I\&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97$\times$ comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5\% and 5.5\% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method.
摘要：是唇读一个令人印象深刻的技术，至今准确性的明确的改善在最近几年。然而，现有的用于唇读主要建立在自回归（AR）模型的方法，其中产生目标令牌逐个和从高推理延迟受到影响。为了突破这个限制，我们提出FastLR，非自回归（NAR）唇读模型生成所有目标同时令牌。 NAR唇读是一项具有挑战性的任务，有许多困难：1）源和目标使得难以估算输出序列的长度之间的序列长度的差异; 2）NAR代的条件独立的行为缺乏跨越时间，这导致目标分布的差近似的相关性; 3）编码器的特征表示能力可能是弱由于缺乏有效的对准机构;和4）移除AR语言模型的加剧唇读的固有模糊问题。因此，在本文中，我们介绍三种方法来减少FastLR和AR模型之间的间隙：到地址1）各种挑战1和2，我们利用整合和火（I \＆F）模块到源视频帧之间的对应关系进行建模和输出文本序列。 2）为了解决挑战3，我们添加的辅助联结颞分类（CTC）解码器到编码器的上部，并用额外的CTC损耗优化它。我们还添加了辅助自回归解码器，以帮助编码器的特征提取。 3）为了克服挑战4，我们提出了一个新颖的嘈杂并行解码（NPD）的I \＆F，并把字节编码对（BPE）到唇读。我们的实验表现出FastLR达到增速高达10.97 $ \次$与国家的最先进的唇读模型1.5 \％略有WER绝对增加，分别GRID和LRS2唇读数据集5.5 \％，这表明了比较我们提出的方法的有效性。

49. OverNet: Lightweight Multi-Scale Super-Resolution with Overscaling Network [PDF] 返回目录
Parichehr Behjati, Pau Rodriguez, Armin Mehri, Isabelle Hupont, Jordi Gonzalez, Carles Fernandez Tena
Abstract: Super-resolution (SR) has achieved great success due to the development of deep convolutional neural networks (CNNs). However, as the depth and width of the networks increase, CNN-based SR methods have been faced with the challenge of computational complexity in practice. Moreover, most of them train a dedicated model for each target resolution, losing generality and increasing memory requirements. To address these limitations we introduce OverNet, a deep but lightweight convolutional network to solve SISR at arbitrary scale factors with a single model. We make the following contributions: first, we introduce a lightweight recursive feature extractor that enforces efficient reuse of information through a novel recursive structure of skip and dense connections. Second, to maximize the performance of the feature extractor we propose a reconstruction module that generates accurate high-resolution images from overscaled feature maps and can be independently used to improve existing architectures. Third, we introduce a multi-scale loss function to achieve generalization across scales. Through extensive experiments, we demonstrate that our network outperforms previous state-of-the-art results in standard benchmarks while using fewer parameters than previous approaches.
摘要：超分辨率（SR）做出了应有的深卷积神经网络（细胞神经网络）的发展取得了巨大成功。然而，随着深度的增加网络的宽度，基于CNN-SR方法已经面临着在实践中的计算复杂性的挑战。此外，他们最列车专用模型为每个目标分辨率，不失一般性，增加内存需求。为了解决这些限制我们引入OverNet，深但重量轻的卷积网络解决SISR在任意倍率用单一的模式。我们做出以下贡献：首先，我们介绍了一个轻量级的递归特征提取，通过跳跃和密集连接的新递归结构的信息，强制实施有效的再利用。其次，要最大限度地发挥我们建议从超标度特征地图生成精确的高分辨率图像，并且可以独立使用，以改善现有的架构重构模块的特征提取的性能。第三，我们引入了多尺度损失函数跨尺度来实现推广。通过大量的实验，我们证明了我们的网络，同时使用比以前的方法参数少，远远超过前国家的最先进成果的标准基准。

50. A Robot that Counts Like a Child -- a Developmental Model of Counting and Pointing [PDF] 返回目录
Leszek Pecyna, Angelo Cangelosi, Alessandro Di Nuovo
Abstract: In this paper, a novel neuro-robotics model capable of counting real items is introduced. The model allows us to investigate the interaction between embodiment and numerical cognition. This is composed of a deep neural network capable of image processing and sequential tasks performance, and a robotic platform providing the embodiment - the iCub humanoid robot. The network is trained using images from the robot's cameras and proprioceptive signals from its joints. The trained model is able to count a set of items and at the same time points to them. We investigate the influence of pointing on the counting process and compare our results with those from studies with children. Several training approaches are presented in this paper all of them uses pre-training routine allowing the network to gain the ability of pointing and number recitation (from 1 to 10) prior to counting training. The impact of the counted set size and distance to the objects are investigated. The obtained results on counting performance show similarities with those from human studies.
摘要：在本文中，能够统计真实项目的新型神经机器人模型引入。该模型允许我们能够调查实施例和数值认知之间的相互作用。这是由能够图像处理和顺序的任务的性能，和一个机器人平台提供所述实施例的深神经网络 - 的的iCub人形机器人。网络使用从机器人的摄像机的图像和本体感受信号从它的关节训练。训练的模型就可以计算一组项目，并在同一时间点给他们。我们调查指出在计票过程的影响，我们的结果与那些从儿童的研究进行比较。几个训练方法在本文提出他们都使用预训练例程允许网络来获得计数训练之前指向和数量表述（从1到10）的能力。计数集的大小和距离的对象的影响进行了研究。与那些来自人类研究计数的表演秀相似之处得到的结果。

51. Exploiting Temporal Attention Features for Effective Denoising in Videos [PDF] 返回目录
Aryansh Omray, Samyak Jain, Utsav Krishnan, Pratik Chattopadhyay
Abstract: Video denoising has significant applications in diverse domains of computer vision, such as video-based object localization, text detection, and several others. An image denoising approach applied to video denoising results in flickering due to ignoring the temporal aspects of video frames. The proposed method makes use of the temporal as well as the spatial characteristics of video frames to form a two-stage denoising pipeline. Each stage uses a channel-wise attention mechanism to forward the encoder signal to the decoder side. The Attention Block used here is based on soft attention to rank the filters for effective learning. A key advantage of our approach is that it does not require prior information related to the amount of noise present in the video. Hence, it is quite suitable for application in real-life scenarios. We train the model on a large set of noisy videos along with their ground-truth. Experimental analysis shows that our approach performs denoising effectively and also surpasses existing methods in terms of efficiency and PSNR/SSIM metrics. In addition to this, we construct a new dataset for training video denoising models and also share the trained model online for further comparative studies.
摘要：视频去噪在计算机视觉的不同领域，如基于视频的目标定位，文本检测，和其他几个人显著的应用。一种图像降噪方法应用于视频去噪结果在闪烁由于忽略视频帧的时间方面。所提出的方法是利用时间以及视频帧的空间特性的，以形成两阶段流水线去噪。每个阶段使用信道逐注意机制给编码器信号转发到解码器侧。块中注意这里是基于软重视排名的有效学习的过滤器。我们的方法的主要优点是，它不需要与存在于视频降噪量先验信息。因此，它非常适合于现实生活场景的应用程序。我们培养的一大组与他们的地面实况沿着嘈杂的视频模式。实验分析表明，我们的方法进行有效的降噪和也超过现有的效率和PSNR / SSIM指标方面的方法。除了这个，我们构建了培训视频降噪模式新的数据集，并在网上分享训练的模型进行进一步的比较研究。

52. Global Voxel Transformer Networks for Augmented Microscopy [PDF] 返回目录
Zhengyang Wang, Yaochen Xie, Shuiwang Ji
Abstract: Advances in deep learning have led to remarkable success in augmented microscopy, enabling us to obtain high-quality microscope images without using expensive microscopy hardware and sample preparation techniques. However, current deep learning models for augmented microscopy are mostly U-Net based neural networks, thus sharing certain drawbacks that limit the performance. In this work, we introduce global voxel transformer networks (GVTNets), an advanced deep learning tool for augmented microscopy that overcomes intrinsic limitations of the current U-Net based models and achieves improved performance. GVTNets are built on global voxel transformer operators (GVTOs), which are able to aggregate global information, as opposed to local operators like convolutions. We apply the proposed methods on existing datasets for three different augmented microscopy tasks under various settings. The performance is significantly and consistently better than previous U-Net based approaches.
摘要：进展深度学习导致了增强显微镜显着成效，使我们能够获得高品质的显微镜图像，而无需使用昂贵的显微镜硬件和样品制备技术。然而，目前的深度学习模型增强镜大多是基于掌中神经网络，从而共享某些缺点，限制了性能。在这项工作中，我们介绍了全球变压器素网络（GVTNets），用于扩大显微镜先进的深度学习工具，它克服了现行的U-Net的基于模型的内在限制，并实现更高的性能。 GVTNets是建立在全球素变压器运营商（GVTOs），这是能够聚集全球信息，而不是当地的运营商如卷积。我们应用对现有数据集所提出的方法下的各种设置三种不同的增强显微镜的任务。性能显著和持续改善，比以前的U型网络基础的方法是。

53. Can I Pour into It? Robot Imagining Open Containability Affordance of Previously Unseen Objects via Physical Simulations [PDF] 返回目录
Hongtao Wu, Gregory S. Chirikjian
Abstract: Open containers, i.e., containers without covers, are an important and ubiquitous class of objects in human life. In this letter, we propose a novel method for robots to "imagine" the open containability affordance of a previously unseen object via physical simulations. We implement our imagination method on a UR5 manipulator. The robot autonomously scans the object with an RGB-D camera. The scanned 3D model is used for open containability imagination which quantifies the open containability affordance by physically simulating dropping particles onto the object and counting how many particles are retained in it. This quantification is used for open-container vs. non-open-container binary classification (hereafter referred to as open container classification). If the object is classified as an open container, the robot further imagines pouring into the object, again using physical simulations, to obtain the pouring position and orientation for real robot autonomous pouring. We evaluate our method on open container classification and autonomous pouring of granular material on a dataset containing 130 previously unseen objects with 57 object categories. Although our proposed method uses only 11 objects for simulation calibration (training), its open container classification aligns well with human judgements. In addition, our method endows the robot with the capability to autonomously pour into the 55 containers in the dataset with a very high success rate. We also compare to a deep learning method. Results show that our method achieves the same performance as the deep learning method on open container classification and outperforms it on autonomous pouring. Moreover, our method is fully explainable.
摘要：打开容器，即，没有盖子的容器，是人类生命的物体的一个重要和无处不在的阶级。在这封信中，我们提出了机器人“想象”开放containability启示通过物理模拟一个前所未见的对象的新方法。我们执行我们在UR5机器人的想象法。机器人自主地扫描与RGB-d照相机的对象。扫描的3D模型被用于其中通过物理模拟滴下颗粒到物体及数量许多颗粒如何保持在它量化开放containability启示开放containability想象力。这种量化被用于开放式容器与非开放式容器二元分类（此后称为开口容器分类）。如果对象被分类为一个开放的容器，所述机器人还不可想象浇注到对象，再次使用物理模拟，以获得用于实际机器人浇注位置和取向自主浇注。我们评估我们的开放容器分类和自主含130名57个对象类别前所未见的对象数据集浇筑粒状材料的方法。虽然我们提出的方法只使用11模拟校准（训练），它与人类的判断敞口容器分类以及对齐对象。此外，我们的方法赋予有能力的机器人自主地倒入容器55在数据集中具有非常高的成功率。我们也比较了深刻的学习方法。结果表明，我们的方法达到相同的性能，开放的容器分类深学习方法，优于它的自主涌出。此外，我们的方法是完全可以解释的。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-07

目录

摘要