摘要

1. Learning What to Learn for Video Object Segmentation [PDF] 返回目录
Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, Radu Timofte
Abstract: Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined during inference with a given first-frame reference mask. The problem of how to capture and utilize this limited target information remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module. This internal learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond standard few-shot learning techniques by learning what the few-shot learner should learn. This allows us to achieve a rich internal representation of the target in the current frame, significantly increasing the segmentation accuracy of our approach. We perform extensive experiments on multiple benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result.
摘要：视频对象分割（VOS）是一个高度挑战性的问题，由于目标对象与给定的第一帧参考掩模推理期间限定。如何捕捉和利用这个有限的目标信息的问题仍然是一个基本的研究问题。我们通过引入一个终端到终端的可训练VOS架构，集成了微为数不多的射门学习模块解决这个问题。这种内部学习者被设计成通过在第一帧中最小化分段误差预测目标的一个功能强大的参数模型。我们进一步超越标准的为数不多的射门学习技术通过学习几拍学习者应该学习什么。这使我们能够实现目标的丰富的内部表示在当前帧，显著增加我们的方法的分割精度。我们在多个基准进行大量的实验。我们的方法达到81.5的综合得分，相当于比上最好结果的2.6％，相对改善设置了一个新的大规模的YouTube-VOS 2018数据集的国家的最先进的。

2. Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? [PDF] 返回目录
Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, Phillip Isola
Abstract: The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost. Few-shot learning is widely used as one of the standard benchmarks in meta-learning. In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods. An additional boost can be achieved through the use of self-distillation. This demonstrates that using a good learned embedding model can be more effective than sophisticated meta-learning algorithms. We believe that our findings motivate a rethinking of few-shot image classification benchmarks and the associated role of meta-learning algorithms. Code is available at: this http URL.
摘要：最近的一项荟萃学习研究的重点是学习算法，可以快速适应测试时间任务，有限的数据和较低的计算成本的发展。很少拍学习被广泛用作元学习标准的基准之一。在这项工作中，我们证明了一个简单的基准：学习上的元训练集，然后在此表示的顶部训练线性分类有监督和自我监督的表现，优于国家的最先进的几拍学习方法。额外的提升可以通过使用自蒸馏来实现。这表明，使用好的了解到嵌入模型可能比复杂的元学习算法更有效。我们相信，我们的研究结果促使一些镜头图像分类基准的反思和元学习算法关联的角色。代码，请访问：此http网址。

3. HP2IFS: Head Pose estimation exploiting Partitioned Iterated Function Systems [PDF] 返回目录
Carmen Bisogni, Michele Nappi, Chiara Pero, Stefano Ricciardi
Abstract: Estimating the actual head orientation from 2D images, with regard to its three degrees of freedom, is a well known problem that is highly significant for a large number of applications involving head pose knowledge. Consequently, this topic has been tackled by a plethora of methods and algorithms the most part of which exploits neural networks. Machine learning methods, indeed, achieve accurate head rotation values yet require an adequate training stage and, to that aim, a relevant number of positive and negative examples. In this paper we take a different approach to this topic by using fractal coding theory and particularly Partitioned Iterated Function Systems to extract the fractal code from the input head image and to compare this representation to the fractal code of a reference model through Hamming distance. According to experiments conducted on both the BIWI and the AFLW2000 databases, the proposed PIFS based head pose estimation method provides accurate yaw/pitch/roll angular values, with a performance approaching that of state of the art of machine-learning based algorithms and exceeding most of non-training based approaches.
摘要：估计从二维图像，实际机头朝向就其三个自由度，是一个众所周知的问题是大量的，涉及的头部姿势知识应用非常显著。因此，这个话题已经被的方法过多和算法，其中利用神经网络在很大程度上解决。机器学习方法，事实上，实现精确的头部旋转值仍然需要足够的训练阶段，并为此目的，正面和负面的例子一个相关数字。在本文中，我们采取一种不同的方法在本主题通过使用分形编码原理和特别分区迭代函数系统来提取从输入头部图像的分数维码和该表示比较通过汉明距离的基准模型的分数维码。根据关于BIWI和AFLW2000数据库两者进行的实验中，所提出的PIFS基于头部姿势估计方法提供了精确的偏航/俯仰/翻滚角度值，具有性能接近本领域的状态的基于机器学习的算法和超过最基于非培训方法。

4. Training Binary Neural Networks with Real-to-Binary Convolutions [PDF] 返回目录
Brais Martinez, Jing Yang, Adrian Bulat, Georgios Tzimiropoulos
Abstract: This paper shows how to train binary networks to within a few percent points ($\sim 3-5 \%$) of the full precision counterpart. We first show how to build a strong baseline, which already achieves state-of-the-art accuracy, by combining recently proposed advances and carefully adjusting the optimization procedure. Secondly, we show that by attempting to minimize the discrepancy between the output of the binary and the corresponding real-valued convolution, additional significant accuracy gains can be obtained. We materialize this idea in two complementary ways: (1) with a loss function, during training, by matching the spatial attention maps computed at the output of the binary and real-valued convolutions, and (2) in a data-driven manner, by using the real-valued activations, available during inference prior to the binarization process, for re-scaling the activations right after the binary convolution. Finally, we show that, when putting all of our improvements together, the proposed model beats the current state of the art by more than 5% top-1 accuracy on ImageNet and reduces the gap to its real-valued counterpart to less than 3% and 5% top-1 accuracy on CIFAR-100 and ImageNet respectively when using a ResNet-18 architecture. Code available at this https URL.
摘要：本文展示了如何充分精确配对的百分之几的点（$ \ SIM 3-5 \％$）内培养二进制网络。首先，我们展示了如何建立一个强大的基础，它已经实现了国家的最先进的准确性，结合最近提出的进步和精心调整的优化过程。其次，我们表明，试图通过减少二进制文件的输出和相应的实值卷积之间的差异，可以得到额外的显著准确性收益。我们在两个互补的方式具体化这样的观念：（1）与损失函数，在训练期间，通过匹配空间注意映射计算在二进制的在数据驱动的方式输出及实值卷积，和（2），通过实值激活，之前的二值化处理推断期间可用，重新缩放激活二进制卷积之后。最后，我们表明，将在我们所有的共同提高，该模型由超过5％最高1精度节拍上ImageNet艺术的当前状态，并降低了差距，它的实值对应到不足3％和分别在CIFAR-100和ImageNet 5％顶1的精度使用RESNET-18体系结构时。代码可在此HTTPS URL。

5. Improved Techniques for Training Single-Image GANs [PDF] 返回目录
Tobias Hinz, Matthew Fisher, Oliver Wang, Stefan Wermter
Abstract: Recently there has been an interest in the potential of learning generative models from a single image, as opposed to from a large dataset. This task is of practical significance, as it means that generative models can be used in domains where collecting a large dataset is not feasible. However, training a model capable of generating realistic images from only a single sample is a difficult problem. In this work, we conduct a number of experiments to understand the challenges of training these methods and propose some best practices that we found allowed us to generate improved results over previous work in this space. One key piece is that unlike prior single image generation methods, we concurrently train several stages in a sequential multi-stage manner, allowing us to learn models with fewer stages of increasing image resolution. Compared to a recent state of the art baseline, our model is up to six times faster to train, has fewer parameters, and can better capture the global structure of images.
摘要：最近，已经从一个单一的形象学习生成模型，从一个大的数据集相对于潜在的利益。这个任务是有实际意义，因为它意味着生成模型可在域中使用，其中收集了大量的数据集是不可行的。然而，从仅单个样本训练能够产生逼真的图像的模型是一个困难的问题。在这项工作中，我们进行了大量的实验，以了解这些训练方法所面临的挑战，并提出了一些最佳实践，我们发现使我们能够产生超过在这一领域以前的工作改进的结果。其中一个关键部分是，不像之前单个图像生成方法，我们同时培养几个阶段，在连续多级方式，使我们能够学习机型提高图像分辨率的少的阶段。相比于最近的现有技术基础的国家，我们的模型是六倍快列车，具有较少的参数，并能更好地捕捉图像的整体结构。

6. PiP: Planning-informed Trajectory Prediction for Autonomous Driving [PDF] 返回目录
Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, Qifeng Chen
Abstract: It is critical to predict the motion of surrounding vehicles for self-driving planning, especially in a socially compliant and flexible way. However, future prediction is challenging due to the interaction and uncertainty in driving behaviors. We propose planning-informed trajectory prediction (PiP) to tackle the prediction problem in the multi-agent setting. Our approach is differentiated from the traditional manner of prediction, which is only based on historical information and decoupled with planning. By informing the prediction process with the planning of ego vehicle, our method achieves the state-of-the-art performance of multi-agent forecasting on highway datasets. Moreover, our approach enables a novel pipeline which couples the prediction and planning, by conditioning PiP on multiple candidate trajectories of the ego vehicle, which is highly beneficial for autonomous driving in interactive scenarios.
摘要：这是预测周边车辆的运动自驾车规划，尤其是在社会上兼容，灵活的方式是至关重要的。然而，未来的预测是由于驾驶行为的互动性和不确定性的挑战。我们提出规划知情的轨迹预测（PIP），以解决多代理设置的预测问题。我们的方法是从预测的传统方式，这只是基于历史信息和规划脱钩分化。通过与自主车辆的规划通知预测处理，我们的方法实现了多代理预测的公路上的数据集的状态的最先进的性能。此外，我们的方法实现了新的管道，其对夫妇自车辆，这是在互动场景自动驾驶非常有益的多个候选者轨迹的预测和规划，通过调节画中画。

7. Fisheye Distortion Rectification from Deep Straight Lines [PDF] 返回目录
Zhu-Cun Xue, Nan Xue, Gui-Song Xia
Abstract: This paper presents a novel line-aware rectification network (LaRecNet) to address the problem of fisheye distortion rectification based on the classical observation that straight lines in 3D space should be still straight in image planes. Specifically, the proposed LaRecNet contains three sequential modules to (1) learn the distorted straight lines from fisheye images; (2) estimate the distortion parameters from the learned heatmaps and the image appearance; and (3) rectify the input images via a proposed differentiable rectification layer. To better train and evaluate the proposed model, we create a synthetic line-rich fisheye (SLF) dataset that contains the distortion parameters and well-annotated distorted straight lines of fisheye images. The proposed method enables us to simultaneously calibrate the geometric distortion parameters and rectify fisheye images. Extensive experiments demonstrate that our model achieves state-of-the-art performance in terms of both geometric accuracy and image quality on several evaluation metrics. In particular, the images rectified by LaRecNet achieve an average reprojection error of 0.33 pixels on the SLF dataset and produce the highest peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) compared with the groundtruth.
摘要：本文提出了一种新线感知整治网络（LaRecNet），以解决鱼眼畸变校正的基础上，经典的看法，即直线在三维空间应该还是直图像平面的问题。具体地，所提出的包含LaRecNet三个连续模块（1）学习从鱼眼图像中的失真的直线; （2）估计来自所学习的热图和图像外观上的失真参数;通过提出的微分整流层和（3）整流的输入图像。为了更好地培养和评价所提出的模型，我们创建一个包含失真参数和鱼眼图像的注释良好的扭曲直线合成丰富的线鱼眼（SLF）数据集。所提出的方法，使我们能够同时校准几何失真参数和整流鱼眼图像。大量的实验表明，我们的模型实现了国家的最先进的性能在这两个几何精度和图像质量方面的几个评价指标。特别地，通过LaRecNet整流后的图像实现的对数据集SLF 0.33个像素的平均投影误差和产生最高峰值信噪比（PSNR）和与地面实况相比结构相似性指数（SSIM）。

8. Circumventing Outliers of AutoAugment with Knowledge Distillation [PDF] 返回目录
Longhui Wei, An Xiao, Lingxi Xie, Xin Chen, Xiaopeng Zhang, Qi Tian
Abstract: AutoAugment has been a powerful algorithm that improves the accuracy of many vision tasks, yet it is sensitive to the operator space as well as hyper-parameters, and an improper setting may degenerate network optimization. This paper delves deep into the working mechanism, and reveals that AutoAugment may remove part of discriminative information from the training image and so insisting on the ground-truth label is no longer the best option. To relieve the inaccuracy of supervision, we make use of knowledge distillation that refers to the output of a teacher model to guide network training. Experiments are performed in standard image classification benchmarks, and demonstrate the effectiveness of our approach in suppressing noise of data augmentation and stabilizing training. Upon the cooperation of knowledge distillation and AutoAugment, we claim the new state-of-the-art on ImageNet classification with a top-1 accuracy of 85.8%.
摘要：AutoAugment一直是一个强大的算法，提高了许多视觉任务的准确性，但它是操作空间以及超参数，以及不正确的设置可能沦为网络优化敏感。本文深深入研究了工作机制，并揭示了AutoAugment可以从训练图像中移除的区别信息的一部分，因此坚持地面实况标签上不再是最好的选择。为了减轻监管的不准确性，我们利用的是指教师模式，引导网络培训输出知识升华。实验在标准图像分类基准进行的，并证明在抑制数据增强的噪声和稳定训练我们的方法的有效性。在知识蒸馏和AutoAugment的合作，我们要求新上ImageNet分类国家的最先进的具有85.8％的顶部-1的精度。

9. Data Uncertainty Learning in Face Recognition [PDF] 返回目录
Jie Chang, Zhonghao Lan, Changmao Cheng, Yichen Wei
Abstract: Modeling data uncertainty is important for noisy images, but seldom explored for face recognition. The pioneer work, PFE, considers uncertainty by modeling each face image embedding as a Gaussian distribution. It is quite effective. However, it uses fixed feature (mean of the Gaussian) from an existing model. It only estimates the variance and relies on an ad-hoc and costly metric. Thus, it is not easy to use. It is unclear how uncertainty affects feature learning. This work applies data uncertainty learning to face recognition, such that the feature (mean) and uncertainty (variance) are learnt simultaneously, for the first time. Two learning methods are proposed. They are easy to use and outperform existing deterministic methods as well as PFE on challenging unconstrained scenarios. We also provide insightful analysis on how incorporating uncertainty estimation helps reducing the adverse effects of noisy samples and affects the feature learning.
摘要：建模数据的不确定性是很重要的噪声图像，但很少探讨了面部识别。先锋工作，PFE，每个人脸图像嵌入高斯分布模型考虑了不确定性。这是很有效的。但是，它使用从现有模型的固定特征（平均值高斯的）。它只是估计方差和依赖于一个特设的和昂贵的度量。因此，不便于使用。目前还不清楚的不确定性如何影响功能的学习。这项工作适用于数据的不确定性学习人脸识别，使得功能（平均值）和不确定性（方差）同时了解到，首次。我们提出了两种学习方法。他们很容易使用，并超越现有的确定性方法，以及对挑战不受约束的情况PFE。我们还提供了如何把不确定性估计有助于降低噪声采样的不利影响，并影响地物学习精辟的分析。

10. SPFCN: Select and Prune the Fully Convolutional Networks for Real-time Parking Slot Detection [PDF] 返回目录
Zhuoping Yu, Zhong Gao, Hansheng Chen, Yuyao Huang
Abstract: For passenger cars equipped with automatic parking function, convolutional neural networks are employed to detect parking slots on the panoramic surround view, which is an overhead image synthesized by four calibrated fish-eye images, The accuracy is obtained at the price of low speed or expensive computation equipments, which are sensitive for many car manufacturers. In this paper, the same accuracy is challenged by the proposed parking slot detector, which leverages deep convolutional networks for the faster speed and smaller model while keep the accuracy by simultaneously training and pruning it. To achieve the optimal trade-off, we developed a strategy to select the best receptive fields and prune the redundant channels automatically during training. The proposed model is capable of jointly detecting corners and line features of parking slots while running efficiently in real time on average CPU. Even without any specific computing devices, the model outperforms existing counterparts, at a frame rate of about 30 FPS on a 2.3 GHz CPU core, getting parking slot corner localization error of 1.51$\pm$2.14 cm (std. err.) and slot detection accuracy of 98\%, generally satisfying the requirements in both speed and accuracy on on-board mobile terminals.
摘要：对乘用车配备有自动驻车功能，采用卷积神经网络以检测在全景环绕视图，它是由四个校准鱼眼图像合成俯视图像停车插槽，以低速的价格获得的精确度或昂贵的计算设备，这是很多汽车厂商敏感。在本文中，相同的精度是由所提出的停车位探测器，它利用深卷积网络的更快的速度和更小的模式，同时通过同时训练和修剪保持它的准确性提出质疑。为了达到最佳的权衡，我们制定了一项战略，以选择最佳的感受野和训练过程中自动修剪多余的通道。所提出的模型能够联合检测角的和，而在平均CPU上实时高效运行停车槽线要素。即使没有任何特定的计算装置，现有的对应，在约30 FPS的上的2.3 GHz的CPU核心的帧速率的模型优于，得到1.51 $ \下午$2.14厘米停车位角定位误差（标准ERR。）和插槽检测98 \％的准确度，通常满足上车载的移动终端在速度和准确度的要求。

11. DCDLearn: Multi-order Deep Cross-distance Learning for Vehicle Re-Identification [PDF] 返回目录
Rixing Zhu, Jianwu Fang, Hongke Xu, Hongkai Yu, Jianru Xue
Abstract: Vehicle re-identification (Re-ID) has become a popular research topic owing to its practicability in intelligent transportation systems. Vehicle Re-ID suffers the numerous challenges caused by drastic variation in illumination, occlusions, background, resolutions, viewing angles, and so on. To address it, this paper formulates a multi-order deep cross-distance learning (\textbf{DCDLearn}) model for vehicle re-identification, where an efficient one-view CycleGAN model is developed to alleviate exhaustive and enumerative cross-camera matching problem in previous works and smooth the domain discrepancy of cross cameras. Specially, we treat the transferred images and the reconstructed images generated by one-view CycleGAN as multi-order augmented data for deep cross-distance learning, where the cross distances of multi-order image set with distinct identities are learned by optimizing an objective function with multi-order augmented triplet loss and center loss to achieve the camera-invariance and identity-consistency. Extensive experiments on three vehicle Re-ID datasets demonstrate that the proposed method achieves significant improvement over the state-of-the-arts, especially for the small scale dataset.
摘要：车辆重新鉴定（重新编号）已成为一个热门的研究课题由于其在智能交通系统的实用性。车辆重新ID遭受在照明，闭塞，背景，分辨率引起的急剧变化，视角的诸多挑战，等等。为了解决它，本文制定的多阶深横远程学习（\ textbf {DCDLearn}）模型车辆重新鉴定，其中一个有效的一个视图CycleGAN模型来减轻详尽和枚举横相机匹配问题在以前的作品和平滑的跨相机领域的差异。特别地，我们把由一个视图CycleGAN产生的传输的图像和重建图像作为多阶增强深横远程学习，其中多阶图像组具有不同身份的横距离通过优化的目标函数的学习数据与多阶增强三重态损耗和中心损失以实现相机不变性和身份的一致性。三路车重ID的数据集大量实验表明，该方法实现对国家的最艺显著的改善，特别是对于小规模的数据集。

12. Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation [PDF] 返回目录
Sunghun Joung, Seungryong Kim, Hanjae Kim, Minsu Kim, Ig-Jae Kim, Junghyun Cho, Kwanghoon Sohn
Abstract: Existing techniques to encode spatial invariance within deep convolutional neural networks only model 2D transformation fields. This does not account for the fact that objects in a 2D space are a projection of 3D ones, and thus they have limited ability to severe object viewpoint changes. To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space. CCNs extract a view-specific feature through a view-specific convolutional kernel to predict object category scores at each viewpoint. With the view-specific feature, we simultaneously determine objective category and viewpoints using the proposed sinusoidal soft-argmax module. Our experiments demonstrate the effectiveness of the cylindrical convolutional networks on joint object detection and viewpoint estimation.
摘要：现有技术编码深卷积神经网络中的空间不变性唯一一款2D改造领域。这种不考虑一个事实，即在2D空间物体的3D片源的投影，因此，他们有能力有限，严重的对象视角变化。为了克服这种限制，我们引入一个可学习模块，圆柱形卷积网络（CCNS），即利用在三维空间中定义的卷积内核的圆筒形表示。 CCNS通过视图特定卷积内核提取视图特定的特征在每个视点来预测对象类别分值。随着视图特定的功能，我们同时确定的目标类别和使用提出正弦软argmax模块的观点。我们的实验证明上关节物体检测和视点估计圆柱形卷积网络的有效性。

13. A Unified Object Motion and Affinity Model for Online Multi-Object Tracking [PDF] 返回目录
Junbo Yin, Wenguan Wang, Qinghao Meng, Ruigang Yang, Jianbing Shen
Abstract: Current popular online multi-object tracking (MOT) solutions apply single object trackers (SOTs) to capture object motions, while often requiring an extra affinity network to associate objects, especially for the occluded ones. This brings extra computational overhead due to repetitive feature extraction for SOT and affinity computation. Meanwhile, the model size of the sophisticated affinity network is usually non-trivial. In this paper, we propose a novel MOT framework that unifies object motion and affinity model into a single network, named UMA, in order to learn a compact feature that is discriminative for both object motion and affinity measure. In particular, UMA integrates single object tracking and metric learning into a unified triplet network by means of multi-task learning. Such design brings advantages of improved computation efficiency, low memory requirement and simplified training procedure. In addition, we equip our model with a task-specific attention module, which is used to boost task-aware feature learning. The proposed UMA can be easily trained end-to-end, and is elegant - requiring only one training stage. Experimental results show that it achieves promising performance on several MOT Challenge benchmarks.
摘要：目前网上流行的多目标追踪（MOT）解决方案，适用于单个对象追踪（SOTS）来捕捉对象运动的同时，往往需要额外的亲和力网络关联的对象，尤其是对于那些闭塞。这带来了额外的计算开销由于对SOT和亲和力计算重复特征提取。同时，先进的亲和力网络模型大小通常是不平凡的。在本文中，我们提出了一个新颖的MOT框架相结合物体运动和亲和力模型到一个网络中，命名为UMA，为了学习一个紧凑的特点，那就是歧视性的两个物体的运动和亲和力的措施。具体而言，UMA集成了单目标跟踪和度量学习到由多任务学习的手段统一的三重网络。这样设计带来的改善的计算效率，低存储器要求和简化训练过程的优点。此外，我们使一个特定任务的注意力模块，它是用来提升任务感知功能的学习模式。所提出的UMA可以很容易地训练的端至端，雅致的 - 只需要一个训练阶段。实验结果表明，它实现承诺在几个MOT挑战基准测试性能。

14. SCATTER: Selective Context Attentional Scene Text Recognizer [PDF] 返回目录
Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, R. Manmatha
Abstract: Scene Text Recognition (STR), the task of recognizing text against complex image backgrounds, is an active area of research. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. In this paper, we introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER). SCATTER utilizes a stacked block architecture with intermediate supervision during training, that paves the way to successfully train a deep BiLSTM encoder, thus improving the encoding of contextual dependencies. Decoding is done using a two-step 1D attention mechanism. The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer. The second attention step, similar to previous papers, treats the features as a sequence and attends to the intra-sequence relationships. Experiments show that the proposed approach surpasses SOTA performance on irregular text recognition benchmarks by 3.7\% on average.
摘要：场景文本识别（STR），认识到对复杂的图像背景文本的任务，是一个活跃的研究领域。当前国家的最先进的（SOTA）方法仍然在努力识别写在任意形状的文本。在本文中，我们介绍一个STR新颖的架构，命名为选择性Context注意力文本识别器（散射）。 SCATTER利用具有中间监督的堆叠块结构在训练期间，该铺平了道路成功地培养了深BiLSTM编码器，从而提高了上下文相关性的编码。解码采用两步1D注意机制来完成。第一关注步骤重新重量从与由BiLSTM层计算上下文特征的CNN骨架一起视觉特征。第二个关注的步骤，类似于以前的文件，将这个功能作为一个序列，并照顾到了序列内的关系。实验表明，该方法平均超过SOTA性能上不规则的文字识别基准3.7 \％。

15. Multiscale Sparsifying Transform Learning for Image Denoising [PDF] 返回目录
Ashkan Abbasi, Amirhassan Monadjemi, Leyuan Fang, Hossein Rabbani, Neda Noormohammadi
Abstract: The data-driven sparse methods such as synthesis dictionary learning and sparsifying transform learning have been proven to be effective in image denoising. However, these methods are intrinsically single-scale, which ignores the multiscale nature of images. This often leads to suboptimal results. In this paper, we propose several strategies to exploit multiscale information in image denoising through the sparsifying transform learning denoising (TLD) method. To this end, we first employ a simple method of denoising each wavelet subband independently via TLD. Then, we show that this method can be greatly enhanced using wavelet subbands mixing, which is a cheap fusion technique, to combine the results of single-scale and multiscale methods. Finally, we remove the need for denoising detail subbands. This simplification leads to an efficient multiscale denoising method with competitive performance to its baseline. The effectiveness of the proposed methods are experimentally shown over two datasets: 1) classic test images corrupted with Gaussian noise, and 2) fluorescence microscopy images corrupted with real Poisson-Gaussian noise. The proposed multiscale methods improve over the single-scale baseline method by an average of about 0.2 dB (in terms of PSNR) for removing synthetic Gaussian noise form classic test images and real Poisson-Gaussian noise from microscopy images, respectively. Interestingly, the proposed multiscale methods keep their superiority over the baseline even when noise is relatively weak. More importantly, we show that the proposed methods lead to visually pleasing results, in which edges and textures are better recovered. Extensive experiments over these two different datasets show that the proposed methods offer a good trade-off between performance and complexity.
摘要：数据驱动稀疏的方法，例如合成的词典学习和变换稀疏基底学习已被证明是有效的图像去噪。然而，这些方法在本质上是单尺度，而忽略图像的多尺度性质。这往往导致次优结果。在本文中，我们提出了几种策略通过稀疏利用图像去噪多尺度信息变换学习去噪（TLD）方法。为此，我们首先使用经由独立地TLD去噪每个小波子带的简单方法。然后，我们表明，该方法可以利用小波子混合，它是一种廉价的融合技术，以单一尺度和多尺度方法的结果结合起来可大大提高。最后，我们去除去噪细节子的需要。这种简化导致了有竞争力的性能，其基线高效的多尺度去噪方法。所提出的方法的有效性进行了试验示出在两个数据集：1）经典测试图像与高斯噪声损坏，以及2）与真正的泊松高斯噪声破坏的荧光显微图像。所提出的多尺度方法用于除去合成的高斯噪声形成从显微镜图像经典测试图像和真实泊松高斯噪声，分别改善对单尺度基线法平均约0.2分贝（在PSNR计算）。有趣的是，所提出的多尺度方法保持自己的优势将比基线，即使噪声相对较弱。更重要的是，我们表明，该方法导致视觉愉悦的结果，其中边缘和纹理更好地恢复。在这两种不同的数据集，大量实验表明，该方法提供了一个很好的权衡性能和复杂度之间。

16. VaB-AL: Incorporating Class Imbalance and Difficulty with Variational Bayes for Active Learning [PDF] 返回目录
Jongwon Choi, Kwang Moo Yi, Jihoon Kim, Jincho Choo, Byoungjip Kim, Jin-Yeop Chang, Youngjune Gwon, Hyung Jin Chang
Abstract: Active Learning for discriminative models has largely been studied with the focus on individual samples, with less emphasis on how classes are distributed or which classes are hard to deal with. In this work, we show that this is harmful. We propose a method based on the Bayes' rule, that can naturally incorporate class imbalance into the Active Learning framework. We derive that three terms should be considered together when estimating the probability of a classifier making a mistake for a given sample; i) probability of mislabelling a class, ii) likelihood of the data given a predicted class, and iii) the prior probability on the abundance of a predicted class. Implementing these terms requires a generative model and an intractable likelihood estimation. Therefore, we train a Variational Auto Encoder (VAE) for this purpose. To further tie the VAE with the classifier and facilitate VAE training, we use the classifiers' deep feature representations as input to the VAE. By considering all three probabilities, among them especially the data imbalance, we can substantially improve the potential of existing methods under limited data budget. We show that our method can be applied to classification tasks on multiple different datasets -- including one that is a real-world dataset with heavy data imbalance -- significantly outperforming the state of the art.
摘要：判别模型主动学习已经在很大程度上被研究为重点，个别样品，用量少，强调阶级是如何分布或类是很难对付。在这项工作中，我们表明，这是有害的。我们提出了基于贝叶斯法则的方法，可以自然地一体化类不平衡到主动学习的框架。我们推导出三个词应该估计分级机对于给定的样本错误的概率时，一并考虑; ⅰ）假标签类的概率，ii）所述数据的可能性给定的预测类，以及iii）对丰度的预测类的先验概率。实施这些条款需要生成模型和一个棘手的似然估计。因此，我们培养一个变自动编码器（VAE）用于这一目的。为了进一步配合的VAE与分类，并促进VAE培训，我们使用分类深特征表示作为输入VAE。通过考虑所有三种可能性，其中尤其是数据的不平衡，我们可以大幅度提高在有限的预算数据的现有方法的可能性。我们表明，我们的方法可以适用于多个不同的数据集分类任务 - 包括，一个是现实世界的数据集大量数据的不平衡 - 显著跑赢技术状态。

17. What Deep CNNs Benefit from Global Covariance Pooling: An Optimization Perspective [PDF] 返回目录
Qilong Wang, Li Zhang, Banggu Wu, Dongwei Ren, Peihua Li, Wangmeng Zuo, Qinghua Hu
Abstract: Recent works have demonstrated that global covariance pooling (GCP) has the ability to improve performance of deep convolutional neural networks (CNNs) on visual classification task. Despite considerable advance, the reasons on effectiveness of GCP on deep CNNs have not been well studied. In this paper, we make an attempt to understand what deep CNNs benefit from GCP in a viewpoint of optimization. Specifically, we explore the effect of GCP on deep CNNs in terms of the Lipschitzness of optimization loss and the predictiveness of gradients, and show that GCP can make the optimization landscape more smooth and the gradients more predictive. Furthermore, we discuss the connection between GCP and second-order optimization for deep CNNs. More importantly, above findings can account for several merits of covariance pooling for training deep CNNs that have not been recognized previously or fully explored, including significant acceleration of network convergence (i.e., the networks trained with GCP can support rapid decay of learning rates, achieving favorable performance while significantly reducing number of training epochs), stronger robustness to distorted examples generated by image corruptions and perturbations, and good generalization ability to different vision tasks, e.g., object detection and instance segmentation. We conduct extensive experiments using various deep CNN models on diversified tasks, and the results provide strong support to our findings.
摘要：最近的工作已经证明了全球协方差池（GCP）具有改善视觉分类任务深卷积神经网络（细胞神经网络）的性能。尽管有相当大的进步，对深层细胞神经网络GCP的效果的原因还没有得到很好的研究。在本文中，我们做出试图了解什么深层细胞神经网络从GCP在优化的观点中受益。具体来说，我们探讨深细胞神经网络GCP的效果优化损失Lipschitzness和梯度的预测度方面，并表明GCP可以使优化景观更顺畅，更梯度预测。此外，我们还讨论了深深的细胞神经网络GCP和二阶优化之间的连接。更重要的是，上述研究结果可以解释协方差池的几个优点训练还没有被认可之前或充分探讨深细胞神经网络，包括网络融合的显著加速度（即与GCP训练有素的网络能够支持学习率的迅速衰减，实现有利的性能，同时减少显著训练历元的数量），更强的鲁棒性通过图像损坏和扰动，并具有良好的泛化能力不同视觉任务，例如，对象检测和实例分割产生的失真的例子。我们进行了使用上的多样化任务的各种深CNN车型广泛的实验，并将结果提供给我们的研究结果强有力的支持。

18. GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet [PDF] 返回目录
Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, Changshui Zhang
Abstract: Training a supernet matters for one-shot neural architecture search (NAS) methods since it serves as a basic performance estimator for different architectures (paths). Current methods mainly hold the assumption that a supernet should give a reasonable ranking over all paths. They thus treat all paths equally, and spare much effort to train paths. However, it is harsh for a single supernet to evaluate accurately on such a huge-scale search space (e.g., $7^{21}$). In this paper, instead of covering all paths, we ease the burden of supernet by encouraging it to focus more on evaluation of those potentially-good ones, which are identified using a surrogate portion of validation data. Concretely, during training, we propose a multi-path sampling strategy with rejection, and greedily filter the weak paths. The training efficiency is thus boosted since the training space has been greedily shrunk from all paths to those potentially-good ones. Moreover, we further adopt an exploration and exploitation policy by introducing an empirical candidate path pool. Our proposed method GreedyNAS is easy-to-follow, and experimental results on ImageNet dataset indicate that it can achieve better Top-1 accuracy under same search space and FLOPs or latency level, but with only $\sim$60\% of supernet training cost. By searching on a larger space, our GreedyNAS can also obtain new state-of-the-art architectures.
摘要：一次性神经结构搜索（NAS）方法的培训超网的问题，因为它充当不同的体系结构（路径）一个基本的性能估计。目前的方法主要是持有的假设，超网应该给一个合理的排名在所有路径。因此，他们把所有的路径同样和备件大力气列车路径。然而，这是苛刻单个超网到如此巨大的规模搜索空间准确地评估（例如，$ 7·{21} $）。在本文中，而不是涵盖所有的路径，我们通过鼓励其更关注那些潜在的，好的，这是使用验证数据的替代部分鉴定的评估缓解超网的负担。具体而言，在训练中，我们提出了以抑制多路径采样策略，并贪婪地过滤弱路径。因此，训练效率提高，因为在训练空间已经贪婪地从那些潜在的，好的所有路径缩水。此外，我们进一步采取通过引入实证路径候选者池中的勘探和开采政策。我们提出的方法GreedyNAS是易于跟踪，并在ImageNet数据集实验结果表明，它可以实现在相同的搜索空间和触发器或延迟水平高顶1的精度，但只有$ \卡$ 60 \超网培训费用的％。通过在一个较大的空间搜索，我们GreedyNAS也能获得国家的最先进的新架构。

19. ASFD: Automatic and Scalable Face Detector [PDF] 返回目录
Bin Zhang, Jian Li, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yili Xia, Wenjiang Pei, Rongrong Ji
Abstract: In this paper, we propose a novel Automatic and Scalable Face Detector (ASFD), which is based on a combination of neural architecture search techniques as well as a new loss design. First, we propose an automatic feature enhance module named Auto-FEM by improved differential architecture search, which allows efficient multi-scale feature fusion and context enhancement. Second, we use Distance-based Regression and Margin-based Classification (DRMC) multi-task loss to predict accurate bounding boxes and learn highly discriminative deep features. Third, we adopt compound scaling methods and uniformly scale the backbone, feature modules, and head networks to develop a family of ASFD, which are consistently more efficient than the state-of-the-art face detectors. Extensive experiments conducted on popular benchmarks, e.g. WIDER FACE and FDDB, demonstrate that our ASFD-D6 outperforms the prior strong competitors, and our lightweight ASFD-D0 runs at more than 120 FPS with Mobilenet for VGA-resolution images.
摘要：在本文中，我们提出了一个新颖的自动可扩展面部检测器（ASFD），这是基于对神经结构的搜索技术以及新的损失相结合的设计。首先，我们提出了一种自动功能增强名为AUTO-FEM模块通过改进差分结构的搜索，从而能够实现高效的多尺度特征融合和内容的增强。其次，我们使用基于距离的回归和基于保证金的分类（DRMC）多任务亏损预测准确边界框和学习高辨别深的特点。第三，我们采用化合物缩放方法和均匀地缩放主链，特性模块，和头部网络开发一系列ASFD的，这是始终比状态的最先进的面部检测器更高效。上流行的基准，例如进行了广泛的实验WIDER FACE和FDDB，证明我们的ASFD-D6优于之前强劲的竞争对手，我们的轻量级ASFD-D0超过120 FPS运行与Mobilenet的VGA分辨率的图像。

20. A New Multiple Max-pooling Integration Module and Cross Multiscale Deconvolution Network Based on Image Semantic Segmentation [PDF] 返回目录
Hongfeng You, Shengwei Tian, Long Yu, Xiang Ma, Yan Xing, Ning Xin
Abstract: To better retain the deep features of an image and solve the sparsity problem of the end-to-end segmentation model, we propose a new deep convolutional network model for medical image pixel segmentation, called MC-Net. The core of this network model consists of four parts, namely, an encoder network, a multiple max-pooling integration module, a cross multiscale deconvolution decoder network and a pixel-level classification layer. In the network structure of the encoder, we use multiscale convolution instead of the traditional single-channel convolution. The multiple max-pooling integration module first integrates the output features of each submodule of the encoder network and reduces the number of parameters by convolution using a kernel size of 1. At the same time, each max-pooling layer (the pooling size of each layer is different) is spliced after each convolution to achieve the translation invariance of the feature maps of each submodule. We use the output feature maps from the multiple max-pooling integration module as the input of the decoder network; the multiscale convolution of each submodule in the decoder network is cross-fused with the feature maps generated by the corresponding multiscale convolution in the encoder network. Using the above feature map processing methods solves the sparsity problem after the max-pooling layer-generating matrix and enhances the robustness of the classification. We compare our proposed model with the well-known Fully Convolutional Networks for Semantic Segmentation (FCNs), DecovNet, PSPNet, U-net, SgeNet and other state-of-the-art segmentation networks such as HyperDenseNet, MS-Dual, Espnetv2, Denseaspp using one binary Kaggle 2018 data science bowl dataset and two multiclass dataset and obtain encouraging experimental results.
摘要：为了更好地保留图像的深层特征，解决终端到终端的细分模型的稀疏性问题，我们提出了医学图像像素分割新的深卷积网络模型，称为MC-网。该网络模式的核心由四个部分组成，即，编码器网络，多MAX-池集成模块，横多尺度解卷积解码器网络和像素级分类层。在编码器中的网络结构，我们使用多尺度卷积，而不是传统的单信道卷积。多个MAX-池集成模块第一集成编码器网络的每个子模块的输出功能，并使用为1的核尺寸与此同时，每个MAX-池层（每个池大小由卷积减少参数的数量层是不同的），每个卷积后拼接到实现每个子模块的特征地图的平移不变性。我们使用来自多个MAX-池集成模块作为解码器网络的输入输出特征图;解码器网络中的每个子模块的多尺度卷积交稠合与由编码器网络中的多尺度对应卷积产生的特征图。利用上述特征地图的加工方法解决了MAX-池层生成矩阵后的稀疏性问题，提高了分类的鲁棒性。我们比较我们提出的模型与语义分割著名完全卷积网络（FCNs），DecovNet，PSPNet，U网，SgeNet和国家的最先进的其他分割网络，如HyperDenseNet，MS-双，Espnetv2， Denseaspp使用一个二进制Kaggle 2018数据科学瓷碗数据集和两个多类数据集，并取得令人鼓舞的实验结果。

21. Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval [PDF] 返回目录
Shuhei Yokoo, Kohei Ozaki, Edgar Simo-Serra, Satoshi Iizuka
Abstract: We propose an efficient pipeline for large-scale landmark image retrieval that addresses the diversity of the dataset through two-stage discriminative re-ranking. Our approach is based on embedding the images in a feature-space using a convolutional neural network trained with a cosine softmax loss. Due to the variance of the images, which include extreme viewpoint changes such as having to retrieve images of the exterior of a landmark from images of the interior, this is very challenging for approaches based exclusively on visual similarity. Our proposed re-ranking approach improves the results in two steps: in the sort-step, $k$-nearest neighbor search with soft-voting to sort the retrieved results based on their label similarity to the query images, and in the insert-step, we add additional samples from the dataset that were not retrieved by image-similarity. This approach allows overcoming the low visual diversity in retrieved images. In-depth experimental results show that the proposed approach significantly outperforms existing approaches on the challenging Google Landmarks Datasets. Using our methods, we achieved 1st place in the Google Landmark Retrieval 2019 challenge and 3rd place in the Google Landmark Recognition 2019 challenge on Kaggle. Our code is publicly available here: \url{this https URL}
摘要：我们提出了大型标志性图像检索高效的流水线地址数据集的两阶段的多样性歧视性重新排序。我们的方法是基于使用具有余弦损失添加Softmax训练有素的卷积神经网络中的特征空间内嵌入图像。由于该图像，其包括极端视点改变，例如具有从内部的图像检索的标志性的外部的图像，这是非常专门用于视觉相似基础的方法具有挑战性的方差。我们建议重新排序方法提高了两个步骤的结果：在排序步，$ķ$ -nearest邻居搜索与软表决基于他们的标签相似的查询图像检索结果进行排序，并在插入 - 步骤中，我们从没有由图像类似检索的数据集添加额外的样品。这种方法可以克服低视觉多样性检索到的图像。在深入的实验结果表明，该方法显著优于上挑战谷歌的地标数据集的现有方法。使用我们的方法，我们在上Kaggle的谷歌地标识别2019挑战谷歌地标检索2019挑战和第三名获得第一名。我们的代码是公开的位置：\ {URL这HTTPS URL}

22. Prior-enlightened and Motion-robust Video Deblurring [PDF] 返回目录
Ya Zhou, Jianfeng Xu, Kazuyuki Tasakab, Zhibo Chen, Weiping Li
Abstract: Various blur distortions in video will cause negative impact on both human viewing and video-based applications, which makes motion-robust deblurring methods urgently needed. Most existing works have strong dataset dependency and limited generalization ability in handling challenging scenarios, like blur in low contrast or severe motion areas, and non-uniform blur. Therefore, we propose a PRiOr-enlightened and MOTION-robust video deblurring model (PROMOTION) suitable for challenging blurs. On the one hand, we use 3D group convolution to efficiently encode heterogeneous prior information, explicitly enhancing the scenes' perception while mitigating the output's artifacts. On the other hand, we design the priors representing blur distribution, to better handle non-uniform blur in spatio-temporal domain. Besides the classical camera shake caused global blurry, we also prove the generalization for the downstream task suffering from local blur. Extensive experiments demonstrate we can achieve the state-of-the-art performance on well-known REDS and GoPro datasets, and bring machine task gain.
摘要：在视频的各种模糊失真会导致在两个人类观看和基于视频的应用程序产生负面影响，这使得迫切需要运动去模糊健壮方法。大多数现有的作品具有很强的数据集的依赖，并在处理具有挑战性的场景，就像在低对比度或严重的运动领域，和非均匀模糊模糊有限泛化能力。因此，我们提出了现有开明和MOTION-强大的视频去模糊模型（PROMOTION）适用于具有挑战性的模糊。在一方面，我们使用3D组卷积编码效率异质先验信息，明确提高了场景感知，同时降低输出的文物。在另一方面，我们设计较模糊分布先验概率，更好地处理不均匀模糊的时空领域。除了传统的相机抖动造成的模糊全球，我们也证明了从局部的模糊下游任务痛苦的泛化。大量的实验证明我们可以实现在著名的红色和GoPro的数据集，并把机任务增益状态的最先进的性能。

23. Holopix50k: A Large-Scale In-the-wild Stereo Image Dataset [PDF] 返回目录
Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, Edward Li
Abstract: With the mass-market adoption of dual-camera mobile phones, leveraging stereo information in computer vision has become increasingly important. Current state-of-the-art methods utilize learning-based algorithms, where the amount and quality of training samples heavily influence results. Existing stereo image datasets are limited either in size or subject variety. Hence, algorithms trained on such datasets do not generalize well to scenarios encountered in mobile photography. We present Holopix50k, a novel in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix mobile social platform. In this work, we describe our data collection process and statistically compare our dataset to other popular stereo datasets. We experimentally show that using our dataset significantly improves results for tasks such as stereo super-resolution and self-supervised monocular depth estimation. Finally, we showcase practical applications of our dataset to motivate novel works and use cases. The Holopix50k dataset is available at this http URL
摘要：随着大众市场的双摄像头的手机，利用计算机视觉立体信息已经变得越来越重要。当前状态的最先进的方法利用基于学习的算法，其中，训练样本的数量和质量严重影响的结果。现有的立体图像数据集的大小或受试者各种限定。因此，培养这样的数据集算法不能推广以及在移动摄影中遇到的情景。我们提出Holopix50k，内式野生立体图像数据集的新的方法，包括贡献的Holopix移动社交平台的用户49368图像对。在这项工作中，我们描述了我们的数据收集过程和统计比较我们的数据与其他流行的立体声数据集。我们实验表明，使用我们的数据集显著提高了任务，如立体声超分辨率和成果自我监督单眼深度估计。最后，我们将展示我们的数据的激励新颖的作品和用例的实际应用。该Holopix50k数据集可在这个HTTP URL

24. Fusing Wearable IMUs with Multi-View Images for Human Pose Estimation: A Geometric Approach [PDF] 返回目录
Zhe Zhang, Chunyu Wang, Wenhu Qin, Wenjun Zeng
Abstract: We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs. It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space. We present a geometric approach to reinforce the visual features of each pair of joints based on the IMUs. This notably improves 2D pose estimation accuracy especially when one joint is occluded. We call this approach Orientation Regularized Network (ORN). Then we lift the multi-view 2D poses to the 3D space by an Orientation Regularized Pictorial Structure Model (ORPSM) which jointly minimizes the projection error between the 3D and 2D poses, along with the discrepancy between the 3D pose and IMU orientations. The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset. Our code will be released at this https URL.
摘要：我们提出来估算多视角图像的3D人体姿势和附着在人的四肢几个IMU产品。它的工作由从两个信号首先检测2D姿势，然后将它们提升到3D空间。我们提出了一个几何方法，以加强每对基于所述的IMU关节的视觉特征。这显着提高尤其是当一个关节被遮挡2D姿态估计的准确性。我们把这种方法定位正则网（ORN）。然后，我们通过一个取向正则画报结构模型（ORPSM），其共同最小化3D和2D姿势之间的投影误差，在3D姿势和IMU取向之间的差异沿着解除多视图2D姿势到3D空间。简单的两步法通过在公共数据集大幅降低的状态下的最先进的错误。我们的代码将在这个HTTPS URL被释放。

25. A Systematic Evaluation: Fine-Grained CNN vs. Traditional CNN Classifiers [PDF] 返回目录
Saeed Anwar, Nick Barnes, Lars Petersson
Abstract: To make the best use of the underlying minute and subtle differences, fine-grained classifiers collect information about inter-class variations. The task is very challenging due to the small differences between the colors, viewpoint, and structure in the same class entities. The classification becomes more difficult due to the similarities between the differences in viewpoint with other classes and differences with its own. In this work, we investigate the performance of the landmark general CNN classifiers, which presented top-notch results on large scale classification datasets, on the fine-grained datasets, and compare it against state-of-the-art fine-grained classifiers. In this paper, we pose two specific questions: (i) Do the general CNN classifiers achieve comparable results to fine-grained classifiers? (ii) Do general CNN classifiers require any specific information to improve upon the fine-grained ones? Throughout this work, we train the general CNN classifiers without introducing any aspect that is specific to fine-grained datasets. We show an extensive evaluation on six datasets to determine whether the fine-grained classifier is able to elevate the baseline in their experiments.
摘要：为了使底层分钟和细微的差别的最佳利用，细粒度的分类收集类间变化的信息。任务是由于在同一类的实体颜色，观点和结构之间的微小差异非常具有挑战性。分类变得更加困难，因为与其他类和与自己不同视点的差异之间的相似之处。在这项工作中，我们研究了具有里程碑意义的一般CNN分类，其中提出对大型分类数据集顶尖的结果，对细粒度数据集的性能，并将其与国家的最先进的细粒度分类。在本文中，我们提出了两个具体的问题：（i）他们一般CNN分类达到比较的结果，细粒度的分类？（二）做普通CNN分类要求的任何具体信息，以改善于细粒度的呢？在整个工作中，我们训练一般CNN分类不引入特定于细粒度数据集的任何方面。我们展示六个数据集进行广泛的评估，以确定细粒度的分类是否能够提升基线在他们的实验。

26. Adversarial Light Projection Attacks on Face Recognition Systems: A Feasibility Study [PDF] 返回目录
Luan Nguyen, Sunpreet S. Arora, Yuhang Wu, Hao Yang
Abstract: Deep learning-based systems have been shown to be vulnerable to adversarial attacks in both digital and physical domains. While feasible, digital attacks have limited applicability in attacking deployed systems, including face recognition systems, where an adversary typically has access to the input and not the transmission channel. In such setting, physical attacks that directly provide a malicious input through the input channel pose a bigger threat. We investigate the feasibility of conducting real-time physical attacks on face recognition systems using adversarial light projections. A setup comprising a commercially available web camera and a projector is used to conduct the attack. The adversary uses a transformation-invariant adversarial pattern generation method to generate a digital adversarial pattern using one or more images of the target available to the adversary. The digital adversarial pattern is then projected onto the adversary's face in the physical domain to either impersonate a target (impersonation) or evade recognition (obfuscation). We conduct preliminary experiments using two open-source and one commercial face recognition system on a pool of 50 subjects. Our experimental results demonstrate the vulnerability of face recognition systems to light projection attacks in both white-box and black-box attack settings.
摘要：深学习型系统已被证明是脆弱的数字和物理领域的敌对攻击。虽然可行，数字攻击有有限的适用性在攻击部署的系统，包括脸部识别系统中，在对手通常有权访问输入和不传输信道。在这样的设置，即直接通过输入信道提供一个恶意输入物理攻击造成更大的威胁。我们调查进行使用对抗光投影人脸识别系统的实时物理攻击的可行性。甲设置包括可商购的网络摄像机和投影仪被用来进行攻击。对手使用变换不变对抗性图案生成方法，用于使用提供给对手的目标的一个或多个图像的数字对抗性图案。数字对抗性图案然后在物理域中要么冒充的靶（模拟）或逃避识别（模糊处理）投影到对手的脸。我们进行了使用两个开源和50名受试者池中的一个商业人脸识别系统的初步实验。我们的实验结果表明，人脸识别系统的脆弱性在两个白盒和黑盒攻击设置光投射攻击。

27. BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models [PDF] 返回目录
Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, Quoc Le
Abstract: Neural architecture search (NAS) has shown promising results discovering models that are both accurate and fast. For NAS, training a one-shot model has become a popular strategy to rank the relative quality of different architectures (child models) using a single set of shared weights. However, while one-shot model weights can effectively rank different network architectures, the absolute accuracies from these shared weights are typically far below those obtained from stand-alone training. To compensate, existing methods assume that the weights must be retrained, finetuned, or otherwise post-processed after the search is completed. These steps significantly increase the compute requirements and complexity of the architecture search and model deployment. In this work, we propose BigNAS, an approach that challenges the conventional wisdom that post-processing of the weights is necessary to get good prediction accuracies. Without extra retraining or post-processing steps, we are able to train a single set of shared weights on ImageNet and use these weights to obtain child models whose sizes range from 200 to 1000 MFLOPs. Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%, surpassing state-of-the-art models in this range including EfficientNets and Once-for-All networks without extra retraining or post-processing. We present ablative study and analysis to further understand the proposed BigNASModels.
摘要：神经结构搜索（NAS）已经显示出大有希望的结果发现模型，既准确又快捷。对于NAS，培养了一杆模式，已经成为排名使用一组共享的权重不同的架构（儿童型）的相对质量一种流行的策略。然而，虽然一次性模型权重可以有效的排名不同的网络架构，从这些共享权的绝对精度通常远远低于从单机训练获得的。为了弥补现有方法假定权重必须接受再培训，微调，或者搜索完成后，否则后处理。这些步骤显著增加了建筑的搜索和模型部署的计算需求和复杂性。在这项工作中，我们提出BigNAS，即挑战传统智慧是，权重的后处理是必要的，以获得较好的预测精度的方法。如果没有额外的再培训或后处理步骤，我们能够培养上ImageNet一组共享的权重，并使用这些权重来获得孩子机型，其尺寸范围从200到1000 MFLOPS。我们发现了示范户，BigNASModels，争创1精度从76.5％到80.9％，超过了国家的最先进的车型在此范围内，包括EfficientNets和一劳永逸的所有网络，而无需额外的再培训或后处理。我们现在烧蚀的研究和分析，以进一步了解建议BigNASModels。

28. Joint Deep Cross-Domain Transfer Learning for Emotion Recognition [PDF] 返回目录
Dung Nguyen, Sridha Sridharan, Duc Thanh Nguyen, Simon Denman, Son N. Tran, Rui Zeng, Clinton Fookes
Abstract: Deep learning has been applied to achieve significant progress in emotion recognition. Despite such substantial progress, existing approaches are still hindered by insufficient training data, and the resulting models do not generalize well under mismatched conditions. To address this challenge, we propose a learning strategy which jointly transfers the knowledge learned from rich datasets to source-poor datasets. Our method is also able to learn cross-domain features which lead to improved recognition performance. To demonstrate the robustness of our proposed framework, we conducted experiments on three benchmark emotion datasets including eNTERFACE, SAVEE, and EMODB. Experimental results show that the proposed method surpassed state-of-the-art transfer learning schemes by a significant margin.
摘要：深学习已经被应用到实现情感识别显著的进展。尽管有这样的进步，现有的方法仍然是由训练数据不足阻碍，产生的型号不匹配条件下，推广好。为了应对这一挑战，我们提出了一种学习策略，共同传递来自丰富的数据集学会源贫乏的数据集的知识。我们的方法也能够了解哪些导致更高的识别性能跨域功能。为了证明我们提出的架构的稳健性，我们进行了对三个标准数据集情感包括eNTERFACE，SAVEE和EMODB实验。实验结果表明，所提出的方法通过一个显著余量超过国家的最先进的传输学习方案。

29. PADS: Policy-Adapted Sampling for Visual Similarity Learning [PDF] 返回目录
Karsten Roth, Timo Milbich, Björn Ommer
Abstract: Learning visual similarity requires to learn relations, typically between triplets of images. Albeit triplet approaches being powerful, their computational complexity mostly limits training to only a subset of all possible training triplets. Thus, sampling strategies that decide when to use which training sample during learning are crucial. Currently, the prominent paradigm are fixed or curriculum sampling strategies that are predefined before training starts. However, the problem truly calls for a sampling process that adjusts based on the actual state of the similarity representation during training. We, therefore, employ reinforcement learning and have a teacher network adjust the sampling distribution based on the current state of the learner network, which represents visual similarity. Experiments on benchmark datasets using standard triplet-based losses show that our adaptive sampling strategy significantly outperforms fixed sampling strategies. Moreover, although our adaptive sampling is only applied on top of basic triplet-learning frameworks, we reach competitive results to state-of-the-art approaches that employ diverse additional learning signals or strong ensemble architectures. Code can be found under this https URL.
摘要：学习视觉相似，需要学习的关系，一般图像的三胞胎之间。虽然三联方法是强大的，他们的计算复杂性主要是限制了培训，只有所有可能的训练三胞胎的一个子集。因此，采样是决定何时使用哪种学习期间的训练样本是关键的战略。目前，突出的范例是固定的或课程抽样之前训练开始预定义的策略。然而，问题真正的要求，基于相似性表示的训练过程中的实际情况来调整采样过程。因此，我们聘请强化学习，并有教师网络调整基础上，学习网络，这代表了视觉上的相似的当前状态的抽样分布。使用标准的基于三重损失的基准数据集实验结果表明，自适应采样策略显著优于固定的取样策略。此外，虽然我们的自适应采样仅在基本三重学习框架的顶层上应用，我们到达的竞争结果，以国家的最先进的方法是聘请不同的附加学习信号或强合奏架构。代码可以在此HTTPS URL中找到。

30. G2L-Net: Global to Local Network for Real-time 6D Pose Estimation with Embedding Vector Features [PDF] 返回目录
Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, Ales Leonardis
Abstract: In this paper, we propose a novel real-time 6D object pose estimation framework, named G2L-Net. Our network operates on point clouds from RGB-D detection in a divide-and-conquer fashion. Specifically, our network consists of three steps. First, we extract the coarse object point cloud from the RGB-D image by 2D detection. Second, we feed the coarse object point cloud to a translation localization network to perform 3D segmentation and object translation prediction. Third, via the predicted segmentation and translation, we transfer the fine object point cloud into a local canonical coordinate, in which we train a rotation localization network to estimate initial object rotation. In the third step, we define point-wise embedding vector features to capture viewpoint-aware information. In order to calculate more accurate rotation, we adopt a rotation residual estimator to estimate the residual between initial rotation and ground truth, which can boost initial pose estimation performance. Our proposed G2L-Net is real-time despite the fact multiple steps are stacked. Extensive experiments on two benchmark datasets show that the proposed method achieves state-of-the-art performance in terms of both accuracy and speed.
摘要：在本文中，我们提出了一个新颖的实时6D对象姿态估计框架，名为G2L-Net的。我们从网络上的RGB-d检测分而治之的方式点云工作。具体来说，我们的网络包括三个步骤。首先，我们提取由2D检测RGB-d图像粗略对象点云。其次，我们喂粗对象点云到翻译本地化网络执行3D分割和对象翻译预测。第三，经由预测分割和平移，我们的微小物体点云转移到本地坐标规范，其中我们培养了旋转定位网络来估计初始对象旋转。在第三步中，我们定义的逐点嵌入矢量功能，可拍摄视点感知信息。为了计算更精确的旋转，我们采用旋转残留估计来估算初始旋转和地面的真理之间有残留，这可以提高初始姿势估计性能。我们提出的G2L-Net是实时尽管多个步骤堆叠。两个基准数据集大量实验表明，该方法实现了精度和速度方面的国家的最先进的性能。

31. Bootstrapping Weakly Supervised Segmentation-free Word Spotting through HMM-based Alignment [PDF] 返回目录
Tomas Wilkinson, Carl Nettelblad
Abstract: Recent work in word spotting in handwritten documents has yielded impressive results. This progress has largely been made by supervised learning systems, which are dependent on manually annotated data, making deployment to new collections a significant effort. In this paper, we propose an approach that utilises transcripts without bounding box annotations to train segmentation-free query-by-string word spotting models, given a partially trained model. This is done through a training-free alignment procedure based on hidden Markov models. This procedure creates a tentative mapping between word region proposals and the transcriptions to automatically create additional weakly annotated training data, without choosing any single alignment possibility as the correct one. When only using between 1% and 7% of the fully annotated training sets for partial convergence, we automatically annotate the remaining training data and successfully train using it. On all our datasets, our final trained model then comes within a few mAP% of the performance from a model trained with the full training set used as ground truth. We believe that this will be a significant advance towards a more general use of word spotting, since digital transcription data will already exist for parts of many collections of interest.
摘要：在手写文档单词辨别最近的工作取得了令人瞩目的成果。这种进步在很大程度上被监督的学习系统，这是依赖于手工标注的数据来进行，从而使部署到新的集合一个显著的努力。在本文中，我们提出了利用成绩单无边框注释列车自由分割的查询通过串单词辨别模型，给予了部分训练模型的方法。这是通过免费培训，调整过程基于隐马尔可夫模型来完成。此过程创建字区域提案和转录之间的暂定映射来自动创建额外的弱酸注释的训练数据，而无需选择任何单个对准可能性作为正确的一个。当只有1％，而完全注释的训练集部分趋同的7％之间使用，我们会自动标注剩余的训练数据，并成功地训练使用。在我们所有的数据集，我们最终的训练模型则来自于与作为地面真相的全部训练集训练了模型性能的几个地图％以内。我们相信，这将是迈向更普遍使用单词辨别的显著进步，因为数字转录数据将已经存在的许多感兴趣的藏品的一部分。

32. A Survey of Methods for Low-Power Deep Learning and Computer Vision [PDF] 返回目录
Abhinav Goel, Caleb Tung, Yung-Hsiang Lu, George K. Thiruvathukal
Abstract: Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.
摘要：深层神经网络（DNNs）成功在许多计算机视觉任务。然而，最准确的DNNs需要数以百万计的参数和操作，使它们能量，计算和内存密集型。这阻碍了大DNNs在有限的计算资源的低功率设备的部署。最近的研究通过减少对内存的需求，能源消耗和操作的数量，而不显著降低精度提高DNN模型。本文综述低功耗深度学习和计算机视觉的进步，特别是在关于推断，并讨论了压缩和加速DNN模型的方法。这些技术可被分成四个主要类别：（1）参数量化和修剪，（2）压缩的卷积滤波器和矩阵分解，（3）的网络架构的搜索，以及（4）蒸馏知识。我们分析的准确性，优点，缺点和潜力，在每个类别中的技术问题的解决方案。我们还讨论了新的评价指标，作为未来研究的指导方针。

33. Deformable Style Transfer [PDF] 返回目录
Sunnie S. Y. Kim, Nicholas Kolkin, Jason Salavon, Gregory Shakhnarovich
Abstract: Geometry and shape are fundamental aspects of visual style. Existing style transfer methods focus on texture-like components of style, ignoring geometry. We propose deformable style transfer (DST), an optimization-based approach that integrates texture and geometry style transfer. Our method is the first to allow geometry-aware stylization not restricted to any domain and not requiring training sets of matching style/content pairs. We demonstrate our method on a diverse set of content and style images including portraits, animals, objects, scenes, and paintings.
摘要：几何和形状的视觉风格基本方面。现有的样式转移方法集中在质地般的风格组件，忽略几何。我们建议可变形的风格传输（DST），基于优化的方法，集成了纹理和几何风格转移。我们的方法是第一个允许几何感知程式化不限于任何域，不需要匹配风格/内容对训练集。我们证明我们对多样化的内容和风格的图像，包括肖像，动物，对象，场景和绘画方法。

34. Visual-Inertial Telepresence for Aerial Manipulation [PDF] 返回目录
Jongseok Lee, Ribin Balachandran, Yuri S. Sarkisov, Marco De Stefano, Andre Coelho, Kashmira Shinde, Min Jun Kim, Rudolph Triebel, Konstantin Kondak
Abstract: This paper presents a novel telepresence system for enhancing aerial manipulation capabilities. It involves not only a haptic device, but also a virtual reality that provides a 3D visual feedback to a remotely-located teleoperator in real-time. We achieve this by utilizing onboard visual and inertial sensors, an object tracking algorithm and a pre-generated object database. As the virtual reality has to closely match the real remote scene, we propose an extension of a marker tracking algorithm with visual-inertial odometry. Both indoor and outdoor experiments show benefits of our proposed system in achieving advanced aerial manipulation tasks, namely grasping, placing, force exertion and peg-in-hole insertion.
摘要：本文提出了加强空中操纵能力的新型远程呈现系统。它不仅涉及触觉装置，而且还提供了一个3D视觉反馈，实时远程定位的遥控机器人虚拟现实。我们通过利用车载视觉和惯性传感器，对象跟踪算法和预先生成的对象数据库实现这一点。由于虚拟现实有密切配合真正实现了远程现场，我们提出了视觉惯性测程标记跟踪算法的扩展。室内和室外实验表明，在实现先进的空中操作任务，即抓，放置，施力和PEG-在孔插入我们提出的系统的好处。

35. Not all domains are equally complex: Adaptive Multi-Domain Learning [PDF] 返回目录
Ali Senhaji, Jenni Raitoharju, Moncef Gabbouj, Alexandros Iosifidis
Abstract: Deep learning approaches are highly specialized and require training separate models for different tasks. Multi-domain learning looks at ways to learn a multitude of different tasks, each coming from a different domain, at once. The most common approach in multi-domain learning is to form a domain agnostic model, the parameters of which are shared among all domains, and learn a small number of extra domain-specific parameters for each individual new domain. However, different domains come with different levels of difficulty; parameterizing the models of all domains using an augmented version of the domain agnostic model leads to unnecessarily inefficient solutions, especially for easy to solve tasks. We propose an adaptive parameterization approach to deep neural networks for multi-domain learning. The proposed approach performs on par with the original approach while reducing by far the number of parameters, leading to efficient multi-domain learning solutions.
摘要：深学习方法的专业性强，需要培养独立的模型，不同的任务。多域学习着眼于如何学习不同的任务众多，各自从不同的域来了，马上。在多域学习最常见的方法是形成一个未知的领域模型，该模型的参数都是域之间共享，并学习少量的额外域的具体参数为每个新域。然而，不同的领域都不同程度的困难;使用未知的领域模型引线的增强版本不必要的低效率的解决方案参数化的所有域的车型，尤其对于容易解决的任务。我们提出了一种自适应参数的方法来深层神经网络的多域学习。看齐提出的方法执行与原来的做法，而迄今为止的参数数量减少，导致有效的多领域的学习解决方案。

36. Commentaries on "Learning Sensorimotor Control with Neuromorphic Sensors: Toward Hyperdimensional Active Perception" [Science Robotics Vol. 4 Issue 30 (2019) 1-10 [PDF] 返回目录
Denis Kleyko, Ross W. Gayler, Evgeny Osipov
Abstract: This correspondence comments on the findings reported in a recent Science Robotics article by Mitrokhin et al. [1]. The main goal of this commentary is to expand on some of the issues touched on in that article. Our experience is that hyperdimensional computing is very different from other approaches to computation and that it can take considerable exposure to its concepts before attaining practically useful understanding. Therefore, in order to provide an overview of the area to the first time reader of [1], the commentary includes a brief historic overview as well as connects the findings of the article to a larger body of literature existing in the area.
摘要：在发现此对应的评论报道了最近米特罗欣等Science机器人技术文章。 [1]。这篇评论的主要目的是对一些在那篇文章中谈到了问题的扩大。我们的经验是超维度计算是从其他的方法来计算很大的不同，它可能需要相当暴露在它的概念获得实际有用的理解了。因此，为了提供对区域的概述[1]的第一次读出器，评注包括简要历史概述以及连接制品的文献中存在的区域具有更大体的结果。

37. Content Adaptive and Error Propagation Aware Deep Video Compression [PDF] 返回目录
Guo Lu, Chunlei Cai, Xiaoyun Zhang, Li Chen, Wanli Ouyang, Dong Xu, Zhiyong Gao
Abstract: Recently, learning based video compression methods attract increasing attention. However, the previous works suffer from error propagation due to the accumulation of reconstructed error in inter predictive coding. Meanwhile, the previous learning based video codecs are also not adaptive to different video contents. To address these two problems, we propose a content adaptive and error propagation aware video compression system. Specifically, our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Based on the learned long-term temporal information, our approach effectively alleviates error propagation in reconstructed frames. More importantly, instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system. The proposed approach updates the parameters for encoder according to the rate-distortion criterion but keeps the decoder unchanged in the inference stage. Therefore, the encoder is adaptive to different video contents and achieves better compression performance by reducing the domain gap between the training and testing datasets. Our method is simple yet effective and outperforms the state-of-the-art learning based video codecs on benchmark datasets without increasing the model size or decreasing the decoding speed.
摘要：近日，基于学习的视频压缩方法吸引越来越多的关注。然而，之前的作品从错误传播遭受由于重建误差的帧间预测编码的积累。同时，以前的学习基于视频编解码器，也没有适应不同的视频内容。为了解决这两个问题，我们提出了一个自适应的内容和错误传播感知视频压缩系统。具体地，我们的方法通过考虑多个连续帧而不是单个帧的压缩性能采用联合训练策略。基于对学习的长期时间的信息，我们的方法有效地重建帧的缓解错误传播。更重要的是，而不是使用传统的压缩系统的手工编码模式，我们在设计我们的系统在线更新编码方案。所提出的方法根据率失真标准更新编码器的参数，但保持了解码器推论阶段不变。因此，编码器是自适应的，以不同的视频内容和通过减少训练之间的域间隙和测试数据集获得更好的压缩性能。我们的方法是简单而有效的，优于国家的最先进的学习上的基准数据集基于视频编解码器在不增加模型的大小或降低解码速度。

38. Safety-Aware Hardening of 3D Object Detection Neural Network Systems [PDF] 返回目录
Chih-Hong Cheng
Abstract: We study how state-of-the-art neural networks for 3D object detection using a single-stage pipeline can be made safety aware. We start with the safety specification (reflecting the capability of other components) that partitions the 3D input space by criticality, where the critical area employs a separate criterion on robustness under perturbation, quality of bounding boxes, and the tolerance over false negatives demonstrated on the training set. In the architecture design, we consider symbolic error propagation to allow feature-level perturbation. Subsequently, we introduce a specialized loss function reflecting (1) the safety specification, (2) the use of single-stage detection architecture, and finally, (3) the characterization of robustness under perturbation. We also replace the commonly seen non-max-suppression post-processing algorithm by a safety-aware non-max-inclusion algorithm, in order to maintain the safety claim created by the neural network. The concept is detailed by extending the state-of-the-art PIXOR detector which creates object bounding boxes in bird's eye view with inputs from point clouds.
摘要：我们研究使用单级流水线的立体物检测的国家的最先进的神经网络如何进行安全意识。我们开始与该分区由关键性，3D输入空间，其中临界区采用一个单独的标准上的鲁棒性下的扰动，包围盒的质量，并在公差在假阴性证明对安全规范（反映其它组分的能力）训练集。在架构设计，我们认为象征性的错误传播，使功能级别的扰动。随后，我们引入一个反射损失专门功能（1）的安全规范，（2）使用单级检测架构，最后，（3）的鲁棒性下的扰动的表征。我们也有一个安全意识的非最大夹杂物算法代替常见的非最大抑制后期处理算法，以维持由神经网络产生的安全要求。这个概念是通过延伸其与从点云输入鸟瞰创建对象边界框的状态的最先进的PIXOR探测器详述。

39. Aerial Imagery based LIDAR Localization for Autonomous Vehicles [PDF] 返回目录
Ankit Vora, Siddharth Agarwal, Gaurav Pandey, James McBride
Abstract: This paper presents a localization technique using aerial imagery maps and LIDAR based ground reflectivity for autonomous vehicles in urban environments. Traditional localization techniques using LIDAR reflectivity rely on high definition reflectivity maps generated from a mapping vehicle. The cost and effort required to maintain such prior maps are generally very high because it requires a fleet of expensive mapping vehicles. In this work we propose a localization technique where the vehicle localizes using aerial/satellite imagery, eradicating the need to develop and maintain complex high-definition maps. The proposed technique has been tested on a real world dataset collected from a test track in Ann Arbor, Michigan. This research concludes that aerial imagery based maps provides real-time localization performance similar to state-of-the-art LIDAR based maps for autonomous vehicles in urban environments at reduced costs.
摘要：本文介绍采用航空影像图和LIDAR地面的反射率在城市环境中自主车本地化技术。使用LIDAR反射率的传统定位技术依赖于高清晰度反射率映射从映射车辆产生。成本和精力需要保持这样的现有地图一般都非常高，因为它需要昂贵的映射车队。在这项工作中，我们提出了国产化技术，其中车辆使用本地化航空/卫星图像，消除了需要开发和维护复杂的高清地图。所提出的技术已经从在密歇根州Ann Arbor测试轨道收集了现实世界中的数据集进行了测试。这项研究的结论是，航空影像地图基于提供类似于国家的最先进的激光雷达基于地图以较低的成本在城市环境中的自主车辆的实时定位性能。

40. Patch-based Non-Local Bayesian Networks for Blind Confocal Microscopy Denoising [PDF] 返回目录
Saeed Izadi, Ghassan Hamarneh
Abstract: Confocal microscopy is essential for histopathologic cell visualization and quantification. Despite its significant role in biology, fluorescence confocal microscopy suffers from the presence of inherent noise during image acquisition. Non-local patch-wise Bayesian mean filtering (NLB) was until recently the state-of-the-art denoising approach. However, classic denoising methods have been outperformed by neural networks in recent years. In this work, we propose to exploit the strengths of NLB in the framework of Bayesian deep learning. We do so by designing a convolutional neural network and training it to learn parameters of a Gaussian model approximating the prior on noise-free patches given their nearest, similar yet non-local, neighbors. We then apply Bayesian reasoning to leverage the prior and information from the noisy patch in the process of approximating the noise-free patch. Specifically, we use the closed-form analytic \textit{maximum a posteriori} (MAP) estimate in the NLB algorithm to obtain the noise-free patch that maximizes the posterior distribution. The performance of our proposed method is evaluated on confocal microscopy images with real noise Poisson-Gaussian noise. Our experiments reveal the superiority of our approach against state-of-the-art unsupervised denoising techniques.
摘要：共聚焦显微镜对病理细胞可视化和量化至关重要。尽管其生物学作用显著荧光从固有噪声的图像采集过程中存在的共焦显微镜受到影响。非本地补丁明智的贝叶斯均值滤波（NLB）是直到最近的国家的最先进的降噪方法。然而，经典的去噪方法已经在最近几年跑赢通过神经网络。在这项工作中，我们提出利用NLB的优势，在贝叶斯深度学习的框架。我们通过设计卷积神经网络训练它学会高斯模型上的近似无噪音补丁之前给他们最近，类似但非本地，邻居的参数，这样做。接着，我们采用贝叶斯推理利用从接近无噪声补丁的过程中，嘈杂的补丁之前和信息。具体地，我们使用闭合形式的解析\ textit {最大后验}（MAP）估计在NLB算法，以获得无噪声补丁最大化后验分布。我们提出的方法的性能与真正的噪音泊松高斯噪声共聚焦显微图像分析。我们的实验表明我们对国家的最先进的无人监督的降噪技术方法的优越性。

41. PoisHygiene: Detecting and Mitigating Poisoning Attacks in Neural Networks [PDF] 返回目录
Junfeng Guo, Zelun Kong, Cong Liu
Abstract: The black-box nature of deep neural networks (DNNs) facilitates attackers to manipulate the behavior of DNN through data poisoning. Being able to detect and mitigate poisoning attacks, typically categorized into backdoor and adversarial poisoning (AP), is critical in enabling safe adoption of DNNs in many application domains. Although recent works demonstrate encouraging results on detection of certain backdoor attacks, they exhibit inherent limitations which may significantly constrain the applicability. Indeed, no technique exists for detecting AP attacks, which represents a harder challenge given that such attacks exhibit no common and explicit rules while backdoor attacks do (i.e., embedding backdoor triggers into poisoned data). We believe the key to detect and mitigate AP attacks is the capability of observing and leveraging essential poisoning-induced properties within an infected DNN model. In this paper, we present PoisHygiene, the first effective and robust detection and mitigation framework against AP attacks. PoisHygiene is fundamentally motivated by Dr. Ernest Rutherford's story (i.e., the 1908 Nobel Prize winner), on observing the structure of atom through random electron sampling.
摘要：深层神经网络（DNNs）的黑盒性质有利于攻击者操纵DNN通过数据中毒的行为。能够检测并缓解中毒攻击，通常分为后门和对抗中毒（AP），是实现在许多应用领域安全收养DNNs的关键。虽然最近的作品展示，鼓励对检测的某些后门攻击的结果，他们表现出固有的局限性可能显著限制的适用性。事实上，不存在技术用于检测AP的攻击，这代表较硬的挑战鉴于这样的攻击表现出不常见的，明确的规则而后门攻击做（即，嵌入后门触发器到中毒数据）。我们认为，关键检测和减轻AP的攻击是观察和利用被感染的DNN模型中必不可少的中毒引起的性能的能力。在本文中，我们目前PoisHygiene，对AP攻击的第一个有效和强大的检测和缓解框架。 PoisHygiene基本上由卢瑟福博士的故事（即，1908年诺贝尔奖获得者）激励，在通过随机抽样电子观察原子的结构。

42. Removing Dynamic Objects for Static Scene Reconstruction using Light Fields [PDF] 返回目录
Pushyami Kaveti, Sammie Katt, Hanumant Singh
Abstract: There is a general expectation that robots should operate in environments that consist of static and dynamic entities including people, furniture and automobiles. These dynamic environments pose challenges to visual simultaneous localization and mapping (SLAM) algorithms by introducing errors into the front-end. Light fields provide one possible method for addressing such problems by capturing a more complete visual information of a scene. In contrast to a single ray from a perspective camera, Light Fields capture a bundle of light rays emerging from a single point in space, allowing us to see through dynamic objects by refocusing past them. In this paper we present a method to synthesize a refocused image of the static background in the presence of dynamic objects that uses a light-field acquired with a linear camera array. We simultaneously estimate both the depth and the refocused image of the static scene using semantic segmentation for detecting dynamic objects in a single time step. This eliminates the need for initializing a static map . The algorithm is parallelizable and is implemented on GPU allowing us execute it at close to real time speeds. We demonstrate the effectiveness of our method on real-world data acquired using a small robot with a five camera array.
摘要：有一种普遍期望机器人应该在包括静态和动态的实体，包括人，家具和汽车的环境中运作。这些动态环境中通过将错误引入前端会对视觉同时定位和地图创建（SLAM）算法的挑战。光场提供用于通过捕捉场景的更完整的视觉信息解决这些问题的一种可能的方法。与此相反，从透视相机的单个射线，光场捕获的光线从单点新兴空间中的一束，使我们能够通过重聚焦过去他们通过动态对象看到。在本文中，我们提出以合成在动态对象的使用与线阵相机获取的光场存在下的静态背景的重聚焦图像的方法。我们同时估计二者的深度和使用语义分割为在单个时间步检测动态对象的静态场景的重聚焦图像。这消除了初始化静态地图的需要。该算法并行化，并在GPU上实现的，使我们在接近实时的速度执行。我们证明我们的方法对真实世界的数据使用小型机器人与五个相机阵列获取的有效性。

43. ML-SIM: A deep neural network for reconstruction of structured illumination microscopy images [PDF] 返回目录
Charles N. Christensen, Edward N. Ward, Pietro Lio, Clemens F. Kaminski
Abstract: Structured illumination microscopy (SIM) has become an important technique for optical super-resolution imaging because it allows a doubling of image resolution at speeds compatible for live-cell imaging. However, the reconstruction of SIM images is often slow and prone to artefacts. Here we propose a versatile reconstruction method, ML-SIM, which makes use of machine learning. The model is an end-to-end deep residual neural network that is trained on a simulated data set to be free of common SIM artefacts. ML-SIM is thus robust to noise and irregularities in the illumination patterns of the raw SIM input frames. The reconstruction method is widely applicable and does not require the acquisition of experimental training data. Since the training data are generated from simulations of the SIM process on images from generic libraries the method can be efficiently adapted to specific experimental SIM implementations. The reconstruction quality enabled by our method is compared with traditional SIM reconstruction methods, and we demonstrate advantages in terms of noise, reconstruction fidelity and contrast for both simulated and experimental inputs. In addition, reconstruction of one SIM frame typically only takes ~100ms to perform on PCs with modern Nvidia graphics cards, making the technique compatible with real-time imaging. The full implementation and the trained networks are available at this http URL.
摘要：结构照明显微镜（SIM）已经成为用于光学超分辨率成像的重要技术，因为它允许在用于活细胞成像兼容速度图像分辨率加倍。然而，SIM图像的重建往往是缓慢和容易文物。在这里，我们提出了一个通用的重建方法，ML-SIM，它利用机器学习的。该模型是在一个模拟数据集是免费的普通SIM文物训练有素的端至端深残余神经网络。 ML-SIM因此稳健的原始SIM输入帧的照明图案噪声和不规则性。重建方法是广泛适用的，不需要采集实验训练数据。由于训练数据从SIM过程的模拟从通用库的图像生成的方法可有效地适用于特定的实验SIM实现。用我们的方法使重建质量与传统SIM重建方法相比，我们的噪音，重建的保真度和对比度两个模拟和实验的投入方面表现出优势。此外，一个SIM帧的重建典型地只需要100毫秒〜到与现代Nvidia图形卡的PC执行，使得该技术具有实时成像相容。全面实施和培训的网络都可以在这个HTTP URL。

44. COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images [PDF] 返回目录
Ezz El-Din Hemdan, Marwa A. Shouman, Mohamed Esmail Karar
Abstract: Background and Purpose: Coronaviruses (CoV) are perilous viruses that may cause Severe Acute Respiratory Syndrome (SARS-CoV), Middle East Respiratory Syndrome (MERS-CoV). The novel 2019 Coronavirus disease (COVID-19) was discovered as a novel disease pneumonia in the city of Wuhan, China at the end of 2019. Now, it becomes a Coronavirus outbreak around the world, the number of infected people and deaths are increasing rapidly every day according to the updated reports of the World Health Organization (WHO). Therefore, the aim of this article is to introduce a new deep learning framework; namely COVIDX-Net to assist radiologists to automatically diagnose COVID-19 in X-ray images. Materials and Methods: Due to the lack of public COVID-19 datasets, the study is validated on 50 Chest X-ray images with 25 confirmed positive COVID-19 cases. The COVIDX-Net includes seven different architectures of deep convolutional neural network models, such as modified Visual Geometry Group Network (VGG19) and the second version of Google MobileNet. Each deep neural network model is able to analyze the normalized intensities of the X-ray image to classify the patient status either negative or positive COVID-19 case. Results: Experiments and evaluation of the COVIDX-Net have been successfully done based on 80-20% of X-ray images for the model training and testing phases, respectively. The VGG19 and Dense Convolutional Network (DenseNet) models showed a good and similar performance of automated COVID-19 classification with f1-scores of 0.89 and 0.91 for normal and COVID-19, respectively. Conclusions: This study demonstrated the useful application of deep learning models to classify COVID-19 in X-ray images based on the proposed COVIDX-Net framework. Clinical studies are the next milestone of this research work.
摘要：背景与目的：冠状病毒（冠状病毒）是可能引起严重急性呼吸道症候群（SARS-COV），中东呼吸综合征（MERS冠状病毒）危险的病毒。新型冠状病毒2019病（COVID-19）被发现在武汉市，中国的新的疾病的肺炎在2019年现在结束，它成为在世界各地爆发的冠状病毒，感染者和死亡人数都在增加迅速根据世界卫生组织的最新报告，每天（WHO）。因此，本文的目的是介绍一个新的深度学习框架;即COVIDX-Net的协助放射科医生自动诊断COVID-19的X射线图像。材料与方法：由于缺乏公共COVID-19数据集，该研究是在50宝箱25的X射线图像验证确认阳性COVID-19的情况。该COVIDX-Net的包括深卷积神经网络模型的七个不同的结构，如修改后的视觉几何组网络（VGG19）和谷歌MobileNet的第二个版本。每个深层神经网络模型是能够分析的X射线图像的归一化强度对患者状况负或正COVID-19的情况下进行分类。结果：实验和COVIDX-Net的评价都基于X射线图像分别为模型的训练和测试阶段，80-20％已成功完成。的VGG19和密集卷积网络（DenseNet）模型显示了良好的，并与0.89和0.91 F1-分数正常和COVID-19，分别自动COVID-19分类的类似的性能。结论：这项研究表明深学习模型的有用的应用在基础上，提出COVIDX-.Net框架的X射线图像进行分类COVID-19。临床研究是这项研究工作的下一个里程碑。

45. CoCoPIE: Making Mobile AI Sweet As PIE --Compression-Compilation Co-Design Goes a Long Way [PDF] 返回目录
Shaoshan Liu, Bin Ren, Xipeng Shen, Yanzhi Wang
Abstract: Assuming hardware is the major constraint for enabling real-time mobile intelligence, the industry has mainly dedicated their efforts to developing specialized hardware accelerators for machine learning and inference. This article challenges the assumption. By drawing on a recent real-time AI optimization framework CoCoPIE, it maintains that with effective compression-compiler co-design, it is possible to enable real-time artificial intelligence on mainstream end devices without special hardware. CoCoPIE is a software framework that holds numerous records on mobile AI: the first framework that supports all main kinds of DNNs, from CNNs to RNNs, transformer, language models, and so on; the fastest DNN pruning and acceleration framework, up to 180X faster compared with current DNN pruning on other frameworks such as TensorFlow-Lite; making many representative AI applications able to run in real-time on off-the-shelf mobile devices that have been previously regarded possible only with special hardware support; making off-the-shelf mobile devices outperform a number of representative ASIC and FPGA solutions in terms of energy efficiency and/or performance.
摘要：假设硬件是实现实时智能手机的主要制约因素，业界一直主要致力于努力开发机器学习和推理专门的硬件加速器。本文挑战假设。通过对最近的实时人工智能优化框架CoCoPIE绘画，它坚持认为，有效的压缩编译器协同设计，可以使主流终端设备实时人工智能无需特殊硬件。 CoCoPIE是一个软件框架，持有移动AI许多纪录：第一个框架，它支持所有主要类型DNNs的，从细胞神经网络来RNNs，变压器，语言模型等;最快DNN修剪和加速度框架，高达180X更快与当前DNN其他框架如TensorFlow-精简版修剪相比;使得能够实时对先前已认为只可能有特殊的硬件支持关闭的，现成的移动设备上运行的许多代表AI应用;使关断的，现成的移动设备优于在能量效率和/或性能方面的一些代表性的ASIC和FPGA的解决方案。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-26

目录

摘要