摘要

1. Condensed Movies: Story Based Retrieval with Contextual Embeddings [PDF] 返回目录
Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
Abstract: Our objective in this work is the long range understanding of the narrative structure of movies. Instead of considering the entire movie, we propose to learn from the key scenes of the movie, providing a condensed look at the full storyline. To this end, we make the following four contributions: (i) We create the Condensed Movie Dataset (CMD) consisting of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face tracks, and metadata about the movie. Our dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use. It is also an order of magnitude larger than existing movie datasets in the number of movies; (ii) We introduce a new story-based text-to-video retrieval task on this dataset that requires a high level understanding of the plotline; (iii) We provide a deep network baseline for this task on our dataset, combining character, speech and visual cues into a single video embedding; and finally (iv) We demonstrate how the addition of context (both past and future) improves retrieval performance.
摘要：我们在这项工作的目的是长距离了解电影的叙事结构。相反，考虑到整部影片，我们建议从电影的关键场景来学习，在提供完整的故事情节浓缩的样子。为此，我们提出以下四点贡献：（a）我们创建简明电影数据集（CMD）组成的关键场景的结束3K电影：每个按键的场景是伴随着场景，人物脸部的高层次语义描述轨道和元数据有关的电影。我们的数据是可伸缩的，从YouTube自动获得的，并且是免费提供的任何人下载和使用。这也是数量级比电影中的许多现有电影数据集大一个数量级; （二）我们介绍这个数据集，需要一个高水平的理解情节主线的一个新的故事为基础的文本到视频检索任务; （三）我们为我们的数据集这个任务了深刻的网络基线，组合字符，语音和视觉线索成一个单一的视频嵌入;最后（四）我们证明了除上下文（过去和将来）如何提高检索的性能。

2. Hyperspectral Image Restoration via Global Total Variation Regularized Local nonconvex Low-Rank matrix Approximation [PDF] 返回目录
Haijin Zeng, Xiaozhen Xie, Jifeng Ning
Abstract: Several bandwise total variation (TV) regularized low-rank (LR)-based models have been proposed to remove mixed noise in hyperspectral images (HSIs). Conventionally, the rank of LR matrix is approximated using nuclear norm (NN). The NN is defined by adding all singular values together, which is essentially a $L_1$-norm of the singular values. It results in non-negligible approximation errors and thus the resulting matrix estimator can be significantly biased. Moreover, these bandwise TV-based methods exploit the spatial information in a separate manner. To cope with these problems, we propose a spatial-spectral TV (SSTV) regularized non-convex local LR matrix approximation (NonLLRTV) method to remove mixed noise in HSIs. From one aspect, local LR of HSIs is formulated using a non-convex $L_{\gamma}$-norm, which provides a closer approximation to the matrix rank than the traditional NN. From another aspect, HSIs are assumed to be piecewisely smooth in the global spatial domain. The TV regularization is effective in preserving the smoothness and removing Gaussian noise. These facts inspire the integration of the NonLLR with TV regularization. To address the limitations of bandwise TV, we use the SSTV regularization to simultaneously consider global spatial structure and spectral correlation of neighboring bands. Experiment results indicate that the use of local non-convex penalty and global SSTV can boost the preserving of spatial piecewise smoothness and overall structural information.
摘要：一些bandwise总变差（TV）正则低等级（LR）为基础的模型已被建议删除高光谱图像（HSIS）混合噪声。按照惯例，LR矩阵的秩是利用核规范（NN）近似。的NN是通过将所有奇异值定义在一起，这实质上是一个$ $ L_1的奇异值的范数。它导致不可忽略的近似误差，因此所得到的矩阵估计器可以显著偏置。此外，这些bandwise基于电视的方法利用在一个单独的方式的空间信息。为了应付这些问题，我们提出了一个空间光谱TV（SSTV）正则化非凸本地LR矩阵近似（NonLLRTV）方法来在HSIS除去混合噪声。规范，它提供了一个更接近的近似矩阵秩比传统的NN - 从一个方面，HSIS的本地LR是使用非凸$ L _ {\伽马} $配制。从另一个方面，HSIS被假定为在全局空间域中是piecewisely平滑。 TV正则能有效地保持光滑和消除高斯噪声。这些事实激发NonLLR与TV正整合。为了解决bandwise电视的局限性，我们使用SSTV正规化同时考虑全球性的空间结构和相邻带的频谱相关性。实验结果表明，使用当地的非凸惩罚和全球SSTV的可提升空间分段平滑性和整体结构信息的保存。

3. Data-Free Network Quantization With Adversarial Knowledge Distillation [PDF] 返回目录
Yoojin Choi, Jihwan Choi, Mostafa El-Khamy, Jungwon Lee
Abstract: Network quantization is an essential procedure in deep learning for development of efficient fixed-point inference models on mobile or edge platforms. However, as datasets grow larger and privacy regulations become stricter, data sharing for model compression gets more difficult and restricted. In this paper, we consider data-free network quantization with synthetic data. The synthetic data are generated from a generator, while no data are used in training the generator and in quantization. To this end, we propose data-free adversarial knowledge distillation, which minimizes the maximum distance between the outputs of the teacher and the (quantized) student for any adversarial samples from a generator. To generate adversarial samples similar to the original data, we additionally propose matching statistics from the batch normalization layers for generated data and the original data in the teacher. Furthermore, we show the gain of producing diverse adversarial samples by using multiple generators and multiple students. Our experiments show the state-of-the-art data-free model compression and quantization results for (wide) residual networks and MobileNet on SVHN, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. The accuracy losses compared to using the original datasets are shown to be very minimal.
摘要：网络量化是深度学习在移动或边缘平台的高效定点推理模型的发展的必要过程。然而，随着数据集越来越大，隐私法规日趋严格，对于共享模型压缩数据变得更加困难和限制。在本文中，我们考虑与合成数据免费数据网的量化。合成数据从生成器生成，而没有数据在训练发生器和量化中使用。为此，我们提出了免费数据对抗性知识蒸馏，其中教师的输出和第（量化）学生用于从发生器任何对抗性样本之间的最大距离最小化。为了产生类似于原始数据对抗性样品，我们另外提出从批标准化为层生成的数据和在教师的原始数据相匹配的统计信息。此外，我们将展示通过使用多个发电机和多个学生生成各种敌对样本的增益。我们的实验显示（宽）残留网络的状态的最先进的无数据模型压缩和量化结果和MobileNet上SVHN，CIFAR-10，CIFAR-100，和微型-ImageNet数据集。相对于使用原始数据集上的精度损失证明是非常小的。

4. NTIRE 2020 Challenge on Real Image Denoising: Dataset, Methods and Results [PDF] 返回目录
Abdelrahman Abdelhamed, Mahmoud Afifi, Radu Timofte, Michael S. Brown, Yue Cao, Zhilu Zhang, Wangmeng Zuo, Xiaoling Zhang, Jiye Liu, Wendong Chen, Changyuan Wen, Meng Liu, Shuailin Lv, Yunchao Zhang, Zhihong Pan, Baopu Li, Teng Xi, Yanwen Fan, Xiyu Yu, Gang Zhang, Jingtuo Liu, Junyu Han, Errui Ding, Songhyun Yu, Bumjun Park, Jechang Jeong, Shuai Liu, Ziyao Zong, Nan Nan, Chenghua Li, Zengli Yang, Long Bao, Shuangquan Wang, Dongwoon Bai, Jungwon Lee, Youngjung Kim, Kyeongha Rho, Changyeop Shin, Sungho Kim, Pengliang Tang, Yiyun Zhao, Yuqian Zhou, Yuchen Fan, Thomas Huang, Zhihao Li, Nisarg A. Shah, Wei Liu, Qiong Yan, Yuzhi Zhao, Marcin Możejko, Tomasz Latkowski, Lukasz Treszczotko, Michał Szafraniuk, Krzysztof Trojanowski, Yanhong Wu, Pablo Navarrete Michelini, Fengshuo Hu, Yunhua Lu
Abstract: This paper reviews the NTIRE 2020 challenge on real image denoising with focus on the newly introduced dataset, the proposed methods and their results. The challenge is a new version of the previous NTIRE 2019 challenge on real image denoising that was based on the SIDD benchmark. This challenge is based on a newly collected validation and testing image datasets, and hence, named SIDD+. This challenge has two tracks for quantitatively evaluating image denoising performance in (1) the Bayer-pattern rawRGB and (2) the standard RGB (sRGB) color spaces. Each track ~250 registered participants. A total of 22 teams, proposing 24 methods, competed in the final phase of the challenge. The proposed methods by the participating teams represent the current state-of-the-art performance in image denoising targeting real noisy images. The newly collected SIDD+ datasets are publicly available at: this https URL.
摘要：本文综述了实像NTIRE 2020挑战，着眼于新推出的数据集，所提出的方法及其结果去噪。我们面临的挑战是真实图像去噪以前NTIRE 2019挑战那是基于SIDD基准的新版本。这种挑战是基于新收集的验证和测试图像数据集，因此命名为SIDD +。这一挑战具有两个轨道的用于定量评价（1）中的Bayer图案rawRGB图像去噪的性能和（2）的标准RGB（sRGB）可颜色空间。每个轨道〜250名登记的与会者。一共有22支球队，24点提出的方法，在挑战的最后阶段比赛。所提出的方法由参赛队伍代表了当前国家的最先进的性能在图像去噪针对实际噪声图像。新收集的SIDD +数据集公布于：此HTTPS URL。

5. Sparsely-Labeled Source Assisted Domain Adaptation [PDF] 返回目录
Wei Wang, Zhihui Wang, Yuankai Xiang, Jing Sun, Haojie Li, Fuming Sun, Zhengming Ding
Abstract: Domain Adaptation (DA) aims to generalize the classifier learned from the source domain to the target domain. Existing DA methods usually assume that rich labels could be available in the source domain. However, there are usually a large number of unlabeled data but only a few labeled data in the source domain, and how to transfer knowledge from this sparsely-labeled source domain to the target domain is still a challenge, which greatly limits their application in the wild. This paper proposes a novel Sparsely-Labeled Source Assisted Domain Adaptation (SLSA-DA) algorithm to address the challenge with limited labeled source domain samples. Specifically, due to the label scarcity problem, the projected clustering is conducted on both the source and target domains, so that the discriminative structures of data could be leveraged elegantly. Then the label propagation is adopted to propagate the labels from those limited labeled source samples to the whole unlabeled data progressively, so that the cluster labels are revealed correctly. Finally, we jointly align the marginal and conditional distributions to mitigate the cross-domain mismatch problem, and optimize those three procedures iteratively. However, it is nontrivial to incorporate those three procedures into a unified optimization framework seamlessly since some variables to be optimized are implicitly involved in their formulas, thus they could not promote to each other. Remarkably, we prove that the projected clustering and conditional distribution alignment could be reformulated as different expressions, thus the implicit variables are revealed in different optimization steps. As such, the variables related to those three quantities could be optimized in a unified optimization framework and facilitate to each other, to improve the recognition performance obviously.
摘要：领域适应性（DA）旨在推广从源域到目标域学到的分类。现有的DA方法通常假定丰富的标签可能是在源域中可用。然而，通常有大量的未标记数据，但在源域中只有少数几个标记的数据，以及如何从这个人烟稀少标记源域到目标域转移的知识仍然是一个挑战，这大大限制了其应用野生。本文提出了辅助域适配（SLSA-DA）算法来解决与限制标记的源域样本的挑战的新的稀疏标记源。具体而言，由于该标签稀缺问题，投影聚类在源和目标域二者进行的，从而使数据的辨别结构可以优雅利用。则标签传播采用从这些限制标记的源样本标签传播到逐步整个未标记的数据，以使簇标签正确地显现出来。最后，我们共同对准边际和条件分布，以减轻跨域不匹配的问题，并反复优化这三个程序。但是，它是平凡纳入这三个程序，成为一个统一的优化框架，因为一些变量来进行优化的隐含参与其配方无缝连接，因此，无法促进彼此。值得注意的是，我们证明了投影聚类和条件分布排列的可改写为不同的表达，从而隐含变量揭示了不同的优化措施。因此，涉及到这三个量的变量可以在一个统一的优化框架进行优化，并促进彼此，明显提高了识别性能。

6. A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird's Eye View [PDF] 返回目录
Lennart Reiher, Bastian Lampe, Lutz Eckstein
Abstract: Accurate environment perception is essential for automated driving. When using monocular cameras, the distance estimation of elements in the environment poses a major challenge. Distances can be more easily estimated when the camera perspective is transformed to a bird's eye view (BEV). For flat surfaces, Inverse Perspective Mapping (IPM) can accurately transform images to a BEV. Three-dimensional objects such as vehicles and vulnerable road users are distorted by this transformation making it difficult to estimate their position relative to the sensor. This paper describes a methodology to obtain a corrected 360° BEV image given images from multiple vehicle-mounted cameras. The corrected BEV image is segmented into semantic classes and includes a prediction of occluded areas. The neural network approach does not rely on manually labeled data, but is trained on a synthetic dataset in such a way that it generalizes well to real-world data. By using semantically segmented images as input, we reduce the reality gap between simulated and real-world data and are able to show that our method can be successfully applied in the real world. Extensive experiments conducted on the synthetic data demonstrate the superiority of our approach compared to IPM. Source code and datasets are available at this https URL
摘要：准确的环境感知是自动驾驶的关键。当使用单眼相机，在环境要素的距离估计是一个重大的挑战。距离可以当相机透视变换为鸟瞰图（BEV）可以更容易地估计。为平坦表面，逆透视映射（IPM）能准确地变换影像到BEV。三维物体，如车辆和弱势道路使用者通过这种转变使得它很难估计相对于传感器的位置扭曲。本文介绍的方法，以获得从多个校正后的360°BEV图像给定图像的车载摄像机。校正BEV图像分割成语义类别，包括遮挡区域的预测。神经网络方法不依赖于手工标注的数据，但它概括很好地真实世界的数据上以这样的方式合成数据集进行训练。通过使用语义分割图像作为输入，我们减少了模拟和真实世界的数据之间的现实差距，并能证明我们的方法可以在现实世界中成功应用。对合成数据进行了大量的实验表明，相比于IPM我们的方法的优越性。源代码和数据集可在此HTTPS URL

7. Fast Automatic Visibility Optimization for Thermal Synthetic Aperture Visualization [PDF] 返回目录
Indrajit Kurmi, David C. Schedl, Oliver Bimber
Abstract: In this article, we describe and validate the first fully automatic parameter optimization for thermal synthetic aperture visualization. It replaces previous manual exploration of the parameter space, which is time consuming and error prone. We prove that the visibility of targets in thermal integral images is proportional to the variance of the targets' image. Since this is invariant to occlusion it represents a suitable objective function for optimization. Our findings have the potential to enable fully autonomous search and recuse operations with camera drones.
摘要：在本文中，我们描述和验证热合成孔径可视化第一全自动参数优化。它取代了参数空间，这是耗时且容易出错的以前的人工探索。我们证明了在热积图像目标的可见性是成正比的目标的图像的变化。由于这是不变的闭塞它代表优化合适的目标函数。我们的研究结果必须能够与相机无人机完全独立的搜索和回避操作的可能性。

8. TSDM: Tracking by SiamRPN++ with a Depth-refiner and a Mask-generator [PDF] 返回目录
Pengyao Zhao, Quanli Liu, Wei Wang, Qiang Guo
Abstract: In a generic object tracking, depth (D) information provides informative cues for foreground-background separation and target bounding box regression. However, so far, few trackers have used depth information to play the important role aforementioned due to the lack of a suitable model. In this paper, a RGB-D tracker named TSDM is proposed, which is composed of a Mask-generator (M-g), SiamRPN++ and a Depth-refiner (D-r). The M-g generates the background masks, and updates them as the target 3D position changes. The D-r optimizes the target bounding box estimated by SiamRPN++, based on the spatial depth distribution difference between the target and the surrounding background. Extensive evaluation on the Princeton Tracking Benchmark and the Visual Object Tracking challenge shows that our tracker outperforms the state-of-the-art by a large margin while achieving 23 FPS. In addition, a light-weight variant can run at 31 FPS and thus it is practical for real world applications. Code and models of TSDM are available at this https URL.
摘要：一个通用对象跟踪，深度（d）的信息提供信息量的线索用于前景背景分离和目标边界框消退。然而，到目前为止，很少有纤夫用深度信息，以发挥应有前述缺乏一个合适的模型中的重要作用。在本文中，提出了名为TSDM一个RGB-d跟踪器，其被构成的掩模 - 发电机（M-G），SiamRPN ++和深度精制器（d-R）的。所述M-克生成背景掩模，并更新它们作为目标的三维位置的变化。所述d-R优化由SiamRPN估计++目标边界框，基于目标和周围的背景之间的空间深度分布的差异。在普林斯顿跟踪基准和视觉对象跟踪的挑战表明，我们的跟踪大幅度优于国家的最先进的，同时实现23 FPS广泛的评估。此外，重量轻的变体可以在31 FPS运行，因此它是现实世界的应用实践。代码和TSDM的型号可供选择，在此HTTPS URL。

9. Introduction of a new Dataset and Method for Detecting and Counting the Pistachios based on Deep Learning [PDF] 返回目录
Mohammad Rahimzadeh, Abolfazl Attar
Abstract: Pistachio is a nutritious nut that has many uses in the food industry. Iran is one of its largest producers, and pistachio is considered as a strategic export product for this country. Pistachios are sorted based on the shape of their shell into two categories: Open-mouth and Closed-mouth. The open-mouth pistachios are higher in price, value, and demand than the closed-mouth pistachios. In the countries that are famous in pistachio production and exporting, there are companies that pack the picked pistachios from the trees and make them ready for exporting. As there are differences between the price and the demand of the open-mouth and closed-mouth pistachios, it is considerable for these companies to know precisely how much of these two kinds of pistachios exist in each packed package. In this paper, we have introduced and shared a new dataset of pistachios, which is called Pesteh-Set. Pesteh-Set includes 6 videos with a total length of 164 seconds and 561 moving pistachios. It also contains 423 labeled images that totally include 3927 labeled pistachios. At the first stage, we have used RetinaNet, the deep fully convolutional object detector for detecting the pistachios in the video frames. In the second stage, we introduce our method for counting the open-mouth and closed-mouth pistachios in the videos. Pistachios that move and roll on the transportation line may appear as closed-mouth in some frames and as open-mouth in other frames. With this circumstance, the main challenge of our work is to count these two kinds of pistachios correctly and fast. Our introduced method performs very fast with no need for GPU, and it also achieves good counting results. The computed accuracy of our counting method is 94.75%. Our proposed methods can be remotely performed by using the videos taken from the implemented cameras that could monitor the pistachios.
摘要：开心果是在食品行业的许多用途营养螺母。伊朗是其最大的生产商之一，开心果被认为是这个国家的战略性出口产品。开心果是基于其外壳的形状分为两类：开嘴闭口。敞开心果的价格，价值和需求相对封闭口开心果更高。在这是开心果生产和出口闻名的国家，也有从树上包装挑选开心果，让他们准备出口的公司。由于有价格和敞和封闭口开心果的需求之间的差异，这是非常可观的，这些公司准确地知道有多少这两种开心果在每个压缩包存在。在本文中，我们已经介绍和共享开心果的新的数据集，被称为Pesteh-集。 Pesteh-集包括具有164秒，561个移动开心果总长度为6级的视频。它还包含423倍标记的图像，完全有3927个标开心果。在第一阶段中，我们已经使用RetinaNet，对于在视频帧检测开心果深充分卷积对象检测器。在第二阶段中，我们介绍了我们在视频计算的敞和封闭口开心果方法。开心果是移动和滚动的运输线可能显示为封闭口在一些帧和敞其他框架中。随着这种情况下，我们的工作的主要挑战是如何正确，快速计算这两种开心果。我们介绍的方法进行非常快，无需GPU，而且还取得了良好的计数结果。我们的计数法的计算精度为94.75％。我们提出的方法可以通过使用从实施摄像机，可以监测开心果拍摄的视频进行远程执行。

10. On Vocabulary Reliance in Scene Text Recognition [PDF] 返回目录
Zhaoyi Wan, Jielei Zhang, Liang Zhang, Jiebo Luo, Cong Yao
Abstract: The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progress has been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon "vocabulary reliance". In this paper, we establish an analytical framework to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and improves the overall scene text recognition performance.
摘要：对公共基准的高性能的追求一直在现场文字识别研究的驱动力，并取得显着进展已经实现。然而，仔细调查揭示了一个惊人的事实，国家的最先进的方法与词汇的单词上的图像表现良好，但不良的概括与外界的话词汇的图像。我们称这种现象为“词汇的依赖”。在本文中，我们建立了一个分析框架，在现场文字识别词汇的依赖问题进行了深入的研究。的主要结果包括：（1）依赖词汇是无处不在的，即，所有的现有的算法或多或少表现出这样的特性; （2）基于注意译码器证明在弱概括单词以外的词汇和基于分割的解码器在利用视觉特征表现良好; （3）上下文建模高度加上预测层。这些发现提供了新的见解和可以受益未来场景文字识别研究。此外，我们提出了一个简单而有效的相互学习策略，让两个家庭（基于注意机制和分割为基础），以协作学习的典范。这一补救措施缓解词汇依赖的问题，提高了整体场景文本识别性能。

11. RetinaMask: A Face Mask detector [PDF] 返回目录
Mingjie Jiang, Xinqi Fan
Abstract: Coronavirus disease 2019 has affected the world seriously, because people cannot work as usual in case of infection. One of the effective protection methods for human beings is to wear masks in public areas. Furthermore, many public service providers require customers to use the service only if they wear masks correctly. However, there are only a few research studies about face mask detection. To contribute to public healthcare for human beings, we propose RetinaMask, which is a high-accuracy and efficient face mask detector. The proposed RetinaMask is a one-stage detector, which consists of a feature pyramid network to fuse high-level semantic information with multiple feature maps, and a novel context attention module to focus on detecting face masks. In addition, we also propose a novel cross-class object removal algorithm to reject predictions with low confidences and the high intersection of union. Experimental results show that RetinaMask achieves state-of-the-art results on a public face mask dataset with 2.3% and 1.5% higher than the baseline result in the face and mask detection precision, respectively, and $11.0\%$ and $5.9\%$ higher than baseline for recall. Besides, we also explore the possibility of implementing RetinaMask with a light-weighted neural network MobileNet for embedded or mobile devices.
摘要：冠状病毒病2019已经影响到了世界重视，因为人们不能在感染的情况下照常上班。其中一个有效的保护方法对人类是戴口罩的公共区域。此外，许多公共服务提供者要求客户只有在正确佩戴口罩使用该服务。然而，只有大约面罩检测一些调查研究。为了促进公共医疗卫生的人，我们建议RetinaMask，这是一个高精确度和高效率面罩检测。所提出的是RetinaMask一阶段检测器，它由一个特征金字塔网络与多个特征映射保险丝高层语义信息，以及新的上下文注意模块侧重于检测口罩。此外，我们还提出了一种新的跨类对象移除算法，以拒绝具有低置信度和联盟的高交叉点的预测。实验结果表明，RetinaMask实现在一个公共的面罩数据集状态的最先进的结果与2.3％和1.5％比基线结果分别在面部和面罩检测精度，更高，和$ 11.0 \％$和$ 5.9 \％ $比基线召回更高。此外，我们还探讨与用于嵌入式或移动设备重量轻的神经网络MobileNet实施RetinaMask的可能性。

12. Efficient convolutional neural networks with smaller filters for human activity recognition using wearable sensors [PDF] 返回目录
Yin Tang, Qi Teng, Lei Zhang, Fuhong Min, Jun He
Abstract: Recently, human activity recognition (HAR) has been beginning to adopt deep learning to substitute for traditional shallow learning techniques that rely on hand-crafted features. CNNs, in particular, have set latest state-of-the-art on various HAR datasets. However, deep model often requires more computing resources, which limits its applications in embedded HAR. Although many successful methods have been proposed to reduce memory and FLOPs of CNNs, they often involve special network architectures for visual tasks, which are not suitable for deep HAR tasks with time series sensor signals, due to remarkable discrepancy. Therefore, it is necessary to develop lightweight deep models to perform HAR. As filter is the basic unit in constructing CNNs, we must ask whether redesigning smaller filters is applicable for deep HAR. In the paper, inspired by the idea, we proposed a lightweight CNN using re-designed Lego filters for the use of HAR. A set of lower-dimensional filters is used as Lego bricks to be stacked for conventional filters, which does not rely on any special network structure. To our knowledge, this is the first paper that proposes lightweight CNN for HAR in ubiquitous and wearable computing arena. The experiment results on five public HAR datasets, UCI-HAR dataset, OPPORTUNITY dataset, UNIMIB-SHAR dataset, PAMAP2 dataset, and WISDM dataset, indicate that our novel Lego-CNN approach can greatly reduce memory and computation cost over CNN, while maintaining comparable accuracy. We believe that the proposed approach could combine with the existing state-of-the-art HAR architecture and easily deployed onto wearable devices for real HAR applications.
摘要：近日，人类活动识别（HAR）已开始采用深度学习，以替代依赖于手工制作的传统特色浅学习技术。细胞神经网络，特别是已经设置最新的国家的最先进的各种HAR数据集。然而，深层模型往往需要更多的计算资源，这限制了它的应用在嵌入式HAR。尽管许多成功的方法已被提出，以减少内存和细胞神经网络的触发器，它们往往涉及视觉任务，这是不适合与时间序列传感器信号深HAR任务，由于显着的差异特殊的网络架构。因此，有必要开发轻量级深模型来执行HAR。由于过滤器的基本单位是在构建细胞神经网络，我们必须要问是否重新设计更小的过滤器适用于深HAR。在论文中，被这个想法的启发，我们提出了一个轻量级的CNN使用重新设计的乐高过滤器使用HAR的。使用一组较低维滤波器作为被层叠为传统的过滤器，其不依赖于任何特殊网络结构乐高积木。据我们所知，这是提出了HAR轻量级CNN在无处不在，可穿戴计算领域的第一篇论文。在五个公共HAR数据集，UCI-HAR数据集，机遇数据集，实验结果UNIMIB-SHAR数据集，PAMAP2数据集和WISDM数据集，表明我们的新乐高CNN的做法可以大大降低内存和超过CNN的计算成本，同时保持相当准确性。我们认为，该方法可以与现有的国家的最先进的HAR结构结合起来，轻松地部署到真正的HAR应用可穿戴式设备。

13. Learning Generalized Spoof Cues for Face Anti-spoofing [PDF] 返回目录
Haocheng Feng, Zhibin Hong, Haixiao Yue, Yang Chen, Keyao Wang, Junyu Han, Jingtuo Liu, Errui Ding
Abstract: Many existing face anti-spoofing (FAS) methods focus on modeling the decision boundaries for some predefined spoof types. However, the diversity of the spoof samples including the unknown ones hinders the effective decision boundary modeling and leads to weak generalization capability. In this paper, we reformulate FAS in an anomaly detection perspective and propose a residual-learning framework to learn the discriminative live-spoof differences which are defined as the spoof cues. The proposed framework consists of a spoof cue generator and an auxiliary classifier. The generator minimizes the spoof cues of live samples while imposes no explicit constraint on those of spoof samples to generalize well to unseen attacks. In this way, anomaly detection is implicitly used to guide spoof cue generation, leading to discriminative feature learning. The auxiliary classifier serves as a spoof cue amplifier and makes the spoof cues more discriminative. We conduct extensive experiments and the experimental results show the proposed method consistently outperforms the state-of-the-art methods. The code will be publicly available at this https URL.
摘要：许多现有的脸上反欺骗（FAS）方法集中在建模的一些预定义的恶搞类型的决策边界。然而，恶搞样本包括未知的多样性阻碍了有效的决策边界建模，并导致弱泛化能力。在本文中，我们重新制定FAS在异常检测的角度，提出了一种残留学习框架，以了解其定义为恶搞线索辨别活恶搞的差异。所提出的框架由一欺骗提示发生器和辅助分类。发电机现场样品的恶搞线索最小化，同时规定了这些恶搞的样品没有明确的约束来概括很好地看不见的攻击。这样一来，异常检测被隐式用于指导恶搞线索生成，从而导致判别功能的学习。辅助分类器用作欺骗线索放大器和使得欺骗线索更有辨别力。我们进行了广泛的实验，实验结果表明，该方法的性能一直优于国家的最先进的方法。该代码将公开可在此HTTPS URL。

14. OpenEDS2020: Open Eyes Dataset [PDF] 返回目录
Cristina Palmero, Abhishek Sharma, Karsten Behrendt, Kapil Krishnakumar, Oleg V. Komogortsev, Sachin S. Talathi
Abstract: We present the second edition of OpenEDS dataset, OpenEDS2020, a novel dataset of eye-image sequences captured at a frame rate of 100 Hz under controlled illumination, using a virtual-reality head-mounted display mounted with two synchronized eye-facing cameras. The dataset, which is anonymized to remove any personally identifiable information on participants, consists of 80 participants of varied appearance performing several gaze-elicited tasks, and is divided in two subsets: 1) Gaze Prediction Dataset, with up to 66,560 sequences containing 550,400 eye-images and respective gaze vectors, created to foster research in spatio-temporal gaze estimation and prediction approaches; and 2) Eye Segmentation Dataset, consisting of 200 sequences sampled at 5 Hz, with up to 29,500 images, of which 5% contain a semantic segmentation label, devised to encourage the use of temporal information to propagate labels to contiguous frames. Baseline experiments have been evaluated on OpenEDS2020, one for each task, with average angular error of 5.37 degrees when performing gaze prediction on 1 to 5 frames into the future, and a mean intersection over union score of 84.1% for semantic segmentation. As its predecessor, OpenEDS dataset, we anticipate that this new dataset will continue creating opportunities to researchers in eye tracking, machine learning and computer vision communities, to advance the state of the art for virtual reality applications. The dataset is available for download upon request at this http URL.
摘要：我们提出OpenEDS数据集，OpenEDS2020，在100Hz的下受控照明的帧速率拍摄的眼图像的序列的新颖的数据集的第二版，使用虚拟现实头戴式显示器安装有两个同步面向眼相机。数据集，这是匿名以移除参与者任何个人识别信息，由80名参与者多种多样的外观的执行若干注视诱发的任务，并且在两个子集被划分：1）的凝视预测数据集，具有高达66,560的序列含有550400眼-images和各自的注视矢量，在创建的培育研究时空注视估计和预测方法;和2）眼睛分割数据集，包括在5赫兹采样，具有高达29,500的图像，其中5％含有一个语义分割标签200个的序列，设计成鼓励使用时间信息来标记传播到相邻的帧。上1至5帧执行凝视预测时进入未来基线实验已经评估了OpenEDS2020，每个任务，其中的5.37度平均角度误差，并用联合得分的84.1％为语义分割的平均交叉点。正如它的前身，OpenEDS数据集，我们预计这个新的数据集将继续创造机会，在眼睛跟踪，机器学习和计算机视觉研究人员的社区，以超前的技术状态进行虚拟现实应用。该数据集可用于在此HTTP URL应要求下载。

15. Point Cloud Completion by Skip-attention Network with Hierarchical Folding [PDF] 返回目录
Xin Wen, Tianyang Li, Zhizhong Han, Yu-Shen Liu
Abstract: Point cloud completion aims to infer the complete geometries for missing regions of 3D objects from incomplete ones. Previous methods usually predict the complete point cloud based on the global shape representation extracted from the incomplete input. However, the global representation often suffers from the information loss of structure details on local regions of incomplete point cloud. To address this problem, we propose Skip-Attention Network (SA-Net) for 3D point cloud completion. Our main contributions lie in the following two-folds. First, we propose a skip-attention mechanism to effectively exploit the local structure details of incomplete point clouds during the inference of missing parts. The skip-attention mechanism selectively conveys geometric information from the local regions of incomplete point clouds for the generation of complete ones at different resolutions, where the skip-attention reveals the completion process in an interpretable way. Second, in order to fully utilize the selected geometric information encoded by skip-attention mechanism at different resolutions, we propose a novel structure-preserving decoder with hierarchical folding for complete shape generation. The hierarchical folding preserves the structure of complete point cloud generated in upper layer by progressively detailing the local regions, using the skip-attentioned geometry at the same resolution. We conduct comprehensive experiments on ShapeNet and KITTI datasets, which demonstrate that the proposed SA-Net outperforms the state-of-the-art point cloud completion methods.
摘要：云点完成的目标来推断从残缺的人缺少3D对象的区域的完整的几何形状。以前的方法通常预测基于从所述不完全输入中提取的总体形状表示完整的点云。然而，全球经常表示从结构细节上的不完整的点云的局部区域的信息丢失受到影响。为了解决这个问题，我们提出了跳过，关注网络（SA-网）的三维点云完成。我们的主要贡献在于以下两倍。首先，我们提出了一个跳跃关注机制缺失部分的推理过程中有效地利用不完整的点云的局部结构细节。跳跃，注意机制选择性传送来自不完整的点云在不同的分辨率生成完整的人的，在跳过，注意揭示可解释的方式完成方法的局部区域的几何信息。第二，为了充分利用由不同分辨率跳过注意机制编码所选择的几何信息，我们提出与完整形状生成分层折叠的新颖结构保留解码器。分层折叠通过逐渐详述局部区域，使用跳过受到注目几何以相同的分辨率保持在上层中生成完整的点云的结构。我们开展对ShapeNet和KITTI数据集，这表明，该SA-净优于国家的最先进的点云完井方法综合性实验。

16. Where am I looking at? Joint Location and Orientation Estimation by Cross-View Matching [PDF] 返回目录
Yujiao Shi, Xin Yu, Dylan Campbell, Hongdong Li
Abstract: Cross-view geo-localization is the problem of estimating the position and orientation (latitude, longitude and azimuth angle) of a camera at ground level given a large-scale database of geo-tagged aerial (e.g., satellite) images. Existing approaches treat the task as a pure location estimation problem by learning discriminative feature descriptors, but neglect orientation alignment. It is well-recognized that knowing the orientation between ground and aerial images can significantly reduce matching ambiguity between these two views, especially when the ground-level images have a limited Field of View (FoV) instead of a full field-of-view panorama. Therefore, we design a Dynamic Similarity Matching network to estimate cross-view orientation alignment during localization. In particular, we address the cross-view domain gap by applying a polar transform to the aerial images to approximately align the images up to an unknown azimuth angle. Then, a two-stream convolutional network is used to learn deep features from the ground and polar-transformed aerial images. Finally, we obtain the orientation by computing the correlation between cross-view features, which also provides a more accurate measure of feature similarity, improving location recall. Experiments on standard datasets demonstrate that our method significantly improves state-of-the-art performance. Remarkably, we improve the top-1 location recall rate on the CVUSA dataset by a factor of 1.5x for panoramas with known orientation, by a factor of 3.3x for panoramas with unknown orientation, and by a factor of 6x for 180-degree FoV images with unknown orientation.
摘要：交叉视图的地理定位是在给定的一个大型数据库接地电平估计照相机的位置和取向（纬度，经度和方位角）的问题地理标记天线（例如，卫星）图像。现有的方法处理任务，通过学习辨别特征描述一个纯粹的位置估计问题，而忽视了方向对齐。据公认，知道地面和空中图像之间的取向可以显著减少匹配歧义这两个视图之间，特别是当地面图像具有视场（FOV）的限制字段，而不是一个完整的场的视图全景。因此，我们设计了一个动态相似性匹配网络定位期间来估计跨视图方向对准。特别是，我们解决通过施加极性变换的航空图像到大约对准图像到一个未知的方位角的交叉视角域间隙。然后，双码流卷积网络是用来学习从地面和极性变换的航空影像深的特点。最后，我们通过计算跨视图特征之间的相关性，这还提供了特征相似度的更准确的测量，从而提高位置召回获得的取向。在标准数据集实验表明，我们的方法显著提高了国家的最先进的性能。值得注意的是，我们提高了最高1位置召回的CVUSA数据集与已知方位全景与未知方位全景率的1.5倍的系数，由3.3倍的因素，并通过6倍的180度FoV的一个因素未知方位的图像。

17. Synchronous Bidirectional Learning for Multilingual Lip Reading [PDF] 返回目录
Mingshuang Luo, Shuang Yang, Xilin Chen, Zitao Liu, Shiguang Shan
Abstract: Lip reading has received increasing attention in recent years. This paper focuses on the synergy of multilingual lip reading. There are more than 7,000 languages in the world, which implies that it is impractical to train separate lip reading models by collecting large-scale data per language. Although each language has its own linguistic and pronunciation features, the lip movements of all languages share similar patterns. Based on this idea, in this paper, we try to explore the synergized learning of multilingual lip reading, and further propose a synchronous bidirectional learning(SBL) framework for effective synergy of multilingual lip reading. Firstly, we introduce the phonemes as our modeling units for the multilingual setting. Similar phoneme always leads to similar visual patterns. The multilingual setting would increase both the quantity and the diversity of each phoneme shared among different languages. So the learning for the multilingual target should bring improvement to the prediction of phonemes. Then, a SBL block is proposed to infer the target unit when given its previous and later context. The rules for each specific language which the model itself judges to be is learned in this fill-in-the-blank manner. To make the learning process more targeted at each particular language, we introduce an extra task of predicting the language identity in the learning process. Finally, we perform a thorough comparison on LRW (English) and LRW-1000(Mandarin). The results outperform the existing state of the art by a large margin, and show the promising benefits from the synergized learning of different languages.
摘要：读唇已经受到越来越多的重视，近年来。本文侧重于多语种读唇语的协同作用。有超过7000种语言在世界上，这意味着它是不切实际每语言收集大规模数据来训练不同的读唇语的模型。虽然每一种语言都有它自己的语言和发音功能，所有语言的嘴唇运动有着相似的模式。基于这一思路，在本文中，我们尝试探索多种语言唇读的协同学习，并进一步提出了多语种读唇语的有效协同同步双向学习（SBL）的框架。首先，我们介绍了音素作为我们的建模单位为多语言设置。类似的音素总是会导致类似的视觉模式。多语种的设置将增加的数量和不同语言之间共享的每个音素的多样性。因此，对于多语种目标的学习应该把改善音素的预测。然后，块SBL提出给出其以前和以后上下文时来推断所述目标单元。针对模型本身法官在每个特定语言的规则在此填充式的空白方式教训。为了使学习的过程更有针对性的在每个特定的语言，介绍预测在学习过程中的语言标识的额外任务。最后，我们执行上LRW（英文）和LRW-1000（普通话）的全面比较。结果大幅度超越现有技术的现有状态，并显示不同语言的协同学习有为的好处。

18. SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving [PDF] 返回目录
Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou, Pei Sun, Dumitru Erhan, Sean Rafferty, Henrik Kretzschmar
Abstract: Autonomous driving system development is critically dependent on the ability to replay complex and diverse traffic scenarios in simulation. In such scenarios, the ability to accurately simulate the vehicle sensors such as cameras, lidar or radar is essential. However, current sensor simulators leverage gaming engines such as Unreal or Unity, requiring manual creation of environments, objects and material properties. Such approaches have limited scalability and fail to produce realistic approximations of camera, lidar, and radar data without significant additional work. In this paper, we present a simple yet effective approach to generate realistic scenario sensor data, based only on a limited amount of lidar and camera data collected by an autonomous vehicle. Our approach uses texture-mapped surfels to efficiently reconstruct the scene from an initial vehicle pass or set of passes, preserving rich information about object 3D geometry and appearance, as well as the scene conditions. We then leverage a SurfelGAN network to reconstruct realistic camera images for novel positions and orientations of the self-driving vehicle and moving objects in the scene. We demonstrate our approach on the Waymo Open Dataset and show that it can synthesize realistic camera data for simulated scenarios. We also create a novel dataset that contains cases in which two self-driving vehicles observe the same scene at the same time. We use this dataset to provide additional evaluation and demonstrate the usefulness of our SurfelGAN model.
摘要：自主驾驶系统的开发是关键取决于重播复杂多样的交通场景仿真的能力。在这样的场景中，能够准确地模拟车辆传感器，例如照相机，激光雷达或雷达是必不可少的。然而，目前的传感器模拟器杠杆游戏引擎，如虚幻或统一，需要人工创作环境中，对象和材料性质。这种方法具有有限的可扩展性，并不能产生摄像机，激光雷达和雷达数据的逼真近似无显著额外的工作。在本文中，我们提出了一个简单而有效的方法来产生真实的场景的传感器数据，仅基于由自主车辆收集的激光雷达和摄像机数据的有限量的。我们的方法使用纹理映射surfels从初始车辆通或经过集，保留关于对象的3D几何形状和外观丰富的信息，以及与场景条件高效地重建场景。然后，我们利用一个SurfelGAN网络重构现实摄像机图像场景中的新位置和自驾车车辆的方向和移动的物体。我们证明了我们对Waymo打开的数据集的方法，并表明，它可以合成为模拟场景逼真的相机数据。我们还创建了一个包含两个自驾车车辆在同一时间观测同一场景的情况下一种新的数据集。我们使用这个数据集，以提供额外的评估和证明我们SurfelGAN模型的有效性。

19. Projection & Probability-Driven Black-Box Attack [PDF] 返回目录
Jie Li, Rongrong Ji, Hong Liu, Jianzhuang Liu, Bineng Zhong, Cheng Deng, Qi Tian
Abstract: Generating adversarial examples in a black-box setting retains a significant challenge with vast practical application prospects. In particular, existing black-box attacks suffer from the need for excessive queries, as it is non-trivial to find an appropriate direction to optimize in the high-dimensional space. In this paper, we propose Projection & Probability-driven Black-box Attack (PPBA) to tackle this problem by reducing the solution space and providing better optimization. For reducing the solution space, we first model the adversarial perturbation optimization problem as a process of recovering frequency-sparse perturbations with compressed sensing, under the setting that random noise in the low-frequency space is more likely to be adversarial. We then propose a simple method to construct a low-frequency constrained sensing matrix, which works as a plug-and-play projection matrix to reduce the dimensionality. Such a sensing matrix is shown to be flexible enough to be integrated into existing methods like NES and Bandits$_{TD}$. For better optimization, we perform a random walk with a probability-driven strategy, which utilizes all queries over the whole progress to make full use of the sensing matrix for a less query budget. Extensive experiments show that our method requires at most 24% fewer queries with a higher attack success rate compared with state-of-the-art approaches. Finally, the attack method is evaluated on the real-world online service, i.e., Google Cloud Vision API, which further demonstrates our practical potentials.
摘要：在一个黑盒设置生成对抗性的实例保留了与广阔的实际应用前景显著的挑战。特别是，现有的黑箱攻击距离需要过多的查询受苦，因为它是不平凡的找到合适的方向，以在高维空间的优化。在本文中，我们提出了投影和概率驱动的黑盒攻击（PPBA）通过降低解空间，并提供更好的优化来解决这个问题。为了降低溶液的空间，我们首先对抗扰动优化问题建模为与回收压缩传感频稀疏扰动，设定下，在低频空间随机噪声更可能是对抗性的处理。然后，我们提出了一个简单的方法来构造一个低频限制传感矩阵，它可以作为一个插件和播放投影矩阵，以减少维度。这样的感测阵列被示出为是柔性的，足以被集成到像NES和强盗$ _ {TD} $现有的方法。为了更好的优化，我们执行的概率为驱动的战略，它利用在整个进度的所有查询，使一个不太查询预算充分利用传感矩阵的随机游走。大量的实验表明，该方法需要在一个较高的进攻成功率最多减少24％的查询与国家的最先进的方法相比。最后，攻击方法是对现实世界的在线服务，即谷歌云愿景API，这进一步证明了我们的实际潜力进行评估。

20. One-Shot Object Detection without Fine-Tuning [PDF] 返回目录
Xiang Li, Lin Zhang, Yau Pun Chen, Yu-Wing Tai, Chi-Keung Tang
Abstract: Deep learning has revolutionized object detection thanks to large-scale datasets, but their object categories are still arguably very limited. In this paper, we attempt to enrich such categories by addressing the one-shot object detection problem, where the number of annotated training examples for learning an unseen class is limited to one. We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module, the combination of which integrates metric learning with an anchor-free Faster R-CNN-style detection pipeline, eventually eliminating the need to fine-tune on the support images. We also propose novel training strategies that effectively improve detection performance. Extensive quantitative and qualitative evaluations were performed and our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
摘要：深学习已经彻底改变了目标检测拜大型数据集，但其对象类别仍然可以说是非常有限的。在本文中，我们试图通过解决一次性对象检测的问题，在这里的学习看不见的类注释的训练实例的数量限制为一个充实这些类别。我们引入一个两阶段模型包括第一阶段匹配，FCOS网络和第二阶段的结构感知关系模块，其组合度量学习与无锚更快的R-CNN式检测管道集成，最终消除了需要对支持的图像进行微调。我们还建议，有效提高检测性能的新培训战略。进行广泛的定量和定性的评价，我们的方法持续超过对多个数据集的国家的最先进的一次性的性能。

21. Text Synopsis Generation for Egocentric Videos [PDF] 返回目录
Aidean Sharghi, Niels da Vitoria Lobo, Mubarak Shah
Abstract: Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.
摘要：人体穿戴相机质量利用率已经导致可用的自我中心视频的庞大的语料库。现有的视频摘要算法，可以加速通过选择（视觉），从中有趣的镜头浏览这样的视频。然而，由于该系统的用户仍然有观看总结视频，浏览大的视频数据库仍是一个挑战。因此，在这项工作中，我们提出生成一个文本摘要，包括描述一个长期以自我为中心的影片中最重要的事件的几个句子。用户可以阅读简短的文字，以获得有关视频的洞察力，更重要的是，通过使用文本查询大量视频数据库的内容有效地搜索。由于自我中心的视频很长，包含许多活动和事件，利用数千描述的，其中有许多是不正确的视频到文本的算法的结果。因此，我们提出了一个多任务的学习计划，同时生成视频片段的描述，并在终端到终端的方式总结了造成描述。我们输入一组视频截图和网络为每个拍摄的文本描述。接着，视觉语言内容匹配单元，其与弱训练监督目的，识别正确的描述。最后，我们的网络的最后一个组件，叫做旨趣网络，评估说明一起选择包含关键信息的人。出千对视频产生的描述，一些知识性语句返回给用户。我们确认我们的挑战UT自我中心视频数据集，其中每个视频是3至5之间小时之久，平均为3000的文本描述相关的框架。将所生成的文本概要，包括只有5％（或更少）所生成的描述，进行比较，以真实状况使用自然语言处理行之有效的度量在文本域摘要。

22. A Gaussian Process Upsampling Model for Improvements in Optical Character Recognition [PDF] 返回目录
Steven I Reeves, Dongwook Lee, Anurag Singh, Kunal Verma
Abstract: Optical Character Recognition and extraction is a key tool in the automatic evaluation of documents in a financial context. However, the image data provided to automated systems can have unreliable quality, and can be inherently low-resolution or downsampled and compressed by a transmitting program. In this paper, we illustrate the efficacy of a Gaussian Process upsampling model for the purposes of improving OCR and extraction through upsampling low resolution documents.
摘要：光学字符识别和提取是在金融方面的文件的自动评估的重要工具。然而，提供给自动化系统中的图像数据可以具有不可靠的质量，并且可以是固有地低的分辨率或由发送程序下采样和压缩。在本文中，我们说明通过上采样低分辨率文件用于改进OCR和提取的目的，高斯过程上采样模型的功效。

23. Recognizing Magnification Levels in Microscopic Snapshots [PDF] 返回目录
Manit Zaveri, Shivam Kalra, Morteza Babaie, Sultaan Shah, Savvas Damskinos, Hany Kashani, H.R. Tizhoosh
Abstract: Recent advances in digital imaging has transformed computer vision and machine learning to new tools for analyzing pathology images. This trend could automate some of the tasks in the diagnostic pathology and elevate the pathologist workload. The final step of any cancer diagnosis procedure is performed by the expert pathologist. These experts use microscopes with high level of optical magnification to observe minute characteristics of the tissue acquired through biopsy and fixed on glass slides. Switching between different magnifications, and finding the magnification level at which they identify the presence or absence of malignant tissues is important. As the majority of pathologists still use light microscopy, compared to digital scanners, in many instance a mounted camera on the microscope is used to capture snapshots from significant field-of-views. Repositories of such snapshots usually do not contain the magnification information. In this paper, we extract deep features of the images available on TCGA dataset with known magnification to train a classifier for magnification recognition. We compared the results with LBP, a well-known handcrafted feature extraction method. The proposed approach achieved a mean accuracy of 96% when a multi-layer perceptron was trained as a classifier.
摘要：在数码影像的最新进展已经改变了计算机视觉和机器学习分析病理图像的新工具。这种趋势可能会在诊断病理学自动完成一些任务，并提升病理学家工作量。任何癌症的诊断程序的最后步骤是通过专家病理学家进行。这些专家使用具有高水平的光学放大倍数的显微镜观察通过活检获得的并固定在载玻片上的组织的微小的特性。不同的放大倍数之间的切换，并找到他们鉴定恶性组织的存在或不存在是重要的放大级别。由于大多数病理学家的仍然使用光学显微镜相比，数字扫描仪，在许多实例中安装的摄像机在显微镜被用来捕获从快照显著场的视图。这样的快照库通常不包含放大倍率信息。在本文中，我们提取可在TCGA数据集与已知的放大率训练的放大倍率识别分类图像的深的特点。我们比较了LBP，著名的手工制作的特征提取方法的结果。所提出的方法实现的96％的平均精确度，当多层感知被训练作为分类器。

24. Effective Data Fusion with Generalized Vegetation Index: Evidence from Land Cover Segmentation in Agriculture [PDF] 返回目录
Hao Sheng, Xiao Chen, Jingyi Su, Ram Rajagopal, Andrew Ng
Abstract: How can we effectively leverage the domain knowledge from remote sensing to better segment agriculture land cover from satellite images? In this paper, we propose a novel, model-agnostic, data-fusion approach for vegetation-related computer vision tasks. Motivated by the various Vegetation Indices (VIs), which are introduced by domain experts, we systematically reviewed the VIs that are widely used in remote sensing and their feasibility to be incorporated in deep neural networks. To fully leverage the Near-Infrared channel, the traditional Red-Green-Blue channels, and Vegetation Index or its variants, we propose a Generalized Vegetation Index (GVI), a lightweight module that can be easily plugged into many neural network architectures to serve as an additional information input. To smoothly train models with our GVI, we developed an Additive Group Normalization (AGN) module that does not require extra parameters of the prescribed neural networks. Our approach has improved the IoUs of vegetation-related classes by 0.9-1.3 percent and consistently improves the overall mIoU by 2 percent on our baseline.
摘要：如何才能有效地从卫星图像利用遥感领域知识，以更好地将农业土地覆盖？在本文中，我们提出了一个新颖的，模型无关，数据融合植被相关的计算机视觉任务的方法。通过各种植被指数（VIS），这是由领域专家介绍的启发，我们系统地回顾了的VI被广泛应用于遥感和深层神经网络纳入其可行性。为了充分利用近红外通道，传统的红 - 绿 - 蓝渠道，植被指数或其变体，我们提出了一个广义植被指数（GVI），可以很容易地插入到许多神经网络架构服务的轻量级模块作为附加信息的输入。为了顺利地训练我们的GVI模型，我们开发出不需要规定神经网络的额外参数的加法群正常化（AGN）模块。我们的办法是用0.9-1.3％的提高植被相关的类的欠条，并通过我们的基准2％，持续提高了整体米欧。

25. Neural Object Learning for 6D Pose Estimation Using a Few Cluttered Images [PDF] 返回目录
Kiru Park, Timothy Patten, Markus Vincze
Abstract: Recent methods for 6D pose estimation of objects assume either textured 3D models or real images that cover the entire range of target poses. However, it is difficult to obtain textured 3D models and annotate the poses of objects in real scenarios. This paper proposes a method, Neural Object Learning (NOL), that creates synthetic images of objects in arbitrary poses by combining only a few observations from cluttered images. A novel refinement step is proposed to align inaccurate poses of objects in source images, which results in better quality images. Evaluations performed on two public datasets show that the rendered images created by NOL lead to state-of-the-art performance in comparison to methods that use 10 times the number of real images. Evaluations on our new dataset show multiple objects can be trained and recognized simultaneously using a sequence of a fixed scene.
摘要：物体的6D姿态估计近期方法假定两种纹理覆盖目标姿势的整个范围内的3D模型或真实图像。但是，它是很难获得纹理的3D模型和注释对象的姿态在现实场景。本文提出了一种方法，神经对象学习（NOL），即仅通过从杂乱图像组合几点意见创建以任意姿态的对象的合成图像。一种新的精制工序，提出了在源图像中的对象的不准确的对准姿势，这导致更好质量的图像。在两个公共数据集进行评估显示，NOL铅相比创建国家的最先进的性能，使用真实图像的10倍的方式渲染图像。我们的新的数据集评估显示多个对象可以被训练和使用固定场景的顺序同时被识别。

26. Regularized Pooling [PDF] 返回目录
Takato Otsuzuki, Hideaki Hayashi, Yuchen Zheng, Seiichi Uchida
Abstract: In convolutional neural networks (CNNs), pooling operations play important roles such as dimensionality reduction and deformation compensation. In general, max pooling, which is the most widely used operation for local pooling, is performed independently for each kernel. However, the deformation may be spatially smooth over the neighboring kernels. This means that max pooling is too flexible to compensate for actual deformations. In other words, its excessive flexibility risks canceling the essential spatial differences between classes. In this paper, we propose regularized pooling, which enables the value selection direction in the pooling operation to be spatially smooth across adjacent kernels so as to compensate only for actual deformations. The results of experiments on handwritten character images and texture images showed that regularized pooling not only improves recognition accuracy but also accelerates the convergence of learning compared with conventional pooling operations.
摘要：卷积神经网络（细胞神经网络），集中行动中发挥重要作用，如降维和变形补偿。在一般情况下，最大池，这是地方统筹使用最广泛的操作，为每个内核独立进行。然而，变形可以在空间上顺利越过邻近的内核。这意味着最大池是太灵活，以补偿实际变形。换句话说，它的过度灵活性风险消除类之间的必要的空间差异。在本文中，我们提出了正则化池，这使得值选择方向在池操作在空间上在相邻的内核平滑以便只补偿实际变形。对手写字符图像和纹理图像的实验结果表明，正规化池不仅提高识别的准确率，而且加快了与传统的池操作学习比较收敛。

27. Source-Relaxed Domain Adaptation for Image Segmentation [PDF] 返回目录
Mathilde Bateson, Hoel Kervadec, Jose Dolz, Herve Lombaert, Ismail Ben Ayed
Abstract: Domain adaptation (DA) has drawn high interests for its capacity to adapt a model trained on labeled source data to perform well on unlabeled or weakly labeled target data from a different domain. Most common DA techniques require the concurrent access to the input images of both the source and target domains. However, in practice, it is common that the source images are not available in the adaptation phase. This is a very frequent DA scenario in medical imaging, for instance, when the source and target images come from different clinical sites. We propose a novel formulation for adapting segmentation networks, which relaxes such a constraint. Our formulation is based on minimizing a label-free entropy loss defined over target-domain data, which we further guide with a domain invariant prior on the segmentation regions. Many priors can be used, derived from anatomical information. Here, a class-ratio prior is learned via an auxiliary network and integrated in the form of a Kullback-Leibler (KL) divergence in our overall loss function. We show the effectiveness of our prior-aware entropy minimization in adapting spine segmentation across different MRI modalities. Our method yields comparable results to several state-of-the-art adaptation techniques, even though is has access to less information, the source images being absent in the adaptation phase. Our straight-forward adaptation strategy only uses one network, contrary to popular adversarial techniques, which cannot perform without the presence of the source images. Our framework can be readily used with various priors and segmentation problems.
摘要：适配域（DA）已经引起高的利益的能够适应上训练标记源数据的模型从一个不同的域上的未标记的或弱标记的目标数据执行阱容量。最常见的DA技术需要对源和目标域两者的输入图像的并发访问。然而，在实践中是很常见的源图像是不是在适应阶段可用。这是在医疗成像非常频繁DA的情况下，例如，当源和目标图像来自不同的临床试验点。我们提出的新制剂为适应分割网络，放宽这种约束。我们的制剂是基于最小化超过目标域数据定义的无标记的熵损失，这是我们进一步引导与所述分割区域上之前的域不变。许多先验可以使用，从解剖信息的。在这里，一类比之前通过辅助网络学习和集成在我们的整体损耗功能的的Kullback-Leibler距离（KL）散度的形式。我们证明我们之前的感知熵最小化在跨越不同MRI模式适应脊柱分割的有效性。我们的方法可以得到可比较的结果，以国家的最先进的几个适配技术，即使是具有获得更少的信息，源图像是在适应阶段不存在。我们直接的适应策略，只使用一个网络，与流行的对抗技术，这离不开源图像的情况下进行。我们的分析框架可以与各种先验和分割问题，很容易使用。

28. Preprint: Using RF-DNA Fingerprints To Classify OFDM Transmitters Under Rayleigh Fading Conditions [PDF] 返回目录
Mohamed Fadul, Donald Reising, T. Daniel Loveless, Abdul Ofoli
Abstract: The Internet of Things (IoT) is a collection of Internet connected devices capable of interacting with the physical world and computer systems. It is estimated that the IoT will consist of approximately fifty billion devices by the year 2020. In addition to the sheer numbers, the need for IoT security is exacerbated by the fact that many of the edge devices employ weak to no encryption of the communication link. It has been estimated that almost 70% of IoT devices use no form of encryption. Previous research has suggested the use of Specific Emitter Identification (SEI), a physical layer technique, as a means of augmenting bit-level security mechanism such as encryption. The work presented here integrates a Nelder-Mead based approach for estimating the Rayleigh fading channel coefficients prior to the SEI approach known as RF-DNA fingerprinting. The performance of this estimator is assessed for degrading signal-to-noise ratio and compared with least square and minimum mean squared error channel estimators. Additionally, this work presents classification results using RF-DNA fingerprints that were extracted from received signals that have undergone Rayleigh fading channel correction using Minimum Mean Squared Error (MMSE) equalization. This work also performs radio discrimination using RF-DNA fingerprints generated from the normalized magnitude-squared and phase response of Gabor coefficients as well as two classifiers. Discrimination of four 802.11a Wi-Fi radios achieves an average percent correct classification of 90% or better for signal-to-noise ratios of 18 and 21 dB or greater using a Rayleigh fading channel comprised of two and five paths, respectively.
摘要：物联网（IOT）的互联网是能够与物理世界与计算机系统进行交互的互联网连接设备的集合。据估计，物联网将在2020年包括大约50十亿设备除了纯粹的数字，对于物联网安全的需要是一个事实，即许多边缘设备的使用弱到没有通信链路的加密加剧。据估计，物联网设备几乎70％的人使用无形式的加密。先前的研究已经建议使用特定辐射源识别（SEI），物理层技术，如增强的位级的安全机制，例如加密的手段。工作这里介绍的集成了用于被称为RF-DNA指纹的SEI方法之前估计所述瑞利衰落信道系数的内尔德-米德基础的方法。此估计器的性能评估降解信噪比，并用最小二乘和最小均方误差信道估计进行比较。此外，使用从已经历瑞利衰落使用最小均方误差（MMSE）均衡信道校正接收到的信号中提取的RF-DNA指纹这项工作提出的分类结果。这项工作也使用从伽柏系数以及两个分类器的归一化幅度平方和相位响应产生的RF-DNA指纹进行无线歧视。四个802.11a的无线网络连接的无线电歧视实现了90％或更好，使用由两个到五个路径的瑞利衰落信道，分别为18的信号 - 噪声比和21分贝或更大的平均百分比正确分类。

29. A Hybrid Method for Training Convolutional Neural Networks [PDF] 返回目录
Vasco Lopes, Paulo Fazendeiro
Abstract: Artificial Intelligence algorithms have been steadily increasing in popularity and usage. Deep Learning, allows neural networks to be trained using huge datasets and also removes the need for human extracted features, as it automates the feature learning process. In the hearth of training deep neural networks, such as Convolutional Neural Networks, we find backpropagation, that by computing the gradient of the loss function with respect to the weights of the network for a given input, it allows the weights of the network to be adjusted to better perform in the given task. In this paper, we propose a hybrid method that uses both backpropagation and evolutionary strategies to train Convolutional Neural Networks, where the evolutionary strategies are used to help to avoid local minimas and fine-tune the weights, so that the network achieves higher accuracy results. We show that the proposed hybrid method is capable of improving upon regular training in the task of image classification in CIFAR-10, where a VGG16 model was used and the final test results increased 0.61%, in average, when compared to using only backpropagation.
摘要：人工智能算法已在使用和普及的稳步增长。深度学习，使神经网络使用巨大的数据集进行培训，也消除了对人类提取特征的需要，因为它的自动化功能，学习的过程。在训练深层神经网络，如卷积神经网络的炉膛，我们发现反向传播，即通过计算相对于网络给定输入的权重损失函数的梯度，它允许网络中的权重为调整，以更好地在给定的任务执行。在本文中，我们提出了同时使用反向传播和进化策略训练卷积神经网络，在进化策略来帮助避免局部最低高度和微调的权重的混合方法，使网络达到更高的精度结果。我们表明，所提出的混合方法能够提高在在CIFAR-10，其中使用了VGG16模型，并与仅使用反向传播时，最终的测试结果增加了0.61％，在平均图像分类的任务的定期培训。

30. Multi-Phase Cross-modal Learning for Noninvasive Gene Mutation Prediction in Hepatocellular Carcinoma [PDF] 返回目录
Jiapan Gu, Ziyuan Zhao, Zeng Zeng, Yuzhe Wang, Zhengyiren Qiu, Bharadwaj Veeravalli, Brian Kim Poh Goh, Glenn Kunnath Bonney, Krishnakumar Madhavan, Chan Wan Ying, Lim Kheng Choon, Thng Choon Hua, Pierce KH Chow
Abstract: Hepatocellular carcinoma (HCC) is the most common type of primary liver cancer and the fourth most common cause of cancer-related death worldwide. Understanding the underlying gene mutations in HCC provides great prognostic value for treatment planning and targeted therapy. Radiogenomics has revealed an association between non-invasive imaging features and molecular genomics. However, imaging feature identification is laborious and error-prone. In this paper, we propose an end-to-end deep learning framework for mutation prediction in APOB, COL11A1 and ATRX genes using multiphasic CT scans. Considering intra-tumour heterogeneity (ITH) in HCC, multi-region sampling technology is implemented to generate the dataset for experiments. Experimental results demonstrate the effectiveness of the proposed model.
摘要：肝细胞癌（HCC）是原发性肝癌最常见的类型和癌症相关死亡的全球第四大最常见的原因。在HCC了解底层的基因突变提供了治疗计划和有针对性的治疗伟大的预后价值。 Radiogenomics已经揭示非侵入性成像的特征和基因组学的分子之间的关联。然而，成像特征识别是费力的且容易出错的。在本文中，我们提出了在使用多相CT扫描APOB，COL11A1和ATRX突变基因预测结束到终端的深度学习的框架。考虑到在HCC肿瘤内异质性（ITH），多区域采样技术被实现为产生用于实验的数据集。实验结果表明，该模型的有效性。

31. Hypergraph Learning for Identification of COVID-19 with CT Imaging [PDF] 返回目录
Donglin Di, Feng Shi, Fuhua Yan, Liming Xia, Zhanhao Mo, Zhongxiang Ding, Fei Shan, Shengrui Li, Ying Wei, Ying Shao, Miaofei Han, Yaozong Gao, He Sui, Yue Gao, Dinggang Shen
Abstract: The coronavirus disease, named COVID-19, has become the largest global public health crisis since it started in early 2020. CT imaging has been used as a complementary tool to assist early screening, especially for the rapid identification of COVID-19 cases from community acquired pneumonia (CAP) cases. The main challenge in early screening is how to model the confusing cases in the COVID-19 and CAP groups, with very similar clinical manifestations and imaging features. To tackle this challenge, we propose an Uncertainty Vertex-weighted Hypergraph Learning (UVHL) method to identify COVID-19 from CAP using CT images. In particular, multiple types of features (including regional features and radiomics features) are first extracted from CT image for each case. Then, the relationship among different cases is formulated by a hypergraph structure, with each case represented as a vertex in the hypergraph. The uncertainty of each vertex is further computed with an uncertainty score measurement and used as a weight in the hypergraph. Finally, a learning process of the vertex-weighted hypergraph is used to predict whether a new testing case belongs to COVID-19 or not. Experiments on a large multi-center pneumonia dataset, consisting of 2,148 COVID-19 cases and 1,182 CAP cases from five hospitals, are conducted to evaluate the performance of the proposed method. Results demonstrate the effectiveness and robustness of our proposed method on the identification of COVID-19 in comparison to state-of-the-art methods.
摘要：冠状病毒病，命名为COVID-19，已成为全球最大的公共卫生危机，因为它在2020年年初CT成像开始就一直作为辅助工具来帮助早期筛查，尤其是对COVID-19情况下，迅速查明从社区获得性肺炎（CAP）的情况。在早期筛查的主要挑战是如何在混乱的情况下，在COVID-19和CAP组合模式，具有非常相似的临床表现及影像学特征。为了应对这一挑战，我们提出了一个不确定性的顶点加权超图学习（UVHL）方法来利用CT图像从CAP识别COVID-19。具体地，多种类型的特征（包括区域特征和radiomics特征）首先从CT图像中提取用于每个情况。然后，不同的情况之间的关系是由超图结构配制，用每种情况下表示为超图的顶点。每个顶点的不确定性与不确定性评分测量进一步计算并用作超图的权重。最后，顶点加权超图的学习处理被用于预测一个新的测试的情况下是否属于COVID-19与否。在大型多中心性肺炎的数据集，包括2148 COVID-19的情况下，并从五家医院1182 CAP情况下的实验，都进行评估了该方法的性能。结果表明相比于国家的最先进的方法，我们提出的上COVID-19的鉴定方法的有效性和鲁棒性。

32. Convolutional Sparse Support Estimator Based Covid-19 Recognition from X-ray Images [PDF] 返回目录
Mehmet Yamac, Mete Ahishali, Aysen Degerli, Serkan Kiranyaz, Muhammad E. H. Chowdhury, Moncef Gabbouj
Abstract: Coronavirus disease (Covid-19) has been the main agenda of the whole world since it came in sight in December 2019. It has already caused thousands of causalities and infected several millions worldwide. Any technological tool that can be provided to healthcare practitioners to save time, effort, and possibly lives has crucial importance. The main tools practitioners currently use to diagnose Covid-19 are Reverse Transcription-Polymerase Chain reaction (RT-PCR) and Computed Tomography (CT), which require significant time, resources and acknowledged experts. X-ray imaging is a common and easily accessible tool that has great potential for Covid-19 diagnosis. In this study, we propose a novel approach for Covid-19 recognition from chest X-ray images. Despite the importance of the problem, recent studies in this domain produced not so satisfactory results due to the limited datasets available for training. Recall that Deep Learning techniques can generally provide state-of-the-art performance in many classification tasks when trained properly over large datasets, such data scarcity can be a crucial obstacle when using them for Covid-19 detection. Alternative approaches such as representation-based classification (collaborative or sparse representation) might provide satisfactory performance with limited size datasets, but they generally fall short in performance or speed compared to Machine Learning methods. To address this deficiency, Convolution Support Estimation Network (CSEN) has recently been proposed as a bridge between model-based and Deep Learning approaches by providing a non-iterative real-time mapping from query sample to ideally sparse representation coefficient' support, which is critical information for class decision in representation based techniques.
摘要：冠状病毒病（Covid-19）一直是全世界的主要议程，因为它终于看见在2019年12月以来，已经造成数千人伤亡，并感染全球数百万。可以提供给卫生保健人员的任何技术工具，以节省时间，精力和可能生活具有至关重要的意义。主要工具从业人员目前用来诊断Covid-19是反转录 - 聚合酶链反应（RT-PCR）和计算机断层扫描（CT），这需要显著的时间，资源和知名专家。 X射线成像是一种常见和方便的工具，对Covid-19诊断的巨大潜力。在这项研究中，我们提出了从胸部X射线图像的Covid-19识别的新方法。尽管这个问题的重要性，最近在这一领域的研究产生因可用于培训有限的数据集不那么令人满意的结果。回想一下，适当的培训在大型数据集，这样的数据稀缺时使用它们Covid-19检测时，可以是一个关键的障碍深度学习技术通常可以提供许多分类任务的国家的最先进的性能。替代方法，如基于表示分类（协作或稀疏表示）可能提供具有有限大小的数据集令人满意的性能，但是它们通常属于短在性能或速度相比的机器学习方法。为了解决这一缺陷，卷积支持估算网络（CSEN）最近被提议作为模型为基础，深入学习的理想地稀疏表示系数提供从查询样本非迭代实时映射方法的支持之间的桥梁，这是在基于表示技术类决策的关键信息。

33. DeepHist: Differentiable Joint and Color Histogram Layers for Image-to-Image Translation [PDF] 返回目录
Mor Avi-Aharon, Assaf Arbelle, Tammy Riklin Raviv
Abstract: We present the DeepHist - a novel Deep Learning framework for augmenting a network by histogram layers and demonstrate its strength by addressing image-to-image translation problems. Specifically, given an input image and a reference color distribution we aim to generate an output image with the structural appearance (content) of the input (source) yet with the colors of the reference. The key idea is a new technique for a differentiable construction of joint and color histograms of the output images. We further define a color distribution loss based on the Earth Mover's Distance between the output's and the reference's color histograms and a Mutual Information loss based on the joint histograms of the source and the output images. Promising results are shown for the tasks of color transfer, image colorization and edges $\rightarrow$ photo, where the color distribution of the output image is controlled. Comparison to Pix2Pix and CyclyGANs are shown.
摘要：我们提出的DeepHist - 为增强直方图层网络的新的深度学习框架，并展示通过解决图像到影像转换的问题它的实力。具体地，给出输入图像，我们的目标与参考的颜色，以产生与输入（源）的结构外观（内容）的输出图像又一个参考的颜色分布。其核心思想是输出图像的关节和颜色直方图的微结构的新技术。我们进一步定义输出的和参考的颜色直方图和基于源的联合直方图和输出图像的互信息的损失之间的基础上堆土机距离颜色分布的损失。有希望的结果示于彩色转印，图像彩色化的任务和边缘$ \ RIGHTARROW $照片，其中输出图像的颜色分布进行控制。比较Pix2Pix和CyclyGANs被示出。

34. Compressive sensing with un-trained neural networks: Gradient descent finds the smoothest approximation [PDF] 返回目录
Reinhard Heckel, Mahdi Soltanolkotabi
Abstract: Un-trained convolutional neural networks have emerged as highly successful tools for image recovery and restoration. They are capable of solving standard inverse problems such as denoising and compressive sensing with excellent results by simply fitting a neural network model to measurements from a single image or signal without the need for any additional training data. For some applications, this critically requires additional regularization in the form of early stopping the optimization. For signal recovery from a few measurements, however, un-trained convolutional networks have an intriguing self-regularizing property: Even though the network can perfectly fit any image, the network recovers a natural image from few measurements when trained with gradient descent until convergence. In this paper, we provide numerical evidence for this property and study it theoretically. We show that---without any further regularization---an un-trained convolutional neural network can approximately reconstruct signals and images that are sufficiently structured, from a near minimal number of random measurements.
摘要：没有接受过培训卷积神经网络已经成为图像复原和修复非常成功的工具。它们能够解决标准逆问题，诸如降噪和通过简单地从单个图像或信号拟合神经网络模型的测量，而不需要任何额外的训练数据以优异的成绩压缩感测的。对于一些应用，这至关重要，需要在早期停止优化形式的额外正规化。对于从几个测量信号恢复，然而，未受过训练的卷积网络有一个有趣的自我正规化性能：即使网络可以完美地适应任何图像，与梯度下降，直到收敛训练有素的网络恢复来自几个测量的自然图像。在本文中，我们为这个属性提供数字证据和理论研究它。我们证明---没有任何进一步的规范化---一个未受过训练的卷积神经网络可以近似地重构充分的结构，从随机测量的近最少数量的信号和图像。

35. Beyond CNNs: Exploiting Further Inherent Symmetries in Medical Images for Segmentation [PDF] 返回目录
Shuchao Pang, Anan Du, Mehmet A. Orgun, Yan Wang, Quanzheng Sheng, Shoujin Wang, Xiaoshui Huang, Zhemei Yu
Abstract: Automatic tumor segmentation is a crucial step in medical image analysis for computer-aided diagnosis. Although the existing methods based on convolutional neural networks (CNNs) have achieved the state-of-the-art performance, many challenges still remain in medical tumor segmentation. This is because regular CNNs can only exploit translation invariance, ignoring further inherent symmetries existing in medical images such as rotations and reflections. To mitigate this shortcoming, we propose a novel group equivariant segmentation framework by encoding those inherent symmetries for learning more precise representations. First, kernel-based equivariant operations are devised on every orientation, which can effectively address the gaps of learning symmetries in existing approaches. Then, to keep segmentation networks globally equivariant, we design distinctive group layers with layerwise symmetry constraints. By exploiting further symmetries, novel segmentation CNNs can dramatically reduce the sample complexity and the redundancy of filters (by roughly 2/3) over regular CNNs. More importantly, based on our novel framework, we show that a newly built GER-UNet outperforms its regular CNN-based counterpart and the state-of-the-art segmentation methods on real-world clinical data. Specifically, the group layers of our segmentation framework can be seamlessly integrated into any popular CNN-based segmentation architectures.
摘要：自动肿瘤分割是用于计算机辅助诊断医学图像分析中的关键步骤。虽然基于卷积神经网络（细胞神经网络）的现有方法都取得了国家的最先进的性能，许多挑战仍然存在在医学肿瘤分割。这是因为普通细胞神经网络只能利用平移不变性，忽视现有的医学图像，如旋转和反射进一步固有的对称性。为了减轻这一缺点，提出了通过编码那些固有对称性学习更精确的表示一组新等变分割框架。首先，基于内核的参数等操作都设计在每一个方向，这可以有效地解决学习现有的方法对称的差距。然后，以保持网络的分割等变全球，我们的设计与分层对称约束独特的一群层。通过利用进一步的对称性，新颖分割细胞神经网络可以显着降低在常规细胞神经网络的样品的复杂性和过滤器的冗余（由大致2/3）。更重要的是，基于我们新的架构，我们表明，新建成的GER-UNET优于定期基于CNN-对口和对现实世界的临床数据的国家的最先进的分割方法。具体地，我们的框架分割的组层可被无缝地集成到任何流行的基于CNN的分割结构。

36. Is an Affine Constraint Needed for Affine Subspace Clustering? [PDF] 返回目录
Chong You, Chun-Guang Li, Daniel P. Robinson, Rene Vidal
Abstract: Subspace clustering methods based on expressing each data point as a linear combination of other data points have achieved great success in computer vision applications such as motion segmentation, face and digit clustering. In face clustering, the subspaces are linear and subspace clustering methods can be applied directly. In motion segmentation, the subspaces are affine and an additional affine constraint on the coefficients is often enforced. However, since affine subspaces can always be embedded into linear subspaces of one extra dimension, it is unclear if the affine constraint is really necessary. This paper shows, both theoretically and empirically, that when the dimension of the ambient space is high relative to the sum of the dimensions of the affine subspaces, the affine constraint has a negligible effect on clustering performance. Specifically, our analysis provides conditions that guarantee the correctness of affine subspace clustering methods both with and without the affine constraint, and shows that these conditions are satisfied for high-dimensional data. Underlying our analysis is the notion of affinely independent subspaces, which not only provides geometrically interpretable correctness conditions, but also clarifies the relationships between existing results for affine subspace clustering.
摘要：基于表示每个数据点的其他数据点的线性组合的子空间聚类方法已实现在计算机视觉应用，如运动分割，面部和数字聚类巨大的成功。在面部聚类，子空间是线性的并且子空间聚类方法可以直接应用。在运动分割中，子空间是仿射和所述系数的附加仿射约束经常执行。然而，由于仿射子空间总是可以被嵌入到一个额外的维度的线性子空间，目前还不清楚，如果仿射约束是真的有必要。本文所示，在理论和经验，当环境空间的维数是相对于仿射子空间的尺寸之和高，仿射约束对聚类性能的影响可以忽略。具体来说，我们的分析提供了保证仿射子空间聚类方法既没有仿射约束的正确性，这些条件都满足了高维数据的条件，和表演。底层我们的分析是仿射独立的子空间，它不仅提供了几何解释的正确性条件的概念，同时也明确了仿射子空间聚类现有的成果之间的关系。

37. Synergistic Learning of Lung Lobe Segmentation and Hierarchical Multi-Instance Classification for Automated Severity Assessment of COVID-19 in CT Images [PDF] 返回目录
Kelei He, Wei Zhao, Xingzhi Xie, Wen Ji, Mingxia Liu, Zhenyu Tang, Feng Shi, Yang Gao, Jun Liu, Junfeng Zhang, Dinggang Shen
Abstract: Understanding chest CT imaging of the coronavirus disease 2019 (COVID-19) will help detect infections early and assess the disease progression. Especially, automated severity assessment of COVID-19 in CT images plays an essential role in identifying cases that are in great need of intensive clinical care. However, it is often challenging to accurately assess the severity of this disease in CT images, due to small infection regions in the lungs, similar imaging biomarkers, and large inter-case variations. To this end, we propose a synergistic learning framework for automated severity assessment of COVID-19 in 3D CT images, by jointly performing lung lobe segmentation and multi-instance classification. Considering that only a few infection regions in a CT image are related to the severity assessment, we first represent each input image by a bag that contains a set of 2D image patches (with each one cropped from a specific slice). A multi-task multi-instance deep network (called M2UNet) is then developed to assess the severity of COVID-19 patients and segment the lung lobe simultaneously. Our M2UNet consists of a patch-level encoder, a segmentation sub-network for lung lobe segmentation, and a classification sub-network for severity assessment (with a unique hierarchical multi-instance learning strategy). Here, the context information provided by segmentation can be implicitly employed to improve the performance of severity assessment. Extensive experiments were performed on a real COVID-19 CT image dataset consisting of 666 chest CT images, with results suggesting the effectiveness of our proposed method compared to several state-of-the-art methods.
摘要：冠状病毒病2019（COVID-19）的谅解胸部CT成像有助于早期发现感染和评估疾病进展。特别是，COVID-19的CT图像中自动严重性评估起着识别是非常需要的重症临床护理的情况下至关重要的作用。然而，它通常具有挑战性准确地评估这种疾病在CT图像的严重程度，由于在肺部感染小区域，类似的成像生物标志物，和大壳体间的变化。为此，我们提出了COVID-19中的3D CT图像，自动严重性评估的协同学习框架通过共同执行肺叶分割和多实例的分类。考虑到在CT图像只有少数感染地区都涉及到严重程度的评估，我们首先通过一个包含了一组2D图像补丁（每一个从一个特定的片裁剪）一袋表示每个输入图像。一种多任务多实例深网络（称为M2UNet）随后发展到肺叶同时评估COVID-19的患者和段的严重性。我们M2UNet由一个补丁级编码器，细分子网络肺叶分割，以及分类子网络严重性评估（以独特的分层多实例学习策略）的。在这里，通过分割所提供的上下文信息可以被隐用来改善程度评估的性能。广泛实验在由666胸部CT图像的真实COVID-19 CT图像数据组进行的，带结果表明我们提出的方法的有效性相比状态的最先进的几种方法。

38. Y-Net for Chest X-Ray Preprocessing: Simultaneous Classification of Geometry and Segmentation of Annotations [PDF] 返回目录
John McManigle, Raquel Bartz, Lawrence Carin
Abstract: Over the last decade, convolutional neural networks (CNNs) have emerged as the leading algorithms in image classification and segmentation. Recent publication of large medical imaging databases have accelerated their use in the biomedical arena. While training data for photograph classification benefits from aggressive geometric augmentation, medical diagnosis - especially in chest radiographs -- depends more strongly on feature location. Diagnosis classification results may be artificially enhanced by reliance on radiographic annotations. This work introduces a general pre-processing step for chest x-ray input into machine learning algorithms. A modified Y-Net architecture based on the VGG11 encoder is used to simultaneously learn geometric orientation (similarity transform parameters) of the chest and segmentation of radiographic annotations. Chest x-rays were obtained from published databases. The algorithm was trained with 1000 manually labeled images with augmentation. Results were evaluated by expert clinicians, with acceptable geometry in 95.8% and annotation mask in 96.2% (n=500), compared to 27.0% and 34.9% respectively in control images (n=241). We hypothesize that this pre-processing step will improve robustness in future diagnostic algorithms.
摘要：在过去的十年中，卷积神经网络（细胞神经网络）已经成为图像分类和分割的主要算法。大型医疗影像数据库最近出版已经加快了其在生物医学领域使用。而从积极的几何增强，医疗诊断的照片分类益处训练数据 - 尤其是在胸部X光 - 更强烈地依赖于功能位置。诊断分类结果可通过对射线照相注解的依赖被人为提高。这项工作引入了对胸部的X射线输入的通用预处理步骤到机器学习算法。基于所述VGG11编码器被用于同时学习几何定向（相似变换参数）放射线注释的胸部和分割的改性Y型网体系结构。从公开的数据库中获得胸部X射线。该算法用1000个增强手动标记的图像训练。结果由专家临床医生96.2％（N = 500）进行评估，在95.8％上可接受的几何形状和注释掩模相比，在控制图像分别27.0％和34.9％（N = 241）。我们推测，这个预处理步骤将改善未来的诊断算法的鲁棒性。

39. Blind Backdoors in Deep Learning Models [PDF] 返回目录
Eugene Bagdasaryan, Vitaly Shmatikov
Abstract: We investigate a new method for injecting backdoors into machine learning models, based on poisoning the loss computation in the model-training code. Our attack is blind: the attacker cannot modify the training data, nor observe the execution of his code, nor access the resulting model. We develop a new technique for blind backdoor training using multi-objective optimization to achieve high accuracy on both the main and backdoor tasks while evading all known defenses. We then demonstrate the efficacy of the blind attack with new classes of backdoors strictly more powerful than those in prior literature: single-pixel backdoors in ImageNet models, backdoors that switch the model to a different, complex task, and backdoors that do not require inference-time input modifications. Finally, we discuss defenses.
摘要：我们研究了注入到后门机器学习模型，基于模型的训练码中毒损失计算的新方法。我们的攻击是盲目的：攻击者无法修改的训练数据，也没有观察到他的代码执行，也不能访问生成的模型。我们开发了使用多目标优化，以实现双方的主要和后门任务，高精确度，同时回避所有已知的防御盲目后门培训的新技术。然后，我们证明了盲目攻击与新类的后门程序严格比以前的文献更强大的功效：单像素后门在ImageNet模型，该模型切换到不同的，复杂的任务，不需要推理后门程序和后门-time输入修改。最后，我们讨论的防御。

40. MLGaze: Machine Learning-Based Analysis of Gaze Error Patterns in Consumer Eye Tracking Systems [PDF] 返回目录
Anuradha Kar
Abstract: Analyzing the gaze accuracy characteristics of an eye tracker is a critical task as its gaze data is frequently affected by non-ideal operating conditions in various consumer eye tracking applications. In this study, gaze error patterns produced by a commercial eye tracking device were studied with the help of machine learning algorithms, such as classifiers and regression models. Gaze data were collected from a group of participants under multiple conditions that commonly affect eye trackers operating on desktop and handheld platforms. These conditions (referred here as error sources) include user distance, head pose, and eye-tracker pose variations, and the collected gaze data were used to train the classifier and regression models. It was seen that while the impact of the different error sources on gaze data characteristics were nearly impossible to distinguish by visual inspection or from data statistics, machine learning models were successful in identifying the impact of the different error sources and predicting the variability in gaze error levels due to these conditions. The objective of this study was to investigate the efficacy of machine learning methods towards the detection and prediction of gaze error patterns, which would enable an in-depth understanding of the data quality and reliability of eye trackers under unconstrained operating conditions. Coding resources for all the machine learning methods adopted in this study were included in an open repository named MLGaze to allow researchers to replicate the principles presented here using data from their own eye trackers.
摘要：分析了眼球跟踪的注视精度特性是一个重要的任务，因为它的目光数据经常受到各种消费者眼球跟踪应用程序的非理想的工作条件。在这项研究中，凝视由商业眼跟踪设备产生错误的模式进行了研究用的机器学习算法，如分类和回归模型的帮助。凝视数据从一组通常会影响在桌面和手持平台上操作眼动仪多种条件下参与者的收集。这些条件下（在这里称为误差源）包括用户的距离，头部的姿势，和眼睛跟踪器的姿势的变化，并将所收集的注视数据被用来训练分类和回归模型。它被认为是同时注视数据特征不同的误差源的影响是几乎不可能通过目测或从数据统计来区分，机器学习模型成功地识别不同的误差源的影响，并预测在视线误差的变化由于这些条件的程度。本研究的目的是调查的机器学习朝向注视错误模式的检测和预测，这将使无约束的操作条件下的数据的质量和眼睛跟踪器的可靠性的深入理解方法的功效。所有在本研究中所采用的机器学习方法编码资源被纳入名为MLGaze一个开放的信息库，以使研究人员能够复制原理这里介绍用从自己的眼动仪数据。

41. Federated Generative Adversarial Learning [PDF] 返回目录
Chenyou Fan, Ping Liu
Abstract: This work studies training generative adversarial networks under the federated learning setting. Generative adversarial networks (GANs) have achieved advancement in various real-world applications, such as image editing, style transfer, scene generations, etc. However, like other deep learning models, GANs are also suffering from data limitation problems in real cases. To boost the performance of GANs in target tasks, collecting images as many as possible from different sources becomes not only important but also essential. For example, to build a robust and accurate bio-metric verification system, huge amounts of images might be collected from surveillance cameras, and/or uploaded from cellphones by users accepting agreements. In an ideal case, utilize all those data uploaded from public and private devices for model training is straightforward. Unfortunately, in the real scenarios, this is hard due to a few reasons. At first, some data face the serious concern of leakage, and therefore it is prohibitive to upload them to a third-party server for model training; at second, the images collected by different kinds of devices, probably have distinctive biases due to various factors, $\textit{e.g.}$, collector preferences, geo-location differences, which is also known as "domain shift". To handle those problems, we propose a novel generative learning scheme utilizing a federated learning framework. Following the configuration of federated learning, we conduct model training and aggregation on one center and a group of clients. Specifically, our method learns the distributed generative models in clients, while the models trained in each client are fused into one unified and versatile model in the center. We perform extensive experiments to compare different federation strategies, and empirically examine the effectiveness of federation under different levels of parallelism and data skewness.
摘要：该作品研究的联合学习环境下训练生成对抗性的网络。生成对抗网络（甘斯）已经在各种现实世界的应用，如图像编辑，风格转移，现场代，等等。但是，像其他深学习模型，甘斯也从实际案例中的数据限制问题困扰的实现进程。为了提高甘斯的性能目标任务，采集图像尽可能多地从不同的来源变得不仅重要，而且也是必不可少的。例如，要建立一个强大和精确的生物指标的验证系统，图像的巨额可以从监控摄像头采集，和/或用户接受的协议，从手机上传。在理想的情况下，利用所有的公共和私人设备上传的模型训练是简单的数据。不幸的是，在现实情况下，这是很难由于几个原因。起初，一些数据面临泄漏的严重关切，因此它会阻碍它们上传到了模型训练第三方服务器;在第二，通过不同类型的装置所收集的图像，可能有显着的偏差，由于各种因素，$ \ textit {e.g。} $，集电极偏好，地理位置的差异，这也被称为“结构域移位”。为了处理这些问题，我们提出了利用联合学习框架新颖的创造型学习方案。继联合学习的配置，我们进行模型训练，并围绕一个中心聚集和一组客户。具体来说，我们的方法学中客户分布生成模型，而在每个客户端训练模式融合成一个统一的和灵活的中心模式。我们进行了广泛的实验来比较不同的联盟策略，实证检验了联合下不同级别的并行性和数据偏斜的有效性。

42. ProSelfLC: Progressive Self Label Correction for Target Revising in Label Noise [PDF] 返回目录
Xinshao Wang, Yang Hua, Elyor Kodirov, Neil M. Robertson
Abstract: In this work, we address robust deep learning under label noise (semi-supervised learning) from the perspective of target revising. We make three main contributions. First, we present a comprehensive mathematical study on existing target modification techniques, including Pseudo-Label [1], label smoothing [2], bootstrapping [3], knowledge distillation [4], confidence penalty [5], and joint optimisation [6]. Consequently, we reveal their relationships and drawbacks. Second, we propose ProSelfLC, a progressive and adaptive self label correction method, endorsed by learning time and predictive confidence. It addresses the disadvantages of existing algorithms and embraces many practical merits: (1) It is end-to-end trainable; (2) Given an example, ProSelfLC has the ability to revise an one-hot target by adding the information about its similarity structure, and correcting its semantic class; (3) No auxiliary annotations, or extra learners are required. Our proposal is designed according to the well-known expertise: deep neural networks learn simple meaningful patterns before fitting noisy patterns [7-9], and entropy regularisation principle [10, 11]. Third, label smoothing, confidence penalty and naive label correction perform on par with the state-of-the-art in our implementation. This probably indicates they were not benchmarked properly in prior work. Furthermore, our ProSelfLC outperforms them significantly.
摘要：在这项工作中，我们要解决的目标修订视角下的标签噪声（半监督学习）强大的深度学习。我们做三个主要贡献。首先，我们提出关于现有目标修饰技术，包括伪标签[1]，标签平滑[2]，自举[3]，知识蒸馏[4]，信心罚[5]，和联合优化[6全面的数学研究]。因此，我们揭示了它们之间的关系和缺点。其次，我们提出ProSelfLC，渐进和适应性自我标签校正方法，通过学习时间和预测的信心认可。它解决了现有的算法的缺点，并且包括许多实用的优点：（1）它是端至端可训练; （2）给定的示例，ProSelfLC具有通过将围绕其结构的相似性的信息，并校正其语义类别修改一个独热目标的能力; （3）无辅助的注释，或额外的学习者是必需的。我们的建议是根据著名的专业设计：深层神经网络拟合嘈杂模式[7-9]，以及熵原理正规化[10，11]前学习简单的有意义的模式。第三，标签平滑，信心处罚，天真的标签修正看齐执行与国家的最先进的在我们的实现。这可能表明他们并没有在之前正常工作基准。此外，我们的ProSelfLC显著优于他们。

43. Planning from Images with Deep Latent Gaussian Process Dynamics [PDF] 返回目录
Nathanael Bosch, Jan Achterhold, Laura Leal-Taixé, Jörg Stückler
Abstract: Planning is a powerful approach to control problems with known environment dynamics. In unknown environments the agent needs to learn a model of the system dynamics to make planning applicable. This is particularly challenging when the underlying states are only indirectly observable through images. We propose to learn a deep latent Gaussian process dynamics (DLGPD) model that learns low-dimensional system dynamics from environment interactions with visual observations. The method infers latent state representations from observations using neural networks and models the system dynamics in the learned latent space with Gaussian processes. All parts of the model can be trained jointly by optimizing a lower bound on the likelihood of transitions in image space. We evaluate the proposed approach on the pendulum swing-up task while using the learned dynamics model for planning in latent space in order to solve the control problem. We also demonstrate that our method can quickly adapt a trained agent to changes in the system dynamics from just a few rollouts. We compare our approach to a state-of-the-art purely deep learning based method and demonstrate the advantages of combining Gaussian processes with deep learning for data efficiency and transfer learning.
摘要：规划是控制已知环境动力学问题的有力途径。在未知环境中的代理需要了解系统动力学模型，以使规划适用。当底层的状态是仅通过图像观察到的间接这是尤其具有挑战性。我们建议学习了深刻的潜高斯过程动力学（DLGPD）模型，从目视观测环境之间的相互作用获悉低维系统动力学。利用神经网络和模型系统动力学与高斯过程中了解到潜在空间观测方法推断潜伏状态表示。该模型的所有部件能够共同通过优化下界转换的图像空间的可能性进行培训。同时采用了解到动力学模型，以解决控制问题的潜在空间规划我们评估的钟摆建立任务所提出的方法。我们还表明，我们的方法可以很快适应一个训练有素的代理人，从短短的推出系统动态变化。我们我们的做法比作一个国家的最先进的纯深度学习为基础的方法，并证明高斯过程与深度学习的效率，数据传输和学习相结合的优势。

44. A Hand Motion-guided Articulation and Segmentation Estimation [PDF] 返回目录
Richard Sahala Hartanto, Ryoichi Ishikawa, Menandro Roxas, Takeshi Oishi
Abstract: In this paper, we present a method for simultaneous articulation model estimation and segmentation of an articulated object in RGB-D images using human hand motion. Our method uses the hand motion in the processes of the initial articulation model estimation, ICP-based model parameter optimization, and region selection of the target object. The hand motion gives an initial guess of the articulation model: prismatic or revolute joint. The method estimates the joint parameters by aligning the RGB-D images with the constraint of the hand motion. Finally, the target regions are selected from the cluster regions which move symmetrically along with the articulation model. Our experimental results show the robustness of the proposed method for the various objects.
摘要：在本文中，我们提出了在使用人的手的移动RGB-d的图像具有关节的对象的同时关节模型估计和分割方法。我们的方法使用在初始关节运动模型估计，基于ICP-模型参数优化，并且目标对象的区域选择的处理的手的运动。手部动作给出了关节模型的初始猜测：棱柱形或旋转接头。该方法估计由与手的运动的约束对准RGB-d的图像的关节参数。最后，将目标区域从与所述关节运动模型沿着对称地移动群集区域选择。我们的实验结果表明，不同对象提出的方法的稳健性。

45. Learning to Segment Actions from Observation and Narration [PDF] 返回目录
Daniel Fried, Jean-Baptiste Alayrac, Phil Blunsom, Chris Dyer, Stephen Clark, Aida Nematzadeh
Abstract: We apply a generative segmental model of task structure, guided by narration, to action segmentation in video. We focus on unsupervised and weakly-supervised settings where no action labels are known during training. Despite its simplicity, our model performs competitively with previous work on a dataset of naturalistic instructional videos. Our model allows us to vary the sources of supervision used in training, and we find that both task structure and narrative language provide large benefits in segmentation quality.
摘要：我们在视频应用的任务结构的生成段模型，通过叙事为导向，以行动分割。我们专注于其中任何行动标签训练中已知的无监督和弱监督的设置。尽管它的简单，我们的模型进行竞争性上的自然主义教学视频的数据集以前的工作。我们的模式使我们能够改变在训练中使用监督的来源，我们发现，这两个任务结构和叙事语言提供分段质量大的好处。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-11

目录

摘要