摘要

1. Self-Supervised Ranking for Representation Learning [PDF] 返回目录
Ali Varamesh, Ali Diba, Tinne Tuytelaars, Luc Van Gool
Abstract: We present a new framework for self-supervised representation learning by positing it as a ranking problem in an image retrieval context on a large number of random views from random sets of images. Our work is based on two intuitive observations: first, a good representation of images must yield a high-quality image ranking in a retrieval task; second, we would expect random views of an image to be ranked closer to a reference view of that image than random views of other images. Hence, we model representation learning as a learning-to-rank problem in an image retrieval context, and train it by maximizing average precision (AP) for ranking. Specifically, given a mini-batch of images, we generate a large number of positive/negative samples and calculate a ranking loss term by separately treating each image view as a retrieval query. The new framework, dubbed S2R2, enables computing a global objective compared to the local objective in the popular contrastive learning framework calculated on pairs of views. A global objective leads S2R2 to faster convergence in terms of the number of epochs. In principle, by using a ranking criterion, we eliminate reliance on object-centered curated datasets (e.g., ImageNet). When trained on STL10 and MS-COCO, S2R2 outperforms SimCLR and performs on par with the state-of-the-art clustering-based contrastive learning model, SwAV, while being much simpler both conceptually and implementation-wise. Furthermore, when trained on a small subset of MS-COCO with fewer similar scenes, S2R2 significantly outperforms both SwAV and SimCLR. This indicates that S2R2 is potentially more effective on diverse scenes and decreases the need for a large training dataset for self-supervised learning.
摘要：我们通过并主张其在上有大量的与随机图像集随机意见中的图像检索上下文中的排名问题提出自我监督表示学习一个新的框架。我们的工作是基于两个直观的看法：第一，图像的良好表现必须得到高画质的图像检索任务的排名;第二，我们期望的图像的随机访问量进行排列更接近图像比其他图像的随机视图的参考图。因此，我们的模型表示学习作为在图像检索上下文学习到秩问题，并且通过用于排名最大化平均精确度（AP）训练它。具体地，给定一个小批量的图像，我们产生了大量的正/负的样品和通过分别处理每个图像视图作为检索查询计算排名损耗项。新的框架，被称为S2R2，相比于对视计算的流行对比学习框架本地目标使计算一个全球性的目标。一个全球性的目标导致S2R2更快的收敛时期的数量方面。原则上，通过使用分级准则，我们消除对以对象为中心的数据集策划依赖（例如，ImageNet）。当STL10和MS-COCO，S2R2性能优于SimCLR和执行的培训看齐，与国家的最先进的基于聚类的对比学习模式，SwAV，而被更简单的概念上和实现明智的。此外，在MS-COCO的一小部分用更少的相似场景的培训时，S2R2显著性能优于SwAV和SimCLR。这表明S2R2是潜在的多样化的场景更有效，降低了自我监督学习的大型训练数据集的需要。

2. Vision-Aided Radio: User Identity Match in Radio and Video Domains Using Machine Learning [PDF] 返回目录
Vinicius M. de Pinho, Marcello L. R. de Campos, Luis Uzeda Garcia, Dalia Popescu
Abstract: 5G is designed to be an essential enabler and a leading infrastructure provider in the communication technology industry by supporting the demand for the growing data traffic and a variety of services with distinct requirements. The use of deep learning and computer vision tools has the means to increase the environmental awareness of the network with information from visual data. Information extracted via computer vision tools such as user position, movement direction, and speed can be promptly available for the network. However, the network must have a mechanism to match the identity of a user in both visual and radio systems. This mechanism is absent in the present literature. Therefore, we propose a framework to match the information from both visual and radio domains. This is an essential step to practical applications of computer vision tools in communications. We detail the proposed framework training and deployment phases for a presented setup. We carried out practical experiments using data collected in different types of environments. The work compares the use of Deep Neural Network and Random Forest classifiers and shows that the former performed better across all experiments, achieving classification accuracy greater than 99%.
摘要：5G的设计是一个重要推动因素，通过支持对不断增长的数据流量需求和各种具有不同需求的服务在通信技术行业领先的基础设施供应商。采用深度学习和计算机视觉工具，有可能增加与可视化数据信息网络的环保意识的手段。通过计算机视觉工具，例如用户的位置，移动方向和速度中提取的信息能够迅速可用的网络。然而，网络必须有一种机制来匹配视觉和无线电系统的用户的身份。这种机制是本文献中不存在。因此，我们提出了一个框架，以匹配从视觉和广播域的信息。这是在通信的计算机视觉工具的实际应用的重要一步。我们详细的建议框架的培训和部署阶段的呈现设置。我们进行了使用不同类型的环境中收集的数据实际实验。这项工作比较，前者在所有实验中表现较好，实现分类精度大于99％使用深层神经网络和随机森林分类和表演。

3. Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning [PDF] 返回目录
Xinyu Yang, Majid Mirmehdi, Tilo Burghardt
Abstract: In this paper we show that learning video feature spaces in which temporal cycles are maximally predictable benefits action classification. In particular, we propose a novel learning approach termed Cycle Encoding Prediction~(CEP) that is able to effectively represent high-level spatio-temporal structure of unlabelled video content. CEP builds a latent space wherein the concept of closed forward-backward as well as backward-forward temporal loops is approximately preserved. As a self-supervision signal, CEP leverages the bi-directional temporal coherence of the video stream and applies loss functions that encourage both temporal cycle closure as well as contrastive feature separation. Architecturally, the underpinning network structure utilises a single feature encoder for all video snippets, adding two predictive modules that learn temporal forward and backward transitions. We apply our framework for pretext training of networks for action recognition tasks. We report significantly improved results for the standard datasets UCF101 and HMDB51. Detailed ablation studies support the effectiveness of the proposed components. We publish source code for the CEP components in full with this paper.
摘要：在本文中我们将展示其中的时间周期是可预见的最大利益行为分类，学习视频功能的空间。特别是，我们提出了一种新的学习方法称为循环编码预测〜（CEP），其能够有效地表示的未标记的视频内容的高级别时空结构。 CEP构建了一个潜在空间，其中的概念封闭前后以及向后前向时间循环近似保留。作为自检信号，CEP利用了视频流的双向时间相干性和应用于的鼓励双侧颞周期闭合以及对比特征分离损失功能。在结构上，托底网络结构利用对所有视频片段的单个特征的编码器，并表示学习时间向前和向后跃迁两个预测模块。我们运用我们的框架网络的行为识别任务的借口训练。我们报告的标准数据集UCF101和HMDB51显著改进的结果。详细消融研究支持所提出的功效成分。我们发布源代码的CEP组件完全与本文。

4. Manifold-Net: Using Manifold Learning for Point Cloud Classification [PDF] 返回目录
Dinghao Yang, Wei Gao
Abstract: In this paper, we propose a point cloud classification method based on graph neural network and manifold learning. Different from the conventional point cloud analysis methods, this paper uses manifold learning algorithms to embed point cloud features for better considering the geometric continuity on the surface. Then, the nature of point cloud can be acquired in low dimensional space, and after being concatenated with features in the original three-dimensional (3D)space, both the capability of feature representation and the classification network performance can be improved. We pro-pose two manifold learning modules, where one is based on locally linear embedding algorithm, and the other is a non-linear projection method based on neural network architecture. Both of them can obtain better performances than the state-of-the-art baseline. Afterwards, the graph model is constructed by using the k nearest neighbors algorithm, where the edge features are effectively aggregated for the implementation of point cloud classification. Experiments show that the proposed point cloud classification methods obtain the mean class accuracy (mA) of 90.2% and the overall accuracy (oA)of 93.2%, which reach competitive performances compared with the existing state-of-the-art related methods.
摘要：本文提出了一种基于图形的神经网络和流形学习上的点云的分类方法。从常规的点云的分析方法不同，本文采用歧管学习算法来嵌入点云的特征在于用于更好考虑表面上的几何连续性。然后，点云的性质可以获取在低维空间中，并与后正在特征串接在原始的三维（3D）空间中，特征表示的两者的能力和分类的网络性能可以得到改善。我们PRO-姿势两个集管学习模块，其中一个是基于局部线性嵌入算法，而另一个是基于神经网络结构的非线性投影方法。他们都能够获得比国家的最先进的基准更好的性能。然后，将图形模型是通过使用最近邻算法，其中，所述边缘特征被有效地聚集为点云分类的执行第k构成。实验结果表明，所提出的点云的分类方法得到的90.2％的平均类精度（MA）和总准确度的93.2％（OA），与现有的国家的最先进的相关的方法相比的竞争达到性能。

5. Learning Propagation Rules for Attribution Map Generation [PDF] 返回目录
Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, Xinchao Wang
Abstract: Prior gradient-based attribution-map methods rely on handcrafted propagation rules for the non-linear/activation layers during the backward pass, so as to produce gradients of the input and then the attribution map. Despite the promising results achieved, such methods are sensitive to the non-informative high-frequency components and lack adaptability for various models and samples. In this paper, we propose a dedicated method to generate attribution maps that allow us to learn the propagation rules automatically, overcoming the flaws of the handcrafted ones. Specifically, we introduce a learnable plugin module, which enables adaptive propagation rules for each pixel, to the non-linear layers during the backward pass for mask generating. The masked input image is then fed into the model again to obtain new output that can be used as a guidance when combined with the original one. The introduced learnable module can be trained under any auto-grad framework with higher-order differential support. As demonstrated on five datasets and six network architectures, the proposed method yields state-of-the-art results and gives cleaner and more visually plausible attribution maps.
摘要：在此之前的基于梯度的归因映射方法依赖于用于向后通过期间非线性/活化层手工传播规则，以便将所述输入，然后将归因地图的产生梯度。尽管取得了可喜的成果，这些方法是在非信息的高频成分敏感，缺乏对各种模型和样品的适应性。在本文中，我们提出了一个专门的方法来生成属性的地图，使我们能够自动学习传播规则，克服了手工制作的人的缺陷。具体而言，我们引入一个可学习插件模块，其向后通行证掩模产生期间使自适应传播规则针对每个像素，对非线性层。然后，掩蔽输入图像被馈送到模型中再次以获得当与原来的组合，可以用来作为引导新的输出。在介绍了可学习模块可与高阶差分支持任何自动毕业生框架下进行训练。这表现在五个数据集和六个网络架构，所提出的方法的产率状态的最先进的结果，并给出更清洁和更直观地合理归因地图。

6. A Vector-based Representation to Enhance Head Pose Estimation [PDF] 返回目录
Zhiwen Cao, Zongcheng Chu, Dongfang Liu, Yingjie Chen
Abstract: This paper proposes to use the three vectors in a rotation matrix as the representation in head pose estimation and develops a new neural network based on the characteristic of such representation. We address two potential issues existed in current head pose estimation works: 1. Public datasets for head pose estimation use either Euler angles or quaternions to annotate data samples. However, both of these annotations have the issue of discontinuity and thus could result in some performance issues in neural network training. 2. Most research works report Mean Absolute Error (MAE) of Euler angles as the measurement of performance. We show that MAE may not reflect the actual behavior especially for the cases of profile views. To solve these two problems, we propose a new annotation method which uses three vectors to describe head poses and a new measurement Mean Absolute Error of Vectors (MAEV) to assess the performance. We also train a new neural network to predict the three vectors with the constraints of orthogonality. Our proposed method achieves state-of-the-art results on both AFLW2000 and BIWI datasets. Experiments show our vector-based annotation method can effectively reduce prediction errors for large pose angles.
摘要：本文提出使用三个向量中的旋转矩阵作为在头部姿势估计的表示和开发基于这种表示的特性的新的神经网络。我们解决目前的头部姿势估计工程存在两个潜在的问题：1。头部姿态估计使用公共数据集，无论是欧拉角或四元注释数据样本。然而，这两种注解有不连续的问题，从而可能会导致神经网络训练的一些性能问题。 2.大多数研究工作报告欧拉的平均绝对误差（MAE）作为角度的性能测量。我们发现，MAE可能不反映实际行为特别是对个人资料被访问情况。为了解决这两个问题，我们提出使用三个矢量来描述头部姿势和载体（MAEV）的一个新的测量平均绝对误差来评估性能的新注释方法。我们还培养了新的神经网络与正交的约束预测三个矢量。我们提出的方法实现了两个AFLW2000和BIWI数据集的国家的最先进的成果。实验表明，我们的基于矢量的标注方法可以有效地减少对大姿态角的预测误差。

7. WeightAlign: Normalizing Activations by Weight Alignment [PDF] 返回目录
Xiangwei Shi, Yunqiang Li, Xin Liu, Jan van Gemert
Abstract: Batch normalization (BN) allows training very deep networks by normalizing activations by mini-batch sample statistics which renders BN unstable for small batch sizes. Current small-batch solutions such as Instance Norm, Layer Norm, and Group Norm use channel statistics which can be computed even for a single sample. Such methods are less stable than BN as they critically depend on the statistics of a single input sample. To address this problem, we propose a normalization of activation without sample statistics. We present WeightAlign: a method that normalizes the weights by the mean and scaled standard derivation computed within a filter, which normalizes activations without computing any sample statistics. Our proposed method is independent of batch size and stable over a wide range of batch sizes. Because weight statistics are orthogonal to sample statistics, we can directly combine WeightAlign with any method for activation normalization. We experimentally demonstrate these benefits for classification on CIFAR-10, CIFAR-100, ImageNet, for semantic segmentation on PASCAL VOC 2012 and for domain adaptation on Office-31.
摘要：批标准化（BN）允许通过规范化的激活通过使BN不稳定的小批量小批量样品的统计训练非常深刻的网络。电流小批量溶液，如实例规范，层规范，并且可以甚至对于单个样本来计算范数组使用信道统计信息。这种方法比BN稳定较少，因为它们严重地依赖于单个输入样本的统计信息。为了解决这个问题，我们提出激活正常化无样本统计。我们本WeightAlign：由平均值和标准缩放推导一个过滤器，其归一化的激活而不计算任何样本统计中计算归一化的权重的方法。我们提出的方法不依赖于批量大小和稳定的在宽范围的批次大小。因为体重统计正交样本统计，我们可以直接与活化正常化的任何方法结合WeightAlign。我们通过实验证明上CIFAR-10进行分类这些好处，CIFAR-100，ImageNet，对PASCAL VOC 2012语义分割和有关Office-31领域适应性。

8. Multi-class segmentation under severe class imbalance: A case study in roof damage assessment [PDF] 返回目录
Jean-Baptiste Boin, Nat Roth, Jigar Doshi, Pablo Llueca, Nicolas Borensztein
Abstract: The task of roof damage classification and segmentation from overhead imagery presents unique challenges. In this work we choose to address the challenge posed due to strong class imbalance. We propose four distinct techniques that aim at mitigating this problem. Through a new scheme that feeds the data to the network by oversampling the minority classes, and three other network architectural improvements, we manage to boost the macro-averaged F1-score of a model by 39.9 percentage points, thus achieving improved segmentation performance, especially on the minority classes.
摘要：从俯拍提出了独特的挑战，屋顶损坏分类和分割的任务。在这项工作中，我们选择来解决所带来的挑战，由于强势阶层失衡。我们建议，旨在减轻这个问题的四个不同的技术。通过一个新的方案，通过过采样少数类饲料的数据到网络上，和其他三位网络架构的改进，我们设法提高宏观平均模型的F1-得分39.9个百分点，从而实现改进的分割性能，尤其是对民族班。

9. Online Anomaly Detection in Surveillance Videos with Asymptotic Bounds on False Alarm Rate [PDF] 返回目录
Keval Doshi, Yasin Yilmaz
Abstract: Anomaly detection in surveillance videos is attracting an increasing amount of attention. Despite the competitive performance of recent methods, they lack theoretical performance analysis, particularly due to the complex deep neural network architectures used in decision making. Additionally, online decision making is an important but mostly neglected factor in this domain. Much of the existing methods that claim to be online, depend on batch or offline processing in practice. Motivated by these research gaps, we propose an online anomaly detection method in surveillance videos with asymptotic bounds on the false alarm rate, which in turn provides a clear procedure for selecting a proper decision threshold that satisfies the desired false alarm rate. Our proposed algorithm consists of a multi-objective deep learning module along with a statistical anomaly detection module, and its effectiveness is demonstrated on several publicly available data sets where we outperform the state-of-the-art algorithms. All codes are available at this https URL.
摘要：异常检测监控视频是吸引注意力的增加量。尽管最近方法的竞争表现，他们缺乏理论性能分析，特别是由于在决策中使用的复杂的深层神经网络结构。此外，在线决策是在这一领域的一个重要但被忽视的主要因素。许多现有的方法宣称是在网上，取决于批量或离线处理的做法。这些研究空白的启发，我们提出了与在误报率，这又提供了选择满足期望的误警率适当的判决门限的明确程序渐近边界监控视频网上异常检测方法。我们提出的算法由一个多目标的深度学习模块与统计异常检测模块一起，它的有效性表现出对我们超越国家的最先进的算法，一些公开数据集。所有代码都可以在此HTTPS URL。

10. A New Distributional Ranking Loss With Uncertainty: Illustrated in Relative Depth Estimation [PDF] 返回目录
Alican Mertan, Yusuf Huseyin Sahin, Damien Jade Duff, Gozde Unal
Abstract: We propose a new approach for the problem of relative depth estimation from a single image. Instead of directly regressing over depth scores, we formulate the problem as estimation of a probability distribution over depth and aim to learn the parameters of the distributions which maximize the likelihood of the given data. To train our model, we propose a new ranking loss, Distributional Loss, which tries to increase the probability of farther pixel's depth being greater than the closer pixel's depth. Our proposed approach allows our model to output confidence in its estimation in the form of standard deviation of the distribution. We achieve state of the art results against a number of baselines while providing confidence in our estimations. Our analysis show that estimated confidence is actually a good indicator of accuracy. We investigate the usage of confidence information in a downstream task of metric depth estimation, to increase its performance.
摘要：我们从一个单一的形象提出了相对深度估计问题的新方法。而不是通过深度分数直接倒退，我们制定的问题，因为在深度的概率分布的估计，旨在了解哪些最大限度的给定数据的可能性分布的参数。为了训练我们的模型，我们提出了一个新的排名下降，分布式损失，它试图增加更远像素的深度大于越接近像素的深度的可能性。我们提出的方法可以让我们的模型在分布的标准偏差的形式输出信心，对中国的估计。我们达到艺术效果的状态，对一些基准的同时提供我们估计的信心。我们的分析显示，估计信心实际上是精度的一个很好的指标。我们调查的置信度信息的使用公制深度估计的下游任务，以提高其性能。

11. Better Patch Stitching for Parametric Surface Reconstruction [PDF] 返回目录
Zhantao Deng, Jan Bednařík, Mathieu Salzmann, Pascal Fua
Abstract: Recently, parametric mappings have emerged as highly effective surface representations, yielding low reconstruction error. In particular, the latest works represent the target shape as an atlas of multiple mappings, which can closely encode object parts. Atlas representations, however, suffer from one major drawback: The individual mappings are not guaranteed to be consistent, which results in holes in the reconstructed shape or in jagged surface areas. We introduce an approach that explicitly encourages global consistency of the local mappings. To this end, we introduce two novel loss terms. The first term exploits the surface normals and requires that they remain locally consistent when estimated within and across the individual mappings. The second term further encourages better spatial configuration of the mappings by minimizing novel stitching error. We show on standard benchmarks that the use of normal consistency requirement outperforms the baselines quantitatively while enforcing better stitching leads to much better visual quality of the reconstructed objects as compared to the state-of-the-art.
摘要：近日，参数映射已经成为高度有效的表面图像，从而得到低重建误差。尤其是，最新的作品表示目标形状的多个映射的图谱，其可紧密地编码对象部分。阿特拉斯交涉，然而，从一个主要的缺陷：个人映射不能保证是一致的，这导致在重建的形状或锯齿状的表面区域的孔。我们引入明确鼓励本地映射的全球一致性的方法。为此，我们引入两个新的损失项。第一项利用表面法线和要求，当内部和整个个体映射估计它们保持一致的地方。第二项通过最小化新颖缝合误差进一步鼓励映射的更好的空间配置。我们展示标准的基准，在正常使用的一致性要求优于基线定量，同时实施更好的拼接导致更好的视觉质量的重建对象相比，国家的最先进的。

12. FC-DCNN: A densely connected neural network for stereo estimation [PDF] 返回目录
Dominik Hirner, Friedrich Fraundorfer
Abstract: We propose a novel lightweight network for stereo estimation. Our network consists of a fully-convolutional densely connected neural network (FC-DCNN) that computes matching costs between rectified image pairs. Our FC-DCNN method learns expressive features and performs some simple but effective post-processing steps. The densely connected layer structure connects the output of each layer to the input of each subsequent layer. This network structure and the fact that we do not use any fully-connected layers or 3D convolutions leads to a very lightweight network. The output of this network is used in order to calculate matching costs and create a cost-volume. Instead of using time and memory-inefficient cost-aggregation methods such as semi-global matching or conditional random fields in order to improve the result, we rely on filtering techniques, namely median filter and guided filter. By computing a left-right consistency check we get rid of inconsistent values. Afterwards we use a watershed foreground-background segmentation on the disparity image with removed inconsistencies. This mask is then used to refine the final prediction. We show that our method works well for both challenging indoor and outdoor scenes by evaluating it on the Middlebury, KITTI and ETH3D benchmarks respectively. Our full framework is available at this https URL
摘要：我们提出了立体声估计一个新的轻量级网络。我们的网络由一个计算匹配校正的图像对之间的费用的完全卷积密集连接的神经网络（FC-DCNN）的。我们的FC-DCNN方法学的表现特征，并进行一些简单而有效的后处理步骤。密集连接的层结构的各层的输出连接到每个随后的层的输入。这种网络结构和事实，我们不使用任何完全连接层或3D卷积导致了一个非常轻量级的网络。这个网络的输出，以计算匹配成本，创造具有成本体积被使用。代替使用时间和内存低效性价比聚合方法，例如，以改善结果半全局匹配或条件随机场的，我们依靠过滤技术，即中值滤波器和引导过滤器。通过计算左右一致性检查我们摆脱不一致值。后来，我们用一个分水岭前景背景分割去掉不一致视差图像上。然后，该掩模被用于细化最终预测。我们证明了我们的方法通过在明德，KITTI评估它非常适用于既有挑战性的室内和室外场景和ETH3D分别基准。我们的全面框架可在此HTTPS URL

13. Relative Depth Estimation as a Ranking Problem [PDF] 返回目录
Alican Mertan, Damien Jade Duff, Gozde Unal
Abstract: We present a formulation of the relative depth estimation from a single image problem, as a ranking problem. By reformulating the problem this way, we were able to utilize literature on the ranking problem, and apply the existing knowledge to achieve better results. To this end, we have introduced a listwise ranking loss borrowed from ranking literature, weighted ListMLE, to the relative depth estimation problem. We have also brought a new metric which considers pixel depth ranking accuracy, on which our method is stronger.
摘要：从单个图像问题呈现相对深度估计的制剂，作为排名问题。通过重整问题就这样，我们能够利用上的排名问题的文献，并运用已有的知识，以达到更好的效果。为此，我们引入了成列排序损失从排名文献借用，加权ListMLE，到相对深度估计问题。我们也带来了新的指标，用于考虑像素深度排名的准确性，其上我们的方法是更强的。

14. Deep Learning from Small Amount of Medical Data with Noisy Labels: A Meta-Learning Approach [PDF] 返回目录
Görkem Algan, Ilkay Ulusoy, Şaban Gönül, Banu Turgut, Berker Bakbak
Abstract: Computer vision systems recently made a big leap thanks to deep neural networks. However, these systems require correctly labeled large datasets in order to be trained properly, which is very difficult to obtain for medical applications. Two main reasons for label noise in medical applications are the high complexity of the data and conflicting opinions of experts. Moreover, medical imaging datasets are commonly tiny, which makes each data very important in learning. As a result, if not handled properly, label noise significantly degrades the performance. Therefore, we propose a label-noise-robust learning algorithm that makes use of the meta-learning paradigm. We tested our proposed solution on retinopathy of prematurity (ROP) dataset with a very high label noise of 68%. Our results show that the proposed algorithm significantly improves the classification algorithm's performance in the presence of noisy labels.
摘要：计算机视觉系统最近做了一个大的飞跃得益于深层神经网络。然而，这些系统需要正确才能进行适当的培训，这是很难获得的医疗应用标记的大型数据集。原因主要有两个用于标签的噪声在医疗应用是数据的高复杂性和专家的意见冲突。此外，医疗成像数据集通常是微小的，这使得每一个数据的学习非常重要。其结果是，如果处理不当，标签噪音显著降低了性能。因此，我们提出了一个标签噪声稳健学习算法，利用元学习范式。我们测试了我们对早产儿（ROP）数据集的视网膜病变提出的解决方案有68％的非常高的标签噪音。我们的研究结果表明，该算法显著提高了在嘈杂的标签存在的分类算法的性能。

15. PP-LinkNet: Improving Semantic Segmentation of High Resolution Satellite Imagery with Multi-stage Training [PDF] 返回目录
An Tran, Ali Zonoozi, Jagannadan Varadarajan, Hannes Kruppa
Abstract: Road network and building footprint extraction is essential for many applications such as updating maps, traffic regulations, city planning, ride-hailing, disaster response \textit{etc}. Mapping road networks is currently both expensive and labor-intensive. Recently, improvements in image segmentation through the application of deep neural networks has shown promising results in extracting road segments from large scale, high resolution satellite imagery. However, significant challenges remain due to lack of enough labeled training data needed to build models for industry grade applications. In this paper, we propose a two-stage transfer learning technique to improve robustness of semantic segmentation for satellite images that leverages noisy pseudo ground truth masks obtained automatically (without human labor) from crowd-sourced OpenStreetMap (OSM) data. We further propose Pyramid Pooling-LinkNet (PP-LinkNet), an improved deep neural network for segmentation that uses focal loss, poly learning rate, and context module. We demonstrate the strengths of our approach through evaluations done on three popular datasets over two tasks, namely, road extraction and building foot-print detection. Specifically, we obtain 78.19\% meanIoU on SpaceNet building footprint dataset, 67.03\% and 77.11\% on the road topology metric on SpaceNet and DeepGlobe road extraction dataset, respectively.
摘要：公路网络，建设足迹提取是许多应用，如更新地图，交通法规，城市规划，骑乘海陵，救灾\ textit {等}是必不可少的。映射道路网络是目前既昂贵又耗力。近日，在图像分割的改进，通过深层神经网络的应用已经显示出从大型，高分辨率卫星图像提取路段可喜的成果。然而，显著挑战依然存在，由于缺乏需要建立模型，行业级应用足够标记的训练数据。在本文中，我们提出了两个阶段的转移学习技术，以提高该人群从来源的OpenStreetMap（OSM）数据利用自动获得的嘈杂模拟接地真理口罩（无需人工）卫星图像语义分割的鲁棒性。我们进一步提出了金字塔池-的连接数量（PP-LINKnet设备），用于分割改进深层神经网络使用焦距损失，多学习率，以及上下文模块。我们经历了两个任务，即道路提取和建筑足迹检测上的三个热门数据集进行评估证明了该方法的优势。具体来说，我们获得78.19 \％meanIoU上SpaceNet建设足迹数据集，67.03 \％和77.11 \％上SpaceNet和DeepGlobe道路提取数据集中的道路拓扑指标，分别。

16. AMPA-Net: Optimization-Inspired Attention Neural Network for Deep Compressed Sensing [PDF] 返回目录
Nanyu Li, Charles C. Zhou
Abstract: Compressed sensing (CS) is a challenging problem in image processing due to reconstructing an almost complete image from a limited measurement. To achieve fast and accurate CS reconstruction, we synthesize the advantages of two well-known methods (neural network and optimization algorithm) to propose a novel optimization inspired neural network which dubbed AMP-Net. AMP-Net realizes the fusion of the Approximate Message Passing (AMP) algorithm and neural network. All of its parameters are learned automatically. Furthermore, we propose an AMPA-Net which uses three attention networks to improve the representation ability of AMP-Net. Finally, We demonstrate the effectiveness of AMP-Net and AMPA-Net on four CS reconstruction benchmark data sets.
摘要：压缩感测（CS）是图像处理由于从一个有限的测量重构几乎完整的图像一个具有挑战性的问题。为了实现快速，准确的CS重建，我们合成的两个著名的方法（神经网络和优化算法）的优势，提出了一种新的优化启发其称为AMP-Net的神经网络。 AMP-Net的实现近似消息传递（AMP）算法和神经网络的融合。它的所有参数都自动学习。此外，我们提出了一种AMPA-Net的使用三个注意网络来提高AMP-网的表现能力。最后，我们证明AMP-Net和AMPA-网的四个CS重建基准数据集的有效性。

17. Development of Open Informal Dataset Affecting Autonomous Driving [PDF] 返回目录
Yong-Gu Lee, Seong-Jae Lee, Sang-Jin Lee, Tae-Seung Baek, Dong-Whan Lee, Kyeong-Chan Jang, Ho-Jin Sohn, Jin-Soo Kim
Abstract: This document is a document that has written procedures and methods for collecting objects and unstructured dynamic data on the road for the development of object recognition technology for self-driving cars, and outlines the methods of collecting data, annotation data, object classifier criteria, and data processing methods. On-road object and unstructured dynamic data were collected in various environments, such as weather, time and traffic conditions, and additional reception calls for police and safety personnel were collected. Finally, 100,000 images of various objects existing on pedestrians and roads, 200,000 images of police and traffic safety personnel, 5,000 images of police and traffic safety personnel, and data sets consisting of 5,000 image data were collected and built.
摘要：本文是对物体识别技术的发展，为自动驾驶汽车在道路上收集的对象和非结构化的动态数据写入的程序和方法的文档，并概述了收集数据，注释数据，对象分类标准的方法和数据处理方法。路上物体和非结构化的动态数据收集在各种环境，如天气，时间和交通状况，并为警察和安全人员额外的接收呼叫收集。最后，收集并建立现有的行人和道路，警方和交通安全人员20万倍的图像，警方和交通安全人员5000倍的图像，以及由5000的图像数据的数据集各种物体100,000的图像。

18. Adaptive-Attentive Geolocalization from few queries: a hybrid approach [PDF] 返回目录
Gabriele Moreno Berton, Valerio Paolicelli, Carlo Masone, Barbara Caputo
Abstract: We address the task of cross-domain visual place recognition, where the goal is to geolocalize a given query image against a labeled gallery, in the case where the query and the gallery belong to different visual domains. To achieve this, we focus on building a domain robust deep network by leveraging over an attention mechanism combined with few-shot unsupervised domain adaptation techniques, where we use a small number of unlabeled target domain images to learn about the target distribution. With our method, we are able to outperform the current state of the art while using two orders of magnitude less target domain images. Finally we propose a new large-scale dataset for cross-domain visual place recognition, called SVOX. Upon acceptance of the paper, code and dataset will be released.
摘要：我们解决跨域视觉识别的地方，这里的目标是geolocalize一个给定的查询图像对标记库的任务，在情况下查询和画廊属于不同的视觉领域。为了实现这一目标，我们专注于通过利用一个多注意机制与几拍无监督域自适应技术，在这里我们使用少量未标记的目标域的图像，了解目标分配相结合建立一个强大的域名深网络。随着我们的方法，我们能够跑赢领域的当前状态，同时使用更少数量目标域图像的两个数量级。最后，我们提出了跨域视觉识别的地方一个新的大型数据集，称为SVOX。一旦接受的文件，代码和数据集将被释放。

19. Semantic Segmentation for Partially Occluded Apple Trees Based on Deep Learning [PDF] 返回目录
Zijue Chen, David Ting, Rhys Newbury, Chao Chen
Abstract: Fruit tree pruning and fruit thinning require a powerful vision system that can provide high resolution segmentation of the fruit trees and their branches. However, recent works only consider the dormant season, where there are minimal occlusions on the branches or fit a polynomial curve to reconstruct branch shape and hence, losing information about branch thickness. In this work, we apply two state-of-the-art supervised learning models U-Net and DeepLabv3, and a conditional Generative Adversarial Network Pix2Pix (with and without the discriminator) to segment partially occluded 2D-open-V apple trees. Binary accuracy, Mean IoU, Boundary F1 score and Occluded branch recall were used to evaluate the performances of the models. DeepLabv3 outperforms the other models at Binary accuracy, Mean IoU and Boundary F1 score, but is surpassed by Pix2Pix (without discriminator) and U-Net in Occluded branch recall. We define two difficulty indices to quantify the difficulty of the task: (1) Occlusion Difficulty Index and (2) Depth Difficulty Index. We analyze the worst 10 images in both difficulty indices by means of Branch Recall and Occluded Branch Recall. U-Net outperforms the other two models in the current metrics. On the other hand, Pix2Pix (without discriminator) provides more information on branch paths, which are not reflected by the metrics. This highlights the need for more specific metrics on recovering occluded information. Furthermore, this shows the usefulness of image-transfer networks for hallucination behind occlusions. Future work is required to further enhance the models to recover more information from occlusions such that this technology can be applied to automating agricultural tasks in a commercial environment.
摘要：果树修剪和疏果需要一个强大的视觉系统，可以提供果树及其分支机构的高分辨率分割。然而，最近的作品只考虑休眠季节，那里有树枝上最小的闭塞或适合重建分支形状，从而多项式曲线，损失约分支厚度信息。在这项工作中，我们运用两家国有的最先进的监督学习模型掌中宽带和DeepLabv3和条件剖成对抗性网络Pix2Pix（有和没有鉴别）分割部分遮挡2D开-V的苹果树。二进制精度，平均欠条，边界F1得分和闭塞分支召回被用来评估模型的性能。 DeepLabv3优于在二进制精度，平均IOU和边界F1得分的其他车型，而是由Pix2Pix（没有鉴别）和U-网在遮挡分支召回超越。我们定义两个难点指标的量化任务的难度：（1）封堵难度指数和（2）深度难度指数。我们通过分行召回和闭塞分公司召回的手段分析都困难指数最差的10张图像。掌中优于其他两种模式在目前的指标。在另一方面，Pix2Pix（没有鉴别器）提供了有关的分支路径，其不被反射的度量的详细信息。这突出表明，需要对恢复闭塞的信息更具体的指标。此外，该显示图像传送网络的用于闭塞后面幻觉的有用性。今后的工作需要进一步提高模型来恢复从闭塞，使得该技术可应用于商业环境农业自动化任务的详细信息。

20. Semantic Flow-guided Motion Removal Method for Robust Mapping [PDF] 返回目录
Xudong Lv, Boya Wang, Dong Ye, Shuo Wang
Abstract: Moving objects in scenes are still a severe challenge for the SLAM system. Many efforts have tried to remove the motion regions in the images by detecting moving objects. In this way, the keypoints belonging to motion regions will be ignored in the later calculations. In this paper, we proposed a novel motion removal method, leveraging semantic information and optical flow to extract motion regions. Different from previous works, we don't predict moving objects or motion regions directly from image sequences. We computed rigid optical flow, synthesized by the depth and pose, and compared it against the estimated optical flow to obtain initial motion regions. Then, we utilized K-means to finetune the motion region masks with instance segmentation masks. The ORB-SLAM2 integrated with the proposed motion removal method achieved the best performance in both indoor and outdoor dynamic environments.
摘要：在场景中移动对象仍然为SLAM系统提出了严峻挑战。许多努力试图通过检测移动物体，以除去在图像中的运动区域。通过这种方式，属于运动区域的关键点会在以后的计算中被忽略。在本文中，我们提出了一种新的运动去除方法，利用语义信息和光流提取运动区域。从以前的作品不同的是，我们不预测直接从序列图像中运动物体或者运动区域。我们计算刚性光流，由深度和姿态合成，并且比较它针对所估计的光流，以获得初始运动区域。然后，我们使用K均值对微调用实例分割掩码运动区域的掩模。的ORB-SLAM2集成与所提出的运动去除方法在室内和室外的动态环境实现最佳性能。

21. Multi-Scale Networks for 3D Human PoseEstimation with Inference Stage Optimization [PDF] 返回目录
Cheng Yu, Bo Wang, Bo Yang, Robby T. Tan
Abstract: Estimating 3D human poses from a monocular video is still a challenging task. Many existing methods' performance drops when the target person is occluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional networks (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground-truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Moreover, we observe that there is a discrepancy between 3D pose prediction and 2D pose estimation due to different pose variations between video and image training datasets. We, therefore propose a confidence-based inference stage optimization to adaptively enforce 3D pose projection to match 2D pose estimation to further improve final pose prediction accuracy. Experiments on public datasets validate the effectiveness of our method, and our ablation studies show the strengths of our network's individual submodules.
摘要：从单筒视频估算三维人体姿势仍是一项艰巨的任务。当目标人被其他物体遮挡的许多现有方法的性能下降，或运动太快/相对于训练数据的规模和速度是缓慢的。此外，许多这些方法都没有严重闭塞下设计或经过培训的明确，使他们在处理闭塞折衷的性能。解决这些问题，我们引入了强大的3D人体姿态估计的时空网络。如在视频人类可能出现在不同的尺度和具有各种运动的速度，我们申请2D关节或关键点预测多尺度空间特征在每个单独的帧，和多步幅时间卷积网络（的TCN）来估计三维关节或关键点。此外，我们设计了一种基于身体结构以及肢体运动的时空鉴别，评估预测姿态是否形成有效的姿态和有效的运动。在培训过程中，我们明确地屏蔽掉一些关键点来模拟各种闭塞的情况下，从轻微到严重阻塞，使我们的网络能够更好地学习，成为强大的不同程度的阻塞。由于有限制的3D地面实况数据，我们进一步利用2D视频数据注入一个半监督学习能力，我们的网络。此外，我们观察到，有三维姿态预测和2D姿势估计之间的差异由于视频和图像训练数据集之间的不同的姿势变化。因此我们提出了一种基于信任的推论阶段优化自适应地执行3D投影姿势相匹配的二维姿态估计进一步提高最终姿势中预测精度。公共数据集的实验验证了该方法的有效性，以及我们的消融研究表明，我们的网络的各个子模块的优势。

22. Towards Optimal Filter Pruning with Balanced Performance and Pruning Speed [PDF] 返回目录
Dong Li, Sitong Chen, Xudong Liu, Yunda Sun, Li Zhang
Abstract: Filter pruning has drawn more attention since resource constrained platform requires more compact model for deployment. However, current pruning methods suffer either from the inferior performance of one-shot methods, or the expensive time cost of iterative training methods. In this paper, we propose a balanced filter pruning method for both performance and pruning speed. Based on the filter importance criteria, our method is able to prune a layer with approximate layer-wise optimal pruning rate at preset loss variation. The network is pruned in the layer-wise way without the time consuming prune-retrain iteration. If a pre-defined pruning rate for the entire network is given, we also introduce a method to find the corresponding loss variation threshold with fast converging speed. Moreover, we propose the layer group pruning and channel selection mechanism for channel alignment in network with short connections. The proposed pruning method is widely applicable to common architectures and does not involve any additional training except the final fine-tuning. Comprehensive experiments show that our method outperforms many state-of-the-art approaches.
摘要：由于资源受限的平台需要部署更多紧凑型过滤器修剪引起了更多的关注。然而，当前的修剪方法从一步法性能较差，或迭代训练方法昂贵的时间成本遭受无论是。在本文中，我们提出了性能和修剪速度的平衡滤波器修剪方法。基于过滤标准的重要性，我们的方法是能够用修剪以预设的损耗变化近似逐层最佳修剪率的层。该网络修剪在逐层方式而不消耗剪枝-重新训练迭代的时间。如果针对整个网络预先定义的修剪率给出，我们还引入一个方法来找到具有快速收敛速度相应的损耗变化阈值。此外，我们提出了层组修剪和用于在网络与短连接通道对准通道选择机制。所提出的修剪方法是广泛适用于通用架构，并且不涉及除最后微调任何额外的培训。综合实验表明，我们的方法优于许多国家的最先进的方法。

23. Ferrograph image classification [PDF] 返回目录
Peng Peng, Jiugen Wang
Abstract: It has been challenging to identify ferrograph images with a small dataset and various scales of wear particle. A novel model is proposed in this study to cope with these challenging problems. For the problem of insufficient samples, we first proposed a data augmentation algorithm based on the permutation of image patches. Then, an auxiliary loss function of image patch permutation recognition was proposed to identify the image generated by the data augmentation algorithm. Moreover, we designed a feature extraction loss function to force the proposed model to extract more abundant features and to reduce redundant representations. As for the challenge of large change range of wear particle size, we proposed a multi-scale feature extraction block to obtain the multi-scale representations of wear particles. We carried out experiments on a ferrograph image dataset and a mini-CIFAR-10 dataset. Experimental results show that the proposed model can improve the accuracy of the two datasets by 9% and 20% respectively compared with the baseline.
摘要：一直难以确定一个小数据集和磨粒的各种规模的铁谱图像。一种新的模型是本研究提出应对这些具有挑战性的问题。对于样本不足的问题，我们首先提出了一种基于图像块的排列数据增强算法。然后，图像块置换识别的辅助损失函数提出来识别由数据增强算法产生的图像。此外，我们设计了一个特征提取损失函数，迫使该模型中提取更丰富的功能和减少冗余表示。至于磨损粒径的大变化范围的挑战，我们提出了一种多尺度特征提取块，以获得磨损颗粒的多尺度表示。我们进行了实验，对铁谱图像数据集和一个迷你CIFAR-10数据集。实验结果表明，所提出的模型可以通过9％和20％分别提高两个数据集的准确度与基线进行比较。

24. Rotation Averaging with Attention Graph Neural Networks [PDF] 返回目录
Joshua Thorpe, Ruwan Tennakoon, Alireza Bab-Hadiashar
Abstract: In this paper we propose a real-time and robust solution to large-scale multiple rotation averaging. Until recently, Multiple rotation averaging problem had been solved using conventional iterative optimization algorithms. Such methods employed robust cost functions that were chosen based on assumptions made about the sensor noise and outlier distribution. In practice, these assumptions do not always fit real datasets very well. A recent work showed that the noise distribution could be learnt using a graph neural network. This solution required a second network for outlier detection and removal as the averaging network was sensitive to a poor initialization. In this paper we propose a single-stage graph neural network that can robustly perform rotation averaging in the presence of noise and outliers. Our method uses all observations, suppressing outliers effects through the use of weighted averaging and an attention mechanism within the network design. The result is a network that is faster, more robust and can be trained with less samples than the previous neural approach, ultimately outperforming conventional iterative algorithms in accuracy and in inference times.
摘要：本文提出了一种实时，可靠的解决方案，以大规模多回转平均。直到最近，多回转平均问题一直使用传统的迭代优化算法求解。这种方法中使用基于关于所述传感器噪声和异常值分布作出的假设被选择的是稳健的成本函数。在实践中，这些假设并不总是符合真实数据非常好。最近的工作表明，噪声分布可以使用图形的神经网络来学习。这种解决方案需要用于异常值检测和去除第二网络作为平均网络是一个差初始化敏感。在本文中，我们提出一种可在鲁棒噪声和异常值的存在下进行旋转平均单级图表神经网络。我们的方法是使用所有观测，通过使用加权平均和网络设计中的注意机制的抑制异常的影响。其结果是一个网络，更快，更强壮和可以用比以前的神经方法更少的样本被训练，最终表现优于在精度和在推理倍常规迭代算法。

25. Are all negatives created equal in contrastive instance discrimination? [PDF] 返回目录
Tiffany, Jonathan Frankle, David J. Schwab, Ari S. Morcos
Abstract: Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is then used on downstream tasks such as image classification. Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found a minority of negatives -- the hardest 5% -- were both necessary and sufficient for the downstream task to reach nearly full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, the very hardest 0.1% of negatives were unnecessary and sometimes detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives vary in importance and that CID may benefit from more intelligent negative treatment.
摘要：自监督学习最近开始对计算机视觉任务的竞争对手监督学习。许多近来的方案已经根据对比实例歧视（CID），其中网络被训练来识别，而歧视其他情况下（阴性）的池相同的实例两个版本的增强（查询和阳性）。所学习的表示，然后在下游的任务，如图像分类使用。从莫科V2使用方法（Chen等人，2020年），我们将底片通过他们的困难，对于给定的查询，并研究了难度范围为学习用的表现最重要的。我们发现底片的少数民族 - 最难的5％ - 都是必要的和足够的下游任务达到近全精度。相反，底片的最简单的95％是不必要的和不充分的。此外，底片的非常最难0.1％是不必要的，有时是不利的。最后，我们研究了影响其硬度底片的属性，发现硬底片更语义相似的查询，以及一些底片更一致的易跌难比我们偶然想到。总之，我们的结果表明，在底片的重要性和CID变化可以从更智能的消极治疗中获益。

26. Intrapersonal Parameter Optimization for Offline Handwritten Signature Augmentation [PDF] 返回目录
Teruo M. Maruyama, Luiz S. Oliveira, Alceu S. Britto Jr, Robert Sabourin
Abstract: Usually, in a real-world scenario, few signature samples are available to train an automatic signature verification system (ASVS). However, such systems do indeed need a lot of signatures to achieve an acceptable performance. Neuromotor signature duplication methods and feature space augmentation methods may be used to meet the need for an increase in the number of samples. Such techniques manually or empirically define a set of parameters to introduce a degree of writer variability. Therefore, in the present study, a method to automatically model the most common writer variability traits is proposed. The method is used to generate offline signatures in the image and the feature space and train an ASVS. We also introduce an alternative approach to evaluate the quality of samples considering their feature vectors. We evaluated the performance of an ASVS with the generated samples using three well-known offline signature datasets: GPDS, MCYT-75, and CEDAR. In GPDS-300, when the SVM classifier was trained using one genuine signature per writer and the duplicates generated in the image space, the Equal Error Rate (EER) decreased from 5.71% to 1.08%. Under the same conditions, the EER decreased to 1.04% using the feature space augmentation technique. We also verified that the model that generates duplicates in the image space reproduces the most common writer variability traits in the three different datasets.
摘要：通常，在真实世界的场景中，几个签名样本可用于训练自动签名验证系统（ASVS）。然而，这样的系统确实需要大量的签名来实现可接受的性能。神经运动签名复制方法和特征空间增强方法可以用于以满足增加的样本数的需要。这样的技术手动或凭经验定义一组参数，以引入一定程度的作家变异性。因此，在本研究中，一种方法来自动建模中最常见的变异作家特点提出。所述方法用于产生所述图像和所述特征空间中的离线签名和训练的ASVS。我们还介绍了另一种方法来评估考虑它们的特征向量样品的质量。我们评估使用三个著名的离线签名数据集生成的样本的ASVS的性能：GPDS，MCYT-75，和雪松。在GPDS-300中，当SVM分类用每一个作家真正的签名，并在图像空间中产生的重复训练，平等错误率（EER）从5.71％降低至1.08％。在相同的条件下，使用EER特征空间增强技术降低至1.04％。我们还证实，在图像空间中产生重复的模型再现三个不同的数据集最常见的变异作家特质。

27. Video Action Understanding: A Tutorial [PDF] 返回目录
Matthew Hutchinson, Vijay Gadepally
Abstract: Many believe that the successes of deep learning on image understanding problems can be replicated in the realm of video understanding. However, the span of video action problems and the set of proposed deep learning solutions is arguably wider and more diverse than those of their 2D image siblings. Finding, identifying, and predicting actions are a few of the most salient tasks in video action understanding. This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures.
摘要：许多人认为，深度学习的图像理解问题的成功可以在视频理解的境界被复制。然而，视频动作问题的跨度和该组建议深度学习的解决方案无疑是更广泛，比他们的2D图像的兄弟姐妹更加多样化。查找，识别和预测的行为是一些在视频动作的理解中最突出的任务。本教程澄清了的视频动作的问题，突出数据集和用于基线每一个问题的指标，分类介绍了常见的数据准备的方法，并提出了国家的最先进的深学习模型架构的基石。

28. On Deep Learning Techniques to Boost Monocular Depth Estimation for Autonomous Navigation [PDF] 返回目录
Raul de Queiroz Mendes, Eduardo Godinho Ribeiro, Nicolas dos Santos Rosa, Valdir Grassi Jr
Abstract: Inferring the depth of images is a fundamental inverse problem within the field of Computer Vision since depth information is obtained through 2D images, which can be generated from infinite possibilities of observed real scenes. Benefiting from the progress of Convolutional Neural Networks (CNNs) to explore structural features and spatial image information, Single Image Depth Estimation (SIDE) is often highlighted in scopes of scientific and technological innovation, as this concept provides advantages related to its low implementation cost and robustness to environmental conditions. In the context of autonomous vehicles, state-of-the-art CNNs optimize the SIDE task by producing high-quality depth maps, which are essential during the autonomous navigation process in different locations. However, such networks are usually supervised by sparse and noisy depth data, from Light Detection and Ranging (LiDAR) laser scans, and are carried out at high computational cost, requiring high-performance Graphic Processing Units (GPUs). Therefore, we propose a new lightweight and fast supervised CNN architecture combined with novel feature extraction models which are designed for real-world autonomous navigation. We also introduce an efficient surface normals module, jointly with a simple geometric 2.5D loss function, to solve SIDE problems. We also innovate by incorporating multiple Deep Learning techniques, such as the use of densification algorithms and additional semantic, surface normals and depth information to train our framework. The method introduced in this work focuses on robotic applications in indoor and outdoor environments and its results are evaluated on the competitive and publicly available NYU Depth V2 and KITTI Depth datasets.
摘要：由于通过2D图像，这可以从观察到的真实场景的无限可能性来产生得到的深度信息推断图像的深度是计算机视觉的领域内的基本逆问题。从卷积神经网络（细胞神经网络）的进展情况，探讨结构特点和空间图像信息，单张图像深度估计（SIDE）经常强调了科技创新的范围，因为这一概念提供了有关其实施成本低的优势，造福鲁棒性的环境条件。在自主车辆的情况下，状态的最先进的细胞神经网络优化由生产高品质的深度图，其是在不同位置的自主导航过程中必不可少的一侧的任务。然而，这种网络通常是通过稀疏的和有噪声的深度数据的监督，从光检测和测距（LIDAR）激光扫描，并且在较高的计算成本进行，需要高性能图形处理单元（GPU）。因此，我们提出了一个新的轻量级和快速的监督CNN架构，其目的是为现实世界的自主导航新的特征提取模型相结合。我们还引入了一种有效的表面法线模块，用一个简单的几何2.5D损失函数联手，解决方面的问题。我们还通过将多个深度学习技术，如使用的致密化算法和附加的语义，表面法线和深度信息来训练我们的框架进行创新。在这项工作中介绍的方法集中在机器人应用在室内和室外环境，其结果是在竞争和公开NYU深度V2和KITTI深度的数据集进行评估。

29. A spatial model checker in GPU (extended version) [PDF] 返回目录
Laura Bussi, Vincenzo Ciancia, Fabio Gadducci
Abstract: The tool voxlogica merges the state-of-the-art library of computational imaging algorithms ITK with the combination of declarative specification and optimised execution provided by spatial logic model checking. The analysis of an existing benchmark for segmentation of brain tumours via a simple logical specification reached state-of-the-art accuracy. We present a new, GPU-based version of voxlogica and discuss its implementation, scalability, and applications.
摘要：工具voxlogica合并计算成像算法ITK用声明规范和由空间逻辑模型检测提供优化的执行的组合的状态的最先进的库。通过一个简单的逻辑规范脑肿瘤的分割现有的基准的分析达到状态的最先进的精度。我们提出voxlogica的一个新的，基于GPU的版本，并讨论其执行情况，可扩展性和应用。

30. Privacy-Preserving Object Detection & Localization Using Distributed Machine Learning: A Case Study of Infant Eyeblink Conditioning [PDF] 返回目录
Stefan Zwaard, Henk-Jan Boele, Hani Alers, Christos Strydis, Casey Lew-Williams, Zaid Al-Ars
Abstract: Distributed machine learning is becoming a popular model-training method due to privacy, computational scalability, and bandwidth capacities. In this work, we explore scalable distributed-training versions of two algorithms commonly used in object detection. A novel distributed training algorithm using Mean Weight Matrix Aggregation (MWMA) is proposed for Linear Support Vector Machine (L-SVM) object detection based in Histogram of Orientated Gradients (HOG). In addition, a novel Weighted Bin Aggregation (WBA) algorithm is proposed for distributed training of Ensemble of Regression Trees (ERT) landmark localization. Both algorithms do not restrict the location of model aggregation and allow custom architectures for model distribution. For this work, a Pool-Based Local Training and Aggregation (PBLTA) architecture for both algorithms is explored. The application of both algorithms in the medical field is examined using a paradigm from the fields of psychology and neuroscience eyeblink conditioning with infants - where models need to be trained on facial images while protecting participant privacy. Using distributed learning, models can be trained without sending image data to other nodes. The custom software has been made available for public use on GitHub: this https URL. Results show that the aggregation of models for the HOG algorithm using MWMA not only preserves the accuracy of the model but also allows for distributed learning with an accuracy increase of 0.9% compared with traditional learning. Furthermore, WBA allows for ERT model aggregation with an accuracy increase of 8% when compared to single-node models.
摘要：分布式机器学习正在成为一种流行模式，训练方法，由于隐私，计算的可扩展性和带宽容量。在这项工作中，我们探讨的物体检测常用的两种算法，可扩展的分布式训练版本。使用平均体重矩阵聚合（MWMA）的一种新型分布式训练算法的线性支持向量机（L-SVM）设在导向型梯度直方图（HOG）的对象检测。此外，一个新的加权滨聚合（WBA）算法用于回归树（ERT）的地标定位的合奏的分布式训练。这两种算法并不限制模式聚集的位置，并允许分布模型定制架构。对于这项工作，对于这两种算法一池为基础的本地培训和聚合（PBLTA）架构进行了探讨。其中型号需要，同时保护参与者的隐私的人脸图像进行训练 - 在医疗领域的两种算法的应用程序正在使用从心理学和神经科学眨眼空调与婴幼儿领域的范式研究。使用分布式学习模式可以在没有图像数据发送到其他节点进行培训。定制软件已提供给公众使用GitHub上：此HTTPS URL。结果表明，模型使用MWMA的HOG算法的聚集不仅保留了模型的精度，而且还允许与传统的学习相比，0.9％的精度提高分布式学习。此外，WBA允许具有8％的准确度增加ERT模型聚合相比单节点模型时。

31. Fader Networks for domain adaptation on fMRI: ABIDE-II study [PDF] 返回目录
Marina Pominova, Ekaterina Kondrateva, Maxim Sharaev, Alexander Bernstein, Evgeny Burnaev
Abstract: ABIDE is the largest open-source autism spectrum disorder database with both fMRI data and full phenotype description. These data were extensively studied based on functional connectivity analysis as well as with deep learning on raw data, with top models accuracy close to 75\% for separate scanning sites. Yet there is still a problem of models transferability between different scanning sites within ABIDE. In the current paper, we for the first time perform domain adaptation for brain pathology classification problem on raw neuroimaging data. We use 3D convolutional autoencoders to build the domain irrelevant latent space image representation and demonstrate this method to outperform existing approaches on ABIDE data.
摘要：遵守是全球最大的开源自闭症谱系障碍的数据库既fMRI数据和充分的表型描述。基于功能连通性分析，以及对原始数据的深度学习，与顶配车型的准确性接近75 \％的单独扫描位点对这些数据进行广泛的研究。然而，至今还恪守内不同的扫描站点之间的车型转让的问题。在当前的文件中，我们首次执行对原始影像学资料脑病理分类问题领域适应性。我们使用3D卷积自动编码打造域无关潜在空间图像表示，并证明这种方法来跑赢上恪守数据的现有方法。

32. Domain Shift in Computer Vision models for MRI data analysis: An Overview [PDF] 返回目录
Ekaterina Kondrateva, Marina Pominova, Elena Popova, Maxim Sharaev, Alexander Bernstein, Evgeny Burnaev
Abstract: Machine learning and computer vision methods are showing good performance in medical imagery analysis. Yetonly a few applications are now in clinical use and one of the reasons for that is poor transferability of themodels to data from different sources or acquisition domains. Development of new methods and algorithms forthe transfer of training and adaptation of the domain in multi-modal medical imaging data is crucial for thedevelopment of accurate models and their use in clinics. In present work, we overview methods used to tackle thedomain shift problem in machine learning and computer vision. The algorithms discussed in this survey includeadvanced data processing, model architecture enhancing and featured training, as well as predicting in domaininvariant latent space. The application of the autoencoding neural networks and their domain-invariant variationsare heavily discussed in a survey. We observe the latest methods applied to the magnetic resonance imaging(MRI) data analysis and conclude on their performance as well as propose directions for further research.
摘要：机器学习和计算机视觉方法显示医疗图像分析不错的表现。 Yetonly几个应用程序现在临床使用和的原因，一个是themodels不同来源或收购域数据的可转移性较差。新的方法和算法开发欢喜是因为在多模态医学成像数据域的培训和适应转移是精确的模型的thedevelopment及其在临床应用的关键。在目前的工作中，我们概述的方法用在机器学习和计算机视觉解决thedomain转变问题。在本次调查讨论的算法includeadvanced数据处理，模型架构提高和功能的培训，以及在domaininvariant潜在空间预测。在autoencoding神经网络及其域不变variationsare的应用程序将在调查大量讨论。我们观察到的最新方法适用于磁共振成像（MRI）数据分析和对他们的表现得出结论，以及提出了进一步研究的方向。

33. Data Augmentation for Meta-Learning [PDF] 返回目录
Renkun Ni, Micah Goldblum, Amr Sharaf, Kezhi Kong, Tom Goldstein
Abstract: Conventional image classifiers are trained by randomly sampling mini-batches of images. To achieve state-of-the-art performance, sophisticated data augmentation schemes are used to expand the amount of training data available for sampling. In contrast, meta-learning algorithms sample not only images, but classes as well. We investigate how data augmentation can be used not only to expand the number of images available per class, but also to generate entirely new classes. We systematically dissect the meta-learning pipeline and investigate the distinct ways in which data augmentation can be integrated at both the image and class levels. Our proposed meta-specific data augmentation significantly improves the performance of meta-learners on few-shot classification benchmarks.
摘要：传统的图像分类是通过随机抽样的图像的小型批次培训。为了实现状态的最先进的性能，复杂的数据扩充方案被用于扩大的训练数据可用于采样的量。相比之下，元学习算法样品不仅图像，但类为好。我们调查的数据隆多可以用来不仅要扩大每班可用图像的数量，而且还产生全新的类。我们系统地剖析元学习管道和探讨不同的方法，使数据可以增强在图像和职业等级都被整合。我们提出的元特定的数据增强显著提高元学习者对为数不多的镜头分类基准的表现。

34. 3D Segmentation Networks for Excessive Numbers of Classes: Distinct Bone Segmentation in Upper Bodies [PDF] 返回目录
Eva Schnider, Antal Horváth, Georg Rauter, Azhar Zam, Magdalena Müller-Gerbl, Philippe C. Cattin
Abstract: Segmentation of distinct bones plays a crucial role in diagnosis, planning, navigation, and the assessment of bone metastasis. It supplies semantic knowledge to visualisation tools for the planning of surgical interventions and the education of health professionals. Fully supervised segmentation of 3D data using Deep Learning methods has been extensively studied for many tasks but is usually restricted to distinguishing only a handful of classes. With 125 distinct bones, our case includes many more labels than typical 3D segmentation tasks. For this reason, the direct adaptation of most established methods is not possible. This paper discusses the intricacies of training a 3D segmentation network in a many-label setting and shows necessary modifications in network architecture, loss function, and data augmentation. As a result, we demonstrate the robustness of our method by automatically segmenting over one hundred distinct bones simultaneously in an end-to-end learnt fashion from a CT-scan.
摘要：不同的骨骼分割起着诊断，规划，导航至关重要的作用，以及骨转移的评估。它提供语义知识的可视化工具，用于外科手术的规划和卫生专业人员的教育。使用Deep学习方法三维数据的充分监督分割已被广泛研究了许多任务，但通常仅限于区分只类屈指可数。与125块不同的骨头，我们的情况下，包括比一般的3D分割的任务更多的标签。出于这个原因，最成熟的方法直接适应是不可能的。本文讨论的在许多标签设置和显示在网络架构中，损失函数，和数据扩张必要的修改训练3D分割网络的复杂性。其结果是，我们通过一百多个不同的骨头同时自动分割在从CT扫描结束到终端了解到时尚展示了该方法的稳健性。

35. Understanding bias in facial recognition technologies [PDF] 返回目录
David Leslie
Abstract: Over the past couple of years, the growing debate around automated facial recognition has reached a boiling point. As developers have continued to swiftly expand the scope of these kinds of technologies into an almost unbounded range of applications, an increasingly strident chorus of critical voices has sounded concerns about the injurious effects of the proliferation of such systems. Opponents argue that the irresponsible design and use of facial detection and recognition technologies (FDRTs) threatens to violate civil liberties, infringe on basic human rights and further entrench structural racism and systemic marginalisation. They also caution that the gradual creep of face surveillance infrastructures into every domain of lived experience may eventually eradicate the modern democratic forms of life that have long provided cherished means to individual flourishing, social solidarity and human self-creation. Defenders, by contrast, emphasise the gains in public safety, security and efficiency that digitally streamlined capacities for facial identification, identity verification and trait characterisation may bring. In this explainer, I focus on one central aspect of this debate: the role that dynamics of bias and discrimination play in the development and deployment of FDRTs. I examine how historical patterns of discrimination have made inroads into the design and implementation of FDRTs from their very earliest moments. And, I explain the ways in which the use of biased FDRTs can lead distributional and recognitional injustices. The explainer concludes with an exploration of broader ethical questions around the potential proliferation of pervasive face-based surveillance infrastructures and makes some recommendations for cultivating more responsible approaches to the development and governance of these technologies.
摘要：在过去的几年中，各地的自动人脸识别日益激烈的争论已达到了沸点。作为开发人员继续迅速扩大这类技术的范围为应用几乎无限的范围，批评之声日益尖锐的合唱团已经吹响了对这种系统的扩散的有害影响的担忧。反对者认为，不负责任的设计和使用面部检测和识别技术（FDRTs）的威胁违反了基本人权，并进一步巩固结构性种族主义和系统性边缘化公民自由，侵犯版权。他们还警告说，面对监控基础设施的逐步蔓延到生活经验各个领域，最终可能消除一些长期提供珍视手段个别旺，社会团结和人类自我创造生命的现代民主形式。捍卫者，相反，强调公共安全，安全性和效率，对于面部识别，核实身份和特质表征数字流线型能力可能带来的收益。在这个解释器，我关注这个争论的一个重要方面：作用在FDRTs的开发和部署偏见和歧视发挥的动力。我研究如何歧视的历史模式已经从最早的时刻攻进FDRTs的设计和实施。而且，我来解释一下其中使用偏置FDRTs会导致分配和再认识不公正的方式。该解释器与总结各地普遍基于面的监控基础设施的潜在扩散更广泛的伦理问题的探讨，并就培养更负责任的方式对这些技术的发展和管理提出了一些建议。

36. Fast meningioma segmentation in T1-weighted MRI volumes using a lightweight 3D deep learning architecture [PDF] 返回目录
David Bouget, André Pedersen, Sayied Abdol Mohieb Hosainey, Johanna Vanel, Ole Solheim, Ingerid Reinertsen
Abstract: Automatic and consistent meningioma segmentation in T1-weighted MRI volumes and corresponding volumetric assessment is of use for diagnosis, treatment planning, and tumor growth evaluation. In this paper, we optimized the segmentation and processing speed performances using a large number of both surgically treated meningiomas and untreated meningiomas followed at the outpatient clinic. We studied two different 3D neural network architectures: (i) a simple encoder-decoder similar to a 3D U-Net, and (ii) a lightweight multi-scale architecture (PLS-Net). In addition, we studied the impact of different training schemes. For the validation studies, we used 698 T1-weighted MR volumes from St. Olav University Hospital, Trondheim, Norway. The models were evaluated in terms of detection accuracy, segmentation accuracy and training/inference speed. While both architectures reached a similar Dice score of 70% on average, the PLS-Net was more accurate with an F1-score of up to 88%. The highest accuracy was achieved for the largest meningiomas. Speed-wise, the PLS-Net architecture tended to converge in about 50 hours while 130 hours were necessary for U-Net. Inference with PLS-Net takes less than a second on GPU and about 15 seconds on CPU. Overall, with the use of mixed precision training, it was possible to train competitive segmentation models in a relatively short amount of time using the lightweight PLS-Net architecture. In the future, the focus should be brought toward the segmentation of small meningiomas (less than 2ml) to improve clinical relevance for automatic and early diagnosis as well as speed of growth estimates.
摘要：在T1加权MRI卷和相应的体积自动评估和一致脑膜瘤分割是使用了诊断，治疗计划和肿瘤生长的评价。在本文中，我们优化使用大量既手术治疗脑膜瘤和未处理的脑膜瘤，接着在门诊的分割和处理速度的性能。我们研究了两种不同的三维神经网络结构：（ⅰ）一个简单的编码器 - 解码器相似于3D U形网，和（ii）的轻质多尺度结构（PLS-净）。此外，我们研究了不同的培训计划的影响。对于验证研究，我们使用了圣奥拉夫大学附属医院，挪威特隆赫姆698 T1加权卷。该模型的检测精度，分割精度和培训/推理速度方面进行了评价。虽然这两种架构平均达到类似骰子得分的70％时，PLS-Net的是高达88％的F1-得分更准确。最高的准确度为最大的脑膜瘤实现。速度明智的，PLS-Net的结构开始趋同于约50小时，同时130小时是必要的掌中。推理与PLS-Net的花费比在GPU的第二和关于CPU15秒以下。总体来说，随着使用的混合精训练，有可能使用轻量级PLS-网络架构中的时间相对较短，培养有竞争力的车型细分。在未来，重点应朝向小脑膜瘤的分割（小于2ml）中被带到提高用于自动和早期诊断的临床相关性以及增长的估计的速度。

37. Using satellite imagery to understand and promote sustainable development [PDF] 返回目录
Marshall Burke, Anne Driscoll, David B. Lobell, Stefano Ermon
Abstract: Accurate and comprehensive measurements of a range of sustainable development outcomes are fundamental inputs into both research and policy. We synthesize the growing literature that uses satellite imagery to understand these outcomes, with a focus on approaches that combine imagery with machine learning. We quantify the paucity of ground data on key human-related outcomes and the growing abundance and resolution (spatial, temporal, and spectral) of satellite imagery. We then review recent machine learning approaches to model-building in the context of scarce and noisy training data, highlighting how this noise often leads to incorrect assessment of models' predictive performance. We quantify recent model performance across multiple sustainable development domains, discuss research and policy applications, explore constraints to future progress, and highlight key research directions for the field.
摘要：一系列的可持续发展成果的准确和全面的测量是基本投入研究和政策。我们综合使用卫星图像来了解这些成果，重点是与机器学习相结合的图像的方法越来越多的文献。我们量化地面数据的关键人有关的成果不足，并日益丰富和卫星图像分辨率（空间，时间和谱）。然后，我们审查最近的机器学习方法对模型构建的稀缺和嘈杂的训练数据的情况下，强调这一噪声往往导致模型的预测性能不正确的评估方式。我们量化跨多个可持续发展领域最近的模型性能，讨论研究和政策应用，探索制约未来的进展，并突出了重点领域的研究方向。

38. Practical Deep Raw Image Denoising on Mobile Devices [PDF] 返回目录
Yuzhi Wang, Haibin Huang, Qin Xu, Jiaming Liu, Yiqun Liu, Jue Wang
Abstract: Deep learning-based image denoising approaches have been extensively studied in recent years, prevailing in many public benchmark datasets. However, the stat-of-the-art networks are computationally too expensive to be directly applied on mobile devices. In this work, we propose a light-weight, efficient neural network-based raw image denoiser that runs smoothly on mainstream mobile devices, and produces high quality denoising results. Our key insights are twofold: (1) by measuring and estimating sensor noise level, a smaller network trained on synthetic sensor-specific data can out-perform larger ones trained on general data; (2) the large noise level variation under different ISO settings can be removed by a novel k-Sigma Transform, allowing a small network to efficiently handle a wide range of noise levels. We conduct extensive experiments to demonstrate the efficiency and accuracy of our approach. Our proposed mobile-friendly denoising model runs at ~70 milliseconds per megapixel on Qualcomm Snapdragon 855 chipset, and it is the basis of the night shot feature of several flagship smartphones released in 2019.
摘要：深基于学习的图像去噪方法已被广泛研究，近年来，在许多公共标准数据集普遍存在。然而，STAT-的最先进的网络是计算昂贵的太直接施加在移动设备上。在这项工作中，我们建议，顺利运行在主流移动设备，并产生高质量的去噪结果的重量轻，高效的基于神经网络的原始图像降噪。我们的关键见解是双重的：（1）通过测量和估计的传感器噪声电平，较小的网络上训练合成传感器特有的数据可以在性能较大的训练上的一般数据; （2）根据不同ISO设置大噪声电平变化可通过去除一个新的K-西格玛变换，允许一个小型网络来有效地处理宽范围的噪音水平的。我们进行了广泛的实验，证明了该方法的效率和准确性。我们在提出〜移动友好的降噪模型每运行70万像素的毫秒的高通Snapdragon芯片组855，它是几种重要的夜拍功能的基础上在2019年智能手机发布。

39. Efficient and high accuracy 3-D OCT angiography motion correction in pathology [PDF] 返回目录
Stefan B. Ploner, Martin F. Kraus, Eric M. Moult, Lennart Husvogt, Julia Schottenhamml, A. Yasin Alibhai, Nadia K. Waheed, Jay S. Duker, James G. Fujimoto, Andreas K. Maier
Abstract: We propose a novel method for non-rigid 3-D motion correction of orthogonally raster-scanned optical coherence tomography angiography volumes. This is the first approach that aligns predominantly axial structural features like retinal layers and transverse angiographic vascular features in a joint optimization. Combined with the use of orthogonal scans and favorization of kinematically more plausible displacements, the approach allows subpixel alignment and micrometer-scale distortion correction in all 3 dimensions. As no specific structures or layers are segmented, the approach is by design robust to pathologic changes. It is furthermore designed for highly parallel implementation and brief runtime, allowing its integration in clinical routine even for high density or wide-field scans. We evaluated the algorithm with metrics related to clinically relevant features in a large-scale quantitative evaluation based on 204 volumetric scans of 17 subjects including both a wide range of pathologies and healthy controls. Using this method, we achieve state-of-the-art axial performance and show significant advances in both transverse co-alignment and distortion correction, especially in the pathologic subgroup.
摘要：我们提出了正交光栅扫描光学相干断层扫描血管造影体积的非刚性3- d运动校正的新方法。这是第一种方法，像在一个联合优化视网膜层和横向血管造影血管特征对准主要轴向的结构特征。与使用正交扫描和运动学上更合理的位移favorization的组合，该方法允许在所有3个维度的子像素对准和微米级的失真校正。由于没有特定结构或层是分段的，该方法是通过设计鲁棒的病理改变。这还设计为高度并行执行和简要运行时，允许其在临床常规整合甚至高密度或宽视场扫描。我们评估了基于17名受试者的204个体积扫描既包括广泛的疾病和健康对照的大规模定量评价与临床相关的特征指标的算法。使用这种方法，我们实现状态的最先进的轴向性能，并显示在两个横向共同对准和失真校正显著的进步，特别是在病理子组。

40. Identifying Wrongly Predicted Samples: A Method for Active Learning [PDF] 返回目录
Rahaf Aljundi, Nikolay Chumerin, Daniel Olmeda Reino
Abstract: State-of-the-art machine learning models require access to significant amount of annotated data in order to achieve the desired level of performance. While unlabelled data can be largely available and even abundant, annotation process can be quite expensive and limiting. Under the assumption that some samples are more important for a given task than others, active learning targets the problem of identifying the most informative samples that one should acquire annotations for. Instead of the conventional reliance on model uncertainty as a proxy to leverage new unknown labels, in this work we propose a simple sample selection criterion that moves beyond uncertainty. By first accepting the model prediction and then judging its effect on the generalization error, we can better identify wrongly predicted samples. We further present an approximation to our criterion that is very efficient and provides a similarity based interpretation. In addition to evaluating our method on the standard benchmarks of active learning, we consider the challenging yet realistic scenario of imbalanced data where categories are not equally represented. We show state-of-the-art results and better rates at identifying wrongly predicted samples. Our method is simple, model agnostic and relies on the current model status without the need for re-training from scratch.
摘要：国家的最先进的机器学习模型需要访问注释数据显著量，以达到所需的性能水平。虽然未标记的数据可以在很大程度上可用，甚至丰富，注释过程可能会相当昂贵，限制。在此假设下，一些样品比其他给定的任务更重要的是，积极的学习目标确定最翔实的样本是一个应该获得的注释的问题。相反，对模型不确定性作为代理利用新未知标签的传统依赖，在这项工作中，我们提出了一个简单的样本选择标准，即超越不确定性移动。通过首先接受该模型预测，然后在泛化误差判定其效果，我们可以更好地识别错误的预测样本。我们进一步提出了一个近似我们的标准，这是非常有效的，并提供了基于相似的解释。除了主动学习的标准基准测试评估我们的方法，我们认为其中的类别没有平等的代表权不平衡数据的挑战而现实情况。我们发现，在识别错误地预测样本国家的先进成果和更好的利率。我们的方法很简单，模型无关，并且依赖于当前模型状态，而不需要从头再培训。

41. Deep Ensembles for Low-Data Transfer Learning [PDF] 返回目录
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, andAndré Susano Pinto, Daniel Keysers, Neil Houlsby
Abstract: In the low-data regime, it is difficult to train good supervised models from scratch. Instead practitioners turn to pre-trained models, leveraging transfer learning. Ensembling is an empirically and theoretically appealing way to construct powerful predictive models, but the predominant approach of training multiple deep networks with different random initialisations collides with the need for transfer via pre-trained weights. In this work, we study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity, and propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset. The approach is simple: Use nearest-neighbour accuracy to rank pre-trained models, fine-tune the best ones with a small hyperparameter sweep, and greedily construct an ensemble to minimise validation cross-entropy. When evaluated together with strong baselines on 19 different downstream tasks (the Visual Task Adaptation Benchmark), this achieves state-of-the-art performance at a much lower inference budget, even when selecting from over 2,000 pre-trained models. We also assess our ensembles on ImageNet variants and show improved robustness to distribution shift.
摘要：在低数据政权，就很难从头开始培养良好的监管模式。相反，从业者转向预先训练模型，利用迁移学习。 Ensembling是建立强有力的预测模型的经验和理论上是吸引人的方式，但训练多深的网络与需要通过预训练的权重传递不同的随机initialisations碰撞的主要方法。在这项工作中，我们研究建立从预先训练模式合奏的不同方式。我们表明，前培训本身的性质是多样的高性能来源，并提出了一个实际的算法，有效识别的预先训练模型的任何下游数据集的一个子集。方法很简单：使用近邻精度等级预先训练模式，微调用小超参数扫最好的，贪婪地构建一个整体，以尽量减少验证交叉熵。当在19级不同的任务下游（视觉任务适应基准）具有较强的基线评估在一起，这就达到了国家的最先进的性能，在低得多的推断预算，从超过2,000预先训练模型中选择时也是如此。我们也评估ImageNet变种我们的演奏，以及表现出改善的鲁棒性分布转变。

42. GreedyFool: An Imperceptible Black-box Adversarial Example Attack against Neural Networks [PDF] 返回目录
Hui Liu, Bo Zhao, Jiabao Guo, Pengyuan Zhao, Peng Liu
Abstract: Deep neural networks (DNNs) are inherently vulnerable to well-designed input samples called adversarial examples. The adversary can easily fool DNNs by adding slight perturbations to the input. In this paper, we propose a novel black-box adversarial example attack named GreedyFool, which synthesizes adversarial examples based on the differential evolution and the greedy approximation. The differential evolution is utilized to evaluate the effects of perturbed pixels on the confidence of the DNNs-based classifier. The greedy approximation is an approximate optimization algorithm to automatically get adversarial perturbations. Existing works synthesize the adversarial examples by leveraging simple metrics to penalize the perturbations, which lack sufficient consideration of the human visual system (HVS), resulting in noticeable artifacts. In order to sufficient imperceptibility, we launch a lot of investigations into the HVS and design an integrated metric considering just noticeable distortion (JND), Weber-Fechner law, texture masking and channel modulation, which is proven to be a better metric to measure the perceptual distance between the benign examples and the adversarial ones. The experimental results demonstrate that the GreedyFool has several remarkable properties including black-box, 100% success rate, flexibility, automation and can synthesize the more imperceptible adversarial examples than the state-of-the-art pixel-wise methods.
摘要：深层神经网络（DNNs）本质上是容易受到精心设计的所谓敌对例子输入样本。对手可以很容易地通过添加轻微扰动到输入愚弄DNNs。在本文中，我们提出了一个名为GreedyFool一个新的黑盒对抗性例如攻击，合成基于差分进化和贪婪近似对抗的例子。差分进化是用来评估对DNNs基于分类的信心扰动像素的效果。贪心的近似是一种近似优化算法来自动获得对抗扰动。现有作品通过利用简单的指标来惩罚扰动，这缺乏足够的考虑人类视觉系统（HVS）的，从而导致显着的伪像合成对抗性例子。为了足够的隐蔽性，我们推出了很多调查到HVS和设计于一体的综合度量考虑正好察觉失真（JND），韦伯 - 费希纳定律，纹理掩蔽和信道调制，这被证明是一个更好的指标来衡量良性实施例和敌对者之间的感性距离。实验结果表明，该GreedyFool具有几个显着的性质，包括黑盒，100％的成功率，柔韧性，自动化和可以合成比国家的最先进的逐像素的方法的更不易察觉的对抗性的例子。

43. Differential diagnosis and molecular stratification of gastrointestinal stromal tumors on CT images using a radiomics approach [PDF] 返回目录
Martijn P.A. Starmans, Milea J.M. Timbergen, Melissa Vos, Michel Renckens, Dirk J. Grünhagen, Geert J.L.H. van Leenders, Roy S. Dwarkasing, François E. J. A. Willemssen, Wiro J. Niessen, Cornelis Verhoef, Stefan Sleijfer, Jacob J. Visser, Stefan Klein
Abstract: Distinguishing gastrointestinal stromal tumors (GISTs) from other intra-abdominal tumors and GISTs molecular analysis is necessary for treatment planning, but challenging due to its rarity. The aim of this study was to evaluate radiomics for distinguishing GISTs from other intra-abdominal tumors, and in GISTs, predict the \textit{c-KIT}, \textit{PDGFRA}, \textit{BRAF} mutational status and mitotic index (MI). All 247 included patients (125 GISTS, 122 non-GISTs) underwent a contrast-enhanced venous phase CT. The GIST vs. non-GIST radiomics model, including imaging, age, sex and location, had a mean area under the curve (AUC) of 0.82. Three radiologists had an AUC of 0.69, 0.76, and 0.84, respectively. The radiomics model had an AUC of 0.52 for \textit{c-KIT}, 0.56 for \textit{c-KIT} exon 11, and 0.52 for the MI. Hence, our radiomics model was able to distinguish GIST from non-GISTS with a performance similar to three radiologists, but was not able to predict the c-KIT mutation or MI.
摘要：从其他腹内肿瘤和间质瘤区别胃肠道间质瘤（间质瘤）的分子分析是必要的治疗计划，但由于其稀有性挑战。本研究的目的是用于区分其他腹内肿瘤间质瘤评估radiomics，和在间质瘤，预测\ textit {C-KIT}，\ textit {PDGFRA}，\ textit {BRAF}突变状态和有丝分裂指数（ MI）。所有247名包括患者（125个要旨，122个非间质瘤）进行了对比增强静脉期CT。要点与非GIST radiomics模型，包括成像，年龄，性别和位置，有曲线的0.82（AUC）下的平均面积。三名放射科医生为0.69，0.76，和0.84，分别AUC。该radiomics模型具有0.52的AUC为\ textit {C-KIT}，0.56 \ textit {C-KIT}外显子11，和0.52的MI。因此，我们的radiomics模型能够从非学家类似三个放射科医师性能区别GIST，但无法预测的c-kit突变或心肌梗死。

44. Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout [PDF] 返回目录
Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, Dragomir Anguelov
Abstract: The vast majority of deep models use multiple gradient signals, typically corresponding to a sum of multiple loss terms, to update a shared set of trainable weights. However, these multiple updates can impede optimal training by pulling the model in conflicting directions. We present Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. GradDrop is implemented as a simple deep layer that can be used in any deep net and synergizes with other gradient balancing approaches. We show that GradDrop outperforms the state-of-the-art multiloss methods within traditional multitask and transfer learning settings, and we discuss how GradDrop reveals links between optimal multiloss training and gradient stochasticity.
摘要：绝大多数深模型的使用多个梯度信号，典型地对应于多个损耗项之和，来更新一组共享可训练权重。然而，这些多个更新可以通过在冲突的方向拉动模式阻碍了最佳的训练。我们本梯度拍降（GradDrop），其基于其一致性的水平在活化层样品梯度概率掩蔽过程。 GradDrop被实现为可以在任何深网使用，并且协同作用与其他梯度平衡接近简单深层。我们发现，GradDrop优于传统的多任务和转移学习环境中的国家的最先进的multiloss方法，和大家讨论如何GradDrop揭示最佳multiloss培训和梯度随机性之间的联系。

45. Low-rank Convex/Sparse Thermal Matrix Approximation for Infrared-based Diagnostic System [PDF] 返回目录
Bardia Yousefi, Clemente Ibarra Castanedo, Xavier P.V. Maldague
Abstract: Active and passive thermography are two efficient techniques extensively used to measure heterogeneous thermal patterns leading to subsurface defects for diagnostic evaluations. This study conducts a comparative analysis on low-rank matrix approximation methods in thermography with applications of semi-, convex-, and sparse- non-negative matrix factorization (NMF) methods for detecting subsurface thermal patterns. These methods inherit the advantages of principal component thermography (PCT) and sparse PCT, whereas tackle negative bases in sparse PCT with non-negative constraints, and exhibit clustering property in processing data. The practicality and efficiency of these methods are demonstrated by the experimental results for subsurface defect detection in three specimens (for different depth and size defects) and preserving thermal heterogeneity for distinguishing breast abnormality in breast cancer screening dataset (accuracy of 74.1%, 75.8%, and 77.8%).
摘要：主动和被动热成像是两个有效的技术广泛地用于测量多相热模式导致对诊断评估表面缺陷。本研究进行的热成像低秩矩阵的近似方法以半，convex-，和sparse-非负矩阵分解（NMF），用于检测地下热图案的方法的应用程序进行比较分析。这些方法继承主成分热成像（PCT）和稀疏PCT的优点，而在处理稀疏PCT负碱基具有非负约束条件，并且在处理数据表现出的聚类属性。这些方法的实用性和有效性是通过在三个试样内部缺陷检测的实验结果证明（为不同的深度和大小的缺陷），并保存用于乳腺癌鉴别乳房异常筛选的74.1％，75.8％的数据集（精度的热不均一性，和77.8％）。

46. Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [PDF] 返回目录
Hao Tan, Mohit Bansal
Abstract: Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at this https URL
摘要：人类通过听学习语言，说，写，读，而且，通过互动与多模式的现实世界。现有的语言前培训框架只显示文本的自我监督的有效性，同时我们将探讨在本文中视觉监督语言模型的想法。我们发现，阻碍这种探索的主要原因是在视觉上接地语言数据集和纯语料之间的大小和分布的大分歧。因此，我们开发了一个名为“vokenization”技术，通过上下文映射语言推断多式联运路线只有语言数据的令牌分配给相关的图像（我们称之为“vokens”）。该“vokenizer”训练上相对较小的图像字幕数据集，我们再运用它来生成大量语料vokens。这些情境产生vokens的训练，我们的视觉语言监督模型显示了多个纯语言任务，如胶水，班长和SWAG自我监督的替代持续改善。代码和公开的这个预训练模型HTTPS URL

47. Measuring Visual Generalization in Continuous Control from Pixels [PDF] 返回目录
Jake Grigsby, Yanjun Qi
Abstract: Self-supervised learning and data augmentation have significantly reduced the performance gap between state and image-based reinforcement learning agents in continuous control tasks. However, it is still unclear whether current techniques can face a variety of visual conditions required by real-world environments. We propose a challenging benchmark that tests agents' visual generalization by adding graphical variety to existing continuous control domains. Our empirical analysis shows that current methods struggle to generalize across a diverse set of visual changes, and we examine the specific factors of variation that make these tasks difficult. We find that data augmentation techniques outperform self-supervised learning approaches and that more significant image transformations provide better visual generalization \footnote{The benchmark and our augmented actor-critic implementation are open-sourced @ this https URL)
摘要：自监督学习和数据增强了显著减少在连续的控制任务，基于图像状态，增强学习代理商之间的性能差距。但是，目前还不清楚目前的技术是否能面对各种由现实世界环境中所需的视觉条件。我们提出了一个具有挑战性的基准测试剂中加入图形品种，以现有的连续控制区域的视觉概括。我们的实证分析表明，目前的方法挣扎在一组不同的视觉变化来概括，我们检查，使这些任务难以变化的具体因素。我们发现，数据增强技术，超越自我监督学习方法和更多的显著图像变换提供更好的视觉泛化\ {脚注基准和我们的增强演员评论家实现是开源的@，此HTTPS URL）

48. Random Network Distillation as a Diversity Metric for Both Image and Text Generation [PDF] 返回目录
Liam Fowl, Micah Goldblum, Arjun Gupta, Amr Sharaf, Tom Goldstein
Abstract: Generative models are increasingly able to produce remarkably high quality images and text. The community has developed numerous evaluation metrics for comparing generative models. However, these metrics do not effectively quantify data diversity. We develop a new diversity metric that can readily be applied to data, both synthetic and natural, of any type. Our method employs random network distillation, a technique introduced in reinforcement learning. We validate and deploy this metric on both images and text. We further explore diversity in few-shot image generation, a setting which was previously difficult to evaluate.
摘要：生成模型越来越能够产生非常高品质的图像和文字。社区已经开发了许多评价指标比较生成模型。然而，这些指标不能有效地量化数据的多样性。我们开发了可以很容易地应用到数据，合成的和天然的，任何类型的新多样性指标。我们的方法采用随机网络蒸馏，在强化学习引进的技术。我们验证和部署该指标在图像和文本。我们进一步探讨几个镜头图像生成多样性，这在以前是很难评价的设置。

49. Handwriting Quality Analysis using Online-Offline Models [PDF] 返回目录
Yahia Hamdi, Hanen Akouaydi, Houcine Boubaker, Adel M. Alimi
Abstract: This work is part of an innovative e-learning project allowing the development of an advanced digital educational tool that provides feedback during the process of learning handwriting for young school children (three to eight years old). In this paper, we describe a new method for children handwriting quality analysis. It automatically detects mistakes, gives real-time on-line feedback for children's writing, and helps teachers comprehend and evaluate children's writing skills. The proposed method adjudges five main criteria shape, direction, stroke order, position respect to the reference lines, and kinematics of the trace. It analyzes the handwriting quality and automatically gives feedback based on the combination of three extracted models: Beta-Elliptic Model (BEM) using similarity detection (SD) and dissimilarity distance (DD) measure, Fourier Descriptor Model (FDM), and perceptive Convolutional Neural Network (CNN) with Support Vector Machine (SVM) comparison engine. The originality of our work lies partly in the system architecture which apprehends complementary dynamic, geometric, and visual representation of the examined handwritten scripts and in the efficient selected features adapted to various handwriting styles and multiple script languages such as Arabic, Latin, digits, and symbol drawing. The application offers two interactive interfaces respectively dedicated to learners, educators, experts or teachers and allows them to adapt it easily to the specificity of their disciples. The evaluation of our framework is enhanced by a database collected in Tunisia primary school with 400 children. Experimental results show the efficiency and robustness of our suggested framework that helps teachers and children by offering positive feedback throughout the handwriting learning process using tactile digital devices.
摘要：这项工作是一个创新的网上学习项目，允许采用先进的数字教育工具，学习手写年轻的学童（三至八岁）的过程中提供反馈的发展的一部分。在本文中，我们描述了孩子的笔迹质量分析的新方法。它会自动检测错误，给儿童写实时的在线反馈，并帮助教师理解和评估孩子们的写作能力。所提出的方法adjudges五个主要标准的形状，方向，笔顺，位置相对于基准线，与该轨迹的运动学。它分析手写质量和基于三个提取模型组合自动给出反馈：使用相似性检测（SD）和相异度距离（DD）的措施，傅立叶描述符模型（FDM）的β-椭圆模型（BEM），和感知卷积神经网（CNN）与支持向量机（SVM）比较引擎。我们的工作所在的部分在系统架构捉拿互补动态，几何和检查手写脚本的视觉表现和高效选择特征的创意适应各种手写风格和多种脚本语言，如阿拉伯语，拉丁语，数字和符号图。该应用程序提供了两种交互接口分别奉献给学习者，教育者，专家和教师，使他们能够轻松地适应它自己的弟子的特异性。我们的分析框架的评估是通过收集在突尼斯小学有400名儿童数据库增强。实验结果表明我们建议的框架，在整个使用触觉数字设备的手写学习过程中提供积极的反馈可以帮助老师和孩子们的效率和稳健性。

50. A Multi-Modal Method for Satire Detection using Textual and Visual Cues [PDF] 返回目录
Lily Li, Or Levi, Pedram Hosseini, David A. Broniatowski
Abstract: Satire is a form of humorous critique, but it is sometimes misinterpreted by readers as legitimate news, which can lead to harmful consequences. We observe that the images used in satirical news articles often contain absurd or ridiculous content and that image manipulation is used to create fictional scenarios. While previous work have studied text-based methods, in this work we propose a multi-modal approach based on state-of-the-art visiolinguistic model ViLBERT. To this end, we create a new dataset consisting of images and headlines of regular and satirical news for the task of satire detection. We fine-tune ViLBERT on the dataset and train a convolutional neural network that uses an image forensics technique. Evaluation on the dataset shows that our proposed multi-modal approach outperforms image-only, text-only, and simple fusion baselines.
摘要：讽刺幽默批判的一种形式，但它有时是由读者合法的新闻，这可能会导致不良后果的误解。我们观察到，在讽刺新闻报道中使用的图像往往含有荒谬可笑或者内容和图像处理用于创建虚构的场景。虽然之前的工作已经研究了基于文本的方法，在这项工作中，我们提出了一种基于国家的最先进的visiolinguistic模型ViLBERT一个多模式的方法。为此，我们创建由图像和定期和讽刺新闻标题的讽刺检测任务的新数据集。在数据集上我们微调ViLBERT和训练使用图像取证技术卷积神经网络。在数据集上展示评测，我们提出的多模态方法比只有图象，文本而已，简单融合基线。

51. LiDAM: Semi-Supervised Learning with Localized Domain Adaptation and Iterative Matching [PDF] 返回目录
Qun Liu, Matthew Shreve, Raja Bala
Abstract: Although data is abundant, data labeling is expensive. Semi-supervised learning methods combine a few labeled samples with a large corpus of unlabeled data to effectively train models. This paper introduces our proposed method LiDAM, a semi-supervised learning approach rooted in both domain adaptation and self-paced learning. LiDAM first performs localized domain shifts to extract better domain-invariant features for the model that results in more accurate clusters and pseudo-labels. These pseudo-labels are then aligned with real class labels in a self-paced fashion using a novel iterative matching technique that is based on majority consistency over high-confidence predictions. Simultaneously, a final classifier is trained to predict ground-truth labels until convergence. LiDAM achieves state-of-the-art performance on the CIFAR-100 dataset, outperforming FixMatch (73.50% vs. 71.82%) when using 2500 labels.
摘要：尽管数据丰富，数据标签是昂贵的。半监督学习方法结合了几个标记的样品与大语料库标签数据的有效培养模式。本文介绍了我们提出的方法LiDAM，植根于这两个领域适应性和自学半监督学习方法。局部域转移到提取更好域不变特征的模型LiDAM首先执行的结果更准确的簇和伪标签。这些伪标签然后使用是基于在高可信度的预测多数一致性新颖的迭代匹配技术在自定进度的方式真实级标记对齐。同时，最终的分类器进行训练来预测地面实况标签直到收敛。 LiDAM实现对CIFAR-100数据集状态的最先进的性能，采用2500个标签时优于FixMatch（73.50％对71.82％）。

52. Training independent subnetworks for robust prediction [PDF] 返回目录
Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M. Dai, Dustin Tran
Abstract: Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved `for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.
摘要：近来的方案能够有效地集成神经网络已经表明，较强的鲁棒性和不确定性的性能可与参数在原来的网络可以忽略不计的增益来实现。然而，这些方法仍然需要预测多前向传球，导致显著计算成本。在这项工作中，我们展示了一个令人惊讶的结果：使用多个预测的好处是可以的单一模式的直传下才能实现'免费”。特别是，我们显示，使用多输入多输出（MIMO）配置中，一个可以利用单个模型的能力来训练多个子网独立地学习手头的任务。通过ensembling由子网做出的预测，我们提高模型的鲁棒性，而不增加计算。我们观察到负对数似然，准确性和CIFAR10，CIFAR100，ImageNet校准误差显著的改善，或者他们的分布变异体相比以前的方法。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-10-15

目录

摘要