摘要

1. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains [PDF] 返回目录
Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, Ren Ng
Abstract: We show that passing input points through a simple Fourier feature mapping enables a multilayer perceptron (MLP) to learn high-frequency functions in low-dimensional problem domains. These results shed light on recent advances in computer vision and graphics that achieve state-of-the-art results by using MLPs to represent complex 3D objects and scenes. Using tools from the neural tangent kernel (NTK) literature, we show that a standard MLP fails to learn high frequencies both in theory and in practice. To overcome this spectral bias, we use a Fourier feature mapping to transform the effective NTK into a stationary kernel with a tunable bandwidth. We suggest an approach for selecting problem-specific Fourier features that greatly improves the performance of MLPs for low-dimensional regression tasks relevant to the computer vision and graphics communities.
摘要：我们表明，通过简单的傅立叶特征映射使输入点使多层感知器（MLP），了解高频功能在低维问题域。这些结果阐明了在计算机视觉的最新进展和图形，通过使用的MLP表示复杂的3D对象和场景实现状态的最先进的结果。从神经切线内核（NTK）文献使用的工具，我们表明，一个标准的MLP学习不到无论在理论上还是在实践中高频。为了克服这个光谱的偏倚，我们用傅立叶特征映射到有效NTK转换成固定的内核可变带宽。我们建议选择的问题，具体的做法傅立叶功能，能够大幅提高业主有限合伙制的用于相关的计算机视觉和图形社区内的低维回归任务的性能。

2. Differentiable Augmentation for Data-Efficient GAN Training [PDF] 返回目录
Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, Song Han
Abstract: The performance of generative adversarial networks (GANs) heavily deteriorates given a limited amount of training data. This is mainly because the discriminator is memorizing the exact training set. To combat it, we propose Differentiable Augmentation (DiffAugment), a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Previous attempts to directly augment the training data manipulate the distribution of real images, yielding little benefit; DiffAugment enables us to adopt the differentiable augmentation for the generated samples, effectively stabilizes training, and leads to better convergence. Experiments demonstrate consistent gains of our method over a variety of GAN architectures and loss functions for both unconditional and class-conditional generation. With DiffAugment, we achieve a state-of-the-art FID of 6.80 with an IS of 100.8 on ImageNet 128x128. Furthermore, with only 20% training data, we can match the top performance on CIFAR-10 and CIFAR-100. Finally, our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms. Code is available at this https URL.
摘要：生成对抗网络（甘斯）的性能严重恶化给定的训练数据的数量有限。这主要是因为鉴别是记忆的确切训练集。为了解决它，我们建议可微增强（DiffAugment），即提高了两个真假样品实行各类微增广甘斯的数据效率的简单方法。以前曾试图直接增加数据处理真实图像的分布，从而得到一点好处的培训; DiffAugment使我们能够采取的样品产生的微增强，有效地稳定了培训，并带来更好的收敛。实验表明对于两种无条件和分类条件产生各种GAN架构和损失函数的我们的方法是一致的收益。随着DiffAugment，我们实现了6.80一个国家的最先进的FID与上ImageNet 128×128的100.8 IS。此外，只有20％的训练数据，我们可以匹配CIFAR-10和CIFAR-100的最佳性能。最后，我们的方法可以只用100图像，而无需预先训练生成高清晰图像，而票面上是与现有的传送学习算法。代码可在此HTTPS URL。

3. Spin-Weighted Spherical CNNs [PDF] 返回目录
Carlos Esteves, Ameesh Makadia, Kostas Daniilidis
Abstract: Learning equivariant representations is a promising way to reduce sample and model complexity and improve the generalization performance of deep neural networks. The spherical CNNs are successful examples, producing SO(3)-equivariant representations of spherical inputs. There are two main types of spherical CNNs. The first type lifts the inputs to functions on the rotation group SO(3) and applies convolutions on the group, which are computationally expensive since SO(3) has one extra dimension. The second type applies convolutions directly on the sphere, which are limited to zonal (isotropic) filters, and thus have limited expressivity. In this paper, we present a new type of spherical CNN that allows anisotropic filters in an efficient way, without ever leaving the spherical domain. The key idea is to consider spin-weighted spherical functions, which were introduced in physics in the study of gravitational waves. These are complex-valued functions on the sphere whose phases change upon rotation. We define a convolution between spin-weighted functions and build a CNN based on it. Experiments show that our method outperforms the isotropic spherical CNNs while still being much more efficient than using SO(3) convolutions. The spin-weighted functions can also be interpreted as spherical vector fields, allowing applications to tasks where the inputs or outputs are vector fields.
摘要：学习等变化表示是减少样本和模型复杂度，提高深层神经网络的泛化性能有前途的方法。球形细胞神经网络是成功的例子，产生SO（3）的球形输入-equivariant表示。主要有两种类型的球形细胞神经网络的。第一种类型的升降机的输入上的旋转组SO（3）的功能，并且对组卷积，这在计算上昂贵的，因为SO（3）具有一个额外的维度。第二种类型的直接施加在球体上，这被限制在纬向（各向同性）滤波器卷积，并因此具有有限的表达性。在本文中，我们提出了一种新型的球形CNN，允许各向异性过滤器的有效途径，而无需离开球面域。其核心思想是考虑自旋加权球面的功能，这是在物理学中的引力波的研究介绍。这些都是它们的相位在转动而改变球体复值函数。我们定义的自旋加权函数之间的卷积，并建立基于它CNN。实验表明，我们的方法优于各向同性球形细胞神经网络，同时仍然比使用SO（3）卷积更加有效。自旋加权函数也可以解释为球形矢量场，允许应用程序，其中输入或输出是矢量场的任务。

4. Diverse Image Generation via Self-Conditioned GANs [PDF] 返回目录
Steven Liu, Tongzhou Wang, David Bau, Jun-Yan Zhu, Antonio Torralba
Abstract: We introduce a simple but effective unsupervised method for generating realistic and diverse images. We train a class-conditional GAN model without using manually annotated class labels. Instead, our model is conditional on labels automatically derived from clustering in the discriminator's feature space. Our clustering step automatically discovers diverse modes, and explicitly requires the generator to cover them. Experiments on standard mode collapse benchmarks show that our method outperforms several competing methods when addressing mode collapse. Our method also performs well on large-scale datasets such as ImageNet and Places365, improving both image diversity and standard quality metrics, compared to previous methods.
摘要：介绍了产生逼真多样图像的简单而有效的方法，无人监督。我们培养一个分类条件GAN模式，而无需使用手动注释类的标签。取而代之的是，我们的模型是从鉴别的特征空间聚类自动导出标签条件。我们的聚类步骤自动发现不同的模式，并明确要求发电机覆盖它们。在标准模式下崩溃基准测试实验表明，我们的方法寻址模式崩溃时胜过几个竞争方法。我们的方法还对大型数据集如ImageNet和Places365，同时改善图像的多样性和标准质量度量执行良好，比起以前的方法。

5. Cyclic Differentiable Architecture Search [PDF] 返回目录
Hongyuan Yu, Houwen Peng
Abstract: Recently, differentiable architecture search has draw great attention due to its high efficiency and competitive performance. It searches the optimal architecture in a shallow network, and then measures its performance in a deep evaluation network. This leads to the optimization of architecture search is independent of the target evaluation network, and the discovered architecture is sub-optimal. To address this issue, we propose a novel cyclic differentiable architecture search framework (CDARTS). Considering the structure difference, CDARTS builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network. The experiments and analysis on CIFAR, ImageNet and NAS-Bench- 201 demonstrate the efficacy of the proposed approach.
摘要：近日，微架构搜索有由于其高效率和竞争力的性能中获取极大的关注。它搜索在浅网络的最佳架构，然后测量其在深的评估网络性能。这导致建筑搜索的优化是独立的，评价对象的网络，以及发现的架构是次优的。为了解决这个问题，我们提出了一个新的环状微架构搜索框架（CDARTS）。考虑到结构的差异，CDARTS构建搜索和评价网络之间的循环反馈机制。首先，搜索网络产生用于评估的初始拓扑结构，使得评估网络的权重可以被优化。其次，搜索网络中的结构布局，通过在分类标签监管，以及从评估网络，通过功能蒸馏正规化进一步优化。重复搜索和评估网络的联合优化上述循环的结果，并因此能够拓扑以适应最终评价网络的演进。上CIFAR，ImageNet和NAS-Bench- 201的实验和分析表明，该方法的功效。

6. Ocean: Object-aware Anchor-free Tracking [PDF] 返回目录
Zhipeng Zhang, Houwen Peng
Abstract: Anchor-based Siamese trackers have achieved remarkable advancements in accuracy, yet the further improvement is restricted by the lagged tracking robustness. We find the underlying reason is that the regression network in anchor-based methods is only trained on the positive anchor boxes (i.e., $IoU \geq0.6$). This mechanism makes it difficult to refine the anchors whose overlap with the target objects are small. In this paper, we propose a novel object-aware anchor-free network to address this issue. First, instead of refining the reference anchor boxes, we directly predict the position and scale of target objects in an anchor-free fashion. Since each pixel in groundtruth boxes is well trained, the tracker is capable of rectifying inexact predictions of target objects during inference. Second, we introduce a feature alignment module to learn an object-aware feature from predicted bounding boxes. The object-aware feature can further contribute to the classification of target objects and background. Moreover, we present a novel tracking framework based on the anchor-free model. The experiments show that our anchor-free tracker achieves state-of-the-art performance on five benchmarks, including VOT-2018, VOT-2019, OTB-100, GOT-10k and LaSOT. The source code is available at this https URL.
摘要：基于锚的连体追踪器已经在精度取得了显着的进步，但进一步的改进是由滞后跟踪鲁棒性的限制。我们发现，其根本原因在于，在基于锚的方法，回归网络只受过训练的积极锚箱（即$欠条\ geq0.6 $）。这种机制使得它很难提炼，其重叠的目标对象是小锚。在本文中，我们提出了一个新的对象知晓无锚网来解决这个问题。首先，而不是提炼参考锚箱，我们直接预测的无锚时尚的位置和目标对象的规模。由于在地面实况盒的每个像素训练有素，跟踪器能够推理期间整流目标对象的不精确的预测。其次，我们引入了功能定位模块从预测包围盒学会的对象认知功能。对象感知功能可以进一步促进目标对象和背景的分类。此外，我们提出了基于无锚模型新颖的跟踪框架。实验表明，我们的无锚跟踪器实现对5个标准，包括VOT-2018，VOT-2019，OTB-100，GOT-10K和LaSOT国家的最先进的性能。源代码可在此HTTPS URL。

7. Unsupervised out-of-distribution detection using kernel density estimation [PDF] 返回目录
Ertunc Erdil, Krishna Chaitanya, Ender Konukoglu
Abstract: Deep neural networks achieve significant advancement to the state-of-the-art in many computer vision tasks. However, accuracy of the networks may drop drastically when test data come from a different distribution than training data. Therefore, detecting out-of-distribution (OOD) examples in neural networks arises as a crucial problem. Although, majority of the existing methods focuses on OOD detection in classification networks, the problem exist for any type of networks. In this paper, we propose an unsupervised OOD detection method that can work with both classification and non-classification networks by using kernel density estimation (KDE). The proposed method estimates probability density functions (pdfs) of activations at various levels of the network by performing KDE on the in-distribution dataset. At test time, the pdfs are evaluated on the test data to obtain a confidence score for each layer which are expected to be higher for in-distribution and lower for OOD. The scores are combined into a final score using logistic regression. We perform experiments on 2 different classification networks trained on CIFAR-10 and CIFAR-100, and on a segmentation network trained on Pascal VOC datasets. In CIFAR-10, our method achieves better results than the other methods in 4 of 6 OOD datasets while being the second best in the remaining ones. In CIFAR-100, we obtain the best results in 2 and the second best in 3 OOD datasets. In the segmentation network, we achieve the highest scores according to most of the evaluation metrics among all other OOD detection methods. The results demonstrate that the proposed method achieves competitive results to the state-of-the-art in classification networks and leads to improvement on segmentation network.
摘要：深层神经网络实现显著进步的国家的最先进的在许多计算机视觉任务。然而，网络的精度可能当测试数据来自一个不同的分配比的训练数据急剧下降。因此，检测在神经网络外的分布（OOD）实施例产生的关键问题。虽然，大多数现有的方法着重于OOD检测分类网络，这个问题对于任何类型的网络的存在。在本文中，我们建议可以通过使用核密度估计（KDE）与分类和非分级网络中使用一种无监督的OOD的检测方法。所提出的方法通过在在分布的数据集执行KDE估计各级网络的激活的概率密度函数（pdf）。在测试时间，所述概率分布函数都对测试数据进行评估，以获得置信得分，其被预期可用于在分布更高，并降低OOD每一层。得分组合成采用logistic回归的最终分数。我们执行在上训练CIFAR-10和CIFAR-100 2个不同的分类网络实验，培养帕斯卡VOC数据集分割网络上。在CIFAR-10，我们的方法实现了比在4 6的OOD数据集的其它方法，同时在其余的第二个最佳更好的结果。在CIFAR-100，我们得到在2最好的结果和所述第二最佳3个OOD数据集。在分割网络，我们根据大多数其他OOD的检测方法中的评价指标达到最高的分数。结果表明，所提出的方法实现了有竞争力的结果的状态的最先进的在网络中的分类，并导致改进的分割网络上。

8. Latent Video Transformer [PDF] 返回目录
Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin, Evgeny Burnaev
Abstract: The video generation task can be formulated as a prediction of future video frames given some past frames. Recent generative models for videos face the problem of high computational requirements. Some models require up to 512 Tensor Processing Units for parallel training. In this work, we address this problem via modeling the dynamics in a latent space. After the transformation of frames into the latent space, our model predicts latent representation for the next frames in an autoregressive manner. We demonstrate the performance of our approach on BAIR Robot Pushing and Kinetics-600 datasets. The approach tends to reduce requirements to 8 Graphical Processing Units for training the models while maintaining comparable generation quality.
摘要：视频生成任务可配制成给予一定的过去未来帧的视频帧的预测。对于视频最近生成模型面临的高计算需求的问题。有些型号需要多达512个量处理单元并行培训。在这项工作中，我们通过在潜在空间造型的动态解决这个问题。帧转化为潜在空间后，我们的模型预测，在自回归的方式下一帧潜伏表示。我们证明了我们对BAIR机器人推进和动力学-600数据集方法的性能。这种方法往往会减少训练机型，同时保持相当的发电质量要求8个图形处理单元。

9. Semi-Supervised Recognition under a Noisy and Fine-grained Dataset [PDF] 返回目录
Cheng Cui, Zhi Ye, Yangxi Li, Xinjian Li, Min Yang, Kai Wei, Bing Dai, Yanmei Zhao, Zhongji Liu, Rong Pang
Abstract: Simi-Supervised Recognition Challenge-FGVC7 is a challenging fine-grained recognition competition. One of the difficulties of this competition is how to use unlabeled data. We adopted pseudo-tag data mining to increase the amount of training data. The other one is how to identify similar birds with a very small difference, especially those have a relatively tiny main-body in examples. We combined generic image recognition and fine-grained image recognition method to solve the problem. All generic image recognition models were training using PaddleClas . Using the combination of two different ways of deep recognition models, we finally won the third place in the competition.
摘要：西米监督识别质询FGVC7是一个具有挑战性的细粒度识别的竞争。一个本次比赛的困难是如何使用无标签数据。我们通过伪标签数据挖掘，以增加训练的数据量。另外一个是如何识别同类鸟的一个非常小的差异，特别是那些具有相对较小的主体中的例子。我们结合的通用图像识别和解决问题的细粒度的图像识别方法。所有的通用图像识别模型，使用PaddleClas训练。使用深识别模型的两种不同的方式组合，我们终于在竞争中获得了第三名。

10. Dissecting Deep Networks into an Ensemble of Generative Classifiers for Robust Predictions [PDF] 返回目录
Lokender Tiwari, Anish Madan, Saket Anand, Subhashis Banerjee
Abstract: Deep Neural Networks (DNNs) are often criticized for being susceptible to adversarial attacks. Most successful defense strategies adopt adversarial training or random input transformations that typically require retraining or fine-tuning the model to achieve reasonable performance. In this work, our investigations of intermediate representations of a pre-trained DNN lead to an interesting discovery pointing to intrinsic robustness to adversarial attacks. We find that we can learn a generative classifier by statistically characterizing the neural response of an intermediate layer to clean training samples. The predictions of multiple such intermediate-layer based classifiers, when aggregated, show unexpected robustness to adversarial attacks. Specifically, we devise an ensemble of these generative classifiers that rank-aggregates their predictions via a Borda count-based consensus. Our proposed approach uses a subset of the clean training data and a pre-trained model, and yet is agnostic to network architectures or the adversarial attack generation method. We show extensive experiments to establish that our defense strategy achieves state-of-the-art performance on the ImageNet validation set.
摘要：深层神经网络（DNNs）经常被批评为易受敌对攻击。大多数成功的防守策略，采取对抗性训练或者通常需要再培训或微调模式，实现合理的性能，随机输入转换。在这项工作中，我们的预训练DNN铅中间表示的调查，一项有趣的发现指向内在的鲁棒性对抗性攻击。我们发现，我们可以通过统计特性的中间层，以干净的训练样本的神经反应学习生成分类。多个这样的中间层基于分类器的预测，聚集时，显示出意想不到的鲁棒性对抗攻击。具体来说，我们制定这些生成分类的合奏通过基于计数博尔达共识，即秩聚集他们的预测。我们所提出的方法使用干净的训练数据和预先训练模式的一个子集，却是不可知的网络架构或对抗攻击的生成方法。我们发现大量的实验来建立我们的防御策略实现对ImageNet验证组国家的最先进的性能。

11. Multi-Density Sketch-to-Image Translation Network [PDF] 返回目录
Jialu Huang, Jing Liao, Zhifeng Tan, Sam Kwong
Abstract: Sketch-to-image (S2I) translation plays an important role in image synthesis and manipulation tasks, such as photo editing and colorization. Some specific S2I translation including sketch-to-photo and sketch-to-painting can be used as powerful tools in the art design industry. However, previous methods only support S2I translation with a single level of density, which gives less flexibility to users for controlling the input sketches. In this work, we propose the first multi-level density sketch-to-image translation framework, which allows the input sketch to cover a wide range from rough object outlines to micro structures. Moreover, to tackle the problem of noncontinuous representation of multi-level density input sketches, we project the density level into a continuous latent space, which can then be linearly controlled by a parameter. This allows users to conveniently control the densities of input sketches and generation of images. Moreover, our method has been successfully verified on various datasets for different applications including face editing, multi-modal sketch-to-photo translation, and anime colorization, providing coarse-to-fine levels of controls to these applications.
摘要：素描到影像（S2I）翻译起着图像合成和操作任务，如照片编辑和彩色化具有重要作用。一些具体S2I翻译，包括草图到的照片和素描到油画可以用作艺术设计行业的强大工具。然而，以前的方法只支持密度的单一层面，这给了较少的灵活性，以用户为控制输入草图S2I翻译。在这项工作中，我们提出了第一多级密度草图到图像翻译框架，其允许输入草图以覆盖宽范围的从粗对象轮廓到微结构。此外，解决多级密度输入草图的非连续表示的问题，我们预计浓度水平成连续潜在空间，然后可以通过一个参数来线性地控制。这允许用户方便地控制输入草图和新一代图像的密度。此外，我们的方法已经被成功验证各种数据集为不同的应用，包括面编辑，多模态草图到的照片翻译，和动漫着色，提供控制这些应用程序由粗到细的水平。

12. Online Deep Clustering for Unsupervised Representation Learning [PDF] 返回目录
Xiaohang Zhan, Jiahao Xie, Ziwei Liu, Yew Soon Ong, Chen Change Loy
Abstract: Joint clustering and feature learning methods have shown remarkable performance in unsupervised representation learning. However, the training schedule alternating between feature clustering and network parameters update leads to unstable learning of visual representations. To overcome this challenge, we propose Online Deep Clustering (ODC) that performs clustering and network update simultaneously rather than alternatingly. Our key insight is that the cluster centroids should evolve steadily in keeping the classifier stably updated. Specifically, we design and maintain two dynamic memory modules, i.e., samples memory to store samples labels and features, and centroids memory for centroids evolution. We break down the abrupt global clustering into steady memory update and batch-wise label re-assignment. The process is integrated into network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly. Extensive experiments demonstrate that ODC stabilizes the training process and boosts the performance effectively. Code: this https URL.
摘要：联合集群和地物学习方法已经显示出在无监督表示学习骄人的业绩。然而，功能的集群和网络参数更新导致视觉表现的不稳定学习之间的训练计划交替。为了克服这一挑战，我们提出了在线深聚类（ODC）执行群集和网络同步更新，而不是交替。我们的主要观点是，聚类中心应在保持稳定的更新分类稳步发展。具体来说，我们设计和维护两个动态存储器模块，即，将样品存储器来存储样本标签和功能，以及质心记忆质心进化。我们打破了突发的全球集群，使之稳定内存更新和间歇式标签重新分配。该过程被集成到网络更新迭代。通过这种方式，标签和网络演变的肩膀到肩膀，而不是交替。大量的实验证明，ODC稳定的训练过程，并有效地提升性能。代码：该HTTPS URL。

13. Use of in-the-wild images for anomaly detection in face anti-spoofing [PDF] 返回目录
Latifah Abduh, Ioannis Ivrissimtzis
Abstract: The traditional approach to face anti-spoofing sees it as a binary classification problem, and binary classifiers are trained and validated on specialized anti-spoofing databases. One of the drawbacks of this approach is that, due to the variability of face spoofing attacks, environmental factors, and the typically small sample size, such classifiers do not generalize well to previously unseen databases. Anomaly detection, which approaches face anti-spoofing as a one-class classification problem, is emerging as an increasingly popular alternative approach. Nevertheless, in all existing work on anomaly detection for face anti-spoofing, the proposed training protocols utilize images from specialized anti-spoofing databases only, even though only common images of real faces are needed. Here, we explore the use of in-the-wild images, and images from non-specialized face databases, to train one-class classifiers for face anti-spoofing. Employing a well-established technique, we train a convolutional autoencoder on real faces and compare the reconstruction error of the input against a threshold to classify a face image accordingly as either client or imposter. Our results show that the inclusion in the training set of in-the-wild images increases the discriminating power of the classifier significantly on an unseen database, as evidenced by a large increase in the value of the Area Under the Curve. In a limitation of our approach, we note that the problem of finding a suitable operating point on the unseen database remains a challenge, as evidenced by the values of the Half Total Error Rate.
摘要：传统的方法来面对反欺骗认为这是一个二元分类问题，二元分类进行培训和专门的反欺骗数据库验证。这种方法的缺点是，由于面对欺骗攻击，环境因素的变化，和一般小样本，这样的分类不能一概而论很好地前所未见的数据库。异常检测，接近脸防伪作为一类分类问题，正在成为一种越来越流行的另一种方法。然而，在异常检测人脸防欺骗所有现有的工作，所提出的培训协议利用从专业防伪数据库而已，需要即使只是普通真实面孔图像的图像。在这里，我们探索非专业的人脸数据库的使用在最狂野的图像和图像的，要培养一个类分类面部反欺骗。采用行之有效的技术，我们训练上实面卷积自动编码器和比较输入的重建误差与阈值的面部图像相应地分类为客户端或冒名顶替者。我们的研究结果表明，列入训练集最百搭的图像增加了分类器的区分能力显著在一个看不见的数据库，在该地区的下曲线值大幅增加就是明证。在我们的方法的局限性，我们注意到，寻找看不见的数据库上合适的工作点的问题仍然是一个挑战，由半总错误率的值作为证明。

14. SatImNet: Structured and Harmonised Training Data for Enhanced Satellite Imagery Classification [PDF] 返回目录
Vasileios Syrris, Ondrej Pesek, Pierre Soille
Abstract: Automatic supervised classification of satellite images with complex modelling such as deep neural networks requires the availability of representative training datasets. While there exists a plethora of datasets that can be used for this purpose, they are usually very heterogeneous and not interoperable. This prevents the combination of two or more training datasets for improving image classification tasks based on machine learning. To alleviate these problems, we propose a methodology for structuring and harmonising open training datasets on the basis of a series of fundamental attributes we put forward for any such dataset. By applying this methodology to seven representative open training datasets, we generate a harmonised collection called SatImNet. Its usefulness is demonstrated for enhanced satellite image classification and segmentation based on convolutional neural networks. Data and open source code are provided to ensure the reproducibility of all obtained results and facilitate the ingestion of additional datasets in SatImNet.
摘要：自动监督的卫星图像，造型复杂的分级，比如深层神经网络需要代表训练数据的可用性。虽然存在可用于此目的的数据集过多，他们通常非常庞杂，而不是互操作。这可以防止对提高基于机器学习的图像分类任务两个或更多训练数据集的组合。为了解决这些问题，我们提出了一个方法构建和一系列我们提出任何这样的数据集的基本属性的基础上，统一开放的训练数据集。通过应用这一方法七个代表公开训练数据集，我们生成一个统一的集合称为SatImNet。其有用性证明了增强的卫星图像分类和分割基于卷积神经网络。被提供的数据和开放的源代码，以确保所有获得的结果的可再现性和便利的附加数据集的摄取在SatImNet。

15. Neural Graphics Pipeline for Controllable Image Generation [PDF] 返回目录
Xuelin Chen, Daniel Cohen-Or, Baoquan Chen, Niloy J. Mitra
Abstract: We present Neural Graphics Pipeline (NGP), a hybrid generative model that brings together neural and traditional image formation models. NGP generates coarse 3D models that are fed into neural rendering modules to produce view-specific interpretable 2D maps, which are then composited into the final output image using a traditional image formation model. Our approach offers control over image generation by providing direct handles controlling illumination and camera parameters, in addition to control over shape and appearance variations. The key challenge is to learn these controls through unsupervised training that links generated coarse 3D models with unpaired real images via neural and traditional (e.g., Blinn-Phong) rendering functions without establishing an explicit correspondence between them. We evaluate our hybrid modeling framework, compare with neural-only generation methods (namely, DCGAN, LSGAN, WGAN-GP, VON, and SRNs), report improvement in FID scores against real images, and demonstrate that NGP supports direct controls common in traditional forward rendering. Code, data, and trained models will be released on acceptance.
摘要：我们提出神经图形管线（NGP），汇集了神经和传统形象形成模型的混合生成模型。 NGP生成粗的3D模型可解释2D映射被送入神经呈现模块以产生视图特定的，然后将其合成为使用传统的图像形成模型的最终的输出图像。我们的方法提供了通过提供直接控制手柄照明和摄像机参数，除了以上的形状和外观的变化控制图像生成控制。关键的挑战是通过粗糙的3D模型，通过神经和传统的（例如，布林 - 海防）渲染功能配对的真实图像生成的链接，而不在它们之间建立明确的对应无人监督的训练，学习这些控件。我们评估我们的混合建模构架，与神经只生成方法比较（即DCGAN，LSGAN，WGAN-GP，VON和的SRN），在对真实图像FID分数报告的改进，并表明NGP支持直接控制常见于传统向前渲染。代码，数据和训练的模型将在接受被释放。

16. MOSQUITO-NET: A deep learning based CADx system for malaria diagnosis along with model interpretation using GradCam and class activation maps [PDF] 返回目录
Aayush Kumar, Sanat B Singh, Suresh Chandra Satapathy, Minakhi Rout
Abstract: Malaria is considered one of the deadliest diseases in today world which causes thousands of deaths per year. The parasites responsible for malaria are scientifically known as Plasmodium which infects the red blood cells in human beings. The parasites are transmitted by a female class of mosquitos known as Anopheles. The diagnosis of malaria requires identification and manual counting of parasitized cells by medical practitioners in microscopic blood smears. Due to the unavailability of resources, its diagnostic accuracy is largely affected by large scale screening. State of the art Computer-aided diagnostic techniques based on deep learning algorithms such as CNNs, with end to end feature extraction and classification, have widely contributed to various image recognition tasks. In this paper, we evaluate the performance of custom made convnet Mosquito-Net, to classify the infected and uninfected cells for malaria diagnosis which could be deployed on the edge and mobile devices owing to its fewer parameters and less computation power. Therefore, it can be wildly preferred for diagnosis in remote and countryside areas where there is a lack of medical facilities.
摘要：疟疾被认为是当今世界上导致数千每年死亡的致命疾病之一。负责疟疾的寄生虫是科学称为疟原虫感染该红血细胞的人类。寄生虫通过雌性类称为按蚊蚊子的发送。疟疾的诊断需要识别和通过在微观血涂片医生寄生细胞的人工计数。由于资源的不可用，其诊断的准确性在很大程度上是由大规模筛选的影响。基于深度学习算法，如细胞神经网络，端对端的特征提取和分类艺术计算机辅助诊断技术状态，已经广泛地实施各种图像识别任务作出了贡献。在本文中，我们评估的定制convnet蚊帐的性能，这可能在边缘，并且由于其参数较少和更少的计算能力的移动设备被部署用于诊断疟疾的感染和非感染细胞进行分类。因此，它可以广泛优先用于远程和农村地区的诊断那里是一个缺乏医疗设施。

17. Contrastive learning of global and local features for medical image segmentation with limited annotations [PDF] 返回目录
Krishna Chaitanya, Ertunc Erdil, Neerav Karani, Ender Konukoglu
Abstract: A key requirement for the success of supervised deep learning is a large labeled dataset - a condition that is difficult to meet in medical image analysis. Self-supervised learning (SSL) can help in this regard by providing a strategy to pre-train a neural network with unlabeled data, followed by fine-tuning for a downstream task with limited annotations. Contrastive learning, a particular variant of SSL, is a powerful technique for learning image-level representations. In this work, we propose strategies for extending the contrastive learning framework for segmentation of volumetric medical images in the semi-supervised setting with limited annotations, by leveraging domain-specific and problem-specific cues. Specifically, we propose (1) novel contrasting strategies that leverage structural similarity across volumetric medical images (domain-specific cue) and (2) a local version of the contrastive loss to learn distinctive representations of local regions that are useful for per-pixel segmentation (problem-specific cue). We carry out an extensive evaluation on three Magnetic Resonance Imaging (MRI) datasets. In the limited annotation setting, the proposed method yields substantial improvements compared to other self-supervision and semi-supervised learning techniques. When combined with a simple data augmentation technique, the proposed method reaches within 8% of benchmark performance using only two labeled MRI volumes for training, corresponding to only 4% (for ACDC) of the training data used to train the benchmark.
摘要：深监督学习成功的一个关键要求是一个大的数据集标记 - 这是很难在医学图像分析满足的条件。自我监督学习（SSL）可以通过提供战略，以预先训练神经网络与未标记的数据，其次是微调用有限的注解下游任务在这方面提供帮助。对比学习，SSL的特定变体，是学习图像电平表示的强大技术。在这项工作中，我们提出了扩展对比学习框架的半监督设置体积医学影像有限的注释的分割，通过利用特定领域和问题的具体线索的策略。具体来说，我们建议（1）新的对比策略横跨体积的医用图像（特定于域的线索）和（2）的对比损耗的本地版本的杠杆结构相似地得知，对于每个像素分割有用的局部区域的独特表示（解决问题的特定线索）。我们开展三个磁共振成像（MRI）的数据集进行广泛的评估。在有限的注释设置，相对于其他的自检和半监督学习技术所提出的方法得到实质性的改进。当用一个简单的数据增强技术中，仅使用两个标记的MRI体积进行训练，对应于用来训练基准训练数据只有4％（对于ACDC）基准性能8％的范围内所提出的方法达到组合。

18. Distillation of neural network models for detection and description of key points of images [PDF] 返回目录
A.V. Yashchenko, A.V. Belikov, M.V. Peterson, A.S. Potapov
Abstract: Image matching and classification methods, as well as synchronous location and mapping, are widely used on embedded and mobile devices. Their most resource-intensive part is the detection and description of the key points of the images. And if the classical methods of detecting and describing key points can be executed in real time on mobile devices, then for modern neural network methods with the best quality, such use is difficult. Thus, it is important to increase the speed of neural network models for the detection and description of key points. The subject of research is distillation as one of the methods for reducing neural network models. The aim of thestudy is to obtain a more compact model of detection and description of key points, as well as a description of the procedure for obtaining this model. A method for the distillation of neural networks for the task of detecting and describing key points was tested. The objective function and training parameters that provide the best results in the framework of the study are proposed. A new data set has been introduced for testing key point detection methods and a new quality indicator of the allocated key points and their corresponding local features. As a result of training in the described way, the new model, with the same number of parameters, showed greater accuracy in comparing key points than the original model. A new model with a significantly smaller number of parameters shows the accuracy of point matching close to the accuracy of the original model.
摘要：图像匹配和分类方法，以及同步位置和映射，是在嵌入和移动设备得到广泛应用。他们的资源最密集的部分是图像的关键点的检测和描述。如果检测和描述关键点的经典方法可以实时地在移动设备上执行，那么现代神经网络方法，用最好的质量，这样的使用是困难的。因此，重要的是提高神经网络模型的速度为关键点的检测和描述。研究的主题是蒸馏法的减少神经网络模型的方法之一。的thestudy的目的是获得的检测和关键点的描述的更紧凑模型，以及该过程的用于获得该模型的描述。一种用于神经网络的用于检测和描述要点任务蒸馏方法进行了测试。提供在研究框架的最好成绩的目标函数和训练参数提出了建议。新的数据集已经推出了测试关键点的检测方法和所分配关键点的新的质量指标及其对应的地方特色。由于在所描述的方式训练的结果，新的模式，与相同数量的参数，在比原机型比较关键点表现出更大的准确性。新模式与显著数量较少的参数显示匹配点接近原始模型的精确度的精确度。

19. ReenactNet: Real-time Full Head Reenactment [PDF] 返回目录
Mohammad Rami Koujan, Michail Christos Doukas, Anastasios Roussos, Stefanos Zafeiriou
Abstract: Video-to-video synthesis is a challenging problem aiming at learning a translation function between a sequence of semantic maps and a photo-realistic video depicting the characteristics of a driving video. We propose a head-to-head system of our own implementation capable of fully transferring the human head 3D pose, facial expressions and eye gaze from a source to a target actor, while preserving the identity of the target actor. Our system produces high-fidelity, temporally-smooth and photo-realistic synthetic videos faithfully transferring the human time-varying head attributes from the source to the target actor. Our proposed implementation: 1) works in real time ($\sim 20$ fps), 2) runs on a commodity laptop with a webcam as the only input, 3) is interactive, allowing the participant to drive a target person, e.g. a celebrity, politician, etc, instantly by varying their expressions, head pose, and eye gaze, and visualising the synthesised video concurrently.
摘要：视频对视频合成是针对学习语义地图的序列和照片般逼真的描绘视频驱动视频的特性之间的转换功能的具有挑战性的问题。我们建议我们能够完全转移人的头部三维姿态自己实施的头对头系统，面部表情和眼睛从源到目标的演员的目光，同时保留目标男主角的身份。我们的系统产生高保真，时间上平滑和逼真的合成视频忠实地从源到目标的演员转印人随时间变化的头的属性。我们提出的实现：1）实时运行，（$ \卡$ 20 FPS），2）在商品上的笔记本电脑带有摄像头作为唯一的输入，运行3次）是交互式的，允许参与者以驱动目标人物，例如名人，政治家等，即时通过改变他们的表情，头部姿态和眼睛注视，并同时可视化合成视频。

20. Real-Time Monocular 4D Face Reconstruction using the LSFM models [PDF] 返回目录
Mohammad Rami Koujan, Nikolai Dochev, Anastasios Roussos
Abstract: 4D face reconstruction from a single camera is a challenging task, especially when it is required to be performed in real time. We demonstrate a system of our own implementation that solves this task accurately and runs in real time on a commodity laptop, using a webcam as the only input. Our system is interactive, allowing the user to freely move their head and show various expressions while standing in front of the camera. As a result, the put forward system both reconstructs and visualises the identity of the subject in the correct pose along with the acted facial expressions in real-time. The 4D reconstruction in our framework is based on the recently-released Large-Scale Facial Models (LSFM) \cite{LSFM1, LSFM2}, which are the largest-scale 3D Morphable Models of facial shapes ever constructed, based on a dataset of more than 10,000 facial identities from a wide range of gender, age and ethnicity combinations. This is the first real-time demo that gives users the opportunity to test in practice the capabilities of the recently-released Large-Scale Facial Models (LSFM)
摘要：从单一的相机4D脸部重建是一项十分艰巨的任务，尤其是当需要实时进行。我们证明了我们自己实现的一个系统，准确和运行解决了这一任务，实时对商品的笔记本电脑，使用网络摄像头作为唯一的输入。我们的系统是交互式的，允许用户自由移动他们的头，并显示各种表情，而在镜头前站着。其结果是，在提出系统既来重构，并用形象化的实时行动表情沿着正确的姿势的主体的身份。四维重建我们的框架是基于最近发布的大型面部模型（LSFM）\举{LSFM1，LSFM2}，这是规模最大的有史以来建造的面部形状的3D形变模型的基础上，更多的数据集不是从广泛的性别，年龄和种族的组合万名面部的身份。这是第一个实时的演示，让用户有机会测试在实践中最近发布的大型面部模型的能力（LSFM）

21. Language Guided Networks for Cross-modal Moment Retrieval [PDF] 返回目录
Kun Liu, Xun Yang, Tat-seng Chua, Huadong Ma, Chuang Gan
Abstract: We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between vision and linguistic domains. Most of these methods only leverage sentences in the multi-modal fusion stage and independently extract the features of videos and sentences, which do not make full use of the potential of language. In this paper, we present Language Guided Networks (LGN), a new framework that tightly integrates cross-modal features in multiple stages. In the first feature extraction stage, we introduce to capture the discriminative visual features which can cover the complex semantics in the sentence query. Specifically, the early modulation unit is designed to modulate convolutional feature maps by a linguistic embedding. Then we adopt a multi-modal fusion module in the second fusion stage. Finally, to get a precise localizer, the sentence information is utilized to guide the process of predicting temporal positions. Specifically, the late guidance module is developed to further bridge vision and language domain via the channel attention mechanism. We evaluate the proposed model on two popular public datasets: Charades-STA and TACoS. The experimental results demonstrate the superior performance of our proposed modules on moment retrieval (improving 5.8\% in terms of R1@IoU5 on Charades-STA and 5.2\% on TACoS). We put the codes in the supplementary material and will make it publicly available.
摘要：我们解决跨模态瞬间检索的具有挑战性的任务，其目的是通过自然语言查询描述的修剪视频本地化的时间分段。它带来了视觉和语言域之间的适当的语义对准了巨大的挑战。这些方法大多只杠杆句子在多模态融合阶段，分别提取的视频和句子，不充分利用语言的潜能的特点。在本文中，我们目前语言引导网络（LGN），一个新的框架紧密集成在多个阶段跨模式的特点。在第一特征提取阶段，我们引入捕捉区别的视觉特征，可以覆盖在句子查询复杂的语义。具体地，早期的调制单元被设计成通过一个语言嵌入到调制卷积特征地图。然后，我们采用在第二阶段的融合多模态融合模块。最后，要获得精确的定位，句子信息被利用来指导预测时间位置的过程。具体而言，后期引导模块发展到通过通道注意机制进一步桥的眼光和语言域。我们评估在两个流行的公共数据集所提出的模型：猜字谜-STA和玉米饼。实验结果表明，我们提出的矩检索模块的性能优越（提高5.8 \％在哑谜-STA和玉米饼5.2 \％R1 @ IoU5而言）。我们把代码中的补充材料，并予以公布。

22. Deep Multitask Learning for Pervasive BMI Estimation and Identity Recognition in Smart Beds [PDF] 返回目录
Vandad Davoodnia, Monet Slinowsky, Ali Etemad
Abstract: Smart devices in the Internet of Things (IoT) paradigm provide a variety of unobtrusive and pervasive means for continuous monitoring of bio-metrics and health information. Furthermore, automated personalization and authentication through such smart systems can enable better user experience and security. In this paper, simultaneous estimation and monitoring of body mass index (BMI) and user identity recognition through a unified machine learning framework using smart beds is explored. To this end, we utilize pressure data collected from textile-based sensor arrays integrated onto a mattress to estimate the BMI values of subjects and classify their identities in different positions by using a deep multitask neural network. First, we filter and extract 14 features from the data and subsequently employ deep neural networks for BMI estimation and subject identification on two different public datasets. Finally, we demonstrate that our proposed solution outperforms prior works and several machine learning benchmarks by a considerable margin, while also estimating users' BMI in a 10-fold cross-validation scheme.
摘要：物联网的智能设备（IOT）模式提供了多种用于连续监测生物指标和健康信息的不显眼的和普遍的手段。此外，自动化的个性化和认证通过这样的智能系统可以实现更好的用户体验和安全性。在本文中，同时估计，通过使用智能床的统一的机器学习框架监测身体质量指数（BMI）和用户身份识别的探索。为此，我们使用来自集成到床垫以估计受试者的BMI值，并通过使用深度多任务神经网络及其在不同的位置标识分类基于织物的传感器阵列收集的压力数据。首先，我们筛选并从数据中提取14种功能，并随后采用了BMI估计和主体识别深层神经网络在两个不同的公共数据集。最后，我们证明了我们提出的解决方案由一个相当幅度优于之前的作品和一些机器学习的基准，同时还估计用户的BMI在10倍交叉验证方案。

23. Learning High-Resolution Domain-Specific Representations with a GAN Generator [PDF] 返回目录
Danil Galeev, Konstantin Sofiiuk, Danila Rukhovich, Mikhail Romanov, Olga Barinova, Anton Konushin
Abstract: In recent years generative models of visual data have made a great progress, and now they are able to produce images of high quality and diversity. In this work we study representations learnt by a GAN generator. First, we show that these representations can be easily projected onto semantic segmentation map using a lightweight decoder. We find that such semantic projection can be learnt from just a few annotated images. Based on this finding, we propose LayerMatch scheme for approximating the representation of a GAN generator that can be used for unsupervised domain-specific pretraining. We consider the semi-supervised learning scenario when a small amount of labeled data is available along with a large unlabeled dataset from the same domain. We find that the use of LayerMatch-pretrained backbone leads to superior accuracy compared to standard supervised pretraining on ImageNet. Moreover, this simple approach also outperforms recent semi-supervised semantic segmentation methods that use both labeled and unlabeled data during training. Source code for reproducing our experiments will be available at the time of publication.
摘要：近年来可视化数据的生成模型都取得了很大的进步，现在他们能够生产高品质和多样化的图像。在这项工作中，我们研究由GAN发电机学表示。首先，我们表明，这些表示可以轻松投射到采用了轻量化解码器语义分割图。我们发现，这种语义的投影可以从几注释的图像来学习。基于这一发现，我们提出了近似GAN发生器，可用于无人监管的特定领域的训练前的表现LayerMatch方案。我们认为半监督学习情况时，标记数据少量是来自同一个域的大型数据集未标记一起使用。我们发现，相比于标准极高的精度使用LayerMatch，预训练的骨干引线监督的训练前对ImageNet。此外，这种简单的方法也优于培训过程中使用标记的和未标记的数据最近半监督语义分割方法。再现我们的实验源代码将可在公布的时间。

24. Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax [PDF] 返回目录
Yu Li, Tao Wang, Bingyi Kang, Sheng Tang, Chunfeng Wang, Jintao Li, Jiashi Feng
Abstract: Solving long-tail large vocabulary object detection with deep learning based models is a challenging and demanding task, which is however this http URL this work, we provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. We find existing detection methods are unable to model few-shot classes when the dataset is extremely skewed, which can result in classifier imbalance in terms of parameter magnitude. Directly adapting long-tail classification models to detection frameworks can not solve this problem due to the intrinsic difference between detection and this http URL this work, we propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training. It implicitly modulates the training process for the head and tail classes and ensures they are both sufficiently trained, without requiring any extra sampling for the instances from the tail classes.Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors with various backbones and frameworks on both object detection and instance segmentation. It beats all state-of-the-art methods transferred from long-tail image classification and establishes new state-of-the-art.Code is available at this https URL.
摘要：求解长尾大词汇量目标检测与深度学习基础的模式是一个具有挑战性和艰巨的任务，但是这是此http URL这项工作，我们提供国家的最先进的车型中表现欠佳第一系统分析长尾分布的前面。我们发现现有的检测方法都不能少拍类模型时，该数据集是非常扭曲，这可能会导致分类失衡参数幅度方面。直接适应长尾分类模型来检测框架不能解决这个问题，因为检测之间的内在差异，此http URL这项工作中，我们提出了一个新的平衡组SOFTMAX（袋）模块，通过组平衡检测框架内的分类-wise培训。它含蓄地为调制的头部和尾部类，并确保它们都得到充分培训的培训过程中，而不需要对实例的任何额外的采样从尾巴上非常最近的长尾巴的大词汇量物体识别基准classes.Extensive实验LVIS显示，我们提出的BAGS显著改善与这两个目标检测和实例分割各种骨干网和框架检测器的性能。这可难倒了从长尾图像分类传输的所有国家的最先进的方法，并建立新的国家的最art.Code可在此HTTPS URL。

25. Fourth-Order Anisotropic Diffusion for Inpainting and Image Compression [PDF] 返回目录
Ikram Jumakulyyev, Thomas Schultz
Abstract: Edge-enhancing diffusion (EED) can reconstruct a close approximation of an original image from a small subset of its pixels. This makes it an attractive foundation for PDE based image compression. In this work, we generalize second-order EED to a fourth-order counterpart. It involves a fourth-order diffusion tensor that is constructed from the regularized image gradient in a similar way as in traditional second-order EED, permitting diffusion along edges, while applying a non-linear diffusivity function across them. We show that our fourth-order diffusion tensor formalism provides a unifying framework for all previous anisotropic fourth-order diffusion based methods, and that it provides additional flexibility. We achieve an efficient implementation using a fast semi-iterative scheme. Experimental results on natural and medical images suggest that our novel fourth-order method produces more accurate reconstructions compared to the existing second-order EED.
摘要：边缘增强扩散（EED）可以从它的像素的一小部分重建原始图像的接近。这使得基于PDE图像压缩一个有吸引力的基础。在这项工作中，我们概括二阶EED第四次对口。它涉及从正则化图像梯度构造以类似的方式，如传统的二阶EED，从而允许沿着边缘扩散，同时施加在它们之间的非线性函数的扩散的第四阶扩散张量。我们证明了我们的四阶扩散张量形式主义提供了所有以前的各向异性四阶基于扩散的方法，一个统一的框架，而且它提供了更多的灵活性。我们使用快速半迭代方案实现高效的实现。对自然和医学图像的实验结果表明，本发明的新型四阶方法产生比现有二阶EED更准确的重建。

26. SceneAdapt: Scene-based domain adaptation for semantic segmentation using adversarial learning [PDF] 返回目录
Daniele Di Mauro, Antonino Furnari, Giuseppe Patanè, Sebastiano Battiato, Giovanni Maria Farinella
Abstract: Semantic segmentation methods have achieved outstanding performance thanks to deep learning. Nevertheless, when such algorithms are deployed to new contexts not seen during training, it is necessary to collect and label scene-specific data in order to adapt them to the new domain using fine-tuning. This process is required whenever an already installed camera is moved or a new camera is introduced in a camera network due to the different scene layouts induced by the different viewpoints. To limit the amount of additional training data to be collected, it would be ideal to train a semantic segmentation method using labeled data already available and only unlabeled data coming from the new camera. We formalize this problem as a domain adaptation task and introduce a novel dataset of urban scenes with the related semantic labels. As a first approach to address this challenging task, we propose SceneAdapt, a method for scene adaptation of semantic segmentation algorithms based on adversarial learning. Experiments and comparisons with state-of-the-art approaches to domain adaptation highlight that promising performance can be achieved using adversarial learning both when the two scenes have different but points of view, and when they comprise images of completely different scenes. To encourage research on this topic, we made our code available at our web page: this https URL.
摘要：语义分割方法都取得了出色的表现归功于深度学习。然而，当这样的算法被部署到训练中没有看到新的环境，这是必要的，以便使其适应使用微调新域收集和现场特定的标签数据。每当已安装的摄像头被移动或新相机在相机网络介绍由于不同的观点引起了不同的场景布局，则需要此过程。为了限制要收集更多的培训数据的数量，这将是理想的训练用已有的和唯一的标签数据来自新相机来标记数据语义分割方法。我们形式化这个问题是一个域的适应任务和引进都市风光与相关语义标签，一个新的数据集。作为解决这一具有挑战性的任务第一种方法，我们提出SceneAdapt，为基于敌对学习语义分割算法的场景适配的方法。实验和比较与国家的最先进的方法领域适应性的亮点是有前途的性能可以用对抗性的学习来实现既当两个场景有不同，但观点，当他们包括完全不同的场景的图像。为了鼓励研究这个话题，我们提供我们的代码，在我们的网页：此HTTPS URL。

27. 3D Pipe Network Reconstruction Based on Structure from Motion with Incremental Conic Shape Detection and Cylindrical Constraint [PDF] 返回目录
Sho kagami, Hajime Taira, Naoyuki Miyashita, Akihiko Torii, Masatoshi Okutomi
Abstract: Pipe inspection is a critical task for many industries and infrastructure of a city. The 3D information of a pipe can be used for revealing the deformation of the pipe surface and position of the camera during the inspection. In this paper, we propose a 3D pipe reconstruction system using sequential images captured by a monocular endoscopic camera. Our work extends a state-of-the-art incremental Structure-from-Motion (SfM) method to incorporate prior constraints given by the target shape into bundle adjustment (BA). Using this constraint, we can minimize the scale-drift that is the general problem in SfM. Moreover, our method can reconstruct a pipe network composed of multiple parts including straight pipes, elbows, and tees. In the experiments, we show that the proposed system enables more accurate and robust pipe mapping from a monocular camera in comparison with existing state-of-the-art methods.
摘要：管道检测是许多产业和城市基础设施建设的重要任务。管的3D信息可以被用于在检查期间揭示管道表面和照相机的位置的变形。在本文中，我们提出使用由单目内窥镜相机拍摄连续图像的3D管重建系统。我们的工作延伸的状态的最先进的增量结构 - 从-运动（SFM）方法掺入由目标形状成束调整（BA）给定现有的约束。使用这种约束，我们可以尽量减少大规模漂移是在SFM的一般问题。而且，我们的方法可以重构的多个部分，包括直管，弯管，和三通组成的管网。在实验中，我们表明，所提出的系统使从单眼照相机与现有状态的最先进的方法相比更精确和鲁棒管映射。

28. Video Semantic Segmentation with Distortion-Aware Feature Correction [PDF] 返回目录
Jiafan Zhuang, Zilei Wang, Bingke Wang
Abstract: Video semantic segmentation is active in recent years benefited from the great progress of image semantic segmentation. For such a task, the per-frame image segmentation is generally unacceptable in practice due to high computation cost. To tackle this issue, many works use the flow-based feature propagation to reuse the features of previous frames. However, the optical flow estimation inevitably suffers inaccuracy and then causes the propagated features distorted. In this paper, we propose distortion-aware feature correction to alleviate the issue, which improves video segmentation performance by correcting distorted propagated features. To be specific, we firstly propose to transfer distortion patterns from feature into image space and conduct effective distortion map prediction. Benefited from the guidance of distortion maps, we proposed Feature Correction Module (FCM) to rectify propagated features in the distorted areas. Our proposed method can significantly boost the accuracy of video semantic segmentation at a low price. The extensive experimental results on Cityscapes and CamVid show that our method outperforms the recent state-of-the-art methods.
摘要：视频语义分割是活跃在最近几年从图像语义分割的巨大进步中受益。对于这样的任务，每帧的图像分割在实践中通常是不可接受的，由于高的计算成本。为了解决这个问题，很多工程采用基于流的功能，传播重用以前帧的功能。然而，光流估计不可避免地遭受不精确性，然后使失真传播的特征。在本文中，我们提出了失真感知特征校正，以缓解这一问题，从而提高通过纠正扭曲的传播功能的视频分割性能。具体而言，我们首先提出从特征转移失真模式为图像的空间和进行有效的失真映射预测。从失真映射的指导中受益，我们建议特征校正模块（FCM），纠正扭曲的地区传播的特点。我们提出的方法可以显著提高视频语义分割的准确性以低廉的价格。对城市景观和CamVid表明，我们的方法优于最近的国家的最先进方法的广泛的实验结果。

29. On the Robustness of Active Learning [PDF] 返回目录
Lukas Hahn, Lutz Roese-Koerner, Peet Cremer, Urs Zimmermann, Ori Maoz, Anton Kummert
Abstract: Active Learning is concerned with the question of how to identify the most useful samples for a Machine Learning algorithm to be trained with. When applied correctly, it can be a very powerful tool to counteract the immense data requirements of Artificial Neural Networks. However, we find that it is often applied with not enough care and domain knowledge. As a consequence, unrealistic hopes are raised and transfer of the experimental results from one dataset to another becomes unnecessarily hard. In this work we analyse the robustness of different Active Learning methods with respect to classifier capacity, exchangeability and type, as well as hyperparameters and falsely labelled data. Experiments reveal possible biases towards the architecture used for sample selection, resulting in suboptimal performance for other classifiers. We further propose the new "Sum of Squared Logits" method based on the Simpson diversity index and investigate the effect of using the confusion matrix for balancing in sample selection.
摘要：主动学习关注的是如何确定一个机器学习算法中最有用的样本与被训练的问题。如果应用得当，它可以抵消人工神经网络的巨大数据的要求非常强大的工具。然而，我们发现，它往往没有足够的关怀和领域知识应用。因此，不切实际的希望都提出和实验结果的转移，从一个数据集到另一变得不必要的困难。在这项工作中，我们分析了不同的主动学习方法的稳健性对于分类能力，互换性和类型，以及超参数和虚假标签的数据。实验揭示了可能的偏差向用于样本选择的体系结构，从而为其它分类最佳性能。我们进一步提出了一种基于辛普森多样性指数新的“平方Logits的总和”，并探讨使用在样本选择平衡混淆矩阵的效果。

30. Automated Radiological Report Generation For Chest X-Rays With Weakly-Supervised End-to-End Deep Learning [PDF] 返回目录
Shuai Zhang, Xiaoyan Xin, Yang Wang, Yachong Guo, Qiuqiao Hao, Xianfeng Yang, Jun Wang, Jian Zhang, Bing Zhang, Wei Wang
Abstract: The chest X-Ray (CXR) is the one of the most common clinical exam used to diagnose thoracic diseases and abnormalities. The volume of CXR scans generated daily in hospitals is huge. Therefore, an automated diagnosis system able to save the effort of doctors is of great value. At present, the applications of artificial intelligence in CXR diagnosis usually use pattern recognition to classify the scans. However, such methods rely on labeled databases, which are costly and usually have large error rates. In this work, we built a database containing more than 12,000 CXR scans and radiological reports, and developed a model based on deep convolutional neural network and recurrent network with attention mechanism. The model learns features from the CXR scans and the associated raw radiological reports directly; no additional labeling of the scans are needed. The model provides automated recognition of given scans and generation of reports. The quality of the generated reports was evaluated with both the CIDEr scores and by radiologists as well. The CIDEr scores are found to be around 5.8 on average for the testing dataset. Further blind evaluation suggested a comparable performance against human radiologist.
摘要：胸部X-射线（CXR）是一种用于诊断胸椎的疾病和异常的最常见的临床检查中的所述一个。在医院每天产生的CXR扫描的量是巨大的。因此，能够节省医生的工作量自动化诊断系统是很有价值的。目前，人工智能的胸片诊断应用程序通常使用模式识别扫描分类。然而，这种方法依赖于标记的数据库，这是昂贵的，通常有较大的误差率。在这项工作中，我们建立了包含超过12,000 CXR扫描和影像学报告的数据库，并开发了基于注意力机制深卷积神经网络和递归网络的模型。该模型学习来自CXR扫描并直接相关的原始放射性报告功能;没有需要扫描的额外的标签。该模型提供了自动识别给出扫描和生成报告。的所生成的报告的质量与苹果酒分数都和由放射科医生以及进行评价。苹果酒分数被发现是平均约5.8测试数据集。另外盲评建议对人体放射科医生可比较的性能。

31. Cascaded Regression Tracking: Towards Online Hard Distractor Discrimination [PDF] 返回目录
Ning Wang, Wengang Zhou, Qi Tian, Houqiang Li
Abstract: Visual tracking can be easily disturbed by similar surrounding objects. Such objects as hard distractors, even though being the minority among negative samples, increase the risk of target drift and model corruption, which deserve additional attention in online tracking and model update. To enhance the tracking robustness, in this paper, we propose a cascaded regression tracker with two sequential stages. In the first stage, we filter out abundant easily-identified negative candidates via an efficient convolutional regression. In the second stage, a discrete sampling based ridge regression is designed to double-check the remaining ambiguous hard samples, which serves as an alternative of fully-connected layers and benefits from the closed-form solver for efficient learning. Extensive experiments are conducted on 11 challenging tracking benchmarks including OTB-2013, OTB-2015, VOT2018, VOT2019, UAV123, Temple-Color, NfS, TrackingNet, LaSOT, UAV20L, and OxUvA. The proposed method achieves state-of-the-art performance on prevalent benchmarks, while running in a real-time speed.
摘要：视觉跟踪可以通过类似的周围物体很容易受到干扰。这样的对象为硬错误选项，即使是阴性样本中的少数，提高目标漂移和模型腐败的风险，这值得进一步关注网上追踪和模型更新。为了提高跟踪的鲁棒性，在本文中，我们提出了一个级联回归跟踪仪，两个连续的阶段。在第一阶段中，我们通过一个有效的卷积回归滤除丰富容易鉴定的阴性的候选者。在第二阶段，一个离散采样基于岭回归被设计要仔细检查剩余暧昧硬样品，其作为全连接层和好处从封闭形式解算器用于高效学习的替代品。大量的实验都在11个有挑战性的跟踪基准，包括OTB-2013，OTB-2015，VOT2018，VOT2019，UAV123，寺色，NFS，TrackingNet，LaSOT，UAV20L和OxUvA进行。所提出的方法实现了对流行的基准状态的最先进的性能，在一个实时速度运行时。

32. Joint Contrastive Learning for Unsupervised Domain Adaptation [PDF] 返回目录
Changhwa Park, Jonghyun Lee, Jaeyoon Yoo, Minhoe Hur, Sungroh Yoon
Abstract: Enhancing feature transferability by matching marginal distributions has led to improvements in domain adaptation, although this is at the expense of feature discrimination. In particular, the ideal joint hypothesis error in the target error upper bound, which was previously considered to be minute, has been found to be significant, impairing its theoretical guarantee. In this paper, we propose an alternative upper bound on the target error that explicitly considers the joint error to render it more manageable. With the theoretical analysis, we suggest a joint optimization framework that combines the source and target domains. Further, we introduce Joint Contrastive Learning (JCL) to find class-level discriminative features, which is essential for minimizing the joint error. With a solid theoretical framework, JCL employs contrastive loss to maximize the mutual information between a feature and its label, which is equivalent to maximizing the Jensen-Shannon divergence between conditional distributions. Experiments on two real-world datasets demonstrate that JCL outperforms the state-of-the-art methods.
摘要：通过匹配边缘分布，导致了领域适应性改进加强功能可转让性，虽然这是在功能歧视的代价。特别是，在目标误码理想联合假设误差上限，这是以前被认为是分钟，一直被认为是显著，削弱其理论保证。在本文中，我们提出上明确地考虑关节误差，以使其更易于管理的目标误差上限的替代品。与理论分析，我们提出了一种联合优化框架，结合源和目标域。此外，我们引入联合对比学习（JCL）找到类级别的判别特征，这对于减少关节误差是必不可少的。具有扎实的理论框架，JCL采用对比损失最大化的功能和它的标签，这相当于最大化的条件分布之间的詹森 - 香农散度之间的互信息。两个真实世界的数据集实验表明，优于JCL国家的最先进的方法。

33. Video Moment Localization using Object Evidence and Reverse Captioning [PDF] 返回目录
Madhawa Vidanapathirana, Supriya Pandhre, Sonia Raychaudhuri, Anjali Khurana
Abstract: We address the problem of language-based temporal localization of moments in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries have no predefined activity classes and may also contain complex descriptions. Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities. This method encodes the semantic activity concepts from the verb/object pair in a language query and leverages visual activity concepts from video activity classification prediction scores. We propose "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence via object segmentation masks and video understanding features via video captioning. Furthermore, we improve language modelling in sentence embedding. We experimented on Charades-STA dataset and identified that MML outperforms MAC baseline by 4.93% and 1.70% on R@1 and R@5metrics respectively. Our code and pre-trained model are publicly available at this https URL.
摘要：我们解决修剪视频的时刻基于语言的本地化时间的问题。相较于具有固定类别的时间定位，这一问题更具挑战性的基于语言的查询，没有预定义的活动课，也可能包含复杂的描述。当前状态的最先进的模型MAC通过从视频和语言模式采矿活动概念解决它。此方法编码从在语言查询动词/对象对语义概念的活性，并利用从视频活动分类预测分数视觉活动的概念。我们提出了“多面VideoMoment定位器”（MML），通过经由视频字幕经由对象分割掩码和视频理解特征引入视觉对象证据MAC模型的扩展。此外，我们提高句子嵌入语言模型。我们试验了字谜-STA的数据集和1和R 5metrics分别由4.93％和1.70％MML性能优于MAC基线上R 7识别。我们的代码和预先训练模式是公开的，在此HTTPS URL。

34. Progressively Unfreezing Perceptual GAN [PDF] 返回目录
Jinxuan Sun, Yang Chen, Junyu Dong, Guoqiang Zhong
Abstract: Generative adversarial networks (GANs) are widely used in image generation tasks, yet the generated images are usually lack of texture details. In this paper, we propose a general framework, called Progressively Unfreezing Perceptual GAN (PUPGAN), which can generate images with fine texture details. Particularly, we propose an adaptive perceptual discriminator with a pre-trained perceptual feature extractor, which can efficiently measure the discrepancy between multi-level features of the generated and real images. In addition, we propose a progressively unfreezing scheme for the adaptive perceptual discriminator, which ensures a smooth transfer process from a large scale classification task to a specified image generation task. The qualitative and quantitative experiments with comparison to the classical baselines on three image generation tasks, i.e. single image super-resolution, paired image-to-image translation and unpaired image-to-image translation demonstrate the superiority of PUPGAN over the compared approaches.
摘要：剖成对抗网络（甘斯）被广泛用于图像生成的任务，但所产生的图像通常缺乏的纹理细节。在本文中，我们提出了一个总体框架，称为逐步解冻感知GAN（PUPGAN），它可以生成与质地细腻的细节图像。特别是，我们提出了一种自适应感知鉴别器与预训练的感知特征提取器，它可以有效地测量所生成的图像和真实图像的多层次特征之间的差异。此外，我们提出了自适应感知鉴别器，这确保从一个大型分类任务到指定的图像生成任务顺利转移过程逐渐解冻方案。与比较经典的基线上三个图像生成任务，即单个图像超分辨率，对图像 - 图像平移和不成图像 - 图像平移的定性和定量实验证明PUPGAN过的比较方法的优越性。

35. Sequential Graph Convolutional Network for Active Learning [PDF] 返回目录
Razvan Caramalau, Binod Bhattarai, Tae-Kyun Kim
Abstract: We propose a novel generic sequential Graph Convolution Network (GCN) training for Active Learning. Each of the unlabelled and labelled examples is represented through a pre-trained learner as nodes of a graph and their similarities as edges. With the available few labelled examples as seed annotations, the parameters of the Graphs are optimised to minimise the binary cross-entropy loss to identify labelled vs unlabelled. Based on the confidence score of the nodes in the graph we sub-sample unlabelled examples to annotate where inherited uncertainties correlate. With the newly annotated examples along with the existing ones, the parameters of the graph are optimised to minimise the modified objective. We evaluated our method on four publicly available image classification benchmarks. Our method outperforms several competitive baselines and existing arts. The implementations of this paper can be found here: this https URL
摘要：我们提出了主动学习一种新的通用顺序图卷积网络（GCN）的培训。每个未标记和标记的例子是通过预先训练学习者作为图的节点和它们作为边缘的相似之处表示。与可用的几个标识样本作为种子注解，曲线图的参数进行优化以最小化该二进制交叉熵损耗，以确定标记的未标记的VS。基于所述置信度得分图表中我们子样品未标记的例子来注释，其中的不确定性继承关联的节点。用与现有的沿着新注释的示例中，曲线图的参数进行优化以最小化修改目标。我们评估四个可公开获得的图像分类基准，我们的方法。我们的方法优于几个有竞争力的基准和现有的艺术。本文的实现可以在这里找到：此HTTPS URL

36. MediaPipe Hands: On-device Real-time Hand Tracking [PDF] 返回目录
Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, Matthias Grundmann
Abstract: We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.
摘要：我们提出了从单一的RGB摄像头，AR / VR应用预测的手骨架的实时设备上的手跟踪管道。该管道由两个模型：1）一个手掌检测，2）一个手标志模型。它通过MediaPipe，可用于构建跨平台的解决方案，ML的框架来实现。该模型和流水线架构展示了在移动GPU和高品质的预测实时推理速度。 MediaPipe双手被开源，开源在https://mediapipe.dev。

37. UV-Net: Learning from Curve-Networks and Solids [PDF] 返回目录
Pradeep Kumar Jayaraman, Aditya Sanghi, Joseph Lambourne, Thomas Davies, Hooman Shayani, Nigel Morris
Abstract: Parametric curves, surfaces and boundary representations are the basis for 2D vector graphics and 3D industrial designs. Despite their prevalence, there exists limited research on applying modern deep neural networks directly to such representations. The unique challenges in working with such representations arise from the combination of continuous non-Euclidean geometry domain and discrete topology, as well as a lack of labeled datasets, benchmarks and baseline models. In this paper, we propose a unified representation for parametric curve-networks and solids by exploiting the u- and uv-parameter domains of curve and surfaces, respectively, to model the geometry, and an adjacency graph to explicitly model the topology. This leads to a unique and efficient network architecture based on coupled image and graph convolutional neural networks to extract features from curve-networks and solids. Inspired by the MNIST image dataset, we create and publish WireMNIST (for 2D curve-networks) and SolidMNIST (for 3D solids), two related labeled datasets depicting alphabets to encourage future research in this area. We demonstrate the effectiveness of our method using supervised and self-supervised tasks on our new datasets, as well as the publicly available ABC dataset. The results demonstrate the effectiveness of our representation and provide a competitive baseline for learning tasks involving curve-networks and solids.
摘要：参数曲线，表面和边界表示是用于2D向量图形和三维外观设计的基础。尽管他们的患病率，存在直接运用现代深层神经网络这样的表示有限的研究。与此类陈述工作面临的独特挑战，从连续的非欧几里得几何域和离散拓扑的组合，以及缺乏标记的数据集，基准和基准模型的出现。在本文中，我们提出了参数曲线的网络和固体的统一表示通过利用曲线和表面的U-和UV-参数域，分别向几何模型，和一个邻接图拓扑明确建模。这就导致了一个独特的，高效的网络体系结构基于耦合图像上并提取图形卷积神经网络从曲线的网络和固体特征。由MNIST图像数据集的启发，我们创建和发布WireMNIST（二维曲线网络）和SolidMNIST（3D实体），两个相关的标记数据集描绘字母，以鼓励未来在这方面的研究。我们演示如何使用我们的新的数据集监督和自我监督的任务，以及公开发布的数据集ABC我们方法的有效性。结果证明我们的表现的有效性，并提供了学习曲线参与网络和固体任务有竞争力的基础。

38. BlazePose: On-device Real-time Body Pose tracking [PDF] 返回目录
Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, Matthias Grundmann
Abstract: We present BlazePose, a lightweight convolutional neural network architecture for human pose estimation that is tailored for real-time inference on mobile devices. During inference, the network produces 33 body keypoints for a single person and runs at over 30 frames per second on a Pixel 2 phone. This makes it particularly suited to real-time use cases like fitness tracking and sign language recognition. Our main contributions include a novel body pose tracking solution and a lightweight body pose estimation neural network that uses both heatmaps and regression to keypoint coordinates.
摘要：我们目前BlazePose，为人类的姿态估计一个轻量级的卷积神经网络架构，该架构在移动设备上的实时推理量身定做。在推理时，网络产生体33点的关键点用于像素2的手机上的单一人并运行以超过每秒30帧。这使得它特别适合于实时使用情况下，像健身追踪和手语识别。我们的主要贡献包括一个新的身体姿态跟踪解决方案，同时使用热图和回归到关键点坐标的轻量化车身姿态估计神经网络。

39. HyNet: Local Descriptor with Hybrid Similarity Measure and Triplet Loss [PDF] 返回目录
Yurun Tian, Axel Barroso-Laguna, Tony Ng, Vassileios Balntas, Krystian Mikolajczyk
Abstract: Recent works show that local descriptor learning benefits from the use of L2 normalisation, however, an in-depth analysis of this effect lacks in the literature. In this paper, we investigate how L2 normalisation affects the back-propagated descriptor gradients during training. Based on our observations, we propose HyNet, a new local descriptor that leads to state-of-the-art results in matching. HyNet introduces a hybrid similarity measure for triplet margin loss, a regularisation term constraining the descriptor norm, and a new network architecture that performs L2 normalisation of all intermediate feature maps and the output descriptors. HyNet surpasses previous methods by a significant margin on standard benchmarks that include patch matching, verification, and retrieval, as well as outperforming full end-to-end methods on 3D reconstruction tasks.
摘要：最近的作品表明，使用L2正常化的局部描述符学习的好处，但是，这种影响进行了深入分析，在文献中所缺乏的。在本文中，我们研究了L2规范化培训期间如何影响反向传播的描述梯度。根据我们的观察，我们提出HyNet，一种新的局部描述符，导致国家的最先进的结果匹配。 HyNet引入了三重裕度损失的混合相似性度量，正则化项的约束描述符规范，和一个新的网络体系结构，所有中间的特征图和输出描述符的执行L2归一化。 HyNet超过通过标准的基准测试，其中包括补丁匹配，验证和检索，以及对3D重建任务全面超越终端到终端的方法显著保证金以前的方法。

40. Head2Head++: Deep Facial Attributes Re-Targeting [PDF] 返回目录
Michail Christos Doukas, Mohammad Rami Koujan, Viktoriia Sharmanska, Anastasios Roussos
Abstract: Facial video re-targeting is a challenging problem aiming to modify the facial attributes of a target subject in a seamless manner by a driving monocular sequence. We leverage the 3D geometry of faces and Generative Adversarial Networks (GANs) to design a novel deep learning architecture for the task of facial and head reenactment. Our method is different to purely 3D model-based approaches, or recent image-based methods that use Deep Convolutional Neural Networks (DCNNs) to generate individual frames. We manage to capture the complex non-rigid facial motion from the driving monocular performances and synthesise temporally consistent videos, with the aid of a sequential Generator and an ad-hoc Dynamics Discriminator network. We conduct a comprehensive set of quantitative and qualitative tests and demonstrate experimentally that our proposed method can successfully transfer facial expressions, head pose and eye gaze from a source video to a target subject, in a photo-realistic and faithful fashion, better than other state-of-the-art methods. Most importantly, our system performs end-to-end reenactment in nearly real-time speed (18 fps).
摘要：面部视频重新定位是瞄准由驱动单眼序列修改以无缝方式的目标对象的脸部属性的具有挑战性的问题。我们利用的面孔，生殖对抗式网络（甘斯）的三维几何设计的面部和头部重演的任务，一个新的深度学习建筑。我们的方法是使用深卷积神经网络（DCNNs）生成单独的帧纯粹基于3D模型的方法，或者最近基于图像的方法不同。我们设法捕获驱动单眼表演和合成时间一致的视频的复杂的非刚性面部运动，以连续生成的援助和一个特设的动态鉴别网络。我们进行了全面的定量和定性测试和演示实验，我们提出的方法能成功传输的面部表情，头部姿态和眼睛从视频源到目标对象的目光，在照片般逼真的和忠实的方式，比其他国家-of最先进的方法。最重要的是，我们的系统进行最后到终端重演近实时速度（18个FPS）。

41. TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations [PDF] 返回目录
Jiahao Pang, Duanshun Li, Dong Tian
Abstract: Topology matters. Despite the recent success of point cloud processing with geometric deep learning, it remains arduous to capture the complex topologies of point cloud data with a learning model. Given a point cloud dataset containing objects with various genera or scenes with multiple objects, we propose an autoencoder, TearingNet, which tackles the challenging task of representing the point clouds using a fixed-length descriptor. Unlike existing works to deform primitives of genus zero (e.g., a 2D square patch) to an object-level point cloud, we propose a function which tears the primitive during deformation, letting it emulate the topology of a target point cloud. From the torn primitive, we construct a locally-connected graph to further enforce the learned topology via filtering. Moreover, we analyze a widely existing problem which we call point-collapse when processing point clouds with diverse topologies. Correspondingly, we propose a subtractive sculpture strategy to train our TearingNet model. Experimentation finally shows the superiority of our proposal in terms of reconstructing more faithful point clouds as well as generating more topology-friendly representations than benchmarks.
摘要：拓扑事项。尽管点云处理的几何深度学习的最近的成功，但它仍然艰巨捕捉点云数据的复杂的拓扑结构与学习模式。给出含各种属或场景与多个对象的对象的点云数据集，我们提出了一种自动编码，TearingNet，从而能够解决使用固定长度的描述符表示的点云的具有挑战性的任务。不同于现有作品属的变形为零的原语（例如，2D方形贴片）至对象级的点云，我们提出一种变形期间眼泪原始，让它模拟目标点云的拓扑的函数。从撕裂原始的，我们构建了一个本地连接的图形经由滤波，以进一步执行学习拓扑。此外，我们分析一个广泛存在的问题，我们称之为点崩溃与不同的拓扑处理点云的时候。相应地，我们提出了一个消减雕塑策略来训练我们的TearingNet模型。实验终于显示了我们的建议，在重建更忠实点云以及产生比基准多拓扑友好交涉方面的优势。

42. Sky Optimization: Semantically aware image processing of skies in low-light photography [PDF] 返回目录
Orly Liba, Longqi Cai, Yun-Ta Tsai, Elad Eban, Yair Movshovitz-Attias, Yael Pritch, Huizhong Chen, Jonathan T. Barron
Abstract: The sky is a major component of the appearance of a photograph, and its color and tone can strongly influence the mood of a picture. In nighttime photography, the sky can also suffer from noise and color artifacts. For this reason, there is a strong desire to process the sky in isolation from the rest of the scene to achieve an optimal look. In this work, we propose an automated method, which can run as a part of a camera pipeline, for creating accurate sky alpha-masks and using them to improve the appearance of the sky. Our method performs end-to-end sky optimization in less than half a second per image on a mobile device. We introduce a method for creating an accurate sky-mask dataset that is based on partially annotated images that are inpainted and refined by our modified weighted guided filter. We use this dataset to train a neural network for semantic sky segmentation. Due to the compute and power constraints of mobile devices, sky segmentation is performed at a low image resolution. Our modified weighted guided filter is used for edge-aware upsampling to resize the alpha-mask to a higher resolution. With this detailed mask we automatically apply post-processing steps to the sky in isolation, such as automatic spatially varying white-balance, brightness adjustments, contrast enhancement, and noise reduction.
摘要：天空是照片的外观的主要组成部分，它的颜色和色调可以强烈地影响图片的情绪。在夜间拍摄时，天空还可能遭受噪音和伪色。出于这个原因，还有就是从现场的其余部分处理天空隔离，能获得最佳外观的强烈愿望。在这项工作中，我们提出了一个自动化的方法，它可以作为一个摄像头管道的一部分运行，用于创建精确的天空阿尔法口罩并用它们来改善天空的外观。我们的方法进行端至端天空优化在每个图像小于半秒在移动设备上。我们介绍用于创建基于被补绘和由我们的改性加权引导过滤精制部分注释的图像的准确的天空掩模数据集的方法。我们用这个数据集来训练神经网络语义天空分割。由于移动设备的计算和功率约束，天空分割以低的图像分辨率进行。我们的改性加权引导滤波器用于边感知上采样到所述α-掩模调整到更高的分辨率。与此详细掩模我们自动应用后处理步骤，将天空隔离，如自动空间上变化的白平衡，亮度调整，对比度增强和降噪。

43. Deep Network for Scatterer Distribution Estimation for Ultrasound Image Simulation [PDF] 返回目录
Lin Zhang, Valery Vishnevskiy, Orcun Goksel
Abstract: Simulation-based ultrasound training can be an essential educational tool. Realistic ultrasound image appearance with typical speckle texture can be modeled as convolution of a point spread function with point scatterers representing tissue microstructure. Such scatterer distribution, however, is in general not known and its estimation for a given tissue type is fundamentally an ill-posed inverse problem. In this paper, we demonstrate a convolutional neural network approach for probabilistic scatterer estimation from observed ultrasound data. We herein propose to impose a known statistical distribution on scatterers and learn the mapping between ultrasound image and distribution parameter map by training a convolutional neural network on synthetic images. In comparison with several existing approaches, we demonstrate in numerical simulations and with in-vivo images that the synthesized images from scatterer representations estimated with our approach closely match the observations with varying acquisition parameters such as compression and rotation of the imaged domain.
摘要：基于仿真的超声培训可以是一个重要的教育工具。逼真的超声图像的外观与典型斑点纹理可被建模为与表示组织显微点散射的点扩散函数的卷积。这样散射体分布，但是，在一般不知道其对于给定的组织类型估计基本上是一个不适定反演问题。在本文中，我们展示了从观测到的超声数据的概率散射估计的卷积神经网络方法。在此我们建议征收散射已知的统计分布，并通过对合成图像训练卷积神经网络学会超声图像和分布参数映射之间的映射。在与几个现有的方法比较，我们证明在数值模拟，并与体内图像，从散射体的表示合成图像与我们的方法估计的紧密匹配具有不同采集参数诸如压缩和成像域的旋转的意见。

44. M2Net: Multi-modal Multi-channel Network for Overall Survival Time Prediction of Brain Tumor Patients [PDF] 返回目录
Tao Zhou, Huazhu Fu, Yu Zhang, Changqing Zhang, Xiankai Lu, Jianbing Shen, Ling Shao
Abstract: Early and accurate prediction of overall survival (OS) time can help to obtain better treatment planning for brain tumor patients. Although many OS time prediction methods have been developed and obtain promising results, there are still several issues. First, conventional prediction methods rely on radiomic features at the local lesion area of a magnetic resonance (MR) volume, which may not represent the full image or model complex tumor patterns. Second, different types of scanners (i.e., multi-modal data) are sensitive to different brain regions, which makes it challenging to effectively exploit the complementary information across multiple modalities and also preserve the modality-specific properties. Third, existing methods focus on prediction models, ignoring complex data-to-label relationships. To address the above issues, we propose an end-to-end OS time prediction model; namely, Multi-modal Multi-channel Network (M2Net). Specifically, we first project the 3D MR volume onto 2D images in different directions, which reduces computational costs, while preserving important information and enabling pre-trained models to be transferred from other tasks. Then, we use a modality-specific network to extract implicit and high-level features from different MR scans. A multi-modal shared network is built to fuse these features using a bilinear pooling model, exploiting their correlations to provide complementary information. Finally, we integrate the outputs from each modality-specific network and the multi-modal shared network to generate the final prediction result. Experimental results demonstrate the superiority of our M2Net model over other methods.
摘要：总生存期（OS）的时间早，精确的预测可以帮助获得脑瘤患者更好的治疗计划。尽管许多OS时间预测方法已被开发，并取得可喜的成果，但仍有几个问题。首先，传统的预测方法依赖于radiomic特征在磁共振（MR）体积，其可能不能代表整个图像或复杂的肿瘤图案模型的局部病变区域。第二，不同类型的扫描仪（即，多模态数据）对不同脑区敏感，这使得它具有挑战性的有效地利用在多个方式的补充信息，并且还保持特定的模态特性。第三，现有的方法集中在预测模型，忽略了复杂的数据到标签的关系。为了解决上述问题，我们提出了一个终端到终端OS时间预测模型;即，多模态多通道网络（M2Net）。具体地讲，我们首先投射3D MR体积到不同的方向，从而降低计算成本的2D图像，同时保存重要信息和能够预先训练模型来从其它任务被传送。然后，我们用一个模式专用的网络来提取不同MR扫描隐式和高层次的特点。一种多模态的共享网络上建立融合使用双线性模型池，利用其相关以提供补充信息这些功能。最后，我们整合来自每个模态的特定网络和多模态的共享网络的输出，以产生最终预测结果。实验结果表明，我们的模型M2Net比其他方法的优越性。

45. Interpreting the Latent Space of GANs via Correlation Analysis for Controllable Concept Manipulation [PDF] 返回目录
Ziqiang Li, Rentuo Tao, Hongjing Niu, Bin Li
Abstract: Generative adversarial nets (GANs) have been successfully applied in many fields like image generation, inpainting, super-resolution and drug discovery, etc., by now, the inner process of GANs is far from been understood. To get deeper insight of the intrinsic mechanism of GANs, in this paper, a method for interpreting the latent space of GANs by analyzing the correlation between latent variables and the corresponding semantic contents in generated images is proposed. Unlike previous methods that focus on dissecting models via feature visualization, the emphasis of this work is put on the variables in latent space, i.e. how the latent variables affect the quantitative analysis of generated results. Given a pretrained GAN model with weights fixed, the latent variables are intervened to analyze their effect on the semantic content in generated images. A set of controlling latent variables can be derived for specific content generation, and the controllable semantic content manipulation be achieved. The proposed method is testified on the datasets Fashion-MNIST and UT Zappos50K, experiment results show its effectiveness.
摘要：剖成对抗网（甘斯）已在像图像生成许多领域得到成功应用，图像修复，超分辨率和药物发现等，由现在，甘斯的内过程远未被理解。为了得到甘斯的内在机制的更深入的了解，在本文中，用于通过分析潜在变量和在生成的图像的相应的语义内容之间的相关性解释甘斯的潜在空间提出一种方法。与专注于通过功能可视化解剖模型前面的方法，这项工作的重点放在了变量潜在空间，潜在变量，即如何影响产生结果的定量分析。考虑到与固定权重预训练GAN模型，潜变量进行干预，分析其对生成的图像语义内容的效果。一组控制潜在变量的可以衍生为特定内容生成，并且可以实现可控的语义内容操纵。该方法被证明对数据集时尚MNIST和UT Zappos50K，实验结果证明了其有效性。

46. Sustainable Recreational Fishing Using a Novel Electrical Muscle Stimulation (EMS) Lure and Ensemble Network Algorithm to Maximize Catch and Release Survivability [PDF] 返回目录
Petteri Haverinen, Krithik Ramesh, Nathan Wang
Abstract: With 200-700 million anglers in the world, sportfishing is nearly five times more common than commercial trawling. Worldwide, hundreds of thousands of jobs are linked to the sportfishing industry, which generates billions of dollars for water-side communities and fisheries conservatories alike. However, the sheer popularity of recreational fishing poses threats to aquatic biodiversity that are hard to regulate. For example, as much as 25% of overfished populations can be traced to anglers. This alarming statistic is explained by the average catch and release mortality rate of 43%, which primarily results from hook-related injuries and careless out-of-water handling. The provisional-patented design proposed in this paper addresses both these problems separately First, a novel, electrical muscle stimulation based fishing lure is proposed as a harmless and low cost alternative to sharp hooks. Early prototypes show a constant electrical current of 90 mA applied through a 200g European perch's jaw can support a reeling tension of 2N - safely within the necessary ranges. Second, a fish-eye camera bob is designed to wirelessly relay underwater footage to a smartphone app, where an ensemble convolutional neural network automatically classifies the fish's species, estimates its length, and cross references with local and state fishing regulations (ie. minimum size, maximum bag limit, and catch season). This capability reduces overfishing by helping anglers avoid accidentally violating guidelines and eliminates the need to reel the fish in and expose it to negligent handling. IN conjunction, this cheap, lightweight, yet high-tech invention is a paradigm shift in preserving a world favorite pastime; while at the same time making recreational fishing more sustainable.
摘要：随着世界200-700亿钓鱼，垂钓是近五倍比商业拖网更常见。在世界范围内，数以十万计的就业岗位都与游钓产业，产生数十亿美元用于水侧的社区和渔业温室的一致好评。然而，休闲渔业的绝对人气构成对水生生物多样性，是很难监管的威胁。例如，过度捕捞种群的多达25％可追溯到垂钓者。这一惊人的统计数字是由平均捕捉和释放死亡率的43％，这主要是从钩有关的伤害和粗心外的水的处理结果进行说明。本文解决了这两个问题提出单独首先将临时专利设计，一种新型的，肌肉电刺激的基于鱼饵被提议为无害和低成本的替代尖锐的钩子。早期原型显示通过200克欧洲鲈鱼的钳口可支持2N的张力卷取施加90毫安恒定电流 - 安全地内的必要的范围。其次，鱼眼摄像机鲍勃设计水下录像无线中继到智能手机应用程序，其中合奏卷积神经网络自动分类鱼的种类，估计它的长度，并与当地和国家渔业法规（即交叉引用。最小尺寸，最大包的限制，和捕获季节）。这种能力降低，帮助垂钓者避免意外违反指南，过度捕捞和无需卷轴鱼和它暴露于疏忽处理。结合，这种价格便宜，重量轻，但高科技发明是在维护一个世界最喜欢的消遣的模式转变;而在同一时间，使休闲渔业可持续发展。

47. Overcoming Statistical Shortcuts for Open-ended Visual Counting [PDF] 返回目录
Corentin Dancette, Remi Cadene, Xinlei Chen, Matthieu Cord
Abstract: Machine learning models tend to over-rely on statistical shortcuts. These spurious correlations between parts of the input and the output labels does not hold in real-world settings. We target this issue on the recent open-ended visual counting task which is well suited to study statistical shortcuts. We aim to develop models that learn a proper mechanism of counting regardless of the output label. First, we propose the Modifying Count Distribution (MCD) protocol, which penalizes models that over-rely on statistical shortcuts. It is based on pairs of training and testing sets that do not follow the same count label distribution such as the odd-even sets. Intuitively, models that have learned a proper mechanism of counting on odd numbers should perform well on even numbers. Secondly, we introduce the Spatial Counting Network (SCN), which is dedicated to visual analysis and counting based on natural language questions. Our model selects relevant image regions, scores them with fusion and self-attention mechanisms, and provides a final counting score. We apply our protocol on the recent dataset, TallyQA, and show superior performances compared to state-of-the-art models. We also demonstrate the ability of our model to select the correct instances to count in the image. Code and datasets are available: this https URL
摘要：机器学习模型往往过分依赖统计的快捷方式。输入部分和输出标签之间的这些虚假相关不真实世界的设置保持。我们的目标上非常适合学习统计捷径近期开放式的视觉计数任务这个问题。我们的目标是开发出学习计数的适当机制，无论输出标签的机型。首先，我们提出了修改计数分发（MCD）协议，该惩罚的模型，过分依赖统计的快捷方式。它是基于对训练和测试不遵循相同的计数标签分发诸如奇偶台套。直观地说，这已经学会计数的适当机制上奇数模式应在偶数表现良好。其次，我们引入了空间计数网络（SCN），它是基于自然语言问题，致力于可视化分析和计算。我们的模式选择有关的图像区域，以融合和自我关注机制评分他们，并提供了最终计分。我们采用了最近的数据集，TallyQA我们的协议，并与国家的最先进的机型显示出优异的性能。我们还表明我们的模型来选择正确的情况下，图像中的计数能力。代码和数据集可用：此HTTPS URL

48. Forward Prediction for Physical Reasoning [PDF] 返回目录
Rohit Girdhar, Laura Gustafson, Aaron Adcock, Laurens van der Maaten
Abstract: Physical reasoning requires forward prediction: the ability to forecast what will happen next given some initial world state. We study the performance of state-of-the-art forward-prediction models in complex physical-reasoning tasks. We do so by incorporating models that operate on object or pixel-based representations of the world, into simple physical-reasoning agents. We find that forward-prediction models improve the performance of physical-reasoning agents, particularly on complex tasks that involve many objects. However, we also find that these improvements are contingent on the training tasks being similar to the test tasks, and that generalization to different tasks is more challenging. Surprisingly, we observe that forward predictors with better pixel accuracy do not necessarily lead to better physical-reasoning performance. Nevertheless, our best models set a new state-of-the-art on the PHYRE benchmark for physical reasoning.
摘要：物理推理需要前向预测：能够预测未来会发生什么给出一些初步的世态。我们研究的国家的最先进的前向预测模型的复杂的物理推理任务的性能。我们通过结合对物体或世界的基于像素的表示操作，到简单的物理推理代理模型这么做。我们发现，前向预测模型，从而提高身体的推理剂的性能，特别是在涉及许多对象的复杂任务。然而，我们也发现，这些改进队伍在训练任务中类似的测试任务，并推广到不同的任务更具挑战性。出人意料的是，我们观察到，向前预测具有更好的像素精度并不一定带来更好的物理性能的推理。不过，我们最好的车型设置上PHYRE基准物理推理一个新的国家的最先进的。

49. Fully Test-time Adaptation by Entropy Minimization [PDF] 返回目录
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, Trevor Darrell
Abstract: Faced with new and different data during testing, a model must adapt itself. We consider the setting of fully test-time adaptation, in which a supervised model confronts unlabeled test data from a different distribution, without the help of its labeled training data. We propose an entropy minimization approach for adaptation: we take the model's confidence as our objective as measured by the entropy of its predictions. During testing, we adapt the model by modulating its representation with affine transformations to minimize entropy. Our experiments show improved robustness to corruptions for image classification on CIFAR-10/100 and ILSVRC and demonstrate the feasibility of target-only domain adaptation for digit classification on MNIST and SVHN.
摘要：在测试过程中新的和不同的数据面前，模型必须调整自己。我们充分考虑测试时间适应的设定，其中有监督模式面临来自不同的分布未标记的测试数据，而其标记的训练数据的帮助。我们提出了适应熵最小化的方法：我们采取的模式的信心，我们的目标是通过其预测的熵测量。在测试过程中，我们通过调节与仿射变换表示，以尽量减少熵适应的模式。我们的实验显示出改进的鲁棒性损坏对图像分类上CIFAR-10/100和ILSVRC并证明只有目标域的适应对MNIST和SVHN数字分类的可行性。

50. Zero-Shot Learning with Common Sense Knowledge Graphs [PDF] 返回目录
Nihal V. Nayak, Stephen H. Bach
Abstract: Zero-shot learning relies on semantic class representations such as attributes or pretrained embeddings to predict classes without any labeled examples. We propose to learn class representations from common sense knowledge graphs. Common sense knowledge graphs are an untapped source of explicit high-level knowledge that requires little human effort to apply to a range of tasks. To capture the knowledge in the graph, we introduce ZSL-KG, a framework based on graph neural networks with non-linear aggregators to generate class representations. Whereas most prior work on graph neural networks uses linear functions to aggregate information from neighboring nodes, we find that non-linear aggregators such as LSTMs or transformers lead to significant improvements on zero-shot tasks. On two natural language tasks across three datasets, ZSL-KG shows an average improvement of 9.2 points of accuracy versus state-of-the-art methods. In addition, on an object classification task, ZSL-KG shows a 2.2 accuracy point improvement versus the best methods that do not require hand-engineered class representations. Finally, we find that ZSL-KG outperforms the best performing graph neural networks with linear aggregators by an average of 3.8 points of accuracy across these four datasets.
摘要：零射门的学习依赖于语义类的表示，如属性或预训练的嵌入预测类没有任何标记的例子。我们建议借鉴常识知识图类表示。常识性知识图是明确的知识层次高的未开发的来源，只需要很少的人的努力，以适用于一系列任务。为了捕捉图中的知识，我们引入ZSL-KG，基于图形神经网络与非线性的聚合，生成类表示了一个框架。而从邻居节点上图的神经网络应用最优先的工作线性函数来汇总信息，我们发现非线性的聚合，如LSTMs或变压器导致对零射门的任务显著的改善。跨三个数据集两个自然语言任务，ZSL-KG显示的精度与国家的最先进方法的9.2分的平均改善。此外，对象类别的任务，ZSL-KG显示精度2.2点的改善对不需要手工设计类表示最好的方法。最后，我们发现，ZSL-KG平均跨越这四个数据集精度的3.8点优于线性聚合表现最佳的图形神经网络。

51. Set Distribution Networks: a Generative Model for Sets of Images [PDF] 返回目录
Shuangfei Zhai, Walter Talbott, Miguel Angel Bautista, Carlos Guestrin, Josh M. Susskind
Abstract: Images with shared characteristics naturally form sets. For example, in a face verification benchmark, images of the same identity form sets. For generative models, the standard way of dealing with sets is to represent each as a one hot vector, and learn a conditional generative model $p(\mathbf{x}|\mathbf{y})$. This representation assumes that the number of sets is limited and known, such that the distribution over sets reduces to a simple multinomial distribution. In contrast, we study a more generic problem where the number of sets is large and unknown. We introduce Set Distribution Networks (SDNs), a novel framework that learns to autoencode and freely generate sets. We achieve this by jointly learning a set encoder, set discriminator, set generator, and set prior. We show that SDNs are able to reconstruct image sets that preserve salient attributes of the inputs in our benchmark datasets, and are also able to generate novel objects/identities. We examine the sets generated by SDN with a pre-trained 3D reconstruction network and a face verification network, respectively, as a novel way to evaluate the quality of generated sets of images.
摘要：图片共享特性自然形成集。例如，在面部检验基准，相同的标识形式集的图像。对于生成模型，处理组的标准方法是代表每个作为一个热载体和学习条件生成模型$ P（\ mathbf {X} | \ mathbf {Y}）$。这种表示假定的集合的数目是有限的和已知的，例如，分配在组简化为一个简单的多项式分布。与此相反，我们研究一个更通用的问题，其中组数是大的和未知的。我们引入集合分发网络（的SDN），一个新的框架，学会autoencode和自由产生套。我们通过共同学习一组编码器，一套鉴别，发电机组，并设置之前实现这一目标。我们表明的SDN能够重建图像集，在我们的基准数据集保护的投入显着属性，并且还能够产生新的对象/身份。我们检查由SDN分别与预训练的3D重建网络和人脸验证网络，产生的套，以评价图像生成集的质量的新方法。

52. On the Predictability of Pruning Across Scales [PDF] 返回目录
Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, Nir Shavit
Abstract: We show that the error of magnitude-pruned networks follows a scaling law, and that this law is of a fundamentally different nature than that of unpruned networks. We functionally approximate the error of the pruned networks, showing that it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different sparsities are freely interchangeable. We demonstrate the accuracy of this functional approximation over scales spanning orders of magnitude in depth, width, dataset size, and sparsity for CIFAR-10 and ImageNet. As neural networks become ever larger and more expensive to train, our findings enable a framework for reasoning conceptually and analytically about pruning.
摘要：我们发现，大小，修剪网络的误差遵循比例定律，而这个法律是根本不同的性质比未修剪的网络。我们功能近似修剪网络的错误，这表明它是在捆扎宽度，深度和修剪的水平，使得千差万别sparsities的网络是随意互换的不变的方面可预测的。我们证明这个函数逼近的过磅秤跨数量级的深度，宽度，数据集的大小，以及稀疏的CIFAR-10和ImageNet的准确性。由于神经网络变得越来越大，越来越昂贵的火车，我们的研究结果启用概念和分析推理修剪的框架。

53. Shapeshifter Networks: Cross-layer Parameter Sharing for Scalable and Effective Deep Learning [PDF] 返回目录
Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, Kate Saenko
Abstract: We present Shapeshifter Networks (SSNs), a flexible neural network framework that improves performance and reduces memory requirements on a diverse set of scenarios over standard neural networks. Our approach is based on the observation that many neural networks are severely overparameterized, resulting in significant waste in computational resources as well as being susceptible to overfitting. SSNs address this by learning where and how to share parameters between layers in a neural network while avoiding degenerate solutions that result in underfitting. Specifically, we automatically construct parameter groups that identify where parameter sharing is most beneficial. Then, we map each group's weights to construct layers with learned combinations of candidates from a shared parameter pool. SSNs can share parameters across layers even when they have different sizes, perform different operations, and/or operate on features from different modalities. We evaluate our approach on a diverse set of tasks, including image classification, bidirectional image-sentence retrieval, and phrase grounding, creating high performing models even when using as little as 1% of the parameters. We also apply SSNs to knowledge distillation, where we obtain state-of-the-art results when combined with traditional distillation methods.
摘要：我们提出变形者网络（核潜艇），灵活的神经网络框架，提高了性能并减少了多样化的超标神经网络场景的内存需求。我们的做法是基于这样的观察，许多神经网络受到严重overparameterized，造成计算资源浪费显著以及易受过度拟合。核潜艇通过学习在哪里以及如何各层之间共享参数的神经网络，同时避免退化的解决方案，导致欠拟合解决这个问题。具体来说，我们会自动构建查明参数共享是最有利的参数组。然后，我们从一个共享参数池候选人的教训组合各组的权重来构造层映射。的SSN可以共享跨层参数，即使他们具有不同的尺寸，执行不同的操作，和/或在来自不同模态的特征操作。我们评估我们在一组不同的任务，包括图像分类，双向图像句子检索和短语接地的方法，少用为参数的1％时，甚至创造高性能的机型。我们也在申请社会安全号码知识蒸馏，在那里我们与传统的蒸馏方法相结合获得国家的先进成果。

54. A Review of 1D Convolutional Neural Networks toward Unknown Substance Identification in Portable Raman Spectrometer [PDF] 返回目录
M. Hamed Mozaffari, Li-Lin Tay
Abstract: Raman spectroscopy is a powerful analytical tool with applications ranging from quality control to cutting edge biomedical research. One particular area which has seen tremendous advances in the past decade is the development of powerful handheld Raman spectrometers. They have been adopted widely by first responders and law enforcement agencies for the field analysis of unknown substances. Field detection and identification of unknown substances with Raman spectroscopy rely heavily on the spectral matching capability of the devices on hand. Conventional spectral matching algorithms (such as correlation, dot product, etc.) have been used in identifying unknown Raman spectrum by comparing the unknown to a large reference database. This is typically achieved through brute-force summation of pixel-by-pixel differences between the reference and the unknown spectrum. Conventional algorithms have noticeable drawbacks. For example, they tend to work well with identifying pure compounds but less so for mixture compounds. For instance, limited reference spectra inaccessible databases with a large number of classes relative to the number of samples have been a setback for the widespread usage of Raman spectroscopy for field analysis applications. State-of-the-art deep learning methods (specifically convolutional neural networks CNNs), as an alternative approach, presents a number of advantages over conventional spectral comparison algorism. With optimization, they are ideal to be deployed in handheld spectrometers for field detection of unknown substances. In this study, we present a comprehensive survey in the use of one-dimensional CNNs for Raman spectrum identification. Specifically, we highlight the use of this powerful deep learning technique for handheld Raman spectrometers taking into consideration the potential limit in power consumption and computation ability of handheld systems.
摘要：拉曼光谱是其应用范围从质量控制到最前沿的生物医学研究的一个强大的分析工具。这已经看到在过去十年中的巨大进步的一个具体领域是强大的手持式拉曼光谱仪的发展。他们已经被急救人员和执法机构的未知物质领域分析广泛采用。场检测和与拉曼未知物质的鉴别光谱很大程度上依赖于手头上的装置的光谱匹配能力。常规光谱匹配算法（如相关性，点积等）已在由未知比较大的参考数据库鉴定未知拉曼光谱中使用。这通常是通过逐个像素的参考和未知谱之间的差别蛮力求和来实现的。传统算法有明显的缺点。例如，它们倾向于与识别纯化合物，但不适合混合物中的化合物很好地工作。例如，限定的参考光谱与大量类相对于样本的数目不可访问数据库已经用于拉曼光谱学的用于现场分析应用的广泛使用了挫折。状态的最先进的深学习方法（具体卷积神经网络细胞神经网络），作为一种替代方法，提出了许多优于常规光谱比较算法的优点。有了优化，他们是理想的手持式光谱仪将被部署用于现场检测不明物质。在这项研究中，我们提出在使用一维细胞神经网络的的拉曼光谱鉴定了全面的调查。具体来说，我们突出了手持式拉曼光谱仪考虑到功耗和手持系统的计算能力的潜在限制使用这种强大的深度学习技术。

55. Gradient Amplification: An efficient way to train deep neural networks [PDF] 返回目录
Sunitha Basodi, Chunyan Ji, Haiping Zhang, Yi Pan
Abstract: Improving performance of deep learning models and reducing their training times are ongoing challenges in deep neural networks. There are several approaches proposed to address these challenges one of which is to increase the depth of the neural networks. Such deeper networks not only increase training times, but also suffer from vanishing gradients problem while training. In this work, we propose gradient amplification approach for training deep learning models to prevent vanishing gradients and also develop a training strategy to enable or disable gradient amplification method across several epochs with different learning rates. We perform experiments on VGG-19 and resnet (Resnet-18 and Resnet-34) models, and study the impact of amplification parameters on these models in detail. Our proposed approach improves performance of these deep learning models even at higher learning rates, thereby allowing these models to achieve higher performance with reduced training time.
摘要：提高深度学习模型的性能，并减少他们的训练时间是在深层神经网络的持续挑战。有建议应对这些挑战其中之一是增加神经网络的深度的几种方法。这种更深层次的网络不仅增加了训练时间，而且从消失的梯度问题而遭受的培训。在这项工作中，我们提出了培养深度学习模式，以防止消失梯度，也制定培训策略启用或禁用梯度扩增方法在若干时期不同的学习率梯度放大的方法。我们对VGG-19和RESNET（RESNET-18和RESNET-34）的模型进行实验，研究的扩增参数对详细这些模型的影响。我们提出的方法即使在较高的学习速率提高这些深层次的学习模型的性能，从而使这些模型来达到降低训练时间更高的性能。

56. XRayGAN: Consistency-preserving Generation of X-ray Images from Radiology Reports [PDF] 返回目录
Xingyi Yang, Nandiraju Gireesh, Eric Xing, Pengtao Xie
Abstract: To effectively train medical students to become qualified radiologists, a large number of X-ray images collected from patients with diverse medical conditions are needed. However, due to data privacy concerns, such images are typically difficult to obtain. To address this problem, we develop methods to generate view-consistent, high-fidelity, and high-resolution X-ray images from radiology reports to facilitate radiology training of medical students. This task is presented with several challenges. First, from a single report, images with different views (e.g., frontal, lateral) need to be generated. How to ensure consistency of these images (i.e., make sure they are about the same patient)? Second, X-ray images are required to have high resolution. Otherwise, many details of diseases would be lost. How to generate high-resolutions images? Third, radiology reports are long and have complicated structure. How to effectively understand their semantics to generate high-fidelity images that accurately reflect the contents of the reports? To address these three challenges, we propose an XRayGAN composed of three modules: (1) a view consistency network that maximizes the consistency between generated frontal-view and lateral-view images; (2) a multi-scale conditional GAN that progressively generates a cascade of images with increasing resolution; (3) a hierarchical attentional encoder that learns the latent semantics of a radiology report by capturing its hierarchical linguistic structure and various levels of clinical importance of words and sentences. Experiments on two radiology datasets demonstrate the effectiveness of our methods. To our best knowledge, this work represents the first one generating consistent and high-resolution X-ray images from radiology reports. The code is available at this https URL.
摘要：为了有效地培养医学生成为合格的放射科医生，大量的X射线图像的收集自患者有需要多样化的医疗条件。然而，由于数据隐私问题，例如图像通常很难获得。为了解决这个问题，我们开发方法来生成观点一致，高保真，从放射学报告的高分辨率X射线图像，以促进医学生的放射训练。这个任务呈现几个挑战。首先，从一个单一的报告，（例如，正面，侧面），以产生具有不同视点的图像的需求。如何保证这些图像的一致性（即确保它们是对同一患者）？其次，需要X射线图像具有高分辨率。否则，疾病的许多细节都将丢失。如何生成高分辨率的图像？三，放射报告很长，具有复杂的结构。如何有效地理解它们的语义生成高保真的图像，准确地反映了报告的内容是什么？为了解决这些三个挑战，我们提出的三个模块组成的XRayGAN：（1）的图一致性网络最大化产生的正面视图和横向视图图像之间的一致性; （2）多尺度条件GAN其逐步生成随着分辨率的图像的级联; （3）分级注意力编码器，学习到捕捉其层次语言结构和词汇和句子的临床重要性各级放射学报告的潜在语义。两个放射数据集的实验结果证明了我们方法的有效性。据我们所知，这项工作是第一个从放射报告生成一致且高分辨率的X射线图像。该代码可在此HTTPS URL。

57. ChestX-det10: Chest X-ray Dataset on Detection of Thoracic Abnormalities [PDF] 返回目录
Jingyu Liu, Jie Lian, Yizhou Yu
Abstract: Instance level detection of thoracic diseases or abnormalities are crucial for automatic diagnosis in chest X-ray images. Most existing works on chest X-rays focus on disease classification and weakly supervised localization. In order to push forward the research on disease classification and localization on chest X-rays. We provide a new benchmark called ChestX-det10, including box-level annotations of 10 categories of disease/abnormality of $\sim$ 3,500 images. The annotations are located at this https URL.
摘要：胸部疾病或异常的实例电平检测对于在胸部X射线图像自动诊断至关重要。胸部X射线大多数现有的工作重点放在疾病的分类和弱监督的定位。为了推动胸部X射线对疾病分类与定位研究。我们提供了一个名为ChestX-det10新标杆，包括10类疾病/ $ \ $ SIM 3,500名的图像异常箱级别的注解。注解位于该HTTPS URL。

58. Structure and Design of HoloGen [PDF] 返回目录
Peter J. Christopher, Timothy D. Wilkinson
Abstract: Increasing popularity of augmented and mixed reality systems has seen a similar increase of interest in 2D and 3D computer generated holography (CGH). Unlike stereoscopic approaches, CGH can fully represent a light field including depth of focus, accommodation and vergence. Along with existing telecommunications, imaging, projection, lithography, beam shaping and optical tweezing applications, CGH is an exciting technique applicable to a wide array of photonic problems including full 3D representation. Traditionally, the primary roadblock to acceptance has been the significant numerical processing required to generate holograms requiring both significant expertise and significant computational power. This article discusses the structure and design of HoloGen. HoloGen is an MIT licensed application that may be used to generate holograms using a wide array of algorithms without expert guidance. HoloGen uses a Cuda C and C++ backend with a C# and Windows Presentation Framework graphical user interface. The article begins by introducing HoloGen before providing an in-depth discussion of its design and structure. Particular focus is given to the communication, data transfer and algorithmic aspects.
摘要：提高增强和混合现实系统的普及已经看到在2D和3D计算机生成的全息（CGH）的兴趣类似的增加。不同于立体方法中，CGH可以完全表示光场包括焦点，住宿和聚散度的深度。随着现有的电信，成像，投影，光刻，光束成形和光学捕获应用中，CGH是适用于宽阵列的光子的问题，包括完整的3D表示的令人兴奋的技术。传统上，主要路障接受已以产生既需要显著专长和显著计算能力的全息图所需的显著数值处理。本文讨论HoloGen的结构和设计。 HoloGen是可以被用于产生使用宽的算法阵列而不专家指导全息图的MIT授权的应用程序。 HoloGen使用CUDA C和C ++后端与C＃和Windows演示框架图形用户界面。文章开始于提供其设计和结构的深入讨论之前引入HoloGen。特别关注的是在通信，数据传输和算法方面。

59. SXL: Spatially explicit learning of geographic processes with auxiliary tasks [PDF] 返回目录
Konstantin Klemmer, Daniel B. Neill
Abstract: From earth system sciences to climate modeling and ecology, many of the greatest empirical modeling challenges are geographic in nature. As these processes are characterized by spatial dynamics, we can exploit their autoregressive nature to inform learning algorithms. We introduce SXL, a method for learning with geospatial data using explicitly spatial auxiliary tasks. We embed the local Moran's I, a well-established measure of local spatial autocorrelation, into the training process, "nudging" the model to learn the direction and magnitude of local autoregressive effects in parallel with the primary task. Further, we propose an expansion of Moran's I to multiple resolutions to capture effects at different spatial granularities and over varying distance scales. We show the superiority of this method for training deep neural networks using experiments with real-world geospatial data in both generative and predictive modeling tasks. Our approach can be used with arbitrary network architectures and, in our experiments, consistently improves their performance. We also outperform appropriate, domain-specific interpolation benchmarks. Our work highlights how integrating the geographic information sciences and spatial statistics into machine learning models can address the specific challenges of spatial data.
摘要：从地球系统科学的气候模型和生态环境，许多伟大的经验模型的挑战实际上是地理。由于这些工艺的特点是空间动态，我们可以利用他们的自回归自然，通知学习算法。我们介绍SXL，使用明确的空间辅助任务与地理空间数据的学习方法。我们嵌入局部Moran的我，局部空间自相关的完善措施，为训练过程中，“轻推”的模式来学习的方向和与主要任务并行局部自回归的影响程度。此外，我们建议莫兰我的扩展到多种分辨率在不同的空间粒度拍摄效果，在不同的距离尺度。我们证明了该方法用于训练使用的实验用在这两个生成和预测建模任务的现实世界的地理空间数据的深层神经网络的优越性。我们的方法可以用任意的网络架构中使用，并在我们的实验中，始终如一地提高它们的性能。我们也跑赢适当，特定领域的插值基准。我们的工作重点介绍如何整合地理信息科学和空间统计到机器学习模型可以解决空间数据的具体挑战。

60. Automatic Speech Recognition Benchmark for Air-Traffic Communications [PDF] 返回目录
Juan Zuluaga-Gomez, Petr Motlicek, Qingran Zhan, Karel Vesely, Rudolf Braun
Abstract: Advances in Automatic Speech Recognition (ASR) over the last decade opened new areas of speech-based automation such as in Air-Traffic Control (ATC) environment. Currently, voice communication and data links communications are the only way of contact between pilots and Air-Traffic Controllers (ATCo), where the former is the most widely used and the latter is a non-spoken method mandatory for oceanic messages and limited for some domestic issues. ASR systems on ATCo environments inherit increasing complexity due to accents from non-English speakers, cockpit noise, speaker-dependent biases, and small in-domain ATC databases for training. Hereby, we introduce CleanSky EC-H2020 ATCO2, a project that aims to develop an ASR-based platform to collect, organize and automatically pre-process ATCo speech-data from air space. This paper conveys an exploratory benchmark of several state-of-the-art ASR models trained on more than 170 hours of ATCo speech-data. We demonstrate that the cross-accent flaws due to speakers' accents are minimized due to the amount of data, making the system feasible for ATC environments. The developed ASR system achieves an averaged word error rate (WER) of 7.75% across four databases. An additional 35% relative improvement in WER is achieved on one test set when training a TDNNF system with byte-pair encoding.
摘要：在过去的十年进展自动语音识别（ASR）打开基于语音的自动化的新领域，如空中交通管制（ATC）环境。目前，语音通信和数据链路的通信是飞行员和空中交通管制（ATCO），其中前者是使用最广泛的，而后者之间的接触的唯一方式是在非语音方法强制性的海洋消息和有限的一段国内问题。在空中交通管制员的环境ASR系统为业，由于来自非英语为母语，座舱的噪音，扬声器依赖的偏见，并为培训小域ATC数据库口音越来越复杂。在此，我们介绍CleanSky EC-H2020 ATCO2，一个项目，旨在开发一种基于ASR平台，收集，整理和自动空气空间预处理空中交通管制员的语音数据。本文传达的培训了超过170小时空中交通管制员的语音数据的几个国家的最先进的ASR模式的探索基准。我们证明了交叉口音缺陷由于扬声器的口音被最小化由于数据量，使得系统对环境ATC可行。在发达的ASR系统实现跨越四个数据库的7.75％的平均字错误率（WER）。在WER额外35％的相对改善训练TDNNF系统字节对编码当在一个测试集来实现的。

61. Generating Fundus Fluorescence Angiography Images from Structure Fundus Images Using Generative Adversarial Networks [PDF] 返回目录
Wanyue Li, Wen Kong, Yiwei Chen, Jing Wang, Yi He, Guohua Shi, Guohua Deng
Abstract: Fluorescein angiography can provide a map of retinal vascular structure and function, which is commonly used in ophthalmology diagnosis, however, this imaging modality may pose risks of harm to the patients. To help physicians reduce the potential risks of diagnosis, an image translation method is adopted. In this work, we proposed a conditional generative adversarial network(GAN) - based method to directly learn the mapping relationship between structure fundus images and fundus fluorescence angiography images. Moreover, local saliency maps, which define each pixel's importance, are used to define a novel saliency loss in the GAN cost function. This facilitates more accurate learning of small-vessel and fluorescein leakage features.
摘要：荧光血管造影可提供地图上的视网膜血管结构和功能，其通常在眼科诊断中使用，然而，该成像模态可能对患者的伤害风险。为了帮助医生诊断降低的潜在风险，图像翻译方法被采用。在这项工作中，我们提出了一种有条件生成对抗网络（GAN） - 为基础的方法来直接学习结构眼底图像和眼底荧光血管造影图像之间的映射关系。此外，本地显着性映射，其定义每个像素的重要性，被用来定义在GAN成本函数的新型显着损失。这有利于小血管和荧光素渗漏功能的更准确的学习。

62. Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF [PDF] 返回目录
Atanas Mirchev, Baris Kayalibay, Patrick van der Smagt, Justin Bayer
Abstract: We solve the problem of 6-DoF localisation and 3D dense reconstruction in spatial environments as approximate Bayesian inference in a deep generative approach which combines learned with engineered models. This principled treatment of uncertainty and probabilistic inference overcomes the shortcoming of current state-of-the-art solutions to rely on heavily engineered, heterogeneous pipelines. Variational inference enables us to use neural networks for system identification, while a differentiable raycaster is used for the emission model. This ensures that our model is amenable to end-to-end gradient-based optimisation. We evaluate our approach on realistic unmanned aerial vehicle flight data, nearing the performance of a state-of-the-art visual inertial odometry system. The applicability of the learned model to downstream tasks such as generative prediction and planning is investigated.
摘要：解决六自由度定位和3D重建密集的空间环境，因为近似贝叶斯推理的问题，在一个很深的生成方法与设计模型学会联合收割机。不确定性和概率推理这一原则处理克服了国家的最先进的解决方案，目前的缺点依靠巨资改造的，异构的管道。变推理使我们能够利用神经网络系统辨识，而微raycaster用于排放模型。这确保了我们的模式是适合于终端到终端的基于梯度的优化。我们评估我们对现实的无人机飞行数据的方法，即将一个国家的最先进的视觉惯性测距系统的性能。学习模式，下游的任务，例如生成预测和规划的适用性进行了研究。

63. Are you wearing a mask? Improving mask detection from speech using augmentation by cycle-consistent GANs [PDF] 返回目录
Nicolae-Cătălin Ristea, Radu Tudor Ionescu
Abstract: The task of detecting whether a person wears a face mask from speech is useful in modelling speech in forensic investigations, communication between surgeons or people protecting themselves against infectious diseases such as COVID-19. In this paper, we propose a novel data augmentation approach for mask detection from speech. Our approach is based on (i) training Generative Adversarial Networks (GANs) with cycle-consistency loss to translate unpaired utterances between two classes (with mask and without mask), and on (ii) generating new training utterances using the cycle-consistent GANs, assigning opposite labels to each translated utterance. Original and translated utterances are converted into spectrograms which are provided as input to a set of ResNet neural networks with various depths. The networks are combined into an ensemble through a Support Vector Machines (SVM) classifier. With this system, we participated in the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 Computational Paralinguistics Challenge, surpassing the baseline proposed by the organizers by 2.8%. Our data augmentation technique provided a performance boost of 0.9% on the private test set. Furthermore, we show that our data augmentation approach yields better results than other baseline and state-of-the-art augmentation methods.
摘要：检测一个人是否从语音戴着面罩的任务是在模拟演讲有用的法庭调查中，外科医生或人保护自己免受传染病如COVID-19之间的通信。在本文中，我们提出了从语音面罩检测一种新的数据增强方法。我们的方法是基于（i）培养剖成对抗性网络（甘斯）与周期一致性损耗（用掩模和无掩模）两个类之间翻译不成对话语，和（ii）使用所述周期一致的甘斯产生新的训练发言，分配相对标签到每个翻译发音。原始和翻译的发音被转换成作为输入被提供给一组具有各种深度RESNET神经网络的频谱图。这些网络被组合成通过支持向量机（SVM）分类器的集合。有了这个系统，我们参加了INTERSPEECH 2020计算Paralinguistics挑战的面具次挑战赛（MSC），超过了2.8％建议由主办方基线。我们的数据增强技术在私人测试组提供0.9％的性能提升。此外，我们表明，我们的数据隆胸方法产生比其他基线和国家的最先进的隆胸方法更好的结果。

64. A Practical Online Method for Distributionally Deep Robust Optimization [PDF] 返回目录
Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, Tianbao Yang
Abstract: In this paper, we propose a practical online method for solving a distributionally robust optimization (DRO) for deep learning, which has important applications in machine learning for improving the robustness of neural networks. In the literature, most methods for solving DRO are based on stochastic primal-dual methods. However, primal-dual methods for deep DRO suffer from several drawbacks: (1) manipulating a high-dimensional dual variable corresponding to the size of data is time expensive; (2) they are not friendly to online learning where data is coming sequentially. To address these issues, we transform the min-max formulation into a minimization formulation and propose a practical duality-free online stochastic method for solving deep DRO with KL divergence regularization. The proposed online stochastic method resembles the practical stochastic Nesterov's method in several perspectives that are widely used for learning deep neural networks. Under a Polyak-Lojasiewicz (PL) condition, we prove that the proposed method can enjoy an optimal sample complexity and a better round complexity (the number of gradient evaluations divided by a fixed mini-batch size) with a moderate mini-batch size than existing algorithms for solving the min-max or min formulation of DRO. Of independent interest, the proposed method can be also used for solving a family of stochastic compositional problems.
摘要：在本文中，我们提出了解决深学习，这在机器学习为提高神经网络的鲁棒性的重要应用的分布式地鲁棒优化（DRO）一个实用的在线方法。在文献中，解决DRO大多数方法是基于随机原对偶方法。然而，对于深DRO原始对偶方法从几个缺点：（1）操纵对应于数据的大小的高维双变量是时间昂贵; （2）他们不友善的在线学习，其中数据依次到来。为了解决这些问题，我们把最小 - 最大配制成最小化公式，并提出解决深DRO与KL发散正规化实用的免费两重性，在线随机方法。所提出的在线随机方法类似于被广泛用于学习深层神经网络的几个方面的实际随机涅斯捷罗夫的方法。下一个Polyak-Lojasiewicz（PL）条件下，我们证明了所提出的方法可以享受的最佳样品的复杂性和更好的圆复杂以适度的小批量尺寸比（由固定小批量大小划分梯度评估的数目）用于解决DRO的最小 - 最大或最小制剂现有算法。独立的利益的，所提出的方法也可以用于解决一个家庭的随机成分的问题。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-06-19

目录

摘要