摘要

1. Revisiting Spatial Invariance with Low-Rank Local Connectivity [PDF] 返回目录
Gamaleldin F. Elsayed, Prajit Ramachandran, Jonathon Shlens, Simon Kornblith
Abstract: Convolutional neural networks are among the most successful architectures in deep learning. This success is at least partially attributable to the efficacy of spatial invariance as an inductive bias. Locally connected layers, which differ from convolutional layers in their lack of spatial invariance, usually perform poorly in practice. However, these observations still leave open the possibility that some degree of relaxation of spatial invariance may yield a better inductive bias than either convolution or local connectivity. To test this hypothesis, we design a method to relax the spatial invariance of a network layer in a controlled manner. In particular, we create a \textit{low-rank} locally connected layer, where the filter bank applied at each position is constructed as a linear combination of basis set of filter banks. By varying the number of filter banks in the basis set, we can control the degree of departure from spatial invariance. In our experiments, we find that relaxing spatial invariance improves classification accuracy over both convolution and locally connected layers across MNIST, CIFAR-10, and CelebA datasets. These results suggest that spatial invariance in convolution layers may be overly restrictive.
摘要：卷积神经网络在深学习最成功的架构之中。这一成功是至少部分地归因于空间不变性的功效为感应偏压。本地连接的层，其从卷积层的区别在于它们缺少空间不变性的，通常在实践中表现不佳。然而，这些意见仍然保持打开的可能性，一定程度的空间不变性的放松可能会产生比任何回旋或本地连接更好的归纳偏置。为了检验这一假设，我们设计放松的网络层的空间不变性以受控的方式的方法。特别是，我们创建了一个\ textit {低秩}本地连接的层，其中，所述滤波器组施加在每个位置被构造为基组滤波器组的线性组合。通过改变基组滤波器组的数量，我们可以控制背离空间不变性的程度。在我们的实验中，我们发现，放松的空间不变性在两个卷积提高了分类的准确性和本地连接跨MNIST，CIFAR-10和CelebA数据集层。这些结果表明，在卷积层的空间不变性可能过于严格。

2. $M^3$T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild [PDF] 返回目录
Yuan-Hang Zhang, Rulin Huang, Jiabei Zeng, Shiguang Shan, Xilin Chen
Abstract: This report describes a multi-modal multi-task ($M^3$T) approach underlying our submission to the valence-arousal estimation track of the Affective Behavior Analysis in-the-wild (ABAW) Challenge, held in conjunction with the IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020. In the proposed $M^3$T framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal. The spatio-temporal visual features are extracted with a 3D convolutional network and a bidirectional recurrent neural network. Considering the correlations between valence / arousal, emotions, and facial actions, we also explores mechanisms to benefit from other tasks. We evaluated the $M^3$T framework on the validation set provided by ABAW and it significantly outperforms the baseline method.
摘要：该报告描述了我们提交基本的情感行为分析的价觉醒估计轨道多模式多任务（$ M ^ 3 $ T）的方式在最狂野结合举行（ABAW）的挑战，在自动面部和手势识别（FG）2020年提出的$ M ^ 3 $ T框架的IEEE国际会议，我们融合从音轨视频和声音特征的视觉特征来估计效价和唤醒。时空视觉特征与3D卷积网络和双向回归神经网络萃取。考虑价/觉醒，情绪和面部动作之间的相关性，我们还探讨了其他的任务机制的好处。我们评估了由ABAW提供的验证集的$ M ^ 3 $ T框架，它显著优于基线法。

3. On the Robustness of Face Recognition Algorithms Against Attacks and Bias [PDF] 返回目录
Richa Singh, Akshay Agarwal, Maneet Singh, Shruti Nagpal, Mayank Vatsa
Abstract: Face recognition algorithms have demonstrated very high recognition performance, suggesting suitability for real world applications. Despite the enhanced accuracies, robustness of these algorithms against attacks and bias has been challenged. This paper summarizes different ways in which the robustness of a face recognition algorithm is challenged, which can severely affect its intended working. Different types of attacks such as physical presentation attacks, disguise/makeup, digital adversarial attacks, and morphing/tampering using GANs have been discussed. We also present a discussion on the effect of bias on face recognition models and showcase that factors such as age and gender variations affect the performance of modern algorithms. The paper also presents the potential reasons for these challenges and some of the future research directions for increasing the robustness of face recognition models.
摘要：人脸识别算法已经证明非常高的识别性能，这对于现实世界的应用程序的适用性。尽管提高精度，这些算法受到攻击与偏见稳健性受到了挑战。本文总结了不同的方式，其中人脸识别算法的鲁棒性受到质疑，这会严重影响其预期工作。不同类型的攻击，例如物理呈现攻击，伪装/化妆，数字对抗攻击，变形/篡改使用甘斯进行了讨论。我们还提出关于面部识别模型偏差的影响的讨论，也展示了因素，如年龄和性别变化而影响的现代算法的性能。本文还介绍了这些挑战和一些未来的研究方向为增加脸部识别模型的鲁棒性的潜在原因。

4. How Does Gender Balance In Training Data Affect Face Recognition Accuracy? [PDF] 返回目录
Vítor Albiero, Kai Zhang, Kevin W. Bowyer
Abstract: Even though deep learning methods have greatly increased the overall accuracy of face recognition, an old problem still persists: accuracy is higher for men than for women. Previous researchers have speculated that the difference could be due to cosmetics, head pose, or hair covering the face. It is also often speculated that the lower accuracy for women is caused by women being under-represented in the training data. This work aims to investigate if gender imbalance in the training data is actually the cause of lower accuracy for females. Using a state-of-the-art deep CNN, three different loss functions, and two training datasets, we train each on seven subsets with different male/female ratios, totaling forty two train-ings. The trained face matchers are then tested on three different testing datasets. Results show that gender-balancing the dataset has an overall positive effect, with higher accuracy for most of the combinations of loss functions and datasets when a balanced subset is used. However, for the best combination of loss function and dataset, the original training dataset shows better accuracy on 3 out of 4 times. We observe that test accuracy for males is higher when the training data is all male. However, test accuracy for females is not maximized when the training data is all female. Fora number of combinations of loss function and test dataset, accuracy for females is higher when only 75% of the train-ing data is female than when 100% of the training data is female. This suggests that lower accuracy for females is nota simple result of the fraction of female training data. By clustering face features, we show that in general, male faces are closer to other male faces than female faces, and female faces are closer to other female faces than male faces
摘要：尽管深度学习方法已经人脸识别的整体精度大大提高，一个老问题仍然存在：精度是男性高于女性。先前的研究人员推测，差异可能是由于化妆品，头部姿势，或头发遮住了脸。它也经常被推测为女性较低的精度是由妇女是在训练数据代表性不足引起的。这项工作旨在调查，如果在训练数据性别失衡实际上是低精度为女性的原因。用一个国家的最先进的深CNN，三种不同的损失函数，和两个训练数据集，我们每次训练七子集与不同的男性/女性的比例，共计42列车英格斯。训练有素的脸的匹配，然后在三个不同的测试数据集进行测试。结果表明，两性平衡数据集具有总体积极作用，以较高的精度对大多数的损失函数和数据集的组合中的，当使用平衡子集。然而，对于损失函数和数据集，3开出4次原训练数据集显示了更好的精确度的最佳组合。我们观察到，测试精度男性要高，当训练数据是所有男性。然而，当训练数据是所有女性为女性测试精度没有最大化。损失函数和测试数据集的组合的数目论坛，精度为女性更高时只有75％的训练数据的情况相比，在训练数据的100％是女女。这表明，对于女性低精度诺塔女训练数据的分数的简单的结果。通过聚类面部特征，我们表明，在一般情况下，男性的面孔更接近其他男性的脸比女性的面孔，和女性的面孔更接近其他女性的面孔比男性面孔

5. SPN-CNN: Boosting Sensor-Based Source Camera Attribution With Deep Learning [PDF] 返回目录
Matthias Kirchner, Cameron Johnson
Abstract: We explore means to advance source camera identification based on sensor noise in a data-driven framework. Our focus is on improving the sensor pattern noise (SPN) extraction from a single image at test time. Where existing works suppress nuisance content with denoising filters that are largely agnostic to the specific SPN signal of interest, we demonstrate that a~deep learning approach can yield a more suitable extractor that leads to improved source attribution. A series of extensive experiments on various public datasets confirms the feasibility of our approach and its applicability to image manipulation localization and video source attribution. A critical discussion of potential pitfalls completes the text.
摘要：基于在数据驱动框架传感器噪声探索手段预先源摄像机识别。我们的重点是在测试时间改善从单个图像传感器图案噪声（SPN）萃取。如果现有的工作与抑制去噪是很大程度上不可知的感兴趣的特定SPN信号过滤器扰民的内容，我们证明了〜深深的学习方法可以产生更适合提取这会改善来源归属。一系列的各种公共数据集大量的实验证明我们的方法和适用于图像处理的定位和视频源归属的可行性。潜在缺陷的一个重要讨论完成的文字。

6. Subspace Capsule Network [PDF] 返回目录
Marzieh Edraki, Nazanin Rahnavard, Mubarak Shah
Abstract: Convolutional neural networks (CNNs) have become a key asset to most of fields in AI. Despite their successful performance, CNNs suffer from a major drawback. They fail to capture the hierarchy of spatial relation among different parts of an entity. As a remedy to this problem, the idea of capsules was proposed by Hinton. In this paper, we propose the SubSpace Capsule Network (SCN) that exploits the idea of capsule networks to model possible variations in the appearance or implicitly defined properties of an entity through a group of capsule subspaces instead of simply grouping neurons to create capsules. A capsule is created by projecting an input feature vector from a lower layer onto the capsule subspace using a learnable transformation. This transformation finds the degree of alignment of the input with the properties modeled by the capsule subspace. We show that SCN is a general capsule network that can successfully be applied to both discriminative and generative models without incurring computational overhead compared to CNN during test time. Effectiveness of SCN is evaluated through a comprehensive set of experiments on supervised image classification, semi-supervised image classification and high-resolution image generation tasks using the generative adversarial network (GAN) framework. SCN significantly improves the performance of the baseline models in all 3 tasks.
摘要：卷积神经网络（细胞神经网络）已经成为一个重要的资产，以最人工智能领域。尽管他们的成功表现，细胞神经网络从一大缺点。他们并没有捕捉到一个实体的不同部分之间的空间关系的层次结构。作为补救这一问题，胶囊的想法被提出韩丁。在本文中，我们提出了子空间胶囊网络（SCN），它利用胶囊网络的想法通过一组子空间胶囊而不是简单地分组的神经元来创建胶囊以一个实体的外观或隐式地定义的属性可能的变化进行建模。胶囊是通过使用可学习变换从较低层的输入特征向量投影到子空间胶囊创建。这种转变中找到与由胶囊子空间模型化的特性输入的对准程度。我们发现，SCN是可以成功地应用到辨别和生成模型，而不会在测试时间招致相比，CNN的计算开销一般胶囊网络。 SCN的有效性是通过一套综合的有关使用生成对抗网络（GAN）框架监督图像分类，半监督图像分类和高分辨率图像生成任务实验进行评价。 SCN显著提高了基准模型中的所有3个任务的性能。

7. Temporal Segmentation of Surgical Sub-tasks through Deep Learning with Multiple Data Sources [PDF] 返回目录
Yidan Qin, Sahba Aghajani Pedram, Seyedshams Feyzabadi, Max Allan, A. Jonathan McLeod, Joel W. Burdick, Mahdi Azizian
Abstract: Many tasks in robot-assisted surgeries (RAS) can be represented by finite-state machines (FSMs), where each state represents either an action (such as picking up a needle) or an observation (such as bleeding). A crucial step towards the automation of such surgical tasks is the temporal perception of the current surgical scene, which requires a real-time estimation of the states in the FSMs. The objective of this work is to estimate the current state of the surgical task based on the actions performed or events occurred as the task progresses. We propose Fusion-KVE, a unified surgical state estimation model that incorporates multiple data sources including the Kinematics, Vision, and system Events. Additionally, we examine the strengths and weaknesses of different state estimation models in segmenting states with different representative features or levels of granularity. We evaluate our model on the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS), as well as a more complex dataset involving robotic intra-operative ultrasound (RIOUS) imaging, created using the da Vinci Xi surgical system. Our model achieves a superior frame-wise state estimation accuracy up to 89.4%, which improves the state-of-the-art surgical state estimation models in both JIGSAWS suturing dataset and our RIOUS dataset.
摘要：在机器人辅助的外科手术（RAS）的许多任务可以通过有限状态机（FSM），其中每个状态代表任一种动作（诸如拿起针）或观察（如出血）来表示。对这样的手术任务的自动化的一个关键步骤是目前的手术场景，这需要在有限状态机的状态的实时估计的时间感知。这项工作的目的是评估基础上进行的操作或事件发生的任务进展手术任务的当前状态。我们提出了Fusion-KVE，了采用多种数据源，包括运动学，视觉，和系统事件统一的手术状态估计模型。此外，我们研究不同的状态估计模型的优势和劣势在具有不同代表性的特征或粒度级别分割的状态。我们评估我们在约翰霍普金斯大学，ISI手势和技能评估工作组（拼图）模型，以及一个涉及机器人术中超声（RIOUS）成像更复杂的数据集，使用达芬奇手术兮系统创建。我们的模型实现了卓越的逐帧状态估计精度高达89.4％，提高了两个拼图的国家的最先进的手术状态估计模型缝合数据集，我们RIOUS数据集。

8. iqiyi Submission to ActivityNet Challenge 2019 Kinetics-700 challenge: Hierarchical Group-wise Attention [PDF] 返回目录
Qian Liu, Dongyang Cai, Jie Liu, Nan Ding, Tao Wang
Abstract: In this report, the method for the iqiyi submission to the task of ActivityNet 2019 Kinetics-700 challenge is described. Three models are involved in the model ensemble stage: TSN, HG-NL and StNet. We propose the hierarchical group-wise non-local (HG-NL) module for frame-level features aggregation for video classification. The standard non-local (NL) module is effective in aggregating frame-level features on the task of video classification but presents low parameters efficiency and high computational cost. The HG-NL method involves a hierarchical group-wise structure and generates multiple attention maps to enhance performance. Basing on this hierarchical group-wise structure, the proposed method has competitive accuracy, fewer parameters and smaller computational cost than the standard NL. For the task of ActivityNet 2019 Kinetics-700 challenge, after model ensemble, we finally obtain an averaged top-1 and top-5 error percentage 28.444% on the test set.
摘要：在这个报告中，被描述为爱奇艺提交ActivityNet 2019动力学-700挑战任务的方法。三种型号都参与模式集合阶段：TSN，HG-NL和StNet。我们提出了帧级的分级组明智的非本地（HG-NL）模块支持视频分类聚集。标准的非本地（NL）模块是有效的聚合帧级特征的视频分类，但呈现低参数效率和高的计算成本的任务。在HG-NL方法涉及分级组明智的结构和生成多个关注的地图，以提高性能。在此基础上分级组明智结构，所提出的方法具有竞争力的精确度，更少的参数和比标准NL较小的计算成本。对于ActivityNet 2019动力学-700挑战的任务，模式集合后，我们终于在测试组获得的平均最高-1和前五名误差百分比28.444％。

9. Data augmentation with Möbius transformations [PDF] 返回目录
Sharon Zhou, Jiequan Zhang, Hang Jiang, Torbjörn Lundh, Andrew Y. Ng
Abstract: Data augmentation has led to substantial improvements in the performance and generalization of deep models, and remain a highly adaptable method to evolving model architectures and varying amounts of data---in particular, extremely scarce amounts of available training data. In this paper, we present a novel method of applying Möbius transformations to augment input images during training. Möbius transformations are bijective conformal maps that generalize image translation to operate over complex inversion in pixel space. As a result, Möbius transformations can operate on the sample level and preserve data labels. We show that the inclusion of Möbius transformations during training enables improved generalization over prior sample-level data augmentation techniques such as cutout and standard crop-and-flip transformations, most notably in low data regimes.
摘要：数据增强导致了深模型的性能和泛化实质性的改善，并保持高度适应性的方法来进化模型架构和不同的数据量---尤其是极为稀缺的可用金额的训练数据。在本文中，我们提出了训练期间施加莫比乌斯变换到扩充输入图像的新方法。莫比乌斯变换是双射共形映射是广义含图像平移了在像素空间复杂反转来操作。其结果是，莫比乌斯转换可以在样品上水平和操作保持数据的标签。我们发现，莫比乌斯变换的训练中列入允许超过前一个样级别的数据增强技术，如切口和标准的作物和翻动的转换，特别是在低数据制度改进的概括。

10. Domain Embedded Multi-model Generative Adversarial Networks for Image-based Face Inpainting [PDF] 返回目录
Xian Zhang, Xin Wang, Bin Kong, Youbing Yin, Qi Song, Siwei Lyu, Jiancheng Lv, Canghong Shi, Xiaojie Li
Abstract: Prior knowledge of face shape and location plays an important role in face inpainting. However, traditional facing inpainting methods mainly focus on the generated image resolution of the missing portion but without consideration of the special particularities of the human face explicitly and generally produce discordant facial parts. To solve this problem, we present a stable variational latent generative model for large inpainting of face images. We firstly represent only face regions with the latent variable space but simultaneously constraint the random vectors to offer control over the distribution of latent variables, and combine with the non-face parts textures to generate a face image with plausible contents. Two adversarial discriminators are finally used to judge whether the generated distribution is close to the real distribution or not. It can not only synthesize novel image structures but also explicitly utilize the latent space with Eigenfaces to make better predictions. Furthermore, our work better evaluates the side face impainting problem. Experiments on both CelebA and CelebA-HQ face datasets demonstrate that our proposed approach generates higher quality inpainting results than existing ones.
摘要：脸的形状和位置的先验知识起着面修补了重要作用。然而，传统的面向图像修复方法主要集中在缺失部分的所产生的图像分辨率，但不考虑人脸的特殊特殊性明确和一般产生不和谐的面部部分。为了解决这个问题，我们提出了一个稳定的潜在变生成模型对于大修补面部图像。我们首先仅代表面孔区域与潜变量空间，但同时在潜变量的分布约束随机向量提供控制，并与非人脸部分的纹理相结合，生成具有合理内容的人脸图像。两个敌对的鉴别最终用于判断产生的分布是否接近真实分布与否。它不仅可以合成新的图像结构，而且还明确利用与特征脸的潜在空间，以做出更好的预测。此外，我们的工作更好地评估侧面impainting问题。两个CelebA和CelebA-HQ面数据集实验证明，我们提出的方法生成更高质量的图像修补效果比现有的。

11. An Auxiliary Task for Learning Nuclei Segmentation in 3D Microscopy Images [PDF] 返回目录
Peter Hirsch, Dagmar Kainmueller
Abstract: Segmentation of cell nuclei in microscopy images is a prevalent necessity in cell biology. Especially for three-dimensional datasets, manual segmentation is prohibitively time-consuming, motivating the need for automated methods. Learning-based methods trained on pixel-wise ground-truth segmentations have been shown to yield state-of-the-art results on 2d benchmark image data of nuclei, yet a respective benchmark is missing for 3d image data. In this work, we perform a comparative evaluation of nuclei segmentation algorithms on a database of manually segmented 3d light microscopy volumes. We propose a novel learning strategy that boosts segmentation accuracy by means of a simple auxiliary task, thereby robustly outperforming each of our baselines. Furthermore, we show that one of our baselines, the popular three-label model, when trained with our proposed auxiliary task, outperforms the recent StarDist-3D. As an additional, practical contribution, we benchmark nuclei segmentation against nuclei detection, i.e. the task of merely pinpointing individual nuclei without generating respective pixel-accurate segmentations. For learning nuclei detection, large 3d training datasets of manually annotated nuclei center points are available. However, the impact on detection accuracy caused by training on such sparse ground truth as opposed to dense pixel-wise ground truth has not yet been quantified. To this end, we compare nuclei detection accuracy yielded by training on dense vs. sparse ground truth. Our results suggest that training on sparse ground truth yields competitive nuclei detection rates.
摘要：在显微镜图像细胞核的分割是细胞生物学中普遍存在的必要性。特别是对于三维数据集，手动分割是过于费时的，激励为自动化方法的需要。基于学习训练的上逐像素地面实况分割方法已经显示出产生关于细胞核的2D基准图像数据状态的最先进的结果，但各自的基准缺少3D图像数据。在这项工作中，我们手动分割的3D光学显微镜卷的数据库上执行的细胞核分割算法的比较评价。我们提出了一种新的学习策略，提升分割精度通过简单的辅助任务的手段，从而有力跑赢我们每一个基线。此外，我们表明，我们的基准之一，流行的三标签模型，当我们提出的辅助任务的训练，优于近期StarDist-3D。作为一个附加的，实用的贡献，对细胞核检测我们基准细胞核分割，即，仅仅确定个体细胞核，而不会产生相应的像素精确的分割的任务。对于学习核检测，人工标注的核中心点大型3D训练数据是可用的。然而，由这种稀疏的地面实况训练，而不是密集的逐像素的地面实况对检测精度的影响尚未量化。为此，我们通过比较致密与稀疏地面实况训练产生的核检测精度。我们的研究结果表明，在稀疏的地面实况训练产生有竞争力的核检测率。

12. Input Dropout for Spatially Aligned Modalities [PDF] 返回目录
Sébastien de Blois, Mathieu Garon, Christian Gagné, Jean-François Lalonde
Abstract: Computer vision datasets containing multiple modalities such as color, depth, and thermal properties are now commonly accessible and useful for solving a wide array of challenging tasks. However, deploying multi-sensor heads is not possible in many scenarios. As such many practical solutions tend to be based on simpler sensors, mostly for cost, simplicity and robustness considerations. In this work, we propose a training methodology to take advantage of these additional modalities available in datasets, even if they are not available at test time. By assuming that the modalities have a strong spatial correlation, we propose Input Dropout, a simple technique that consists in stochastic hiding of one or many input modalities at training time, while using only the canonical (e.g. RGB) modalities at test time. We demonstrate that Input Dropout trivially combines with existing deep convolutional architectures, and improves their performance on a wide range of computer vision tasks such as dehazing, 6-DOF object tracking, pedestrian detection and object classification.
摘要：含有多种方式，如颜色，深度和热性能的计算机视觉的数据集，现在是解决了各种各样的挑战性的任务阵列通常访问并从中受益。但是，在部署多传感器头是不可能在许多情况下。因此许多切实可行的解决方案往往是基于简单的传感器，主要是出于成本，简单性和稳健性的考虑。在这项工作中，我们提出了一种培训方法采取数据集提供这些附加模式的优势，即使他们不提供测试时间。通过假设方式具有很强的空间相关性，我们建议输入差，一个简单的技术，其在于在训练时间的一个或多个输入模态随机遮盖力，而只使用的规范（例如，RGB）模式在测试时间。我们表明，输入差平凡与现有的深卷积架构相结合，并提高他们对范围广泛的计算机视觉任务，如除雾，6-DOF目标跟踪，行人检测和对象分类性能。

13. Switchable Precision Neural Networks [PDF] 返回目录
Luis Guerra, Bohan Zhuang, Ian Reid, Tom Drummond
Abstract: Instantaneous and on demand accuracy-efficiency trade-off has been recently explored in the context of neural networks slimming. In this paper, we propose a flexible quantization strategy, termed Switchable Precision neural Networks (SP-Nets), to train a shared network capable of operating at multiple quantization levels. At runtime, the network can adjust its precision on the fly according to instant memory, latency, power consumption and accuracy demands. For example, by constraining the network weights to 1-bit with switchable precision activations, our shared network spans from BinaryConnect to Binarized Neural Network, allowing to perform dot-products using only summations or bit operations. In addition, a self-distillation scheme is proposed to increase the performance of the quantized switches. We tested our approach with three different quantizers and demonstrate the performance of SP-Nets against independently trained quantized models in classification accuracy for Tiny ImageNet and ImageNet datasets using ResNet-18 and MobileNet architectures.
摘要：瞬时和按需精度效率的权衡已经在最近减肥神经网络的环境中探索。在本文中，我们提出了一种灵活的量化策略，称为切换精密神经网络（SP-网），培养能够在多个量化等级操作的共享的网络。在运行时，网络可以根据即时记忆，延迟，功耗和精度要求在飞行中调整其精度。例如，通过限制网络的权重为1位具有可切换精度激活，我们的共享网络从跨度到BinaryConnect二值化神经网络，允许进行点副产物仅使用加法运算或位操作。此外，自蒸馏方案提出增加量化开关的性能。我们使用RESNET-18和MobileNet架构测试我们有三个不同的量化方法，并展示SP-篮网对独立训练的量化模型，分类准确率的表现为微小ImageNet和ImageNet数据集。

14. Fine-Grained Fashion Similarity Learning by Attribute-Specific Embedding Network [PDF] 返回目录
Zhe Ma, Jianfeng Dong, Yao Zhang, Zhongzi Long, Yuan He, Hui Xue, Shouling Ji
Abstract: This paper strives to learn fine-grained fashion similarity. In this similarity paradigm, one should pay more attention to the similarity in terms of a specific design/attribute among fashion items, which has potential values in many fashion related applications such as fashion copyright protection. To this end, we propose an Attribute-Specific Embedding Network (ASEN) to jointly learn multiple attribute-specific embeddings in an end-to-end manner, thus measure the fine-grained similarity in the corresponding space. With two attention modules, i.e., Attribute-aware Spatial Attention and Attribute-aware Channel Attention, ASEN is able to locate the related regions and capture the essential patterns under the guidance of the specified attribute, thus make the learned attribute-specific embeddings better reflect the fine-grained similarity. Extensive experiments on four fashion-related datasets show the effectiveness of ASEN for fine-grained fashion similarity learning and its potential for fashion reranking.
摘要：本文力求学习细粒度的方式相似。在这种相似的模式，应该在当中的时尚单品的特定设计/属性，它在很多时尚相关的应用，如时尚版权保护的潜在价值方面更注重的相似性。为此，提出了一种属性特定嵌入网络（ASEN）共同学习多个属性特定的嵌入在端至端的方式，从而测量在相应的空间中的细粒的相似性。有两个注意模块，即属性感知空间注意和属性感知通道注意，日月能够找到相关的区域和指定属性的指导下拍摄的基本模式，从而使学习特定属性的嵌入更好地反映细粒度的相似性。在四大时装相关的数据集大量的实验表明日月的细粒度方式相似的学习和有效性及其对时尚的重新排名的潜力。

15. FourierNet: Compact mask representation for instance segmentation using differentiable shape decoders [PDF] 返回目录
Nuri Benbarka, Hamd ul Moqeet Riaz, Andreas Zell
Abstract: We present FourierNet a single shot, anchor-free, fully convolutional instance segmentation method, which predicts a shape vector that is converted into contour points using a numerical transformation. Compared to previous methods, we introduce a new training technique, where we utilize a differentiable shape decoder, which achieves automatic weight balancing of the shape vector's coefficients. Fourier series was utilized as a shape encoder because of its coefficient interpretability and fast implementation. By using its lower frequencies we were able to retrieve smooth and compact masks. FourierNet shows promising results compared to polygon representation methods, achieving 30.6 mAP on the MS COCO 2017 benchmark. At lower image resolutions, it runs at 26.6 FPS with 24.3 mAP. It achieves 23.3 mAP using just 8 parameters to represent the mask, which is double the amount of parameters to predict a bounding box. Code will be available at: this http URL.
摘要：我们提出FourierNet单杆，锚自由，充分卷积实例分割方法，该方法预测，被转换成使用数字变换的轮廓点的形状向量。相比以前的方法中，我们引入一个新的训练技术，在这里我们利用微分的形状解码器，它实现了自动重平衡形状矢量的系数。傅立叶系列被用作形状编码器，因为它的系数解释性和快速实现的。通过使用它的频率较低，我们能够取得光滑紧致口罩。 FourierNet显示有希望的结果相比，多边形表示方法，实现对MS COCO 2017年基准30.6地图。在较低的图像分辨率，它运行在26.6 FPS 24.3地图。它实现只用8个参数来表示掩模，这是参数双倍量来预测的边界框23.3地图。代码将可在：这个HTTP URL。

16. Deep Robust Multilevel Semantic Cross-Modal Hashing [PDF] 返回目录
Ge Song, Jun Zhao, Xiaoyang Tan
Abstract: Hashing based cross-modal retrieval has recently made significant progress. But straightforward embedding data from different modalities into a joint Hamming space will inevitably produce false codes due to the intrinsic modality discrepancy and noises. We present a novel Robust Multilevel Semantic Hashing (RMSH) for more accurate cross-modal retrieval. It seeks to preserve fine-grained similarity among data with rich semantics, while explicitly require distances between dissimilar points to be larger than a specific value for strong robustness. For this, we give an effective bound of this value based on the information coding-theoretic analysis, and the above goals are embodied into a margin-adaptive triplet loss. Furthermore, we introduce pseudo-codes via fusing multiple hash codes to explore seldom-seen semantics, alleviating the sparsity problem of similarity information. Experiments on three benchmarks show the validity of the derived bounds, and our method achieves state-of-the-art performance.
摘要：基于散列的跨模态获取最近取得显著的进展。但来自不同模态直接的数据嵌入到一个关节海明空间不可避免地会产生虚假的代码由于固有形态差异和噪声。我们提出了一个新颖的多级鲁棒语义散列（RMSH），用于更精确的跨通道检索。它旨在保护具有丰富的语义数据中细粒度的相似性，同时明确要求不同的点之间的距离比为较强的鲁棒性的特定值。对于这一点，我们给出一个有效的结合的该值的基础上，信息编码-理论分析，和上述目标被实现成一个余量自适应三重态损耗。此外，我们通过融合多个散列码来探索很少见过语义，减轻相似信息的稀疏问题引入伪代码。在三个基准实验表明派生边界的有效性，以及我们的方法实现国家的最先进的性能。

17. Learning Class Regularized Features for Action Recognition [PDF] 返回目录
Alexandros Stergiou, Ronald Poppe, Remco C. Veltkamp
Abstract: Training Deep Convolutional Neural Networks (CNNs) is based on the notion of using multiple kernels and non-linearities in their subsequent activations to extract useful features. The kernels are used as general feature extractors without specific correspondence to the target class. As a result, the extracted features do not correspond to specific classes. Subtle differences between similar classes are modeled in the same way as large differences between dissimilar classes. To overcome the class-agnostic use of kernels in CNNs, we introduce a novel method named Class Regularization that performs class-based regularization of layer activations. We demonstrate that this not only improves feature search during training, but also allows an explicit assignment of features per class during each stage of the feature extraction process. We show that using Class Regularization blocks in state-of-the-art CNN architectures for action recognition leads to systematic improvement gains of 1.8%, 1.2% and 1.4% on the Kinetics, UCF-101 and HMDB-51 datasets, respectively.
摘要：培训深卷积神经网络（细胞神经网络）是基于使用多个内核和非线性在其随后的激活，提取有用的功能的概念。将所述核用作为一般特征提取器没有具体的对应于目标类。其结果是，所提取的特征不对应于特定的类。相似的类之间的细微差别以同样的方式被建模为不同阶层之间的巨大差异。为了克服类无关的使用在细胞神经网络内核，我们引入已命名的类的正则化，其执行基于类的层的激活的正则化的新方法。我们证明，这不仅提高了训练中的搜索功能，还允许在特征提取过程的每一个阶段，每级功能的明确任务。我们分别显示在国家的最先进的美国有线电视新闻网的架构，使用正则班块动作识别导致的1.8％，1.2％和1.4％，在动力学，UCF-101和HMDB-51数据集系统化改善收益。

18. Statistical Outlier Identification in Multi-robot Visual SLAM using Expectation Maximization [PDF] 返回目录
Arman Karimian, Ziqi Yang, Roberto Tron
Abstract: This paper introduces a novel and distributed method for detecting inter-map loop closure outliers in simultaneous localization and mapping (SLAM). The proposed algorithm does not rely on a good initialization and can handle more than two maps at a time. In multi-robot SLAM applications, maps made by different agents have nonidentical spatial frames of reference which makes initialization very difficult in the presence of outliers. This paper presents a probabilistic approach for detecting incorrect orientation measurements prior to pose graph optimization by checking the geometric consistency of rotation measurements. Expectation-Maximization is used to fine-tune the model parameters. As ancillary contributions, a new approximate discrete inference procedure is presented which uses evidence on loops in a graph and is based on optimization (Alternate Direction Method of Multipliers). This method yields superior results compared to Belief Propagation and has convergence guarantees. Simulation and experimental results are presented that evaluate the performance of the outlier detection method and the inference algorithm on synthetic and real-world data.
摘要：本文介绍了一种新颖的和用于同时定位和地图创建（SLAM）检测地图间环路闭合离群值分布的方法。该算法不依赖于良好的初始化，可以同时处理两个以上的地图。在多机器人SLAM应用，地图由不同试剂制成具有不相同的参考帧的空间，这使得在异常值的存在初始化非常困难。本文提出了通过检查转动测量值的几何一致性检测姿态图形优化之前不正确的方向测量值的概率方法。期望最大化用于微调模型参数。作为辅助的贡献，提出了一种新的近似离散推理过程，它使用的环路证据的曲线图，并且基于优化（乘法器的交替方向法）。这种方法可以得到比置信传播效果出众，具有收敛的保证。仿真和实验结果都认为评估异常检测方法的性能和合成和真实数据的推理算法。

19. SideInfNet: A Deep Neural Network for Semi-Automatic Semantic Segmentation with Side Information [PDF] 返回目录
Jing Yu Koh, Duc Thanh Nguyen, Quang-Trung Truong, Sai-Kit Yeung, Alexander Binder
Abstract: Fully-automatic execution is the ultimate goal for many Computer Vision applications. However, this objective is not always realistic in tasks associated with high failure costs, such as medical applications. For these tasks, a compromise between fully-automatic execution and user interactions is often preferred due to desirable accuracy and performance. Semi-automatic methods require minimal effort from experts by allowing them to provide cues that guide computer algorithms. Inspired by the practicality and applicability of the semi-automatic approach, this paper proposes a novel deep neural network architecture, namely SideInfNet that effectively integrates features learnt from images with side information extracted from user annotations to produce high quality semantic segmentation results. To evaluate our method, we applied the proposed network to three semantic segmentation tasks and conducted extensive experiments on benchmark datasets. Experimental results and comparison with prior work have verified the superiority of our model, suggesting the generality and effectiveness of the model in semi-automatic semantic segmentation.
摘要：全自动执行是许多计算机视觉应用的终极目标。然而，这个目标并非总是与高失败成本，如医疗应用相关的任务逼真。对于这些任务的，完全自动执行与用户的交互之间的折中通常优选的，因为所希望的精度和性能。半自动方法，让他们提供线索引导计算机算法需要专家最小的努力。本文通过实用性和半自动方法的适用性的启发，提出了一种新颖深层神经网络体系结构，即SideInfNet有效地集成了各种功能从与来自用户的注释提取以生产高品质的语义分割结果侧信息图像获知。为了评估我们的方法，我们应用所提出的网络三个语义分割任务，并进行了基准数据集广泛的实验。实验结果与以前的工作相比，已经验证了我们的模型的优势，这在半自动语义分割模型的通用性和有效性。

20. Visual search over billions of aerial and satellite images [PDF] 返回目录
Ryan Keisler, Samuel W. Skillman, Sunny Gonnabathula, Justin Poehnelt, Xander Rudelis, Michael S. Warren
Abstract: We present a system for performing visual search over billions of aerial and satellite images. The purpose of visual search is to find images that are visually similar to a query image. We define visual similarity using 512 abstract visual features generated by a convolutional neural network that has been trained on aerial and satellite imagery. The features are converted to binary values to reduce data and compute requirements. We employ a hash-based search using Bigtable, a scalable database service from Google Cloud. Searching the continental United States at 1-meter pixel resolution, corresponding to approximately 2 billion images, takes approximately 0.1 seconds. This system enables real-time visual search over the surface of the earth, and an interactive demo is available at this https URL.
摘要：本文提出了一种系统，用于超千亿航空和卫星图像，进行视觉搜索。视觉搜索的目的是找到在视觉上类似于查询图像的图像。我们定义使用经过训练的空中和卫星图像卷积神经网络产生512个抽象的视觉特征视觉相似。特征被转换为二进制值，以减少数据和计算要求。我们采用使用Bigtable的，从谷歌云可扩展的数据库服务基于散列的搜索。搜索美国大陆在1米像素的分辨率，对应于大约2十亿图像，需要大约为0.1秒。这个系统使地球表面上的实时可视化搜索和互动演示可在此HTTPS URL。

21. Image Fine-grained Inpainting [PDF] 返回目录
Zheng Hui, Jie Li, Xiumei Wang, Xinbo Gao
Abstract: Image inpainting techniques have shown promising improvement with the assistance of generative adversarial networks (GANs) recently. However, most of them often suffered from completed results with unreasonable structure or blurriness. To mitigate this problem, in this paper, we present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields. Benefited from the property of this network, we can more easily recover large regions in an incomplete image. To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss for concentrating on uncertain areas and enhancing the semantic details. Besides, we devise a geometrical alignment constraint item to compensate for the pixel-based distance between prediction features and ground-truth ones. We also employ a discriminator with local and global branches to ensure local-global contents consistency. To further improve the quality of generated images, discriminator feature matching on the local branch is introduced, which dynamically minimizes the similarity of intermediate features between synthetic and ground-truth patches. Extensive experiments on several public datasets demonstrate that our approach outperforms current state-of-the-art methods. Code is available at~\url{this https URL}.
摘要：图像修复技术已显示出大有希望与生成对抗网络（甘斯）最近的协助改善。然而，大多数人往往是因与结构不合理或模糊完成结果遭遇。为了缓解这个问题，在该论文中，我们提出了利用扩张卷积的致密组合，以获得更大的和更有效的感受域的一阶段的模型。从这个网络的性能中受益，我们可以更容易在不完整的图像恢复大区。为了更好地培养这种高效的发电机，除了常用VGG特征匹配的损失，我们设计了一个新的自导回归亏损集中在不确定的领域，提高语义细节。此外，我们设计的几何对齐约束的项目，以弥补预测的功能和地面实况的人之间基于像素的距离。我们还采用了与本地和全球的分支机构鉴别，以确保地方 - 全球内容的一致性。为了进一步提高生成的图像的质量，对本地分支鉴别特征匹配被引入，其动态地最小化的合成的和地面实况贴片之间的中间特征的相似性。在几个公开的数据集大量的实验证明我们的方法优于国家的最先进的通用方法。代码可以在〜\ {URL这HTTPS URL}。

22. Adaptive Deep Metric Embeddings for Person Re-Identification under Occlusions [PDF] 返回目录
Wanxiang Yang, Yan Yan, Si Chen
Abstract: Person re-identification (ReID) under occlusions is a challenging problem in video surveillance. Most of existing person ReID methods take advantage of local features to deal with occlusions. However, these methods usually independently extract features from the local regions of an image without considering the relationship among different local regions. In this paper, we propose a novel person ReID method, which learns the spatial dependencies between the local regions and extracts the discriminative feature representation of the pedestrian image based on Long Short-Term Memory (LSTM), dealing with the problem of occlusions. In particular, we propose a novel loss (termed the adaptive nearest neighbor loss) based on the classification uncertainty to effectively reduce intra-class variations while enlarging inter-class differences within the adaptive neighborhood of the sample. The proposed loss enables the deep neural network to adaptively learn discriminative metric embeddings, which significantly improve the generalization capability of recognizing unseen person identities. Extensive comparative evaluations on challenging person ReID datasets demonstrate the significantly improved performance of the proposed method compared with several state-of-the-art methods.
摘要：人重新鉴定（里德）下闭塞是视频监控一个具有挑战性的问题。大多数现有的人里德方法利用的地方特色，以应对闭塞。然而，这些方法通常是独立地从提取的图像的局部区域的特征而没有考虑不同的局部区域之间的关系。在本文中，我们提出了一种新的人雷德法，其学习的局部区域之间的空间的依赖，并提取基于长短期记忆（LSTM）行人图像的判别特征表示，处理阻塞的问题。特别是，我们提出基于分类的不确定性，以有效地减少类内变化，同时增大样本的自适应邻域内类间差异的新型的损失（称为自适应最近邻损失）。所提出的损失使深层神经网络自适应学习判别指标的嵌入，这显著提高认识看不见人身份的泛化能力。上具有挑战性的人里德数据集广泛比较评价证明了该方法的显著改进的性能与国家的最先进的几种方法进行比较。

23. Object-Adaptive LSTM Network for Real-time Visual Tracking with Adversarial Data Augmentation [PDF] 返回目录
Yihan Du, Yan Yan, Si Chen, Yang Hua
Abstract: In recent years, deep learning based visual tracking methods have obtained great success owing to the powerful feature representation ability of Convolutional Neural Networks (CNNs). Among these methods, classification-based tracking methods exhibit excellent performance while their speeds are heavily limited by the expensive computation for massive proposal feature extraction. In contrast, matching-based tracking methods (such as Siamese networks) possess remarkable speed superiority. However, the absence of online updating renders these methods unadaptable to significant object appearance variations. In this paper, we propose a novel real-time visual tracking method, which adopts an object-adaptive LSTM network to effectively capture the video sequential dependencies and adaptively learn the object appearance variations. For high computational efficiency, we also present a fast proposal selection strategy, which utilizes the matching-based tracking method to pre-estimate dense proposals and selects high-quality ones to feed to the LSTM network for classification. This strategy efficiently filters out some irrelevant proposals and avoids the redundant computation for feature extraction, which enables our method to operate faster than conventional classification-based tracking methods. In addition, to handle the problems of sample inadequacy and class imbalance during online tracking, we adopt a data augmentation technique based on the Generative Adversarial Network (GAN) to facilitate the training of the LSTM network. Extensive experiments on four visual tracking benchmarks demonstrate the state-of-the-art performance of our method in terms of both tracking accuracy and speed, which exhibits great potentials of recurrent structures for visual tracking.
摘要：近年来，深度学习基于视觉跟踪方法已获得由于卷积神经网络（细胞神经网络）的强大功能表现能力，取得巨大成功。在这些方法中，基于分类的跟踪方法表现出优异的性能，而他们的速度很大程度上受到了大量的建议特征提取昂贵的计算限制。相反，基于匹配追踪方法（如连体网络）具有显着的速度优势。然而，不存在在线更新的呈现这些方法不能适应显著对象的外观的变化。在本文中，我们提出了一种新的实时视觉跟踪方法，即采用一个目的自适应LSTM网络有效地捕捉视频顺序依赖性和自适应学习对象的外观的变化。对于高计算效率，我们还提出了一种快速建议选择策略，其利用基于匹配追踪方法预先估计密提案和选择高品质的那些，以进料LSTM网络进行分类。这种策略有效地过滤掉一些不相关的建议，并避免了特征提取，这使得我们的方法比传统的基于分类的跟踪方法更快地操作冗余计算。此外，处理样品不足和不平衡类在线跟踪过程中的问题，我们采用了基于创成对抗性网络（GAN）的数据增强技术，以方便LSTM网络的培训。在四个视觉跟踪基准广泛的实验表明在这两种跟踪准确性和速度，其表现出对视觉跟踪复发性结构的巨大潜力方面我们的方法的状态的最先进的性能。

24. Poisson Kernel Avoiding Self-Smoothing in Graph Convolutional Networks [PDF] 返回目录
Ziqing Yang, Shoudong Han, Jun Zhao
Abstract: Graph convolutional network (GCN) is now an effective tool to deal with non-Euclidean data, such as social networks in social behavior analysis, molecular structure analysis in the field of chemistry, and skeleton-based action recognition. Graph convolutional kernel is one of the most significant factors in GCN to extract nodes' feature, and some improvements of it have reached promising performance theoretically and experimentally. However, there is limited research about how exactly different data types and graph structures influence the performance of these kernels. Most existing methods used an adaptive convolutional kernel to deal with a given graph structure, which still not reveals the internal reasons. In this paper, we started from theoretical analysis of the spectral graph and studied the properties of existing graph convolutional kernels. While taking some designed datasets with specific parameters into consideration, we revealed the self-smoothing phenomenon of convolutional kernels. After that, we proposed the Poisson kernel that can avoid self-smoothing without training any adaptive kernel. Experimental results demonstrate that our Poisson kernel not only works well on the benchmark dataset where state-of-the-art methods work fine, but also is evidently superior to them in synthetic datasets.
摘要：图形卷积网络（GCN）现在是处理非欧几里得数据，如社会行为分析社交网络，在化学领域的分子结构分析，以及基于骨架动作识别的有效工具。图卷积内核的GCN以提取节点的功能，最显著的因素之一，而它的一些改进，已经达到了理论和实验有前途的性能。然而，有关数据类型和图形结构究竟如何影响不同这些内核的性能有限的研究。大多数现有的方法中使用的自适应卷积内核来处理一个给定的图形结构，仍然没有揭示的内在原因。在本文中，我们从谱图的理论分析开始，研究了现有的图形内核卷积的性质。虽然采取了一些设计数据集以特定参数加以考虑，我们揭示了卷积核的自流平现象。在那之后，我们提出的泊松内核，可避免自平滑无任何训练适应核。实验结果表明，我们的泊松内核不仅行之有效的基准数据集，其中国家的最先进的方法，做工精细，而且是在合成数据集明显优于它们。

25. Learning Hyperspectral Feature Extraction and Classification with ResNeXt Network [PDF] 返回目录
Divinah Nyasaka, Jing Wang, Haron Tinega
Abstract: The Hyperspectral image (HSI) classification is a standard remote sensing task, in which each image pixel is given a label indicating the physical land-cover on the earth's surface. The achievements of image semantic segmentation and deep learning approaches on ordinary images have accelerated the research on hyperspectral image classification. Moreover, the utilization of both the spectral and spatial cues in hyperspectral images has shown improved classification accuracy in hyperspectral image classification. The use of only 3D Convolutional Neural Networks (3D-CNN) to extract both spatial and spectral cues from Hyperspectral images results in an explosion of parameters hence high computational cost. We propose network architecture called the MixedSN that utilizes the 3D convolutions to modeling spectral-spatial information in the early layers of the architecture and the 2D convolutions at the top layers which majorly deal with semantic abstraction. We constrain our architecture to ResNeXt block because of their performance and simplicity. Our model drastically reduced the number of parameters and achieved comparable classification performance with state-of-the-art methods on Indian Pine (IP) scene dataset, Pavia University scene (PU) dataset, Salinas (SA) Scene dataset, and Botswana (BW) dataset.
摘要：高光谱图像（HSI）分类是一个标准的远程感测任务，其中，每个图像像素被赋予了标签指示在地球表面上的物理土地覆盖。图像语义分割和深学习方法对普通图像的成就，加速了对高光谱影像分类研究。此外，光谱和空间线索两者在高光谱图像的利用率已经显示出在高光谱图像分类改进的分类精度。仅使用三维卷积神经网络（3D-CNN）的提取从高光谱图像的结果的空间和频谱线索在的参数因此具有高的计算成本爆炸。我们建议网络架构，名为利用三维回旋在顶层这majorly处理语义抽象建模架构的早期层和二维卷积谱空间信息MixedSN。我们限制，因为它们的性能和简单了系统架构以ResNeXt块。我们的模型大大减少参数的数量，并实现与印度的松树（IP）的场景数据集的国家的最先进的方法相同的分类性能，帕维亚大学场景（PU）的数据集，萨利纳斯（SA）场景的数据集，以及博茨瓦纳（BW ）数据集。

26. Impact of ImageNet Model Selection on Domain Adaptation [PDF] 返回目录
Youshan Zhang, Brian D. Davison
Abstract: Deep neural networks are widely used in image classification problems. However, little work addresses how features from different deep neural networks affect the domain adaptation problem. Existing methods often extract deep features from one ImageNet model, without exploring other neural networks. In this paper, we investigate how different ImageNet models affect transfer accuracy on domain adaptation problems. We extract features from sixteen distinct pre-trained ImageNet models and examine the performance of twelve benchmarking methods when using the features. Extensive experimental results show that a higher accuracy ImageNet model produces better features, and leads to higher accuracy on domain adaptation problems (with a correlation coefficient of up to 0.95). We also examine the architecture of each neural network to find the best layer for feature extraction. Together, performance from our features exceeds that of the state-of-the-art in three benchmark datasets.
摘要：深层神经网络被广泛应用于图像分类问题。然而，很少工作地址是如何从不同的深层神经网络的功能影响领域适应性问题。现有的方法常从一个ImageNet模型深的特点，没有探索其他神经网络。在本文中，我们研究了不同型号ImageNet如何影响域的适应问题传递的准确性。我们提取从16不同的预先训练ImageNet机型的功能和使用功能检查时，十二基准方法的性能。广泛的实验结果表明，较高的精度ImageNet模型产生更好的功能，并导致更高的准确度上域的适应的问题（与最多的相关系数0.95）。我们还检查每个神经网络的体系结构，以找到特征提取的最佳层。总之，从我们的特色性能超过了国家的最先进的三个地基准数据集。

27. Opposite Structure Learning for Semi-supervised Domain Adaptation [PDF] 返回目录
Can Qin, Lichen Wang, Qianqian Ma, Yu Yin, Huan Wang, Yun Fu
Abstract: Current adversarial adaptation methods attempt to align the cross-domain features whereas two challenges remain unsolved: 1) conditional distribution mismatch between different domains and 2) the bias of decision boundary towards the source domain. To solve these challenges, we propose a novel framework for semi-supervised domain adaptation by unifying the learning of opposite structures (UODA). UODA consists of a generator and two classifiers (i.e., the source-based and the target-based classifiers respectively) which are trained with opposite forms of losses for a unified object. The target-based classifier attempts to cluster the target features to improve intra-class density and enlarge inter-class divergence. Meanwhile, the source-based classifier is designed to scatter the source features to enhance the smoothness of decision boundary. Through the alternation of source-feature expansion and target-feature clustering procedures, the target features are well-enclosed within the dilated boundary of the corresponding source features. This strategy effectively makes the cross-domain features precisely aligned. To overcome the model collapse through training, we progressively update the measurement of distance and the feature representation on both domains via an adversarial training paradigm. Extensive experiments on the benchmarks of DomainNet and Office-home datasets demonstrate the effectiveness of our approach over the state-of-the-art method.
摘要：当前对抗性适应方法试图对准跨域特征而两个挑战仍然没有解决：1）不同的结构域和2之间条件分布不匹配）决策边界的偏置朝向源域。为了解决这些难题，我们通过统一相反的结构（UODA）的学习提出了半监督领域适应一个新的框架。 UODA由发电机和两个分类器（即，基于源和分别与基于目标的分类器），其与一个统一的对象损失相对形式的训练。基于目标的分类器试图群集目标功能，以提高的类内的密度和放大级间发散性。同时，基于源代码的分类被设计成散射源功能，以提高决策边界的平滑度。通过源极 - 功能扩展和目标特征聚类程序的交替，所述目标特征的对应的源特征的扩张型边界内孔封闭。这种策略有效地使交叉域特征精确地对准。为了克服通过培训模式崩溃，我们不断更新的距离的测量，并通过对抗性训练模式在两个域的特征表示。在DomainNet和Office家庭数据集的基准广泛的实验，证明了我们在国家的最先进的方法，该方法的有效性。

28. Continuous Geodesic Convolutions for Learning on 3D Shapes [PDF] 返回目录
Zhangsihao Yang, Or Litany, Tolga Birdal, Srinath Sridhar, Leonidas Guibas
Abstract: The majority of descriptor-based methods for geometric processing of non-rigid shape rely on hand-crafted descriptors. Recently, learning-based techniques have been shown effective, achieving state-of-the-art results in a variety of tasks. Yet, even though these methods can in principle work directly on raw data, most methods still rely on hand-crafted descriptors at the input layer. In this work, we wish to challenge this practice and use a neural network to learn descriptors directly from the raw mesh. To this end, we introduce two modules into our neural architecture. The first is a local reference frame (LRF) used to explicitly make the features invariant to rigid transformations. The second is continuous convolution kernels that provide robustness to sampling. We show the efficacy of our proposed network in learning on raw meshes using two cornerstone tasks: shape matching, and human body parts segmentation. Our results show superior results over baseline methods that use hand-crafted descriptors.
摘要：大多数的非刚性形状的几何处理基于描述符的方法依赖于手工制作的描述符。近年来，基于学习的技术已被证明有效，实现多种任务的国家的最先进的成果。然而，尽管这些方法可以直接在原始数据的原理工作的，大多数方法还是依靠在输入层手工制作的描述符。在这项工作中，我们要挑战这一做法，并用神经网络直接从原网学习描述。为此，我们引入两个模块到我们的神经结构。第一种是用于显式地使功能不变的刚性变换的本地参考帧（LRF）。第二个是连续卷积核，要采样提供鲁棒性。我们发现在学习上使用两个基石任务原料网我们提出的网络的功效：人体部位分割形状匹配和。我们的研究结果表明在基线的方法是用手工制作的描述效果出众。

29. Activation Density driven Energy-Efficient Pruning in Training [PDF] 返回目录
Timothy Foldy-Porto, Priyadarshini Panda
Abstract: The process of neural network pruning with suitable fine-tuning and retraining can yield networks with considerably fewer parameters than the original with comparable degrees of accuracy. Typically, pruning methods require large, pre-trained networks as a starting point from which they perform a time-intensive iterative pruning and retraining algorithm. We propose a novel pruning in-training method that prunes a network real-time during training, reducing the overall training time to achieve an optimal compressed network. To do so, we introduce an activation density based analysis that identifies the optimal relative sizing or compression for each layer of the network. Our method removes the need for pre-training and is architecture agnostic, allowing it to be employed on a wide variety of systems. For VGG-19 and ResNet18 on CIFAR-10, CIFAR-100, and TinyImageNet, we obtain exceedingly sparse networks (up to 200x reduction in parameters and >60x reduction in inference compute operations in the best case) with comparable accuracies (up to 2%-3% loss with respect to the baseline network). By reducing the network size periodically during training, we achieve total training times that are shorter than those of previously proposed pruning methods. Furthermore, training compressed networks at different epochs with our proposed method yields considerable reduction in training compute complexity (1.6x -3.2x lower) at near iso-accuracy as compared to a baseline network trained entirely from scratch.
摘要：与合适的微调神经网络修剪和再培训可以产生网络具有比具有可比较的精确度的原始参数相当少的方法。典型地，修剪方法需要大的，预训练的网络与其所执行时间密集的迭代修剪和再培训算法的起点。我们提出了一个新的修剪在训练方法训练李子期间网络的实时性，降低整体的训练时间，以达到最佳的压缩网络。要做到这一点，我们引入一个激活基于密度分析标识所述最佳相对尺寸或压缩为网络的每个层。我们的方法消除了对预训练的必要性和架构是不可知的，允许它被在各种各样的系统中采用。为VGG-19和ResNet18上CIFAR-10，CIFAR-100，和TinyImageNet，我们得到极其稀疏的网络（高达参数200X减少和> 60倍的减少在推理计算操作在最佳情况下）具有可比较的精度（最多2个％-3相对于基线网络％的损失）。通过培训期间定期降低了网络规模，我们实现了总的训练时间是比那些先前提出的修剪方法更短。此外，相比于完全从头培养了基线网络训练在与我们在接近异精度提出的方法的产率显着降低在训练计算复杂度（1.6倍-3.2x降低）不同时期压缩网络。

30. AnimePose: Multi-person 3D pose estimation and animation [PDF] 返回目录
Laxman Kumarapu, Prerana Mukherjee
Abstract: 3D animation of humans in action is quite challenging as it involves using a huge setup with several motion trackers all over the person's body to track the movements of every limb. This is time-consuming and may cause the person discomfort in wearing exoskeleton body suits with motion sensors. In this work, we present a trivial yet effective solution to generate 3D animation of multiple persons from a 2D video using deep learning. Although significant improvement has been achieved recently in 3D human pose estimation, most of the prior works work well in case of single person pose estimation and multi-person pose estimation is still a challenging problem. In this work, we firstly propose a supervised multi-person 3D pose estimation and animation framework namely AnimePose for a given input RGB video sequence. The pipeline of the proposed system consists of various modules: i) Person detection and segmentation, ii) Depth Map estimation, iii) Lifting 2D to 3D information for person localization iv) Person trajectory prediction and human pose tracking. Our proposed system produces comparable results on previous state-of-the-art 3D multi-person pose estimation methods on publicly available datasets MuCo-3DHP and MuPoTS-3D datasets and it also outperforms previous state-of-the-art human pose tracking methods by a significant margin of 11.7% performance gain on MOTA score on Posetrack 2018 dataset.
摘要：在行动人的3D动画是相当具有挑战性的，因为它涉及到使用一个巨大的设置与几个运动跟踪器遍布人的身体来跟踪每一个肢体的运动。这是耗时的，并且可能导致人不适穿着外骨骼紧身衣与运动传感器。在这项工作中，我们提出了一个平凡而有效的解决方案以使用深度学习从2D视频多人的3D动画。虽然显著的改善已经在3D人体姿势估计最近取得，大部分之前的作品在一个人的姿态估计和多方人士的姿势估计的情况下很好地工作仍然是一个具有挑战性的问题。在这项工作中，我们首先提出了一个给定的输入RGB视频序列的监督多人3D姿态估计和动画框架，即AnimePose。所提出的系统的流水线由各种模块组成：i）人检测和分割，ⅱ）深度图估计，ⅲ）起重2D到3D信息用于人本地化ⅳ）人轨迹预测和人类姿态跟踪。我们所提出的系统产生的可公开获得的数据集以前的状态的最先进的3D多人姿势估计方法粘膜3DHP和MuPoTS-3D数据集比较的结果，同时也优于国家的最先进的前面的人体姿势的跟踪方法通过对MOTA 11.7％的性能增益显著保证金得分Posetrack 2018集。

31. Trust Your Model: Iterative Label Improvement and Robust Training by Confidence Based Filtering and Dataset Partitioning [PDF] 返回目录
Christian Haase-Schütz, Rainer Stal, Heinz Hertlein, Bernhard Sick
Abstract: State-of-the-art, high capacity deep neural networks not only require large amounts of labelled training data, they are also highly susceptible to label errors in this data, typically resulting in large efforts and costs and therefore limiting the applicability of deep learning. To alleviate this issue, we propose a novel meta training and labelling scheme that is able to use inexpensive unlabelled data by taking advantage of the generalization power of deep neural networks. We show experimentally that by solely relying on one network architecture and our proposed scheme of iterative training and prediction steps, both label quality and resulting model accuracy can be improved significantly. Our method achieves state-of-the-art results, while being architecture agnostic and therefore broadly applicable. Compared to other methods dealing with erroneous labels, our approach does neither require another network to be trained, nor does it necessarily need an additional, highly accurate reference label set. Instead of removing samples from a labelled set, our technique uses additional sensor data without the need for manual labelling.
摘要：国家的最先进的，高容量的深层神经网络，不仅需要大量的标记的训练数据，他们也很容易受到标签错误，在此数据，通常会生成大量的努力和成本，并因此限制的适用性深度学习。为了缓解这一问题，我们提出了一个新颖元的培训和标签计划，能够通过利用深层神经网络的推广力量的优势，使用廉价的未标记的数据。我们实验表明，单纯依靠一个网络架构和我们所提出的迭代训练和预测的步骤方案，无论是标签质量和得到的模型精度可以提高显著。我们的方法实现状态的最先进的结果，而被架构无关，因此广泛适用的。相比于处理错误标签的其他方法，我们的做法既没有要求其他网络进行训练，也不一定需要一个额外的，高度准确的参考符号集。而不是从标记组取出样品，我们的技术使用附加的传感器数据，而无需手动标记。

32. Optimization of Structural Similarity in Mathematical Imaging [PDF] 返回目录
D. Otero, D. La Torre, O. Michailovich, E.R. Vrscay
Abstract: It is now generally accepted that Euclidean-based metrics may not always adequately represent the subjective judgement of a human observer. As a result, many image processing methodologies have been recently extended to take advantage of alternative visual quality measures, the most prominent of which is the Structural Similarity Index Measure (SSIM). The superiority of the latter over Euclidean-based metrics have been demonstrated in several studies. However, being focused on specific applications, the findings of such studies often lack generality which, if otherwise acknowledged, could have provided a useful guidance for further development of SSIM-based image processing algorithms. Accordingly, instead of focusing on a particular image processing task, in this paper, we introduce a general framework that encompasses a wide range of imaging applications in which the SSIM can be employed as a fidelity measure. Subsequently, we show how the framework can be used to cast some standard as well as original imaging tasks into optimization problems, followed by a discussion of a number of novel numerical strategies for their solution.
摘要：现在人们普遍认为，基于欧几里得度量可以不总是充分代表人类观察者的主观判断。其结果是，许多图像处理方法已经扩展最近采取的另类视觉质量的措施，其中最突出的是结构相似度指数度量（SSIM）的优势。后者通过基于欧几里得度量的优越性已被证明在一些研究。然而，被集中在特定的应用程序，这些研究的结果往往缺乏，如果其它方法确认，可能会对基于SSIM图像处理算法的进一步发展提供了有益的指导普遍性。因此，而不是集中在一个特定的图像处理任务，在本文中，我们引入包括宽范围的，其中，SSIM可以用作一个保真度测度成像应用的一般框架。随后，我们展示了框架如何可以用来施放一些标准以及原始成像任务为优化问题，其次是一些对他们的解决方案的新的数字战略的讨论。

33. Quantifying the Value of Lateral Views in Deep Learning for Chest X-rays [PDF] 返回目录
Mohammad Hashir, Hadrien Bertrand, Joseph Paul Cohen
Abstract: Most deep learning models in chest X-ray prediction utilize the posteroanterior (PA) view due to the lack of other views available. PadChest is a large-scale chest X-ray dataset that has almost 200 labels and multiple views available. In this work, we use PadChest to explore multiple approaches to merging the PA and lateral views for predicting the radiological labels associated with the X-ray image. We find that different methods of merging the model utilize the lateral view differently. We also find that including the lateral view increases performance for 32 labels in the dataset, while being neutral for the others. The increase in overall performance is comparable to the one obtained by using only the PA view with twice the amount of patients in the training set.
摘要：胸片预测最深刻的学习模型，利用后前（PA）视图由于缺乏可用的其他意见。 PadChest的是，有近200个标签和多视图提供一个大型的胸部X射线数据集。在这项工作中，我们使用PadChest探索多种方法来合并PA和横向视图预测与X射线图像有关的放射性标签。我们发现合并模型利用横向视图不同的，不同的方法。我们还发现，包括横向视图提高性能，在数据集32级的标签，同时保持中立的人。整体性能的增加是与通过仅使用PA视图与在训练集中两次患者的量而得到的一个。

34. Closing the Dequantization Gap: PixelCNN as a Single-Layer Flow [PDF] 返回目录
Didrik Nielsen, Ole Winther
Abstract: Flow models have recently made great progress at modeling quantized sensor data such as images and audio. Due to the continuous nature of flow models, dequantization is typically applied when using them for such quantized data. In this paper, we propose subset flows, a class of flows which can tractably transform subsets of the input space in one pass. As a result, they can be applied directly to quantized data without the need for dequantization. Based on this class of flows, we present a novel interpretation of several existing autoregressive models, including WaveNet and PixelCNN, as single-layer flow models defined through an invertible transformation between uniform noise and data samples. This interpretation suggests that these existing models, 1) admit a latent representation of data and 2) can be stacked in multiple flow layers. We demonstrate this by exploring the latent space of a PixelCNN and by stacking PixelCNNs in multiple flow layers.
摘要：流模型建模时量化传感器数据，如图像和音频最近取得了很大进展。由于流模型的连续性质，使用它们用于这种量化的数据去量化时典型地施加。在本文中，我们提议子集流动，一类可tractably将输入空间的子集在一个通流。其结果是，它们可以被直接应用到量化数据，而不需要反量化。基于此类流中，我们提出的几种现有的自回归模型，包括WaveNet和PixelCNN，作为单层流模型通过均匀噪声和数据样本之间的可逆变换中定义的新的解释。这种解释表明，这些现有的模型，1）承认数据和2的潜表示）可以在多个流动层堆叠。我们通过探讨PixelCNN的潜在空间，并通过在多个流程层层积PixelCNNs证明这一点。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-02-10

目录

摘要