摘要

1. Learning Lane Graph Representations for Motion Forecasting [PDF] 返回目录
Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song Feng, Raquel Urtasun
Abstract: We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. Instead of encoding vectorized maps as raster images, we construct a lane graph from raw map data to explicitly preserve the map structure. To capture the complex topology and long range dependencies of the lane graph, we propose LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane dilation. To capture the complex interactions between actors and maps, we exploit a fusion network consisting of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor. Powered by LaneGCN and actor-map interactions, our model is able to predict accurate and realistic multi-modal trajectories. Our approach significantly outperforms the state-of-the-art on the large scale Argoverse motion forecasting benchmark.
摘要：我们建议，利用一种新型的构成的地图表现以及演员地图交互的运动预测模型。相反编码矢量地图的光栅图像，我们构建了从原材料的地图数据车道图明确地保存映射结构。为了捕捉车道图的复杂的拓扑结构和长距离的相关性，我们建议LaneGCN延伸曲线回旋与多个邻接矩阵和沿车道扩张。为了捕捉演员和地图之间复杂的相互作用，我们利用融合网络，其中包括四种相互作用的，演员至道，道至道，线路到演员和演员到演员。技术LaneGCN和演员的互动地图，我们的模型能够预测准确，逼真的多模态轨迹。我们的做法显著优于在大规模Argoverse运动预测基准的国家的最先进的。

2. Noisy Agents: Self-supervised Exploration by Predicting Auditory Events [PDF] 返回目录
Chuang Gan, Xiaoyu Chen, Phillip Isola, Antonio Torralba, Joshua B. Tenenbaum
Abstract: Humans integrate multiple sensory modalities (e.g. visual and audio) to build a causal understanding of the physical world. In this work, we propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions through auditory event prediction. First, we allow the agent to collect a small amount of acoustic data and use K-means to discover underlying auditory event clusters. We then train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration. Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines. We further visualize our noisy agents' behavior in a physics environment and demonstrate that our newly designed intrinsic reward leads to the emergence of physical interaction behaviors (e.g. contact with objects).
摘要：人类集成多个感觉形态（例如视频和音频），以建立一个因果关系的理解物理世界的。在这项工作中，我们提出了一种新型的强化学习（RL）的内在动力，鼓励代理人了解其行动通过听觉事件预测的因果关系。首先，我们允许代理收集声学数据量小，并使用K-手段来发现潜在的听觉事件集群。然后，我们训练神经网络预测听觉事件，并使用预测误差为内在报酬指导RL探索。在Atari游戏实验结果表明，我们的新的内在动力显著优于国家的最先进的几个基准。我们进一步想象我们吵主体的行为在物理环境，并证明我们的新设计的内在奖励导致的物理交互行为（例如与物体接触）的出现。

3. Associative3D: Volumetric Reconstruction from Sparse Views [PDF] 返回目录
Shengyi Qian, Linyi Jin, David F. Fouhey
Abstract: This paper studies the problem of 3D volumetric reconstruction from two views of a scene with an unknown camera. While seemingly easy for humans, this problem poses many challenges for computers since it requires simultaneously reconstructing objects in the two views while also figuring out their relationship. We propose a new approach that estimates reconstructions, distributions over the camera/object and camera/camera transformations, as well as an inter-view object affinity matrix. This information is then jointly reasoned over to produce the most likely explanation of the scene. We train and test our approach on a dataset of indoor scenes, and rigorously evaluate the merits of our joint reasoning approach. Our experiments show that it is able to recover reasonable scenes from sparse views, while the problem is still challenging. Project site: this https URL
摘要：本文研究三维立体重建的从未知相机两个视图场景的问题。虽然看似简单的对于人类来说，这个问题造成的电脑，因为它需要同时重建对象的两种观点同时也明白他们的关系面临许多挑战。我们提出了一个新的方法，估计重建，在摄像头/对象和相机/摄像机的转换，以及一个视图对象间亲和基质分布。然后这些信息共同理由在生产现场的最可能的解释。我们训练和测试在室内场景的数据集我们的方法，并严格评估我们的共同推理方法的优劣。我们的实验表明，它能够从稀疏的意见恢复合理的场景，而问题仍然具有挑战性。项目选址：该HTTPS URL

4. The Unsupervised Method of Vessel Movement Trajectory Prediction [PDF] 返回目录
Chih-Wei Chen, Charles Harrison, Hsin-Hsiung Huang
Abstract: In real-world application scenarios, it is crucial for marine navigators and security analysts to predict vessel movement trajectories at sea based on the Automated Identification System (AIS) data in a given time span. This article presents an unsupervised method of ship movement trajectory prediction which represents the data in a three-dimensional space which consists of time difference between points, the scaled error distance between the tested and its predicted forward and backward locations, and the space-time angle. The representation feature space reduces the search scope for the next point to a collection of candidates which fit the local path prediction well, and therefore improve the accuracy. Unlike most statistical learning or deep learning methods, the proposed clustering-based trajectory reconstruction method does not require computationally expensive model training. This makes real-time reliable and accurate prediction feasible without using a training set. Our results show that the most prediction trajectories accurately consist of the true vessel paths.
摘要：在现实世界的应用场景，它是海洋导航和安全分析家预测基于在给定的时间跨度的自动识别系统（AIS）的数据在海上船只的移动轨迹的关键。本文介绍了船舶运动轨迹预测的无监督方法，其表示在其由点之间的时间差的三维空间内的数据，被测试和其预测的向前和向后的位置，以及空时角度之间的缩放误差距离。表示特征空间减少了对下一个点到适合的本地路径预测以及候选的集合的搜索范围，因此，提高精度。与大多数统计学习或深的学习方法，所提出的基于聚类的轨迹重建方法不需要昂贵的计算模型训练。这使得实时可靠和准确的预测不使用训练集是可行的。我们的研究结果表明，大多数的预测轨迹准确地由真正的容器路径。

5. WGANVO: Monocular Visual Odometry based on Generative Adversarial Networks [PDF] 返回目录
Javier Cremona, Lucas Uzal, Taihú Pire
Abstract: In this work we present WGANVO, a Deep Learning based monocular Visual Odometry method. In particular, a neural network is trained to regress a pose estimate from an image pair. The training is performed using a semi-supervised approach. Unlike geometry based monocular methods, the proposed method can recover the absolute scale of the scene without neither prior knowledge nor extra information. The evaluation of the system is carried out on the well-known KITTI dataset where it is shown to work in real time and the accuracy obtained is encouraging to continue the development of Deep Learning based methods.
摘要：在这项工作中，我们目前WGANVO，基于深度学习单目视觉里程计方法。特别地，神经网络被训练以倒退从图像对的姿态的估计。训练是使用半监督的办法进行。不同于基于几何单眼方法，该方法可以恢复场景的绝对规模，而不事先没有知识，也不会额外信息。该系统的评价是在著名的KITTI数据集进行其中显示工作的实时性和获得的精度鼓励继续基于深度学习方法的发展。

6. A Closer Look at Art Mediums: The MAMe Image Classification Dataset [PDF] 返回目录
Ferran Parés, Anna Arias-Duart, Dario Garcia-Gasulla, Gema Campo-Francés, Nina Viladrich, Eduard Ayguadé, Jesús Labarta
Abstract: Art is an expression of human creativity, skill and technology. An exceptionally rich source of visual content. In the context of AI image processing systems, artworks represent one of the most challenging domains conceivable: Properly perceiving art requires attention to detail, a huge generalization capacity, and recognizing both simple and complex visual patterns. To challenge the AI community, this work introduces a novel image classification task focused on museum art mediums, the MAMe dataset. Data is gathered from three different museums, and aggregated by art experts into 29 classes of medium (i.e. materials and techniques). For each class, MAMe provides a minimum of 850 images (700 for training) of high-resolution and variable shape. The combination of volume, resolution and shape allows MAMe to fill a void in current image classification challenges, empowering research in aspects so far overseen by the research community. After reviewing the singularity of MAMe in the context of current image classification tasks, a thorough description of the task is provided, together with dataset statistics. Baseline experiments are conducted using well-known architectures, to highlight both the feasibility and complexity of the task proposed. Finally, these baselines are inspected using explainability methods and expert knowledge, to gain insight on the challenges that remain ahead.
摘要：艺术是人类的创造力，技能和技术的一种表现。一个异常丰富的视觉内容源。在AI的图像处理系统的上下文中，艺术品代表可想到的最具挑战性的领域之一：正确感知技术需要注意细节，巨大的泛化能力，并认识到简单和复杂的视觉图案。挑战AI社区，这项工作引入了一种新的图像分类任务集中在博物馆的艺术媒介，MAME的数据集。数据是从三个不同的博物馆聚集，并且通过本领域的专家聚集成29类介质（即材料和技术）。对于每个类，MAME提供最小高分辨率和可变形状的图像850（700进行训练）的。音量，分辨率和形状的组合允许MAME来填补空虚在当前图像分类的挑战，迄今为止研究社会监督方面授权的研究。回顾MAME在当前图像分类任务的上下文中的奇点后，提供任务的完整描述，与数据集统计数据一起。基线实验使用众所周知的架构，突出提出的任务，双方的可行性和复杂性进行。最后，这些基线使用explainability方法和专业知识，以获得对保持未来的挑战洞察力检查。

7. Ordinary Differential Equation and Complex Matrix Exponential for Multi-resolution Image Registration [PDF] 返回目录
Abhishek Nan, Matthew Tennant, Uriel Rubin, Nilanjan Ray
Abstract: Autograd-based software packages have recently renewed interest in image registration using homography and other geometric models by gradient descent and optimization, e.g., AirLab and DRMIME. In this work, we emphasize on using complex matrix exponential (CME) over real matrix exponential to compute transformation matrices. CME is theoretically more suitable and practically provides faster convergence as our experiments show. Further, we demonstrate that the use of an ordinary differential equation (ODE) as an optimizable dynamical system can adapt the transformation matrix more accurately to the multi-resolution Gaussian pyramid for image registration. Our experiments include four publicly available benchmark datasets, two of them 2D and the other two being 3D. Experiments demonstrate that our proposed method yields significantly better registration compared to a number of off-the-shelf, popular, state-of-the-art image registration toolboxes.
摘要：利用梯度下降和优化，例如，AirLab和DRMIME单应和其他几何模型基于Autograd的软件包最近更新的图像配准的兴趣。在这项工作中，我们强调在使用复杂的矩阵指数（CME）上实矩阵指数以计算变换矩阵。 CME在理论上更适合和实践提供了更快速的收敛作为我们的实验证明。此外，我们证明，作为一个可优化动力系统中使用的常微分方程（ODE）的可以更准确地适应转换矩阵将所述多分辨率高斯金字塔图像配准。我们的实验包括四个可公开获得的基准数据集，其中两个2D和其他两个分别是3D。实验证明，我们提出的方法的产率显著更好登记相比一些现成的，货架，流行，国家的最先进的图像配准的工具箱。

8. 3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning [PDF] 返回目录
Xiangyu Xu, Hao Chen, Francesc Moreno-Noguer, Laszlo A. Jeni, Fernando De la Torre
Abstract: 3D human shape and pose estimation from monocular images has been an active area of research in computer vision, having a substantial impact on the development of new applications, from activity recognition to creating virtual avatars. Existing deep learning methods for 3D human shape and pose estimation rely on relatively high-resolution input images; however, high-resolution visual content is not always available in several practical scenarios such as video surveillance and sports broadcasting. Low-resolution images in real scenarios can vary in a wide range of sizes, and a model trained in one resolution does not typically degrade gracefully across resolutions. Two common approaches to solve the problem of low-resolution input are applying super-resolution techniques to the input images which may result in visual artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed network is able to learn the 3D body shape and pose across different resolutions with a single model. The self-supervision loss encourages scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new training losses provide robustness when learning 3D shape and pose in a weakly-supervised manner. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.
摘要：三维人体形状和姿态估计从单眼图像一直是计算机视觉研究的活跃领域，其对新应用的开发，以创建虚拟化身有实质性的影响，从活动的认可。三维人体形状和姿态估计现有的深度学习方法依赖于相对较高分辨率的输入图像;然而，高分辨率的视频内容是不是在一些实际的场景，如视频监控和体育转播始终可用。在实际场景中低分辨率图像可以在各种尺寸的变化，并在一项决议训练的模型通常不跨越分辨率优雅降级。两种常用的方法来解决低分辨率输入的问题被施加超分辨率技术以这可导致视觉假象，或简单地训练一个模型用于每个分辨率，这是在许多现实应用中不切实际的输入图像。为了解决上述问题，本文提出了一种新的算法称为RSC-网，它由一个分辨率感知网络，自我监督的损失，以及对比的学习方案。所提出的网络能够学习三维体形和姿势在不同分辨率下的单一车型。自检损失鼓励输出的规模，一致性，以及深特征对比学习计划强制实施大规模一致性。我们发现，这两种新的培训学习的损失三维形状和姿态在弱监督的方式时提供鲁棒性。大量实验证明，RSC-网络可以实现比国家的最先进的方法具有挑战性的低清晰度的图像一贯更好的结果。

9. Solving Linear Inverse Problems Using the Prior Implicit in a Denoiser [PDF] 返回目录
Zahra Kadkhodaie, Eero P. Simoncelli
Abstract: Prior probability models are a central component of many image processing problems, but density estimation is notoriously difficult for high-dimensional signals such as photographic images. Deep neural networks have provided state-of-the-art solutions for problems such as denoising, which implicitly rely on a prior probability model of natural images. Here, we develop a robust and general methodology for making use of this implicit prior. We rely on a little-known statistical result due to Miyasawa (1961), who showed that the least-squares solution for removing additive Gaussian noise can be written directly in terms of the gradient of the log of the noisy signal density. We use this fact to develop a stochastic coarse-to-fine gradient ascent procedure for drawing high-probability samples from the implicit prior embedded within a CNN trained to perform blind (i.e., unknown noise level) least-squares denoising. A generalization of this algorithm to constrained sampling provides a method for using the implicit prior to solve any linear inverse problem, with no additional training. We demonstrate this general form of transfer learning in multiple applications, using the same algorithm to produce high-quality solutions for deblurring, super-resolution, inpainting, and compressive sensing.
摘要：先验概率模型是许多图像处理的问题的核心组件，但密度估计为高维信号，例如摄影图象非常困难的。深层神经网络已经提出的问题，如降噪，它隐依靠自然图像的先验概率模型提供国家的最先进的解决方案。在这里，我们开发了利用这种隐含之前的一个强大的和一般方法。我们依靠一个鲜为人知的统计结果，由于Miyasawa（1961年），谁发现，去除加性高斯噪声最小二乘解决方案可直接在日志中的噪声信号密度梯度的角度来写。我们使用此事实来开发用于从嵌入CNN内隐式先绘制高概率的样品随机粗到细梯度上升过程训练来执行盲（即，未知噪声电平）最小二乘去噪。该算法以约束采样的一般化提供了使用隐式之前解决任何线性逆问题，没有额外的训练的方法。我们证明在多个应用程序迁移学习的一般形式，使用相同的算法，以生产高品质的解决方案去模糊，超分辨率，图像修复和压缩感知。

10. Message Passing Least Squares Framework and its Application to Rotation Synchronization [PDF] 返回目录
Yunpeng Shi, Gilad Lerman
Abstract: We propose an efficient algorithm for solving group synchronization under high levels of corruption and noise, while we focus on rotation synchronization. We first describe our recent theoretically guaranteed message passing algorithm that estimates the corruption levels of the measured group ratios. We then propose a novel reweighted least squares method to estimate the group elements, where the weights are initialized and iteratively updated using the estimated corruption levels. We demonstrate the superior performance of our algorithm over state-of-the-art methods for rotation synchronization using both synthetic and real data.
摘要：我们提出了在高层次的腐败和噪音的解决组同步的高效算法，而我们专注于旋转同步。我们首先描述我们最近从理论上保证消息传递算法，估计测量组比的腐败程度。然后，我们提出了一个新颖的再加权最小二乘法来估计族元素，其中权重被初始化，并且迭代地使用估计的腐败程度进行更新。我们证明我们的算法超过国家的最先进的方法，同时使用合成和真实数据旋转同步的卓越性能。

11. Black-Box Face Recovery from Identity Features [PDF] 返回目录
Anton Razzhigaev, Klim Kireev, Edgar Kaziakhmedov, Nurislam Tursynbek, Aleksandr Petyushko
Abstract: In this work, we present a novel algorithm based on an it-erative sampling of random Gaussian blobs for black-box face recovery,given only an output feature vector of deep face recognition systems. Weattack the state-of-the-art face recognition system (ArcFace) to test ouralgorithm. Another network with different architecture (FaceNet) is usedas an independent critic showing that the target person can be identi-fied with the reconstructed image even with no access to the attackedmodel. Furthermore, our algorithm requires a significantly less numberof queries compared to the state-of-the-art solution.
摘要：在这项工作中，我们目前只给出深刻的面部识别系统的输出特征向量基于随机高斯斑点的黑盒面恢复的IT-erative抽样一种新的算法。 Weattack状态的最先进的脸部识别系统（ArcFace）到测试ouralgorithm。用不同的体系结构（FaceNet）另一网络是usedas表示对象者可以是identi-田间与重建的图像，即使没有访问attackedmodel独立评论家。此外，我们的算法要求相比国家的最先进的解决方案显著少numberof查询。

12. Towards Accuracy-Fairness Paradox: Adversarial Example-based Data Augmentation for Visual Debiasing [PDF] 返回目录
Yi Zhang, Jitao Sang
Abstract: Machine learning fairness concerns about the biases towards certain protected or sensitive group of people when addressing the target tasks. This paper studies the debiasing problem in the context of image classification tasks. Our data analysis on facial attribute recognition demonstrates (1) the attribution of model bias from imbalanced training data distribution and (2) the potential of adversarial examples in balancing data distribution. We are thus motivated to employ adversarial example to augment the training data for visual debiasing. Specifically, to ensure the adversarial generalization as well as cross-task transferability, we propose to couple the operations of target task classifier training, bias task classifier training, and adversarial example generation. The generated adversarial examples supplement the target task training dataset via balancing the distribution over bias variables in an online fashion. Results on simulated and real-world debiasing experiments demonstrate the effectiveness of the proposed solution in simultaneously improving model accuracy and fairness. Preliminary experiment on few-shot learning further shows the potential of adversarial attack-based pseudo sample generation as alternative solution to make up for the training data lackage.
摘要：机器学习解决目标任务时，对某些受保护的或敏感的一群人的偏见的公平性的担忧。本文研究了在图像分类任务的背景下，消除直流偏压问题。我们的面部属性识别数据的分析表明（1）模型的偏压从不平衡训练数据分布和（2）的在平衡数据分发对抗性例的潜在的归属。因此，我们的动机是采用对抗性的例子，以增加训练数据的可视化去除偏差。具体来说，确保对抗泛化以及跨任务转让，我们建议夫妻目标任务分类培训，偏任务分类培训，并对抗例子代的操作。生成的敌对例子通过平衡在偏置变量的分布以在线的方式补充目标任务训练数据集。在模拟和真实世界的去除偏差的实验结果表明，在同时提高模型的准确性和公正性所提出的解决方案的有效性。在一些次学习进一步表明敌对基于Web的攻击伪样本生成的潜力作为替代解决方案的初步实验，以弥补训练数据lackage。

13. SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training [PDF] 返回目录
Pengcheng Dai, Jianlei Yang, Xucheng Ye, Xingzhou Cheng, Junyu Luo, Linghao Song, Yiran Chen, Weisheng Zhao
Abstract: Training Convolutional Neural Networks (CNNs) usually requires a large number of computational resources. In this paper, \textit{SparseTrain} is proposed to accelerate CNN training by fully exploiting the sparsity. It mainly involves three levels of innovations: activation gradients pruning algorithm, sparse training dataflow, and accelerator architecture. By applying a stochastic pruning algorithm on each layer, the sparsity of back-propagation gradients can be increased dramatically without degrading training accuracy and convergence rate. Moreover, to utilize both \textit{natural sparsity} (resulted from ReLU or Pooling layers) and \textit{artificial sparsity} (brought by pruning algorithm), a sparse-aware architecture is proposed for training acceleration. This architecture supports forward and back-propagation of CNN by adopting 1-Dimensional convolution dataflow. We have built %a simple compiler to map CNNs topology onto \textit{SparseTrain}, and a cycle-accurate architecture simulator to evaluate the performance and efficiency based on the synthesized design with $14nm$ FinFET technologies. Evaluation results on AlexNet/ResNet show that \textit{SparseTrain} could achieve about $2.7 \times$ speedup and $2.2 \times$ energy efficiency improvement on average compared with the original training process.
摘要：训练卷积神经网络（细胞神经网络）通常需要大量的计算资源。在本文中，\ {textit} SparseTrain提出了通过充分利用稀疏加快CNN培训。它主要涉及三个层面的创新：激活梯度修正算法，稀疏数据流的训练，和加速器架构。通过施加每个层上的随机修剪算法，反向传播梯度的稀疏性可以显着地不降低训练精度和收敛速度增加。此外，为了利用两个\ textit {天然稀疏}（源于RELU或池层）和\ textit {人工稀疏}（通过修剪算法带来），稀疏感知架构提出了一种用于训练加速度。该架构支持正向和通过采用一维卷积数据流CNN的反向传播。我们已经建立了一个％简单的编译器细胞神经网络拓扑映射到\ {textit} SparseTrain和周期精确架构模拟器来评估基于与14nm制$ $的FinFET技术的综合设计的性能和效率。在AlexNet / RESNET评价结果表明，\ {textit} SparseTrain可以达到约2.7 $ \ $次加速和$ 2.2平均\ $次提高能源效率与原来的训练过程进行比较。

14. MADGAN: unsupervised Medical Anomaly Detection GAN using multiple adjacent brain MRI slice reconstruction [PDF] 返回目录
Changhee Han, Leonardo Rundo, Kohei Murao, Tomoyuki Noguchi, Yuki Shimahara, Zoltan Adam Milacski, Saori Koshino, Evis Sala, Hideki Nakayama, Shinichi Satoh
Abstract: Unsupervised learning can discover various unseen diseases, relying on large-scale unannotated medical images of healthy subjects. Towards this, unsupervised methods reconstruct a 2D/3D single medical image to detect outliers either in the learned feature space or from high reconstruction loss. However, without considering continuity between multiple adjacent slices, they cannot directly discriminate diseases composed of the accumulation of subtle anatomical anomalies, such as Alzheimer's Disease (AD). Moreover, no study has shown how unsupervised anomaly detection is associated with either disease stages, various (i.e., more than two types of) diseases, or multi-sequence Magnetic Resonance Imaging (MRI) scans. Therefore, we propose unsupervised Medical Anomaly Detection Generative Adversarial Network (MADGAN), a novel two-step method using GAN-based multiple adjacent brain MRI slice reconstruction to detect various diseases at different stages on multi-sequence structural MRI: (Reconstruction) Wasserstein loss with Gradient Penalty + 100 L1 loss-trained on 3 healthy brain axial MRI slices to reconstruct the next 3 ones-reconstructs unseen healthy/abnormal scans; (Diagnosis) Average L2 loss per scan discriminates them, comparing the ground truth/reconstructed slices. For training, we use 1,133 healthy T1-weighted (T1) and 135 healthy contrast-enhanced T1 (T1c) brain MRI scans. Our Self-Attention MADGAN can detect AD on T1 scans at a very early stage, Mild Cognitive Impairment (MCI), with Area Under the Curve (AUC) 0.727, and AD at a late stage with AUC 0.894, while detecting brain metastases on T1c scans with AUC 0.921.
摘要：无监督的学习可以发现各种看不见的疾病，依靠健康受试者的大型未注释的医学图像。朝此，无监督方法重建的2D / 3D单医用图像中所学习的特征空间或从高重建丢失或者检测离群值。然而，在不考虑多个相邻的层之间的连续性，它们不能由细微解剖学异常，如阿尔茨海默病（AD）的积累直接判别疾病。此外，没有研究显示异常如何无监督检测与任一疾病阶段相关联，各种（即，多于两种类型的）疾病，或者多序列磁共振成像（MRI）扫描。因此，我们提出无人监管医疗异常检测剖成对抗性网络（MADGAN），使用基于GAN-多个相邻的脑MRI切片重建，以检测在对多序列结构MRI不同阶段的各种疾病的新的两步法：（重建）瓦瑟斯坦损失与梯度罚金+ 100 L1损失训练3健康的脑MRI轴向切片重建未来3对那些-重建看不见健康/异常扫描;每次扫描判别它们（诊断）平均L2损失，比较地面实况/重构的切片。对于训练中，我们使用1133健康T1加权（T1）和135健康的对比增强T1（T1c期）脑部核磁共振成像扫描。我们的自注意MADGAN可以检测T1 AD扫描在非常早的阶段，轻度认知障碍（MCI），与曲线下面积（AUC）0.727，和AD在与AUC 0.894的后期阶段，而在T1c期检测脑转移与AUC 0.921扫描。

15. Differentiable Manifold Reconstruction for Point Cloud Denoising [PDF] 返回目录
Shitong Luo, Wei Hu
Abstract: 3D point clouds are often perturbed by noise due to the inherent limitation of acquisition equipments, which obstructs downstream tasks such as surface reconstruction, rendering and so on. Previous works mostly infer the displacement of noisy points from the underlying surface, which however are not designated to recover the surface explicitly and may lead to sub-optimal denoising results. To this end, we propose to learn the underlying manifold of a noisy point cloud from differentiably subsampled points with trivial noise perturbation and their embedded neighborhood feature, aiming to capture intrinsic structures in point clouds. Specifically, we present an autoencoder-like neural network. The encoder learns both local and non-local feature representations of each point, and then samples points with low noise via an adaptive differentiable pooling operation. Afterwards, the decoder infers the underlying manifold by transforming each sampled point along with the embedded feature of its neighborhood to a local surface centered around the point. By resampling on the reconstructed manifold, we obtain a denoised point cloud. Further, we design an unsupervised training loss, so that our network can be trained in either an unsupervised or supervised fashion. Experiments show that our method significantly outperforms state-of-the-art denoising methods under both synthetic noise and real world noise. The code and data are available at this https URL
摘要：三维点云经常由噪声由于采集设备的固有的限制，其阻碍了下游的任务，如表面重建，渲染等干扰。以前的作品大多推断下垫面吵点，然而没有指定表面明确复苏，并可能导致次优结果去噪的位移。为此，我们建议借鉴differentiably子采样点嘈杂的点云琐碎的噪音干扰和他们的嵌入式邻居功能的基础歧管，旨在捕捉的内在结构的点云。具体而言，提出了一种自动编码状神经网络。该编码器通过一个自适应微分池操作获悉每个点的本地和非本地特征表示，然后样品点与低噪声。此后，解码器推断通过用其邻域的一个局部表面嵌入的特征沿着变换每个采样点的底层歧管围绕着的点。通过重采样的重建歧管，我们得到了一个去噪点云。此外，我们还设计了一个无人监管的培训损失，使我们的网络可以在任何无监督或监督的方式进行训练。实验表明，我们的方法显著优于国家的最先进的降噪两者合成的噪音和现实世界的噪音下的方法。代码和数据都可以在此HTTPS URL

16. Reconstruction Regularized Deep Metric Learning for Multi-label Image Classification [PDF] 返回目录
Changsheng Li, Chong Liu, Lixin Duan, Peng Gao, Kai Zheng
Abstract: In this paper, we present a novel deep metric learning method to tackle the multi-label image classification problem. In order to better learn the correlations among images features, as well as labels, we attempt to explore a latent space, where images and labels are embedded via two unique deep neural networks, respectively. To capture the relationships between image features and labels, we aim to learn a \emph{two-way} deep distance metric over the embedding space from two different views, i.e., the distance between one image and its labels is not only smaller than those distances between the image and its labels' nearest neighbors, but also smaller than the distances between the labels and other images corresponding to the labels' nearest neighbors. Moreover, a reconstruction module for recovering correct labels is incorporated into the whole framework as a regularization term, such that the label embedding space is more representative. Our model can be trained in an end-to-end manner. Experimental results on publicly available image datasets corroborate the efficacy of our method compared with the state-of-the-arts.
摘要：在本文中，我们提出了一个新颖的深度量学习方法，以解决多标签图像分类问题。为了更好地学习中的图像特征的相关性，以及标签，我们试图探索出一条潜在空间，其中的图像和标签通过两个独特的深层神经网络分别被埋入。为了捕捉图像特征和标签之间的关系，我们的目标是学习\ EMPH {双向}深距离在嵌入空间的度量从两种不同的观点，即一个图像和标签之间的距离不仅比那些规模较小图像和其标签的最近的邻居，但也比对应于标签上的标签和其他图像之间的距离更小的最近邻居之间的距离。此外，用于回收正确标签重建模块被结合到整个框架作为正则化项，使得该标签嵌入空间是更具有代表性。我们的模型能在终端到终端的方式进行培训。可公开获得的图像数据集的实验结果证实与国家的的艺术相比，我们的方法的有效性。

17. A Novel adaptive optimization of Dual-Tree Complex Wavelet Transform for Medical Image Fusion [PDF] 返回目录
T.Deepika, G.Karpaga Kannan
Abstract: In recent years, many research achievements are made in the medical image fusion field. Fusion is basically extraction of best of inputs and conveying it to the output. Medical Image fusion means that several of various modality image information is comprehended together to form one image to express its information. The aim of image fusion is to integrate complementary and redundant information. In this paper, a multimodal image fusion algorithm based on the dual-tree complex wavelet transform (DT-CWT) and adaptive particle swarm optimization (APSO) is proposed. Fusion is achieved through the formation of a fused pyramid using the DTCWT coefficients from the decomposed pyramids of the source images. The coefficients are fused by the weighted average method based on pixels, and the weights are estimated by the APSO to gain optimal fused images. The fused image is obtained through conventional inverse dual-tree complex wavelet transform reconstruction process. Experiment results show that the proposed method based on adaptive particle swarm optimization algorithm is remarkably better than the method based on particle swarm optimization. The resulting fused images are compared visually and through benchmarks such as Entropy (E), Peak Signal to Noise Ratio, (PSNR), Root Mean Square Error (RMSE), Standard deviation (SD) and Structure Similarity Index Metric (SSIM) computations.
摘要：近年来，许多研究成果在医学图像融合领域取得。融合基本上是最好的输入提取和将其传送到输出端。医学图像融合装置，若干的各种模态图像信息被领悟到一起以形成一个图像，以表示其信息。图像融合的目的是整合互补和冗余信息。在本文中，基于所述二元树复多峰图像融合算法小波变换（DT-CWT）和自适应粒子群优化（APSO）提出。融合是通过使用从所述源图像的金字塔分解的DTCWT系数稠金字塔的形成实现的。所述系数是由基于像素的加权平均法融合，并且权重由APSO到增益最佳融合图像估计。融合图像是通过常规的逆二元树复获得小波变换重建过程。实验结果表明，基于自适应粒子群优化算法所提出的方法是显着地好于基于粒子群优化的方法。所得融合的图像在视觉上和通过基准诸如熵（E），峰值信噪比（PSNR），均方根误差（RMSE），标准偏差（SD）和结构相似性指数度量（SSIM）计算比较。

18. The Effect of Wearing a Mask on Face Recognition Performance: an Exploratory Study [PDF] 返回目录
Naser Damer, Jonas Henry Grebe, Cong Chen, Fadi Boutros, Florian Kirchbuchner, Arjan Kuijper
Abstract: Face recognition has become essential in our daily lives as a convenient and contactless method of accurate identity verification. Process such as identity verification at automatic border control gates or the secure login to electronic devices are increasingly dependant on such technologies. The recent COVID-19 pandemic have increased the value of hygienic and contactless identity verification. However, the pandemic led to the wide use of face masks, essential to keep the pandemic under control. The effect of wearing a mask on face recognition in a collaborative environment is currently sensitive yet understudied issue. We address that by presenting a specifically collected database containing three session, each with three different capture instructions, to simulate realistic use cases. We further study the effect of masked face probes on the behaviour of three top-performing face recognition systems, two academic solutions and one commercial off-the-shelf (COTS) system.
摘要：人脸识别已经成为我们日常生活中必不可少的准确身份验证的方便和非接触式方法。处理如在自动边界控制栅极验证身份或安全登录到电子装置越来越多地依赖于这样的技术。最近COVID-19大流行已经增加卫生和非接触式身份验证的价值。但是，大流行导致了广泛使用口罩，必须控制住疫情。穿着人脸识别口罩在协作环境中的作用是目前的敏感性，充分研究的问题。我们解决通过载三届，每三个不同的捕获指令，模拟真实使用情况下，专门收集数据库。我们进一步研究掩盖脸上的探针三个顶级表现的面部识别系统，二级学科的解决方案和商业一个现成的现货（COTS）系统的行为的影响。

19. Identity-Guided Human Semantic Parsing for Person Re-Identification [PDF] 返回目录
Kuan Zhu, Haiyun Guo, Zhiwei Liu, Ming Tang, Jinqiao Wang
Abstract: Existing alignment-based methods have to employ the pretrained human parsing models to achieve the pixel-level alignment, and cannot identify the personal belongings (e.g., backpacks and reticule) which are crucial to person re-ID. In this paper, we propose the identity-guided human semantic parsing approach (ISP) to locate both the human body parts and personal belongings at pixel-level for aligned person re-ID only with person identity labels. We design the cascaded clustering on feature maps to generate the pseudo-labels of human parts. Specifically, for the pixels of all images of a person, we first group them to foreground or background and then group the foreground pixels to human parts. The cluster assignments are subsequently used as pseudo-labels of human parts to supervise the part estimation and ISP iteratively learns the feature maps and groups them. Finally, local features of both human body parts and personal belongings are obtained according to the selflearned part estimation, and only features of visible parts are utilized for the retrieval. Extensive experiments on three widely used datasets validate the superiority of ISP over lots of state-of-the-art methods. Our code is available at this https URL.
摘要：现有的基于定位的方法有采用预训练的人的分析模型来实现像素级排列，并不能识别个人物品（例如，背包，手提袋），这是对人重新编号至关重要。在本文中，我们提出了身份引导人类语义分析方法（ISP）在像素水平排列的人重新编号只与人的身份标签，找到人体器官和个人物品都。在功能映射到人类产生部分的伪标签，我们设计的级联群集。具体而言，对于一个人的所有图像的像素，我们第一组他们前景或背景，然后组前景像素人体部位。群集分配随后被用作人体部位的伪标签来监督部位估计和ISP反复学习功能的地图，并将它们分组。最后，无论是人的身体部分和个人物品的局部特征根据selflearned部分估计获得，并且仅设有的可以看见的部分被用于进行检索。三个广泛使用的数据集大量的实验验证ISP超过很多国家的最先进方法的优越性。我们的代码可在此HTTPS URL。

20. Two-Level Residual Distillation based Triple Network for Incremental Object Detection [PDF] 返回目录
Dongbao Yang, Yu Zhou, Dayan Wu, Can Ma, Fei Yang, Weiping Wang
Abstract: Modern object detection methods based on convolutional neural network suffer from severe catastrophic forgetting in learning new classes without original data. Due to time consumption, storage burden and privacy of old data, it is inadvisable to train the model from scratch with both old and new data when new object classes emerge after the model trained. In this paper, we propose a novel incremental object detector based on Faster R-CNN to continuously learn from new object classes without using old data. It is a triple network where an old model and a residual model as assistants for helping the incremental model learning on new classes without forgetting the previous learned knowledge. To better maintain the discrimination of features between old and new classes, the residual model is jointly trained on new classes in the incremental learning procedure. In addition, a corresponding distillation scheme is designed to guide the training process, which consists of a two-level residual distillation loss and a joint classification distillation loss. Extensive experiments on VOC2007 and COCO are conducted, and the results demonstrate that the proposed method can effectively learn to incrementally detect objects of new classes, and the problem of catastrophic forgetting is mitigated in this context.
摘要：基于卷积神经网络的现代对象检测方法严重的灾难性遗忘在学习新的类，而原始数据受到影响。由于时间的消耗，存储负担和旧数据的私密性，这是不可取的模型从与新老数据的临时列车，当新的对象类训练模式后出现。在本文中，我们提出了基于更快的R-CNN一种新的增量对象从探测器新的对象类不断学习，而无需使用旧数据。这是一个三重网络中的旧模式和剩余模型作为助手，帮助新的类增量模型学习没有忘记以前学过的知识。为了更好地维护新老类之间的功能区别，剩余的模型共同在新的类增量学习过程中的培训。此外，相应的蒸馏方案被设计来引导训练过程中，它由两级残余蒸馏损失和关节分类蒸馏损失。在VOC2007和COCO大量实验进行，结果表明，该方法能有效地学习以增量检测的新类的对象，和灾难性遗忘的问题，在这种情况下减轻。

21. Contraction Mapping of Feature Norms for Classifier Learning on the Data with Different Quality [PDF] 返回目录
Weihua Liu, Xiabi Liu, Murong Wang, Ling Ma, Yunde Jia
Abstract: The popular softmax loss and its recent extensions have achieved great success in the deep learning-based image clas-sification. However, the data for training image classifiers usually has different quality. Ignoring such problem, the cor-rect classification of low quality data is hard to be solved. In this paper, we discover the positive correlation between the feature norm of an image and its quality through careful ex-periments on various applications and various deep neural networks. Based on this finding, we propose a contraction mapping function to compress the range of feature norms of training images according to their quality and embed this con-traction mapping function into softmax loss or its extensions to produce novel learning objectives. The experiments on var-ious classification applications, including handwritten digit recognition, lung nodule classification, face verification and face recognition, demonstrate that the proposed approach is promising to effectively deal with the problem of learning on the data with different quality and leads to the significant and stable improvements in the classification accuracy.
摘要：流行SOFTMAX损失和最近扩展已取得深基于学习的图像CLAS-sification了巨大的成功。然而，对于训练图像分类数据通常有不同的质量。忽略这样的问题，低质量数据的COR-RECT分类是难以解决的。在本文中，我们发现一个图像的特征指标，并通过各种应用和各种深层神经网络小心前periments其质量之间的正相关关系。基于这一发现，我们提出了一个压缩映射函数压缩的训练图像的功能规范的范围内，根据其质量和嵌入这种收缩映射功能将丧失添加Softmax或其扩展，产生新的学习目标。在VAR-欠条分类应用，包括手写数字识别，肺结节分类，人脸验证和脸部识别的实验，证明了该方法是有希望有效地处理学习与不同的质量，并导致显著的数据问题在分类准确度和稳定的改善。

22. YOLOpeds: Efficient Real-Time Single-Shot Pedestrian Detection for Smart Camera Applications [PDF] 返回目录
Christos Kyrkou
Abstract: Deep Learning-based object detectors can enhance the capabilities of smart camera systems in a wide spectrum of machine vision applications including video surveillance, autonomous driving, robots and drones, smart factory, and health monitoring. Pedestrian detection plays a key role in all these applications and deep learning can be used to construct accurate state-of-the-art detectors. However, such complex paradigms do not scale easily and are not traditionally implemented in resource-constrained smart cameras for on-device processing which offers significant advantages in situations when real-time monitoring and robustness are vital. Efficient neural networks can not only enable mobile applications and on-device experiences but can also be a key enabler of privacy and security allowing a user to gain the benefits of neural networks without needing to send their data to the server to be evaluated. This work addresses the challenge of achieving a good trade-off between accuracy and speed for efficient deployment of deep-learning-based pedestrian detection in smart camera applications. A computationally efficient architecture is introduced based on separable convolutions and proposes integrating dense connections across layers and multi-scale feature fusion to improve representational capacity while decreasing the number of parameters and operations. In particular, the contributions of this work are the following: 1) An efficient backbone combining multi-scale feature operations, 2) a more elaborate loss function for improved localization, 3) an anchor-less approach for detection, The proposed approach called YOLOpeds is evaluated using the PETS2009 surveillance dataset on 320x320 images. Overall, YOLOpeds provides real-time sustained operation of over 30 frames per second with detection rates in the range of 86% outperforming existing deep learning models.
摘要：学习型深对象检测器可以增强智能相机系统的能力在机器视觉应用，包括视频监控，自动驾驶，机器人和无人驾驶飞机，智能工厂，和健康监测范围广泛。行人检测对所有这些应用和深入学习的关键作用，可用于构建国家的最先进的精确检测。然而，这种复杂的范例并不轻松扩展，而不是在资源受限的智能相机历来实行了对设备处理这情况下，提供了显著的优势时，实时监控和鲁棒性是至关重要的。高效的神经网络，不仅可以使移动应用和设备上的经验，但也可以是隐私和安全的关键因素，允许用户无需将他们的数据发送到服务器进行评估获得神经网络的好处。这项工作涉及实现了很好的权衡精度和速度之间的智能相机应用为基础的深学习行人检测的有效部署的挑战。计算高效的体系结构是基于可分离卷积和提出了积分跨层和多尺度特征融合密连接，以改善代表能力，同时减少的参数和操作的数量引入。特别是，这项工作的贡献有以下几种：1）一种有效的骨架结合多尺度特征操作，2）用于改善定位的更精细的损失函数，3），用于检测锚以下方法，所提出的方法称为YOLOpeds采用320×320的图像PETS2009监控数据集进行评估。总体而言，YOLOpeds提供实时持续在86％优于现有的深度学习模式的范围超过每秒30帧的检测率的操作。

23. Ladybird: Quasi-Monte Carlo Sampling for Deep Implicit Field Based 3D Reconstruction with Symmetry [PDF] 返回目录
Yifan Xu, Tianqi Fan, Yi Yuan, Gurprit Singh
Abstract: Deep implicit field regression methods are effective for 3D reconstruction from single-view images. However, the impact of different sampling patterns on the reconstruction quality is not well-understood. In this work, we first study the effect of point set discrepancy on the network training. Based on Farthest Point Sampling algorithm, we propose a sampling scheme that theoretically encourages better generalization performance, and results in fast convergence for SGD-based optimization algorithms. Secondly, based on the reflective symmetry of an object, we propose a feature fusion method that alleviates issues due to self-occlusions which makes it difficult to utilize local image features. Our proposed system Ladybird is able to create high quality 3D object reconstructions from a single input image. We evaluate Ladybird on a large scale 3D dataset (ShapeNet) demonstrating highly competitive results in terms of Chamfer distance, Earth Mover's distance and Intersection Over Union (IoU).
摘要：深隐式字段回归方法是有效地用于从单个视点图像的三维重建。然而，不同的采样模式对重建质量的影响是不是很好理解。在这项工作中，我们首先研究点集差异对网络培训的效果。基于最远点采样算法，我们提出了一个采样方案在理论上鼓励更好的泛化性能，并快速收敛结果基于SGD-优化算法。其次，基于对象的反射对称，我们提出了一种特征融合的方法，由于自闭塞，这使得它很难利用局部图像特征的缓解问题。我们提出的系统瓢虫能够从一个单一的输入图像创建高品质的3D对象重建。我们评估大规模3D数据集（ShapeNet），这表明在倒角距离，堆土机的距离，路口过联盟（IOU）方面竞争激烈的结果瓢虫。

24. NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination [PDF] 返回目录
Penghao Zhou, Chong Zhou, Pai Peng, Junlong Du, Xing Sun, Xiaowei Guo, Feiyue Huang
Abstract: Greedy-NMS inherently raises a dilemma, where a lower NMS threshold will potentially lead to a lower recall rate and a higher threshold introduces more false positives. This problem is more severe in pedestrian detection because the instance density varies more intensively. However, previous works on NMS don't consider or vaguely consider the factor of the existent of nearby pedestrians. Thus, we propose Nearby Objects Hallucinator (NOH), which pinpoints the objects nearby each proposal with a Gaussian distribution, together with NOH-NMS, which dynamically eases the suppression for the space that might contain other objects with a high likelihood. Compared to Greedy-NMS, our method, as the state-of-the-art, improves by $3.9\%$ AP, $5.1\%$ Recall, and $0.8\%$ $\text{MR}^{-2}$ on CrowdHuman to $89.0\%$ AP and $92.9\%$ Recall, and $43.9\%$ $\text{MR}^{-2}$ respectively.
摘要：贪婪-NMS本身提出了一个两难境地，其中较低的门槛NMS将可能导致较低的查全率和较高的门槛更加倾向于误报。这个问题在行人检测更严重，因为实例密度变化更集中。然而，在NMS以前的作品不考虑或考虑隐约附近行人的存在的因素。因此，我们提出了附近的物体Hallucinator（NOH），其针点附近的高斯分布的每个建议的对象，与NOH-NMS，其动态地减轻了可能含有具有高可能性的其他对象的空间压制在一起。相比贪婪-NMS，我们的方法中，作为状态的最先进的，改进了由$ 3.9 \％$ AP，$ 5.1 $召回\％，和$ 0.8 \％$ $ \文本{MR} ^ { - 2} $上CrowdHuman到$ 89.0 \％$ AP和$ 92.9 \％$回想一下，和$ 43.9 \％$ $ \ {文本MR} ^ { - 2} $分别。

25. Decomposed Generation Networks with Structure Prediction for Recipe Generation from Food Images [PDF] 返回目录
Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
Abstract: Recipe generation from food images and ingredients is a challenging task, which requires the interpretation of the information from another modality. Different from the image captioning task, where the captions usually have one sentence, cooking instructions contain multiple sentences and have obvious structures. To help the model capture the recipe structure and avoid missing some cooking details, we propose a novel framework: Decomposed Generation Networks (DGN) with structure prediction, to get more structured and complete recipe generation outputs. To be specific, we split each cooking instruction into several phases, and assign different sub-generators to each phase. Our approach includes two novel ideas: (i) learning the recipe structures with the global structure prediction component and (ii) producing recipe phases in the sub-generator output component based on the predicted structure. Extensive experiments on the challenging large-scale Recipe1M dataset validate the effectiveness of our proposed model DGN, which improves the performance over the state-of-the-art results.
摘要：从食物图片和配料配方产生是一个具有挑战性的任务，这需要从其他形式的信息的解释。从图像字幕任务，其中字幕通常有一个句子的不同，烹调指令包含多个句子，并有明显的结构。为了帮助模型捕获的配方结构，并避免丢失了一些烹饪细节，我们提出了一个新的框架：风化下一代网络（DGN）与结构预测，以获得更多的结构和完整的配方产生输出。具体而言，我们分裂每个烹饪指令为几个阶段，并分配不同的子发生器的每个阶段。我们的方法包括两个新奇的想法：（ⅰ）学习配方结构与基于预测的结构的分发电机输出组件全局结构预测组分和（ⅱ）产生配方阶段。在具有挑战性的大型数据集Recipe1M验证我们提出的模型DGN，从而提高了国家的最先进的成果的工作成效广泛的实验。

26. Part-Aware Data Augmentation for 3D Object Detection in Point Cloud [PDF] 返回目录
Jaeseok Choi, Yeji Song, Nojun Kwak
Abstract: Data augmentation has greatly contributed to improving the performance in image recognition tasks, and a lot of related studies have been conducted. However, data augmentation on 3D point cloud data has not been much explored. 3D label has more sophisticated and rich structural information than the 2D label, so it enables more diverse and effective data augmentation. In this paper, we propose part-aware data augmentation (PA-AUG) that can better utilize rich information of 3D label to enhance the performance of 3D object detectors. PA-AUG divides objects into partitions and stochastically applies five novel augmentation methods to each local region. It is compatible with existing point cloud data augmentation methods and can be used universally regardless of the detector's architecture. PA-AUG has improved the performance of state-of-the-art 3D object detector for all classes of the KITTI dataset and has the equivalent effect of increasing the train data by about 2.5$\times$. We also show that PA-AUG not only increases performance for a given dataset but also is robust to corrupted data. CODE WILL BE AVAILABLE.
摘要：数据增强大大有助于提高图像识别任务的性能，以及大量相关的研究已经进行。然而，在三维点云数据数据扩张并没有受到太多的探讨。 3D标签具有比二维标签更复杂，更丰富的结构信息，所以它能够更加多样化和有效的数据增强。在本文中，我们提出了部分感知数据增强（PA-AUG）可以更好地利用3D标签的丰富信息，以提高3D对象检测器的性能。 PA-AUG将对象划分为分区，并且随机施加5种新的增强方法来每个局部区域。它是与现有的点云数据扩增方法相容并能普遍而不管检测装置的体系结构的使用。 PA-AUG具有改善的状态的最先进的三维物体检测装置的用于KITTI数据集的所有类的性能，并且具有由约2.5 $ \倍$提高列车数据的等效效果。我们还表明，PA-AUG不仅提升了性能对于一个给定的数据集，而且具有较强的抗损坏的数据。 CODE将可用。

27. Feature visualization of Raman spectrum analysis with deep convolutional neural network [PDF] 返回目录
Masashi Fukuhara, Kazuhiko Fujiwara, Yoshihiro Maruyama, Hiroyasu Itoh
Abstract: We demonstrate a recognition and feature visualization method that uses a deep convolutional neural network for Raman spectrum analysis. The visualization is achieved by calculating important regions in the spectra from weights in pooling and fully-connected layers. The method is first examined for simple Lorentzian spectra, then applied to the spectra of pharmaceutical compounds and numerically mixed amino acids. We investigate the effects of the size and number of convolution filters on the extracted regions for Raman-peak signals using the Lorentzian spectra. It is confirmed that the Raman peak contributes to the recognition by visualizing the extracted features. A near-zero weight value is obtained at the background level region, which appears to be used for baseline correction. Common component extraction is confirmed by an evaluation of numerically mixed amino acid spectra. High weight values at the common peaks and negative values at the distinctive peaks appear, even though the model is given one-hot vectors as the training labels (without a mix ratio). This proposed method is potentially suitable for applications such as the validation of trained models, ensuring the reliability of common component extraction from compound samples for spectral analysis.
摘要：我们证明，使用拉曼光谱分析了深刻的卷积神经网络识别和特征可视化方法。该可视化是通过从在汇集和完全连接的层的权重计算所述光谱重要区域来实现的。该方法是首先检验用于简单的洛伦兹光谱，然后应用到药物化合物和数字混合氨基酸的光谱。我们探讨了使用洛伦兹光谱拉曼峰值信号的提取区域的大小和卷积过滤器的数量的影响。它是由可视化所提取的特征证实，拉曼峰有助于识别。的近零权重值是在背景水平区域，该区域似乎是用于基线校正获得。常见成分提取由数字混合氨基酸光谱的评价证实。处的独特峰的共有峰和负值高权重值的出现，即使模型给定的一个热载体作为训练标签（不混合比）。此建议的方法是可能适合应用，如受过训练的模型的验证，确保公共分量提取的，从化合物的样品进行光谱分析的可靠性。

28. Self-Prediction for Joint Instance and Semantic Segmentation of Point Clouds [PDF] 返回目录
Jinxian Liu, Minghui Yu, Bingbing Ni, Ye Chen
Abstract: We develop a novel learning scheme named Self-Prediction for 3D instance and semantic segmentation of point clouds. Distinct from most existing methods that focus on designing convolutional operators, our method designs a new learning scheme to enhance point relation exploring for better segmentation. More specifically, we divide a point cloud sample into two subsets and construct a complete graph based on their representations. Then we use label propagation algorithm to predict labels of one subset when given labels of the other subset. By training with this Self-Prediction task, the backbone network is constrained to fully explore relational context/geometric/shape information and learn more discriminative features for segmentation. Moreover, a general associated framework equipped with our Self-Prediction scheme is designed for enhancing instance and semantic segmentation simultaneously, where instance and semantic representations are combined to perform Self-Prediction. Through this way, instance and semantic segmentation are collaborated and mutually reinforced. Significant performance improvements on instance and semantic segmentation compared with baseline are achieved on S3DIS and ShapeNet. Our method achieves state-of-the-art instance segmentation results on S3DIS and comparable semantic segmentation results compared with state-of-the-arts on S3DIS and ShapeNet when we only take PointNet++ as the backbone network.
摘要：我们开发了一种新的学习计划命名为自我预测3D实例和点云的语义分割。从专注于设计卷积运营商现有的大多数方法不同，我们的方法设计了一种新的学习计划，以提高点关系，探索更好的分割。更具体地讲，我们把点云样本分为两个子集，构建基于他们表示一个完整的图形。然后我们使用标签传播算法来预测一个子集的标签给出的其他子集的标签时。通过培训，这种自我预测任务，骨干网被约束，充分发掘关系背景/几何/形状信息，了解更多判别特征进行分割。此外，配备了我们的自我预测方案的一般框架的相关设计用于同时增强实例和语义分割，其中实例和语义表示结合起来进行自我预测。通过这种方式，实例和语义分割的合作和相互加强。在实例和语义分割显著的性能提升相比，与基线是在S3DIS和ShapeNet实现。与和ShapeNet国家的最艺术上S3DIS进行比较时，我们只需要PointNet ++为骨干网我们的方法实现对S3DIS和可比的语义分割结果的国家的最先进的实例分割结果。

29. Few-shot Knowledge Transfer for Fine-grained Cartoon Face Generation [PDF] 返回目录
Nan Zhuang, Cheng Yang
Abstract: In this paper, we are interested in generating fine-grained cartoon faces for various groups. We assume that one of these groups consists of sufficient training data while the others only contain few samples. Although the cartoon faces of these groups share similar style, the appearances in various groups could still have some specific characteristics, which makes them differ from each other. A major challenge of this task is how to transfer knowledge among groups and learn group-specific characteristics with only few samples. In order to solve this problem, we propose a two-stage training process. First, a basic translation model for the basic group (which consists of sufficient data) is trained. Then, given new samples of other groups, we extend the basic model by creating group-specific branches for each new group. Group-specific branches are updated directly to capture specific appearances for each group while the remaining group-shared parameters are updated indirectly to maintain the distribution of intermediate feature space. In this manner, our approach is capable to generate high-quality cartoon faces for various groups.
摘要：在本文中，我们感兴趣的是产生于不同群体细粒度的卡通表情。我们假设其中的一个组由足够的训练数据，而其他只包含少量样本。虽然这些团体的卡通表情有着相似的风格，在不同群体的出现可能仍然有一定的特殊性，这使得它们彼此不同。这个任务的一个主要挑战是如何转移组间的知识和学习组特定的特点，只有少数样本。为了解决这个问题，我们提出了两个阶段的训练过程。首先，对于碱性基团的基本翻译模型（其由足够的数据）进行训练。然后，其他组赋予了新的样品，我们通过为每个新的组作成组特定的分支延伸的基本模型。而其余的组共享的参数间接地更新，以保持的中间特征空间中的分布组特定分支被直接更新到捕获特定露面为每个组。通过这种方式，我们的做法是能够生成各种群体的高品质的卡通表情。

30. Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches [PDF] 返回目录
Zhi Chen, Sen Wang, Jingjing Li, Zi Huang
Abstract: Zero-shot learning (ZSL) is commonly used to address the very pervasive problem of predicting unseen classes in fine-grained image classification and other tasks. One family of solutions is to learn synthesised unseen visual samples produced by generative models from auxiliary semantic information, such as natural language descriptions. However, for most of these models, performance suffers from noise in the form of irrelevant image backgrounds. Further, most methods do not allocate a calculated weight to each semantic patch. Yet, in the real world, the discriminative power of features can be quantified and directly leveraged to improve accuracy and reduce computational complexity. To address these issues, we propose a novel framework called multi-patch generative adversarial nets (MPGAN) that synthesises local patch features and labels unseen classes with a novel weighted voting strategy. The process begins by generating discriminative visual features from noisy text descriptions for a set of predefined local patches using multiple specialist generative models. The features synthesised from each patch for unseen classes are then used to construct an ensemble of diverse supervised classifiers, each corresponding to one local patch. A voting strategy averages the probability distributions output from the classifiers and, given that some patches are more discriminative than others, a discrimination-based attention mechanism helps to weight each patch accordingly. Extensive experiments show that MPGAN has significantly greater accuracy than state-of-the-art methods.
摘要：零次学习（ZSL）通常用于解决细颗粒图像分类等任务预测未类的非常普遍的问题。解决方案之一的家庭是学习合成由辅助语义信息生成模型，如自然语言描述产生看不见的视觉样本。然而，对于大多数这些模型，性能中的噪声不相关的图像背景的形式受到影响。此外，大多数方法不分配计算的权重，以每个语义补丁。然而，在现实世界中，特征的辨别力可以量化和直接利用，以提高精度和降低计算复杂度。为了解决这些问题，我们提出了所谓的多补丁生成对抗性网（MPGAN）一种新的框架，synthesises本地补丁的功能和标签看不见的班级，新的加权投票的策略。该过程开始于产生嘈杂的文字描述辨别视觉特征的一组使用多个专科生成模型预定义的局部小片。从每一个片看不见类合成的特征然后用于构建多样监督分类器的集合，每一个对应于一个局部斑块。拥有投票权的战略平均值从分类输出的概率分布，并考虑到一些补丁比其他人更有辨别力，基于歧视，注意机制有助于每个补丁相应权重。大量的实验表明，MPGAN比国家的最先进的方法显著更高的精度。

31. Split Computing for Complex Object Detectors: Challenges and Preliminary Results [PDF] 返回目录
Yoshitomo Matsubara, Marco Levorato
Abstract: Following the trends of mobile and edge computing for DNN models, an intermediate option, split computing, has been attracting attentions from the research community. Previous studies empirically showed that while mobile and edge computing often would be the best options in terms of total inference time, there are some scenarios where split computing methods can achieve shorter inference time. All the proposed split computing approaches, however, focus on image classification tasks, and most are assessed with small datasets that are far from the practical scenarios. In this paper, we discuss the challenges in developing split computing methods for powerful R-CNN object detectors trained on a large dataset, COCO 2017. We extensively analyze the object detectors in terms of layer-wise tensor size and model size, and show that naive split computing methods would not reduce inference time. To the best of our knowledge, this is the first study to inject small bottlenecks to such object detectors and unveil the potential of a split computing approach. The source code and trained models' weights used in this study are available at this https URL .
摘要：继手机和边缘计算DNN的车型，中间选项，拆分计算的发展趋势，一直备受关注，从研究团体。以往的研究经验表明，虽然移动和边缘计算通常会在总推理时间方面的最佳选择，也有一些场景里劈计算方法可以实现更短的推理时间。所有提出的分割计算的方式，但是，专注于图像分类的任务，而且大多数是评估与小数据集是远离现实的场景。在本文中，我们将讨论在开发拆分计算对训练有素的大型数据集强大的R-CNN对象探测器方法的挑战，2017年COCO广泛，我们在分析逐层张量大小和型号尺寸方面的对象检测器，并表明天真拆分计算方法不会减少推理时间。据我们所知，这是第一次研究注入小的瓶颈，这样的对象检测器，并推出一个分裂的计算方法的潜力。源代码和训练的模型在这项研究中使用的权重可在此HTTPS URL。

32. K-Shot Contrastive Learning of Visual Features with Multiple Instance Augmentations [PDF] 返回目录
Haohang Xu, Hongkai Xiong, Guo-Jun Qi
Abstract: In this paper, we propose the $K$-Shot Contrastive Learning (KSCL) of visual features by applying multiple augmentations to investigate the sample variations within individual instances. It aims to combine the advantages of inter-instance discrimination by learning discriminative features to distinguish between different instances, as well as intra-instance variations by matching queries against the variants of augmented samples over instances. Particularly, for each instance, it constructs an instance subspace to model the configuration of how the significant factors of variations in $K$-shot augmentations can be combined to form the variants of augmentations. Given a query, the most relevant variant of instances is then retrieved by projecting the query onto their subspaces to predict the positive instance class. This generalizes the existing contrastive learning that can be viewed as a special one-shot case. An eigenvalue decomposition is performed to configure instance subspaces, and the embedding network can be trained end-to-end through the differentiable subspace configuration. Experiment results demonstrate the proposed $K$-shot contrastive learning achieves superior performances to the state-of-the-art unsupervised methods.
摘要：在本文中，我们通过将多个扩充到调查个别实例中的样本变化提出了$ķ$ -shot对比的视觉特征学习（KSCL）。它的目的是通过学习判别特征通过匹配查询针对所述过实例增强样品的变体的不同实例，以及帧内实例的变化之间进行区分，以实例间歧视的优点结合起来。特别地，对于每一个实例，它构造一个实例子空间的在$ $ķ扩充-shot变化的显著因素如何可以被组合以形成扩充的变体的结构进行建模。给定一个查询，情况最相关的变体，然后通过投影查询到自己的子空间预测正实例类检索。这个概括，可以被看作是一种特殊的一次性情况下，现有的对比学习。特征值分解进行配置实例的子空间，并且嵌入网络可以通过微分子空间结构来训练端至端。实验结果表明所提出的$ķ$ -shot对比学习实现了性能优越的国家的最先进的无监督的方法。

33. Reconstructing NBA Players [PDF] 返回目录
Luyang Zhu, Konstantinos Rematas, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman
Abstract: Great progress has been made in 3D body pose and shape estimation from a single photo. Yet, state-of-the-art results still suffer from errors due to challenging body poses, modeling clothing, and self occlusions. The domain of basketball games is particularly challenging, as it exhibits all of these challenges. In this paper, we introduce a new approach for reconstruction of basketball players that outperforms the state-of-the-art. Key to our approach is a new method for creating poseable, skinned models of NBA players, and a large database of meshes (derived from the NBA2K19 video game), that we are releasing to the research community. Based on these models, we introduce a new method that takes as input a single photo of a clothed player in any basketball pose and outputs a high resolution mesh and 3D pose for that player. We demonstrate substantial improvement over state-of-the-art, single-image methods for body shape reconstruction.
摘要：大进步在三维人体姿势和体形估计作了从单张照片。然而，国家的最先进的成果仍然遭受错误是由于身体的挑战姿态，造型服装，自我闭塞。篮球游戏领域尤其具有挑战性，因为它表现出所有这些挑战。在本文中，我们介绍的是优于国家的最先进的篮球运动员重建的新方法。关键是我们的做法是创建poseable的新方法，剥皮的NBA球员，和一个大型的数据库网格的（从NBA2K19电子游戏得到的）的模型，我们将发布的研究团体。基于这些模型，我们引入作为输入的任何篮球姿势穿着衣服的球员的一张照片，并输出高分辨率的网格和三维姿态为玩家的新方法。我们证明在国家的最先进的，用于身体形状重建单个图像的方法显着改善。

34. Research Progress of Convolutional Neural Network and its Application in Object Detection [PDF] 返回目录
Wei Zhang, Zuoxiang Zeng
Abstract: With the improvement of computer performance and the increase of data volume, the object detection based on convolutional neural network (CNN) has become the main algorithm for object detection. This paper summarizes the research progress of convolutional neural networks and their applications in object detection, and focuses on analyzing and discussing a specific idea and method of applying convolutional neural networks for object detection, pointing out the current deficiencies and future development direction.
摘要：随着计算机性能的提高和数据量的增加，基于卷积神经网络的物体检测（CNN）已经成为对象检测主算法。本文总结了卷积神经网络及其在目标检测应用中的研究进展，并着重分析和讨论具体的想法和应用卷积神经网络的目标检测方法，指出目前的不足和未来的发展方向。

35. Representation Learning with Video Deep InfoMax [PDF] 返回目录
Hjelm, R Devon, Bachman, Philip
Abstract: Self-supervised learning has made unsupervised pretraining relevant again for difficult computer vision tasks. The most effective self-supervised methods involve prediction tasks based on features extracted from diverse views of the data. DeepInfoMax (DIM) is a self-supervised method which leverages the internal structure of deep networks to construct such views, forming prediction tasks between local features which depend on small patches in an image and global features which depend on the whole image. In this paper, we extend DIM to the video domain by leveraging similar structure in spatio-temporal networks, producing a method we call Video Deep InfoMax(VDIM). We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks which match or outperform prior state-of-the-art methods that use more costly large-time-scale transformer models. We also examine the effects of data augmentation and fine-tuning methods, accomplishingSoTA by a large margin when training only on the UCF-101 dataset.
摘要：自监督学习取得了无人监管的困难计算机视觉任务再次训练前相关。最有效的自我监督方法基于从数据的不同意见提取的特征包括预测任务。 DeepInfoMax（DIM）是利用深网络的内部结构来构造这样的观点，形成依赖于小片的图像，其中取决于在整个图像上局部特征和全局特征之间的预测任务的自监督方法。在本文中，我们通过利用在时空网络相似的结构，从而产生我们称为视频深INFOMAX（VDIM）的方法延伸DIM到视频域。我们发现，从自然率序列和时间下采样序列产生的结果无论在哪个匹配或优于使用更昂贵的大时间尺度变压器模型之前，国家的最先进的方法，动力学，预训练的动作识别任务的工程视图。我们还大幅度仅在UCF-101数据集中训练的时候检查数据增强和微调方法，accomplishingSoTA的影响。

36. Learning Task-oriented Disentangled Representations for Unsupervised Domain Adaptation [PDF] 返回目录
Pingyang Dai, Peixian Chen, Qiong Wu, Xiaopeng Hong, Qixiang Ye, Qi Tian, Rongrong Ji
Abstract: Unsupervised domain adaptation (UDA) aims to address the domain-shift problem between a labeled source domain and an unlabeled target domain. Many efforts have been made to address the mismatch between the distributions of training and testing data, but unfortunately, they ignore the task-oriented information across domains and are inflexible to perform well in complicated open-set scenarios. Many efforts have been made to eliminate the mismatch between the distributions of training and testing data by learning domain-invariant representations. However, the learned representations are usually not task-oriented, i.e., being class-discriminative and domain-transferable simultaneously. This drawback limits the flexibility of UDA in complicated open-set tasks where no labels are shared between domains. In this paper, we break the concept of task-orientation into task-relevance and task-irrelevance, and propose a dynamic task-oriented disentangling network (DTDN) to learn disentangled representations in an end-to-end fashion for UDA. The dynamic disentangling network effectively disentangles data representations into two components: the task-relevant ones embedding critical information associated with the task across domains, and the task-irrelevant ones with the remaining non-transferable or disturbing information. These two components are regularized by a group of task-specific objective functions across domains. Such regularization explicitly encourages disentangling and avoids the use of generative models or decoders. Experiments in complicated, open-set scenarios (retrieval tasks) and empirical benchmarks (classification tasks) demonstrate that the proposed method captures rich disentangled information and achieves superior performance.
摘要：无监督域适配（UDA）的目的是解决一个标记的源域和未标记的靶结构域之间的域转移的问题。许多已作出努力，以解决训练和测试数据的分布之间的不匹配，但不幸的是，他们忽略了跨域的面向任务的信息是不灵活的在复杂的开集的场景表现良好。许多已作出努力，以消除培训和通过学习域的恒定表征测试数据的分布之间的不匹配。然而，了解到表示是通常不以任务为导向，即是阶级歧视和域名转让同时进行。这个缺点限制了UDA的在没有标签域之间共享复杂开放组任务的灵活性。在本文中，我们打破任务为导向的成任务的相关性和任务无关的概念，并提出一个动态的面向任务的解开网络（DTDN）了解解开表示在终端到终端的时尚UDA。动态解开网络有效地理顺了那些纷繁的数据表示为两个部分：与任务相关的多个嵌入域之间的任务相关的关键信息，并与任务无关的人用剩下的不可转让或干扰信息。这两个组件是由一组跨域特定任务的目标函数正规化。这种正常化的明确鼓励解开，并避免使用生成模型或解码器。在复杂的，开放式集情景（检索任务）和经验基准（分类任务）的实验表明，该方法捕获丰富的解缠结的信息，并实现卓越的性能。

37. REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering [PDF] 返回目录
Siwen Luo, Soyeon Caren Han, Kaiyuan Sun, Josiah Poon
Abstract: Visual question answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of both images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect, either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question in an image with simple objects. In this paper, we propose a deep reasoning VQA model with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting a complex object-relationship in photo-realistic images. REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network. We quantitatively and qualitatively evaluate REXUP on the GQA dataset and conduct extensive ablation studies to explore the reasons behind REXUP's effectiveness. Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.
摘要：视觉问答（VQA）是一个具有挑战性的多模式的任务，不仅需要两个图像和问题的语义理解，又是一步一步的推理过程，这将导致正确答案的声音感觉。到目前为止，在VQA最成功的尝试一直专注于只是一个方面，无论是图像和问题Word功能，或用简单的物体的像在回答这个问题的推理过程的可视像素的互动功能。在本文中，我们提出了具有明确的视觉结构的感知文本信息深推理VQA模型，它工作得很好捕捉一步一步的推理过程和检测照片般逼真的图像一个复杂的对象关系。 REXUP网络由两个分支，像面向对象和场景图为主，与上斜线融合组成注意网络共同工作。我们定量和定性评估对GQA数据集REXUP进行广泛消融研究，探讨背后REXUP的有效性的原因。我们最好的模型显著优于珍贵的国家的最先进的，它在测试开发组提供的验证集92.7％和73.1％。

38. Point-to-set distance functions for weakly supervised segmentation [PDF] 返回目录
Bas Peters
Abstract: When pixel-level masks or partial annotations are not available for training neural networks for semantic segmentation, it is possible to use higher-level information in the form of bounding boxes, or image tags. In the imaging sciences, many applications do not have an object-background structure and bounding boxes are not available. Any available annotation typically comes from ground truth or domain experts. A direct way to train without masks is using prior knowledge on the size of objects/classes in the segmentation. We present a new algorithm to include such information via constraints on the network output, implemented via projection-based point-to-set distance functions. This type of distance functions always has the same functional form of the derivative, and avoids the need to adapt penalty functions to different constraints, as well as issues related to constraining properties typically associated with non-differentiable functions. Whereas object size information is known to enable object segmentation from bounding boxes from datasets with many general and medical images, we show that the applications extend to the imaging sciences where data represents indirect measurements, even in the case of single examples. We illustrate the capabilities in case of a) one or more classes do not have any annotation; b) there is no annotation at all; c) there are bounding boxes. We use data for hyperspectral time-lapse imaging, object segmentation in corrupted images, and sub-surface aquifer mapping from airborne-geophysical remote-sensing data. The examples verify that the developed methodology alleviates difficulties with annotating non-visual imagery for a range of experimental settings.
摘要：当像素级的口罩或部分注解不适用于训练神经网络的语义分割，可以在边界框，或图像以标签的形式使用更高级别的信息。在成像科学，许多应用程序不具有对象的背景结构和边框不可用。任何可用的注释通常来自地面实况或领域的专家。无掩模来训练直接的方法是对对象/类的在分割的大小使用先验知识。我们提出了一个新的算法包括通过限制在网络输出这样的信息，通过基于投影的点到设定距离函数实现的。这种类型的距离函数总是具有衍生物的相同的功能形式，并且避免了需要适应有关约束通常与非微函数相关联的特性的惩罚函数，以不同的约束，以及问题。而对象大小信息是已知的，以从与许多一般和医学图像数据集的边界框使对象分割，我们表明，应用程序扩展到其中的数据表示间接测量成像科学，即使在单个实施例的情况。我们说明的情况下）的一个或多个类别没有任何注释的能力; b）中存在根本没有注释; C）有包围盒。我们使用来自机载地球物理遥感数据的高光谱时间推移成像，对象分割在损坏的图像，和亚表面含水层映射数据。实例验证与注释非视觉图像为一系列的实验设置的改进的方法的缓解困难。

39. OASIS: A Large-Scale Dataset for Single Image 3D in the Wild [PDF] 返回目录
Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, Jia Deng
Abstract: Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate leading models on a variety of single-image 3D tasks. We expect OASIS to be a useful resource for 3D vision research. Project site: this https URL.
摘要：单视点3D是回收3D性能如从单个图像的深度和表面法线的任务。我们假设单张图像3D的一个主要障碍是数据。我们通过提供单像面（OASIS）的开放式注解，在由详细的三维几何结构的注释140,000图像野生单图像的3D数据集解决这个问题。我们培养和各种单一图像的3D任务评估领军车型。我们期望OASIS成为3D视觉研究的有用资源。项目选址：该HTTPS URL。

40. Learning and aggregating deep local descriptors for instance-level recognition [PDF] 返回目录
Giorgos Tolias, Tomas Jenicek, Ondřej Chum
Abstract: We propose an efficient method to learn deep local descriptors for instance-level recognition. The training only requires examples of positive and negative image pairs and is performed as metric learning of sum-pooled global image descriptors. At inference, the local descriptors are provided by the activations of internal components of the network. We demonstrate why such an approach learns local descriptors that work well for image similarity estimation with classical efficient match kernel methods. The experimental validation studies the trade-off between performance and memory requirements of the state-of-the-art image search approach based on match kernels. Compared to existing local descriptors, the proposed ones perform better in two instance-level recognition tasks and keep memory requirements lower. We experimentally show that global descriptors are not effective enough at large scale and that local descriptors are essential. We achieve state-of-the-art performance, in some cases even with a backbone network as small as ResNet18.
摘要：本文提出一种有效的方法去学习实例级识别深层的局部描述符。培训只需要正面和负面的图像对的实例和总和，汇集全球图像描述的度量学习进行。在推理时，局部描述符是由网络的内部部件的激活提供。我们说明了为什么这种方法学习局部描述符的工作以及与传统的高效匹配核方法图像相似度估计。实验验证研究基于匹配内核的国家的最先进的图像检索方法的性能和内存需求之间的权衡。相较于现有的局部描述符，建议那些两个实例级识别任务更好地履行，并保持内存的需求降低。我们实验显示，全球的描述是不是在大规模足够有效的，局部描述符是必不可少的。我们甚至有一个骨干网小ResNet18实现国家的最先进的性能，在某些情况下。

41. Deep Photometric Stereo for Non-Lambertian Surfaces [PDF] 返回目录
Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita, Kwan-Yee K. Wong
Abstract: This paper addresses the problem of photometric stereo, in both calibrated and uncalibrated scenarios, for non-Lambertian surfaces based on deep learning. We first introduce a fully convolutional deep network for calibrated photometric stereo, which we call PS-FCN. Unlike traditional approaches that adopt simplified reflectance models to make the problem tractable, our method directly learns the mapping from reflectance observations to surface normal, and is able to handle surfaces with general and unknown isotropic reflectance. At test time, PS-FCN takes an arbitrary number of images and their associated light directions as input and predicts a surface normal map of the scene in a fast feed-forward pass. To deal with the uncalibrated scenario where light directions are unknown, we introduce a new convolutional network, named LCNet, to estimate light directions from input images. The estimated light directions and the input images are then fed to PS-FCN to determine the surface normals. Our method does not require a pre-defined set of light directions and can handle multiple images in an order-agnostic manner. Thorough evaluation of our approach on both synthetic and real datasets shows that it outperforms state-of-the-art methods in both calibrated and uncalibrated scenarios.
摘要：本文地址光度立体的问题，在这两个校准和未校准的情况下，基于深度学习非朗伯表面。我们先介绍用于校准光度立体全卷积深刻的网络，我们称之为PS-FCN。不同于采用简化的模型反射，使问题变得易于处理传统的方法，我们的方法是直接从学习的观测反射率的映射表面正常，并能处理与一般的和未知的各向同性的反射面。在测试时，PS-FCN拍摄图像和它们的相关联的光的方向作为输入的任意数量，并且预测在快速前馈通的场景的表面的法线图。为应对未校准方案，其中光的方向是未知的，我们引入了一个新的卷积网络，命名LCNet，光方向来自输入图像估计。然后，将估计的光的方向和所述输入的图像被馈送到PS-FCN以确定表面法线。我们的方法不需要预先定义的一组光方向的，并且可以在一个顺序无关的方式处理多个图像。我们对合成和真实数据的显示方法的全面评估，它优于在这两个校准和未校准情况下国家的最先进的方法。

42. Challenge-Aware RGBT Tracking [PDF] 返回目录
Chenglong Li, Lei Liu, Andong Lu, Qing Ji, Jin Tang
Abstract: RGB and thermal source data suffer from both shared and specific challenges, and how to explore and exploit them plays a critical role to represent the target appearance in RGBT tracking. In this paper, we propose a novel challenge-aware neural network to handle the modality-shared challenges (e.g., fast motion, scale variation and occlusion) and the modality-specific ones (e.g., illumination variation and thermal crossover) for RGBT tracking. In particular, we design several parameter-shared branches in each layer to model the target appearance under the modality-shared challenges, and several parameterindependent branches under the modality-specific ones. Based on the observation that the modality-specific cues of different modalities usually contains the complementary advantages, we propose a guidance module to transfer discriminative features from one modality to another one, which could enhance the discriminative ability of some weak modality. Moreover, all branches are aggregated together in an adaptive manner and parallel embedded in the backbone network to efficiently form more discriminative target representations. These challenge-aware branches are able to model the target appearance under certain challenges so that the target representations can be learnt by a few parameters even in the situation of insufficient training data. From the experimental results we will show that our method operates at a real-time speed while performing well against the state-of-the-art methods on three benchmark datasets.
摘要：RGB和热源数据从共享和具体挑战受苦，如何发掘和利用它们发挥代表在RGBT跟踪目标出现了关键的作用。在本文中，我们提出了一种新的挑战感知神经网络来处理模态共同的挑战（例如，快速运动，尺度变化和遮挡）和模态特异性的人（例如，照明变化和热交叉）为RGBT跟踪。特别是，我们在每个层中设计了几种参数共享分支到下模态特异性的人下的模态共同挑战目标的外观，和几个parameterindependent分支建模。基于不同的方式的方式，具体的线索通常包含了优势互补的观察，我们提出了一个引导模块从一个模态转移判别特征到另一个，这可能会提高一些薄弱形态的判别能力。此外，所有的分支以自适应的方式聚集在一起并平行嵌入在骨干网中有效地形成更判别目标表示。这些挑战意识的分支都能够在一定的挑战，使得目标表示可以由几个参数，即使是在训练数据不足的情况要学习的目标外观模型。从实验结果，我们将展示我们的方法在一个实时的速度运行，同时对三个标准数据集的国家的最先进的方法表现良好。

43. Virtual Multi-view Fusion for 3D Semantic Segmentation [PDF] 返回目录
Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru
Abstract: Semantic segmentation of 3D meshes is an important problem for 3D scene understanding. In this paper we revisit the classic multiview representation of 3D meshes and study several techniques that make them effective for 3D semantic segmentation of meshes. Given a 3D mesh reconstructed from RGBD sensors, our method effectively chooses different virtual views of the 3D mesh and renders multiple 2D channels for training an effective 2D semantic segmentation model. Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels. Using the large scale indoor 3D semantic segmentation benchmark of ScanNet, we show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches. When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results compared to all prior multiview approaches and competitive with recent 3D convolution approaches.
摘要：3D网格的语义分割是三维场景理解的一个重要问题。在本文中，我们重温3D网格的经典的多视图表示，并研究多种技术，使它们有效网格的3D语义分割。给定3D网格从RGBD传感器重构，我们的方法有效地选择了三维的不同虚拟视图啮合并使得多个2D信道用于训练有效2D语义分割模型。从多个按次预测功能终于融合三维网格顶点预测网语义分割标签。使用ScanNet的大型室内3D语义分割基准，我们证明了我们的虚拟视图使2D语义分割网络的更有效的训练比以前多视图的方法。当每个像素的预测二维的三维表面聚集，我们的虚拟多视角的融合方法能够更好地显著实现3D语义分割结果相比，所有以前的多视点方法和有竞争力的近期3D卷积方法。

44. Contrastive Visual-Linguistic Pretraining [PDF] 返回目录
Lei Shi, Kai Shuang, Shijie Geng, Peng Su, Zhengkai Jiang, Peng Gao, Zuohui Fu, Gerard de Melo, Sen Su
Abstract: Several multi-modality representation learning approaches such as LXMERT and ViLBERT have been proposed recently. Such approaches can achieve superior performance due to the high-level semantic information captured during large-scale multimodal pretraining. However, as ViLBERT and LXMERT adopt visual region regression and classification loss, they often suffer from domain gap and noisy label problems, based on the visual features having been pretrained on the Visual Genome dataset. To overcome these issues, we propose unbiased Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual self-supervised loss built upon contrastive learning. We evaluate CVLP on several down-stream tasks, including VQA, GQA and NLVR2 to validate the superiority of contrastive learning on multi-modality representation learning. Our code is available at: this https URL.
摘要：一些多模态学习的表示方法，如LXMERT和ViLBERT近来已经提出。这样的方法可以由于在大型多峰预训练捕获的高层语义信息实现优异的性能。然而，随着ViLBERT和LXMERT采用可视区域回归和分类的损失，他们经常遭受域差距，嘈杂的标签问题的基础上，已预先训练的可视化基因组数据集的视觉特征。为了克服这些问题，我们提出了公正的对比视觉，语言训练前（CVLP），它构建在对比学习内置视觉自我监督的损失。我们在几个下游任务，包括VQA，GQA和NLVR2验证对比学习的多模态表示学习的优越性评估CVLP。我们的代码，请访问：此HTTPS URL。

45. GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-aware Supervision [PDF] 返回目录
Lei Ke, Shichao Li, Yanan Sun, Yu-Wing Tai, Chi-Keung Tang
Abstract: We present a novel end-to-end framework named as GSNet (Geometric and Scene-aware Network), which jointly estimates 6DoF poses and reconstructs detailed 3D car shapes from single urban street view. GSNet utilizes a unique four-way feature extraction and fusion scheme and directly regresses 6DoF poses and shapes in a single forward pass. Extensive experiments show that our diverse feature extraction and fusion scheme can greatly improve model performance. Based on a divide-and-conquer 3D shape representation strategy, GSNet reconstructs 3D vehicle shape with great detail (1352 vertices and 2700 faces). This dense mesh representation further leads us to consider geometrical consistency and scene context, and inspires a new multi-objective loss function to regularize network training, which in turn improves the accuracy of 6D pose estimation and validates the merit of jointly performing both tasks. We evaluate GSNet on the largest multi-task ApolloCar3D benchmark and achieve state-of-the-art performance both quantitatively and qualitatively. Project page is available at this https URL.
摘要：我们目前命名为GSNet一个新的终端到终端的框架（几何和场景感知网络），其中联合估计6自由度的姿势并重建详细的3D赛车形状由单城市街景。 GSNet利用独特的四通特征提取和融合方法和直接退化6自由度的姿势和形状以单直传。大量的实验表明，我们的多元化特征提取和融合方案可以大大提高模型的性能。基于分而治之三维形状表示策略，GSNet重建非常详细（1352个顶点和2700面）的三维形状的车辆。这种密集的网格表示进一步导致我们考虑几何一致性和景物情境，激发新的多目标损失函数来规范网络训练，进而提高6D姿态估计的准确度和验证的联合执行这两项任务的优点。我们评估GSNet上最大的多任务ApolloCar3D基准，在数量上和质量上实现国家的最先进的性能。项目页面可在此HTTPS URL。

46. Towards End-to-end Video-based Eye-Tracking [PDF] 返回目录
Seonwook Park, Emre Aksan, Xucong Zhang, Otmar Hilliges
Abstract: Estimating eye-gaze from images alone is a challenging task, in large parts due to un-observable person-specific factors. Achieving high accuracy typically requires labeled data from test users which may not be attainable in real applications. We observe that there exists a strong relationship between what users are looking at and the appearance of the user's eyes. In response to this understanding, we propose a novel dataset and accompanying method which aims to explicitly learn these semantic and temporal relationships. Our video dataset consists of time-synchronized screen recordings, user-facing camera views, and eye gaze data, which allows for new benchmarks in temporal gaze tracking as well as label-free refinement of gaze. Importantly, we demonstrate that the fusion of information from visual stimuli as well as eye images can lead towards achieving performance similar to literature-reported figures acquired through supervised personalization. Our final method yields significant performance improvements on our proposed EVE dataset, with up to a 28 percent improvement in Point-of-Gaze estimates (resulting in 2.49 degrees in angular error), paving the path towards high-accuracy screen-based eye tracking purely from webcam sensors. The dataset and reference source code are available at this https URL
摘要：估计视线从单独的图像是一项艰巨的任务，在大部分地区，由于未观察到具体的人因素。实现高精确度通常需要从测试用户可能不是在实际应用中可获得的标记数据。我们观察到，之间存在着什么样的用户正在寻找与用户的眼睛的外观，强大的关系。针对这一认识，我们提出了一种新的数据集和相应的方法，其目的是明确学习这些语义和时间关系。我们的视频数据集由时间同步的屏幕录制的，面向用户的相机视图，和眼睛注视数据，它允许新的基准在时间凝视跟踪以及视线的无标记细化。重要的是，我们证明了的信息从视觉刺激以及眼睛图像融合可以在实现类似于通过监督个性化获取文献报告的数字性能导致。我们最后的方法产生在我们提出的EVE数据集显著性能改进，具有高达在点的注视估计（导致角度误差2.49度）一个改进28％，铺路朝高精度的基于屏幕的眼睛纯粹跟踪路径从摄像头传感器。该数据集和参考源代码都可以在这个HTTPS URL

47. SADet: Learning An Efficient and Accurate Pedestrian Detector [PDF] 返回目录
Chubin Zhuang, Zhen Lei, Stan Z. Li
Abstract: Although the anchor-based detectors have taken a big step forward in pedestrian detection, the overall performance of algorithm still needs further improvement for practical applications, \emph{e.g.}, a good trade-off between the accuracy and efficiency. To this end, this paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector, forming a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection, which includes three main improvements. Firstly, we optimize the sample generation process by assigning soft tags to the outlier samples to generate semi-positive samples with continuous tag value between $0$ and $1$, which not only produces more valid samples, but also strengthens the robustness of the model. Secondly, a novel Center-$IoU$ loss is applied as a new regression loss for bounding box regression, which not only retains the good characteristics of IoU loss, but also solves some defects of it. Thirdly, we also design Cosine-NMS for the postprocess of predicted bounding boxes, and further propose adaptive anchor matching to enable the model to adaptively match the anchor boxes to full or visible bounding boxes according to the degree of occlusion, making the NMS and anchor matching algorithms more suitable for occluded pedestrian detection. Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images ($640 \times 480$) on challenging pedestrian detection benchmarks, i.e., CityPersons, Caltech, and human detection benchmark CrowdHuman, leading to a new attractive pedestrian detector.
摘要：虽然基于锚的探测器行人检测已经迈出了一大步向前，算法的整体表现仍然需要进一步改进的实际应用中，\ EMPH，一个很好的权衡精度和效率之间{e.g}。为此，提出了一系列的单级检测器的检测系统的管线优化策略，形成用于有效和精确的行人检测，其包括三个主要的改进的单基于锚镜头检测器（SADet）。首先，我们通过分配软标签离群样品以产生具有$ 0 $和$ 1 $之间连续标签值，其不仅产生更有效的样品，但也增强了模型的鲁棒性半阳性样品优化采样生成处理。其次，一个新的中心 - $欠条$损耗应用为边框的回归，它不仅保留借条丢失的良好特性的新回归的损失，同时也解决了它的一些缺陷。第三，我们还设计余弦网管进行的预测包围盒的后处理，并进一步提出了自适应锚匹配，以使该模型根据阻塞的程度自适应匹配锚箱满或可见的边框，使得NMS和锚匹配算法更适合闭塞行人检测。虽然结构简单，它提出的$ 20 $的FPS有挑战性的行人检测基准，即CityPersons，加州理工学院的VGA分辨率的图片（$ 640 \次480 $）国家的先进成果和实时速度，人体检测基准CrowdHuman，导致新的有吸引力的行人检测。

48. Detection and Annotation of Plant Organs from Digitized Herbarium Scans using Deep Learning [PDF] 返回目录
Sohaib Younis, Marco Schmidt, Claus Weiland, Steffan Dressler, Bernhard Seeger, Thomas Hickler
Abstract: As herbarium specimens are increasingly becoming digitized and accessible in online repositories, advanced computer vision techniques are being used to extract information from them. The presence of certain plant organs on herbarium sheets is useful information in various scientific contexts and automatic recognition of these organs will help mobilize such information. In our study we use deep learning to detect plant organs on digitized herbarium specimens with Faster R-CNN. For our experiment we manually annotated hundreds of herbarium scans with thousands of bounding boxes for six types of plant organs and used them for training and evaluating the plant organ detection model. The model worked particularly well on leaves and stems, while flowers were also present in large numbers in the sheets, but not equally well recognized.
摘要：随着植物标本正日益成为数字化和在线储存库访问，先进的计算机视觉技术被用来从中提取信息。某些植物器官对标本馆床单的存在是在各种科学背景和这些器官的自动识别有用的信息，将有助于调动这些信息。在我们的研究中，我们使用深学检测对数字化的植物标本的植物器官具有更快的R-CNN。对于我们的实验中，我们手动注释数百植物标本馆扫描成千上万的边界框六类植物器官，并用它们进行训练和评估植物器官检测模型。该模型的工作特别在叶和茎，花，同时也存在于表大数，但同样没有得到公认。

49. Towards Purely Unsupervised Disentanglement of Appearance and Shape for Person Images Generation [PDF] 返回目录
Hongtao Yang, Tong Zhang, Wenbing Huang, Xuming He
Abstract: There have been a fairly of research interests in exploring the disentanglement of appearance and shape from human images. Most existing endeavours pursuit this goal by either using training images with annotations or regulating the training process with external clues such as human skeleton, body segmentation or cloth patches etc. In this paper, we aim to address this challenge in a more unsupervised manner---we do not require any annotation nor any external task-specific clues. To this end, we formulate an encoder-decoder-like network to extract both the shape and appearance features from input images at the same time, and train the parameters by three losses: feature adversarial loss, color consistency loss and reconstruction loss. The feature adversarial loss mainly impose little to none mutual information between the extracted shape and appearance features, while the color consistency loss is to encourage the invariance of person appearance conditioned on different shapes. More importantly, our unsupervised\footnote{Unsupervised learning has many interpretations in different tasks. To be clear, in this paper, we refer unsupervised learning as learning without task-specific human annotations, pairs or any form of weak supervision.} framework utilizes learned shape features as masks which are applied to the input itself in order to obtain clean appearance features. Without using fixed input human skeleton, our network better preserves the conditional human posture while requiring less supervision. Experimental results on DeepFashion and Market1501 demonstrate that the proposed method achieves clean disentanglement and is able to synthesis novel images of comparable quality with state-of-the-art weakly-supervised or even supervised methods.
摘要：已经有相当的研究兴趣在探索的外观和形状与人的形象的解开。大多数现有的努力追求这个目标通过或者使用训练图像与注解或调节与外部线索，如人的骨骼，身体分割或布补丁等，本文训练过程中，我们的目标是在无监督更要manner--应对这一挑战 - 我们不需要任何注解，也没有任何外部任务的具体线索。为此，我们制订编码器 - 解码器状网络的同时，以提取从输入图像的形状和外观两者的功能，并且由三个损失训练参数：特征对抗性下降，颜色一致性的损失和重建的损失。该功能对抗性的损失主要是强加很少提取的形状和外观特征之间没有相互的信息，而色彩一致性的损失是鼓励人的外貌条件上不同形状的不变性。更重要的是，我们的无监督\ {脚注学习无监督在不同的任务有很多解释。需要明确的是，在本文中，我们指监督学习作为没有任务的特定人的注解，成对或任何形式的监督乏力的学习。}架构利用学到外形特征作为应用于输入自身为了获得清洁的外观所掩盖特征。如果不使用固定的输入人体的骨骼，我们的网络更好的保留条件的人的姿态，而需要较少的监督。上DeepFashion和Market1501实验结果表明，所提出的方法实现了清洁解缠结并且能够与国家的最先进的弱监督或甚至监督方法可比较的质量的合成新颖的图像。

50. U2-ONet: A Two-level Nested Octave U-structure with Multiscale Attention Mechanism for Moving Instances Segmentation [PDF] 返回目录
Chenjie Wang, Chengyuan Li, Bin Luo
Abstract: Most scenes in practical applications are dynamic scenes containing moving objects, so segmenting accurately moving objects is crucial for many computer vision applications. In order to efficiently segment out all moving objects in the scene, regardless of whether the object has a predefined semantic label, we propose a two-level nested Octave U-structure network with a multiscale attention mechanism called U2-ONet. Each stage of U2-ONet is filled with our newly designed Octave ReSidual U-block (ORSU) to enhance the ability to obtain more context information at different scales while reducing spatial redundancy of feature maps. In order to efficiently train our multi-scale deep network, we introduce a hierarchical training supervision strategy that calculates the loss at each level while adding a knowledge matching loss to keep the optimization consistency. Experimental results show that our method achieves state-of-the-art performance in several general moving objects segmentation datasets.
摘要：在实际应用中的大多数场景都包含移动的物体，所以准确地分割运动物体是许多计算机视觉应用是至关重要的动态场景。为了有效地分割出场景中的所有运动的物体，不管对象是否具有一个预定义的语义标签，我们提出了一个名为多尺度注意机制U2-ONET两个级别嵌套八度U型结构的网络。 U2-ONET的每个阶段都充满了我们的新设计的倍频残留U型块（ORSU），以提高在不同尺度，以获得更多的上下文信息，同时降低特征映射的空间冗余的能力。为了有效地培养我们的多尺度深网络，我们介绍，在每个级别计算的损失，同时增加了知识匹配损失，以保持一致性优化分层培训监管策略。实验结果表明，我们的方法实现了现有技术的国家的若干一般移动性能对象分割数据集。

51. SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction [PDF] 返回目录
Sriram N N, Buyu Liu, Francesco Pittaluga, Manmohan Chandraker
Abstract: We propose advances that address two key challenges in future trajectory prediction: (i) multimodality in both training data and predictions and (ii) constant time inference regardless of number of agents. Existing trajectory predictions are fundamentally limited by lack of diversity in training data, which is difficult to acquire with sufficient coverage of possible modes. Our first contribution is an automatic method to simulate diverse trajectories in the top-view. It uses pre-existing datasets and maps as initialization, mines existing trajectories to represent realistic driving behaviors and uses a multi-agent vehicle dynamics simulator to generate diverse new trajectories that cover various modes and are consistent with scene layout constraints. Our second contribution is a novel method that generates diverse predictions while accounting for scene semantics and multi-agent interactions, with constant-time inference independent of the number of agents. We propose a convLSTM with novel state pooling operations and losses to predict scene-consistent states of multiple agents in a single forward pass, along with a CVAE for diversity. We validate our proposed multi-agent trajectory prediction approach by training and testing on the proposed simulated dataset and existing real datasets of traffic scenes. In both cases, our approach outperforms SOTA methods by a large margin, highlighting the benefits of both our diverse dataset simulation and constant-time diverse trajectory prediction methods.
摘要：我们提出的发展，在未来的轨迹预测解决两个关键的挑战：在训练数据和预测，以及（ii）固定的时间推断（我）多模态不论座席数的。现有的轨迹预测是在训练数据缺乏多样性，这是很难获得与可能的方式充分覆盖从根本上受到限制。我们的第一个贡献是模拟在顶视图多样轨迹的自动方法。它使用预先存在的数据集和地图的初始化，现有煤矿的轨迹来表示逼真的驾驶行为，并采用了多剂车辆动态模拟器生成涵盖各种模式，并与现场布置约束条件不同的新轨迹。我们的第二个贡献是产生多样的预测，同时考虑场景语义和多剂相互作用，以推断独立的代理数量的常数时间的新方法。我们提出新颖的状态池操作和损失convLSTM预测多个代理的场景一致的状态在一个单一的直传，用CVAE多样性一起。我们验证由训练和测试上所提出的模拟数据集和交通场景的现有的真实数据我们提出的多代理轨迹预测方法。在这两种情况下，我们的方法比SOTA方法以大比分，突出了我们两个不同的数据集模拟和固定时间的不同轨迹预测方法的好处。

52. Approaches of large-scale images recognition with more than 50,000 categoris [PDF] 返回目录
Wanhong Huang, Rui Geng
Abstract: Though current CV models have been able to achieve high levels of accuracy on small-scale images classification dataset with hundreds or thousands of categories, many models become infeasible in computational or space consumption when it comes to large-scale dataset with more than 50,000 categories. In this paper, we provide a viable solution for classifying large-scale species datasets using traditional CV techniques such as.features extraction and processing, BOVW(Bag of Visual Words) and some statistical learning technics like Mini-Batch K-Means,SVM which are used in our works. And then mixed with a neural network model. When applying these techniques, we have done some optimization in time and memory consumption, so that it can be feasible for large-scale dataset. And we also use some technics to reduce the impact of mislabeling data. We use a dataset with more than 50, 000 categories, and all operations are done on common computer with l 6GB RAM and a CPU of 3. OGHz. Our contributions are: 1) analysis what problems may meet in the training processes, and presents several feasible ways to solve these problems. 2) Make traditional CV models combined with neural network models provide some feasible scenarios for training large-scale classified datasets within the constraints of time and spatial resources.
摘要：虽然目前的CV机型已经能够实现对数百或数千个种类的小型图像分类数据集高水平的准确性，众多车型成为计算和空间消耗不可行，当谈到有超过50000的大型数据集类别。在本文中，我们提供了使用传统的CV技术，as.features提取分类的大型物种数据集和处理，BOVW（视觉词袋），一个可行的解决方案以及一些统计学习工艺像小批量K均值，SVM这在我们的作品中使用。然后用神经网络模型混合。当应用这些技术，我们做的时间和内存消耗一些优化，以便它可以为大型数据集是可行的。而且我们还使用一些工艺来降低贴错标签数据的影响。我们使用的数据集有超过50，000类，和所有的操作都以L 6GB内存常见的电脑和3 OGHz的CPU上完成。我们的贡献是：要解决这些问题1）分析什么问题可以在训练过程中相遇，并提出一些可行的方法。 2）使传统的CV模型和神经网络模型相结合提供了时间和空间资源的限制范围内培养大规模数据集分类一些可行的方案。

53. A Dual Iterative Refinement Method for Non-rigid Shape Matching [PDF] 返回目录
Rui Xiang, Rongjie Lai, Hongkai Zhao
Abstract: In this work, a simple and efficient dual iterative refinement (DIR) method is proposed for dense correspondence between two nearly isometric shapes. The key idea is to use dual information, such as spatial and spectral, or local and global features, in a complementary and effective way, and extract more accurate information from current iteration to use for the next iteration. In each DIR iteration, starting from current correspondence, a zoom-in process at each point is used to select well matched anchor pairs by a local mapping distortion criterion. These selected anchor pairs are then used to align spectral features (or other appropriate global features) whose dimension adaptively matches the capacity of the selected anchor pairs. Thanks to the effective combination of complementary information in a data-adaptive way, DIR is not only efficient but also robust to render accurate results within a few iterations. By choosing appropriate dual features, DIR has the flexibility to handle patch and partial matching as well. Extensive experiments on various data sets demonstrate the superiority of DIR over other state-of-the-art methods in terms of both accuracy and efficiency.
摘要：在这项工作中，一个简单的，高效的双迭代细化（DIR）方法提出了一种用于密对应两个几乎等距形状之间。关键思想是利用双重信息，如空间和光谱，或局部和全局特征，以互补和有效的方式，并提取更精确的信息，从当前迭代以用于下一次迭代。在每个迭代DIR，从当前开始对应，一个放大过程中在各点所使用的本地映射失真标准以选择良好匹配的锚定器对。这些选定的锚定器对，然后用于对准的光谱特征（或其他适当的全局特征），其尺寸适应地选定锚对的容量相匹配。由于在一个数据自适应方式的补充信息的有效结合，DIR不仅效率高，而且健壮渲染几次迭代内准确的结果。通过选择合适的双重功能，DIR有弹性，手柄贴片和部分匹配也是如此。关于各种数据集广泛的实验表明在精度和效率方面的DIR比其它国家的最先进的方法的优越性。

54. Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve [PDF] 返回目录
Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai
Abstract: Object recognition has seen significant progress in the image domain, with focus primarily on 2D perception. We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses. We present Mask2CAD, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose. We construct a joint embedding space between the detected regions of an image corresponding to an object and 3D CAD models, enabling retrieval of CAD models for an input RGB image. This produces a clean, lightweight representation of the objects in an image; this CAD-based representation ensures a valid, efficient shape representation for applications such as content creation or interactive scenarios, and makes a step towards understanding the transformation of real-world imagery to a synthetic domain. Experiments on real-world images from Pix3D demonstrate the advantage of our approach in comparison to state of the art. To facilitate future research, we additionally propose a new image-to-3D baseline on ScanNet which features larger shape diversity, real-world occlusions, and challenging image views.
摘要：物体识别已经看到了在图像域显著进展，主要侧重于2D的看法。我们建议利用现有3D模型的大型数据集通过构建的对象和他们的姿势的基于CAD的代表，了解在图像中看到的物体的底层3D结构。我们目前Mask2CAD，其联合检测在现实世界图像和每个检测到的对象的对象，优化了最相似的CAD模型及其姿态。我们构建的图像的对应于对象的检测到的区域和3D CAD模型之间的接头嵌入空间，从而使CAD模型的检索，用于将输入RGB图像。这产生的图像中的对象的一个干净的，轻量表示;这种基于CAD-表示确保了应用，如内容创建或交互式场景有效，高效形状表示，并且使得朝向理解现实世界的图像的变换来合成结构域的步骤。从Pix3D真实世界的影像实验表明相比于最先进的我们的方法的优点。为了方便今后的研究中，我们还提出了对ScanNet新的图像到3D基准设有较大的形状差异，真实世界的闭塞，并具有挑战性的图像视图。

55. Style is a Distribution of Features [PDF] 返回目录
Eddie Huang, Sahil Gupta
Abstract: Neural style transfer (NST) is a powerful image generation technique that uses a convolutional neural network (CNN) to merge the content of one image with the style of another. Contemporary methods of NST use first or second order statistics of the CNN's features to achieve transfers with relatively little computational cost. However, these methods cannot fully extract the style from the CNN's features. We present a new algorithm for style transfer that fully extracts the style from the features by redefining the style loss as the Wasserstein distance between the distribution of features. Thus, we set a new standard in style transfer quality. In addition, we state two important interpretations of NST. The first is a re-emphasis from Li et al., which states that style is simply the distribution of features. The second states that NST is a type of generative adversarial network (GAN) problem.
摘要：神经风格转移（NST）是使用卷积神经网络（CNN）一个图像的内容与他人的风格融合了强大的图像生成技术。 NST的当代方法使用的CNN的特点第一或第二顺序统计实现以相对较少的计算成本转移。然而，这些方法不能完全提取从CNN的特色风格。我们提出了风格转移，通过重新定义风格损失的功能分布之间的距离瓦瑟斯坦充分提取从的风格特点的新算法。因此，我们设定的风格传输质量的新标准。另外，我们国家NST的两个重要的解释。首先是从Li等，其中指出，风格是简单的功能分布的重新重视。第二指出NST是一类生成对抗网络（GAN）的问题。

56. HATNet: An End-to-End Holistic Attention Network for Diagnosis of Breast Biopsy Images [PDF] 返回目录
Sachin Mehta, Ximing Lu, Donald Weaver, Joann G. Elmore, Hannaneh Hajishirzi, Linda Shapiro
Abstract: Training end-to-end networks for classifying gigapixel size histopathological images is computationally intractable. Most approaches are patch-based and first learn local representations (patch-wise) before combining these local representations to produce image-level decisions. However, dividing large tissue structures into patches limits the context available to these networks, which may reduce their ability to learn representations from clinically relevant structures. In this paper, we introduce a novel attention-based network, the Holistic ATtention Network (HATNet) to classify breast biopsy images. We streamline the histopathological image classification pipeline and show how to learn representations from gigapixel size images end-to-end. HATNet extends the bag-of-words approach and uses self-attention to encode global information, allowing it to learn representations from clinically relevant tissue structures without any explicit supervision. It outperforms the previous best network Y-Net, which uses supervision in the form of tissue-level segmentation masks, by 8%. Importantly, our analysis reveals that HATNet learns representations from clinically relevant structures, and it matches the classification accuracy of human pathologists for this challenging test set. Our source code is available at \url{this https URL}
摘要：分类千兆像素的大小病理图像培训最终对终端网络是难以计算的。大多数方法是补丁为基础，先学习这些地方交涉组合以产生图像层次决定之前，当地表示（补丁明智）。然而，将大组织结构到补丁限制了可用于这些网络，这可能会减少他们从临床相关的结构学习交涉能力范围内。在本文中，我们介绍了一种新的关注，基于网络的整体关注网络（HATNet）分类乳腺活检图像。我们精简的组织图象分类管道，并展示如何借鉴千兆像素大小的图像表示最终到终端。 HATNet扩展袋的词办法和用途自注意编码全球信息，使其能够了解从临床相关的组织结构表示，而没有任何明确的监管。它优于前面最好的网络Y型网络，它使用监督组织水平分割掩码的形式，由8％。重要的是，我们的分析表明，HATNet获悉表示，从临床相关的结构，它的人类病理学的分类精度相匹配这一具有挑战性的测试集。我们的源代码都可以在\ {URL这HTTPS URL}

57. Robust and Generalizable Visual Representation Learning via Random Convolutions [PDF] 返回目录
Zhenlin Xu, Deyi Liu, Junlin Yang, Marc Niethammer
Abstract: While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. Hence, our goal is to train models in such a way that improves their robustness to these perturbations. We are motivated by the approximately shape-preserving property of randomized convolutions, which is due to distance preservation under random linear transforms. Intuitively, randomized convolutions create an infinite number of new domains with similar object shapes but random local texture. Therefore, we explore using outputs of multi-scale random convolutions as new images or mixing them with the original images during training. When applying a network trained with our approach to unseen domains, our method consistently improves the performance on domain generalization benchmarks and is scalable to ImageNet. Especially for the challenging scenario of generalizing to the sketch domain in PACS and to ImageNet-Sketch, our method outperforms state-of-art methods by a large margin. More interestingly, our method can benefit downstream tasks by providing a more robust pretrained visual representation.
摘要：尽管成功适用于各种计算机视觉任务，深层神经网络已经证明是脆弱的质地风格的变化和小的扰动，其人是稳健的。因此，我们的目标是在提高其稳健性对这些扰动这样的方式来训练模型。我们通过随机卷积的大致保形性，这是由于下随机线性变换距离保存的动机。直观地看，随机卷积创建新域的类似物体的形状，但随机局部纹理无限多。因此，我们探索利用多尺度随机卷积为新建图像的输出或训练过程中与原始图像混合它们。当应用与我们的方法，以看不见的领域培养了网络，我们的方法始终提高域泛化基准性能和可扩展到ImageNet。特别是对于泛化的在PACS的挑战场景到草图域和ImageNet-草图，我们的方法优于国家的技术的方法大幅度。更有趣的是，我们的方法可以通过提供更强大的预训练的可视化表示受益下游任务。

58. GP-Aligner: Unsupervised Non-rigid Groupwise Point Set Registration Based On Optimized Group Latent Descriptor [PDF] 返回目录
Lingjing Wang, Xiang Li, Yi Fang
Abstract: In this paper, we propose a novel method named GP-Aligner to deal with the problem of non-rigid groupwise point set registration. Compared to previous non-learning approaches, our proposed method gains competitive advantages by leveraging the power of deep neural networks to effectively and efficiently learn to align a large number of highly deformed 3D shapes with superior performance. Unlike most learning-based methods that use an explicit feature encoding network to extract the per-shape features and their correlations, our model leverages a model-free learnable latent descriptor to characterize the group relationship. More specifically, for a given group we first define an optimizable Group Latent Descriptor (GLD) to characterize the gruopwise relationship among a group of point sets. Each GLD is randomly initialized from a Gaussian distribution and then concatenated with the coordinates of each point of the associated point sets in the group. A neural network-based decoder is further constructed to predict the coherent drifts as the desired transformation from input groups of shapes to aligned groups of shapes. During the optimization process, GP-Aligner jointly updates all GLDs and weight parameters of the decoder network towards the minimization of an unsupervised groupwise alignment loss. After optimization, for each group our model coherently drives each point set towards a middle, common position (shape) without specifying one as the target. GP-Aligner does not require large-scale training data for network training and it can directly align groups of point sets in a one-stage optimization process. GP-Aligner shows both accuracy and computational efficiency improvement in comparison with state-of-the-art methods for groupwise point set registration. Moreover, GP-Aligner is shown great efficiency in aligning a large number of groups of real-world 3D shapes.
摘要：在本文中，我们提出了一个名为GP-定位仪来处理非刚性的GroupWise点集配准问题的新方法。相比以前的非学习方法，我们通过利用深层神经网络的力量来有效地提出方法的收益竞争优势，学会对准了大量的高度变形的3D性能优越的形状。与使用明确的特征编码网络提取每个形状特征和它们之间的关系大多数基于学习的方法，我们的模型利用无模型可以学习的潜在描述符来表征组关系。更具体地，对于给定的组中，我们首先定义一个可优化组潜描述符（GLD）来表征一组点集之间的关系gruopwise。每个GLD随机从高斯分布初始化，然后与相关联的点集的群组中的每个点的坐标串联。阿基于神经网络的解码器进一步构造成预测相干漂移从形状的输入组所需转化到形状的对齐组。在优化过程中，GP对准共同更新所有GLD的朝无监督的GroupWise对准损失的最小化的解码器网络的权重参数。优化后，为每个组提供了模型驱动相干朝向中间，共同位置（形状）而不指定一个作为目标的每个点集。 GP-定位仪不需要网络训练的大型训练数据，它可以在一个阶段的优化过程点集的直接对准组。 GP-定位仪显示精度和在与国家的最先进的方法的GroupWise点集配准的比较计算效率的提高。此外，GP对准在对准了大量真实世界的3D形状的群体表现出极大的效率。

59. MRGAN: Multi-Rooted 3D Shape Generation with Unsupervised Part Disentanglement [PDF] 返回目录
Rinon Gal, Amit Bermano, Hao Zhang, Daniel Cohen-Or
Abstract: We present MRGAN, a multi-rooted adversarial network which generates part-disentangled 3D point-cloud shapes without part-based shape supervision. The network fuses multiple branches of tree-structured graph convolution layers which produce point clouds, with learnable constant inputs at the tree roots. Each branch learns to grow a different shape part, offering control over the shape generation at the part level. Our network encourages disentangled generation of semantic parts via two key ingredients: a root-mixing training strategy which helps decorrelate the different branches to facilitate disentanglement, and a set of loss terms designed with part disentanglement and shape semantics in mind. Of these, a novel convexity loss incentivizes the generation of parts that are more convex, as semantic parts tend to be. In addition, a root-dropping loss further ensures that each root seeds a single part, preventing the degeneration or over-growth of the point-producing branches. We evaluate the performance of our network on a number of 3D shape classes, and offer qualitative and quantitative comparisons to previous works and baseline approaches. We demonstrate the controllability offered by our part-disentangled generation through two applications for shape modeling: part mixing and individual part variation, without receiving segmented shapes as input.
摘要：本MRGAN，其产生部分解开的3D点云形状而不基于部分的形状监督的多根对抗性网络。网络熔断器其产生点云，与在树的根部可学习恒定输入树形结构图表卷积层的多个分支。每个分支学会种植不同形状的部分，在部分级别提供在形状上一代控制。根混合培训战略，帮助去相关的不同分支，以便解开，和一组设计时考虑了部分的解开和形状语义损失方面的：我们的网络通过两个关键因素促使解缠结的一代语义部分。在这些中，一种新颖的凸损失刺激行为是更加凸起，因为语义部分往往是部分的产生。此外，根坠损失进一步确保每个根种子单个部件，防止变性或点产生分支的过度增长。我们评估我们对一些3D形状类的网络的性能，并提供定性和定量比较，以前的作品和基线的方法。我们证明通过两个应用程序形状建模我们部分解开一代提供的可控性：部分混合和个体存在差异，而不接收分段的形状作为输入。

60. Gradient Regularized Contrastive Learning for Continual Domain Adaptation [PDF] 返回目录
Peng Su, Shixiang Tang, Peng Gao, Di Qiu, Ni Zhao, Xiaogang Wang
Abstract: Human beings can quickly adapt to environmental changes by leveraging learning experience. However, the poor ability of adapting to dynamic environments remains a major challenge for AI models. To better understand this issue, we study the problem of continual domain adaptation, where the model is presented with a labeled source domain and a sequence of unlabeled target domains. There are two major obstacles in this problem: domain shifts and catastrophic forgetting. In this work, we propose Gradient Regularized Contrastive Learning to solve the above obstacles. At the core of our method, gradient regularization plays two key roles: (1) enforces the gradient of contrastive loss not to increase the supervised training loss on the source domain, which maintains the discriminative power of learned features; (2) regularizes the gradient update on the new domain not to increase the classification loss on the old target domains, which enables the model to adapt to an in-coming target domain while preserving the performance of previously observed domains. Hence our method can jointly learn both semantically discriminative and domain-invariant features with labeled source domain and unlabeled target domains. The experiments on Digits, DomainNet and Office-Caltech benchmarks demonstrate the strong performance of our approach when compared to the state-of-the-art.
摘要：人类可以通过充分利用学习经验快速适应环境的变化。然而，适应动态环境的能力较差仍是AI模式的一个重大挑战。为了更好地理解这个问题，我们研究不断域调整，其中模型呈现标记源域和未标记的目标域的顺序问题。有在这个问题两个主要障碍：域的变化和灾难性的遗忘。在这项工作中，我们提出了梯度正则对比学习，以解决上述障碍。在我们的方法的核心，梯度正扮演了两个关键的作用：（1）强制对比损失不增加源域，它保持了解到特征辨别力指导训练损失的梯度; （2）规则化的新的域的梯度更新不增加对旧目标域的分类损失，这使得模型适应于在注目的目标域，同时保持先前观察到的域的性能。因此，我们的方法能共同学习与标记源域和未标记的目标域都语义歧视和域不变特征。对数字的试验，相比于国家的最先进的当DomainNet和办公室，加州理工学院的基准证明了该方法的强劲表现。

61. Video Super Resolution Based on Deep Learning: A comprehensive survey [PDF] 返回目录
Hongying Liu, Zhubo Ruan, Peng Zhao, Fanhua Shang, Linlin Yang, Yuanyuan Liu
Abstract: In recent years, deep learning has made great progress in the fields of image recognition, video analysis, natural language processing and speech recognition, including video super-resolution tasks. In this survey, we comprehensively investigate 28 state-of-the-art video super-resolution methods based on deep learning. It is well known that the leverage of information within video frames is important for video super-resolution. Hence we propose a taxonomy and classify the methods into six sub-categories according to the ways of utilizing inter-frame information. Moreover, the architectures and implementation details (including input and output, loss function and learning rate) of all the methods are depicted in details. Finally, we summarize and compare their performance on some benchmark datasets under different magnification factors. We also discuss some challenges, which need to be further addressed by researchers in the community of video super-resolution. Therefore, this work is expected to make a contribution to the future development of research in video super-resolution, and alleviate understandability and transferability of existing and future techniques into practice.
摘要：近年来，深度学习已在图像识别，视频分析，自然语言处理和语音识别领域，包括视频超分辨率的任务很大的进步。在本次调查中，我们全面调查基础上深度学习28国家的最先进的视频超分辨率的方法。这是众所周知的视频帧中的信息的利用是视频超分辨率非常重要的。因此，我们提出了一种分类，并根据利用帧间信息的方法的方法分为六个子类别。而且，体系结构和的所有方法实现细节（包括输入和输出，损失函数和学习率）进行了详细的示出。最后，我们总结并比较在不同的放大系数一些标准数据集的表现。我们还讨论了一些挑战，需要进一步在视频超分辨率的社会讨论研究人员。因此，这项工作预计将作出对视频超分辨率研究未来的发展做出了贡献，并减轻可理解性和现有和未来技术的转让付诸实践。

62. Crowdsourced 3D Mapping: A Combined Multi-View Geometry and Self-Supervised Learning Approach [PDF] 返回目录
Hemang Chawla, Matti Jukola, Terence Brouns, Elahe Arani, Bahram Zonooz
Abstract: The ability to efficiently utilize crowdsourced visual data carries immense potential for the domains of large scale dynamic mapping and autonomous driving. However, state-of-the-art methods for crowdsourced 3D mapping assume prior knowledge of camera intrinsics. In this work, we propose a framework that estimates the 3D positions of semantically meaningful landmarks such as traffic signs without assuming known camera intrinsics, using only monocular color camera and GPS. We utilize multi-view geometry as well as deep learning based self-calibration, depth, and ego-motion estimation for traffic sign positioning, and show that combining their strengths is important for increasing the map coverage. To facilitate research on this task, we construct and make available a KITTI based 3D traffic sign ground truth positioning dataset. Using our proposed framework, we achieve an average single-journey relative and absolute positioning accuracy of 39cm and 1.26m respectively, on this dataset.
摘要：为了有效地利用众包视觉数据的能力携带用于大规模动态映射和自主驾驶域巨大潜力。然而，国家的最先进的方法众包3D映射假设相机内在的先验知识。在这项工作中，我们提出了一个框架，估计语义上有意义的地标，如交通标志的3D位置而不承担知道相机内部函数，只用单眼彩色摄像机和GPS。我们利用多视图几何形状以及深度学习基于自校准，深度和自我的运动估计的交通标志的定位，并表明，结合自己的优势是增加的地图覆盖重要。为了便利对这一课题研究中，我们构建并提供一个基于KITTI三维交通标志，地面实况定位数据集。使用我们提出的框架下，我们分别达到39厘米和1.26米的平均单旅程相对和绝对定位精度，在此数据集。

63. Approximated Bilinear Modules for Temporal Modeling [PDF] 返回目录
Xinqi Zhu, Chang Xu, Langwen Hui, Cewu Lu, Dacheng Tao
Abstract: We consider two less-emphasized temporal properties of video: 1. Temporal cues are fine-grained; 2. Temporal modeling needs reasoning. To tackle both problems at once, we exploit approximated bilinear modules (ABMs) for temporal modeling. There are two main points making the modules effective: two-layer MLPs can be seen as a constraint approximation of bilinear operations, thus can be used to construct deep ABMs in existing CNNs while reusing pretrained parameters; frame features can be divided into static and dynamic parts because of visual repetition in adjacent frames, which enables temporal modeling to be more efficient. Multiple ABM variants and implementations are investigated, from high performance to high efficiency. Specifically, we show how two-layer subnets in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch. Besides, we introduce snippet sampling and shifting inference to boost sparse-frame video classification performance. Extensive ablation studies are conducted to show the effectiveness of proposed techniques. Our models can outperform most state-of-the-art methods on Something-Something v1 and v2 datasets without Kinetics pretraining, and are also competitive on other YouTube-like action recognition datasets. Our code is available on this https URL.
摘要：我们认为两个视频强调较小的时间属性：1.时空线索是细颗粒; 2.颞建模需要推理。要同时解决这两个问题，我们利用近似的双线性对时间建模模块商（ABM）。有使模块的两个主要观点有效：两层的MLP可以看作是双线性的操作的约束近似，因此可以用于构建在现有细胞神经网络同时重用预训练的参数深反弹道导弹;帧特征可分为因为在相邻的帧的视觉重复，这使得时间建模以更有效的静态和动态部分。多个变种ABM和实现进行了研究，从高性能到高效率。具体来说，我们显示了如何在细胞神经网络两层子网可以通过添加辅助分支被转换为时间双线性模块。此外，我们介绍片段采样和不断变化的推理升压稀疏帧视频分类性能。广泛消融研究以展现提议技术的有效性。我们的模型可以超越国家的最先进的最上的东西，东西v1和v2数据集的方法，而没有动力学训练前，并且也对其他类似YouTube的动作识别数据集的竞争力。我们的代码可以在此HTTPS URL。

64. Learning Disentangled Representations with Latent Variation Predictability [PDF] 返回目录
Xinqi Zhu, Chang Xu, Dacheng Tao
Abstract: Latent traversal is a popular approach to visualize the disentangled latent representations. Given a bunch of variations in a single unit of the latent representation, it is expected that there is a change in a single factor of variation of the data while others are fixed. However, this impressive experimental observation is rarely explicitly encoded in the objective function of learning disentangled representations. This paper defines the variation predictability of latent disentangled representations. Given image pairs generated by latent codes varying in a single dimension, this varied dimension could be closely correlated with these image pairs if the representation is well disentangled. Within an adversarial generation process, we encourage variation predictability by maximizing the mutual information between latent variations and corresponding image pairs. We further develop an evaluation metric that does not rely on the ground-truth generative factors to measure the disentanglement of latent representations. The proposed variation predictability is a general constraint that is applicable to the VAE and GAN frameworks for boosting disentanglement of latent representations. Experiments show that the proposed variation predictability correlates well with existing ground-truth-required metrics and the proposed algorithm is effective for disentanglement learning.
摘要：潜在遍历以可视化的解缠结的潜表示一个流行的方法。鉴于一束在潜表示的单个单元的变型，可以预期，有在所述数据的变化的单一因素的变化，而其他是固定的。然而，这令人印象深刻的实验观察很少明确学习解开表示的目标函数编码。本文定义了潜在的解缠结的陈述的变化可预测性。通过在单个维度变化的潜码生成给定的图像对，此多种多样尺寸可紧密地与这些图像相关的双如果表示是公解缠结。在对抗性的生成过程中，我们通过最大化潜在的变化和相应的图像对之间的互信息鼓励变异可预测性。我们进一步开发出不依赖地面实况生成的因素来衡量潜在交涉解开的评价指标。所提出的变化可预测性是一个普遍的约束是适用于VAE和GaN框架，提高潜在交涉解开。实验表明，该变异可预测性良好的相关性与现有的地面实况，需要度量和算法是有效的解开学习。

65. MirrorNet: Bio-Inspired Adversarial Attack for Camouflaged Object Segmentation [PDF] 返回目录
Jinnan Yan, Trung-Nghia Le, Khanh-Duy Nguyen, Minh-Triet Tran, Thanh-Toan Do, Tam V. Nguyen
Abstract: Camouflaged objects are generally difficult to be detected in their natural environment even for human beings. In this paper, we propose a novel bio-inspired network, named the MirrorNet, that leverages both instance segmentation and adversarial attack for the camouflaged object segmentation. Differently from existing networks for segmentation, our proposed network possesses two segmentation streams: the main stream and the adversarial stream corresponding with the original image and its flipped image, respectively. The output from the adversarial stream is then fused into the main stream's result for the final camouflage map to boost up the segmentation accuracy. Extensive experiments conducted on the public CAMO dataset demonstrate the effectiveness of our proposed network. Our proposed method achieves 89% in accuracy, outperforming the state-of-the-arts. Project Page: this https URL
摘要：伪装对象一般很难在自然环境中甚至对人类进行检测。在本文中，我们提出了一个新颖的仿生网络，命名为MirrorNet，它利用实例分割和伪装对象分割敌对攻击两者。不同于用于分割现有的网络，我们所提出的网络具有两个分割流：主流和分别与原始图像和其翻转图像，对应对抗流。然后从对抗流的输出被融合到主流的结果为最终伪装地图提振了分割精度。对公众CAMO数据集进行了大量的实验证明我们提出的网络的有效性。我们提出的方法准确度达到89％，跑赢国家的最艺术。项目页面：这个HTTPS URL

66. Applying Semantic Segmentation to Autonomous Cars in the Snowy Environment [PDF] 返回目录
Zhaoyu Pan, Takanori Emaru, Ankit Ravankar, Yukinori Kobayashi
Abstract: This paper mainly focuses on environment perception in snowy situations which forms the backbone of the autonomous driving technology. For the purpose, semantic segmentation is employed to classify the objects while the vehicle is driven autonomously. We train the Fully Convolutional Networks (FCN) on our own dataset and present the experimental results. Finally, the outcomes are analyzed to give a conclusion. It can be concluded that the database still needs to be optimized and a favorable algorithm should be proposed to get better results.
摘要：本文主要侧重于白雪皑皑的情况下，环境感知形成了自主驾驶技术骨干。为目的，语义分割采用在车辆自动驾驶中的对象进行分类。我们对我们自己的数据集训练完全卷积网络（FCN），并给出了实验结果。最后的结果进行分析，给出一个结论。由此可以得出结论，该数据库还需要优化和良好的算法应提议得到更好的结果。

67. OpenRooms: An End-to-End Open Framework for Photorealistic Indoor Scene Datasets [PDF] 返回目录
Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Sai Bi, Zexiang Xu, Hong-Xing Yu, Kalyan Sunkavalli, Miloš Hašan, Ravi Ramamoorthi, Manmohan Chandraker
Abstract: Large-scale photorealistic datasets of indoor scenes, with ground truth geometry, materials and lighting, are important for deep learning applications in scene reconstruction and augmented reality. The associated shape, material and lighting assets can be scanned or artist-created, both of which are expensive; the resulting data is usually proprietary. We aim to make the dataset creation process for indoor scenes widely accessible, allowing researchers to transform casually acquired scans to large-scale datasets with high-quality ground truth. We achieve this by estimating consistent furniture and scene layout, ascribing high quality materials to all surfaces and rendering images with spatially-varying lighting consisting of area lights and environment maps. We demonstrate an instantiation of our approach on the publicly available ScanNet dataset. Deep networks trained on our proposed dataset achieve competitive performance for shape, material and lighting estimation on real images and can be used for photorealistic augmented reality applications, such as object insertion and material editing. Importantly, the dataset and all the tools to create such datasets from scans will be released, enabling others in the community to easily build large-scale datasets of their own. All code, models, data, dataset creation tool will be publicly released on our project page.
摘要：大型室内场景的真实感数据集，与地面实况几何形状，材质和灯光，是在现场重建和增强现实深度学习应用很重要。相关联的形状，材料和照明资产可以被扫描或艺术家创建的，这两者是昂贵的;所得到的数据通常是专有的。我们的目标是使室内场景的数据集创建过程广泛接受，使研究人员能够随便获得的扫描转换为高品质的地面实况大型数据集。我们通过估计一致的家具和场景布置，归咎于高品质的材料，所有表面与空间变化的灯光组成的区域灯光和环境贴图渲染图像实现这一目标。我们证明我们的方法在可公开获得的数据集ScanNet一个实例。培训了我们提出的深层网络数据集中实现真实的影像形状，材料和照明估计有竞争力的性能，可用于逼真的增强现实应用，如物品插入和材质编辑。重要的是，数据集和所有的工具来创建扫描这样的数据集将被释放，使社区的其他人可以轻松地建立自己的大型数据集。所有的代码，模型，数据，数据集创建工具将公开发布我们的项目页面上。

68. A Self-Training Approach for Point-Supervised Object Detection and Counting in Crowds [PDF] 返回目录
Yi Wang, Junhui Hou, Xinyu Hou, Lap-Pui Chau
Abstract: In this paper, we propose a novel self-training approach which enables a typical object detector trained only with point-level annotations (i.e., objects are labeled with points) to estimate both the center points and sizes of crowded objects. Specifically, during training we utilize the available point annotations to directly supervise the estimation of the center points of objects. Based on a locally-uniform distribution assumption, we initialize pseudo object sizes from the point-level supervisory information, which are then leveraged to guide the regression of object sizes via a crowdedness-aware loss. Meanwhile, we propose a confidence and order-aware refinement scheme to continuously refine the initial pseudo object sizes such that the ability of the detector is increasingly boosted to simultaneously detect and count objects in crowds. Moreover, to address extremely crowded scenes, we propose an effective decoding method to improve the representation ability of the detector. Experimental results on the WiderFace benchmark show that our approach significantly outperforms state-of-the-art point-supervised methods under both detection and counting tasks, i.e., our method improves the average precision by more than 10% and reduces the counting error by 31.2%. In addition, our method obtains the best results on the dense crowd counting dataset (i.e., ShanghaiTech) and vehicle counting datasets (i.e., CARPK and PUCPR+) when compared with state-of-the-art counting-by-detection methods. We will make the code publicly available to facilitate future research.
摘要：在本文中，我们提出了一种新的自我训练方法使只用点级别注释（即，对象被标记分）培养了典型对象检测器来估算的中心点和拥挤对象的大小两者。具体而言，在训练期间，我们利用现有的点标注直接监督对象的中心点的估计。基于本地均匀分布的假设，我们初始化从点级别的监控信息，然后再利用通过拥挤感知损失，指导对象大小的回归伪对象的尺寸。同时，我们提出了一种信心和秩序意识的细化方案，不断细化初始伪对象大小，使得检测器的能力日益提高同时检测和计数在人群中的对象。此外，为了地址非常拥挤的场景，我们提出了一种有效的解码方法，以提高检测器的表示能力。在WiderFace基准的实验结果表明我们的方法显著优于下都检测和计数的任务状态的最先进的点监督方法，即，我们的方法由10％以上的提高了平均精度和31.2减少计数误差％。此外，当与国家的最先进的计数逐检测方法相比我们的方法获得上密集的人群计数数据集的最佳结果（即，ShanghaiTech）和车辆计数数据集（即，CARPK和PUCPR +）。我们将代码公开可用，以方便今后的研究。

69. Counting Fish and Dolphins in Sonar Images Using Deep Learning [PDF] 返回目录
Stefan Schneider, Alex Zhuang
Abstract: Deep learning provides the opportunity to improve upon conflicting reports considering the relationship between the Amazon river's fish and dolphin abundance and reduced canopy cover as a result of deforestation. Current methods of fish and dolphin abundance estimates are performed by on-site sampling using visual and capture/release strategies. We propose a novel approach to calculating fish abundance using deep learning for fish and dolphin estimates from sonar images taken from the back of a trolling boat. We consider a data set of 143 images ranging from 0-34 fish, and 0-3 dolphins provided by the Fund Amazonia research group. To overcome the data limitation, we test the capabilities of data augmentation on an unconventional 15/85 training/testing split. Using 20 training images, we simulate a gradient of data up to 25,000 images using augmented backgrounds and randomly placed/rotation cropped fish and dolphin taken from the training set. We then train four multitask network architectures: DenseNet201, InceptionNetV2, Xception, and MobileNetV2 to predict fish and dolphin numbers using two function approximation methods: regression and classification. For regression, Densenet201 performed best for fish and Xception best for dolphin with mean squared errors of 2.11 and 0.133 respectively. For classification, InceptionResNetV2 performed best for fish and MobileNetV2 best for dolphins with a mean error of 2.07 and 0.245 respectively. Considering the 123 testing images, our results show the success of data simulation for limited sonar data sets. We find DenseNet201 is able to identify dolphins after approximately 5000 training images, while fish required the full 25,000. Our method can be used to lower costs and expedite the data analysis of fish and dolphin abundance to real-time along the Amazon river and river systems worldwide.
摘要：深学习为改善相互矛盾时考虑到毁林的结果亚马逊河的鱼类和海豚的丰度和减少篷盖之间的关系报道的机会。鱼类和海豚的数量估计当前的方法是通过现场使用视觉和捕获/释放战略取样进行。我们提出了一个新的方法来使用深度学习鱼计算渔业资源和海豚从一个曳船的后面拍摄的图像声纳的估计。我们认为143幅的图像，从0-34的鱼，以及基金亚马逊河流域的研究小组提供0-3海豚的数据集。为了克服数据的限制，我们在一个非常规的15/85训练/测试分割测试数据增强的能力。使用20个训练图像，我们模拟数据的梯度升至使用增强背景25000个图像和随机放置/旋转裁剪鱼和来自训练集海豚服用。然后，我们培养4层多任务网络架构：DenseNet201，InceptionNetV2，Xception和MobileNetV2预测使用两个函数近似方法的鱼类和海豚号：回归和分类。对于回归，Densenet201进行最适合鱼类和Xception最好的分别为2.11和0.133的平均误差平方的海豚。对于分类，InceptionResNetV2进行最适合鱼类和MobileNetV2最好的海豚分别为2.07和0.245，平均误差。考虑到123个的测试图像，我们的结果表明数据模拟的有限声纳数据集的成功。我们发现DenseNet201能后的大约5000个训练图像识别海豚，而鱼所需要的全部25,000。我们的方法可以用来降低成本，加快鱼的数据分析和海豚丰沿着全球亚马逊河和河流系统的实时性。

70. Spatiotemporal Bundle Adjustment for Dynamic 3D Human Reconstruction in the Wild [PDF] 返回目录
Minh Vo, Yaser Sheikh, Srinivasa G. Narasimhan
Abstract: Bundle adjustment jointly optimizes camera intrinsics and extrinsics and 3D point triangulation to reconstruct a static scene. The triangulation constraint, however, is invalid for moving points captured in multiple unsynchronized videos and bundle adjustment is not designed to estimate the temporal alignment between cameras. We present a spatiotemporal bundle adjustment framework that jointly optimizes four coupled sub-problems: estimating camera intrinsics and extrinsics, triangulating static 3D points, as well as sub-frame temporal alignment between cameras and computing 3D trajectories of dynamic points. Key to our joint optimization is the careful integration of physics-based motion priors within the reconstruction pipeline, validated on a large motion capture corpus of human subjects. We devise an incremental reconstruction and alignment algorithm to strictly enforce the motion prior during the spatiotemporal bundle adjustment. This algorithm is further made more efficient by a divide and conquer scheme while still maintaining high accuracy. We apply this algorithm to reconstruct 3D motion trajectories of human bodies in dynamic events captured by multiple uncalibrated and unsynchronized video cameras in the wild. To make the reconstruction visually more interpretable, we fit a statistical 3D human body model to the asynchronous video streams.Compared to the baseline, the fitting significantly benefits from the proposed spatiotemporal bundle adjustment procedure. Because the videos are aligned with sub-frame precision, we reconstruct 3D motion at much higher temporal resolution than the input videos.
摘要：集束调整优化联合相机内在和外部参数和三维点三角测量来重建一个静态场景。三角测量约束，但是，是无效的，用于移动在多个非同步视频和束调整捕获点没有被设计来估计照相机之间的时间对准。我们提出了一个时空束调节框架，联合地优化4联接的子问题：估计相机内在和外部参数，三角测量摄像机之间的静态3D点，以及子帧时间对准并且计算动态点的三维轨迹。关键是我们的联合优化是重建管道内的基于物理的运动先验，对人体进行的大动作捕捉语料库验证的紧密集成。我们设计的增量重建和比对算法的时空束调整期间严格之前执行所述运动。该算法是通过分治方案进一步变得更加高效，同时仍保持较高的精度。我们将此算法来重建由野生多个未校准和非同步视频摄像机拍摄动态事件人体的三维运动轨迹。为了使视觉重建更可解释的，我们拟合统计的3D人体模型streams.Compared基线异步视频，从提出的时空束调整过程中的显著装修的好处。因为视频与子帧精度对齐，我们以比输入视频高得多的时间分辨率重建3D运动。

71. Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data [PDF] 返回目录
Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, Dhruv Batra
Abstract: Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. By factorizing intention and language, our model minimizes linguistic drift after fine-tuning for new tasks. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans. Code has been made available at this https URL
摘要：我们可以开发视觉接地对话框剂可以有效地适应新的任务，而忘记了如何与人交谈？这些药剂可以利用较大的各种现有数据推广到新的任务，最大限度地减少昂贵的数据收集和注解。在这项工作中，我们研究了我们称之为“对话框，不对话”，这就要求代理商开发视觉接地的对话模式，可以适应新的任务，而无需语言水平监督的设置。通过因式分解意图和语言，我们的模型微调为新的任务后语言漂移最小化。我们给出定性结果，自动化的指标和人体研究都表明我们的模型能够适应新的任务和维护语言的质量。基线要么不能在新的任务或经验，语言漂移表现良好，成为不可理解的人类。代码已可在此HTTPS URL

72. Hard negative examples are hard, but useful [PDF] 返回目录
Hong Xuan, Abby Stylianou, Xiaotong Liu, Robert Pless
Abstract: Triplet loss is an extremely common approach to distance metric learning. Representations of images from the same class are optimized to be mapped closer together in an embedding space than representations of images from different classes. Much work on triplet losses focuses on selecting the most useful triplets of images to consider, with strategies that select dissimilar examples from the same class or similar examples from different classes. The consensus of previous research is that optimizing with the \textit{hardest} negative examples leads to bad training behavior. That's a problem -- these hardest negatives are literally the cases where the distance metric fails to capture semantic similarity. In this paper, we characterize the space of triplets and derive why hard negatives make triplet loss training fail. We offer a simple fix to the loss function and show that, with this fix, optimizing with hard negative examples becomes feasible. This leads to more generalizable features, and image retrieval results that outperform state of the art for datasets with high intra-class variance.
摘要：三重损失是距离度量学习一种极为常见的做法。从相同类的图像的表示进行了优化，可以以不同于从不同类别的图像的表示一个嵌入空间更靠近在一起映射。在三重损失大部分工作的重点是选择图像的最有用的三胞胎考虑，与选择同一类或不同的例子，从不同类别的类似的例子战略。在前人研究的共识与\ {textit最难}反面的例子引出坏的训练行为优化。这是一个问题 - 这些最困难的底片字面上其中距离度量没有抓住语义相似的案例。在本文中，我们描述三胞胎的空间和获得为什么难以负负得三重损失训练失败。我们提供了一个简单的修复，以损失函数，并表明，与此修复程序，用硬反面典型优化变得可行。这将导致更普及的功能，并且优于现有技术的状态以用于与高的类内方差数据集的图像检索的结果。

73. Point Cloud Based Reinforcement Learning for Sim-to-Real and Partial Observability in Visual Navigation [PDF] 返回目录
Kenzo Lobos-Tsunekawa, Tatsuya Harada
Abstract: Reinforcement Learning (RL), among other learning-based methods, represents powerful tools to solve complex robotic tasks (e.g., actuation, manipulation, navigation, etc.), with the need for real-world data to train these systems as one of its most important limitations. The use of simulators is one way to address this issue, yet knowledge acquired in simulations does not work directly in the real-world, which is known as the sim-to-real transfer problem. While previous works focus on the nature of the images used as observations (e.g., textures and lighting), which has proven useful for a sim-to-sim transfer, they neglect other concerns regarding said observations, such as precise geometrical meanings, failing at robot-to-robot, and thus in sim-to-real transfers. We propose a method that learns on an observation space constructed by point clouds and environment randomization, generalizing among robots and simulators to achieve sim-to-real, while also addressing partial observability. We demonstrate the benefits of our methodology on the point goal navigation task, in which our method proves to be highly unaffected to unseen scenarios produced by robot-to-robot transfer, outperforms image-based baselines in robot-randomized experiments, and presents high performances in sim-to-sim conditions. Finally, we perform several experiments to validate the sim-to-real transfer to a physical domestic robot platform, confirming the out-of-the-box performance of our system.
摘要：强化学习（RL），其它基于学习的方法之一，代表了功能强大的工具来解决复杂的机器人任务（例如，驱动，操纵，导航等），有需要对真实世界的数据来训练这些系统为一体其最重要的局限性。模拟器的使用是解决这个问题的一种方式，但在模拟中获得的知识并不在现实世界中，这被称为SIM到实时传输的问题直接工作。而先前的工作集中于用作为观测（例如，纹理和照明），图像的性质，其已经被证明可用于一个SIM到SIM转移，他们忽略关于所述观察，诸如精确的几何意义的其他问题，在故障机器人到机器人，并且因此在SIM到实际传输。我们提出了一种方法，通过点云和环境的随机化，机器人之间的泛化和模拟器构成的观测空间学会实现SIM到真实的，同时也解决了部分可观察性。我们证明了我们对点目标导航任务，在我们的方法被证明是高度不受影响由机器人来自动机械传送产生看不见的情景方法论的好处，胜过于机器人随机实验基于图像的基线，并提出高性能在SIM到SIM卡的条件。最后，我们进行了几个实验来验证SIM到真正转移到物理家用机器人平台，证实了我们的系统外的现成的性能。

74. MMDF: Mobile Microscopy Deep Framework [PDF] 返回目录
Anatasiia Kornilova, Mikhail Salnikov, Olga Novitskaya, Maria Begicheva, Egor Sevriugov, Kirill Shcherbakov
Abstract: In the last decade, a huge step was done in the field of mobile microscopes development as well as in the field of mobile microscopy application to real-life disease diagnostics and a lot of other important areas (air/water quality pollution, education, agriculture). In current study we applied image processing techniques from Deep Learning (in-focus/out-of-focus classification, image deblurring and denoising, multi-focus image fusion) to the data obtained from the mobile microscope. Overview of significant works for every task is presented, the most suitable approaches were highlighted. Chosen approaches were implemented as well as their performance were compared with classical computer vision techniques.
摘要：在过去的十年中，一个巨大的一步，在移动显微镜的发展领域，以及在移动显微镜应用到现实生活中的疾病诊断和很多其他重要领域（空气/水质污染，教育领域已完成，农业）。在当前的研究中，我们从施加深度学习（对焦/出焦分类，图像去模糊和去噪，多聚焦图像融合）来从移动显微镜获得的数据的图像处理技术。每个任务显著作品的概述，提出最合适的方法进行了强调。选择的方法被执行以及它们的性能都与传统的计算机视觉技术进行比较。

75. Cloud Detection through Wavelet Transforms in Machine Learning and Deep Learning [PDF] 返回目录
Philippe Reiter
Abstract: Cloud detection is a specialized application of image recognition and object detection using remotely sensed data. The task presents a number of challenges, including analyzing images obtained in visible, infrared and multi-spectral frequencies, usually without ground truth data for comparison. Moreover, machine learning and deep learning (MLDL) algorithms applied to this task are required to be computationally efficient, as they are typically deployed in low-power devices and called to operate in real-time. This paper explains Wavelet Transform (WT) theory, comparing it to more widely used image and signal processing transforms, and explores the use of WT as a powerful signal compressor and feature extractor for MLDL classifiers.
摘要：云检测是使用遥感数据的图像识别和对象检测的一个专门的应用程序。任务提出了许多挑战，包括在可见光，红外线和多频谱频率进行比较得到的，通常无地面实况数据分析图像。此外，机器学习和深度学习（MLDL）算法应用到该任务需要是计算有效的，因为它们通常被部署在低功率设备和称为实时进行操作。本文解释小波变换（WT）理论，比较其更广泛使用的图像和信号处理变换，并探讨了使用WT作为强大信号压缩机和特征提取器，用于MLDL分类器。

76. Towards Learning Convolutions from Scratch [PDF] 返回目录
Behnam Neyshabur
Abstract: Convolution is one of the most essential components of architectures used in computer vision. As machine learning moves towards reducing the expert bias and learning it from data, a natural next step seems to be learning convolution-like structures from scratch. This, however, has proven elusive. For example, current state-of-the-art architecture search algorithms use convolution as one of the existing modules rather than learning it from data. In an attempt to understand the inductive bias that gives rise to convolutions, we investigate minimum description length as a guiding principle and show that in some settings, it can indeed be indicative of the performance of architectures. To find architectures with small description length, we propose $\beta$-LASSO, a simple variant of LASSO algorithm that, when applied on fully-connected networks for image classification tasks, learns architectures with local connections and achieves state-of-the-art accuracies for training fully-connected nets on CIFAR-10 (85.19%), CIFAR-100 (59.56%) and SVHN (94.07%) bridging the gap between fully-connected and convolutional nets.
摘要：卷积是计算机视觉中使用的架构的最重要组成部分之一。机器学习在减少专家偏倚并从数据中学习它的移动，一个自然的下一步似乎是学习卷积般从头结构。然而，这已经被证明是难以捉摸的。例如，国家的最先进的当前的体系结构的搜索算法使用卷积作为现有模块中的一个，而不是从数据中学习它。在试图理解归纳偏置，使人们产生回旋，我们研究了最小描述长度作为指导原则，并表明在一些设置，它确实可以指示架构的性能。要查找的架构与小描述长度，我们提出了$ \ $公测-LASSO，LASSO算法的一个简单的变体，当全连接网络的图像分类任务，与本地连接获悉架构应用和实现国家的the-技术精度用于训练上CIFAR-10（85.19％），CIFAR-100（59.56％）和SVHN（94.07％）全连接网桥接全连接和卷积网之间的间隙。

77. Orpheus: A New Deep Learning Framework for Easy Deployment and Evaluation of Edge Inference [PDF] 返回目录
Perry Gibson, José Cano
Abstract: Optimising deep learning inference across edge devices and optimisation targets such as inference time, memory footprint and power consumption is a key challenge due to the ubiquity of neural networks. Today, production deep learning frameworks provide useful abstractions to aid machine learning engineers and systems researchers. However, in exchange they can suffer from compatibility challenges (especially on constrained platforms), inaccessible code complexity, or design choices that otherwise limit research from a systems perspective. This paper presents Orpheus, a new deep learning framework for easy prototyping, deployment and evaluation of inference optimisations. Orpheus features a small codebase, minimal dependencies, and a simple process for integrating other third party systems. We present some preliminary evaluation results.
摘要：优化深学习跨边缘设备和优化的目标推断，如推理时间，内存占用量和功耗是一个关键的挑战，由于神经网络的普及。如今，生产深学习框架提供了有用的抽象，以辅助机器学习的工程师和研究人员的系统。然而，在交换，他们可以从兼容性的挑战（尤其是受约束的平台），人迹罕至代码的复杂性，或设计选择是遭受从系统的角度，否则限制的研究。本文礼物奥菲斯，对于推理的优化的易成型，部署和评估新的深度学习的框架。奥菲斯设有一个小型的代码库，最小的依赖关系，并整合其他第三方系统一个简单的过程。我们提出了一些初步的评估结果。

78. Hardware Implementation of Hyperbolic Tangent Function using Catmull-Rom Spline Interpolation [PDF] 返回目录
Mahesh Chandra
Abstract: Deep neural networks yield the state of the art results in many computer vision and human machine interface tasks such as object recognition, speech recognition etc. Since, these networks are computationally expensive, customized accelerators are designed for achieving the required performance at lower cost and power. One of the key building blocks of these neural networks is non-linear activation function such as sigmoid, hyperbolic tangent (tanh), and ReLU. A low complexity accurate hardware implementation of the activation function is required to meet the performance and area targets of the neural network accelerators. This paper presents an implementation of tanh function using the Catmull-Rom spline interpolation. State of the art results are achieved using this method with comparatively smaller logic area.
摘要：深层神经网络产生许多计算机视觉和人机界面的任务，如目标识别，语音识别等。因为，这些网络计算昂贵的定制加速器是专为以更低的成本实现所需的性能与艺术效果的状态和力量。其中的一个神经网络的关键构建块是非线性激活函数如乙状结肠，双曲正切（双曲正切）和RELU。激活功能的低复杂度的精确的硬件实现，需要满足神经网络加速器的性能和面积目标。本文礼物用的Catmull-ROM样条插值正切函数的实现。本领域的结果状态是使用这种方法具有相对较小的逻辑区域来实现的。

79. Attention-based Graph ResNet for Motor Intent Detection from Raw EEG signals [PDF] 返回目录
Shuyue Jia, Yimin Hou, Yan Shi, Yang Li
Abstract: In previous studies, decoding electroencephalography (EEG) signals has not considered the topological relationship of EEG electrodes. However, the latest neuroscience has suggested brain network connectivity. Thus, the exhibited interaction between EEG channels might not be appropriately measured via Euclidean distance. To fill the gap, an attention-based graph residual network, a novel structure of Graph Convolutional Neural Network (GCN), was presented to detect human motor intents from raw EEG signals, where the topological structure of EEG electrodes was built as a graph. Meanwhile, deep residual learning with a full-attention architecture was introduced to address the degradation problem concerning deeper networks in raw EEG motor imagery (MI) data. Individual variability, the critical and longstanding challenge underlying EEG signals, has been successfully handled with the state-of-the-art performance, 98.08% accuracy at the subject level, 94.28% for 20 subjects. Numerical results were promising that the implementation of the graph-structured topology was superior to decode raw EEG data. The innovative deep learning approach was expected to entail a universal method towards both neuroscience research and real-world EEG-based practical applications, e.g., seizure prediction.
摘要：在以前的研究，解码脑电图（EEG）信号没有考虑EEG电极的拓扑关系。然而，最新的神经科学已经表明大脑网络连接。因此，EEG通道之间的相互作用显示出可能不能适当地通过欧几里德距离测量。为了填补该间隙，一个基于注意机制的曲线图残余网络，格拉夫卷积神经网络（GDN）的一种新颖的结构，提出以检测从原始EEG信号，其中EEG电极的拓扑结构被构建为一个曲线图人体运动意图。同时，采用了全关注建筑的深层残留的学习被引入，以解决有关原EEG运动想象（MI）数据更深网络降解问题。个体差异，底层的EEG信号的临界和长期的挑战，已被成功地与国家的最先进的性能，98.08％的准确度在被检体的水平，对20名受试者94.28％处理。计算结果是有希望的图形结构拓扑的实现优于解码原始EEG数据。创新的深度学习办法预计将意味着对两个神经科学的研究和现实世界的基于EEG的实际应用中，例如，扣押预测的通用方法。

80. Image-driven discriminative and generative machine learning algorithms for establishing microstructure-processing relationships [PDF] 返回目录
Wufei Ma, Elizabeth Kautz, Arun Baskaran, Aritra Chowdhury, Vineet Joshi, Bülent Yener, Daniel Lewis
Abstract: We investigate methods of microstructure representation for the purpose of predicting processing condition from microstructure image data. A binary alloy (uranium-molybdenum) that is currently under development as a nuclear fuel was studied for the purpose of developing an improved machine learning approach to image recognition, characterization, and building predictive capabilities linking microstructure to processing conditions. Here, we test different microstructure representations and evaluate model performance based on the F1 score. A F1 score of 95.1% was achieved for distinguishing between micrographs corresponding to ten different thermo-mechanical material processing conditions. We find that our newly developed microstructure representation describes image data well, and the traditional approach of utilizing area fractions of different phases is insufficient for distinguishing between multiple classes using a relatively small, imbalanced original data set of 272 images. To explore the applicability of generative methods for supplementing such limited data sets, generative adversarial networks were trained to generate artificial microstructure images. Two different generative networks were trained and tested to assess performance. Challenges and best practices associated with applying machine learning to limited microstructure image data sets is also discussed. Our work has implications for quantitative microstructure analysis, and development of microstructure-processing relationships in limited data sets typical of metallurgical process design studies.
摘要：我们研究了用于预测患有结构图像数据处理条件的目的微观结构表征的方法。二元合金（铀钼），这是目前正在开发的核燃料研究制定改进的机器学习方法，图像识别，鉴定，并建立连接组织加工条件的预测能力的目的。在这里，我们测试不同的微观结构表征和评估基础上，F1分数模型的性能。的95.1％A F1得分对应于十个不同的热机械材料的加工条件显微照片之间进行区分实现的。我们发现，我们新开发的微观结构表征描述图像的数据，并且利用不同阶段的面积率的传统方法是不够的，使用相对较小的，不均衡的原始数据272个的图像集合多类之间的区分。探索用于补充这样有限的数据集生成的方法的适用性，生成对抗性网络被训练以产生人造微观结构图像。两种不同的生成网络进行训练和测试，以评估绩效。与应用机器学习到有限的微结构的图像数据集相关联的挑战和最佳实践进行了讨论。我们的工作在有限的数据集典型的冶金工艺设计研究的定量金相组织分析的影响，以及微结构处理关系的发展。

81. XCAT-GAN for Synthesizing 3D Consistent Labeled Cardiac MR Images on Anatomically Variable XCAT Phantoms [PDF] 返回目录
Sina Amirrajab, Samaneh Abbasi-Sureshjani, Yasmina Al Khalil, Cristian Lorenz, Juergen Weese, Josien Pluim, Marcel Breeuwer
Abstract: Generative adversarial networks (GANs) have provided promising data enrichment solutions by synthesizing high-fidelity images. However, generating large sets of labeled images with new anatomical variations remains unexplored. We propose a novel method for synthesizing cardiac magnetic resonance (CMR) images on a population of virtual subjects with a large anatomical variation, introduced using the 4D eXtended Cardiac and Torso (XCAT) computerized human phantom. We investigate two conditional image synthesis approaches grounded on a semantically-consistent mask-guided image generation technique: 4-class and 8-class XCAT-GANs. The 4-class technique relies on only the annotations of the heart; while the 8-class technique employs a predicted multi-tissue label map of the heart-surrounding organs and provides better guidance for our conditional image synthesis. For both techniques, we train our conditional XCAT-GAN with real images paired with corresponding labels and subsequently at the inference time, we substitute the labels with the XCAT derived ones. Therefore, the trained network accurately transfers the tissue-specific textures to the new label maps. By creating 33 virtual subjects of synthetic CMR images at the end-diastolic and end-systolic phases, we evaluate the usefulness of such data in the downstream cardiac cavity segmentation task under different augmentation strategies. Results demonstrate that even with only 20% of real images (40 volumes) seen during training, segmentation performance is retained with the addition of synthetic CMR images. Moreover, the improvement in utilizing synthetic images for augmenting the real data is evident through the reduction of Hausdorff distance up to 28% and an increase in the Dice score up to 5%, indicating a higher similarity to the ground truth in all dimensions.
摘要：创成对抗网络（甘斯）已经提供了通过合成高保真图像数据看好充实解决方案。然而，产生大集标记的图像与新的解剖变异仍然未知。我们提出了对虚拟受试者的群体合成心脏磁共振（CMR）的图像具有大的解剖变异的新方法，引入使用4D延伸心脏和躯干（XCAT）计算机化人类幻象。我们调查两项有条件的图像合成的办法接地在语义一致的面具引导图像生成技术：4级和8级XCAT - 甘斯。 4级的技术依赖于心脏的只有注释;而8级的技术使用了心脏，周围脏器的预测多组织标记图，并为我们有条件图像合成提供更好的指导。对于这两种技术，我们培养我们的条件XCAT-GaN与相应的标签，随后在推理的时候，我们用XCAT得出那些替代标签配对的真实图像。因此，训练网络传送准确的组织特异性的纹理到新标记图。通过在舒张末期和收缩末期阶段创建合成图像CMR 33个虚拟对象，我们评估这样的数据在根据不同扩增策略下游心腔分割任务的有用性。结果表明，即使只有20％在训练期间看到实像（40个体积）的，分割性能被保持在添加合成CMR图像。此外，在利用合成图像用于增大实际数据的改善通过Hausdorff距离向上的减少至28％，增加了骰子得分至多5％是显而易见的，指示更高相似性在所有维度上的地面实况。

82. ALF: Autoencoder-based Low-rank Filter-sharing for Efficient Convolutional Neural Networks [PDF] 返回目录
Alexander Frickenstein, Manoj-Rohit Vemparala, Nael Fasfous, Laura Hauenschild, Naveen-Shankar Nagaraja, Christian Unger, Walter Stechele
Abstract: Closing the gap between the hardware requirements of state-of-the-art convolutional neural networks and the limited resources constraining embedded applications is the next big challenge in deep learning research. The computational complexity and memory footprint of such neural networks are typically daunting for deployment in resource constrained environments. Model compression techniques, such as pruning, are emphasized among other optimization methods for solving this problem. Most existing techniques require domain expertise or result in irregular sparse representations, which increase the burden of deploying deep learning applications on embedded hardware accelerators. In this paper, we propose the autoencoder-based low-rank filter-sharing technique technique (ALF). When applied to various networks, ALF is compared to state-of-the-art pruning methods, demonstrating its efficient compression capabilities on theoretical metrics as well as on an accurate, deterministic hardware-model. In our experiments, ALF showed a reduction of 70\% in network parameters, 61\% in operations and 41\% in execution time, with minimal loss in accuracy.
摘要：合国家的最先进的卷积神经网络的硬件需求和有限的资源约束的嵌入式应用之间的差距是在深度学习研究的下一个重大挑战。这样的神经网络的计算复杂度和内存占用一般令人畏惧在资源受限的环境中部署。模型压缩技术，如修剪，被其他优化方法解决这个问题中强调。大多数现有技术需要在不规则的稀疏表示，这增加了部署的嵌入式硬件加速器深度学习应用的负担领域的专业知识或结果。在本文中，我们提出了基于自动编码，低等级的过滤器共享技术技术（ALF）。当应用于各种网络，ALF相比状态的最先进的修剪方法，证明理论度量以及关于准确的，确定性的硬件的模型其高效的压缩能力。在我们的实验中，显示出ALF减少在网络参数70 \％，在操作61 \％和在执行时间41 \％，其中在准确度损失最小。

83. Dual Distribution Alignment Network for Generalizable Person Re-Identification [PDF] 返回目录
Peixian Chen, Pingyang Dai, Jianzhuang Liu, Feng Zheng, Qi Tian, Rongrong Ji
Abstract: Domain generalization (DG) serves as a promising solution to handle person Re-Identification (Re-ID), which trains the model using labels from the source domain alone, and then directly adopts the trained model to the target domain without model updating. However, existing DG approaches are usually disturbed by serious domain variations due to significant dataset variations. Subsequently, DG highly relies on designing domain-invariant features, which is however not well exploited, since most existing approaches directly mix multiple datasets to train DG based models without considering the local dataset similarities, i.e., examples that are very similar but from different domains. In this paper, we present a Dual Distribution Alignment Network (DDAN), which handles this challenge by mapping images into a domain-invariant feature space by selectively aligning distributions of multiple source domains. Such an alignment is conducted by dual-level constraints, i.e., the domain-wise adversarial feature learning and the identity-wise similarity enhancement. We evaluate our DDAN on a large-scale Domain Generalization Re-ID (DG Re-ID) benchmark. Quantitative results demonstrate that the proposed DDAN can well align the distributions of various source domains, and significantly outperforms all existing domain generalization approaches.
摘要：域泛化（DG）作为有前途的解决方案来处理人重新鉴定（再ID），该训练使用标签从单独的源域的模型，然后直接采用训练模型到目标域，而无需模型更新。然而，现有的DG方法通常是由严重的域的变化，由于显著数据集的变化感到不安。随后，DG高度依赖于设计域不变特征，然而这是没有得到很好的利用，因为大多数现有的方法直接混合多个数据集训练基于DG模型没有考虑到当地的数据集的相似性，即例子非常相似，但来自不同域。在本文中，我们提出了一个双分布对准网络（DDAN），其通过选择性地排列多个源域的分布处理由映射图像这一挑战到域不变特征空间。这样的对准是通过双电平的限制，即，域逐对抗性功能学习和身份明智相似增强进行。我们评估在一个大型域泛化再ID（DG重新-ID），我们的基准DDAN。定量结果表明，该DDAN可以很好地对准各个源域的分布，并显著优于所有现有的域泛化方法。

84. Uniformizing Techniques to Process CT scans with 3D CNNs for Tuberculosis Prediction [PDF] 返回目录
Hasib Zunair, Aimon Rahman, Nabeel Mohammed, Joseph Paul Cohen
Abstract: A common approach to medical image analysis on volumetric data uses deep 2D convolutional neural networks (CNNs). This is largely attributed to the challenges imposed by the nature of the 3D data: variable volume size, GPU exhaustion during optimization. However, dealing with the individual slices independently in 2D CNNs deliberately discards the depth information which results in poor performance for the intended task. Therefore, it is important to develop methods that not only overcome the heavy memory and computation requirements but also leverage the 3D information. To this end, we evaluate a set of volume uniformizing methods to address the aforementioned issues. The first method involves sampling information evenly from a subset of the volume. Another method exploits the full geometry of the 3D volume by interpolating over the z-axis. We demonstrate performance improvements using controlled ablation studies as well as put this approach to the test on the ImageCLEF Tuberculosis Severity Assessment 2019 benchmark. We report 73% area under curve (AUC) and binary classification accuracy (ACC) of 67.5% on the test set beating all methods which leveraged only image information (without using clinical meta-data) achieving 5-th position overall. All codes and models are made available at this https URL.
摘要：一种常见的方法在医学图像分析体积数据使用深2D卷积神经网络（细胞神经网络）。这在很大程度上归因于3D数据的性质带来的挑战：在优化过程中可变容积大小，GPU疲惫。然而，处理独立于各个片在二维细胞神经网络的故意丢弃的深度信息，这导致了预期的任务表现不佳。因此，大力发展，不仅克服了沉重的记忆和计算要求，而且还充分利用三维信息的方法是很重要的。为此，我们评估了一组体积的均匀化方法来解决上述问题。第一种方法包括从卷的子集的均匀采样信息。另一种方法通过内插在所述z轴利用3D体积的全部几何形状。我们使用受控消融的研究，以及把这种做法对ImageCLEF结核病严重性评估基准2019测试证明性能改进。我们报告曲线下（AUC）和对测试集殴打其杠杆的图像信息的所有方法67.5％的二元分类准确度（ACC）73％的面积（不使用临床元数据）实现总的5个位置。所有代码和模型，在此HTTPS URL提供。

85. UIAI System for Short-Duration Speaker Verification Challenge 2020 [PDF] 返回目录
Md Sahidullah, Achintya Kumar Sarkar, Ville Vestman, Xuechen Liu, Romain Serizel, Tomi Kinnunen, Zheng-Hua Tan, Emmanuel Vincent
Abstract: In this work, we present the system description of the UIAI entry for the short-duration speaker verification (SdSV) challenge 2020. Our focus is on Task 1 dedicated to text-dependent speaker verification. We investigate different feature extraction and modeling approaches for automatic speaker verification (ASV) and utterance verification (UV). We have also studied different fusion strategies for combining UV and ASV modules. Our primary submission to the challenge is the fusion of seven subsystems which yields a normalized minimum detection cost function (minDCF) of 0.072 and an equal error rate (EER) of 2.14% on the evaluation set. The single system consisting of a pass-phrase identification based model with phone-discriminative bottleneck features gives a normalized minDCF of 0.118 and achieves 19% relative improvement over the state-of-the-art challenge baseline.
摘要：在这项工作中，我们提出了UIAI条目的短时说话人确认（SdSV）挑战2020年我们的重点的系统描述是在专用于文本相关的说话者验证任务1。我们研究了不同的特征提取和扬声器自动验证建模方法（ASV）和词语验证（UV）。我们还研究了不同的融合策略结合UV和ASV模块。我们的主要提交挑战是七个子系统这产生0.072的归一化最小检测成本函数（minDCF）和2.14％对评估集合相等错误率（EER）的融合。由具有电话判别瓶颈特征的密码短语识别基于模型的单一系统给出的0.118的归一化minDCF和达到19％的相对改善以上国家的最先进的挑战基线。

86. Regularized Flexible Activation Function Combinations for Deep Neural Networks [PDF] 返回目录
Renlong Jie, Junbin Gao, Andrey Vasnev, Min-ngoc Tran
Abstract: Activation in deep neural networks is fundamental to achieving non-linear mappings. Traditional studies mainly focus on finding fixed activations for a particular set of learning tasks or model architectures. The research on flexible activation is quite limited in both designing philosophy and application scenarios. In this study, three principles of choosing flexible activation components are proposed and a general combined form of flexible activation functions is implemented. Based on this, a novel family of flexible activation functions that can replace sigmoid or tanh in LSTM cells are implemented, as well as a new family by combining ReLU and ELUs. Also, two new regularisation terms based on assumptions as prior knowledge are introduced. It has been shown that LSTM models with proposed flexible activations P-Sig-Ramp provide significant improvements in time series forecasting, while the proposed P-E2-ReLU achieves better and more stable performance on lossy image compression tasks with convolutional auto-encoders. In addition, the proposed regularization terms improve the convergence, performance and stability of the models with flexible activation functions.
摘要：在激活深层神经网络是实现非线性映射的基础。传统的研究主要集中在寻找固定激活一组特定的学习任务或模型架构。在灵活的激活该研究在这两个设计理念和应用场景是非常有限的。在这项研究中，选择灵活的激活组件的三个原则，提出和灵活的激活功能的全面结合的形式实现。基于此，一种新型的灵活的激活功能，可以在LSTM细胞替代S形或双曲正切家族被实现，以及一个新的家庭通过组合RELU和ELUs。此外，引入了基于假设的先验知识两个新的正则项。它已被证明与建议灵活激活P-SIG-斜坡LSTM车型提供时间序列预测显著的改善，而所提出的P-E2-RELU实现与卷积自动编码器有损图像压缩任务，更好，更稳定的性能。此外，所提出的正规化方面提高收敛，性能和模型的稳定性与灵活的激活功能。

87. MACU-Net Semantic Segmentation from High-Resolution Remote Sensing Images [PDF] 返回目录
Rui Li, Chenxi Duan, Shunyi Zheng
Abstract: Semantic segmentation of remote sensing images plays an important role in land resource management, yield estimation, and economic assessment. U-Net is a sophisticated encoder-decoder architecture which has been frequently used in medical image segmentation and has attained prominent performance. And asymmetric convolution block can enhance the square convolution kernels using asymmetric convolutions. In this paper, based on U-Net and asymmetric convolution block, we incorporate multi-scale features generated by different layers of U-Net and design a multi-scale skip connected architecture, MACU-Net, for semantic segmentation using high-resolution remote sensing images. Our design has the following advantages: (1) The multi-scale skip connections combine and realign semantic features contained both in low-level and high-level feature maps with different scales; (2) the asymmetric convolution block strengthens the representational capacity of a standard convolution layer. Experiments conducted on two remote sensing image datasets captured by separate satellites demonstrate that the performance of our MACU-Net transcends the U-Net, SegNet, DeepLab V3+, and other baseline algorithms.
摘要：遥感图像语义分割中扮演着土地资源管理，估产和经济评估的一个重要的角色。 U型网是一个复杂的编码器 - 解码器架构已经在医学图像分割被频繁使用，并已取得显着的性能。和不对称卷积块可以提高使用非对称回旋广场卷积核。在本文中，基于U型网和非对称卷积块，我们包括通过U形网的不同层产生，并设计一个多尺度跳过连接架构的多尺度特征，MACU型网，用于语义分割远程使用高分辨率遥感图像。我们的设计具有以下优点：（1）多尺度跳过连接组合并重新对准语义特征包含无论是在低级别和高级别特征具有不同比例的地图; （2）非对称卷积块加强标准卷积层的代表能力。由单独的卫星捕获的两个遥感图像数据组进行的实验表明，我们的MACU-Net的超越在U型网，SegNet，DeepLab V3 +，和其它基线的算法的性能。

88. A Preliminary Exploration into an Alternative CellLineNet: An Evolutionary Approach [PDF] 返回目录
Akwarandu Ugo Nwachuku, Xavier Lewis-Palmer, Darlington Ahiale Akogo
Abstract: Within this paper, the exploration of an evolutionary approach to an alternative CellLineNet: a convolutional neural network adept at the classification of epithelial breast cancer cell lines, is presented. This evolutionary algorithm introduces control variables that guide the search of architectures in the search space of inverted residual blocks, bottleneck blocks, residual blocks and a basic 2x2 convolutional block. The promise of EvoCELL is predicting what combination or arrangement of the feature extracting blocks that produce the best model architecture for a given task. Therein, the performance of how the fittest model evolved after each generation is shown. The final evolved model CellLineNet V2 classifies 5 types of epithelial breast cell lines consisting of two human cancer lines, 2 normal immortalized lines, and 1 immortalized mouse line (MDA-MB-468, MCF7, 10A, 12A and HC11). The Multiclass Cell Line Classification Convolutional Neural Network extends our earlier work on a Binary Breast Cancer Cell Line Classification model. This paper presents an on-going exploratory approach to neural network architecture design and is presented for further study.
摘要：在本文中，以渐进的方式来替代CellLineNet探索：在乳腺上皮癌细胞系的分类卷积神经网络娴熟，提出。此进化算法介绍控制指导在倒置的残余块，瓶颈块，残余块和一个2×2的基本卷积块的搜索空间中的搜索体系结构中的变量。 EvoCELL的承诺是什么预测组合或者上述特征抽取产生给定任务的最佳模式架构块的配置。其中，优胜劣汰模式如何演变显示每一代的性能。最终演进模型CellLineNet V2进行分类5种类型的由两个人癌细胞系，2条正常永生化线和1个永生化小鼠系（MDA-MB-468，MCF7，10A，12A和HC11）的乳腺上皮细胞系。多类细胞分类卷积神经网络扩展了我们对一个二进制乳腺癌细胞系分类模型先前的工作。本文提出一种持续的探索方法的神经网络结构设计，并提出了进一步的研究。

89. Tighter risk certificates for neural networks [PDF] 返回目录
María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári
Abstract: This paper presents empirical studies regarding training probabilistic neural networks using training objectives derived from PAC-Bayes bounds. In the context of probabilistic neural networks, the output of training is a probability distribution over network weights. We present two training objectives, used here for the first time in connection with training neural networks. These two training objectives are derived from tight PAC-Bayes bounds, one of which is new. We also re-implement a previously used training objective based on a classical PAC-Bayes bound, to compare the properties of the predictors learned using the different training objectives. We compute risk certificates that are valid on any unseen examples for the learnt predictors. We further experiment with different types of priors on the weights (both data-free and data-dependent priors) and neural network architectures. Our experiments on MNIST and CIFAR-10 show that our training methods produce competitive test set errors and non-vacuous risk bounds with much tighter values than previous results in the literature, showing promise not only to guide the learning algorithm through bounding the risk but also for model selection. These observations suggest that the methods studied here might be good candidates for self-bounding learning.
摘要：本文提出了关于训练利用PAC-贝叶斯界衍生培养目标概率神经网络的实证研究。在概率神经网络的上下文中，培训的输出在网络权的概率分布。我们提出了两种培养目标，这里使用的第一次在训练神经网络的连接。这两个培训目标是从紧的PAC-贝叶斯界衍生，其中之一是新的。我们也重新实现以前使用的培养目标基于经典的PAC-贝叶斯结合，对预测的性能比较使用不同的培养目标教训。我们计算是对学习预测任何看不见的例子有效的风险证书。我们进一步试验不同类型的权重先验（包括无数据和数据依赖先验）和神经网络结构。我们对MNIST和CIFAR-10的实验表明我们的训练方法产生有竞争力的测试集的错误和非空的风险范围比文献以前的结果更严格的值，显示出的承诺不仅要指导学习算法通过边界的危险，但也为模型选择。这些观察表明，在这里学习的方法可能是自我边界学习的好人选。

90. CNN Detection of GAN-Generated Face Images based on Cross-Band Co-occurrences Analysis [PDF] 返回目录
Mauro Barni, Kassem Kallas, Ehsan Nowroozi, Benedetta Tondi
Abstract: Last-generation GAN models allow to generate synthetic images which are visually indistinguishable from natural ones, raising the need to develop tools to distinguish fake and natural images thus contributing to preserve the trustworthiness of digital images. While modern GAN models can generate very high-quality images with no visible spatial artifacts, reconstruction of consistent relationships among colour channels is expectedly more difficult. In this paper, we propose a method for distinguishing GAN-generated from natural images by exploiting inconsistencies among spectral bands, with specific focus on the generation of synthetic face images. Specifically, we use cross-band co-occurrence matrices, in addition to spatial co-occurrence matrices, as input to a CNN model, which is trained to distinguish between real and synthetic faces. The results of our experiments confirm the goodness of our approach which outperforms a similar detection technique based on intra-band spatial co-occurrences only. The performance gain is particularly significant with regard to robustness against post-processing, like geometric transformations, filtering and contrast manipulations.
摘要：上一代车型GAN允许产生可从天然的视觉上不可区分的合成图像，提高了需要开发工具来区分假冒的自然图像从而有利于保护数字图像的可信性。虽然现代GAN模型可以产生非常没有可见的空间假象高品质的图像，色彩通道之间的一致关系的重建果然更加困难。在本文中，我们提出一种用于区分GAN-产生通过利用光谱带间的不一致，特别着重于合成脸图像的生成从自然图像的方法。具体来说，我们使用跨频带共生矩阵中，除了空间共生矩阵，作为输入提供给CNN模型，该模型被训练真实的和合成的面之间进行区分。我们的实验结果证实了我们的方法，其优于基于波段内的空间仅共同出现类似的检测技术的善良。性能增益是关于稳健性针对后处理，如几何变换，滤波和对比度操作特别显著。

91. 3D Neural Network for Lung Cancer Risk Prediction on CT Volumes [PDF] 返回目录
Daniel Korat
Abstract: With an estimated 160,000 deaths in 2018, lung cancer is the most common cause of cancer death in the United States. Lung cancer CT screening has been shown to reduce mortality by up to 40% and is now included in US screening guidelines. Reducing the high error rates in lung cancer screening is imperative because of the high clinical and financial costs caused by diagnosis mistakes. Despite the use of standards for radiological diagnosis, persistent inter-grader variability and incomplete characterization of comprehensive imaging findings remain as limitations of current methods. These limitations suggest opportunities for more sophisticated systems to improve performance and inter-reader consistency. In this report, we reproduce a state-of-the-art deep learning algorithm for lung cancer risk prediction. Our model predicts malignancy probability and risk bucket classification from lung CT studies. This allows for risk categorization of patients being screened and suggests the most appropriate surveillance and management. Combining our solution high accuracy, consistency and fully automated nature, our approach may enable highly efficient screening procedures and accelerate the adoption of lung cancer screening.
摘要：在2018年估计有160000人死亡，肺癌是在美国癌症死亡的最常见原因。肺癌CT筛查已显示出高达40％，以降低死亡率和现在包括在美国筛查指南。减少高错误率在肺癌筛查是因为由诊断失误很高的临床和财务费用的当务之急。尽管采用的进行放射诊断标准，持续年级学生间的差异和全面的影像学表现不完善的特性，仍作为当前方法的局限性。这些限制建议为更复杂的系统，以提高性能和跨读者一致性的机会。在这份报告中，我们重现肺癌风险预测一个国家的最先进的深学习算法。我们的模型预测从肺部CT研究恶性肿瘤的概率和风险桶分类。这使得患者的风险分类正在筛选，并提出最适当的监督和管理。结合我们的解决方案精度高，一致性和完全自动化的性质，我们的方法可以实现高效率的筛选程序和加速采用肺癌筛查的。

92. Modal Uncertainty Estimation via Discrete Latent Representation [PDF] 返回目录
Di Qiu, Lok Ming Lui
Abstract: Many important problems in the real world don't have unique solutions. It is thus important for machine learning models to be capable of proposing different plausible solutions with meaningful probability measures. In this work we introduce such a deep learning framework that learns the one-to-many mappings between the inputs and outputs, together with faithful uncertainty measures. We call our framework {\it modal uncertainty estimation} since we model the one-to-many mappings to be generated through a set of discrete latent variables, each representing a latent mode hypothesis that explains the corresponding type of input-output relationship. The discrete nature of the latent representations thus allows us to estimate for any input the conditional probability distribution of the outputs very effectively. Both the discrete latent space and its uncertainty estimation are jointly learned during training. We motivate our use of discrete latent space through the multi-modal posterior collapse problem in current conditional generative models, then develop the theoretical background, and extensively validate our method on both synthetic and realistic tasks. Our framework demonstrates significantly more accurate uncertainty estimation than the current state-of-the-art methods, and is informative and convenient for practical use.
摘要：在现实世界中的许多重要问题没有独特的解决方案。机器学习模型能够提出有意义的概率指标不同合理的解决方案，是重要的。在这项工作中，我们介绍了这么深的学习框架，学习的投入和产出之间的一个一对多的映射，保证了逼真的不确定性的措施一起。我们呼吁我们的框架{\了模式的不确定性估计}因为我们建模一个一对多的映射将通过一组离散的潜在变量的产生，每一个代表一个潜在的假设模式来解释的输入输出关系对应的类型。潜在的陈述的离散性从而使我们能够对任何输入非常有效地估计输出的条件概率分布。无论是分立的潜在空间和它的不确定性估计训练期间共同教训。我们鼓励通过多模态后崩溃的问题我们使用的离散潜在空间在当前条件生成模型，然后开发的理论背景，并广泛验证我们的方法在合成和现实的任务。我们的框架演示比当前国家的最先进的方法显著更准确的估计的不确定性，并为资料，方便实用。

93. Joint Featurewise Weighting and Lobal Structure Learning for Multi-view Subspace Clustering [PDF] 返回目录
Shi-Xun Lina, Guo Zhongb, Ting Shu
Abstract: Multi-view clustering integrates multiple feature sets, which reveal distinct aspects of the data and provide complementary information to each other, to improve the clustering performance. It remains challenging to effectively exploit complementary information across multiple views since the original data often contain noise and are highly redundant. Moreover, most existing multi-view clustering methods only aim to explore the consistency of all views while ignoring the local structure of each view. However, it is necessary to take the local structure of each view into consideration, because different views would present different geometric structures while admitting the same cluster structure. To address the above issues, we propose a novel multi-view subspace clustering method via simultaneously assigning weights for different features and capturing local information of data in view-specific self-representation feature spaces. Especially, a common cluster structure regularization is adopted to guarantee consistency among different views. An efficient algorithm based on an augmented Lagrangian multiplier is also developed to solve the associated optimization problem. Experiments conducted on several benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance. We provide the Matlab code on this https URL.
摘要：多视图聚类集成了多个特征集，其显示数据的不同方面，并彼此提供补充信息，提高了聚类的性能。它仍然具有挑战性有效地利用在多个视图的补充信息，因为原始数据通常包含噪声并且是高度冗余的。此外，大多数现有的多视点聚类方法仅旨在探索的所有视图的一致性而忽略每个视图的局部结构。然而，有必要采取每个视图考虑的局部结构，因为不同的意见将呈现不同的几何结构，同时承认相同的簇结构。为了解决上述问题，我们通过同时分配用于不同的特征权重和在视图特定的自表示特征空间的获取数据的本地信息提出了一种多视点子空间聚类方法。特别是，一个共同的簇结构正规化采用以保证不同视图之间的一致性。基于增强拉格朗日乘子的高效算法也为解决相关的优化问题。在几个基准数据集进行的实验表明，该方法实现了国家的最先进的性能。我们提供这方面的HTTPS URL的Matlab代码。

94. All-Optical Information Processing Capacity of Diffractive Surfaces [PDF] 返回目录
Onur Kulce, Deniz Mengu, Yair Rivenson, Aydogan Ozcan
Abstract: Precise engineering of materials and surfaces has been at the heart of some of the recent advances in optics and photonics. These advances around the engineering of materials with new functionalities have also opened up exciting avenues for designing trainable surfaces that can perform computation and machine learning tasks through light-matter interaction and diffraction. Here, we analyze the information processing capacity of coherent optical networks formed by diffractive surfaces that are trained to perform an all-optical computational task between a given input and output field-of-view. We prove that the dimensionality of the all-optical solution space covering the complex-valued transformations between the input and output fields-of-view is linearly proportional to the number of diffractive surfaces within the optical network, up to a limit that is dictated by the extent of the input and output fields-of-view. Deeper diffractive networks that are composed of larger numbers of trainable surfaces can cover a higher dimensional subspace of the complex-valued linear transformations between a larger input field-of-view and a larger output field-of-view, and exhibit depth advantages in terms of their statistical inference, learning and generalization capabilities for different image classification tasks, when compared with a single trainable diffractive surface. These analyses and conclusions are broadly applicable to various forms of diffractive surfaces, including e.g., plasmonic and/or dielectric-based metasurfaces and flat optics that can be used to form all-optical processors.
摘要：材料和表面精密工程已经在一些光学和光子学的最新进展的心脏。与新功能材料工程围绕这些进步，也为设计可通过光与物质相互作用和衍射进行计算和机器学习任务训练的面打开了令人兴奋的途径。在这里，我们分析由衍射表面形成相干光网络，被训练来的视场的给定的输入和输出之间进行的全光学计算任务的信息处理能力。我们证明了全光解空间覆盖所述输入和输出域的视之间的复值变换的维数成线性比例的衍射面的光学网络中的号码，最多到由规定的限制的输入和输出域的视的程度。这是由较大的可训练的表面的数量的较深的衍射网络可以覆盖方面具有较大的输入字段的视图和一个较大的输出场的视图，并且表现出深度的优点之间的复值线性变换的较高维子空间他们的统计推断，学习和推广能力，针对不同的图像分类任务，当一个可训练衍射表面相比。这些分析和结论是广泛地适用于能够被用于形成全光处理器各种形式的衍射表面的，包括例如，等离激元和/或基于电介质的metasurfaces和平面光学元件。

95. Selection of Proper EEG Channels for Subject Intention Classification Using Deep Learning [PDF] 返回目录
Ghazale Ghorbanzade, Zahra Nabizadeh-ShahreBabak, Shadrokh Samavi, Nader Karimi, Ali Emami, Pejman Khadivi
Abstract: Brain signals could be used to control devices to assist individuals with disabilities. Signals such as electroencephalograms are complicated and hard to interpret. A set of signals are collected and should be classified to identify the intention of the subject. Different approaches have tried to reduce the number of channels before sending them to a classifier. We are proposing a deep learning-based method for selecting an informative subset of channels that produce high classification accuracy. The proposed network could be trained for an individual subject for the selection of an appropriate set of channels. Reduction of the number of channels could reduce the complexity of brain-computer-interface devices. Our method could find a subset of channels. The accuracy of our approach is comparable with a model trained on all channels. Hence, our model's temporal and power costs are low, while its accuracy is kept high.
摘要：大脑信号可以用来控制设备，以帮助的残疾人。信号如脑电图复杂，难以解释。一组信号被收集，并应被分类，以确定所述对象的意图。不同的方法都试图减少它们发送到分类之前，通道的数量。我们建议选择产生高分类精度渠道的信息子集的深学习法。所提出的网络可以被训练为一套合适的渠道的选择的个人问题。信道的数量的减少可以减少大脑 - 计算机接口设备的复杂性。我们的方法可以找到通道的子集。我们的方法的精度是训练有素的所有通道的模型相媲美。因此，我们的模型的时间和电力成本低，而其精度高保持。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-07-28

目录

摘要