摘要

1. Coherent Reconstruction of Multiple Humans from a Single Image [PDF] 返回目录
Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, Kostas Daniilidis
Abstract: In this work, we address the problem of multi-person 3D pose estimation from a single image. A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently. However, this type of prediction suffers from incoherent results, e.g., interpenetration and inconsistent depth ordering between the people in the scene. Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene. To this end, a key design choice is the incorporation of the SMPL parametric body model in our top-down framework, which enables the use of two novel losses. First, a distance field-based collision loss penalizes interpenetration among the reconstructed people. Second, a depth ordering-aware loss reasons about occlusions and promotes a depth ordering of people that leads to a rendering which is consistent with the annotated instance segmentation. This provides depth supervision signals to the network, even if the image has no explicit 3D annotations. The experiments show that our approach outperforms previous methods on standard 3D pose benchmarks, while our proposed losses enable more coherent reconstruction in natural images. The project website with videos, results, and code can be found at: this https URL
摘要：在这项工作中，我们从单一图像解决多人3D姿态估计的问题。在这个问题上的自上而下设置一个典型的回归方法首先会检测所有的人，然后单独重建他们中的每一个。然而，这种类型的预测从语无伦次结果，例如，相互渗透和场景人民之间不一致的深度排序受到影响。我们的目标是训练一个网络，它学会了避免这些问题，并生成场景中所有的人的一致三维重建。为此，一个关键的设计选择是我们的自上而下的框架，允许使用的两个新的损失SMPL参数体模型的结合。首先，重建人与人之间基于场距离碰撞损失惩罚相互渗透。其次，关于闭塞的深度排序感知损失的原因和促进人的深度排序，导致渲染这与注解实例分割一致。这提供了深入的监视信号到网络，即使图像有没有明确的3D注解。实验表明，我们的方法优于标准的3D姿态基准以前的方法，而我们提出的损失，使自然图像更加连贯重建。该项目的网站，视频，结果和代码可以在这里找到：此HTTPS URL

2. Now that I can see, I can improve: Enabling data-driven finetuning of CNNs on the edge [PDF] 返回目录
Aditya Rajagopal, Christos-Savvas Bouganis
Abstract: In today's world, a vast amount of data is being generated by edge devices that can be used as valuable training data to improve the performance of machine learning algorithms in terms of the achieved accuracy or to reduce the compute requirements of the model. However, due to user data privacy concerns as well as storage and communication bandwidth limitations, this data cannot be moved from the device to the data centre for further improvement of the model and subsequent deployment. As such there is a need for increased edge intelligence, where the deployed models can be fine-tuned on the edge, leading to improved accuracy and/or reducing the model's workload as well as its memory and power footprint. In the case of Convolutional Neural Networks (CNNs), both the weights of the network as well as its topology can be tuned to adapt to the data that it processes. This paper provides a first step towards enabling CNN finetuning on an edge device based on structured pruning. It explores the performance gains and costs of doing so and presents an extensible open-source framework that allows the deployment of such approaches on a wide range of network architectures and devices. The results show that on average, data-aware pruning with retraining can provide 10.2pp increased accuracy over a wide range of subsets, networks and pruning levels with a maximum improvement of 42.0pp over pruning and retraining in a manner agnostic to the data being processed by the network.
摘要：在当今世界，正由可作为宝贵的训练数据，以提高机器学习算法的实现的精确度方面的性能或降低模型的计算要求边缘设备产生的庞大的数据量。然而，由于用户数据隐私担忧以及存储和通信带宽的限制，该数据不能从该装置移动到用于模型和随后部署的进一步改进的数据中心。因此有必要增加边缘情报，其中，部署模型可以微调上的优势，从而提高了精度和/或降低模型的工作量以及其内存和电源的足迹。在卷积神经网络（细胞神经网络）的情况下，两个网络以及其拓扑结构可以被调谐以适应它处理的数据的权重。本文提供了对使得能够基于结构化的修剪边缘设备上CNN细化和微调的第一步。它探讨的性能提升以及这样做的成本，并提供了一个可扩展的开源框架，允许在多种网络架构和设备的这些方法的部署。结果表明，平均而言，数据感知修剪与重新训练可以提供10.2pp在宽范围的子集，网络的提高的准确度和过度修剪和方式不可知的数据重新训练与修剪的42.0pp最大改进水平正被处理由网络。

3. Visibility Guided NMS: Efficient Boosting of Amodal Object Detection in Crowded Traffic Scenes [PDF] 返回目录
Nils Gählert, Niklas Hanselmann, Uwe Franke, Joachim Denzler
Abstract: Object detection is an important task in environment perception for autonomous driving. Modern 2D object detection frameworks such as Yolo, SSD or Faster R-CNN predict multiple bounding boxes per object that are refined using Non-Maximum-Suppression (NMS) to suppress all but one bounding box. While object detection itself is fully end-to-end learnable and does not require any manual parameter selection, standard NMS is parametrized by an overlap threshold that has to be chosen by hand. In practice, this often leads to an inability of standard NMS strategies to distinguish different objects in crowded scenes in the presence of high mutual occlusion, e.g. for parked cars or crowds of pedestrians. Our novel Visibility Guided NMS (vg-NMS) leverages both pixel-based as well as amodal object detection paradigms and improves the detection performance especially for highly occluded objects with little computational overhead. We evaluate vg-NMS using KITTI, VIPER as well as the Synscapes dataset and show that it outperforms current state-of-the-art NMS.
摘要：目标检测是在环境感知的自主驾驶的一项重要任务。现代二维物体检测框架，如YOLO，SSD或更快的R-CNN预测每正在使用非最大抑制（NMS）来抑制所有而不是一个边界框精制对象的多个包围盒。虽然物体检测本身会完全端至端可学习，并且不需要任何手动参数选择，标准NMS通过具有由手工选择的重叠阈值参数化。在实践中，这通常会导致没有能力的标准NMS策略来区分在高相互咬合，例如存在在拥挤的场景不同的对象对于停放的车辆或行人的人群。我们的新型的能见度引导NMS（VG-NMS）同时利用以及基于像素的amodal物体检测模式和提高特别是对于具有小的计算开销高度遮挡对象的检测性能。我们使用KITTI，VIPER还有Synscapes数据集评估VG-NMS，并表明它优于国家的最先进的电流NMS。

4. Go-CaRD -- Generic, Optical Car Part Recognition and Detection: Collection, Insights, and Applications [PDF] 返回目录
Lukas Stappen, Xinchen Du, Vincent Karas, Stefan Müller, Björn W. Schuller
Abstract: Systems for the automatic recognition and detection of automotive parts are crucial in several emerging research areas in the development of intelligent vehicles. They enable, for example, the detection and modelling of interactions between human and the vehicle. In this paper, we present three suitable datasets as well as quantitatively and qualitatively explore the efficacy of state-of-the-art deep learning architectures for the localisation of 29 interior and exterior vehicle regions, independent of brand, model, and environment. A ResNet50 model achieved an F1 score of 93.67 % for recognition, while our best Darknet model achieved an mAP of 58.20 % for detection. We also experiment with joint and transfer learning approaches and point out potential applications of our systems.
摘要：系统为汽车零部件的自动识别和检测是智能汽车的发展一些新兴的研究领域是至关重要的。他们能够，例如，检测和人与车之间的相互作用的模型。在本文中，我们提出了3个合适的数据集以及定量和定性地探索状态的最先进的深学习架构29内部的定位和外部车辆区域，独立品牌，型号，和环境的功效。一个ResNet50模型实现的认可93.67％的F1得分，而我们最好的暗网模型检测达到58.20％的地图。我们还尝试与关节和转移学习方法，并指出了我们系统的潜在应用。

5. Towards Incorporating Contextual Knowledge into the Prediction of Driving Behavior [PDF] 返回目录
Florian Wirthmüller, Julian Schlechtriemen, Jochen Hipp, Manfred Reichert
Abstract: Predicting the behavior of surrounding traffic participants is crucial for advanced driver assistance systems and autonomous driving. Most researchers however do not consider contextual knowledge when predicting vehicle motion. Extending former studies, we investigate how predictions are affected by external conditions. To do so, we categorize different kinds of contextual information and provide a carefully chosen definition as well as examples for external conditions. More precisely, we investigate how a state-of-the-art approach for lateral motion prediction is influenced by one selected external condition, namely the traffic density. Our investigations demonstrate that this kind of information is highly relevant in order to improve the performance of prediction algorithms. Therefore, this study constitutes the first step towards the integration of such information into automated vehicles. Moreover, our motion prediction approach is evaluated based on the public highD data set showing a maneuver prediction performance with areas under the ROC curve above 97% and a median lateral prediction error of only 0.18m on a prediction horizon of 5s.
摘要：预测周围交通参与者的行为是高级驾驶员辅助系统和自动驾驶的关键。大多数研究人员预测但车辆运动时不考虑背景知识。扩展前的研究中，我们调查的预测是如何受到外部条件的影响。要做到这一点，我们归类不同类型的上下文信息和外部条件提供一个精心选择的定义以及实例。更确切地说，我们研究如何用于横向运动预测的状态的最先进的方法是由一个选择的外部条件，即交通密度的影响。我们的调查表明，这类信息是为了提高预测算法的性能密切相关。因此，本研究是朝着这样的信息到自动车辆的整合的第一步。此外，我们的运动预测方法是基于示出具有97％以上的ROC曲线和仅0.18米对5S的预测范围的中值横向预测误差下区域的操纵的预测性能的公共高-D数据集进行评估。

6. 3D-ZeF: A 3D Zebrafish Tracking Benchmark Dataset [PDF] 返回目录
Malte Pedersen, Joakim Bruslund Haurum, Stefan Hein Bengtson, Thomas B. Moeslund
Abstract: In this work we present a novel publicly available stereo based 3D RGB dataset for multi-object zebrafish tracking, called 3D-ZeF. Zebrafish is an increasingly popular model organism used for studying neurological disorders, drug addiction, and more. Behavioral analysis is often a critical part of such research. However, visual similarity, occlusion, and erratic movement of the zebrafish makes robust 3D tracking a challenging and unsolved problem. The proposed dataset consists of eight sequences with a duration between 15-120 seconds and 1-10 free moving zebrafish. The videos have been annotated with a total of 86,400 points and bounding boxes. Furthermore, we present a complexity score and a novel open-source modular baseline system for 3D tracking of zebrafish. The performance of the system is measured with respect to two detectors: a naive approach and a Faster R-CNN based fish head detector. The system reaches a MOTA of up to 77.6%. Links to the code and dataset is available at the project page this https URL
摘要：在这项工作中，我们提出公开可用的基于立体3D RGB数据集的多目标跟踪斑马鱼一种新型的，所谓的3D-ZEF。斑马鱼是用于研究神经系统疾病，药物成瘾，以及更多的日益流行的模式生物。行为分析往往是这种研究的重要组成部分。然而，视觉相似性，闭塞，和斑马鱼的无规律的运动使强劲的3D跟踪挑战和未解决的问题。所提出的数据集由用15-120秒和1-10自由移动斑马鱼之间的持续时间8个序列。该视频已被注释，共有86,400点和边框。此外，我们提出了一个复杂度评分和斑马鱼的3D跟踪一个新的开源模块化基线系统。该系统的性能相对于测量到两个检测器：一个天真的做法和更快的R-CNN基于鱼嘴检测器。该系统达到了77.6％的MOTA。链接的代码和数据集可在项目网页此HTTPS URL

7. SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning [PDF] 返回目录
Gencer Sumbul, Sonali Nayak, Begüm Demir
Abstract: Deep neural networks (DNNs) have been recently found popular for image captioning problems in remote sensing (RS). Existing DNN based approaches rely on the availability of a training set made up of a high number of RS images with their captions. However, captions of training images may contain redundant information (they can be repetitive or semantically similar to each other), resulting in information deficiency while learning a mapping from image domain to language domain. To overcome this limitation, in this paper we present a novel Summarization Driven Remote Sensing Image Captioning (SD-RSIC) approach. The proposed approach consists of three main steps. The first step obtains the standard image captions by jointly exploiting convolutional neural networks (CNNs) with long short-term memory (LSTM) networks. The second step, unlike the existing RS image captioning methods, summarizes the ground-truth captions of each training image into a single caption by exploiting sequence to sequence neural networks and eliminates the redundancy present in the training set. The third step automatically defines the adaptive weights associated to each RS image to combine the standard captions with the summarized captions based on the semantic content of the image. This is achieved by a novel adaptive weighting strategy defined in the context of LSTM networks. Experimental results obtained on the RSCID, UCM-Captions and Sydney-Captions datasets show the effectiveness of the proposed approach compared to the state-of-the-art RS image captioning approaches.
摘要：深层神经网络（DNNs）最近已发现流行在遥感（RS）图像字幕的问题。现有的基于DNN的方法依赖于他们的字幕大量遥感图像的由训练集的可用性。然而，训练图像的字幕可能包含冗余信息（它们可以是重复的或者语义上彼此相似），从而导致信息不足，同时学习从图像域映射到语言域。为了克服这种限制，在本文中，我们提出一个新的综述驱动遥感图像字幕（SD-RSIC）的方法。所提出的方法包括三个主要步骤。第一步通过联合利用卷积神经网络（细胞神经网络）与长短期记忆（LSTM）网络获得标准图像字幕。第二步骤中，不同于现有的RS图像字幕方法，通过利用序列序列神经网络总结了每个训练图像的地面真值字幕成一个单一标题，并且消除存在于所述训练集合中的冗余。第三步自动定义关联到每个RS图像到标准字幕与基于图像的语义内容的概括字幕结合自适应权重。这是通过在网络LSTM的上下文中定义一个新的自适应加权策略实现。在RSCID获得的实验结果，UCM-字幕和悉尼的字幕数据集示出的状态相比的最先进的RS图像字幕接近所提出的方法的有效性。

8. Pixel Invisibility: Detecting Objects Invisible in Color Images [PDF] 返回目录
Yongxin, Wang, Duminda Wijesekera
Abstract: Despite recent success of object detectors using deep neural networks, their deployment on safety-critical applications such as self-driving cars remains questionable. This is partly due to the absence of reliable estimation for detectors' failure under operational conditions such as night, fog, dusk, dawn and glare. Such unquantifiable failures could lead to safety violations. In order to solve this problem, we created an algorithm that predicts a pixel-level invisibility map for color images that does not require manual labeling - that computes the probability that a pixel/region contains objects that are invisible in color domain, during various lighting conditions such as day, night and fog. We propose a novel use of cross modal knowledge distillation from color to infra-red domain using weakly-aligned image pairs from the day and construct indicators for the pixel-level invisibility based on the distances of their intermediate-level features. Quantitative experiments show the great performance of our pixel-level invisibility mask and also the effectiveness of distilled mid-level features on object detection in infra-red imagery.
摘要：尽管最近使用深层神经网络的对象检测器，其对安全关键应用，如自动驾驶汽车的部署的成功仍值得怀疑。这部分是由于缺乏可靠的估计的操作状态，例如夜晚，雾，黄昏，黎明和强光下探测器的失败。这种无法量化的失败可能会导致违反安全。为了解决这个问题，我们创建了预测为彩色图像的像素级的不可见性图，不需要手动贴标签的算法 - ，计算该像素/区域包含在颜色域不可见的，各种照明期间的对象的概率条件，如日间，夜间和雾。我们建议使用从白天和构建指标基于对它们的中间级功能的距离的像素级的不可见性弱对准图像对的新用途交叉模态知识蒸馏从颜色到红外域。定量实验表明，我们的像素级隐形面膜的卓越性能，也蒸馏水中等特征的物体检测红外图像的效果。

9. Generating Master Faces for Use in Performing Wolf Attacks on Face Recognition Systems [PDF] 返回目录
Huy H. Nguyen, Junichi Yamagishi, Isao Echizen, Sébastien Marcel
Abstract: Due to its convenience, biometric authentication, especial face authentication, has become increasingly mainstream and thus is now a prime target for attackers. Presentation attacks and face morphing are typical types of attack. Previous research has shown that finger-vein- and fingerprint-based authentication methods are susceptible to wolf attacks, in which a wolf sample matches many enrolled user templates. In this work, we demonstrated that wolf (generic) faces, which we call "master faces," can also compromise face recognition systems and that the master face concept can be generalized in some cases. Motivated by recent similar work in the fingerprint domain, we generated high-quality master faces by using the state-of-the-art face generator StyleGAN in a process called latent variable evolution. Experiments demonstrated that even attackers with limited resources using only pre-trained models available on the Internet can initiate master face attacks. The results, in addition to demonstrating performance from the attacker's point of view, can also be used to clarify and improve the performance of face recognition systems and harden face authentication systems.
摘要：由于其方便性，生物认证，特殊的脸认证，已成为越来越主流，因此现在是攻击者的主要目标。演示攻击和脸部变形是典型类型的攻击。以前的研究已经表明，手指vein-和基于指纹的身份验证方法很容易受到攻击的狼，其中一个样本狼许多登记的用户模板匹配。在这项工作中，我们证明了狼（通用）的面孔，我们称之为“主人的面孔，”也容易影响面部识别系统和主面的概念可以在某些情况下一概而论。通过在指纹域最近类似的工作动力，我们通过使用在被称为潜变量进化过程中的状态的最先进的面部发生器StyleGAN生成高品质的主面。实验采用在互联网上只能预先训练模式可以启动主面攻击证明利用有限的资源，即使攻击者。结果，除了但从攻击者的角度展示了性能，还可以用来澄清和改进的人脸识别系统和哈登的脸认证系统的性能。

10. Tamil Vowel Recognition With Augmented MNIST-like Data Set [PDF] 返回目录
Muthiah Annamalai
Abstract: We report generation of a MNIST [4] compatible data set [1] for Tamil vowels to enable building a classification DNN or other such ML/AI deep learning [2] models for Tamil OCR/Handwriting applications. We report the capability of the 60,000 grayscale, 28x28 pixel dataset to build a 92% accuracy (training) and 82% cross-validation 4-layer CNN, with 100,000+ parameters, in TensorFlow. We also report a top-1 classification accuracy of 70% and top-2 classification accuracy of 92% on handwritten vowels showing, for the same network.
摘要：我们报告MNIST的产生[4]泰米尔兼容的数据集[1]元音能够建立一个分类DNN或其他类似ML / AI深度学习[2]模型泰米尔OCR /手写应用。我们报告的60000灰度，28x28像素数据集以建立一个92％的准确度（训练）和82％的交叉验证4层CNN，具有100,000的参数，在TensorFlow的能力。我们还报告的70％上手写元音示出，对于相同的网络的顶1的分类精度和92％的顶2的分类精度。

11. CoDeNet: Algorithm-hardware Co-design for Deformable Convolution [PDF] 返回目录
Zhen Dong, Dequan Wang, Qijing Huang, Yizhao Gao, Yaohui Cai, Bichen Wu, Kurt Keutzer, John Wawrzynek
Abstract: Deploying deep learning models on embedded systems for computer vision tasks has been challenging due to limited compute resources and strict energy budgets. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this, recent work proposes dynamic deformable convolution to augment regular convolutions. Regular convolutions process a fixed grid of pixels across all the spatial locations in an image, while dynamic deformable convolution may access arbitrary pixels in the image and the access pattern is input-dependent and varies per spatial location. These properties lead to inefficient memory accesses of inputs with existing hardware. In this work, we first investigate the overhead of the deformable convolution on embedded FPGA SoCs, and introduce a depthwise deformable convolution to reduce the total number of operations required. We then show the speed-accuracy tradeoffs for a set of algorithm modifications including irregular-access versus limited-range and fixed-shape. We evaluate these algorithmic changes with corresponding hardware optimizations. Results show a 1.36x and 9.76x speedup respectively for the full and depthwise deformable convolution on the embedded FPGA accelerator with minor accuracy loss on the object detection task. We then co-design an efficient network CoDeNet with the modified deformable convolution for object detection and quantize the network to 4-bit weights and 8-bit activations. Results show that our designs lie on the pareto-optimal front of the latency-accuracy tradeoff for the object detection task on embedded FPGAs
摘要：在部署计算机视觉任务的嵌入式系统的深度学习模式已经由于有限的计算资源和严格的能源预算挑战。大多数现有工作的重点放在加快图像分类，而其他基本视力问题，如目标检测，都没有得到充分解决。与图像分类相比，检测问题是到对象的空间方差更敏感，因此，需要专门的卷积到聚集空间信息。为了解决这个问题，最近的工作提出了动态变形卷积，以增加普通卷积。常规的卷积处理整个图像中的所有空间位置的像素的定格，而动态变形卷积可以访问任意的像素的图像中和所述接入模式是输入依赖性与每个空间位置而变化。这些性质导致的与现有的硬件输入低效的存储器存取。在这项工作中，我们首先探讨嵌入式FPGA的SoC可变形卷积的开销，并引入纵深变形卷积，以减少所需操作的总数量。然后，我们示出了用于一组算法的修改包括不规则访问与有限范围和固定形状的速度准确度的折衷。我们评估与相应的硬件优化这些算法的改变。结果分别示出了1.36x和9.76x加速用于在与所述物体检测任务次要精度损失嵌入FPGA加速器充分和深度方向的变形的卷积。然后，我们共同设计一个高效的网络CoDeNet具有用于物体检测的改性变形卷积和量化网络至4比特的权重和8位激活。结果表明，我们的设计趴在延迟准确性权衡的帕累托最优前面的物体检测任务嵌入式的FPGA

12. ORD: Object Relationship Discovery for Visual Dialogue Generation [PDF] 返回目录
Ziwei Wang, Zi Huang, Yadan Luo, Huimin Lu
Abstract: With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well explored.Existing visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the response.Some recent methods tackle the co-reference resolution problem using co-attention mechanism to cross-refer relevant elements from the image, history, and the target question.However, it remains challenging to reason visual relationships, since the fine-grained object-level information is omitted before co-attentive reasoning. In this paper, we propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation. Specifically, a hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally to obtain the final graph embeddings. A graph attention is further incorporated to dynamically attend to this graph-structured representation at the response reasoning stage. Extensive experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships. The model achieves superior performance over the state-of-the-art methods on the Visual Dialog dataset, increasing MRR from 0.6222 to 0.6447, and recall@1 from 48.48% to 51.22%.
摘要：随着图像字幕和单周级别的视觉问题回答的迅猛发展，如何产生对视觉内容的多轮对话的问题还没有得到很好的视觉explored.Existing对话方式将图像编码成一个固定的功能矢量直接与问题和历史的嵌入连接起来以预测response.Some最近的方法使用共同注意机制，以解决共同引用解决问题横指从图像，历史相关的元素，并且目标question.However，它仍然具有挑战性的原因视觉关系，因为细粒度对象级信息之前共同细心推理省略。在本文中，我们提出了一个对象关系发现（ORD）框架保持视觉对话产生的对象交互。具体地，层次图卷积网络（HierGCN）被建议保留对象的节点和邻居关系本地，然后提炼物对象连接全局以获得最终的图嵌入。的曲线图注意力被进一步掺入到动态出席到在响应推理阶段此图形结构表示。大量的实验证明，该方法能够显著利用视觉关系的上下文信息提高对话的质量。该模型从48.48％，从而获得高达在视觉对话框数据集的状态的最先进的方法优越的性能，从0.6222提高至MRR 0.6447，和召回@ 1〜51.22％。

13. On the Preservation of Spatio-temporal Information in Machine Learning Applications [PDF] 返回目录
Yigit Oktar, Mehmet Turkan
Abstract: In conventional machine learning applications, each data attribute is assumed to be orthogonal to others. Namely, every pair of dimension is orthogonal to each other and thus there is no distinction of in-between relations of dimensions. However, this is certainly not the case in real world signals which naturally originate from a spatio-temporal configuration. As a result, the conventional vectorization process disrupts all of the spatio-temporal information about the order/place of data whether it be $1$D, $2$D, $3$D, or $4$D. In this paper, the problem of orthogonality is first investigated through conventional $k$-means of images, where images are to be processed as vectors. As a solution, shift-invariant $k$-means is proposed in a novel framework with the help of sparse representations. A generalization of shift-invariant $k$-means, convolutional dictionary learning, is then utilized as an unsupervised feature extraction method for classification. Experiments suggest that Gabor feature extraction as a simulation of shallow convolutional neural networks provides a little better performance compared to convolutional dictionary learning. Many alternatives of convolutional-logic are also discussed for spatio-temporal information preservation, including a spatio-temporal hypercomplex encoding scheme.
摘要：在传统的机器学习应用中，每个数据属性被假设为正交给他人。即，每对尺寸是彼此正交的，因此存在的尺寸关系在两者之间没有区别。但是，这肯定不是在现实世界中信号的情况下，这自然从时空配置发起。其结果是，传统的矢量化过程破坏的所有有关订单/地点数据的时空信息无论是$ 1 $ d，$ 2 $ d，$ 3 $ d，或$ 4 $ d。在本文中，正交性的问题通过图像，其中图像将被处理作为载体的常规$ $ķ-means首先调查。作为一个解决方案，移不变$ķ$ -means是与稀疏表示的帮助下，新的框架中提出。移不变$ $ķ-means，卷积字典学习的推广，然后利用作为分类无监督的特征提取方法。实验表明，Gabor特征提取浅卷积神经网络的模拟相比，卷积的词典学习提供了一个更好一点的表现。卷积逻辑的许多替代方案，也讨论了时空信息保存，包括时空超复数编码方案。

14. Mitigating Gender Bias in Captioning Systems [PDF] 返回目录
Ruixiang Tang, Mengnan Du, Yuening Li, Zirui Liu, Xia Hu
Abstract: Recent studies have shown that captioning datasets, such as the COCO dataset, may contain severe social bias which could potentially lead to unintentional discrimination in learning models. In this work, we specifically focus on the gender bias problem. The existing dataset fails to quantify bias because models that intrinsically memorize gender bias from training data could still achieve a competitive performance on the biased test dataset. To bridge the gap, we create two new splits: COCO-GB v1 and v2 to quantify the inherent gender bias which could be learned by models. Several widely used baselines are evaluated on our new settings, and experimental results indicate that most models learn gender bias from the training data, leading to an undesirable gender prediction error towards women. To overcome the unwanted bias, we propose a novel Guided Attention Image Captioning model (GAIC) which provides self-guidance on visual attention to encourage the model to explore correct gender visual evidence. Experimental results validate that GAIC can significantly reduce gender prediction error, with a competitive caption quality. Our codes and the designed benchmark datasets are available at this https URL.
摘要：最近的研究表明，字幕数据集，如COCO数据集，可能包含严重的社会偏见这有可能导致无意的歧视在学习模式。在这项工作中，我们特别注重性别歧视问题。现有的数据集无法量化偏差，因为模型本质上来自训练数据记忆性别偏见仍然可以实现对偏测试数据集有竞争力的表现。为了缩小差距，我们创建了两个新的分裂：COCO-GB v1和v2量化固有的性别偏见，这可以通过模型来学习。一些广泛使用的基线是在我们新的设置进行评估，而实验结果表明，大部分车型学习的性别偏见从训练数据，导致对女性的不良性别预测误差。为了克服不必要的偏见，我们提出了一个新的引导关注图像字幕模型（GAIC）的视觉注意提供自我引导，鼓励模型，探索正确的性别视觉证据。实验结果验证了GAIC可以显著减少性别预测误差，有竞争力的字幕质量。我们的代码和设计基准数据集可在此HTTPS URL。

15. Deep-CAPTCHA: a deep learning based CAPTCHA solver for vulnerability assessment [PDF] 返回目录
Zahra Noury, Mahdi Rezaei
Abstract: CAPTCHA is a human-centred test to distinguish a human operator from bots, attacking programs, or any other computerised agent that tries to imitate human intelligence. In this research, we investigate a way to crack visual CAPTCHA tests by an automated deep learning based solution. The goal of the cracking is to investigate the weaknesses and vulnerabilities of the CAPTCHA generators and to develop more robust CAPTCHAs, without taking the risks of manual try and error efforts. We have developed a Convolutional Neural Network called \Deep-CAPTCHA to achieve this goal. We propose a platform to investigate both numerical and alphanumerical image CAPTCHAs. To train and develop an efficient model, we have generated 500,000 CAPTCHAs using Python Image-Captcha Library. In this paper, we present our customised deep neural network model, the research gaps and the existing challenges, and the solutions to overcome the issues. Our network's cracking accuracy results leads to 98.94\% and 98.31\% for the numerical and the alpha-numerical Test datasets, respectively. That means more works need to be done to develop robust CAPTCHAs, to be non-crackable against bot attaches and artificial agents. As the outcome of this research, we identify some efficient techniques to improve the CAPTCHA generators, based on the performance analysis conducted on the Deep-CAPTCHA model.
摘要：CAPTCHA是一个人为中心的试验来区分机器人人类操作者，攻击程序，或任何其它计算机化剂试图模仿人类智能。在这项研究中，我们调查的方式由自动深度学习基础的解决方案来破解CAPTCHA视觉测试。开裂的目的是考察CAPTCHA发电机的薄弱环节和漏洞，并制定更强大的验证码，没有采取手动尝试和错误的努力的风险。我们已经制定了卷积神经网名为\深CAPTCHA来实现这一目标。我们提出了一个平台，两个调查数字和字母数字图像验证码。培养和发展一个有效的模式，我们已经生成使用Python图像，验证码库500000个验证码。在本文中，我们提出我们定制的深层神经网络模型，研究差距和存在的挑战，并克服这些问题的解决方案。我们的网络的开裂精度结果导致98.94 \％和分别的数字和字母 - 数字测试数据集，98.31 \％。这意味着更多的工作需要做开发可靠CAPTCHA系统，以对抗机器人武官和人工坐席非攻破。作为这项研究的结果，我们确定了一些有效的技术来改善CAPTCHA发电机的基础上，对深CAPTCHA模型进行了性能分析。

16. AMENet: Attentive Maps Encoder Network for Trajectory Prediction [PDF] 返回目录
Hao Cheng, Wentong Liao, Michael Ying Yang, Bodo Rosenhahn, Monika Sester
Abstract: Trajectory prediction is a crucial task in different communities, such as intelligent transportation systems, photogrammetry, computer vision, and mobile robot applications. However, there are many challenges to predict the trajectories of heterogeneous road agents (e.g. pedestrians, cyclists and vehicles) at a microscopical level. For example, an agent might be able to choose multiple plausible paths in complex interactions with other agents in varying environments, and the behavior of each agent is affected by the various behaviors of its neighboring agents. To this end, we propose an end-to-end generative model named Attentive Maps Encoder Network (AMENet) for accurate and realistic multi-path trajectory prediction. Our method leverages the target road user's motion information (i.e. movement in xy-axis in a Cartesian space) and the interaction information with the neighboring road users at each time step, which is encoded as dynamic maps that are centralized on the target road user. A conditional variational auto-encoder module is trained to learn the latent space of possible future paths based on the dynamic maps and then used to predict multiple plausible future trajectories conditioned on the observed past trajectories. Our method reports the new state-of-the-art performance (final/mean average displacement (FDE/MDE) errors 1.183/0.356 meters) on benchmark datasets and wins the first place in the open challenge of Trajnet.
摘要：轨迹预测是在不同的社区，如智能交通系统，摄影测量，计算机视觉和移动机器人应用的一个重要任务。但是，也有在一个显微水平许多挑战来预测道路异构剂的轨迹（例如行人，骑自行车者和车辆）。例如，一个代理可能能够选择在不同的环境中与其他药物相互作用的复杂多个合理的路径，每个代理的行为是由它的相邻代理的各种行为的影响。为此，我们提出了一个名为细心的地图编码器网络（AMENet）准确和真实的多路径轨迹预测的端至端生成模型。我们的方法利用了对象道路用户的运动信息（即运动中的xy轴的笛卡尔空间），并在每个时间步长，其被编码为在目标上的道路使用者集中动态地图相邻道路使用者的交互信息。有条件变自动编码器模块被训练以学习的基于所述动态地图可能的未来路径的潜在空间，然后用来预测多个未来可能出现的轨迹条件所观察到的过去的轨迹。我们的方法报告对标准数据集的国家的最先进的新的性能（最后/平均平均位移（FDE / MDE）错误1.183 /0.356米）和胜Trajnet的公开挑战第一名。

17. Dermatologist vs Neural Network [PDF] 返回目录
Kaushil Mangaroliya, Mitt Shah
Abstract: Cancer, in general, is very deadly. Timely treatment of any cancer is the key to saving a life. Skin cancer is no exception. There have been thousands of Skin Cancer cases registered per year all over the world. There have been 123,000 deadly melanoma cases detected in a single year. This huge number is proven to be a cause of a high amount of UV rays present in the sunlight due to the degradation of the Ozone layer. If not detected at an early stage, skin cancer can lead to the death of the patient. Unavailability of proper resources such as expert dermatologists, state of the art testing facilities, and quick biopsy results have led researchers to develop a technology that can solve the above problem. Deep Learning is one such method that has offered extraordinary results. The Convolutional Neural Network proposed in this study out performs every pretrained models. We trained our model on the HAM10000 dataset which offers 10015 images belonging to 7 classes of skin disease. The model we proposed gave an accuracy of 89%. This model can predict deadly melanoma skin cancer with a great accuracy. Hopefully, this study can help save people's life where there is the unavailability of proper dermatological resources by bridging the gap using our proposed study.
摘要：癌症，一般来说，是非常致命的。任何癌症的及时治疗是关键，救人一命。皮肤癌也不例外。已经有成千上万的世界各地每年登记的皮肤癌病例。目前已在一个单一的一年检测123000致命的黑素瘤病例。这个庞大的数字被证明是高量的紫外线存在于太阳光的的原因是由于臭氧层的降解。如果在早期没有检测到，皮肤癌可能导致患者死亡。适当的资源，如专家皮肤科医师，先进的检测设备的状态，并快速活检结果的不可用，导致研究人员开发出能解决上述问题的技术。深度学习的是，提供了非凡的成果这样的一个方法。卷积神经网络在这项研究中提出了执行的每个预训练模式。我们训练有素的HAM10000数据集它提供属于7类皮肤病的10015张图片我们的模型。我们提出的模型给出的89％的准确度。这个模型可以预测致命的黑色素瘤皮肤癌有很大的精确度。希望这项研究能帮助拯救人的一生中，有适当的皮肤病资源通过桥接使用我们提出研究的差距不可用。

18. Learn to cycle: Time-consistent feature discovery for action recognition [PDF] 返回目录
Alexandros Stergiou, Ronald Poppe
Abstract: Temporal motion has been one of the essential components for effectively recognizing actions in videos. Both, time information and features are primarily extracted hierarchically through small sequences of few frames, with the use of 3D convolutions. In this paper, we propose a method that can learn general feature changes across time, making activations unbounded to a temporal locality, by additionally including a general notion of their learned features. Through this recalibration of temporal feature cues across multiple frames, 3D-CNN models are capable of using features that are prevalent over different time segments, while being less constraint by their temporal receptive fields. We present improvements on both high and low capacity models, with the largest benefits being observed in low-memory models, as most of their current drawbacks rely on their poor generalization capabilities because of the low number and feature complexity. We present average improvements, over both corresponding and state-of-the-art models, in the range of 3.67% on Kinetics-700 (K-700), 2.75% on Moments in Time (MiT), 2.57% on Human Actions Clips and Segments (HACS), 3.195% on HMDB-51 and 3.30% on UCF-101.
摘要：时间的运动一直是一个必要组成部分在视频有效地识别操作。既，时间信息和功能主要通过分层几帧的小序列中提取，通过使用三维卷积。在本文中，我们建议可以学习跨越时间一般特征的变化，使得激活无限的时间局部性，通过另外包括自己所学功能的一般概念的方法。通过这次调整跨越多个帧的时间特征线索，3D-CNN模型能够使用是在不同的时间段流行特征，而被他们的时间感受野的约束较少。在低内存模型被观察到高和低容量型号，我们提出的改进，其中最大的好处，大部分当前的缺点的依赖，因为低数量和功能复杂的穷人推广能力。我们提出平均改进，在两个对应的和国家的最先进的模型，在3.67％的动力学-700（K-700），在时刻2.75％在时间（MIT），人类操作剪辑2.57％的范围内和段（HACS），上HMDB-51 3.195％和3.30％的UCF-101。

19. AutoGAN-Distiller: Searching to Compress Generative Adversarial Networks [PDF] 返回目录
Yonggan Fu, Wuyang Chen, Haotao Wang, Haoran Li, Yingyan Lin, Zhangyang Wang
Abstract: The compression of Generative Adversarial Networks (GANs) has lately drawn attention, due to the increasing demand for deploying GANs into mobile devices for numerous applications such as image translation, enhancement and editing. However, compared to the substantial efforts to compressing other deep models, the research on compressing GANs (usually the generators) remains at its infancy stage. Existing GAN compression algorithms are limited to handling specific GAN architectures and losses. Inspired by the recent success of AutoML in deep compression, we introduce AutoML to GAN compression and develop an AutoGAN-Distiller (AGD) framework. Starting with a specifically designed efficient search space, AGD performs an end-to-end discovery for new efficient generators, given the target computational resource constraints. The search is guided by the original GAN model via knowledge distillation, therefore fulfilling the compression. AGD is fully automatic, standalone (i.e., needing no trained discriminators), and generically applicable to various GAN models. We evaluate AGD in two representative GAN tasks: image translation and super resolution. Without bells and whistles, AGD yields remarkably lightweight yet more competitive compressed models, that largely outperform existing alternatives.
摘要：创成对抗性网络（甘斯）的压缩近来受到关注，由于部署甘斯为众多应用，如图像转换，增强和编辑移动设备的需求不断增加。然而，相对于巨大努力压缩等深车型，在压缩甘斯的研究（通常是发电机）保持在其起步阶段。现有GAN压缩算法被限制为处理特定GAN架构和损失。通过AutoML的深度压缩最近成功的启发，我们引入AutoML甘压缩并制定AutoGAN-蒸馏器（AGD）的框架。用专门设计的高效的搜索空间的开始，AGD给出的指标计算资源的限制执行的端至端发现新的高效发电机。搜索是通过经由知识蒸馏原始GAN模型引导，因此实现压缩。 AGD是完全自动的，独立的（即，需要训练有素没有鉴别），并一般适用于各种型号甘。我们两个有代表性的GAN任务评估AGD：图像的平移和超分辨率。没有花俏，AGD产生显着轻巧而更具竞争力的车型压缩，在很大程度上优于现有的替代品。

20. Infinite Feature Selection: A Graph-based Feature Filtering Approach [PDF] 返回目录
Giorgio Roffo, Simone Melzi, Umberto Castellani, Alessandro Vinciarelli, Marco Cristani
Abstract: We propose a filtering feature selection framework that considers subsets of features as paths in a graph, where a node is a feature and an edge indicates pairwise (customizable) relations among features, dealing with relevance and redundancy principles. By two different interpretations (exploiting properties of power series of matrices and relying on Markov chains fundamentals) we can evaluate the values of paths (i.e., feature subsets) of arbitrary lengths, eventually go to infinite, from which we dub our framework Infinite Feature Selection (Inf-FS). Going to infinite allows to constrain the computational complexity of the selection process, and to rank the features in an elegant way, that is, considering the value of any path (subset) containing a particular feature. We also propose a simple unsupervised strategy to cut the ranking, so providing the subset of features to keep. In the experiments, we analyze diverse settings with heterogeneous features, for a total of 11 benchmarks, comparing against 18 widely-known comparative approaches. The results show that Inf-FS behaves better in almost any situation, that is, when the number of features to keep are fixed a priori, or when the decision of the subset cardinality is part of the process.
摘要：我们提出了一个过滤特征选择框架，考虑的特征的子集为图表中，其中，一个节点是一个功能和边缘指示特征之间的成对（定制）的关系，处理相关性和冗余原则路径。通过两种不同的解释（利用幂级数矩阵的性质和依靠马尔可夫链基本面），我们可以计算任意长度的路径（即特征子集）的值，最终去无限的，从中我们复制我们的框架无限的特征选择（INF-FS）。要无限允许约束选择过程的计算复杂度，并包含一个具体的特征进行排名以一种优雅的方式，也就是特征，考虑任何路径（子集）的值。我们还提出了一个简单的无监督的策略削减的排名，所以提供的功能，让该子集。在实验中，我们分析使用异构特征多样的设置，共计11个基准，针对18广为人知的方法比较比较。结果表明，INF-FS的行为几乎在任何情况下，即当的功能，让数是固定的先验，或当子集基数的决定过程的一部分更好。

21. Binary DAD-Net: Binarized Driveable Area Detection Network for Autonomous Driving [PDF] 返回目录
Alexander Frickenstein, Manoj Rohit Vemparala, Jakob Mayr, Naveen Shankar Nagaraja, Christian Unger, Federico Tombari, Walter Stechele
Abstract: Driveable area detection is a key component for various applications in the field of autonomous driving (AD), such as ground-plane detection, obstacle detection and maneuver planning. Additionally, bulky and over-parameterized networks can be easily forgone and replaced with smaller networks for faster inference on embedded systems. The driveable area detection, posed as a two class segmentation task, can be efficiently modeled with slim binary networks. This paper proposes a novel binarized driveable area detection network (binary DAD-Net), which uses only binary weights and activations in the encoder, the bottleneck, and the decoder part. The latent space of the bottleneck is efficiently increased (x32 -> x16 downsampling) through binary dilated convolutions, learning more complex features. Along with automatically generated training data, the binary DAD-Net outperforms state-of-the-art semantic segmentation networks on public datasets. In comparison to a full-precision model, our approach has a x14.3 reduced compute complexity on an FPGA and it requires only 0.9MB memory resources. Therefore, commodity SIMD-based AD-hardware is capable of accelerating the binary DAD-Net.
摘要：可驱动的区域检测为在自动驾驶（AD）的领域中的各种应用，如接地平面检测，障碍物检测和机动规划的关键部件。另外，大体积和过度参数化网络可以很容易地放弃的，并与用于在嵌入式系统上更快推理更小的网络取代。所述可驱动的区域检测，构成作为两类分割任务，可以使用超薄二进制网络能够有效地建模。本文提出了一种新颖的二值化的可驱动的区域检测网络（二进制DAD-净），仅使用二进制权重和激活在编码器中，瓶颈，和解码器的一部分。瓶颈的潜在空间被有效地增加（32倍 - > X16下采样）通过二进制卷积扩张，学习更复杂的特征。随着自动生成的训练数据，在公共数据集二进制DAD-Net的性能优于状态的最先进的语义分割网络。相比于全精度的模型，我们的做法已经在FPGA上一个x14.3降低计算复杂性，它仅需要0.9MB的内存资源。因此，基于SIMD-商品AD-硬件能够加速二进制DAD-网。

22. Neural gradients are lognormally distributed: understanding sparse and quantized training [PDF] 返回目录
Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, Daniel Soudry
Abstract: Neural gradient compression remains a main bottleneck in improving training efficiency, as most existing neural network compression methods (e.g., pruning or quantization) focus on weights, activations, and weight gradients. However, these methods are not suitable for compressing neural gradients, which have a very different distribution. Specifically, we find that the neural gradients follow a lognormal distribution. Taking this into account, we suggest two methods to reduce the computational and memory burdens of neural gradients. The first one is stochastic gradient pruning, which can accurately set the sparsity level -- up to 85% gradient sparsity without hurting accuracy (ResNet18 on ImageNet). The second method determines the floating-point format for low numerical precision gradients (e.g., FP8). Our results shed light on previous findings related to local scaling, the optimal bit-allocation for the mantissa and exponent, and challenging workloads for which low-precision floating-point arithmetic has reported to fail. Reference implementation accompanies the paper.
摘要：神经梯度压缩保持在提高训练效率的主要瓶颈，因为大多数现有的神经网络的压缩方法上的权重，激活，和重量梯度（例如，修剪或量化）的焦点。然而，这些方法不适合于压缩神经梯度，其具有非常不同的分布。具体而言，我们发现，神经梯度遵循对数正态分布。考虑到这一点，我们提出了两种方法来减少神经梯度的计算和存储负担。第一种是随机梯度修剪，能够精确地设定稀疏水平 - 高达85％梯度的稀疏性而不损害精度（ResNet18上ImageNet）。第二种方法确定用于低数值精度梯度（例如，FP8）的浮点格式。我们的研究结果阐明了与局部缩放，尾数和指数的最佳比特分配，以及其低精度浮点运算报告失败具有挑战性的工作负载以前的研究结果。参考实施伴随纸张。

23. Filter design for small target detection on infrared imagery using normalized-cross-correlation layer [PDF] 返回目录
H. Seçkin Demir, Erdem Akagunduz
Abstract: In this paper, we introduce a machine learning approach to the problem of infrared small target detection filter design. For this purpose, similarly to a convolutional layer of a neural network, the normalized-cross-correlational (NCC) layer, which we utilize for designing a target detection/recognition filter bank, is proposed. By employing the NCC layer in a neural network structure, we introduce a framework, in which supervised training is used to calculate the optimal filter shape and the optimum number of filters required for a specific target detection/recognition task on infrared images. We also propose the mean-absolute-deviation NCC (MAD-NCC) layer, an efficient implementation of the proposed NCC layer, designed especially for FPGA systems, in which square root operations are avoided for real-time computation. As a case study we work on dim-target detection on mid-wave infrared imagery and obtain the filters that can discriminate a dim target from various types of background clutter, specific to our operational concept.
摘要：本文介绍了一种机器学习方法的红外小目标检测滤波器的设计问题。为了这个目的，类似于神经网络卷积层，所述归一化交叉相关性（NCC）层，这是我们利用了用于设计目标检测/识别滤波器组，提出了。通过采用在神经网络结构的NCC层，我们引入了一个框架，其中监督训练用于计算最优滤波器形状和用于在红外图像的特定目标检测/识别任务所需的过滤器的最佳数目。我们还提出了平均绝对偏差NCC（MAD-NCC）层，有效实现所提出的NCC层，特别适用于FPGA的系统，其中平方根操作避免了实时计算设计。正如我们在昏暗的靶标检测工作，对案件研究中波红外图像，并获得可以区分不同类型的背景杂波，具体的弱小目标到我们的经营理念的过滤器。

24. Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry and Fusion [PDF] 返回目录
Yang Wang
Abstract: With the development of web technology, multi-modal or multi-view data has surged as a major stream for big data, where each modal/view encodes individual property of data objects. Often, different modalities are complementary to each other. Such fact motivated a lot of research attention on fusing the multi-modal feature spaces to comprehensively characterize the data objects. Most of the existing state-of-the-art focused on how to fuse the energy or information from multi-modal spaces to deliver a superior performance over their counterparts with single modal. Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data, so naturally does for multi-modal data. Substantial empirical studies are carried out to demonstrate its advantages that are benefited from deep multi-modal methods, which can essentially deepen the fusion from multi-modal deep feature spaces. In this paper, we provide a substantial overview of the existing state-of-the-arts on the filed of multi-modal data analytics from shallow to deep spaces. Throughout this survey, we further indicate that the critical components for this field go to collaboration, adversarial competition and fusion over multi-modal spaces. Finally, we share our viewpoints regarding some future directions on this field.
摘要：随着网络技术的发展，多模态或多视图数据已飙升作为大数据的主要流，其中每个模式/视图编码数据对象的个人财产。通常，不同的模态是彼此互补的。这些事实促使了大量的研究关注于融合多模态特征空间全面表征数据对象。大多数现有的国家的最先进的集中在如何从融合多模态空间的能量或信息提供超过其对应的单态优越的性能。近日，深层神经网络已经表现为一个强大的架构很好地捕捉高维多媒体数据的非线性分布，所以自然也就多模态数据。大量实证研究都进行了展示其优势是由深层的多模态的方法，这基本上可以加深，从多模态深特征空间融合中受益。在本文中，我们提供了在现有的国家的最艺术的实质性概述申请多模态数据分析的由浅到深的空间。在整个调查中，我们进一步表明，对于这一领域的关键部件去合作，对抗竞争，融合了多模态空间。最后，我们来分享一下关于这个领域的一些发展方向的观点。

25. Classifying degraded images over various levels of degradation [PDF] 返回目录
Kazuki Endo, Masayuki Tanaka, Masatoshi Okutomi
Abstract: Classification for degraded images having various levels of degradation is very important in practical applications. This paper proposes a convolutional neural network to classify degraded images by using a restoration network and an ensemble learning. The results demonstrate that the proposed network can classify degraded images over various levels of degradation well. This paper also reveals how the image-quality of training data for a classification network affects the classification performance of degraded images.
摘要：分类对具有降解各级退化图像在实际应用中非常重要。本文通过使用恢复网络和集成学习提出了一种卷积神经网络来分类退化图像。结果表明，所提出的网络可以在各种级别退化退化图像分类好。本文还揭示了训练数据的分类网络的图像质量如何影响退化图像的分类性能。

26. Anomalous Motion Detection on Highway Using Deep Learning [PDF] 返回目录
Harpreet Singh, Emily M. Hand, Kostas Alexis
Abstract: Research in visual anomaly detection draws much interest due to its applications in surveillance. Common datasets for evaluation are constructed using a stationary camera overlooking a region of interest. Previous research has shown promising results in detecting spatial as well as temporal anomalies in these settings. The advent of self-driving cars provides an opportunity to apply visual anomaly detection in a more dynamic application yet no dataset exists in this type of environment. This paper presents a new anomaly detection dataset - the Highway Traffic Anomaly (HTA) dataset - for the problem of detecting anomalous traffic patterns from dash cam videos of vehicles on highways. We evaluate state-of-the-art deep learning anomaly detection models and propose novel variations to these methods. Our results show that state-of-the-art models built for settings with a stationary camera do not translate well to a more dynamic environment. The proposed variations to these SoTA methods show promising results on the new HTA dataset.
摘要：研究视觉异常检测得出的极大兴趣，因为它在监控应用。为评估通用的数据集使用的是固定摄像头俯瞰感兴趣的区域构成。先前的研究已经表明在检测这些设置空间以及时间异常有希望的结果。自动驾驶汽车的出现提供了适用于更加动态的应用视觉异常检测的机会，但没有数据集在这种环境的存在。本文提出了一种新的异常检测的数据集 - 公路交通异常（HTA）数据集 - 用于检测来自公路上车辆的仪表板凸轮视频异常流量模式的问题。我们评估的国家的最先进的深学习异常检测模型，并提出这些方法的新变化。我们的研究结果显示，专为设置与固定相机国家的最先进的型号没有很好的转化为一个更加动态的环境。拟议的变化对这些SOTA方法显示在新的HTA数据集可喜的成果。

27. Geo-PIFu: Geometry and Pixel Aligned Implicit Functions for Single-view Human Reconstruction [PDF] 返回目录
Tong He, John Collomosse, Hailin Jin, Stefano Soatto
Abstract: We propose Geo-PIFu, a method to recover a 3D mesh from a monocular color image of a clothed person. Our method is based on a deep implicit function-based representation to learn latent voxel features using a structure-aware 3D U-Net, to constrain the model in two ways: first, to resolve feature ambiguities in query point encoding, second, to serve as a coarse human shape proxy to regularize the high-resolution mesh and encourage global shape regularity. We show that, by both encoding query points and constraining global shape using latent voxel features, the reconstruction we obtain for clothed human meshes exhibits less shape distortion and improved surface details compared to competing methods. We evaluate Geo-PIFu on a recent human mesh public dataset that is $10 \times$ larger than the private commercial dataset used in PIFu and previous derivative work. On average, we exceed the state of the art by $42.7\%$ reduction in Chamfer and Point-to-Surface Distances, and $19.4\%$ reduction in normal estimation errors.
摘要：本文提出地理批覆，恢复从穿着人的单眼彩色图像的三维网状物的方法。我们的方法是基于深隐基于函数的表示，了解潜在的体素使用结构感知3D掌中，以约束模式在两个方面的特点：在查询点编码首先，要解决功能歧义，第二，服务作为一个粗糙的人形代理来规范高分辨率网，并鼓励全球形状规整。我们表明，由两个编码的查询点，并使用体素潜特征约束全球形状，我们获得用于穿衣人网格呈现重建更少形状失真和改善的表面细节比同类方法。我们评估地理批覆在最近的一个人的网公共数据集比批覆和以前的衍生作品使用的私有商业数据集大10 $ \ $倍。平均而言，我们超过倒角$ 42.7 \％$减少和点到地面的距离，而在正常估计误差$ 19.4 \％$还原最先进的技术。

28. Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution [PDF] 返回目录
Xianhang Cheng, Zhenzhong Chen
Abstract: Generating non-existing frames from a consecutive video sequence has been an interesting and challenging problem in the video processing field. Recent kernel-based interpolation methods predict pixels with a single convolution process that convolves source frames with spatially adaptive local kernels. However, when scene motion is larger than the pre-defined kernel size, these methods are prone to yield less plausible results and they cannot directly generate a frame at an arbitrary temporal position because the learned kernels are tied to the midpoint in time between the input frames. In this paper, we try to solve these problems and propose a novel approach that we refer to as enhanced deformable separable convolution (EDSC) to estimate not only adaptive kernels, but also offsets, masks and biases to make the network obtain information from non-local neighborhood. During the learning process, different intermediate time step can be involved as a control variable by means of the coord-conv trick, allowing the estimated components to vary with different input temporal information. This makes our method capable to produce multiple in-between frames. Furthermore, we investigate the relationships between our method and other typical kernel- and flow-based methods. Experimental results show that our method performs favorably against the state-of-the-art methods across a broad range of datasets. Code will be publicly available on URL: \url{this https URL}.
摘要：从连续的视频序列生成不存在的框架已经在视频处理领域的一个有趣和具有挑战性的问题。最近基于内核的插值方法预测像素具有单个卷积过程，卷积源帧与空间自适应本地内核。然而，当场景运动大于所述预定义的内核大小时，这些方法都容易得到不太可信的结果，他们不能直接产生在任意时间位置的帧，因为学习核被输入之间绑在中点时刻帧。在本文中，我们试图解决这些问题，并提出了一种新的方法，我们称之为增强变形可分离卷积（EDSC）估计不仅自适应内核，也偏移，口罩和偏见，使网络免受非获取信息当地居委会。在学习过程中，不同的中间时间步骤可以通过坐标-CONV特技方式可涉及作为控制变量，使得所估计的组件与不同的输入的时间信息而改变。这使得我们的方法，它能够产生帧之间的倍数。此外，我们研究我们的方法和其他典型的内核级和基于流的方法之间的关系。实验结果表明，我们的有利方法进行针对在宽范围的数据集的状态的最先进的方法。代码将可以公开获得的网址：\ {URL这HTTPS URL}。

29. RasterNet: Modeling Free-Flow Speed using LiDAR and Overhead Imagery [PDF] 返回目录
Armin Hadzic, Hunter Blanton, Weilian Song, Mei Chen, Scott Workman, Nathan Jacobs
Abstract: Roadway free-flow speed captures the typical vehicle speed in low traffic conditions. Modeling free-flow speed is an important problem in transportation engineering with applications to a variety of design, operation, planning, and policy decisions of highway systems. Unfortunately, collecting large-scale historical traffic speed data is expensive and time consuming. Traditional approaches for estimating free-flow speed use geometric properties of the underlying road segment, such as grade, curvature, lane width, lateral clearance and access point density, but for many roads such features are unavailable. We propose a fully automated approach, RasterNet, for estimating free-flow speed without the need for explicit geometric features. RasterNet is a neural network that fuses large-scale overhead imagery and aerial LiDAR point clouds using a geospatially consistent raster structure. To support training and evaluation, we introduce a novel dataset combining free-flow speeds of road segments, overhead imagery, and LiDAR point clouds across the state of Kentucky. Our method achieves state-of-the-art results on a benchmark dataset.
摘要：巷道自由流速度捕捉低的交通条件典型的车速。造型自由流速度是在交通运输工程与应用各种设计，操作，规划和公路系统的决策的一个重要问题。不幸的是，收集大型历史交通数据的速度是昂贵和费时。传统的方法用于估计自由流速度利用底层道路段的几何性质，例如级，曲率，车道宽度，侧向间隙和接入点密度，但对于许多道路这样的特征是不可用的。我们提出了一个完全自动化的方法，RasterNet，估计自由流速度，而不需要明确的几何特征。 RasterNet是保险丝大型开销影像和航空激光雷达点云使用地理空间一致的光栅结构的神经网络。支持培训和评估，我们引入了一个新的数据集跨国家的肯塔基结合路段，俯拍图像和激光点云的自由流动速度。我们的方法实现对基准数据集的国家的最先进的成果。

30. BatVision with GCC-PHAT Features for Better Sound to Vision Predictions [PDF] 返回目录
Jesper Haahr Christensen, Sascha Hornauer, Stella Yu
Abstract: Inspired by sophisticated echolocation abilities found in nature, we train a generative adversarial network to predict plausible depth maps and grayscale layouts from sound. To achieve this, our sound-to-vision model processes binaural echo-returns from chirping sounds. We build upon previous work with BatVision that consists of a sound-to-vision model and a self-collected dataset using our mobile robot and low-cost hardware. We improve on the previous model by introducing several changes to the model, which leads to a better depth and grayscale estimation, and increased perceptual quality. Rather than using raw binaural waveforms as input, we generate generalized cross-correlation (GCC) features and use these as input instead. In addition, we change the model generator and base it on residual learning and use spectral normalization in the discriminator. We compare and present both quantitative and qualitative improvements over our previous BatVision model.
摘要：通过先进的回声定位能力的启发存在于自然界中，我们培养出生成对抗网络从健全合理的预测深度图和灰度布局。为了实现这一目标，我们的声音到视觉模型处理来自尖锐的鸣叫声双耳回声回报。我们建立在与BatVision以前的工作是由一个声音到视觉模型，并使用我们的移动机器人，低成本的硬件自我收集的数据集。我们在以前的型号提高通过引入一些更改模型，这导致了更好的深度和灰度估计，并增加了感知质量。而不是使用原始双耳波形作为输入，我们产生广义互相关（GCC）特征并使用这些作为输入来代替。此外，我们改变残余学习模型生成和基地，并在鉴别使用谱归。我们比较并提出在原有BatVision模型的定量和定性的改进。

31. Road Mapping in Low Data Environments with OpenStreetMap [PDF] 返回目录
John Kamalu, Benjamin Choi
Abstract: Roads are among the most essential components of any country's infrastructure. By facilitating the movement and exchange of people, ideas, and goods, they support economic and cultural activity both within and across local and international borders. A comprehensive, up-to-date mapping of the geographical distribution of roads and their quality thus has the potential to act as an indicator for broader economic development. Such an indicator has a variety of high-impact applications, particularly in the planning of rural development projects where up-to-date infrastructure information is not available. This work investigates the viability of high resolution satellite imagery and crowd-sourced resources like OpenStreetMap in the construction of such a mapping. We experiment with state-of-the-art deep learning methods to explore the utility of OpenStreetMap data in road classification and segmentation tasks. We also compare the performance of models in different mask occlusion scenarios as well as out-of-country domains. Our comparison raises important pitfalls to consider in image-based infrastructure classification tasks, and shows the need for local training data specific to regions of interest for reliable performance.
摘要：道路是任何一个国家的基础设施最重要的组成部分之一。通过促进运动和人，观念和商品交换，他们同时支持内部和跨本地和国际边境经济和文化活动。全面，跟上时代的道路，其质量的地理分布的映射因而具有充当更广泛的经济发展指标的潜力。这样的指标有多种高影响力的应用，特别是在农村发展项目凡达最新基础设施信息不可用的规划。这项工作调查了这种映射的建设高分辨率卫星图像和人群来源的资源，如OpenStreetMap的生存能力。我们与国家的最先进的深学习方法进行试验，以探索OpenStreetMap的数据，在道路分类和细分任务的工具。我们还比较的车型在不同的面具遮挡的场景表现，以及外的国家域名。我们的对比提出了基于图像的基础设施分类任务要考虑的重要隐患，并显示当地的培训具体数据的可靠的性能感兴趣的区域的需要。

32. Emergent Properties of Foveated Perceptual Systems [PDF] 返回目录
Arturo Deza, Talia Konkle
Abstract: We introduce foveated perceptual systems, inspired by human biological systems, and examine the impact that this foveation stage has on the nature and robustness of subsequently learned visual representation. Specifically, these \textit{two-stage} perceptual systems first foveate an image, inducing a texture-like encoding of peripheral information, which is then inputted to a convolutional neural network (CNN) and trained to perform scene categorization. We find that: 1-- Systems trained on foveated inputs (Foveation-Nets) have similar generalization as networks trained on matched-resource networks without foveated input (Standard-Nets), yet show greater cross-generalization. 2- Foveation-Nets show higher robustness than Standard-Nets to scotoma (fovea removed) occlusions, driven by the first foveation stage. 3-- Subsequent representations learned in the CNN of Foveation-Nets weigh center information more strongly than Standard-Nets. 4-- Foveation-Nets show less sensitivity to low-spatial frequency information than Standard-Nets. Furthermore, when we added biological and artificial augmentation mechanisms to each system through simulated eye-movements or random cropping and mirroring respectively, we found that these effects were amplified. Taken together, we find evidence that foveated perceptual systems learn a visual representation that is distinct from non-foveated perceptual systems, with implications in generalization, robustness, and perceptual sensitivity. These results provide computational support for the idea that the foveated nature of the human visual system might confer a functional advantage for scene representation.
摘要：介绍了视网膜中央凹的感知系统，通过人体生物系统的启发，并检查这foveation阶段对随后得知视觉表现的性质和稳健性的影响。具体地，这些\ textit {两级}感知系统第一foveate图像，诱导纹理状的周边信息，然后将其输入到一个卷积神经网络（CNN）和训练来执行场景分类编码。我们发现：1--系统训练的视网膜中央凹上输入（Foveation-网队）也有类似的概括作为匹配的资源网络培训的视网膜中央凹无输入（标网）网络，但表现出更大的交叉推广。 2- Foveation-网表现出更高的鲁棒性比标准 - 网到暗点（中央凹移除）闭塞，通过第一阶段foveation驱动。 3--在Foveation，篮网的CNN了解到随后表示更强烈的权衡中心的信息比标准 - 篮网。 4-- Foveation-篮网显示比标准 - 网的低空间频率信息较小的灵敏度。此外，当我们增加生物和人造增强机制，以通过模拟的眼睛运动或随机裁剪和分别镜像每个系统中，我们发现这些效应进行扩增。综上所述，我们发现的证据表明，视网膜中央凹的知觉系统学习的视觉表示是从非视网膜中央凹的知觉系统不同，在推广，稳健性和感知灵敏度的影响。这些结果提供了这样的想法：人类视觉系统的视网膜中央凹性质，可能赋予的功能优势，为现场表示计算支持。

33. GradAug: A New Regularization Method for Deep Neural Networks [PDF] 返回目录
Taojiannan Yang, Sijie Zhu, Chen Chen
Abstract: We propose a new regularization method to alleviate over-fitting in deep neural networks. The key idea is utilizing randomly transformed training samples to regularize a set of sub-networks, which are originated by sampling the width of the original network, in the training process. As such, the proposed method introduces self-guided disturbances to the raw gradients of the network and therefore is termed as Gradient Augmentation (GradAug). We demonstrate that GradAug can help the network learn well-generalized and more diverse representations. Moreover, it is easy to implement and can be applied to various structures and applications. GradAug improves ResNet-50 to 78.79% on ImageNet classification, which is a new state-of-the-art accuracy. By combining with CutMix, it further boosts the performance to 79.58%, which outperforms an ensemble of advanced training tricks. The generalization ability is evaluated on COCO object detection and instance segmentation where GradAug significantly surpasses other state-of-the-art methods. GradAug is also robust to image distortions and adversarial attacks and is highly effective in the low data regimes.
摘要：本文提出了一种新的正则化方法，以缓解深层神经网络过度拟合。的关键思想是利用随机变换的训练样本正规化一组子网络，其源自通过采样原始网络的宽度，在训练过程中的。这样，所提出的方法引入了自导的干扰的网络的原始梯度，因此被称为梯度增强（GradAug）。我们证明GradAug可以帮助网络学习良好的推广和更多样化的表示。此外，它很容易实现，并且可以应用到各种结构和应用。 GradAug改进了ImageNet分类，这是一个新的国家的最先进的精度RESNET-50至78.79％。通过与CutMix相结合，进一步提升性能，以79.58％，这优于先进的培训技巧的合奏。泛化能力上COCO物体检测和实例分割其中GradAug显著优于其他国家的最先进的方法进行评价。 GradAug也是鲁棒的图像失真和敌对攻击和处于低数据制度高度有效的。

34. ShapeFlow: Learnable Deformations Among 3D Shapes [PDF] 返回目录
Chiyu "Max" Jiang, Jingwei Huang, Andrea Tagliasacchi, Leonidas Guibas
Abstract: We present ShapeFlow, a flow-based model for learning a deformation space for entire classes of 3D shapes with large intra-class variations. ShapeFlow allows learning a multi-template deformation space that is agnostic to shape topology, yet preserves fine geometric details. Different from a generative space where a latent vector is directly decoded into a shape, a deformation space decodes a vector into a continuous flow that can advect a source shape towards a target. Such a space naturally allows the disentanglement of geometric style (coming from the source) and structural pose (conforming to the target). We parametrize the deformation between geometries as a learned continuous flow field via a neural network and show that such deformations can be guaranteed to have desirable properties, such as be bijectivity, freedom from self-intersections, or volume preservation. We illustrate the effectiveness of this learned deformation space for various downstream applications, including shape generation via deformation, geometric style transfer, unsupervised learning of a consistent parameterization for entire classes of shapes, and shape interpolation.
摘要：我们目前ShapeFlow，学习的变形空间用于3D的整个类基于流的模型形状，大型的类内变化。 ShapeFlow允许学习一个多模板变形空间不可知塑造的拓扑结构，但保留精细几何细节。从其中一个潜载体直接解码成的形状的生成空间不同，变形空间解码载体引入连续流动，可以向目标平流输送的源极的形状。这样的空间自然允许的几何样式（来自源）和结构姿势（符合目标）的解缠结。我们通过神经网络参数化的几何形状作为学习连续流场之间的变形，并显示这样的变形可以保证具有期望的性质，如Be双射，免于自交叉，或体积保存。我们说明了各种下游应用，包括通过变形形状生成，几何风格转移，对于形状的整个班级一致的参数的无监督学习，和形状插补这里学到的变形空间的效果。

35. Geodesic-HOF: 3D Reconstruction Without Cutting Corners [PDF] 返回目录
Ziyun Wang, Eric A. Mitchell, Volkan Isler, Daniel D. Lee
Abstract: Single-view 3D object reconstruction is a challenging fundamental problem in computer vision, largely due to the morphological diversity of objects in the natural world. In particular, high curvature regions are not always captured effectively by methods trained using only set-based loss functions, resulting in reconstructions short-circuiting the surface or cutting corners. In particular, high curvature regions are not always captured effectively by methods trained using only set-based loss functions, resulting in reconstructions short-circuiting the surface or cutting corners. To address this issue, we propose learning an image-conditioned mapping function from a canonical sampling domain to a high dimensional space where the Euclidean distance is equal to the geodesic distance on the object. The first three dimensions of a mapped sample correspond to its 3D coordinates. The additional lifted components contain information about the underlying geodesic structure. Our results show that taking advantage of these learned lifted coordinates yields better performance for estimating surface normals and generating surfaces than using point cloud reconstructions alone. Further, we find that this learned geodesic embedding space provides useful information for applications such as unsupervised object decomposition.
摘要：单视图三维重建对象是计算机视觉一个具有挑战性的根本问题，主要是由于在自然世界中的物体的形态多样性。特别是，高曲率区域不总是有效地仅使用基于集的损失函数训练方法捕获，导致重建短路表面或切削角。特别是，高曲率区域不总是有效地仅使用基于集的损失函数训练方法捕获，导致重建短路表面或切削角。为了解决这个问题，我们提出学习从规范的采样域到高维空间，其中的欧几里得距离等于在对象上的测地距离的图像的空调映射函数。的第一三维映射样本对应于它的三维坐标。附加提升组件包含关于底层短程线结构的信息。我们的研究结果表明，服用这些教训解除坐标产量估计表面法线和产生表面比单独使用点云重建更好的性能优势。此外，我们发现，这个学测地嵌入空间提供的应用，如无监督的对象分解有用的信息。

36. Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization [PDF] 返回目录
Junting Pan, Siyu Chen, Zheng Shou, Jing Shao, Hongsheng Li
Abstract: Localizing persons and recognizing their actions from videos is a challenging task towards high-level video understanding. Recent advances have been achieved by modeling either 'actor-actor' or 'actor-context' relations. However, such direct first-order relations are not sufficient for localizing actions in complicated scenes. Some actors might be indirectly related via objects or background context in the scene. Such indirect relations are crucial for determining the action labels but are mostly ignored by existing work. In this paper, we propose to explicitly model the Actor-Context-Actor Relation, which can capture indirect high-order supportive information for effectively reasoning actors' actions in complex scenes. To this end, we design an Actor-Context-Actor Relation Network (ACAR-Net) which builds upon a novel High-order Relation Reasoning Operator to model indirect relations for spatio-temporal action localization. Moreover, to allow utilizing more temporal contexts, we extend our framework with an Actor-Context Feature Bank for reasoning long-range high-order relations. Extensive experiments on AVA dataset validate the effectiveness of our ACAR-Net. Ablation studies show the advantages of modeling high-order relations over existing first-order relation reasoning methods. The proposed ACAR-Net is also the core module of our 1st place solution in AVA-Kinetics Crossover Challenge 2020. Training code and models will be available at this https URL.
摘要：本地化人员和从视频中承认自己的行为是对高层视频的理解一项艰巨的任务。最新进展已通过建模是“演员的演员”或“演员方面”关系来实现的。然而，这种直接的一阶关系不足以在复杂场景本地化操作。有些演员可能通过场景中的对象或背景背景下间接相关。这种间接的关系是确定的动作标签至关重要的，但通过现有的工作大多忽略。在本文中，我们提出的演员 - 背景 - 演员的关系，它可以捕捉有效的推理在复杂的场景中演员的行为间接高阶支持信息明确建模。为此，我们设计了一个演员 - 背景 - 演员的关系网络（ACAR-网），它是建立在一个新的高阶关系推理操作为时空行动本地化间接的关系进行建模。此外，允许利用更多的时间背景下，我们与演员关联的特征银行扩大我们的框架进行推理长程的高阶关系。在AVA数据集验证了广泛的实验我们ACAR-Net的有效性。消融研究显示建模在现有的一阶关系推理方法高阶关系的优势。所提出的ACAR-Net的也是AVA-动力学交叉挑战2020年培训代码和模型我们的第一名解决方案的核心模块将可在此HTTPS URL。

37. Meta Approach to Data Augmentation Optimization [PDF] 返回目录
Ryuichiro Hataya, Jan Zdenek, Kazuki Yoshizoe, Hideki Nakayama
Abstract: Data augmentation policies drastically improve the performance of image recognition tasks, especially when the policies are optimized for the target data and tasks. In this paper, we propose to optimize image recognition models and data augmentation policies simultaneously to improve the performance using gradient descent. Unlike prior methods, our approach avoids using proxy tasks or reducing search space, and can directly improve the validation performance. Our method achieves efficient and scalable training by approximating the gradient of policies by implicit gradient with Neumann series approximation. We demonstrate that our approach can improve the performance of various image classification tasks, including ImageNet classification and fine-grained recognition, without using dataset-specific hyperparameter tuning.
摘要：数据增强政策显着提高的图像识别任务的性能，尤其是当政策目标数据和任务进行了优化。在本文中，我们提出了以优化图像识别模型和数据增强政策的同时，提高使用梯度下降的表现。与现有方法不同，我们的方法避免了使用代理任务或减少搜索空间，并且可以直接提高验证的性能。我们的方法实现，通过用Neumann级数逼近隐斜度接近政策的梯度高效，可扩展的培训。我们证明我们的方法可以改善各种图像分类任务，包括ImageNet分类和细粒度的识别性能，而无需使用数据集特有的超参数调整。

38. Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning [PDF] 返回目录
Yuqing Song, Shizhe Chen, Yida Zhao, Qin Jin
Abstract: Detecting meaningful events in an untrimmed video is essential for dense video captioning. In this work, we propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video. The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass. Experimental results show that the proposed event sequence generation model can generate more accurate and diverse events within a small number of proposals. For the event captioning, we follow our previous work to employ the intra-event captioning models into our pipeline system. The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
摘要：在未修剪视频检测有意义的事件是密集的视频字幕是必不可少的。在这项工作中，我们提出了事件序列生成一种新的和简单的模型，并探讨在视频的事件序列的时间关系。该模型忽略了低效的两级方案生成和直接产生事件边界在一个道次上的空调双向时间依赖性。实验结果表明，该事件序列生成模型可以产生少量的提案中更准确，更多样化的事件。对于事件字幕，我们按照我们以前的工作聘请内部事件字幕车型进入我们的管道系统。整个系统实现了对视频的任务密集的字幕事件与挑战测试集9.894流星得分国家的最先进的性能。

39. Optical Music Recognition: State of the Art and Major Challenges [PDF] 返回目录
Elona Shatri, György Fazekas
Abstract: Optical Music Recognition (OMR) is concerned with transcribing sheet music into a machine-readable format. The transcribed copy should allow musicians to compose, play and edit music by taking a picture of a music sheet. Complete transcription of sheet music would also enable more efficient archival. OMR facilitates examining sheet music statistically or searching for patterns of notations, thus helping use cases in digital musicology too. Recently, there has been a shift in OMR from using conventional computer vision techniques towards a deep learning approach. In this paper, we review relevant works in OMR, including fundamental methods and significant outcomes, and highlight different stages of the OMR pipeline. These stages often lack standard input and output representation and standardised evaluation. Therefore, comparing different approaches and evaluating the impact of different processing methods can become rather complex. This paper provides recommendations for future work, addressing some of the highlighted issues and represents a position in furthering this important field of research.
摘要：光学音乐识别（OMR）涉及抄写乐谱为机器可读的格式。转录副本应通过采取乐谱的图片让音乐人作曲，播放和编辑音乐。乐谱的完整转录也将实现更高效的档案。 OMR便于检查乐谱统计学或搜索符号的模式，从而帮助使用情况在数字音乐学过。最近，已经从使用传统的计算机视觉技术实现了深刻的学习方法在OMR的转变。在本文中，我们审查OMR相关工作，包括基本方法和显著的成果，并强调OMR流水线的不同阶段。这些阶段往往缺乏标准输入和输出形式和标准化的评价。因此，在比较不同的方法和评估的不同处理方法的影响会变得相当复杂。本文为今后的工作提出了建议，解决一些突出的问题，并表示在进一步研究这一重要领域的地位。

40. FenceMask: A Data Augmentation Approach for Pre-extracted Image Features [PDF] 返回目录
Pu Li, Xiangyang Li, Xiang Long
Abstract: We propose a novel data augmentation method named 'FenceMask' that exhibits outstanding performance in various computer vision tasks. It is based on the 'simulation of object occlusion' strategy, which aim to achieve the balance between object occlusion and information retention of the input data. By enhancing the sparsity and regularity of the occlusion block, our augmentation method overcome the difficulty of small object augmentation and notably improve performance over baselines. Sufficient experiments prove the performance of our method is better than other simulate object occlusion approaches. We tested it on CIFAR10, CIFAR100 and ImageNet datasets for Coarse-grained classification, COCO2017 and VisDrone datasets for detection, Oxford Flowers, Cornel Leaf and Stanford Dogs datasets for Fine-Grained Visual Categorization. Our method achieved significant performance improvement on Fine-Grained Visual Categorization task and VisDrone dataset.
摘要：我们提出了一个名为“FenceMask”表现出在不同的计算机视觉任务的出色表现了一种新的数据增强方法。它是基于策略，其目的是实现对象闭塞和输入数据的信息保持之间的平衡的“对象闭塞模拟”。通过增强遮挡块的稀疏性和规律性，增强我们克服方法小物件增强的难度，显着提高了性能基线。充足的实验证明我们的方法的性能优于其他模拟物体遮挡的方法。我们测试了CIFAR10，CIFAR100和ImageNet数据集粗粒度的分类，COCO2017和VisDrone数据集进行检测，牛津花，山茱萸叶片和斯坦福狗数据集进行细粒度的视觉分类。我们的方法实现对细粒度的视觉分类任务和VisDrone数据集显著的性能提升。

41. Explicitly Modeled Attention Maps for Image Classification [PDF] 返回目录
Andong Tan, Duc Tam Nguyen, Maximilian Dax, Matthias Niessner, Thomas Brox
Abstract: Self-attention networks have shown remarkable progress in computer vision tasks such as image classification. The main benefit of the self-attention mechanism is the ability to capture long-range feature interactions in attention-maps. However, the computation of attention-maps requires a learnable key, query, and positional encoding, whose usage is often not intuitive and computationally expensive. To mitigate this problem, we propose a novel self-attention module with explicitly modeled attention-maps using only a single learnable parameter for low computational overhead. The design of explicitly modeled attention-maps using geometric prior is based on the observation that the spatial context for a given pixel within an image is mostly dominated by its neighbors, while more distant pixels have a minor contribution. Concretely, the attention-maps are parametrized via simple functions (e.g., Gaussian kernel) with a learnable radius, which is modeled independently of the input content. Our evaluation shows that our method achieves an accuracy improvement of up to 2.2% over the ResNet-baselines in ImageNet ILSVRC and outperforms other self-attention methods such as AA-ResNet152 (Bello et al., 2019) in accuracy by 0.9% with 6.4% fewer parameters and 6.7% fewer GFLOPs.
摘要：自重视网络已经显示在计算机视觉任务，如图像分类显着进展。自注意机制的主要好处是捕捉注意力地图远射特征交互的能力。然而，注意力地图的计算需要一个可以学习的关键，查询和位置编码，其使用往往是不直观，计算量很大。为了缓解这一问题，我们建议只使用低计算开销一个可以学习的参数明确建模的注意力图的新型自注意模块。明确建模的注意力地图的几何使用前的设计是基于这样的观察：对于给定像素的图像内的空间环境主要是由它的邻国为主，而更远处的像素有一个小的贡献。具体地，注意，地图是通过简单的功能（例如，高斯核）与可学习半径，其被独立地建模的输入内容的参数化。我们的评估显示，我们的方法比在ImageNet ILSVRC的RESNET-基线达到了的精度提高到2.2％和6.4 0.9％，优于精度其他自注意方法，如AA-ResNet152（贝洛等，2019）％更少的参数和更少的6.7％GFLOPS。

42. Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection [PDF] 返回目录
Nils Gählert, Nicolas Jourdan, Marius Cordts, Uwe Franke, Joachim Denzler
Abstract: Detecting vehicles and representing their position and orientation in the three dimensional space is a key technology for autonomous driving. Recently, methods for 3D vehicle detection solely based on monocular RGB images gained popularity. In order to facilitate this task as well as to compare and drive state-of-the-art methods, several new datasets and benchmarks have been published. Ground truth annotations of vehicles are usually obtained using lidar point clouds, which often induces errors due to imperfect calibration or synchronization between both sensors. To this end, we propose Cityscapes 3D, extending the original Cityscapes dataset with 3D bounding box annotations for all types of vehicles. In contrast to existing datasets, our 3D annotations were labeled using stereo RGB images only and capture all nine degrees of freedom. This leads to a pixel-accurate reprojection in the RGB image and a higher range of annotations compared to lidar-based approaches. In order to ease multitask learning, we provide a pairing of 2D instance segments with 3D bounding boxes. In addition, we complement the Cityscapes benchmark suite with 3D vehicle detection based on the new annotations as well as metrics presented in this work. Dataset and benchmark are available online.
摘要：检测车辆和表示它们在三维空间中的位置和方向是自动驾驶的关键技术。最近，3D车辆检测方法完全基于得到普及单眼RGB图像。为了促进这项工作，以及比较和推动国家的最先进的方法，几个新的数据集和基准已经公布。通常使用激光雷达点云，这经常导致由于两个传感器之间的不完全的校准或同步误差获得车辆地面实况注释。为此，我们提出风情3D，扩展原始风情数据集3D边框标注为所有类型的车辆。相较于现有的数据集，我们的3D注释是只使用立体声RGB图像标记，并捕获所有九个自由度。这导致RGB图像中的像素精确的重投影和与基于激光雷达的办法较高范围的注释。为了缓解多任务学习，我们提供3D边界框2D情况下段的配对。此外，我们还补充了城市景观基准套件与3D车辆检测基于这项工作提出了新的注解，以及指标。数据集和基准可在网上。

43. An adversarial learning algorithm for mitigating gender bias in face recognition [PDF] 返回目录
Prithviraj Dhar, Joshua Gleason, Hossein Souri, Carlos D. Castillo, Rama Chellappa
Abstract: State-of-the-art face recognition networks implicitly encode gender information while being trained for identity classification. Gender is often viewed as an important face attribute to recognize humans. But, the expression of gender information in deep facial features appears to contribute to gender bias in face recognition, i.e. we find a significant difference in the recognition accuracy of DCNNs on male and female faces. We hypothesize that reducing implicitly encoded gender information will help reduce this gender bias. Therefore, we present a novel approach called `Adversarial Gender De-biasing (AGD)' to reduce the strength of gender information in face recognition features. We accomplish this by introducing a bias reducing classification loss $L_{br}$. We show that AGD significantly reduces bias, while achieving reasonable recognition performance. The results of our approach are presented on two state-of-the-art networks.
摘要：国家的最先进的，同时被训练识别分类人脸识别的网络编码隐含性别信息。性别通常被视为一个重要的面部属性的认识到人类。但是，在深五官性别信息的表达似乎有助于性别偏见在人脸识别，即我们发现在男性和女性的面孔DCNNs的识别精度的显著差异。我们假设，减少隐式编码的性别信息，将有助于减少这种性别偏见。因此，我们提出了一种所谓的新颖的方法`对抗性性别去偏（AGD）”，以减少在面部识别特征性别信息的强度。我们通过引入偏置减少损失分类L_ $ {BR} $做到这一点。我们发现，AGD显著减少偏见，同时实现合理的识别性能。我们的方法的结果呈现在两个国家的最先进的网络。

44. A Generalized Asymmetric Dual-front Model for Active Contours and Image Segmentation [PDF] 返回目录
Da~Chen, Jack Spencer, Jean-Marie Mirebeau, Ke Chen, Ming-Lei Shu, Laurent D. Cohen
Abstract: The geodesic distance-based dual-front curve evolution model is a powerful and efficient solution to the active contours and image segmentation issues. In its basic formulation, the dual-front model regards the meeting interfaces of two adjacent Voronoi regions as the evolving curves in the course of curve evolution. One of the most crucial ingredients for the construction of Voronoi regions or Voronoi diagram is the geodesic metrics and the corresponding geodesic distance. In this paper, we introduce a new type of geodesic metrics that encodes the edge-based anisotropy features, the region-based homogeneity penalization and asymmetric enhancement. In contrast to the original isotropic dual-front model, the use of the asymmetric enhancement can reduce the risk of shortcuts or leakage problems especially when the initial curves are far away from the target boundaries. Moreover, the proposed dual-front model can be applied for image segmentation in conjunction with various region-based homogeneity terms, whereas the original model only makes use of the piecewise constant case. The numerical experiments on both synthetic and real images show that the proposed model indeed achieves encouraging results.
摘要：测地基于距离的双前曲线演化模型是一个强大的和有效的解决方案的活性的轮廓和图像分割的问题。在其基本配方中，双前模型关于两个相邻的Voronoi区的会议接口如曲线进化过程不断变化的曲线。一个用于Voronoi区或维诺图的结构中最关键的成分之一是测地度量和对应的测地距离。在本文中，我们引入一个新的类型，其编码基于边缘的各向异性的特征，基于区域的同质性惩罚和不对称增强测地度量。与此相反的初始的各向同性双前模型中，使用非对称的增强可以减少快捷方式或泄漏问题的风险，特别是当初始曲线是远离目标边界。此外，提出的双前模型可以应用于与各种基于区域的均匀性方面结合图像分割，而原始模型只利用了分段常数箱子。在人工和真实图像的数值实验表明，该模型确实取得了令人鼓舞的成果。

45. Multi-Miner: Object-Adaptive Region Mining for Weakly-Supervised Semantic Segmentation [PDF] 返回目录
Kuangqi Zhou, Qibin Hou, Zun Li, Jiashi Feng
Abstract: Object region mining is a critical step for weakly-supervised semantic segmentation. Most recent methods mine the object regions by expanding the seed regions localized by class activation maps. They generally do not consider the sizes of objects and apply a monotonous procedure to mining all the object regions. Thus their mined regions are often insufficient in number and scale for large objects, and on the other hand easily contaminated by surrounding backgrounds for small objects. In this paper, we propose a novel multi-miner framework to perform a region mining process that adapts to diverse object sizes and is thus able to mine more integral and finer object regions. Specifically, our multi-miner leverages a parallel modulator to check whether there are remaining object regions for each single object, and guide a category-aware generator to mine the regions of each object independently. In this way, the multi-miner adaptively takes more steps for large objects and fewer steps for small objects. Experiment results demonstrate that the multi-miner offers better region mining results and helps achieve better segmentation performance than state-of-the-art weakly-supervised semantic segmentation methods.
摘要：对象区域挖掘是对于弱监督语义分割的关键步骤。最近方法通过扩大按类活动图局部点火区开采目标区域。他们一般不考虑物体的大小和应用程序单调挖掘到所有的目标区域。因此，他们的开采区域在大对象数量和规模往往不足，在另一方面容易被周围的物体小的背景污染。在本文中，我们提出了一种新的多矿工的框架来执行区域开采过程中，能够适应不同对象的大小，因此能够更雷积分和更精细的物体的区域。具体地，我们的多矿工利用并行调制器以独立地检查是否存在剩余的对象区域为每个单独的对象，并引导类别感知发生器到矿井每个对象的区域。通过这种方式，多矿工自适应花费大对象和小对象更少的步骤更多的步骤。实验结果表明，多矿工提供了更好的区域挖掘的结果，并有助于实现比国家的最先进的更好的分割性能弱监督语义分割方法。

46. On Saliency Maps and Adversarial Robustness [PDF] 返回目录
Puneet Mangla, Vedant Singh, Vineeth N Balasubramanian
Abstract: A Very recent trend has emerged to couple the notion of interpretability and adversarial robustness, unlike earlier efforts which solely focused on good interpretations or robustness against adversaries. Works have shown that adversarially trained models exhibit more interpretable saliency maps than their non-robust counterparts, and that this behavior can be quantified by considering the alignment between input image and saliency map. In this work, we provide a different perspective to this coupling, and provide a method, Saliency based Adversarial training (SAT), to use saliency maps to improve adversarial robustness of a model. In particular, we show that using annotations such as bounding boxes and segmentation masks, already provided with a dataset, as weak saliency maps, suffices to improve adversarial robustness with no additional effort to generate the perturbations themselves. Our empirical results on CIFAR-10, CIFAR-100, Tiny ImageNet and Flower-17 datasets consistently corroborate our claim, by showing improved adversarial robustness using our method. saliency maps. We also show how using finer and stronger saliency maps leads to more robust models, and how integrating SAT with existing adversarial training methods, further boosts performance of these existing methods.
摘要：一个非常最近的趋势已经出现几个解释性和对抗性的鲁棒性的概念，不同于只关注好的解释或对对手的鲁棒性早期的努力。作品表明，adversarially训练的模型表现出比非稳健的同行更可解释的显着性地图，以及这种行为可以通过考虑输入图像和显着图之间的一致性进行量化。在这项工作中，我们提供了一个不同的角度，以这种耦合，并提供一种方法，显着性基于对抗性训练（SAT），到使用显着性图，以提高模型的鲁棒性对抗。特别是，我们表明，使用注释，如边界框和分割口罩，已经提供了一个数据集，弱特征地图，足以改善，没有额外的努力来产生扰动本身对抗性的鲁棒性。我们对CIFAR-10经验结果，CIFAR-100，微小ImageNet和花-17数据集一致地证实我们的权利要求，通过示出了使用我们的方法改进的对抗性的鲁棒性。显着性图。我们还展示了如何使用更细，更强的显着性映射导致更稳健的模式，以及如何与现有的对抗性训练方法，这些现有方法的进一步提升性能集成饱和

47. PCAAE: Principal Component Analysis Autoencoder for organising the latent space of generative networks [PDF] 返回目录
Chi-Hieu Pham, Saïd Ladjal, Alasdair Newson
Abstract: Autoencoders and generative models produce some of the most spectacular deep learning results to date. However, understanding and controlling the latent space of these models presents a considerable challenge. Drawing inspiration from principal component analysis and autoencoder, we propose the Principal Component Analysis Autoencoder (PCAAE). This is a novel autoencoder whose latent space verifies two properties. Firstly, the dimensions are organised in decreasing importance with respect to the data at hand. Secondly, the components of the latent space are statistically independent. We achieve this by progressively increasing the latent space during training, and with a covariance loss applied to the latent codes. The resulting autoencoder produces a latent space which separates the intrinsic attributes of the data into different components of the latent space, in a completely unsupervised manner. We also describe an extension of our approach to the case of powerful, pre-trained GANs. We show results on both synthetic examples of shapes and on a state-of-the-art GAN. For example, we are able to separate the color shade scale of hair and skin, pose of faces and the gender in the CelebA, without accessing any labels. We compare the PCAAE with other state-of-the-art approaches, in particular with respect to the ability to disentangle attributes in the latent space. We hope that this approach will contribute to better understanding of the intrinsic latent spaces of powerful deep generative models.
摘要：自动编码并生成模型产生了一些最壮观的深度学习结果为准。然而，了解和控制这些模型的潜在空间提出了巨大的挑战。借鉴主成分分析和自动编码器的启发，提出了主成分分析自动编码器（PCAAE）。这是一种新颖的自动编码器，其潜在空间验证两个属性。首先，尺寸相对于手头上的数据重要性递减组织。其次，潜在空间的成分是统计独立的。我们通过培训过程中逐步增加潜在空间实现这一点，并适用于潜代码的协方差损失。将得到的自编码产生，在一个完全无监督的方式分开的数据到所述潜在空间的不同部件的固有属性的潜在空间。我们还描述了我们的方法的扩展功能强大，预先训练甘斯的情况。我们显示形状上的两个合成的实施例和对国家的最先进的GAN结果。例如，我们可以的头发和皮肤，面部的姿态，并在CelebA性别色差规模分开，而无需访问任何标签。我们比较了与其他国家的最先进的方法，特别是对于有能力的潜在空间解开属性PCAAE。我们希望，这种做法将有助于更好地理解强大的深度生成模型固有的潜在空间。

48. Few-shot Object Detection on Remote Sensing Images [PDF] 返回目录
Jingyu Deng, Xiang Li, Yi Fang
Abstract: In this paper, we deal with the problem of object detection on remote sensing images. Previous methods have developed numerous deep CNN-based methods for object detection on remote sensing images and the report remarkable achievements in regard to both detection performance and efficiency. However, current CNN-based methods mostly require a large number of annotated samples to train deep neural networks and tend to have limited generalization abilities for unseen object categories. In this paper, we introduce a few-shot learning-based method for object detection on remote sensing images where only a few annotated samples are provided for the unseen categories. More specifically, our model contains three main components: a meta feature extractor that learns to extract feature representations from input images, a reweighting module that learn to adaptively assign different weights for each feature representation from the support images, and a bounding box prediction module that carries out object detection on the reweighted feature maps. We build our few-shot object detection model upon YOLOv3 architecture and develop a multi-scale object detection framework. Experiments on two benchmark datasets demonstrate that with only a few annotated samples our model can still achieve a satisfying detection performance on remote sensing images and the performance of our model is significantly better than the well-established baseline models.
摘要：在本文中，我们处理的遥感图像目标检测的问题。先前的方法已经开发了关于这两个检测性能和效率上的遥感图像和报告令人瞩目的成就目标检测许多深深的基于CNN的方法。然而，当前基于CNN的方法大多需要大量注释的样本来训练深层神经网络，往往有看不见的对象类别限制泛化能力。在本文中，我们介绍上遥感图像对象检测，其中提供了用于看不见的类别只有少数注解样本为基础的学习几拍方法。更具体地讲，我们的模型包含三个主要组成部分：一元特征提取的是学会了从输入图像，即学习适应性分配不同的权重从支持图像中的每个特征表示一个权重调整模块，和边框预测模块，提取特征表示进行对重加权特征的地图对象的检测。我们建立在YOLOv3架构我们几个拍物体检测模型，并开发了多尺度目标检测框架。两个基准数据集实验结果表明，只有少数样本注解我们的模型仍然可以实现遥感图像的满足检测性能和我们的模型的性能显著优于公认的基本模式。

49. Working with scale: 2nd place solution to Product Detection in Densely Packed Scenes [Technical Report] [PDF] 返回目录
Artem Kozlov
Abstract: This report describes a 2nd place solution of the detection challenge which is held within CVPR 2020 Retail-Vision workshop. Instead of going further considering previous results this work mainly aims to verify previously observed takeaways by re-experimenting. The reliability and reproducibility of the results are reached by incorporating a popular object detection toolbox MMDetection. In this report, I firstly represent the results received for Faster-RCNN and RetinaNet models, which were taken for comparison in the original work. Then I describe the experiment results with more advanced models. The final section reviews two simple tricks for Faster-RCNN model that were used for my final submission: changing default anchor scale parameter and train-time image tiling. The source code is available at this https URL.
摘要：该报告描述了召开中CVPR 2020零售-愿景研讨会的检测挑战的第二名解决方案。而不是去考虑进一步先前的结果这项工作的主要目的通过重新试验，以验证先前观察到的外卖。结果的可靠性和重现性通过将一种流行的物体检测工具箱MMDetection达到。在这份报告中，我首先代表更快RCNN和RetinaNet模型，采取在原工作比较收到了成效。然后，我描述了更先进的模型实验结果。最后一节评论仅供快RCNN模型两个简单的技巧被用于我的最后意见，即：改变默认的锚尺度参数和列车实时图像拼接。源代码可在此HTTPS URL。

50. Adaptively Meshed Video Stabilization [PDF] 返回目录
Minda Zhao, Qiang Ling
Abstract: Video stabilization is essential for improving visual quality of shaky videos. The current video stabilization methods usually take feature trajectories in the background to estimate one global transformation matrix or several transformation matrices based on a fixed mesh, and warp shaky frames into their stabilized views. However, these methods may not model the shaky camera motion well in complicated scenes, such as scenes containing large foreground objects or strong parallax, and may result in notable visual artifacts in the stabilized videos. To resolve the above issues, this paper proposes an adaptively meshed method to stabilize a shaky video based on all of its feature trajectories and an adaptive blocking strategy. More specifically, we first extract feature trajectories of the shaky video and then generate a triangle mesh according to the distribution of the feature trajectories in each frame. Then transformations between shaky frames and their stabilized views over all triangular grids of the mesh are calculated to stabilize the shaky video. Since more feature trajectories can usually be extracted from all regions, including both background and foreground regions, a finer mesh will be obtained and provided for camera motion estimation and frame warping. We estimate the mesh-based transformations of each frame by solving a two-stage optimization problem. Moreover, foreground and background feature trajectories are no longer distinguished and both contribute to the estimation of the camera motion in the proposed optimization problem, which yields better estimation performance than previous works, particularly in challenging videos with large foreground objects or strong parallax.
摘要：视频稳定是改善抖动的视频视觉质量至关重要。当前视频稳定方法通常采取特征轨迹的背景来估计一个全局变换矩阵或基于固定的网格数的变换矩阵，和翘曲摇晃帧到其稳定的观点。然而，这些方法可能不摇晃的照相机运动以及在复杂的场景，如含有大前景对象或强视差场景模型，并且可能导致在稳定化的视频显着的视觉伪像。为了解决上述问题，本文提出了一种自适应网格方法来稳定基于其所有特征轨迹和自适应阻断策略抖动的视频。更具体地讲，我们首先提取抖动的视频的特征的轨迹，然后生成根据特征轨迹的每一帧中的分布的三角形网格。然后摇晃帧和在网格的所有三角形的网格的稳定视图之间的变换来计算，以稳定抖动的视频。由于以上特征的轨迹通常可以从所有区域，包括背景和前景区域中提取，更细的网格将获得并提供了一种用于相机的运动估计和帧翘曲。我们通过解决一个两阶段优化问题，估计每个帧的基于网格的转换。此外，前台和后台功能的轨迹不再区分的，都有助于在提出的优化问题的相机运动，这将产生比以前的作品更好的估计性能，特别是在大的前景物体或强视差挑战视频的估计。

51. Alternating ConvLSTM: Learning Force Propagation with Alternate State Updates [PDF] 返回目录
Congyue Deng, Tai-Jiang Mu, Shi-Min Hu
Abstract: Data-driven simulation is an important step-forward in computational physics when traditional numerical methods meet their limits. Learning-based simulators have been widely studied in past years; however, most previous works view simulation as a general spatial-temporal prediction problem and take little physical guidance in designing their neural network architectures. In this paper, we introduce the alternating convolutional Long Short-Term Memory (Alt-ConvLSTM) that models the force propagation mechanisms in a deformable object with near-uniform material properties. Specifically, we propose an accumulation state, and let the network update its cell state and the accumulation state alternately. We demonstrate how this novel scheme imitates the alternate updates of the first and second-order terms in the forward Euler method of numerical PDE solvers. Benefiting from this, our network only requires a small number of parameters, independent of the number of the simulated particles, and also retains the essential features in ConvLSTM, making it naturally applicable to sequential data with spatial inputs and outputs. We validate our Alt-ConvLSTM on human soft tissue simulation with thousands of particles and consistent body pose changes. Experimental results show that Alt-ConvLSTM efficiently models the material kinetic features and greatly outperforms vanilla ConvLSTM with only the single state update.
摘要：在传统的数值方法满足他们的极限数据驱动的模拟是计算物理的重要一步向前。学习型模拟器已经被广泛的研究在过去几年中;然而，大多数以前的作品中视景仿真作为一般的时空预测问题并采取小物理指导设计其神经网络结构。在本文中，我们介绍了交替的卷积龙短时记忆（ALT-ConvLSTM），其型号在一个变形对象的力传播机制与近似均匀的材料特性。具体来说，我们建议累积状态，并让网络交替更新其电池状态和积累状态。我们证明这种新颖的方案如何模仿在数值解算器PDE的向前欧拉方法中的第一和第二阶项的交替的更新。从这个受益，我们的网络仅需要少量的参数，不依赖于模拟颗粒的数量，并且还保持在ConvLSTM的基本特征，使得它当然也适用于具有空间的输入和输出的时序数据。我们确认对人体软组织模拟与成千上万的颗粒和一致身体姿势的变化我们的Alt-ConvLSTM。实验结果表明，按住Alt键ConvLSTM有效模型材料动力学特征，并远远胜过香草ConvLSTM只有单一的状态更新。

52. 2D Image Relighting with Image-to-Image Translation [PDF] 返回目录
Paul Gafton, Erick Maraz
Abstract: With the advent of Generative Adversarial Networks (GANs), a finer level of control in manipulating various features of an image has become possible. One example of such fine manipulation is changing the position of the light source in a scene. This is fundamentally an ill-posed problem, since it requires understanding the scene geometry to generate proper lighting effects. This problem is not a trivial one and can become even more complicated if we want to change the direction of the light source from any direction to a specific one. Here we provide our attempt to solve this problem using GANs. Specifically, pix2pix [arXiv:1611.07004] trained with the dataset VIDIT [arXiv:2005.05460] which contains images of the same scene with different types of light temperature and 8 different light source positions (N, NE, E, SE, S, SW, W, NW). The results are 8 neural networks trained to be able to change the direction of the light source from any direction to one of the 8 previously mentioned. Additionally, we provide, as a tool, a simple CNN trained to identify the direction of the light source in an image.
摘要：随着剖成对抗性网络（甘斯），在处理图像的各种特征控制的的更精细水平的出现已经成为可能。这样精细操作的一个例子是改变场景中的光源的位置。这基本上是一个病态问题，因为它需要了解的场景几何产生适当的照明效果。这个问题不是一个简单的一个，如果我们想从任何方向的光源的方向改变为一个特定的人可以变得更加复杂。在这里，我们为我们的尝试使用甘斯来解决这个问题。具体而言，pix2pix [的arXiv：1611.07004]训练与数据集VIDIT [的arXiv：2005.05460]其中包含与不同类型的光的温度和8个不同的光源位置（N，NE，E，SE，S，SW相同的场景的图像， W，NW）。结果是训练成能够在光源的方向上从任何方向改变为前面提到的8的一个8个神经网络。另外，我们提供，作为一种工具，一个简单的CNN训练以识别图像中的光源的方向。

53. Disentanglement for Discriminative Visual Recognition [PDF] 返回目录
Xiaofeng Liu
Abstract: Recent successes of deep learning-based recognition rely on maintaining the content related to the main-task label. However, how to explicitly dispel the noisy signals for better generalization in a controllable manner remains an open issue. For instance, various factors such as identity-specific attributes, pose, illumination and expression affect the appearance of face images. Disentangling the identity-specific factors is potentially beneficial for facial expression recognition (FER). This chapter systematically summarize the detrimental factors as task-relevant/irrelevant semantic variations and unspecified latent variation. In this chapter, these problems are casted as either a deep metric learning problem or an adversarial minimax game in the latent space. For the former choice, a generalized adaptive (N+M)-tuplet clusters loss function together with the identity-aware hard-negative mining and online positive mining scheme can be used for identity-invariant FER. The better FER performance can be achieved by combining the deep metric loss and softmax loss in a unified two fully connected layer branches framework via joint optimization. For the latter solution, it is possible to equipping an end-to-end conditional adversarial network with the ability to decompose an input sample into three complementary parts. The discriminative representation inherits the desired invariance property guided by prior knowledge of the task, which is marginal independent to the task-relevant/irrelevant semantic and latent variations. The framework achieves top performance on a serial of tasks, including lighting, makeup, disguise-tolerant face recognition and facial attributes recognition. This chapter systematically summarize the popular and practical solution for disentanglement to achieve more discriminative visual recognition.
摘要：深学习型确认最近的成功依赖于维持与主要任务标签的内容。然而，如何明确打消了更好的泛化的噪声信号以可控的方式仍然是一个悬而未决的问题。例如，各种因素，诸如身份特定的属性，姿态，照明和表达影响的人脸图像的外观。解开的具体身份因素是用于面部表情识别（FER）可能是有益的。本章系统总结的不利因素，如任务相关/无关语义变化和不确定的潜在变化。在本章中，这些问题都浇注，无论是深度量学习问题或潜在空间的对抗性游戏的极大极小。对于前者选择，广义自适应（N + M）与身份感知硬负采矿一起-tuplet集群损失函数和网上积极挖掘方案可用于身份不变FER。更好的FER性能可以通过联合优化统一两个完全连接层分支机构框架深度量损失和损失添加Softmax结合来实现。对于后者的解决方案，它能够以分解输入样本分成三个互补部分的能力装备的端至端的条件对抗性网络。的辨别表示继承由任务，这是边际独立于任务相关/无关语义和潜变化的先验知识引导所需的不变性性质。该框架实现的任务，包括灯光，化妆，戴着面具的，宽容的面部识别和面部识别属性的系列顶级性能。本章系统总结了流行和实用的解决方案的解开，以实现更有辨别力的视觉识别。

54. ReLGAN: Generalization of Consistency for GAN with Disjoint Constraints and Relative Learning of Generative Processes for Multiple Transformation Learning [PDF] 返回目录
Chiranjib Sur
Abstract: Image to image transformation has gained popularity from different research communities due to its enormous impact on different applications, including medical. In this work, we have introduced a generalized scheme for consistency for GAN architectures with two new concepts of Transformation Learning (TL) and Relative Learning (ReL) for enhanced learning image transformations. Consistency for GAN architectures suffered from inadequate constraints and failed to learn multiple and multi-modal transformations, which is inevitable for many medical applications. The main drawback is that it focused on creating an intermediate and workable hybrid, which is not permissible for the medical applications which focus on minute details. Another drawback is the weak interrelation between the two learning phases and TL and ReL have introduced improved coordination among them. We have demonstrated the capability of the novel network framework on public datasets. We emphasized that our novel architecture produced an improved neural image transformation version for the image, which is more acceptable to the medical community. Experiments and results demonstrated the effectiveness of our framework with enhancement compared to the previous works.
摘要：图像到图像变换已经获得了来自不同研究团体的普及，由于不同的应用，包括医疗的巨大影响。在这项工作中，我们已经介绍了与转化学习（TL）和相对学业（REL）为增强学习的图像变换两个新概念GAN架构一致性推广方案。一致性GAN架构从约束不足遭受学不到多和多模式的转换，这是不可避免的许多医疗应用。的主要缺点是，它集中在创建中间和可行的混合，这是不允许的，其重点放在微小的细节医疗应用。另一个缺点是两个学习阶段和TL和Rel纷纷推出改进它们之间的协调之间的弱相互关系。我们已经证明了公共数据集新型网络架构的能力。我们强调的是，我们的新建筑产生的改进神经图像变换版本的图像，这是比较接受的医学界。实验及结果表明相比于以前的作品我们与增强框架的有效性。

55. Relative Pose Estimation for Stereo Rolling Shutter Cameras [PDF] 返回目录
Ke Wang, Bin Fan, Yuchao Dai
Abstract: In this paper, we present a novel linear algorithm to estimate the 6 DoF relative pose from consecutive frames of stereo rolling shutter (RS) cameras. Our method is derived based on the assumption that stereo cameras undergo motion with constant velocity around the center of the baseline, which needs 9 pairs of correspondences on both left and right consecutive frames. The stereo RS images enable the recovery of depth maps from the semi-global matching (SGM) algorithm. With the estimated camera motion and depth map, we can correct the RS images to get the undistorted images without any scene structure assumption. Experiments on both simulated points and synthetic RS images demonstrate the effectiveness of our algorithm in relative pose estimation.
摘要：在本文中，我们提出了一个新颖的线性算法来从立体声的连续帧滚动快门（RS）摄像机估计6自由度相对姿势。我们的方法是基于这样的假设立体相机经历与周围的基线，这需要9双上左右两个连续帧对应的中心等速运动的。立体RS图像使深度图的从半全局匹配（SGM）算法的恢复。随着估计摄像机运动和深度图，我们可以纠正遥感图像，以获得无失真图像无任何场景结构的假设。两个模拟实验点和合成遥感影像展示我们在相对位姿估计算法的有效性。

56. Geometry-Aware Instance Segmentation with Disparity Maps [PDF] 返回目录
Cho-Ying Wu, Xiaoyan Hu, Michael Happold, Qiangeng Xu, Ulrich Neumann
Abstract: Most previous works of outdoor instance segmentation for images only use color information. We explore a novel direction of sensor fusion to exploit stereo cameras. Geometric information from disparities helps separate overlapping objects of the same or different classes. Moreover, geometric information penalizes region proposals with unlikely 3D shapes thus suppressing false positive detections. Mask regression is based on 2D, 2.5D, and 3D ROI using the pseudo-lidar and image-based representations. These mask predictions are fused by a mask scoring process. However, public datasets only adopt stereo systems with shorter baseline and focal legnth, which limit measuring ranges of stereo cameras. We collect and utilize High-Quality Driving Stereo (HQDS) dataset, using much longer baseline and focal length with higher resolution. Our performance attains state of the art. Please refer to our project page. The full paper is available here.
摘要：以往大多数户外实例分割为影像作品只使用颜色信息。我们探索传感器融合的一种新颖的方向以利用立体相机。从差距几何信息帮助相同或不同类别的单独的重叠对象。此外，不太可能的3D几何信息惩罚区域提案形状从而抑制假阳性检测。面具回归使用伪激光雷达和基于图像的表示基于2D，2.5D和3D ROI。这些面具预测由面具评分过程融合。然而，公共数据集，只能采取用较短的基线和焦点legnth立体声系统，这限制了立体相机测量范围。我们收集和使用高品质的驾驶立体声（HQDS）数据集，用更长的时间基准和焦距更高的分辨率。我们的业绩说明无所获的艺术。请参阅我们的项目页面。论文全文请点击这里。

57. Hyper RPCA: Joint Maximum Correntropy Criterion and Laplacian Scale Mixture Modeling On-the-Fly for Moving Object Detection [PDF] 返回目录
Zerui Shao, Yifei Pu, Jiliu Zhou, Bihan Wen, Yi Zhang
Abstract: Moving object detection is critical for automated video analysis in many vision-related tasks, such as surveillance tracking, video compression coding, etc. Robust Principal Component Analysis (RPCA), as one of the most popular moving object modelling methods, aims to separate the temporally varying (i.e., moving) foreground objects from the static background in video, assuming the background frames to be low-rank while the foreground to be spatially sparse. Classic RPCA imposes sparsity of the foreground component using l1-norm, and minimizes the modeling error via 2-norm. We show that such assumptions can be too restrictive in practice, which limits the effectiveness of the classic RPCA, especially when processing videos with dynamic background, camera jitter, camouflaged moving object, etc. In this paper, we propose a novel RPCA-based model, called Hyper RPCA, to detect moving objects on the fly. Different from classic RPCA, the proposed Hyper RPCA jointly applies the maximum correntropy criterion (MCC) for the modeling error, and Laplacian scale mixture (LSM) model for foreground objects. Extensive experiments have been conducted, and the results demonstrate that the proposed Hyper RPCA has competitive performance for foreground detection to the state-of-the-art algorithms on several well-known benchmark datasets.
摘要：运动目标检测是许多视觉相关的任务，如监视跟踪，视频压缩编码等稳健主成分分析（RPCA）自动视频分析的关键，作为最流行的移动对象的建模方法之一，目的是在时间上变化的（即，移动）前景对象从在视频静止背景中分离出来，假设背景帧是低秩而前景在空间上稀疏的。经典RPCA强加使用L1范数的前台组件的稀疏性，并且经由2范数最小化的模型化误差。我们表明，这些假设在实践中，这限制了经典RPCA的有效性，处理与动态背景，相机抖动，伪装移动的物体等视频尤其是当在本文中是过于严格，我们提出了一种新颖的基于RPCA模型，所谓的超RPCA，以检测在飞行移动的物体。从经典RPCA不同，建议超RPCA共同申请的前景对象的建模误差最大correntropy标准（MCC），和拉普拉斯规模混合物（LSM）模型。大量的实验已经进行，结果表明，该超RPCA有几个著名的基准数据集的前景检测到国家的最先进的算法，有竞争力的表现。

58. Generative 3D Part Assembly via Dynamic Graph Learning [PDF] 返回目录
Jialei Huang, Guanqi Zhan, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas Guibas, Hao Dong
Abstract: Autonomous part assembly is a challenging yet crucial task in 3D computer vision and robotics. Analogous to buying an IKEA furniture, given a set of 3D parts that can assemble a single shape, an intelligent agent needs to perceive the 3D part geometry, reason to propose pose estimations for the input parts, and finally call robotic planning and control routines for actuation. In this paper, we focus on the pose estimation subproblem from the vision side involving geometric and relational reasoning over the input part geometry. Essentially, the task of generative 3D part assembly is to predict a 6-DoF part pose, including a rigid rotation and translation, for each input part that assembles a single 3D shape as the final output. To tackle this problem, we propose an assembly-oriented dynamic graph learning framework that leverages an iterative graph neural network as a backbone. It explicitly conducts sequential part assembly refinements in a coarse-to-fine manner, exploits a pair of part relation reasoning module and part aggregation module for dynamically adjusting both part features and their relations in the part graph. We conduct extensive experiments and quantitative comparisons to three strong baseline methods, demonstrating the effectiveness of the proposed approach.
摘要：自治区部分组件处于3D计算机视觉和机器人一个具有挑战性但至关重要的任务。类似于买的宜家家具，给定一组三维零件，可以装配一个形状，一个智能代理需要感知3D零件的几何形状，有理由提出姿态估计为输入部分，最后调用机器人的规划和控制程序驱动。在本文中，我们侧重于从涉及超过输入零件的几何形状的几何和关系推理的愿景侧姿态估计子问题。本质上，生成3D部件的任务组件是预测六自由度部的姿势，包括刚性的旋转和平移，对于每个输入部分即组装单个3D形状作为最终输出。为了解决这个问题，我们提出了一个组装面向动态图形学习框架，利用迭代图神经网络作为骨干。它明确地进行在粗到细的方式顺序部分装配精炼，利用用于动态地调整两个部分的功能和它们在图形部分关系的一对部分关系推理模块和部分聚合模块的。我们进行了广泛的实验和定量比较，三种强的基准方法，证明了该方法的有效性。

59. PrimA6D: Rotational Primitive Reconstruction for Enhanced and Robust 6D Pose Estimation [PDF] 返回目录
MyungHwan Jeon, Ayoung Kim
Abstract: In this paper, we introduce a rotational primitive prediction based 6D object pose estimation using a single image as an input. We solve for the 6D object pose of a known object relative to the camera using a single image with occlusion. Many recent state-of-the-art (SOTA) two-step approaches have exploited image keypoints extraction followed by PnP regression for pose estimation. Instead of relying on bounding box or keypoints on the object, we propose to learn orientation-induced primitive so as to achieve the pose estimation accuracy regardless of the object size. We leverage a Variational AutoEncoder (VAE) to learn this underlying primitive and its associated keypoints. The keypoints inferred from the reconstructed primitive image are then used to regress the rotation using PnP. Lastly, we compute the translation in a separate localization module to complete the entire 6D pose estimation. When evaluated over public datasets, the proposed method yields a notable improvement over the LINEMOD, the Occlusion LINEMOD, and the YCB-Video dataset. We further provide a synthetic-only trained case presenting comparable performance to the existing methods which require real images in the training phase.
摘要：在本文中，我们引入一个旋转原始预测使用单个图像作为输入基于6D对象姿态估计。我们解决了已知的相对于照相机的对象使用具有遮挡单个图像的6D对象姿态。最近的许多国家的最先进的（SOTA）两步方法已利用图像的关键点提取随后的PnP回归姿态估计。而不是依靠的对象边界框或关键点，我们建议学习定向诱发原始，从而实现姿态估计的精度，无论该对象的大小。我们利用一个变自动编码器（VAE），以了解这背后的原始及其相关的关键点。从重建的原始图像推断关键点然后被用于回归使用即插即用的旋转。最后，我们计算转换的分离定位模块中完成整个6D姿态估计。当在公共数据集进行评估，所提出的方法产生过LINEMOD，闭塞LINEMOD和YCB视频数据集显着改善。我们还提供了一种合成只训练有素的情况下呈现相当的性能需要在训练阶段中的真实图像的现有方法。

60. Cascaded deep monocular 3D human pose estimation with evolutionary training data [PDF] 返回目录
Shichao Li, Lei Ke, Kevin Pratama, Yu-Wing Tai, Chi-Keung Tang, Kwang-Ting Cheng
Abstract: End-to-end deep representation learning has achieved remarkable accuracy for monocular 3D human pose estimation, yet these models may fail for unseen poses with limited and fixed training data. This paper proposes a novel data augmentation method that: (1) is scalable for synthesizing massive amount of training data (over 8 million valid 3D human poses with corresponding 2D projections) for training 2D-to-3D networks, (2) can effectively reduce dataset bias. Our method evolves a limited dataset to synthesize unseen 3D human skeletons based on a hierarchical human representation and heuristics inspired by prior knowledge. Extensive experiments show that our approach not only achieves state-of-the-art accuracy on the largest public benchmark, but also generalizes significantly better to unseen and rare poses. Relevant files and tools are available at the project website.
摘要：结束到终端的深表示学习已经取得了显着的准确性单眼的3D人体姿态估计，但这些车型可能会失败与有限的，固定的训练数据看不见的姿势。本文提出了一种新颖的数据增强方法，该方法：（1）是可伸缩的对（与对应的2D投影超过800万个有效三维人体姿势）用于训练2D到3D网络合成的训练数据的巨量，（2）可以有效地减少数据集的偏见。我们的方法演变有限的数据集合成基于分层的人表示和由先验知识启发启发看不见的3D人的骨骼。大量的实验表明，该方法不仅实现上最大的公共基准的国家的最先进的准确性，而且显著更好地推广到看不见的和罕见的姿态。相关的文件和工具都可以在项目网站。

61. Domain Adaptation and Image Classification via Deep Conditional Adaptation Network [PDF] 返回目录
Pengfei Ge, Chuan-Xian Ren, Dao-Qing Dai, Hong Yan
Abstract: Unsupervised domain adaptation aims to generalize the supervised model trained on a source domain to an unlabeled target domain. Marginal distribution alignment of feature spaces is widely used to reduce the domain discrepancy between the source and target domains. However, it assumes that the source and target domains share the same label distribution, which limits their application scope. In this paper, we consider a more general application scenario where the label distributions of the source and target domains are not the same. In this scenario, marginal distribution alignment-based methods will be vulnerable to negative transfer. To address this issue, we propose a novel unsupervised domain adaptation method, Deep Conditional Adaptation Network (DCAN), based on conditional distribution alignment of feature spaces. To be specific, we reduce the domain discrepancy by minimizing the Conditional Maximum Mean Discrepancy between the conditional distributions of deep features on the source and target domains, and extract the discriminant information from target domain by maximizing the mutual information between samples and the prediction labels. In addition, DCAN can be used to address a special scenario, Partial unsupervised domain adaptation, where the target domain category is a subset of the source domain category. Experiments on both unsupervised domain adaptation and Partial unsupervised domain adaptation show that DCAN achieves superior classification performance over state-of-the-art methods. In particular, DCAN achieves great improvement in the tasks with large difference in label distributions (6.1\% on SVHN to MNIST, 5.4\% in UDA tasks on Office-Home and 4.5\% in Partial UDA tasks on Office-Home).
摘要：无监督领域适应性旨在推广训练有素的来源域，未标记的目标域中的监督模式。特征空间的边缘分布对准被广泛使用，以减少源和目标结构域之间的结构域差异。然而，假定在源和目标域共享相同的标签分发，这限制了它们的应用范围。在本文中，我们考虑一个更普遍的应用场景源和目标域的标签分布是不一样的。在这种情况下，边缘分布基于比对的方法将是脆弱的负迁移。为了解决这个问题，我们提出了一种新的无监督领域适应性方法，深条件适应网络（DCAN），基于特征空间的条件分布排列。具体而言，我们通过对源和目标域深特征的条件分布之间最小化的条件最大平均差异减小域差异，并且通过最大化样品和预测标记物之间的相互信息中提取从目标域中的判别信息。此外，DCAN可用于解决一个特殊的情况下，偏无监督域调整，其中目标域类别是源域类别的子集。在两个无监督域的适应和偏无监督域适应实验表明，DCAN达到过度状态的最先进的方法优越的分类性能。特别是，DCAN实现了与标签分布差异较大（6.1 \％的SVHN到MNIST，在办公，家庭UDA任务5.4 \％和4.5 \％在办公室，家庭部分UDA任务）的任务很大的提高。

62. Recurrent Distillation based Crowd Counting [PDF] 返回目录
Yue Gu, Wenxi Liu
Abstract: In recent years, with the progress of deep learning technologies, crowd counting has been rapidly developed. In this work, we propose a simple yet effective crowd counting framework that is able to achieve the state-of-the-art performance on various crowded scenes. In particular, we first introduce a perspective-aware density map generation method that is able to produce ground-truth density maps from point annotations to train crowd counting model to accomplish superior performance than prior density map generation techniques. Besides, leveraging our density map generation method, we propose an iterative distillation algorithm to progressively enhance our model with identical network structures, without significantly sacrificing the dimension of the output density maps. In experiments, we demonstrate that, with our simple convolutional neural network architecture strengthened by our proposed training algorithm, our model is able to outperform or be comparable with the state-of-the-art methods. Furthermore, we also evaluate our density map generation approach and distillation algorithm in ablation studies.
摘要：近年来，随着深学习技术的进步，人流计数得到了迅速发展。在这项工作中，我们提出了一个简单而有效的人群计数框架，能够实现各种场面拥挤的国家的最先进的性能。具体地讲，我们首先介绍的立体感知密度图的生成方法，其能够从点注释列车人群计数模型产生地面真密度图来完成比现有密度地图生成技术优越的性能。此外，利用我们的密度图的生成方法，我们提出了一种迭代算法蒸馏逐步加强与相同的网络结构模型，没有显著牺牲输出密度图的维数。在实验中，我们证明了，我们简单的卷积神经网络架构由我们提出的训练算法增强，我们的模型能够超越或者是与国家的最先进的方法相同。此外，我们也评估我们在切除研究密度图生成方法和蒸馏算法。

63. 3D Reconstruction of Novel Object Shapes from Single Images [PDF] 返回目录
Anh Thai, Stefan Stojanov, Vijay Upadhya, James M. Rehg
Abstract: The key challenge in single image 3D shape reconstruction is to ensure that deep models can generalize to shapes which were not part of the training set. This is difficult because the algorithm must infer the occluded portion of the surface by leveraging the shape characteristics of the training data, and can therefore be vulnerable to overfitting. Such generalization to unseen categories of objects is a function of architecture design and training approaches. This paper introduces SDFNet, a novel shape prediction architecture and training approach which supports effective generalization. We provide an extensive investigation of the factors which influence generalization accuracy and its measurement, ranging from the consistent use of 3D shape metrics to the choice of rendering approach and the large-scale evaluation on unseen shapes using ShapeNetCore.v2 and ABC. We show that SDFNet provides state-of-the-art performance on seen and unseen shapes relative to existing baseline methods GenRe and OccNet. We provide the first large-scale experimental evaluation of generalization performance. The codebase released with this article will allow for the consistent evaluation and comparison of methods for single image shape reconstruction.
摘要：在单一的图像三维形状重建的关键挑战是，确保深层模型可以推广到那些没有在训练集的一部分形状。这是困难的，因为算法必须通过利用训练数据的形状特征推断表面的遮挡部，因此能够很容易受到过度拟合。这种推广到看不见类别的对象是的体系结构设计和培训方法的功能。本文介绍SDFNet，它支持有效泛化新颖的形状预测结构和培训方法。我们提供的其影响泛化精度及其测量，范围从一贯使用的3D形状指标渲染方法，并使用ShapeNetCore.v2和ABC看不见形状的大型评估选择的因素了广泛的调查。我们发现，SDFNet上看见，相对于现有的基准方法看不见形状流派和OccNet提供国家的最先进的性能。我们提供的推广能力的第一次大规模的实验评价。这篇文章发布的代码库将允许一致的评估和对单个图像形状重建方法的比较。

64. Exploiting the ConvLSTM: Human Action Recognition using Raw Depth Video-Based Recurrent Neural Networks [PDF] 返回目录
Adrian Sanchez-Caballero, David Fuentes-Jimenez, Cristina Losada-Gutiérrez
Abstract: As in many other different fields, deep learning has become the main approach in most computer vision applications, such as scene understanding, object recognition, computer-human interaction or human action recognition (HAR). Research efforts within HAR have mainly focused on how to efficiently extract and process both spatial and temporal dependencies of video sequences. In this paper, we propose and compare, two neural networks based on the convolutional long short-term memory unit, namely ConvLSTM, with differences in the architecture and the long-term learning strategy. The former uses a video-length adaptive input data generator (\emph{stateless}) whereas the latter explores the \emph{stateful} ability of general recurrent neural networks but applied in the particular case of HAR. This stateful property allows the model to accumulate discriminative patterns from previous frames without compromising computer memory. Experimental results on the large-scale NTU RGB+D dataset show that the proposed models achieve competitive recognition accuracies with lower computational cost compared with state-of-the-art methods and prove that, in the particular case of videos, the rarely-used stateful mode of recurrent neural networks significantly improves the accuracy obtained with the standard mode. The recognition accuracies obtained are 75.26\% (CS) and 75.45\% (CV) for the stateless model, with an average time consumption per video of 0.21 s, and 80.43\% (CS) and 79.91\%(CV) with 0.89 s for the stateful version.
摘要：在许多其他不同领域，深度学习已经成为大多数计算机视觉应用，如现场了解，物体识别，人机交互或人类行为识别（HAR）的主要途径。内HAR研究工作主要集中在如何有效地提取和处理视频序列的空间和时间依赖性。在本文中，我们提出和比较，基于卷积长短期记忆单元，即ConvLSTM，在架构和长期学习策略的差异在两个神经网络。前者采用的视频长度适应输入数据生成器（\ {EMPH无国籍}），而后者探索\ EMPH {状态}一般回归神经网络的能力，但是在HAR的特定情况下应用。此状态属性允许模型从以前的帧累积判别模式不会影响计算机的内存。在大规模的实验结果NTU RGB + d数据集显示，与国家的最先进的方法相比，该模型实现与较低的计算成本竞争力的识别精度和证明，在视频的特定情况下，很少使用回归神经网络的有状态模式显著改善了与标准模式所获得的精确度。精度获得的识别是75.26 \％（CS）和用于无状态模型75.45 \％（CV），每0.21的视频的平均时间消耗，以及80.43 \％（CS）和79.91 \％（CV）与0.89 S对于有状态的版本。

65. 3DFCNN: Real-Time Action Recognition using 3D Deep Neural Networks with Raw Depth Information [PDF] 返回目录
Adrian Sanchez-Caballero, Sergio de López-Diz, David Fuentes-Jimenez, Cristina Losada-Gutiérrez, Marta Marrón-Romera, David Casillas-Perez, Mohammad Ibrahim Sarker
Abstract: Human actions recognition is a fundamental task in artificial vision, that has earned a great importance in recent years due to its multiple applications in different areas. %, such as the study of human behavior, security or video surveillance. In this context, this paper describes an approach for real-time human action recognition from raw depth image-sequences, provided by an RGB-D camera. The proposal is based on a 3D fully convolutional neural network, named 3DFCNN, which automatically encodes spatio-temporal patterns from depth sequences without %any costly pre-processing. Furthermore, the described 3D-CNN allows %automatic features extraction and actions classification from the spatial and temporal encoded information of depth sequences. The use of depth data ensures that action recognition is carried out protecting people's privacy% allows recognizing the actions carried out by people, protecting their privacy%\sout{of them} , since their identities can not be recognized from these data. %\st{ from depth images.} 3DFCNN has been evaluated and its results compared to those from other state-of-the-art methods within three widely used %large-scale NTU RGB+D datasets, with different characteristics (resolution, sensor type, number of views, camera location, etc.). The obtained results allows validating the proposal, concluding that it outperforms several state-of-the-art approaches based on classical computer vision techniques. Furthermore, it achieves action recognition accuracy comparable to deep learning based state-of-the-art methods with a lower computational cost, which allows its use in real-time applications.
摘要：人类的行为识别是人工视觉，已经赢得了由于不同领域的多种应用近年来非常重要的根本任务。％，如人类行为，安全和视频监控的研究。在这方面，本文描述了用于从原始深度图像序列的实时人类动作识别的方法，通过RGB-d摄像机提供。该提案是基于3D完全卷积神经网络，命名3DFCNN，它从深度序列自动编码时空模式，而不％任何昂贵的预处理上。此外，所描述的3D-CNN允许％自动特征提取和分类的动作从深度序列的空间和时间编码信息。该动作识别进行保护人们的隐私％使用深度数据，确保允许承认行为的人进行，保护他们的隐私％\ {SOUT其中}，因为他们的身份不能从这些数据可以确认。％\ ST {从深度图像。} 3DFCNN已经评估，其结果相比，这些从国家的最先进的其他内的三个％大型广泛使用NTU RGB + d的数据集的方法，具有不同的特性（分辨率，传感器输入时，观看次数，摄像机的位置，等等）。得到的结果允许验证的建议，认为它优于几个国家的最先进方法的基础上经典的计算机视觉技术。此外，实现了动作识别精度堪比基于深度学习国家的最先进的方法，以较低的计算成本，这使得在实时应用中的使用。

66. Split-Merge Pooling [PDF] 返回目录
Omid Hosseini Jafari, Carsten Rother
Abstract: There are a variety of approaches to obtain a vast receptive field with convolutional neural networks (CNNs), such as pooling or striding convolutions. Most of these approaches were initially designed for image classification and later adapted to dense prediction tasks, such as semantic segmentation. However, the major drawback of this adaptation is the loss of spatial information. Even the popular dilated convolution approach, which in theory is able to operate with full spatial resolution, needs to subsample features for large image sizes in order to make the training and inference tractable. In this work, we introduce Split-Merge pooling to fully preserve the spatial information without any subsampling. By applying Split-Merge pooling to deep networks, we achieve, at the same time, a very large receptive field. We evaluate our approach for dense semantic segmentation of large image sizes taken from the Cityscapes and GTA-5 datasets. We demonstrate that by replacing max-pooling and striding convolutions with our split-merge pooling, we are able to improve the accuracy of different variations of ResNet significantly.
摘要：有各种各样的方法以获得具有卷积神经网络（细胞神经网络），诸如集中或跨越卷积广阔感受域。大多数这些方法中的最初设计用于图像分类和以后适于密预测任务，如语义分割。但是，这种调整的主要缺点是空间信息的丢失。甚至在著名的扩张卷积的做法，这在理论上是能够与全空间分辨率进行操作，需要进行分采样，以使培训和推理听话的大尺寸图像特征。在这项工作中，我们介绍了拆分合并集中，充分保留，没有任何二次采样的空间信息。通过应用拆分合并池深的网络，就可以实现，在同一时间，一个非常大的感受野。我们评估我们从城市景观和GTA-5数据集拍摄大图像尺寸的密集语义分割方法。我们证明，通过更换MAX-池大步我们的分裂 - 合并池卷积，我们能够显著提高RESNET的不同变化的准确度。

67. V2E: From video frames to realistic DVS event camera streams [PDF] 返回目录
Tobi Delbruck, Yuhuang Hu, Zhe He
Abstract: To help meet the increasing need for dynamic vision sensor (DVS) event camera data, we developed the v2e toolbox, which generates synthetic DVS event streams from intensity frame videos. Videos can be of any type, either real or synthetic. v2e optionally uses synthetic slow motion to upsample the video frame rate and then generates DVS events from these frames using a realistic pixel model that includes event threshold mismatch, finite illumination-dependent bandwidth, and several types of noise. v2e includes an algorithm that determines the DVS thresholds and bandwidth so that the synthetic event stream statistics match a given reference DVS recording. v2e is the first toolbox that can synthesize realistic low light DVS data. This paper also clarifies misleading claims about DVS characteristics in some of the computer vision literature. The v2e website is this https URL and code is hosted at this https URL.
摘要：为了帮助满足日益增长的需要进行动态视觉传感器（DVS）事件摄像机数据，我们开发了V2E工具箱，产生合成DVS事件从强度帧的视频流。画可以是任何类型的，无论是真实的或合成的。 V2E任选使用合成的慢动作上采样所述视频帧速率，然后产生从DVS使用包括事件阈值不匹配，有限照明依赖性带宽，和几种类型的噪声的真实像素模型这些帧事件。 V2E包括确定DVS阈值和带宽，以使合成的事件流的统计数据相匹配的给定参考DVS记录的算法。 V2E是指能合成现实低光DVS数据的第一工具箱。本文还澄清误导性在某些计算机视觉文献的DVS特征索赔。该V2E网站是这样的HTTPS URL和代码在此HTTPS URL托管。

68. Sensorless Freehand 3D Ultrasound Reconstruction via Deep Contextual Learning [PDF] 返回目录
Hengtao Guo, Sheng Xu, Bradford Wood, Pingkun Yan
Abstract: Transrectal ultrasound (US) is the most commonly used imaging modality to guide prostate biopsy and its 3D volume provides even richer context information. Current methods for 3D volume reconstruction from freehand US scans require external tracking devices to provide spatial position for every frame. In this paper, we propose a deep contextual learning network (DCL-Net), which can efficiently exploit the image feature relationship between US frames and reconstruct 3D US volumes without any tracking device. The proposed DCL-Net utilizes 3D convolutions over a US video segment for feature extraction. An embedded self-attention module makes the network focus on the speckle-rich areas for better spatial movement prediction. We also propose a novel case-wise correlation loss to stabilize the training process for improved accuracy. Highly promising results have been obtained by using the developed method. The experiments with ablation studies demonstrate superior performance of the proposed method by comparing against other state-of-the-art methods. Source code of this work is publicly available at this https URL.
摘要：经直肠超声（US）是最常用的成像模态来引导活检前列腺及其3D体积提供更丰富的上下文信息。从US徒手扫描的3D体积重构的当前方法需要外部跟踪装置以提供用于每一帧的空间位置。在本文中，我们提出了一个深刻的情境学习网络（DCL-网），它可以有效地利用没有任何跟踪设备美国帧和重构3D美国卷之间的图像特征的关系。所提出的DCL-Net的利用3D盘旋在美国的视频片段进行特征提取。嵌入式自注意模块使得网络重点放在更好的空间运动预测丰富的斑点区域。我们还提出了一个新的情况下，明智的相关损失，以稳定提高精度的训练过程。极有希望的结果已经通过发达的方法获得。消融研究的实验表明通过对其他国家的最先进的方法相比，该方法的性能优越。这项工作的源代码是公开的，在此HTTPS URL。

69. Uncertainty-aware Score Distribution Learning for Action Quality Assessment [PDF] 返回目录
Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, Jie Zhou
Abstract: Assessing action quality from videos has attracted growing attention in recent years. Most existing approaches usually tackle this problem based on regression algorithms, which ignore the intrinsic ambiguity in the score labels caused by multiple judges or their subjective appraisals. To address this issue, we propose an uncertainty-aware score distribution learning (USDL) approach for action quality assessment (AQA). Specifically, we regard an action as an instance associated with a score distribution, which describes the probability of different evaluated scores. Moreover, under the circumstance where fine-grained score labels are available (e.g., difficulty degree of an action or multiple scores from different judges), we further devise a multi-path uncertainty-aware score distributions learning (MUSDL) method to explore the disentangled components of a score. We conduct experiments on three AQA datasets containing various Olympic actions and surgical activities, where our approaches set new state-of-the-arts under the Spearman's Rank Correlation.
摘要：从视频中评估行动的质量吸引了近年来越来越多的关注。大多数现有的方法通常处理基于回归算法，忽略引起多发性法官或他们的主观评价得分标签固有的歧义这个问题。为了解决这个问题，我们提出了动作质量评估（AQA）的不确定性感知分值分布学习（USDL）的方法。具体而言，我们认为一个动作作为与得分分布，其描述了不同的评价分数的概率相关联的一个实例。此外，根据其中细粒度得分标签是可用的（例如，一个动作或从不同的评委的多个分数的难易程度），我们还设计一种多路径的不确定性感知分数分布学习（MUSDL）方法来探索解开的情况下分数的组成部分。我们进行的包含各种奥运行动和手术活动3点AQA的数据集，我们的方法设置新的国家的最艺术的Spearman秩相关下的实验。

70. Convolutional Generation of Textured 3D Meshes [PDF] 返回目录
Dario Pavllo, Graham Spinks, Thomas Hofmann, Marie-Francine Moens, Aurelien Lucchi
Abstract: Recent generative models for 2D images achieve impressive visual results, but clearly lack the ability to perform 3D reasoning. This heavily restricts the degree of control over generated objects as well as the possible applications of such models. In this work, we leverage recent advances in differentiable rendering to design a framework that can generate triangle meshes and associated high-resolution texture maps, using only 2D supervision from single-view natural images. A key contribution of our work is the encoding of the mesh and texture as 2D representations, which are semantically aligned and can be easily modeled by a 2D convolutional GAN. We demonstrate the efficacy of our method on Pascal3D+ Cars and the CUB birds dataset, both in an unconditional setting and in settings where the model is conditioned on class labels, attributes, and text. Finally, we propose an evaluation methodology that assesses the mesh and texture quality separately.
摘要：2D图像最近生成模型取得不俗的视觉效果，但显然缺乏执行3D推理的能力。这种严格限制的控制权产生的对象，以及这种模式的可能的应用程度。在这项工作中，我们利用最新进展在微渲染设计一个框架，可以生成三角形网格和相关的高分辨率纹理贴图，只使用2D监督从单一视图自然的图像。我们工作的一个重要贡献是网格和纹理为2D表示，这是语义上对齐，并且可以通过二维卷积GAN很容易建模的编码。我们证明了我们对Pascal3D +汽车方法的有效性和CUB鸟集，无论是在无条件设置，在设置里的模型是在类的标签，属性和文本条件。最后，我们提出了一种评估方法，评估的目，并分别纹理质量。

71. DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythms [PDF] 返回目录
Hua Qi, Qing Guo, Felix Juefei-Xu, Xiaofei Xie, Lei Ma, Wei Feng, Yang Liu, Jianjun Zhao
Abstract: As the GAN-based face image and video generation techniques, widely known as DeepFakes, have become more and more matured and realistic, the need for an effective DeepFakes detector has become imperative. Motivated by the fact that remote visualphotoplethysmography (PPG) is made possible by monitoring the minuscule periodic changes of skin color due to blood pumping through the face, we conjecture that normal heartbeat rhythms found in the real face videos will be diminished or even disrupted entirely in a DeepFake video, making it a powerful indicator for detecting DeepFakes. In this work, we show that our conjecture holds true and the proposed method indeed can very effectively exposeDeepFakes by monitoring the heartbeat rhythms, which is termedasDeepRhythm. DeepRhythm utilizes dual-spatial-temporal attention to adapt to dynamically changing face and fake types. Extensive experiments on FaceForensics++ and DFDC-preview datasets have demonstrated not only the effectiveness of our proposed method, but also how it can generalize over different datasets with various DeepFakes generation techniques and multifarious challenging degradations.
摘要：随着基于GaN的人脸图像和视频生成技术，被广泛称为DeepFakes，已经变得越来越成熟和现实，需要一个有效的DeepFakes探测器已成为当务之急。的事实，远程visualphotoplethysmography（PPG）是通过监测微小的皮肤颜色，由于通过面部血液泵周期变化成为可能的启发，我们推测正常心跳节律的本来面目视频发现将被削弱甚至完全破坏一个DeepFake视频，使其成为一个强大的指示器，用于检测DeepFakes。在这项工作中，我们证明了我们的猜想成立，并提出的方法确实可以非常有效地通过监测心跳的节奏，这是termedasDeepRhythm exposeDeepFakes。 DeepRhythm采用双时空的关注，以适应动态变化的脸和假类型。在FaceForensics ++和DFDC-预览数据集大量的实验不仅证明了我们提出的方法的有效性，又是如何能概括了各种DeepFakes生成技术和五花八门的挑战降级不同的数据集。

72. Equivariant Neural Rendering [PDF] 返回目录
Emilien Dupont, Miguel Angel Bautista, Alex Colburn, Aditya Sankar, Carlos Guestrin, Josh Susskind, Qi Shan
Abstract: We propose a framework for learning neural scene representations directly from images, without 3D supervision. Our key insight is that 3D structure can be imposed by ensuring that the learned representation transforms like a real 3D scene. Specifically, we introduce a loss which enforces equivariance of the scene representation with respect to 3D transformations. Our formulation allows us to infer and render scenes in real time while achieving comparable results to models requiring minutes for inference. In addition, we introduce two challenging new datasets for scene representation and neural rendering, including scenes with complex lighting and backgrounds. Through experiments, we show that our model achieves compelling results on these datasets as well as on standard ShapeNet benchmarks.
摘要：我们提出了直接从图像学习神经现场表示，没有3D监管的框架。我们的主要观点是，3D结构可以确保被征收该学会表示变换像一个真正的3D场景。具体而言，我们引入了损失，这强制实施关于3D变换同变性的场景表达的。我们的配方使我们能够推断出，并实时渲染的场景，同时获得可比较的结果与要求用于推断分钟车型。此外，我们引入了情景再现和神经的渲染，包括复杂的灯光和背景场景2点有挑战性的新的数据集。通过实验，我们表明，我们的模型实现对这些数据集，以及对标准ShapeNet基准令人瞩目的成果。

73. DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition [PDF] 返回目录
Ziming Liu, Guangyu Gao, A. K. Qin, Jinyang Li
Abstract: State-of-the-art video action recognition models with complex network architecture have archived significant improvements, but these models heavily depend on large-scale well-labeled datasets. To reduce such dependency, we propose a self-supervised teacher-student architecture, i.e., the Differentiated Teachers Guided self-supervised Network (DTG-Net). In DTG-Net, except for reducing labeled data dependency by self-supervised learning (SSL), pre-trained action related models are used as teacher guidance providing prior knowledge to alleviate the demand for a large number of unlabeled videos in SSL. Specifically, leveraging the years of effort in action-related tasks, e.g., image classification, image-based action recognition, the DTG-Net learns the self-supervised video representation under various teacher guidance, i.e., those well-trained models of action-related tasks. Meanwhile, the DTG-Net is optimized in the way of contrastive self-supervised learning. When two image sequences are randomly sampled from the same video or different videos as the positive or negative pairs, respectively, they are then sent to the teacher and student networks for feature embedding. After that, the contrastive feature consistency is defined between features embedding of each pair, i.e., consistent for positive pair and inconsistent for negative pairs. Meanwhile, to reflect various teacher tasks' different guidance, we also explore different weighted guidance on teacher tasks. Finally, the DTG-Net is evaluated in two ways: (i) the self-supervised DTG-Net to pre-train the supervised action recognition models with only unlabeled videos; (ii) the supervised DTG-Net to be jointly trained with the supervised action networks in an end-to-end way. Its performance is better than most pre-training methods but also has excellent competitiveness compared to supervised action recognition methods.
摘要：国家的最先进复杂的网络架构的视频行为识别模型已归档显著的改进，但这些模型在很大程度上取决于大型良好标记的数据集。为了减少这种依赖，我们提出了一个自我监督的师生架构，即区分教师指导下的自我监督网络（DTG-网）。在DTG-Net的，除了通过自我监督学习（SSL）减少标记数据的依赖，预先训练动作相关的模型作为指导老师提供先验知识，从而减少了对大量的SSL未标记的视频需求。具体而言，借力多年的行动有关的任务的努力，例如，图像分类，基于图像的动作识别，在DTG-Net的各种学习指导老师，也就是在自我监督的视频表示，行动为那些训练有素的模特相关任务。同时，DTG-NET是在对比自我监督学习的方式进行了优化。当两个图像序列中随机从同一个视频或作为阳性或阴性对不同的视频取样，分别，他们然后被发送到教师和学生网络为特征嵌入。在此之后，对比特征的一致性特征嵌入每对，即，对于正对和不一致的负对一致的之间。同时，以反映不同老师的任务不同的指导，我们也探讨教师的任务不同的加权指导。最后，DTG-NET是在两个方面评估：（一）自我监督DTG-Net的预先训练，只有未标记的视频的监督行为识别模型; （ii）所述监督DTG-Net的与在端至端的方式受监管行动网络联合训练。其性能比最前的训练方法更好，但比起监督行为识别方法也有出色的竞争力。

74. HRDNet: High-resolution Detection Network for Small Objects [PDF] 返回目录
Ziming Liu, Guangyu Gao, Lin Sun, Zhiyuan Fang
Abstract: Small object detection is challenging because small objects do not contain detailed information and may even disappear in the deep network. Usually, feeding high-resolution images into a network can alleviate this issue. However, simply enlarging the resolution will cause more problems, such as that, it aggravates the large variant of object scale and introduces unbearable computation cost. To keep the benefits of high-resolution images without bringing up new problems, we proposed the High-Resolution Detection Network (HRDNet). HRDNet takes multiple resolution inputs using multi-depth backbones. To fully take advantage of multiple features, we proposed Multi-Depth Image Pyramid Network (MD-IPN) and Multi-Scale Feature Pyramid Network (MS-FPN) in HRDNet. MD-IPN maintains multiple position information using multiple depth backbones. Specifically, high-resolution input will be fed into a shallow network to reserve more positional information and reducing the computational cost while low-resolution input will be fed into a deep network to extract more semantics. By extracting various features from high to low resolutions, the MD-IPN is able to improve the performance of small object detection as well as maintaining the performance of middle and large objects. MS-FPN is proposed to align and fuse multi-scale feature groups generated by MD-IPN to reduce the information imbalance between these multi-scale multi-level features. Extensive experiments and ablation studies are conducted on the standard benchmark dataset MS COCO2017, Pascal VOC2007/2012 and a typical small object dataset, VisDrone 2019. Notably, our proposed HRDNet achieves the state-of-the-art on these datasets and it performs better on small objects.
摘要：小物体检测具有挑战性，因为小对象不包含详细信息，甚至可深网络中消失。通常情况下，饲养高分辨率图像转换成一个网络可以缓解这个问题。然而，简单地扩大分辨率会导致更多的问题，比如，它加剧对象规模大的变异，并介绍了难以承受的计算成本。为了保持高分辨率图像的好处，而不带来了新的问题，我们提出了高分辨率探测网（HRDNet）。 HRDNet需要使用多深度骨架多个分辨率的输入。为了充分利用多种功能，我们在HRDNet提出的多深度图像金字塔网络（MD-IPN）和多尺度特征金字塔网络（MS-FPN）。 MD-IPN维持利用多个深度骨干多个位置信息。具体地，高分辨率输入将被送入一个浅网络中保留更多的位置信息，降低了计算成本，而低解析度输入将被送入深网络提取更多语义。通过从高至低的分辨率提取各种特征，在MD-IPN能够提高小物件检测的性能以及维护中大型对象的性能。 MS-FPN提出了对准并通过保险丝MD-IPN中产生，以减少这些多尺度多级特征之间的信息不平衡多尺度特征的基团。广泛的实验和消融的研究在标准基准数据集MS COCO2017进行，帕斯卡尔VOC2007 / 2012和典型的小对象数据集，2019年VisDrone值得注意的是，我们提出的HRDNet实现对这些数据集的状态的最先进的，它的性能会更好小物件。

75. Faces à la Carte: Text-to-Face Generation via Attribute Disentanglement [PDF] 返回目录
Tianren Wang, Teng Zhang, Brian Lovell
Abstract: Text-to-Face (TTF) synthesis is a challenging task with great potential for diverse computer vision applications. Compared to Text-to-Image (TTI) synthesis tasks, the textual description of faces can be much more complicated and detailed due to the variety of facial attributes and the parsing of high dimensional abstract natural language. In this paper, we propose a Text-to-Face model that not only produces images in high resolution (1024x1024) with text-to-image consistency, but also outputs multiple diverse faces to cover a wide range of unspecified facial features in a natural way. By fine-tuning the multi-label classifier and image encoder, our model obtains the vectors and image embeddings which are used to transform the input noise vector sampled from the normal distribution. Afterwards, the transformed noise vector is fed into a pre-trained high-resolution image generator to produce a set of faces with the desired facial attributes. We refer to our model as TTF-HD. Experimental results show that TTF-HD generates high-quality faces with state-of-the-art performance.
摘要：文本面对面（TTF）的合成是与不同的计算机视觉应用的巨大潜力具有挑战性的任务。相比于文本到图像（TTI）合成的任务，面的文本描述可以更加复杂和详细由于各种面部属性和高尺寸抽象自然语言的解析。在本文中，我们提出了一个文本到脸部模型，不仅产生图像以高分辨率（1024×1024）以文本到图像的一致性，但还输出多个不同的面以覆盖宽范围的未指定的面部特征在自然道路。通过微调多标签分类器和图像编码器，我们的模型获得被用于转化从正态分布采样的输入噪声向量的向量和图像的嵌入。然后，将转化的噪声矢量被送入一个预训练的高分辨率图像生成器，以产生一组与所述期望的面部属性的面孔。我们把我们作为TTF-HD模型。实验结果表明，TTF-HD生成高品质的面与国家的最先进的性能。

76. Dynamic gesture retrieval: searching videos by human pose sequence [PDF] 返回目录
Cheng Zhang
Abstract: The number of static human poses is limited, it is hard to retrieve the exact videos using one single pose as the clue. However, with a pose sequence or a dynamic gesture as the keyword, retrieving specific videos becomes more feasible. We propose a novel method for querying videos containing a designated sequence of human poses, whereas previous works only designate a single static pose. The proposed method takes continuous 3d human poses from keyword gesture video and video candidates, then converts each pose in individual frames into bone direction descriptors, which describe the direction of each natural connection in articulated pose. A temporal pyramid sliding window is then applied to find matches between designated gesture and video candidates, which ensures that same gestures with different duration can be matched.
摘要：静态姿势的人的数量是有限的，很难检索使用一个单一的姿态为线索的确切视频。然而，具有姿势序列或动态手势作为关键词，检索特定视频变得更加可行。我们提出查询包含人体姿势的指定序列的视频，而以前的作品只能指定一个静态姿态的新方法。所提出的方法需要从关键字手势视频和视频候选连续三维人体姿势，然后在单独的帧的每个姿态转换成骨方向描述符，其描述了在铰接的姿势每个自然连接的方向。然后，时间金字塔滑动窗口施加到找到指定手势和视频候选，这保证了具有不同的持续时间相同的手势可以被匹配之间的匹配。

77. NoPeopleAllowed: The Three-Step Approach to Weakly Supervised Semantic Segmentation [PDF] 返回目录
Mariia Dobko, Ostap Viniavskyi, Oles Dobosevych
Abstract: We propose a novel approach to weakly supervised semantic segmentation, which consists of three consecutive steps. The first two steps extract high-quality pseudo masks from image-level annotated data, which are then used to train a segmentation model on the third step. The presented approach also addresses two problems in the data: class imbalance and missing labels. Using only image-level annotations as supervision, our method is capable of segmenting various classes and complex objects. It achieves 37.34 mean IoU on the test set, placing 3rd at the LID Challenge in the task of weakly supervised semantic segmentation.
摘要：本文提出了一种新的方法来监督弱语义分割，其中包括三个连续的步骤。前两个步骤从图像级注释的数据，然后将其用于训练在第三步骤的分割模型提取高质量的伪掩模。所提出的方法也解决了数据的两个问题：类不平衡和残缺的标签。仅使用图像级别注释作为监督，我们的方法是能够分割各类复杂的对象。它实现了37.34的平均借条上的测试集，在弱的任务将在3日的LID挑战监管语义分割。

78. Attribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification [PDF] 返回目录
Zhiyuan Chen, Annan Li, Shilu Jiang, Yunhong Wang
Abstract: Video-based person re-identification (Re-ID) is an important computer vision task. The batch-hard triplet loss frequently used in video-based person Re-ID suffers from the Distance Variance among Different Positives (DVDP) problem. In this paper, we address this issue by introducing a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL), which reduces the intra-class variation among positive samples via calculating attribute distance. To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed. Extensive experiments on MARS and DukeMTMC-VID datasets shows that both the AITL and ASTA are very effective. Enhanced by them, even a simple light-weighted video-based person Re-ID baseline can outperform existing state-of-the-art approaches. The codes has been published on this https URL.
摘要：基于视频的人重新鉴定（重新编号）是一种重要的计算机视觉任务。批处理硬三重损失从距离变化基于视频的人重新编号患有不同阳性者（DVDP）的问题频繁使用。在本文中，我们通过引入所谓的属性感知识别硬三重损失（AITL）一个新的度量学习方法，它通过计算属性距离减少阳性样品中的类内变化解决这个问题。为了实现基于视频的人重新编号，以属性驱动的时空注意（ASTA）机制的多任务框架的完整模型还提出。在火星和DukeMTMC-VID数据集显示了广泛的实验，无论是AITL和ASTA是非常有效的。通过将它们增强，即使是简单的光加权基于视频的人重新ID基线可以超越现有状态的最先进的方法。这些代码已经公布在该HTTPS URL。

79. Semantic-driven Colorization [PDF] 返回目录
Man M. Ho, Lu Zhang, TU Ilmenau, Jinjia Zhou
Abstract: Recent deep colorization works predict the semantic information implicitly while learning to colorize black-and-white photographic images. As a consequence, the generated color is easier to be overflowed, and the semantic faults are invisible. As human experience in coloring, the human first recognize which objects and their location in the photo, imagine which color is plausible for the objects as in real life, then colorize it. In this study, we simulate that human-like action to firstly let our network learn to segment what is in the photo, then colorize it. Therefore, our network can choose a plausible color under semantic constraint for specific objects, and give discriminative colors between them. Moreover, the segmentation map becomes understandable and interactable for the user. Our models are trained on PASCAL-Context and evaluated on selected images from the public domain and COCO-Stuff, which has several unseen categories compared to training data. As seen from the experimental results, our colorization system can provide plausible colors for specific objects and generate harmonious colors competitive with state-of-the-art methods.
摘要：近期深彩色作品含蓄地预测，同时学习上色黑与白的照片图像的语义信息。其结果是，所生成的颜色更容易被溢出和语义故障是不可见的。作为着色人的经验，人的第一识别哪些对象及其位置的照片，想象它的颜色是合理的对象在现实生活中，然后上色吧。在这项研究中，我们模拟出类似人类的行动，首先让我们的网络学习段是什么的照片，然后上色吧。因此，我们的网络可以选择在特定对象语义约束合理的颜色，并给他们之间的区别的颜色。此外，分割地图变得可以理解和交互的用户。我们的模型进行培训帕斯卡 - 语境和从公共领域和COCO-东西，比起训练数据有哪几种看不见类别中选择图像进行评估。由于从实验结果可以看出，我们的着色系统可以提供合理的颜色为特定对象，并产生和谐的色彩与国家的最先进的方法，有竞争力的。

80. Learning from the Scene and Borrowing from the Rich: Tackling the Long Tail in Scene Graph Generation [PDF] 返回目录
Tao He, Lianli Gao, Jingkuan Song, Jianfei Cai, Yuan-Fang Li
Abstract: Despite the huge progress in scene graph generation in recent years, its long-tail distribution in object relationships remains a challenging and pestering issue. Existing methods largely rely on either external knowledge or statistical bias information to alleviate this problem. In this paper, we tackle this issue from another two aspects: (1) scene-object interaction aiming at learning specific knowledge from a scene via an additive attention mechanism; and (2) long-tail knowledge transfer which tries to transfer the rich knowledge learned from the head into the tail. Extensive experiments on the benchmark dataset Visual Genome on three tasks demonstrate that our method outperforms current state-of-the-art competitors.
摘要：尽管近年来在场景图产生巨大的进步，它的对象关系长尾分布仍然是一个具有挑战性和纠缠的问题。现有的方法在很大程度上都依赖于外部知识或统计偏差的信息来缓解这个问题。在本文中，我们将处理来自其他两个方面这个问题：（1）场景对象的交互，旨在通过添加剂注意机制学习从场景的具体知识; （2）它试图从头部学到的知识丰富转移到尾巴长尾巴的知识转移。在基准数据集可视化基因组上的三个任务大量的实验证明我们的方法优于国家的最先进的电流的竞争对手。

81. Mitigating Face Recognition Bias via Group Adaptive Classifier [PDF] 返回目录
Sixue Gong, Xiaoming Liu, Anil K. Jain
Abstract: Face recognition is known to exhibit bias - subjects in certain demographic group can be better recognized than other groups. This work aims to learn a fair face representation, where faces of every group could be equally well-represented. Our proposed group adaptive classifier, GAC, learns to mitigate bias by using adaptive convolution kernels and attention mechanisms on faces based on their demographic attributes. The adaptive module comprises kernel masks and channel-wise attention maps for each demographic group so as to activate different facial regions for identification, leading to more discriminative features pertinent to their demographics. We also introduce an automated adaptation strategy which determines whether to apply adaptation to a certain layer by iteratively computing the dissimilarity among demographic-adaptive parameters, thereby increasing the efficiency of the adaptation learning. Experiments on benchmark face datasets (RFW, LFW, IJB-A, and IJB-C) show that our framework is able to mitigate face recognition bias on various demographic groups as well as maintain the competitive performance.
摘要：人脸识别是已知表现出偏见 - 在特定受众群体对象能够比其他群体更好的认识。这项工作旨在学习脸儿表示，其中每组的面孔可能同样很好的体现。我们提出的组适应性分类，海关总署，通过使用自适应卷积核和注意力的机制基于他们的人口属性面孔学会减轻偏见。自适应模块包括内核掩模和信道逐注意对于每个人口统计组映射，以便识别激活不同的面部区域，导致更多的判别特征有关的其人口统计数据。我们还介绍了确定是否通过迭代计算人口统计自适应参数中的不相似性，从而增加了自适应学习效率应用适应于特定层的自动适应策略。基准上面对数据集实验（RFW，LFW，IJB-A和IJB-C），表明我们的框架是能够减轻对各个人口群体的面部识别偏差以及保持竞争力的性能。

82. Unbiased Auxiliary Classifier GANs with MINE [PDF] 返回目录
Ligong Han, Anastasis Stathopoulos, Tao Xue, Dimitris Metaxas
Abstract: Auxiliary Classifier GANs (AC-GANs) are widely used conditional generative models and are capable of generating high-quality images. Previous work has pointed out that AC-GAN learns a biased distribution. To remedy this, Twin Auxiliary Classifier GAN (TAC-GAN) introduces a twin classifier to the min-max game. However, it has been reported that using a twin auxiliary classifier may cause instability in training. To this end, we propose an Unbiased Auxiliary GANs (UAC-GAN) that utilizes the Mutual Information Neural Estimator (MINE) to estimate the mutual information between the generated data distribution and labels. To further improve the performance, we also propose a novel projection-based statistics network architecture for MINE. Experimental results on three datasets, including Mixture of Gaussian (MoG), MNIST and CIFAR10 datasets, show that our UAC-GAN performs better than AC-GAN and TAC-GAN. Code can be found on the project website.
摘要：辅助分类甘斯（AC-甘斯）被广泛用于有条件的生成模型，并且能够产生高质量的图像。以前的研究曾指出，AC-GaN学会了一门偏分布。为了解决这个问题，双床辅助分类GAN（TAC-GAN）介绍了一种双分类器以最小 - 最大的游戏。然而，据报道，使用双辅助分类可能导致不稳定的培训。为此，我们建议，利用互信息的神经估算（矿）的无偏辅助甘斯（UAC-GAN）估计产生的数据分发和标签之间的互信息。为了进一步提高性能，我们还提出了矿山新型基于投影的统计网络架构。三个数据集，包括高斯（MOG）的混合实验结果，MNIST和CIFAR10数据集，表明我们的UAC-GaN进行比AC-GaN和TAC-GaN更好。代码可以在项目网站上找到。

83. Accurate Anchor Free Tracking [PDF] 返回目录
Shengyun Peng, Yunxuan Yu, Kun Wang, Lei He
Abstract: Visual object tracking is an important application of computer vision. Recently, Siamese based trackers have achieved good accuracy. However, most of Siamese based trackers are not efficient, as they exhaustively search potential object locations to define anchors and then classify each anchor (i.e., a bounding box). This paper develops the first Anchor Free Siamese Network (AFSN). Specifically, a target object is defined by a bounding box center, tracking offset, and object size. All three are regressed by Siamese network with no additional classification or regional proposal, and performed once for each frame. We also tune the stride and receptive field for Siamese network, and further perform ablation experiments to quantitatively illustrate the effectiveness of our AFSN. We evaluate AFSN using five most commonly used benchmarks and compare to the best anchor-based trackers with source codes available for each benchmark. AFSN is 3-425 times faster than these best anchor based trackers. AFSN is also 5.97% to 12.4% more accurate in terms of all metrics for benchmark sets OTB2015, VOT2015, VOT2016, VOT2018 and TrackingNet, except that SiamRPN++ is 4% better than AFSN in terms of Expected Average Overlap (EAO) on VOT2018 (but SiamRPN++ is 3.9 times slower).
摘要：视觉目标跟踪是计算机视觉的一个重要应用。近日，连体根据追踪器已经取得了良好的精度。然而，大多数基于连体跟踪器的效率不高，因为它们穷尽搜索潜在对象的位置，以限定锚件，然后进行分类每个锚（即，一个边界框）。本文开发了第一个无锚固连体网（AFSN）。具体而言，目标对象是由边界框定义的中心，跟踪偏移量，和对象的大小。所有这三个由连体网络，没有额外的分类或区域建议缩小，且对每一帧进行一次。我们还调整步幅和连体网络感受野，并进一步进行消融实验，以定量说明我们AFSN的有效性。我们评估AFSN使用五种最常用的基准和比较具有可用于每个基准源代码最好的基于锚的跟踪器。 AFSN比这些最好的锚基于跟踪快3-425倍。 AFSN也5.97％至12.4％更精确的在用于基准集OTB2015，VOT2015，VOT2016，VOT2018和TrackingNet，不同之处在于SiamRPN ++是在预期平均重叠（EAO）上VOT2018方面比AFSN 4％更好的所有指标方面（但SiamRPN ++是较慢的3.9倍）。

84. GAN Memory with No Forgetting [PDF] 返回目录
Yulai Cong, Miaoyun Zhao, Jianqiao Li, Sijia Wang, Lawrence Carin
Abstract: Seeking to address the fundamental issue of memory in lifelong learning, we propose a GAN memory that is capable of realistically remembering a stream of generative processes with \emph{no} forgetting. Our GAN memory is based on recognizing that one can modulate the ``style'' of a GAN model to form perceptually-distant targeted generation. Accordingly, we propose to do sequential style modulations atop a well-behaved base GAN model, to form sequential targeted generative models, while simultaneously benefiting from the transferred base knowledge. Experiments demonstrate the superiority of our method over existing approaches and its effectiveness in alleviating catastrophic forgetting for lifelong classification problems.
摘要：寻求解决的终身学习记忆的根本问题，我们提出了一个GAN内存，能够逼真地记住生成过程的流与\ {EMPH没有}遗忘。我们GAN内存是基于认识到一个可以调节的GaN模型，形成感知遥远的针对性一代的``风格'。因此，我们打算做的顺序风格调制顶上乖巧基地GAN模式，形成连续的有针对性的生成模型，同时从转变的基础知识中受益。实验证明我们的方法比现有的方法及其在减轻灾难性遗忘终身分类问题效果的优越性。

85. FakePolisher: Making DeepFakes More Detection-Evasive by Shallow Reconstruction [PDF] 返回目录
Yihao Huang, Felix Juefei-Xu, Run Wang, Qing Guo, Lei Ma, Xiaofei Xie, Jianwen Li, Weikai Miao, Yang Liu, Geguang Pu
Abstract: The recently rapid advances of generative adversarial networks (GANs) in synthesizing realistic and natural DeepFake information (e.g., images, video) cause severe concerns and threats to our society. At this moment, GAN-based image generation methods are still imperfect, whose upsampling design has limitations in leaving some certain artifact patterns in the synthesized image. Such artifact patterns can be easily exploited (by recent methods) for difference detection of real and GAN-synthesized images. To reduce the artifacts in the synthesized images, deep reconstruction techniques are usually futile because the process itself can leave traces of artifacts. In this paper, we devise a simple yet powerful approach termed FakePolisher that performs shallow reconstruction of fake images through learned linear dictionary, intending to effectively and efficiently reduce the artifacts introduced during image synthesis. The comprehensive evaluation on 3 state-of-the-art DeepFake detection methods and fake images generated by 16 popular GAN-based fake image generation techniques, demonstrates the effectiveness of our technique.
摘要：生成对抗网络（甘斯）的合成真实和自然的DeepFake信息（例如，图像，视频）引起严重关切，并威胁到我们的社会最近迅速发展。此时，基于GaN的图像生成方法还不完善，其采样设计在保留一定的神器模式合成图像中的局限性。这种伪影模式可以（通过最近的方法），用于差分检测真实和GAN-合成图像的容易利用。为了降低合成图像的伪影，深重建技术通常无效的，因为该方法本身可以离开工件的痕迹。在本文中，我们设计一种简单而功能强大的方法称为FakePolisher执行浅重建通过学习的线性字典假图像，意欲有效地降低图像合成过程中引入的伪像。由16流行的基于GAN-假图像生成技术生成的3状态的最先进的DeepFake检测方法和假的图像，综合评价证明了我们的技术的有效性。

86. CBR-Net: Cascade Boundary Refinement Network for Action Detection: Submission to ActivityNet Challenge 2020 (Task 1) [PDF] 返回目录
Xiang Wang, Baiteng Ma, Zhiwu Qing, Yongpeng Sang, Changxin Gao, Shiwei Zhang, Nong Sang
Abstract: In this report, we present our solution for the task of temporal action localization (detection) (task 1) in ActivityNet Challenge 2020. The purpose of this task is to temporally localize intervals where actions of interest occur and predict the action categories in a long untrimmed video. Our solution mainly includes three components: 1) feature encoding: we apply three kinds of backbones, including TSN [7], Slowfast[3] and I3d[1], which are both pretrained on Kinetics dataset[2]. Applying these models, we can extract snippet-level video representations; 2) proposal generation: we choose BMN [5] as our baseline, base on which we design a Cascade Boundary Refinement Network (CBR-Net) to conduct proposal detection. The CBR-Net mainly contains two modules: temporal feature encoding, which applies BiLSTM to encode long-term temporal information; CBR module, which targets to refine the proposal precision under different parameter settings; 3) action localization: In this stage, we combine the video-level classification results obtained by the fine tuning networks to predict the category of each proposal. Moreover, we also apply to different ensemble strategies to improve the performance of the designed solution, by which we achieve 42.788% on the testing set of ActivityNet v1.3 dataset in terms of mean Average Precision metrics.
摘要：在这份报告中，我们提出我们在ActivityNet挑战时间行动本地化（检测）（任务1）的任务的解决方案2020年这项工作的目的是在时间上局部化间隔，其中发生并预测在动作类的兴致所至长修剪视频。我们的解决方案主要包括三个部分：1）的特征编码：我们应用3种骨架，包括TSN [7]，Slowfast [3]和I3D [1]，它们都预训练的上动力学数据集[2]。应用这些模型时，我们可以提取片段级视频表示; 2）建议生成：我们选择BMN [5]如我们的基准，在其上设计了一个级联边界细化网络（CBR-净）到行为提案检测基地。的CBR-Net的主要包含两个模块：时间特征的编码，它适用于BiLSTM编码长期的时间信息; CBR模块，它的目标细化下的不同的参数设置的提案精度; 3）动作的本地化：在这个阶段，我们结合由微调网络获得的预测每项建议的类别的视频级分类结果。此外，我们还适用于不同的集成策略，以提高设计的解决方案，使我们在平均平均准确度量方面的测试集ActivityNet V1.3数据集的实现42.788％的性能。

87. Self-Supervised Discovery of Anatomical Shape Landmarks [PDF] 返回目录
Riddhish Bhalodia, Ladislav Kavan, Ross Whitaker
Abstract: Statistical shape analysis is a very useful tool in a wide range of medical and biological applications. However, it typically relies on the ability to produce a relatively small number of features that can capture the relevant variability in a population. State-of-the-art methods for obtaining such anatomical features rely on either extensive preprocessing or segmentation and/or significant tuning and post-processing. These shortcomings limit the widespread use of shape statistics. We propose that effective shape representations should provide sufficient information to align/register images. Using this assumption we propose a self-supervised, neural network approach for automatically positioning and detecting landmarks in images that can be used for subsequent analysis. The network discovers the landmarks corresponding to anatomical shape features that promote good image registration in the context of a particular class of transformations. In addition, we also propose a regularization for the proposed network which allows for a uniform distribution of these discovered landmarks. In this paper, we present a complete framework, which only takes a set of input images and produces landmarks that are immediately usable for statistical shape analysis. We evaluate the performance on a phantom dataset as well as 2D and 3D images.
摘要：统计形状分析是在广泛的医学和生物学应用中非常有用的工具。然而，通常依赖于生产数量相对较少的功能，可以捕捉人口的相关变化的能力。状态的最先进的方法，用于获得这样的解剖学功能依赖于任一广泛的预处理或分割和/或显著调谐和后处理。这些缺点限制了广泛使用形状统计的。我们建议，有效的形状表示要对齐/注册图像提供了足够的信息。根据这一假设，我们提出了一个自我监督，神经网络方法用于自动定位，并在可用于后续的分析检测图像中的地标。网络中发现对应于在一个特定的类转换的背景下促进良好的图像配准解剖形状特征的地标。此外，我们还提出了所提出的网络，它允许这些发现标志的均匀分布的正规化建设。在本文中，我们提出了一个完整的框架，它仅采用一组输入图像并产生用于统计形状分析立即可用的地标。我们评估在幻像数据集的表现以及2D和3D图像。

88. Temporal Fusion Network for Temporal Action Localization:Submission to ActivityNet Challenge 2020 (Task E) [PDF] 返回目录
Zhiwu Qing, Xiang Wang, Yongpeng Sang, Changxin Gao, Shiwei Zhang, Nong Sang
Abstract: This technical report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020.The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category.Firstly, we utilize the video-level feature information to train multiple video-level action classification models. In this way, we can get the category of action in the video.Secondly, we focus on generating high quality temporal proposals.For this purpose, we apply BMN to generate a large number of proposals to obtain high recall rates. We then refine these proposals by employing a cascade structure network called Refine Network, which can predict position offset and new IOU under the supervision of ground this http URL make the proposals more accurate, we use bidirectional LSTM, Nonlocal and Transformer to capture temporal relationships between local features of each proposal and global features of the video data.Finally, by fusing the results of multiple models, our method obtains 40.55% on the validation set and 40.53% on the test set in terms of mAP, and achieves Rank 1 in this challenge.
摘要：该技术报告分析了我们在这是我们的任务Activitynet挑战2020.The目标举办的HACS比赛使用的时间行动定位方法是定位在修剪视频动作的开始时间和结束时间，并预测行动category.Firstly，我们利用视频一级特征信息来训练多个视频级动作分类模型。通过这种方式，我们可以得到在video.Secondly行动的范畴，我们致力于创造高品质的时间proposals.For为此，我们应用BMN产生了大量的建议，以获得较高的召回率。然后，我们通过采用名为瑞风网络级联结构的网络，它可以预测位置偏移和新的借条地面指导下进行HTTP URL使建议更准确，我们采用双向LSTM细化这些建议，非局部和变压器捕捉之间的时间关系每项提案的地方特色和视频的全局特征data.Finally，通过融合多个模型的结果，我们的方法获得的验证集40.55％，以及在地图方面的测试集40.53％，并在此达到1级挑战。

89. Weakly-supervised Any-shot Object Detection [PDF] 返回目录
Siddhesh Khandelwal, Raghav Goyal, Leonid Sigal
Abstract: Methods for object detection and segmentation rely on large scale instance-level annotations for training, which are difficult and time-consuming to collect. Efforts to alleviate this look at varying degrees and quality of supervision. Weakly-supervised approaches draw on image-level labels to build detectors/segmentors, while zero/few-shot methods assume abundant instance-level data for a set of base classes, and none to a few examples for novel classes. This taxonomy has largely siloed algorithmic designs. In this work, we aim to bridge this divide by proposing an intuitive weakly-supervised model that is applicable to a range of supervision: from zero to a few instance-level samples per novel class. For base classes, our model learns a mapping from weakly-supervised to fully-supervised detectors/segmentors. By learning and leveraging visual and lingual similarities between the novel and base classes, we transfer those mappings to obtain detectors/segmentors for novel classes; refining them with a few novel class instance-level annotated samples, if available. The overall model is end-to-end trainable and highly flexible. Through extensive experiments on MS-COCO and Pascal VOC benchmark datasets we show improved performance in a variety of settings.
摘要：方法目标检测和分割依靠大规模实例级别的注解进行训练，这是困难和费时的收集。努力缓解这个看不同程度和监督的质量。弱监督方法使图像级标签绘制构建检测器/ segmentors，而零/几拍方法假定对一组基类的，并没有为新颖的类的几个例子丰富实例级数据。这种分类已经在很大程度上孤立的算法设计。在这项工作中，我们的目标是通过提出一种直观的弱监督模式是适用于一系列监管弥合这一鸿沟：从零到每类新型的几个实例级的样品。为基类，我们的模型学习从弱监督完全监督检测器/ segmentors的映射。通过学习和利用新颖和基类之间的视觉和舌的相似性，我们转移这些映射，以获得新种类的检测器/ segmentors;有一些新的类实例级注解样本炼制出来，如果有的话。总体模型是终端到终端的可训练和高度的灵活性。通过对MS-COCO和Pascal VOC标准数据集大量的实验，我们显示了在各种环境改善的性能。

90. Multi-Modal Fingerprint Presentation Attack Detection: Evaluation On A New Dataset [PDF] 返回目录
Leonidas Spinoulas, Hengameh Mirzaalian, Mohamed Hussein, Wael AbdAlmageed
Abstract: Fingerprint presentation attack detection is becoming an increasingly challenging problem due to the continuous advancement of attack preparation techniques, which generate realistic-looking fake fingerprint presentations. In this work, rather than relying on legacy fingerprint images, which are widely used in the community, we study the usefulness of multiple recently introduced sensing modalities. Our study covers front-illumination imaging using short-wave-infrared, near-infrared, and laser illumination; and back-illumination imaging using near-infrared light. Toward studying the effectiveness of our data, we conducted a comprehensive analysis using a fully convolutional deep neural network framework. We performed our evaluations on two large datasets collected at different sites. For examining the effects of changing the training and testing sets, a $3$-fold cross-validation evaluation was followed. Moreover, the effect of the presence of unseen attacks is studied by using a leave-one-out cross-validation evaluation over different attributes of the utilized presentation attack instruments. To assess the effectiveness of the studied sensing modalities compared to legacy data, we applied the same classification framework on legacy images collected for the same participants in one of our collection sites and achieved improved performance. Furthermore, the proposed classification framework was applied on the LivDet2015 dataset and outperformed existing state-of-the-art algorithms. This indicates that the power of our approach stems from the nature of the captured data rather than just the employed classification framework. Therefore, the extra cost for hardware-based (or hybrid) solutions is justifiable by its superior performance, especially for high security applications. One of the dataset collections used in this study will be publicly released upon the acceptance of this manuscript.
摘要：指纹演示攻击检测正变得越来越具有挑战性的问题，由于攻击制备技术，其产生逼真的假指纹演示的不断进步。在这项工作中，而不是依靠传统的指纹图像，广泛应用于社区，我们研究了多最近推出的传感方式的有效性。我们的研究盖前照明成像使用短波红外，近红外，和激光照射;和使用近红外光背照成像。对研究我们的数据的有效性，我们使用完全卷积深层神经网络架构进行了全面的分析。我们进行了在不同地点收集两个大型数据集，我们的评估。对于检查改变了训练和测试集的效果，一个$ 3 $倍交叉验证评估之后。此外，看不见攻击的存在的效应是通过使用留一法交叉验证评价在利用呈现攻击仪器的不同属性的影响。为了评估相比于传统数据研究的传感方式的有效性，我们应用在收集在我们的集合地点之一相同的参与者和实现更高的性能的传统图像相同的分类框架。此外，所提出的分类框架涂布于LivDet2015数据集和优于现有状态的最先进的算法。这表明，我们的方法的功率从捕获的数据，而不只是所采用的分类框架的性质造成的。因此，基于硬件的（或混合）解决方案的额外费用是以其优越的性能合理的，特别是对于高安全性应用。一个在这项研究中使用的数据集藏品将在接受本手稿被公开发布。

91. OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by learning to unfold [PDF] 返回目录
Mohamed Yousef, Tom E. Bishop
Abstract: Text recognition is a major computer vision task with a big set of associated challenges. One of those traditional challenges is the coupled nature of text recognition and segmentation. This problem has been progressively solved over the past decades, going from segmentation based recognition to segmentation free approaches, which proved more accurate and much cheaper to annotate data for. We take a step from segmentation-free single line recognition towards segmentation-free multi-line / full page recognition. We propose a novel and simple neural network module, termed \textbf{OrigamiNet}, that can augment any CTC-trained, fully convolutional single line text recognizer, to convert it into a multi-line version by providing the model with enough spatial capacity to be able to properly collapse a 2D input signal into 1D without losing information. Such modified networks can be trained using exactly their same simple original procedure, and using only \textbf{unsegmented} image and text pairs. We carry out a set of interpretability experiments that show that our trained models learn an accurate implicit line segmentation. We achieve state-of-the-art character error rate on both IAM \& ICDAR 2017 HTR benchmarks for handwriting recognition, surpassing all other methods in the literature. On IAM we even surpass single line methods that use accurate localization information during training. Our code is available online at \url{this https URL}.
摘要：文本识别是一个主要的计算机视觉任务有一个大组相关的挑战。其中的一个挑战传统的文字识别和分割的耦合性质。此问题已得到解决逐步在过去的几十年中，从分割基于识别要分割自由的办法，这证明更加精确和便宜得多注释数据。我们从自由分割的单线条识别走向自由分割，多行/整页识别的一个步骤。我们提出了一个新颖而简单的神经网络模块被称为\ textbf {OrigamiNet}，即可以增加任何CTC培训，充分卷积单行文本识别，将其转换成多行版本提供了足够的空间容量模型能够正确折叠的2D输入信号转换成一维，而不会丢失信息。这种修饰的网络可以使用完全相同其同样简单的原程序，并且仅使用\ textbf {未分段}图像和文字对进行训练。我们开展了一系列的解释性实验证明，证明我们的训练的模型学习准确的隐行分割。我们实现两个IAM \＆ICDAR 2017年HTR基准的国家的最先进的字符错误率的手写识别，超越在文献中所有其他方法。在IAM我们甚至超过了世界上训练过程中使用精确的定位信息的单行法。我们的代码，请访问网站\的网址{此HTTPS URL}。

92. Multispectral Biometrics System Framework: Application to Presentation Attack Detection [PDF] 返回目录
Leonidas Spinoulas, Mohamed Hussein, David Geissbühler, Joe Mathai, Oswin G.Almeida, Guillaume Clivaz, Sébastien Marcel, Wael AbdAlmageed
Abstract: In this work, we present a general framework for building a biometrics system capable of capturing multispectral data from a series of sensors synchronized with active illumination sources. The framework unifies the system design for different biometric modalities and its realization on face, finger and iris data is described in detail. To the best of our knowledge, the presented design is the first to employ such a diverse set of electromagnetic spectrum bands, ranging from visible to long-wave-infrared wavelengths, and is capable of acquiring large volumes of data in seconds. Having performed a series of data collections, we run a comprehensive analysis on the captured data using a deep-learning classifier for presentation attack detection. Our study follows a data-centric approach attempting to highlight the strengths and weaknesses of each spectral band at distinguishing live from fake samples.
摘要：在这项工作中，我们提出了一个总体框架用于构建能够从一系列具有活性照明源同步传感器的捕获多光谱数据的生物测定系统。该框架相结合的系统设计用于不同的生物测定模态及其对面，手指和虹膜数据实现进行详细说明。据我们所知，所呈现的设计是第一采用这样一组不同的电磁频谱的频带的，范围从可见光到长波红外波长，并且能够获取大量数据以秒的。已经进行了一系列的数据的集合，我们运行使用演示攻击检测深学习分类器采集的数据进行综合分析。我们的研究遵循以数据为中心的方法试图从假活样本区分选择显示的每个光谱带的长处和短处。

93. Early Blindness Detection Based on Retinal Images Using Ensemble Learning [PDF] 返回目录
Niloy Sikder, Md. Sanaullah Chowdhury, Abu Shamim Mohammad Arif, Abdullah-Al Nahid
Abstract: Diabetic retinopathy (DR) is the primary cause of vision loss among grownup people around the world. In four out of five cases having diabetes for a prolonged period leads to DR. If detected early, more than 90 percent of the new DR occurrences can be prevented from turning into blindness through proper treatment. Despite having multiple treatment procedures available that are well capable to deal with DR, the negligence and failure of early detection cost most of the DR patients their precious eyesight. The recent developments in the field of Digital Image Processing (DIP) and Machine Learning (ML) have paved the way to use machines in this regard. The contemporary technologies allow us to develop devices capable of automatically detecting the condition of a persons eyes based on their retinal images. However, in practice, several factors hinder the quality of the captured images and impede the detection outcome. In this study, a novel early blind detection method has been proposed based on the color information extracted from retinal images using an ensemble learning algorithm. The method has been tested on a set of retinal images collected from people living in the rural areas of South Asia, which resulted in a 91 percent classification accuracy.
摘要：糖尿病性视网膜病变（DR）是世界各地的人们大人之间的视力丧失的主要原因。在四明五例具有用于延长期，就会导致糖尿病DR。如果早期发现，可以通过从适当的治疗变成失明可以防止90％以上的新DR发生。尽管具有非常能够应付DR，及早发现成本大部分DR患者的视力珍贵的疏忽而未能提供多种治疗程序。在数字图像处理（DIP）和机器学习（ML）领域的最新发展铺平了道路在这方面用机器的方式。当代技术使我们能够开发能够自动检测一个人的眼睛基于他们的视网膜图像的条件的设备。然而，在实践中，有几个因素阻碍了拍摄图像的质量，阻碍了检测结果。在这项研究中，一种新颖的早期盲检测方法基于使用集成学习算法从视网膜图像中提取的色彩信息被提出。该方法已在一组来自生活在南亚，这导致了91％的分类准确度的农村地区的人收集的视网膜图像进行了测试。

94. Learning-to-Learn Personalised Human Activity Recognition Models [PDF] 返回目录
Anjana Wijekoon, Nirmalie Wiratunga
Abstract: Human Activity Recognition~(HAR) is the classification of human movement, captured using one or more sensors either as wearables or embedded in the environment~(e.g. depth cameras, pressure mats). State-of-the-art methods of HAR rely on having access to a considerable amount of labelled data to train deep architectures with many train-able parameters. This becomes prohibitive when tasked with creating models that are sensitive to personal nuances in human movement, explicitly present when performing exercises. In addition, it is not possible to collect training data to cover all possible subjects in the target population. Accordingly, learning personalised models with few data remains an interesting challenge for HAR research. We present a meta-learning methodology for learning to learn personalised HAR models for HAR; with the expectation that the end-user need only provides a few labelled data but can benefit from the rapid adaptation of a generic meta-model. We introduce two algorithms, Personalised MAML and Personalised Relation Networks inspired by existing Meta-Learning algorithms but optimised for learning HAR models that are adaptable to any person in health and well-being applications. A comparative study shows significant performance improvements against the state-of-the-art Deep Learning algorithms and the Few-shot Meta-Learning algorithms in multiple HAR domains.
摘要：人类活动识别〜（HAR）是人体运动的分类，使用一个或多个传感器所捕获或者作为穿戴式或嵌入在环境〜（例如深度相机，压力垫。）。国家的最先进的喀拉方法依赖于有机会获得相当数量的标记数据来训练深层结构与许多列车能够参数。当创建模型是在人类运动的细微差别的个人敏感，明确进行练习时，应当出示负责这会让人望而却步。此外，它是不可能收集训练数据来覆盖目标人群的所有可能的科目。因此，学习个性化的车型很少的数据仍然是HAR研究一个有趣的挑战。我们提出了学习学习个性化HAR模型HAR元学习方法;与最终用户只需要提供几个标记数据，但可以从通用元模型的快速适应受益的预期。我们介绍两种算法，个性化MAML和现有的元学习算法的启发，但对于学习HAR模型是适用于任何人的健康和福祉应用而优化的个性化关系网络。对比研究表明显著的性能提升对国家的最先进的深度学习算法和为数不多的射门元学习算法在多个HAR域。

95. Defending against GAN-based Deepfake Attacks via Transformation-aware Adversarial Faces [PDF] 返回目录
Chaofei Yang, Lei Ding, Yiran Chen, Hai Li
Abstract: Deepfake represents a category of face-swapping attacks that leverage machine learning models such as autoencoders or generative adversarial networks. Although the concept of the face-swapping is not new, its recent technical advances make fake content (e.g., images, videos) more realistic and imperceptible to Humans. Various detection techniques for Deepfake attacks have been explored. These methods, however, are passive measures against Deepfakes as they are mitigation strategies after the high-quality fake content is generated. More importantly, we would like to think ahead of the attackers with robust defenses. This work aims to take an offensive measure to impede the generation of high-quality fake images or videos. Specifically, we propose to use novel transformation-aware adversarially perturbed faces as a defense against GAN-based Deepfake attacks. Different from the naive adversarial faces, our proposed approach leverages differentiable random image transformations during the generation. We also propose to use an ensemble-based approach to enhance the defense robustness against GAN-based Deepfake variants under the black-box setting. We show that training a Deepfake model with adversarial faces can lead to a significant degradation in the quality of synthesized faces. This degradation is twofold. On the one hand, the quality of the synthesized faces is reduced with more visual artifacts such that the synthesized faces are more obviously fake or less convincing to human observers. On the other hand, the synthesized faces can easily be detected based on various metrics.
摘要：Deepfake代表面对面交换攻击，充分利用机器学习模型，如自动编码或生成对抗性的网络类别。虽然面交换的概念并不新，最近的技术进步作出虚假的内容（例如，图像，视频）更现实和难以察觉到人类。对于Deepfake攻击的各种检测技术已经探索。这些方法，但是，反对Deepfakes被动的措施，因为它们缓解策略生成高品质的虚假内容后。更重要的是，我们想提前考虑具有强大的防御攻击者。这项工作旨在采取进攻性的措施以阻止高品质的假图像或视频的生成。具体来说，我们建议使用新的转变感知adversarially不安的面孔作为对GaN基Deepfake攻击防御。从天真的对抗性面临着不同的，我们提出的方法产生过程中利用微随机图像转换。此外，我们建议使用基于集合的方法来增强对防御坚固性GaN基 - Deepfake暗箱设置下的变体。我们发现，与训练对抗面的Deepfake模型可导致合成面的质量显著下降。这种退化是双重的。在一方面，将合成面的质量与多个视觉伪像减小，使得合成的面是更明显假或更少令人信服到人类观察者。在另一方面，合成的面可以很容易地根据各种度量来检测。

96. The DeepFake Detection Challenge Dataset [PDF] 返回目录
Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, Cristian Canton Ferrer
Abstract: Deepfakes are a recent off-the-shelf manipulation technique that allows anyone to swap two identities in a single video. In addition to Deepfakes, a variety of GAN-based face swapping methods have also been published with accompanying code. To counter this emerging threat, we have constructed an extremely large face swap video dataset to enable the training of detection models, and organized the accompanying \textit{DeepFake Detection Challenge} (DFDC) Kaggle competition. Importantly, all recorded subjects agreed to participate in and have their likenesses modified during the construction of the face-swapped dataset. The DFDC dataset is by far the largest currently- and publicly-available face swap video dataset, with over 100,000 total clips sourced from 3,426 paid actors, produced with several Deepfake, GAN-based, and non-learned methods. In addition to describing the methods used to construct the dataset, we provide a detailed analysis of the top submissions from the Kaggle contest. We show although Deepfake detection is extremely difficult and still an unsolved problem, a Deepfake detection model trained only on the DFDC can generalize to real "in-the-wild" Deepfake videos, and such a model can be a valuable analysis tool when analyzing potentially Deepfaked videos. Training, validation and testing corpuses can be downloaded from this http URL (URL to be updated).
摘要：Deepfakes是最近关闭的，现成的操作技术，允许任何人以交换在单个视频两个身份。除了Deepfakes，各种基于GaN的脸交换方法也已经被发表带有代码的。为了应对这种新威胁，我们已经构建了一个非常大的面子交换视频数据集，以使检测模型的培训，并组织了伴随\ textit {DeepFake检测挑战}（DFDC）Kaggle竞争。重要的是，所有的记录受试者同意参加，并有脸部交换数据集的施工过程中修改了自己的肖像。该DFDC数据集是迄今为止最大的currently-和公开可用的脸上交换视频数据集，从3,426支付的演员，有几个Deepfake生产采购超过10万的总剪辑，GaN基和非学方法。除了描述用于构建数据集的方法，我们提供从Kaggle比赛前提交详细的分析。我们发现虽然Deepfake检测是极其困难的，仍然是一个未解决的问题，一个Deepfake检测模型中训练只对DFDC可以推广到真正的“在最野” Deepfake视频，这样的模式可以潜在地分析时是一个有价值的分析工具Deepfaked视频。培训，验证和测试语料可以从这个HTTP URL（网址更新）下载。

97. Spectral DiffuserCam: lensless snapshot hyperspectral imaging with a spectral filter array [PDF] 返回目录
Kristina Monakhova, Kyrollos Yanny, Neerja Aggarwal, Laura Waller
Abstract: Hyperspectral imaging is useful for applications ranging from medical diagnostics to crop monitoring; however, traditional scanning hyperspectral imagers are prohibitively slow and expensive for widespread adoption. Snapshot techniques exist but are often confined to bulky benchtop setups or have low spatio-spectral resolution. In this paper, we propose a novel, compact, and inexpensive computational camera for snapshot hyperspectral imaging. Our system consists of a repeated spectral filter array placed directly on the image sensor and a diffuser placed close to the sensor. Each point in the world maps to a unique pseudorandom pattern on the spectral filter array, which encodes multiplexed spatio-spectral information. A sparsity-constrained inverse problem solver then recovers the hyperspectral volume with good spatio-spectral resolution. By using a spectral filter array, our hyperspectral imaging framework is flexible and can be designed with contiguous or non-contiguous spectral filters that can be chosen for a given application. We provide theory for system design, demonstrate a prototype device, and present experimental results with high spatio-spectral resolution.
摘要：高光谱成像为应用范围从医疗诊断到作物监测有用;然而，传统的扫描高光谱成像仪是广泛采用过于缓慢且昂贵。快照技术存在，但往往仅限于笨重的台式设置或具有低空间 - 光谱分辨率。在本文中，我们提出了一个新颖，小巧，廉价的计算相机快照超光谱成像。我们的系统包括直接放置在图像传感器和放置在靠近传感器的扩散器重复光谱滤光器阵列的。在世界上的每个点映射到光谱滤光器阵列，其编码复用的空间 - 光谱信息上的唯一的伪随机图形。稀疏性约束逆问题解算器然后恢复具有良好的空间 - 光谱分辨率高光谱体积。通过使用光谱滤光器阵列，我们的高光谱成像框架是柔性的，并且可以被设计成具有可被选择用于给定的应用连续的或不连续的光谱过滤器。我们提供理论进行系统设计，展示一个原型设备，并具有高时空光谱分辨率目前的实验结果。

98. Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction [PDF] 返回目录
Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, Yi Ma
Abstract: To learn intrinsic low-dimensional structures from high-dimensional data that most discriminate between classes, we propose the principle of Maximal Coding Rate Reduction ($\text{MCR}^2$), an information-theoretic measure that maximizes the coding rate difference between the whole dataset and the sum of each individual class. We clarify its relationships with most existing frameworks such as cross-entropy, information bottleneck, information gain, contractive and contrastive learning, and provide theoretical guarantees for learning diverse and discriminative features. The coding rate can be accurately computed from finite samples of degenerate subspace-like distributions and can learn intrinsic representations in supervised, self-supervised, and unsupervised settings in a unified manner. Empirically, the representations learned using this principle alone are significantly more robust to label corruptions in classification than those using cross-entropy, and can lead to state-of-the-art results in clustering mixed data from self-learned invariant features.
摘要：要了解固有的低维结构从高维数据类之间最判别，我们提出了最大的编码速率降低的原则（$ \文本{MCR} ^ 2 $），最大化编码的信息理论量度整个数据集和每个单独类的总和之间速率差。我们与诸如交叉熵，信息瓶颈，信息增益，收缩和对比学习大多数现有框架阐明其关系，并学习多样化和判别特征提供理论保证。编码率可以从退化的有限样本被精确地计算子空间状分布，并且可以学习监督，自我监测以及未监测设定固有的表示以统一的方式。根据经验，在交涉利用这个原理了解到单是显著更稳健标签损坏的分类比使用交叉熵的，并可能导致国家的先进成果，从自我了解到不变特征聚类混合数据。

99. Efficient Black-Box Adversarial Attack Guided by the Distribution of Adversarial Perturbations [PDF] 返回目录
Yan Feng, Baoyuan Wu, Yanbo Fan, Zhifeng Li, Shutao Xia
Abstract: This work studied the score-based black-box adversarial attack problem, where only a continuous score is returned for each query, while the structure and parameters of the attacked model are unknown. A promising approach to solve this problem is evolution strategies (ES), which introduces a search distribution to sample perturbations that are likely to be adversarial. Gaussian distribution is widely adopted as the search distribution in the standard ES algorithm. However, it may not be flexible enough to capture the diverse distributions of adversarial perturbations around different benign examples. In this work, we propose to transform the Gaussian-distributed variable to another space through a conditional flow-based model, to enhance the capability and flexibility of capturing the intrinsic distribution of adversarial perturbations conditioned on the benign example. Besides, to further enhance the query efficiency, we propose to pre-train the conditional flow model based on some white-box surrogate models, utilizing the transferability of adversarial perturbations across different models, which has been widely observed in the literature of adversarial examples. Consequently, the proposed method could take advantage of both query-based and transfer-based attack methods, to achieve satisfying attack performance on both effectiveness and efficiency. Extensive experiments of attacking four target models on CIFAR-10 and Tiny-ImageNet verify the superior performance of the proposed method to state-of-the-art methods.
摘要：本研究工作基于分数黑匣子对抗攻击的问题，只有连续得分将返回对于每个查询，而结构和攻击模型的参数是未知的。有希望的方法来解决这个问题是进化策略（ES），它引入了一个搜索分配到可能是对抗样品扰动。高斯分布在标准ES算法搜索分布广泛采用。但是，它可能不足够灵活的捕捉各地不同的良性例子对抗扰动的不同分布。在这项工作中，我们提出通过有条件的基于流的模型高斯分布变量变换到另一个空间，提升能力和捕捉对抗性扰动空调在良好的示例的内在分布的灵活性。此外，为进一步提高查询效率，我们建议前期训练基于一些白盒替代模型的条件流模型，利用不同模型对抗性的扰动，这在对抗例子文献中被普遍观察到的转移性。因此，该方法可以采取两种基于传输基于查询和攻击方法的优势，以达到满足双方的有效性和效率的攻击性能。上CIFAR-10和攻击四个目标模型的广泛的实验微小-ImageNet验证所提出的方法，以国家的最先进的方法的优越性能。

100. Improved Conditional Flow Models for Molecule to Image Synthesis [PDF] 返回目录
Karren Yang, Samuel Goldman, Wengong Jin, Alex Lu, Regina Barzilay, Tommi Jaakkola, Caroline Uhler
Abstract: In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell features at different resolutions and scale to high-resolution images, we develop a novel multi-scale flow architecture based on a Haar wavelet image pyramid. To maximize the mutual information between the generated images and the molecular interventions, we devise a training strategy based on contrastive learning. To evaluate our model, we propose a new set of metrics for biological image generation that are robust, interpretable, and relevant to practitioners. We show quantitatively that our method learns a meaningful embedding of the molecular intervention, which is translated into an image representation reflecting the biological effects of the intervention.
摘要：在本文中，我们的目标是合成在不同分子的干预措施，通过实际应用到药物开发的动机细胞显微图像。上图神经网络的学习分子的嵌入和图像生成基于流的模型，最近成功的基础上，我们提出Mol2Image：基于流的生成模型的分子细胞图像合成。为了生成以不同的分辨率和规模高分辨率图像细胞的功能，我们开发了基于哈尔一种新颖的多尺度流结构小波图像金字塔。为了最大限度地生成的图像和分子干预之间的互信息，我们设计基于对比学习培训战略。为了评估我们的模型，我们提出了一个新的组是稳健的，可解释的，以及相关从业人员指标生物图像生成的。我们定量表明，我们的方法学的分子干预，其被翻译成图像表示反映了干预的生物效应的一个有意义的嵌入。

101. The Limit of the Batch Size [PDF] 返回目录
Yang You, Yuhui Wang, Huan Zhang, Zhao Zhang, James Demmel, Cho-Jui Hsieh
Abstract: Large-batch training is an efficient approach for current distributed deep learning systems. It has enabled researchers to reduce the ImageNet/ResNet-50 training from 29 hours to around 1 minute. In this paper, we focus on studying the limit of the batch size. We think it may provide a guidance to AI supercomputer and algorithm designers. We provide detailed numerical optimization instructions for step-by-step comparison. Moreover, it is important to understand the generalization and optimization performance of huge batch training. Hoffer et al. introduced "ultra-slow diffusion" theory to large-batch training. However, our experiments show contradictory results with the conclusion of Hoffer et al. We provide comprehensive experimental results and detailed analysis to study the limitations of batch size scaling and "ultra-slow diffusion" theory. For the first time we scale the batch size on ImageNet to at least a magnitude larger than all previous work, and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting. We propose an optimization recipe that is able to improve the top-1 test accuracy by 18% compared to the baseline.
摘要：大批量的培训是当前分布式深度学习系统的有效方法。这使研究人员能够减少29小时ImageNet / RESNET-50训练，大约1分钟。在本文中，我们重点研究了批量大小的限制。我们认为它可以提供AI的超级计算机和算法设计人员的指导。我们提供详细的一步一步的比较数值优化指令。此外，要了解的巨大批培训的推广和优化性能是很重要的。霍弗等。推出的“超慢扩散”理论，以大批量的培训。然而，我们的实验表明，与霍弗尔等人的结论矛盾的结果。我们提供全面的实验结果和详细的分析，研究批次大小缩放和“超慢扩散”理论的局限性。这是第一次，我们对规模的ImageNet批量大小至少一个数量级比所有以前的工作时，并提供国家的最先进的许多优化方案在此设置下的性能详细的研究。我们提出的优化配方，能够以18％的比较基准，以改善顶层-1测试精度。

102. APQ: Joint Search for Network Architecture, Pruning and Quantization Policy [PDF] 返回目录
Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Song Han
Abstract: We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. To deal with the larger design space it brings, a promising approach is to train a quantization-aware accuracy predictor to quickly get the accuracy of the quantized model and feed it to the search engine to select the best fit. However, training this quantization-aware accuracy predictor requires collecting a large number of quantized pairs, which involves quantization-aware finetuning and thus is highly time-consuming. To tackle this challenge, we propose to transfer the knowledge from a full-precision (i.e., fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy predictor, which greatly improves the sample efficiency. Besides, collecting the dataset for the fp32 accuracy predictor only requires to evaluate neural networks without any training cost by sampling from a pretrained once-for-all network, which is highly efficient. Extensive experiments on ImageNet demonstrate the benefits of our joint optimization approach. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ. Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU hours and CO2 emission, pushing the frontier for green AI that is environmental-friendly. The code and video are publicly available.
摘要：在资源受限的硬件高效的深度学习推断我们目前APQ。不同于以往的方法是分别搜索神经结构，修剪政策和量化政策，我们对它们进行优化以联合方式。为了对付它带来了更大的设计空间，一个有前途的方法是培养一个量化感知的准确度的预测很快得到了量化模型的准确性和饲料它的搜索引擎来选择最合适的。然而，培养该量化感知精度预测需要收集大量的量化<模型，精度>对，其中涉及量化感知和细化和微调因而是高度耗时。为了应对这一挑战，我们建议从全精度转移知识（即FP32）预测的准确性与量化感知（即INT8）预测准确度，大大提高了样品效率。此外，收集了FP32精度的预测数据集只需要通过抽样无需任何培训费用，以评估神经网络预训练一次参加的所有网络，这是高效的。在ImageNet大量的实验证明我们的联合优化方法的好处。具有相同的准确度，APQ减少了由2×/ 1.3倍以上MobileNetV2 + HAQ的等待时间/能量。相较于单独的优化方法（ProxylessNAS + AMC + HAQ），APQ达到更高的2.3％ImageNet精度，同时减少的幅度GPU小时，二氧化碳排放量的订单，推动绿色AI的前沿是环境友好。代码和视频是公开的。

103. Deep learning mediated single time-point image-based prediction of embryo developmental outcome at the cleavage stage [PDF] 返回目录
Manoj Kumar Kanakasabapathy, Prudhvi Thirumalaraju, Charles L Bormann, Raghav Gupta, Rohan Pooniwala, Hemanth Kandula, Irene Souter, Irene Dimitriadis, Hadi Shafiee
Abstract: In conventional clinical in-vitro fertilization practices embryos are transferred either at the cleavage or blastocyst stages of development. Cleavage stage transfers, particularly, are beneficial for patients with relatively poor prognosis and at fertility centers in resource-limited settings where there is a higher chance of developmental failure in embryos in-vitro. However, one of the major limitations of embryo selections at the cleavage stage is the availability of very low number of manually discernable features to predict developmental outcomes. Although, time-lapse imaging systems have been proposed as possible solutions, they are cost-prohibitive and require bulky and expensive hardware, and labor-intensive. Advances in convolutional neural networks (CNNs) have been utilized to provide accurate classifications across many medical and non-medical object categories. Here, we report an automated system for classification and selection of human embryos at the cleavage stage using a trained CNN combined with a genetic algorithm. The system selected the cleavage stage embryo at 70 hours post insemination (hpi) that ultimately developed into top-quality blastocyst at 70 hpi with 64% accuracy, outperforming the abilities of embryologists in identifying embryos with the highest developmental potential. Such systems can have a significant impact on IVF procedures by empowering embryologists for accurate and consistent embryo assessment in both resource-poor and resource-rich settings.
摘要：在常规的临床体外受精做法胚或者在裂解或发展的胚泡阶段转移。卵裂期转账，尤其是患者预后较差，在资源有限的情况下生育中心有益那里是发育失败的体外胚胎的机会较高。然而，在卵裂期胚胎选择的主要限制之一是非常低的数字手动辨别特征的预测发展结果的可用性。虽然时间推移成像系统已提出可能的解决方案，它们是成本过高，需要笨重和昂贵的硬件和劳动密集型。在卷积神经网络（细胞神经网络）的进步已经用于在许多医疗和非医疗目的的类别提供准确的分类。这里，我们报告分类和人类胚胎的选择在切割阶段使用已训练的CNN具有遗传算法相结合的自动系统。该系统所选择的卵裂期胚胎在70后小时数授精（HPI），最终发展成为顶级质量的胚泡在70 HPI与64％的准确度，表现优于胚胎学家的能力中识别与所述最高发育潜能胚胎。这样的系统可以通过授权的胚胎学家在这两个资源匮乏和资源丰富的设置，准确和一致的评估胚胎对IVF程序显著的影响。

104. A Dataset and Benchmarks for Multimedia Social Analysis [PDF] 返回目录
Bofan Xue, David Chan, John Canny
Abstract: We present a new publicly available dataset with the goal of advancing multi-modality learning by offering vision and language data within the same context. This is achieved by obtaining data from a social media website with posts containing multiple paired images/videos and text, along with comment trees containing images/videos and/or text. With a total of 677k posts, 2.9 million post images, 488k post videos, 1.4 million comment images, 4.6 million comment videos, and 96.9 million comments, data from different modalities can be jointly used to improve performances for a variety of tasks such as image captioning, image classification, next frame prediction, sentiment analysis, and language modeling. We present a wide range of statistics for our dataset. Finally, we provide baseline performance analysis for one of the regression tasks using pre-trained models and several fully connected networks.
摘要：本文提出了一种新的可公开获得的数据集在同一范围内通过提供视觉和语言数据推动多模态学习的目标。这是通过从社交媒体网站，包含多个配对的图像/视频和文本，包含图像/视频和/或文本注释树木沿柱获得的数据来实现的。共有677k的帖子，290万倍后的图像，488k发布视频，140万张评论图片，460万个评论视频，96.9万条评论，从不同形态的数据可以结合使用，以提高各种任务，例如图像的表现字幕，图像分类，下一帧的预测，情绪分析，以及语言模型。我们提出了一个范围广泛用于我们的数据统计。最后，我们对使用预训练模式和几个全连接网络的回归任务之一提供基准性能分析。

105. Differentiable Neural Architecture Transformation for Reproducible Architecture Improvement [PDF] 返回目录
Do-Guk Kim, Heung-Chang Lee
Abstract: Recently, Neural Architecture Search (NAS) methods are introduced and show impressive performance on many benchmarks. Among those NAS studies, Neural Architecture Transformer (NAT) aims to improve the given neural architecture to have better performance while maintaining computational costs. However, NAT has limitations about a lack of reproducibility. In this paper, we propose differentiable neural architecture transformation that is reproducible and efficient. The proposed method shows stable performance on various architectures. Extensive reproducibility experiments on two datasets, i.e., CIFAR-10 and Tiny Imagenet, present that the proposed method definitely outperforms NAT and be applicable to other models and datasets.
摘要：近日，介绍和展示的许多令人印象深刻的基准测试性能的神经结构搜索（NAS）的方法。在这些NAS的研究，神经结构变压器（NAT），目的是提高给予神经结构有更好的表现，同时保持计算成本。然而，NAT大约有缺乏可再现性的限制。在本文中，我们提出微神经结构转型是可重复的，高效的。在各种架构所提出的方法显示性能稳定。上两个数据集，即，CIFAR-10和微小Imagenet，本广泛重复性实验，提出的方法绝对性能优于NAT和适用于其它的模型和数据集。

106. Slowing Down the Weight Norm Increase in Momentum-based Optimizers [PDF] 返回目录
Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Youngjung Uh, Jung-Woo Ha
Abstract: Normalization techniques, such as batch normalization (BN), have led to significant improvements in deep neural network performances. Prior studies have analyzed the benefits of the resulting scale invariance of the weights for the gradient descent (GD) optimizers: it leads to a stabilized training due to the auto-tuning of step sizes. However, we show that, combined with the momentum-based algorithms, the scale invariance tends to induce an excessive growth of the weight norms. This in turn overly suppresses the effective step sizes during training, potentially leading to sub-optimal performances in deep neural networks. We analyze this phenomenon both theoretically and empirically. We propose a simple and effective solution: at each iteration of momentum-based GD optimizers (e.g. SGD or Adam) applied on scale-invariant weights (e.g. Conv weights preceding a BN layer), we remove the radial component (i.e. parallel to the weight vector) from the update vector. Intuitively, this operation prevents the unnecessary update along the radial direction that only increases the weight norm without contributing to the loss minimization. We verify that the modified optimizers SGDP and AdamP successfully regularize the norm growth and improve the performance of a broad set of models. Our experiments cover tasks including image classification and retrieval, object detection, robustness benchmarks, and audio classification. Source code is available at this https URL.
摘要：标准化的技术，如批标准化（BN），导致了深层神经网络性能显著的改善。此前的研究已经分析了权重的梯度下降（GD）优化器所产生的尺度不变性的好处：它会导致一个稳定的训练，由于步长的自动调整。然而，我们表明，与基于动量的算法相结合，所述标度不变性容易引起的重量规范的过快增长。这反过来又在训练中过度抑制有效步长，可能导致在深层神经网络的次优的性能。我们在理论和实证分析这一现象。我们提出了一个简单而有效的解决方案：在基于动量-GD优化（例如SGD或ADAM）的每次迭代中施加在尺度不变的权重（例如，BN层前述转化率的权重），除去的径向分量（即平行于该重量矢量）从更新向量。直观上，这操作防止沿径向仅增加重量范数而无助于损耗最小化不必要的更新。我们验证改进优化SGDP和AdamP成功正常化的常态增长，提高一套广泛的车型的性能。我们的实验覆盖任务，包括图像分类和检索，目标检测，鲁棒性基准和音频分类。源代码可在此HTTPS URL。

107. Dissimilarity Mixture Autoencoder for Deep Clustering [PDF] 返回目录
Juan S. Lara, Fabio A. González
Abstract: In this paper, we introduce the Dissimilarity Mixture Autoencoder (DMAE), a novel neural network model that uses a dissimilarity function to generalize a family of density estimation and clustering methods. It is formulated in such a way that it internally estimates the parameters of a probability distribution through gradient-based optimization. Also, the proposed model can leverage from deep representation learning due to its straightforward incorporation into deep learning architectures, because, it consists of an encoder-decoder network that computes a probabilistic representation. Experimental evaluation was performed on image and text clustering benchmark datasets showing that the method is competitive in terms of unsupervised classification accuracy and normalized mutual information. The source code to replicate the experiments is publicly available at this https URL
摘要：在本文中，我们介绍了相异混合自动编码器（DMAE），使用不相似的功能概括一个家庭密度估计和聚类方法的一种新的神经网络模型。这是制定这样一种方式，它在内部估计通过基于梯度的优化的概率分布的参数。此外，所提出的模型可以从深表示学习由于其简单的掺入深度学习架构利用，因为，它由一个计算某个概率表示的编码器 - 解码器网络。在图像和文本聚类基准数据集显示，该方法是在无监督分类的准确性和归一化互信息方面的竞争进行实验评价。源代码复制实验是公开的，在此HTTPS URL

108. Emotion Recognition in Audio and Video Using Deep Neural Networks [PDF] 返回目录
Mandeep Singh, Yuan Fang
Abstract: Humans are able to comprehend information from multiple domains for e.g. speech, text and visual. With advancement of deep learning technology there has been significant improvement of speech recognition. Recognizing emotion from speech is important aspect and with deep learning technology emotion recognition has improved in accuracy and latency. There are still many challenges to improve accuracy. In this work, we attempt to explore different neural networks to improve accuracy of emotion recognition. With different architectures explored, we find (CNN+RNN) + 3DCNN multi-model architecture which processes audio spectrograms and corresponding video frames giving emotion prediction accuracy of 54.0% among 4 emotions and 71.75% among 3 emotions using IEMOCAP[2] dataset.
摘要：人类能够从例如多个域理解信息语音，文字和视觉。随着深学习技术进步出现了语音识别的显著改善。从语音识别情感是很重要的方面，并与深学习技术的情绪识别的准确性和等待时间有所改善。还有，以提高精度许多挑战。在这项工作中，我们试图探索不同的神经网络，以改善情绪识别的准确性。具有不同体系结构研究中，我们发现（CNN + RNN）+ 3DCNN多模型体系结构，其处理音频声谱图和相对应的视频帧中给予4种情绪和间使用IEMOCAP [2]的数据集3周的情绪71.75％的54.0％情感预测精度。

109. Generalized Adversarially Learned Inference [PDF] 返回目录
Yatin Dandi, Homanga Bharadhwaj, Abhishek Kumar, Piyush Rai
Abstract: Allowing effective inference of latent vectors while training GANs can greatly increase their applicability in various downstream tasks. Recent approaches, such as ALI and BiGAN frameworks, develop methods of inference of latent variables in GANs by adversarially training an image generator along with an encoder to match two joint distributions of image and latent vector pairs. We generalize these approaches to incorporate multiple layers of feedback on reconstructions, self-supervision, and other forms of supervision based on prior or learned knowledge about the desired solutions. We achieve this by modifying the discriminator's objective to correctly identify more than two joint distributions of tuples of an arbitrary number of random variables consisting of images, latent vectors, and other variables generated through auxiliary tasks, such as reconstruction and inpainting or as outputs of suitable pre-trained models. We design a non-saturating maximization objective for the generator-encoder pair and prove that the resulting adversarial game corresponds to a global optimum that simultaneously matches all the distributions. Within our proposed framework, we introduce a novel set of techniques for providing self-supervised feedback to the model based on properties, such as patch-level correspondence and cycle consistency of reconstructions. Through comprehensive experiments, we demonstrate the efficacy, scalability, and flexibility of the proposed approach for a variety of tasks.
摘要：允许潜在向量的有效推理，而训练甘斯可以大大提高各种下游任务的适用性。近来的方案，如ALI和笔杆框架，通过与编码器以匹配图像和潜矢量对两个连接分布沿着adversarially训练图像生成开发甘斯潜在变量的推理的方法。我们推广这些方法纳入反馈多层上重建，自我监督，并根据有关所需的解决方案之前或学到的知识其他形式的监督。我们通过修改的合适的鉴别的目的是正确地识别由图像，潜在向量，并通过辅助任务产生的其它变量，例如重建的和补绘的随机变量的任意数量的元组的两个以上的联合分布或作为输出实现这一预先训练模式。我们设计了一个非饱和最大化目标的发电机，编码器对，证明所产生的对抗性游戏对应于全局最优，能同时所有的分布相匹配。在我们提出的框架中，我们引入了新的一套技术基础上的属性，如补丁级别的通信和重建的循环一致性模型提供自我监督反馈。通过综合性实验，我们证明疗效，可扩展性，并且该方法适用于各种任务的灵活性。

110. CompressNet: Generative Compression at Extremely Low Bitrates [PDF] 返回目录
Suraj Kiran Raman, Aditya Ramesh, Vijayakrishna Naganoor, Shubham Dash, Giridharan Kumaravelu, Honglak Lee
Abstract: Compressing images at extremely low bitrates (< 0.1 bpp) has always been a challenging task since the quality of reconstruction significantly reduces due to the strong imposed constraint on the number of bits allocated for the compressed data. With the increasing need to transfer large amounts of images with limited bandwidth, compressing images to very low sizes is a crucial task. However, the existing methods are not effective at extremely low bitrates. To address this need, we propose a novel network called CompressNet which augments a Stacked Autoencoder with a Switch Prediction Network (SAE-SPN). This helps in the reconstruction of visually pleasing images at these low bitrates (< 0.1 bpp). We benchmark the performance of our proposed method on the Cityscapes dataset, evaluating over different metrics at extremely low bitrates to show that our method outperforms the other state-of-the-art. In particular, at a bitrate of 0.07, CompressNet achieves 22% lower Perceptual Loss and 55% lower Frechet Inception Distance (FID) compared to the deep learning SOTA methods.
摘要：在非常低比特率压缩图像（<0.1 bpp）一直是具有挑战性的任务，因为重建质量显著降低了由于在分配用于压缩数据位的数量的强约束强加。随着越来越需要传输大量图像的带宽有限，压缩图像，以非常低的尺寸是一个重要的任务。但是，现有的方法不是在极低的比特率有效。为了满足这一需求，我们提出了一个名为compressnet新颖的网络，增强了堆叠式自动编码器与交换机预测网络（sae-spn）。这有助于在视觉上令人愉悦的图像中的这些低比特率（<0.1 bpp）的重建。我们的基准我们提出的对城市景观的数据集方法的性能，极低比特率评估在不同的指标来显示我们的方法优于其他国家的最先进的。特别地，在0.07的比特率，compressnet达到22％相比，深学习sota方法低级视觉损失和55％的降低的frechet启距离（fid）。< font>

111. Leveraging Multimodal Behavioral Analytics for Automated Job Interview Performance Assessment and Feedback [PDF] 返回目录
Anumeha Agrawal, Rosa Anil George, Selvan Sunitha Ravi, Sowmya Kamath S, Anand Kumar M
Abstract: Behavioral cues play a significant part in human communication and cognitive perception. In most professional domains, employee recruitment policies are framed such that both professional skills and personality traits are adequately assessed. Hiring interviews are structured to evaluate expansively a potential employee's suitability for the position - their professional qualifications, interpersonal skills, ability to perform in critical and stressful situations, in the presence of time and resource constraints, etc. Therefore, candidates need to be aware of their positive and negative attributes and be mindful of behavioral cues that might have adverse effects on their success. We propose a multimodal analytical framework that analyzes the candidate in an interview scenario and provides feedback for predefined labels such as engagement, speaking rate, eye contact, etc. We perform a comprehensive analysis that includes the interviewee's facial expressions, speech, and prosodic information, using the video, audio, and text transcripts obtained from the recorded interview. We use these multimodal data sources to construct a composite representation, which is used for training machine learning classifiers to predict the class labels. Such analysis is then used to provide constructive feedback to the interviewee for their behavioral cues and body language. Experimental validation showed that the proposed methodology achieved promising results.
摘要：行为线索在人类沟通和认知感知显著的部分。在最专业的领域，员工招聘政策框架，使得无论是专业技能和个性特征得到充分的评估。招聘面试是结构化的，以豪爽地评估潜在雇员的职位适应性 - 他们的专业资格，人际交往能力，才能在关键和紧张的情况下进行，在时间和资源的限制，存在等。因此，考生需要注意的其正面和负面属性，同时也要注意可能对他们的成功产生不利影响的行为线索。我们提出了一个多模式的分析框架，分析在接受采访时的场景候选人和预定义的标签，如订婚，语速，眼神交流等，我们进行了全面的分析，包括受访者的面部表情，语音和韵律信息提供反馈，使用视频，音频和文字记录的采访中获得的成绩单。我们使用这些多数据源来构建一个复合表示，这是用于训练机器学习分类预测类标签。那么这种分析用来为自己的行为线索和身体语言提供给受访者建设性的反馈意见。实验验证表明，该方法取得了可喜的成果。

112. Continual General Chunking Problem and SyncMap [PDF] 返回目录
Danilo Vasconcellos Vargas, Toshitake Asabuki
Abstract: Humans possess an inherent ability to chunk sequences into their constituent parts. In fact, this ability is thought to bootstrap language skills to the learning of image patterns which might be a key to a more animal-like type of intelligence. Here, we propose a continual generalization of the chunking problem (an unsupervised problem), encompassing fixed and probabilistic chunks, discovery of temporal and causal structures and their continual variations. Additionally, we propose an algorithm called SyncMap that can learn and adapt to changes in the problem by creating a dynamic map which preserves the correlation between variables. Results of SyncMap suggest that the proposed algorithm learn near optimal solutions, despite the presence of many types of structures and their continual variation. When compared to Word2vec, PARSER and MRIL, SyncMap surpasses or ties with the best algorithm on $77\%$ of the scenarios while being the second best in the remaing $33\%$.
摘要：人类具有的固有能力，以块序列成它们的组成部分。事实上，这种能力被认为是引导语言技能的图像模式的学习，这可能是一种动物般的多种类型的智能的关键。在这里，我们提出了分块问题的不断泛化（无监督的问题），涵盖固定和概率块，时间和因果结构的发现和他们不断变化。此外，我们提出了所谓的SYNCMAP算法，可以学习，并通过创建一个动态地图，保留变量之间的相关性适应问题的变化。 SYNCMAP的结果表明，该算法学习接近最优的解决方案，尽管有许多类型的结构的存在和他们的不断变化。相较于Word2vec，解析器和MRIL，SYNCMAP超越或与$ 77国\％的情景$，同时在remaing $ 33 \％$第二个最好的最好的算法关系。

113. Structural Autoencoders Improve Representations for Generation and Transfer [PDF] 返回目录
Felix Leeb, Yashas Annadani, Stefan Bauer, Bernhard Schölkopf
Abstract: We study the problem of structuring a learned representation to significantly improve performance without supervision. Unlike most methods which focus on using side information like weak supervision or defining new regularization objectives, we focus on improving the learned representation by structuring the architecture of the model. We propose a self-attention based architecture to make the encoder explicitly associate parts of the representation with parts of the input observation. Meanwhile, our structural decoder architecture encourages a hierarchical structure in the latent space, akin to structural causal models, and learns a natural ordering of the latent mechanisms. We demonstrate how these models learn a representation which improves results in a variety of downstream tasks including generation, disentanglement, and transfer using several challenging and natural image datasets.
摘要：我们研究结构的问题学会表示，以显著改善没有监督的表现。不同于注重使用像弱监理方的信息或定义新的规则化的目标大多数方法，我们专注于通过结构改进了解到表示模型的架构。我们提出了一个自我关注基础架构，使代表的编码器明确相关联的部件与输入观察的部分。与此同时，我们的结构解码器架构鼓励潜在空间的层次结构，类似于结构因果模型，并学习潜在机制的自然排序。我们证明这些模型是如何学习的，其提高了在各种下游任务使用一些具有挑战性的和自然的图像数据集，包括发电，解开，并传递结果的表示。

114. DeeperGCN: All You Need to Train Deeper GCNs [PDF] 返回目录
Guohao Li, Chenxin Xiong, Ali Thabet, Bernard Ghanem
Abstract: Graph Convolutional Networks (GCNs) have been drawing significant attention with the power of representation learning on graphs. Unlike Convolutional Neural Networks (CNNs), which are able to take advantage of stacking very deep layers, GCNs suffer from vanishing gradient, over-smoothing and over-fitting issues when going deeper. These challenges limit the representation power of GCNs on large-scale graphs. This paper proposes DeeperGCN that is capable of successfully and reliably training very deep GCNs. We define differentiable generalized aggregation functions to unify different message aggregation operations (e.g. mean, max). We also propose a novel normalization layer namely MsgNorm and a pre-activation version of residual connections for GCNs. Extensive experiments on Open Graph Benchmark (OGB) show DeeperGCN significantly boosts performance over the state-of-the-art on the large scale graph learning tasks of node property prediction and graph property prediction. Please visit this https URL for more information.
摘要：图形卷积网络（GCNs）已经被绘制显著的关注与图形表示学习的动力。不同于卷积神经网络（细胞神经网络），这是能够利用堆叠非常深层的优势，GCNs从梯度消失，过度平滑和不断深入，当过拟合问题的影响。这些挑战限制对大型图表GCNs的表现力。本文提出DeeperGCN是能够成功的，可靠的训练非常深刻GCNs。我们定义微广义聚合功能来统一不同消息聚集操作（例如平均值，最大值）。我们还提出了一种归一化层即MsgNorm和GCNs残余连接的预活化的版本。上打开图基准（OGB）广泛实验表明DeeperGCN显著提升性能比状态的最先进的节点属性预测和图形性能预测的大规模图形学习任务的。请访问此HTTPS URL以获取更多信息。

115. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning [PDF] 返回目录
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko
Abstract: We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods intrinsically rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches $74.3\%$ top-1 classification accuracy on ImageNet using the standard linear evaluation protocol with a ResNet-50 architecture and $79.6\%$ with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.
摘要：介绍引导自己的潜（BYOL），一种新的方法进行自我监督的图像表示学习。 BYOL依赖于两个神经网络，被称为网络和目标网络，互相作用，互相学习。从图像的增强视图，我们培养了网上网下不同的增强视图来预测同一图像的目标网络表示。与此同时，我们更新了在线网络的缓慢移动平均目标网络。虽然国家的最先进的方法本质上依靠负对，BYOL达到了一个新的艺术状态没有他们。 BYOL到达上ImageNet使用标准线性评价协议与RESNET-50体系结构和具有较大RESNET $ 79.6 \％$ $ 74.3 \％$顶1的分类精度。我们展示持平或比艺术上都转移和半监督基准的当前状态，更好的是BYOL执行。

116. RoadNet-RT: High Throughput CNN Architecture and SoC Design for Real-Time Road Segmentation [PDF] 返回目录
Lin Bai, Yecheng Lyu, Xinming Huang
Abstract: In recent years, convolutional neural network has gained popularity in many engineering applications especially for computer vision. In order to achieve better performance, often more complex structures and advanced operations are incorporated into the neural networks, which results very long inference time. For time-critical tasks such as autonomous driving and virtual reality, real-time processing is fundamental. In order to reach real-time process speed, a light-weight, high-throughput CNN architecture namely RoadNet-RT is proposed for road segmentation in this paper. It achieves 90.33% MaxF score on test set of KITTI road segmentation task and 8 ms per frame when running on GTX 1080 GPU. Comparing to the state-of-the-art network, RoadNet-RT speeds up the inference time by a factor of 20 at the cost of only 6.2% accuracy loss. For hardware design optimization, several techniques such as depthwise separable convolution and non-uniformed kernel size convolution are customized designed to further reduce the processing time. The proposed CNN architecture has been successfully implemented on an FPGA ZCU102 MPSoC platform that achieves the computation capability of 83.05 GOPS. The system throughput reaches 327.9 frames per second with image size 1216x176.
摘要：近年来，卷积神经网络已在许多工程应用特别是对计算机视觉得到普及。为了达到更好的性能，往往更复杂的结构和先进的操作纳入神经网络，这会导致很长的推理时间。对于时间紧迫的任务，如自动驾驶和虚拟现实，实时处理是根本。为了达到实时处理速度，重量轻的，高通量的CNN架构即路网-RT，提出了在该纸路分割。对GTX GPU 1080运行时，它实现了在测试集KITTI道路分割的任务和每帧8毫秒的90.33％MAXF得分。相较于国家的最先进的网络，路网-RT在只有6.2％的准确度损失为代价加速了20倍的推理时间。对于硬件设计优化，多种技术，如深度方向可分离卷积和不穿军装的内核尺寸卷积定制设计，进一步减少处理时间。所提出的CNN架构的FPGA ZCU102片上多核平台，实现了83.05 GOPS计算能力上得到成功实施。吞吐量达到每秒327.9帧，图像尺寸1216x176的系统。

117. Adversarial Self-Supervised Contrastive Learning [PDF] 返回目录
Minseon Kim, Jihoon Tack, Sung Ju Hwang
Abstract: Existing adversarial learning approaches mostly use class labels to generate adversarial samples that lead to incorrect predictions, which are then used to augment the training of the model for improved robustness. While some recent works propose semi-supervised adversarial learning methods that utilize unlabeled data, they still require class labels. However, do we really need class labels at all, for adversarially robust training of deep neural networks? In this paper, we propose a novel adversarial attack for unlabeled data, which makes the model confuse the instance-level identities of the perturbed data samples. Further, we present a self-supervised contrastive learning framework to adversarially train a robust neural network without labeled data, which aims to maximize the similarity between a random augmentation of a data sample and its instance-wise adversarial perturbation. We validate our method, Robust Contrastive Learning (RoCL), on multiple benchmark datasets, on which it obtains comparable robust accuracy over state-of-the-art supervised adversarial learning methods, and significantly improved robustness against the black box and unseen types of attacks. Moreover, with further joint fine-tuning with supervised adversarial loss, RoCL obtains even higher robust accuracy over using self-supervised learning alone. Notably, RoCL also demonstrate impressive results in robust transfer learning.
摘要：现有的对抗性学习方法大多采用类标签生成对抗性样品是导致不正确的预测，然后将其用于增加该模型改进的鲁棒性的训练。虽然最近的一些作品提出了半监督利用未标记数据对抗性的学习方法，他们仍然需要类的标签。但是，我们真的需要在所有类标签，用于深层神经网络的adversarially强大的培训？在本文中，我们提出了未标记的数据的新的对抗攻击，这使得该模型混淆改动的数据样本的实例级身份。此外，我们提出了一种自监督学习对比框架adversarially训练鲁棒神经网络，无需标记的数据，其目的是最大限度地提高数据样本和它的实例明智对抗性扰动的随机增强之间的相似度。我们确认在多个基准数据集我们的方法，乐百氏对比学习（RoCL），在其上获得相当的稳健的准确度国家的最先进的监督敌对的学习方法，并显著改进的健壮性对黑匣子和看不见的攻击类型。此外，进一步的联合微调与监督对抗性损失，RoCL在使用获得更高的精度强劲自我监督上单独学习。值得注意的是，RoCL也证明了在强大的迁移学习了不俗的业绩。

118. Sparse Separable Nonnegative Matrix Factorization [PDF] 返回目录
Nicolas Nadisic, Arnaud Vandaele, Jeremy E. Cohen, Nicolas Gillis
Abstract: We propose a new variant of nonnegative matrix factorization (NMF), combining separability and sparsity assumptions. Separability requires that the columns of the first NMF factor are equal to columns of the input matrix, while sparsity requires that the columns of the second NMF factor are sparse. We call this variant sparse separable NMF (SSNMF), which we prove to be NP-complete, as opposed to separable NMF which can be solved in polynomial time. The main motivation to consider this new model is to handle underdetermined blind source separation problems, such as multispectral image unmixing. We introduce an algorithm to solve SSNMF, based on the successive nonnegative projection algorithm (SNPA, an effective algorithm for separable NMF), and an exact sparse nonnegative least squares solver. We prove that, in noiseless settings and under mild assumptions, our algorithm recovers the true underlying sources. This is illustrated by experiments on synthetic data sets and the unmixing of a multispectral image.
摘要：本文提出非负矩阵分解（NMF）的一个新变种，结合分离性和稀疏的假设。分离性要求第一NMF因数的列是等于输入矩阵的列，而稀疏要求第二NMF因数的列是稀疏的。我们称这种变型稀疏可分离NMF（SSNMF），这是我们被证明是NP完全问题，而不是分开的NMF能在多项式时间内解决。考虑这种新模式的主要动机是处理欠定盲源分离问题，如多光谱图像解混。我们介绍的算法基于所述连续的非负投影算法（SNPA，一种有效的算法可分离NMF），和一个确切的稀疏非负最小二乘解算器来解决SSNMF。我们证明了，在无声的设置，在温和的假设下，我们的算法恢复真正的潜在来源。这是通过在合成的数据组和多光谱图像的解混实验示出。

119. Rethinking the Value of Labels for Improving Class-Imbalanced Learning [PDF] 返回目录
Yuzhe Yang, Zhi Xu
Abstract: Real-world data often exhibits long-tailed distributions with heavy class imbalance, posing great challenges for deep recognition models. We identify a persisting dilemma on the value of labels in the context of imbalanced learning: on the one hand, supervision from labels typically leads to better results than its unsupervised counterparts; on the other hand, heavily imbalanced data naturally incurs "label bias" in the classifier, where the decision boundary can be drastically altered by the majority classes. In this work, we systematically investigate these two facets of labels. We demonstrate, theoretically and empirically, that class-imbalanced learning can significantly benefit in both semi-supervised and self-supervised manners. Specifically, we confirm that (1) positively, imbalanced labels are valuable: given more unlabeled data, the original labels can be leveraged with the extra data to reduce label bias in a semi-supervised manner, which greatly improves the final classifier; (2) negatively however, we argue that imbalanced labels are not useful always: classifiers that are first pre-trained in a self-supervised manner consistently outperform their corresponding baselines. Extensive experiments on large-scale imbalanced datasets verify our theoretically grounded strategies, showing superior performance over the previous state-of-the-arts. Our intriguing findings highlight the need to rethink the usage of imbalanced labels in realistic long-tailed tasks.
摘要：真实世界的数据往往具有长尾分布重类不平衡，冒充深识别模型的巨大挑战。我们识别标签上的不平衡学习的背景下，价值坚持的两难处境：一方面，从标签的监督通常会导致比它的同行无人监督更好的效果;在另一方面，严重失衡的数据自然会带来“标签偏见”中的分类项，其中决策边界可以通过广大的类被大幅改动。在这项工作中，我们系统地研究标签的这两个方面。我们证明，理论和实证，即类不平衡的学习可以在两个半监督和自我监督的方式显著受益。具体地，我们确认（1）积极，不平衡标签是有价值的：给定更多的未标记的数据，原始的标签可以通过额外的数据被利用以降低在半监督方式，极大地提高了最终的分类标签偏压; （2）负面然而，我们认为，不平衡的标签是没有用的总是：这是第一个预先训练中自我监督的方式分类持续超越其对应的基准。在大型大量的实验数据集不平衡验证我们的理论基础的战略，表现出较先进的最艺术的卓越性能。我们有趣的发现强调了有必要重新考虑不平衡标签的使用在实际长尾任务。

120. TURB-Rot. A large database of 3d and 2d snapshots from turbulent rotating flows [PDF] 返回目录
L. Biferale, F. Bonaccorso, M. Buzzicotti, P. Clark di Leoni
Abstract: We present TURB-Rot, a new open database of 3d and 2d snapshots of turbulent velocity fields, obtained by Direct Numerical Simulations (DNS) of the original Navier-Stokes equations in the presence of rotation. The aim is to provide the community interested in data-assimilation and/or computer vision with a new testing-ground made of roughly 300K complex images and fields. TURB-Rot data are characterized by multi-scales strongly non-Gaussian features and rough, non-differentiable, fields over almost two decades of scales. In addition, coming from fully resolved numerical simulations of the original partial differential equations, they offer the possibility to apply a wide range of approaches, from equation-free to physics-based models. TURB-Rot data are reachable at this http URL
摘要：我们目前TURB腐，湍流速度场，在旋转的情况下由原始的Navier-Stokes方程的直接数值模拟（DNS）中获得的3D和2D快照的新的开放式数据库。其目的是提供有兴趣的数据同化和/或计算机视觉与制作大约300K复杂图像和领域的一个新的测试地的社区。通过多尺度强非高斯的特点和粗糙，无微，田在近二十年的尺度TURB腐数据的特点。此外，从原来的偏微分方程的完全解决数值模拟来了，他们提供适用范围广的方法，免费公式对物理模型的可能性。 TURB腐数据到达这个HTTP URL

121. BI-MAML: Balanced Incremental Approach for Meta Learning [PDF] 返回目录
Yang Zheng, Jinlin Xiang, Kun Su, Eli Shlizerman
Abstract: We present a novel Balanced Incremental Model Agnostic Meta Learning system (BI-MAML) for learning multiple tasks. Our method implements a meta-update rule to incrementally adapt its model to new tasks without forgetting old tasks. Such a capability is not possible in current state-of-the-art MAML approaches. These methods effectively adapt to new tasks, however, suffer from 'catastrophic forgetting' phenomena, in which new tasks that are streamed into the model degrade the performance of the model on previously learned tasks. Our system performs the meta-updates with only a few-shots and can successfully accomplish them. Our key idea for achieving this is the design of balanced learning strategy for the baseline model. The strategy sets the baseline model to perform equally well on various tasks and incorporates time efficiency. The balanced learning strategy enables BI-MAML to both outperform other state-of-the-art models in terms of classification accuracy for existing tasks and also accomplish efficient adaption to similar new tasks with less required shots. We evaluate BI-MAML by conducting comparisons on two common benchmark datasets with multiple number of image classification tasks. BI-MAML performance demonstrates advantages in both accuracy and efficiency.
摘要：我们提出了学习多个任务的新的平衡增量模型无关元学习系统（BI-MAML）。我们的方法实现一元更新规则逐步适应其模型的新任务没有忘记旧任务。这种能力是不可能的当前状态的最先进的MAML接近。这些方法有效地适应新的任务，然而，遭受“灾难性的遗忘”现象，其中被涌入模型新任务分解上以前学过的任务的模型的性能。我们的系统进行荟萃更新，只有少数射击和可以成功地完成它们。我们对实现这一核心思想是在基准模型均衡的学习策略的设计。该战略将基准模型对各种任务同样表现出色，并采用时间效率。平衡学习策略使BI-MAML到国家的最先进的两个领先于其他车型分类的准确性现有任务方面，也实现高效的适应相似的新任务，需要较少出手。我们通过开展上的图像分类任务倍数两种常见的基准数据集比较评价BI-MAML。 BI-MAML的表现证明了在精度和效率优势。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-06-16

目录

摘要