摘要

1. Learning to See Through Obstructions [PDF] 返回目录
Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, Jia-Bin Huang
Abstract: We present a learning-based approach for removing unwanted obstructions, such as window reflections, fence occlusions or raindrops, from a short sequence of images captured by a moving camera. Our method leverages the motion differences between the background and the obstructing elements to recover both layers. Specifically, we alternate between estimating dense optical flow fields of the two layers and reconstructing each layer from the flow-warped images via a deep convolutional neural network. The learning-based layer reconstruction allows us to accommodate potential errors in the flow estimation and brittle assumptions such as brightness consistency. We show that training on synthetically generated data transfers well to real images. Our results on numerous challenging scenarios of reflection and fence removal demonstrate the effectiveness of the proposed method.
摘要：我们提出一个基于学习的方法对于除去不需要的障碍物，如窗口反射，围栏闭塞或雨滴，从由移动相机拍摄的图像的短序列。我们的方法利用了背景和阻碍件来恢复两个层之间的运动的差异。具体而言，我们估计所述两个层的致密的光流场，并通过深卷积神经网络重构从流入翘曲图像中的每个层之间交替。所述基于学习的层重建允许我们以适应流估计和脆假设诸如亮度一致性潜在的错误。我们给出合成产生的数据传输培训以及真实影像。我们在反思和围栏拆除了许多具有挑战性的场景结果证明了该方法的有效性。

2. Unsupervised Real-world Image Super Resolution via Domain-distance Aware Training [PDF] 返回目录
Yunxuan Wei, Shuhang Gu, Yawei Li, Longcun Jin
Abstract: These days, unsupervised super-resolution (SR) has been soaring due to its practical and promising potential in real scenarios. The philosophy of off-the-shelf approaches lies in the augmentation of unpaired data, i.e. first generating synthetic low-resolution (LR) images $\mathcal{Y}^g$ corresponding to real-world high-resolution (HR) images $\mathcal{X}^r$ in the real-world LR domain $\mathcal{Y}^r$, and then utilizing the pseudo pairs $\{\mathcal{Y}^g, \mathcal{X}^r\}$ for training in a supervised manner. Unfortunately, since image translation itself is an extremely challenging task, the SR performance of these approaches are severely limited by the domain gap between generated synthetic LR images and real LR images. In this paper, we propose a novel domain-distance aware super-resolution (DASR) approach for unsupervised real-world image SR. The domain gap between training data (e.g. $\mathcal{Y}^g$) and testing data (e.g. $\mathcal{Y}^r$) is addressed with our \textbf{domain-gap aware training} and \textbf{domain-distance weighted supervision} strategies. Domain-gap aware training takes additional benefit from real data in the target domain while domain-distance weighted supervision brings forward the more rational use of labeled source domain data. The proposed method is validated on synthetic and real datasets and the experimental results show that DASR consistently outperforms state-of-the-art unsupervised SR approaches in generating SR outputs with more realistic and natural textures.
摘要：这些天来，无人监督的超分辨率（SR）一直飙升，由于其实际和现实场景看好的潜力。关的现成的理念在不成对数据的增强，即，第一生成合成的低分辨率（LR）图像$ \ mathcal {Y} ^克$对应于现实世界的高分辨率（HR）图像$接近位于\ mathcal {X} ^ R $在现实世界中的LR域$ \ mathcal {Y} ^ R $，然后利用伪双$ \ {\ mathcal {Y} ^克，\ mathcal {X} ^ R \ } $在监督的方式训练。不幸的是，由于图像译文本身是一个非常具有挑战性的任务，这些方法的SR性能严重生成的合成图像LR和实际LR图像之间的差距域限制。在本文中，我们提出了无人监管的现实世界图像SR一种新型域距离感知超分辨率（DASR）的方法。（如$ \ mathcal {Y} ^ G $）和测试数据的训练数据之间的域间隙（如$ \ mathcal {Y} ^ R $）与我们的\ textbf {域的差距感知训练}解决和\ textbf {域的距离加权监督}策略。域的差距意识到培训需要从目标域中的实际数据额外的好处，而域的距离加权监管提出了更合理的使用标记的源域数据。所提出的方法进行了验证上合成的和真实数据集和所述实验结果表明，DASR始终优于国家的最先进的无监督SR与更现实的和自然的纹理生成SR输出接近。

3. Tracking Objects as Points [PDF] 返回目录
Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
Abstract: Tracking has traditionally been the art of following interest points through space and time. This changed with the rise of powerful deep networks. Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection. In this paper, we present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time. It achieves 67.3% MOTA on the MOT17 challenge at 22 FPS and 89.4% MOTA on the KITTI tracking benchmark at 15 FPS, setting a new state of the art on both datasets. CenterTrack is easily extended to monocular 3D tracking by regressing additional 3D attributes. Using monocular video input, it achieves 28.3% AMOTA@0.2 on the newly released nuScenes 3D tracking benchmark, substantially outperforming the monocular baseline on this benchmark while running at 28 FPS.
摘要：追踪历来是通过空间和时间如下兴趣点的艺术。这改变了强大的深层网络的兴起。如今，跟踪由执行对象检测随后时间关联，也被称为跟踪逐检测管线主宰。在本文中，我们提出了一个同时检测和跟踪算法，该算法简单，更快速，而且比现有技术的更准确。我们的跟踪器，CenterTrack，施加检测模型应用于来自前一帧的一对图像和检测的。鉴于这种最小的投入，CenterTrack本地化对象和预测他们的协会与前一帧。而已。 CenterTrack很简单，在网上（没有窥视到未来），和实时性。它实现了对在22 FPS的MOT17挑战和89.4％MOTA上以15 FPS的KITTI跟踪基准的67.3％MOTA，在两个数据集设置一个新的艺术状态。 CenterTrack很容易扩展由回归附加3D属性单眼3D跟踪。使用单眼视频输入，实现了对新发布的nuScenes 3D跟踪基准的28.3％AMOTA@0.2，大幅跑赢基准单眼在这个基准测试在28 FPS运行时。

4. Learning Unsupervised Hierarchical Part Decomposition of 3D Objects from a Single RGB Image [PDF] 返回目录
Despoina Paschalidou, Luc van Gool, Andreas Geiger
Abstract: Humans perceive the 3D world as a set of distinct objects that are characterized by various low-level (geometry, reflectance) and high-level (connectivity, adjacency, symmetry) properties. Recent methods based on convolutional neural networks (CNNs) demonstrated impressive progress in 3D reconstruction, even when using a single 2D image as input. However, the majority of these methods focuses on recovering the local 3D geometry of an object without considering its part-based decomposition or relations between parts. We address this challenging problem by proposing a novel formulation that allows to jointly recover the geometry of a 3D object as a set of primitives as well as their latent hierarchical structure without part-level supervision. Our model recovers the higher level structural decomposition of various objects in the form of a binary tree of primitives, where simple parts are represented with fewer primitives and more complex parts are modeled with more components. Our experiments on the ShapeNet and D-FAUST datasets demonstrate that considering the organization of parts indeed facilitates reasoning about 3D geometry.
摘要：人类感知到3D世界为一组不同的对象的为特征的各种低级（几何形状，反射率）和高级别（连接，邻接，对称）的性质。基于卷积神经网络（细胞神经网络）最近的方法在三维重建证明引人注目的进展，使用单个2D图像作为输入时也是如此。然而，大多数的这些方法侧重于恢复对象的局部三维几何体，而不考虑其基础部分分解或部分之间的关系。我们通过提出一种新颖的提法，允许共同恢复3D对象为一组原语以及它们内在的层次结构的几何形状，而不部件级监管解决这个具有挑战性的问题。我们的模型中恢复的各种对象的原语，其中简单的部件被表示用更少的原语和更复杂的部件与多个组件进行建模的二进制树的形式上级结构分解。我们对ShapeNet和d-FAUST数据集实验结果表明，考虑到部分确实有利于推理三维几何体的组织。

5. DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes [PDF] 返回目录
Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Tom Funkhouser, Caroline Pantofaru, David Ross, Larry S. Davis, Alireza Fathi
Abstract: We propose DOPS, a fast single-stage 3D object detection method for LIDAR data. Previous methods often make domain-specific design decisions, for example projecting points into a bird-eye view image in autonomous driving scenarios. In contrast, we propose a general-purpose method that works on both indoor and outdoor scenes. The core novelty of our method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes. 3D bounding box parameters are estimated in one pass for every point, aggregated through graph convolutions, and fed into a branch of the network that predicts latent codes representing the shape of each detected object. The latent shape space and shape decoder are learned on a synthetic dataset and then used as supervision for the end-to-end training of the 3D object detection pipeline. Thus our model is able to extract shapes without access to ground-truth shape information in the target dataset. During experiments, we find that our proposed method achieves state-of-the-art results by ~5% on object detection in ScanNet scenes, and it gets top results by 3.4% in the Waymo Open Dataset, while reproducing the shapes of detected cars.
摘要：本文提出DOPS，激光雷达数据的快速单级三维物体检测方法。先前的方法往往使特定领域的设计决策，例如投影点进入自动驾驶场景鸟瞰视图图像。相反，我们建议在室内和室外场景的作品的通用方法。我们的方法的核心新奇是一种快速，单通道架构，既检测3D对象，并估计它们的形状。三维边界框参数进行估计在一次通过为每一个点，通过图卷积聚集，并且馈送到预测表示每个检测到的对象的形状潜码的网络的一个分支。潜形状空间和形状解码器上的合成数据集学习，然后用作监督用于端至端训练3D对象检测流水线的。因此，我们的模型能够提取物的形状没有获得地面实况形状信息的目标数据集。在实验过程中，我们发现，我们所提出的方法实现国家的最先进成果由〜5％的目标检测ScanNet场景，并以3.4％Waymo打开的数据集获取顶级效果，同时重现检测汽车的形状。

6. Bodies at Rest: 3D Human Pose and Shape Estimation from a Pressure Image using Synthetic Data [PDF] 返回目录
Henry M. Clever, Zackory Erickson, Ariel Kapusta, Greg Turk, C. Karen Liu, Charles C. Kemp
Abstract: People spend a substantial part of their lives at rest in bed. 3D human pose and shape estimation for this activity would have numerous beneficial applications, yet line-of-sight perception is complicated by occlusion from bedding. Pressure sensing mats are a promising alternative, but training data is challenging to collect at scale. We describe a physics-based method that simulates human bodies at rest in a bed with a pressure sensing mat, and present PressurePose, a synthetic dataset with 206K pressure images with 3D human poses and shapes. We also present PressureNet, a deep learning model that estimates human pose and shape given a pressure image and gender. PressureNet incorporates a pressure map reconstruction (PMR) network that models pressure image generation to promote consistency between estimated 3D body models and pressure image input. In our evaluations, PressureNet performed well with real data from participants in diverse poses, even though it had only been trained with synthetic data. When we ablated the PMR network, performance dropped substantially.
摘要：人们度过一生的大部分在卧床休息。三维人体姿势和体形估计这个活动将有许多有益的应用，但行的视线感知受到遮挡从被褥复杂。压力传感垫有前途的替代，但训练数据是具有挑战性的收集大规模。我们描述了在休息模拟人体在床上的压力传感垫一个基于物理学的方法，和本PressurePose，合成数据集与3D人的姿势和形状206K压力图像。我们还提出PressureNet，那估计给定压力图像和性别的人的姿势和体形深刻的学习模式。 PressureNet结合有压力图重建（PMR）网络模型压力图像生成促进估计的3D身体模型和压力输入图像之间的一致性。在我们的评估，PressureNet与参与者在不同的姿势真实数据表现良好，即使它只是被训练与合成数据。当我们消融PMR网络，性能大幅下降。

7. BUDA: Boundless Unsupervised Domain Adaptation in Semantic Segmentation [PDF] 返回目录
Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, Patrick Pérez
Abstract: In this work, we define and address "Boundless Unsupervised Domain Adaptation" (BUDA), a novel problem in semantic segmentation. BUDA set-up pictures a realistic scenario where unsupervised target domain not only exhibits a data distribution shift w.r.t. supervised source domain but also includes classes that are absent from the latter. Different to "open-set" and "universal domain adaptation", which both regard never-seen objects as "unknown", BUDA aims at explicit test-time prediction for these never-seen classes. To reach this goal, we propose a novel framework leveraging domain adaptation and zero-shot learning techniques to enable "boundless" adaptation on the target domain. Performance is further improved using self-training on target pseudo-labels. For validation, we consider different domain adaptation set-ups, namely synthetic-2-real, country-2-country and dataset-2-dataset. Our framework outperforms the baselines by significant margins, setting competitive standards on all benchmarks for the new task. Code and models are available at:~\url{this https URL}.
摘要：在这项工作中，我们定义和地址“无边无监督的领域适应性”（BUDA），在语义分割一个新的问题。 BUDA的建立画面的现实场景，其中无监督目标域不仅表现出数据分发移位w.r.t.监督源结构域，还包括从后者不存在类。不同于“开集”和“通用域适应”，这两个方面从来没有见过的对象为“未知”，BUDA旨在为这些从未见过的类明确的测试时间预测。为了达到这个目标，我们提出了一个新的框架撬动领域适应性和零次学习技术，以使在目标域“无边”改编。性能目标伪标签使用的自我训练进一步提高。为了验证，我们认为不同的领域适应性调校，即合成-2实时，国家2国和数据集-2数据集。我们的框架，显著利润率优于基准，对新任务的所有基准设置有竞争力的标准。代码和模型，请访问：〜\ {URL这HTTPS URL}。

8. ProxyNCA++: Revisiting and Revitalizing Proxy Neighborhood Component Analysis [PDF] 返回目录
Eu Wern Teh, Terrance DeVries, Graham W. Taylor
Abstract: We consider the problem of distance metric learning (DML), where the task is to learn an effective similarity measure between images. We revisit ProxyNCA and incorporate several enhancements. We find that low temperature scaling is a performance-critical component and explain why it works. Besides, we also discover that Global Max Pooling works better in general when compared to Global Average Pooling. Additionally, our proposed fast moving proxies also addresses small gradient issue of proxies, and this component synergizes well with low temperature scaling and Global Max Pooling. Our enhanced model, called ProxyNCA++, achieves a 22.9 percentage point average improvement of Recall@1 across four different zero-shot retrieval datasets compared to the original ProxyNCA algorithm. Furthermore, we achieve state-of-the-art results on the CUB200, Cars196, Sop, and InShop datasets, achieving Recall@1 scores of 72.2, 90.1, 81.4, and 90.9, respectively.
摘要：我们认为距离度量学习（DML），其中任务是学习图像之间的有效的相似性度量的问题。我们重温ProxyNCA和集成了多项增强。我们发现，低温结垢性能关键组成部分，并解释为什么它的工作原理。此外，我们还发现，相比全球平均池时，全球最大池更好地工作一般。此外，我们提出的快速移动的代理也解决了代理的梯度小问题，这部分与低温缩放和全球最大池协同作用良好。我们增强模式，称为ProxyNCA ++，实现了召回@ 1在四个不同的零次检索数据集22.9个百分点，平均提高比原来的算法ProxyNCA。此外，我们实现对CUB200，Cars196，SOP，和InShop数据集状态的最先进的结果，从而实现调用@ 1分的分数为72.2，90.1，81.4，和90.9，分别。

9. An Attention-Based Deep Learning Model for Multiple Pedestrian Attributes Recognition [PDF] 返回目录
Ehsan Yaghoubi, Diana Borza, João Neves, Aruna Kumar, Hugo Proença
Abstract: The automatic characterization of pedestrians in surveillance footage is a tough challenge, particularly when the data is extremely diverse with cluttered backgrounds, and subjects are captured from varying distances, under multiple poses, with partial occlusion. Having observed that the state-of-the-art performance is still unsatisfactory, this paper provides a novel solution to the problem, with two-fold contributions: 1) considering the strong semantic correlation between the different full-body attributes, we propose a multi-task deep model that uses an element-wise multiplication layer to extract more comprehensive feature representations. In practice, this layer serves as a filter to remove irrelevant background features, and is particularly important to handle complex, cluttered data; and 2) we introduce a weighted-sum term to the loss function that not only relativizes the contribution of each task (kind of attributed) but also is crucial for performance improvement in multiple-attribute inference settings. Our experiments were performed on two well-known datasets (RAP and PETA) and point for the superiority of the proposed method with respect to the state-of-the-art. The code is available at this https URL.
摘要：行人在监控录像的自动表征是一个严峻的挑战，特别是当数据与堆满背景极其多样，并且受试者是从不同的距离捕获，下多个姿态，以部分遮挡。已经观察到，国家的最先进的性能仍不能令人满意，本文提供了一种新颖的解决方案的问题，有两个倍贡献：1）考虑不同的全身属性之间的强的语义相关性，我们提出了一种多任务使用元素之倍增层提取更全面的功能，表示深深的模型。在实践中，该层用作过滤器，以除去无关的背景的功能，并且是特别重要的处理复杂，混乱的数据; 2）我们引入一个加权和长期的损失函数，不仅相对化每个任务的贡献（种归因），而且是在多个属性的推理设置的性能提升是至关重要的。我们的实验，相对于该状态的最先进的两个公知的数据集（RAP和PETA）和点执行所提出的方法的优越性。该代码可在此HTTPS URL。

10. Map-Enhanced Ego-Lane Detection in the Missing Feature Scenarios [PDF] 返回目录
Xiaoliang Wang, Yeqiang Qian, Chunxiang Wang, Ming Yang
Abstract: As one of the most important tasks in autonomous driving systems, ego-lane detection has been extensively studied and has achieved impressive results in many scenarios. However, ego-lane detection in the missing feature scenarios is still an unsolved problem. To address this problem, previous methods have been devoted to proposing more complicated feature extraction algorithms, but they are very time-consuming and cannot deal with extreme scenarios. Different from others, this paper exploits prior knowledge contained in digital maps, which has a strong capability to enhance the performance of detection algorithms. Specifically, we employ the road shape extracted from OpenStreetMap as lane model, which is highly consistent with the real lane shape and irrelevant to lane features. In this way, only a few lane features are needed to eliminate the position error between the road shape and the real lane, and a search-based optimization algorithm is proposed. Experiments show that the proposed method can be applied to various scenarios and can run in real-time at a frequency of 20 Hz. At the same time, we evaluated the proposed method on the public KITTI Lane dataset where it achieves state-of-the-art performance. Moreover, our code will be open source after publication.
摘要：由于在自动驾驶系统中最重要的任务之一，自我车道检测已被广泛研究，并已在许多情况下取得了骄人的成绩。然而，在缺少功能情景自我车道检测仍然是一个未解决的问题。为了解决这个问题，以前的方法，一直致力于提出更复杂的特征提取算法，但他们都非常耗时，并且不能与极端情况处理。与其他人不同，本文利用包含在数字地图先验知识，因而具有很强的能力，以提高检测算法的性能。具体而言，我们使用OpenStreetMap的，从提取车道模型，它是与真实赛道状高度一致和不相关的车道功能的道路形状。通过这种方式，只需要几道功能，以消除道路形状和真实车道之间的位置误差，并提出了基于搜索的优化算法。实验结果表明，所提出的方法可以应用于各种场合，可以在实时的以20Hz的频率运行。与此同时，我们评估对公众KITTI巷数据集所提出的方法在那里实现了国家的最先进的性能。此外，我们的代码将是出版后开源。

11. MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model [PDF] 返回目录
Han Fu, Rui Wu, Chenghao Liu, Jianling Sun
Abstract: Nowadays, driven by the increasing concern on diet and health, food computing has attracted enormous attention from both industry and research community. One of the most popular research topics in this domain is Food Retrieval, due to its profound influence on health-oriented applications. In this paper, we focus on the task of cross-modal retrieval between food images and cooking recipes. We present Modality-Consistent Embedding Network (MCEN) that learns modality-invariant representations by projecting images and texts to the same embedding space. To capture the latent alignments between modalities, we incorporate stochastic latent variables to explicitly exploit the interactions between textual and visual features. Importantly, our method learns the cross-modal alignments during training but computes embeddings of different modalities independently at inference time for the sake of efficiency. Extensive experimental results clearly demonstrate that the proposed MCEN outperforms all existing approaches on the benchmark Recipe1M dataset and requires less computational cost.
摘要：如今，通过在饮食和健康的日益关注推动下，食品计算已经引起极大关注来自工业界和研究界。一个在该领域最热门的研究课题是食品检索，由于对健康为导向的应用，影响深远。在本文中，我们专注于食物图像和烹饪食谱之间的跨模态获取的任务。我们目前的模态一致嵌入网络（MCEN），其学习通过投影图像和文字相同的嵌入空间形态不变表示。为了捕捉模式之间的潜在路线，我们结合随机潜在变量明确利用文本和视觉特征之间的相互作用。重要的是，我们的方法在训练期间学习的跨模态路线，但在独立推理时间为求效率计算不同方式的嵌入物。大量的实验结果清楚地表明，该MCEN优于基准的Recipe1M数据集中的所有现有的方法和需要较少的计算成本。

12. Learning Longterm Representations for Person Re-Identification Using Radio Signals [PDF] 返回目录
Lijie Fan, Tianhong Li, Rongyao Fang, Rumen Hristov, Yuan Yuan, Dina Katabi
Abstract: Person Re-Identification (ReID) aims to recognize a person-of-interest across different places and times. Existing ReID methods rely on images or videos collected using RGB cameras. They extract appearance features like clothes, shoes, hair, etc. Such features, however, can change drastically from one day to the next, leading to inability to identify people over extended time periods. In this paper, we introduce RF-ReID, a novel approach that harnesses radio frequency (RF) signals for longterm person ReID. RF signals traverse clothes and reflect off the human body; thus they can be used to extract more persistent human-identifying features like body size and shape. We evaluate the performance of RF-ReID on longitudinal datasets that span days and weeks, where the person may wear different clothes across days. Our experiments demonstrate that RF-ReID outperforms state-of-the-art RGB-based ReID approaches for long term person ReID. Our results also reveal two interesting features: First since RF signals work in the presence of occlusions and poor lighting, RF-ReID allows for person ReID in such scenarios. Second, unlike photos and videos which reveal personal and private information, RF signals are more privacy-preserving, and hence can help extend person ReID to privacy-concerned domains, like healthcare.
摘要：人重新鉴定（里德）的目的是识别人的兴趣在不同的地点和时间。现有的里德方法依赖于使用RGB摄像头采集的图像或视频。他们提取的外观特性，如衣服，鞋子，发型等。这样的特点，但是，可以极大地从一天更改为下，导致无法确定人们在更长的时间段。在本文中，我们引入RF-REID，一种新颖的方法，运用射频（RF）信号，用于长期人里德。 RF信号穿过衣服和反射离开人体;因此它们可被用于提取多个持久人类识别特征状体的大小和形状。我们评估RF-里德对纵向数据集的性能跨度几天和几周内，那里的人可以穿跨天不同的衣服。我们的实验证明了RF-Reid的性能优于基于RGB的国家的最先进的里德长期人里德接近。我们的研究结果还揭示了两个有趣的特点：一是因为RF信号闭塞和照明不足的情况下工作，RF-REID允许在这样的场景中的人里德。其次，与照片以及泄露个人信息和私人信息的视频，RF信号更隐私保护，从而有助于延长人里德隐私有关的领域，例如医疗保健。

13. Model-based disentanglement of lens occlusions [PDF] 返回目录
Fabio Pizzati, Pietro Cerri, Raoul de Charette
Abstract: With lens occlusions, naive image-to-image networks fail to learn an accurate source to target mapping, due to the partial entanglement of the scene and occlusion domains. We propose an unsupervised model-based disentanglement training, which learns to disentangle scene from lens occlusion and can regress the occlusion model parameters from target database. The experiments demonstrate our method is able to handle varying types of occlusions (raindrops, dirt, watermarks, etc.) and generate highly realistic translations, qualitatively and quantitatively outperforming the state-of-the-art on multiple datasets.
摘要：随着透镜遮挡，幼稚图像到图像的网络未能学习准确源到目标的映射，由于场景和遮挡结构域的部分缠绕。我们提出了一种无监督的基于模型的解开的培训，从镜头遮挡学会理清现场，并可以从回归目标数据库闭塞模型参数。实验证明我们的方法能够处理不同类型的闭塞（雨滴，污垢，水印等）和产生高度逼真的翻译，定性和定量地优于对多个数据集的状态的最先进的。

14. Effect of Annotation Errors on Drone Detection with YOLOv3 [PDF] 返回目录
Aybora Koksal, Kutalmis Gokalp Ince, A. Aydin Alatan
Abstract: Following the recent advances in deep networks, object detection and tracking algorithms with deep learning backbones have been improved significantly; however, this rapid development resulted in the necessity of large amounts of annotated labels. Even if the details of such semi-automatic annotation processes for most of these datasets are not known precisely, especially for the video annotations, some automated labeling processes are usually employed. Unfortunately, such approaches might result with erroneous annotations. In this work, different types of annotation errors for object detection problem are simulated and the performance of a popular state-of-the-art object detector, YOLOv3, with erroneous annotations during training and testing stages is examined. Moreover, some inevitable annotation errors in Anti-UAV Challenge dataset is also examined in this manner, while proposing a solution to correct such annotation errors of this valuable data set.
摘要：继深网络，目标检测与深学习骨干跟踪算法的最新进展已显著提高;然而，这种快速发展导致了大量的注释标签的必要性。即使这样的半自动注释过程的大多数这些数据集的细节不是精确已知的，特别是对于视频的注释，一些自动标示过程通常采用。不幸的是，这种方法可能会导致具有误差注解。在这项工作中，不同类型的物体检测问题注释错误被模拟并且检查流行国家的最先进的物体检测装置，YOLOv3，与错误的注释的训练和测试阶段中的性能。此外，在反无人机挑战数据集的一些不可避免的注释错误，也审查了这种方式，同时提出了解决这一宝贵的数据集的正确诠释这样的错误。

15. Objects of violence: synthetic data for practical ML in human rights investigations [PDF] 返回目录
Lachlan Kermode, Jan Freyberg, Alican Akturk, Robert Trafford, Denis Kochetkov, Rafael Pardinas, Eyal Weizman, Julien Cornebise
Abstract: We introduce a machine learning workflow to search for, identify, and meaningfully triage videos and images of munitions, weapons, and military equipment, even when limited training data exists for the object of interest. This workflow is designed to expedite the work of OSINT ("open source intelligence") researchers in human rights investigations. It consists of three components: automatic rendering and annotating of synthetic datasets that make up for a lack of training data; training image classifiers from combined sets of photographic and synthetic data; and mtriage, an open source software that orchestrates these classifiers' deployment to triage public domain media, and visualise predictions in a web interface. We show that synthetic data helps to train classifiers more effectively, and that certain approaches yield better results for different architectures. We then demonstrate our workflow in two real-world human rights investigations: the use of the Triple-Chaser tear gas grenade against civilians, and the verification of allegations of military presence in Ukraine in 2014.
摘要：介绍机器学习的工作流程来搜索，识别和分流有意义视频和弹药，武器和军事装备的图像，即使感兴趣的对象存在有限的训练数据。该工作流程的设计，以加速在人权调查OSINT的研究工作（“开源情报”）。它由三个部分组成：自动渲染和合成数据集是弥补缺少训练数据的标注;从合并的集合的摄影和合成的数据的训练图像分类;和mtriage，一个开源软件，编排这些分类部署以分流公共领域的媒体，并在可视化web界面的预测。我们表明，合成数据有助于更有效地训练的分类，以及某些做法产生了不同的架构更好的结果。然后，我们证明了我们在两个现实世界人权调查工作流程：使用对平民的三驱逐舰催泪弹，并在乌克兰的军事存在的指控，在2014年的验证。

16. Face Quality Estimation and Its Correlation to Demographic and Non-Demographic Bias in Face Recognition [PDF] 返回目录
Philipp Terhörst, Jan Niklas Kolf, Naser Damer, Florian Kirchbuchner, Arjan Kuijper
Abstract: Face quality assessment aims at estimating the utility of a face image for the purpose of recognition. It is a key factor to achieve high face recognition performances. Currently, the high performance of these face recognition systems come with the cost of a strong bias against demographic and non-demographic sub-groups. Recent work has shown that face quality assessment algorithms should adapt to the deployed face recognition system, in order to achieve highly accurate and robust quality estimations. However, this could lead to a bias transfer towards the face quality assessment leading to discriminatory effects e.g. during enrolment. In this work, we present an in-depth analysis of the correlation between bias in face recognition and face quality assessment. Experiments were conducted on two publicly available datasets captured under controlled and uncontrolled circumstances with two popular face embeddings. We evaluated four state-of-the-art solutions for face quality assessment towards biases to pose, ethnicity, and age. The experiments showed that the face quality assessment solutions assign significantly lower quality values towards subgroups affected by the recognition bias demonstrating that these approaches are biased as well. This raises ethical questions towards fairness and discrimination which future works have to address.
摘要：面对质量评估旨在估计人脸图像的效用识别的目的。这是实现高人脸识别性能的关键因素。目前，这些人脸识别系统的高性能配备对人口和非人口子群体有很大成见的成本。最近的研究表明，面部质量评估的算法应该适应部署的人脸识别系统，以实现高度准确和健全的质量估计。然而，这可能导致射向面质量评估偏置转移导致的歧视性影响如在注册过程中。在这项工作中，我们提出在人脸识别和脸部质量评估偏差之间的相互关系的深入分析。实验是在控制下和非受控的情况下拍摄的两个流行的嵌入面对两个公开可用的数据集进行。我们评估的脸上质量评估四大国有的最先进的解决方案，对偏见摆姿势，种族和年龄。实验表明，面对质量评估方案分配，这些方法都偏向以及向受表彰偏置示范分组显著质量较低值。这就提出了对公平和歧视的道德问题，其今后的工作中必须解决。

17. DualConvMesh-Net: Joint Geodesic and Euclidean Convolutions on 3D Meshes [PDF] 返回目录
Jonas Schult, Francis Engelmann, Theodora Kontogianni, Bastian Leibe
Abstract: We propose DualConvMesh-Nets (DCM-Net) a family of deep hierarchical convolutional networks over 3D geometric data that combines two types of convolutions. The first type, geodesic convolutions, defines the kernel weights over mesh surfaces or graphs. That is, the convolutional kernel weights are mapped to the local surface of a given mesh. The second type, Euclidean convolutions, is independent of any underlying mesh structure. The convolutional kernel is applied on a neighborhood obtained from a local affinity representation based on the Euclidean distance between 3D points. Intuitively, geodesic convolutions can easily separate objects that are spatially close but have disconnected surfaces, while Euclidean convolutions can represent interactions between nearby objects better, as they are oblivious to object surfaces. To realize a multi-resolution architecture, we borrow well-established mesh simplification methods from the geometry processing domain and adapt them to define mesh-preserving pooling and unpooling operations. We experimentally show that combining both types of convolutions in our architecture leads to significant performance gains for 3D semantic segmentation, and we report competitive results on three scene segmentation benchmarks. Our models and code are publicly available.
摘要：本文提出DualConvMesh-网（DCM-网）深层次卷积网络上的三维几何数据的家人，结合两种类型的卷积。第一种类型，测地卷积，定义内核权值超过啮合表面或曲线图。即，卷积内核权值被映射到给定的网孔的局部表面。第二种类型，欧几里德卷积，独立于任何底层网状结构。卷积内核被施加在从基于三维点之间的欧几里得距离的本地亲和力表示获得的附近。直观地说，测地回旋可以轻松地分离的对象在空间上接近，但有断开的表面，而欧氏卷积可以代表附近的物体之间的相互作用更好，因为他们无视物体表面。为了实现多分辨率的架构，我们借用几何处理领域行之有效的网格简化方法，并适应它们来定义网保留池和unpooling操作。我们实验表明，在我们的架构导致显著的性能提升为3D语义分割结合这两种类型的卷积，和我们报告三个场景分割基准竞争的结果。我们的模型和代码是公开的。

18. PaStaNet: Toward Human Activity Knowledge Engine [PDF] 返回目录
Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, Cewu Lu
Abstract: Existing image-based activity understanding methods mainly adopt direct mapping, i.e. from image to activity concepts, which may encounter performance bottleneck since the huge gap. In light of this, we propose a new path: infer human part states first and then reason out the activities based on part-level semantics. Human Body Part States (PaSta) are fine-grained action semantic tokens, e.g. , which can compose the activities and help us step toward human activity knowledge engine. To fully utilize the power of PaSta, we build a large-scale knowledge base PaStaNet, which contains 7M+ PaSta annotations. And two corresponding models are proposed: first, we design a model named Activity2Vec to extract PaSta features, which aim to be general representations for various activities. Second, we use a PaSta-based Reasoning method to infer activities. Promoted by PaStaNet, our method achieves significant improvements, e.g. 6.4 and 13.9 mAP on full and one-shot sets of HICO in supervised learning, and 3.2 and 4.2 mAP on V-COCO and images-based AVA in transfer learning. Code and data are available at this http URL.
摘要：现有的基于图像的活动的理解方法主要采用直接映射，即从形象的活动概念，它可能会遇到性能瓶颈，因为巨大的差距。光这一点，我们提出了一个新的路径：人类推断部分规定，然后再原因了基于部件级语义的活动。人体部分国家（意大利面）是细粒度的操作语义标记，例如<手，保持，东西>，它可以构成我们对人类活动的知识引擎步骤的活动和帮助。要充分利用面食的力量，我们建立了一个大型的知识库PaStaNet，其中包含7M +面食注解。和两个相应的模型建议：第一，我们设计了一个名为Activity2Vec来提取面食特征，其目标是为各种活动的一般表示模型。其次，我们使用基于面食推理的方法来推断活动。通过PaStaNet的推动下，我们的方法实现显著的改善，例如6.4和13.9地图上在监督学习，和3.2和4.2地图上V-COCO和充分的一次性套HICO的图像为基础在转印学习AVA。代码和数据都可以在此的HTTP URL。

19. Controllable Orthogonalization in Training DNNs [PDF] 返回目录
Lei Huang, Li Liu, Fan Zhu, Diwen Wan, Zehuan Yuan, Bo Li, Ling Shao
Abstract: Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI), to learn a layer-wise orthogonal weight matrix in DNNs. ONI works by iteratively stretching the singular values of a weight matrix towards 1. This property enables it to control the orthogonality of a weight matrix by its number of iterations. We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (SN), and further outperforms SN by providing controllable orthogonality.
摘要：正交广泛用于由于其保持了雅可比接近1的所有奇异值，减少冗余表现能力训练深层神经网络（DNNs）。本文提出使用牛顿迭代（ONI），吸取在DNNs逐层正交的权重矩阵计算高效和数值稳定正交方法。 ONI的工作原理是迭代地朝1.该性质使得它通过在其数量的迭代来控制的权重矩阵的正交性拉伸的权重矩阵的奇异值。我们表明，我们的方法提高了图像分类的网络被有效控制的正交性，以提供优化的好处和代表能力下降之间的最佳平衡性能。我们还表明，通过ONI维护网络，类似于谱归一化（SN）的李氏持续稳定生成对抗网络（甘斯）的培训，并进一步通过性能优于提供可控正交SN。

20. Learning to Segment the Tail [PDF] 返回目录
Xinting Hu, Yi Jiang, Kaihua Tang, Jingyuan Chen, Chunyan Miao, Hanwang Zhang
Abstract: Real-world visual recognition requires handling the extreme sample imbalance in large-scale long-tailed data. We propose a "divide\&conquer" strategy for the challenging LVIS task: divide the whole data into balanced parts and then apply incremental learning to conquer each one. This derives a novel learning paradigm: \textbf{class-incremental few-shot learning}, which is especially effective for the challenge evolving over time: 1) the class imbalance among the old-class knowledge review and 2) the few-shot data in new-class learning. We call our approach \textbf{Learning to Segment the Tail} (LST). In particular, we design an instance-level balanced replay scheme, which is a memory-efficient approximation to balance the instance-level samples from the old-class images. We also propose to use a meta-module for new-class learning, where the module parameters are shared across incremental phases, gaining the learning-to-learn knowledge incrementally, from the data-rich head to the data-poor tail. We empirically show that: at the expense of a little sacrifice of head-class forgetting, we can gain a significant 8.3\% AP improvement for the tail classes with less than 10 instances, achieving an overall 2.0\% AP boost for the whole 1,230 classes\footnote{Code is available at \url{this https URL}}.
摘要：真实世界的视觉识别需要处理大型长尾数据的极端样本不平衡。我们提出了具有挑战性的任务LVIS一个“分\＆治之”的策略：把整个数据到均衡的部分再申请增量学习征服每一个。这来自一种新的学习模式：\ textbf {类增量几拍学习}，这是不断变化的随时间变化的挑战，特别是有效的：1）老一流的知识复习和2之间的类不平衡）少数炮数据在新的课堂学习。我们把我们的方法\ textbf {学习到段尾}（LST）。特别是，我们设计了一个实例级别的平衡重播计划，这是一种内存高效近似从旧级图像平衡实例级别的样本。此外，我们建议使用元模块新的课堂学习，其中模块参数跨增量相共享，逐步获得学习到学习知识，从丰富的数据头的数据贫乏的尾巴。我们根据经验表明：在头类遗忘的一点牺牲为代价的，我们可以得到的尾班显著8.3 \％AP改善小于10分的情况下，实现全面2.0 \％AP提升整个1230类\ {注脚代码可以在\ {URL这HTTPS URL}}。

21. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers [PDF] 返回目录
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
Abstract: We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose to use a random pixel sampling mechanism to enhance the robustness of visual representation and to apply the Masked Language Model and Image-Text Matching as pre-training tasks. Extensive experiments on downstream tasks with our pre-trained model show that our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR). Particularly, we boost the performance of a single model in VQA task by 2.17 points compared with SOTA under fair comparison.
摘要：本文提出像素-BERT对齐图像像素与深多模式上变压器，共同学习在一个统一的终端到终端的框架，视觉和语言文字嵌入。我们的目标是建立直接从图像和句对，而不是采用基于区域的图像特征作为最近的视觉和语言的任务图像像素和语言语义之间更准确和全面的连接。我们的像素-BERT其对齐像素和文字水平解决了视觉和语言任务的特定任务的可视化表示的限制语义联系。它也减轻边框注释的成本，克服语义标签之间的不平衡视觉任务和语言的语义。为了给下游任务的更好的代表性，我们预先训练结束普及到终端模型从视觉基因组数据集和MS-COCO数据集图像和句子对。我们建议使用随机像素采样机制，以提高视觉表现的鲁棒性和应用蒙面语言模型和图片，文本匹配为前培训任务。与我们的预训练的模型显示，我们的方法使得大部分国家的的艺术在下游的任务，包括Visual答疑（VQA），图文检索，自然语言为真实的视觉推理（下游任务，大量的实验NLVR）。特别是，我们通过2.17点与公平的比较下SOTA相比提振VQA任务单一模型的性能。

22. Occlusion-Aware Depth Estimation with Adaptive Normal Constraints [PDF] 返回目录
Xiaoxiao Long, Lingjie Liu, Christian Theobalt, Wenping Wang
Abstract: We present a new learning-based method for multi-frame depth estimation from a color video, which is a fundamental problem in scene understanding, robot navigation or handheld 3D reconstruction. While recent learning-based methods estimate depth at high accuracy, 3D point clouds exported from their depth maps often fail to preserve important geometric feature (e.g., corners, edges, planes) of man-made scenes. Widely-used pixel-wise depth errors do not specifically penalize inconsistency on these features. These inaccuracies are particularly severe when subsequent depth reconstructions are accumulated in an attempt to scan a full environment with man-made objects with this kind of features. Our depth estimation algorithm therefore introduces a Combined Normal Map (CNM) constraint, which is designed to better preserve high-curvature features and global planar regions. In order to further improve the depth estimation accuracy, we introduce a new occlusion-aware strategy that aggregates initial depth predictions from multiple adjacent views into one final depth map and one occlusion probability map for the current reference view. Our method outperforms the state-of-the-art in terms of depth estimation accuracy, and preserves essential geometric features of man-made indoor scenes much better than other algorithms.
摘要：从彩色视频，这是在场景的理解，机器人导航或手持3D重建的基本问题呈现为多帧深度估计新的基于学习的方法。虽然近期基于学习的方法估算深度高精度，三维点云从它们的深度映射出口往往不能保持人造场景的重要几何特征（例如，拐角，边缘，飞机）。广泛使用的像素式深度误差不明确惩罚不一致这些功能。当后续的深度重建，企图扫描一个完整的环境与这种功能的人造物体都积累了这些误差是特别严重。因此，我们的深度估计算法引入了联合法线贴图（CNM）的约束，其目的是更好地保持高曲率特征和全球平面区域。为了进一步提高深度估计准确性，我们引入一个新的闭塞感知策略来自多个相邻的视图聚集体初始深度预测为一个最终深度图和一个阻塞概率的地图为当前参考图。我们的方法优于国家的最先进的深度估计准确性方面，以及人造室内场景果酱必不可少的几何特征比其他算法要好得多。

23. Robust Single-Image Super-Resolution via CNNs and TV-TV Minimization [PDF] 返回目录
Marija Vella, João F. C. Mota
Abstract: Single-image super-resolution is the process of increasing the resolution of an image, obtaining a high-resolution (HR) image from a low-resolution (LR) one. By leveraging large training datasets, convolutional neural networks (CNNs) currently achieve the state-of-the-art performance in this task. Yet, during testing/deployment, they fail to enforce consistency between the HR and LR images: if we downsample the output HR image, it never matches its LR input. Based on this observation, we propose to post-process the CNN outputs with an optimization problem that we call TV-TV minimization, which enforces consistency. As our extensive experiments show, such post-processing not only improves the quality of the images, in terms of PSNR and SSIM, but also makes the super-resolution task robust to operator mismatch, i.e., when the true downsampling operator is different from the one used to create the training dataset.
摘要：单图像超分辨率是提高图像的分辨率，获得高分辨率（HR）图像从低分辨率（LR）之一的过程。通过利用大型训练数据，卷积神经网络（细胞神经网络）目前实现这一任务的国家的最先进的性能。然而，测试/部署过程中，他们未能在HR和LR图像之间执行一致性：如果我们下采样输出高分辨率图像，它从来没有它的LR输入相匹配。基于这一观察，我们提出与优化问题，我们称之为TV-TV最小化，从而强制一致性进行后处理的CNN输出。由于我们大量的实验表明，这样的后处理不仅提高了图像的质量，PSNR和SSIM方面，也使得超分辨率任务，稳健的经营者不匹配，即当真正的采样运营商与不同一个用于创建训练数据集。

24. Improving 3D Object Detection through Progressive Population Based Augmentation [PDF] 返回目录
Shuyang Cheng, Zhaoqi Leng, Ekin Dogus Cubuk, Barret Zoph, Chunyan Bai, Jiquan Ngiam, Yang Song, Benjamin Caine, Vijay Vasudevan, Congcong Li, Quoc V. Le, Jonathon Shlens, Dragomir Anguelov
Abstract: Data augmentation has been widely adopted for object detection in 3D point clouds. All previous efforts have focused on manually designing specific data augmentation methods for individual architectures, however no work has attempted to automate the design of data augmentation in 3D detection problems -- as is common in 2D image-based computer vision. In this work, we present the first attempt to automate the design of data augmentation policies for 3D object detection. We present an algorithm, termed Progressive Population Based Augmentation (PPBA). PPBA learns to optimize augmentation strategies by narrowing down the search space and adopting the best parameters discovered in previous iterations. On the KITTI test set, PPBA improves the StarNet detector by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset indicate that PPBA continues to effectively improve 3D object detection on a 20x larger dataset compared to KITTI. The magnitude of the improvements may be comparable to advances in 3D perception architectures and the gains come without an incurred cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient than baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples.
摘要：数据增强了对三维点云物体检测被广泛采用。以前所有的努力都集中在手工设计各个架构的特定数据隆胸方法，但没有工作一直试图在自动化检测的3D数据的问题隆胸的设计 - 在基于图像的二维计算机视觉是常见的。在这项工作中，我们提出的第一次尝试，以用于自动化立体物检测数据增强政策的设计。我们提出的算法，被称为渐进人口基增强（PPBA）。 PPBA学会优化增强战略通过缩小搜索空间，采用在以前的迭代发现的最佳参数。在KITTI测试集，提高PPBA通过对汽车，行人，骑自行车者和的难度适中类别实质边缘，表现优于所有当前状态的最先进的单级检测模型的STARNET检测器。在Waymo打开的数据集的额外的实验表明PPBA继续有效地改善上相比KITTI 20倍大的数据集的3D对象的检测。的改善幅度可以媲美进步3D感知架构和收益来没有在推理时的所招致的成本。在随后的实验中，我们发现，PPBA可能高达10点以上的基线3D检测模型高效的数据没有隆胸，强调3D检测模型可以实现竞争准确性少得多的标识样本。

25. Tracking by Instance Detection: A Meta-Learning Approach [PDF] 返回目录
Guangting Wang, Chong Luo, Xiaoyan Sun, Zhiwei Xiong, Wenjun Zeng
Abstract: We consider the tracking problem as a special type of object detection problem, which we call instance detection. With proper initialization, a detector can be quickly converted into a tracker by learning the new instance from a single image. We find that model-agnostic meta-learning (MAML) offers a strategy to initialize the detector that satisfies our needs. We propose a principled three-step approach to build a high-performance tracker. First, pick any modern object detector trained with gradient descent. Second, conduct offline training (or initialization) with MAML. Third, perform domain adaptation using the initial frame. We follow this procedure to build two trackers, named Retina-MAML and FCOS-MAML, based on two modern detectors RetinaNet and FCOS. Evaluations on four benchmarks show that both trackers are competitive against state-of-the-art trackers. On OTB-100, Retina-MAML achieves the highest ever AUC of 0.712. On TrackingNet, FCOS-MAML ranks the first on the leader board with an AUC of 0.757 and the normalized precision of 0.822. Both trackers run in real-time at 40 FPS.
摘要：我们认为，跟踪问题作为一个特殊类型的对象检测问题的，我们称之为实例检测。有了正确的初始化，检测器可以快速从单个图像学习新的实例转换为跟踪。我们发现，模型无关元学习（MAML）提供了一个战略来初始化检测器，它满足了我们的需求。我们提出了一个原则性的三步法来构建一个高性能的跟踪器。首先，选择有梯度下降训练有素任何现代的目标物检测。其次，从事与MAML离线训练（或初始化）。第三，使用初始帧执行域自适应。我们遵循这个过程来建立两个跟踪器，名为视网膜-MAML和FCOS-MAML，基于两个现代探测器RetinaNet和FCOS。四个基准测试评估表明，两个跟踪反对国家的最先进的跟踪竞争力。在OTB-100，视网膜-MAML实现了0.712有史以来最高的AUC。在TrackingNet，FCOS-MAML排名上领先榜第一与0.757的AUC和0.822的归一化的精度。两个跟踪实时在40 FPS运行。

26. SSHFD: Single Shot Human Fall Detection with Occluded Joints Resilience [PDF] 返回目录
Umar Asif, Stefan Von Cavallar, Jianbin Tang, Stefan Harre
Abstract: Falling can have fatal consequences for elderly people especially if the fallen person is unable to call for help due to loss of consciousness or any injury. Automatic fall detection systems can assist through prompt fall alarms and by minimizing the fear of falling when living independently at home. Existing vision-based fall detection systems lack generalization to unseen environments due to challenges such as variations in physical appearances, different camera viewpoints, occlusions, and background clutter. In this paper, we explore ways to overcome the above challenges and present Single Shot Human Fall Detector (SSHFD), a deep learning based framework for automatic fall detection from a single image. This is achieved through two key innovations. First, we present a human pose based fall representation which is invariant to appearance characteristics. Second, we present neural network models for 3d pose estimation and fall recognition which are resilient to missing joints due to occluded body parts. Experiments on public fall datasets show that our framework successfully transfers knowledge of 3d pose estimation and fall recognition learnt purely from synthetic data to unseen real-world data, showcasing its generalization capability for accurate fall detection in real-world scenarios.
摘要：落可以有老人致命的后果，特别是如果堕落的人是无法呼救，由于失去知觉或任何伤亡。自动跌倒检测系统能够协助客户通过迅速下跌警报和尽量减少在家中独立生活时坠落的恐惧。现有的基于视觉的跌倒检测系统缺乏泛化到看不见的环境下，由于挑战，比如物理外观的变化，不同的相机视角，遮挡和背景杂波。在本文中，我们探讨如何克服上述挑战，目前单次人跌倒检测器（SSHFD），从单一图像自动跌倒检测了深刻的学习为基础的框架。这是通过两个关键的创新来实现的。首先，我们提出了一个人体姿势基于秋表示这是不变的外观特性。其次，我们目前的神经网络模型的三维姿态估计和秋季识别哪些是有弹性的缺失关节因闭塞的身体部位。公共数据集落实验表明，我们的框架成功传输三维姿态估计的知识和下降识别由合成数据看不见的真实世界的数据纯属了解到，展示了其泛化能力在现实世界的场景准确坠落检测。

27. Alleviating Semantic-level Shift: A Semi-supervised Domain Adaptation Method for Semantic Segmentation [PDF] 返回目录
Zhonghao Wang, Yunchao Wei, Rogerior Feris, Jinjun Xiong, Wen-Mei Hwu, Thomas S. Huang, Honghui Shi
Abstract: Learning segmentation from synthetic data and adapting to real data can significantly relieve human efforts in labelling pixel-level masks. A key challenge of this task is how to alleviate the data distribution discrepancy between the source and target domains, i.e. reducing domain shift. The common approach to this problem is to minimize the discrepancy between feature distributions from different domains through adversarial training. However, directly aligning the feature distribution globally cannot guarantee consistency from a local view (i.e. semantic-level), which prevents certain semantic knowledge learned on the source domain from being applied to the target domain. To tackle this issue, we propose a semi-supervised approach named Alleviating Semantic-level Shift (ASS), which can successfully promote the distribution consistency from both global and local views. Specifically, leveraging a small number of labeled data from the target domain, we directly extract semantic-level feature representations from both the source and the target domains by averaging the features corresponding to same categories advised by pixel-level masks. We then feed the produced features to the discriminator to conduct semantic-level adversarial learning, which collaborates with the adversarial learning from the global view to better alleviate the domain shift. We apply our ASS to two domain adaptation tasks, from GTA5 to Cityscapes and from Synthia to Cityscapes. Extensive experiments demonstrate that: (1) ASS can significantly outperform the current unsupervised state-of-the-arts by employing a small number of annotated samples from the target domain; (2) ASS can beat the oracle model trained on the whole target dataset by over 3 points by augmenting the synthetic source data with annotated samples from the target domain without suffering from the prevalent problem of overfitting to the source domain.
摘要：学习从合成数据分割和适应实际数据可以显著缓解标签像素级的面具人的努力。这个任务的一个主要挑战是如何缓解源和目标域之间的数据分布偏差，即减少域移位。解决这个问题的常用方法是尽量减少通过对抗训练来自不同域的功能分布之间的差异。然而，直接全局对准特征分布不能从本地视图保证一致性（即语义级），这防止了被施加到目标域了解到在源域的某些语义知识。为了解决这个问题，我们提出了一个名为缓解语义级移位（ASS）一个半监督的方法，它可以成功地促进从全局和局部的观点分布的一致性。具体而言，利用少量的标签数据从目标域，我们直接提取与源极和通过平均对应于由像素级掩模建议相同的类别的特征的目标结构域的两条语义层次特征表示。然后，我们饲料生产功能鉴别进行语义级别的对抗学习，这从全球视野对抗学习合作，以更好地缓解域转变。我们运用我们的ASS两个领域适应性任务，从GTA5到城市景观和Synthia到都市风景。大量的实验证明：（1）可以ASS优于显著当前无监督状态的最艺通过采用少量的从目标域注解样本; （2）可以ASS通过增加与从目标域中注解样本合成源数据不脱离过度拟合到源域的普遍的问题痛苦击败通过在3个点上训练整个目标数据集的预言模型。

28. Graph-based fusion for change detection in multi-spectral images [PDF] 返回目录
David Alejandro Jimenez Sierra, Hernán Darío Benítez Restrepo, Hernán Darío Vargas Cardonay, Jocelyn Chanussot
Abstract: In this paper we address the problem of change detection in multi-spectral images by proposing a data-driven framework of graph-based data fusion. The main steps of the proposed approach are: (i) The generation of a multi-temporal pixel based graph, by the fusion of intra-graphs of each temporal data; (ii) the use of Nyström extension to obtain the eigenvalues and eigenvectors of the fused graph, and the selection of the final change map. We validated our approach in two real cases of remote sensing according to both qualitative and quantitative analyses. The results confirm the potential of the proposed graph-based change detection algorithm outperforming state-of-the-art methods.
摘要：在本文中，我们通过提出基于图的数据融合的数据驱动框架解决多光谱图像变化检测的问题。所提出的方法的主要步骤是：（ⅰ）生成一个基于多时间像素图的，由每个时间数据的帧内图的融合; （ⅱ）使用应用Nyström扩展，以获得本征值和熔融曲线的特征向量，并最终改变地图的选择。我们根据定性和定量分析，验证了遥感的两个真实情况下，我们的做法。结果证实所提出的基于图的变化检测算法优于国家的最先进的方法的潜力。

29. Scene-Adaptive Video Frame Interpolation via Meta-Learning [PDF] 返回目录
Myungsub Choi, Janghoon Choi, Sungyong Baik, Tae Hyun Kim, Kyoung Mu Lee
Abstract: Video frame interpolation is a challenging problem because there are different scenarios for each video depending on the variety of foreground and background motion, frame rate, and occlusion. It is therefore difficult for a single network with fixed parameters to generalize across different videos. Ideally, one could have a different network for each scenario, but this is computationally infeasible for practical applications. In this work, we propose to adapt the model to each video by making use of additional information that is readily available at test time and yet has not been exploited in previous works. We first show the benefits of `test-time adaptation' through simple fine-tuning of a network, then we greatly improve its efficiency by incorporating meta-learning. We obtain significant performance gains with only a single gradient update without any additional parameters. Finally, we show that our meta-learning framework can be easily employed to any video frame interpolation network and can consistently improve its performance on multiple benchmark datasets.
摘要：视频帧插值是一个具有挑战性的问题，因为存在用于每个视频取决于各种前景和背景的运动，帧速率，和闭塞的不同的方案。因此，难以为具有固定参数的单个网络跨越不同的视频概括。理想情况下，人们可以为每个场景不同的网络，但是这是在实际应用中计算上不可行。在这项工作中，我们提出通过利用是很容易在测试时间，但并没有在以前的作品中被利用的更多信息，以该模型的每个视频适应。我们首先通过一个网络的简单微调显示`测试时间适应”的好处，那么，我们大大提高了通过将元学习效率。我们获得只有一个单一的梯度更新显著的性能提升没有任何额外的参数。最后，我们表明，我们的元学习框架，可以很容易地使用任何视频帧插补网络，并不断改善其在多个基准数据集的性能。

30. Consistent Multiple Sequence Decoding [PDF] 返回目录
Bicheng Xu, Leonid Sigal
Abstract: Sequence decoding is one of the core components of most visual-lingual models. However, typical neural decoders when faced with decoding multiple, possibly correlated, sequences of tokens resort to simple independent decoding schemes. In this paper, we introduce a consistent multiple sequence decoding architecture, which is while relatively simple, is general and allows for consistent and simultaneous decoding of an arbitrary number of sequences. Our formulation utilizes a consistency fusion mechanism, implemented using message passing in a Graph Neural Network (GNN), to aggregate context from related decoders. This context is then utilized as a secondary input, in addition to previously generated output, to make a prediction at a given step of decoding. Self-attention, in the GNN, is used to modulate the fusion mechanism locally at each node and each step in the decoding process. We show the efficacy of our consistent multiple sequence decoder on the task of dense relational image captioning and illustrate state-of-the-art performance (+ 5.2% in mAP) on the task. More importantly, we illustrate that the decoded sentences, for the same regions, are more consistent (improvement of 9.5%), while across images and regions maintain diversity.
摘要：序列解码是最视觉语言模型的核心部件之一。但是，当面对解码多个，可能相关的，令牌的序列诉诸简单独立解码方案典型的神经解码器。在本文中，我们引入一个一致的多重序列解码架构，这是同时比较简单，是通用的并且允许序列的任意数量的一致和同时解码。我们的制剂利用一致性融合机制，使用消息传递以图形神经网络（GNN）来实现，以从相关的解码器骨料上下文。然后，该上下文被用作次级输入，除了先前生成的输出，进行预测解码在一个给定步骤。自注意，在GNN，用于在本地调节在每个节点，并在解码处理中的每个步骤中的融合机制。我们表明我们一贯的多序列解码器的功效上密集的关系像字幕的任务，并说明了国家的最先进的任务性能（地图+ 5.2％）。更重要的是，我们表明，该解码的句子，对同一地区的，更一致（9.5％的改善），而跨图片和地区保持多样性。

31. Monocular Camera Localization in Prior LiDAR Maps with 2D-3D Line Correspondences [PDF] 返回目录
Huai Yu, Weikun Zhen, Wen Yang, Ji Zhang, Sebastian Scherer
Abstract: Light-weight camera localization in existing maps is essential for vision-based navigation. Currently, visual and visual-inertial odometry (VO\&VIO) techniques are well-developed for state estimation but with inevitable accumulated drifts and pose jumps upon loop closure. To overcome these problems, we propose an efficient monocular camera localization method in prior LiDAR maps using directly estimated 2D-3D line correspondences. To handle the appearance differences and modality gaps between untextured point clouds and images, geometric 3D lines are extracted offline from LiDAR maps while robust 2D lines are extracted online from video sequences. With the pose prediction from VIO, we can efficiently obtain coarse 2D-3D line correspondences. After that, the camera poses and 2D-3D correspondences are iteratively optimized by minimizing the projection error of correspondences and rejecting outliers. The experiment results on the EurocMav dataset and our collected dataset demonstrate that the proposed method can efficiently estimate camera poses without accumulated drifts or pose jumps in urban environments. The code and our collected data are available at this https URL.
摘要：在现有地图轻便摄像机的定位是基于视觉的导航至关重要。目前，视觉与视觉惯性里程计（VO \＆VIO）技术的状态估计，但不可避免的积累漂移发达和姿态跳跃在环路闭合。为了克服这些问题，我们提出了在现有激光雷达的有效单眼照相机定位方法，使用直接推定2D-3D线的对应图。为了处理无织构的点云和图像之间的外观差别和模态间隙，几何3D线从激光雷达离线提取映射而健壮的2D线从视频序列的在线萃取。与来自VIO姿态预测，我们可以有效地获得粗2D-3D线的对应关系。此后，照相机姿势和2D-3D对应通过最小化对应的投影误差和拒绝异常值被迭代优化。在EurocMav数据集，我们收集到的数据集的实验结果表明，该方法能有效地估计摄影机姿态，而不累积漂移或姿态跳跃在城市环境中。代码和我们收集的数据可在此HTTPS URL。

32. Robust Single Rotation Averaging [PDF] 返回目录
Seong Hun Lee, Javier Civera
Abstract: We propose a novel method for single rotation averaging using the Weiszfeld algorithm. Our contribution is threefold: First, we propose a robust initialization based on the elementwise median of the input rotation matrices. Our initial solution is more accurate and robust than the commonly used chordal $L_2$-mean. Second, we propose an outlier rejection scheme that can be incorporated in the Weiszfeld algorithm to improve the robustness of $L_1$ rotation averaging. Third, we propose a method for approximating the chordal $L_1$-mean using the Weiszfeld algorithm. An extensive evaluation shows that both our method and the state of the art perform equally well with the proposed outlier rejection scheme, but ours is $2-4$ times faster.
摘要：我们建议使用Weiszfeld算法的单一旋转平均的新方法。我们的贡献有三个方面：首先，我们提出了基于输入旋转矩阵的元素单元的中位数稳健初始化。我们最初的解决方案比常用的弦$ L_2 $ -mean更准确，更稳健。其次，我们建议可以在Weiszfeld算法结合，以提高$ L_1 $旋转平均的稳健性的异常值拒绝方案。第三，我们建议使用Weiszfeld算法逼近弦$ L_1 $ -mean的方法。经过广泛的评估表明，无论我们的方法和技术状态同样表现出色所提出的异常值拒绝方案，但我们是$ 2-4更快的$倍。

33. Memory-Efficient Incremental Learning Through Feature Adaptation [PDF] 返回目录
Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid
Abstract: In this work we introduce an approach for incremental learning, which preserves feature descriptors instead of images unlike most existing work. Keeping such low-dimensional embeddings instead of images reduces the memory footprint significantly. We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding images. Feature adaptation is learned with a multi-layer perceptron, which is trained on feature pairs of an image corresponding to the outputs of the original and updated network. We validate experimentally that such a transformation generalizes well to the features of the previous set of classes, and maps features to a discriminative subspace in the feature space. As a result, the classifier is optimized jointly over new and old classes without requiring old class images. Experimental results show that our method achieves state-of-the-art classification accuracy in incremental learning benchmarks, while having at least an order of magnitude lower memory footprint compared to image preserving strategies.
摘要：在这项工作中，我们引入了增量学习，它保留特征描述符，而不是图像与大多数现有工作的方法。保持，而不是图片这样的低维的嵌入显著减少了存储器占用。我们假设模型增量新类作为新的数据变得可依次更新。这就需要调整所述先前存储的特征向量来更新特征空间，而不必访问该相应的图像。特征适应与多层感知，这是在对应于原始的和更新的网络的输出的图像的特征对受过训练的学习。我们通过实验验证了这种转化推广以及前一组类的特点，和地图功能，在功能空间辨别子空间。其结果是，分类共同对新老课程，而无需老班的图像优化。实验结果表明，我们的方法实现了在增量学习基准状态的最先进的分类精确度，同时具有至少比图像保留策略量值下的内存占用的顺序。

34. Revisiting Pose-Normalization for Fine-Grained Few-Shot Recognition [PDF] 返回目录
Luming Tang, Davis Wertheimer, Bharath Hariharan
Abstract: Few-shot, fine-grained classification requires a model to learn subtle, fine-grained distinctions between different classes (e.g., birds) based on a few images alone. This requires a remarkable degree of invariance to pose, articulation and background. A solution is to use pose-normalized representations: first localize semantic parts in each image, and then describe images by characterizing the appearance of each part. While such representations are out of favor for fully supervised classification, we show that they are extremely effective for few-shot fine-grained classification. With a minimal increase in model capacity, pose normalization improves accuracy between 10 and 20 percentage points for shallow and deep architectures, generalizes better to new domains, and is effective for multiple few-shot algorithms and network backbones. Code is available at this https URL
摘要：很少拍的，细粒度的分类需要一个模型来了解仅仅基于一些图片不同类别（例如，鸟类）之间存在着微妙的，细粒度的区别。这就要求摆姿势，关节和背景的显着度不变性。一种解决方案是使用姿态规范化表示：第一局部化的语义部分在每个图像中，然后通过表征每个部件的外观描述图像。虽然这样的表示是赞成完全监督分类出来，我们表明，他们是为数不多的射门细粒度的分类非常有效。随着模型的容量增加最少，姿势正常化提高了浅层和深层结构，推广了10个20个百分点之间准确性不如新的领域，并且是有效的多几拍算法和骨干网。代码可在此HTTPS URL

35. Adversarial Learning for Personalized Tag Recommendation [PDF] 返回目录
Erik Quintanilla, Yogesh Rawat, Andrey Sakryukin, Mubarak Shah, Mohan Kankanhalli
Abstract: We have recently seen great progress in image classification due to the success of deep convolutional neural networks and the availability of large-scale datasets. Most of the existing work focuses on single-label image classification. However, there are usually multiple tags associated with an image. The existing works on multi-label classification are mainly based on lab curated labels. Humans assign tags to their images differently, which is mainly based on their interests and personal tagging behavior. In this paper, we address the problem of personalized tag recommendation and propose an end-to-end deep network which can be trained on large-scale datasets. The user-preference is learned within the network in an unsupervised way where the network performs joint optimization for user-preference and visual encoding. A joint training of user-preference and visual encoding allows the network to efficiently integrate the visual preference with tagging behavior for a better user recommendation. In addition, we propose the use of adversarial learning, which enforces the network to predict tags resembling user-generated tags. We demonstrate the effectiveness of the proposed model on two different large-scale and publicly available datasets, YFCC100M and NUS-WIDE. The proposed method achieves significantly better performance on both the datasets when compared to the baselines and other state-of-the-art methods. The code is publicly available at this https URL.
摘要：我们最近由于深卷积神经网络的成功和大型数据集的可用性看到图像分类长足的进步。现有的大部分工作集中在单标签图像分类。然而，通常有与图像相关联的多个标签。在多标签分类现有的作品，主要是基于实验室策划标签。人类将标记指定给他们的形象不同，这主要是根据自己的兴趣和个人标记行为。在本文中，我们要解决的个性化标签推荐的问题，并提出可在大规模大数据集进行培训，最终到终端的深网络。用户偏好在网络执行用于用户偏好和视觉编码联合优化无人监督的方式在网络中获知。用户偏好和视觉编码的联合训练使网络能有效地对视觉偏好与行为标记为更好的用户推荐整合。此外，我们建议使用对抗性学习，这就加强了网络预测标签类似于用户生成的标签。我们验证了该模型的两个不同的大规模和公开的数据集，YFCC100M和NUS-WIDE的有效性。所提出的方法实现了对时相比，基线和其他国家的最先进的方法中的数据集都显著更好的性能。该代码是公开的，在此HTTPS URL。

36. Generalized Zero-Shot Learning Via Over-Complete Distribution [PDF] 返回目录
Rohit Keshari, Richa Singh, Mayank Vatsa
Abstract: A well trained and generalized deep neural network (DNN) should be robust to both seen and unseen classes. However, the performance of most of the existing supervised DNN algorithms degrade for classes which are unseen in the training set. To learn a discriminative classifier which yields good performance in Zero-Shot Learning (ZSL) settings, we propose to generate an Over-Complete Distribution (OCD) using Conditional Variational Autoencoder (CVAE) of both seen and unseen classes. In order to enforce the separability between classes and reduce the class scatter, we propose the use of Online Batch Triplet Loss (OBTL) and Center Loss (CL) on the generated OCD. The effectiveness of the framework is evaluated using both Zero-Shot Learning and Generalized Zero-Shot Learning protocols on three publicly available benchmark databases, SUN, CUB and AWA2. The results show that generating over-complete distributions and enforcing the classifier to learn a transform function from overlapping to non-overlapping distributions can improve the performance on both seen and unseen classes.
摘要：训练有素，广义深层神经网络（DNN）应当是稳健既看到和看不到的类。然而，大多数的现有监督DNN算法的性能，降低了它们在训练集看不见的课程。要学习辨别分类，其产生的零射门学习（ZSL）设置好的表现，我们建议使用产生既看到和看不见类条件变自动编码器（CVAE）过完整的分销（OCD）。为了执行类之间的可分性，减少类散，我们建议在产生OCD使用在线批次三重损失（OBTL）和中心损失（CL）的。该框架的有效性是使用三个公开可用的基准数据库，SUN，CUB和AWA2两个零射学习和广义零射门学习协议进行评估。结果表明，发生过的完整版本和执行分类学重叠到非重叠分布可以改善看见也看不见类的性能的转换函数。

37. Synchronizing Probability Measures on Rotations via Optimal Transport [PDF] 返回目录
Tolga Birdal, Michael Arbel, Umut Şimşekli, Leonidas Guibas
Abstract: We introduce a new paradigm, $\textit{measure synchronization}$, for synchronizing graphs with measure-valued edges. We formulate this problem as maximization of the cycle-consistency in the space of probability measures over relative rotations. In particular, we aim at estimating marginal distributions of absolute orientations by synchronizing the $\textit{conditional}$ ones, which are defined on the Riemannian manifold of quaternions. Such graph optimization on distributions-on-manifolds enables a natural treatment of multimodal hypotheses, ambiguities and uncertainties arising in many computer vision applications such as SLAM, SfM, and object pose estimation. We first formally define the problem as a generalization of the classical rotation graph synchronization, where in our case the vertices denote probability measures over rotations. We then measure the quality of the synchronization by using Sinkhorn divergences, which reduces to other popular metrics such as Wasserstein distance or the maximum mean discrepancy as limit cases. We propose a nonparametric Riemannian particle optimization approach to solve the problem. Even though the problem is non-convex, by drawing a connection to the recently proposed sparse optimization methods, we show that the proposed algorithm converges to the global optimum in a special case of the problem under certain conditions. Our qualitative and quantitative experiments show the validity of our approach and we bring in new perspectives to the study of synchronization.
摘要：介绍一种新的模式，$ \ {textit测度同步} $，同步与测量值的边缘图。我们提出这个问题，如超过相对转动概率测度的空间周期一致性的最大化。特别地，我们的目标是通过同步$ \ textit {条件} $那些，它们在四元数的黎曼流形定义估计绝对方位的边缘分布。在分布上，歧管这样的图形优化使许多计算机视觉应用，如SLAM，SFM和对象姿势估计产生的多假设，含糊和不确定性的自然疗法。我们首次正式定义问题作为经典的旋转曲线同步，凡在我们的例子中的顶点表示过转概率措施的推广。然后，我们通过使用Sinkhorn分歧，这降低了对其他流行指标，如华的距离或最大平均差异为极限情况下测量的同步的质量。我们提出了一个非参数黎曼粒子优化的方法来解决这个问题。尽管这个问题是无凹凸，通过绘制到最近提出的稀疏优化方法的连接，我们表明，该算法收敛到全局最优在一定条件下问题的一个特例。我们的定性和定量实验表明我们的方法的有效性，我们带来新的视角，以同步的研究。

38. Background Matting: The World is Your Green Screen [PDF] 返回目录
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman
Abstract: We propose a method for creating a matte -- the per-pixel foreground color and alpha -- of a person by taking photos or videos in an everyday setting with a handheld camera. Most existing matting methods require a green screen background or a manually created trimap to produce a good matte. Automatic, trimap-free methods are appearing, but are not of comparable quality. In our trimap free approach, we ask the user to take an additional photo of the background without the subject at the time of capture. This step requires a small amount of foresight but is far less time-consuming than creating a trimap. We train a deep network with an adversarial loss to predict the matte. We first train a matting network with supervised loss on ground truth data with synthetic composites. To bridge the domain gap to real imagery with no labeling, we train another matting network guided by the first network and by a discriminator that judges the quality of composites. We demonstrate results on a wide variety of photos and videos and show significant improvement over the state of the art.
摘要：我们提出了建立一个磨砂的方法 - 用手持相机拍摄的照片或视频在日常环境的人 - 每像素前景颜色和透明度。大多数现有的抠图方法需要一个绿色屏幕背景或手动创建的三分图，产生了良好的磨砂。自动，无三分图的方法正在出现，但是质量相当的没有。在我们的三分图免费的做法，我们要求用户采取背景的其他照片不会在拍摄时的主题。此步骤需要远见少量但比创建一个三分图远耗时较少。我们培养了深刻的网络敌对损失预测无光。首先，我们培养了消光网络与地面实况数据丢失监督与合成的复合材料。为了弥补域差距，真正的影像，没有标签，我们训练由第一网络引导其他消光网络，并通过鉴别，法官复合材料的质量。我们证明在各种各样的照片和视频结果，并显示在现有技术的改进显著。

39. GraphChallenge.org Sparse Deep Neural Network Performance [PDF] 返回目录
Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Albert Reuther, Ryan Robinett, Sid Samsi
Abstract: The MIT/IEEE/Amazon this http URL encourages community approaches to developing new solutions for analyzing graphs and sparse data. Sparse AI analytics present unique scalability difficulties. The Sparse Deep Neural Network (DNN) Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a challenge that is reflective of emerging sparse AI systems. The sparse DNN challenge is based on a mathematically well-defined DNN inference computation and can be implemented in any programming environment. In 2019 several sparse DNN challenge submissions were received from a wide range of authors and organizations. This paper presents a performance analysis of the best performers of these submissions. These submissions show that their state-of-the-art sparse DNN execution time, $T_{\rm DNN}$, is a strong function of the number of DNN operations performed, $N_{\rm op}$. The sparse DNN challenge provides a clear picture of current sparse DNN systems and underscores the need for new innovations to achieve high performance on very large sparse DNNs.
摘要：MIT / IEEE /亚马逊这个HTTP URL鼓励社区办法开发用于分析图表和数据稀疏新的解决方案。稀疏AI分析提出了独特的可扩展性的困难。稀疏深层神经网络（DNN）挑战借鉴了机器学习，高性能计算和可视化分析之前的挑战，创造一个有反射新兴稀疏AI系统的一个挑战。稀疏DNN挑战是基于数学上定义良好的DNN推理运算，可以在任何编程环境中实现。在2019年，从广泛的作者和组织收到了一些稀疏DNN挑战意见书。本文提出这些意见的表现最佳的性能分析。这些呈表明，它们的状态的最先进的稀疏DNN执行时间，$ T _ {\ RM DNN} $，是DNN操作的数目的强函数进行的，$ N _ {\ RM运算} $。稀疏DNN挑战提供了当前稀疏DNN系统的清晰图片，并强调需要有新的创新，以实现在非常大的稀疏DNNs高性能。

40. Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline [PDF] 返回目录
Yu-Lun Liu, Wei-Sheng Lai, Yu-Sheng Chen, Yi-Lung Kao, Ming-Hsuan Yang, Yung-Yu Chuang, Jia-Bin Huang
Abstract: Recovering a high dynamic range (HDR) image from a single low dynamic range (LDR) input image is challenging due to missing details in under-/over-exposed regions caused by quantization and saturation of camera sensors. In contrast to existing learning-based methods, our core idea is to incorporate the domain knowledge of the LDR image formation pipeline into our model. We model the HDRto-LDR image formation pipeline as the (1) dynamic range clipping, (2) non-linear mapping from a camera response function, and (3) quantization. We then propose to learn three specialized CNNs to reverse these steps. By decomposing the problem into specific sub-tasks, we impose effective physical constraints to facilitate the training of individual sub-networks. Finally, we jointly fine-tune the entire model end-to-end to reduce error accumulation. With extensive quantitative and qualitative experiments on diverse image datasets, we demonstrate that the proposed method performs favorably against state-of-the-art single-image HDR reconstruction algorithms.
摘要：恢复从一个单一的低动态范围（LDR）输入图像中的高动态范围（HDR）图像是由于由量化引起和相机传感器的饱和度不足/曝光过度区缺少细节挑战。相较于现有的基于学习的方法，我们的核心理念是将LDR图像形成管道的领域知识为我们的模型。我们的HDRto-LDR图像形成管道作为（1）的动态范围限幅，从相机响应函数（2）非线性映射，以及（3）量化建模。然后，我们建议学习三个专业细胞神经网络的扭转这些步骤。通过将问题分解成具体的子任务，我们施加有效的物理限制，促进各子网络的培训。最后，我们共同微调整个模型的端至端，以减少误差积累。与上不同的图像数据组广泛的定量和定性实验中，我们证明有利针对国家的最先进的单图像HDR重建算法所提出的方法进行。

41. Introducing Anisotropic Minkowski Functionals for Local Structure Analysis and Prediction of Biomechanical Strength of Proximal Femur Specimens [PDF] 返回目录
Titas De
Abstract: Bone fragility and fracture caused by osteoporosis or injury are prevalent in adults over the age of 50 and can reduce their quality of life. Hence, predicting the biomechanical bone strength, specifically of the proximal femur, through non-invasive imaging-based methods is an important goal for the diagnosis of Osteoporosis as well as estimating fracture risk. Dual X-ray absorptiometry (DXA) has been used as a standard clinical procedure for assessment and diagnosis of bone strength and osteoporosis through bone mineral density (BMD) measurements. However, previous studies have shown that quantitative computer tomography (QCT) can be more sensitive and specific to trabecular bone characterization because it reduces the overlap effects and interferences from the surrounding soft tissue and cortical shell. This study proposes a new method to predict the bone strength of proximal femur specimens from quantitative multi-detector computer tomography (MDCT) images. Texture analysis methods such as conventional statistical moments (BMD mean), Isotropic Minkowski Functionals (IMF) and Anisotropic Minkowski Functionals (AMF) are used to quantify BMD properties of the trabecular bone micro-architecture. Combinations of these extracted features are then used to predict the biomechanical strength of the femur specimens using sophisticated machine learning techniques such as multiregression (MultiReg) and support vector regression with linear kernel (SVRlin). The prediction performance achieved with these feature sets is compared to the standard approach that uses the mean BMD of the specimens and multiregression models using root mean square error (RMSE).
摘要：骨质脆弱和骨质疏松引起的或受伤骨折是成人的50岁以上的盛行，可以降低他们的生活质量。因此，预测的生物力学骨强度，特别是股骨近端的，通过非侵入性的基于成像的方法是骨质疏松症的诊断以及估计骨折风险的一个重要目标。双能X线吸收测定法（DXA）已被用作标准临床程序进行评估和骨强度和骨质疏松症诊断通过骨矿物质密度（BMD）的测量。然而，先前的研究已经表明，定量计算机断层扫描（QCT）可以更敏感和特异的骨小梁表征，因为它减少从周围的软组织和皮质壳的重叠影响和干扰。本研究提出来预测近端的骨强度股骨从量变多探测器计算机断层摄影（MDCT）的图像样本的新方法。纹理分析方法，如常规的统计矩（BMD平均），各向同性度规函数（IMF）和各向异性度规函数（AMF）被用于量化骨小梁微架构的BMD性质。这些提取的特征的组合然后用来预测使用复杂的机器学习技术，例如multiregression（MultiReg）和支持向量回归与线性核（SVRlin）股骨标本的生物力学强度。与这些功能集所取得的预测性能进行比较，使用了利用根均方误差（RMSE）试样和multiregression模型的平均BMD的标准方法。

42. Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios [PDF] 返回目录
Alexander Schindler, Andrew Lindley, Anahid Jalali, Martin Boyer, Sergiu Gordea, Ross King
Abstract: The forensic investigation of a terrorist attack poses a significant challenge to the investigative authorities, as often several thousand hours of video footage must be viewed. Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. Current platforms focus primarily on the integration of different computer vision methods and thus are restricted to a single modality. We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses. Videos are analyzed according their acoustic and visual content. Specifically, Audio Event Detection is applied to index the content according to attack-specific acoustic concepts. Audio similarity search is utilized to identify similar video sequences recorded from different perspectives. Visual object detection and tracking are used to index the content according to relevant concepts. Innovative user-interface concepts are introduced to harness the full potential of the heterogeneous results of the analytical modules, allowing investigators to more quickly follow-up on leads and eyewitness reports.
摘要：恐怖袭击的法医调查构成了对调查当局显著的挑战，因为经常数千小时的录像中，必须进行查看。大型视频分析平台（VAP）协助执法机构（LEA）在确定犯罪嫌疑人，并获取证据。当前的平台主要集中在不同的计算机视觉方法的整合，从而被限制为一个单一的模式。我们提出了一个视频分析平台，集成了视觉和音频分析模块和保险丝从目击者监控摄像机和视频上传信息。影片进行了分析，根据自己的听觉和视觉内容。具体来说，音频事件检测是根据特定的攻击声的概念应用到索引内容。音频相似性搜索是用于识别从不同的角度记录的类似的视频序列。视觉对象检测和跟踪根据相关的概念用于索引的内容。创新的用户界面概念引入到利用分析模块的异质结果的全部潜能，使研究者更快地跟进线索和目击者报告。

43. Deep-n-Cheap: An Automated Search Framework for Low Complexity Deep Learning [PDF] 返回目录
Sourya Dey, Saikrishna C. Kanala, Keith M. Chugg, Peter A. Beerel
Abstract: We present Deep-n-Cheap -- an open-source AutoML framework to search for deep learning models. This search includes both architecture and training hyperparameters, and supports convolutional neural networks and multi-layer perceptrons. Our framework is targeted for deployment on both benchmark and custom datasets, and as a result, offers a greater degree of search space customizability as compared to a more limited search over only pre-existing models from literature. We also introduce the technique of 'search transfer', which demonstrates the generalization capabilities of the models found by our framework to multiple datasets. Deep-n-Cheap includes a user-customizable complexity penalty which trades off performance with training time or number of parameters. Specifically, our framework results in models offering performance comparable to state-of-the-art while taking 1-2 orders of magnitude less time to train than models from other AutoML and model search frameworks. Additionally, this work investigates and develops various insights regarding the search process. In particular, we show the superiority of a greedy strategy and justify our choice of Bayesian optimization as the primary search methodology over random / grid search.
摘要：我们提出深正便宜 - 一个开源框架AutoML搜索深度学习模型。这个搜索包括建筑和训练超参数，并且支持卷积神经网络和多层感知。我们的框架是针对两个基准测试和定制数据集的部署，并作为一个结果，提供相比于更有限的搜索从文献上仅预先存在的模型搜索空间定制的程度更大。我们还推出“搜索转移”的技术，这表明了模型的泛化能力，发现我们的框架，多个数据集。深正低价包括用户自定义的复杂性惩罚其权衡与训练时间或参数的数量表现。具体来说，我们的框架导致模型提供了性能堪比国家的最先进的同时服用幅度较少的时间1-2个数量级，以列车比其他AutoML和型号搜索的框架模型。此外，这项工作探索与发展有关的搜索过程的不同见解。特别是，我们表现出贪婪策略的优越性，证明我们的选择贝叶斯优化为在随机/网格搜索的主要搜索方法。

44. Learning Representations For Images With Hierarchical Labels [PDF] 返回目录
Ankit Dhall
Abstract: Image classification has been studied extensively but there has been limited work in the direction of using non-conventional, external guidance other than traditional image-label pairs to train such models. In this thesis we present a set of methods to leverage information about the semantic hierarchy induced by class labels. In the first part of the thesis, we inject label-hierarchy knowledge to an arbitrary classifier and empirically show that availability of such external semantic information in conjunction with the visual semantics from images boosts overall performance. Taking a step further in this direction, we model more explicitly the label-label and label-image interactions by using order-preserving embedding-based models, prevalent in natural language, and tailor them to the domain of computer vision to perform image classification. Although, contrasting in nature, both the CNN-classifiers injected with hierarchical information, and the embedding-based models outperform a hierarchy-agnostic model on the newly presented, real-world ETH Entomological Collection image dataset.
摘要：图像分类已被广泛研究，但一直有限的工作中使用比传统的图像标签对其他非传统，外部指导训练这样的模型的方向。在本文，我们提出了一套方法有关类标签引起的语义层级利用信息。在本文的第一部分，我们注入标签层次结构的知识为任意分类器和经验显示的与来自图像提升整体性能可视语义结合这样的外部的语义信息的可用性。进一步先走一步朝着这个方向，我们通过使用保序基于嵌入的模型，在自然语言流行模式更加明确的标签，标签和标签图像交互，并使其适合计算机视觉领域进行图像分类。虽然，对比的性质，无论是CNN-分类与分级信息注入，基于嵌入的模型跑赢上新呈现的，真实世界的ETH昆虫采集图像数据集的等级无关的模型。

45. Go Fetch: Mobile Manipulation in Unstructured Environments [PDF] 返回目录
Kenneth Blomqvist, Michel Breyer, Andrei Cramariuc, Julian Förster, Margarita Grinvald, Florian Tschopp, Jen Jen Chung, Lionel Ott, Juan Nieto, Roland Siegwart
Abstract: With humankind facing new and increasingly large-scale challenges in the medical and domestic spheres, automation of the service sector carries a tremendous potential for improved efficiency, quality, and safety of operations. Mobile robotics can offer solutions with a high degree of mobility and dexterity, however these complex systems require a multitude of heterogeneous components to be carefully integrated into one consistent framework. This work presents a mobile manipulation system that combines perception, localization, navigation, motion planning and grasping skills into one common workflow for fetch and carry applications in unstructured indoor environments. The tight integration across the various modules is experimentally demonstrated on the task of finding a commonly available object in an office environment, grasping it, and delivering it to a desired drop-off location. The accompanying video is available at this https URL.
摘要：随着人类面临的医疗和生活领域的新的和日趋大型化的挑战，服务部门的自动化进行，以提高效率，质量和操作安全性的巨大潜力。移动机器人可与高度的机动性和灵活性的提供解决方案，但是这些复杂的系统需要异构组件的多个要精心整合到一个统一的框架。这项工作提出了一个移动操纵系统，结合感知，定位，导航，运动规划和掌握的技能转化为一个共同的工作流程打杂非结构化室内环境中的应用。横跨各个模块紧密集成的实验证明在办公环境找到一个常用的对象，抓它，它传递到所需的下车位置的任务。附带的视频可在此HTTPS URL。

46. End-To-End Convolutional Neural Network for 3D Reconstruction of Knee Bones From Bi-Planar X-Ray Images [PDF] 返回目录
Yoni Kasten, Daniel Doktofsky, Ilya Kovler
Abstract: We present an end-to-end Convolutional Neural Network (CNN) approach for 3D reconstruction of knee bones directly from two bi-planar X-ray images. Clinically, capturing the 3D models of the bones is crucial for surgical planning, implant fitting, and postoperative evaluation. X-ray imaging significantly reduces the exposure of patients to ionizing radiation compared to Computer Tomography (CT) imaging, and is much more common and inexpensive compared to Magnetic Resonance Imaging (MRI) scanners. However, retrieving 3D models from such 2D scans is extremely challenging. In contrast to the common approach of statistically modeling the shape of each bone, our deep network learns the distribution of the bones' shapes directly from the training images. We train our model with both supervised and unsupervised losses using Digitally Reconstructed Radiograph (DRR) images generated from CT scans. To apply our model to X-Ray data, we use style transfer to transform between X-Ray and DRR modalities. As a result, at test time, without further optimization, our solution directly outputs a 3D reconstruction from a pair of bi-planar X-ray images, while preserving geometric constraints. Our results indicate that our deep learning model is very efficient, generalizes well and produces high quality reconstructions.
摘要：直接从两个双平面X射线图像呈现为3D重建膝关节的骨骼的端至端卷积神经网络（CNN）的方法。临床上，攻克了三维模型的骨头是手术计划，植入物适配，以及术后评估至关重要。 X射线成像显著降低患者暴露于电离辐射相比计算机断层扫描（CT）成像，并且更常见的和廉价相比磁共振成像（MRI）扫描仪。然而，从这样的2D扫描检索3D模型是极具挑战性。与此相反的统计建模每个骨头形状的，常见的做法，我们深深的网络学习的骨头形状直接从训练图像的分布。我们培训用数字重建射线照相（DRR）从CT扫描生成的图像既监督和无监督的损失模型。要运用我们的模型，以X射线数据，我们使用的风格传输X射线和DRR模式之间转换。其结果是，在测试时，没有进一步的优化，我们的解决方案直接输出由一对双平面X射线图像的3D重建，同时保留几何约束。我们的研究结果表明，我们的深度学习模式是非常有效的，推广好，生产高品质的重建。

47. Image Denoising Using Sparsifying Transform Learning and Weighted Singular Values Minimization [PDF] 返回目录
Yanwei Zhao, Ping Yang, Qiu Guan, Jianwei Zheng, Wanliang Wang
Abstract: In image denoising (IDN) processing, the low-rank property is usually considered as an important image prior. As a convex relaxation approximation of low rank, nuclear norm based algorithms and their variants have attracted significant attention. These algorithms can be collectively called image domain based methods, whose common drawback is the requirement of great number of iterations for some acceptable solution. Meanwhile, the sparsity of images in a certain transform domain has also been exploited in image denoising problems. Sparsity transform learning algorithms can achieve extremely fast computations as well as desirable performance. By taking both advantages of image domain and transform domain in a general framework, we propose a sparsity transform learning and weighted singular values minimization method (STLWSM) for IDN problems. The proposed method can make full use of the preponderance of both domains. For solving the non-convex cost function, we also present an efficient alternative solution for acceleration. Experimental results show that the proposed STLWSM achieves improvement both visually and quantitatively with a large margin over state-of-the-art approaches based on an alternatively single domain. It also needs much less iteration than all the image domain algorithms.
摘要：图像去噪（IDN）处理中，低秩性通常被认为是现有的重要图像。作为低级别的凸松弛逼近，基于核规范算法及其变种已吸引了显著的关注。这些算法可以被统称为图像域为基础的方法，其共同的缺点是的迭代数大一些可接受的解决方案的需求。与此同时，图像的某一变换域稀疏也已在图像降噪问题利用。稀疏变换学习算法可以实现非常快的计算，以及良好的性能。通过利用图像领域的既有优势，并在总体框架变换域，我们提出了一个稀疏变换学习和加权奇异值最小化方法（STLWSM）对IDN问题。该方法可以充分利用两个领域的优势的。为了解决非凸成本函数，我们还提出了加速度的有效的替代方案。实验结果表明，所提出的STLWSM视觉和定量地实现改善超过国家的最先进的方法的基础上的可替换单结构域大的裕度。它还需要比所有图像域迭代算法少得多。

48. Object-Centric Image Generation with Factored Depths, Locations, and Appearances [PDF] 返回目录
Titas Anciukevicius, Christoph H. Lampert, Paul Henderson
Abstract: We present a generative model of images that explicitly reasons over the set of objects they show. Our model learns a structured latent representation that separates objects from each other and from the background; unlike prior works, it explicitly represents the 2D position and depth of each object, as well as an embedding of its segmentation mask and appearance. The model can be trained from images alone in a purely unsupervised fashion without the need for object masks or depth information. Moreover, it always generates complete objects, even though a significant fraction of training images contain occlusions. Finally, we show that our model can infer decompositions of novel images into their constituent objects, including accurate prediction of depth ordering and segmentation of occluded parts.
摘要：我们提出了在一组对象的明确原因，它们显示图像的生成模型。我们的模型学习结构化潜表示，其彼此分离，并从背景对象;不同于现有的作品，它明确地表示各物体的二维位置和深度，以及它的分割掩码和外观的嵌入。该模型可以从单独一个纯粹的无监督形式的图像，而不需要对象口罩或深度信息进行培训。此外，它总是可以生成完整的对象，即使训练图像的显著部分包含闭塞。最后，我们表明，我们的模型可以推断出新的图像分解成它们的成分对象，包括深度排序的准确预测和遮挡部分的分割。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-03

目录

摘要