摘要

1. Contrastive Learning for Unpaired Image-to-Image Translation [PDF] 返回目录
Taesung Park, Alexei A. Efros, Richard Zhang, Jun-Yan Zhu
Abstract: In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so -- maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each "domain" is only a single image.
摘要：图像到图像的平移，在输出每个补丁应反映相应的补丁的在输入独立域的内容，。我们提出了这样一个简单的方法 - 两者之间的相互最大化的信息，使用基于对比学习的框架。该方法鼓励两个元件（对应补丁）映射到一个类似的点在学习地物的空间，相对于其它元件（其他补丁）在数据集中，被称为底片。我们将探讨一些关键设计选择使对比学习有效的图像合成的设置。值得注意的是，我们使用的多层，基于补丁的方法，而不是对整个图像进行操作。此外，我们得出的底片从输入图像本身，而不是从数据集的其余部分。我们证明了我们的框架能够在不成对图像 - 图像转换设置片面的翻译，同时提高质量，减少培训时间。此外，我们的方法甚至可以扩展到训练环境，让每一个“域”只是一个单一的形象。

2. Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild [PDF] 返回目录
Jason Y. Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, Angjoo Kanazawa
Abstract: We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. Notably, our method runs on datasets without any scene- or object-level 3D supervision. Our key insight is that considering humans and objects jointly gives rise to "3D common sense" constraints that can be used to resolve ambiguity. In particular, we introduce a scale loss that learns the distribution of object size from data; an occlusion-aware silhouette re-projection loss to optimize object pose; and a human-object interaction loss to capture the spatial layout of objects with which humans interact. We empirically validate that our constraints dramatically reduce the space of likely 3D spatial configurations. We demonstrate our approach on challenging, in-the-wild images of humans interacting with large objects (such as bicycles, motorcycles, and surfboards) and handheld objects (such as laptops, tennis rackets, and skateboards). We quantify the ability of our approach to recover human-object arrangements and outline remaining challenges in this relatively domain. The project webpage can be found at this https URL.
摘要：本文提出了一种方法，推断空间安排和人体和物体的形状，在全球一致的3D场景，从单一图像中最百搭以不受控制的环境中捕获。值得注意的是，我们的方法没有任何scene-或对象级别的3D数据集的监督运行。我们的主要观点是，考虑到人类和对象共同产生了可用于解决歧义“3D常识”的约束。特别是，我们引入学习从数据对象的大小的分布的标度损失;一个闭塞感知轮廓再投影损失，以优化对象的姿态;和人对象相互作用损失来捕获对象的空间布局与人类交互。我们经验验证了我们的限制大大减少可能的三维空间配置的空间。我们证明在挑战我们的方法，在最狂野人类与大型物体（如自行车，摩托车和冲浪板）和手持的物体（如笔记本电脑，网球拍，和滑板）相互作用的图像。我们量化我们的方法来恢复人体对象的安排和其余的轮廓挑战在这个相对领域的能力。该项目的网页可以在此HTTPS URL中找到。

3. Rewriting a Deep Generative Model [PDF] 返回目录
David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba
Abstract: A deep generative model such as a GAN learns to model a rich set of semantic and physical rules about the target distribution, but up to now, it has been obscure how such rules are encoded in the network, or how a rule could be changed. In this paper, we introduce a new problem setting: manipulation of specific rules encoded by a deep generative model. To address the problem, we propose a formulation in which the desired rule is changed by manipulating a layer of a deep network as a linear associative memory. We derive an algorithm for modifying one entry of the associative memory, and we demonstrate that several interesting structural rules can be located and modified within the layers of state-of-the-art generative models. We present a user interface to enable users to interactively change the rules of a generative model to achieve desired effects, and we show several proof-of-concept applications. Finally, results on multiple datasets demonstrate the advantage of our method against standard fine-tuning methods and edit transfer algorithms.
摘要：深生成模型如GaN学会建模一组丰富的关于目标分布语义和物理规则，但到现在为止，它已经模糊的规则如何等在网络中被编码，或如何规则可能是改变。在本文中，我们介绍了一个新的问题设置：具体规则操作编码由深生成模型。为了解决这个问题，我们提出了在其中所需的规则通过操纵一个深网络层作为线性关联存储器改变的制剂。我们推导修改联想记忆的一个条目的算法，我们证明了几个有趣的结构规则可以位于和国家的最先进的生成模型的层内修改。我们提出了一个用户界面，使用户以交互方式改变生成模型的规则，以达到预期的效果，我们提供证明的概念几个应用程序。最后，在多个数据集的结果证明了我们对标准的微调方法和编辑传输算法方法的优点。

4. LevelSet R-CNN: A Deep Variational Method for Instance Segmentation [PDF] 返回目录
Namdar Homayounfar, Yuwen Xiong, Justin Liang, Wei-Chiu Ma, Raquel Urtasun
Abstract: Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets.
摘要：获取精确的情况下分割掩码是许多现代应用，如机器人操作和自动驾驶高度重视。目前，该机型的艺术许多国家都是基于面膜R-CNN框架，同时非常强大，在低分辨率这可能会导致不准确的边界输出口罩。在另一方面，对于分割经典变分法通过优化能源功能强加于口罩适当的全球和本地数据和几何约束。虽然数学优雅，其良好的初始化，非强大的图像线索和超参数的手动设置直接依赖使得它们不适合现代的应用程序。我们提出的Levelset R-CNN，其通过获得与某个变分割框架结合在一个终端到终端的方式强大的功能表示结合了两者的优点。我们证明了我们对COCO和风情的数据集方法的有效性。

5. Unsupervised Continuous Object Representation Networks for Novel View Synthesis [PDF] 返回目录
Nicolai Häni, Selim Engin, Jun-Jee Chao, Volkan Isler
Abstract: Novel View Synthesis (NVS) is concerned with the generation of novel views of a scene from one or more source images. NVS requires explicit reasoning about 3D object structure and unseen parts of the scene. As a result, current approaches rely on supervised training with either 3D models or multiple target images. We present Unsupervised Continuous Object Representation Networks (UniCORN), which encode the geometry and appearance of a 3D scene using a neural 3D representation. Our model is trained with only two source images per object, requiring no ground truth 3D models or target view supervision. Despite being unsupervised, UniCORN achieves comparable results to the state-of-the-art on challenging tasks, including novel view synthesis and single-view 3D reconstruction.
摘要：新型视合成（NVS）涉及的来自一个或多个源图像的场景的新颖观点的产生。 NVS大约需要3D对象的结构和场景的看不见的部分明确的理由。因此，当前的方法依赖于指导训练与任意3D模型或多个目标图像。我们本无监督连续对象表示网络（麒麟），其编码使用神经3D表示的3D场景的几何形状和外观。我们的模型进行训练，每个对象只有两个源图像，无需地面实况3D模型或目标视图监督。尽管是无监督，麒麟达到上具有挑战性的任务，包括新颖的视图合成和单视图三维重建可比较的结果的状态的最先进的。

6. Multi-label Zero-shot Classification by Learning to Transfer from External Knowledge [PDF] 返回目录
He Huang, Yuanwei Chen, Wei Tang, Wenhao Zheng, Qing-Guo Chen, Yao hu, Philip Yu
Abstract: Multi-label zero-shot classification aims to predict multiple unseen class labels for an input image. It is more challenging than its single-label counterpart. On one hand, the unconstrained number of labels assigned to each image makes the model more easily overfit to those seen classes. On the other hand, there is a large semantic gap between seen and unseen classes in the existing multi-label classification datasets. To address these difficult issues, this paper introduces a novel multi-label zero-shot classification framework by learning to transfer from external knowledge. We observe that ImageNet is commonly used to pretrain the feature extractor and has a large and fine-grained label space. This motivates us to exploit it as external knowledge to bridge the seen and unseen classes and promote generalization. Specifically, we construct a knowledge graph including not only classes from the target dataset but also those from ImageNet. Since ImageNet labels are not available in the target dataset, we propose a novel PosVAE module to infer their initial states in the extended knowledge graph. Then we design a relational graph convolutional network (RGCN) to propagate information among classes and achieve knowledge transfer. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed approach.
摘要：多标签零镜头分类目标的预测多看不见类标签的输入图像。它比它的单标签对应的更具挑战性。一方面，分配给每个图像标签的不受约束数使模型更容易过度拟合到那些看到类。在另一方面，还有在现有的多标签分类数据集可见和不可见的类之间的大间隙语义。要通过学习，从外部知识转移解决这些棘手的问题，本文介绍了一种新型的多标签零镜头分类框架。我们观察到ImageNet常用于pretrain的特征提取和拥有一个庞大而细粒度标签空间。这促使我们利用它作为外部知识弥合看见和看不见的类和促进推广。具体而言，构造包括目标数据集不仅类，但也来自ImageNet知识图。由于ImageNet标签中未提供的目标数据集，我们提出了一个新颖的PosVAE模块来推断在扩展知识图初始状态。然后，我们设计类之间的关系图卷积网络（RGCN）来传播信息，实现知识转移。在两个标准数据集实验结果表明，该方法的有效性。

7. Heatmap-based Vanishing Point boosts Lane Detection [PDF] 返回目录
Yin-Bo Liu, Ming Zeng, Qing-Hao Meng
Abstract: Vision-based lane detection (LD) is a key part of autonomous driving technology, and it is also a challenging problem. As one of the important constraints of scene composition, vanishing point (VP) may provide a useful clue for lane detection. In this paper, we proposed a new multi-task fusion network architecture for high-precision lane detection. Firstly, the ERFNet was used as the backbone to extract the hierarchical features of the road image. Then, the lanes were detected using image segmentation. Finally, combining the output of lane detection and the hierarchical features extracted by the backbone, the lane VP was predicted using heatmap regression. The proposed fusion strategy was tested using the public CULane dataset. The experimental results suggest that the lane detection accuracy of our method outperforms those of state-of-the-art (SOTA) methods.
摘要：基于视觉的车道检测（LD）是自动驾驶技术的关键组成部分，它也是一个具有挑战性的问题。作为场景组合物的重要约束之一，消失点（VP）可以提供用于车道检测的有用线索。在本文中，我们提出了高精度的车道检测一个新的多任务融合网络架构。首先，ERFNet作为骨干，以提取道路图像的层次特征。然后，使用图像分割检测的车道。最后，结合车道检测，并通过骨干提取的分层特征的输出，车道VP使用热图回归预测。所提出的融合策略对利用公共CULane数据集进行测试。实验结果表明，我们的方法的车道检测准确度优于那些状态的最先进的（SOTA）的方法。

8. Dense Scene Multiple Object Tracking with Box-Plane Matching [PDF] 返回目录
Jinlong Peng, Yueyang Gu, Yabiao Wang, Chengjie Wang, Jilin Li, Feiyue Huang
Abstract: Multiple Object Tracking (MOT) is an important task in computer vision. MOT is still challenging due to the occlusion problem, especially in dense scenes. Following the tracking-by-detection framework, we propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes. First, we design the Layer-wise Aggregation Discriminative Model (LADM) to filter the noisy detections. Then, to associate remaining detections correctly, we introduce the Global Attention Feature Model (GAFM) to extract appearance feature and use it to calculate the appearance similarity between history tracklets and current detections. Finally, we propose the Box-Plane Matching strategy to achieve data association according to the motion similarity and appearance similarity between tracklets and detections. With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.
摘要：多目标跟踪（MOT）是计算机视觉中的一项重要任务。 MOT仍因具有挑战性的遮挡问题，特别是在密集的场景。继跟踪通过检测框架，我们提出了箱平面匹配（BPM）的方法来提高在密集的场景MOT performacne。首先，我们设计了逐层汇聚判别模型（LADM）来过滤嘈杂的检测。然后，剩余的检测正确关联，我们引入了全球的关注要素模型（GAFM）提取的外观特征，并用它来计算历史tracklets和电流检测之间的外观相似性。最后，我们提出了箱平面匹配策略，根据tracklets和检测之间的运动相似度和外观相似性，实现数据关联。随着三个模块的效率，我们的团队实现了对在ACM MM大挑战HiEve 2020年履带1排行榜的第一名。

9. Unsupervised Disentanglement GAN for Domain Adaptive Person Re-Identification [PDF] 返回目录
Yacine Khraimeche, Guillaume-Alexandre Bilodeau, David Steele, Harshad Mahadik
Abstract: While recent person re-identification (ReID) methods achieve high accuracy in a supervised setting, their generalization to an unlabelled domain is still an open problem. In this paper, we introduce a novel unsupervised disentanglement generative adversarial network (UD-GAN) to address the domain adaptation issue of supervised person ReID. Our framework jointly trains a ReID network for discriminative features extraction in a source labelled domain using identity annotation, and adapts the ReID model to an unlabelled target domain by learning disentangled latent representations on the domain. Identity-unrelated features in the target domain are distilled from the latent features. As a result, the ReID features better encompass the identity of a person in the unsupervised domain. We conducted experiments on the Market1501, DukeMTMC and MSMT17 datasets. Results show that the unsupervised domain adaptation problem in ReID is very challenging. Nevertheless, our method shows improvement in half of the domain transfers and achieve state-of-the-art performance for one of them.
摘要：虽然近期人重新鉴定（里德）方法在监督环境达到较高的精度，其泛化到未标记域仍然是一个悬而未决的问题。在本文中，我们介绍了一种新的无监督的解开生成对抗网络（UD-GAN），以解决监管人里德的领域适应性问题。我们的框架共同训练使用标识标注在源标记域里德网络判别特征提取，并通过学习对域解开潜在交涉适应里德模型未标记的目标域。在目标域身份无关的功能从潜在功能蒸馏。其结果是，里德更好的功能涵盖一个人在无监督域的身份。我们在Market1501，DukeMTMC和MSMT17数据集进行了实验。结果表明，在里德无人监督的领域适应性问题是非常具有挑战性的。然而，我们的方法显示在域转移的一半改善和实现国家的最先进的性能其中之一。

10. Quantitative Distortion Analysis of Flattening Applied to the Scroll from En-Gedi [PDF] 返回目录
Clifford Seth Parker, William Brent Seales, Pnina Shor
Abstract: Non-invasive volumetric imaging can now capture the internal structure and detailed evidence of ink-based writing from within the confines of damaged and deteriorated manuscripts that cannot be physically opened. As demonstrated recently on the En-Gedi scroll, our "virtual unwrapping" software pipeline enables the recovery of substantial ink-based text from damaged artifacts at a quality high enough for serious critical textual analysis. However, the quality of the resulting images is defined by the subjective evaluation of scholars, and a choice of specific algorithms and parameters must be available at each stage in the pipeline in order to maximize the output quality.
摘要：无创容积成像现在可以从无法打开物理损坏，恶化手稿的范围内捕捉基于墨水书写的内部结构和详细的证据。作为对隐基底滚动最近证明，我们的“虚拟解包”的软件管道使大量的基于墨水的文本从损坏的文物在质量足够高的严肃批判文本分析的恢复。然而，所得到的图像的质量由学者的主观评价所定义，以及特定的算法和参数的选择必须以最大化输出质量可在管道中的每个阶段。

11. Event-based Stereo Visual Odometry [PDF] 返回目录
Yi Zhou, Guillermo Gallego, Shaojie Shen
Abstract: Event-based cameras are bio-inspired vision sensors whose pixels work independently from each other and respond asynchronously to brightness changes, with microsecond resolution. Their advantages make it possible to tackle challenging scenarios in robotics, such as high-speed and high dynamic range scenes. We present a solution to the problem of visual odometry from the data acquired by a stereo event-based camera rig. Our system follows a parallel tracking-and-mapping approach, where novel solutions to each subproblem (3D reconstruction and camera pose estimation) are developed with two objectives in mind: being principled and efficient, for real-time operation with commodity hardware. To this end, we seek to maximize the spatio-temporal consistency of stereo event-based data while using a simple and efficient representation. Specifically, the mapping module builds a semi-dense 3D map of the scene by fusing depth estimates from multiple local viewpoints (obtained by spatio-temporal consistency) in a probabilistic fashion. The tracking module recovers the pose of the stereo rig by solving a registration problem that naturally arises due to the chosen map and event data representation. Experiments on publicly available datasets and on our own recordings demonstrate the versatility of the proposed method in natural scenes with general 6-DoF motion. The system successfully leverages the advantages of event-based cameras to perform visual odometry in challenging illumination conditions, such as low-light and high dynamic range, while running in real-time on a standard CPU. We release the software and dataset under an open source licence to foster research in the emerging topic of event-based SLAM.
摘要：基于事件的摄像机是仿生视觉传感器，其像素从对方回应独立工作异步到亮度的变化，用微秒分辨率。他们的优势使其能够解决在机器人技术，如高速，高动态范围场景具有挑战性的场景。我们提出了一个解决方案，从由基于事件立体相机钻机获取的数据可视化测距的问题。我们的系统遵循平行跟踪和映射方法，其中，给每个子问题（3D重建和照相机姿态估计）的新的解决方案有两个目标而开发：被原则性高效，用于与商品硬件的实时操作。为此，我们寻求在使用一种简单而有效的表示最大化立体声基于事件的数据的时空一致性。具体而言，映射模块建立通过以概率方式融合来自多个本地视点深度估计（由时空一致性）中获得的场景的半密集的3D地图。跟踪模块恢复通过求解配准问题的立体声钻机的姿态天然产生是由于所选择的地图数据和事件数据表示。上公开提供的数据集和我们自己的录音实验表明与一般6自由度运动的自然景观所提出的方法的通用性。系统成功利用的基于事件的摄像机的优点在具有挑战性的照明条件下，如低光和高动态范围进行视觉里程计中，标准CPU上在实时运行时。我们基于事件的SLAM的新兴主题下一个开源许可证下发布的软件和数据集，以培育研究。

12. Epipolar-Guided Deep Object Matching for Scene Change Detection [PDF] 返回目录
Kento Doi, Ryuhei Hamaguchi, Shun Iwase, Rio Yokota, Yutaka Matsuo, Ken Sakurada
Abstract: This paper describes a viewpoint-robust object-based change detection network (OBJ-CDNet). Mobile cameras such as drive recorders capture images from different viewpoints each time due to differences in camera trajectory and shutter timing. However, previous methods for pixel-wise change detection are vulnerable to the viewpoint differences because they assume aligned image pairs as inputs. To cope with the difficulty, we introduce a deep graph matching network that establishes object correspondence between an image pair. The introduction enables us to detect object-wise scene changes without precise image alignment. For more accurate object matching, we propose an epipolar-guided deep graph matching network (EGMNet), which incorporates the epipolar constraint into the deep graph matching layer used in OBJCDNet. To evaluate our network's robustness against viewpoint differences, we created synthetic and real datasets for scene change detection from an image pair. The experimental results verified the effectiveness of our network.
摘要：本文描述了一种视点健壮的基于对象的变化检测网络（OBJ-CDNet）。由于在相机轨迹和快门定时差移动相机，如从不同的视点的每个时间驱动录像机捕获图像。然而，因为它们假设对准图像对作为输入，用于逐像素变化检测以前的方法很容易受到视点差异。为了应对的困难，我们引入其建立的图像对之间物体对应深图形匹配网络。引进使我们能够检测不准确的图像对准对象的明智的场景变化。为了更准确的对象匹配，我们提出了一种极引导深图形匹配网络（EGMNet），其结合了极线约束到OBJCDNet使用的深曲线匹配层。为了评估我们的网络反对观点的差异稳健性，我们创建了场景变化检测合成和真实数据集从图像对。实验结果验证了我们网络的有效性。

13. A new Local Radon Descriptor for Content-Based Image Search [PDF] 返回目录
Morteza Babaie, Hany Kashani, Meghana D. Kumar, Hamid.R. Tizhoosh
Abstract: Content-based image retrieval (CBIR) is an essential part of computer vision research, especially in medical expert systems. Having a discriminative image descriptor with the least number of parameters for tuning is desirable in CBIR systems. In this paper, we introduce a new simple descriptor based on the histogram of local Radon projections. We also propose a very fast convolution-based local Radon estimator to overcome the slow process of Radon projections. We performed our experiments using pathology images (KimiaPath24) and lung CT patches and test our proposed solution for medical image processing. We achieved superior results compared with other histogram-based descriptors such as LBP and HoG as well as some pre-trained CNNs.
摘要：基于内容的图像检索（CBIR）是计算机视觉研究的重要组成部分，特别是在医疗专家系统。具有用于调谐的最少数量的参数的判别图像描述符中的系统CBIR是可取的。在本文中，我们介绍了基于本地氡预测的直方图一个新的简单描述。我们还提出了一个非常快速的基于卷积本地氡估计克服氡预测的缓慢的过程。我们使用病理图像（KimiaPath24）和肺CT补丁我们的实验和测试我们提出了医学图像处理解决方案。我们与其他基于直方图的描述，如LBP和肉猪以及一些预先训练的细胞神经网络相比，取得了优异的业绩。

14. SimPose: Effectively Learning DensePose and Surface Normals of People from Simulated Data [PDF] 返回目录
Tyler Zhu, Per Karlsson, Christoph Bregler
Abstract: With a proliferation of generic domain-adaptation approaches, we report a simple yet effective technique for learning difficult per-pixel 2.5D and 3D regression representations of articulated people. We obtained strong sim-to-real domain generalization for the 2.5D DensePose estimation task and the 3D human surface normal estimation task. On the multi-person DensePose MSCOCO benchmark, our approach outperforms the state-of-the-art methods which are trained on real images that are densely labelled. This is an important result since obtaining human manifold's intrinsic uv coordinates on real images is time consuming and prone to labeling noise. Additionally, we present our model's 3D surface normal predictions on the MSCOCO dataset that lacks any real 3D surface normal labels. The key to our approach is to mitigate the "Inter-domain Covariate Shift" with a carefully selected training batch from a mixture of domain samples, a deep batch-normalized residual network, and a modified multi-task learning objective. Our approach is complementary to existing domain-adaptation techniques and can be applied to other dense per-pixel pose estimation problems.
摘要：随着通用域名适应的增殖方法，我们报告了学习困难的每个像素的2.5D和铰接人的3D回归表示一个简单而有效的方法。我们获得了强大的SIM到实域概括为2.5D DensePose估计任务和三维人体表面法线估计任务。在多人DensePose MSCOCO标杆，我们的方法优于其上密集地标记，真正的图像训练了国家的最先进的方法。这是因为在真实图像获取人类歧管的内在UV坐标的一个重要结果是耗时且容易产生噪音标签。此外，我们提出的MSCOCO数据集缺乏真正的三维表面法线标签，我们的模型的3D表面法线预测。我们的方法，关键是要减轻“域间协变量转移”精心挑选的训练一批来自域采样，深批量标准化剩余网络的混合物，以及修改后的多任务学习目标。我们的方法是现有域自适应技术互补，并且可以被应用到其它致密每像素姿态估计问题。

15. Cascaded Non-local Neural Network for Point Cloud Semantic Segmentation [PDF] 返回目录
Mingmei Cheng, Le Hui, Jin Xie, Jian Yang, Hui Kong
Abstract: In this paper, we propose a cascaded non-local neural network for point cloud segmentation. The proposed network aims to build the long-range dependencies of point clouds for the accurate segmentation. Specifically, we develop a novel cascaded non-local module, which consists of the neighborhood-level, superpoint-level and global-level non-local blocks. First, in the neighborhood-level block, we extract the local features of the centroid points of point clouds by assigning different weights to the neighboring points. The extracted local features of the centroid points are then used to encode the superpoint-level block with the non-local operation. Finally, the global-level block aggregates the non-local features of the superpoints for semantic segmentation in an encoder-decoder framework. Benefiting from the cascaded structure, geometric structure information of different neighborhoods with the same label can be propagated. In addition, the cascaded structure can largely reduce the computational cost of the original non-local operation on point clouds. Experiments on different indoor and outdoor datasets show that our method achieves state-of-the-art performance and effectively reduces the time consumption and memory occupation.
摘要：在本文中，我们提出了点云分割级联非本地的神经网络。建议的网络旨在建立点云的远程依赖关系的精确分割。具体而言，我们开发级联的非本地模块一部小说，它由邻居级，superpoint级和全球级的非本地块。首先，在邻居级块，我们通过到邻近的点分配不同的权重提取点云的重心点的局部特征。重心点的提取的局部特征然后用于编码与非本地操作的superpoint级块。最后，全局级别的块集合了superpoints用于在编码器 - 解码器框架语义分割的非局部特征。从级联结构中受益，具有相同的标签不同的街区的几何结构信息可以传播。此外，级联结构可以在很大程度上降低对点云的原非本地操作的计算成本。在不同的室内和室外的数据集实验表明，我们的方法实现国家的最先进的性能，有效地降低了时间消耗和内存占用。

16. Revisiting the Modifiable Areal Unit Problem in Deep Traffic Prediction with Visual Analytics [PDF] 返回目录
Wei Zeng, Chengqiao Lin, Juncong Lin, Jincheng Jiang, Jiazhi Xia, Cagatay Turkay, Wei Chen
Abstract: Deep learning methods are being increasingly used for urban traffic prediction where spatiotemporal traffic data is aggregated into sequentially organized matrices that are then fed into convolution-based residual neural networks. However, the widely known modifiable areal unit problem within such aggregation processes can lead to perturbations in the network inputs. This issue can significantly destabilize the feature embeddings and the predictions, rendering deep networks much less useful for the experts. This paper approaches this challenge by leveraging unit visualization techniques that enable the investigation of many-to-many relationships between dynamically varied multi-scalar aggregations of urban traffic data and neural network predictions. Through regular exchanges with a domain expert, we design and develop a visual analytics solution that integrates 1) a Bivariate Map equipped with an advanced bivariate colormap to simultaneously depict input traffic and prediction errors across space, 2) a Morans I Scatterplot that provides local indicators of spatial association analysis, and 3) a Multi-scale Attribution View that arranges non-linear dot plots in a tree layout to promote model analysis and comparison across scales. We evaluate our approach through a series of case studies involving a real-world dataset of Shenzhen taxi trips, and through interviews with domain experts. We observe that geographical scale variations have important impact on prediction performances, and interactive visual exploration of dynamically varying inputs and outputs benefit experts in the development of deep traffic prediction models.
摘要：深学习方法正在越来越多地用于城市交通预测，其中时空业务数据聚集成顺序进行组织矩阵接着被馈送到基于卷积的残余神经网络。然而，这样的聚集过程中的公知的可塑性面积单元问题可以导致在网络输入扰动。这个问题可以显著动摇功能的嵌入和预测，使深层网络的专家有用得多。本文方法通过利用单元可视化技术，使城市交通数据和神经网络预测的动态变化，多标量集合之间的许多一对多的关系的调查这一挑战。通过与领域专家的定期交流，我们设计并开发可视化分析解决方案，集成了1）上的二元地图配备了先进的双变量颜色映射到整个空间同时描绘输入流量和预测误差，2）Morans我散点图，它提供本地指标的空间关联分析，以及3）一个多尺度归因视图，在一个树布局排列的非线性的点图，以促进跨尺度模型分析和比较。我们评估通过一系列涉及深圳的出租车出行的现实世界的数据集的案例研究我们的方法，并通过与领域专家的采访。我们观察到的地域规模的变化对预测性能的动态变化的投入和产出效益专家深交通预测模型的发展具有重要的影响，以及交互式可视化探索。

17. Learning from Few Samples: A Survey [PDF] 返回目录
Nihar Bendre, Hugo Terashima Marín, Peyman Najafirad
Abstract: Deep neural networks have been able to outperform humans in some cases like image recognition and image classification. However, with the emergence of various novel categories, the ability to continuously widen the learning capability of such networks from limited samples, still remains a challenge. Techniques like Meta-Learning and/or few-shot learning showed promising results, where they can learn or generalize to a novel category/task based on prior knowledge. In this paper, we perform a study of the existing few-shot meta-learning techniques in the computer vision domain based on their method and evaluation metrics. We provide a taxonomy for the techniques and categorize them as data-augmentation, embedding, optimization and semantics based learning for few-shot, one-shot and zero-shot settings. We then describe the seminal work done in each category and discuss their approach towards solving the predicament of learning from few samples. Lastly we provide a comparison of these techniques on the commonly used benchmark datasets: Omniglot, and MiniImagenet, along with a discussion towards the future direction of improving the performance of these techniques towards the final goal of outperforming humans.
摘要：深层神经网络已经能够超越人类在某些情况下，像图像识别和图像分类。然而，随着各种新类型的出现，能力不断扩大，从有限的样品，例如网络的学习能力，仍然是一个挑战。像元学习和/或几拍学习技术表现出可喜的成果，在那里他们可以学习或推广到基于先验知识的一种新的类别/任务。在本文中，我们执行的在计算机视觉领域根据他们的方法和评价指标现有的几个次元学习技术的研究。我们提供的技术的分类和其分为数据增强，嵌入，优化和基于语义学习几拍，一拍和零拍摄设置。然后，我们将在每个类别所做的开创性工作，并讨论对解决从几样学习的困境，他们的做法。最后，我们提供对常用的标准数据集这些技术的比较：Omniglot，并MiniImagenet，以努力改善这些技术对人类超越的最终目标性能的未来发展方向的讨论一起。

18. $grid2vec$: Learning Efficient Visual Representations via Flexible Grid-Graphs [PDF] 返回目录
Ali Hamdi, Du Yong Kim, Flora D. Salim
Abstract: We propose $grid2vec$, a novel approach for image representation learning based on Graph Convolutional Network (GCN). Existing visual representation methods suffer from several issues, such as requiring high-computation, losing in-depth structures, and being restricted to specific objects. $grid2vec$ converts an image to a low-dimensional feature vector. A key component of $grid2vec$ is Flexible Grid-Graphs, a spatially-adaptive method based on the image key-points, as a flexible grid, to generate the graph representation. It represents each image with a graph of unique node locations and edge distances. Nodes, in Flexible Grid-Graphs, describe the most representative patches in the image. We develop a multi-channel Convolutional Neural Network architecture to learn local features of each patch. We implement a hybrid node-embedding method, i.e., having spectral and non-spectral components. It aggregates the products of neighbours' features and node's eigenvector centrality score. We compare the performance of $grid2vec$ with a set of state-of-the-art representation learning and visual recognition models. $grid2vec$ has only $512$ features in comparison to a range from VGG16 with $25,090$ to NASNet with $487,874$. We show the models' superior accuracy in both binary and multi-class image classification. Although we utilise imbalanced, low-size dataset, $grid2vec$ shows stable and superior results against the well-known base classifiers.
摘要：我们建议$ grid2vec $，基于图形卷积网络（GCN）的图像表示学习一种新的方法。现有的可视化表示方法从几个问题，如需要高计算受到影响，失去了深入的结构，并只限于向特定对象。 $ $ grid2vec的图像转换为低维特征向量。 $ $ grid2vec的一个关键组成部分是柔性的网格图形，基于该图像的关键点在空间上的自适应方法中，作为柔性的网格，以生成所述图形表示。它表示具有唯一的节点位置和边缘的距离的曲线图的每个图像。节点，在弹性网格，图表，描述图像中最具代表性的补丁。我们开发了多通道卷积神经网络架构，以学习每个补丁的地方特色。我们实现混合节点嵌入方法，即具有光谱和非频谱分量。它集合了邻居的特点和节点的特征向量中心得分的产品。我们比较$ grid2vec $的表现与一组国家的先进代表学习和视觉识别模型。 $ $ grid2vec仅具有$ 512 $比较功能，以从VGG16一系列带有$ $二五〇九〇到NASNet与$ $四十八万七千八百七十四。我们展示了模型的二进制和多类图像分类更高的精度。虽然我们利用不平衡，低大小数据集，$ grid2vec $显示稳定，对著名的基分类效果出众。

19. Label or Message: A Large-Scale Experimental Survey of Texts and Objects Co-Occurrence [PDF] 返回目录
Koki Takeshita, Juntaro Shioyama, Seiichi Uchida
Abstract: Our daily life is surrounded by textual information. Nowadays, the automatic collection of textual information becomes possible owing to the drastic improvement of scene text detectors and recognizer. The purpose of this paper is to conduct a large-scale survey of co-occurrence between visual objects (such as book and car) and scene texts with a large image dataset and a state-of-the-art scene text detector and recognizer. Especially, we focus on the function of "label" texts, which are attached to objects for detailing the objects. By analyzing co-occurrence between objects and scene texts, it is possible to observe the statistics about the label texts and understand how the scene texts will be useful for recognizing the objects and vice versa.
摘要：我们的日常生活中被文本信息所包围。如今，文本信息的自动采集成为可能由于现场文字探测器和识别的显着改善。本文的目的是进行可视对象（诸如书和汽车）和场景的文本之间的共现的大规模调查具有大的图像数据集和国家的最先进的场景文本检测器和辨识器。特别是，我们专注于“标签”的文本，这是附着在物体上的细节的对象的功能。通过分析物体和场景文本之间的共生，这是可以观察到的统计信息的标签文本和了解现场文本将如何用于识别对象，反之亦然有用。

20. NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image [PDF] 返回目录
Lizhen Wang, Xiaochen Zhao, Tao Yu, Songtao Wang, Yebin Liu
Abstract: We propose NormalGAN, a fast adversarial learning-based method to reconstruct the complete and detailed 3D human from a single RGB-D image. Given a single front-view RGB-D image, NormalGAN performs two steps: front-view RGB-D rectification and back-view RGBD inference. The final model was then generated by simply combining the front-view and back-view RGB-D information. However, inferring backview RGB-D image with high-quality geometric details and plausible texture is not trivial. Our key observation is: Normal maps generally encode much more information of 3D surface details than RGB and depth images. Therefore, learning geometric details from normal maps is superior than other representations. In NormalGAN, an adversarial learning framework conditioned by normal maps is introduced, which is used to not only improve the front-view depth denoising performance, but also infer the back-view depth image with surprisingly geometric details. Moreover, for texture recovery, we remove shading information from the front-view RGB image based on the refined normal map, which further improves the quality of the back-view color inference. Results and experiments on both testing data set and real captured data demonstrate the superior performance of our approach. Given a consumer RGB-D sensor, NormalGAN can generate the complete and detailed 3D human reconstruction results in 20 fps, which further enables convenient interactive experiences in telepresence, AR/VR and gaming scenarios.
摘要：本文提出NormalGAN，快速对抗性学习法从单一的RGB-d图像重建完整和详细的三维人体。给予单一正视RGBD图像，NormalGAN执行两个步骤：正视RGBD整流和背视图RGBD推断。然后通过简单地组合的前视图和后视RGB-d信息生成的最终模型。然而，具有高品质的几何细节和纹理似是而非推断backview RGB-d的图像是不平凡的。我们的主要发现是：法线贴图通常编码的三维表面细节比RGB和深度图像的更多信息。因此，从正常的地图学习几何的细节比其他表示优越。在NormalGAN，通过正常的地图空调对抗性学习框架被引入，这是用来不仅提高了前视图深度去噪性能，而且还推断具有令人惊讶的几何细节背面视点深度图像。此外，对于纹理恢复，除去遮蔽从以精制法线图，这进一步改进了背视图颜色推断的质量上的前视图RGB图像信息。结果和实验两个测试数据集和实时采集的数据证明了该方法的优越性能。鉴于消费者RGB-d传感器，NormalGAN可以生成20 fps，从而进一步使得能够在远程呈现方便交互式体验，AR / VR和游戏场景的完整而详细的三维人体重建的结果。

21. Infrastructure-based Multi-Camera Calibration using Radial Projections [PDF] 返回目录
Yukai Lin, Viktor Larsson, Marcel Geppert, Zuzana Kukelova, Marc Pollefeys, Torsten Sattler
Abstract: Multi-camera systems are an important sensor platform for intelligent systems such as self-driving cars. Pattern-based calibration techniques can be used to calibrate the intrinsics of the cameras individually. However, extrinsic calibration of systems with little to no visual overlap between the cameras is a challenge. Given the camera intrinsics, infrastucture-based calibration techniques are able to estimate the extrinsics using 3D maps pre-built via SLAM or Structure-from-Motion. In this paper, we propose to fully calibrate a multi-camera system from scratch using an infrastructure-based approach. Assuming that the distortion is mainly radial, we introduce a two-stage approach. We first estimate the camera-rig extrinsics up to a single unknown translation component per camera. Next, we solve for both the intrinsic parameters and the missing translation components. Extensive experiments on multiple indoor and outdoor scenes with multiple multi-camera systems show that our calibration method achieves high accuracy and robustness. In particular, our approach is more robust than the naive approach of first estimating intrinsic parameters and pose per camera before refining the extrinsic parameters of the system. The implementation is available at this https URL.
摘要：多摄像机系统是一个重要的传感器平台，智能系统，如自动驾驶汽车。基于模式的校准技术可以用于单独校准相机的内部函数。然而，几乎没有系统的摄像机之间没有视觉重叠外部校准是一个挑战。考虑到相机内部函数，基于基础建设校准技术能够估计使用3D的外部参数映射通过SLAM或结构由运动预建。在本文中，我们提出要充分多摄像机系统从头开始使用基于基础设施的方法进行校准。假设失真主要是径向的，我们引入一个两阶段方法。我们首先估计相机钻机外部参数每个摄像机最多一个未知的平移分量。接下来，我们解决的内在参数和漏译组件两者。在具有多个多摄像机系统中的多个室内和室外场景的大量实验表明，我们的校准方法实现高精确度和耐用性。特别是，我们的方法是比第一估计固有参数天真的方法更强大和完善了系统的外在参数之前每相机姿态。实施可在此HTTPS URL。

22. Searching Collaborative Agents for Multi-plane Localization in 3D Ultrasound [PDF] 返回目录
Yuhao Huang, Xin Yang, Rui Li, Jikuan Qian, Xiaoqiong Huang, Wenlong Shi, Haoran Dou, Chaoyu Chen, Yuanji Zhang, Huanjia Luo, Alejandro Frangi, Yi Xiong, Dong Ni
Abstract: 3D ultrasound (US) is widely used due to its rich diagnostic information, portability and low cost. Automated standard plane (SP) localization in US volume not only improves efficiency and reduces user-dependence, but also boosts 3D US interpretation. In this study, we propose a novel Multi-Agent Reinforcement Learning (MARL) framework to localize multiple uterine SPs in 3D US simultaneously. Our contribution is two-fold. First, we equip the MARL with a one-shot neural architecture search (NAS) module to obtain the optimal agent for each plane. Specifically, Gradient-based search using Differentiable Architecture Sampler (GDAS) is employed to accelerate and stabilize the training process. Second, we propose a novel collaborative strategy to strengthen agents' communication. Our strategy uses recurrent neural network (RNN) to learn the spatial relationship among SPs effectively. Extensively validated on a large dataset, our approach achieves the accuracy of 7.05 degree/2.21mm, 8.62 degree/2.36mm and 5.93 degree/0.89mm for the mid-sagittal, transverse and coronal plane localization, respectively. The proposed MARL framework can significantly increase the plane localization accuracy and reduce the computational cost and model size.
摘要：三维超声（US）被广泛使用，由于其丰富的诊断信息，便携性和低成本。在美国体积自动化标准平面（SP）的定位不仅提高了效率，降低了用户的依赖性，而且提升3D US解释。在这项研究中，我们提出了一个新颖的多Agent强化学习（MARL）框架以3D美国同时定位多个子宫的SP。我们的贡献是双重的。首先，我们装备了一杆的神经结构搜索（NAS）模块泥灰岩，以获得每架飞机的最佳代理。具体而言，使用可微架构进样器（GDAS）基于梯度的搜索被用来加速和稳定训练过程。第二，我们提出了一种新的合作战略，以加强代理商的沟通。我们的策略是采用递归神经网络（RNN）有效地学习SP间的空间关系。广泛验证大型数据集，我们的方法达到7.05度/2.21毫米，8.62度/2.36毫米和5.93度/0.89毫米为中间的矢，分别横向和冠状面定位，准确性。所提出的MARL框架可以显著提高平面定位精度，降低了计算成本和模型的大小。

23. Dynamic texture analysis for detecting fake faces in video sequences [PDF] 返回目录
Mattia Bonomi, Cecilia Pasquini, Giulia Boato
Abstract: The creation of manipulated multimedia content involving human characters has reached in the last years unprecedented realism, calling for automated techniques to expose synthetically generated faces in images and videos. This work explores the analysis of spatio-temporal texture dynamics of the video signal, with the goal of characterizing and distinguishing real and fake sequences. We propose to build a binary decision on the joint analysis of multiple temporal segments and, in contrast to previous approaches, to exploit the textural dynamics of both the spatial and temporal dimensions. This is achieved through the use of Local Derivative Patterns on Three Orthogonal Planes (LDP-TOP), a compact feature representation known to be an important asset for the detection of face spoofing attacks. Experimental analyses on state-of-the-art datasets of manipulated videos show the discriminative power of such descriptors in separating real and fake sequences, and also identifying the creation method used. Linear Support Vector Machines (SVMs) are used which, despite the lower complexity, yield comparable performance to previously proposed deep models for fake content detection.
摘要：涉及人类角色操控多媒体内容的创建，在过去几年已经达到前所未有的逼真，要求自动化技术在图像和视频综合暴露生成的面部。这项工作探索了视频信号的时空动态纹理的分析，表征和区分真假序列的目标。我们建议的基础上多时间分段的联合分析的二元决策，并与以前的方法，利用空间和时间维度两者的纹理动态。这是通过使用在三个正交平面（LDP-TOP），已知用于检测的脸欺骗攻击的重要资产紧凑的特征表示局部导数模式的实现。上的操纵视频状态的最先进的数据集实验分析示出了分离真假序列，并且还识别用于创建方法这样的描述符的辨别力。线性支持向量机（SVM）用于其中，尽管较低的复杂性，产生相当的性能，为假含量检测先前提出的深模型。

24. The Blessing and the Curse of the Noise behind Facial Landmark Annotations [PDF] 返回目录
Xiaoyu Xiang, Yang Cheng, Shaoyuan Xu, Qian Lin, Jan Allebach
Abstract: The evolving algorithms for 2D facial landmark detection empower people to recognize faces, analyze facial expressions, etc. However, existing methods still encounter problems of unstable facial landmarks when applied to videos. Because previous research shows that the instability of facial landmarks is caused by the inconsistency of labeling quality among the public datasets, we want to have a better understanding of the influence of annotation noise in them. In this paper, we make the following contributions: 1) we propose two metrics that quantitatively measure the stability of detected facial landmarks, 2) we model the annotation noise in an existing public dataset, 3) we investigate the influence of different types of noise in training face alignment neural networks, and propose corresponding solutions. Our results demonstrate improvements in both accuracy and stability of detected facial landmarks.
摘要：2D面部标志检测使人们不断演变的算法来识别面部，分析面部表情等。然而，当应用于视频现有的方法仍然遇到不稳定的面部标志的问题。因为以前的研究显示，脸部显着标记的不稳定性是由公共数据集之间贴标质量的不一致而产生的，我们希望有一个更好的了解他们的注释噪音的影响。在本文中，我们提供以下捐款：1）提出了两种度量定量测量检测到的面部界标的稳定性，2）我们注释噪声在现有的公共数据集进行建模，3），我们研究了不同类型的噪声的影响在训练的脸对准的神经网络，并提出相应的解决方案。我们的研究结果表明在准确度和面部检测标志性建筑稳定性方面的改进。

25. Weakly-Supervised Cell Tracking via Backward-and-Forward Propagation [PDF] 返回目录
Kazuya Nishimura, Junya Hayashida, Chenyang Wang, Dai Fei Elmer Ker, Ryoma Bise
Abstract: We propose a weakly-supervised cell tracking method that can train a convolutional neural network (CNN) by using only the annotation of "cell detection" (i.e., the coordinates of cell positions) without association information, in which cell positions can be easily obtained by nuclear staining. First, we train co-detection CNN that detects cells in successive frames by using weak-labels. Our key assumption is that co-detection CNN implicitly learns association in addition to detection. To obtain the association, we propose a backward-and-forward propagation method that analyzes the correspondence of cell positions in the outputs of co-detection CNN. Experiments demonstrated that the proposed method can associate cells by analyzing co-detection CNN. Even though the method uses only weak supervision, the performance of our method was almost the same as the state-of-the-art supervised method. Code is publicly available in this https URL
摘要：我们建议，可以通过仅使用的“小区检测”注释训练卷积神经网络（CNN）弱监督小区跟踪方法（即，细胞位置的坐标），而不关联信息，其中信元位置可以是通过核染色容易获得。首先，我们培养共检测CNN检测细胞中通过使用弱的标签连续帧。我们的主要假设是，联合检测CNN隐学习除了检测协会。为了获得所述关联，我们建议，分析在共同检测CNN的输出单元位置的对应关系的向后和向前传播方法。实验表明，所提出的方法可以通过分析共检测CNN关联细胞。尽管该方法只使用监管不力，我们的方法的性能几乎一样的国家的最先进的方法监督。代码是公开的在此HTTPS URL

26. Instance Selection for GANs [PDF] 返回目录
Terrance DeVries, Michal Drozdzal, Graham W. Taylor
Abstract: Recent advances in Generative Adversarial Networks (GANs) have led to their widespread adoption for the purposes of generating high quality synthetic imagery. While capable of generating photo-realistic images, these models often produce unrealistic samples which fall outside of the data manifold. Several recently proposed techniques attempt to avoid spurious samples, either by rejecting them after generation, or by truncating the model's latent space. While effective, these methods are inefficient, as large portions of model capacity are dedicated towards representing samples that will ultimately go unused. In this work we propose a novel approach to improve sample quality: altering the training dataset via instance selection before model training has taken place. To this end, we embed data points into a perceptual feature space and use a simple density model to remove low density regions from the data manifold. By refining the empirical data distribution before training we redirect model capacity towards high-density regions, which ultimately improves sample fidelity. We evaluate our method by training a Self-Attention GAN on ImageNet at 64x64 resolution, where we outperform the current state-of-the-art models on this task while using 1/2 of the parameters. We also highlight training time savings by training a BigGAN on ImageNet at 128x128 resolution, achieving a 66% increase in Inception Score and a 16% improvement in FID over the baseline model with less than 1/4 the training time.
摘要：在创成对抗性网络（甘斯）的最新进展，导致其广泛采用，用以生成高质量的图像合成的目的。而能够产生照片般逼真的图像，这些模型通常产生落在数据歧管的外不切实际样品。一些最近提出的技术试图避免虚假样品，无论是通过一代又拒绝他们，或通过截取模型的潜在空间。虽然有效，但是这些方法是低效的，因为模型容量大部分朝向表示样本，这将最终去未使用的专用的。在这项工作中，我们提出了一种新的方法，以提高样本质量：通过实例选择改变训练数据集模型训练已经发生之前。为此，我们将数据嵌入点分成感知特征空间和使用一个简单的密度模型来从所述数据歧管除去低浓度区。通过训练前细化经验数据分发我们重定向朝向高密度区域，从而最终提高样本保真度模型的能力。我们通过在64×64的分辨率，我们超越这个任务的当前状态的最先进的机型，同时使用参数的1/2培训ImageNet自甘关注我们的评估方法。我们还强调通过在128×128的分辨率上ImageNet训练BigGAN，实现了成立以来的分数增长66％，并在FID 16％的改善超过基线模型小于1/4的训练时间训练节省了时间。

27. Hierarchical Action Classification with Network Pruning [PDF] 返回目录
Mahdi Davoodikakhki, KangKang Yin
Abstract: Research on human action classification has made significant progresses in the past few years. Most deep learning methods focus on improving performance by adding more network components. We propose, however, to better utilize auxiliary mechanisms, including hierarchical classification, network pruning, and skeleton-based preprocessing, to boost the model robustness and performance. We test the effectiveness of our method on four commonly used testing datasets: NTU RGB+D 60, NTU RGB+D 120, Northwestern-UCLA Multiview Action 3D, and UTD Multimodal Human Action Dataset. Our experiments show that our method can achieve either comparable or better performance on all four datasets. In particular, our method sets up a new baseline for NTU 120, the largest dataset among the four. We also analyze our method with extensive comparisons and ablation studies.
摘要：研究人类行为的分类在过去的几年中取得了显著进展。最深刻的学习方法集中在通过增加更多的网络组件提高性能。我们建议，然而，以更好地利用辅助机制，包括分级分类，网络修剪和骨架为基础的预处理，以提高模型的鲁棒性和性能。 NTU RGB + d 60，NTU RGB + d 120，西北大学，加州大学洛杉矶分校多视点3D行动，并UTD多式联运人类行为数据集：我们在四个常用的测试数据集测试了该方法的有效性。我们的实验表明，我们的方法可以实现在所有四个数据集或者相当或更好的性能。特别是，我们的方法设置为NTU 120，四中最大的数据集一个新的基准。我们还分析了广泛的比较和消融研究我们的方法。

28. Weakly Supervised Minirhizotron Image Segmentation with MIL-CAM [PDF] 返回目录
Guohao Yu, Alina Zare, Weihuang Xu, Roser Matamala, Joel Reyes-Cabrera, Felix B. Fritschi, Thomas E. Juenger
Abstract: We present a multiple instance learning class activation map (MIL-CAM) approach for pixel-level minirhizotron image segmentation given weak image-level labels. Minirhizotrons are used to image plant roots in situ. Minirhizotron imagery is often composed of soil containing a few long and thin root objects of small diameter. The roots prove to be challenging for existing semantic image segmentation methods to discriminate. In addition to learning from weak labels, our proposed MIL-CAM approach re-weights the root versus soil pixels during analysis for improved performance due to the heavy imbalance between soil and root pixels. The proposed approach outperforms other attention map and multiple instance learning methods for localization of root objects in minirhizotron imagery.
摘要：本文提出了一种多示例学习类的活动地形图（MIL-CAM）给定弱影像级标签像素级微根管图像分割方法。 Minirhizotrons用于图像植物根系原位。微根管图像通常由含有小直径的几个长且薄的对象根土壤。根证明是具有挑战性的现有语义图像分割的方法来判别。除了从弱标签学习，我们提出的MIL-CAM方法重新配重根与分析，以提高性能期间土壤像素，由于土壤和根像素之间的重失衡。所提出的方法优于其他注意力图和在微根管影像根对象的定位多实例学习方法。

29. Action2Motion: Conditioned Generation of 3D Human Motions [PDF] 返回目录
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, Li Cheng
Abstract: Action recognition is a relatively established task, where givenan input sequence of human motion, the goal is to predict its ac-tion category. This paper, on the other hand, considers a relativelynew problem, which could be thought of as an inverse of actionrecognition: given a prescribed action type, we aim to generateplausible human motion sequences in 3D. Importantly, the set ofgenerated motions are expected to maintain itsdiversityto be ableto explore the entire action-conditioned motion space; meanwhile,each sampled sequence faithfully resembles anaturalhuman bodyarticulation dynamics. Motivated by these objectives, we followthe physics law of human kinematics by adopting the Lie Algebratheory to represent thenaturalhuman motions; we also propose atemporal Variational Auto-Encoder (VAE) that encourages adiversesampling of the motion space. A new 3D human motion dataset, HumanAct12, is also constructed. Empirical experiments overthree distinct human motion datasets (including ours) demonstratethe effectiveness of our approach.
摘要：动作识别是一个相对确定的任务，在人类运动的givenan输入序列，目的是预测其交流重刑类别。本文中，在另一方面，考虑了relativelynew问题，这可能被认为是actionrecognition的倒数：给定一个预定的动作类型，我们的目标是在3D generateplausible人体运动序列。重要的是，该组ofgenerated运动有望保持itsdiversityto是ableto探索整个动作空调运动空间;同时，每个采样序列忠实类似于anaturalhuman bodyarticulation动态。通过这些目标的推动下，我们通过采用烈Algebratheory代表thenaturalhuman运动followthe人体运动学的物理规律;我们还提出了时间性有利于鼓励adiversesampling运动空间变自动编码器（VAE）。一种新的三维人体运动数据集，HumanAct12，也被构造。实证实验overthree不同的人体运动数据集（包括我们）我们的方法demonstratethe有效性。

30. Detecting Suspicious Behavior: How to Deal with Visual Similarity through Neural Networks [PDF] 返回目录
Guillermo A. Martínez-Mascorro, José C. Ortiz-Bayliss, Hugo Terashima-Marín
Abstract: Suspicious behavior is likely to threaten security, assets, life, or freedom. This behavior has no particular pattern, which complicates the tasks to detect it and define it. Even for human observers, it is complex to spot suspicious behavior in surveillance videos. Some proposals to tackle abnormal and suspicious behavior-related problems are available in the literature. However, they usually suffer from high false-positive rates due to different classes with high visual similarity. The Pre-Crime Behavior method removes information related to a crime commission to focus on suspicious behavior before the crime happens. The resulting samples from different types of crime have a high-visual similarity with normal-behavior samples. To address this problem, we implemented 3D Convolutional Neural Networks and trained them under different approaches. Also, we tested different values in the number-of-filter parameter to optimize computational resources. Finally, the comparison between the performance using different training approaches shows the best option to improve the suspicious behavior detection on surveillance videos.
摘要：可疑行为可能危及安全，财产，生命，自由或。这种行为有没有特定的模式，其中的任务复杂化，以检测它，并把它定义。即使对于人类观察者，它是复杂的发现在视频监控可疑行为。有的建议，解决在文献中可用的异常和可疑行为相关的问题。然而，它们通常由高误报率遭受由于不同种类的具有高视觉相似。相关的预防犯罪委员会的前犯罪行为的方法删除信息集中的可疑行为，犯罪行为发生之前。从不同类型的犯罪所得样品具有与正常行为的样品的高的视觉相似性。为了解决这个问题，我们实现了3D卷积神经网络，并根据不同的方法训练他们。此外，我们在数滤波器的最参数测试的不同的值，以优化计算资源。最后，使用不同的训练之间的性能比较接近节目，以提高监控视频的可疑行为检测的最佳选择。

31. Key Frame Proposal Network for Efficient Pose Estimation in Videos [PDF] 返回目录
Yuexi Zhang, Yin Wang, Octavia Camps, Mario Sznaier
Abstract: Human pose estimation in video relies on local information by either estimating each frame independently or tracking poses across frames. In this paper, we propose a novel method combining local approaches with global context. We introduce a light weighted, unsupervised, key frame proposal network (K-FPN) to select informative frames and a learned dictionary to recover the entire pose sequence from these frames. The K-FPN speeds up the pose estimation and provides robustness to bad frames with occlusion, motion blur, and illumination changes, while the learned dictionary provides global dynamic context. Experiments on Penn Action and sub-JHMDB datasets show that the proposed method achieves state-of-the-art accuracy, with substantial speed-up.
摘要：在视频人类姿势估计通过无论是独立估计每一帧或在帧之间跟踪姿态依赖于本地信息。在本文中，我们提出了一种新方法，结合了全球范围内的局部方法。我们引入一个光加权，无监督，关键帧提案网络（K-FPN）选择信息的帧和一个学习字典，以恢复从这些帧的整个姿态序列。的K-FPN加快姿态估计并提供鲁棒性与阻塞，运动模糊，以及照明变化的坏帧，而了解到字典提供全局动态上下文。在宾夕法尼亚行动和子JHMDB数据集的实验表明，该方法实现国家的最先进的精度，有大量加速。

32. Crowdsampling the Plenoptic Function [PDF] 返回目录
Zhengqi Li, Wenqi Xian, Abe Davis, Noah Snavely
Abstract: Many popular tourist landmarks are captured in a multitude of online, public photos. These photos represent a sparse and unstructured sampling of the plenoptic function for a particular scene. In this paper,we present a new approach to novel view synthesis under time-varying illumination from such data. Our approach builds on the recent multi-plane image (MPI) format for representing local light fields under fixed viewing conditions. We introduce a new DeepMPI representation, motivated by observations on the sparsity structure of the plenoptic function, that allows for real-time synthesis of photorealistic views that are continuous in both space and across changes in lighting. Our method can synthesize the same compelling parallax and view-dependent effects as previous MPI methods, while simultaneously interpolating along changes in reflectance and illumination with time. We show how to learn a model of these effects in an unsupervised way from an unstructured collection of photos without temporal registration, demonstrating significant improvements over recent work in neural rendering. More information can be found this http URL.
摘要：许多受欢迎的旅游地标在网上，公开照片的大量捕获。这些照片代表了全光功能的特定场景中的稀疏化和非结构化采样。在本文中，我们提出了从这样的数据在时变照明的新方法新视图合成。我们的方法建立在用于表示固定的观看条件下的局部光场最近的多平面图像（MPI）的格式。我们引入一个新的DeepMPI表示，通过在全光函数的稀疏性结构，其允许的是在空间和整个照明的变化连续逼真的视图的实时合成观测的动机。我们的方法可以合成同样引人注目的视差和视点相关的效果以前MPI方法，同时随着时间的反射率和亮度的变化内插。我们展示了如何从照片非结构化集合学的这些效应的模型在无人监督的方式不具有时间登记，证明了神经渲染近期工作显著的改善。更多信息可以发现这个HTTP URL。

33. Domain Adaptive Semantic Segmentation Using Weak Labels [PDF] 返回目录
Sujoy Paul, Yi-Hsuan Tsai, Samuel Schulter, Amit K. Roy-Chowdhury, Manmohan Chandraker
Abstract: Learning semantic segmentation models requires a huge amount of pixel-wise labeling. However, labeled data may only be available abundantly in a domain different from the desired target domain, which only has minimal or no annotations. In this work, we propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain. The weak labels may be obtained based on a model prediction for unsupervised domain adaptation (UDA), or from a human annotator in a new weakly-supervised domain adaptation (WDA) paradigm for semantic segmentation. Using weak labels is both practical and useful, since (i) collecting image-level target annotations is comparably cheap in WDA and incurs no cost in UDA, and (ii) it opens the opportunity for category-wise domain alignment. Our framework uses weak labels to enable the interplay between feature alignment and pseudo-labeling, improving both in the process of domain adaptation. Specifically, we develop a weak-label classification module to enforce the network to attend to certain categories, and then use such training signals to guide the proposed category-wise alignment method. In experiments, we show considerable improvements with respect to the existing state-of-the-arts in UDA and present a new benchmark in the WDA setting. Project page is at this http URL.
摘要：学习语义分割模式需要巨大的逐像素标签的数量。然而，标记的数据可能只提供丰富地从期望的目标域，其中只有很少或没有注释的域不同。在这项工作中，我们提出了在目标域图像电平弱标签语义分割域适应一个新的框架。弱的标签可基于无监督域适配（UDA）的模型预测来获得，或从人类注释在一个新的弱监督域适配（WDA）范式语义分割。使用弱标签是兼具实用和有用的，因为（我）采集图像层次目标的注解是相当便宜的WDA并招致UDA没有成本，及（ii）它打开了类明智的畴对准的机会。我们的框架使用弱的标签，使功能定位和伪标签之间的相互影响，同时改善域适应的过程。具体而言，我们开发了一个弱标签分类模块执行网络参加某些类别，然后使用这样的训练信号，以指导所建议类别的明智对准方法。在实验中，我们表现出相对于现有状态的最艺术在UDA相当大的改善，并在WDA的设置提出了新的标杆。项目页面在这个HTTP URL。

34. An Improvement for Capsule Networks using Depthwise Separable Convolution [PDF] 返回目录
Nguyen Huu Phong, Bernardete Ribeiro
Abstract: Capsule Networks face a critical problem in computer vision in the sense that the image background can challenge its performance, although they learn very well on training data. In this work, we propose to improve Capsule Networks' architecture by replacing the Standard Convolution with a Depthwise Separable Convolution. This new design significantly reduces the model's total parameters while increases stability and offers competitive accuracy. In addition, the proposed model on $64\times64$ pixel images outperforms standard models on $32\times32$ and $64\times64$ pixel images. Moreover, we empirically evaluate these models with Deep Learning architectures using state-of-the-art Transfer Learning networks such as Inception V3 and MobileNet V1. The results show that Capsule Networks perform equivalently against Deep Learning models. To the best of our knowledge, we believe that this is the first work on the integration of Depthwise Separable Convolution into Capsule Networks.
摘要：胶囊网络面临着在这个意义上，图像背景可能挑战其性能在计算机视觉的一个关键问题，虽然他们学得很好的训练数据。在这项工作中，我们提出用纵深可分离卷积更换标准卷积，以提高胶囊Networks的架构。这种新的设计显著减少模型的总参数，同时增加稳定性，提供具有竞争力的准确性。此外，在64个$ \ $ times64像素图像性能优于上$ 32个标准型号\ times32 $ $和64个\ times64 $像素的图片所提出的模型。此外，我们根据经验评估使用国家的最先进的传输网络，学习如盗梦空间V3与V1 MobileNet这些模型与深度学习架构。结果表明：胶囊对网络深度学习模型等效执行。据我们所知，我们认为，这是在深度上整合可分卷积成胶囊网络的第一项工作。

35. Rethinking Recurrent Neural Networks and other Improvements for Image Classification [PDF] 返回目录
Nguyen Huu Phong, Bernardete Ribeiro
Abstract: For a long history of Machine Learning which dates back to several decades, Recurrent Neural Networks (RNNs) have been mainly used for sequential data and time series or generally 1D information. Even in some rare researches on 2D images, the networks merely learn and generate data sequentially rather than for recognition of images. In this research, we propose to integrate RNN as an additional layer in designing image recognition's models. Moreover, we develop End-to-End Ensemble Multi-models that are able to learn experts' predictions from several models. Besides, we extend training strategy and softmax pruning which overall leads our designs to perform comparably to top models on several datasets. The source code of the methods provided in this article is available in this https URL and this http URL.
摘要：机器学习的悠久历史可追溯到几十年，回归神经网络（RNNs）已经被主要用于连续的数据和时间序列或一般1D信息。即使在2D图像一些罕见的研究中，网络仅仅是学习和产生的数据顺序，而不是识别图像。在这项研究中，我们建议RNN结合起来，作为在设计图像识别的模型的附加层。此外，我们开发端至端合奏多模式，能够从多种型号学专家的预测。此外，我们扩大培训战略和SOFTMAX修剪整体引线我们的设计，在多个数据集进行同等于顶级车型哪个。在这篇文章中提供的方法的源代码可在此HTTPS URL，这HTTP URL。

36. Benchmarking and Comparing Multi-exposure Image Fusion Algorithms [PDF] 返回目录
Xingchen Zhang
Abstract: Multi-exposure image fusion (MEF) is an important area in computer vision and has attracted increasing interests in recent years. Apart from conventional algorithms, deep learning techniques have also been applied to multi-exposure image fusion. However, although much efforts have been made on developing MEF algorithms, the lack of benchmark makes it difficult to perform fair and comprehensive performance comparison among MEF algorithms, thus significantly hindering the development of this field. In this paper, we fill this gap by proposing a benchmark for multi-exposure image fusion (MEFB) which consists of a test set of 100 image pairs, a code library of 16 algorithms, 20 evaluation metrics, 1600 fused images and a software toolkit. To the best of our knowledge, this is the first benchmark in the field of multi-exposure image fusion. Extensive experiments have been conducted using MEFB for comprehensive performance evaluation and for identifying effective algorithms. We expect that MEFB will serve as an effective platform for researchers to compare performances and investigate MEF algorithms.
摘要：多重曝光图像融合（MEF）是计算机视觉中的一个重要领域，并吸引了近年来越来越浓厚的兴趣。除了常规的算法，深度学习技术也被应用于多曝光图像融合。然而，尽管很多的努力已经发展MEF算法做出，缺乏基准使得难以进行公正和全面的性能对比MEF算法中，从而阻碍了显著这一领域的发展。在本文中，我们通过提出一种用于多重曝光图像融合（MEFB）由测试集100中的图像对，16种算法代码库，20个评价标准，1600个融合图像和软件工具包的基准填补这一空白。据我们所知，这是多重曝光图像融合领域的第一标杆。大量的实验已经使用MEFB综合性能评估和识别有效的算法进行。我们预计，MEFB将作为一个有效的平台，为研究人员比较性能和调查MEF算法。

37. Fully Dynamic Inference with Deep Neural Networks [PDF] 返回目录
Wenhan Xia, Hongxu Yin, Xiaoliang Dai, Niraj K. Jha
Abstract: Modern deep neural networks are powerful and widely applicable models that extract task-relevant information through multi-level abstraction. Their cross-domain success, however, is often achieved at the expense of computational cost, high memory bandwidth, and long inference latency, which prevents their deployment in resource-constrained and time-sensitive scenarios, such as edge-side inference and self-driving cars. While recently developed methods for creating efficient deep neural networks are making their real-world deployment more feasible by reducing model size, they do not fully exploit input properties on a per-instance basis to maximize computational efficiency and task accuracy. In particular, most existing methods typically use a one-size-fits-all approach that identically processes all inputs. Motivated by the fact that different images require different feature embeddings to be accurately classified, we propose a fully dynamic paradigm that imparts deep convolutional neural networks with hierarchical inference dynamics at the level of layers and individual convolutional filters/channels. Two compact networks, called Layer-Net (L-Net) and Channel-Net (C-Net), predict on a per-instance basis which layers or filters/channels are redundant and therefore should be skipped. L-Net and C-Net also learn how to scale retained computation outputs to maximize task accuracy. By integrating L-Net and C-Net into a joint design framework, called LC-Net, we consistently outperform state-of-the-art dynamic frameworks with respect to both efficiency and classification accuracy. On the CIFAR-10 dataset, LC-Net results in up to 11.9$\times$ fewer floating-point operations (FLOPs) and up to 3.3% higher accuracy compared to other dynamic inference methods. On the ImageNet dataset, LC-Net achieves up to 1.4$\times$ fewer FLOPs and up to 4.6% higher Top-1 accuracy than the other methods.
摘要：现代深层神经网络是功能强大和广泛适用的模式，通过多层次的抽象提取任务相关的信息。他们的跨域成功，然而，往往是在计算成本，高内存带宽，和长推断延迟为代价防止在资源有限和对时间敏感的情况下，如边侧的推理和自其部署实现驾驶汽车。虽然建立高效深层神经网络最近开发的方法正在通过减少模型的大小它们在现实世界的部署更可行，他们没有充分利用每个实例的基础上输入性能最大化计算效率和精度的任务。特别是，大多数现有的方法一般使用一个尺寸适合所有的办法，相同地处理所有的输入。通过不同的图像需要不同的功能的嵌入，以准确分类的事实的启发，我们提出了一个完全动态的范式赋予与层和个人卷积滤波器/通道的水平分层推理动态卷积深层神经网络。两个紧凑网络，称为层-Net的（L-净）和信道网络（C-净），预测在每个实例进行哪些层或过滤器/信道是多余的，因此应该被跳过。 L-Net和C-NET也学习如何保持规模计算产出最大化任务的准确性。通过L-Net和C-净集成到联合设计框架，称为LC-网，我们始终优于国家的最先进的动态框架相对于效率和分类精度。在CIFAR-10数据集，LC-Net的结果高达11.9 $ \ $次少浮点运算（FLOPS）和最多增加3.3％的精度相比其他动态推理方法。在ImageNet数据集，LC-网实现了比其他方法1.4 $ \ $次少拖鞋和高者可达4.6％上-1的精度。

38. Single Image Cloud Detection via Multi-Image Fusion [PDF] 返回目录
Scott Workman, M. Usman Rafique, Hunter Blanton, Connor Greenwell, Nathan Jacobs
Abstract: Artifacts in imagery captured by remote sensing, such as clouds, snow, and shadows, present challenges for various tasks, including semantic segmentation and object detection. A primary challenge in developing algorithms for identifying such artifacts is the cost of collecting annotated training data. In this work, we explore how recent advances in multi-image fusion can be leveraged to bootstrap single image cloud detection. We demonstrate that a network optimized to estimate image quality also implicitly learns to detect clouds. To support the training and evaluation of our approach, we collect a large dataset of Sentinel-2 images along with a per-pixel semantic labelling for land cover. Through various experiments, we demonstrate that our method reduces the need for annotated training data and improves cloud detection performance.
摘要：工件在图像通过遥感捕获，例如云，雪，和阴影，对各种任务，包括语义分割和对象检测提出了挑战。在开发算法用于识别这种伪像一个主要的挑战是收集注释的训练数据的成本。在这项工作中，我们将探讨在多图像融合的进展如何近期可以被利用来引导单个图像云检测。我们证明了优化，以估计图像质量的网络也隐含学检测云。为了支持我们的做法的培训和考核，我们收集了大量的数据集的Sentinel-2图像与土地覆盖每个像素的语义标注一起。通过各种实验，我们证明了我们的方法减少了需要注释的训练数据，提高了云检测性能。

39. Learning To Pay Attention To Mistakes [PDF] 返回目录
Mou-Cheng Xu, Neil Oxtoby, Daniel C. Alexander, Joseph Jacob
Abstract: As evidenced in visual results in \cite{attenUnet}\cite{AttenUnet2018}\cite{InterSeg}\cite{UnetFrontNeuro}\cite{LearnActiveContour}, the periphery of foreground regions representing malignant tissues may be disproportionately assigned as belonging to the background class of healthy tissues in medical image segmentation. This leads to high false negative detection rates. In this paper, we propose a novel attention mechanism to directly address such high false negative rates, called Paying Attention to Mistakes. Our attention mechanism steers the models towards false positive identification, which counters the existing bias towards false negatives. The proposed mechanism has two complementary implementations: (a) "explicit" steering of the model to attend to a larger Effective Receptive Field on the foreground areas; (b) "implicit" steering towards false positives, by attending to a smaller Effective Receptive Field on the background areas. We validated our methods on three tasks: 1) binary dense prediction between vehicles and the background using CityScapes; 2) Enhanced Tumour Core segmentation with multi-modal MRI scans in BRATS2018; 3) segmenting stroke lesions using ultrasound images in ISLES2018. We compared our methods with state-of-the-art attention mechanisms in medical imaging, including self-attention, spatial-attention and spatial-channel mixed attention. Across all of the three different tasks, our models consistently outperform the baseline models in Intersection over Union (IoU) and/or Hausdorff Distance (HD). For instance, in the second task, the "explicit" implementation of our mechanism reduces the HD of the best baseline by more than $26\%$, whilst improving the IoU by more than $3\%$. We believe our proposed attention mechanism can benefit a wide range of medical and computer vision tasks, which suffer from over-detection of background.
摘要：证明在视觉效果在\举{attenUnet} \举{AttenUnet2018} \举{InterSeg} \举{UnetFrontNeuro} \举{LearnActiveContour}，表示恶性组织的前景区域的外周可被不成比例地分配为属于背景类在医学图像分割的健康组织的。这导致较高的假阴性检测率。在本文中，我们提出了一个新颖的注意机制直接处理如此高的假阴性率，堪称注重误区。我们的注意机制对操纵的假阳性鉴定，这对柜台假阴性现有的偏置模式。该机制具有两个互补的实现：模型出席更大的有效感受野的前景区域的（一）“明确的”转向; （二）“隐含的”转向朝误报，通过参加一个较小的有效感受野的背景区域。我们经过验证的三项任务我们的方法：1）车辆和使用城市景观背景之间的二进制密集的预测; 2）增强的肿瘤核心分割与多模态MRI扫描中BRATS2018; 3）分割使用超声图像中ISLES2018中风病变。我们比较我们与国家的最先进的注意力机制在医疗成像，包括自我注意，空间注意和空间信道混合注意方法。在所有的三个不同的任务，我们的模型不断超越过联盟（IOU）和/或豪斯多夫距离（HD）在路口基线模型。例如，在第二个任务中，“明确的”执行我们的机制，超过$ 26 \％$降低了最好基线的HD，而超过$ 3 \％$提高了欠条。我们相信，我们建议关注机构可以享受广泛的医疗和计算机视觉任务，从背景的过检测受到影响。

40. Foveation for Segmentation of Ultra-High Resolution Images [PDF] 返回目录
Chen Jin, Ryutaro Tanno, Moucheng Xu, Thomy Mertzanidou, Daniel C. Alexander
Abstract: Segmentation of ultra-high resolution images is challenging because of their enormous size, consisting of millions or even billions of pixels. Typical solutions include dividing input images into patches of fixed size and/or down-sampling to meet memory constraints. Such operations incur information loss in the field-of-view (FoV) i.e., spatial coverage and the image resolution. The impact on segmentation performance is, however, as yet understudied. In this work, we start with a motivational experiment which demonstrates that the trade-off between FoV and resolution affects the segmentation performance on ultra-high resolution images---and furthermore, its influence also varies spatially according to the local patterns in different areas. We then introduce foveation module, a learnable "dataloader" which, for a given ultra-high resolution image, adaptively chooses the appropriate configuration (FoV/resolution trade-off) of the input patch to feed to the downstream segmentation model at each spatial location of the image. The foveation module is jointly trained with the segmentation network to maximise the task performance. We demonstrate on three publicly available high-resolution image datasets that the foveation module consistently improves segmentation performance over the cases trained with patches of fixed FoV/resolution trade-off. Our approach achieves the SoTA performance on the DeepGlobe aerial image dataset. On the Gleason2019 histopathology dataset, our model achieves better segmentation accuracy for the two most clinically important and ambiguous classes (Gleason Grade 3 and 4) than the top performers in the challenge by 13.1% and 7.5%, and improves on the average performance of 6 human experts by 6.5% and 7.5%. Our code and trained models are available at this https URL.
摘要：超高分辨率的图像分割，是因为其规模巨大的挑战，其中包括数百万甚至数十亿像素。典型的解决方案包括将输入图像划分成固定大小和/或下采样以满足存储器约束的补丁。这样的操作招致场的视场（FOV）的信息损失，即空间覆盖和图像分辨率。对分割性能的影响，然而，尚未充分研究。在这项工作中，我们先从一个激励人心的实验这表明的FoV和分辨率之间的权衡会影响超高分辨率的图像分割性能---进而，其影响力也随空间变化根据不同地区的本地模式。然后，我们介绍foveation模块，可学习“的DataLoader”，其对于给定的超高分辨率图像，适应性地选择适当的配置（FOV /分辨率折衷）输入接插的饲料到下游分割模型在每个空间位置的形象。该foveation模块会同分割网络，最大限度地执行任务训练。我们证明三个公开可用的高分辨率图像数据集的foveation模块始终改进了固定的FoV /分辨率权衡的补丁训练的情况下分割性能。我们的方法实现对DeepGlobe航拍图像数据集的SOTA性能。在Gleason2019组织病理学数据集，我们的模型实现了两个最重要的临床和暧昧类较好的分割精度（格里森3和4级）比挑战表现最佳的13.1％和7.5％，并提高了对6的平均表现人类专家6.5％和7.5％。我们的代码和训练的模型可在此HTTPS URL。

41. Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints [PDF] 返回目录
You-Yi Jau, Rui Zhu, Hao Su, Manmohan Chandraker
Abstract: Estimating relative camera poses from consecutive frames is a fundamental problem in visual odometry (VO) and simultaneous localization and mapping (SLAM), where classic methods consisting of hand-crafted features and sampling-based outlier rejection have been a dominant choice for over a decade. Although multiple works propose to replace these modules with learning-based counterparts, most have not yet been as accurate, robust and generalizable as conventional methods. In this paper, we design an end-to-end trainable framework consisting of learnable modules for detection, feature extraction, matching and outlier rejection, while directly optimizing for the geometric pose objective. We show both quantitatively and qualitatively that pose estimation performance may be achieved on par with the classic pipeline. Moreover, we are able to show by end-to-end training, the key components of the pipeline could be significantly improved, which leads to better generalizability to unseen datasets compared to existing learning-based methods.
摘要：从连续帧估算相对相机的姿势是在视觉测距法（VO）和同时定位和地图创建（SLAM），一个基本问题，其中包括手工制作的特征和经典方法离群采样为基础的排斥反应一直是主流选择用于在十年。尽管多个作品提出这些模块替换为基于学习的同行，大部分都尚未准确，强大和普及作为常规方法。在本文中，我们设计包括用于检测，特征提取，匹配和异常值拒绝可学习模块的端至端可训练框架，而直接优化几何姿势目标。我们在数量上和质量上同时显示该姿势估计性能可以媲美，而实现与经典管线。此外，我们能够通过终端到终端的培训，以展示，管道的关键部件可以显著改善，从而得到更好的推广到看不见的数据集相比，现有的基于学习的方法。

42. Outlier-Robust Estimation: Hardness, Minimally-Tuned Algorithms, and Applications [PDF] 返回目录
Pasquale Antonante, Vasileios Tzoumas, Heng Yang, Luca Carlone
Abstract: Nonlinear estimation in robotics and vision is typically plagued with outliers due to wrong data association, or to incorrect detections from signal processing and machine learning methods. This paper introduces two unifying formulations for outlier-robust estimation, Generalized Maximum Consensus (G MC) and Generalized Truncated Least Squares (G-TLS), and investigates fundamental limits, practical algorithms, and applications. Our first contribution is a proof that outlier-robust estimation is inapproximable: in the worst case, it is impossible to (even approximately) find the set of outliers, even with slower-than-polynomial-time algorithms (particularly, algorithms running in quasi-polynomial time). As a second contribution, we review and extend two general-purpose algorithms. The first, Adaptive Trimming (ADAPT), is combinatorial, and is suitable for G-MC; the second, Graduated Non-Convexity (GNC), is based on homotopy methods, and is suitable for G-TLS. We extend ADAPT and GNC to the case where the user does not have prior knowledge of the inlier-noise statistics (or the statistics may vary over time) and is unable to guess a reasonable threshold to separate inliers from outliers (as the one commonly used in RANSAC). We propose the first minimally-tuned algorithms for outlier rejection, that dynamically decide how to separate inliers from outliers. Our third contribution is an evaluation of the proposed algorithms on robot perception problems: mesh registration, image-based object detection (shape alignment), and pose graph optimization. ADAPT and GNC execute in real-time, are deterministic, outperform RANSAC, and are robust to 70-90% outliers. Their minimally-tuned versions also compare favorably with the state of the art, even though they do not rely on a noise bound for the inliers.
摘要：在机器人和视觉的非线性估计通常与异常值的困扰由于错误的数据关联，或者不正确的检测从信号处理和机器学习方法。本文介绍了离群-鲁棒估计2个统一配方，广义最大共识（G MC）和广义截断最小二乘（G-TLS），并调查基本限制，实际的算法和应用程序。我们的第一个贡献是一个证明，异常稳健估计inapproximable：在最坏的情况下，这是不可能的（即使是约）找到一组离群，甚至慢于多项式算法（特别是在准运行算法-polynomial时间）。作为第二个贡献，我们回顾和扩展两个通用算法。第一，自适应微调（ADAPT），是组合的，并且适合于G-MC;第二，毕业非凸（GNC），是基于同伦方法，以及适用于G-TLS。我们延伸ADAPT和GNC于用户不具有内点噪声统计的先验知识（或统计可以随时间而变化），并且是无法猜测从异常值在合理的阈值，以单独的内围层（作为一个通常使用的情况下在RANSAC）。我们建议第一最小调校算法异常值拒绝，动态决定如何从异常分离正常值。我们的第三个贡献是对机器人感知问题所提出的算法的评价：目注册，基于图像的对象检测（形状对准），以及姿态图形优化。适应并GNC实时执行，是确定性的，超越RANSAC，并以强劲的70-90％的异常值。他们最小调校的版本也与现有技术的相媲美，即使他们不依赖开往正常值的噪声。

43. Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval [PDF] 返回目录
Aneeshan Sain, Ayan Kumar Bhunia, Yongxin Yang, Tao Xiang, Yi-Zhe Song
Abstract: Sketch as an image search query is an ideal alternative to text in capturing the fine-grained visual details. Prior successes on fine-grained sketch-based image retrieval (FG-SBIR) have demonstrated the importance of tackling the unique traits of sketches as opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixel-perfect. In this paper, we study a further trait of sketches that has been overlooked to date, that is, they are hierarchical in terms of the levels of detail -- a person typically sketches up to various extents of detail to depict an object. This hierarchical structure is often visually distinct. In this paper, we design a novel network that is capable of cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at corresponding hierarchical levels. In particular, features from a sketch and a photo are enriched using cross-modal co-attention, coupled with hierarchical node fusion at every level to form a better embedding space to conduct retrieval. Experiments on common benchmarks show our method to outperform state-of-the-arts by a significant margin.
摘要：素描为图像的搜索查询是一个理想的替代文本捕获细粒度的视觉细节。细粒度的基于草图的图像检索（FG-SBIR）此前成功已经证明了解决草图的独特特质的重要性，而不是照片，例如，时间与静态，中风与像素，抽象与像素 - 完善。在本文中，我们研究了以前一直被忽视到今天为止，也就是说，他们在细节的水平方面层次草图的另一特点 - 一个人通常勾画了细节的不同程度来描绘的对象。这个分层结构是经常视觉上不同的。在本文中，我们设计了一个新的网络，它能够培养特定草图的层次结构和利用它们来匹配素描与照片对应等级层次。特别是，从草图功能和相片使用跨模态共注意，加上每一级层级节点融合以形成更好的嵌入空间来进行检索富集。在共同基准的实验由显著余量显示我们的方法超越国家的最艺术。

44. Unselfie: Translating Selfies to Neutral-pose Portraits in the Wild [PDF] 返回目录
Liqian Ma, Zhe Lin, Connelly Barnes, Alexei A. Efros, Jingwan Lu
Abstract: Due to the ubiquity of smartphones, it is popular to take photos of one's self, or "selfies." Such photos are convenient to take, because they do not require specialized equipment or a third-party photographer. However, in selfies, constraints such as human arm length often make the body pose look unnatural. To address this issue, we introduce $\textit{unselfie}$, a novel photographic transformation that automatically translates a selfie into a neutral-pose portrait. To achieve this, we first collect an unpaired dataset, and introduce a way to synthesize paired training data for self-supervised learning. Then, to $\textit{unselfie}$ a photo, we propose a new three-stage pipeline, where we first find a target neutral pose, inpaint the body texture, and finally refine and composite the person on the background. To obtain a suitable target neutral pose, we propose a novel nearest pose search module that makes the reposing task easier and enables the generation of multiple neutral-pose results among which users can choose the best one they like. Qualitative and quantitative evaluations show the superiority of our pipeline over alternatives.
摘要：由于智能手机的普及，它是普遍采取一个人的自我的照片，或“自拍照”。这些照片是携带方便，因为他们并不需要专门的设备或第三方摄影师。然而，在自拍，如人的手臂长度的限制往往使身体姿势看起来不自然。为了解决这个问题，我们引入$ \ {textit} unselfie $，一种新型照相改造，可自动转换自拍成中性姿态肖像。要做到这一点，我们首先收集未配对的数据集，并引入了一种合成配对训练数据进行自我监督学习。然后，到$ \ {textit} unselfie美元的照片，我们提出了一个新的三级流水线，在这里我们先找到目标中立姿态，补绘的机身质感，最后细化和复合的人的背景。为了得到一个合适的目标中立姿态，我们提出最近的姿态搜索模块的新颖，使寄托任务更容易，使多个中性姿势结果中，用户可以选择最好的一个，他们像的产生。定性和定量评估表明我们的管道在替代的优越性。

45. Generative Classifiers as a Basis for Trustworthy Computer Vision [PDF] 返回目录
Radek Mackowiak, Lynton Ardizzone, Ullrich Köthe, Carsten Rother
Abstract: With the maturing of deep learning systems, trustworthiness is becoming increasingly important for model assessment. We understand trustworthiness as the combination of explainability and robustness. Generative classifiers (GCs) are a promising class of models that are said to naturally accomplish these qualities. However, this has mostly been demonstrated on simple datasets such as MNIST, SVHN and CIFAR in the past. In this work, we firstly develop an architecture and training scheme that allows for GCs to be trained on the ImageNet classification task, a more relevant level of complexity for practical computer vision. The resulting models use an invertible neural network architecture and achieve a competetive ImageNet top-1 accuracy of up to 76.2%. Secondly, we show the large potential of GCs for trustworthiness. Explainability and some aspects of robustness are vastly improved compared to standard feed-forward models, even when the GCs are just applied naively. While not all trustworthiness problems are solved completely, we argue from our observations that GCs are an extremely promising basis for further algorithms and modifications, as have been developed in the past for feedforward models to increase their trustworthiness. We release our trained model for download in the hope that it serves as a starting point for various other generative classification tasks in much the same way as pretrained ResNet models do for discriminative classification.
摘要：随着深度学习系统的日趋完善，诚信正在成为模型评估越来越重要。我们理解为explainability和鲁棒性的组合可信度。生成分类（GCS）是一个有前途的类被说成自然地完成这些素质模型。然而，这大多被证明简单的数据集，如MNIST，SVHN和CIFAR过去。在这项工作中，我们首先开发一个架构和培训方案，该方案允许在ImageNet分类任务，复杂的实际计算机视觉更相关的水平进行培训选区。将得到的模型使用一种可逆的神经网络结构，并实现了一个竞技ImageNet顶部-1精度76.2％。其次，我们将展示为地方选区守信的巨大潜力。 Explainability和稳健性的某些方面有了长足的进步比标准的前馈机型，即使在地方选区都只是应用天真。虽然不是所有的诚信问题完全解决，我们从我们的观察认为，地方选区有待进一步的算法和修改一个非常有前途的基础，因为在过去已经开发了前馈的模型，以增加其可信度。我们希望它成为在大致相同的方式作为预训练RESNET起点其它各种生成分类任务模型判别分类做释放我们的训练模型下载。

46. Hybrid Deep Learning Gaussian Process for Diabetic Retinopathy Diagnosis and Uncertainty Quantification [PDF] 返回目录
Santiago Toledo-Cortés, Melissa De La Pava, Oscar Perdómo, Fabio A. González
Abstract: Diabetic Retinopathy (DR) is one of the microvascular complications of Diabetes Mellitus, which remains as one of the leading causes of blindness worldwide. Computational models based on Convolutional Neural Networks represent the state of the art for the automatic detection of DR using eye fundus images. Most of the current work address this problem as a binary classification task. However, including the grade estimation and quantification of predictions uncertainty can potentially increase the robustness of the model. In this paper, a hybrid Deep Learning-Gaussian process method for DR diagnosis and uncertainty quantification is presented. This method combines the representational power of deep learning, with the ability to generalize from small datasets of Gaussian process models. The results show that uncertainty quantification in the predictions improves the interpretability of the method as a diagnostic support tool. The source code to replicate the experiments is publicly available at this https URL.
摘要：糖尿病视网膜病变（DR）是糖尿病，这仍然为全球失明的主要原因之一的微血管并发症之一。基于卷积神经网络的计算模型使用眼底图像表示为DR的自动检测技术的状态。目前大多数的工作地址的这个问题是一个二元分类任务。然而，包括预测的不确定性的等级评定和定量可以潜在地增加模型的鲁棒性。在本文中，提出了一种用于诊断DR和不确定性量化的混合深度学习高斯的工艺方法。该方法结合深学习的表达能力，与高斯过程模型的小型数据集来概括的能力。结果表明，在预测不确定性定量改善方法作为诊断支持工具的可解释性。源代码复制实验是公开的，在此HTTPS URL。

47. Comparative study of deep learning methods for the automatic segmentation of lung, lesion and lesion type in CT scans of COVID-19 patients [PDF] 返回目录
Sofie Tilborghs, Ine Dirks, Lucas Fidon, Siri Willems, Tom Eelbode, Jeroen Bertels, Bart Ilsen, Arne Brys, Adriana Dubbeldam, Nico Buls, Panagiotis Gonidakis, Sebastián Amador Sánchez, Annemiek Snoeckx, Paul M. Parizel, Johan de Mey, Dirk Vandermeulen, Tom Vercauteren, David Robben, Dirk Smeets, Frederik Maes, Jef Vandemeulebroucke, Paul Suetens
Abstract: Recent research on COVID-19 suggests that CT imaging provides useful information to assess disease progression and assist diagnosis, in addition to help understanding the disease. There is an increasing number of studies that propose to use deep learning to provide fast and accurate quantification of COVID-19 using chest CT scans. The main tasks of interest are the automatic segmentation of lung and lung lesions in chest CT scans of confirmed or suspected COVID-19 patients. In this study, we compare twelve deep learning algorithms using a multi-center dataset, including both open-source and in-house developed algorithms. Results show that ensembling different methods can boost the overall test set performance for lung segmentation, binary lesion segmentation and multiclass lesion segmentation, resulting in mean Dice scores of 0.982, 0.724 and 0.469, respectively. The resulting binary lesions were segmented with a mean absolute volume error of 91.3 ml. In general, the task of distinguishing different lesion types was more difficult, with a mean absolute volume difference of 152 ml and mean Dice scores of 0.369 and 0.523 for consolidation and ground glass opacity, respectively. All methods perform binary lesion segmentation with an average volume error that is better than visual assessment by human raters, suggesting these methods are mature enough for a large-scale evaluation for use in clinical practice.
摘要：COVID-19最近的研究表明，CT成像提供有用的信息，以评估疾病的进展，并协助诊断，除了帮助了解疾病。有越来越多的研究，提出用深度学习提供COVID-19使用胸部CT扫描的快速和准确的定量。感兴趣的主要任务是在确诊或疑似COVID-19的患者胸部CT扫描肺癌和肺损伤的自动分割。在这项研究中，我们比较使用一个多中心的数据集12种深学习算法，包括开源和内部开发的算法。结果表明，不同ensembling方法可以提高对肺癌分割的整体测试集性能，二进制病变分割和多类肿瘤分割，产生的0.982，0.724分别和0.469，平均骰子分数。将所得的二元病变91.3 ml的平均绝对体积误差分段。通常，区分不同病变类型的任务更加困难，用152 ml和用于分别合并和毛玻璃不透明度，的0.369和0.523的平均分数骰子的平均绝对体积差。所有方法执行二进制病变划分与比通过人工评级视觉评估更好的平均体积误差，这表明这些方法不够在临床实践中使用的大型评估成熟。

48. Searching for Pneumothorax in Half a Million Chest X-Ray Images [PDF] 返回目录
Antonio Sze-To, Hamid Tizhoosh
Abstract: Pneumothorax, a collapsed or dropped lung, is a fatal condition typically detected on a chest X-ray by an experienced radiologist. Due to shortage of such experts, automated detection systems based on deep neural networks have been developed. Nevertheless, applying such systems in practice remains a challenge. These systems, mostly compute a single probability as output, may not be enough for diagnosis. On the contrary, content-based medical image retrieval (CBIR) systems, such as image search, can assist clinicians for diagnostic purposes by enabling them to compare the case they are examining with previous (already diagnosed) cases. However, there is a lack of study on such attempt. In this study, we explored the use of image search to classify pneumothorax among chest X-ray images. All chest X-ray images were first tagged with deep pretrained features, which were obtained from existing deep learning models. Given a query chest X-ray image, the majority voting of the top K retrieved images was then used as a classifier, in which similar cases in the archive of past cases are provided besides the probability output. In our experiments, 551,383 chest X-ray images were obtained from three large recently released public datasets. Using 10-fold cross-validation, it is shown that image search on deep pretrained features achieved promising results compared to those obtained by traditional classifiers trained on the same features. To the best of knowledge, it is the first study to demonstrate that deep pretrained features can be used for CBIR of pneumothorax in half a million chest X-ray images.
摘要：气胸，折叠或下降肺，是由有经验的放射科医生在胸部X射线通常检测到致命的状态。由于这些专家的短缺，基于深层神经网络的自动化检测系统已被开发。然而，在实践中应用这样的系统仍然是一个挑战。这些系统中，大多是计算一个概率作为输出，可能没有足够的进行诊断。相反，基于内容的医学图像检索（CBIR）系统，如图片搜索，可以使他们能够比较它们与以前的（已确诊）案件检查的情况下协助诊断目的的临床医生。然而，在这样的尝试缺乏研究。在这项研究中，我们探讨了胸部X射线图像中使用图像搜索的分类气胸。所有胸部X射线图像进行标记的第一深预训练的特征，这是从现有的深度学习模型中获得。给定查询胸部X射线图像，K检索到的图像的顶部的多数表决，然后用作分类器，其中，在过去的情况下归档类似病例除了概率输出而提供。在我们的实验中，从三个大的最近发布的数据集大众获得551383胸部X射线图像。使用10倍交叉验证中，示出的是在深预训练的特征的图像搜索来实现希望的结果相比，由受过训练的上相同的特征分类器的传统获得的那些。为了最好的知识，这是首次研究证明深预训练的特征可以在五十万胸部X射线图像被用于气胸的CBIR。

49. Few shot domain adaptation for in situ macromolecule structural classification in cryo-electron tomograms [PDF] 返回目录
Liangyong Yu, Ran Li, Xiangrui Zeng, Hongyi Wang, Jie Jin, Ge Yang, Rui Jiang, Min Xu
Abstract: Motivation: Cryo-Electron Tomography (cryo-ET) visualizes structure and spatial organization of macromolecules and their interactions with other subcellular components inside single cells in the close-to-native state at sub-molecular resolution. Such information is critical for the accurate understanding of cellular processes. However, subtomogram classification remains one of the major challenges for the systematic recognition and recovery of the macromolecule structures in cryo-ET because of imaging limits and data quantity. Recently, deep learning has significantly improved the throughput and accuracy of large-scale subtomogram classification. However often it is difficult to get enough high-quality annotated subtomogram data for supervised training due to the enormous expense of labeling. To tackle this problem, it is beneficial to utilize another already annotated dataset to assist the training process. However, due to the discrepancy of image intensity distribution between source domain and target domain, the model trained on subtomograms in source domainmay perform poorly in predicting subtomogram classes in the target domain. Results: In this paper, we adapt a few shot domain adaptation method for deep learning based cross-domain subtomogram classification. The essential idea of our method consists of two parts: 1) take full advantage of the distribution of plentiful unlabeled target domain data, and 2) exploit the correlation between the whole source domain dataset and few labeled target domain data. Experiments conducted on simulated and real datasets show that our method achieves significant improvement on cross domain subtomogram classification compared with baseline methods.
摘要：动机：低温电子断层扫描（低温ET）可视化大分子的结构和空间组织以及它们与单个细胞在亚分子分辨率近于天然状态内其它亚细胞组分的相互作用。这种信息对于细胞过程的准确理解是至关重要的。然而，subtomogram分类仍然为大分子结构的系统识别和恢复低温ET因为成像范围和数据量的主要挑战之一。近日，深学习已显著提高大型subtomogram分类的吞吐量和准确性。但是往往也很难得到指导训练不够高品质的注解subtomogram数据由于标签的巨大代价。为了解决这个问题，有利于利用其它已注释的数据集，以协助训练过程。然而，由于源域和目标域之间的图像的强度分布的差异，培养在源subtomograms模型domainmay差在目标区域预测subtomogram类执行。结果：在本文中，我们适应了深度学习基于跨域subtomogram分类的几炮域自适应方法。我们的方法的基本思想由两个部分组成：1）充分利用丰富的未标记靶标域数据的分布，和2）利用整个源域数据集和几个标记的靶域数据之间的相关性。在模拟和真实数据集进行的实验表明，该方法实现与基线方法相比跨域subtomogram分类显著改善。

50. Very Deep Super-Resolution of Remotely Sensed Images with Mean Square Error and Var-norm Estimators as Loss Functions [PDF] 返回目录
Antigoni Panagiotopoulou, Lazaros Grammatikopoulos, Eleni Charou, Emmanuel Bratsolis, Nicholas Madamopoulos, John Petrogonas
Abstract: In this work, very deep super-resolution (VDSR) method is presented for improving the spatial resolution of remotely sensed (RS) images for scale factor 4. The VDSR net is re-trained with Sentinel-2 images and with drone aero orthophoto images, thus becomes RS-VDSR and Aero-VDSR, respectively. A novel loss function, the Var-norm estimator, is proposed in the regression layer of the convolutional neural network during re-training and prediction. According to numerical and optical comparisons, the proposed nets RS-VDSR and Aero-VDSR can outperform VDSR during prediction with RS images. RS-VDSR outperforms VDSR up to 3.16 dB in terms of PSNR in Sentinel-2 images.
摘要：在这项工作中，非常深的超分辨率提出了一种用于改善遥感（RS）的图像的比例因子4. VDSR净空间分辨率（VDSR）方法被重新训练与哨兵-2图像和与无人驾驶飞机航空正射影像图像，从而成为RS-VDSR和航空VDSR，分别。一种新颖的损失函数，瓦尔范数估计，在期间重新训练和预测的卷积神经网络的回归层提出。据数值和光比较，所提出的网RS-VDSR和航空VDSR可以与RS图像预测期间优于VDSR。 RS-VDSR性能优于VDSR高达3.16分贝在哨兵-2的图像的PSNR的条款。

51. Black-box Adversarial Sample Generation Based on Differential Evolution [PDF] 返回目录
Junyu Lin, Lei Xu, Yingqi Liu, Xiangyu Zhang
Abstract: Deep Neural Networks (DNNs) are being used in various daily tasks such as object detection, speech processing, and machine translation. However, it is known that DNNs suffer from robustness problems -- perturbed inputs called adversarial samples leading to misbehaviors of DNNs. In this paper, we propose a black-box technique called Black-box Momentum Iterative Fast Gradient Sign Method (BMI-FGSM) to test the robustness of DNN models. The technique does not require any knowledge of the structure or weights of the target DNN. Compared to existing white-box testing techniques that require accessing model internal information such as gradients, our technique approximates gradients through Differential Evolution and uses approximated gradients to construct adversarial samples. Experimental results show that our technique can achieve 100% success in generating adversarial samples to trigger misclassification, and over 95% success in generating samples to trigger misclassification to a specific target output label. It also demonstrates better perturbation distance and better transferability. Compared to the state-of-the-art black-box technique, our technique is more efficient. Furthermore, we conduct testing on the commercial Aliyun API and successfully trigger its misbehavior within a limited number of queries, demonstrating the feasibility of real-world black-box attack.
摘要：深层神经网络（DNNs）在各种日常任务，如目标检测，语音处理，和机器翻译中使用。然而，众所周知，DNNs从稳健性问题的困扰 - 扰动输入所谓的导致DNNs的不良行为对抗样本。在本文中，我们提出了一个叫做黑箱动量迭代快速倾斜的符号法（BMI-FGSM）来测试DNN模型的鲁棒性黑盒技术。该技术不需要目标DNN的结构或重量的任何知识。相比于现有的白盒测试技术那些需要访问模型内部信息，如梯度，我们的技术近似于近似梯度来构造对抗性样品通过微分进化和用途梯度。实验结果表明，我们的技术可实现在产生对抗采样以触发错误分类100％的成功，并产生样品中的超过95％的成功至触发错误分类到特定的目标输出标签。它还演示了更好的干扰距离和更好的可转让性。相较于国家的最先进的黑匣子技术，我们的技术是更有效的。此外，我们的行为对商业阿里云API测试和查询的限制数内成功触发其行为不端，展示真实世界的黑盒攻击的可行性。

52. DeepPeep: Exploiting Design Ramifications to Decipher the Architecture of Compact DNNs [PDF] 返回目录
Nandan Kumar Jha, Sparsh Mittal, Binod Kumar, Govardhan Mattela
Abstract: The remarkable predictive performance of deep neural networks (DNNs) has led to their adoption in service domains of unprecedented scale and scope. However, the widespread adoption and growing commercialization of DNNs have underscored the importance of intellectual property (IP) protection. Devising techniques to ensure IP protection has become necessary due to the increasing trend of outsourcing the DNN computations on the untrusted accelerators in cloud-based services. The design methodologies and hyper-parameters of DNNs are crucial information, and leaking them may cause massive economic loss to the organization. Furthermore, the knowledge of DNN's architecture can increase the success probability of an adversarial attack where an adversary perturbs the inputs and alter the prediction. In this work, we devise a two-stage attack methodology "DeepPeep" which exploits the distinctive characteristics of design methodologies to reverse-engineer the architecture of building blocks in compact DNNs. We show the efficacy of "DeepPeep" on P100 and P4000 GPUs. Additionally, we propose intelligent design maneuvering strategies for thwarting IP theft through the DeepPeep attack and proposed "Secure MobileNet-V1". Interestingly, compared to vanilla MobileNet-V1, secure MobileNet-V1 provides a significant reduction in inference latency ($\approx$60%) and improvement in predictive performance ($\approx$2%) with very-low memory and computation overheads.
摘要：深层神经网络（DNNs）的显着预测性能，导致其在采用前所未有的规模和范围的服务领域。然而，广泛采用和DNNs的日益商业化都强调知识产权（IP）保护的重要性。制定技术，以确保知识产权保护做出了应有的外包在基于云的服务的不信任加速器DNN计算的增加趋势成为必要。该设计方法和DNNs的超参数是至关重要的信息，并且将其泄露可能会造成巨大的经济损失该组织。此外，DNN架构的知识，可以增加对抗攻击，其中一个对手干扰了输入的成功概率并改变预测。在这项工作中，我们设计一个两阶段的攻击方法“DeepPeep”，它利用了设计方法的鲜明特征进行反向工程紧凑DNNs积木的结构。我们发现“DeepPeep”对P100和P4000的GPU功效。此外，我们提出了智能设计机动策略通过DeepPeep攻击阻碍IP盗用并提出了“安全MobileNet-V1”。有趣的是，相比于香草MobileNet-V1，安全MobileNet-V1提供了推理潜伏期预测性能与极低的内存和计算开销一个显著减少（\ $约$ 60％）和改善（\ $约$ 2％）。

53. Unsupervised Event Detection, Clustering, and Use Case Exposition in Micro-PMU Measurements [PDF] 返回目录
Armin Aligholian, Alireza Shahsavari, Emma Stewart, Ed Cortez, Hamed Mohsenian-Rad
Abstract: Distribution-level phasor measurement units, a.k.a, micro-PMUs, report a large volume of high resolution phasor measurements which constitute a variety of event signatures of different phenomena that occur all across power distribution feeders. In order to implement an event-based analysis that has useful applications for the utility operator, one needs to extract these events from a large volume of micro-PMU data. However, due to the infrequent, unscheduled, and unknown nature of the events, it is often a challenge to even figure out what kind of events are out there to capture and scrutinize. In this paper, we seek to address this open problem by developing an unsupervised approach, which requires minimal prior human knowledge. First, we develop an unsupervised event detection method based on the concept of Generative Adversarial Networks (GAN). It works by training deep neural networks that learn the characteristics of the normal trends in micro-PMU measurements; and accordingly detect an event when there is any abnormality. We also propose a two-step unsupervised clustering method, based on a novel linear mixed integer programming formulation. It helps us categorize events based on their origin in the first step and their similarity in the second step. The active nature of the proposed clustering method makes it capable of identifying new clusters of events on an ongoing basis. The proposed unsupervised event detection and clustering methods are applied to real-world micro-PMU data. Results show that they can outperform the prevalent methods in the literature. These methods also facilitate our further analysis to identify important clusters of events that lead to unmasking several use cases that could be of value to the utility operator.
摘要：分布级别相量测量单元，a.k.a，微的PMU，报告一个大的体积，其构成的各种各样的发生跨配电馈线所有不同的现象的事件签名的高分辨率相量测量的。为了实现基于事件的分析，其具有用于该设施操作员有用的应用中，一个需求，以提取从大体积的微PMU数据的这些事件。然而，由于不经常，不定期，以及事件的性质不明，经常是连人物挑战什么样的事件都在那里，以捕捉和端详。在本文中，我们力求通过发展无监督的方法，要求最少的前人类的知识来解决这个问题的开放。首先，我们开发了基于生成性对抗性网络（GAN）的概念，无监督事件检测方法。它通过培训，以学习的微PMU测量正常趋势的特点深层神经网络;并且当存在任何异常相应检测事件。我们还提出了一种两步无监督聚类方法，基于新线性混合整数规划。它可以帮助我们分类基于其在第一步的起源及其在第二步相似的事件。所提出的聚类方法的活性性质使它能够识别正在进行的基础上的事件的新集群。所提出的无监督事件检测和聚类方法应用到真实世界的微PMU数据。结果表明，它们可以超越文献流行的方法。这些方法还有助于我们进一步分析，以确定导致揭露几个用例可能是有价值的实用操作事件的重要集群。

54. Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation [PDF] 返回目录
Yu Xiang, Christopher Xie, Arsalan Mousavian, Dieter Fox
Abstract: Segmenting unseen objects in cluttered scenes is an important skill that robots need to acquire in order to perform tasks in new environments. In this work, we propose a new method for unseen object instance segmentation by learning RGB-D feature embeddings from synthetic data. A metric learning loss function is utilized to learn to produce pixel-wise feature embeddings such that pixels from the same object are close to each other and pixels from different objects are separated in the embedding space. With the learned feature embeddings, a mean shift clustering algorithm can be applied to discover and segment unseen objects. We further improve the segmentation accuracy with a new two-stage clustering algorithm. Our method demonstrates that non-photorealistic synthetic RGB and depth images can be used to learn feature embeddings that transfer well to real-world images for unseen object instance segmentation.
摘要：在杂乱的场景细分看不见的物体是一个重要的技能，机器人必须掌握，以在新的环境中执行任务。在这项工作中，我们通过综合数据中学习RGB-d功能的嵌入提出了看不见的对象实例分割的新方法。度量学习损失函数被用于学习，以产生逐个像素的嵌入特征使得从同一对象的像素是彼此接近的像素和从不同的对象在嵌入空间分离。随着学习功能的嵌入，均值漂移聚类算法可以应用于发现和段看不见的对象。我们进一步完善了新的两阶段聚类算法的分割精度。我们的方法证明了非真实感合成RGB和深度图像可以用来学习功能的嵌入该转移阱现实世界图像中的潜在对象实例分割。

55. OrcVIO: Object residual constrained Visual-Inertial Odometry [PDF] 返回目录
Mo Shan, Qiaojun Feng, Nikolay Atanasov
Abstract: Introducing object-level semantic information into simultaneous localization and mapping (SLAM) system is critical. It not only improves the performance but also enables tasks specified in terms of meaningful objects. This work presents OrcVIO, for visual-inertial odometry tightly coupled with tracking and optimization over structured object models. OrcVIO differentiates through semantic feature and bounding-box reprojection errors to perform batch optimization over the pose and shape of objects. The estimated object states aid in real-time incremental optimization over the IMU-camera states. The ability of OrcVIO for accurate trajectory estimation and large-scale object-level mapping is evaluated using real data.
摘要：介绍对象级语义信息到同步定位和地图创建（SLAM）系统是关键的。这不仅提高了性能，而且还能够在有意义的对象方面规定的任务。这项工作提出OrcVIO，视觉惯性测程紧密结合了结构化的对象模型跟踪和优化。 OrcVIO分化带来通过语义特征和边界框重投影误差超过对象的姿势和体形进行批量优化。实时增量优化的估计对象国的援助在IMU相机的状态。 OrcVIO的精确轨迹估计和大型对象级映射的能力被使用真实数据进行评价。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-07-31

目录

摘要