摘要

1. FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation [PDF] 返回目录
Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, Du Tran
Abstract: A majority of approaches solve the problem of video frame interpolation by computing bidirectional optical flow between adjacent frames of a video followed by a suitable warping algorithm to generate the output frames. However, methods relying on optical flow often fail to model occlusions and complex non-linear motions directly from the video and introduce additional bottlenecks unsuitable for real time deployment. To overcome these limitations, we propose a flexible and efficient architecture that makes use of 3D space-time convolutions to enable end to end learning and inference for the task of video frame interpolation. Our method efficiently learns to reason about non-linear motions, complex occlusions and temporal abstractions resulting in improved performance on video interpolation, while requiring no additional inputs in the form of optical flow or depth maps. Due to its simplicity, our proposed method improves the inference speed by 384x compared to the current most accurate method and 23x compared to the current fastest on 8x interpolation. In addition, we evaluate our model on a wide range of challenging settings and consistently demonstrate superior qualitative and quantitative results compared with current methods on various popular benchmarks including Vimeo-90K, UCF101, DAVIS, Adobe, and GoPro. Finally, we demonstrate that video frame interpolation can serve as a useful self-supervised pretext task for action recognition, optical flow estimation, and motion magnification.
摘要：多数方法通过计算视频相邻帧之间的双向光流，然后采用合适的变形算法来生成输出帧，从而解决了视频帧内插问题。但是，依赖于光流的方法通常无法直接从视频中建模遮挡和复杂的非线性运动，并引入了不适用于实时部署的其他瓶颈。为了克服这些限制，我们提出了一种灵活高效的体系结构，该体系结构利用3D时空卷积来实现视频帧插值任务的端到端学习和推理。我们的方法有效地学习了关于非线性运动，复杂的遮挡和时间抽象的推理，从而提高了视频插值的性能，同时不需要以光流或深度图的形式进行额外的输入。由于其简单性，我们提出的方法与当前最精确的方法相比，将推理速度提高了384倍，与当前最快的8x插值方法相比，提高了23倍。此外，与各种常用基准（包括Vimeo-90K，UCF101，DAVIS，Adobe和GoPro）上的当前方法相比，我们在各种具有挑战性的环境下评估模型，并始终如一地展示出优异的定性和定量结果。最后，我们证明了视频帧插值可以用作动作识别，光流估计和运动放大的有用的自我监督的借口任务。

2. GTA: Global Temporal Attention for Video Action Understanding [PDF] 返回目录
Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav Shrivastava
Abstract: Self-attention learns pairwise interactions via dot products to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. In particular, we demonstrate that the entangled modeling of spatial-temporal information by flattening all pixels is sub-optimal, failing to capture temporal relationships among frames explicitly. We introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA randomly initializes a global attention matrix that is intended to learn stable temporal structures to generalize across different samples. GTA is further augmented with a cross-channel multi-head fashion to exploit feature interactions for better temporal modeling. We apply GTA not only on pixels but also on semantically similar regions identified automatically by a learned transformation matrix. Extensive experiments on 2D and 3D networks demonstrate that our approach consistently enhances the temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
摘要：自我注意通过点积学习成对交互以对远程依赖性进行建模，从而极大地改善了视频动作识别。在本文中，我们寻求对视频中时间建模的自我注意的更深刻理解。特别是，我们证明了通过展平所有像素进行时空信息的纠缠建模是次优的，无法明确捕获帧之间的时间关系。我们介绍了全局时间注意（GTA），它以分离的方式在空间注意之上执行全局时间注意。与传统的自我注意力计算实例特定的注意力矩阵不同，GTA随机初始化一个全局注意力矩阵，该矩阵旨在学习稳定的时间结构以在不同样本之间进行概括。跨渠道多头方式进一步增强了GTA，以利用特征交互来实现更好的时间建模。我们不仅将GTA应用于像素，还将GTA应用于由学习的变换矩阵自动识别的语义相似区域。在2D和3D网络上进行的大量实验表明，我们的方法不断增强了时间建模，并在三个视频动作识别数据集上提供了最先进的性能。

3. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures [PDF] 返回目录
David Ding, Felix Hill, Adam Santoro, Matt Botvinick
Abstract: Neural networks have achieved success in a wide array of perceptual tasks, but it is often stated that they are incapable of solving tasks that require higher-level reasoning. Two new task domains, CLEVRER and CATER, have recently been developed to focus on reasoning, as opposed to perception, in the context of spatio-temporal interactions between objects. Initial experiments on these domains found that neuro-symbolic approaches, which couple a logic engine and language parser with a neural perceptual front-end, substantially outperform fully-learned distributed networks, a finding that was taken to support the above thesis. Here, we show on the contrary that a fully-learned neural network with the right inductive biases can perform substantially better than all previous neural-symbolic models on both of these tasks, particularly on questions that most emphasize reasoning over perception. Our model makes critical use of both self-attention and learned "soft" object-centric representations, as well as BERT-style semi-supervised predictive losses. These flexible biases allow our model to surpass the previous neuro-symbolic state-of-the-art using less than 60% of available labelled data. Together, these results refute the neuro-symbolic thesis laid out by previous work involving these datasets, and they provide evidence that neural networks can indeed learn to reason effectively about the causal, dynamic structure of physical events.
摘要：神经网络在各种各样的感知任务中都取得了成功，但是人们经常说神经网络无法解决需要更高层次推理的任务。最近开发了两个新的任务域，即CLEVRER和CATER，以在对象之间时空交互的上下文中着重于与感知相反的推理。在这些领域的初步实验发现，将逻辑引擎和语言解析器与神经感知前端结合使用的神经符号方法，其性能明显优于完全学习的分布式网络，这一发现被用来支持上述观点。在这里，我们相反地表明，在这两个任务上，尤其是在那些最着重于推理而不是感知的问题上，具有正确的归纳偏差的完全学习的神经网络可以比所有以前的神经符号模型表现更好。我们的模型充分利用了自我注意力和学习的“软”以对象为中心的表示形式，以及BERT样式的半监督预测损失。这些灵活的偏差允许我们的模型使用少于60％的可用标记数据来超越以前的神经符号最新技术水平。总之，这些结果驳斥了以前涉及这些数据集的工作提出的神经符号命题，并且它们提供了证据，表明神经网络确实可以学习有效地推理物理事件的因果关系，动态结构。

4. Object-Centric Neural Scene Rendering [PDF] 返回目录
Michelle Guo, Alireza Fathi, Jiajun Wu, Thomas Funkhouser
Abstract: We present a method for composing photorealistic scenes from captured images of objects. Our work builds upon neural radiance fields (NeRFs), which implicitly model the volumetric density and directionally-emitted radiance of a scene. While NeRFs synthesize realistic pictures, they only model static scenes and are closely tied to specific imaging conditions. This property makes NeRFs hard to generalize to new scenarios, including new lighting or new arrangements of objects. Instead of learning a scene radiance field as a NeRF does, we propose to learn object-centric neural scattering functions (OSFs), a representation that models per-object light transport implicitly using a lighting- and view-dependent neural network. This enables rendering scenes even when objects or lights move, without retraining. Combined with a volumetric path tracing procedure, our framework is capable of rendering both intra- and inter-object light transport effects including occlusions, specularities, shadows, and indirect illumination. We evaluate our approach on scene composition and show that it generalizes to novel illumination conditions, producing photorealistic, physically accurate renderings of multi-object scenes.
摘要：我们提出了一种从拍摄的物体图像中组成真实感场景的方法。我们的工作建立在神经辐射场（NeRF）的基础上，神经辐射场隐式地建模了场景的体积密度和方向性辐射。当NeRF合成逼真的图片时，它们仅对静态场景建模，并且与特定的成像条件紧密相关。此属性使NeRF很难推广到新的场景，包括新的照明或对象的新布置。我们建议不要学习像NeRF那样的场景辐射场，而是建议学习以对象为中心的神经散射函数（OSF），它是一种使用依赖于照明和视图的神经网络隐式建模每个对象光传输的表示。这样即使在物体或灯光移动时也可以渲染场景，而无需重新训练。结合体积路径跟踪过程，我们的框架能够渲染对象内和对象间的光传输效果，包括遮挡，镜面反射，阴影和间接照明。我们评估了我们在场景合成上的方法，并表明它可以推广到新颖的照明条件，从而产生多对象场景的逼真，物理上精确的渲染。

5. NAPA: Neural Art Human Pose Amplifier [PDF] 返回目录
Qingfu Wan, Oliver Lu
Abstract: This is the project report for CSCI-GA.2271-001. We target human pose estimation in artistic images. For this goal, we design an end-to-end system that uses neural style transfer for pose regression. We collect a 277-style set for arbitrary style transfer and build an artistic 281-image test set. We directly run pose regression on the test set and show promising results. For pose regression, we propose a 2d-induced bone map from which pose is lifted. To help such a lifting, we additionally annotate the pseudo 3d labels of the full in-the-wild MPII dataset. Further, we append another style transfer as self supervision to improve 2d. We perform extensive ablation studies to analyze the introduced features. We also compare end-to-end with per-style training and allude to the tradeoff between style transfer and pose regression. Lastly, we generalize our model to the real-world human dataset and show its potentiality as a generic pose model. We explain the theoretical foundation in Appendix. We release code at this https URL, data, and video.
摘要：这是CSCI-GA.2271-001的项目报告。我们针对艺术图像中的人体姿势估计。为此，我们设计了一个端到端系统，该系统使用神经样式转移进行姿势回归。我们收集了277种样式集以进行任意样式转移，并构建了艺术性的281图像测试集。我们直接在测试集上运行姿势回归，并显示出令人鼓舞的结果。对于姿势回归，我们提出了二维诱导的骨骼图，从该图可以解除姿势。为了帮助进行这种提升，我们还附加了完整的野生MPII数据集的伪3d标签。此外，我们附加了另一种样式转换作为自我监督来改进2d。我们进行了广泛的消融研究，以分析引入的功能。我们还将端到端与每个样式的训练进行比较，并暗示了样式转换和姿势回归之间的权衡。最后，我们将模型推广到真实的人类数据集，并展示其作为通用姿态模型的潜力。我们在附录中解释了理论基础。我们通过此https URL，数据和视频发布代码。

6. Geometric Surface Image Prediction for Image Recognition Enhancement [PDF] 返回目录
Tanasai Sucontphunt
Abstract: This work presents a method to predict a geometric surface image from a photograph to assist in image recognition. To recognize objects, several images from different conditions are required for training a model or fine-tuning a pre-trained model. In this work, a geometric surface image is introduced as a better representation than its color image counterpart to overcome lighting conditions. The surface image is predicted from a color image. To do so, the geometric surface image together with its color photographs are firstly trained with Generative Adversarial Networks (GAN) model. The trained generator model is then used to predict the geometric surface image from the input color image. The evaluation on a case study of an amulet recognition shows that the predicted geometric surface images contain less ambiguity than their color images counterpart under different lighting conditions and can be used effectively for assisting in image recognition task.
摘要：这项工作提出了一种从照片预测几何表面图像以帮助图像识别的方法。为了识别物体，训练模型或微调预先训练的模型需要来自不同条件的几幅图像。在这项工作中，为了克服光照条件，引入了比其彩色图像更好的表示形式的几何表面图像。从彩色图像预测表面图像。为此，首先使用Generative Adversarial Networks（GAN）模型训练几何表面图像及其彩色照片。然后将训练后的生成器模型用于根据输入的彩色图像预测几何表面图像。对护身符识别案例研究的评估表明，在不同光照条件下，预测的几何表面图像比其彩色图像对应的图像具有更少的模糊性，可以有效地用于辅助图像识别任务。

7. Detecting Invisible People [PDF] 返回目录
Tarasha Khurana, Achal Dave, Deva Ramanan
Abstract: Monocular object detection and tracking have improved drastically in recent years, but rely on a key assumption: that objects are visible to the camera. Many offline tracking approaches reason about occluded objects post-hoc, by linking together tracklets after the object re-appears, making use of reidentification (ReID). However, online tracking in embodied robotic agents (such as a self-driving vehicle) fundamentally requires object permanence, which is the ability to reason about occluded objects before they re-appear. In this work, we re-purpose tracking benchmarks and propose new metrics for the task of detecting invisible objects, focusing on the illustrative case of people. We demonstrate that current detection and tracking systems perform dramatically worse on this task. We introduce two key innovations to recover much of this performance drop. We treat occluded object detection in temporal sequences as a short-term forecasting challenge, bringing to bear tools from dynamic sequence prediction. Second, we build dynamic models that explicitly reason in 3D, making use of observations produced by state-of-the-art monocular depth estimation networks. To our knowledge, ours is the first work to demonstrate the effectiveness of monocular depth estimation for the task of tracking and detecting occluded objects. Our approach strongly improves by 11.4% over the baseline in ablations and by 5.0% over the state-of-the-art in F1 score.
摘要：近年来，单目物体的检测和跟踪已得到了极大的改善，但是要依靠一个关键的假设：物体对相机是可见的。许多离线跟踪方法是事后通过在对象重新出现后将小轨道链接在一起，利用重新标识（ReID）来将被遮挡的对象联系起来。但是，在嵌入式机器人代理程序（例如自动驾驶汽车）中进行在线跟踪从根本上要求对象具有永久性，这是在被遮挡的对象重新出现之前进行推理的能力。在这项工作中，我们重新定位了跟踪基准，并针对检测到的不可见物体提出了新的度量标准，重点是说明性人员。我们证明了当前的检测和跟踪系统在此任务上的表现要差得多。我们引入了两项关键的创新来恢复大部分性能下降。我们将时间序列中被遮挡的对象检测视为短期预测挑战，从动态序列预测中使用工具。其次，我们利用最先进的单眼深度估计网络产生的观察结果，建立可在3D中明确推理的动态模型。据我们所知，我们是第一个证明单眼深度估计在跟踪和检测被遮挡物体方面的有效性的工作。我们的方法在消融方面比基线提高了11.4％，在F1评分方面比最新技术提高了5.0％。

8. SPOC learner's final grade prediction based on a novel sampling batch normalization embedded neural network method [PDF] 返回目录
Zhuonan Liang, Ziheng Liu, Huaze Shi, Yunlong Chen, Yanbin Cai, Yating Liang, Yafan Feng, Yuqing Yang, Jing Zhang, Peng Fu
Abstract: Recent years have witnessed the rapid growth of Small Private Online Courses (SPOC) which is able to highly customized and personalized to adapt variable educational requests, in which machine learning techniques are explored to summarize and predict the learner's performance, mostly focus on the final grade. However, the problem is that the final grade of learners on SPOC is generally seriously imbalance which handicaps the training of prediction model. To solve this problem, a sampling batch normalization embedded deep neural network (SBNEDNN) method is developed in this paper. First, a combined indicator is defined to measure the distribution of the data, then a rule is established to guide the sampling process. Second, the batch normalization (BN) modified layers are embedded into full connected neural network to solve the data imbalanced problem. Experimental results with other three deep learning methods demonstrates the superiority of the proposed method.
摘要：近年来见证了小型私人在线课程（SPOC）的快速发展，该课程能够高度定制和个性化以适应可变的教育要求，其中探索了机器学习技术以总结和预测学习者的表现，主要侧重于最终成绩。但是，问题在于，SPOC学习者的最终成绩通常严重不平衡，这不利于预测模型的训练。为了解决这个问题，本文提出了一种采样批量归一化嵌入式深度神经网络（SBNEDNN）方法。首先，定义一个组合指标以测量数据的分布，然后建立规则以指导采样过程。其次，将批标准化（BN）修改层嵌入到全连接神经网络中，以解决数据不平衡问题。其他三种深度学习方法的实验结果证明了该方法的优越性。

9. FINED: Fast Inference Network for Edge Detection [PDF] 返回目录
Jan Kristanto Wibisono, Hsueh-Ming Hang
Abstract: In this paper, we address the design of lightweight deep learning-based edge detection. The deep learning technology offers a significant improvement on the edge detection accuracy. However, typical neural network designs have very high model complexity, which prevents it from practical usage. In contrast, we propose a Fast Inference Network for Edge Detection (FINED), which is a lightweight neural net dedicated to edge detection. By carefully choosing proper components for edge detection purpose, we can achieve the state-of-the-art accuracy in edge detection while significantly reducing its complexity. Another key contribution in increasing the inferencing speed is introducing the training helper concept. The extra subnetworks (training helper) are employed in training but not used in inferencing. It can further reduce the model complexity and yet maintain the same level of accuracy. Our experiments show that our systems outperform all the current edge detectors at about the same model (parameter) size.
摘要：在本文中，我们解决了基于轻量级深度学习的边缘检测的设计。深度学习技术极大地提高了边缘检测的准确性。但是，典型的神经网络设计具有很高的模型复杂度，这使其无法实际使用。相反，我们提出了一种用于边缘检测的快速推理网络（FINED），这是一种专用于边缘检测的轻量级神经网络。通过仔细选择合适的组件进行边缘检测，我们可以在边缘检测方面达到最先进的精度，同时显着降低其复杂性。提高推理速度的另一个关键贡献是引入了训练助手概念。额外的子网（培训助手）用于培训，但不用于推理。它可以进一步降低模型的复杂性，但仍保持相同的准确性。我们的实验表明，在几乎相同的模型（参数）尺寸下，我们的系统优于所有当前的边缘检测器。

10. mDALU: Multi-Source Domain Adaptation and Label Unification with Partial Datasets [PDF] 返回目录
Rui Gong, Dengxin Dai, Yuhua Chen, Wen Li, Luc Van Gool
Abstract: Object recognition advances very rapidly these days. One challenge is to generalize existing methods to new domains, to more classes and/or to new data modalities. In order to avoid annotating one dataset for each of these new cases, one needs to combine and reuse existing datasets that may belong to different domains, have partial annotations, and/or have different data modalities. This paper treats this task as a multi-source domain adaptation and label unification (mDALU) problem and proposes a novel method for it. Our method consists of a partially-supervised adaptation stage and a fully-supervised adaptation stage. In the former, partial knowledge is transferred from multiple source domains to the target domain and fused therein. Negative transfer between unmatched label space is mitigated via three new modules: domain attention, uncertainty maximization and attention-guided adversarial alignment. In the latter, knowledge is transferred in the unified label space after a label completion process with pseudo-labels. We verify the method on three different tasks, image classification, 2D semantic image segmentation, and joint 2D-3D semantic segmentation. Extensive experiments show that our method outperforms all competing methods significantly.
摘要：如今，对象识别技术发展非常迅速。一项挑战是将现有方法推广到新的领域，更多的类和/或新的数据模式。为了避免为这些新情况中的每一种注释一个数据集，需要合并和重用可能属于不同域，具有部分注释和/或具有不同数据模态的现有数据集。本文将此任务视为多源域自适应和标签统一（mDALU）问题，并提出了一种新颖的方法。我们的方法包括部分监督的适应阶段和完全监督的适应阶段。在前者中，部分知识从多个源域转移到目标域并融合在其中。通过三个新模块可以缓解不匹配标签空间之间的负向转移：领域关注度，不确定性最大化和关注度导向的对抗性对齐。在后者中，知识在使用伪标签完成标签过程之后在统一标签空间中传递。我们在三种不同的任务上验证了该方法，即图像分类，2D语义图像分割和联合2D-3D语义分割。大量的实验表明，我们的方法明显优于所有竞争方法。

11. Practical Auto-Calibration for Spatial Scene-Understanding from Crowdsourced Dashcamera Videos [PDF] 返回目录
Hemang Chawla, Matti Jukola, Shabbir Marzban, Elahe Arani, Bahram Zonooz
Abstract: Spatial scene-understanding, including dense depth and ego-motion estimation, is an important problem in computer vision for autonomous vehicles and advanced driver assistance systems. Thus, it is beneficial to design perception modules that can utilize crowdsourced videos collected from arbitrary vehicular onboard or dashboard cameras. However, the intrinsic parameters corresponding to such cameras are often unknown or change over time. Typical manual calibration approaches require objects such as a chessboard or additional scene-specific information. On the other hand, automatic camera calibration does not have such requirements. Yet, the automatic calibration of dashboard cameras is challenging as forward and planar navigation results in critical motion sequences with reconstruction ambiguities. Structure reconstruction of complete visual-sequences that may contain tens of thousands of images is also computationally untenable. Here, we propose a system for practical monocular onboard camera auto-calibration from crowdsourced videos. We show the effectiveness of our proposed system on the KITTI raw, Oxford RobotCar, and the crowdsourced D$^2$-City datasets in varying conditions. Finally, we demonstrate its application for accurate monocular dense depth and ego-motion estimation on uncalibrated videos.
摘要：包括密集深度和自我运动估计在内的空间场景理解是自动驾驶汽车和高级驾驶员辅助系统的计算机视觉中的重要问题。因此，设计感知模块是有益的，该感知模块可以利用从任意车载或仪表盘摄像机收集的众包视频。但是，与此类摄像机相对应的固有参数通常是未知的或随时间变化的。典型的手动校准方法需要诸如棋盘之类的对象或其他特定于场景的信息。另一方面，相机自动校准没有这样的要求。然而，仪表盘相机的自动校准具有挑战性，因为前向导航和平面导航会导致具有重构歧义的关键运动序列。可能包含成千上万个图像的完整视觉序列的结构重建在计算上也是站不住脚的。在这里，我们提出了一种用于从众包视频中自动对单眼车载摄像机进行自动校准的系统。我们在不同条件下，在KITTI原始数据，Oxford RobotCar和众包D $ ^ 2 $ -City数据集中显示了我们提出的系统的有效性。最后，我们展示了其在未校准视频上用于精确的单眼密集深度和自我运动估计的应用。

12. Improved Image Matting via Real-time User Clicks and Uncertainty Estimation [PDF] 返回目录
Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Hanqing Zhao, Weiming Zhang, Nenghai Yu
Abstract: Image matting is a fundamental and challenging problem in computer vision and graphics. Most existing matting methods leverage a user-supplied trimap as an auxiliary input to produce good alpha matte. However, obtaining high-quality trimap itself is arduous, thus restricting the application of these methods. Recently, some trimap-free methods have emerged, however, the matting quality is still far behind the trimap-based methods. The main reason is that, without the trimap guidance in some cases, the target network is ambiguous about which is the foreground target. In fact, choosing the foreground is a subjective procedure and depends on the user's intention. To this end, this paper proposes an improved deep image matting framework which is trimap-free and only needs several user click interactions to eliminate the ambiguity. Moreover, we introduce a new uncertainty estimation module that can predict which parts need polishing and a following local refinement module. Based on the computation budget, users can choose how many local parts to improve with the uncertainty guidance. Quantitative and qualitative results show that our method performs better than existing trimap-free methods and comparably to state-of-the-art trimap-based methods with minimal user effort.
摘要：图像消光是计算机视觉和图形学中的一个基本且具有挑战性的问题。大多数现有的抠图方法都利用用户提供的trimap作为辅助输入来产生良好的alpha遮罩。但是，获得高质量的trimap本身很困难，因此限制了这些方法的应用。最近，出现了一些无trimap的方法，但是，消光质量仍然远远落后于基于trimap的方法。主要原因是，在某些情况下，如果没有trimap指导，目标网络就不清楚哪个是前景目标。实际上，选择前景是一个主观过程，并且取决于用户的意图。为此，本文提出了一种改进的深度图像抠图框架，该框架不包含trimap，并且仅需要几次用户单击交互即可消除歧义。此外，我们引入了一个新的不确定性估算模块，该模块可以预测哪些零件需要抛光以及随后的局部精修模块。根据计算预算，用户可以在不确定性指导下选择要改进的局部零件。定量和定性结果表明，我们的方法比现有的无trimap的方法性能更好，并且与基于trimap的最新方法相比，所需的工作量最少。

13. Robots Understanding Contextual Information in Human-Centered Environments using Weakly Supervised Mask Data Distillation [PDF] 返回目录
Daniel Dworakowski, Goldie Nejat
Abstract: Contextual information in human environments, such as signs, symbols, and objects provide important information for robots to use for exploration and navigation. To identify and segment contextual information from complex images obtained in these environments, data-driven methods such as Convolutional Neural Networks (CNNs) are used. However, these methods require large amounts of human labeled data which are slow and time-consuming to obtain. Weakly supervised methods address this limitation by generating pseudo segmentation labels (PSLs). In this paper, we present the novel Weakly Supervised Mask Data Distillation (WeSuperMaDD) architecture for autonomously generating PSLs using CNNs not specifically trained for the task of context segmentation; i.e., CNNs trained for object classification, image captioning, etc. WeSuperMaDD uniquely generates PSLs using learned image features from sparse and limited diversity data; common in robot navigation tasks in human-centred environments (malls, grocery stores). Our proposed architecture uses a new mask refinement system which automatically searches for the PSL with the fewest foreground pixels that satisfies cost constraints. This removes the need for handcrafted heuristic rules. Extensive experiments successfully validated the performance of WeSuperMaDD in generating PSLs for datasets with text of various scales, fonts, and perspectives in multiple indoor/outdoor environments. A comparison with Naive, GrabCut, and Pyramid methods found a significant improvement in label and segmentation quality. Moreover, a context segmentation CNN trained using the WeSuperMaDD architecture achieved measurable improvements in accuracy compared to one trained with Naive PSLs. Our method also had comparable performance to existing state-of-the-art text detection and segmentation methods on real datasets without requiring segmentation labels for training.
摘要：人类环境中的上下文信息，例如符号，符号和对象，为机器人提供了用于探索和导航的重要信息。为了从在这些环境中获得的复杂图像中识别和分割上下文信息，使用了卷积神经网络（CNN）等数据驱动方法。然而，这些方法需要大量的人类标记数据，这是缓慢且费时的。受到弱监督的方法通过生成伪分段标签（PSL）来解决此限制。在本文中，我们介绍了一种新颖的弱监督掩码数据蒸馏（WeSuperMaDD）架构，该架构可使用未经专门训练用于上下文分割任务的CNN来自动生成PSL。例如，受过对象分类，图像字幕等训练的CNN。WeSuperMaDD使用从稀疏和有限的多样性数据中学到的图像特征来唯一生成PSL；在以人为中心的环境中（机器人，杂货店）在机器人导航任务中很常见。我们提出的体系结构使用了新的蒙版细化系统，该系统可以自动搜索满足成本约束的前景像素最少的PSL。这消除了对手工启发式规则的需要。大量实验成功验证了WeSuperMaDD在为多个室内/室外环境中具有各种比例，字体和视角的文本的数据集生成PSL方面的性能。与Naive，GrabCut和Pyramid方法进行的比较发现，标签和分割质量得到了显着改善。而且，与使用Naive PSL训练的上下文分割CNN相比，使用WeSuperMaDD架构训练的上下文分割CNN在准确性上有可衡量的提高。我们的方法还具有与实际数据集上现有的最新文本检测和分割方法相当的性能，而无需使用分割标签进行训练。

14. Cluster, Split, Fuse, and Update: Meta-Learning for Open Compound Domain Adaptive Semantic Segmentation [PDF] 返回目录
Rui Gong, Yuhua Chen, Danda Pani Paudel, Yawei Li, Ajad Chhatkuli, Wen Li, Dengxin Dai, Luc Van Gool
Abstract: Open compound domain adaptation (OCDA) is a domain adaptation setting, where target domain is modeled as a compound of multiple unknown homogeneous domains, which brings the advantage of improved generalization to unseen domains. In this work, we propose a principled meta-learning based approach to OCDA for semantic segmentation, MOCDA, by modeling the unlabeled target domain continuously. Our approach consists of four key steps. First, we cluster target domain into multiple sub-target domains by image styles, extracted in an unsupervised manner. Then, different sub-target domains are split into independent branches, for which batch normalization parameters are learnt to treat them independently. A meta-learner is thereafter deployed to learn to fuse sub-target domain-specific predictions, conditioned upon the style code. Meanwhile, we learn to online update the model by model-agnostic meta-learning (MAML) algorithm, thus to further improve generalization. We validate the benefits of our approach by extensive experiments on synthetic-to-real knowledge transfer benchmark datasets, where we achieve the state-of-the-art performance in both compound and open domains.
摘要：开放式复合域自适应（OCDA）是一种域自适应设置，其中目标域被建模为多个未知同构域的复合，这带来了改进的泛化能力到看不见的域的优势。在这项工作中，我们通过对未标记的目标域进行连续建模，为基于语义的分段的OCDA提出了一种基于元学习的原理化方法，即MOCDA。我们的方法包括四个关键步骤。首先，我们通过图像样式将目标域聚类为多个子目标域，并以无监督的方式提取它们。然后，将不同的子目标域划分为多个独立的分支，针对这些分支学习批量归一化参数以对其进行独立处理。此后，部署元学习器以根据样式代码学习融合子目标域特定的预测。同时，我们学习通过模型不可知的元学习（MAML）算法在线更新模型，从而进一步提高泛化能力。我们通过在从合成到真实的知识转移基准数据集上进行的大量实验来验证我们方法的好处，在复合和开放域中我们都达到了最先进的性能。

15. Artificial Dummies for Urban Dataset Augmentation [PDF] 返回目录
Antonín Vobecký, David Hurych, Michal Uřičář, Patrick Pérez, Josef Šivic
Abstract: Existing datasets for training pedestrian detectors in images suffer from limited appearance and pose variation. The most challenging scenarios are rarely included because they are too difficult to capture due to safety reasons, or they are very unlikely to happen. The strict safety requirements in assisted and autonomous driving applications call for an extra high detection accuracy also in these rare situations. Having the ability to generate people images in arbitrary poses, with arbitrary appearances and embedded in different background scenes with varying illumination and weather conditions, is a crucial component for the development and testing of such applications. The contributions of this paper are three-fold. First, we describe an augmentation method for controlled synthesis of urban scenes containing people, thus producing rare or never-seen situations. This is achieved with a data generator (called DummyNet) with disentangled control of the pose, the appearance, and the target background scene. Second, the proposed generator relies on novel network architecture and associated loss that takes into account the segmentation of the foreground person and its composition into the background scene. Finally, we demonstrate that the data generated by our DummyNet improve performance of several existing person detectors across various datasets as well as in challenging situations, such as night-time conditions, where only a limited amount of training data is available. In the setup with only day-time data available, we improve the night-time detector by $17\%$ log-average miss rate over the detector trained with the day-time data only.
摘要：现有的训练行人检测器图像的数据集外观和姿势变化有限。很少包含最具挑战性的场景，因为由于安全原因它们很难捕获，或者极不可能发生。在这些罕见的情况下，辅助和自动驾驶应用中严格的安全要求也要求更高的检测精度。能够以任意姿势生成具有任意外观的人物图像，并嵌入具有变化的光照和天气条件的不同背景场景中的能力，对于此类应用程序的开发和测试至关重要。本文的贡献是三方面的。首先，我们描述了一种增强方法，用于控制包含人的城市场景的综合，从而产生罕见或从未见过的情况。这可以通过数据生成器（称为DummyNet）来实现，该数据生成器可以对姿势，外观和目标背景场景进行解开控制。其次，提出的生成器依赖于新颖的网络体系结构以及相关的损耗，其中考虑了前台人员的分割及其在后台场景中的组成。最后，我们证明了由DummyNet生成的数据可改善各种数据集中现有的几个人体检测器的性能，以及在挑战性的情况下（例如夜间条件下，仅提供有限数量的训练数据）的性能。在仅提供白天数据的设置中，与仅使用白天数据训练的检测器相比，我们将夜间检测器的对数平均丢失率提高了$ 17 \％$。

16. FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion [PDF] 返回目录
Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, Liangjun Zhang
Abstract: Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.
摘要：深度补全旨在从稀疏的深度图中恢复密集的深度图，并以相应的彩色图像作为输入。最近的方法主要将深度完成公式化为一个阶段的端到端学习任务，该任务直接输出密集的深度图。但是，单阶段框架中的特征提取和监督不足，限制了这些方法的性能。为了解决这个问题，我们提出了一种新颖的端到端残差学习框架，该框架将深度完成公式化为两个阶段的学习任务，即从稀疏到粗糙的阶段和从粗糙到精细的阶段。首先，通过简单的CNN框架获得粗略的密集深度图。然后，在粗略到细化阶段使用残差学习策略，以粗略深度图和彩色图像作为输入，进一步获得细化深度图。特别地，在从粗到精的阶段，利用通道混洗提取操作从彩色图像和粗深图中提取更具代表性的特征，并利用基于能量的融合操作有效融合通过通道混洗操作获得的这些特征，从而导致更准确，更精确的深度图。我们在KITTI基准上以RMSE达到了SoTA表现。未来在其他数据集上进行的大量实验证明了我们的方法优于当前最先进的深度完成方法。

17. HeadGAN: Video-and-Audio-Driven Talking Head Synthesis [PDF] 返回目录
Michail Christos Doukas, Stefanos Zafeiriou, Viktoriia Sharmanska
Abstract: Recent attempts to solve the problem of talking head synthesis using a single reference image have shown promising results. However, most of them fail to meet the identity preservation problem, or perform poorly in terms of photo-realism, especially in extreme head poses. We propose HeadGAN, a novel reenactment approach that conditions synthesis on 3D face representations, which can be extracted from any driving video and adapted to the facial geometry of any source. We improve the plausibility of mouth movements, by utilising audio features as a complementary input to the Generator. Quantitative and qualitative experiments demonstrate the merits of our approach.
摘要：最近尝试使用单个参考图像解决说话人头合成问题的尝试已显示出令人鼓舞的结果。但是，它们中的大多数不能满足身份保存问题，或者在照片写实方面表现不佳，尤其是在极端的头部姿势中。我们提出了HeadGAN，这是一种新颖的重新制定方法，可以在3D人脸表示上进行条件合成，可以从任何驾驶视频中提取出来并适应任何来源的人脸几何。通过利用音频功能作为发生器的补充输入，我们改善了嘴巴运动的真实性。定量和定性实验证明了我们方法的优点。

18. Geometry Enhancements from Visual Content: Going Beyond Ground Truth [PDF] 返回目录
Liran Azaria, Dan Raviv
Abstract: This work presents a new cyclic architecture that extracts high-frequency patterns from images and re-insert them as geometric features. This procedure allows us to enhance the resolution of low-cost depth sensors capturing fine details on the one hand and being loyal to the scanned ground truth on the other. We present state-of-the-art results for depth super-resolution tasks and as well as visually attractive, enhanced generated 3D models.
摘要：这项工作提出了一种新的循环架构，可以从图像中提取高频模式并将其重新插入为几何特征。此过程使我们可以提高低成本深度传感器的分辨率，一方面可以捕获精细的细节，另一方面可以忠实于所扫描的地面真实情况。我们为深度超分辨率任务以及视觉上引人入胜的增强型3D模型提供了最新的结果。

19. Robust Factorization Methods Using a Gaussian/Uniform Mixture Model [PDF] 返回目录
Andrei Zaharescu, Radu Horaud
Abstract: In this paper we address the problem of building a class of robust factorization algorithms that solve for the shape and motion parameters with both affine (weak perspective) and perspective camera models. We introduce a Gaussian/uniform mixture model and its associated EM algorithm. This allows us to address robust parameter estimation within a data clustering approach. We propose a robust technique that works with any affine factorization method and makes it robust to outliers. In addition, we show how such a framework can be further embedded into an iterative perspective factorization scheme. We carry out a large number of experiments to validate our algorithms and to compare them with existing ones. We also compare our approach with factorization methods that use M-estimators.
摘要：在本文中，我们解决了构建一类鲁棒的分解算法的问题，该算法可以使用仿射（弱透视）模型和透视相机模型来解决形状和运动参数。我们介绍了一个高斯/均匀混合模型及其相关的EM算法。这使我们能够在数据聚类方法中解决健壮的参数估计问题。我们提出了一种鲁棒的技术，可与任何仿射因式分解方法一起使用，并使它对异常值具有鲁棒性。另外，我们展示了如何将这样的框架进一步嵌入到迭代透视分解方案中。我们进行了大量实验，以验证我们的算法并将其与现有算法进行比较。我们还将我们的方法与使用M估计量的分解方法进行比较。

20. Point-Level Temporal Action Localization: Bridging Fully-supervised Proposals to Weakly-supervised Losses [PDF] 返回目录
Chen Ju, Peisen Zhao, Ya Zhang, Yanfeng Wang, Qi Tian
Abstract: Point-Level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance. Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels. However, such a framework inevitably suffers from a large solution space. This paper attempts to explore the proposal-based prediction paradigm for point-level annotations, which has the advantage of more constrained solution space and consistent predictions among neighboring frames. The point-level annotations are first used as the keypoint supervision to train a keypoint detector. At the location prediction stage, a simple but effective mapper module, which enables back-propagation of training errors, is then introduced to bridge the fully-supervised framework with weak supervision. To our best of knowledge, this is the first work to leverage the fully-supervised paradigm for the point-level setting. Experiments on THUMOS14, BEOID, and GTEA verify the effectiveness of our proposed method both quantitatively and qualitatively, and demonstrate that our method outperforms state-of-the-art methods.
摘要：点级时间动作本地化（PTAL）的目的是在未修饰的视频中定位动作，每个动作实例仅带有一个时间戳注释。现有方法采用帧级预测范式来学习稀疏单帧标签。然而，这样的框架不可避免地遭受较大的解决方案空间。本文尝试探索基于提议的点级注释的预测范式，其优点是解决方案空间更受约束，并且相邻帧之间的预测一致。首先，将点级注释用作训练关键点检测器的关键点监督。在位置预测阶段，随后引入了一个简单但有效的映射器模块，该模块可以使训练错误向后传播，以在弱监督的情况下架起全监督框架。就我们所知，这是在点级设置中充分利用受监督范式的第一项工作。在THUMOS14，BEOID和GTEA上进行的实验从数量和质量上证明了我们提出的方法的有效性，并证明了我们的方法优于最新方法。

21. Canny-VO: Visual Odometry with RGB-D Cameras based on Geometric 3D-2D Edge Alignment [PDF] 返回目录
Yi Zhou, Hongdong Li, Laurent Kneip
Abstract: The present paper reviews the classical problem of free-form curve registration and applies it to an efficient RGBD visual odometry system called Canny-VO, as it efficiently tracks all Canny edge features extracted from the images. Two replacements for the distance transformation commonly used in edge registration are proposed: Approximate Nearest Neighbour Fields and Oriented Nearest Neighbour Fields. 3D2D edge alignment benefits from these alternative formulations in terms of both efficiency and accuracy. It removes the need for the more computationally demanding paradigms of datato-model registration, bilinear interpolation, and sub-gradient computation. To ensure robustness of the system in the presence of outliers and sensor noise, the registration is formulated as a maximum a posteriori problem, and the resulting weighted least squares objective is solved by the iteratively re-weighted least squares method. A variety of robust weight functions are investigated and the optimal choice is made based on the statistics of the residual errors. Efficiency is furthermore boosted by an adaptively sampled definition of the nearest neighbour fields. Extensive evaluations on public SLAM benchmark sequences demonstrate state-of-the-art performance and an advantage over classical Euclidean distance fields.
摘要：本文回顾了自由曲面曲线配准的经典问题，并将其应用到称为Canny-VO的高效RGBD视觉测距系统中，该系统可以有效跟踪从图像中提取的所有Canny边缘特征。对于边缘配准中常用的距离变换，提出了两种替代方法：近似最近邻场和定向最近邻场。 3D2D边缘对齐在效率和准确性方面都受益于这些替代配方。它消除了对数据到模型配准，双线性插值和次梯度计算等对计算要求更高的范例的需求。为了确保在存在异常值和传感器噪声的情况下系统的鲁棒性，将配准公式化为最大后验问题，并通过迭代地重新加权最小二乘法来解决所得的加权最小二乘目标。研究了各种鲁棒的加权函数，并根据残差统计数据做出了最佳选择。此外，通过最近邻域的自适应采样定义还可以提高效率。对公共SLAM基准序列的广泛评估证明了其最新的性能以及相对于经典欧几里德距离场的优势。

22. Cross-Domain Grouping and Alignment for Domain Adaptive Semantic Segmentation [PDF] 返回目录
Minsu Kim, Sunghun Joung, Seungryong Kim, JungIn Park, Ig-Jae Kim, Kwanghoon Sohn
Abstract: Existing techniques to adapt semantic segmentation networks across the source and target domains within deep convolutional neural networks (CNNs) deal with all the samples from the two domains in a global or category-aware manner. They do not consider an inter-class variation within the target domain itself or estimated category, providing the limitation to encode the domains having a multi-modal data distribution. To overcome this limitation, we introduce a learnable clustering module, and a novel domain adaptation framework called cross-domain grouping and alignment. To cluster the samples across domains with an aim to maximize the domain alignment without forgetting precise segmentation ability on the source domain, we present two loss functions, in particular, for encouraging semantic consistency and orthogonality among the clusters. We also present a loss so as to solve a class imbalance problem, which is the other limitation of the previous methods. Our experiments show that our method consistently boosts the adaptation performance in semantic segmentation, outperforming the state-of-the-arts on various domain adaptation settings.
摘要：现有技术可在深度卷积神经网络（CNN）中跨源域和目标域适应语义分割网络，以全局或类别感知的方式处理来自两个域的所有样本。他们不考虑目标域本身或估计类别内的类间差异，从而限制了对具有多模式数据分布的域进行编码的限制。为了克服此限制，我们引入了一个可学习的聚类模块，以及一个称为跨域分组和对齐的新型域适应框架。为了跨域对样本进行聚类，以最大化域对齐而不会忘记源域上的精确分割能力，我们提出了两个损失函数，特别是为了鼓励聚类之间的语义一致性和正交性。我们还提出了一种损失，以解决类不平衡问题，这是先前方法的另一个局限性。我们的实验表明，我们的方法在语义分割中持续提高了自适应性能，在各种领域自适应设置上的表现都超过了最新技术。

23. FMODetect: Robust Detection and Trajectory Estimation of Fast Moving Objects [PDF] 返回目录
Denys Rozumnyi, Jiri Matas, Filip Sroubek, Marc Pollefeys, Martin R. Oswald
Abstract: We propose the first learning-based approach for detection and trajectory estimation of fast moving objects. Such objects are highly blurred and move over large distances within one video frame. Fast moving objects are associated with a deblurring and matting problem, also called deblatting. Instead of solving the complex deblatting problem jointly, we split the problem into matting and deblurring and solve them separately. The proposed method first detects all fast moving objects as a truncated distance function to the trajectory. Subsequently, a matting and fitting network for each detected object estimates the object trajectory and its blurred appearance without background. For the sharp appearance estimation, we propose an energy minimization based deblurring. The state-of-the-art methods are outperformed in terms of trajectory estimation and sharp appearance reconstruction. Compared to other methods, such as deblatting, the inference is of several orders of magnitude faster and allows applications such as real-time fast moving object detection and retrieval in large video collections.
摘要：我们提出了第一种基于学习的快速移动物体的检测和轨迹估计方法。此类对象高度模糊，并在一个视频帧内移动了很长一段距离。快速移动的物体会带来去模糊和消光问题，也称为去模糊。与其共同解决复杂的脱刮问题，不如将其分为消光和去模糊并分别解决。所提出的方法首先将所有快速移动的物体检测为到轨迹的截短距离函数。随后，用于每个检测到的对象的抠图和拟合网络将估计对象轨迹及其模糊外观而没有背景。对于清晰的外观估计，我们提出了一种基于能量最小化的去模糊技术。在轨迹估计和清晰外观重建方面，最先进的方法表现不佳。与其他方法（例如脱刀片）相比，推理速度快了几个数量级，并且可以在大型视频集合中进行诸如实时快速移动物体的检测和检索之类的应用。

24. Unsupervised Domain Adaptation from Synthetic to Real Images for Anchorless Object Detection [PDF] 返回目录
Tobias Scheck, Ana Perez Grassi, Gangolf Hirtz
Abstract: Synthetic images are one of the most promising solutions to avoid high costs associated with generating annotated datasets to train supervised convolutional neural networks (CNN). However, to allow networks to generalize knowledge from synthetic to real images, domain adaptation methods are necessary. This paper implements unsupervised domain adaptation (UDA) methods on an anchorless object detector. Given their good performance, anchorless detectors are increasingly attracting attention in the field of object detection. While their results are comparable to the well-established anchor-based methods, anchorless detectors are considerably faster. In our work, we use CenterNet, one of the most recent anchorless architectures, for a domain adaptation problem involving synthetic images. Taking advantage of the architecture of anchorless detectors, we propose to adjust two UDA methods, viz., entropy minimization and maximum squares loss, originally developed for segmentation, to object detection. Our results show that the proposed UDA methods can increase the mAPfrom61 %to69 %with respect to direct transfer on the considered anchorless detector. The code is available: this https URL.
摘要：合成图像是避免与生成带注释的数据集以训练监督卷积神经网络（CNN）相关的高成本的最有前途的解决方案之一。但是，为了使网络能够将知识从合成图像转换为真实图像，必须使用域自适应方法。本文在无锚目标检测器上实现了无监督域自适应（UDA）方法。由于其良好的性能，无锚检测器在物体检测领域越来越受到关注。尽管其结果可与公认的基于锚的方法相比，但无锚检测器的速度要快得多。在我们的工作中，我们使用CenterNet（一种最新的无锚架构）解决涉及合成图像的域自适应问题。利用无锚探测器的架构优势，我们建议将两种最初用于分割的UDA方法（即熵最小化和最大平方损失）调整为目标检测。我们的结果表明，相对于在无锚检测器上的直接转移而言，所提出的UDA方法可以将mAP从61％提高到69％。该代码可用：此https URL。

25. Seeing Behind Objects for 3D Multi-Object Tracking in RGB-D Sequences [PDF] 返回目录
Norman Müller, Yu-Shiang Wong, Niloy J. Mitra, Angela Dai, Matthias Nießner
Abstract: Multi-object tracking from RGB-D video sequences is a challenging problem due to the combination of changing viewpoints, motion, and occlusions over time. We observe that having the complete geometry of objects aids in their tracking, and thus propose to jointly infer the complete geometry of objects as well as track them, for rigidly moving objects over time. Our key insight is that inferring the complete geometry of the objects significantly helps in tracking. By hallucinating unseen regions of objects, we can obtain additional correspondences between the same instance, thus providing robust tracking even under strong change of appearance. From a sequence of RGB-D frames, we detect objects in each frame and learn to predict their complete object geometry as well as a dense correspondence mapping into a canonical space. This allows us to derive 6DoF poses for the objects in each frame, along with their correspondence between frames, providing robust object tracking across the RGB-D sequence. Experiments on both synthetic and real-world RGB-D data demonstrate that we achieve state-of-the-art performance on dynamic object tracking. Furthermore, we show that our object completion significantly helps tracking, providing an improvement of $6.5\%$ in mean MOTA.
摘要：由于随时间变化的视点，运动和遮挡的结合，从RGB-D视频序列进行多对象跟踪是一个具有挑战性的问题。我们观察到具有完整的对象几何形状有助于它们的跟踪，因此建议联合推断对象的完整几何形状并跟踪它们，以随着时间的推移刚性移动对象。我们的主要见识在于，推断出对象的完整几何形状有助于跟踪。通过对对象的看不见区域进行幻觉，我们可以获取同一实例之间的其他对应关系，从而即使在外观变化很大的情况下也可以提供可靠的跟踪。从一系列RGB-D帧中，我们检测每个帧中的对象，并学习预测它们的完整对象几何形状以及到规范空间的密集对应映射。这使我们能够导出每个帧中对象的6DoF姿势及其在帧之间的对应关系，从而在RGB-D序列中提供可靠的对象跟踪。在合成和真实世界的RGB-D数据上进行的实验表明，我们在动态对象跟踪方面达到了最先进的性能。此外，我们证明了对象完成对跟踪的显着帮助，平均MOTA改善了$ 6.5 \％$。

26. Representing Ambiguity in Registration Problems with Conditional Invertible Neural Networks [PDF] 返回目录
Darya Trofimova, Tim Adler, Lisa Kausch, Lynton Ardizzone, Klaus Maier-Hein, Ulrich Köthe, Carsten Rother, Lena Maier-Hein
Abstract: Image registration is the basis for many applications in the fields of medical image computing and computer assisted interventions. One example is the registration of 2D X-ray images with preoperative three-dimensional computed tomography (CT) images in intraoperative surgical guidance systems. Due to the high safety requirements in medical applications, estimating registration uncertainty is of a crucial importance in such a scenario. However, previously proposed methods, including classical iterative registration methods and deep learning-based methods have one characteristic in common: They lack the capacity to represent the fact that a registration problem may be inherently ambiguous, meaning that multiple (substantially different) plausible solutions exist. To tackle this limitation, we explore the application of invertible neural networks (INN) as core component of a registration methodology. In the proposed framework, INNs enable going beyond point estimates as network output by representing the possible solutions to a registration problem by a probability distribution that encodes different plausible solutions via multiple modes. In a first feasibility study, we test the approach for a 2D 3D registration setting by registering spinal CT volumes to X-ray images. To this end, we simulate the X-ray images taken by a C-Arm with multiple orientations using the principle of digitially reconstructed radiographs (DRRs). Due to the symmetry of human spine, there are potentially multiple substantially different poses of the C-Arm that can lead to similar projections. The hypothesis of this work is that the proposed approach is able to identify multiple solutions in such ambiguous registration problems.
摘要：图像配准是医学图像计算和计算机辅助干预领域中许多应用的基础。一个例子是在术中外科手术引导系统中将2D X射线图像与术前三维计算机断层扫描（CT）图像配准。由于医疗应用中的高安全性要求，在这种情况下，估计注册不确定性至关重要。但是，先前提出的方法（包括经典的迭代注册方法和基于深度学习的方法）具有一个共同的特征：它们缺乏表示注册问题可能固有地模棱两可这一事实的能力，这意味着存在多个（实质上不同）合理的解决方案。为了解决此限制，我们探索了可逆神经网络（INN）作为注册方法的核心组件的应用。在提出的框架中，INN可以通过概率分布表示注册问题的可能解决方案，从而将超出点的估计值作为网络输出，该概率分布通过多种模式对不同的可行解进行编码。在第一个可行性研究中，我们通过将脊柱CT体积配准到X射线图像来测试2D 3D配准设置的方法。为此，我们使用数字重建射线照片（DRR）的原理模拟C臂在多个方向上拍摄的X射线图像。由于人体脊柱的对称性，C型臂可能存在多个实质上不同的姿势，这些姿势可能导致相似的投影。这项工作的假设是，所提出的方法能够识别出此类歧义注册问题中的多个解决方案。

27. docExtractor: An off-the-shelf historical document element extraction [PDF] 返回目录
Tom Monnier, Mathieu Aubry
Abstract: We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of rich synthetic documents and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.
摘要：我们提出了docExtractor，这是一种通用方法，可从历史文档中提取视觉元素，例如文本行或插图，而无需任何实际数据注释。我们展示了它作为跨各种数据集的现成系统提供了高质量的性能，并且经过微调可以使结果与最新技术相当。我们认为，不对特定数据集进行微调而获得的性能对于应用程序至关重要，特别是在数字人文科学中，而我们要解决的行级页面细分与通用元素提取引擎最为相关。我们依赖于丰富的合成文档的快速生成器，并设计了一个完全卷积的网络，与基于检测的方法相比，我们证明了它具有更好的泛化能力。此外，我们引入了一个名为IlluHisDoc的新公共数据集，致力于对历史文档中的插图分割进行精细评估。

28. Research on All-content Text Recognition Method for Financial Ticket Image [PDF] 返回目录
Fukang Tian, Haiyu Wu, Bo Xu
Abstract: With the development of the economy, the number of financial tickets increases rapidly. The traditional manual invoice reimbursement and financial accounting system bring more and more burden to financial accountants. Therefore, based on the research and analysis of a large number of real financial ticket data, we designed an accurate and efficient all contents text detection and recognition method based on deep learning. This method has higher recognition accuracy and recall rate and can meet the actual requirements of financial accounting work. In addition, we propose a Financial Ticket Character Recognition Framework (FTCRF). According to the characteristics of Chinese character recognition, this framework contains a two-step information extraction method, which can improve the speed of Chinese character recognition. The experimental results show that the average recognition accuracy of this method is 91.75\% for character sequence and 87\% for the whole ticket. The availability and effectiveness of this method are verified by a commercial application system, which significantly improves the efficiency of the financial accounting system.
摘要要】随着经济的发展，金融票据的数量迅速增加。传统的人工发票报销和财务会计系统给财务会计带来了越来越多的负担。因此，在对大量真实票务数据进行研究和分析的基础上，设计了一种基于深度学习的准确高效的全内容文本检测与识别方法。该方法具有较高的识别准确率和召回率，可以满足财务会计工作的实际要求。此外，我们提出了一种金融票证字符识别框架（FTCRF）。根据汉字识别的特点，该框架包含两步信息提取方法，可以提高汉字识别的速度。实验结果表明，该方法对字符序列的平均识别精度为91.75％，对整张彩票的平均识别精度为87％。商业应用系统验证了该方法的可用性和有效性，从而大大提高了财务会计系统的效率。

29. Dilated-Scale-Aware Attention ConvNet For Multi-Class Object Counting [PDF] 返回目录
Wei Xu, Dingkang Liang, Yixiao Zheng, Zhanyu Ma
Abstract: Object counting aims to estimate the number of objects in images. The leading counting approaches focus on the single category counting task and achieve impressive performance. Note that there are multiple categories of objects in real scenes. Multi-class object counting expands the scope of application of object counting task. The multi-target detection task can achieve multi-class object counting in some scenarios. However, it requires the dataset annotated with bounding boxes. Compared with the point annotations in mainstream object counting issues, the coordinate box-level annotations are more difficult to obtain. In this paper, we propose a simple yet efficient counting network based on point-level annotations. Specifically, we first change the traditional output channel from one to the number of categories to achieve multiclass counting. Since all categories of objects use the same feature extractor in our proposed framework, their features will interfere mutually in the shared feature space. We further design a multi-mask structure to suppress harmful interaction among objects. Extensive experiments on the challenging benchmarks illustrate that the proposed method achieves state-of-the-art counting performance.
摘要：对象计数旨在估计图像中的对象数量。领先的计数方法专注于单一类别计数任务，并实现了出色的性能。请注意，真实场景中有多种对象。多类对象计数扩展了对象计数任务的应用范围。在某些情况下，多目标检测任务可以实现多类对象计数。但是，它需要用边界框注释的数据集。与主流对象计数问题中的点注释相比，坐标框级注释更难获得。在本文中，我们提出了一个基于点级注释的简单而有效的计数网络。具体来说，我们首先将传统的输出通道从一个类别更改为多个类别，以实现多类别计数。由于在我们提出的框架中所有类别的对象都使用相同的特征提取器，因此它们的特征将在共享特征空间中相互干扰。我们进一步设计了一种多遮罩结构，以抑制物体之间的有害相互作用。在具有挑战性的基准上进行的大量实验表明，所提出的方法可实现最新的计数性能。

30. NeuralQAAD: An Efficient Differentiable Framework for High Resolution Point Cloud Compression [PDF] 返回目录
Nicolas Wagner, Ulrich Schwanecke
Abstract: In this paper, we propose NeuralQAAD, a differentiable point cloud compression framework that is fast, robust to sampling, and applicable to high resolutions. Previous work that is able to handle complex and non-smooth topologies is hardly scaleable to more than just a few thousand points. We tackle the task with a novel neural network architecture characterized by weight sharing and autodecoding. Our architecture uses parameters much more efficiently than previous work, allowing us to be deeper and scalable. Futhermore, we show that the currently only tractable training criterion for point cloud compression, the Chamfer distance, performances poorly for high resolutions. To overcome this issue, we pair our architecture with a new training procedure based upon a quadratic assignment problem (QAP) for which we state two approximation algorithms. We solve the QAP in parallel to gradient descent. This procedure acts as a surrogate loss and allows to implicitly minimize the more expressive Earth Movers Distance (EMD) even for point clouds with way more than $10^6$ points. As evaluating the EMD on high resolution point clouds is intractable, we propose a divide-and-conquer approach based on k-d trees, the EM-kD, as a scaleable and fast but still reliable upper bound for the EMD. NeuralQAAD is demonstrated on COMA, D-FAUST, and Skulls to significantly outperform the current state-of-the-art visually and in terms of the EM-kD. Skulls is a novel dataset of skull CT-scans which we will make publicly available together with our implementation of NeuralQAAD.
摘要：在本文中，我们提出了NeuralQAAD，这是一种可微分的点云压缩框架，该框架快速，稳健地采样并且适用于高分辨率。以前能够处理复杂和非平滑拓扑的工作几乎无法扩展到超过几千个点。我们通过以权重共享和自动解码为特征的新型神经网络架构来解决该任务。与以前的工作相比，我们的体系结构使用参数的效率更高，从而使我们更深入，更可扩展。此外，我们表明，对于点云压缩而言，当前唯一可处理的训练标准（倒角距离）在高分辨率下的性能较差。为了克服这个问题，我们将体系结构与基于二次分配问题（QAP）的新训练过程配对，为此我们提出了两种近似算法。我们将QAP与梯度下降并行解决。此过程可作为替代损失，甚至可以隐含地最小化更具表现力的土方行驶距离（EMD），即使对于点云超过$ 10 ^ 6 $的点云也是如此。由于评估高分辨率点云上的EMD非常困难，因此，我们提出了一种基于k-d树EM-kD的分而治之的方法，作为可扩展且快速但仍可靠的EMD上限。在COMA，D-FAUST和Skulls上展示了NeuralQAAD，无论从视觉上还是在EM-kD方面，其性能都大大优于当前的最新技术。头骨是头骨CT扫描的新颖数据集，我们将与我们的NeuralQAAD实施一起公开提供。

31. Deep Layout of Custom-size Furniture through Multiple-domain Learning [PDF] 返回目录
Xinhan Di, Pengqian Yu, Danfeng Yang, Hong Zhu, Changyu Sun, YinDong Liu
Abstract: In this paper, we propose a multiple-domain model for producing a custom-size furniture layout in the interior scene. This model is aimed to support professional interior designers to produce interior decoration solutions with custom-size furniture more quickly. The proposed model combines a deep layout module, a domain attention module, a dimensional domain transfer module, and a custom-size module in the end-end training. Compared with the prior work on scene synthesis, our proposed model enhances the ability of auto-layout of custom-size furniture in the interior room. We conduct our experiments on a real-world interior layout dataset that contains $710,700$ designs from professional designers. Our numerical results demonstrate that the proposed model yields higher-quality layouts of custom-size furniture in comparison with the state-of-art model.
摘要：在本文中，我们提出了一种用于在室内场景中生成自定义尺寸家具布局的多域模型。该模型旨在支持专业的室内设计师更快地使用定制尺寸的家具生产室内装饰解决方案。所提出的模型在终端训练中结合了深层布局模块，领域关注模块，维度域传递模块和自定义尺寸模块。与先前的场景综合工作相比，我们提出的模型增强了室内空间中自定义尺寸家具的自动布局功能。我们对真实的室内布局数据集进行实验，该数据集包含710,700美元的专业设计师设计。我们的数值结果表明，与最新模型相比，该模型可生成定制尺寸家具的更高质量的布局。

32. Class-incremental Learning with Rectified Feature-Graph Preservation [PDF] 返回目录
Cheng-Hsun Lei, Yi-Hsin, Wen-Hsiao Peng, Wei-Chen Chiu
Abstract: In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes. Our project page is at : this https URL.
摘要：在本文中，我们解决了单头基于精馏的基于班级增量学习的问题。这项任务的中心主题是学习随着时间的流逝依次进入新阶段的新类，同时保持模型仅用有限的内存来识别可见类以保留可见数据样本的能力。已经提出了许多正则化策略来减轻灾难性遗忘现象。为了更好地理解这些正则化的本质，我们引入了一个特征图保留的观点。对它们的优缺点的见解激发了我们的加权欧氏正则化方法来保存旧知识。我们进一步提出了校正后的余弦归一化方法，并展示了它如何与二元交叉熵一起工作，以增加班级间隔，从而有效地学习新班级。在CIFAR-100和ImageNet数据集上的实验结果表明，我们的方法在减少分类错误，减轻灾难性遗忘以及鼓励在不同类别上实现均匀平衡的准确性方面优于最新方法。我们的项目页面位于：此https URL。

33. KOALAnet: Blind Super-Resolution using Kernel-Oriented Adaptive Local Adjustment [PDF] 返回目录
Soo Ye Kim, Hyeonjun Sim, Munchurl Kim
Abstract: Blind super-resolution (SR) methods aim to generate a high quality high resolution image from a low resolution image containing unknown degradations. However, natural images contain various types and amounts of blur: some may be due to the inherent degradation characteristics of the camera, but some may even be intentional, for aesthetic purposes (eg. Bokeh effect). In the case of the latter, it becomes highly difficult for SR methods to disentangle the blur to remove, and that to leave as is. In this paper, we propose a novel blind SR framework based on kernel-oriented adaptive local adjustment (KOALA) of SR features, called KOALAnet, which jointly learns spatially-variant degradation and restoration kernels in order to adapt to the spatially-variant blur characteristics in real images. Our KOALAnet outperforms recent blind SR methods for synthesized LR images obtained with randomized degradations, and we further show that the proposed KOALAnet produces the most natural results for artistic photographs with intentional blur, which are not over-sharpened, by effectively handling images mixed with in-focus and out-of-focus areas.
摘要：盲超分辨率（SR）方法旨在从包含未知降级的低分辨率图像中生成高质量的高分辨率图像。但是，自然图像包含各种类型和数量的模糊：某些图像可能是由于相机固有的降级特性引起的，但有些图像甚至出于美观目的甚至是故意的（例如散景效果）。在后者的情况下，SR方法很难解开模糊以将模糊消除并保持原样。在本文中，我们提出了一种基于SR特征的面向内核的自适应局部调整（KOALA）的新型盲SR框架，称为KOALAnet，该框架联合学习了空间变化的退化和恢复内核，以适应空间变化的模糊特征。在真实图像中。对于通过随机降级获得的合成LR图像，我们的KOALAnet优于最近的盲SR方法，并且我们进一步表明，通过有效地处理混合了LR的图像，拟议的KOALAnet可以为艺术照片提供最自然的效果，使其具有故意的模糊效果，并且不会过度锐化。焦点和离焦区域。

34. Fast 3D Image Moments [PDF] 返回目录
William Diggin, Michael Diggin
Abstract: An algorithm to efficiently compute the moments of volumetric images is disclosed. The approach demonstrates a reduction in processing time by reducing the computational complexity significantly. Specifically, the algorithm reduces multiplicative complexity from O(n^3) to O(n). Several 2D projection images of the 3D volume are generated. The algorithm computes a set of 2D moments from those 2D images. Those 2D moments are then used to derive the 3D volumetric moments. Examples of use in MRI or CT and related analysis demonstrates the benefit of the Discrete Projection Moment Algorithm. The approach is also useful in computing the moments of a 3D object using a small set of 2D tomographic images of that object.
摘要：公开了一种有效计算体积图像矩的算法。该方法通过显着降低计算复杂性来证明减少了处理时间。具体而言，该算法将乘法复杂度从O（n ^ 3）降低到O（n）。生成3D体积的几个2D投影图像。该算法从这些2D图像计算一组2D矩。然后将这些2D矩用于导出3D体积矩。在MRI或CT和相关分析中使用的示例证明了离散投影矩算法的好处。该方法在使用该对象的一小组2D断层图像来计算3D对象的矩时也很有用。

35. Towards Improving Spatiotemporal Action Recognition in Videos [PDF] 返回目录
Shentong Mo, Xiaoqing Tan, Jingfei Xia, Pinxu Ren
Abstract: Spatiotemporal action recognition deals with locating and classifying actions in videos. Motivated by the latest state-of-the-art real-time object detector You Only Watch Once (YOWO), we aim to modify its structure to increase action detection precision and reduce computational time. Specifically, we propose four novel approaches in attempts to improve YOWO and address the imbalanced class issue in videos by modifying the loss function. We consider two moderate-sized datasets to apply our modification of YOWO - the popular Joint-annotated Human Motion Data Base (J-HMDB-21) and a private dataset of restaurant video footage provided by a Carnegie Mellon University-based startup, this http URL. The latter involves fast-moving actions with small objects as well as unbalanced data classes, making the task of action localization more challenging. We implement our proposed methods in the GitHub repository this https URL.
摘要：时空动作识别处理视频中的动作的定位和分类。受最新的实时实时对象检测器“一次只看一次”（YOWO）的启发，我们旨在修改其结构，以提高动作检测的精度并减少计算时间。具体而言，我们提出了四种新颖的方法来尝试改善YOWO，并通过修改损失函数来解决视频中的类不平衡问题。我们考虑了两个中等大小的数据集来应用我们对YOWO的修改-流行的带有联合注释的人体运动数据库（J-HMDB-21）和由卡内基·梅隆大学的一家初创公司（此http 网址。后者涉及具有小对象的快速动作以及不平衡的数据类，这使得动作本地化的任务更具挑战性。我们在这个https URL的GitHub存储库中实现了我们提出的方法。

36. FAWA: Fast Adversarial Watermark Attack on Optical Character Recognition (OCR) Systems [PDF] 返回目录
Lu Chen, Jiao Sun, Wei Xu
Abstract: Deep neural networks (DNNs) significantly improved the accuracy of optical character recognition (OCR) and inspired many important applications. Unfortunately, OCRs also inherit the vulnerabilities of DNNs under adversarial examples. Different from colorful vanilla images, text images usually have clear backgrounds. Adversarial examples generated by most existing adversarial attacks are unnatural and pollute the background severely. To address this issue, we propose the Fast Adversarial Watermark Attack (FAWA) against sequence-based OCR models in the white-box manner. By disguising the perturbations as watermarks, we can make the resulting adversarial images appear natural to human eyes and achieve a perfect attack success rate. FAWA works with either gradient-based or optimization-based perturbation generation. In both letter-level and word-level attacks, our experiments show that in addition to natural appearance, FAWA achieves a 100% attack success rate with 60% less perturbations and 78% fewer iterations on average. In addition, we further extend FAWA to support full-color watermarks, other languages, and even the OCR accuracy-enhancing mechanism.
摘要：深度神经网络（DNN）大大提高了光学字符识别（OCR）的准确性，并启发了许多重要的应用程序。不幸的是，OCR还通过对抗性示例继承了DNN的漏洞。与彩色香草图像不同，文本图像通常具有清晰的背景。大多数现有对抗性攻击产生的对抗性例子都是不自然的，严重污染了本底。为了解决此问题，我们以白盒方式针对基于序列的OCR模型提出了快速对抗水印攻击（FAWA）。通过将扰动伪装成水印，我们可以使所得到的对抗图像对人眼来说看起来很自然，并且可以达到完美的攻击成功率。 FAWA可以处理基于梯度的扰动或基于优化的扰动。在字母级别和单词级别的攻击中，我们的实验表明，除了自然外观外，FAWA还实现了100％的攻击成功率，干扰减少了60％，迭代次数平均减少了78％。此外，我们进一步扩展了FAWA，以支持全彩水印，其他语言，甚至OCR精度增强机制。

37. Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [PDF] 返回目录
Feixiang Lu, Zongdai Liu, Hui Miao, Peng Wang, Liangjun Zhang, Ruigang Yang, Dinesh Manocha, Bin Zhou
Abstract: Holistically understanding an object and its 3D movable parts through visual perception models is essential for enabling an autonomous agent to interact with the world. For autonomous driving, the dynamics and states of vehicle parts such as doors, the trunk, and the bonnet can provide meaningful semantic information and interaction states, which are essential to ensure the safety of the self-driving vehicle. Existing visual perception models mainly focus on coarse parsing such as object bounding box detection or pose estimation and rarely tackle these situations. In this paper, we address this important problem for autonomous driving by solving two critical issues using visual data augmentation. First, to deal with data scarcity, we propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images and then reconstructing human-vehicle interaction scenarios. This allows us to directly edit the real images using the aligned 3D parts, yielding effective training data generation for learning robust deep neural networks (DNNs). Second, to benchmark the quality of 3D part understanding, we collect a large dataset in real world driving scenarios with vehicles in uncommon states (VUS), i.e. with the door or trunk opened, etc. Experiments demonstrate our trained network with visual data augmentation largely outperforms other baselines in terms of 2D detection and instance segmentation accuracy. Our network yields large improvements in discovering and understanding these uncommon cases. Moreover, we plan to release all of the source code, the dataset, and the trained model on GitHub.
摘要：通过视觉感知模型全面了解对象及其3D可移动部件对于使自主代理与世界互动至关重要。对于自动驾驶，诸如门，行李箱和发动机盖之类的车辆部件的动力学和状态可以提供有意义的语义信息和交互状态，这对于确保自动驾驶车辆的安全至关重要。现有的视觉感知模型主要集中于粗略分析，例如对象边界框检测或姿势估计，并且很少解决这些情况。在本文中，我们通过使用视觉数据增强解决两个关键问题，从而解决了自动驾驶这一重要问题。首先，为了解决数据短缺问题，我们提出了一种有效的训练数据生成过程，方法是将具有动态零件的3D汽车模型拟合到真实图像中的车辆，然后重建人车交互场景。这使我们能够使用对齐的3D部分直接编辑真实图像，从而生成有效的训练数据，以学习强大的深度神经网络（DNN）。其次，以基准3D零件理解的质量为基准，我们在现实世界中的非常规状态（VUS）车辆（即车门或行李箱打开等）下的驾驶场景中收集了一个大型数据集。在2D检测和实例分割精度方面，其性能优于其他基准。我们的网络在发现和理解这些不常见的情况方面取得了很大的进步。此外，我们计划在GitHub上发布所有源代码，数据集和经过训练的模型。

38. Image Inpainting Guided by Coherence Priors of Semantics and Textures [PDF] 返回目录
Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, Shin'ichi Satoh
Abstract: Existing inpainting methods have achieved promising performance in recovering defected images of specific scenes. However, filling holes involving multiple semantic categories remains challenging due to the obscure semantic boundaries and the mixture of different semantic textures. In this paper, we introduce coherence priors between the semantics and textures which make it possible to concentrate on completing separate textures in a semantic-wise manner. Specifically, we adopt a multi-scale joint optimization framework to first model the coherence priors and then accordingly interleavingly optimize image inpainting and semantic segmentation in a coarse-to-fine manner. A Semantic-Wise Attention Propagation (SWAP) module is devised to refine completed image textures across scales by exploring non-local semantic coherence, which effectively mitigates mix-up of textures. We also propose two coherence losses to constrain the consistency between the semantics and the inpainted image in terms of the overall structure and detailed textures. Experimental results demonstrate the superiority of our proposed method for challenging cases with complex holes.
摘要：现有的修复方法在恢复特定场景的缺陷图像方面取得了可喜的成绩。然而，由于模糊的语义边界和不同语义纹理的混合，填补涉及多个语义类别的漏洞仍然具有挑战性。在本文中，我们介绍了语义和纹理之间的先验先验，这使得可以集中精力以语义方式完成单独的纹理。具体而言，我们采用多尺度联合优化框架来首先对一致性先验建模，然后以粗略到精细的方式交错地优化图像修复和语义分割。设计了一个语义明智的注意传播（SWAP）模块，通过探索非局部语义一致性来跨尺度细化完整的图像纹理，从而有效地减轻了纹理的混淆。我们还提出了两个连贯性损失，以从整体结构和详细纹理方面限制语义和修补图像之间的一致性。实验结果证明了我们提出的方法在复杂孔洞复杂情况下的优越性。

39. Teach me to segment with mixed supervision: Confident students become masters [PDF] 返回目录
Jose Dolz, Christian Desrosiers, Ismail Ben Ayed
Abstract: Deep segmentation neural networks require large training datasets with pixel-wise segmentations, which are expensive to obtain in practice. Mixed supervision could mitigate this difficulty, with a small fraction of the data containing complete pixel-wise annotations, while the rest being less supervised, e.g., only a handful of pixels are labeled. In this work, we propose a dual-branch architecture, where the upper branch (teacher) receives strong annotations, while the bottom one (student) is driven by limited supervision and guided by the upper branch. In conjunction with a standard cross-entropy over the labeled pixels, our novel formulation integrates two important terms: (i) a Shannon entropy loss defined over the less-supervised images, which encourages confident student predictions at the bottom branch; and (ii) a Kullback-Leibler (KL) divergence, which transfers the knowledge from the predictions generated by the strongly supervised branch to the less-supervised branch, and guides the entropy (student-confidence) term to avoid trivial solutions. Very interestingly, we show that the synergy between the entropy and KL divergence yields substantial improvements in performances. Furthermore, we discuss an interesting link between Shannon-entropy minimization and standard pseudo-mask generation and argue that the former should be preferred over the latter for leveraging information from unlabeled pixels. Through a series of quantitative and qualitative experiments, we show the effectiveness of the proposed formulation in segmenting the left-ventricle endocardium in MRI images. We demonstrate that our method significantly outperforms other strategies to tackle semantic segmentation within a mixed-supervision framework. More interestingly, and in line with recent observations in classification, we show that the branch trained with reduced supervision largely outperforms the teacher.
摘要：深度分割神经网络需要具有按像素分割的大型训练数据集，而这在实践中非常昂贵。混合监督可以减轻此困难，因为一小部分数据包含完整的逐像素注释，而其余部分则受到较少监督，例如，仅标记了少数像素。在这项工作中，我们提出了一种双分支体系结构，其中上层分支（老师）收到强注释，而下层分支（学生）则受到有限监督的驱动，并由上层分支指导。结合标记像素上的标准交叉熵，我们的新颖公式整合了两个重要术语：（i）在较少监督的图像上定义的香农熵损失，这鼓励了学生对底部分支的自信预测；（ii）Kullback-Leibler（KL）散度，它将知识从强监督分支所产生的预测转移到非监督分支，并指导熵（学生自信心）项以避免平凡的解决方案。非常有趣的是，我们证明了熵和KL散度之间的协同作用可显着提高性能。此外，我们讨论了香农熵最小化和标准伪掩码生成之间的有趣联系，并指出，在利用未标记像素的信息方面，前者应优于后者。通过一系列定量和定性实验，我们显示了所提出的制剂在MRI图像中分割左心室心内膜的有效性。我们证明了我们的方法在混合监督框架内显着优于其他策略来处理语义分割。更有趣的是，根据最近对分类的观察，我们表明，在减少监督的情况下训练的分支机构的表现大大优于老师。

40. Semantic-Guided Representation Enhancement for Self-supervised Monocular Trained Depth Estimation [PDF] 返回目录
Rui Li, Qing Mao, Pei Wang, Xiantuo He, Yu Zhu, Jinqiu Sun, Yanning Zhang
Abstract: Self-supervised depth estimation has shown its great effectiveness in producing high quality depth maps given only image sequences as input. However, its performance usually drops when estimating on border areas or objects with thin structures due to the limited depth representation ability. In this paper, we address this problem by proposing a semantic-guided depth representation enhancement method, which promotes both local and global depth feature representations by leveraging rich contextual information. In stead of a single depth network as used in conventional paradigms, we propose an extra semantic segmentation branch to offer extra contextual features for depth estimation. Based on this framework, we enhance the local feature representation by sampling and feeding the point-based features that locate on the semantic edges to an individual Semantic-guided Edge Enhancement module (SEEM), which is specifically designed for promoting depth estimation on the challenging semantic borders. Then, we improve the global feature representation by proposing a semantic-guided multi-level attention mechanism, which enhances the semantic and depth features by exploring pixel-wise correlations in the multi-level depth decoding scheme. Extensive experiments validate the distinct superiority of our method in capturing highly accurate depth on the challenging image areas such as semantic category borders and thin objects. Both quantitative and qualitative experiments on KITTI show that our method outperforms the state-of-the-art methods.
摘要：在仅以图像序列为输入的情况下，自监督深度估计已显示出在生成高质量深度图方面的巨大功效。但是，由于有限的深度表示能力，当在边界区域或结构较薄的对象上进行估计时，其性能通常会下降。在本文中，我们通过提出一种语义指导的深度表示增强方法来解决此问题，该方法通过利用丰富的上下文信息来促进局部和全局深度特征表示。代替传统范例中使用的单个深度网络，我们提出了一个额外的语义分段分支，以提供用于深度估计的额外上下文特征。在此框架的基础上，我们通过对位于语义边缘的基于点的特征进行采样并将其馈送到单个语义引导的边缘增强模块（SEEM），来增强局部特征表示，该模块专门用于促进对挑战性进行深度估计语义边界。然后，我们通过提出一种语义指导的多级关注机制来改进全局特征表示，该机制通过探索多级深度解码方案中的像素相关性来增强语义和深度特征。广泛的实验证明了我们的方法在捕获具有挑战性的图像区域（如语义类别边界和薄物体）上的高精度深度方面的独特优势。在KITTI上进行的定量和定性实验均表明，我们的方法优于最新方法。

41. NUTA: Non-uniform Temporal Aggregation for Action Recognition [PDF] 返回目录
Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Hao Chen, Joseph Tighe
Abstract: In the world of action recognition research, one primary focus has been on how to construct and train networks to model the spatial-temporal volume of an input video. These methods typically uniformly sample a segment of an input clip (along the temporal dimension). However, not all parts of a video are equally important to determine the action in the clip. In this work, we focus instead on learning where to extract features, so as to focus on the most informative parts of the video. We propose a method called the non-uniform temporal aggregation (NUTA), which aggregates features only from informative temporal segments. We also introduce a synchronization method that allows our NUTA features to be temporally aligned with traditional uniformly sampled video features, so that both local and clip-level features can be combined. Our model has achieved state-of-the-art performance on four widely used large-scale action-recognition datasets (Kinetics400, Kinetics700, Something-something V2 and Charades). In addition, we have created a visualization to illustrate how the proposed NUTA method selects only the most relevant parts of a video clip.
摘要：在动作识别研究领域，一个主要重点是如何构建和训练网络以对输入视频的时空量建模。这些方法通常对输入剪辑的一个片段（沿时间维度）进行统一采样。但是，并非视频的所有部分对于确定剪辑中的动作都同样重要。在这项工作中，我们专注于学习在哪里提取特征，从而专注于视频中信息最丰富的部分。我们提出了一种称为非均匀时间聚合（NUTA）的方法，该方法仅聚合信息性时间段中的特征。我们还介绍了一种同步方法，该方法可使我们的NUTA功能在时间上与传统的统一采样视频功能保持一致，从而可以将本地和片段级别的功能组合在一起。我们的模型在四个广泛使用的大规模动作识别数据集（Kinetics400，Kinetics700，Something-something V2和Charades）上取得了最先进的性能。另外，我们创建了一个可视化文件来说明所建议的NUTA方法如何仅选择视频剪辑中最相关的部分。

42. Classification of Smoking and Calling using Deep Learning [PDF] 返回目录
Miaowei Wang, Alexander William Mohacey, Hongyu Wang, James Apfel
Abstract: Since 2014, very deep convolutional neural networks have been proposed and become the must-have weapon for champions in all kinds of competition. In this report, a pipeline is introduced to perform the classification of smoking and calling by modifying the pretrained inception V3. Brightness enhancing based on deep learning is implemented to improve the classification of this classification task along with other useful training tricks. Based on the quality and quantity results, it can be concluded that this pipeline with small biased samples is practical and useful with high accuracy.
摘要：自2014年以来，已经提出了非常深入的卷积神经网络，并已成为各种比赛中冠军的必备武器。在此报告中，引入了管道，以通过修改预训练的初始V3来对吸烟和打电话进行分类。实现了基于深度学习的亮度增强功能，以改进此分类任务的分类以及其他有用的训练技巧。根据质和量的结果，可以得出结论，该带有少量偏差样本的管道实用且实用且精度很高。

43. DeepGamble: Towards unlocking real-time player intelligence using multi-layer instance segmentation and attribute detection [PDF] 返回目录
Danish Syed, Naman Gandhi, Arushi Arora, Nilesh Kadam
Abstract: Annually the gaming industry spends approximately $15 billion in marketing reinvestment. However, this amount is spent without any consideration for the skill and luck of the player. For a casino, an unskilled player could fetch ~4 times more revenue than a skilled player. This paper describes a video recognition system that is based on an extension of the Mask R-CNN model. Our system digitizes the game of blackjack by detecting cards and player bets in real-time and processes decisions they took in order to create accurate player personas. Our proposed supervised learning approach consists of a specialized three-stage pipeline that takes images from two viewpoints of the casino table and does instance segmentation to generate masks on proposed regions of interest. These predicted masks along with derivative features are used to classify image attributes that are passed onto the next stage to assimilate the gameplay understanding. Our end-to-end model yields an accuracy of ~95% for the main bet detection and ~97% for card detection in a controlled environment trained using transfer learning approach with 900 training examples. Our approach is generalizable and scalable and shows promising results in varied gaming scenarios and test data. Such granular level gathered data, helped in understanding player's deviation from optimum strategy and thereby separate the skill of the player from the luck of the game. Our system also assesses the likelihood of card counting by correlating the player's betting pattern to the deck's scaled count. Such a system lets casinos flag fraudulent activity and calculate expected personalized profitability for each player and tailor their marketing reinvestment decisions.
摘要：游戏产业每年在市场再投资上花费大约150亿美元。但是，花费这笔钱时并未考虑玩家的技巧和运气。对于赌场来说，不熟练的玩家可以获得的收入是熟练玩家的约4倍。本文介绍了一种基于Mask R-CNN模型扩展的视频识别系统。我们的系统通过实时检测纸牌和玩家下注来对二十一点游戏进行数字化处理，并处理为创建准确的玩家角色而做出的决策。我们提出的监督学习方法包括一个专门的三阶段管线，该管线从娱乐场桌子的两个角度拍摄图像，并进行实例分割以在提议的感兴趣区域上生成蒙版。这些预测的遮罩以及派生功能用于对传递到下一阶段的图像属性进行分类，以吸收游戏性的理解。我们的端到端模型在使用转移学习方法和900个训练示例进行训练的受控环境中，对主投注检测的准确度约为95％，对纸牌检测的准确度约为97％。我们的方法具有通用性和可扩展性，并在各种游戏场景和测试数据中显示出令人鼓舞的结果。这种细粒度的数据收集有助于理解玩家偏离最佳策略的情况，从而将玩家的技能与游戏的运气分开。我们的系统还通过将玩家的下注方式与牌组的可缩放计数相关联来评估纸牌计数的可能性。这种系统使赌场可以标记欺诈活动，并为每个玩家计算预期的个性化获利能力，并调整他们的营销再投资决策。

44. FaceDet3D: Facial Expressions with 3D Geometric Detail Prediction [PDF] 返回目录
ShahRukh Athar, Albert Pumarola, Francesc Moreno-Noguer, Dimitris Samaras
Abstract: Facial Expressions induce a variety of high-level details on the 3D face geometry. For example, a smile causes the wrinkling of cheeks or the formation of dimples, while being angry often causes wrinkling of the forehead. Morphable Models (3DMMs) of the human face fail to capture such fine details in their PCA-based representations and consequently cannot generate such details when used to edit expressions. In this work, we introduce FaceDet3D, a first-of-its-kind method that generates - from a single image - geometric facial details that are consistent with any desired target expression. The facial details are represented as a vertex displacement map and used then by a Neural Renderer to photo-realistically render novel images of any single image in any desired expression and view. The Project website is: this http URL
摘要：面部表情在3D面部几何图形上引发了各种高级细节。例如，微笑会使脸颊起皱或形成酒窝，而生气通常会使前额起皱。人脸的可变形模型（3DMM）无法在其基于PCA的表示中捕获此类精细细节，因此在用于编辑表达式时无法生成此类细节。在这项工作中，我们介绍了FaceDet3D，这是第一种方法，它从单个图像生成与任何所需目标表情一致的几何面部细节。面部细节表示为顶点位移图，然后由神经渲染器用于以任何所需的表情和视图对任何单个图像的新颖图像进行真实感渲染。该项目的网站是：此http URL

45. FasteNet: A Fast Railway Fastener Detector [PDF] 返回目录
Jun Jet Tai, Mauro S. Innocente, Owais Mehmood
Abstract: In this work, a novel high-speed railway fastener detector is introduced. This fully convolutional network, dubbed FasteNet, foregoes the notion of bounding boxes and performs detection directly on a predicted saliency map. Fastenet uses transposed convolutions and skip connections, the effective receptive field of the network is 1.5$\times$ larger than the average size of a fastener, enabling the network to make predictions with high confidence, without sacrificing output resolution. In addition, due to the saliency map approach, the network is able to vote for the presence of a fastener up to 30 times per fastener, boosting prediction accuracy. Fastenet is capable of running at 110 FPS on an Nvidia GTX 1080, while taking in inputs of 1600$\times$512 with an average of 14 fasteners per image. Our source is open here: this https URL\_FasteNet.git
摘要：在这项工作中，介绍了一种新型的高速铁路紧固件检测器。这个被称为FasteNet的全卷积网络放弃了边界框的概念，直接在预测的显着图上执行检测。 Fastenet使用转置卷积和跳过连接，网络的有效接受域比紧固件的平均大小大1.5倍，从而使网络能够在不牺牲输出分辨率的情况下进行高置信度的预测。此外，由于采用了显着性图方法，网络能够对紧固件的存在进行投票，每个紧固件最多可投票30次，从而提高了预测准确性。 Fastenet能够在Nvidia GTX 1080上以110 FPS的速度运行，而输入的费用为1600 $×512美元，每个图像平均14个紧固件。我们的资源在这里打开：https https URL \ _FasteNet.git

46. Automatic Vertebra Localization and Identification in CT by Spine Rectification and Anatomically-constrained Optimization [PDF] 返回目录
Fakai Wang, Kang Zheng, Le Lu, Jing Xiao, Min Wu, Shun Miao
Abstract: Accurate vertebra localization and identification are required in many clinical applications of spine disorder diagnosis and surgery planning. However, significant challenges are posed in this task by highly varying pathologies (such as vertebral compression fracture, scoliosis, and vertebral fixation) and imaging conditions (such as limited field of view and metal streak artifacts). This paper proposes a robust and accurate method that effectively exploits the anatomical knowledge of the spine to facilitate vertebra localization and identification. A key point localization model is trained to produce activation maps of vertebra centers. They are then re-sampled along the spine centerline to produce spine-rectified activation maps, which are further aggregated into 1-D activation signals. Following this, an anatomically-constrained optimization module is introduced to jointly search for the optimal vertebra centers under a soft constraint that regulates the distance between vertebrae and a hard constraint on the consecutive vertebra indices. When being evaluated on a major public benchmark of 302 highly pathological CT images, the proposed method reports the state of the art identification (id.) rate of 97.4%, and outperforms the best competing method of 94.7% id. rate by reducing the relative id. error rate by half.
摘要：在脊柱疾病的诊断和手术计划的许多临床应用中都需要准确的椎骨定位和识别。然而，高度变化的病理（例如椎骨压缩性骨折，脊柱侧弯和椎骨固定）和成像条件（例如有限的视野和金属条纹伪影）给这项任务带来了巨大的挑战。本文提出了一种鲁棒而准确的方法，可以有效地利用脊柱的解剖学知识来促进椎骨的定位和识别。训练关键点本地化模型以生成椎骨中心的激活图。然后沿着脊柱中心线对其重新采样，以生成脊柱校正的激活图，将其进一步汇总为一维激活信号。此后，引入了解剖学上受约束的优化模块，以在软约束下联合搜索最佳椎骨中心，该软约束调节椎骨之间的距离，并在连续的椎骨索引上施加硬约束。当在302个高度病理CT图像的主要公共基准上进行评估时，所提出的方法报告了97.4％的最新识别（id。）率，并且胜过了94.7％id的最佳竞争方法。通过降低相对ID进行评分。错误率降低一半。

47. Robust One Shot Audio to Video Generation [PDF] 返回目录
Neeraj Kumar, Srishti Goel, Ankur Narang, Mujtaba Hasan
Abstract: Audio to Video generation is an interesting problem that has numerous applications across industry verticals including film making, multi-media, marketing, education and others. High-quality video generation with expressive facial movements is a challenging problem that involves complex learning steps for generative adversarial networks. Further, enabling one-shot learning for an unseen single image increases the complexity of the problem while simultaneously making it more applicable to practical scenarios. In the paper, we propose a novel approach OneShotA2V to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person. Further, it feeds the features generated from the audio input directly into a generative adversarial network and it adapts to any given unseen selfie by applying fewshot learning with only a few output updation epochs. OneShotA2V leverages spatially adaptive normalization based multi-level generator and multiple multi-level discriminators based architecture. The input audio clip is not restricted to any specific language, which gives the method multilingual applicability. Experimental evaluation demonstrates superior performance of OneShotA2V as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [43], Speech2Vid [8], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio) and CPBD (image sharpness). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.
摘要：音频到视频的生成是一个有趣的问题，在电影，多媒体，市场营销，教育等行业各个行业都有广泛的应用。具有表现力的面部动作的高质量视频生成是一个具有挑战性的问题，涉及生成对抗网络的复杂学习步骤。此外，对看不见的单个图像进行单次学习将增加问题的复杂性，同时使其更适用于实际情况。在本文中，我们提出了一种新颖的方法OneShotA2V，它使用输入作为音频信号和人的单个看不见的图像作为输入来合成任意长度的有声人的视频。 OneShotA2V利用课程学习来学习富有表情的面部表情的动作，从而生成给定人物的高品质谈话头视频。此外，它将从音频输入生成的特征直接馈入到生成的对抗网络中，并且通过仅用几个输出更新时期进行少量学习来适应任何给定的看不见的自拍照。 OneShotA2V利用了基于空间自适应归一化的多级生成器和基于多个多级鉴别器的体系结构。输入的音频剪辑不限于任何特定的语言，从而使该方法具有多语言适用性。实验评估表明，与多个具有GANs（RSDGAN）的逼真的语音驱动面部动画[43]，Speech2Vid [8]和其他方法相比，OneShotA2V在多个定量指标上具有更好的性能，这些指标包括：SSIM（结构相似性指数），PSNR（峰值）信噪比）和CPBD（图像清晰度）。此外，定性评估和在线图灵测试证明了我们方法的有效性。

48. Exploring Vicinal Risk Minimization for Lightweight Out-of-Distribution Detection [PDF] 返回目录
Deepak Ravikumar, Sangamesh Kodge, Isha Garg, Kaushik Roy
Abstract: Deep neural networks have found widespread adoption in solving complex tasks ranging from image recognition to natural language processing. However, these networks make confident mispredictions when presented with data that does not belong to the training distribution, i.e. out-of-distribution (OoD) samples. In this paper we explore whether the property of Vicinal Risk Minimization (VRM) to smoothly interpolate between different class boundaries helps to train better OoD detectors. We apply VRM to existing OoD detection techniques and show their improved performance. We observe that existing OoD detectors have significant memory and compute overhead, hence we leverage VRM to develop an OoD detector with minimal overheard. Our detection method introduces an auxiliary class for classifying OoD samples. We utilize mixup in two ways to implement Vicinal Risk Minimization. First, we perform mixup within the same class and second, we perform mixup with Gaussian noise when training the auxiliary class. Our method achieves near competitive performance with significantly less compute and memory overhead when compared to existing OoD detection techniques. This facilitates the deployment of OoD detection on edge devices and expands our understanding of Vicinal Risk Minimization for use in training OoD detectors.
摘要：深度神经网络已被广泛用于解决从图像识别到自然语言处理的复杂任务。但是，当这些网络呈现不属于训练分布的数据（即分布外（OoD）样本）时，会做出自信的错误预测。在本文中，我们探讨了在不同类别边界之间平滑插值的“最小风险”（VRM）属性是否有助于训练更好的OoD检测器。我们将VRM应用于现有的OoD检测技术，并展示了它们的改进性能。我们观察到现有的OoD检测器具有显着的内存和计算开销，因此我们利用VRM开发了一个最小的窃听率的OoD检测器。我们的检测方法引入了用于对OoD样本进行分类的辅助类。我们以两种方式利用混合来实现“最小化家庭风险”。首先，我们在同一个班级中进行混合，其次，当训练辅助班级时，我们与高斯噪声一起进行混合。与现有的OoD检测技术相比，我们的方法以几乎更少的计算和内存开销实现了接近竞争的性能。这促进了OoD检测在边缘设备上的部署，并扩展了我们对用于培训OoD检测器的“最小化风险”的理解。

49. Masksembles for Uncertainty Estimation [PDF] 返回目录
Nikita Durasov, Timur Bagautdinov, Pierre Baque, Pascal Fua
Abstract: Deep neural networks have amply demonstrated their prowess but estimating the reliability of their predictions remains challenging. Deep Ensembles are widely considered as being one of the best methods for generating uncertainty estimates but are very expensive to train and evaluate. MC-Dropout is another popular alternative, which is less expensive, but also less reliable. Our central intuition is that there is a continuous spectrum of ensemble-like models of which MC-Dropout and Deep Ensembles are extreme examples. The first uses an effectively infinite number of highly correlated models while the second relies on a finite number of independent models. To combine the benefits of both, we introduce Masksembles. Instead of randomly dropping parts of the network as in MC-dropout, Masksemble relies on a fixed number of binary masks, which are parameterized in a way that allows to change correlations between individual models. Namely, by controlling the overlap between the masks and their density one can choose the optimal configuration for the task at hand. This leads to a simple and easy to implement method with performance on par with Ensembles at a fraction of the cost. We experimentally validate Masksembles on two widely used datasets, CIFAR10 and ImageNet.
摘要：深度神经网络已经充分展示了它们的能力，但是估计其预测的可靠性仍然具有挑战性。深度合奏被广泛认为是生成不确定性估计的最佳方法之一，但训练和评估非常昂贵。 MC-Dropout是另一种流行的替代方法，它价格便宜，但可靠性也较低。我们的主要直觉是，存在连续的类集合模型，其中MC-Dropout和Deep Ensembles是极端的例子。第一个使用有效无限数量的高度相关模型，而第二个则使用有限数量的独立模型。为了结合两者的优点，我们引入了Masksembles。 Masksemble不会像在MC-dropout中那样随机丢弃网络的一部分，而是依赖于固定数量的二进制掩码，这些二进制掩码的参数化方式可以更改各个模型之间的相关性。即，通过控制掩模之间的重叠及其密度，可以为手头的任务选择最佳配置。这就导致了一种简单，易于实现的方法，其性能可与Ensembles媲美，而成本却只有其一小部分。我们在两个广泛使用的数据集CIFAR10和ImageNet上通过实验验证了Masksembles。

50. Do not repeat these mistakes -- a critical appraisal of applications of explainable artificial intelligence for image based COVID-19 detection [PDF] 返回目录
Weronika Hryniewska, Przemysław Bombiński, Patryk Szatkowski, Paulina Tomaszewska, Przemysław Biecek
Abstract: The sudden outbreak and uncontrolled spread of COVID-19 disease is one of the most important global problems today. In a short period of time, it has led to the development of many deep neural network models for COVID-19 detection with modules for explainability. In this work, we carry out a systematic analysis of various aspects of proposed models. Our analysis revealed numerous mistakes made at different stages of data acquisition, model development, and explanation construction. In this work, we overview the approaches proposed in the surveyed ML articles and indicate typical errors emerging from the lack of deep understanding of the radiography domain. We present the perspective of both: experts in the field - radiologists, and deep learning engineers dealing with model explanations. The final result is a proposed a checklist with the minimum conditions to be met by a reliable COVID-19 diagnostic model.
摘要：COVID-19疾病的突然爆发和不受控制的传播是当今最重要的全球性问题之一。在短时间内，它导致开发了许多用于COVID-19检测的深层神经网络模型，并带有可解释性的模块。在这项工作中，我们对提出的模型的各个方面进行了系统的分析。我们的分析揭示了在数据采集，模型开发和解释构建的不同阶段所犯的许多错误。在这项工作中，我们概述了被调查的机器学习文章中提出的方法，并指出了由于对射线照相领域缺乏深入了解而出现的典型错误。我们提供了两种观点：放射线专家和领域专家，以及负责模型解释的深度学习工程师。最终结果是建议的清单，其中列出了可靠的COVID-19诊断模型要满足的最低条件。

51. Multiple Sclerosis Lesion Segmentation -- A Survey of Supervised CNN-Based Methods [PDF] 返回目录
Huahong Zhang, Ipek Oguz
Abstract: Lesion segmentation is a core task for quantitative analysis of MRI scans of Multiple Sclerosis patients. The recent success of deep learning techniques in a variety of medical image analysis applications has renewed community interest in this challenging problem and led to a burst of activity for new algorithm development. In this survey, we investigate the supervised CNN-based methods for MS lesion segmentation. We decouple these reviewed works into their algorithmic components and discuss each separately. For methods that provide evaluations on public benchmark datasets, we report comparisons between their results.
摘要：病变分割是对多发性硬化症患者的MRI扫描进行定量分析的一项核心任务。深度学习技术在各种医学图像分析应用程序中的最新成功使人们对这个具有挑战性的问题重新产生了兴趣，并引发了新算法开发的热潮。在这项调查中，我们研究了基于监督的CNN的MS病变分割方法。我们将这些评论的作品分解为它们的算法组成部分，并分别进行讨论。对于提供对公共基准数据集进行评估的方法，我们将报告其结果之间的比较。

52. Enhance Multimodal Transformer With External Label And In-Domain Pretrain: Hateful Meme Challenge Winning Solution [PDF] 返回目录
Ron Zhu
Abstract: Hateful meme detection is a new research area recently brought out that requires both visual, linguistic understanding of the meme and some background knowledge to performing well on the task. This technical report summarises the first place solution of the Hateful Meme Detection Challenge 2020, which extending state-of-the-art visual-linguistic transformers to tackle this problem. At the end of the report, we also point out the shortcomings and possible directions for improving the current methodology.
摘要：讨厌的模因检测是最近出现的一个新研究领域，它要求对模因有视觉，语言方面的理解，并需要一些背景知识才能很好地完成任务。本技术报告总结了2020年仇恨模因检测挑战赛的首要解决方案，该解决方案扩展了最先进的视觉语言转换器以解决此问题。在报告的结尾，我们还指出了改进当前方法的缺点和可能的方向。

53. CosSGD: Nonlinear Quantization for Communication-efficient Federated Learning [PDF] 返回目录
Yang He, Maximilian Zenk, Mario Fritz
Abstract: Federated learning facilitates learning across clients without transferring local data on these clients to a central server. Despite the success of the federated learning method, it remains to improve further w.r.t communicating the most critical information to update a model under limited communication conditions, which can benefit this learning scheme into a wide range of application scenarios. In this work, we propose a nonlinear quantization for compressed stochastic gradient descent, which can be easily utilized in federated learning. Based on the proposed quantization, our system significantly reduces the communication cost by up to three orders of magnitude, while maintaining convergence and accuracy of the training process to a large extent. Extensive experiments are conducted on image classification and brain tumor semantic segmentation using the MNIST, CIFAR-10 and BraTS datasets where we show state-of-the-art effectiveness and impressive communication efficiency.
摘要：联合学习可促进跨客户端的学习，而无需将这些客户端上的本地数据传输到中央服务器。尽管联合学习方法取得了成功，但仍需要进一步改进，以便在有限的通信条件下传达最关键的信息以更新模型，这可以使该学习方案在广泛的应用场景中受益。在这项工作中，我们提出了一种用于压缩随机梯度下降的非线性量化方法，该方法可以很容易地用于联合学习中。基于提出的量化，我们的系统可将通信成本显着降低多达三个数量级，同时在很大程度上保持训练过程的收敛性和准确性。使用MNIST，CIFAR-10和BraTS数据集对图像分类和脑肿瘤语义分割进行了广泛的实验，这些实验显示了最新的有效性和令人印象深刻的通信效率。

54. Frozen-to-Paraffin: Categorization of Histological Frozen Sections by the Aid of Paraffin Sections and Generative Adversarial Networks [PDF] 返回目录
Michael Gadermayr, Maximilian Tschuchnig, Lea Maria Stangassinger, Christina Kreutzer, Sebastien Couillard-Despres, Gertie Janneke Oostingh, Anton Hittmair
Abstract: In contrast to paraffin sections, frozen sections can be quickly generated during surgical interventions. This procedure allows surgeons to wait for histological findings during the intervention to base intra-operative decisions on the outcome of the histology. However, compared to paraffin sections, the quality of frozen sections is typically lower, leading to a higher ratio of miss-classification. In this work, we investigated the effect of the section type on automated decision support approaches for classification of thyroid cancer. This was enabled by a data set consisting of pairs of sections for individual patients. Moreover, we investigated, whether a frozen-to-paraffin translation could help to optimize classification scores. Finally, we propose a specific data augmentation strategy to deal with a small amount of training data and to increase classification accuracy even further.
摘要：与石蜡切片相比，冷冻切片可以在外科手术期间快速生成。该过程允许外科医生在干预过程中等待组织学检查结果，以根据组织学结果进行术中决策。但是，与石蜡切片相比，冷冻切片的质量通常较低，导致误分类的比率较高。在这项工作中，我们调查了节类型对甲状腺癌分类自动决策支持方法的影响。这是由一个由各个患者的成对部分组成的数据集实现的。此外，我们调查了从冷冻到石蜡的翻译是否可以帮助优化分类评分。最后，我们提出了一种特定的数据扩充策略，以处理少量训练数据并进一步提高分类准确性。

55. Hypothesis Disparity Regularized Mutual Information Maximization [PDF] 返回目录
Qicheng Lao, Xiang Jiang, Mohammad Havaei
Abstract: We propose a hypothesis disparity regularized mutual information maximization~(HDMI) approach to tackle unsupervised hypothesis transfer -- as an effort towards unifying hypothesis transfer learning (HTL) and unsupervised domain adaptation (UDA) -- where the knowledge from a source domain is transferred solely through hypotheses and adapted to the target domain in an unsupervised manner. In contrast to the prevalent HTL and UDA approaches that typically use a single hypothesis, HDMI employs multiple hypotheses to leverage the underlying distributions of the source and target hypotheses. To better utilize the crucial relationship among different hypotheses -- as opposed to unconstrained optimization of each hypothesis independently -- while adapting to the unlabeled target domain through mutual information maximization, HDMI incorporates a hypothesis disparity regularization that coordinates the target hypotheses jointly learn better target representations while preserving more transferable source knowledge with better-calibrated prediction uncertainty. HDMI achieves state-of-the-art adaptation performance on benchmark datasets for UDA in the context of HTL, without the need to access the source data during the adaptation.
摘要：我们提出了一种假设差异视差正则化互信息最大化〜（HDMI）方法来解决无监督的假设转移-旨在统一假设转移学习（HTL）和无监督的领域适应（UDA）-来自源域的知识仅通过假设转移并以无监督的方式适应目标域。与通常使用单个假设的流行HTL和UDA方法相反，HDMI使用多个假设来利用源假设和目标假设的基础分布。为了更好地利用不同假设之间的关键关系-相对于每个假设的无限制优化-在通过互信息最大化来适应未标记目标域的同时，HDMI结合了假设差异视差正则化，可协调目标假设，共同学习更好的目标表示同时保留具有更好校准的预测不确定性的更多可转让源知识。在HTL的背景下，HDMI可以在UDA的基准数据集上实现最先进的自适应性能，而无需在自适应期间访问源数据。

56. Iterative label cleaning for transductive and semi-supervised few-shot learning [PDF] 返回目录
Michalis Lazarou, Yannis Avrithis, Tania Stathaki
Abstract: Few-shot learning amounts to learning representations and acquiring knowledge such that novel tasks may be solved with both supervision and data being limited. Improved performance is possible by transductive inference, where the entire test set is available concurrently, and semi-supervised learning, where more unlabeled data is available. These problems are closely related because there is little or no adaptation of the representation in novel tasks. Focusing on these two settings, we introduce a new algorithm that leverages the manifold structure of the labeled and unlabeled data distribution to predict pseudo-labels, while balancing over classes and using the loss value distribution of a limited-capacity classifier to select the cleanest labels, iterately improving the quality of pseudo-labels. Our solution sets new state of the art on four benchmark datasets, namely \emph{mini}ImageNet, \emph{tiered}ImageNet, CUB and CIFAR-FS, while being robust over feature space pre-processing and the quantity of available data.
摘要：少有的学习就等于学习表示和获取知识，从而可以在监督和数据受限的情况下解决新任务。通过转换推理（可以同时使用整个测试集）和半监督学习（可以使用更多未标记的数据）可以提高性能。这些问题密切相关，因为在新任务中很少或根本没有适应表示。针对这两个设置，我们介绍了一种新算法，该算法利用标记和未标记数据分布的流形结构来预测伪标签，同时平衡类并使用容量有限的分类器的损失值分布来选择最干净的标签，反复改善伪标签的质量。我们的解决方案在\ emph {mini} ImageNet，\ emph {tiered} ImageNet，CUB和CIFAR-FS四个基准数据集上设定了最新的技术水平，同时在特征空间预处理和可用数据量方面具有强大的优势。

57. Towards broader generalization of deep learning methods for multiple sclerosis lesion segmentation [PDF] 返回目录
Reda Abdellah Kamraoui, Vinh-Thong Ta, Thomas Tourdias, Boris Mansencal, José V Manjon, Pierrick Coupé
Abstract: Recently, segmentation methods based on Convolutional Neural Networks (CNNs) showed promising performance in automatic Multiple Sclerosis (MS) lesions segmentation. These techniques have even outperformed human experts in controlled evaluation condition. However state-of-the-art approaches trained to perform well on highly-controlled datasets fail to generalize on clinical data from unseen datasets. Instead of proposing another improvement of the segmentation accuracy, we propose a novel method robust to domain shift and performing well on unseen datasets, called DeepLesionBrain (DLB). This generalization property results from three main contributions. First, DLB is based on a large ensemble of compact 3D CNNs. This ensemble strategy ensures a robust prediction despite the risk of generalization failure of some individual networks. Second, DLB includes a new image quality data augmentation to reduce dependency to training data specificity (e.g., acquisition protocol). Finally, to learn a more generalizable representation of MS lesions, we propose a hierarchical specialization learning (HSL). HSL is performed by pre-training a generic network over the whole brain, before using its weights as initialization to locally specialized networks. By this end, DLB learns both generic features extracted at global image level and specific features extracted at local image level. At the time of publishing this paper, DLB is among the Top 3 performing published methods on ISBI Challenge while using only half of the available modalities. DLB generalization has also been compared to other state-of-the-art approaches, during cross-dataset experiments on MSSEG'16, ISBI challenge, and in-house datasets. DLB improves the segmentation performance and generalization over classical techniques, and thus proposes a robust approach better suited for clinical practice.
摘要：最近，基于卷积神经网络（CNN）的分割方法在自动多发性硬化（MS）病变分割中表现出令人鼓舞的性能。这些技术在可控的评估条件下甚至胜过人类专家。但是，经过训练可在高度受控的数据集上表现良好的最新方法无法概括未见数据集的临床数据。除了提出另一种提高分割精度的建议外，我们提出了一种对域移位具有鲁棒性并且在看不见的数据集上表现良好的新颖方法，称为DeepLesionBrain（DLB）。此归纳属性来自三个主要方面。首先，DLB基于紧凑型3D CNN的大型集成体。尽管某些单个网络存在泛化失败的风险，但这种集成策略可确保进行可靠的预测。其次，DLB包括新的图像质量数据增强，以减少对训练数据特异性的依赖性（例如，采集协议）。最后，为了学习MS病变的更一般化的表示，我们提出了层次化的专业学习（HSL）。 HSL通过在整个大脑上预先训练通用网络来执行，然后再将其权重用作对本地专用网络的初始化。为此，DLB既学习了在全局图像级别提取的通用特征，又学习了在本地图像级别提取的特定特征。在发布本文时，DLB是在ISBI Challenge上执行情况最好的3种方法之一，而仅使用了一半的可用方法。在MSSEG'16，ISBI挑战和内部数据集的跨数据集实验期间，DLB泛化也已与其他最新方法进行了比较。 DLB比传统技术提高了分割性能和泛化能力，因此提出了一种更适合临床实践的可靠方法。

58. When Physical Unclonable Function Meets Biometrics [PDF] 返回目录
Kavya Dayananda, Nima Karimian
Abstract: As the Covid-19 pandemic grips the world, healthcare systems are being reshaped, where the e-health concepts become more likely to be accepted. Wearable devices often carry sensitive information from users which are exposed to security and privacy risks. Moreover, users have always had the concern of being counterfeited between the fabrication process and vendors' storage. Hence, not only securing personal data is becoming a crucial obligation, but also device verification is another challenge. To address biometrics authentication and physically unclonable functions (PUFs) need to be put in place to mitigate the security and privacy of the users. Among biometrics modalities, Electrocardiogram (ECG) based biometric has become popular as it can authenticate patients and monitor the patient's vital signs. However, researchers have recently started to study the vulnerabilities of the ECG biometric systems and tried to address the issues of spoofing. Moreover, most of the wearable is enabled with CPU and memories. Thus, volatile memory-based (NVM) PUF can be easily placed in the device to avoid counterfeit. However, many research challenged the unclonability characteristics of PUFs. Thus, a careful study on these attacks should be sufficient to address the need. In this paper, our aim is to provide a comprehensive study on the state-of-the-art developments papers based on biometrics enabled hardware security.
摘要：随着Covid-19大流行风潮席卷全球，医疗保健系统正在重塑，电子保健概念也越来越有可能被接受。可穿戴设备通常会携带来自用户的敏感信息，这会带来安全和隐私风险。而且，用户一直担心在制造过程和供应商的存储之间被伪造。因此，不仅保护个人数据已成为一项至关重要的义务，而且设备验证也是另一个挑战。为了解决生物特征认证和物理上不可克隆的功能（PUF），必须采取适当的措施来减轻用户的安全性和隐私性。在生物特征识别方式中，基于心电图（ECG）的生物特征识别已变得很流行，因为它可以验证患者身份并监视患者的生命体征。但是，研究人员最近开始研究ECG生物特征识别系统的漏洞，并试图解决欺骗问题。此外，大多数可穿戴设备都通过CPU和内存启用。因此，基于易失性存储器（NVM）的PUF可以轻松放置在设备中，以避免伪造。但是，许多研究挑战了PUF的不可克隆性特征。因此，对这些攻击的仔细研究应该足以满足需要。在本文中，我们的目的是对基于生物识别技术的硬件安全性的最新发展论文进行全面的研究。

59. Learning Collision-Free Space Detection from Stereo Images: Homography Matrix Brings Better Data Augmentation [PDF] 返回目录
Rui Fan, Hengli Wang, Peide Cai, Jin Wu, Mohammud Junaid Bocus, Lei Qiao, Ming Liu
Abstract: Collision-free space detection is a critical component of autonomous vehicle perception. The state-of-the-art algorithms are typically based on supervised learning. The performance of such approaches is always dependent on the quality and amount of labeled training data. Additionally, it remains an open challenge to train deep convolutional neural networks (DCNNs) using only a small quantity of training samples. Therefore, this paper mainly explores an effective training data augmentation approach that can be employed to improve the overall DCNN performance, when additional images captured from different views are available. Due to the fact that the pixels of the collision-free space (generally regarded as a planar surface) between two images captured from different views can be associated by a homography matrix, the scenario of the target image can be transformed into the reference view. This provides a simple but effective way of generating training data from additional multi-view images. Extensive experimental results, conducted with six state-of-the-art semantic segmentation DCNNs on three datasets, demonstrate the effectiveness of our proposed training data augmentation algorithm for enhancing collision-free space detection performance. When validated on the KITTI road benchmark, our approach provides the best results for stereo vision-based collision-free space detection.
摘要：无碰撞空间检测是自动驾驶汽车感知的重要组成部分。最先进的算法通常基于监督学习。这种方法的性能始终取决于标记训练数据的质量和数量。此外，仅使用少量训练样本来训练深度卷积神经网络（DCNN）仍然是一个开放的挑战。因此，本文主要探讨了一种有效的训练数据扩充方法，当从不同视图捕获的其他图像可用时，可用于改善总体DCNN性能。由于可以通过单应性矩阵将从不同视图捕获的两个图像之间的无碰撞空间的像素（通常视为平面）相联系，因此可以将目标图像的场景转换为参考视图。这提供了一种从其他多视图图像生成训练数据的简单但有效的方法。在三个数据集上使用六个最先进的语义分割DCNN进行的大量实验结果证明了我们提出的训练数据扩充算法对于增强无碰撞空间检测性能的有效性。经过KITTI道路基准测试的验证，我们的方法可为基于立体视觉的无碰撞空间检测提供最佳结果。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-16

目录

摘要