摘要

1. PC-RGNN: Point Cloud Completion and Graph Neural Network for 3D Object Detection [PDF] 返回目录
Yanan Zhang, Di Huang, Yunhong Wang
Abstract: LiDAR-based 3D object detection is an important task for autonomous driving and current approaches suffer from sparse and partial point clouds of distant and occluded objects. In this paper, we propose a novel two-stage approach, namely PC-RGNN, dealing with such challenges by two specific solutions. On the one hand, we introduce a point cloud completion module to recover high-quality proposals of dense points and entire views with original structures preserved. On the other hand, a graph neural network module is designed, which comprehensively captures relations among points through a local-global attention mechanism as well as multi-scale graph based context aggregation, substantially strengthening encoded features. Extensive experiments on the KITTI benchmark show that the proposed approach outperforms the previous state-of-the-art baselines by remarkable margins, highlighting its effectiveness.
摘要：基于LiDAR的3D对象检测是自动驾驶的一项重要任务，当前的方法遭受着遥远和被遮挡对象的稀疏和部分点云的困扰。在本文中，我们提出了一种新颖的两阶段方法，即PC-RGNN，它通过两个特定的解决方案来应对此类挑战。一方面，我们引入了点云完成模块，以恢复密集点和保留原始结构的整个视图的高质量建议。另一方面，设计了一种图神经网络模块，该模块通过局部全局关注机制以及基于多尺度图的上下文聚合来全面捕获点之间的关系，从而大大增强了编码特征。在KITTI基准上进行的大量实验表明，所提出的方法以明显的优势优于以前的最新基准，突出了其有效性。

2. Learning Complex 3D Human Self-Contact [PDF] 返回目录
Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, Cristian Sminchisescu
Abstract: Monocular estimation of three dimensional human self-contact is fundamental for detailed scene analysis including body language understanding and behaviour modeling. Existing 3d reconstruction methods do not focus on body regions in self-contact and consequently recover configurations that are either far from each other or self-intersecting, when they should just touch. This leads to perceptually incorrect estimates and limits impact in those very fine-grained analysis domains where detailed 3d models are expected to play an important role. To address such challenges we detect self-contact and design 3d losses to explicitly enforce it. Specifically, we develop a model for Self-Contact Prediction (SCP), that estimates the body surface signature of self-contact, leveraging the localization of self-contact in the image, during both training and inference. We collect two large datasets to support learning and evaluation: (1) HumanSC3D, an accurate 3d motion capture repository containing $1,032$ sequences with $5,058$ contact events and $1,246,487$ ground truth 3d poses synchronized with images collected from multiple views, and (2) FlickrSC3D, a repository of $3,969$ images, containing $25,297$ surface-to-surface correspondences with annotated image spatial support. We also illustrate how more expressive 3d reconstructions can be recovered under self-contact signature constraints and present monocular detection of face-touch as one of the multiple applications made possible by more accurate self-contact models.
摘要：三维人自身接触的单眼估计是进行详细场景分析（包括肢体语言理解和行为建模）的基础。现有的3d重建方法不关注自接触的身体区域，因此当它们应该接触时，会恢复彼此远离或自相交的配置。这会导致感知上不正确的估计，并限制了在那些非常细粒度的分析域中的影响，在这些域中，详细的3D模型有望发挥重要作用。为了应对此类挑战，我们检测到自接触并设计3d损耗以明确实施它。具体来说，我们开发了一种用于自我接触预测（SCP）的模型，该模型在训练和推理过程中利用图像中自我接触的定位来估计自我接触的体表特征。我们收集了两个大型数据集以支持学习和评估：（1）HumanSC3D，一个准确的3d运动捕捉存储库，其中包含$ 1,032 $序列，$ 5,058 $接触事件和$ 1,246,487 $地面真实3d姿势与从多个视图收集的图像同步，以及（2） FlickrSC3D，一个包含$ 3,969 $图像的存储库，其中包含$ 25,297 $带有注释的图像空间支持的地对面对应。我们还说明了如何在自接触签名约束下恢复更具表现力的3d重建，并通过更精确的自接触模型实现了多种应用之一，从而实现了人脸触摸的单眼检测。

3. Assessing Pattern Recognition Performance of Neuronal Cultures through Accurate Simulation [PDF] 返回目录
Gabriele Lagani, Raffaele Mazziotti, Fabrizio Falchi, Claudio Gennaro, Guido Marco Cicchini, Tommaso Pizzorusso, Federico Cremisi, Giuseppe Amato
Abstract: Previous work has shown that it is possible to train neuronal cultures on Multi-Electrode Arrays (MEAs), to recognize very simple patterns. However, this work was mainly focused to demonstrate that it is possible to induce plasticity in cultures, rather than performing a rigorous assessment of their pattern recognition performance. In this paper, we address this gap by developing a methodology that allows us to assess the performance of neuronal cultures on a learning task. Specifically, we propose a digital model of the real cultured neuronal networks; we identify biologically plausible simulation parameters that allow us to reliably reproduce the behavior of real cultures; we use the simulated culture to perform handwritten digit recognition and rigorously evaluate its performance; we also show that it is possible to find improved simulation parameters for the specific task, which can guide the creation of real cultures.
摘要：先前的工作表明，可以在多电极阵列（MEA）上训练神经元文化，以识别非常简单的模式。但是，这项工作主要集中于证明可以在文化中诱导可塑性，而不是对其模式识别性能进行严格的评估。在本文中，我们通过开发一种方法来解决这一差距，该方法使我们能够评估学习任务中神经元文化的表现。具体来说，我们提出了一个真正的培养神经元网络的数字模型。我们确定生物学上合理的模拟参数，使我们能够可靠地再现真实文化的行为；我们使用模拟文化进行手写数字识别并严格评估其性能；我们还表明，有可能找到针对特定任务的改进的仿真参数，这可以指导实际文化的创建。

4. Boosting Monocular Depth Estimation with Lightweight 3D Point Fusion [PDF] 返回目录
Lam Huynh, Phong Nguyen, Jiri Matas, Esa Rahtu, Janne Heikkila
Abstract: In this paper, we address the problem of fusing monocular depth estimation with a conventional multi-view stereo or SLAM to exploit the best of both worlds, that is, the accurate dense depth of the first one and lightweightness of the second one. More specifically, we use a conventional pipeline to produce a sparse 3D point cloud that is fed to a monocular depth estimation network to enhance its performance. In this way, we can achieve accuracy similar to multi-view stereo with a considerably smaller number of weights. We also show that even as few as 32 points is sufficient to outperform the best monocular depth estimation methods, and around 200 points to gain full advantage of the additional information. Moreover, we demonstrate the efficacy of our approach by integrating it with a SLAM system built-in on mobile devices.
摘要：在本文中，我们解决了将单眼深度估计与常规多视图立体声或SLAM融合以利用两个方面的优点的问题，即第一个的精确密集深度和第二个的轻巧性。更具体地说，我们使用常规管道生成稀疏3D点云，该稀疏3D点云被馈送到单眼深度估计网络以增强其性能。通过这种方式，我们可以以少得多的权重实现类似于多视图立体声的精度。我们还显示，即使只有32点也足以胜过最佳的单眼深度估计方法，而大约200点也足以获得附加信息的全部优势。此外，我们通过将其与内置在移动设备上的SLAM系统集成来证明我们的方法的有效性。

5. Trying Bilinear Pooling in Video-QA [PDF] 返回目录
Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed
Abstract: Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly developed for VQA models. A bilinear (outer-product) expansion is thought to encourage models to learn interactions between two feature spaces and has experimentally outperformed `simpler' vector operations (concatenation and element-wise-addition/multiplication) on VQA benchmarks. Successive BLP techniques have yielded higher performance with lower computational expense and are often implemented alongside attention mechanisms. However, despite significant progress in VQA, BLP methods have not been widely applied to more recently explored video question answering (video-QA) tasks. In this paper, we begin to bridge this research gap by applying BLP techniques to various video-QA benchmarks, namely: TVQA, TGIF-QA, Ego-VQA and MSVD-QA. We share our results on the TVQA baseline model, and the recently proposed heterogeneous-memory-enchanced multimodal attention (HME) model. Our experiments include both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP we name the `dual-stream' model. We find that our relatively simple integration of BLP does not increase, and mostly harms, performance on these video-QA benchmarks. Using recently proposed theoretical multimodal fusion taxonomies, we offer insight into why BLP-driven performance gain for video-QA benchmarks may be more difficult to achieve than in earlier VQA models. We suggest a few additional `best-practices' to consider when applying BLP to video-QA. We stress that video-QA models should carefully consider where the complex representational potential from BLP is actually needed to avoid computational expense on `redundant' fusion.
摘要：双线性池（BLP）是指最近开发的一系列操作，用于融合主要针对VQA模型开发的不同模式的特征。人们认为，双线性（外部乘积）扩展鼓励模型学习两个特征空间之间的相互作用，并且在VQA基准上，实验上的表现优于“简单的”矢量运算（并置和逐元素加法/乘法）。连续的BLP技术以较低的计算量产生了更高的性能，并且通常与注意力机制一起实现。但是，尽管VQA取得了重大进展，但BLP方法尚未广泛应用于最近探索的视频问答（video-QA）任务。在本文中，我们开始通过将BLP技术应用于各种视频QA基准测试（TVQA，TGIF-QA，Ego-VQA和MSVD-QA）来弥合这一研究差距。我们在TVQA基线模型和最近提出的异构内存平衡多模式注意力（HME）模型上分享我们的结果。我们的实验包括简单地用BLP替换现有模型中的特征串联，以及修改的TVQA基线版本以适应我们称为“双流”模型的BLP。我们发现，我们相对简单的BLP集成不会增加这些视频QA基准的性能，并且在很大程度上损害了这些性能。使用最新提出的理论多峰融合分类法，我们可以洞悉为什么视频QA基准的BLP驱动的性能增益比以前的VQA模型更难实现。在将BLP应用于视频质量检查时，我们建议您考虑一些其他“最佳做法”。我们强调，视频质量保证模型应仔细考虑实际需要BLP的复杂表示潜力的地方，以避免“冗余”融合的计算费用。

6. Temporal Bilinear Encoding Network of Audio-Visual Features at Low Sampling Rates [PDF] 返回目录
Feiyan Hu, Eva Mohedano, Noel O'Connor, Kevin McGuinness
Abstract: Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features: experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art accuracy (hit@1=91.03%) while requiring significantly less computational resources than competing approaches for both training and prediction.
摘要：当前基于深度学习的视频分类体系结构通常在大量数据上进行端到端培训，并且需要大量的计算资源。本文旨在以每秒1帧的采样率利用视频分类中的视听信息。我们提出了使用双线性池对音频和视觉远程时间信息进行编码的时间双线性编码网络（TBEN），并证明对于低采样率视频，双线性池在时间维度上优于平均池。我们还将标签层次结构嵌入TBEN中，以进一步提高分类器的鲁棒性。使用TBEN在FGA240细分类分类数据集上进行的实验获得了最新的技术水平（hit@1=47.95%）。我们还探讨了将TBEN与多种解耦形式（如视觉语义和运动特征）结合在一起的可能性：以1 FPS采样的UCF101实验获得了接近最先进的准确性（hit@1=91.03%），而所需的计算量却大大减少比培训和预测的竞争方法更节省资源。

7. SegGroup: Seg-Level Supervision for 3D Instance and Semantic Segmentation [PDF] 返回目录
An Tao, Yueqi Duan, Yi Wei, Jiwen Lu, Jie Zhou
Abstract: Most existing point cloud instance and semantic segmentation methods heavily rely on strong supervision signals, which require point-level labels for every point in the scene. However, strong supervision suffers from large annotation cost, arousing the need to study efficient annotating. In this paper, we propose a new form of weak supervision signal, namely seg-level labels, for point cloud instance and semantic segmentation. Based on the widely-used over-segmentation as pre-processor, we only annotate one point for each instance to obtain seg-level labels. We further design a segment grouping network (SegGroup) to generate pseudo point-level labels by hierarchically grouping the unlabeled segments into the relevant nearby labeled segments, so that existing methods can directly consume the pseudo labels for training. Experimental results show that our SegGroup achieves comparable results with the fully annotated point-level supervised methods on both point cloud instance and semantic segmentation tasks and outperforms the recent scene-level and subcloud-level supervised methods significantly.
摘要：大多数现有的点云实例和语义分割方法严重依赖于强大的监督信号，这些信号要求场景中每个点都具有点级标签。但是，强有力的监督会带来巨大的注释成本，从而引发了研究有效注释的需求。在本文中，我们提出了一种新的形式的弱监督信号，即段级标签，用于点云实例和语义分割。基于广泛使用的过度分割作为预处理器，我们仅为每个实例注释一个点以获得seg级标签。我们进一步设计了一个段分组网络（SegGroup），通过将未标记的段分层地分组到相关的附近标记段中来生成伪点级标记，以便现有方法可以直接使用伪标记进行训练。实验结果表明，我们的SegGroup在点云实例和语义分割任务上使用完全注释的点级监督方法取得了可比的结果，并且明显优于最近的场景级和子云级监督方法。

8. On Modality Bias in the TVQA Dataset [PDF] 返回目录
Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed
Abstract: TVQA is a large scale video question answering (video-QA) dataset based on popular TV shows. The questions were specifically designed to require "both vision and language understanding to answer". In this work, we demonstrate an inherent bias in the dataset towards the textual subtitle modality. We infer said bias both directly and indirectly, notably finding that models trained with subtitles learn, on-average, to suppress video feature contribution. Our results demonstrate that models trained on only the visual information can answer ~45% of the questions, while using only the subtitles achieves ~68%. We find that a bilinear pooling based joint representation of modalities damages model performance by 9% implying a reliance on modality specific information. We also show that TVQA fails to benefit from the RUBi modality bias reduction technique popularised in VQA. By simply improving text processing using BERT embeddings with the simple model first proposed for TVQA, we achieve state-of-the-art results (72.13%) compared to the highly complex STAGE model (70.50%). We recommend a multimodal evaluation framework that can highlight biases in models and isolate visual and textual reliant subsets of data. Using this framework we propose subsets of TVQA that respond exclusively to either or both modalities in order to facilitate multimodal modelling as TVQA originally intended.
摘要：TVQA是基于流行电视节目的大规模视频问答（video-QA）数据集。这些问题经过专门设计，要求“既有视觉理解能力又要有语言理解能力”。在这项工作中，我们证明了数据集中对文本字幕形式的固有偏见。我们推断出这种偏向是直接还是间接的，特别是发现使用字幕训练的模型平均可以学习抑制视频特征的贡献。我们的结果表明，仅在视觉信息上训练的模型可以回答大约45％的问题，而仅使用字幕可以达到大约68％的问题。我们发现基于双线性池的模态的联合表示将模型性能降低了9％，这意味着对模态特定信息的依赖。我们还表明，TVQA无法从VQA中普及的RUBi模态偏差减少技术中受益。通过使用最初为TVQA提出的简单模型，使用BERT嵌入来简单地改善文本处理，与高度复杂的STAGE模型（70.50％）相比，我们可以获得最新的结果（72.13％）。我们建议使用多模式评估框架，该框架可以强调模型中的偏差并隔离视觉和文本相关的数据子集。使用此框架，我们提出了TVQA的子集，这些子集仅对一种或两种方式做出响应，以促进TVQA最初打算的多模式建模。

9. LGENet: Local and Global Encoder Network for Semantic Segmentation of Airborne Laser Scanning Point Clouds [PDF] 返回目录
Yaping Lin, George Vosselman, Yanpeng Cao, Michael Ying Yang
Abstract: Interpretation of Airborne Laser Scanning (ALS) point clouds is a critical procedure for producing various geo-information products like 3D city models, digital terrain models and land use maps. In this paper, we present a local and global encoder network (LGENet) for semantic segmentation of ALS point clouds. Adapting the KPConv network, we first extract features by both 2D and 3D point convolutions to allow the network to learn more representative local geometry. Then global encoders are used in the network to exploit contextual information at the object and point level. We design a segment-based Edge Conditioned Convolution to encode the global context between segments. We apply a spatial-channel attention module at the end of the network, which not only captures the global interdependencies between points but also models interactions between channels. We evaluate our method on two ALS datasets namely, the ISPRS benchmark dataset and DCF2019 dataset. For the ISPRS benchmark dataset, our model achieves state-of-the-art results with an overall accuracy of 0.845 and an average F1 score of 0.737. With regards to the DFC2019 dataset, our proposed network achieves an overall accuracy of 0.984 and an average F1 score of 0.834.
摘要：机载激光扫描（ALS）点云的解释是生产3D城市模型，数字地形模型和土地使用图等各种地理信息产品的关键过程。在本文中，我们提出了用于ALS点云语义分割的本地和全局编码器网络（LGENet）。适应KPConv网络，我们首先通过2D和3D点卷积提取特征，以使网络学习更多具有代表性的局部几何形状。然后，在网络中使用全局编码器在对象和点级别利用上下文信息。我们设计了一个基于段的边缘条件卷积来对段之间的全局上下文进行编码。我们在网络的末端应用空间通道注意模块，该模块不仅捕获点之间的全局相互依赖性，而且还对通道之间的交互进行建模。我们在两个ALS数据集（即ISPRS基准数据集和DCF2019数据集）上评估了我们的方法。对于ISPRS基准数据集，我们的模型获得了最先进的结果，总体准确度为0.845，平均F1得分为0.737。关于DFC2019数据集，我们提出的网络实现了0.984的总体准确性和0.834的平均F1得分。

10. STNet: Scale Tree Network with Multi-level Auxiliator for Crowd Counting [PDF] 返回目录
Mingjie Wang, Hao Cai, Xianfeng Han, Jun Zhou, Minglun Gong
Abstract: Crowd counting remains a challenging task because the presence of drastic scale variation, density inconsistency, and complex background can seriously degrade the counting accuracy. To battle the ingrained issue of accuracy degradation, we propose a novel and powerful network called Scale Tree Network (STNet) for accurate crowd counting. STNet consists of two key components: a Scale-Tree Diversity Enhancer and a Semi-supervised Multi-level Auxiliator. Specifically, the Diversity Enhancer is designed to enrich scale diversity, which alleviates limitations of existing methods caused by insufficient level of scales. A novel tree structure is adopted to hierarchically parse coarse-to-fine crowd regions. Furthermore, a simple yet effective Multi-level Auxiliator is presented to aid in exploiting generalisable shared characteristics at multiple levels, allowing more accurate pixel-wise background cognition. The overall STNet is trained in an end-to-end manner, without the needs for manually tuning loss weights between the main and the auxiliary tasks. Extensive experiments on four challenging crowd datasets demonstrate the superiority of the proposed method.
摘要：人群计数仍然是一项艰巨的任务，因为尺度变化剧烈，密度不一致以及复杂的背景会严重降低计数准确性。为了解决根深蒂固的准确性降低问题，我们提出了一种新颖且强大的网络，称为规模树网络（STNet），用于准确的人群计数。 STNet由两个关键组件组成：规模树多样性增强器和半监督多级辅助器。具体而言，多样性增强器旨在丰富规模的多样性，从而减轻了由于规模不足而导致的现有方法的局限性。采用一种新颖的树结构来对粗糙到精细的人群区域进行层次分析。此外，提出了一种简单而有效的多级辅助器，以帮助利用多级通用的共享特征，从而实现更准确的逐像素背景认知。整个STNet以端到端的方式进行培训，而无需手动调整主要任务和辅助任务之间的损失权重。在四个具有挑战性的人群数据集上进行的大量实验证明了该方法的优越性。

11. A Holistically-Guided Decoder for Deep Representation Learning with Applications to Semantic Segmentation and Object Detection [PDF] 返回目录
Jianbo Liu, Sijie Ren, Yuanjie Zheng, Xiaogang Wang, Hongsheng Li
Abstract: Both high-level and high-resolution feature representations are of great importance in various visual understanding tasks. To acquire high-resolution feature maps with high-level semantic information, one common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps, such as the dilatedFCN-based methods for semantic segmentation. However, due to many convolution operations are conducted on the high-resolution feature maps, such methods have large computational complexity and memory consumption. In this paper, we propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps via the multi-scale features from the encoder. The decoding is achieved via novel holistic codeword generation and codeword assembly operations, which take advantages of both the high-level and low-level features from the encoder features. With the proposed holistically-guided decoder, we implement the EfficientFCN architecture for semantic segmentation and HGD-FPN for object detection and instance segmentation. The EfficientFCN achieves comparable or even better performance than state-of-the-art methods with only 1/3 of their computational costs for semantic segmentation on PASCAL Context, PASCAL VOC, ADE20K datasets. Meanwhile, the proposed HGD-FPN achieves $>2\%$ higher mean Average Precision (mAP) when integrated into several object detection frameworks with ResNet-50 encoding backbones.
摘要：高阶和高分辨率特征表示在各种视觉理解任务中都非常重要。为了获得具有高级语义信息的高分辨率特征图，一种常见的策略是在骨干网络中采用膨胀卷积来提取高分辨率特征图，例如基于膨胀FCN的语义分割方法。然而，由于在高分辨率特征图上进行了许多卷积运算，所以这种方法具有很大的计算复杂度和存储器消耗。在本文中，我们提出了一种新颖的整体引导解码器，该解码器被引入以通过编码器的多尺度特征来获得高分辨率的语义丰富的特征图。解码是通过新颖的整体代码字生成和代码字组合操作实现的，这些操作利用了编码器功能的高级和低级功能。借助提出的整体导向解码器，我们实现了用于语义分割的EfficientFCN体系结构以及用于对象检测和实例分割的HGD-FPN。在PASCAL上下文，PASCAL VOC和ADE20K数据集上进行语义分割时，EfficientFCN的性能仅与最新方法相当，甚至比最佳方法更好。同时，将所提出的HGD-FPN集成到具有ResNet-50编码主干的多个对象检测框架中时，平均平均精度（mAP）会提高$ 2％。

12. SCNet: Training Inference Sample Consistency for Instance Segmentation [PDF] 返回目录
Thang Vu, Haeyong Kang, Chang D. Yoo
Abstract: Cascaded architectures have brought significant performance improvement in object detection and instance segmentation. However, there are lingering issues regarding the disparity in the Intersection-over-Union (IoU) distribution of the samples between training and inference. This disparity can potentially exacerbate detection accuracy. This paper proposes an architecture referred to as Sample Consistency Network (SCNet) to ensure that the IoU distribution of the samples at training time is close to that at inference time. Furthermore, SCNet incorporates feature relay and utilizes global contextual information to further reinforce the reciprocal relationships among classifying, detecting, and segmenting sub-tasks. Extensive experiments on the standard COCO dataset reveal the effectiveness of the proposed method over multiple evaluation metrics, including box AP, mask AP, and inference speed. In particular, while running 38\% faster, the proposed SCNet improves the AP of the box and mask predictions by respectively 1.3 and 2.3 points compared to the strong Cascade Mask R-CNN baseline. Code is available at \url{this https URL}.
摘要：级联架构在对象检测和实例分割方面带来了显着的性能提升。但是，关于训练与推理之间的样本在工会上的交集（IoU）分布之间的差异，存在一些挥之不去的问题。这种差异可能会加剧检测精度。本文提出了一种称为样本一致性网络（SCNet）的体系结构，以确保训练时样本的IoU分布与推理时的分布接近。此外，SCNet包含功能中继，并利用全局上下文信息进一步加强了分类，检测和分段子任务之间的相互关系。在标准COCO数据集上进行的大量实验表明，该方法在包括框AP，掩码AP和推理速度在内的多个评估指标上的有效性。特别是，虽然运行速度提高了38％，但与强大的Cascade Mask R-CNN基线相比，拟议的SCNet分别将盒子和蒙版预测的AP分别提高了1.3和2.3点。可以在\ url {此https URL}中找到代码。

13. CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth [PDF] 返回目录
Xingxing Zuo, Nathaniel Merrill, Wei Li, Yong Liu, Marc Pollefeys, Guoquan Huang
Abstract: In this work, we present a lightweight, tightly-coupled deep depth network and visual-inertial odometry (VIO) system, which can provide accurate state estimates and dense depth maps of the immediate surroundings. Leveraging the proposed lightweight Conditional Variational Autoencoder (CVAE) for depth inference and encoding, we provide the network with previously marginalized sparse features from VIO to increase the accuracy of initial depth prediction and generalization capability. The compact encoded depth maps are then updated jointly with navigation states in a sliding window estimator in order to provide the dense local scene geometry. We additionally propose a novel method to obtain the CVAE's Jacobian which is shown to be more than an order of magnitude faster than previous works, and we additionally leverage First-Estimate Jacobian (FEJ) to avoid recalculation. As opposed to previous works relying on completely dense residuals, we propose to only provide sparse measurements to update the depth code and show through careful experimentation that our choice of sparse measurements and FEJs can still significantly improve the estimated depth maps. Our full system also exhibits state-of-the-art pose estimation accuracy, and we show that it can run in real-time with single-thread execution while utilizing GPU acceleration only for the network and code Jacobian.
摘要：在这项工作中，我们提出了一个轻量级的，紧密耦合的深度网络和视觉惯性里程表（VIO）系统，该系统可以提供准确的状态估计值和附近环境的密集深度图。利用提议的轻量级条件变分自动编码器（CVAE）进行深度推断和编码，我们为网络提供了VIO之前边缘化的稀疏特征，以提高初始深度预测和泛化能力的准确性。然后，将压缩的深度图与导航状态一起在滑动窗口估计器中进行更新，以提供密集的局部场景几何形状。我们还提出了一种新颖的方法来获得CVAE的雅可比行列式，该方法被证明比以前的工作要快一个数量级，并且我们还利用了第一估计雅可比行列式（FEJ）来避免重新计算。与之前的工作完全依赖于完全稠密的残差相反，我们建议仅提供稀疏测量来更新深度代码，并通过仔细的实验表明我们选择的稀疏测量和FEJ仍然可以显着改善估计的深度图。我们完整的系统还具有最新的姿态估计精度，并且我们证明了它可以通过单线程执行实时运行，同时仅将GPU加速用于网络和代码Jacobian。

14. Hyperspectral Image Semantic Segmentation in Cityscapes [PDF] 返回目录
Yuxing Huang, Erqi Huang, Linsen Chen, Shaodi You, Ying Fu, Qiu Shen
Abstract: High-resolution hyperspectral images (HSIs) contain the response of each pixel in different spectral bands, which can be used to effectively distinguish various objects in complex scenes. While HSI cameras have become low cost, algorithms based on it has not been well exploited. In this paper, we focus on a novel topic, semi-supervised semantic segmentation in cityscapes using this http URL is based on the idea that high-resolution HSIs in city scenes contain rich spectral information, which can be easily associated to semantics without manual labeling. Therefore, it enables low cost, highly reliable semantic segmentation in complex scenes.Specifically, in this paper, we introduce a semi-supervised HSI semantic segmentation network, which utilizes spectral information to improve the coarse labels to a finer degree.The experimental results show that our method can obtain highly competitive labels and even have higher edge fineness than artificial fine labels in some classes. At the same time, the results also show that the optimized labels can effectively improve the effect of semantic segmentation. The combination of HSIs and semantic segmentation proves that HSIs have great potential in high-level visual tasks.
摘要：高分辨率高光谱图像（HSI）包含不同光谱带中每个像素的响应，可用于有效区分复杂场景中的各种对象。虽然HSI摄像机已经变得低成本，但基于它的算法尚未得到充分利用。在本文中，我们重点关注一个新颖的主题，即使用此http URL的城市景观中的半监督语义分割是基于以下思想：城市场景中的高分辨率HSI包含丰富的光谱信息，无需手动标记即可轻松将其与语义相关联。因此，它可以在复杂场景中实现低成本，高度可靠的语义分割。具体而言，本文介绍了一种半监督的HSI语义分割网络，该网络利用频谱信息将粗略标签改进到更好的程度。实验结果表明我们的方法可以在某些类别中获得比人造精细标签更高的竞争力标签，甚至具有更高的边缘精细度。同时，结果还表明，优化后的标签可以有效提高语义分割的效果。 HSI和语义分割的结合证明了HSI在高级视觉任务中具有巨大的潜力。

15. Frequency Consistent Adaptation for Real World Super Resolution [PDF] 返回目录
Xiaozhong Ji, Guangpin Tao, Yun Cao, Ying Tai, Tong Lu, Chengjie Wang, Jilin Li, Feiyue Huang
Abstract: Recent deep-learning based Super-Resolution (SR) methods have achieved remarkable performance on images with known degradation. However, these methods always fail in real-world scene, since the Low-Resolution (LR) images after the ideal degradation (e.g., bicubic down-sampling) deviate from real source domain. The domain gap between the LR images and the real-world images can be observed clearly on frequency density, which inspires us to explictly narrow the undesired gap caused by incorrect degradation. From this point of view, we design a novel Frequency Consistent Adaptation (FCA) that ensures the frequency domain consistency when applying existing SR methods to the real scene. We estimate degradation kernels from unsupervised images and generate the corresponding LR images. To provide useful gradient information for kernel estimation, we propose Frequency Density Comparator (FDC) by distinguishing the frequency density of images on different scales. Based on the domain-consistent LR-HR pairs, we train easy-implemented Convolutional Neural Network (CNN) SR models. Extensive experiments show that the proposed FCA improves the performance of the SR model under real-world setting achieving state-of-the-art results with high fidelity and plausible perception, thus providing a novel effective framework for real-world SR application.
摘要：最近基于深度学习的超分辨率（SR）方法在已知降级的图像上取得了卓越的性能。但是，这些方法在现实世界中总是会失败，因为理想退化（例如，双三次降采样）之后的低分辨率（LR）图像会偏离真实源域。在频率密度上可以清楚地观察到LR图像和真实世界图像之间的域间隙，这启发我们显着缩小由于不正确的降级而导致的不希望有的间隙。从这个角度出发，我们设计了一种新颖的频率一致性自适应（FCA），可在将现有SR方法应用于真实场景时确保频域一致性。我们从无监督的图像估计退化内核，并生成相应的LR图像。为了为核估计提供有用的梯度信息，我们提出了通过区分不同尺度图像的频率密度的频率密度比较器（FDC）。基于域一致的LR-HR对，我们训练了易于实现的卷积神经网络（CNN）SR模型。大量实验表明，所提出的FCA改进了真实环境下SR模型的性能，以高保真度和合理的感知度获得了最新的结果，从而为现实SR应用提供了新颖有效的框架。

16. AU-Guided Unsupervised Domain Adaptive Facial Expression Recognition [PDF] 返回目录
Kai Wang, Yuxin Gu, Xiaojiang Peng, Baigui Sun, Hao Li
Abstract: The domain diversities including inconsistent annotation and varied image collection conditions inevitably exist among different facial expression recognition (FER) datasets, which pose an evident challenge for adapting the FER model trained on one dataset to another one. Recent works mainly focus on domain-invariant deep feature learning with adversarial learning mechanism, ignoring the sibling facial action unit (AU) detection task which has obtained great progress. Considering AUs objectively determine facial expressions, this paper proposes an AU-guided unsupervised Domain Adaptive FER (AdaFER) framework. In AdaFER, we first leverage an advanced model for AU detection on both source and target domain. Then, we compare the AU results to perform AU-guided annotating, i.e., target faces that own the same AUs with source faces would inherit the labels from source domain. Meanwhile, to achieve domain-invariant compact features, we utilize an AU-guided triplet training which randomly collects anchor-positive-negative triplets on both domains with AUs. We conduct extensive experiments on several popular benchmarks and show that AdaFER achieves state-of-the-art results on all the benchmarks.
摘要：在不同的面部表情识别（FER）数据集中不可避免地存在着不一致的注释和不同的图像收集条件等领域多样性，这对将一个数据集上训练的FER模型改编为另一数据集提出了明显的挑战。最近的工作主要集中在具有对抗性学习机制的领域不变的深度特征学习，而忽略了取得了巨大进展的同级面部动作单元（AU）检测任务。考虑到AU客观地决定了面部表情，本文提出了一个AU指导的无监督域自适应FER（AdaFER）框架。在AdaFER中，我们首先利用高级模型在源域和目标域上进行AU检测。然后，我们比较AU结果以执行AU指导的注释，即，与源人脸拥有相同AU的目标人脸将从源域继承标签。同时，为了实现域不变的紧凑特征，我们利用AU引导的三联体训练，该训练在两个域上随机收集AUs的锚定阳性-阴性三胞胎。我们在几个流行的基准上进行了广泛的实验，并证明AdaFER在所有基准上均达到了最先进的结果。

17. TDN: Temporal Difference Networks for Efficient Action Recognition [PDF] 返回目录
Limin Wang, Zhan Tong, Bin Ji, Gangshan Wu
Abstract: Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition. The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator, and systematically assess its effect on short-term and long-term motion modeling. To fully capture temporal information over the entire video, our TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation. TDN provides a simple and principled temporal modeling framework and could be instantiated with the existing CNNs at a small extra computational cost. Our TDN presents a new state of the art on the Something-Something V1 and V2 datasets and is on par with the best performance on the Kinetics-400 dataset. In addition, we conduct in-depth ablation studies and plot the visualization results of our TDN, hopefully providing insightful analysis on temporal difference operation. We release the code at this https URL.
摘要：时间建模对于视频中的动作识别仍然具有挑战性。为了缓解这个问题，本文提出了一种新的视频体系结构，称为时间差异网络（TDN），重点是捕获多尺度时间信息以进行有效的动作识别。我们的TDN的核心是通过显式地利用时间差算子来设计有效的时间模块（TDM），并系统地评估其对短期和长期运动建模的影响。为了完全捕获整个视频中的时间信息，我们的TDN建立了两级差异建模范例。具体来说，对于局部运动建模，连续帧上的时间差用于为2D CNN提供更精细的运动模式，而对于全局运动建模，跨段的时间差被并入以捕获用于运动特征激励的远程结构。 TDN提供了一个简单且原则性的时间建模框架，并且可以使用现有的CNN实例化，而所需的额外计算成本很小。我们的TDN在Something-Something V1和V2数据集上展现了最新的技术水平，与Kinetics-400数据集上的最佳性能相提并论。此外，我们进行了深入的消融研究并绘制了TDN的可视化结果，希望能对时差操作提供深入的分析。我们在此https URL上发布代码。

18. PointINet: Point Cloud Frame Interpolation Network [PDF] 返回目录
Fan Lu, Guang Chen, Sanqing Qu, Zhijun Li, Yinlong Liu, Alois Knoll
Abstract: LiDAR point cloud streams are usually sparse in time dimension, which is limited by hardware performance. Generally, the frame rates of mechanical LiDAR sensors are 10 to 20 Hz, which is much lower than other commonly used sensors like cameras. To overcome the temporal limitations of LiDAR sensors, a novel task named Point Cloud Frame Interpolation is studied in this paper. Given two consecutive point cloud frames, Point Cloud Frame Interpolation aims to generate intermediate frame(s) between them. To achieve that, we propose a novel framework, namely Point Cloud Frame Interpolation Network (PointINet). Based on the proposed method, the low frame rate point cloud streams can be upsampled to higher frame rates. We start by estimating bi-directional 3D scene flow between the two point clouds and then warp them to the given time step based on the 3D scene flow. To fuse the two warped frames and generate intermediate point cloud(s), we propose a novel learning-based points fusion module, which simultaneously takes two warped point clouds into consideration. We design both quantitative and qualitative experiments to evaluate the performance of the point cloud frame interpolation method and extensive experiments on two large scale outdoor LiDAR datasets demonstrate the effectiveness of the proposed PointINet. Our code is available at this https URL.
摘要：LiDAR点云流通常在时间维度上稀疏，这受硬件性能的限制。通常，机械LiDAR传感器的帧频为10到20 Hz，远低于其他常用的传感器（如相机）。为了克服LiDAR传感器的时间限制，本文研究了一种新的任务，即点云框架插值。给定两个连续的点云帧，点云帧插值的目的是在它们之间生成中间帧。为此，我们提出了一种新颖的框架，即点云框架插值网络（PointINet）。基于提出的方法，可以将低帧频点云流上采样到更高的帧频。我们首先估算两个点云之间的双向3D场景流，然后根据3D场景流将它们扭曲到给定的时间步长。为了融合两个弯曲的框架并生成中间点云，我们提出了一种新颖的基于学习的点融合模块，该模块同时考虑了两个弯曲的点云。我们设计了定量和定性实验来评估点云框架插值方法的性能，并且在两个大型室外LiDAR数据集上进行的广泛实验证明了所提出的PointINet的有效性。我们的代码可从以下https URL获得。

19. Content Masked Loss: Human-Like Brush Stroke Planning in a Reinforcement Learning Painting Agent [PDF] 返回目录
Peter Schaldenbrand, Jean Oh
Abstract: The objective of most Reinforcement Learning painting agents is to minimize the loss between a target image and the paint canvas. Human painter artistry emphasizes important features of the target image rather than simply reproducing it (DiPaola 2007). Using adversarial or L2 losses in the RL painting models, although its final output is generally a work of finesse, produces a stroke sequence that is vastly different from that which a human would produce since the model does not have knowledge about the abstract features in the target image. In order to increase the human-like planning of the model without the use of expensive human data, we introduce a new loss function for use with the model's reward function: Content Masked Loss. In the context of robot painting, Content Masked Loss employs an object detection model to extract features which are used to assign higher weight to regions of the canvas that a human would find important for recognizing content. The results, based on 332 human evaluators, show that the digital paintings produced by our Content Masked model show detectable subject matter earlier in the stroke sequence than existing methods without compromising on the quality of the final painting.
摘要：大多数强化学习绘画剂的目的是最大程度地减少目标图像和画布之间的损失。人类画家的艺术性强调目标图像的重要特征，而不是简单地复制它（DiPaola 2007）。在RL绘画模型中使用对抗性或L2损失，尽管其最终输出通常是精巧的工作，但它产生的笔画序列与人类产生的笔画序列有很大不同，因为该模型不了解模型中的抽象特征。目标图像。为了在不使用昂贵的人工数据的情况下增加模型的人性化计划，我们引入了一种新的损失函数，用于模型的奖励函数：Content Masked Loss。在机器人绘画的上下文中，“ Content Masked Loss”使用对象检测模型来提取特征，这些特征用于将更高的权重分配给人类认为对于识别内容很重要的画布区域。基于332位评估人员的结果表明，由我们的“内容蒙版”模型生成的数字绘画比现有方法在笔画序列中显示可检测的主题更早，而不会影响最终绘画的质量。

20. 3D Object Classification on Partial Point Clouds: A Practical Perspective [PDF] 返回目录
Zelin Xu, Ke Chen, Tong Zhang, C. L. Philip Chen, Kui Jia
Abstract: A point cloud is a popular shape representation adopted in 3D object classification, which covers the whole surface of an object and is usually well aligned. However, such an assumption can be invalid in practice, as point clouds collected in real-world scenarios are typically scanned from visible object parts observed under arbitrary SO(3) viewpoint, which are thus incomplete due to self and inter-object occlusion. In light of this, this paper introduces a practical setting to classify partial point clouds of object instances under any poses. Compared to the classification of complete object point clouds, such a problem is made more challenging in view of geometric similarities of local shape across object classes and intra-class dissimilarities of geometries restricted by their observation view. We consider that specifying the location of partial point clouds on their object surface is essential to alleviate suffering from the aforementioned challenges, which can be solved via an auxiliary task of 6D object pose estimation. To this end, a novel algorithm in an alignment-classification manner is proposed in this paper, which consists of an alignment module predicting object pose for the rigid transformation of visible point clouds to their canonical pose and a typical point classifier such as PointNet++ and DGCNN. Experiment results on the popular ModelNet40 and ScanNet datasets, which are adapted to a single-view partial setting, demonstrate the proposed method can outperform three alternative schemes extended from representative point cloud classifiers for complete point clouds.
摘要：点云是3D对象分类中流行的形状表示形式，它覆盖对象的整个表面，并且通常对齐良好。但是，这种假设在实践中可能是无效的，因为通常会从在任意SO（3）视点下观察到的可见对象部分扫描在现实世界场景中收集的点云，由于自身和对象之间的遮挡，这些部分是不完整的。有鉴于此，本文介绍了一种在任何姿势下对对象实例的部分点云进行分类的实用设置。与完整的对象点云的分类相比，鉴于跨对象类的局部形状的几何相似性以及受其观察视图限制的几何内类内不相似性，使此问题更具挑战性。我们认为，指定局部点云在其对象表面上的位置对于缓解上述挑战至关重要，可以通过6D对象姿态估计的辅助任务来解决。为此，本文提出了一种基于路线分类的新算法，该算法由一个预测目标姿态以将可见点云刚性转换为标准姿态的姿态模块和一个典型的点分类器（如PointNet ++和DGCNN）组成。。在适用于单视图局部设置的流行ModelNet40和ScanNet数据集上的实验结果表明，所提出的方法可以胜过从代表点云分类器扩展到完整点云的三种替代方案。

21. Self-supervised Learning with Fully Convolutional Networks [PDF] 返回目录
Zhengeng Yang, Hongshan Yu, Yong He, Zhi-Hong Mao, Ajmal Mian
Abstract: Although deep learning based methods have achieved great success in many computer vision tasks, their performance relies on a large number of densely annotated samples that are typically difficult to obtain. In this paper, we focus on the problem of learning representation from unlabeled data for semantic segmentation. Inspired by two patch-based methods, we develop a novel self-supervised learning framework by formulating the Jigsaw Puzzle problem as a patch-wise classification process and solving it with a fully convolutional network. By learning to solve a Jigsaw Puzzle problem with 25 patches and transferring the learned features to semantic segmentation task on Cityscapes dataset, we achieve a 5.8 percentage point improvement over the baseline model that initialized from random values. Moreover, experiments show that our self-supervised learning method can be applied to different datasets and models. In particular, we achieved competitive performance with the state-of-the-art methods on the PASCAL VOC2012 dataset using significant fewer training images.
摘要：尽管基于深度学习的方法已在许多计算机视觉任务中取得了巨大成功，但其性能依赖于通常难以获得的大量密集注释的样本。在本文中，我们关注于从未标记的数据中学习表示以进行语义分割的问题。受两种基于补丁的方法的启发，我们通过将拼图问题表述为逐个补丁的分类过程，并通过完全卷积的网络对其进行求解，从而开发了一种新颖的自我监督学习框架。通过学习用25个补丁解决拼图难题并将学习到的特征转移到Cityscapes数据集的语义分割任务上，我们相对于从随机值初始化的基线模型而言，可以提高5.8个百分点。此外，实验表明，我们的自我监督学习方法可以应用于不同的数据集和模型。特别是，我们在PASCAL VOC2012数据集上使用了最先进的方法，使用了较少的训练图像，从而获得了竞争优势。

22. Flow-based Generative Models for Learning Manifold to Manifold Mappings [PDF] 返回目录
Xingjian Zhen, Rudrasis Chakraborty, Liu Yang, Vikas Singh
Abstract: Many measurements or observations in computer vision and machine learning manifest as non-Euclidean data. While recent proposals (like spherical CNN) have extended a number of deep neural network architectures to manifold-valued data, and this has often provided strong improvements in performance, the literature on generative models for manifold data is quite sparse. Partly due to this gap, there are also no modality transfer/translation models for manifold-valued data whereas numerous such methods based on generative models are available for natural images. This paper addresses this gap, motivated by a need in brain imaging -- in doing so, we expand the operating range of certain generative models (as well as generative models for modality transfer) from natural images to images with manifold-valued measurements. Our main result is the design of a two-stream version of GLOW (flow-based invertible generative models) that can synthesize information of a field of one type of manifold-valued measurements given another. On the theoretical side, we introduce three kinds of invertible layers for manifold-valued data, which are not only analogous to their functionality in flow-based generative models (e.g., GLOW) but also preserve the key benefits (determinants of the Jacobian are easy to calculate). For experiments, on a large dataset from the Human Connectome Project (HCP), we show promising results where we can reliably and accurately reconstruct brain images of a field of orientation distribution functions (ODF) from diffusion tensor images (DTI), where the latter has a $5\times$ faster acquisition time but at the expense of worse angular resolution.
摘要：计算机视觉和机器学习中的许多测量或观察都表现为非欧几里得数据。尽管最近的提议（例如球形CNN）已将许多深度神经网络体系结构扩展到流形值数据，并且这通常在性能方面提供了强大的改进，但有关流形数据的生成模型的文献却很少。部分由于该差距，也没有用于多值数据的模态转移/转换模型，而基于生成模型的许多此类方法可用于自然图像。本文解决了这一差距，这是由于大脑成像的需求所引起的-为此，我们将某些生成模型（以及用于形态传递的生成模型）的操作范围从自然图像扩展到具有多值测量的图像。我们的主要结果是设计了两种流形式的GLOW（基于流的可逆生成模型），该模型可以合成给定另一种类型的多值测量的字段信息。从理论上讲，我们为流形值数据引入了三种可逆层，它们不仅类似于它们在基于流的生成模型（例如GLOW）中的功能，而且还保留了关键的好处（雅可比行列式很容易计算）。对于实验，在来自人类连接基因组计划（HCP）的大型数据集上，我们展示了令人鼓舞的结果，其中我们可以从扩散张量图像（DTI）可靠而准确地重建方向分布函数（ODF）的大脑图像，而后者的获取时间快了5倍，但以较差的角度分辨率为代价。

23. Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations [PDF] 返回目录
Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, Matthias Grundmann
Abstract: 3D object detection has recently become popular due to many applications in robotics, augmented reality, autonomy, and image retrieval. We introduce the Objectron dataset to advance the state of the art in 3D object detection and foster new research and applications, such as 3D object tracking, view synthesis, and improved 3D shape representation. The dataset contains object-centric short videos with pose annotations for nine categories and includes 4 million annotated images in 14,819 annotated videos. We also propose a new evaluation metric, 3D Intersection over Union, for 3D object detection. We demonstrate the usefulness of our dataset in 3D object detection tasks by providing baseline models trained on this dataset. Our dataset and evaluation source code are available online at http://www.objectron.dev
摘要：由于3D对象检测在机器人技术，增强现实，自主性和图像检索中的许多应用，近来已变得很流行。我们引入Objectron数据集以提高3D对象检测的最新水平，并促进新的研究和应用，如3D对象跟踪，视图合成和改进的3D形状表示。该数据集包含以对象为中心的短视频，这些视频带有针对9个类别的姿势注释，并且在14,819个带注释的视频中包含400万个带注释的图像。我们还提出了一种新的评估指标，即3D物体上方的3D交集。通过提供在此数据集上训练的基线模型，我们证明了我们的数据集在3D对象检测任务中的有用性。我们的数据集和评估源代码可从http://www.objectron.dev在线获得。

24. Relightable 3D Head Portraits from a Smartphone Video [PDF] 返回目录
Artem Sevastopolsky, Savva Ignatiev, Gonzalo Ferrer, Evgeny Burnaev, Victor Lempitsky
Abstract: In this work, a system for creating a relightable 3D portrait of a human head is presented. Our neural pipeline operates on a sequence of frames captured by a smartphone camera with the flash blinking (flash-no flash sequence). A coarse point cloud reconstructed via structure-from-motion software and multi-view denoising is then used as a geometric proxy. Afterwards, a deep rendering network is trained to regress dense albedo, normals, and environmental lighting maps for arbitrary new viewpoints. Effectively, the proxy geometry and the rendering network constitute a relightable 3D portrait model, that can be synthesized from an arbitrary viewpoint and under arbitrary lighting, e.g. directional light, point light, or an environment map. The model is fitted to the sequence of frames with human face-specific priors that enforce the plausibility of albedo-lighting decomposition and operates at the interactive frame rate. We evaluate the performance of the method under varying lighting conditions and at the extrapolated viewpoints and compare with existing relighting methods.
摘要：在这项工作中，提出了一种用于创建人头可照明3D肖像的系统。我们的神经管道在由智能手机相机捕获的一系列帧上运行，同时闪烁（无闪光序列）。然后，将通过运动结构软件和多视图降噪重建的粗糙点云用作几何代理。之后，训练了一个深层渲染网络，以针对任意新视点回归密集的反照率，法线和环境光照图。有效地，代理几何图形和渲染网络构成了可重新照明的3D肖像模型，该模型可以从任意视点并在任意光照（例如光照）下进行合成。定向光，点光源或环境图。该模型适用于具有人脸特定先验的帧序列，这些序列可增强反照率照明分解的合理性并以交互帧速率运行。我们评估了该方法在变化的照明条件下以及在外推的视点下的性能，并与现有的重新照明方法进行了比较。

25. Toward Transformer-Based Object Detection [PDF] 返回目录
Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, Dmitry Kislyuk
Abstract: Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
摘要：由于变形器能够对大量数据进行预训练，然后通过微调转移到更小，更具体的任务，因此它们已成为自然语言处理中的主导模型。视觉变压器是将纯变压器模型直接应用于图像作为输入的首次重大尝试，这表明与卷积网络相比，基于变压器的体系结构可以在基准分类任务上取得有竞争力的结果。但是，注意运算符的计算复杂性意味着我们仅限于低分辨率输入。对于诸如检测或分割之类的更复杂的任务，保持高输入分辨率对于确保模型可以正确识别并在其输出中反映出精细细节至关重要。这自然引发了一个问题，即基于变压器的体系结构（例如Vision Transformer）是否能够执行除分类以外的任务。在本文中，我们确定通用检测任务负责人可以将视觉变压器用作骨干，以产生具有竞争力的COCO结果。我们提出的模型ViT-FRCNN展示了与变压器相关的几种已知特性，包括大的预训练能力和快速的微调性能。我们还研究了对标准检测主干的改进，包括在域外图像上的出色性能，在大型物体上的更好性能以及对非最大抑制的依赖性降低。我们将ViT-FRCNN视为迈向复杂视觉任务（例如目标检测）纯变压器解决方案的重要踏脚石。

26. Learning Compositional Radiance Fields of Dynamic Human Heads [PDF] 返回目录
Ziyan Wang, Timur Bagautdinov, Stephen Lombardi, Tomas Simon, Jason Saragih, Jessica Hodgins, Michael Zollhöfer
Abstract: Photorealistic rendering of dynamic humans is an important ability for telepresence systems, virtual shopping, synthetic data generation, and more. Recently, neural rendering methods, which combine techniques from computer graphics and machine learning, have created high-fidelity models of humans and objects. Some of these methods do not produce results with high-enough fidelity for driveable human models (Neural Volumes) whereas others have extremely long rendering times (NeRF). We propose a novel compositional 3D representation that combines the best of previous methods to produce both higher-resolution and faster results. Our representation bridges the gap between discrete and continuous volumetric representations by combining a coarse 3D-structure-aware grid of animation codes with a continuous learned scene function that maps every position and its corresponding local animation code to its view-dependent emitted radiance and local volume density. Differentiable volume rendering is employed to compute photo-realistic novel views of the human head and upper body as well as to train our novel representation end-to-end using only 2D supervision. In addition, we show that the learned dynamic radiance field can be used to synthesize novel unseen expressions based on a global animation code. Our approach achieves state-of-the-art results for synthesizing novel views of dynamic human heads and the upper body.
摘要：动态人物的真实感渲染是远程呈现系统，虚拟购物，合成数据生成等的重要功能。最近，结合了计算机图形学和机器学习技术的神经渲染方法已经创建了人类和物体的高保真模型。对于可驱动的人体模型（神经体积），这些方法中的某些方法无法获得足够高的保真度，而其他方法的渲染时间却非常长（NeRF）。我们提出了一种新颖的合成3D表示形式，它结合了以前的最佳方法，可以产生更高的分辨率和更快的结果。我们的表示法通过结合一个粗略的3D结构感知动画代码网格和一个连续学习的场景函数来弥合离散和连续体积表示法之间的鸿沟，该场景函数将每个位置及其对应的本地动画代码映射到其与视图相关的发射辐射度和本地体积密度。可微分的体积渲染用于计算人头和上半身的逼真的新颖视图，以及仅使用2D监督就可以端对端地训练我们新颖的表示形式。此外，我们表明，可以将学习到的动态辐射场用于基于全局动画代码合成新颖的看不见的表情。我们的方法获得了用于合成动态人体头部和上半身的新颖观点的最新结果。

27. Attention-based Image Upsampling [PDF] 返回目录
Souvik Kundu, Hesham Mostafa, Sharath Nittur Sridhar, Sairam Sundaresan
Abstract: Convolutional layers are an integral part of many deep neural network solutions in computer vision. Recent work shows that replacing the standard convolution operation with mechanisms based on self-attention leads to improved performance on image classification and object detection tasks. In this work, we show how attention mechanisms can be used to replace another canonical operation: strided transposed convolution. We term our novel attention-based operation attention-based upsampling since it increases/upsamples the spatial dimensions of the feature maps. Through experiments on single image super-resolution and joint-image upsampling tasks, we show that attention-based upsampling consistently outperforms traditional upsampling methods based on strided transposed convolution or based on adaptive filters while using fewer parameters. We show that the inherent flexibility of the attention mechanism, which allows it to use separate sources for calculating the attention coefficients and the attention targets, makes attention-based upsampling a natural choice when fusing information from multiple image modalities.
摘要：卷积层是计算机视觉中许多深度神经网络解决方案不可或缺的一部分。最近的工作表明，用基于自我注意的机制替换标准的卷积运算可以提高图像分类和对象检测任务的性能。在这项工作中，我们展示了如何使用注意力机制代替另一种规范操作：跨步转置卷积。我们称其为基于注意力的新颖操作，是基于注意力的向上采样，因为它会增加/向上采样特征图的空间尺寸。通过对单图像超分辨率和联合图像上采样任务的实验，我们表明基于注意力的上采样在使用较少参数的情况下，始终优于基于跨步转置卷积或基于自适应滤波器的传统上采样方法。我们表明，注意力机制的固有灵活性使它能够使用单独的源来计算注意力系数和注意力目标，从而在融合来自多种图像形式的信息时，基于注意力的上采样成为一种自然选择。

28. Exploring Motion Boundaries in an End-to-End Network for Vision-based Parkinson's Severity Assessment [PDF] 返回目录
Amirhossein Dadashzadeh, Alan Whone, Michal Rolinski, Majid Mirmehdi
Abstract: Evaluating neurological disorders such as Parkinson's disease (PD) is a challenging task that requires the assessment of several motor and non-motor functions. In this paper, we present an end-to-end deep learning framework to measure PD severity in two important components, hand movement and gait, of the Unified Parkinson's Disease Rating Scale (UPDRS). Our method leverages on an Inflated 3D CNN trained by a temporal segment framework to learn spatial and long temporal structure in video data. We also deploy a temporal attention mechanism to boost the performance of our model. Further, motion boundaries are explored as an extra input modality to assist in obfuscating the effects of camera motion for better movement assessment. We ablate the effects of different data modalities on the accuracy of the proposed network and compare with other popular architectures. We evaluate our proposed method on a dataset of 25 PD patients, obtaining 72.3% and 77.1% top-1 accuracy on hand movement and gait tasks respectively.
摘要：评估帕金森氏病（PD）等神经系统疾病是一项艰巨的任务，需要评估几种运动和非运动功能。在本文中，我们提出了一个端到端深度学习框架，以帕金森病统一疾病评估量表（UPDRS）来衡量PD严重性的两个重要组成部分，即手部动作和步态。我们的方法利用由时间分段框架训练的Inflated 3D CNN来学习视频数据中的空间和长时态结构。我们还部署了时间注意机制来提高模型的性能。此外，将运动边界作为一种额外的输入方式进行了探索，以帮助模糊摄像机运动的效果，从而实现更好的运动评估。我们消除了不同数据模式对所提出的网络的准确性的影响，并与其他流行的体系结构进行了比较。我们在25名PD患者的数据集上评估了我们提出的方法，分别在手部运动和步态任务上获得了72.3％和77.1％的top-1准确性。

29. Object Detection based on OcSaFPN in Aerial Images with Noise [PDF] 返回目录
Chengyuan Li, Jun Liu, Hailong Hong, Wenju Mao, Chenjie Wang, Chudi Hu, Xin Su, Bin Luo
Abstract: Taking the deep learning-based algorithms into account has become a crucial way to boost object detection performance in aerial images. While various neural network representations have been developed, previous works are still inefficient to investigate the noise-resilient performance, especially on aerial images with noise taken by the cameras with telephoto lenses, and most of the research is concentrated in the field of denoising. Of course, denoising usually requires an additional computational burden to obtain higher quality images, while noise-resilient is more of a description of the robustness of the network itself to different noises, which is an attribute of the algorithm itself. For this reason, the work will be started by analyzing the noise-resilient performance of the neural network, and then propose two hypotheses to build a noise-resilient structure. Based on these hypotheses, we compare the noise-resilient ability of the Oct-ResNet with frequency division processing and the commonly used ResNet. In addition, previous feature pyramid networks used for aerial object detection tasks are not specifically designed for the frequency division feature maps of the Oct-ResNet, and they usually lack attention to bridging the semantic gap between diverse feature maps from different depths. On the basis of this, a novel octave convolution-based semantic attention feature pyramid network (OcSaFPN) is proposed to get higher accuracy in object detection with noise. The proposed algorithm tested on three datasets demonstrates that the proposed OcSaFPN achieves a state-of-the-art detection performance with Gaussian noise or multiplicative noise. In addition, more experiments have proved that the OcSaFPN structure can be easily added to existing algorithms, and the noise-resilient ability can be effectively improved.
摘要：考虑基于深度学习的算法已成为提高航空图像中目标检测性能的关键方法。尽管已经开发了各种神经网络表示形式，但以前的工作仍然无法有效地研究抗噪性能，尤其是在带有远摄镜头的相机拍摄的带有噪声的航空图像上，并且大部分研究都集中在降噪领域。当然，去噪通常需要额外的计算负担才能获得更高质量的图像，而抗噪声性更多地描述了网络本身对不同噪声的鲁棒性，这是算法本身的一个属性。因此，将通过分析神经网络的抗噪声性能开始这项工作，然后提出两个假设来构建抗噪声结构。基于这些假设，我们将Oct-ResNet的抗噪声能力与频分处理和常用的ResNet进行了比较。另外，用于空中物体检测任务的先前特征金字塔网络不是专门为Oct-ResNet的频分特征图设计的，它们通常缺乏弥合来自不同深度的不同特征图之间的语义鸿沟的注意。在此基础上，提出了一种基于八度卷积的语义注意特征金字塔网络（OcSaFPN），以提高噪声检测对象的准确性。在三个数据集上测试的拟议算法表明，拟议的OcSaFPN具有高斯噪声或乘性噪声，可以实现最新的检测性能。另外，更多的实验证明，OcSaFPN结构可以很容易地添加到现有算法中，并且可以有效地提高抗噪能力。

30. Separation and Concentration in Deep Networks [PDF] 返回目录
John Zarka, Florentin Guth, Stéphane Mallat
Abstract: Numerical experiments demonstrate that deep neural network classifiers progressively separate class distributions around their mean, achieving linear separability on the training set, and increasing the Fisher discriminant ratio. We explain this mechanism with two types of operators. We prove that a rectifier without biases applied to sign-invariant tight frames can separate class means and increase Fisher ratios. On the opposite, a soft-thresholding on tight frames can reduce within-class variabilities while preserving class means. Variance reduction bounds are proved for Gaussian mixture models. For image classification, we show that separation of class means can be achieved with rectified wavelet tight frames that are not learned. It defines a scattering transform. Learning $1 \times 1$ convolutional tight frames along scattering channels and applying a soft-thresholding reduces within-class variabilities. The resulting scattering network reaches the classification accuracy of ResNet-18 on CIFAR-10 and ImageNet, with fewer layers and no learned biases.
摘要：数值实验表明，深层神经网络分类器逐渐围绕其均值分离类分布，在训练集上实现线性可分离性，并提高了Fisher判别率。我们用两种类型的运算符解释这种机制。我们证明了将无偏差的整流器应用于符号不变紧框架可以分离类均值并增加费舍尔比率。相反，对紧帧进行软阈值处理可以减少类内差异，同时保留类均值。证明了高斯混合模型的方差减少界。对于图像分类，我们表明可以使用未学习的经过校正的小波紧框架实现分类均值的分离。它定义了散射变换。沿散射通道学习$ 1乘以1 $卷积紧框架并应用软阈值可减少类内差异。所得的散射网络达到了CIFAR-10和ImageNet上ResNet-18的分类精度，层数更少，没有学习到的偏差。

31. Improving 3D convolutional neural network comprehensibility via interactive visualization of relevance maps: Evaluation in Alzheimer's disease [PDF] 返回目录
Martin Dyrba, Moritz Hanzig, Slawek Altenstein, Sebastian Bader, Tommaso Ballarini, Frederic Brosseron, Katharina Buerger, Daniel Cantré, Peter Dechent, Laura Dobisch, Emrah Düzel, Michael Ewers, Klaus Fliessbach, Wenzel Glanz, John D. Haynes, Michael T. Heneka, Daniel Janowitz, Deniz Baris Keles, Ingo Kilimann, Christoph Laske, Franziska Maier, Coraline D. Metzger, Matthias H. Munk, Robert Perneczky, Oliver Peters, Lukas Preis, Josef Priller, Boris Rauchmann, Nina Roy, Klaus Scheffler, Anja Schneider, Björn H. Schott, Annika Spottke, Eike J. Spruth, Marc-André Weber, Birgit Ertl-Wagner, Michael Wagner, Jens Wiltfang, Frank Jessen, Stefan J. Teipel
Abstract: Although convolutional neural networks (CNN) achieve high diagnostic accuracy for detecting Alzheimer's disease (AD) dementia based on magnetic resonance imaging (MRI) scans, they are not yet applied in clinical routine. One important reason for this is a lack of model comprehensibility. Recently developed visualization methods for deriving CNN relevance maps may help to fill this gap. We investigated whether models with higher accuracy also rely more on discriminative brain regions predefined by prior knowledge. We trained a CNN for the detection of AD in N=663 T1-weighted MRI scans of patients with dementia and amnestic mild cognitive impairment (MCI) and verified the accuracy of the models via cross-validation and in three independent samples including N=1655 cases. We evaluated the association of relevance scores and hippocampus volume to validate the clinical utility of this approach. To improve model comprehensibility, we implemented an interactive visualization of 3D CNN relevance maps. Across three independent datasets, group separation showed high accuracy for AD dementia vs. controls (AUC$\geq$0.92) and moderate accuracy for MCI vs. controls (AUC$\approx$0.75). Relevance maps indicated that hippocampal atrophy was considered as the most informative factor for AD detection, with additional contributions from atrophy in other cortical and subcortical regions. Relevance scores within the hippocampus were highly correlated with hippocampal volumes (Pearson's r$\approx$-0.81). The relevance maps highlighted atrophy in regions that we had hypothesized a priori. This strengthens the comprehensibility of the CNN models, which were trained in a purely data-driven manner based on the scans and diagnosis labels. The high hippocampus relevance scores and high performance achieved in independent samples support the validity of the CNN models in the detection of AD-related MRI abnormalities.
摘要：尽管卷积神经网络（CNN）在基于磁共振成像（MRI）扫描的阿尔茨海默氏病（AD）痴呆症诊断中具有很高的诊断准确性，但尚未在临床常规中应用。一个重要的原因是缺乏模型可理解性。最近开发的用于导出CNN关联图的可视化方法可能有助于填补这一空白。我们调查了具有较高准确性的模型是否也更多地依赖于先验知识预定义的可区分大脑区域。我们在痴呆和轻度认知障碍（MCI）患者的N = 663 T1加权MRI扫描中训练了CNN以检测AD，并通过交叉验证和三个独立样本（包括N = 1655）验证了模型的准确性案件。我们评估了相关性得分和海马体积的关联，以验证这种方法的临床实用性。为了提高模型的可理解性，我们实现了3D CNN相关性地图的交互式可视化。在三个独立的数据集中，组分离显示AD痴呆症与对照组（AUC $ \ geq $ 0.92）的准确性较高，而MCI与对照组相比（AUC $ \约$ 0.75）的准确性中等。相关性图表明，海马萎缩被认为是AD检测最有用的因素，其他皮质和皮质下区域的萎缩也有其他贡献。海马内的相关性得分与海马体积高度相关（Pearson's r $ \ approx $ -0.81）。相关性图突出了我们假设的先验区域的萎缩。这增强了CNN模型的可理解性，该CNN模型是基于扫描和诊断标签以纯数据驱动方式进行训练的。在独立样本中获得的高海马相关性评分和高性能，支持了CNN模型在检测AD相关MRI异常中的有效性。

32. Multimodal Transfer Learning-based Approaches for Retinal Vascular Segmentation [PDF] 返回目录
José Morano, Álvaro S. Hervella, Noelia Barreira, Jorge Novo, José Rouco
Abstract: In ophthalmology, the study of the retinal microcirculation is a key issue in the analysis of many ocular and systemic diseases, like hypertension or diabetes. This motivates the research on improving the retinal vasculature segmentation. Nowadays, Fully Convolutional Neural Networks (FCNs) usually represent the most successful approach to image segmentation. However, the success of these models is conditioned by an adequate selection and adaptation of the architectures and techniques used, as well as the availability of a large amount of annotated data. These two issues become specially relevant when applying FCNs to medical image segmentation as, first, the existent models are usually adjusted from broad domain applications over photographic images, and second, the amount of annotated data is usually scarcer. In this work, we present multimodal transfer learning-based approaches for retinal vascular segmentation, performing a comparative study of recent FCN architectures. In particular, to overcome the annotated data scarcity, we propose the novel application of self-supervised network pretraining that takes advantage of existent unlabelled multimodal data. The results demonstrate that the self-supervised pretrained networks obtain significantly better vascular masks with less training in the target task, independently of the network architecture, and that some FCN architecture advances motivated for broad domain applications do not translate into significant improvements over the vasculature segmentation task.
摘要：在眼科领域，视网膜微循环系统的研究是分析许多眼科和全身性疾病（如高血压或糖尿病）的关键问题。这激发了改善视网膜脉管系统分割的研究。如今，全卷积神经网络（FCN）通常代表最成功的图像分割方法。然而，这些模型的成功是由适当的选择和使用的架构和技术的适配，以及大量的注释的数据的可用性制约。当将FCN应用于医学图像分割时，这两个问题特别相关，因为，首先，通常根据摄影图像的广域应用对现有模型进行调整，其次，带注释的数据量通常很少。在这项工作中，我们提出了基于多峰转移学习的视网膜血管分割方法，并对最新的FCN架构进行了比较研究。特别是，为了克服带注释的数据稀缺性，我们提出了利用现有的未标记多模态数据的自我监督网络预训练的新应用。结果表明，自我监督的预训练网络独立于网络体系结构而在目标任务中只需较少的训练即可获得更好的血管面罩，并且针对广域应用而推动的某些FCN体系结构的进步并未转化为对脉管系统分割的显着改善。任务。

33. Spectral Reflectance Estimation Using Projector with Unknown Spectral Power Distribution [PDF] 返回目录
Hironori Hidaka, Yusuke Monno, Masatoshi Okutomi
Abstract: A lighting-based multispectral imaging system using an RGB camera and a projector is one of the most practical and low-cost systems to acquire multispectral observations for estimating the scene's spectral reflectance information. However, existing projector-based systems assume that the spectral power distribution (SPD) of each projector primary is known, which requires additional equipment such as a spectrometer to measure the SPD. In this paper, we present a method for jointly estimating the spectral reflectance and the SPD of each projector primary. In addition to adopting a common spectral reflectance basis model, we model the projector's SPD by a low-dimensional model using basis functions obtained by a newly collected projector's SPD database. Then, the spectral reflectances and the projector's SPDs are alternatively estimated based on the basis models. We experimentally show the performance of our joint estimation using a different number of projected illuminations and investigate the potential of the spectral reflectance estimation using a projector with unknown SPD.
摘要：使用RGB相机和投影仪的基于照明的多光谱成像系统是获取多光谱观测值以估计场景的光谱反射率信息的最实用，成本最低的系统之一。但是，现有的基于投影仪的系统假定每个投影仪原声的光谱功率分布（SPD）是已知的，这需要诸如光谱仪之类的附加设备来测量SPD。在本文中，我们提出了一种方法，用于联合估计每个投影机主镜的光谱反射率和SPD。除了采用通用的光谱反射率基础模型之外，我们还使用新收集的投影机的SPD数据库获得的基础函数，通过低维模型对投影机的SPD进行建模。然后，基于基本模型交替估计光谱反射率和投影仪的SPD。我们实验性地显示了使用不同数量的投影照明进行联合估算的性能，并研究了使用SPD未知的投影仪进行光谱反射率估算的潜力。

34. A Surrogate Lagrangian Relaxation-based Model Compression for Deep Neural Networks [PDF] 返回目录
Deniz Gurevin, Shanglin Zhou, Lynn Pepin, Bingbing Li, Mikhail Bragin, Caiwen Ding, Fei Miao
Abstract: Network pruning is a widely used technique to reduce computation cost and model size for deep neural networks. However, the typical three-stage pipeline, i.e., training, pruning and retraining (fine-tuning) significantly increases the overall training trails. For instance, the retraining process could take up to 80 epochs for ResNet-18 on ImageNet, that is 70% of the original model training trails. In this paper, we develop a systematic weight-pruning optimization approach based on Surrogate Lagrangian relaxation (SLR), which is tailored to overcome difficulties caused by the discrete nature of the weight-pruning problem while ensuring fast convergence. We decompose the weight-pruning problem into subproblems, which are coordinated by updating Lagrangian multipliers. Convergence is then accelerated by using quadratic penalty terms. We evaluate the proposed method on image classification tasks, i.e., ResNet-18, ResNet-50 and VGG-16 using ImageNet and CIFAR-10, as well as object detection tasks, i.e., YOLOv3 and YOLOv3-tiny using COCO 2014, PointPillars using KITTI 2017, and Ultra-Fast-Lane-Detection using TuSimple lane detection dataset. Numerical testing results demonstrate that with the adoption of the Surrogate Lagrangian Relaxation method, our SLR-based weight-pruning optimization approach achieves a high model accuracy even at the hard-pruning stage without retraining for many epochs, such as on PointPillars object detection model on KITTI dataset where we achieve 9.44x compression rate by only retraining for 3 epochs with less than 1% accuracy loss. As the compression rate increases, SLR starts to perform better than ADMM and the accuracy gap between them increases. SLR achieves 15.2% better accuracy than ADMM on PointPillars after pruning under 9.49x compression. Given a limited budget of retraining epochs, our approach quickly recovers the model accuracy.
摘要：网络修剪是减少深度神经网络的计算成本和模型大小的一种广泛使用的技术。但是，典型的三阶段流水线，即训练，修剪和再训练（微调）会显着增加总体训练轨迹。例如，在ImageNet上，ResNet-18的重新训练过程最多可能需要80个时期，即原始模型训练轨迹的70％。在本文中，我们开发了一种基于代理拉格朗日松弛（SLR）的系统的权重修剪优化方法，该方法旨在克服因权重修剪问题的离散性质而导致的困难，同时确保快速收敛。我们将权重修剪问题分解为子问题，这些子问题通过更新拉格朗日乘数来协调。然后通过使用二次惩罚项来加速收敛。我们评估了使用ImageNet和CIFAR-10在图像分类任务（即ResNet-18，ResNet-50和VGG-16）以及对象检测任务（即使用COCO 2014的YOLOv3和YOLOv3-tiny），图像点任务使用KITTI 2017和使用TuSimple车道检测数据集的超快速车道检测。数值测试结果表明，通过采用替代Lagrangian松弛方法，我们基于SLR的权重修剪优化方法即使在硬修剪阶段也能实现较高的模型精度，而无需重新训练许多时期，例如在PointPillars对象检测模型上KITTI数据集，其中我们仅通过重新训练3个历元而获得了9.44倍的压缩率，而准确度损失不到1％。随着压缩率的提高，SLR开始比ADMM表现更好，并且两者之间的精度差距也越来越大。在9.49倍压缩下修剪后，SLR在PointPillars上的精度比ADMM高15.2％。在重新训练时期的预算有限的情况下，我们的方法可以快速恢复模型的准确性。

35. Information-Preserving Contrastive Learning for Self-Supervised Representations [PDF] 返回目录
Tianhong Li, Lijie Fan, Yuan Yuan, Hao He, Yonglong Tian, Dina Katabi
Abstract: Contrastive learning is very effective at learning useful representations without supervision. Yet contrastive learning has its limitations. It can learn a shortcut that is irrelevant to the downstream task, and discard relevant information. Past work has addressed this limitation via custom data augmentations that eliminate the shortcut. This solution however does not work for data modalities that are not interpretable by humans, e.g., radio signals. For such modalities, it is hard for a human to guess which shortcuts may exist in the signal, or how to alter the radio signals to eliminate the shortcuts. Even for visual data, sometimes eliminating the shortcut may be undesirable. The shortcut may be irrelevant to one downstream task but important to another. In this case, it is desirable to learn a representation that captures both the shortcut information and the information relevant to the other downstream task. This paper presents information-preserving contrastive learning (IPCL), a new framework for unsupervised representation learning that preserves relevant information even in the presence of shortcuts. We empirically show that IPCL addresses the above problems and outperforms contrastive learning on radio signals and learning RGB data representation with different features that support different downstream tasks.
摘要：对比学习在无监督的情况下学习有用的表示非常有效。然而，对比学习有其局限性。它可以学习与下游任务无关的快捷方式，并丢弃相关信息。过去的工作已通过消除快捷方式的自定义数据扩充解决了此限制。然而，该解决方案不适用于人类无法解释的数据模态，例如无线电信号。对于这样的方式，人们很难猜测信号中可能存在哪些捷径，或者如何改变无线电信号以消除捷径。即使对于视觉数据，有时消除快捷方式也可能是不可取的。快捷方式可能与一项下游任务无关，但对另一项任务很重要。在这种情况下，希望学习一种既捕获快捷方式信息又捕获与其他下游任务相关的信息的表示。本文介绍了信息保存对比学习（IPCL），这是一种用于无监督表示学习的新框架，即使在存在快捷方式的情况下，该框架也可以保留相关信息。我们的经验表明，IPCL解决了上述问题，并且在无线电信号方面的对比学习和学习RGB数据表示的性能优于支持不同下游任务的不同功能。

36. Treadmill Assisted Gait Spoofing (TAGS): An Emerging Threat to wearable Sensor-based Gait Authentication [PDF] 返回目录
Rajesh Kumar, Can Isik, Vir V Phoha
Abstract: In this work, we examine the impact of Treadmill Assisted Gait Spoofing (TAGS) on Wearable Sensor-based Gait Authentication (WSGait). We consider more realistic implementation and deployment scenarios than the previous study, which focused only on the accelerometer sensor and a fixed set of features. Specifically, we consider the situations in which the implementation of WSGait could be using one or more sensors embedded into modern smartphones. Besides, it could be using different sets of features or different classification algorithms, or both. Despite the use of a variety of sensors, feature sets (ranked by mutual information), and six different classification algorithms, TAGS was able to increase the average False Accept Rate (FAR) from 4% to 26%. Such a considerable increase in the average FAR, especially under the stringent implementation and deployment scenarios considered in this study, calls for a further investigation into the design of evaluations of WSGait before its deployment for public use.
摘要：在这项工作中，我们研究了跑步机辅助步态欺骗（TAGS）对基于可穿戴传感器的步态认证（WSGait）的影响。与以前的研究相比，我们仅考虑加速度传感器和一组固定功能，因此考虑了更现实的实施和部署方案。具体来说，我们考虑了WSGait的实现可能使用嵌入到现代智能手机中的一个或多个传感器的情况。此外，它可能使用不同的功能集或不同的分类算法，或同时使用两者。尽管使用了各种传感器，功能集（按互信息排序）和六种不同的分类算法，但TAGS仍能够将平均错误接受率（FAR）从4％提高到26％。平均FAR的如此大幅度增加，尤其是在本研究中考虑的严格实施和部署方案下，要求在对WSGait进行公开部署之前对WSGait评估设计进行进一步调查。

37. Fast 3-dimensional estimation of the Foveal Avascular Zone from OCTA [PDF] 返回目录
Giovanni Ometto, Giovanni Montesano, Usha Chakravarthy, Frank Kee, Ruth E. Hogg, David P. Crabb
Abstract: The area of the foveal avascular zone (FAZ) from en face images of optical coherence tomography angiography (OCTA) is one of the most common measurement based on this technology. However, its use in clinic is limited by the high variation of the FAZ area across normal subjects, while the calculation of the volumetric measurement of the FAZ is limited by the high noise that characterizes OCTA scans. We designed an algorithm that exploits the higher signal-to-noise ratio of en face images to efficiently identify the capillary network of the inner retina in 3-dimensions (3D), under the assumption that the capillaries in separate plexuses do not overlap. The network is then processed with morphological operations to identify the 3D FAZ within the bounding segmentations of the inner retina. The FAZ volume and area in different plexuses were calculated for a dataset of 430 eyes. Then, the measurements were analyzed using linear mixed effect models to identify differences between three groups of eyes: healthy, diabetic without diabetic retinopathy (DR) and diabetic with DR. Results showed significant differences in the FAZ volume between the different groups but not in the area measurements. These results suggest that volumetric FAZ could be a better diagnostic detector than the planar FAZ. The efficient methodology that we introduced could allow the fast calculation of the FAZ volume in clinics, as well as providing the 3D segmentation of the capillary network of the inner retina.
摘要：从光学相干断层扫描血管造影（OCTA）的正面图像中得出的中央凹无血管区域（FAZ）的面积是基于该技术的最常见测量之一。但是，其在临床中的使用受到正常受试者FAZ面积变化的限制，而FAZ体积测量的计算受到OCTA扫描的高噪声限制。我们设计了一种算法，该算法利用面部图像中较高的信噪比来有效识别3维（3D）内视网膜的毛细血管网络，但前提是各个丛中的毛细血管不会重叠。然后使用形态学运算处理网络，以识别内部视网膜边界分段内的3D FAZ。针对430只眼睛的数据集计算了不同丛的FAZ体积和面积。然后，使用线性混合效应模型对测量值进行分析，以识别三组眼睛之间的差异：健康，无糖尿病性视网膜病变（DR）的糖尿病患者和有DR的糖尿病患者。结果表明，不同组之间的FAZ体积存在显着差异，但面积测量却没有差异。这些结果表明，与平面FAZ相比，体积FAZ可能是更好的诊断检测器。我们介绍的有效方法可以快速计算临床中的FAZ量，并提供内部视网膜毛细血管网的3D分割。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-21

目录

摘要