摘要

1. PseudoSeg: Designing Pseudo Labels for Semantic Segmentation [PDF] 返回目录
Yuliang Zou, Zizhao Zhang, Han Zhang, Chun-Liang Li, Xiao Bian, Jia-Bin Huang, Tomas Pfister
Abstract: Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies. To address this problem, we present a simple and novel re-design of pseudo-labeling to generate well-calibrated structured pseudo labels for training with unlabeled or weakly-labeled data. Our proposed pseudo-labeling strategy is network structure agnostic to apply in a one-stage consistency training framework. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes. Extensive experiments have validated that pseudo labels generated from wisely fusing diverse sources and strong data augmentation are crucial to consistency training for segmentation. The source code is available at this https URL.
摘要：在半监督学习（SSL）的最新进展表明，一致性正规化和伪标记的组合能有效地提高在低数据政权图像的分类精度。比起分类，语义分割任务需要更多的密集标签成本。因此，这些任务大大从数据高效的训练方法中受益。然而，在分割结构的输出呈现特殊的困难（例如，设计的伪标识和增强），以应用现有的SSL策略。为了解决这个问题，我们提出了一个简单的和伪标记的新颖的重新设计以产生用于训练与未标记的或弱标记的数据以及校准结构伪标签。我们提出的伪标记策略是网络结构不可知的一个阶段的一致性培训框架适用。我们证明在这两个低数据传输和高数据政权提出的伪标记策略的有效性。大量的实验已经证实，从明智地熔合不同来源和强数据扩张产生的伪标记是用于分割一致性训练至关重要。源代码可在此HTTPS URL。

2. Self-supervised Co-training for Video Representation Learning [PDF] 返回目录
Tengda Han, Weidi Xie, Andrew Zisserman
Abstract: The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance.
摘要：本文的目的是纯视觉的自我监督的视频表示学习。我们做出以下贡献：（一）我们研究了添加语义级阳性，以实例为基础的信息噪音对比估计（InfoNCE）培训，展示的好处，这种形式的监督对比学习，因而在性能明显改善的; （ⅱ），我们提出了一种新型的自监督协同训练方案来提高流行infoNCE损失，利用从不同视图中的补充信息，RGB流，并通过使用一个视图以获得阳性类样品光流相同的数据源的，另一个; （iii）本公司彻底评估学习表现的两个不同的下游任务的质量：动作识别及视频检索。在这两种情况下，所提出的方法演示状态的最先进的或与其它自监督的方法比较的性能，同时又是显著更有效列车，即需要少得多的训练数据，以实现类似的性能。

3. Multiple Pedestrians and Vehicles Tracking in Aerial Imagery: A Comprehensive Study [PDF] 返回目录
Seyed Majid Azimi, Maximilian Kraus, Reza Bahmanyar, Peter Reinartz
Abstract: In this paper, we address various challenges in multi-pedestrian and vehicle tracking in high-resolution aerial imagery by intensive evaluation of a number of traditional and Deep Learning based Single- and Multi-Object Tracking methods. We also describe our proposed Deep Learning based Multi-Object Tracking method AerialMPTNet that fuses appearance, temporal, and graphical information using a Siamese Neural Network, a Long Short-Term Memory, and a Graph Convolutional Neural Network module for a more accurate and stable tracking. Moreover, we investigate the influence of the Squeeze-and-Excitation layers and Online Hard Example Mining on the performance of AerialMPTNet. To the best of our knowledge, we are the first in using these two for a regression-based Multi-Object Tracking. Additionally, we studied and compared the L1 and Huber loss functions. In our experiments, we extensively evaluate AerialMPTNet on three aerial Multi-Object Tracking datasets, namely AerialMPT and KIT AIS pedestrian and vehicle datasets. Qualitative and quantitative results show that AerialMPTNet outperforms all previous methods for the pedestrian datasets and achieves competitive results for the vehicle dataset. In addition, Long Short-Term Memory and Graph Convolutional Neural Network modules enhance the tracking performance. Moreover, using Squeeze-and-Excitation and Online Hard Example Mining significantly helps for some cases while degrades the results for other cases. In addition, according to the results, L1 yields better results with respect to Huber loss for most of the scenarios. The presented results provide a deep insight into challenges and opportunities of the aerial Multi-Object Tracking domain, paving the way for future research.
摘要：在本文中，我们应对一些传统的和基于深度学习单和多目标跟踪方法的详细评估在多行人和车辆追踪各种挑战高分辨率航拍图像。我们还描述了我们提出的深入学习基于多目标跟踪方法AerialMPTNet是保险丝的外观，时间和使用连体神经网络，一个长短期记忆，并为更准确和稳定的追踪的图形卷积神经网络模块，图形信息。此外，我们研究的一种挤压和激励层和硬在线实例开采对AerialMPTNet性能的影响。据我们所知，我们在使用这两个用于基于回归的多目标跟踪第一。此外，我们研究和比较的L1和胡贝尔丧失功能。在我们的实验中，我们广泛评估AerialMPTNet三个空中多目标跟踪的数据集，即AerialMPT和KIT AIS行人和车辆的数据集。定性和定量的结果表明，AerialMPTNet优于对行人的数据集以前所有的方法和实现对车辆数据的竞争结果。此外，龙短时记忆和图表卷积神经网络模块增强跟踪性能。此外，使用挤压式，激励和在线硬实例矿业显著有助于对某些情况下，同时降低了结果的其他案件。此外，根据该结果，产生L1相对于更好的结果以损失胡伯大部分场景。所提出的结果提供了一个深入了解的挑战和空中多目标跟踪领域的机会，从而为今后的研究方式。

4. Detecting Hands and Recognizing Physical Contact in the Wild [PDF] 返回目录
Supreeth Narasimhaswamy, Trung Nguyen, Minh Hoai
Abstract: We investigate a new problem of detecting hands and recognizing their physical contact state in unconstrained conditions. This is a challenging inference task given the need to reason beyond the local appearance of hands. The lack of training annotations indicating which object or parts of an object the hand is in contact with further complicates the task. We propose a novel convolutional network based on Mask-RCNN that can jointly learn to localize hands and predict their physical contact to address this problem. The network uses outputs from another object detector to obtain locations of objects present in the scene. It uses these outputs and hand locations to recognize the hand's contact state using two attention mechanisms. The first attention mechanism is based on the hand and a region's affinity, enclosing the hand and the object, and densely pools features from this region to the hand region. The second attention module adaptively selects salient features from this plausible region of contact. To develop and evaluate our method's performance, we introduce a large-scale dataset called ContactHands, containing unconstrained images annotated with hand locations and contact states. The proposed network, including the parameters of attention modules, is end-to-end trainable. This network achieves approximately 7\% relative improvement over a baseline network that was built on the vanilla Mask-RCNN architecture and trained for recognizing hand contact states.
摘要：我们调查的检测手并认识到在无约束条件下的身体接触状态的一个新问题。这是考虑到需要原因超出手的局部外观的挑战推理任务。缺乏训练的对象的注释指示哪个对象或部件的手与所述任务进一步复杂接触。我们提出了一种基于面膜RCNN一种新型的卷积网络，可以共同学习本地化的手，并预测其身体接触来解决这一问题。从另一个对象检测器网络使用输出以获得本场景中的对象的位置。它使用这些输出和手的位置使用两个关注机制来识别手的接触状态。第一注意机制是基于手的区域的亲合性，包围所述手与物体，且密集地池从该区域到手区域的特征。第二注意模块自适应地选择从接触的该可信区域的显着特征。开发和评估我们的方法的性能，我们引入称为ContactHands大规模的数据集，包含与手的位置和接触状态注释不受约束的图像。所提出的网络，包括注意力模块的参数，是端至端可训练。这个网络实现了一个建在香草面膜RCNN架构和训练有素的识别手的接触状态的基线网络大约7 \％的相对改善。

5. Multi-Stage Fusion for One-Click Segmentation [PDF] 返回目录
Soumajit Majumder, Ansh Khurana, Abhinav Rai, Angela Yao
Abstract: Segmenting objects of interest in an image is an essential building block of applications such as photo-editing and image analysis. Under interactive settings, one should achieve good segmentations while minimizing user input. Current deep learning-based interactive segmentation approaches use early fusion and incorporate user cues at the image input layer. Since segmentation CNNs have many layers, early fusion may weaken the influence of user interactions on the final prediction results. As such, we propose a new multi-stage guidance framework for interactive segmentation. By incorporating user cues at different stages of the network, we allow user interactions to impact the final segmentation output in a more direct way. Our proposed framework has a negligible increase in parameter count compared to early-fusion frameworks. We perform extensive experimentation on the standard interactive instance segmentation and one-click segmentation benchmarks and report state-of-the-art performance.
摘要：在图像分割感兴趣的对象是应用，如照片编辑和图像分析的重要组成部分。在交互设置，应该达到良好的分割，同时减少了用户的输入。当前深基于学习的交互式分割方法使用早期融合，并纳入用户提示在上述图像输入层。由于分割细胞神经网络有许多层，早期融合可能削弱用户交互的最终的预测结果的影响。因此，我们提出了交互式分割一个新的多阶段的指导框架。通过在网络的不同阶段，结合用户的线索，我们允许用户交互以更直接的方式影响最终输出的分割。我们提出的框架，在参数个数可忽略的增加相比，早期融合框架。我们执行的标准交互式实例中分割和一键分割基准和国家的最先进的报告性能广泛的实验。

6. Attention Augmented ConvLSTM forEnvironment Prediction [PDF] 返回目录
Bernard Lange, Masha Itkina, Mykel J. Kochenderfer
Abstract: Safe and proactive planning in robotic systems generally requires accurate predictions of the environment. Prior work on environment prediction applied video frame prediction techniques to bird's-eye view environment representations, such as occupancy grids. ConvLSTM-based frameworks used previously often result in significant blurring and vanishing of moving objects, thus hindering their applicability for use in safety-critical applications. In this work, we propose two extensions to the ConvLSTM to address these issues. We present the Temporal Attention Augmented ConvLSTM (TAAConvLSTM) and Self-Attention Augmented ConvLSTM (SAAConvLSTM) frameworks for spatiotemporal occupancy prediction, and demonstrate improved performance over baseline architectures on the real-world KITTI and Waymo datasets.
摘要：在机器人系统安全和主动的规划通常需要环境的准确的预测。对环境的预测工作之前施加视频帧预测技术以鸟瞰图环境表示，例如占用网格。 ConvLSTM基于以前经常采用的框架，导致显著模糊和消失的移动物体，从而阻碍其适用性在安全关键应用。在这项工作中，我们提出了两个扩展到ConvLSTM来解决这些问题。我们目前的时空注意增强ConvLSTM（TAAConvLSTM）和自我注意增强ConvLSTM（SAAConvLSTM）框架时空占用预测，并证明在真实世界KITTI和Waymo数据集在基线架构改进的性能。

7. Multi-Modal Super Resolution for Dense Microscopic Particle Size Estimation [PDF] 返回目录
Sarvesh Patil, Chava Y P D Phani Rajanish, Naveen Margankunte
Abstract: Particle Size Analysis (PSA) is an important process carried out in a number of industries, which can significantly influence the properties of the final product. A ubiquitous instrument for this purpose is the Optical Microscope (OM). However, OMs are often prone to drawbacks like low resolution, small focal depth, and edge features being masked due to diffraction. We propose a powerful application of a combination of two Conditional Generative Adversarial Networks (cGANs) that Super Resolve OM images to look like Scanning Electron Microscope (SEM) images. We further demonstrate the use of a custom object detection module that can perform efficient PSA of the super-resolved particles on both, densely and sparsely packed images. The PSA results obtained from the super-resolved images have been benchmarked against human annotators, and results obtained from the corresponding SEM images. The proposed models show a generalizable way of multi-modal image translation and super-resolution for accurate particle size estimation.
摘要：粒度分析（PSA）是在许多行业，其可以显著影响最终产品的性质进行的重要过程。用于此目的的仪器无处不是光学显微镜（OM）。然而，OMs的往往容易像分辨率低，小焦点深度，和边缘特征的缺点，由于衍射被掩蔽。我们提出了两个条件剖成对抗性网络（CGANS）的组合超级解决OM图像看起来像扫描电子显微镜（SEM）图像的强大的应用程序。我们进一步展示了使用自定义对象检测模块可以同时在执行超分辨率处理的颗粒的高效PSA，密集和稀疏包装图像。从超分辨率图像获得的PSA结果已被基准针对人注释，并且从相应的SEM图像获得的结果。所提出的模型显示，多模态图像的平移和超解像准确的粒度估计的普及方式。

8. Multiclass Wound Image Classification using an Ensemble Deep CNN-based Classifier [PDF] 返回目录
Behrouz Rostami, D.M. Anisuzzaman, Chuanbo Wang, Sandeep Gopalakrishnan, Jeffrey Niezgoda, Zeyun Yu
Abstract: Acute and chronic wounds are a challenge to healthcare systems around the world and affect many people's lives annually. Wound classification is a key step in wound diagnosis that would help clinicians to identify an optimal treatment procedure. Hence, having a high-performance classifier assists the specialists in the field to classify the wounds with less financial and time costs. Different machine learning and deep learning-based wound classification methods have been proposed in the literature. In this study, we have developed an ensemble Deep Convolutional Neural Network-based classifier to classify wound images including surgical, diabetic, and venous ulcers, into multi-classes. The output classification scores of two classifiers (patch-wise and image-wise) are fed into a Multi-Layer Perceptron to provide a superior classification performance. A 5-fold cross-validation approach is used to evaluate the proposed method. We obtained maximum and average classification accuracy values of 96.4% and 94.28% for binary and 91.9\% and 87.7\% for 3-class classification problems. The results show that our proposed method can be used effectively as a decision support system in classification of wound images or other related clinical applications.
摘要：急慢性创伤是世界各地的医疗保健系统的挑战，每年影响许多人的生活。伤口分类是伤口诊断的关键步骤，这将有助于临床医生识别最佳的治疗过程。因此，具有高性能的分类器协助当地的专家能够以较少的资金和时间成本的伤口进行分类。不同的机器学习和深入学习型伤口分类方法已经在文献中提出。在这项研究中，我们已经开发了一个集成的基于网络的深卷积神经分级分类伤口图像，包括外科，糖尿病，和静脉溃疡，为多类。两个分类器的输出分类评分（补丁向和按图像）被馈送到多层感知提供了优越的分类性能。有5倍交叉验证方法被用于评估所提出的方法。我们获得了96.4％和94.28％，最大值和平均值的分类精度值二进制和91.9 \％和3类分类问题87.7 \％。结果表明，该方法能有效地在伤口图像或其他相关临床应用的分类的决策支持系统中使用。

9. Brain Atlas Guided Attention U-Net for White Matter Hyperintensity Segmentation [PDF] 返回目录
Zicong Zhang, Kimerly Powell, Changchang Yin, Shilei Cao, Dani Gonzalez, Yousef Hannawi, Ping Zhang
Abstract: White Matter Hyperintensities (WMH) are the most common manifestation of cerebral small vessel disease (cSVD) on the brain MRI. Accurate WMH segmentation algorithms are important to determine cSVD burden and its clinical consequences. Most of existing WMH segmentation algorithms require both fluid attenuated inversion recovery (FLAIR) images and T1-weighted images as inputs. However, T1-weighted images are typically not part of standard clinicalscans which are acquired for patients with acute stroke. In this paper, we propose a novel brain atlas guided attention U-Net (BAGAU-Net) that leverages only FLAIR images with a spatially-registered white matter (WM) brain atlas to yield competitive WMH segmentation performance. Specifically, we designed a dual-path segmentation model with two novel connecting mechanisms, namely multi-input attention module (MAM) and attention fusion module (AFM) to fuse the information from two paths for accurate results. Experiments on two publicly available datasets show the effectiveness of the proposed BAGAU-Net. With only FLAIR images and WM brain atlas, BAGAU-Net outperforms the state-of-the-art method with T1-weighted images, paving the way for effective development of WMH segmentation. Availability:this https URL
摘要：白质高信号（WMH）是对脑MRI脑小血管病变的最常见的表现（CSVD）。准确WMH分割算法是很重要的决定负担CSVD及其临床后果。大多数的现有WMH分割算法需要两个流体衰减反转恢复（FLAIR）图像和T1加权图像作为输入。然而，T1加权图像通常不是其被采集用于急性脑卒中患者标准clinicalscans的一部分。在本文中，我们提出了引导一个新的脑图谱关注掌中宽带（BAGAU-网），它利用只与空间注册的白质（WM）脑图谱FLAIR图像，以产生有竞争力的WMH分割性能。具体来说，我们设计了一个双路径分割模型具有两个新颖的连接机制，即多输入注意模块（MAM）和注意力融合模块（AFM）从获得正确结果的两个路径融合的信息。在两个可公开获得的数据集的实验表明，该BAGAU-Net的有效性。只有FLAIR图像和WM脑图谱，BAGAU-Net的性能优于国家的最先进的方法，其中T1加权图像，从而为WMH分割的有效开发的方式。可用性：此HTTPS URL

10. Learning to Reconstruct and Segment 3D Objects [PDF] 返回目录
Bo Yang
Abstract: To endow machines with the ability to perceive the real-world in a three dimensional representation as we do as humans is a fundamental and long-standing topic in Artificial Intelligence. Given different types of visual inputs such as images or point clouds acquired by 2D/3D sensors, one important goal is to understand the geometric structure and semantics of the 3D environment. Traditional approaches usually leverage hand-crafted features to estimate the shape and semantics of objects or scenes. However, they are difficult to generalize to novel objects and scenarios, and struggle to overcome critical issues caused by visual occlusions. By contrast, we aim to understand scenes and the objects within them by learning general and robust representations using deep neural networks, trained on large-scale real-world 3D data. To achieve these aims, this thesis makes three core contributions from object-level 3D shape estimation from single or multiple views to scene-level semantic understanding.
摘要：为了赋予机器用感知真实世界的三维表示的能力，因为我们做的是人类在人工智能的根本和长期的话题。给定不同类型的视觉输入，诸如由2D / 3D传感器获取的图像或点云，其中一个重要的目标是了解3D环境的几何结构和语义。传统的方法通常利用手工制作的功能，估计对象或场景的形状和语义。然而，他们很难推广到新的对象和方案，并努力克服因视觉遮挡的关键问题。相比之下，我们的目标是通过使用深层神经网络，培训了大规模真实世界的3D数据一般学习和强大的交涉，了解场景，其中的对象。为了实现这些目标，本文从使对象级的3D形状估计从单个或多个视图场景层次语义理解三个核心贡献。

11. Teacher-Student Competition for Unsupervised Domain Adaption [PDF] 返回目录
Ruixin Xiao, Zhilei Liu, Baoyuan Wu
Abstract: With the supervision from source domain only in class-level, existing unsupervised domain adaption (UDA) methods mainly learn the domain-invariant representations from a shared feature extractor, which causes the source-bias problem. This paper proposes an unsupervised domain adaption approach with Teacher-Student Competition (TSC). In particular, a student network is introduced to learn the target-specific feature space, and we design a novel competition mechanism to select more credible pseudo-labels for the training of student network. We introduce a teacher network with the structure of existing conventional UDA method, and both teacher and student networks compete to provide target pseudo-labels to constrain every target sample's training in student network. Extensive experiments demonstrate that our proposed TSC framework significantly outperforms the state-of-the-art domain adaption methods on Office-31 and ImageCLEF-DA benchmarks.
摘要：从源域仅在类级监督，监督的现有域自适应（UDA）方法主要是从一个共享的特征提取，这将导致源偏置问题学习域不变表示。本文提出了与教师，学生竞赛（TSC）无监督的领域适应方法。特别是，学生网络引入到学习目标，具体的功能空间，我们设计了一个新的竞争机制，为学生的网络培训选择更可信的伪标签。我们引进一名教师网络现有的常规方法UDA的结构，以及教师和学生的网络竞争提供目标伪标签来约束学生的网络每一个目标样本的训练。大量的实验证明，我们提出的TSC框架显著优于上的Office-31和ImageCLEF-DA基准的国家的最先进的领域适应方法。

12. On the Generalisation Capabilities of Fingerprint Presentation Attack Detection Methods in the Short Wave Infrared Domain [PDF] 返回目录
Jascha Kolberg, Marta Gomez-Barrero, Christoph Busch
Abstract: Nowadays, fingerprint-based biometric recognition systems are becoming increasingly popular. However, in spite of their numerous advantages, biometric capture devices are usually exposed to the public and thus vulnerable to presentation attacks (PAs). Therefore, presentation attack detection (PAD) methods are of utmost importance in order to distinguish between bona fide and attack presentations. Due to the nearly unlimited possibilities to create new presentation attack instruments (PAIs), unknown attacks are a threat to existing PAD algorithms. This fact motivates research on generalisation capabilities in order to find PAD methods that are resilient to new attacks. In this context, we evaluate the generalisability of multiple PAD algorithms on a dataset of 19,711 bona fide and 4,339 PA samples, including 45 different PAI species. The PAD data is captured in the short wave infrared domain and the results discuss the advantages and drawbacks of this PAD technique regarding unknown attacks.
摘要：目前，基于指纹的生物识别系统正变得越来越流行。然而，尽管他们的许多优点，生物识别捕获设备通常暴露给公众，从而容易受到攻击的演示文稿（PAS）。因此，呈现攻击检测（PAD）的方法是至关重要的，以善意和攻击演示区分。由于几乎无限的可能性，创造新的演示攻击手段（差价调整数指数），未知的攻击是现有PAD算法的威胁。这一事实导致在量级上的泛化能力，研究发现PAD方法是抵御新的攻击。在这方面，我们评估上的19711个善意和4339 PA样本的数据集的多个PAD算法普适，包括45个不同的PAI物种。该PAD数据在短波红外域捕获，结果讨论的优势和关于未知攻击这个PAD技术的缺点。

13. Domain Generalized Person Re-Identification via Cross-Domain Episodic Learning [PDF] 返回目录
Ci-Siang Lin, Yuan-Chia Cheng, Yu-Chiang Frank Wang
Abstract: Aiming at recognizing images of the same person across distinct camera views, person re-identification (re-ID) has been among active research topics in computer vision. Most existing re-ID works require collection of a large amount of labeled image data from the scenes of interest. When the data to be recognized are different from the source-domain training ones, a number of domain adaptation approaches have been proposed. Nevertheless, one still needs to collect labeled or unlabelled target-domain data during training. In this paper, we tackle an even more challenging and practical setting, domain generalized (DG) person re-ID. That is, while a number of labeled source-domain datasets are available, we do not have access to any target-domain training data. In order to learn domain-invariant features without knowing the target domain of interest, we present an episodic learning scheme which advances meta learning strategies to exploit the observed source-domain labeled data. The learned features would exhibit sufficient domain-invariant properties while not overfitting the source-domain data or ID labels. Our experiments on four benchmark datasets confirm the superiority of our method over the state-of-the-arts.
摘要：针对承认跨越不同的摄像机视图的同一个人的图像，个人重新鉴定（重新-ID）已经在计算机视觉活跃的研究课题之一。大多数现有的再ID工程需要大量从感兴趣的场景标记图像数据的采集。当被认可的数据是从源域训练的人不同，一些领域适应性方法被提出。尽管如此，我们仍然需要在训练中收集标记或未标记的目标域数据。在本文中，我们解决广义（DG）的人重新编号甚至更具挑战性和实用的设置，域。也就是说，虽然标着号码源域数据集是可用的，我们没有获得任何目标域训练数据。为了学习域不变特征不知道感兴趣的目标域，我们提出其前进元学习策略，利用观测到的源域标记数据的阶段性学习计划。而不是过度拟合源域数据或ID标签所学习的特征将表现出足够的域的不变特性。我们的四个标准数据集实验证实了我们在国家的最艺术方法的优越性。

14. A Versatile Crack Inspection Portable System based on Classifier Ensemble and Controlled Illumination [PDF] 返回目录
Milind G. Padalkar, Carlos Beltrán-González, Matteo Bustreo, Alessio Del Bue, Vittorio Murino
Abstract: This paper presents a novel setup for automatic visual inspection of cracks in ceramic tile as well as studies the effect of various classifiers and height-varying illumination conditions for this task. The intuition behind this setup is that cracks can be better visualized under specific lighting conditions than others. Our setup, which is designed for field work with constraints in its maximum dimensions, can acquire images for crack detection with multiple lighting conditions using the illumination sources placed at multiple heights. Crack detection is then performed by classifying patches extracted from the acquired images in a sliding window fashion. We study the effect of lights placed at various heights by training classifiers both on customized as well as state-of-the-art architectures and evaluate their performance both at patch-level and image-level, demonstrating the effectiveness of our setup. More importantly, ours is the first study that demonstrates how height-varying illumination conditions can affect crack detection with the use of existing state-of-the-art classifiers. We provide an insight about the illumination conditions that can help in improving crack detection in a challenging real-world industrial environment.
摘要：本文提出了一种设置在瓷砖以及研究各种分类器和高度变化的照明条件下的此任务的效果裂纹的自动视觉检查。这种设置背后的直觉是，裂纹会比其他特定光线条件下得到更好的可视化。我们的设置，其被设计用于现场工作在其最大尺寸的限制，可以获取用于裂缝检测使用放置在多个高度的照明源的多个照明条件的图像。裂纹检测，然后通过分级从所获取的图像中的滑动窗口方式提取的补丁进行。我们通过培训学习分类放置在不同高度的灯光效果上都定制，以及国家的最先进的架构和评估他们的表现无论是在补丁级和影像级，证明了我们设置的有效性。更重要的是，我们是演示高度变化的照明条件如何影响裂缝检测与使用现有的国家的最先进的分类器的第一项研究。我们提供有关的光照条件，可以提高在一个具有挑战性的真实世界工业环境裂纹检测有助于深入了解。

15. RONELD: Robust Neural Network Output Enhancement for Active Lane Detection [PDF] 返回目录
Zhe Ming Chng, Joseph Mun Hung Lew, Jimmy Addison Lee
Abstract: Accurate lane detection is critical for navigation in autonomous vehicles, particularly the active lane which demarcates the single road space that the vehicle is currently traveling on. Recent state-of-the-art lane detection algorithms utilize convolutional neural networks (CNNs) to train deep learning models on popular benchmarks such as TuSimple and CULane. While each of these models works particularly well on train and test inputs obtained from the same dataset, the performance drops significantly on unseen datasets of different environments. In this paper, we present a real-time robust neural network output enhancement for active lane detection (RONELD) method to identify, track, and optimize active lanes from deep learning probability map outputs. We first adaptively extract lane points from the probability map outputs, followed by detecting curved and straight lanes before using weighted least squares linear regression on straight lanes to fix broken lane edges resulting from fragmentation of edge maps in real images. Lastly, we hypothesize true active lanes through tracking preceding frames. Experimental results demonstrate an up to two-fold increase in accuracy using RONELD on cross-dataset validation tests.
摘要：准确的车道检测是在自主车导航，特别是运行通道，其划定的单一道路空间，车辆当前行驶的关键。国家的最先进的最近的车道检测算法利用卷积神经网络（细胞神经网络）训练上流行的基准，如TuSimple和CULane深刻的学习模式。虽然这些模型的效果尤其好从相同的数据集获得的火车和测试输入，其性能在不同环境下的数据集看不见显著下降。在本文中，我们提出了一种实时鲁棒神经网络的输出用于增强活性车道检测（RONELD）方法来识别，跟踪和优化从深学习概率映射输出活动通道。我们首先自适应地从概率映射输出中提取车道点，随后，通过使用加权最小二乘线性回归上直车道以固定从边缘的断裂产生在真实图像映射破碎车道边缘之前检测曲线和直线通道。最后，我们通过跟踪在前帧推测真正的运行通道。实验结果表明，关于跨数据集验证测试使用RONELD在精度的多达两个倍。

16. Weakly-supervised Learning For Catheter Segmentation in 3D Frustum Ultrasound [PDF] 返回目录
Hongxu Yang, Caifeng Shan, Alexander F. Kolen, Peter H. N. de With
Abstract: Accurate and efficient catheter segmentation in 3D ultrasound (US) is essential for cardiac intervention. Currently, the state-of-the-art segmentation algorithms are based on convolutional neural networks (CNNs), which achieved remarkable performances in a standard Cartesian volumetric data. Nevertheless, these approaches suffer the challenges of low efficiency and GPU unfriendly image size. Therefore, such difficulties and expensive hardware requirements become a bottleneck to build accurate and efficient segmentation models for real clinical application. In this paper, we propose a novel Frustum ultrasound based catheter segmentation method. Specifically, Frustum ultrasound is a polar coordinate based image, which includes same information of standard Cartesian image but has much smaller size, which overcomes the bottleneck of efficiency than conventional Cartesian images. Nevertheless, the irregular and deformed Frustum images lead to more efforts for accurate voxel-level annotation. To address this limitation, a weakly supervised learning framework is proposed, which only needs 3D bounding box annotations overlaying the region-of-interest to training the CNNs. Although the bounding box annotation includes noise and inaccurate annotation to mislead to model, it is addressed by the proposed pseudo label generated scheme. The labels of training voxels are generated by incorporating class activation maps with line filtering, which is iteratively updated during the training. Our experimental results show the proposed method achieved the state-of-the-art performance with an efficiency of 0.25 second per volume. More crucially, the Frustum image segmentation provides a much faster and cheaper solution for segmentation in 3D US image, which meet the demands of clinical applications.
摘要：在三维超声（US），准确和高效的导管分割是心脏介入是必不可少的。目前，国家的最先进的分割算法是基于卷积神经网络（细胞神经网络），其在标准笛卡尔体积数据取得了显着的性能。然而，这些方法遭受效率低，GPU不友好的图像尺寸的挑战。因此，这些困难和昂贵的硬件要求，成为建立准确，高效的分割模型实际临床应用的瓶颈。在本文中，我们提出了一个新的圆台超声基于插管的分割方法。具体而言，平截头体的超声波是极坐标基于图像，其包括标准笛卡尔图像的相同的信息，但具有小得多的尺寸，其克服效率比传统的笛卡尔图像的瓶颈。尽管如此，不规则和变形截锥体图像导致对准确体素水平注解更多的努力。为了解决这个限制，弱监督的学习框架，提出了一种只需要3D边框标注覆盖的区域的利益，以培养细胞神经网络。虽然边框注释包括噪声和不准确的标注误导，以模型，它是由伪建议标签产生的方案解决。训练的体素的标签是通过将类活化生成的映射与行筛选，其在训练期间迭代地更新。我们的实验结果表明，该方法与每体积0.25秒的效率实现状态的最先进的性能。更为关键的是，平截面图像分割为分割的3D图像美国更快和更便宜的解决方案，满足临床应用的需求。

17. Deep Multi-path Network Integrating Incomplete Biomarker and Chest CT Data for Evaluating Lung Cancer Risk [PDF] 返回目录
Riqiang Gao, Yucheng Tang, Kaiwen Xu, Michael N. Kammer, Sanja L. Antic, Steve Deppen, Kim L. Sandler, Pierre P. Massion, Yuankai Huo, Bennett A. Landman
Abstract: Clinical data elements (CDEs) (e.g., age, smoking history), blood markers and chest computed tomography (CT) structural features have been regarded as effective means for assessing lung cancer risk. These independent variables can provide complementary information and we hypothesize that combining them will improve the prediction accuracy. In practice, not all patients have all these variables available. In this paper, we propose a new network design, termed as multi-path multi-modal missing network (M3Net), to integrate the multi-modal data (i.e., CDEs, biomarker and CT image) considering missing modality with multiple paths neural network. Each path learns discriminative features of one modality, and different modalities are fused in a second stage for an integrated prediction. The network can be trained end-to-end with both medical image features and CDEs/biomarkers, or make a prediction with single modality. We evaluate M3Net with datasets including three sites from the Consortium for Molecular and Cellular Characterization of Screen-Detected Lesions (MCL) project. Our method is cross validated within a cohort of 1291 subjects (383 subjects with complete CDEs/biomarkers and CT images), and externally validated with a cohort of 99 subjects (99 with complete CDEs/biomarkers and CT images). Both cross-validation and external-validation results show that combining multiple modality significantly improves the predicting performance of single modality. The results suggest that integrating subjects with missing either CDEs/biomarker or CT imaging features can contribute to the discriminatory power of our model (p < 0.05, bootstrap two-tailed test). In summary, the proposed M3Net framework provides an effective way to integrate image and non-image data in the context of missing information.
摘要：临床数据元素（的CDE）（例如，年龄，吸烟史），血液标志物和胸部计算机断层扫描（CT）的结构特征已被视为有效的手段用于评估肺癌的风险。这些独立的变量可以提供补充信息，我们推测，将其组合起来提高预测精度。在实践中，并不是所有的患者都有所有这些变量可用。在本文中，我们提出了一个新的网络设计，称为多径多模态缺少网络（M3Net），集成多模态数据（即的CDE，生物标志物和CT图像），考虑与多个路径的神经网络缺失模式。每个路径学习一个模态的判别特征，并且不同的模态在第二阶段稠合为一体的综合预测。该网络可以被训练端至端与两个医用图像特征和的CDE /生物标记物，或使用单模态的预测。我们评估M3Net与数据集包括来自联盟三个网站对筛查发现的病变（MCL）项目的分子和细胞特征。我们的方法的1291名受试者（受试者383具有完整的CDE /生物标记物和CT图像）队列内的交验证，并以99名受试者（99具有完整的CDE /生物标记物和CT图像）队列外部验证。两个交叉验证和外部验证结果表明，显著结合多个模态提高单一模态的预测性能。结果表明，与缺少任一的CDE /生物标记物或CT成像特征积分受试者可以向我们的模型（P <0.05，自举双尾检验）的区分能力。综上所述，建议m3net框架提供在信息缺失的背景图像和非图像数据整合的有效途径。< font>

18. Emerging Trends of Multimodal Research in Vision and Language [PDF] 返回目录
Shagun Uppal, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumdar, Soujanya Poria, Roger Zimmermann, Amir Zadeh
Abstract: Deep Learning and its applications have cascaded impactful research and development with a diverse range of modalities present in the real-world data. More recently, this has enhanced research interests in the intersection of the Vision and Language arena with its numerous applications and fast-paced growth. In this paper, we present a detailed overview of the latest trends in research pertaining to visual and language modalities. We look at its applications in their task formulations and how to solve various problems related to semantic perception and content generation. We also address task-specific trends, along with their evaluation strategies and upcoming challenges. Moreover, we shed some light on multi-disciplinary patterns and insights that have emerged in the recent past, directing this field towards more modular and transparent intelligent systems. This survey identifies key trends gravitating recent literature in VisLang research and attempts to unearth directions that the field is heading towards.
摘要：深学习及其应用已经级联影响力的研究和发展提供了多种多样存在于真实世界的数据模式的。最近，这增强了研究的兴趣在视觉和语言的舞台与众多的应用程序和快节奏增长的交集。在本文中，我们目前在研究有关视觉和语言模式的最新趋势的详细介绍。我们来看看它在自己的任务配方的应用程序，以及如何解决与语义感知和内容产生的各种问题。我们也地址任务的具体动向，他们的评估策略和即将到来的挑战一起。此外，我们阐明了多学科的模式和洞察力，已经出现在最近过去的一些光，朝着更加模块化和透明的智能系统指导这一领域。本次调查确定的主要发展趋势引力在VisLang研究近期的文献并试图挖掘的方向，该领域朝着前进。

19. A Backbone Replaceable Fine-tuning Network for Stable Face Alignment [PDF] 返回目录
Xu Sun, Yingjie Guo, Shihong Xia
Abstract: Heatmap regression based face alignment algorithms have achieved prominent performance on static images. However, when applying these methods on videos or sequential images, the stability and accuracy are remarkably discounted. The reason lies in temporal informations are not considered, which is mainly reflected in network structure and loss function. This paper presents a novel backbone replaceable fine-tuning framework, which can swiftly convert facial landmark detector designed for single image level into a better performing one that suitable for videos. On this basis, we proposed the Jitter loss, an innovative temporal information based loss function devised to impose strong penalties on prediction landmarks that jitter around the ground truth. Our framework provides capabilities to achieve at least 40% performance improvement on stability evaluation metrices while enhancing accuracy without re-training the entire model versus state-of-the-art methods.
摘要：基于热图回归脸比对算法已经取得的静态图像突出的表现。但是，在应用上的视频或顺序图像这些方法的情况下，稳定性和精度显着地打折扣。其原因在于颞信息不被认为，这主要体现在网络结构和损失函数。本文提出了一种新颖的骨架替换微调框架，它可以迅速地转换面部界标检测器设计用于单个图像电平转换成更好的执行一个适合于视频。在此基础上，我们提出了抖动的损失，一个创新的时间信息基于丢失设计强加预测标志严厉惩罚功能，围绕地面实况抖动。我们的框架提供功能来实现对稳定性评价metrices至少40％的性能提升，同时提高精度，无需重新培训整个模型与国家的最先进的方法。

20. Softer Pruning, Incremental Regularization [PDF] 返回目录
Linhang Cai, Zhulin An, Chuanguang Yang, Yongjun Xu
Abstract: Network pruning is widely used to compress Deep Neural Networks (DNNs). The Soft Filter Pruning (SFP) method zeroizes the pruned filters during training while updating them in the next training epoch. Thus the trained information of the pruned filters is completely dropped. To utilize the trained pruned filters, we proposed a SofteR Filter Pruning (SRFP) method and its variant, Asymptotic SofteR Filter Pruning (ASRFP), simply decaying the pruned weights with a monotonic decreasing parameter. Our methods perform well across various networks, datasets and pruning rates, also transferable to weight pruning. On ILSVRC-2012, ASRFP prunes 40% of the parameters on ResNet-34 with 1.63% top-1 and 0.68% top-5 accuracy improvement. In theory, SRFP and ASRFP are an incremental regularization of the pruned filters. Besides, We note that SRFP and ASRFP pursue better results while slowing down the speed of convergence.
摘要：网络修剪被广泛用于压缩深层神经网络（DNNs）。软滤波器修剪（SFP）方法在训练期间zeroizes修剪滤波器，而在接下来的训练时期更新它们。因此，修剪滤波器的训练有素的信息完全删除。要利用训练的修剪过滤器，我们提出了一个较软的过滤修剪（SRFP）方法及其变种，渐近较软的过滤修剪（ASRFP），简单地衰减具有单调递减的参数剪枝权重。我们的方法在各种网络上，数据集和修剪率，也转移到重修剪表现良好。上ILSVRC-2012，ASRFP上RESNET-34修剪的参数的40％与1.63％顶部-1和0.68％顶部-5精度的提高。从理论上讲，SRFP和ASRFP是修剪过滤器的增量正规化。此外，我们注意到，SRFP和ASRFP追求更好的结果，同时减缓收敛速度。

21. Noisy-LSTM: Improving Temporal Awareness for Video Semantic Segmentation [PDF] 返回目录
Bowen Wang, Liangzhi Li, Yuta Nakashima, Ryo Kawasaki, Hajime Nagahara, Yasushi Yagi
Abstract: Semantic video segmentation is a key challenge for various applications. This paper presents a new model named Noisy-LSTM, which is trainable in an end-to-end manner, with convolutional LSTMs (ConvLSTMs) to leverage the temporal coherency in video frames. We also present a simple yet effective training strategy, which replaces a frame in video sequence with noises. This strategy spoils the temporal coherency in video frames during training and thus makes the temporal links in ConvLSTMs unreliable, which may consequently improve feature extraction from video frames, as well as serve as a regularizer to avoid overfitting, without requiring extra data annotation or computational costs. Experimental results demonstrate that the proposed model can achieve state-of-the-art performances in both the CityScapes and EndoVis2018 datasets.
摘要：语义视频分割是各种应用的关键挑战。本文提出了一种命名为嘈杂-LSTM新模型，它是可训练的端至端的方式，与卷积LSTMs（ConvLSTMs）来利用在视频帧的时间相干性。我们还提出一个简单而有效的培训策略，它取代与噪声的视频序列的帧。这一战略败坏在训练过程中的视频帧的时间相关性，从而使得ConvLSTMs不可靠的时间链接，可能因此提高从视频帧特征提取，以及作为一个正则避免过拟合，而无需额外的数据注解或计算成本。实验结果表明，该模型能够实现国家的的艺术表演中的城市景观和EndoVis2018数据集两者。

22. Synthesizing the Unseen for Zero-shot Object Detection [PDF] 返回目录
Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman Khan, Syed Waqas Zamir, Fahad Shahbaz Khan
Abstract: The existing zero-shot detection approaches project visual features to the semantic domain for seen objects, hoping to map unseen objects to their corresponding semantics during inference. However, since the unseen objects are never visualized during training, the detection model is skewed towards seen content, thereby labeling unseen as background or a seen class. In this work, we propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. Consequently, the major challenge becomes, how to accurately synthesize unseen objects merely using their class semantics? Towards this ambitious goal, we propose a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them. Further, using a unified model, we ensure the synthesized features have high diversity that represents the intra-class differences and variable localization precision in the detected bounding boxes. We test our approach on three object detection benchmarks, PASCAL VOC, MSCOCO, and ILSVRC detection, under both conventional and generalized settings, showing impressive gains over the state-of-the-art methods. Our codes are available at this https URL.
摘要：现有的零拍检测方法项目的视觉特征为看到的物体的语义域，希望推理过程中看不见的对象映射到其对应的语义。然而，由于看不见的对象是在训练中从来没有显现，检测模型偏向看到的内容，从而看不见的标记作为背景或看到类。在这项工作中，我们提出以综合看不见类的视觉特征，使模型学习在视觉领域都看到和看不到的物体。因此，主要的挑战来了，如何准确地合成仅仅使用它们的类语义看不见的对象呢？为实现这一宏伟目标，我们提出了一个新的生成模型使用类的语义，不仅生成功能，但也有区别分开。此外，使用统一的模型中，我们保证合成特征具有代表所述检测到的边界框的类内的差异和可变的定位精度高的多样性。我们测试我们的三个目标检测基准的做法，PASCAL VOC，MSCOCO和ILSVRC检测，在传统的和广义的设置，显示了国家的最先进的方法，取得了重大进展。我们的代码可在此HTTPS URL。

23. Comprehensive evaluation of no-reference image quality assessment algorithms on KADID-10k database [PDF] 返回目录
Domonkos Varga
Abstract: The main goal of objective image quality assessment is to devise computational, mathematical models which are able to predict perceptual image quality consistently with subjective evaluations. The evaluation of objective image quality assessment algorithms is based on experiments conducted on publicly available benchmark databases. In this study, our goal is to give a comprehensive evaluation about no-reference image quality assessment algorithms, whose original source codes are available online, using the recently published KADID-10k database which is one of the largest available benchmark databases. Specifically, average PLCC, SROCC, and KROCC are reported which were measured over 100 random train-test splits. Furthermore, the database was divided into a train (appx. 80\% of images) and a test set (appx. 20% of images) with respect to the reference images. So no semantic content overlap was between these two sets. Our evaluation results may be helpful to obtain a clear understanding about the status of state-of-the-art no-reference image quality assessment methods.
摘要：客观图像质量评价的主要目标是设计计算，数学模型，其能够与主观评价一致预测的感知图像质量。客观图像质量评价算法的评估是基于可公开获得的基准数据库进行了实验。在这项研究中，我们的目标是提供有关无参考图像质量评价算法，其原始的源代码都可以在网上进行综合评价，使用最近发表KADID-10K数据库，是全国最大的可用基准数据库之一。具体而言，平均PLCC，SROCC，和KROCC报道其经100随机列车测试分裂测定。此外，数据库被分成一列火车（APPX。图像80 \％）和测试组（APPX。图像的20％）相对于所述参考图像。因此，没有语义内容重叠为这两组之间。我们的评估结果可能会有所帮助，以获取有关的国家的最先进的无参考图像质量评价方法的状态清醒的认识。

24. Image Captioning with Visual Object Representations Grounded in the Textual Modality [PDF] 返回目录
Dušan Variš, Katsuhito Sudoh, Satoshi Nakamura
Abstract: We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system instead of grounding words or sentences in their associated images. Based on the previous work, we apply additional grounding losses to the image captioning training objective aiming to force visual object representations to create more heterogeneous clusters based on their class label and copy a semantic structure of the word embedding space. In addition, we provide an analysis of the learned object vector space projection and its impact on the IC system performance. With only slight change in performance, grounded models reach the stopping criterion during training faster than the unconstrained model, needing about two to three times less training updates. Additionally, an improvement in structural correlation between the word embeddings and both original and projected object vectors suggests that the grounding is actually mutual.
摘要：我们提出我们的工作正在进行中探索的文字和视觉方式之间的共享嵌入空间的可能性。利用物体检测标签的文本性质和提取的视觉对象表示的假想的表现，我们提出了相反的做法，以目前的趋势，在这个词的陈述的接地嵌入字幕系统空间，而不是接地在其相关的词或句子图片。基于以前的工作中，我们应用额外的接地损失图像字幕旨在迫使视觉对象表示创建基于他们的阶级标签上更多的异构集群培养目标和复制字嵌入空间的语义结构。此外，我们提供学习对象向量空间投影的分析及其对IC系统性能的影响。在性能上只有轻微的改变，基模型达到训练期间停止准则比不受约束模型更快，需要大约少两到三次训练更新。另外，在字的嵌入和原件与投影对象矢量之间的结构相关性的改进表明接地实际上是相互的。

25. SD-DefSLAM: Semi-Direct Monocular SLAM for Deformable and Intracorporeal Scenes [PDF] 返回目录
Juan J. Gómez Rodríguez, José Lamarca, Javier Morlana, Juan D. Tardós, José M. M. Montiel
Abstract: Conventional SLAM techniques strongly rely on scene rigidity to solve data association, ignoring dynamic parts of the scene. In this work we present Semi-Direct DefSLAM (SD-DefSLAM), a novel monocular deformable SLAM method able to map highly deforming environments, built on top of DefSLAM. To robustly solve data association in challenging deforming scenes, SD-DefSLAM combines direct and indirect methods: an enhanced illumination-invariant Lucas-Kanade tracker for data association, geometric Bundle Adjustment for pose and deformable map estimation, and bag-of-words based on feature descriptors for camera relocation. Dynamic objects are detected and segmented-out using a CNN trained for the specific application domain. We thoroughly evaluate our system in two public datasets. The mandala dataset is a SLAM benchmark with increasingly aggressive deformations. The Hamlyn dataset contains intracorporeal sequences that pose serious real-life challenges beyond deformation like weak texture, specular reflections, surgical tools and occlusions. Our results show that SD-DefSLAM outperforms DefSLAM in point tracking, reconstruction accuracy and scale drift thanks to the improvement in all the data association steps, being the first system able to robustly perform SLAM inside the human body.
摘要：传统的SLAM技术强烈依赖于现场的刚性来解决数据关联，忽视现场的动态部分。在这项工作中，我们提出半直接DefSLAM（SD-DefSLAM），一种新型单眼变形SLAM方法能够映射高度变形的环境中，建立在DefSLAM的顶部。鲁棒地解决在挑战变形场景数据关联，SD-DefSLAM结合直接和间接方法：增强光照恒定卢卡斯 - 奏跟踪器数据关联，几何光束法平差为姿态和变形的图估计，并且基于袋的词特征描述符相机搬迁。被检测和分割出使用CNN训练为特定应用程序域动态对象。我们彻底评估我们在两个公共数据集系统。曼荼罗数据集是SLAM基准日益进取的变形。该哈姆林数据集包含了构成超越般的质感较弱，镜面反射，外科工具和闭塞变形严重的实际挑战体内序列。我们的研究结果表明，SD-DefSLAM性能优于DefSLAM在点跟踪，重建精度和感谢规模漂移到的所有数据关联步骤的提高，能在人体内执行稳健的SLAM第一个系统。

26. A combined full-reference image quality assessment method based on convolutional activation maps [PDF] 返回目录
Domonkos Varga
Abstract: The goal of full-reference image quality assessment (FR-IQA) is to predict the quality of an image as perceived by human observers with using its pristine, reference counterpart. In this study, we explore a novel, combined approach which predicts the perceptual quality of a distorted image by compiling a feature vector from convolutional activation maps. More specifically, a reference-distorted image pair is run through a pretrained convolutional neural network and the activation maps are compared with a traditional image similarity metric. Subsequently, the resulted feature vector is mapped onto perceptual quality scores with the help of a trained support vector regressor. A detailed parameter study is also presented in which the design choices of the proposed method is reasoned. Furthermore, we study the relationship between the amount of training images and the prediction performance. Specifically, it is demonstrated that the proposed method can be trained with few amount of data to reach high prediction performance. Our best proposal - ActMapFeat - is compared to the state-of-the-art on six publicly available benchmark IQA databases, such as KADID-10k, TID2013, TID2008, MDID, CSIQ, and VCL-FER. Specifically, our method is able to significantly outperform the state-of-the-art on these benchmark databases. This paper is accompanied by the source code of the proposed method: this https URL.
摘要：全参考图像质量评价（FR-IQA）的目标是预测的图像的质量与使用其原始的，参照对应物感知人类观察员。在这项研究中，我们探索一种新的，组合的办法，通过编译从卷积激活图的特征向量预测的失真图像的感知质量。更具体地，参考失真图像对通过预训练的卷积神经网络运行和激活图，这些与传统的图像相似性度量相比较。随后，所得到的特征向量被映射到感知质量分数与所训练的支持向量回归的帮助。详细的参数研究也出现在其中所提出的方法的设计选择是合理的。此外，我们研究训练图像的数量和预测性能之间的关系。具体而言，证实了所提出的方法可以与数据的数量被训练以达到高预测性能。我们最好的建议 - ActMapFeat - 相比在六个公开可用的基准IQA数据库，如KADID-10K，TID2013，TID2008，MDID，CSIQ和VCL-FER的国家的最先进的。具体来说，我们的方法是能够显著超越这些标杆数据库的国家的最先进的。本文是伴随着算法的源代码：这个HTTPS URL。

27. SHREC 2020 track: 6D Object Pose Estimation [PDF] 返回目录
Honglin Yuan, Remco C. Veltkamp, Georgios Albanis, Nikolaos Zioulis, Dimitrios Zarpalas, Petros Daras
Abstract: 6D pose estimation is crucial for augmented reality, virtual reality, robotic manipulation and visual navigation. However, the problem is challenging due to the variety of objects in the real world. They have varying 3D shape and their appearances in captured images are affected by sensor noise, changing lighting conditions and occlusions between objects. Different pose estimation methods have different strengths and weaknesses, depending on feature representations and scene contents. At the same time, existing 3D datasets that are used for data-driven methods to estimate 6D poses have limited view angles and low resolution. To address these issues, we organize the Shape Retrieval Challenge benchmark on 6D pose estimation and create a physically accurate simulator that is able to generate photo-realistic color-and-depth image pairs with corresponding ground truth 6D poses. From captured color and depth images, we use this simulator to generate a 3D dataset which has 400 photo-realistic synthesized color-and-depth image pairs with various view angles for training, and another 100 captured and synthetic images for testing. Five research groups register in this track and two of them submitted their results. Data-driven methods are the current trend in 6D object pose estimation and our evaluation results show that approaches which fully exploit the color and geometric features are more robust for 6D pose estimation of reflective and texture-less objects and occlusion. This benchmark and comparative evaluation results have the potential to further enrich and boost the research of 6D object pose estimation and its applications.
摘要：6D姿态估计是增强现实，虚拟现实，机器人操作和视觉导航至关重要。然而，问题是具有挑战性由于各种在现实世界中的物体。它们具有不同的3D形状和它们在所捕获的图像的外观是由传感器噪声的影响，改变的对象之间的照明条件和遮挡。不同的姿态估计方法有不同的长处和短处，根据特征表示和场景内容。与此同时，被用于数据驱动方法来估计6D姿势现有的3D数据集具有有限的视角和低分辨率。为了解决这些问题，我们举办6D姿态估计形状检索挑战标杆，并创建一个物理精确模拟器，能够产生与相应的地面实况6D姿势照片般逼真的色彩和深度图像对。从所获取的彩色和深度图像，我们使用这个模拟器，以产生具有400照片般逼真的合成与训练各种视角色彩和深度图像对，而另一个100捕获和用于测试的合成图像的3D数据集。五个研究小组在登记这条赛道，其中两个提交了他们的研究结果。数据驱动方法在6D对象姿态估计目前的趋势和我们的评估结果表明，该方法充分利用颜色和几何特征是反射性的并且无纹理的对象和闭塞6D姿态估计更稳健。这个基准测试和比较评估结果必须进一步丰富潜力和提高6D对象姿态估计及其应用的研究。

28. The efficacy of Neural Planning Metrics: A meta-analysis of PKL on nuScenes [PDF] 返回目录
Yiluan Guo, Holger Caesar, Oscar Beijbom, Jonah Philion, Sanja Fidler
Abstract: A high-performing object detection system plays a crucial role in autonomous driving (AD). The performance, typically evaluated in terms of mean Average Precision, does not take into account orientation and distance of the actors in the scene, which are important for the safe AD. It also ignores environmental context. Recently, Philion et al. proposed a neural planning metric (PKL), based on the KL divergence of a planner's trajectory and the groundtruth route, to accommodate these requirements. In this paper, we use this neural planning metric to score all submissions of the nuScenes detection challenge and analyze the results. We find that while somewhat correlated with mAP, the PKL metric shows different behavior to increased traffic density, ego velocity, road curvature and intersections. Finally, we propose ideas to extend the neural planning metric.
摘要：高性能物体检测系统在自动驾驶（AD）至关重要的作用。性能，通常意味着平均准确率进行评价，不考虑定向和演员在现场，这是为安全AD重要的距离。它还忽略环境背景。近日，Philion等。提出了神经规划指标（PKL）的基础上，规划者的轨迹的KL散度和真实状况的路线，以满足这些需求。在本文中，我们使用这个神经规划指标的得分nuScenes检测挑战的所有意见和分析结果。我们发现，尽管其中有些地图中，PKL指标表现出不同的行为，以提高交通密度，自我速度，道路曲率和交叉路口相关。最后，我们提出的想法来延长神经规划指标。

29. SelfVoxeLO: Self-supervised LiDAR Odometry with Voxel-based Deep Neural Networks [PDF] 返回目录
Yan Xu, Zhaoyang Huang, Kwan-Yee Lin, Xinge Zhu, Jianping Shi, Hujun Bao, Guofeng Zhang, Hongsheng Li
Abstract: Recent learning-based LiDAR odometry methods have demonstrated their competitiveness. However, most methods still face two substantial challenges: 1) the 2D projection representation of LiDAR data cannot effectively encode 3D structures from the point clouds; 2) the needs for a large amount of labeled data for training limit the application scope of these methods. In this paper, we propose a self-supervised LiDAR odometry method, dubbed SelfVoxeLO, to tackle these two difficulties. Specifically, we propose a 3D convolution network to process the raw LiDAR data directly, which extracts features that better encode the 3D geometric patterns. To suit our network to self-supervised learning, we design several novel loss functions that utilize the inherent properties of LiDAR point clouds. Moreover, an uncertainty-aware mechanism is incorporated in the loss functions to alleviate the interference of moving objects/noises. We evaluate our method's performances on two large-scale datasets, i.e., KITTI and Apollo-SouthBay. Our method outperforms state-of-the-art unsupervised methods by 27%/32% in terms of translational/rotational errors on the KITTI dataset and also performs well on the Apollo-SouthBay dataset. By including more unlabelled training data, our method can further improve performance comparable to the supervised methods.
摘要：最近的学习型激光雷达测距方法已经证明了自身的竞争力。然而，大多数方法仍然面临两个重大挑战：1）LiDAR数据的2D投影表示不能有效地从点云编码3D结构; 2）需要用于训练限制了大量标记的数据的这些方法的应用范围。在本文中，我们提出了一个自我监督的激光雷达测距方法，被称为SelfVoxeLO，以解决这两个难题。具体而言，提出了一种三维卷积网络直接处理原始LiDAR数据，其提取物的特征在于更好编码所述3D几何图案。为了适应我们的网络进行自我监督学习，我们设计了利用激光雷达点云的固有特性几个新的损失函数。此外，不确定性感知机制在损失功能并入到减轻移动物体/噪音的干扰。我们评估我们的方法的两个大型数据集，即KITTI和阿波罗南湾表演。我们的方法优于状态的最先进的无监督方法，通过27％/ 32％的对数据集KITTI平移/旋转误差方面，并且还执行以及对阿波罗南湾数据集。通过包括更多的未标记的训练数据，我们的方法可以进一步提高性能堪比监督方法。

30. SMA-STN: Segmented Movement-Attending Spatiotemporal Network forMicro-Expression Recognition [PDF] 返回目录
Jiateng Liu, Wenming Zheng, Yuan Zong
Abstract: Correctly perceiving micro-expression is difficult since micro-expression is an involuntary, repressed, and subtle facial expression, and efficiently revealing the subtle movement changes and capturing the significant segments in a micro-expression sequence is the key to micro-expression recognition (MER). To handle the crucial issue, in this paper, we firstly propose a dynamic segmented sparse imaging module (DSSI) to compute dynamic images as local-global spatiotemporal descriptors under a unique sampling protocol, which reveals the subtle movement changes visually in an efficient way. Secondly, a segmented movement-attending spatiotemporal network (SMA-STN) is proposed to further unveil imperceptible small movement changes, which utilizes a spatiotemporal movement-attending module (STMA) to capture long-distance spatial relation for facial expression and weigh temporal segments. Besides, a deviation enhancement loss (DE-Loss) is embedded in the SMA-STN to enhance the robustness of SMA-STN to subtle movement changes in feature level. Extensive experiments on three widely used benchmarks, i.e., CASME II, SAMM, and SHIC, show that the proposed SMA-STN achieves better MER performance than other state-of-the-art methods, which proves that the proposed method is effective to handle the challenging MER problem.
摘要：正确感知微表达是困难的，因为微表达是无意识，压抑，以及细微的面部表情，并有效地揭示了细微运动的变化和在微表达序列捕获显著段的关键是微表情识别（MER）。为了处理这个关键问题，在本文中，我们首先提出了一个动态分段稀疏成像模块（DSSI）来计算一个独特的抽样方案，从而揭示了细微的动作以高效的方式在视觉上的变化在动态图像的地方 - 全球时空描述符。其次，分段运动-主治时空网络（SMA-STN），提出了进一步揭开觉察不到的小的运动的变化，其利用时空运动-主治模块（STMA）来捕获长途空间关系为面部表情和称量时间段。此外，偏差增强损失（DE-亏损）嵌入在SMA-STN，以增强SMA-STN的在功能级别细微动作变化的鲁棒性。上三个广泛使用的基准，广泛的实验即CASME II，SAMM，和SHIC，表明，该SMA-STN获得更好的MER性能比其他国家的最先进的方法，这证明了所提出的方法是有效的手柄具有挑战性的MER问题。

31. Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation [PDF] 返回目录
Pierfrancesco Ardino, Yahui Liu, Elisa Ricci, Bruno Lepri, Marco De Nadai
Abstract: Manipulating images of complex scenes to reconstruct, insert and/or remove specific object instances is a challenging task. Complex scenes contain multiple semantics and objects, which are frequently cluttered or ambiguous, thus hampering the performance of inpainting models. Conventional techniques often rely on structural information such as object contours in multi-stage approaches that generate unreliable results and boundaries. In this work, we propose a novel deep learning model to alter a complex urban scene by removing a user-specified portion of the image and coherently inserting a new object (e.g. car or pedestrian) in that scene. Inspired by recent works on image inpainting, our proposed method leverages the semantic segmentation to model the content and structure of the image, and learn the best shape and location of the object to insert. To generate reliable results, we design a new decoder block that combines the semantic segmentation and generation task to guide better the generation of new objects and scenes, which have to be semantically consistent with the image. Our experiments, conducted on two large-scale datasets of urban scenes (Cityscapes and Indian Driving), show that our proposed approach successfully address the problem of semantically-guided inpainting of complex urban scene.
摘要：操作复杂场景的图像重建，插入和/或删除特定的对象实例是一个具有挑战性的任务。复杂的场景包含多个语义和对象，经常杂乱或不明确，从而妨碍了补绘模型的性能。常规技术通常依赖于结构信息例如在产生不可靠的结果和边界的多级方法物体轮廓。在这项工作中，我们提出了一种新颖的深度学习模型通过去除图像的用户指定部分和相干地插入在该场景中的新的对象（例如，汽车或行人），以改变一个复杂的城市场景。通过对图像修复近期的作品的启发，我们提出的方法充分利用了语义分割的图像的内容和结构模型，并学习如何将物体的最佳形状和位置。为了产生可靠的结果，我们设计了新的解码器模块，它结合了语义分割和生成任务，以指导新的物体和场景，它必须与图像语义一致的更好的产生。我们的实验，在城市场景（城市景观和印度的驾驶）的两次大规模数据集进行的，表明我们提出的方法成功地解决复杂的城市场景的语义制导补绘的问题。

32. A Two-stage Unsupervised Approach for Low light Image Enhancement [PDF] 返回目录
Junjie Hu, Xiyue Guo, Junfeng Chen, Guanqi Liang, Fuqin Deng
Abstract: As vision based perception methods are usually built on the normal light assumption, there will be a serious safety issue when deploying them into low light environments. Recently, deep learning based methods have been proposed to enhance low light images by penalizing the pixel-wise loss of low light and normal light images. However, most of them suffer from the following problems: 1) the need of pairs of low light and normal light images for training, 2) the poor performance for dark images, 3) the amplification of noise. To alleviate these problems, in this paper, we propose a two-stage unsupervised method that decomposes the low light image enhancement into a pre-enhancement and a post-refinement problem. In the first stage, we pre-enhance a low light image with a conventional Retinex based method. In the second stage, we use a refinement network learned with adversarial training for further improvement of the image quality. The experimental results show that our method outperforms previous methods on four benchmark datasets. In addition, we show that our method can significantly improve feature points matching and simultaneous localization and mapping in low light conditions.
摘要：基于视觉感受的方法通常是建立在正常光线的假设，部署他们进入光线较暗的环境中时，会出现一个严重的安全问题。近日，深基础的学习方法被提出通过惩罚弱光和正常光图像的像素方面的损失，提高低光图像。然而，他们大多遭受了以下问题：1）培训的需求，对弱光和正常光图像中的，2）暗图像表现不佳，3）噪声的放大。为了解决这些问题，在本文中，我们提出了分解低照度图像增强到预增强和后细化问题两级监督的方法。在第一阶段中，我们预增强与常规的基于Retinex的方法的低光图像。在第二阶段，我们使用了图像质量的进一步提高对抗性训练学到了细化网络。实验结果表明，该方法优于四个标准数据集以前的方法。此外，我们表明，我们的方法可以显著提高特征点的匹配和同步定位与地图在弱光条件下。

33. Language and Visual Entity Relationship Graph for Agent Navigation [PDF] 返回目录
Yicong Hong, Cristian Rodriguez-Opazo, Yuankai Qi, Qi Wu, Stephen Gould
Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects,and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art. On the Room-to-Room (R2R) benchmark, our method achieves the new best performance on the test unseen split with success rate weighted by path length (SPL) of 52%. On the Room-for-Room (R4R) dataset, our method significantly improves the previous best from 13% to 34% on the success weighted by normalized dynamic time warping (SDTW). Code is available at: this https URL.
摘要：视觉和语言导航（VLN）需要在真实世界环境下的自然语言指令的代理进行导航。从文本和可视角度都，我们发现，现场，其对象和定向线索之间的关系是必不可少的代理来解释复杂的指令，正确感知环境。捕获和利用的关系，我们提出了建模的文字和视觉之间的联运关系的新的语言和视觉实体关系图和内模态关系的可视化实体中。我们建议传递算法用于传播语言元素和视觉实体图，我们再结合，以决定下一步要采取的行动之间信息的消息。实验表明，通过采取关系的优势，我们能够提高在国家的最先进的。在房间到房间（R2R）基准测试中，我们的方法实现了对通过的52％，路径长度（SPL）加权成功率测试看不见的分割新的最佳性能。在房间换房间（R4R）数据集，我们的方法显著提高了13％，以前最好的34％的通过标准化的动态时间规整（SDTW）加权的成功。代码，请访问：此HTTPS URL。

34. Double-Uncertainty Weighted Method for Semi-supervised Learning [PDF] 返回目录
Yixin Wang, Yao Zhang, Jiang Tian, Cheng Zhong, Zhongchao Shi, Yang Zhang, Zhiqiang He
Abstract: Though deep learning has achieved advanced performance recently, it remains a challenging task in the field of medical imaging, as obtaining reliable labeled training data is time-consuming and expensive. In this paper, we propose a double-uncertainty weighted method for semi-supervised segmentation based on the teacher-student model. The teacher model provides guidance for the student model by penalizing their inconsistent prediction on both labeled and unlabeled data. We train the teacher model using Bayesian deep learning to obtain double-uncertainty, i.e. segmentation uncertainty and feature uncertainty. It is the first to extend segmentation uncertainty estimation to feature uncertainty, which reveals the capability to capture information among channels. A learnable uncertainty consistency loss is designed for the unsupervised learning process in an interactive manner between prediction and uncertainty. With no ground-truth for supervision, it can still incentivize more accurate teacher's predictions and facilitate the model to reduce uncertain estimations. Furthermore, our proposed double-uncertainty serves as a weight on each inconsistency penalty to balance and harmonize supervised and unsupervised training processes. We validate the proposed feature uncertainty and loss function through qualitative and quantitative analyses. Experimental results show that our method outperforms the state-of-the-art uncertainty-based semi-supervised methods on two public medical datasets.
摘要：虽然深度学习最近取得了先进的性能，但它仍然在医疗成像领域的具有挑战性的任务，作为获得可靠标记的训练数据是耗时且昂贵的。在本文中，我们提出了基于师生模型半监督分割的双重不确定性加权法。老师模型受到惩罚，两个标记和未标记的数据不一致的预测提供了学生模型的指导。我们培养使用贝叶斯深度学习获得双重不确定性，即分割的不确定性和不确定性特征老师模型。它是第一个分割的不确定性估计延伸到特征的不确定性，从而揭示的能力信道之间捕获的信息。阿可学习的不确定性一致损失被设计用于预测和不确定性之间交互的方式的无监督的学习过程。由于没有地面实况进行监督，它仍然可以激励更多的教师准确的预测，并促进该模型以减少不确定性的估计。此外，我们提出的双重不确定性作为每个不一致处罚重量平衡和协调监督和无监督的训练过程。我们验证通过定性和定量分析所提出的功能，不确定性和损失函数。实验结果表明，我们的方法优于两个公共医疗数据集的国家的最先进的基于不确定性半监督方法。

35. Bi-Real Net V2: Rethinking Non-linearity for 1-bit CNNs and Going Beyond [PDF] 返回目录
Zhuo Su, Linpu Fang, Deke Guo, Duwen Hu, Matti Pietikäinen, Li Liu
Abstract: Binary neural networks (BNNs), where both weights and activations are binarized into 1 bit, have been widely studied in recent years due to its great benefit of highly accelerated computation and substantially reduced memory footprint that appeal to the development of resource constrained devices. In contrast to previous methods tending to reduce the quantization error for training BNN structures, we argue that the binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability. In this paper, we re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-the-art performance on the large-scale ImageNet dataset in terms of accuracy and training efficiency. To go further, we find that the proposed BNN model still has much potential to be compressed by making a better use of the efficient binary operations, without losing accuracy. In addition, the limited capacity of the BNN model can also be increased with the help of group execution. Based on these insights, we are able to improve the baseline with an additional 4~5% top-1 accuracy gain even with less computational cost. Our code will be made public at this https URL.
摘要：二进制神经网络（BNNs），其中两个权重和激活被二值化为1位，已被广泛研究，近年来，由于其巨大的高加速运算的效益，大幅度减少内存占用，吸引资源受限设备的发展。相较于以前的方法倾向于减少量化误差训练BNN结构，我们认为，二值化卷积过程拥有尽可能降低这种错误的目标，这反过来又阻碍了BNN的辨别能力的增加线性。在本文中，我们重新研究和调整适当的非线性模块来解决这一矛盾，导致了强大的基础，其实现了国家的最先进的性能在大规模ImageNet数据集在精度和训练效率方面。百尺竿头更进一步，我们发现，所提出的BNN模式仍然有通过制造更好地利用有效的二进制运算，又不失准确地压缩大的潜力。此外，BNN模型的能力有限，也可以用组执行的帮助下增加。根据这些分析，我们能够甚至以较少的计算成本额外的4〜5％，最高1精度增益来提高基线。我们的代码将在这个HTTPS URL公开。

36. Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition [PDF] 返回目录
Fangtao Li, Wenzhe Wang, Zihe Liu, Haoran Wang, Chenghao Yan, Bin Wu
Abstract: Video-based person recognition is challenging due to persons being blocked and blurred, and the variation of shooting angle. Previous research always focused on person recognition on still images, ignoring similarity and continuity between video frames. To tackle the challenges above, we propose a novel Frame Aggregation and Multi-Modal Fusion (FAMF) framework for video-based person recognition, which aggregates face features and incorporates them with multi-modal information to identify persons in videos. For frame aggregation, we propose a novel trainable layer based on NetVLAD (named AttentionVLAD), which takes arbitrary number of features as input and computes a fixed-length aggregation feature based on feature quality. We show that introducing an attention mechanism to NetVLAD can effectively decrease the impact of low-quality frames. For the multi-model information of videos, we propose a Multi-Layer Multi-Modal Attention (MLMA) module to learn the correlation of multi-modality by adaptively updating Gram matrix. Experimental results on iQIYI-VID-2019 dataset show that our framework outperforms other state-of-the-art methods.
摘要：基于视频的人承认是由于人质疑被封锁和模糊，拍摄角度的变化。以前的研究始终专注于个人识别静态图像上，忽略了视频帧之间的相似性和连续性。为了应对上述挑战，我们提出了一个新的框架基于视频的人认可，这聚集脸特征，并结合他们与多模态信息来识别视频中的人聚集和多模态融合（FAMF）框架。对于帧聚合中，我们提出基于NetVLAD（名为AttentionVLAD），它的特征的任意数字作为输入，并计算基于特征质量的固定长度聚合功能的新型可训练层。我们证明了引入注意机制NetVLAD可以有效地减少低质量的帧的影响。对于视频的多模式信息，我们提出了一个多层多模态注意（MLMA）模块，以了解多模态的自适应地更新格拉姆矩阵的相关性。爱奇艺上-VID-2019数据集上，我们的架构优于其他国家的最先进的方法，实验结果。

37. Modality-Pairing Learning for Brain Tumor Segmentation [PDF] 返回目录
Yixin Wang, Yao Zhang, Feng Hou, Yang Liu, Jiang Tian, Cheng Zhong, Yang Zhang, Zhiqiang He
Abstract: Automatic brain tumor segmentation from multi-modality Magnetic Resonance Images (MRI) using deep learning methods plays an important role in assisting the diagnosis and treatment of brain tumor. However, previous methods mostly ignore the latent relationship among different modalities. In this work, we propose a novel end-to-end Modality-Pairing learning method for brain tumor segmentation. Paralleled branches are designed to exploit different modality features and a series of layer connections are utilized to capture complex relationships and abundant information among modalities. We also use a consistency loss to minimize the prediction variance between two branches. Besides, learning rate warmup strategy is adopted to solve the problem of the training instability and early over-fitting. Lastly, we use average ensemble of multiple models and some post-processing techniques to get final results. Our method is tested on the BraTS 2020 validation dataset, obtaining promising segmentation performance, with average dice scores of $0.908, 0.856, 0.787$ for the whole tumor, tumor core and enhancing tumor, respectively. We won the second place of the BraTS 2020 Challenge for the tumor segmentation on the testing dataset.
摘要：使用深层学习方法多模态磁共振影像（MRI）自动脑肿瘤分割在协助脑肿瘤的诊断和治疗具有重要作用。然而，以前的方法主要是忽略了不同的模式之间的潜在关系。在这项工作中，我们提出了脑肿瘤分割一个新的终端到终端的形态配对学习方法。并联分支的目的是利用不同模态功能及一系列层连接被用于捕获复杂的关系和模式之间的丰富信息。我们还使用了一致性损失两个分支之间的最小方差预测。此外，学习率预热策略被采用，解决了训练不稳定，早期过度拟合的问题。最后，我们使用多种模型和一些后处理技术的平均合奏得到最后的结果。我们的方法是在臭小子2020验证数据集进行测试，获得承诺的分割性能，具有0.908 $，0.856的平均得分骰子，0.787 $整个肿瘤，肿瘤核心分别提高肿瘤。我们赢得了臭小子2020挑战赛的第二位的肿瘤分割的测试数据集。

38. DeepReflecs: Deep Learning for Automotive Object Classification with Radar Reflections [PDF] 返回目录
Michael Ulrich, Claudius Gläser, Fabian Timm
Abstract: This paper presents an novel object type classification method for automotive applications which uses deep learning with radar reflections. The method provides object class information such as pedestrian, cyclist, car, or non-obstacle. The method is both powerful and efficient, by using a light-weight deep learning approach on reflection level radar data. It fills the gap between low-performant methods of handcrafted features and high-performant methods with convolutional neural networks. The proposed network exploits the specific characteristics of radar reflection data: It handles unordered lists of arbitrary length as input and it combines both extraction of local and global features. In experiments with real data the proposed network outperforms existing methods of handcrafted or learned features. An ablation study analyzes the impact of the proposed global context layer.
摘要：提出用于汽车应用的一个新物体类型分类方法，该方法使用深度学习与雷达反射。该方法提供了对象类的信息，如行人，骑自行车的人，车，或非障碍物。该方法是既强大而有效，通过使用重量轻的深学习上的反射电平的雷达数据的方法。它填补的手工特征以及与卷积神经网络的高高性能方法低高性能方法之间的间隙。所提出的网络利用雷达反射数据的具体特点：它能够处理任意长度的无序列表作为输入，它结合了局部和全局特征的提取。在真实数据现有手工或学习特征的方法所提出的网络性能优于实验。消融研究分析所提出的全球背景层的影响。

39. Extraction of Discrete Spectra Modes from Video Data Using a Deep Convolutional Koopman Network [PDF] 返回目录
Scott Leask, Vincent McDonell
Abstract: Recent deep learning extensions in Koopman theory have enabled compact, interpretable representations of nonlinear dynamical systems which are amenable to linear analysis. Deep Koopman networks attempt to learn the Koopman eigenfunctions which capture the coordinate transformation to globally linearize system dynamics. These eigenfunctions can be linked to underlying system modes which govern the dynamical behavior of the system. While many related techniques have demonstrated their efficacy on canonical systems and their associated state variables, in this work the system dynamics are observed optically (i.e. in video format). We demonstrate the ability of a deep convolutional Koopman network (CKN) in automatically identifying independent modes for dynamical systems with discrete spectra. Practically, this affords flexibility in system data collection as the data are easily obtainable observable variables. The learned models are able to successfully and robustly identify the underlying modes governing the system, even with a redundantly large embedding space. Modal disaggregation is encouraged using a simple masking procedure. All of the systems analyzed in this work use an identical network architecture.
摘要：在考夫曼理论的最新深度学习的扩展已经启用了非线性动力系统，其适合进行线性分析的紧凑的，可解释的表示。深库普曼网络试图了解库普曼本征函数，其捕获的坐标为全局线性化系统动力学改造。这些本征函数可以链接到其支配系统的动力学行为基础系统模式。虽然许多相关技术已经证明了它们在典型的系统及其相关联的状态变量的功效，在该工作中的系统动力学光学地观察（即，在视频格式）。我们展示了一个深卷积考夫曼网络（CKN）的自动识别用于与离散谱动力系统独立模式的能力。实际上，这提供了灵活性系统的数据收集在作为数据很容易获得可观察到的变量。博学的模型能够成功和稳健识别管理系统的基本模式，即使有大量冗余嵌入空间。模态解聚是使用一个简单的掩模程序鼓励。所有的系统在这项工作中分析使用相同的网络架构。

40. MCGKT-Net: Multi-level Context Gating Knowledge Transfer Network for Single Image Deraining [PDF] 返回目录
Kohei Yamamichi, Xian-Hua Han
Abstract: Rain streak removal in a single image is a very challenging task due to its ill-posed nature in essence. Recently, the end-to-end learning techniques with deep convolutional neural networks (DCNN) have made great progress in this task. However, the conventional DCNN-based deraining methods have struggled to exploit deeper and more complex network architectures for pursuing better performance. This study proposes a novel MCGKT-Net for boosting deraining performance, which is a naturally multi-scale learning framework being capable of exploring multi-scale attributes of rain streaks and different semantic structures of the clear images. In order to obtain high representative features inside MCGKT-Net, we explore internal knowledge transfer module using ConvLSTM unit for conducting interaction learning between different layers and investigate external knowledge transfer module for leveraging the knowledge already learned in other task domains. Furthermore, to dynamically select useful features in learning procedure, we propose a multi-scale context gating module in the MCGKT-Net using squeeze-and-excitation block. Experiments on three benchmark datasets: Rain100H, Rain100L, and Rain800, manifest impressive performance compared with state-of-the-art methods.
摘要：雨条纹去除一个单一的形象是一个非常具有挑战性的任务，因为在本质上它的病态性质。近年来，随着深卷积神经网络的终端到终端的学习技术（DCNN）已经取得了这个任务了长足的进步。然而，传统的基于DCNN-deraining方法一直在努力开发更深层，更复杂的网络架构为追求更好的性能。这项研究提出了一种新MCGKT-Net的用于提升deraining性能，它是天然的多尺度学习框架能够探索多尺度的降雨条纹和清晰的图像不同的语义结构的属性。为了获得MCGKT-网内的高级代表的功能，我们探索内部知识传输模块使用ConvLSTM单位开展不同层之间的互动学习和探讨外部知识传输模块用于充分利用在其他任务领域已经学过的知识。此外，为了动态地学习过程选择有用的功能，我们提出在MCGKT-Net的多尺度上下文门控模块使用挤压和激励块。对三个标准数据集实验：Rain100H，Rain100L和Rain800，表现令人印象深刻的性能与国家的最先进的方法相比。

41. Continual Unsupervised Domain Adaptation with Adversarial Learning [PDF] 返回目录
Joonhyuk Kim, Sahng-Min Yoo, Gyeong-Moon Park, Jong-Hwan Kim
Abstract: Unsupervised Domain Adaptation (UDA) is essential for autonomous driving due to a lack of labeled real-world road images. Most of the existing UDA methods, however, have focused on a single-step domain adaptation (Synthetic-to-Real). These methods overlook a change in environments in the real world as time goes by. Thus, developing a domain adaptation method for sequentially changing target domains without catastrophic forgetting is required for real-world applications. To deal with the problem above, we propose Continual Unsupervised Domain Adaptation with Adversarial learning (CUDA^2) framework, which can generally be applicable to other UDA methods conducting adversarial learning. CUDA^2 framework generates a sub-memory, called Target-specific Memory (TM) for each new target domain guided by Double Hinge Adversarial (DHA) loss. TM prevents catastrophic forgetting by storing target-specific information, and DHA loss induces a synergy between the existing network and the expanded TM. To the best of our knowledge, we consider realistic autonomous driving scenarios (Synthetic-to-Real-to-Real) in UDA research for the first time. The model with our framework outperforms other state-of-the-art models under the same settings. Besides, extensive experiments are conducted as ablation studies for in-depth analysis.
摘要：无监督领域适应性（UDA）可以自主驾驶必不可少的，由于缺乏标记现实世界的道路图像。大多数现有的UDA方法，但是，都集中在一个单步领域适应性（合成到真实）。随着时间的推移，这些方法忽略了真实世界的环境中的变化。因此，开发用于顺序地改变目标域而不灾难性遗忘域自适应方法所需的现实世界的应用。为了解决上面的问题，我们提出了持续的无监督领域适应性与对抗性学习（CUDA ^ 2）框架，它一般可以适用于其他UDA方法进行对抗性的学习。 CUDA ^ 2框架产生副存储器，叫做目标特定存储器（TM），用于通过双铰链对抗性（DHA）损失引导每个新的目标域。 TM防止灾难性遗忘通过存储目标特定信息，和DHA损失引起现有的网络和膨胀TM之间的协同作用。据我们所知，我们认为真实的自动驾驶场景（合成到实到实）在UDA研究首次。我们的框架模型优于在相同设置状态的最先进的其他车型。此外，大量的实验消融研究进行深入分析进行。

42. Intelligent Reference Curation for Visual Place Recognition via Bayesian Selective Fusion [PDF] 返回目录
Timothy L. Molloy, Tobias Fischer, Michael Milford, Girish N. Nair
Abstract: The key challenge of visual place recognition (VPR) lies in recognizing places despite drastic visual appearance changes due to factors such as time of day, season, or weather or lighting conditions. Numerous approaches based on deep-learnt image descriptors, sequence matching, domain translation, and probabilistic localization have had success in addressing this challenge, but most rely on the availability of carefully curated representative reference images of the possible places. In this paper, we propose a novel approach, dubbed Bayesian Selective Fusion, for actively selecting and fusing informative reference images to determine the best place match for a given query image. The selective element of our approach avoids the counterproductive fusion of every reference image and enables the dynamic selection of informative reference images in environments with changing visual conditions (such as indoors with flickering lights, outdoors during sunshowers or over the day-night cycle). The probabilistic element of our approach provides a means of fusing multiple reference images that accounts for their varying uncertainty via a novel training-free likelihood function for VPR. On difficult query images from two benchmark datasets, we demonstrate that our approach matches and exceeds the performance of several alternative fusion approaches along with state-of-the-art techniques that are provided with a priori (unfair) knowledge of the best reference images. Our approach is well suited for long-term robot autonomy where dynamic visual environments are commonplace since it is training-free, descriptor-agnostic, and complements existing techniques such as sequence matching.
摘要：视觉识别的地方（VPR）的谎言中，尽管如天，季节或天气或照明条件的时间急剧视觉外观的变化，由于因素承认地方的主要挑战。基于深学图像描述，序列匹配，域转换和概率定位多种方法曾在应对这一挑战的成功，但大部分依靠可能的地方精心策划的代表参考图像的可用性。在本文中，我们提出了一种新的方法，称为贝叶斯选择性融合，为积极选择和融合翔实的参考图像，以确定一个给定的查询图像的最佳位置匹配。我们的方法的选择元件避免了每参考图像的反作用融合，使信息参考图像的动态选择的环境中与（期间sunshowers或上方的日夜周期，例如与闪烁的灯光，户外室内）改变的可视条件。我们的方法的概率元素提供熔合计及经由用于VPR一种新颖的无训练似然函数的变不确定多个参考图像的手段。从两个基准数据集很难查询图像，我们证明了我们的方法比赛，超过了几种可供选择的融合性能与提供最好的参考图像的先验（不公平）知识的国家的最先进的技术以及方法。我们的方法非常适合长期机器人自主其中动态视觉环境是司空见惯的，因为它是培训免费，描述符无关，和现有的技术，如序列匹配互补。

43. Self-Supervised Visual Attention Learning for Vehicle Re-Identification [PDF] 返回目录
Ming Li, Yiming Zhao, Yecheng Lyu, Ziming Zhang
Abstract: Visual attention learning (VAL) aims to produce a confidence map as weights to detect discriminative features in each image for certain task such as vehicle re-identification (ReID) where the same vehicle instance needs to be identified across different cameras. In contrast to the literature, in this paper we propose utilizing self-supervised learning to regularize VAL to improving the performance for vehicle ReID. Mathematically using lifting we can factorize the two functions of VAL and self-supervised regularization through another shared function. We implement such factorization using a deep learning framework consisting of three branches: (1) a global branch as backbone for image feature extraction, (2) an attentional branch for producing attention masks, and (3) a self-supervised branch for regularizing the attention learning. Our network design naturally leads to an end-to-end multi-task joint optimization. We conduct comprehensive experiments on three benchmark datasets for vehicle ReID, i.e., VeRi-776, CityFlow-ReID, and VehicleID. We demonstrate the state-of-the-art (SOTA) performance of our approach with the capability of capturing informative vehicle parts with no corresponding manual labels. We also demonstrate the good generalization of our approach in other ReID tasks such as person ReID and multi-target multi-camera tracking.
摘要：视觉注意学习（VAL）的目的是产生信心地图作为权数来检测某些任务的每个图像中判别特征，如在哪里可以在不同的相机中确定的同一车辆例如车辆需要重新鉴定（里德）。与此相反的文献，在本文中我们提出利用自我监督学习来规范VAL提高车辆里德的表现。数学上用提升我们可以通过其他共享功能因式分解VAL和自我监督正规化的两个功能。（1）全球分公司为骨干，图像特征提取，（2）用于生产注意口罩的注意力分支，和（3）自我监督分行正规化的：我们使用了深刻的学习框架由三个部门实施这样分解重视学习。我们的网络设计自然会导致一个终端到终端的多任务联合优化。我们对三个标准数据集车辆里德，即VERI-776，CityFlow - 里德和VehicleID进行全面实验。我们通过我们捕捉信息汽车零件没有相应的手动标签的能力，方法的国家的最先进的（SOTA）的性能。我们也证明了我们在其他里德任务，如人Reid和多目标多摄像机跟踪方法的良好的泛化。

44. Unsupervised Domain Adaptation for Spatio-Temporal Action Localization [PDF] 返回目录
Nakul Agarwal, Yi-Ting Chen, Behzad Dariush, Ming-Hsuan Yang
Abstract: Spatio-temporal action localization is an important problem in computer vision that involves detecting where and when activities occur, and therefore requires modeling of both spatial and temporal features. This problem is typically formulated in the context of supervised learning, where the learned classifiers operate on the premise that both training and test data are sampled from the same underlying distribution. However, this assumption does not hold when there is a significant domain shift, leading to poor generalization performance on the test data. To address this, we focus on the hard and novel task of generalizing training models to test samples without access to any labels from the latter for spatio-temporal action localization by proposing an end-to-end unsupervised domain adaptation algorithm. We extend the state-of-the-art object detection framework to localize and classify actions. In order to minimize the domain shift, three domain adaptation modules at image level (temporal and spatial) and instance level (temporal) are designed and integrated. We design a new experimental setup and evaluate the proposed method and different adaptation modules on the UCF-Sports, UCF-101 and JHMDB benchmark datasets. We show that significant performance gain can be achieved when spatial and temporal features are adapted separately, or jointly for the most effective results.
摘要：时空动作本地化是计算机视觉的一个重要问题，即涉及检测其中和活动发生时，因此需要的空间和时间特性进行建模。这个问题通常是制定监督学习，在学习分类的前提是训练和测试数据来自相同的底层分布抽样操作的情况下。然而，当有显著域转变这个假设不成立，从而对测试数据泛化性能较差。为了解决这个问题，我们重点推广的培训模式来测试样品的硬盘和新颖的任务，而不从后者的时空行动本地化访问任何标签通过提出一个终端到终端的无监督域自适应算法。我们延伸的状态的最先进的物体检测框架来定位和分类操作。为了最小化域移位，三个结构域的适应模块在图像电平（时间和空间）和实例级（时间）被设计和集成。我们设计了一种新的实验装置，并评估所提出的方法，并在UCF-体育不同的适应模块，UCF-101和JHMDB基准数据集。我们表明，在空间和时间功能分别适用，或共同为最有效的结果显著的性能增益可以达到。

45. Rotation Invariant Aerial Image Retrieval with Group Convolutional Metric Learning [PDF] 返回目录
Hyunseung Chung, Woo-Jeoung Nam, Seong-Whan Lee
Abstract: Remote sensing image retrieval (RSIR) is the process of ranking database images depending on the degree of similarity compared to the query image. As the complexity of RSIR increases due to the diversity in shooting range, angle, and location of remote sensors, there is an increasing demand for methods to address these issues and improve retrieval performance. In this work, we introduce a novel method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. For refinement and emphasis on important features, we applied channel attention in each group convolution stage. By utilizing the characteristics of group convolution and channel-wise attention, it is possible to acknowledge the equality among rotated but identically located images. The training procedure has two main steps: (i) training the network with Aerial Image Dataset (AID) for classification, (ii) fine-tuning the network with triplet-loss for retrieval with Google Earth South Korea and NWPU-RESISC45 datasets. Results show that the proposed method performance exceeds other state-of-the-art retrieval methods in both rotated and original environments. Furthermore, we utilize class activation maps (CAM) to visualize the distinct difference of main features between our method and baseline, resulting in better adaptability in rotated environments.
摘要：遥感图像检索（RSIR）是排名取决于相比于该查询图像的相似度数据库图像的过程。由于RSIR的复杂性射击场，角度和远程传感器的位置，由于增加了多样性，有一个方法来解决这些问题，提高检索性能的需求不断增加。在这项工作中，我们介绍通过合并组卷积注意机制和度量学习，从而导致鲁棒性旋转变动检索航空图像的新方法。对于细化和强调重要的特点，我们进行频道的关注，各组卷积级。通过使用组卷积和渠道明智关注的特点，可以确认旋转而相同位置的图像之间的平等。训练过程有两个主要步骤：（i）培训与空中图像数据集（AID）进行分类的网络，（二）微调使用三线损网络与谷歌地球韩国和西北工业大学，RESISC45数据集检索。结果表明，所提出的方法的性能超过了两个旋转和原始环境中的其他国家的最先进的检索方法。此外，我们利用类激活图（CAM）来可视化的我们的方法和基线之间的主要特点，从而在旋转的环境适应性较好的显着差异。

46. MaskNet: A Fully-Convolutional Network to Estimate Inlier Points [PDF] 返回目录
Vinit Sarode, Animesh Dhagat, Rangaprasad Arun Srivatsan, Nicolas Zevallos, Simon Lucey, Howie Choset
Abstract: Point clouds have grown in importance in the way computers perceive the world. From LIDAR sensors in autonomous cars and drones to the time of flight and stereo vision systems in our phones, point clouds are everywhere. Despite their ubiquity, point clouds in the real world are often missing points because of sensor limitations or occlusions, or contain extraneous points from sensor noise or artifacts. These problems challenge algorithms that require computing correspondences between a pair of point clouds. Therefore, this paper presents a fully-convolutional neural network that identifies which points in one point cloud are most similar (inliers) to the points in another. We show improvements in learning-based and classical point cloud registration approaches when retrofitted with our network. We demonstrate these improvements on synthetic and real-world datasets. Finally, our network produces impressive results on test datasets that were unseen during training, thus exhibiting generalizability. Code and videos are available at this https URL
摘要：点云已经在计算机感知世界的方式的重要性增加了。从自主车和无人驾驶飞机在我们的手机飞行和立体视觉系统的时间LIDAR传感器，点云无处不在。尽管他们无处不在，在现实世界中的点云常常因为缺少传感器限制或闭塞点，或包含来自传感器的噪声或假象无关点。这些问题的挑战，需要一对点云之间的计算对应的算法。因此，本文提出了一种全卷积神经网络标识哪些在一个点浊点是最相似的（内围层）到另一个点。我们发现在学习型和经典的点云登记方法的改进，当与我们的网络改造。我们证明在合成和真实世界的数据集，这些改进。最后，我们的网络产生的测试数据集培训期间是看不见的，因此呈现出普遍性令人印象深刻的结果。代码和视频可在此HTTPS URL

47. Gaussian Constrained Attention Network for Scene Text Recognition [PDF] 返回目录
Zhi Qiao, Xugong Qin, Yu Zhou, Fei Yang, Weiping Wang
Abstract: Scene text recognition has been a hot topic in computer vision. Recent methods adopt the attention mechanism for sequence prediction which achieve convincing results. However, we argue that the existing attention mechanism faces the problem of attention diffusion, in which the model may not focus on a certain character area. In this paper, we propose Gaussian Constrained Attention Network to deal with this problem. It is a 2D attention-based method integrated with a novel Gaussian Constrained Refinement Module, which predicts an additional Gaussian mask to refine the attention weights. Different from adopting an additional supervision on the attention weights simply, our proposed method introduces an explicit refinement. In this way, the attention weights will be more concentrated and the attention-based recognition network achieves better performance. The proposed Gaussian Constrained Refinement Module is flexible and can be applied to existing attention-based methods directly. The experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. Our code has been available at this https URL.
摘要：场景文本识别一直是计算机视觉的一个热门话题。最近的方法采用了序列预测的注意机制取得令人信服的结果。然而，我们认为，现有的注意机制面临的注意扩散的问题，在这种模型可能无法对某个字符区域。在本文中，我们提出了高斯约束关注网络来处理这个问题。它是具有新型高斯约束细化模块，其预测的附加高斯掩模来细化加权注意一体化的2D基于注意机制的方法。从通过关于关注权重额外的监督不同的只是，我们提出的方法引入了一个明确的细化。通过这种方式，关注权重将更加集中和注意力基于认识网络中获得更好的性能。所提出的约束高斯细化模块是柔性的并且可以应用于直接现有基于注意机制的方法。在几个基准数据集实验证明我们提出的方法的有效性。我们的代码已经在这个HTTPS URL可用。

48. Localized Interactive Instance Segmentation [PDF] 返回目录
Soumajit Majumder, Angela Yao
Abstract: In current interactive instance segmentation works, the user is granted a free hand when providing clicks to segment an object; clicks are allowed on background pixels and other object instances far from the target object. This form of interaction is highly inconsistent with the end goal of efficiently isolating objects of interest. In our work, we propose a clicking scheme wherein user interactions are restricted to the proximity of the object. In addition, we propose a novel transformation of the user-provided clicks to generate a weak localization prior on the object which is consistent with image structures such as edges, textures etc. We demonstrate the effectiveness of our proposed clicking scheme and localization strategy through detailed experimentation in which we raise state-of-the-art on several standard interactive segmentation benchmarks.
摘要：在当前交互的实例分割的作品中，用户被提供点击，以段对象时授予腾出手;点击允许在背景像素和其他对象实例远离目标对象。这种交互形式是感兴趣有效分离对象的最终目标高度一致。在我们的工作中，我们提出了一个点击方案，其中用户交互被限制在物体的接近。此外，我们建议用户提供点击新转化生成对象先前微弱的定位是与图像的结构，例如边缘相一致，纹理等，我们证明了我们提出的方案点击和本地化策略的有效性，通过详细实验中，我们提出几个标准交互式分割基准的国家的最先进的。

49. Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering [PDF] 返回目录
Yuxuan Zhang, Wenzheng Chen, Huan Ling, Jun Gao, Yinan Zhang, Antonio Torralba, Sanja Fidler
Abstract: Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D "neural renderer", complementing traditional graphics renderers.
摘要：微的渲染铺平了道路训练神经网络进行“逆图形”的任务，例如从单眼照片预测3D几何。为了培养高表演机型，目前的大多数方法依赖于多视角影像这是在实践中没有现成的。近期创成对抗性网络（甘斯）是合成图像，相反，似乎在训练过程中隐含获得3D知识：对象的观点可以通过简单的操作潜代码进行操作。然而，这些潜在的代码往往缺乏进一步的物理解释，从而甘斯不能轻易反转进行显式3D推理。在本文中，我们的目标是提取和利用微渲染器通过生成模型学会理清3D知识。关键是我们的做法是利用甘斯作为多视点数据生成器使用过的，现成的微渲染训练逆图形网络，并作为一个老师解开甘的潜代码转换成可解释3D性能的训练有素的逆图形网络。整个体系结构使用周期一致性的损失迭代训练。我们证明了我们的方法显著优于国家的最先进的反图形网络训练的现有数据集，在数量上和通过用户研究。我们进一步展示解缠结的GaN作为一个可控的3D“神经渲染器”，补充传统的图形渲染器。

50. Movement-induced Priors for Deep Stereo [PDF] 返回目录
Yuxin Hou, Muhammad Kamran Janjua, Juho Kannala, Arno Solin
Abstract: We propose a method for fusing stereo disparity estimation with movement-induced prior information. Instead of independent inference frame-by-frame, we formulate the problem as a non-parametric learning task in terms of a temporal Gaussian process prior with a movement-driven kernel for inter-frame reasoning. We present a hierarchy of three Gaussian process kernels depending on the availability of motion information, where our main focus is on a new gyroscope-driven kernel for handheld devices with low-quality MEMS sensors, thus also relaxing the requirement of having full 6D camera poses available. We show how our method can be combined with two state-of-the-art deep stereo methods. The method either work in a plug-and-play fashion with pre-trained deep stereo networks, or further improved by jointly training the kernels together with encoder-decoder architectures, leading to consistent improvement.
摘要：本文提出了一种融合立体视差估计与移动导致的先验信息的方法。相反，独立的推理一帧一帧的，我们在制定之前与帧间推理机芯驱动内核时间高斯过程方面的问题，作为一种非参数的学习任务。我们提出根据运动信息，其中，我们的主要焦点是对于具有低质量的MEMS传感器的手持式设备的一个新陀螺仪驱动内核的可用性，因此也缓和具有充分6D照相机姿态的要求3个高斯过程的层次结构内核可用。我们展示了如何我们的方法可以用两个国家的最先进的深立体声方法相结合。在一个插件和播放方式与预训练深层立体网络，或进一步通过与编码器，解码器架构共同训练内核一起，从而持续改进提高的方法中进行工作。

51. Variational Capsule Encoder [PDF] 返回目录
Harish RaviPrakash, Syed Muhammad Anwar, Ulas Bagci
Abstract: We propose a novel capsule network based variational encoder architecture, called Bayesian capsules (B-Caps), to modulate the mean and standard deviation of the sampling distribution in the latent space. We hypothesized that this approach can learn a better representation of features in the latent space than traditional approaches. Our hypothesis was tested by using the learned latent variables for image reconstruction task, where for MNIST and Fashion-MNIST datasets, different classes were separated successfully in the latent space using our proposed model. Our experimental results have shown improved reconstruction and classification performances for both datasets adding credence to our hypothesis. We also showed that by increasing the latent space dimension, the proposed B-Caps was able to learn a better representation when compared to the traditional variational auto-encoders (VAE). Hence our results indicate the strength of capsule networks in representation learning which has never been examined under the VAE settings before.
摘要：我们提出了一个新颖的胶囊基于网络的变分编码器的体系结构，称为贝叶斯胶囊（B-CAPS），以调制在潜空间中的采样分布的平均值和标准偏差。我们假设，这种方法可以学习的功能，更好的表现比传统方法的潜在空间。我们的假设是通过学习潜变量进行图像重建任务，其中用于MNIST和时尚MNIST数据集，不同类别的成功使用我们提出的模型的潜在空间分离测试。我们的实验结果表明，对于两个数据集添加印证了我们的假设改进的重建和分类的表演。我们还发现，通过增加潜在空间尺寸，所提出的B-上限能学到更好的表现相比传统的变自动编码器（VAE）。因此，我们的研究结果表明代表学习胶囊网络，这从来没有下VAE设置之前检查的强度。

52. View-Invariant Gait Recognition with Attentive Recurrent Learning of Partial Representations [PDF] 返回目录
Alireza Sepas-Moghaddam, Ali Etemad
Abstract: Gait recognition refers to the identification of individuals based on features acquired from their body movement during walking. Despite the recent advances in gait recognition with deep learning, variations in data acquisition and appearance, namely camera angles, subject pose, occlusions, and clothing, are challenging factors that need to be considered for achieving accurate gait recognition systems. In this paper, we propose a network that first learns to extract gait convolutional energy maps (GCEM) from frame-level convolutional features. It then adopts a bidirectional recurrent neural network to learn from split bins of the GCEM, thus exploiting the relations between learned partial spatiotemporal representations. We then use an attention mechanism to selectively focus on important recurrently learned partial representations as identity information in different scenarios may lie in different GCEM bins. Our proposed model has been extensively tested on two large-scale CASIA-B and OU-MVLP gait datasets using four different test protocols and has been compared to a number of state-of-the-art and baseline solutions. Additionally, a comprehensive experiment has been performed to study the robustness of our model in the presence of six different synthesized occlusions. The experimental results show the superiority of our proposed method, outperforming the state-of-the-art, especially in scenarios where different clothing and carrying conditions are encountered. The results also revealed that our model is more robust against different occlusions as compared to the state-of-the-art methods.
摘要：步态识别是指基于行走过程中他们的身体运动取得特征的识别个人。尽管步态识别与深度学习的最新进展，在数据采集和外观，即摄像机角度，主题姿势，闭塞，和服装的变化，正在挑战需要考虑的因素以实现准确的步态识别系统。在本文中，我们提出了一种网络，其第一学会提取帧级卷积特征步态卷积能量映射（GCEM）。然后，它采用了双向回归神经网络从GCEM的分仓来学习，从而利用学到局部时空表述的关系。然后，我们使用的重要反复了解到部分表示在不同的场景中的身份信息的注意力机制有选择地重点可能在于不同GCEM箱。我们提出的模型已经在两个大型CASIA-B和OU-MVLP使用四种不同的测试协议步态数据集被广泛的测试和相比已经许多国家的最先进和基线的解决方案。此外，综合实验已经完成，研究我们的模型的鲁棒性的六种不同的合成闭塞的存在。实验结果表明，我们提出的方法的优越性，跑赢国家的最先进的，尤其是在遇到不同的衣服和携带条件的情况。研究结果还显示，相比于国家的最先进的方法，我们的模型是针对不同的闭塞更加健壮。

53. Gait Recognition using Multi-Scale Partial Representation Transformation with Capsules [PDF] 返回目录
Alireza Sepas-Moghaddam, Saeed Ghorbani, Nikolaus F. Troje, Ali Etemad
Abstract: Gait recognition, referring to the identification of individuals based on the manner in which they walk, can be very challenging due to the variations in the viewpoint of the camera and the appearance of individuals. Current methods for gait recognition have been dominated by deep learning models, notably those based on partial feature representations. In this context, we propose a novel deep network, learning to transfer multi-scale partial gait representations using capsules to obtain more discriminative gait features. Our network first obtains multi-scale partial representations using a state-of-the-art deep partial feature extractor. It then recurrently learns the correlations and co-occurrences of the patterns among the partial features in forward and backward directions using Bi-directional Gated Recurrent Units (BGRU). Finally, a capsule network is adopted to learn deeper part-whole relationships and assigns more weights to the more relevant features while ignoring the spurious dimensions. That way, we obtain final features that are more robust to both viewing and appearance changes. The performance of our method has been extensively tested on two gait recognition datasets, CASIA-B and OU-MVLP, using four challenging test protocols. The results of our method have been compared to the state-of-the-art gait recognition solutions, showing the superiority of our model, notably when facing challenging viewing and carrying conditions.
摘要：步态识别，指的是基于在他们走的方式识别个人，可以是非常具有挑战性，由于在摄像机的视点和个人的外观变化。对于步态识别当前的方法已经被深深的学习模式，特别是那些基于局部特征表示为主。在此背景下，我们提出了一个新的深网，学习转移使用胶囊来获得更有辨别力的步态特征的多尺度局部步态表示。我们的网络首先获得使用状态的最先进的深部分特征提取多尺度部分表示。然后，它反复学习中使用的双向门控复发性单位（BGRU）向前和向后方向上的局部特征之间的相关性和图案的共同出现。最后，胶囊网络采用学习更深的部分 - 整体关系和受让人更多的权重，以更相关的功能而忽略了虚假的尺寸。这样一来，我们得到最终的功能，更加坚固两者的观看和外观上的变化。我们的方法的性能已经在两个步态识别的数据集，中科院自动化所-B和OU-MVLP，使用四个挑战性的测试方案进行了广泛的测试。我们的方法的结果进行了比较国家的最先进的步态识别解决方案，示出了我们的模型的优越性，特别是面临着挑战观看和携带条件时。

54. Graphite: GRAPH-Induced feaTure Extraction for Point Cloud Registration [PDF] 返回目录
Mahdi Saleh, Shervin Dehghani, Benjamin Busam, Nassir Navab, Federico Tombari
Abstract: 3D Point clouds are a rich source of information that enjoy growing popularity in the vision community. However, due to the sparsity of their representation, learning models based on large point clouds is still a challenge. In this work, we introduce Graphite, a GRAPH-Induced feaTure Extraction pipeline, a simple yet powerful feature transform and keypoint detector. Graphite enables intensive down-sampling of point clouds with keypoint detection accompanied by a descriptor. We construct a generic graph-based learning scheme to describe point cloud regions and extract salient points. To this end, we take advantage of 6D pose information and metric learning to learn robust descriptions and keypoints across different scans. We Reformulate the 3D keypoint pipeline with graph neural networks which allow efficient processing of the point set while boosting its descriptive power which ultimately results in more accurate 3D registrations. We demonstrate our lightweight descriptor on common 3D descriptor matching and point cloud registration benchmarks and achieve comparable results with the state of the art. Describing 100 patches of a point cloud and detecting their keypoints takes only ~0.018 seconds with our proposed network.
摘要：三维点云是一个丰富的信息来源，在视觉上享有越来越受欢迎。然而，由于其代表性的稀疏性，学习基于大点云模型仍然是一个挑战。在这项工作中，我们引入石墨，图形诱导特征提取管道，一个简单而强大的功能改造和关键点检测。石墨使点云的密集降采样与同行的描述符关键点检测。我们构建了一个通用的基于图形的学习计划来形容一点云区和提取要点。为此，我们利用的6D姿态信息和度量学习跨越不同的扫描学习强大的说明和关键点。我们与重新拟订图表神经网络允许的点集的高效处理，同时增强其描述能力，最终导致更精确的三维注册的3D关键点的管道。我们证明了我们对常见的3D描述符匹配和点云登记的基准轻质描述符，并实现了与现有技术的比较的结果。描述100个补丁点云和检测他们的关键点仅需〜0.018秒与我们提出的网络。

55. RADIATE: A Radar Dataset for Automotive Perception [PDF] 返回目录
Marcel Sheeny, Emanuele De Pellegrin, Saptarshi Mukherjee, Alireza Ahrabian, Sen Wang, Andrew Wallace
Abstract: Datasets for autonomous cars are essential for the development and benchmarking of perception systems. However, most existing datasets are captured with camera and LiDAR sensors in good weather conditions. In this paper, we present the RAdar Dataset In Adverse weaThEr (RADIATE), aiming to facilitate research on object detection, tracking and scene understanding using radar sensing for safe autonomous driving. RADIATE includes 3 hours of annotated radar images with more than 200K labelled road actors in total, on average about 4.6 instances per radar image. It covers 8 different categories of actors in a variety of weather conditions (e.g., sun, night, rain, fog and snow) and driving scenarios (e.g., parked, urban, motorway and suburban), representing different levels of challenge. To the best of our knowledge, this is the first public radar dataset which provides high-resolution radar images on public roads with a large amount of road actors labelled. The data collected in adverse weather, e.g., fog and snowfall, is unique. Some baseline results of radar based object detection and recognition are given to show that the use of radar data is promising for automotive applications in bad weather, where vision and LiDAR can fail. RADIATE also has stereo images, 32-channel LiDAR and GPS data, directed at other applications such as sensor fusion, localisation and mapping. The public dataset can be accessed at this http URL.
摘要：数据集自主车的发展至关重要，并感知系统的标杆。然而，大多数现有的数据集，在良好的天气条件下相机和激光雷达传感器捕捉。在本文中，我们介绍了雷达数据集在恶劣天气（辐射），旨在促进对目标检测的研究，跟踪和使用雷达传感安全自动驾驶场景理解。辐射包括3小时，超过200K标记道路的行动者注释雷达图像中总平均大约每雷达图像4.6实例。它涵盖8个大类不同的演员在各种天气条件下（例如，太阳，晚上，雨，雾，雪）以及驾驶场景（例如，停放，城市，高速公路和郊区），代表不同层次的挑战。据我们所知，这是第一次公开雷达数据集，其在公共道路上标有大量的道路参与者提供高分辨率雷达图像。在恶劣的天气，例如，大雾和降雪收集的数据，是独一无二的。基于雷达目标检测与识别的一些基线结果被赋予表明，使用雷达数据有望实现在恶劣天气下，在视觉和激光雷达会失败的汽车应用。辐射也有立体图像，32通道激光雷达和GPS数据，针对其他应用，例如传感器融合，定位和映射。公共数据集可以在这个HTTP URL访问。

56. Multimodal semantic forecasting based on conditional generation of future features [PDF] 返回目录
Kristijan Fugošić, Josip Šarić, Siniša Šegvić
Abstract: This paper considers semantic forecasting in road-driving scenes. Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames. However, such approaches ignore the fact that future can not always be guessed with certainty. For example, when a car is about to turn around a corner, the road which is currently occluded by buildings may turn out to be either free to drive, or occupied by people, other vehicles or roadworks. When a deterministic model confronts such situation, its best guess is to forecast the most likely outcome. However, this is not acceptable since it defeats the purpose of forecasting to improve security. It also throws away valuable training data, since a deterministic model is unable to learn any deviation from the norm. We address this problem by providing more freedom to the model through allowing it to forecast different futures. We propose to formulate multimodal forecasting as sampling of a multimodal generative model conditioned on the observed frames. Experiments on the Cityscapes dataset reveal that our multimodal model outperforms its deterministic counterpart in short-term forecasting while performing slightly worse in the mid-term case.
摘要：本文认为，在道路驾驶场景语义预测。大多数现有的方法解决这个问题作为未来的功能或给定的观察帧预测未来确定性的回归。然而，这种方法忽略了一个事实，未来不能总是肯定猜到了。例如，当汽车快要转身一个角落，这是目前被建筑物遮挡的道路可能会变成是要么免费开车，或人，其他车辆或道路施工占用。当确定性模型面对这样的情况，其最好的猜测是预测最可能的结果。然而，因为它违背了预测，以提高安全性的目的，这是不能接受的。它也扔掉宝贵的训练数据，因为确定性模型是无法了解从规范的任何偏差。我们通过允许它来预测不同的未来提供了更多的自由模式解决这个问题。我们建议制定多预测为多生成模型条件所观察到的帧进行采样。对城市景观的实验数据集显示，我们的多模式模型优于其对应的确定性在短期预测，而在中期情况略差执行。

57. Exploiting Context for Robustness to Label Noise in Active Learning [PDF] 返回目录
Sudipta Paul, Shivkumar Chandrasekaran, B.S. Manjunath, Amit K. Roy-Chowdhury
Abstract: Several works in computer vision have demonstrated the effectiveness of active learning for adapting the recognition model when new unlabeled data becomes available. Most of these works consider that labels obtained from the annotator are correct. However, in a practical scenario, as the quality of the labels depends on the annotator, some of the labels might be wrong, which results in degraded recognition performance. In this paper, we address the problems of i) how a system can identify which of the queried labels are wrong and ii) how a multi-class active learning system can be adapted to minimize the negative impact of label noise. Towards solving the problems, we propose a noisy label filtering based learning approach where the inter-relationship (context) that is quite common in natural data is utilized to detect the wrong labels. We construct a graphical representation of the unlabeled data to encode these relationships and obtain new beliefs on the graph when noisy labels are available. Comparing the new beliefs with the prior relational information, we generate a dissimilarity score to detect the incorrect labels and update the recognition model with correct labels which result in better recognition performance. This is demonstrated in three different applications: scene classification, activity classification, and document classification.
摘要：在计算机视觉几部作品都证明了，当新的未标记的数据变得可用适应识别模型主动学习的有效性。这些作品大多认为，从注释获得的标签是正确的。然而，在实际情况下，作为标签的质量取决于注释，一些标签可能是错的，这会导致降低识别性能。在本文中，我们要解决我的问题）的系统如何识别该查询标签是错误的和ii）如何多级主动学习系统可以适用于最小化标签的噪音带来的负面影响。对解决该问题，我们提出了一个嘈杂的标签过滤其中的相互关系（上下文）是很常见的自然数据被用来检测错误的标签基于学习方法。我们构造的无标签数据进行编码这些关系，并获得在图上新的信念时，嘈杂的标签可用的图形表示。比较现有的关系信息的新的信念，我们产生相异的分数来检测不正确的标签和更新与正确的标签，其产生更好的识别性能的识别模型。这表现在三个不同的应用程序：场景分类，活动分类，以及文档分类。

58. Deep Structured Prediction for Facial Landmark Detection [PDF] 返回目录
Lisha Chen, Hui Su, Qiang Ji
Abstract: Existing deep learning based facial landmark detection methods have achieved excellent performance. These methods, however, do not explicitly embed the structural dependencies among landmark points. They hence cannot preserve the geometric relationships between landmark points or generalize well to challenging conditions or unseen data. This paper proposes a method for deep structured facial landmark detection based on combining a deep Convolutional Network with a Conditional Random Field. We demonstrate its superior performance to existing state-of-the-art techniques in facial landmark detection, especially a better generalization ability on challenging datasets that include large pose and occlusion.
摘要：现有的深度学习基于人脸特征点的检测方法取得了优异的业绩。这些方法，但是，并没有明确嵌入标志点之间的结构依存关系。他们因此不能保持具有里程碑意义的点之间的几何关系或推广以及具有挑战性的条件或看不见的数据。本文提出了一种基于深刻的卷积网络与条件随机场相结合的深结构的面部标志检测的方法。我们展示了其卓越的性能，以现有的国家的最先进的技术，面部标志检测，尤其是在具有挑战性的数据集，其中包括大型的姿态和闭塞更好的推广能力。

59. Covapixels [PDF] 返回目录
Jeffrey Uhlmann
Abstract: We propose and discuss the summarization of superpixel-type image tiles/patches using mean and covariance information. We refer to the resulting objects as covapixels.
摘要：本文提出并讨论超像素型图像瓷砖/使用均值和方差信息补丁的汇总。我们指的是得到的对象为covapixels。

60. FGAGT: Flow-Guided Adaptive Graph Tracking [PDF] 返回目录
Chaobing Shan, Chunbo Wei, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Xiaoliang Cheng, Kewei Liang
Abstract: Multi-object tracking (MOT) has always been a very important research direction in computer vision and has great applications in autonomous driving, video object behavior prediction, traffic management, and accident prevention. Recently, some methods have made great progress on MOT, such as CenterTrack, which predicts the trajectory position based on optical flow then tracks it, and FairMOT, which uses higher resolution feature maps to extract Re-id features. In this article, we propose the FGAGT tracker. Different from FairMOT, we use Pyramid Lucas Kanade optical flow method to predict the position of the historical objects in the current frame, and use ROI Pooling\cite{He2015} and fully connected layers to extract the historical objects' appearance feature vectors on the feature maps of the current frame. Next, input them and new objects' feature vectors into the adaptive graph neural network to update the feature vectors. The adaptive graph network can update the feature vectors of the objects by combining historical global position and appearance information. Because the historical information is preserved, it can also re-identify the occluded objects. In the training phase, we propose the Balanced MSE LOSS to balance the sample distribution. In the Inference phase, we use the Hungarian algorithm for data association. Our method reaches the level of state-of-the-art, where the MOTA index exceeds FairMOT by 2.5 points, and CenterTrack by 8.4 points on the MOT17 dataset, exceeds FairMOT by 7.2 points on the MOT16 dataset.
摘要：多目标追踪（MOT）一直是计算机视觉中一个非常重要的研究方向，并在自主驾驶，视频对象行为的预测，交通管理和事故预防优秀的应用软件。近来，一些方法已经对MOT很大的进步，如CenterTrack，该预测基于光流的轨迹位置，然后跟踪它，FairMOT，它采用更高分辨率的功能映射到提取再ID功能。在这篇文章中，我们提出了FGAGT跟踪。从FairMOT不同，我们使用金字塔卢卡斯Kanade光流法来预测当前帧中的历史对象的位置，并使用ROI池\举{He2015}和完全连接层以提取所述特征的历史对象的外观的特征向量映射当前帧的。接着，输入他们和新对象的特征向量到自适应图形神经网络来更新特征向量。自适应图表网络可以通过组合的历史全球位置和外观的信息更新的对象的特征向量。因为历史信息被保存，它也可以重新标识遮挡的对象。在训练阶段，我们提出了平衡MSE损失平衡样本分布。在推断阶段，我们使用的数据关联匈牙利算法。我们的方法到达状态的最先进的，其中MOTA指数2.5点超过FairMOT，和CenterTrack的由8.4分上MOT17数据集的水平，通过在MOT16数据集7.2分超过FairMOT。

61. Image-based Automated Species Identification: Can Virtual Data Augmentation Overcome Problems of Insufficient Sampling? [PDF] 返回目录
Morris Klasen, Dirk Ahrens, Jonas Eberle, Volker Steinhage
Abstract: Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning. 1In this study, we assessed whether a two-level data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The first level of visual data augmentation applies classic approaches of data augmentation and generation of faked images using a GAN approach. Descriptive feature vectors are derived from bottleneck features of a VGG-16 convolutional neural network (CNN) that are then stepwise reduced in dimensionality using Global Average Pooling and PCA to prevent overfitting. The second level of data augmentation employs synthetic additional sampling in feature space by an oversampling algorithm in vector space (SMOTE). Applied on two challenging datasets of scarab beetles (Coleoptera), our augmentation approach outperformed a non-augmented deep learning baseline approach as well as a traditional 2D morphometric approach (Procrustes analysis).
摘要：自动物种鉴定和定界是具有挑战性的，特别是在罕见，因此往往几乎不采样种，其中不允许infraspecific与种间变异的足够的鉴别力。从任何一个低引起的或典型的问题夸大间形态分化最好通过学习效率和有效的物种训练样本识别的机器学习的自动化方法满足。然而，有限的infraspecific抽样仍然是一个关键的挑战也是机器学习。 1在这项研究中，我们评估了两级数据增强方法是否可能有助于克服自动化视觉物种鉴定稀缺的训练数据的问题。视觉数据增强的第一层施加数据增加和产生使用GAN方法伪造图像的经典方法。描述性特征向量从VGG-16卷积神经网络（CNN），其然后在维数使用全球平均池和PCA，以防止过度拟合逐步降低的瓶颈特征的。数据扩张的第二级由在向量空间中的过采样算法（SMOTE）采用在特征空间合成附加采样。在应用金龟子两点有挑战性的数据集（鞘翅目），我们的增强方法优于非增强深度学习基线的方法，以及一个传统的2D形态的方法（普鲁克分析）。

62. Multiple Future Prediction Leveraging Synthetic Trajectories [PDF] 返回目录
Lorenzo Berlincioni, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo
Abstract: Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.
摘要：轨迹预测是一项重要任务，特别是在自动驾驶。预测其他移动代理的位置可以产生一个有效的规划，确保自主车辆的安全性，以及所观察到的实体的能力。在这项工作中，我们提出了一种基于马尔可夫链产生合成的轨迹，这是培养了多个未来轨迹预测有用的数据驱动的方法。的优点是双重的：一方面合成样品可用于增强现有的数据集，并培养更多的有效预测;在另一方面，它允许生成与多个基础事实样本，对应于观测到的轨迹的多样化等可能的结果。我们定义的轨迹预测模型和损失，明确解决问题的多模式，我们表明，合成和真实数据引线相结合，预测的改进，获得艺术效果的状态。

63. Temporal Binary Representation for Event-Based Action Recognition [PDF] 返回目录
Simone Undri Innocenti, Federico Becattini, Federico Pernici, Alberto Del Bimbo
Abstract: In this paper we present an event aggregation strategy to convert the output of an event camera into frames processable by traditional Computer Vision algorithms. The proposed method first generates sequences of intermediate binary representations, which are then losslessly transformed into a compact format by simply applying a binary-to-decimal conversion. This strategy allows us to encode temporal information directly into pixel values, which are then interpreted by deep learning models. We apply our strategy, called Temporal Binary Representation, to the task of Gesture Recognition, obtaining state of the art results on the popular DVS128 Gesture Dataset. To underline the effectiveness of the proposed method compared to existing ones, we also collect an extension of the dataset under more challenging conditions on which to perform experiments.
摘要：本文提出了一个事件聚合策略，事件摄像机的输出转换为帧由传统的计算机视觉算法处理的。所提出的方法首先生成中间二进制表示，其然后无损通过简单地施加一个二进制到十进制的转换变换成一个紧凑的格式的序列。这一战略使我们能够时间信息直接编码成像素值，然后由深学习模型解释。我们运用我们的战略，被称为颞二进制表示，对手势识别的任务，对流行DVS128手势数据集获得艺术效果的状态。为了强调相比于现有的该方法的有效性，我们还收集在其上进行实验，更具挑战性的条件下，该数据集的扩展。

64. Distortion-aware Monocular Depth Estimation for Omnidirectional Images [PDF] 返回目录
Hong-Xiang Chen, Kunhong Li, Yulan Guo, Zhiheng Fu, Mengyi Liu
Abstract: A main challenge for tasks on panorama lies in the distortion of objects among images. In this work, we propose a Distortion-Aware Monocular Omnidirectional (DAMO) dense depth estimation network to address this challenge on indoor panoramas with two steps. First, we introduce a distortion-aware module to extract calibrated semantic features from omnidirectional images. Specifically, we exploit deformable convolution to adjust its sampling grids to geometric variations of distorted objects on panoramas and then utilize a strip pooling module to sample against horizontal distortion introduced by inverse gnomonic projection. Second, we further introduce a plug-and-play spherical-aware weight matrix for our objective function to handle the uneven distribution of areas projected from a sphere. Experiments on the 360D dataset show that the proposed method can effectively extract semantic features from distorted panoramas and alleviate the supervision bias caused by distortion. It achieves state-of-the-art performance on the 360D dataset with high efficiency.
摘要：在全景在于图像中物体的失真任务的主要挑战。在这项工作中，我们提出了一个失真已知的单眼全向（DAMO）密集深度估计网络，以解决室内全景这一挑战有两个步骤。首先，我们引入一个失真感知模块从全方位图像提取校准语义特征。具体来说，我们利用可变形的卷积来调整其采样栅格上全景失真的对象的几何形状的变化，然后利用带池模块到样品抵抗由逆投影心射引入水平失真。其次，我们还引入了插件和播放球形感知权重矩阵为我们的目标函数来处理分布不均匀的区域从球体投射。在360D数据集的实验表明，该方法能有效地从扭曲的全景提取语义特征，并减轻因失真监督偏差。它实现了对数据集360D以高效率状态的最先进的性能。

65. Boosting High-Level Vision with Joint Compression Artifacts Reduction and Super-Resolution [PDF] 返回目录
Xiaoyu Xiang, Qian Lin, Jan P. Allebach
Abstract: Due to the limits of bandwidth and storage space, digital images are usually down-scaled and compressed when transmitted over networks, resulting in loss of details and jarring artifacts that can lower the performance of high-level visual tasks. In this paper, we aim to generate an artifact-free high-resolution image from a low-resolution one compressed with an arbitrary quality factor by exploring joint compression artifacts reduction (CAR) and super-resolution (SR) tasks. First, we propose a context-aware joint CAR and SR neural network (CAJNN) that integrates both local and non-local features to solve CAR and SR in one-stage. Finally, a deep reconstruction network is adopted to predict high quality and high-resolution images. Evaluation on CAR and SR benchmark datasets shows that our CAJNN model outperforms previous methods and also takes 26.2% shorter runtime. Based on this model, we explore addressing two critical challenges in high-level computer vision: optical character recognition of low-resolution texts, and extremely tiny face detection. We demonstrate that CAJNN can serve as an effective image preprocessing method and improve the accuracy for real-scene text recognition (from 85.30% to 85.75%) and the average precision for tiny face detection (from 0.317 to 0.611).
摘要：由于带宽和存储空间的限制，数字图像通常是按比例缩小，并且当在网络上传输的，从而导致的细节损失和不和谐工件时，能够降低高层次视觉任务的性能压缩。在本文中，我们的目的是产生从低分辨率一个通过探索关节压缩伪像减少（CAR）和超分辨率（SR）任务以任意品质因数压缩的无伪影的高分辨率图像。首先，我们提出了一个上下文感知联合CAR和SR神经网络（CAJNN），集成了本地和非本地特色解决CAR和SR的一个阶段。最后，深重建网络，采用预测高质量和高分辨率的图像。评价CAR和SR标准数据集显示，我们的CAJNN模型优于以前的方法也需要更短的26.2％运行。在此基础上，我们探索解决高层次的计算机视觉两个关键挑战：光学字符识别低分辨率的文本，以及极其微小的人脸检测。我们证明CAJNN可以作为一种有效的图像预处理方法，提高了实景文字识别（从85.30％到85.75％）和微小的面部检测的精度平均（从0.317到0.611）的准确度。

66. Sensitivity and Specificity Evaluation of Deep Learning Models for Detection of Pneumoperitoneum on Chest Radiographs [PDF] 返回目录
Manu Goyal, Judith Austin-Strohbehn, Sean J. Sun, Karen Rodriguez, Jessica M. Sin, Yvonne Y. Cheung, Saeed Hassanpour
Abstract: Background: Deep learning has great potential to assist with detecting and triaging critical findings such as pneumoperitoneum on medical images. To be clinically useful, the performance of this technology still needs to be validated for generalizability across different types of imaging systems. Materials and Methods: This retrospective study included 1,287 chest X-ray images of patients who underwent initial chest radiography at 13 different hospitals between 2011 and 2019. The chest X-ray images were labelled independently by four radiologist experts as positive or negative for pneumoperitoneum. State-of-the-art deep learning models (ResNet101, InceptionV3, DenseNet161, and ResNeXt101) were trained on a subset of this dataset, and the automated classification performance was evaluated on the rest of the dataset by measuring the AUC, sensitivity, and specificity for each model. Furthermore, the generalizability of these deep learning models was assessed by stratifying the test dataset according to the type of the utilized imaging systems. Results: All deep learning models performed well for identifying radiographs with pneumoperitoneum, while DenseNet161 achieved the highest AUC of 95.7%, Specificity of 89.9%, and Sensitivity of 91.6%. DenseNet161 model was able to accurately classify radiographs from different imaging systems (Accuracy: 90.8%), while it was trained on images captured from a specific imaging system from a single institution. This result suggests the generalizability of our model for learning salient features in chest X-ray images to detect pneumoperitoneum, independent of the imaging system.
摘要：背景：深学习有很大的潜力，以帮助检测和检伤分类的关键结果，如医疗影像气腹。在临床上有用的，这种技术的性能仍然有待进一步验证了在不同类型的成像系统的普遍性。材料和方法：这项回顾性研究包括谁在13级不同的医院2011年和2019年之间经历了最初的胸片患者1287胸部X射线图像的胸部X射线图像进行了四个放射专家为阳性或阴性腹独立标记。状态的最先进的深学习模型（ResNet101，InceptionV3，DenseNet161和ResNeXt101）接受了有关此数据集的子集，并且自动分类性能是通过测量AUC，灵敏度数据集上的其余部分进行评估，并特异性为每个模型。此外，这些深学习模型的普遍性，通过根据所使用的成像系统的类型的分层测试数据集进行评估。结果：深度学习模型用于识别与气腹X线片表现良好，而DenseNet161实现了95.7％的最高AUC，特异性的89.9％，和91.6％灵敏度。 DenseNet161模型能够从不同的成像系统（准确度：90.8％）准确地分类X光片，而它是在从一个单一的机构从一个特定的成像系统捕获的图像的训练。这一结果表明我们的模型的学习突出的特点在胸部X射线图像检测气腹，独立的成像系统的普遍性。

67. Finding Physical Adversarial Examples for Autonomous Driving with Fast and Differentiable Image Compositing [PDF] 返回目录
Jinghan Yang, Adith Boloor, Ayan Chakrabarti, Xuan Zhang, Yevgeniy Vorobeychik
Abstract: There is considerable evidence that deep neural networks are vulnerable to adversarial perturbations applied directly to their digital inputs. However, it remains an open question whether this translates to vulnerabilities in real-world systems. Specifically, in the context of image inputs to autonomous driving systems, an attack can be achieved only by modifying the physical environment, so as to ensure that the resulting stream of video inputs to the car's controller leads to incorrect driving decisions. Inducing this effect on the video inputs indirectly through the environment requires accounting for system dynamics and tracking viewpoint changes. We propose a scalable and efficient approach for finding adversarial physical modifications, using a differentiable approximation for the mapping from environmental modifications-namely, rectangles drawn on the road-to the corresponding video inputs to the controller network. Given the color, location, position, and orientation parameters of the rectangles, our mapping composites them onto pre-recorded video streams of the original environment. Our mapping accounts for geometric and color variations, is differentiable with respect to rectangle parameters, and uses multiple original video streams obtained by varying the driving trajectory. When combined with a neural network-based controller, our approach allows the design of adversarial modifications through end-to-end gradient-based optimization. We evaluate our approach using the Carla autonomous driving simulator, and show that it is significantly more scalable and far more effective at generating attacks than a prior black-box approach based on Bayesian Optimization.
摘要：大量证据表明，深层神经网络很容易受到敌对扰动直接应用到他们的数字输入。然而，它仍然是一个悬而未决的问题，这是否转化为真实世界系统中的漏洞。具体而言，在图像输入到自动驾驶系统的情况下，攻击只能通过修改物理环境来实现，以确保所得到的视频输入流到汽车的控制导致不正确的驾驶决定。间接地通过环境诱导上的视频输入这个效果需要占系统动力学和跟踪视角变化。我们建议寻找对抗物理改性可扩展的和有效的方法，使用微近似从改变环境，即映射，上绘制矩形路到相应的视频输入到控制网络。由于颜色，位置，矩形的位置和方向的参数，我们的复合映射到他们预先录制的视频流的原环境。我们的映射占几何和颜色的变化，可微相对于矩形的参数，并使用多个原始视频流通过改变驾驶轨迹获得的。当与基于神经网络的控制器相结合，我们的方法允许对抗性修改的通过端至端的基于梯度的优化设计。我们评估使用卡拉自主驾驶模拟器我们的做法，并表明它是显著更灵活，可远在产生比基于贝叶斯优化现有黑箱方法攻击更有效。

68. A Grid-based Representation for Human Action Recognition [PDF] 返回目录
Soufiane Lamghari, Guillaume-Alexandre Bilodeau, Nicolas Saunier
Abstract: Human action recognition (HAR) in videos is a fundamental research topic in computer vision. It consists mainly in understanding actions performed by humans based on a sequence of visual observations. In recent years, HAR have witnessed significant progress, especially with the emergence of deep learning models. However, most of existing approaches for action recognition rely on information that is not always relevant for the task, and are limited in the way they fuse temporal information. In this paper, we propose a novel method for human action recognition that encodes efficiently the most discriminative appearance information of an action with explicit attention on representative pose features, into a new compact grid representation. Our GRAR (Grid-based Representation for Action Recognition) method is tested on several benchmark datasets that demonstrate that our model can accurately recognize human actions, despite intra-class appearance variations and occlusion challenges.
摘要：在视频中人类行为识别（HAR）是计算机视觉的基本研究课题。它主要包括通过基于视觉观察序列进行人类认识的行动。近年来，HAR目睹显著的进步，特别是深学习模式的出现。然而，大多数的动作识别现有方法依赖于这并不总是相关的任务信息，并在他们融合时间信息的方式是有限的。在本文中，我们提出了人类动作识别的新方法，其编码效率与明确的关注代表姿势特征的动作的判别能力最强的外观信息，进入一个新的紧凑型格表示。我们GRAR（基于网格的表示行动识别）方法上证明，我们的模型可以准确地识别人的动作，尽管类内变化的外观和闭塞挑战几个基准数据集进行测试。

69. Efficient and Compact Convolutional Neural Network Architectures for Non-temporal Real-time Fire Detection [PDF] 返回目录
William Thomson, Neelanjan Bhowmik, Toby P. Breckon
Abstract: Automatic visual fire detection is used to complement traditional fire detection sensor systems (smoke/heat). In this work, we investigate different Convolutional Neural Network (CNN) architectures and their variants for the non-temporal real-time bounds detection of fire pixel regions in video (or still) imagery. Two reduced complexity compact CNN architectures (NasNet-A-OnFire and ShuffleNetV2-OnFire) are proposed through experimental analysis to optimise the computational efficiency for this task. The results improve upon the current state-of-the-art solution for fire detection, achieving an accuracy of 95% for full-frame binary classification and 97% for superpixel localisation. We notably achieve a classification speed up by a factor of 2.3x for binary classification and 1.3x for superpixel localisation, with runtime of 40 fps and 18 fps respectively, outperforming prior work in the field presenting an efficient, robust and real-time solution for fire region detection. Subsequent implementation on low-powered devices (Nvidia Xavier-NX, achieving 49 fps for full-frame classification via ShuffleNetV2-OnFire) demonstrates our architectures are suitable for various real-world deployment applications.
摘要：自动视觉火灾探测用于补充传统的火灾探测传感器系统（烟/热量）。在这项工作中，我们研究了不同的卷积神经网络（CNN）的架构及其变体的非暂时实时界定在视频检测火灾像素区域（或静止）图像。两个复杂度降低的紧凑架构CNN（NasNet-A-OnFire和ShuffleNetV2-OnFire）通过实验分析提出来优化该任务的计算效率。结果改善在所述当前状态的最先进的解决方案以火灾检测，实现95％的准确度为全帧二进制分类和超像素本地化97％。我们特别是实现了分类速度增长的2.3倍为二值分类和1.3倍为超像素定位的因素，具有40个FPS和18 fps的运行时分别在字段优于先前的工作表现为一个高效，健壮和实时溶液火灾区域检测。在低功率设备的后续执行（英伟达泽维尔-NX，实现49 fps则可通过ShuffleNetV2-OnFire全画幅分类）显示了我们的架构适用于各种实际部署应用程序。

70. Directed Variational Cross-encoder Network for Few-shot Multi-image Co-segmentation [PDF] 返回目录
Sayan Banerjee, S Divakar Bhat, Subhasis Chaudhuri, Rajbabu Velmurugan
Abstract: In this paper, we propose a novel framework for multi-image co-segmentation using class agnostic meta-learning strategy by generalizing to new classes given only a small number of training samples for each new class. We have developed a novel encoder-decoder network termed as DVICE (Directed Variational Inference Cross Encoder), which learns a continuous embedding space to ensure better similarity learning. We employ a combination of the proposed DVICE network and a novel few-shot learning approach to tackle the small sample size problem encountered in co-segmentation with small datasets like iCoseg and MSRC. Furthermore, the proposed framework does not use any semantic class labels and is entirely class agnostic. Through exhaustive experimentation over multiple datasets using only a small volume of training data, we have demonstrated that our approach outperforms all existing state-of-the-art techniques.
摘要：在本文中，我们通过推广到只给每个新类少量训练样本的新类中使用类不可知元学习策略提出了多图像共同分割了一种新的框架。我们已经开发了称为DVICE（导演变推理十字编码器），它学会了一门连续嵌入空间，以确保更好地相似学习一种新的编码器，解码器网络。我们采用建议的DVICE网络和小说几拍的学习方法，以解决共同分割遇到像iCoseg和MSRC小型数据集的小样本问题的组合。此外，拟议框架不使用任何语义类标签，是完全不可知类。通过详尽的试验在只使用训练数据的小体积多个数据集，我们已经证明，我们的方法优于现有的所有国家的最先进的技术。

71. Discovering Pattern Structure Using Differentiable Compositing [PDF] 返回目录
Pradyumna Reddy, Paul Guerrero, Matt Fisher, Wilmot Li, Miloy J.Mitra
Abstract: Patterns, which are collections of elements arranged in regular or near-regular arrangements, are an important graphic art form and widely used due to their elegant simplicity and aesthetic appeal. When a pattern is encoded as a flat image without the underlying structure, manually editing the pattern is tedious and challenging as one has to both preserve the individual element shapes and their original relative arrangements. State-of-the-art deep learning frameworks that operate at the pixel level are unsuitable for manipulating such patterns. Specifically, these methods can easily disturb the shapes of the individual elements or their arrangement, and thus fail to preserve the latent structures of the input patterns. We present a novel differentiable compositing operator using pattern elements and use it to discover structures, in the form of a layered representation of graphical objects, directly from raw pattern images. This operator allows us to adapt current deep learning based image methods to effectively handle patterns. We evaluate our method on a range of patterns and demonstrate superiority in the context of pattern manipulations when compared against state-of-the-art
摘要：模式，其是布置在规则的或近似规则安排元素的集合，是一个重要的图形艺术形式，并且由于它们的简单优雅和美学吸引力广泛使用。当一个图案而不底层结构编码为平面图像，手动编辑模式是繁琐和挑战作为一个具有到两个保持各个元件的形状和它们的原始相对配置。状态的最先进的，在像素级别上操作深度学习框架是不适合用于操纵这种图案。具体地，这些方法可以很容易地扰乱的单个元件或它们的排列的形状，并因此不能保持输入图案的潜结构。我们使用图案元素提出一种新的微分合成算子，并用它来发现的结构，在图形对象的分层表示的形式，直接从原始图案图像。该运营商允许我们目前深基于学习图像方法，以适应有效地处理模式。我们评估在一系列的模式我们的方法以及在针对国家的最先进的相比，展示模式操作的情况下优势

72. The NVIDIA PilotNet Experiments [PDF] 返回目录
Mariusz Bojarski, Chenyi Chen, Joyjit Daw, Alperen Değirmenci, Joya Deri, Bernhard Firner, Beat Flepp, Sachin Gogri, Jesse Hong, Lawrence Jackel, Zhenhua Jia, BJ Lee, Bo Liu, Fei Liu, Urs Muller, Samuel Payne, Nischal Kota Nagendra Prasad, Artem Provodin, John Roach, Timur Rvachov, Neha Tadimeti, Jesper van Engelen, Haiguang Wen, Eric Yang, Zongyi Yang
Abstract: Four years ago, an experimental system known as PilotNet became the first NVIDIA system to steer an autonomous car along a roadway. This system represents a departure from the classical approach for self-driving in which the process is manually decomposed into a series of modules, each performing a different task. In PilotNet, on the other hand, a single deep neural network (DNN) takes pixels as input and produces a desired vehicle trajectory as output; there are no distinct internal modules connected by human-designed interfaces. We believe that handcrafted interfaces ultimately limit performance by restricting information flow through the system and that a learned approach, in combination with other artificial intelligence systems that add redundancy, will lead to better overall performing systems. We continue to conduct research toward that goal. This document describes the PilotNet lane-keeping effort, carried out over the past five years by our NVIDIA PilotNet group in Holmdel, New Jersey. Here we present a snapshot of system status in mid-2020 and highlight some of the work done by the PilotNet group.
摘要：四年前，被称为PilotNet实验系统成为第NVIDIA系统转向一个自主车沿道路。这种系统代表从自驾车，其中过程是手动分解成一系列的模块，分别执行不同的任务的经典方法的偏离。在PilotNet，在另一方面，单个深神经网络（DNN）取像素作为输入，并产生所期望的车辆轨迹作为输出;存在由人设计的接口连接没有明显的内部模块。我们相信，手工接口通过系统和一个有学问的方法限制信息流，与其他人工智能系统组合最终极限性能是增加冗余，将带来更好的整体执行系统。我们将继续朝着这个目标进行研究。本文档介绍了PilotNet车道保持努力，我们的NVIDIA PilotNet组霍姆德尔镇，新泽西州在过去的五年中进行。在这里，我们目前的系统状态的快照中旬-2020和突出一些由PilotNet组所做的工作。

73. DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement [PDF] 返回目录
Mohamed Ali Souibgui, Yousri Kessentini
Abstract: Documents often exhibit various forms of degradation, which make it hard to be read and substantially deteriorate the performance of an OCR system. In this paper, we propose an effective end-to-end framework named Document Enhancement Generative Adversarial Networks (DE-GAN) that uses the conditional GANs (cGANs) to restore severely degraded document images. To the best of our knowledge, this practice has not been studied within the context of generative adversarial deep networks. We demonstrate that, in different tasks (document clean up, binarization, deblurring and watermark removal), DE-GAN can produce an enhanced version of the degraded document with a high quality. In addition, our approach provides consistent improvements compared to state-of-the-art methods over the widely used DIBCO 2013, DIBCO 2017 and H-DIBCO 2018 datasets, proving its ability to restore a degraded document image to its ideal condition. The obtained results on a wide variety of degradation reveal the flexibility of the proposed model to be exploited in other document enhancement problems.
摘要：文档通常表现出各种形式的降解，这使得它很难被读取，并且基本上恶化的OCR系统的性能。在本文中，我们提出了一个名为文档增强剖成对抗性网络（DE-GAN）的有效终端到终端的框架，使用条件甘斯（CGANS）恢复严重退化的文档图像。据我们所知，这种做法尚未生成对抗性深层网络的环境中学习。我们证明了，在不同的任务（文件清理，二值化，去模糊和水印去除），DE-GaN可以产生降低文档的增强版具有高品质。此外，我们的方法相比，国家的最先进的方法，在广泛使用的DIBCO 2013年，DIBCO 2017年H-DIBCO 2018点的数据集提供了持续改善，证明其对降级的文档图像恢复到理想状态的能力。在各种各样的退化得到的结果表明该模型的灵活性，以在其他文件的增强问题被利用。

74. Gradient Aware Cascade Network for Multi-Focus Image Fusion [PDF] 返回目录
Boyuan Ma, Xiang Yin, Di Wu, Xiaojuan Ban, Haiyou Huang
Abstract: The general aim of multi-focus image fusion is to gather focused regions of different images to generate a unique all-in-focus fused image. Deep learning based methods become the mainstream of image fusion by virtue of its powerful feature representation ability. However, most of the existing deep learning structures failed to balance fusion quality and end-to-end implementation convenience. End-to-end decoder design often leads to poor performance because of its non-linear mapping mechanism. On the other hand, generating an intermediate decision map achieves better quality for the fused image, but relies on the rectification with empirical post-processing parameter choices. In this work, to handle the requirements of both output image quality and comprehensive simplicity of structure implementation, we propose a cascade network to simultaneously generate decision map and fused result with an end-to-end training procedure. It avoids the dependence on empirical post-processing methods in the inference stage. To improve the fusion quality, we introduce a gradient aware loss function to preserve gradient information in output fused image. In addition, we design a decision calibration strategy to decrease the time consumption in the application of multiple image fusion. Extensive experiments are conducted to compare with 16 different state-of-the-art multi-focus image fusion structures with 6 assessment metrics. The results prove that our designed structure can generally ameliorate the output fused image quality, while implementation efficiency increases over 30\% for multiple image fusion.
摘要：多聚焦图像融合的一般目的是收集不同图像的聚焦区域，以产生独特的全聚焦图像融合。基于深学习方法凭借其强大的功能表现能力成为图像融合的主流。然而，大多数现有的深度学习结构未能平衡熔化质量和终端到终端的实现方便。端至端解码器的设计经常导致由于其非线性映射机制性能不佳。在另一方面，生成中间地图的决定实现为融合图像更好的质量，但依赖于与经验后处理参数的选择整流。在这项工作中，处理输出图像质量和结构，实现全面简化双方的要求，我们提出了一个级联网络，同时产生的决定地图和融合的结果与终端到终端的训练过程。它避免了在推论阶段的经验后处理方法的依赖。为了提高熔化质量，我们引入了梯度知道损失函数，以便保持在输出融合图像梯度信息。此外，我们还设计了一个决定校准策略，以减少时间消耗在多个图像融合的应用。广泛实验进行16个不同国家的最先进的多焦点有6个评估指标的图像融合的结构比较。结果证明，我们的设计的结构通常可以改善输出融合图像质量的同时，在30 \％实施效率增加为多个图像融合。

75. Self-Selective Context for Interaction Recognition [PDF] 返回目录
Mert Kilickaya, Noureldien Hussein, Efstratios Gavves, Arnold Smeulders
Abstract: Human-object interaction recognition aims for identifying the relationship between a human subject and an object. Researchers incorporate global scene context into the early layers of deep Convolutional Neural Networks as a solution. They report a significant increase in the performance since generally interactions are correlated with the scene (\ie riding bicycle on the city street). However, this approach leads to the following problems. It increases the network size in the early layers, therefore not efficient. It leads to noisy filter responses when the scene is irrelevant, therefore not accurate. It only leverages scene context whereas human-object interactions offer a multitude of contexts, therefore incomplete. To circumvent these issues, in this work, we propose Self-Selective Context (SSC). SSC operates on the joint appearance of human-objects and context to bring the most discriminative context(s) into play for recognition. We devise novel contextual features that model the locality of human-object interactions and show that SSC can seamlessly integrate with the State-of-the-art interaction recognition models. Our experiments show that SSC leads to an important increase in interaction recognition performance, while using much fewer parameters.
摘要：人类对象交互识别目标用于识别人类对象和对象之间的关系。研究人员纳入全球景物情境陷入深深的卷积神经网络的早期层作为解决方案。他们报告在性能显著增加，因为一般的互动与现场相关（\即骑在城市街道自行车）。但是，这种方法会导致以下问题。它增加了网络规模在早期层，因此效率不高。它导致嘈杂的滤波器响应时的情景是不相关的，因此并不准确。它只而人为对象交互提供上下文众多，因此不完全利用现场环境。为了规避这些问题，在这项工作中，我们提出了自我选择性Context（SSC）。 SSC对人体对象的连接外观和背景带来最辨别上下文（S）发挥作用的认可工作。我们制定新的上下文特征该模型人类对象交互和显示，SSC可以无缝地与国家的最先进的交互识别模型集成的局部性。我们的实验表明，SSC，因而在交互识别性能的重要提升，同时使用少得多的参数。

76. Picture-to-Amount (PITA): Predicting Relative Ingredient Amounts from Food Images [PDF] 返回目录
Jiatong Li, Fangda Han, Ricardo Guerrero, Vladimir Pavlovic
Abstract: Increased awareness of the impact of food consumption on health and lifestyle today has given rise to novel data-driven food analysis systems. Although these systems may recognize the ingredients, a detailed analysis of their amounts in the meal, which is paramount for estimating the correct nutrition, is usually ignored. In this paper, we study the novel and challenging problem of predicting the relative amount of each ingredient from a food image. We propose PITA, the Picture-to-Amount deep learning architecture to solve the problem. More specifically, we predict the ingredient amounts using a domain-driven Wasserstein loss from image-to-recipe cross-modal embeddings learned to align the two views of food data. Experiments on a dataset of recipes collected from the Internet show the model generates promising results and improves the baselines on this challenging task. A demo of our system and our data is availableat: this http URL.
摘要：食品消费的今天健康和生活方式的影响认识的提高已经引起了新的数据驱动的食品分析系统。虽然这些系统可以识别的成分，他们在吃饭，这是最重要的估算正确的营养量的详细分析，通常被忽略。在本文中，我们研究从食品图像预测每种成分的相对量的新颖的和具有挑战性的问题。我们建议PITA，图片到金额深度学习架构来解决这个问题。更具体地说，我们预测成分使用量从图像到配方跨模式的嵌入域驱动瓦瑟斯坦损失了解到对准食物数据的两种观点。从网上搜集菜谱的数据集实验表明该模型产生可喜的成果，提高了对这个具有挑战性的任务基线。我们的系统和我们的数据的演示是availableat：此http网址。

77. Robust Face Alignment by Multi-order High-precision Hourglass Network [PDF] 返回目录
Jun Wan, Zhihui Lai, Jun Liu, Jie Zhou, Can Gao
Abstract: Heatmap regression (HR) has become one of the mainstream approaches for face alignment and has obtained promising results under constrained environments. However, when a face image suffers from large pose variations, heavy occlusions and complicated illuminations, the performances of HR methods degrade greatly due to the low resolutions of the generated landmark heatmaps and the exclusion of important high-order information that can be used to learn more discriminative features. To address the alignment problem for faces with extremely large poses and heavy occlusions, this paper proposes a heatmap subpixel regression (HSR) method and a multi-order cross geometry-aware (MCG) model, which are seamlessly integrated into a novel multi-order high-precision hourglass network (MHHN). The HSR method is proposed to achieve high-precision landmark detection by a well-designed subpixel detection loss (SDL) and subpixel detection technology (SDT). At the same time, the MCG model is able to use the proposed multi-order cross information to learn more discriminative representations for enhancing facial geometric constraints and context information. To the best of our knowledge, this is the first study to explore heatmap subpixel regression for robust and high-precision face alignment. The experimental results from challenging benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in the literature.
摘要：热图回归（HR）已经成为主流的一个方法用于面部对准并取得下受限的环境中有希望的结果。然而，当面部图像患有大姿态的变化，重闭塞和复杂照明，的HR方法的性能劣化大为由于所产生的界标热图的低分辨率和重要的高顺序信息的排除可用于学习更有辨别力的特征。为了解决对准问题对于具有非常大的姿势和重遮挡面，提出了一种热图像素的回归（HSR）方法和多阶交叉几何感知（MCG）模型，其被无缝地集成到一个新的多阶高精度沙漏网络（MHHN）。提出的方法HSR通过一个精心设计的子像素的检测损耗（SDL）和子像素检测技术（SDT），以实现高精度的标志检测。与此同时，墨尔本板球场模型能够使用所提出的多阶交叉信息，以了解更多歧视性陈述增强脸部的几何约束和上下文信息。据我们所知，这是第一次研究，探讨热图像素回归鲁棒和高精度的脸比对。从富有挑战性的基准数据集上的实验结果表明，我们的方法比国家的最先进的方法在文献。

78. PolarDet: A Fast, More Precise Detector for Rotated Target in Aerial Images [PDF] 返回目录
Pengbo Zhao, Zhenshen Qu, Yingjia Bu, Wenming Tan, Ye Ren, Shiliang Pu
Abstract: Fast and precise object detection for high-resolution aerial images has been a challenging task over the years. Due to the sharp variations on object scale, rotation, and aspect ratio, most existing methods are inefficient and imprecise. In this paper, we represent the oriented objects by polar method in polar coordinate and propose PolarDet, a fast and accurate one-stage object detector based on that representation. Our detector introduces a sub-pixel center semantic structure to further improve classifying veracity. PolarDet achieves nearly all SOTA performance in aerial object detection tasks with faster inference speed. In detail, our approach obtains the SOTA results on DOTA, UCAS-AOD, HRSC with 76.64\% mAP, 97.01\% mAP, and 90.46\% mAP respectively. Most noticeably, our PolarDet gets the best performance and reaches the fastest speed(32fps) at the UCAS-AOD dataset.
摘要：高分辨率航空影像快速而精确物体检测已经多年来具有挑战性的任务。由于对对象缩放，旋转，和纵横比的急剧的变化，大多数现有的方法是低效的和不精确的。在本文中，我们表示在极性面向对象通过方法极性坐标和提出PolarDet，基于该表示一个快速和准确的单级对象检测器。我们的检测器引入了一个子像素中心语义结构，以进一步改善分级准确性。 PolarDet实现了更快的速度推断空中目标探测任务，几乎所有的SOTA性能。具体而言，我们的方法取得对DOTA，UCAS-AOD，HRSC的SOTA结果分别为76.64 \％映像，97.01 \％映像和90.46 \％映像。最值得注意的是，我们的PolarDet获得最佳的性能和达到最快的速度（32fps）的UCAS-AOD数据集。

79. A Self-supervised Cascaded Refinement Network for Point Cloud Completion [PDF] 返回目录
Xiaogang Wang, Marcelo H Ang Jr, Gim Hee Lee
Abstract: Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications, such as 3D object classification, detection and segmentation. Existing shape completion methods tend to generate coarse shapes of objects without fine-grained details. Moreover, current approaches require fully-complete ground truth, which are difficult to obtain in real-world applications. In view of these, we propose a self-supervised object completion method, which optimizes the training procedure solely on the partial input without utilizing the fully-complete ground truth. In order to generate high-quality objects with detailed geometric structures, we propose a cascaded refinement network (CRN) with a coarse-to-fine strategy to synthesize the complete objects. Considering the local details of partial input together with the adversarial training, we are able to learn the complicated distributions of point clouds and generate the object details as realistic as possible. We verify our self-supervised method on both unsupervised and supervised experimental settings and show superior performances. Quantitative and qualitative experiments on different datasets demonstrate that our method achieves more realistic outputs compared to existing state-of-the-art approaches on the 3D point cloud completion task.
摘要：点云，往往稀疏，不完整的，这带来了现实世界的应用，如3D对象分类，检测与分割的困难。现有形状完井方法趋向于产生对象的粗形状而不细粒度细节。此外，目前的方法需要完全完成地面实况，这是很难获得在现实世界的应用。鉴于这些，我们提出了一个自我监督的对象完井方法，这对部分输入不使用完全完整的地面实况只优化训练程序。为了生成详细的几何结构，高品质的对象，我们提出用粗到细的策略，以合成完整的对象级联细化网（CRN）。考虑到与对抗训练部分输入在一起的局部细节，我们能够学到点云的复杂的分布和生成对象的详细信息尽可能的真实。我们确认我们的自我监督方法上都无监督和监督的实验设置，并显示出优异的性能。定量和不同的数据集定性实验证明，相比于现有的最先进的国家，我们的方法实现更加逼真的输出接近的三维点云完成任务。

80. CQ-VAE: Coordinate Quantized VAE for Uncertainty Estimation with Application to Disk Shape Analysis from Lumbar Spine MRI Images [PDF] 返回目录
Linchen Qian, Jiasong Chen, Timur Urakov, Weiyong Gu, Liang Liang
Abstract: Ambiguity is inevitable in medical images, which often results in different image interpretations (e.g. object boundaries or segmentation maps) from different human experts. Thus, a model that learns the ambiguity and outputs a probability distribution of the target, would be valuable for medical applications to assess the uncertainty of diagnosis. In this paper, we propose a powerful generative model to learn a representation of ambiguity and to generate probabilistic outputs. Our model, named Coordinate Quantization Variational Autoencoder (CQ-VAE) employs a discrete latent space with an internal discrete probability distribution by quantizing the coordinates of a continuous latent space. As a result, the output distribution from CQ-VAE is discrete. During training, Gumbel-Softmax sampling is used to enable backpropagation through the discrete latent space. A matching algorithm is used to establish the correspondence between model-generated samples and "ground-truth" samples, which makes a trade-off between the ability to generate new samples and the ability to represent training samples. Besides these probabilistic components to generate possible outputs, our model has a deterministic path to output the best estimation. We demonstrated our method on a lumbar disk image dataset, and the results show that our CQ-VAE can learn lumbar disk shape variation and uncertainty.
摘要：歧义是在医用图像，其结果往往是从不同的人类专家不同的图像的解释（例如，对象的边界或分割映射）是不可避免的。因此，获知模糊和输出目标的概率分布的模型，将是有价值的用于医疗应用，以评估诊断的不确定性。在本文中，我们提出了一个强大的生成模型学习含糊的表示，并生成概率输出。我们的模型，名为坐标量化变自动编码器（CQ-VAE）通过量化的连续潜在空间的坐标采用与内部离散概率分布的离散潜在空间。其结果是，从CQ-VAE的输出分布是离散的。在训练期间，冈贝尔-使用SoftMax取样用于通过离散的潜在空间，以使反向传播。匹配算法用来建立模型生成样品和“地面实况”样品之间的对应关系，这使得产生新样品的能力，并代表训练样本的能力之间的折衷。除了这些概率组件来生成可能的输出，我们的模型输出确定性路径的最佳估计数。我们展示了一个腰椎间盘图像数据集我们的方法，结果表明，我们的CQ-VAE可以了解腰椎间盘的形状变化和不确定性。

81. Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering [PDF] 返回目录
Hantao Huang, Tao Han, Wei Han, Deep Yap, Cheng-Ming Chiang
Abstract: Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57\% using fewer parameters on VQA-v2.0 test-standard split.
摘要：视觉答疑（VQA）是由于复杂的跨模态关系挑战。它已受到广泛关注的研究领域。从人性的角度，回答一个问题视觉，需要阅读问题，然后参考图像生成一个答案。这个答案将被反对该议题和形象再次为最终确认检查。在本文中，我们模仿这个过程中，提出了一种基于完全关注VQA架构。此外，回答检查模块提出了对联合答案，问题和图像表示执行统一的注意更新的答案。这模仿人类的答案检查过程中要考虑上下文答案。与答案检查模块和转移BERT层，我们的模型实现了使用上VQA-V2.0测试标准分裂更少的参数的状态的最先进的准确性71.57 \％。

82. DEAL: Difficulty-aware Active Learning for Semantic Segmentation [PDF] 返回目录
Shuai Xie, Zunlei Feng, Ying Chen, Songtao Sun, Chao Ma, Mingli Song
Abstract: Active learning aims to address the paucity of labeled data by finding the most informative samples. However, when applying to semantic segmentation, existing methods ignore the segmentation difficulty of different semantic areas, which leads to poor performance on those hard semantic areas such as tiny or slender objects. To deal with this problem, we propose a semantic Difficulty-awarE Active Learning (DEAL) network composed of two branches: the common segmentation branch and the semantic difficulty branch. For the latter branch, with the supervision of segmentation error between the segmentation result and GT, a pixel-wise probability attention module is introduced to learn the semantic difficulty scores for different semantic areas. Finally, two acquisition functions are devised to select the most valuable samples with semantic difficulty. Competitive results on semantic segmentation benchmarks demonstrate that DEAL achieves state-of-the-art active learning performance and improves the performance of the hard semantic areas in particular.
摘要：主动学习的目标是找到最翔实的样本，以解决标签数据的缺乏。然而，应用语义分割时，现有的方法忽略了不同的语义区域的分割困难，这导致对那些难以语义领域，如细小修长或物体表现不佳。为了解决这个问题，我们提出了一个语义难度感知的主动学习（交易）两个分支组成的网络：常见的分割分支和语义困难分支。对于后者分公司，分割结果和GT之间的分割错误的监管，逐像素概率注意模块被引入到学习对不同的语义领域的语义困难分值。最后，两名采集功能设计，选择语义困难最有价值的样本。在语义分割基准竞争结果表明，DEAL实现国家的最先进的主动学习的性能和提高，特别是硬语义方面的性能。

83. Automatic Tree Ring Detection using Jacobi Sets [PDF] 返回目录
Kayla Makela, Tim Ophelders, Michelle Quigley, Elizabeth Munch, Daniel Chitwood, Asia Dowtin
Abstract: Tree ring widths are an important source of climatic and historical data, but measuring these widths typically requires extensive manual work. Computer vision techniques provide promising directions towards the automation of tree ring detection, but most automated methods still require a substantial amount of user interaction to obtain high accuracy. We perform analysis on 3D X-ray CT images of a cross-section of a tree trunk, known as a tree disk. We present novel automated methods for locating the pith (center) of a tree disk, and ring boundaries. Our methods use a combination of standard image processing techniques and tools from topological data analysis. We evaluate the efficacy of our method for two different CT scans by comparing its results to manually located rings and centers and show that it is better than current automatic methods in terms of correctly counting each ring and its location. Our methods have several parameters, which we optimize experimentally by minimizing edit distances to the manually obtained locations.
摘要：树环宽度的气候和历史数据的重要来源，但测量这些宽度通常需要大量的手工作业。计算机视觉技术提供对树木年轮检测的自动化看好的方向，但大多数自动化方法仍然需要用户交互的大量获得高精确度。我们树干上，被称为一个树盘的横截面的三维X射线CT图像进行分析。我们本发明的新的自动方法用于定位树盘的髓（中心），和环边界。我们的方法使用来自拓扑数据分析标准的图像处理技术和工具的组合。我们通过它的结果对比人工位于环和中心评估我们对两种不同的CT扫描方法的有效性，并表明它比目前的自动方法更好地正确计算每个环和它的位置而言。我们的方法有几个参数，这是我们的编辑距离最小化到手动获得的位置进行实验优化。

84. MeshMVS: Multi-View Stereo Guided Mesh Reconstruction [PDF] 返回目录
Rakesh Shrestha, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Qingkun Su, Ping Tan
Abstract: Deep learning based 3D shape generation methods generally utilize latent features extracted from color images to encode the objects' semantics and guide the shape generation process. These color image semantics only implicitly encode 3D information, potentially limiting the accuracy of the generated shapes. In this paper we propose a multi-view mesh generation method which incorporates geometry information in the color images explicitly by using the features from intermediate 2.5D depth representations of the input images and regularizing the 3D shapes against these depth images. Our system first predicts a coarse 3D volume from the color images by probabilistically merging voxel occupancy grids from individual views. Depth images corresponding to the multi-view color images are predicted which along with the rendered depth images of the coarse shape are used as a contrastive input whose features guide the refinement of the coarse shape through a series of graph convolution networks. Attention-based multi-view feature pooling is proposed to fuse the contrastive depth features from different viewpoints which are fed to the graph convolution networks. We validate the proposed multi-view mesh generation method on ShapeNet, where we obtain a significant improvement with 34% decrease in chamfer distance to ground truth and 14% increase in the F1-score compared with the state-of-the-art multi-view shape generation method.
摘要：深学习基于三维形状生成方法通常利用从彩色图像进行编码的对象的语义和引导形状生成处理中提取潜特征。这些彩色图像语义只是隐含编码3D信息，潜在地限制所生成的形状的精度。在本文中，我们提出一种明确地通过使用来自输入图像的中间2.5D深度表示的特征和规范3D包含在彩色图像几何形状信息对这些深度图像的多视点网格生成方法。我们的系统首先预测从彩色图像的概率合并来自各个视图素占用网格粗3D体积。其与粗形状的渲染的深度图像一起被用作对比输入其特征引导粗形状的微细化，通过一系列图表卷积网络的对应于多视点彩色图像的深度图像被预测。基于关注的多视图特征池提出了从该被馈送到图卷积网络不同视点熔合对比深度的特征。我们验证了该多视图网格生成方法上ShapeNet，在那里我们获得具有34％降低的显著改善倒角距离到地面实况和14％增加F1-得分与国家的最先进的多相比查看形状生成方法。

85. Long-Term Face Tracking for Crowded Video-Surveillance Scenarios [PDF] 返回目录
Germán Barquero, Carles Fernández, Isabelle Hupont
Abstract: Most current multi-object trackers focus on short-term tracking, and are based on deep and complex systems that do not operate in real-time, often making them impractical for video-surveillance. In this paper, we present a long-term multi-face tracking architecture conceived for working in crowded contexts, particularly unconstrained in terms of movement and occlusions, and where the face is often the only visible part of the person. Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking. It follows a tracking-by-detection approach, combining a fast short-term visual tracker with a novel online tracklet reconnection strategy grounded on face verification. Additionally, a correction module is included to correct past track assignments with no extra computational cost. We present a series of experiments introducing novel, specialized metrics for the evaluation of long-term tracking capabilities and a video dataset that we publicly release. Findings demonstrate that, in this context, our approach allows to obtain up to 50% longer tracks than state-of-the-art deep learning trackers.
摘要：目前大多数的多目标跟踪专注于短期的跟踪，并基于不实时操作深刻而复杂的系统，往往使他们不切实际的视频监控。在本文中，我们提出了一个长期的多面孔追踪架构设想在拥挤的环境中，在运动和闭塞方面特别不受约束的工作，并在那里遇到的往往是人的唯一可见的部分。我们从人脸检测和人脸识别领域的发展系统优势，实现长期跟踪。它遵循一个跟踪的检测方法，结合接地面部验证了一种新的在线tracklet重新连接策略快速短期的视觉跟踪。此外，校正模块包括与没有额外的计算成本正确的过往业绩分配。我们提出了一系列实验，介绍小说，专业指标的长期跟踪功能评估和视频数据集，我们公开发布。调查结果表明，在这种情况下，我们的方法能够获得比国家的最先进的深学习跟踪较长高达50％的轨道。

86. Active Domain Adaptation via Clustering Uncertainty-weighted Embeddings [PDF] 返回目录
Viraj Prabhu, Arjun Chandrasekaran, Kate Saenko, Judy Hoffman
Abstract: Generalizing deep neural networks to new target domains is critical to their real-world utility. In practice, it may be feasible to get some target data labeled, but to be cost-effective it is desirable to select a maximally-informative subset via active learning (AL). We study this problem of AL under a domain shift. We empirically demonstrate how existing AL approaches based solely on model uncertainty or representative sampling are suboptimal for active domain adaptation. Our algorithm, Active Domain Adaptation via CLustering Uncertainty-weighted Embeddings (ADA-CLUE), i) identifies diverse datapoints for labeling that are both uncertain under the model and representative of unlabeled target data, and ii) leverages the available source and target data for adaptation by optimizing a semi-supervised adversarial entropy loss that is complementary to our active sampling objective. On standard image classification benchmarks for domain adaptation, ADA-CLUE consistently performs as well or better than competing active adaptation, active learning, and domain adaptation methods across shift severities, model initializations, and labeling budgets.
摘要：归纳深层神经网络，以新的目标领域是他们的真实世界的工具至关重要。在实践中，其可以是获得标记的某些目标数据是可行的，但要符合成本效益，希望选择经由主动学习（AL）一个最大-信息子集。我们域转变下研究AL的这个问题。我们经验演示如何现有的AL仅仅基于模型不确定性或有代表性的取样方法是次优的活跃领域适应性。我们的算法中，通过聚类活动域适应不确定性加权曲面嵌入（ADA-CLUE）中，i）识别不同的数据点用于标记是两个模型下不确定和代表未标记的目标的数据，以及ii）利用对可用的源和目标数据通过优化半监督对抗性熵损失也就是我们的活性采样目标互补适配。在标准图像分类基准域适配，ADA-CLUE一贯执行以及或优于竞争主动适应，主动学习，以及跨越移严重性，模型初始化，和标签预算域适配方法。

87. Generalized Intersection Algorithms with Fixpoints for Image Decomposition Learning [PDF] 返回目录
Robin Richter, Duy H. Thai, Stephan F. Huckemann
Abstract: In image processing, classical methods minimize a suitable functional that balances between computational feasibility (convexity of the functional is ideal) and suitable penalties reflecting the desired image decomposition. The fact that algorithms derived from such minimization problems can be used to construct (deep) learning architectures has spurred the development of algorithms that can be trained for a specifically desired image decomposition, e.g. into cartoon and texture. While many such methods are very successful, theoretical guarantees are only scarcely available. To this end, in this contribution, we formalize a general class of intersection point problems encompassing a wide range of (learned) image decomposition models, and we give an existence result for a large subclass of such problems, i.e. giving the existence of a fixpoint of the corresponding algorithm. This class generalizes classical model-based variational problems, such as the TV-l2 -model or the more general TV-Hilbert model. To illustrate the potential for learned algorithms, novel (non learned) choices within our class show comparable results in denoising and texture removal.
摘要：在图像处理中，传统方法最小化的适当官能该计算可行性之间平衡（功能的凸部是理想的）和合适的处罚反射所需的图像分解。学习架构，从这样的最小化问题导出算法可用于构建体（深）的事实已促使的，可以为一个特别期望的图像分解进行训练的算法，例如开发成卡通和纹理。虽然很多这样的方法都是非常成功的，理论保证只有很少用。为此，在这方面的贡献，我们正式的一般类的涵盖范围广泛的（学习）图像分解模型交点问题，我们给的存在导致了很大的子类的这样的问题，即给人一种不动点的存在的相应的算法。这个类推广经典的基于模型的变分问题，如TV-L2 -model或更一般的TV-希尔伯特模式。为了说明了解到算法的潜力，我们的类中的新（非学习）的选择显示去噪和纹理去除比较的结果。

88. Ensembling Low Precision Models for Binary Biomedical Image Segmentation [PDF] 返回目录
Tianyu Ma, Hang Zhang, Hanley Ong, Amar Vora, Thanh D. Nguyen, Ajay Gupta, Yi Wang, Mert Sabuncu
Abstract: Segmentation of anatomical regions of interest such as vessels or small lesions in medical images is still a difficult problem that is often tackled with manual input by an expert. One of the major challenges for this task is that the appearance of foreground (positive) regions can be similar to background (negative) regions. As a result, many automatic segmentation algorithms tend to exhibit asymmetric errors, typically producing more false positives than false negatives. In this paper, we aim to leverage this asymmetry and train a diverse ensemble of models with very high recall, while sacrificing their precision. Our core idea is straightforward: A diverse ensemble of low precision and high recall models are likely to make different false positive errors (classifying background as foreground in different parts of the image), but the true positives will tend to be consistent. Thus, in aggregate the false positive errors will cancel out, yielding high performance for the ensemble. Our strategy is general and can be applied with any segmentation model. In three different applications (carotid artery segmentation in a neck CT angiography, myocardium segmentation in a cardiovascular MRI and multiple sclerosis lesion segmentation in a brain MRI), we show how the proposed approach can significantly boost the performance of a baseline segmentation method.
摘要：感兴趣的解剖区域，如在医学图像血管或小病灶的分割仍然是通常由专家与手动输入解决的一个难题。其中一个此任务的主要挑战是，前景（正）区域的外观可以类似于背景（负）的区域。其结果是，许多自动分割算法倾向于表现出非对称误差，典型地产生更多的假阳性比假阴性。在本文中，我们的目标是利用这种不对称和培训的模式多样化合奏具有很高的回忆，而牺牲自己的精度。我们的核心理念很简单：低精度的不同合奏和高召回车型都有可能做出不同的假阳性错误（分类背景图像的不同部分的前景），但真正的阳性将趋于一致。因此，从总体假阳性错误将抵消，产生了整体的高性能。我们的策略是通用的，可以与任何细分模型应用。在三个不同的应用程序（颈动脉分割在颈部CT血管成像，在大脑MRI心血管MRI和多发性硬化病变划分心肌分割），我们将展示如何提出的方法可以显著提高基准分割方法的性能。

89. Zoom-CAM: Generating Fine-grained Pixel Annotations from Image Labels [PDF] 返回目录
Xiangwei Shi, Seyran Khademi, Yunqiang Li, Jan van Gemert
Abstract: Current weakly supervised object localization and segmentation rely on class-discriminative visualization techniques to generate pseudo-labels for pixel-level training. Such visualization methods, including class activation mapping (CAM) and Grad-CAM, use only the deepest, lowest resolution convolutional layer, missing all information in intermediate layers. We propose Zoom-CAM: going beyond the last lowest resolution layer by integrating the importance maps over all activations in intermediate layers. Zoom-CAM captures fine-grained small-scale objects for various discriminative class instances, which are commonly missed by the baseline visualization methods. We focus on generating pixel-level pseudo-labels from class labels. The quality of our pseudo-labels evaluated on the ImageNet localization task exhibits more than 2.8% improvement on top-1 error. For weakly supervised semantic segmentation our generated pseudo-labels improve a state of the art model by 1.1%.
摘要：当前弱监督对象的定位和分割依靠阶级歧视的可视化技术来产生伪标签像素层次的培训。这样的可视化方法，包括类激活映射（CAM）和梯度-CAM，只能使用最深的，最低分辨率卷积层，缺少中间层的所有信息。我们建议变焦CAM：按重要性地图上在中间层上的所有激活积分超越过去的最低分辨率层。变焦CAM捕获细粒度的小型物体的各种歧视性类的实例，这通常由基准可视化方法错过。我们专注于从类标签生成像素级的伪标签。我们的伪标签的质量上ImageNet本地化任务展品上顶1误差超过2.8％的改善评估。对于弱监督语义分割我们产生的伪标签1.1％改善现有技术模型的状态。

90. A Survey of Machine Learning Techniques in Adversarial Image Forensics [PDF] 返回目录
Ehsan Nowroozi, Ali Dehghantanha, Reza M. Parizi, Kim-Kwang Raymond Choo
Abstract: Image forensic plays a crucial role in both criminal investigations (e.g., dissemination of fake images to spread racial hate or false narratives about specific ethnicity groups) and civil litigation (e.g., defamation). Increasingly, machine learning approaches are also utilized in image forensics. However, there are also a number of limitations and vulnerabilities associated with machine learning-based approaches, for example how to detect adversarial (image) examples, with real-world consequences (e.g., inadmissible evidence, or wrongful conviction). Therefore, with a focus on image forensics, this paper surveys techniques that can be used to enhance the robustness of machine learning-based binary manipulation detectors in various adversarial scenarios.
摘要：图像取证中都起到刑事调查（例如，假图像的传播蔓延有关特定种族群体的种族仇恨或假的叙述）和民事诉讼（如诽谤）至关重要的作用。越来越多的机器学习方法，还利用在图像取证。然而，也有一些与机基于学习的方法有关，例如如何检测对抗（图像）的例子，现实世界的后果（例如，不可接受的证据，或错误定罪）的局限性和脆弱性。因此，重点是图像取证，可用于本文的调查技术以增强在各种对抗性场景基于机器学习的二进制操作的检测器的稳健性。

91. RobustBench: a standardized adversarial robustness benchmark [PDF] 返回目录
Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Nicolas Flammarion, Mung Chiang, Prateek Mittal, Matthias Hein
Abstract: Evaluation of adversarial robustness is often error-prone leading to overestimation of the true robustness of models. While adaptive attacks designed for a particular defense are a way out of this, there are only approximate guidelines on how to perform them. Moreover, adaptive evaluations are highly customized for particular models, which makes it difficult to compare different defenses. Our goal is to establish a standardized benchmark of adversarial robustness, which as accurately as possible reflects the robustness of the considered models within a reasonable computational budget. This requires to impose some restrictions on the admitted models to rule out defenses that only make gradient-based attacks ineffective without improving actual robustness. We evaluate robustness of models for our benchmark with AutoAttack, an ensemble of white- and black-box attacks which was recently shown in a large-scale study to improve almost all robustness evaluations compared to the original publications. Our leaderboard, hosted at this http URL, aims at reflecting the current state of the art on a set of well-defined tasks in $\ell_\infty$- and $\ell_2$-threat models with possible extensions in the future. Additionally, we open-source the library this http URL that provides unified access to state-of-the-art robust models to facilitate their downstream applications. Finally, based on the collected models, we analyze general trends in $\ell_p$-robustness and its impact on other tasks such as robustness to various distribution shifts and out-of-distribution detection.
摘要：对抗性的鲁棒性的评价往往容易出错，导致模型的鲁棒性确实高估。虽然设计用于特定的适应性防御攻击的方式离开这里，只有在如何执行这些近似准则。此外，自适应的评价是高度定制特定机型，这使得它很难比较不同的防御。我们的目标是建立对抗性鲁棒性的标准化基准测试，它尽可能准确地反映所考虑车型的合理的计算在预算之内的稳健性。这就要求强加给录取模式的一些限制，以排除防御，只有化妆基于梯度的攻击无效，而不提高实际的耐用性。我们评估的模型的鲁棒性我们与自动攻击，这是最近的大规模研究显示相比原来的出版物，以提高几乎所有的稳健性评估的白领和黑箱攻击合奏基准。我们的排行榜，在这个HTTP URL主办，旨在反映当前技术状态的一组定义良好的任务在$ \ ell_ \ infty $ - $和\ ell_2 $ -threat在未来可能的扩展模型。此外，我们的开源库这个HTTP URL是提供统一的访问国家的最先进可靠的模型，以方便他们的下游应用。最后，根据所收集的模型，我们分析了$ \ $ ell_p和-robustness其对其他任务的影响，如健壮性各种分布的变化和外的分布检测的总体趋势。

92. Agent-based Simulation Model and Deep Learning Techniques to Evaluate and Predict Transportation Trends around COVID-19 [PDF] 返回目录
Ding Wang, Fan Zuo, Jingqin Gao, Yueshuai He, Zilin Bian, Suzana Duran Bernardes, Chaekuk Na, Jingxing Wang, John Petinos, Kaan Ozbay, Joseph Y.J. Chow, Shri Iyer, Hani Nassif, Xuegang Jeff Ban
Abstract: The COVID-19 pandemic has affected travel behaviors and transportation system operations, and cities are grappling with what policies can be effective for a phased reopening shaped by social distancing. This edition of the white paper updates travel trends and highlights an agent-based simulation model's results to predict the impact of proposed phased reopening strategies. It also introduces a real-time video processing method to measure social distancing through cameras on city streets.
摘要：COVID-19大流行已经影响出行行为和交通系统运营，城市与什么分阶段重新开放的社会隔离型政策才能有效拼杀。白皮书更新此版本的旅游趋势和亮点基于代理的仿真模型的结果预测提出了分阶段重新开放战略的影响。它还引入了测量流经城市街道相机社会隔离实时视频处理方法。

93. Optimism in the Face of Adversity: Understanding and Improving Deep Learning through Adversarial Robustness [PDF] 返回目录
Guillermo Ortiz-Jimenez, Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard
Abstract: Driven by massive amounts of data and important advances in computational resources, new deep learning systems have achieved outstanding results in a large spectrum of applications. Nevertheless, our current theoretical understanding on the mathematical foundations of deep learning lags far behind its empirical success. Towards solving the vulnerability of neural networks, however, the field of adversarial robustness has recently become one of the main sources of explanations of our deep models. In this article, we provide an in-depth review of the field of adversarial robustness in deep learning, and give a self-contained introduction to its main notions. But, in contrast to the mainstream pessimistic perspective of adversarial robustness, we focus on the main positive aspects that it entails. We highlight the intuitive connection between adversarial examples and the geometry of deep neural networks, and eventually explore how the geometric study of adversarial examples can serve as a powerful tool to understand deep learning. Furthermore, we demonstrate the broad applicability of adversarial robustness, providing an overview of the main emerging applications of adversarial robustness beyond security. The goal of this article is to provide readers with a set of new perspectives to understand deep learning, and to supply them with intuitive tools and insights on how to use adversarial robustness to improve it.
摘要：通过大量的数据和计算资源的重要进步的推动下，新的深度学习系统已经在大范围的应用中取得了优异成绩。尽管如此，我们对深学习的数学基础目前的理论认识远远落后于它的经验成功。对于解决神经网络的脆弱性，不过，对抗鲁棒性的领域，最近成为了我们深深的模型解释的主要来源之一。在这篇文章中，我们提供深度学习对抗性的鲁棒性领域的深入审查，并给出一个自包含的介绍，其主要概念。但是，相比于对抗稳健的主流悲观的角度来看，我们注重的是它需要的主要的积极方面。我们强调对抗的例子和深层神经网络的几何形状之间的直观联系，并最终探索如何对抗例子几何研究可以作为一个强大的工具来了解深度学习。此外，我们表现出对抗稳健的广泛适用性，提供超出安全对抗稳健性的主要新兴应用的概述。这篇文章的目的是为用户提供了一组新的视角读者了解深度学习，并将其与直观的工具，以及如何使用对抗性的稳健性，提高它的见解提供。

94. Practical Frank-Wolfe algorithms [PDF] 返回目录
Vladimir Kolmogorov
Abstract: In the last decade there has been a resurgence of interest in Frank-Wolfe (FW) style methods for optimizing a smooth convex function over a polytope. Examples of recently developed techniques include {\em Decomposition-invariant Conditional Gradient} (DiCG), {\em Blended Condition Gradient} (BCG), and {\em Frank-Wolfe with in-face directions} (IF-FW) methods. We introduce two extensions of these techniques. First, we augment DiCG with the {\em working set} strategy, and show how to optimize over the working set using {\em shadow simplex steps}. Second, we generalize in-face Frank-Wolfe directions to polytopes in which faces cannot be efficiently computed, and also describe a generic recursive procedure that can be used in conjunction with several FW-style techniques. Experimental results indicate that these extensions are capable of speeding up original algorithms by orders of magnitude for certain applications.
摘要：在过去的十年里一直是弗兰克 - 沃尔夫（FW）风格的方法的兴趣超过多面体优化光滑凸函数的复兴。最近开发的技术的例子包括{\ EM分解不变的条件梯度}（DICG），{\ EM混合条件梯度}（BCG）和{\ EM弗兰克 - 沃尔夫在面方向}（IF-FW）的方法。我们引进的这些技术两个扩展。首先，我们增加DICG与{\ EM工作集}策略，并展示如何优化过使用{\ EM影单步}工作组。其次，我们在面弗兰克 - 沃尔夫方向推广到多面体中的面无法被有效计算，并且还描述了可以结合几个FW式技术中使用的通用递归过程。实验结果表明，这些扩展能够通过对于某些应用数量级的加快原算法。

95. GASNet: Weakly-supervised Framework for COVID-19 Lesion Segmentation [PDF] 返回目录
Zhanwei Xu, Yukun Cao, Cheng Jin, Guozhu Shao, Xiaoqing Liu, Jie Zhou, Heshui Shi, Jianjiang Feng
Abstract: Segmentation of infected areas in chest CT volumes is of great significance for further diagnosis and treatment of COVID-19 patients. Due to the complex shapes and varied appearances of lesions, a large number of voxel-level labeled samples are generally required to train a lesion segmentation network, which is a main bottleneck for developing deep learning based medical image segmentation algorithms. In this paper, we propose a weakly-supervised lesion segmentation framework by embedding the Generative Adversarial training process into the Segmentation Network, which is called GASNet. GASNet is optimized to segment the lesion areas of a COVID-19 CT by the segmenter, and to replace the abnormal appearance with a generated normal appearance by the generator, so that the restored CT volumes are indistinguishable from healthy CT volumes by the discriminator. GASNet is supervised by chest CT volumes of many healthy and COVID-19 subjects without voxel-level annotations. Experiments on three public databases show that when using as few as one voxel-level labeled sample, the performance of GASNet is comparable to fully-supervised segmentation algorithms trained on dozens of voxel-level labeled samples.
摘要：胸部CT体积疫区的分割是对COVID，19例患者的进一步诊断和治疗具有重要意义。由于复杂的形状和病变的外观变化，通常需要大量的体素水平标记的样品的培养的病变分割网络，这是用于开发深度学习基于医学图像分割算法主要瓶颈。在本文中，我们提出通过嵌入剖成对抗性训练过程进入分割网络，这就是所谓GASNet弱监督病变划分框架。 GASNet被优化以部分通过的分段一COVID-19 CT的病变区域，并以取代的异常外观与所生成的正常的外观由所述发电机，使得所恢复的CT卷从由鉴别健康CT体积难以区分。 GASNet被许多健康COVID-19科目胸部CT卷，而体素水平的注解监督。三个公共数据库实验表明，当少用为一个体素水平标记的样本是，GASNet的性能与训练的几十体素水平标记的样品的完全监督分割算法。

96. Measuring breathing induced oesophageal motion and its dosimetric impact [PDF] 返回目录
Tobias Fechter, Sonja Adebahr, Anca-Ligia Grosu, Dimos Baltas
Abstract: Stereotactic body radiation therapy allows for a precise and accurate dose delivery. Organ motion during treatment bares the risk of undetected high dose healthy tissue exposure. An organ very susceptible to high dose is the oesophagus. Its low contrast on CT and the oblong shape renders motion estimation difficult. We tackle this issue by modern algorithms to measure the oesophageal motion voxel-wise and to estimate motion related dosimetric impact. Oesophageal motion was measured using deformable image registration and 4DCT of 11 internal and 5 public datasets. Current clinical practice of contouring the organ on 3DCT was compared to timely resolved 4DCT contours. The dosimetric impact of the motion was estimated by analysing the trajectory of each voxel in the 4D dose distribution. Finally an organ motion model was built, allowing for easier patient-wise comparisons. Motion analysis showed mean absolute maximal motion amplitudes of 4.24 +/- 2.71 mm left-right, 4.81 +/- 2.58 mm anterior-posterior and 10.21 +/- 5.13 mm superior-inferior. Motion between the cohorts differed significantly. In around 50 % of the cases the dosimetric passing criteria was violated. Contours created on 3DCT did not cover 14 % of the organ for 50 % of the respiratory cycle and the 3D contour is around 38 % smaller than the union of all 4D contours. The motion model revealed that the maximal motion is not limited to the lower part of the organ. Our results showed motion amplitudes higher than most reported values in the literature and that motion is very heterogeneous across patients. Therefore, individual motion information should be considered in contouring and planning.
摘要：体部立体定向放射治疗允许精确和准确的剂量输送。治疗期间的器官运动剥开未被发现的高剂量健康组织暴露的风险。一个器官高剂量非常敏感是食道。其对CT和长方形形状的低对比度使得运动估计困难。我们解决这个问题，通过现代化的算法来测量食管运动体素明智的，并估计与运动相关的剂量学的影响。食道运动用变形图像配准和11内部4DCT和5个公共数据集测量。轮廓上三维重建器官目前的临床实践进行比较，及时解决4DCT轮廓。运动的剂量测定冲击通过分析在4D剂量分布的每个体素的轨迹估计。最后一个器官运动模型建成，能够让您轻松患者两比较。运动分析显示4.24 +/-2.71毫米左右平均绝对最大运动幅度，4.81 +/-2.58毫米前 - 后和10.21 +/-5.13毫米上 - 下。两组间运动显著差异。在50％左右的情况下，剂量测定合格标准受到侵犯。在三维重建创造并不包括器官的14％，为呼吸循环和三维轮廓的50％的轮廓比所有4D轮廓的工会小38％左右。运动模型，发现该最大运动不限于器官的下部。我们的研究结果显示，运动幅度比文献报道最多值高，并且运动是跨越患者非常庞杂。因此，不同的运动信息应在轮廓和规划考虑。

97. Semantic Histogram Based Graph Matching for Real-Time Multi-Robot Global Localization in Large Scale Environment [PDF] 返回目录
Xiyue Guo, Junjie Hu, Junfeng Chen, Fuqin Deng, Tin Lun Lam
Abstract: The core problem of visual multi-robot simultaneous localization and mapping (MR-SLAM) is how to efficiently and accurately perform multi-robot global localization (MR-GL). The difficulties are two-fold. The first is the difficulty of global localization for significant viewpoint difference. Appearance-based localization methods tend to fail under large viewpoint changes. Recently, semantic graphs have been utilized to overcome the viewpoint variation problem. However, the methods are highly time-consuming, especially in large-scale environments. This leads to the second difficulty, which is how to perform real-time global localization. In this paper, we propose a semantic histogram-based graph matching method that is robust to viewpoint variation and can achieve real-time global localization. Based on that, we develop a system that can accurately and efficiently perform MR-GL for both homogeneous and heterogeneous robots. The experimental results show that our approach is about 30 times faster than Random Walk based semantic descriptors. Moreover, it achieves an accuracy of 95% for global localization, while the accuracy of the state-of-the-art method is 85%.
摘要：视觉多机器人同步定位和映射（MR-SLAM）的核心问题是如何有效地且准确地执行多机器人全局定位（MR-GL）。困难是双重的。首先是显著的观点差异全球定位的难度。基于外观的定位方法往往在大的观点改变失败。近日，语义图已被用来克服的观点变化问题。然而，这些方法是非常耗时的，尤其是在大型环境中。这导致了第二个困难，那就是如何进行实时的全球定位。在本文中，我们提出了一种基于语义直方图图匹配方法是稳健的观点变化，可以实现实时的全球定位。此基础上，我们开发可以准确且有效地执行MR-GL为均相和多相机器人的系统。实验结果表明，我们的做法是大约比随机游走基于语义描述符快30倍。此外，它实现了95％的全球定位的精确度，而国家的最先进的方法的精确度为85％。

98. Meta-learning the Learning Trends Shared Across Tasks [PDF] 返回目录
Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Mubarak Shah
Abstract: Meta-learning stands for 'learning to learn' such that generalization to new tasks is achieved. Among these methods, Gradient-based meta-learning algorithms are a specific sub-class that excel at quick adaptation to new tasks with limited data. This demonstrates their ability to acquire transferable knowledge, a capability that is central to human learning. However, the existing meta-learning approaches only depend on the current task information during the adaptation, and do not share the meta-knowledge of how a similar task has been adapted before. To address this gap, we propose a 'Path-aware' model-agnostic meta-learning approach. Specifically, our approach not only learns a good initialization for adaptation, it also learns an optimal way to adapt these parameters to a set of task-specific parameters, with learnable update directions, learning rates and, most importantly, the way updates evolve over different time-steps. Compared to the existing meta-learning methods, our approach offers: (a) The ability to learn gradient-preconditioning at different time-steps of the inner-loop, thereby modeling the dynamic learning behavior shared across tasks, and (b) The capability of aggregating the learning context through the provision of direct gradient-skip connections from the old time-steps, thus avoiding overfitting and improving generalization. In essence, our approach not only learns a transferable initialization, but also models the optimal update directions, learning rates, and task-specific learning trends. Specifically, in terms of learning trends, our approach determines the way update directions shape up as the task-specific learning progresses and how the previous update history helps in the current update. Our approach is simple to implement and demonstrates faster convergence. We report significant performance improvements on a number of FSL datasets.
摘要：元学习代表“学会学习”，从而推广到新任务的实现。在这些方法中，基于梯度的元学习算法是一个特定的子类，练成了快速适应有限的数据的新任务。这表明他们的收购转让知识的能力，一个功能，是中央对人类学习。然而，仅仅依靠在适应过程中的当前任务信息，并且不共享的是如何相似任务已调整前的元知识现有的元学习方法。为了解决这个差距，我们提出了一个“路径感知”模型无关元学习方法。具体来说，我们的方法不仅学会了一门很好的初始化适应，也学会以最佳的方式，以适应这些参数的一组任务的具体参数，具有可学习更新的方向，学习率，最重要的是，更新的方式发展了不同时间步骤。比起现有的元学习方法，我们的方法提供：（一）学习梯度预处理在内环的不同时间步骤，从而模拟跨任务共享的动态学习行为，和（b）的能力的能力聚集通过提供直接的学习环境的梯度跳过从旧的时间步长的连接，从而避免过度拟合和改善泛化。从本质上说，我们的方法不仅学会了一转让初始化，而且模型最优更新的方向，学习率，和具体任务的学习趋势。具体而言，在学习趋势而言，我们的方法确定的方式更新方向不成材为特定任务的学习进展和以前的更新历史如何帮助在当前的更新。我们的做法是实现简单，演示更快的收敛。我们报告了一些FSL数据集的显著的性能改进。

99. MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization [PDF] 返回目录
Wen Fei, Wenrui Da, Chenglin Li, Junni Zou, Hongkai Xiong
Abstract: Substantial experiments have validated the success of Batch Normalization (BN) Layer in benefiting convergence and generalization. However, BN requires extra memory and float-point calculation. Moreover, BN would be inaccurate on micro-batch, as it depends on batch statistics. In this paper, we address these problems by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate. We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training. MimicNorm consists of only two light operations, including modified weight mean operations (subtract mean values from weight parameter tensor) and one BN layer before loss function (last BN layer). We leverage the neural tangent kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer, and consequently, leads to an enhanced convergence. The last BN layer provides autotuned learning rates and also improves accuracy. Experimental results show that MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption. The code is publicly available at this https URL.
摘要：大量的实验已经证实批标准化（BN）层的成功中受益收敛性和泛化。然而，BN需要额外的内存和浮点计算。此外，国阵将是对微批不准确的，因为它取决于批次的统计数据。在本文中，我们通过简化BN正规化，同时保持BN层，即数据去相关和自适应学习率的两个根本性影响解决这些问题。我们提出了一个新颖的归一化法，命名MimicNorm，提高网络训练收敛性和效率。 MimicNorm仅由两个光操作，包括改性重量平均操作（减去权重参数张量平均值）和前损失函数（最近BN层）一个BN层。我们利用神经切线内核（NTK）的理论来证明我们的体重平均手术美白激活和转移网进入混乱政权像BN层，因此，导致了增强的融合。最后BN层提供自动优化学习率也提高了精度。实验结果表明，MimicNorm实现类似精度的各种网络结构，包括ResNets轻便网络，如的ShuffleNet，具有减少约20％的内存消耗。该代码是公开的，在此HTTPS URL。

100. Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders [PDF] 返回目录
Masha Itkina, Boris Ivanovic, Ransalu Senanayake, Mykel J. Kochenderfer, Marco Pavone
Abstract: Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-world data, rendering downstream tasks computationally challenging. For instance, performing motion planning in a high-dimensional latent representation of the environment could be intractable. We consider the problem of sparsifying the discrete latent space of a trained conditional variational autoencoder, while preserving its learned multimodality. As a post hoc latent space reduction technique, we use evidential theory to identify the latent classes that receive direct evidence from a particular input condition and filter out those that do not. Experiments on diverse tasks, such as image generation and human behavior prediction, demonstrate the effectiveness of our proposed technique at reducing the discrete latent sample space size of a model while maintaining its learned multimodality.
摘要：在变自动编码离散潜在空间已被证明有效地捕捉了许多现实世界的问题的数据分布，如自然语言理解，人类意图的预测，以及视觉场景表示。然而，离散潜在空间必须足够大，捕捉现实世界的数据的复杂性，使下游任务计算挑战。例如，在环境中的高维的潜在表现进行运动规划可能是棘手的。我们考虑的稀疏基底离散潜在空间的问题，一个受过训练的条件变的自动编码，同时保持其多模态学。作为事后潜在空间缩减技术，我们用证据理论来确定接收来自特定的输入条件，过滤掉那些没有直接证据潜类。在不同的任务，如图像生成和人类行为预测的实验，证明在降低模型的离散潜样本空间的大小，同时保持其多模态了解到我们所提出的技术的有效性。

101. Unsupervised Foveal Vision Neural Networks with Top-Down Attention [PDF] 返回目录
Ryan Burt, Nina N. Thigpen, Andreas Keil, Jose C. Principe
Abstract: Deep learning architectures are an extremely powerful tool for recognizing and classifying images. However, they require supervised learning and normally work on vectors the size of image pixels and produce the best results when trained on millions of object images. To help mitigate these issues, we propose the fusion of bottom-up saliency and top-down attention employing only unsupervised learning techniques, which helps the object recognition module to focus on relevant data and learn important features that can later be fine-tuned for a specific task. In addition, by utilizing only relevant portions of the data, the training speed can be greatly improved. We test the performance of the proposed Gamma saliency technique on the Toronto and CAT2000 databases, and the foveated vision in the Street View House Numbers (SVHN) database. The results in foveated vision show that Gamma saliency is comparable to the best and computationally faster. The results in SVHN show that our unsupervised cognitive architecture is comparable to fully supervised methods and that the Gamma saliency also improves CNN performance if desired. We also develop a topdown attention mechanism based on the Gamma saliency applied to the top layer of CNNs to improve scene understanding in multi-object images or images with strong background clutter. When we compare the results with human observers in an image dataset of animals occluded in natural scenes, we show that topdown attention is capable of disambiguating object from background and improves system performance beyond the level of human observers.
摘要：深学习架构是识别和分类图像一个非常强大的工具。然而，他们需要监督的学习和正常工作的矢量图像像素大小和上百万的目标图像的受训时产生最好的结果。为了帮助减轻这些问题，我们提出了自下而上的显着性和自上而下的关注只采用无监督学习技术的融合，这有助于对象识别模块集中于相关数据，了解以后可以微调的重要特征具体任务。另外，通过利用所述数据的唯一相关部分，训练速度可以大大提高。我们测试在多伦多和CAT2000数据库，并在街景门牌（SVHN）数据库中的视网膜中央凹视力所提出的伽玛显着技术的性能。在视网膜中央凹视力结果表明，伽玛显着性可媲美最好的和计算速度更快。如果需要，SVHN结果表明，我们的无监督的认知结构是相当的充分监督的方法和伽玛显着性也提高了CNN的性能。我们还开发了基于伽马显着自上而下的注意机制应用于细胞神经网络的顶层，以提高现场多目标图像或图像理解具有较强的背景杂波。当我们比较与动物的图像数据集人类观察者的结果自然景物遮挡，我们表明，自上而下重视的是能够从后台去歧义对象，提高了超越人类观察员的水平的系统性能。

102. Feature Importance Ranking for Deep Learning [PDF] 返回目录
Maksymilian Wojtas, Ke Chen
Abstract: Feature importance ranking has become a powerful tool for explainable AI. However, its nature of combinatorial optimization poses a great challenge for deep learning. In this paper, we propose a novel dual-net architecture consisting of operator and selector for discovery of an optimal feature subset of a fixed size and ranking the importance of those features in the optimal subset simultaneously. During learning, the operator is trained for a supervised learning task via optimal feature subset candidates generated by the selector that learns predicting the learning performance of the operator working on different optimal subset candidates. We develop an alternate learning algorithm that trains two nets jointly and incorporates a stochastic local search procedure into learning to address the combinatorial optimization challenge. In deployment, the selector generates an optimal feature subset and ranks feature importance, while the operator makes predictions based on the optimal subset for test data. A thorough evaluation on synthetic, benchmark and real data sets suggests that our approach outperforms several state-of-the-art feature importance ranking and supervised feature selection methods. (Our source code is available: this https URL)
摘要：功能重要性的排名已经成为解释的AI的有力工具。然而，它的组合优化的性质对深学习的巨大挑战。在本文中，我们提出了一种新颖的双网架构，包含操作者和选择的一个固定尺寸的最优特征子集的发现和同时居最佳子集的那些特征的重要性。学习期间，操作员训练了监督通过由选择器获悉预测运营商的不同最优子集候选工作的学习表现产生最优特征子集候选学习任务。我们开发的交替学习算法，列车两个网联合，并采用了随机局部搜索到学习解决组合优化的挑战。在部署中，选择生成最优特征子集和职级配备的重要性，而操作者进行基于测试数据的最优子集的预测。在合成基准测试和真实数据集的全面评估表明，我们的方法优于国家的最先进的几种功能的重要性排名和监督的特征选择方法。（我们的源代码是：这HTTPS URL）

103. Shape Constrained CNN for Cardiac MR Segmentation with Simultaneous Prediction of Shape and Pose Parameters [PDF] 返回目录
Sofie Tilborghs, Tom Dresselaers, Piet Claus, Jan Bogaert, Frederik Maes
Abstract: Semantic segmentation using convolutional neural networks (CNNs) is the state-of-the-art for many medical segmentation tasks including left ventricle (LV) segmentation in cardiac MR images. However, a drawback is that these CNNs lack explicit shape constraints, occasionally resulting in unrealistic segmentations. In this paper, we perform LV and myocardial segmentation by regression of pose and shape parameters derived from a statistical shape model. The integrated shape model regularizes predicted segmentations and guarantees realistic shapes. Furthermore, in contrast to semantic segmentation, it allows direct calculation of regional measures such as myocardial thickness. We enforce robustness of shape and pose prediction by simultaneously constructing a segmentation distance map during training. We evaluated the proposed method in a fivefold cross validation on a in-house clinical dataset with 75 subjects containing a total of 1539 delineated short-axis slices covering LV from apex to base, and achieved a correlation of 99% for LV area, 94% for myocardial area, 98% for LV dimensions and 88% for regional wall thicknesses. The method was additionally validated on the LVQuan18 and LVQuan19 public datasets and achieved state-of-the-art results.
摘要：使用卷积神经网络（细胞神经网络）的语义分割是所述状态的最先进的许多医疗分割任务，包括左心室（LV）中的分割心脏MR图像。然而，一个缺点是这些细胞神经网络缺乏明确的形状限制，偶尔导致不切实际分割。在本文中，我们通过从统计形状模型导出的姿势和形状参数回归执行LV和心肌分割。集成的形状模型规则化预测分割，并保证现实的形状。此外，相对于语义分割，它允许的区域的措施，如心肌厚度直接计算。我们通过训练过程中同时构建分割距离图执行的形状和姿态预测的鲁棒性。我们评估了一个内部的临床数据集与包含75名受试者总共1539划定短轴切片覆盖LV从顶点到基地五倍交叉验证所提出的方法，取得99％的相关性为LV面积，94％心肌面积，98％为LV尺寸和区域的壁厚88％。该方法附加地验证在LVQuan18和LVQuan19公共数据集和实现先进的最先进的结果。

104. Light Stage Super-Resolution: Continuous High-Frequency Relighting [PDF] 返回目录
Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhemann, Paul Debevec, Yun-Ta Tsai, Jonathan T. Barron, Ravi Ramamoorthi
Abstract: The light stage has been widely used in computer graphics for the past two decades, primarily to enable the relighting of human faces. By capturing the appearance of the human subject under different light sources, one obtains the light transport matrix of that subject, which enables image-based relighting in novel environments. However, due to the finite number of lights in the stage, the light transport matrix only represents a sparse sampling on the entire sphere. As a consequence, relighting the subject with a point light or a directional source that does not coincide exactly with one of the lights in the stage requires interpolation and resampling the images corresponding to nearby lights, and this leads to ghosting shadows, aliased specularities, and other artifacts. To ameliorate these artifacts and produce better results under arbitrary high-frequency lighting, this paper proposes a learning-based solution for the "super-resolution" of scans of human faces taken from a light stage. Given an arbitrary "query" light direction, our method aggregates the captured images corresponding to neighboring lights in the stage, and uses a neural network to synthesize a rendering of the face that appears to be illuminated by a "virtual" light source at the query location. This neural network must circumvent the inherent aliasing and regularity of the light stage data that was used for training, which we accomplish through the use of regularized traditional interpolation methods within our network. Our learned model is able to produce renderings for arbitrary light directions that exhibit realistic shadows and specular highlights, and is able to generalize across a wide variety of subjects.
摘要：光阶段已经广泛应用于计算机图形在过去的二十年里，主要是为了使人类面临的二次照明。通过捕获在不同光源下的人类对象的外观，可以得到该主题，这使得能够在新的环境中重新点灯基于图像的的光传输矩阵。然而，由于有限数量的在舞台灯，光传输矩阵仅代表在整个球的稀疏采样。因此，重新点灯用的点光源或一致并不精确地与在舞台灯光的一个定向源的主题需要内插和重采样对应于附近的光的图像，并且这会导致重影阴影，别名镜面反射，并等文物。为了改善这些工件并产生在任意高频点灯更好的结果，提出了“超分辨率”从光阶段采取人脸扫描的基于学习的溶液。给定任意“查询”光方向，我们的方法聚合所捕获的图像对应于级相邻灯，并使用神经网络来合成脸的渲染显示由一个“虚拟”光源在查询被照明位置。该神经网络必须规避用于训练的光阶段的数据，这是我们通过我们的网络中使用的正规化传统插值方法实现的固有混淆和规律性。我们学习的模型能够产生效果为表现出逼真的阴影和镜面高光任意光方向，并能够跨多种学科一概而论。

105. Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning [PDF] 返回目录
Chenjia Bai, Peng Liu, Zhaoran Wang, Kaiyu Liu, Lingxiao Wang, Yingnan Zhao
Abstract: Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.
摘要：高效的勘探仍强化学习一个具有挑战性的问题，尤其是对于其中来自环境的外在报酬稀少甚至完全忽视的任务。基于内在动机节目承诺在简单的环境中的结果，但往往陷入与多式联运和随机动态环境显著的进步。在这项工作中，我们提出了基于条件变推理的多模态和随机性建模变动力学模型。我们考虑环境状态 - 动作过渡由当前状态，行动，及潜在变量的条件下生成所述下一状态预测一个条件生成过程。我们得出的上限环境过渡的负对数似然和使用这样的上限作为内在奖励勘探，这允许代理学会自我监督的探索技能不遵守外在报酬。我们评估几个基于图像的模拟任务和真实的自动操纵任务所提出的方法。我们的方法优于一些基于模型的国家的最先进的环境探测方法。

106. A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable Sentences [PDF] 返回目录
Andrew Merritt, Chenhui Chu, Yuki Arase
Abstract: Multimodal neural machine translation (NMT) has become an increasingly important area of research over the years because additional modalities, such as image data, can provide more context to textual data. Furthermore, the viability of training multimodal NMT models without a large parallel corpus continues to be investigated due to low availability of parallel sentences with images, particularly for English-Japanese data. However, this void can be filled with comparable sentences that contain bilingual terms and parallel phrases, which are naturally created through media such as social network posts and e-commerce product descriptions. In this paper, we propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets. In addition, we supplement our comparable sentences with a smaller parallel corpus for validation and test purposes. To test the performance of this comparable sentence translation scenario, we train several baseline NMT models with our comparable corpus and evaluate their English-Japanese translation performance. Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data. Despite this, we hope for our corpus to be used to further research into multimodal NMT with comparable sentences.
摘要：多通道神经机器翻译（NMT）已成为研究的一个越来越重要的领域，多年来，因为附加方式，诸如图像数据，可以为文本数据提供更多的上下文。此外，训练多NMT车型没有大的平行语料库生存能力继续因与图片，特别是英日的数据并行语句的低可用性进行调查。然而，这种无效可以充满包含双语术语和短语并列可比的句子，这是自然通过媒体创建诸如社交网络帖子和电子商务产品说明。在本文中，我们提出了一个新的多模英日语料库与从现有图像数据集字幕编译媲美的句子。此外，我们还与验证和测试目的较小的平行语料库补充我们相媲美的句子。为了测试这个可比整句翻译方案的性能，我们培养几个基线NMT模型与我们可比语料库，并评估他们的英语，日语翻译性能。由于我们的基础实验低的翻译分数，我们认为，目前多NMT模型设计无法有效地利用可比较的句子数据。尽管如此，我们希望我们的语料库被用来进一步研究具有可比性的句子多NMT。

107. Class-incremental Learning with Pre-allocated Fixed Classifiers [PDF] 返回目录
Federico Pernici, Matteo Bruni, Claudio Baecchi, Francesco Turchini, Alberto Del Bimbo
Abstract: In class-incremental learning, a learning agent faces a stream of data with the goal of learning new classes while not forgetting previous ones. Neural networks are known to suffer under this setting, as they forget previously acquired knowledge. To address this problem, effective methods exploit past data stored in an episodic memory while expanding the final classifier nodes to accommodate the new classes. In this work, we substitute the expanding classifier with a novel fixed classifier in which a number of pre-allocated output nodes are subject to the classification loss right from the beginning of the learning phase. Contrarily to the standard expanding classifier, this allows: (a) the output nodes of future unseen classes to firstly see negative samples since the beginning of learning together with the positive samples that incrementally arrive; (b) to learn features that do not change their geometric configuration as novel classes are incorporated in the learning model. Experiments with public datasets show that the proposed approach is as effective as the expanding classifier while exhibiting novel intriguing properties of the internal feature representation that are otherwise not-existent. Our ablation study on pre-allocating a large number of classes further validates the approach.
摘要：类增量学习，学习代理人面临数据与学习新课程，同时不忘记以前的目标流。神经网络被称为在此设置下受苦，因为他们忘记了先前获得的知识。为了解决这个问题，有效的方法利用存储在情景记忆，同时扩大最终的分类节点，以适应新课程过去的数据。在这项工作中，我们替换扩展分类具有新颖固定分类，其中多个预先分配的输出节点都需从学习阶段的开始分类损失的权利。相反，以标准的扩展分类，这允许：（1）未来看不见的课程，以先输出节点看到，因为与阳性样品可以递增到一起学习的开始负样本; （b）向学习作为新型类在学习模型并入不改变它们的几何配置功能。与公共数据集实验结果表明，所提出的方法是同样有效的扩大分类，同时表现出在其他方面不存在内部特征表示的新颖有趣的性质。我们对预分配了大量的类消融研究进一步验证的方法。

108. A general approach to compute the relevance of middle-level input features [PDF] 返回目录
Andrea Apicella, Salvatore Giugliano, Francesco Isgrò, Roberto Prevete
Abstract: This work proposes a novel general framework, in the context of eXplainable Artificial Intelligence (XAI), to construct explanations for the behaviour of Machine Learning (ML) models in terms of middle-level features. One can isolate two different ways to provide explanations in the context of XAI: low and middle-level explanations. Middle-level explanations have been introduced for alleviating some deficiencies of low-level explanations such as, in the context of image classification, the fact that human users are left with a significant interpretive burden: starting from low-level explanations, one has to identify properties of the overall input that are perceptually salient for the human visual system. However, a general approach to correctly evaluate the elements of middle-level explanations with respect ML model responses has never been proposed in the literature.
摘要：这项工作提出了新的总体框架，在解释的人工智能（XAI）的背景下，构建的机器学习（ML）模型的中层功能方面的行为的解释。人们可以分离出两种不同的方式提供XAI的背景下解释：低收入和中等水平的解释。中层的解释已经出台了缓解低水平解释的一些不足之处，例如，在图像分类的情况下，事实上，人类用户留下了一个显著解释负担：从低层次的解释开始，人们必须确定被感知为凸角人类视觉系统的整体输入的性质。但是，一般的方法来正确评价中层解释相对于ML模型响应元素从未在文献中提出。

109. CT Image Segmentation for Inflamed and Fibrotic Lungs Using a Multi-Resolution Convolutional Neural Network [PDF] 返回目录
Sarah E. Gerard, Jacob Herrmann, Yi Xin, Kevin T. Martin, Emanuele Rezoagli, Davide Ippolito, Giacomo Bellani, Maurizio Cereda, Junfeng Guo, Eric A. Hoffman, David W. Kaczka, Joseph M. Reinhardt
Abstract: The purpose of this study was to develop a fully-automated segmentation algorithm, robust to various density enhancing lung abnormalities, to facilitate rapid quantitative analysis of computed tomography images. A polymorphic training approach is proposed, in which both specifically labeled left and right lungs of humans with COPD, and nonspecifically labeled lungs of animals with acute lung injury, were incorporated into training a single neural network. The resulting network is intended for predicting left and right lung regions in humans with or without diffuse opacification and consolidation. Performance of the proposed lung segmentation algorithm was extensively evaluated on CT scans of subjects with COPD, confirmed COVID-19, lung cancer, and IPF, despite no labeled training data of the latter three diseases. Lobar segmentations were obtained using the left and right lung segmentation as input to the LobeNet algorithm. Regional lobar analysis was performed using hierarchical clustering to identify radiographic subtypes of COVID-19. The proposed lung segmentation algorithm was quantitatively evaluated using semi-automated and manually-corrected segmentations in 87 COVID-19 CT images, achieving an average symmetric surface distance of $0.495 \pm 0.309$ mm and Dice coefficient of $0.985 \pm 0.011$. Hierarchical clustering identified four radiographical phenotypes of COVID-19 based on lobar fractions of consolidated and poorly aerated tissue. Lower left and lower right lobes were consistently more afflicted with poor aeration and consolidation. However, the most severe cases demonstrated involvement of all lobes. The polymorphic training approach was able to accurately segment COVID-19 cases with diffuse consolidation without requiring COVID-19 cases for training.
摘要：本研究的目的是开发一种全自动分割算法，鲁棒各种密度增强肺部异常，为了便于计算断层摄影图像的快速定量分析。多态的训练方法提出，在人类的两个特异性标记的左，右肺部有慢性阻塞性肺病，和动物急性肺损伤的非特异性标记的肺，被纳入训练一个神经网络。产生的网络是用于人类有或无弥漫混浊和整合预测左，右肺地区。所提出的肺分割算法的性能对受试者CT扫描被广泛评估与COPD，证实COVID-19，肺癌，和IPF，尽管后者三种疾病的无标记训练数据。使用左和右分割肺作为输入提供给LobeNet算法获得大叶性肺炎分割。采用分层聚类识别COVID-19的放射学亚型进行区域叶分析。在87 COVID-19 CT图像使用半自动和手动校正的分割所提出的肺分割算法进行定量评价，实现$ 0.495平均对称面距离\下午0.309 $毫米和骰子系数的$ 0.985 \下午0.011 $。分级聚类识别COVID-19四个放射照相表型基于巩固和通气不良组织的脑叶的级分。左下和右下肺叶持续多与通风差和整合折磨。然而，最严重的情况下表现出的所有叶的参与。多态培训方法能够准确段COVID 19例，而无需COVID-19案件培训弥漫性实变。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-10-20

目录

摘要