摘要

1. ANR: Articulated Neural Rendering for Virtual Avatars [PDF] 返回目录
Amit Raj, Julian Tanke, James Hays, Minh Vo, Carsten Stoll, Christoph Lassner
Abstract: The combination of traditional rendering with neural networks in Deferred Neural Rendering (DNR) provides a compelling balance between computational complexity and realism of the resulting images. Using skinned meshes for rendering articulating objects is a natural extension for the DNR framework and would open it up to a plethora of applications. However, in this case the neural shading step must account for deformations that are possibly not captured in the mesh, as well as alignment inaccuracies and dynamics -- which can confound the DNR pipeline. We present Articulated Neural Rendering (ANR), a novel framework based on DNR which explicitly addresses its limitations for virtual human avatars. We show the superiority of ANR not only with respect to DNR but also with methods specialized for avatar creation and animation. In two user studies, we observe a clear preference for our avatar model and we demonstrate state-of-the-art performance on quantitative evaluation metrics. Perceptually, we observe better temporal stability, level of detail and plausibility.
摘要：递延神经渲染（DNR）中传统渲染与神经网络的结合在计算复杂性和所得图像的真实性之间提供了令人信服的平衡。使用蒙皮的网格渲染关节对象是DNR框架的自然扩展，它将为众多应用开放。但是，在这种情况下，神经遮蔽步骤必须考虑可能无法捕获到网格中的变形以及对齐的误差和动力学，这会混淆DNR管道。我们提出了铰接式神经渲染（ANR），这是一种基于DNR的新颖框架，明确地解决了其对虚拟人类化身的局限性。我们展示了ANR不仅在DNR方面的优势，而且还展示了专门用于化身创建和动画制作的方法。在两项用户研究中，我们观察到了对我们的化身模型的明显偏好，并且我们展示了量化评估指标上的最新性能。在感知上，我们观察到更好的时间稳定性，细节水平和合理性。

2. Vid2Actor: Free-viewpoint Animatable Person Synthesis from Video in the Wild [PDF] 返回目录
Chung-Yi Weng, Brian Curless, Ira Kemelmacher-Shlizerman
Abstract: Given an "in-the-wild" video of a person, we reconstruct an animatable model of the person in the video. The output model can be rendered in any body pose to any camera view, via the learned controls, without explicit 3D mesh reconstruction. At the core of our method is a volumetric 3D human representation reconstructed with a deep network trained on input video, enabling novel pose/view synthesis. Our method is an advance over GAN-based image-to-image translation since it allows image synthesis for any pose and camera via the internal 3D representation, while at the same time it does not require a pre-rigged model or ground truth meshes for training, as in mesh-based learning. Experiments validate the design choices and yield results on synthetic data and on real videos of diverse people performing unconstrained activities (e.g. dancing or playing tennis). Finally, we demonstrate motion re-targeting and bullet-time rendering with the learned models.
摘要：给定一个人的“狂野”视频，我们在视频中重建了该人的可动画模型。无需学习3D网格重建，即可通过学习到的控件将输出模型以任何人体姿势呈现到任何摄像机视图。我们方法的核心是通过在输入视频上训练的深度网络重建的立体3D人体表示，从而实现新颖的姿势/视图合成。我们的方法是基于GAN的图像到图像转换的一种进步，因为它允许通过内部3D表示对任何姿势和摄像机进行图像合成，而同时它不需要预先绑定的模型或地面真实网格训练，例如基于网格的学习。实验验证了设计选择，并根据合成数据和进行不受限制的活动（例如跳舞或打网球）的不同人的真实视频产生了结果。最后，我们利用学习的模型演示了运动重定目标和子弹时间渲染。

3. Exploring Instance-Level Uncertainty for Medical Detection [PDF] 返回目录
Jiawei Yang, Yuan Liang, Yao Zhang, Weinan Song, Kun Wang, Lei He
Abstract: The ability of deep learning to predict with uncertainty is recognized as key for its adoption in clinical routines. Moreover, performance gain has been enabled by modelling uncertainty according to empirical evidence. While previous work has widely discussed the uncertainty estimation in segmentation and classification tasks, its application on bounding-box-based detection has been limited, mainly due to the challenge of bounding box aligning. In this work, we explore to augment a 2.5D detection CNN with two different bounding-box-level (or instance-level) uncertainty estimates, i.e., predictive variance and Monte Carlo (MC) sample variance. Experiments are conducted for lung nodule detection on LUNA16 dataset, a task where significant semantic ambiguities can exist between nodules and non-nodules. Results show that our method improves the evaluating score from 84.57% to 88.86% by utilizing a combination of both types of variances. Moreover, we show the generated uncertainty enables superior operating points compared to using the probability threshold only, and can further boost the performance to 89.52%. Example nodule detections are visualized to further illustrate the advantages of our method.
摘要：深度学习具有不确定性的预测能力被认为是其在临床常规中采用的关键。此外，通过根据经验证据对不确定性进行建模，可以提高性能。尽管先前的工作已经广泛讨论了分割和分类任务中的不确定性估计，但是由于边界框对齐的挑战，其在基于边界框的检测中的应用受到了限制。在这项工作中，我们探索用两个不同的边界框级别（或实例级别）的不确定性估计值（即预测方差和蒙特卡洛（MC）样本方差）来增强2.5D检测CNN。在LUNA16数据集上进行了肺结节检测实验，该任务在结节和非结节之间可能存在明显的语义歧义。结果表明，通过结合两种类型的方差，我们的方法将评估得分从84.57％提高到88.86％。此外，我们显示，与仅使用概率阈值相比，生成的不确定性可以提供更好的工作点，并且可以将性能进一步提升至89.52％。可视化示例性结节检测，以进一步说明我们方法的优势。

4. Training data-efficient image transformers & distillation through attention [PDF] 返回目录
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou
Abstract: Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption by the larger community. In this work, with an adequate training scheme, we produce a competitive convolution-free transformer by training on Imagenet only. We train it on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. We share our code and models to accelerate community advances on this line of research. Additionally, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 84.4% accuracy) and when transferring to other tasks.
摘要：最近，显示出纯粹基于注意力的神经网络可解决图像理解任务，例如图像分类。但是，这些视觉转换器使用昂贵的基础架构预先接受了数亿个图像的训练，从而限制了它们在更大的社区中的应用。在这项工作中，通过适当的培训计划，我们仅通过在Imagenet上进行培训即可生产出具有竞争力的无卷积变压器。我们不到三天就在一台计算机上对其进行了培训。我们的参考视觉转换器（86M参数）在ImageNet上无需外部数据即可达到83.1％的top-1精度（单幅评估）。我们共享我们的代码和模型，以加快社区在这方面的研究进展。此外，我们介绍了特定于变压器的师生策略。它依靠蒸馏令牌来确保学生通过注意力向老师学习。我们展示了这种基于令牌的蒸馏的兴趣，尤其是在使用卷积网络作为教师时。这使我们能够报告与卷积网络相比在Imagenet（我们可以获得高达84.4％的准确性）和转移到其他任务时具有竞争力的结果。

5. Focal Frequency Loss for Generative Models [PDF] 返回目录
Liming Jiang, Bo Dai, Wayne Wu, Chen Change Loy
Abstract: Despite the remarkable success of generative models in creating photorealistic images using deep neural networks, gaps could still exist between the real and generated images, especially in the frequency domain. In this study, we find that narrowing the frequency domain gap can ameliorate the image synthesis quality further. To this end, we propose the focal frequency loss, a novel objective function that brings optimization of generative models into the frequency domain. The proposed loss allows the model to dynamically focus on the frequency components that are hard to synthesize by down-weighting the easy frequencies. This objective function is complementary to existing spatial losses, offering great impedance against the loss of important frequency information due to the inherent crux of neural networks. We demonstrate the versatility and effectiveness of focal frequency loss to improve various baselines in both perceptual quality and quantitative performance.
摘要：尽管生成模型在使用深度神经网络创建逼真的图像方面取得了巨大的成功，但实际图像和生成的图像之间仍然存在差距，尤其是在频域中。在这项研究中，我们发现缩小频域间隙可以进一步改善图像合成质量。为此，我们提出了焦点频率损失这一新的目标函数，它将生成模型的优化引入频域。所提出的损耗使模型可以动态地关注于难以通过降低易用频率的权重来合成的频率分量。该目标函数是对现有空间损耗的补充，由于神经网络的固有症结，可为重要频率信息的丢失提供极大的阻抗。我们证明了聚焦频率损失的多功能性和有效性，以改善感知质量和量化性能方面的各种基线。

6. Warping of Radar Data into Camera Image for Cross-Modal Supervision in Automotive Applications [PDF] 返回目录
Christopher Grimm, Tai Fei, Ernst Warsitz, Ridha Farhoud, Tobias Breddermann, Reinhold Haeb-Umbach
Abstract: In this paper, we present a novel framework to project automotive radar range-Doppler (RD) spectrum into camera image. The utilized warping operation is designed to be fully differentiable, which allows error backpropagation through the operation. This enables the training of neural networks (NN) operating exclusively on RD spectrum by utilizing labels provided from camera vision models. As the warping operation relies on accurate scene flow, additionally, we present a novel scene flow estimation algorithm fed from camera, lidar and radar, enabling us to improve the accuracy of the warping operation. We demonstrate the framework in multiple applications like direction-of-arrival (DoA) estimation, target detection, semantic segmentation and estimation of radar power from camera data. Extensive evaluations have been carried out for the DoA application and suggest superior quality for NN based estimators compared to classical estimators. The novel scene flow estimation approach is benchmarked against state-of-the-art scene flow algorithms and outperforms them by roughly a third.
摘要：在本文中，我们提出了一种将汽车雷达测距多普勒（RD）光谱投影到摄像机图像中的新颖框架。所使用的翘曲操作被设计为完全可微的，从而允许通过操作进行错误反向传播。这可以通过利用摄像机视觉模型提供的标签来训练仅在RD频谱上运行的神经网络（NN）。由于整形操作依赖于准确的场景流，因此，我们提出了一种由摄像机，激光雷达和雷达提供的新颖的场景流估计算法，从而使我们能够提高整形操作的准确性。我们在多种应用中演示了该框架，例如到达方向（DoA）估计，目标检测，语义分割以及从相机数据中估计雷达功率。已经针对DoA应用程序进行了广泛的评估，并提出了与传统估算器相比，基于NN的估算器具有更高的质量。新颖的场景流估计方法以最新的场景流算法为基准，并且性能比后者高出约三分之一。

7. Coarse-to-Fine Object Tracking Using Deep Features and Correlation Filters [PDF] 返回目录
Ahmed Zgaren, Wassim Bouachir, Riadh Ksantini
Abstract: During the last years, deep learning trackers achieved stimulating results while bringing interesting ideas to solve the tracking problem. This progress is mainly due to the use of learned deep features obtained by training deep convolutional neural networks (CNNs) on large image databases. But since CNNs were originally developed for image classification, appearance modeling provided by their deep layers might be not enough discriminative for the tracking task. In fact,such features represent high-level information, that is more related to object category than to a specific instance of the object. Motivated by this observation, and by the fact that discriminative correlation filters(DCFs) may provide a complimentary low-level information, we presenta novel tracking algorithm taking advantage of both approaches. We formulate the tracking task as a two-stage procedure. First, we exploit the generalization ability of deep features to coarsely estimate target translation, while ensuring invariance to appearance change. Then, we capitalize on the discriminative power of correlation filters to precisely localize the tracked object. Furthermore, we designed an update control mechanism to learn appearance change while avoiding model drift. We evaluated the proposed tracker on object tracking benchmarks. Experimental results show the robustness of our algorithm, which performs favorably against CNN and DCF-based trackers. Code is available at: this https URL
摘要：在过去的几年中，深度学习跟踪器取得了令人鼓舞的结果，同时带来了有趣的想法来解决跟踪问题。取得这一进展主要是由于使用了通过在大型图像数据库上训练深度卷积神经网络（CNN）而获得的学习深度特征。但是，由于CNN最初是为图像分类而开发的，因此其深层提供的外观建模可能不足以区分跟踪任务。实际上，这些功能代表高级信息，与对象类别比与对象的特定实例更相关。出于这一观察结果的启发，以及歧视性相关滤波器（DCF）可能提供互补的低层信息的事实，我们提出了一种利用两种方法的新颖跟踪算法。我们将跟踪任务制定为两个阶段的过程。首先，我们利用深层特征的泛化能力粗略估计目标翻译，同时确保外观变化不变。然后，我们利用相关滤波器的判别力来精确定位被跟踪的对象。此外，我们设计了一种更新控制机制，以学习外观变化，同时避免模型漂移。我们根据对象跟踪基准评估了建议的跟踪器。实验结果显示了我们算法的鲁棒性，它对基于CNN和DCF的跟踪器表现出色。可以在以下网址获得代码：https URL

8. Principled network extraction from images [PDF] 返回目录
Diego Baptista, Caterina De Bacco
Abstract: Images of natural systems may represent patterns of network-like structure, which could reveal important information about the topological properties of the underlying subject. However, the image itself does not automatically provide a formal definition of a network in terms of sets of nodes and edges. Instead, this information should be suitably extracted from the raw image data. Motivated by this, we present a principled model to extract network topologies from images that is scalable and efficient. We map this goal into solving a routing optimization problem where the solution is a network that minimizes an energy function which can be interpreted in terms of an operational and infrastructural cost. Our method relies on recent results from optimal transport theory and is a principled alternative to standard image-processing techniques that are based on heuristics. We test our model on real images of the retinal vascular system, slime mold and river networks and compare with routines combining image-processing techniques. Results are tested in terms of a similarity measure related to the amount of information preserved in the extraction. We find that our model finds networks from retina vascular network images that are more similar to hand-labeled ones, while also giving high performance in extracting networks from images of rivers and slime mold for which there is no ground truth available. While there is no unique method that fits all the images the best, our approach performs consistently across datasets, its algorithmic implementation is efficient and can be fully automatized to be run on several datasets with little supervision.
摘要：自然系统的图像可能代表网络状结构的模式，这可能会揭示有关基础主体拓扑特性的重要信息。但是，图像本身不会自动根据节点和边缘的集合提供网络的正式定义。相反，应该从原始图像数据中适当地提取该信息。出于此目的，我们提出了一种原理模型，该模型可从图像中提取可扩展且高效的网络拓扑。我们将此目标映射到解决路由优化问题中，其中解决方案是使能量函数最小化的网络，该能量函数可以用运营成本和基础设施成本来解释。我们的方法依赖于最佳传输理论的最新结果，是基于启发式的标准图像处理技术的原则替代。我们在视网膜血管系统，粘液霉菌和河网的真实图像上测试我们的模型，并与结合图像处理技术的常规程序进行比较。根据与提取中保存的信息量相关的相似性度量来测试结果。我们发现，我们的模型从视网膜血管网络图像中发现的网络与手工标记的网络更加相似，同时在从河流和粘液霉菌的图像中提取网络的过程中也表现出了很高的性能，而对于这些图像，没有真实的依据。虽然没有独特的方法可以最佳地拟合所有图像，但是我们的方法在整个数据集上均能始终如一地执行，但其算法实现效率很高，并且可以完全自动化以在很少监督的情况下在多个数据集上运行。

9. Estimation of Driver's Gaze Region from Head Position and Orientation using Probabilistic Confidence Regions [PDF] 返回目录
Sumit Jha, Carlos Busso
Abstract: A smart vehicle should be able to understand human behavior and predict their actions to avoid hazardous situations. Specific traits in human behavior can be automatically predicted, which can help the vehicle make decisions, increasing safety. One of the most important aspects pertaining to the driving task is the driver's visual attention. Predicting the driver's visual attention can help a vehicle understand the awareness state of the driver, providing important contextual information. While estimating the exact gaze direction is difficult in the car environment, a coarse estimation of the visual attention can be obtained by tracking the position and orientation of the head. Since the relation between head pose and gaze direction is not one-to-one, this paper proposes a formulation based on probabilistic models to create salient regions describing the visual attention of the driver. The area of the predicted region is small when the model has high confidence on the prediction, which is directly learned from the data. We use Gaussian process regression (GPR) to implement the framework, comparing the performance with different regression formulations such as linear regression and neural network based methods. We evaluate these frameworks by studying the tradeoff between spatial resolution and accuracy of the probability map using naturalistic recordings collected with the UTDrive platform. We observe that the GPR method produces the best result creating accurate predictions with localized salient regions. For example, the 95% confidence region is defined by an area that covers 3.77% region of a sphere surrounding the driver.
摘要：智能车辆应该能够理解人类行为并预测其行为，以免发生危险情况。可以自动预测人类行为的特定特征，这可以帮助车辆做出决策，提高安全性。与驾驶任务有关的最重要方面之一是驾驶员的视觉注意力。预测驾驶员的视觉注意力可以帮助车辆理解驾驶员的意识状态，并提供重要的上下文信息。尽管在汽车环境中很难估计准确的凝视方向，但可以通过跟踪头部的位置和方向来获得视觉注意力的粗略估计。由于头部姿势与注视方向之间的关系不是一对一的，因此本文提出了一种基于概率模型的公式，以创建描述驾驶员视觉注意力的显着区域。当模型对预测具有高置信度时，预测区域的面积很小，这可以直接从数据中学习。我们使用高斯过程回归（GPR）来实现该框架，将性能与不同的回归公式（例如线性回归和基于神经网络的方法）进行比较。我们使用UTDrive平台收集的自然记录，通过研究空间分辨率和概率图的准确性之间的权衡来评估这些框架。我们观察到，GPR方法可产生最佳结果，从而在局部显着区域产生准确的预测。例如，95％置信区域由覆盖驾驶员周围球体3.77％区域的区域定义。

10. Multi-Modality Cut and Paste for 3D Object Detection [PDF] 返回目录
Wenwei Zhang, Zhe Wang, Chen Change Loy
Abstract: Three-dimensional (3D) object detection is essential in autonomous driving. There are observations that multi-modality methods based on both point cloud and imagery features perform only marginally better or sometimes worse than approaches that solely use single-modality point cloud. This paper investigates the reason behind this counter-intuitive phenomenon through a careful comparison between augmentation techniques used by single modality and multi-modality methods. We found that existing augmentations practiced in single-modality detection are equally useful for multi-modality detection. Then we further present a new multi-modality augmentation approach, Multi-mOdality Cut and pAste (MoCa). MoCa boosts detection performance by cutting point cloud and imagery patches of ground-truth objects and pasting them into different scenes in a consistent manner while avoiding collision between objects. We also explore beneficial architecture design and optimization practices in implementing a good multi-modality detector. Without using ensemble of detectors, our multi-modality detector achieves new state-of-the-art performance on nuScenes dataset and competitive performance on KITTI 3D benchmark. Our method also wins the best PKL award in the 3rd nuScenes detection challenge. Code and models will be released at this https URL.
摘要：三维（3D）对象检测在自动驾驶中至关重要。有观察表明，基于点云和图像特征的多模态方法的性能仅比仅使用单模态点云的方法稍好或有时差。本文通过对单模态方法和多模态方法使用的增强技术进行仔细比较，研究了这种违反直觉的现象的原因。我们发现，在单模式检测中实践的现有增强功能对于多模式检测同样有用。然后，我们进一步提出了一种新的多模态增强方法，即Multi-mOdality Cut and pAste（MoCa）。 MoCa通过切开地面真实物体的点云和图像斑块并将它们以一致的方式粘贴到不同的场景中，同时避免物体之间的碰撞来提高检测性能。我们还将探索有益的架构设计和优化实践，以实现良好的多模态检测器。在不使用检测器集成的情况下，我们的多模态检测器在nuScenes数据集上实现了最新的性能，并在KITTI 3D基准上获得了竞争性能。我们的方法还获得了第三届nuScenes检测挑战赛的最佳PKL奖。代码和模型将在此https URL上发布。

11. SWA Object Detection [PDF] 返回目录
Haoyang Zhang, Ying Wang, Feras Dayoub, Niko Sünderhauf
Abstract: Do you want to improve 1.0 AP for your object detector without any inference cost and any change to your detector? Let us tell you such a recipe. It is surprisingly simple: train your detector for an extra 12 epochs using cyclical learning rates and then average these 12 checkpoints as your final detection model. This potent recipe is inspired by Stochastic Weights Averaging (SWA), which is proposed in arXiv:1803.0540 for improving generalization in deep neural networks. We found it also very effective in object detection. In this technique report, we systematically investigate the effects of applying SWA to object detection as well as instance segmentation. Through extensive experiments, we discover a good policy of performing SWA in object detection, and we consistently achieve $\sim$1.0 AP improvement over various popular detectors on the challenging COCO benchmark. We hope this work will make more researchers in object detection know this technique and help them train better object detectors. Code is available at: this https URL .
摘要：您是否要在不增加任何推理成本和对检测器进行任何更改的情况下为对象检测器提高1.0 AP？让我们告诉您这样的食谱。这非常简单：使用循环学习率训练检测器多出12个历元，然后将这12个检查点取平均值作为最终检测模型。此有效配方的灵感来自arXiv：1803.0540中提出的用于改善深度神经网络泛化性的随机权重平均（SWA）。我们发现它在对象检测中也非常有效。在此技术报告中，我们系统地研究了将SWA应用于对象检测以及实例分割的效果。通过广泛的实验，我们发现了在目标检测中执行SWA的良好策略，并且在具有挑战性的COCO基准上，与各种流行的检测器相比，我们始终能够实现$ \ sim $ 1.0的AP改善。我们希望这项工作将使更多的物体检测研究人员了解该技术，并帮助他们训练更好的物体检测器。可以在以下网址获得代码：https https URL。

12. On Calibration of Scene-Text Recognition Models [PDF] 返回目录
Ron Slossberg, Oron Anschel, Amir Markovitz, Ron Litman, Aviad Aberdam, Shahar Tsiper, Shai Mazor, Jon Wu, R. Manmatha
Abstract: In this work, we study the problem of word-level confidence calibration for scene-text recognition (STR). Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored. We analyze several recent STR methods and show that they are consistently overconfident. We then focus on the calibration of STR models on the word rather than the character level. In particular, we demonstrate that for attention based decoders, calibration of individual character predictions increases word-level calibration error compared to an uncalibrated model. In addition, we apply existing calibration methodologies as well as new sequence-based extensions to numerous STR models, demonstrating reduced calibration error by up to a factor of nearly 7. Finally, we show consistently improved accuracy results by applying our proposed sequence calibration method as a preprocessing step to beam-search.
摘要：在这项工作中，我们研究了用于场景文本识别（STR）的单词级置信度校准问题。尽管在过去的几十年中，置信度校准一直是一个活跃的研究领域，但很少探讨结构化和序列预测校准的情况。我们分析了几种最近的STR方法，并表明它们始终过于自信。然后，我们将重点放在单词而不是字符级别上的STR模型的校准上。特别是，我们证明，对于基于注意力的解码器，与未校准的模型相比，对单个字符预测的校准会增加单词级别的校准错误。此外，我们将现有的校准方法以及新的基于序列的扩展应用于多种STR模型，从而将校准误差降低了近7倍。光束搜索的预处理步骤。

13. Direct Estimation of Spinal Cobb Angles by Structured Multi-Output Regression [PDF] 返回目录
Haoliang Sun, Xiantong Zhen, Chris Bailey, Parham Rasoulinejad, Yilong Yin, Shuo Li
Abstract: The Cobb angle that quantitatively evaluates the spinal curvature plays an important role in the scoliosis diagnosis and treatment. Conventional measurement of these angles suffers from huge variability and low reliability due to intensive manual intervention. However, since there exist high ambiguity and variability around boundaries of vertebrae, it is challenging to obtain Cobb angles automatically. In this paper, we formulate the estimation of the Cobb angles from spinal X-rays as a multi-output regression task. We propose structured support vector regression (S^2VR) to jointly estimate Cobb angles and landmarks of the spine in X-rays in one single framework. The proposed S^2VR can faithfully handle the nonlinear relationship between input images and quantitative outputs, while explicitly capturing the intrinsic correlation of outputs. We introduce the manifold regularization to exploit the geometry of the output space. We propose learning the kernel in S2VR by kernel target alignment to enhance its discriminative ability. The proposed method is evaluated on the spinal X-rays dataset of 439 scoliosis subjects, which achieves the inspiring correlation coefficient of 92.76% with ground truth obtained manually by human experts and outperforms two baseline methods. Our method achieves the direct estimation of Cobb angles with high accuracy, which indicates its great potential in clinical use.
摘要：定量评估脊柱弯曲度的Cobb角在脊柱侧凸的诊断和治疗中起着重要的作用。由于密集的人工干预，这些角度的常规测量存在巨大的可变性和较低的可靠性。然而，由于在椎骨边界周围存在高度的模糊性和可变性，因此自动获得Cobb角是具有挑战性的。在本文中，我们将脊柱X射线的Cobb角估算公式化为多输出回归任务。我们提出结构化支持向量回归（S ^ 2VR），以在单个框架中共同估计X射线中的Cobb角和脊柱界标。提出的S ^ 2VR可以忠实地处理输入图像和定量输出之间的非线性关系，同时显式捕获输出的内在关联。我们引入流形正则化以利用输出空间的几何形状。我们建议通过内核目标对齐来学习S2VR中的内核，以增强其判别能力。该方法在439例脊柱侧弯患者的脊柱X射线数据集上进行了评估，与人类专家手动获得的地面真相相比，其激发相关系数达到了92.76％，并且优于两种基线方法。我们的方法可以实现高精度的Cobb角直接估计，这表明它在临床上具有巨大的潜力。

14. ConvMath: A Convolutional Sequence Network for Mathematical Expression Recognition [PDF] 返回目录
Zuoyu Yan, Xiaode Zhang, Liangcai Gao, Ke Yuan, Zhi Tang
Abstract: Despite the recent advances in optical character recognition (OCR), mathematical expressions still face a great challenge to recognize due to their two-dimensional graphical layout. In this paper, we propose a convolutional sequence modeling network, ConvMath, which converts the mathematical expression description in an image into a LaTeX sequence in an end-to-end way. The network combines an image encoder for feature extraction and a convolutional decoder for sequence generation. Compared with other Long Short Term Memory(LSTM) based encoder-decoder models, ConvMath is entirely based on convolution, thus it is easy to perform parallel computation. Besides, the network adopts multi-layer attention mechanism in the decoder, which allows the model to align output symbols with source feature vectors automatically, and alleviates the problem of lacking coverage while training the model. The performance of ConvMath is evaluated on an open dataset named IM2LATEX-100K, including 103556 samples. The experimental results demonstrate that the proposed network achieves state-of-the-art accuracy and much better efficiency than previous methods.
摘要：尽管最近在光学字符识别（OCR）方面取得了进步，但是由于其二维图形布局，数学表达式在识别方面仍然面临巨大的挑战。在本文中，我们提出了卷积序列建模网络ConvMath，该网络以端到端的方式将图像中的数学表达式描述转换为LaTeX序列。该网络结合了用于特征提取的图像编码器和用于序列生成的卷积解码器。与其他基于LSTM的编解码器模型相比，ConvMath完全基于卷积，因此易于执行并行计算。此外，网络在解码器中采用了多层注意机制，使模型能够自动将输出符号与源特征向量对齐，从而减轻了训练模型时覆盖范围不足的问题。在名为IM2LATEX-100K的开放数据集上评估了ConvMath的性能，其中包括103556个样本。实验结果表明，与以前的方法相比，所提出的网络具有最先进的精度和更高的效率。

15. ICMSC: Intra- and Cross-modality Semantic Consistency for Unsupervised Domain Adaptation on Hip Joint Bone Segmentation [PDF] 返回目录
Guodong Zeng, Till D. Lerch, Florian Schmaranzer, Guoyan Zheng, Juergen Burger, Kate Gerber, Moritz Tannast, Klaus Siebenrock, Nicolas Gerber
Abstract: Unsupervised domain adaptation (UDA) for cross-modality medical image segmentation has shown great progress by domain-invariant feature learning or image appearance translation. Adapted feature learning usually cannot detect domain shifts at the pixel level and is not able to achieve good results in dense semantic segmentation tasks. Image appearance translation, e.g. CycleGAN, translates images into different styles with good appearance, despite its population, its semantic consistency is hardly to maintain and results in poor cross-modality segmentation. In this paper, we propose intra- and cross-modality semantic consistency (ICMSC) for UDA and our key insight is that the segmentation of synthesised images in different styles should be consistent. Specifically, our model consists of an image translation module and a domain-specific segmentation module. The image translation module is a standard CycleGAN, while the segmentation module contains two domain-specific segmentation networks. The intra-modality semantic consistency (IMSC) forces the reconstructed image after a cycle to be segmented in the same way as the original input image, while the cross-modality semantic consistency (CMSC) encourages the synthesized images after translation to be segmented exactly the same as before translation. Comprehensive experimental results on cross-modality hip joint bone segmentation show the effectiveness of our proposed method, which achieves an average DICE of 81.61% on the acetabulum and 88.16% on the proximal femur, outperforming other state-of-the-art methods. It is worth to note that without UDA, a model trained on CT for hip joint bone segmentation is non-transferable to MRI and has almost zero-DICE segmentation.
摘要：通过领域不变特征学习或图像外观转换，用于跨模态医学图像分割的无监督领域自适应（UDA）取得了巨大进展。自适应特征学习通常无法在像素级别检测到域偏移，并且在密集的语义分割任务中无法获得良好的结果。图像外观翻译，例如CycleGAN可以将图像转换为具有良好外观的不同样式，尽管其数量众多，但其语义一致性很难维护，并且会导致不良的跨模式分割。在本文中，我们提出了针对UDA的内部和交叉模式语义一致性（ICMSC），我们的主要见识是不同样式的合成图像的分割应保持一致。具体来说，我们的模型包含一个图像翻译模块和一个特定领域的分割模块。图像翻译模块是标准的CycleGAN，而分段模块包含两个特定于域的分段网络。模态内语义一致性（IMSC）强制以与原始输入图像相同的方式对循环后的重建图像进行分割，而跨模态语义一致性（CMSC）鼓励翻译后的合成图像准确地进行分割。与翻译前相同。跨模态髋关节骨分割的综合实验结果显示了我们提出的方法的有效性，该方法在髋臼上的平均DICE达到81.61％，在股骨近端的平均DICE达到88.16％，优于其他最新方法。值得注意的是，如果没有UDA，经过CT训练的髋关节骨分割模型将无法转移至MRI，并且DICE分割几乎为零。

16. Multi-grained Trajectory Graph Convolutional Networks for Habit-unrelated Human Motion Prediction [PDF] 返回目录
Jin Liu, Jianqin Yin
Abstract: Human motion prediction is an essential part for human-robot collaboration. Unlike most of the existing methods mainly focusing on improving the effectiveness of spatiotemporal modeling for accurate prediction, we take effectiveness and efficiency into consideration, aiming at the prediction quality, computational efficiency and the lightweight of the model. A multi-grained trajectory graph convolutional networks based and lightweight framework is proposed for habit-unrelated human motion prediction. Specifically, we represent human motion as multi-grained trajectories, including joint trajectory and sub-joint trajectory. Based on the advanced representation, multi-grained trajectory graph convolutional networks are proposed to explore the spatiotemporal dependencies at the multiple granularities. Moreover, considering the right-handedness habit of the vast majority of people, a new motion generation method is proposed to generate the motion with left-handedness, to better model the motion with less bias to the human habit. Experimental results on challenging datasets, including Human3.6M and CMU Mocap, show that the proposed model outperforms state-of-the-art with less than 0.12 times parameters, which demonstrates the effectiveness and efficiency of our proposed method.
摘要：人体运动预测是人机协作的重要组成部分。与大多数现有方法主要侧重于改进时空建模以进行准确预测的有效性不同，我们将有效性和效率纳入考虑范围，着眼于预测质量，计算效率和模型的轻量级。提出了一种基于多粒度轨迹图卷积网络的轻量级框架，用于与习惯无关的人类运动预测。具体来说，我们将人类运动表示为多颗粒轨迹，包括关节轨迹和子关节轨迹。在高级表示的基础上，提出了多粒度轨迹图卷积网络，以探索多种粒度的时空相关性。此外，考虑到绝大多数人的右手习惯，提出了一种新的运动产生方法来产生左手运动，以更好地模拟运动而对人类习惯的偏见较少。在具有挑战性的数据集（包括Human3.6M和CMU Mocap）上的实验结果表明，所提出的模型在不到0.12倍的参数下优于最新技术，证明了所提出方法的有效性和效率。

17. A Survey on Visual Transformer [PDF] 返回目录
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, Dacheng Tao
Abstract: Transformer is a type of deep neural network mainly based on self-attention mechanism which is originally applied in natural language processing field. Inspired by the strong representation ability of transformer, researchers propose to extend transformer for computer vision tasks. Transformer-based models show competitive and even better performance on various visual benchmarks compared to other network types such as convolutional networks and recurrent networks. In this paper we provide a literature review of these visual transformer models by categorizing them in different tasks and analyze the advantages and disadvantages of these methods. In particular, the main categories include the basic image classification, high-level vision, low-level vision and video processing. Self-attention in computer vision is also briefly revisited as self-attention is the base component in transformer. Efficient transformer methods are included for pushing transformer into real applications. Finally, we give a discussion about the further research directions for visual transformer.
摘要：变压器是一种主要基于自注意力机制的深度神经网络，最初应用于自然语言处理领域。受到变压器强大的表示能力的启发，研究人员提议将变压器扩展到计算机视觉任务。与其他网络类型（例如卷积网络和循环网络）相比，基于变压器的模型在各种视觉基准上显示出竞争甚至更好的性能。在本文中，我们通过将这些视觉转换器模型分类为不同的任务，并分析了这些方法的优缺点，提供了文献综述。特别地，主要类别包括基本图像分类，高级视觉，低级视觉和视频处理。由于自注意力是变压器的基本组成部分，因此也简要回顾了计算机视觉中的自注意力。包括有效的变压器方法，可将变压器推入实际应用。最后，我们讨论了视觉变压器的进一步研究方向。

18. Efficient video annotation with visual interpolation and frame selection guidance [PDF] 返回目录
A. Kuznetsova, A. Talati, Y. Luo, K. Simmons, V. Ferrari
Abstract: We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to annotate manually. Our contribution is two-fold: first, we propose a model that has both interpolating and extrapolating capabilities; second, we propose a guiding mechanism that sequentially generates suggestions for what frame to annotate next, based on the annotations made previously. We extensively evaluate our approach on several challenging datasets in simulation and demonstrate a reduction in terms of the number of manual bounding boxes drawn by 60% over linear interpolation and by 35% over an off-the-shelf tracker. Moreover, we also show 10% annotation time improvement over a state-of-the-art method for video annotation with bounding boxes [25]. Finally, we run human annotation experiments and provide extensive analysis of the results, showing that our approach reduces actual measured annotation time by 50% compared to commonly used linear interpolation.
摘要：我们引入了带有边框的通用视频注释统一框架。视频注释是一个长期存在的问题，因为它是一个繁琐且耗时的过程。我们解决了视频注释的两个重要挑战：（1）由人类注释者在所有帧的子集上提供的框的自动时间内插和外推，以及（2）自动选择要手动注释的帧。我们的贡献是双重的：首先，我们提出了一个既具有内插又具有外推功能的模型。其次，我们提出了一种指导机制，该机制可根据先前进行的注释，顺序生成有关接下来要注释的帧的建议。我们在模拟中的几个具有挑战性的数据集上广泛评估了我们的方法，并证明与线性插值相比，手动边界框绘制的数量减少了60％，而与现有跟踪器相比，则减少了35％。此外，与带边框的视频注释方法相比，我们还显示了10％的注释时间改进[25]。最后，我们进行了人工注释实验并提供了广泛的结果分析，表明与常用的线性插值方法相比，我们的方法将实际测量的注释时间减少了50％。

19. Unsupervised Domain Adaptation for Semantic Segmentation by Content Transfer [PDF] 返回目录
Suhyeon Lee, Junhyuk Hyun, Hongje Seong, Euntai Kim
Abstract: In this paper, we tackle the unsupervised domain adaptation (UDA) for semantic segmentation, which aims to segment the unlabeled real data using labeled synthetic data. The main problem of UDA for semantic segmentation relies on reducing the domain gap between the real image and synthetic image. To solve this problem, we focused on separating information in an image into content and style. Here, only the content has cues for semantic segmentation, and the style makes the domain gap. Thus, precise separation of content and style in an image leads to effect as supervision of real data even when learning with synthetic data. To make the best of this effect, we propose a zero-style loss. Even though we perfectly extract content for semantic segmentation in the real domain, another main challenge, the class imbalance problem, still exists in UDA for semantic segmentation. We address this problem by transferring the contents of tail classes from synthetic to real domain. Experimental results show that the proposed method achieves the state-of-the-art performance in semantic segmentation on the major two UDA settings.
摘要：在本文中，我们解决了用于语义分割的无监督域自适应（UDA），其目的是使用标记的合成数据分割未标记的真实数据。 UDA进行语义分割的主要问题在于减小真实图像和合成图像之间的域间隙。为了解决这个问题，我们专注于将图像中的信息分为内容和样式。在这里，只有内容才具有语义分割的线索，而样式则造成了领域的空白。因此，即使在学习合成数据时，图像中内容和样式的精确分离也可以起到监督实际数据的作用。为了充分发挥这种效果，我们提出了零样式损失。即使我们在真实域中完美地提取了用于语义分割的内容，UDA中仍然存在另一个主要挑战，即类不平衡问题，用于语义分割。我们通过将尾类的内容从合成域转移到实域来解决此问题。实验结果表明，该方法在两个主要的UDA设置上都达到了语义分割的最新性能。

20. The Translucent Patch: A Physical and Universal Attack on Object Detectors [PDF] 返回目录
Alon Zolfi, Moshe Kravchik, Yuval Elovici, Asaf Shabtai
Abstract: Physical adversarial attacks against object detectors have seen increasing success in recent years. However, these attacks require direct access to the object of interest in order to apply a physical patch. Furthermore, to hide multiple objects, an adversarial patch must be applied to each object. In this paper, we propose a contactless translucent physical patch containing a carefully constructed pattern, which is placed on the camera's lens, to fool state-of-the-art object detectors. The primary goal of our patch is to hide all instances of a selected target class. In addition, the optimization method used to construct the patch aims to ensure that the detection of other (untargeted) classes remains unharmed. Therefore, in our experiments, which are conducted on state-of-the-art object detection models used in autonomous driving, we study the effect of the patch on the detection of both the selected target class and the other classes. We show that our patch was able to prevent the detection of 42.27% of all stop sign instances while maintaining high (nearly 80%) detection of the other classes.
摘要：近年来，对目标探测器的物理对抗攻击取得了越来越大的成功。但是，这些攻击需要直接访问感兴趣的对象才能应用物理补丁。此外，要隐藏多个对象，必须对每个对象应用对抗性补丁。在本文中，我们提出了一种非接触式半透明物理贴片，该贴片包含精心构造的图案，该图案被放置在相机镜头上，以欺骗最先进的物体检测器。我们补丁的主要目标是隐藏所选目标类的所有实例。此外，用于构建补丁的优化方法旨在确保检测其他（非目标）类不受损害。因此，在我们用于自动驾驶的最新对象检测模型的实验中，我们研究了补丁对所选目标类别和其他类别的检测的影响。我们表明，我们的补丁程序能够阻止所有停车标志实例的42.27％的检测，同时保持其他类的较高（近80％）的检测率。

21. Vehicle Re-identification Based on Dual Distance Center Loss [PDF] 返回目录
Zhijun Hu, Yong Xu, Jie Wen, Lilei Sun, Raja S P
Abstract: Recently, deep learning has been widely used in the field of vehicle re-identification. When training a deep model, softmax loss is usually used as a supervision tool. However, the softmax loss performs well for closed-set tasks, but not very well for open-set tasks. In this paper, we sum up five shortcomings of center loss and solved all of them by proposing a dual distance center loss (DDCL). Especially we solve the shortcoming that center loss must combine with the softmax loss to supervise training the model, which provides us with a new perspective to examine the center loss. In addition, we verify the inconsistency between the proposed DDCL and softmax loss in the feature space, which makes the center loss no longer be limited by the softmax loss in the feature space after removing the softmax loss. To be specifically, we add the Pearson distance on the basis of the Euclidean distance to the same center, which makes all features of the same class be confined to the intersection of a hypersphere and a hypercube in the feature space. The proposed Pearson distance strengthens the intra-class compactness of the center loss and enhances the generalization ability of center loss. Moreover, by designing a Euclidean distance threshold between all center pairs, which not only strengthens the inter-class separability of center loss, but also makes the center loss (or DDCL) works well without the combination of softmax loss. We apply DDCL in the field of vehicle re-identification named VeRi-776 dataset and VehicleID dataset. And in order to verify its good generalization ability, we also verify it in two datasets commonly used in the field of person re-identification named MSMT17 dataset and Market1501 dataset.
摘要：近年来，深度学习已被广泛应用于车辆重新识别领域。训练深层模型时，softmax损失通常用作监督工具。但是，softmax损失在封闭任务中表现良好，而在开放任务中表现不佳。在本文中，我们总结了中心损耗的五个缺点，并通过提出双距中心损耗（DDCL）解决了所有这些缺点。特别是我们解决了中心损失必须与softmax损失相结合来监督模型训练的缺点，这为我们研究中心损失提供了新的视角。此外，我们验证了所提出的DDCL和特征空间中softmax损失之间的不一致，这使得中心损失不再受消除softmax损失后特征空间中的softmax损失所限制。具体而言，我们将基于欧几里得距离的皮尔逊距离添加到同一中心，这使同一类的所有特征都限于特征空间中超球面和超立方体的交点。提出的皮尔逊距离增强了中心损失的类内紧实度，增强了中心损失的泛化能力。此外，通过设计所有中心对之间的欧几里德距离阈值，不仅增强了中心损耗的类间可分离性，而且使中心损耗（或DDCL）在没有softmax损耗组合的情况下也能很好地工作。我们将DDCL应用于车辆重新识别领域，分别称为VeRi-776数据集和VehicleID数据集。并且为了验证其良好的泛化能力，我们还在人员重新识别领域常用的两个数据集（MSMT17数据集和Market1501数据集）中对其进行了验证。

22. Towards Overcoming False Positives in Visual Relationship Detection [PDF] 返回目录
Daisheng Jin, Xiao Ma, Chongzhi Zhang, Yizhuo Zhou, Jiashu Tao, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, Zhoujun Li, Xianglong Li, Hongsheng Li
Abstract: In this paper, we investigate the cause of the high false positive rate in Visual Relationship Detection (VRD). We observe that during training, the relationship proposal distribution is highly imbalanced: most of the negative relationship proposals are easy to identify, e.g., the inaccurate object detection, which leads to the under-fitting of low-frequency difficult proposals. This paper presents Spatially-Aware Balanced negative pRoposal sAmpling (SABRA), a robust VRD framework that alleviates the influence of false positives. To effectively optimize the model under imbalanced distribution, SABRA adopts Balanced Negative Proposal Sampling (BNPS) strategy for mini-batch sampling. BNPS divides proposals into 5 well defined sub-classes and generates a balanced training distribution according to the inverse frequency. BNPS gives an easier optimization landscape and significantly reduces the number of false positives. To further resolve the low-frequency challenging false positive proposals with high spatial ambiguity, we improve the spatial modeling ability of SABRA on two aspects: a simple and efficient multi-head heterogeneous graph attention network (MH-GAT) that models the global spatial interactions of objects, and a spatial mask decoder that learns the local spatial configuration. SABRA outperforms SOTA methods by a large margin on two human-object interaction (HOI) datasets and one general VRD dataset.
摘要：本文探讨了视觉关系检测（VRD）中假阳性率高的原因。我们观察到在训练过程中，关系建议的分布高度不平衡：大多数负面关系建议很容易识别，例如，目标检测不准确，这导致了低频困难建议的拟合不足。本文介绍了空间感知的平衡负目的抽样（SABRA），这是一个健壮的VRD框架，可减轻误报的影响。为了在不平衡分配下有效地优化模型，SABRA采用平衡负提案抽样（BNPS）策略进行小批量抽样。 BNPS将提案分为5个定义明确的子类，并根据倒频生成平衡的训练分布。 BNPS提供了更轻松的优化环境，并大大减少了误报的数量。为了进一步解决具有高空间歧义性的低频挑战性误报建议，我们从两个方面提高了SABRA的空间建模能力：一个简单高效的多头异构图注意力网络（MH-GAT），用于建模全局空间相互作用对象和一个空间遮罩解码器，可学习局部空间配置。 SABRA在两个人对象交互（HOI）数据集和一个常规VRD数据集上的性能大大优于SOTA方法。

23. Deep Semantic Dictionary Learning for Multi-label Image Classification [PDF] 返回目录
Fengtao Zhou, Sheng Huang, Yun Xing
Abstract: Compared with single-label image classification, multi-label image classification is more practical and challenging. Some recent studies attempted to leverage the semantic information of categories for improving multi-label image classification performance. However, these semantic-based methods only take semantic information as type of complements for visual representation without further exploitation. In this paper, we present a innovative path towards the solution of the multi-label image classification which considers it as a dictionary learning task. A novel end-to-end model named Deep Semantic Dictionary Learning (DSDL) is designed. In DSDL, an auto-encoder is applied to generate the semantic dictionary from class-level semantics and then such dictionary is utilized for representing the visual features extracted by Convolutional Neural Network (CNN) with label embeddings. The DSDL provides a simple but elegant way to exploit and reconcile the label, semantic and visual spaces simultaneously via conducting the dictionary learning among them. Moreover, inspired by iterative optimization of traditional dictionary learning, we further devise a novel training strategy named Alternately Parameters Update Strategy (APUS) for optimizing DSDL, which alteratively optimizes the representation coefficients and the semantic dictionary in forward and backward propagation. Extensive experimental results on three popular benchmarks demonstrate that our method achieves promising performances in comparison with the state-of-the-arts. Our codes and models are available at this https URL.
摘要：与单标签图像分类相比，多标签图像分类更具实用性和挑战性。最近的一些研究试图利用类别的语义信息来改善多标签图像分类性能。但是，这些基于语义的方法仅将语义信息作为视觉表示的补充类型，而无需进一步利用。在本文中，我们提出了一条解决多标签图像分类的创新途径，该方法将其视为字典学习任务。设计了一种名为“深度语义词典学习”（DSDL）的新型端到端模型。在DSDL中，应用自动编码器从类级别的语义生成语义字典，然后使用这种字典来表示卷积神经网络（CNN）提取的带有标签嵌入的视觉特征。 DSDL通过在字典，语义和视觉空间之间进行字典学习，提供了一种简单而优雅的方法来同时利用和协调标签，语义和视觉空间。此外，在传统字典学习的迭代优化启发下，我们进一步设计了一种新颖的训练策略，即用于优化DSDL的交替参数更新策略（APUS），从而交替优化了正向和反向传播中的表示系数和语义字典。在三个流行基准上的大量实验结果表明，与最新技术相比，我们的方法具有令人鼓舞的性能。我们的代码和模型可从以下https URL获得。

24. Blur More To Deblur Better: Multi-Blur2Deblur For Efficient Video Deblurring [PDF] 返回目录
Dongwon Park, Dong Un Kang, Se Young Chun
Abstract: One of the key components for video deblurring is how to exploit neighboring frames. Recent state-of-the-art methods either used aligned adjacent frames to the center frame or propagated the information on past frames to the current frame recurrently. Here we propose multi-blur-to-deblur (MB2D), a novel concept to exploit neighboring frames for efficient video deblurring. Firstly, inspired by unsharp masking, we argue that using more blurred images with long exposures as additional inputs significantly improves performance. Secondly, we propose multi-blurring recurrent neural network (MBRNN) that can synthesize more blurred images from neighboring frames, yielding substantially improved performance with existing video deblurring methods. Lastly, we propose multi-scale deblurring with connecting recurrent feature map from MBRNN (MSDR) to achieve state-of-the-art performance on the popular GoPro and Su datasets in fast and memory efficient ways.
摘要：视频去模糊的关键要素之一是如何利用相邻帧。最近的最新技术要么将相邻帧对齐到中心帧，要么将过去帧的信息循环传播到当前帧。在这里，我们提出了多模糊去模糊（MB2D），一种利用相邻帧进行有效视频去模糊的新颖概念。首先，受锐化蒙版的启发，我们认为使用更多长时间曝光的模糊图像作为附加输入可以显着提高性能。其次，我们提出了多模糊递归神经网络（MBRNN），它可以从相邻帧中合成更多的模糊图像，并通过现有的视频去模糊方法产生显着改善的性能。最后，我们建议通过连接MBRNN（MSDR）的递归特征图进行多尺度去模糊，以快速，高效存储的方式在流行的GoPro和Su数据集上实现最新性能。

25. Active Sampling for Accelerated MRI with Low-Rank Tensors [PDF] 返回目录
Zichang He, Bo Zhao, Zheng Zhang
Abstract: Magnetic resonance imaging (MRI) is a powerful imaging modality that revolutionizes medicine and biology. The imaging speed of high-dimensional MRI is often limited, which constrains its practical utility. Recently, low-rank tensor models have been exploited to enable fast MR imaging with sparse sampling. Most existing methods use some pre-defined sampling design, and active sensing has not been explored for low-rank tensor imaging. In this paper, we introduce an active low-rank tensor model for fast MR imaging.We propose an active sampling method based on a Query-by-Committee model, making use of the benefits of low-rank tensor structure. Numerical experiments on a 3-D MRI data set demonstrate the effectiveness of the proposed method.
摘要：磁共振成像（MRI）是一种强大的成像方式，彻底改变了医学和生物学领域。高维MRI的成像速度通常受到限制，这限制了其实际实用性。最近，已经利用低秩张量模型来实现稀疏采样的快速MR成像。大多数现有方法使用一些预定义的采样设计，并且尚未针对低秩张量成像探索主动感测。本文介绍了一种用于快速MR成像的主动低秩张量模型，并利用低秩张量结构的优点，提出了一种基于委员会查询模型的主动采样方法。在3-D MRI数据集上的数值实验证明了该方法的有效性。

26. Localization in the Crowd with Topological Constraints [PDF] 返回目录
Shahira Abousamra, Minh Hoai, Dimitris Samaras, Chao Chen
Abstract: We address the problem of crowd localization, i.e., the prediction of dots corresponding to people in a crowded scene. Due to various challenges, a localization method is prone to spatial semantic errors, i.e., predicting multiple dots within a same person or collapsing multiple dots in a cluttered region. We propose a topological approach targeting these semantic errors. We introduce a topological constraint that teaches the model to reason about the spatial arrangement of dots. To enforce this constraint, we define a persistence loss based on the theory of persistent homology. The loss compares the topographic landscape of the likelihood map and the topology of the ground truth. Topological reasoning improves the quality of the localization algorithm especially near cluttered regions. On multiple public benchmarks, our method outperforms previous localization methods. Additionally, we demonstrate the potential of our method in improving the performance in the crowd counting task.
摘要：我们解决了人群定位的问题，即在拥挤的场景中对应于人的点的预测。由于各种挑战，定位方法易于发生空间语义错误，即，预测同一个人内的多个点或在杂乱区域中折叠多个点。我们提出针对这些语义错误的拓扑方法。我们引入一种拓扑约束，该约束教给模型推理点的空间排列。为了强制执行此约束，我们基于持久性同源性理论定义持久性损失。该损失将似然图的地形图与地面真实情况的拓扑进行比较。拓扑推理提高了定位算法的质量，尤其是在杂乱区域附近。在多个公共基准上，我们的方法优于以前的本地化方法。此外，我们展示了我们的方法在提高人群计数任务性能方面的潜力。

27. IIRC: Incremental Implicitly-Refined Classification [PDF] 返回目录
Mohamed Abdelsalam, Mojtaba Faramarzi, Shagun Sodhani, Sarath Chandar
Abstract: We introduce the "Incremental Implicitly-Refined Classi-fication (IIRC)" setup, an extension to the class incremental learning setup where the incoming batches of classes have two granularity levels. i.e., each sample could have a high-level (coarse) label like "bear" and a low-level (fine) label like "polar bear". Only one label is provided at a time, and the model has to figure out the other label if it has already learnfed it. This setup is more aligned with real-life scenarios, where a learner usually interacts with the same family of entities multiple times, discovers more granularity about them, while still trying not to forget previous knowledge. Moreover, this setup enables evaluating models for some important lifelong learning challenges that cannot be easily addressed under the existing setups. These challenges can be motivated by the example "if a model was trained on the class bear in one task and on polar bear in another task, will it forget the concept of bear, will it rightfully infer that a polar bear is still a bear? and will it wrongfully associate the label of polar bear to other breeds of bear?". We develop a standardized benchmark that enables evaluating models on the IIRC setup. We evaluate several state-of-the-art lifelong learning algorithms and highlight their strengths and limitations. For example, distillation-based methods perform relatively well but are prone to incorrectly predicting too many labels per image. We hope that the proposed setup, along with the benchmark, would provide a meaningful problem setting to the practitioners
摘要：我们介绍了“增量式隐式精炼分类（IIRC）”设置，这是对类增量式学习设置的扩展，其中传入的类批次具有两个粒度级别。即，每个样本都可以具有高级别（粗）标签（例如“熊”）和低级别（精细）标签，例如“北极熊”。一次只提供一个标签，如果模型已经学习过，则模型必须找出另一个标签。此设置更符合实际情况，在这种情况下，学习者通常与同一实体家族进行多次交互，发现它们的更多粒度，同时仍在尝试不忘记先前的知识。此外，这种设置可以评估一些重要的终身学习挑战的模型，而这些挑战在现有设置下很难解决。这些挑战可以通过以下示例来激发：“如果模型在一项任务中受班级熊训练，而在另一项任务中受北极熊训练，它会忘记熊的概念吗？是否可以合理地推断出北极熊仍然是熊？并将北极熊的标签错误地与其他种类的熊联系起来吗？”。我们开发了标准化的基准，可以评估IIRC设置上的模型。我们评估了几种最先进的终身学习算法，并强调了它们的优势和局限性。例如，基于蒸馏的方法表现相对较好，但易于错误地预测每个图像的标签过多。我们希望提议的设置以及基准将为从业人员提供有意义的问题设置

28. CholecSeg8k: A Semantic Segmentation Dataset for Laparoscopic Cholecystectomy Based on Cholec80 [PDF] 返回目录
W.-Y. Hong, C.-L. Kao, Y.-H. Kuo, J.-R. Wang, W.-L. Chang, C.-S. Shih
Abstract: Computer-assisted surgery has been developed to enhance surgery correctness and safety. However, researchers and engineers suffer from limited annotated data to develop and train better algorithms. Consequently, the development of fundamental algorithms such as Simultaneous Localization and Mapping (SLAM) is limited. This article elaborates on the efforts of preparing the dataset for semantic segmentation, which is the foundation of many computer-assisted surgery mechanisms. Based on the Cholec80 dataset [3], we extracted 8,080 laparoscopic cholecystectomy image frames from 17 video clips in Cholec80 and annotated the images. The dataset is named CholecSeg8K and its total size is 3GB. Each of these images is annotated at pixel-level for thirteen classes, which are commonly founded in laparoscopic cholecystectomy surgery. CholecSeg8k is released under the license CC BY- NC-SA 4.0.
摘要：为了提高手术的正确性和安全性，已经开发了计算机辅助手术。但是，研究人员和工程师只能使用有限的注释数据来开发和训练更好的算法。因此，诸如同步定位和映射（SLAM）之类的基本算法的开发受到了限制。本文详细介绍了为语义分割准备数据集的工作，这是许多计算机辅助手术机制的基础。基于Cholec80数据集[3]，我们从Cholec80中的17个视频剪辑中提取了8080例腹腔镜胆囊切除术图像帧，并对其进行了注释。该数据集名为CholecSeg8K，总大小为3GB。这些图像中的每幅图像都以像素级别注释了13个类别，这些类别通常在腹腔镜胆囊切除术中建立。 CholecSeg8k以CC BY-NC-SA 4.0许可证发布。

29. Skeleton-based Approaches based on Machine Vision: A Survey [PDF] 返回目录
Jie Li, Binglin Li, Min Gao
Abstract: Recently, skeleton-based approaches have achieved rapid progress on the basis of great success in skeleton representation. Plenty of researches focus on solving specific problems according to skeleton features. Some skeleton-based approaches have been mentioned in several overviews on object detection as a non-essential part. Nevertheless, there has not been any thorough analysis of skeleton-based approaches attentively. Instead of describing these techniques in terms of theoretical constructs, we devote to summarizing skeleton-based approaches with regard to application fields and given tasks as comprehensively as possible. This paper is conducive to further understanding of skeleton-based application and dealing with particular issues.
摘要：近年来，基于骨架的方法在骨架表示的巨大成功的基础上取得了长足的进步。大量研究致力于根据骨架特征解决特定问题。在一些有关对象检测的概述中，已经将一些基于骨骼的方法作为非必要部分。但是，尚未对基于骨骼的方法进行任何详尽的分析。除了致力于从理论构造上描述这些技术之外，我们致力于在应用领域和给定的任务方面尽可能全面地总结基于骨架的方法。本文有助于进一步了解基于骨架的应用程序和处理特定问题。

30. MG-SAGC: A multiscale graph and its self-adaptive graph convolution network for 3D point clouds [PDF] 返回目录
Bo Wu, Bo Lang
Abstract: To enhance the ability of neural networks to extract local point cloud features and improve their quality, in this paper, we propose a multiscale graph generation method and a self-adaptive graph convolution method. First, we propose a multiscale graph generation method for point clouds. This approach transforms point clouds into a structured multiscale graph form that supports multiscale analysis of point clouds in the scale space and can obtain the dimensional features of point cloud data at different scales, thus making it easier to obtain the best point cloud features. Because traditional convolutional neural networks are not applicable to graph data with irregular vertex neighborhoods, this paper presents an sef-adaptive graph convolution kernel that uses the Chebyshev polynomial to fit an irregular convolution filter based on the theory of optimal approximation. In this paper, we adopt max pooling to synthesize the features of different scale maps and generate the point cloud features. In experiments conducted on three widely used public datasets, the proposed method significantly outperforms other state-of-the-art models, demonstrating its effectiveness and generalizability.
摘要：为了增强神经网络提取局部点云特征并提高其质量的能力，本文提出了一种多尺度图生成方法和一种自适应图卷积方法。首先，我们提出了一种针对点云的多尺度图生成方法。该方法将点云转换为结构化的多尺度图形式，该形式支持对尺度空间中的点云进行多尺度分析，并可以获取不同尺度下的点云数据的维特征，从而更容易获得最佳的点云特征。由于传统的卷积神经网络不适用于具有不规则顶点邻域的图数据，因此本文提出了一种自适应的图卷积内核，该内核使用Chebyshev多项式基于最佳逼近理论来拟合不规则卷积滤波器。在本文中，我们采用最大池化来合成不同比例尺地图的特征并生成点云特征。在对三个广泛使用的公共数据集进行的实验中，所提出的方法明显优于其他最新模型，证明了其有效性和可推广性。

31. Correspondence Learning for Controllable Person Image Generation [PDF] 返回目录
Shilong Shen
Abstract: We present a generative model for controllable person image synthesis,as shown in Figure , which can be applied to pose-guided person image synthesis, $i.e.$, converting the pose of a source person image to the target pose while preserving the texture of that source person image, and clothing-guided person image synthesis, $i.e.$, changing the clothing texture of a source person image to the desired clothing texture. By explicitly establishing the dense correspondence between the target pose and the source image, we can effectively address the misalignment introduced by pose tranfer and generate high-quality images. Specifically, we first generate the target semantic map under the guidence of the target pose, which can provide more accurate pose representation and structural constraints during the generation process. Then, decomposed attribute encoder is used to extract the component features, which not only helps to establish a more accurate dense correspondence, but also realizes the clothing-guided person generation. After that, we will establish a dense correspondence between the target pose and the source image within the sharded domain. The source image feature is warped according to the dense correspondence to flexibly account for deformations. Finally, the network renders image based on the warped source image feature and the target pose. Experimental results show that our method is superior to state-of-the-art methods in pose-guided person generation and its effectiveness in clothing-guided person generation.
摘要：我们提出了一种用于可控人物图像合成的生成模型，如图1所示，该模型可用于姿势指导人物图像合成$ ie $，在将原始人物图像的姿势转换为目标姿势的同时保留纹理源人物图像的图像和服装指导人物图像的合成$ ie $，将源人物图像的衣服纹理更改为所需的衣服纹理。通过显式建立目标姿势和源图像之间的密集对应关系，我们可以有效地解决姿势转移带来的失准并生成高质量图像。具体来说，我们首先在目标姿势的指导下生成目标语义图，这可以在生成过程中提供更准确的姿势表示和结构约束。然后，通过分解属性编码器提取零件特征，不仅有助于建立更准确的密集对应关系，而且还可以实现服装指导人物的生成。之后，我们将在分片域内在目标姿态和源图像之间建立密集的对应关系。源图像特征会根据密集的对应关系进行变形，以灵活地解决变形问题。最终，网络根据变形的源图像特征和目标姿态渲染图像。实验结果表明，我们的方法在以姿势指导的人生成方面优于最新方法，并且在以服装为指导的人生成方面具有优势。

32. Pit30M: A Benchmark for Global Localization in the Age of Self-Driving Cars [PDF] 返回目录
Julieta Martinez, Sasha Doubov, Jack Fan, Ioan Andrei Bârsan, Shenlong Wang, Gellért Máttyus, Raquel Urtasun
Abstract: We are interested in understanding whether retrieval-based localization approaches are good enough in the context of self-driving vehicles. Towards this goal, we introduce Pit30M, a new image and LiDAR dataset with over 30 million frames, which is 10 to 100 times larger than those used in previous work. Pit30M is captured under diverse conditions (i.e., season, weather, time of the day, traffic), and provides accurate localization ground truth. We also automatically annotate our dataset with historical weather and astronomical data, as well as with image and LiDAR semantic segmentation as a proxy measure for occlusion. We benchmark multiple existing methods for image and LiDAR retrieval and, in the process, introduce a simple, yet effective convolutional network-based LiDAR retrieval method that is competitive with the state of the art. Our work provides, for the first time, a benchmark for sub-metre retrieval-based localization at city scale. The dataset, additional experimental results, as well as more information about the sensors, calibration, and metadata, are available on the project website: this https URL
摘要：我们有兴趣了解基于检索的定位方法在自动驾驶汽车中是否足够好。为了实现这一目标，我们引入了Pit30M，这是一个新的图像和LiDAR数据集，具有超过3,000万帧，是以前工作中使用的10至100倍。 Pit30M是在各种条件下（即季节，天气，一天中的时间，交通状况）捕获的，并提供了准确的本地化地面实况。我们还将自动使用历史天气和天文数据以及图像和LiDAR语义分段作为遮挡的代理度量来注释数据集。我们对用于图像和LiDAR检索的多种现有方法进行了基准测试，并在此过程中引入了一种简单而有效的基于卷积网络的LiDAR检索方法，该方法可与最新技术相抗衡。我们的工作首次为城市范围内基于亚米级检索的本地化提供了基准。数据集，其他实验结果以及有关传感器，校准和元数据的更多信息可在项目网站上找到：此https URL

33. Optical Braille Recognition Using Object Detection CNN [PDF] 返回目录
Ilya G. Ovodov
Abstract: This paper proposes an optical Braille recognition method that uses an object detection convolutional neural network to detect whole Braille characters at once. The proposed algorithm is robust to the deformation of the page shown in the image and perspective distortions. It makes it usable for recognition of Braille texts being shoot on a smartphone camera, including bowed pages and perspective distorted images. The proposed algorithm shows high performance and accuracy compared to existing methods. We also introduce a new "Angelina Braille Images Dataset" containing 240 annotated photos of Braille texts. The proposed algorithm and dataset are available at GitHub.
摘要：本文提出了一种光学盲文识别方法，该方法使用对象检测卷积神经网络一次检测整个盲文字符。所提出的算法对于图像中显示的页面变形和透视变形具有鲁棒性。它可用于识别在智能手机相机上拍摄的盲文，包括弯曲的页面和透视变形的图像。与现有方法相比，该算法具有较高的性能和准确性。我们还将引入一个新的“ Angelina盲文图像数据集”，其中包含240张带注释的盲文文字照片。拟议的算法和数据集可从GitHub获得。

34. Open source software for automatic subregional assessment of knee cartilage degradation using quantitative T2 relaxometry and deep learning [PDF] 返回目录
Kevin A. Thomas, Dominik Krzemiński, Łukasz Kidziński, Rohan Paul, Elka B. Rubin, Eni Halilaj, Marianne S. Black, Akshay Chaudhari, Garry E. Gold, Scott L. Delp
Abstract: Objective: We evaluate a fully-automated femoral cartilage segmentation model for measuring T2 relaxation values and longitudinal changes using multi-echo spin echo (MESE) MRI. We have open sourced this model and corresponding segmentations. Methods: We trained a neural network to segment femoral cartilage from MESE MRIs. Cartilage was divided into 12 subregions along medial-lateral, superficial-deep, and anterior-central-posterior boundaries. Subregional T2 values and four-year changes were calculated using a musculoskeletal radiologist's segmentations (Reader 1) and the model's segmentations. These were compared using 28 held out images. A subset of 14 images were also evaluated by a second expert (Reader 2) for comparison. Results: Model segmentations agreed with Reader 1 segmentations with a Dice score of 0.85 +/- 0.03. The model's estimated T2 values for individual subregions agreed with those of Reader 1 with an average Spearman correlation of 0.89 and average mean absolute error (MAE) of 1.34 ms. The model's estimated four-year change in T2 for individual regions agreed with Reader 1 with an average correlation of 0.80 and average MAE of 1.72 ms. The model agreed with Reader 1 at least as closely as Reader 2 agreed with Reader 1 in terms of Dice score (0.85 vs 0.75) and subregional T2 values. Conclusions: We present a fast, fully-automated model for segmentation of MESE MRIs. Assessments of cartilage health using its segmentations agree with those of an expert as closely as experts agree with one another. This has the potential to accelerate osteoarthritis research.
摘要：目的：我们评估使用多回旋自旋回波（MESE）MRI的全自动股骨软骨分割模型，以测量T2松弛值和纵向变化。我们已经开源了该模型和相应的细分。方法：我们训练了一个神经网络，根据MESE MRI分割股骨软骨。软骨沿着内侧-外侧，浅-深和前-中央-后边界被分为12个子区域。使用肌肉骨骼放射科医生的分割（读者1）和模型的分割来计算次区域T2值和四年变化。使用28张保留的图像对这些图像进行了比较。第二位专家（读者2）也评估了14张图像的子集，以进行比较。结果：模型细分与Reader 1细分一致，Dice得分为0.85 +/- 0.03。该模型对各个子区域的T2估计值与阅读器1的估计值一致，平均Spearman相关系数为0.89，平均平均绝对误差（MAE）为1.34 ms。该模型估计的每个区域的T2四年变化与阅读器1一致，平均相关系数为0.80，平均MAE为1.72 ms。在Dice得分（0.85对0.75）和子区域T2值方面，该模型与阅读器1的认可程度至少与阅读器2与阅读器1的认可程度非常接近。结论：我们提出了一种快速，全自动的MESE MRI分割模型。使用软骨分割进行的软骨健康评估与专家的评估非常一致，就像专家们彼此同意一样。这有可能加速骨关节炎的研究。

35. Learning Joint 2D-3D Representations for Depth Completion [PDF] 返回目录
Yun Chen, Bin Yang, Ming Liang, Raquel Urtasun
Abstract: In this paper, we tackle the problem of depth completion from RGBD data. Towards this goal, we design a simple yet effective neural network block that learns to extract joint 2D and 3D features. Specifically, the block consists of two domain-specific sub-networks that apply 2D convolution on image pixels and continuous convolution on 3D points, with their output features fused in image space. We build the depth completion network simply by stacking the proposed block, which has the advantage of learning hierarchical representations that are fully fused between 2D and 3D spaces at multiple levels. We demonstrate the effectiveness of our approach on the challenging KITTI depth completion benchmark and show that our approach outperforms the state-of-the-art.
摘要：在本文中，我们从RGBD数据中解决了深度补全的问题。为了实现这一目标，我们设计了一个简单而有效的神经网络模块，该模块学习提取联合的2D和3D特征。具体而言，该块由两个特定于域的子网组成，这些子网将2D卷积应用于图像像素，并将连续卷积应用于3D点，其输出特征融合在图像空间中。我们只需堆叠建议的块即可构建深度完成网络，这具有学习分层表示的优势，该分层表示已在2D和3D空间的多个级别完全融合。我们在具有挑战性的KITTI深度完成基准上证明了我们的方法的有效性，并表明我们的方法优于最新技术。

36. Multi-Task Multi-Sensor Fusion for 3D Object Detection [PDF] 返回目录
Ming Liang, Bin Yang, Yun Chen, Rui Hu, Raquel Urtasun
Abstract: In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and BEV object detection, while being real time.
摘要：在本文中，我们建议利用多个相关任务来进行精确的多传感器3D对象检测。为了实现这一目标，我们提出了一种端到端可学习的体系结构，该体系结构涉及2D和3D对象检测以及地面估计和深度完成的原因。我们的实验表明，所有这些任务都是相辅相成的，并通过融合各个级别的信息来帮助网络学习更好的表示形式。重要的是，我们的方法在实时性方面领先于2D，3D和BEV对象检测的KITTI基准。

37. Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net [PDF] 返回目录
Wenjie Luo, Bin Yang, Raquel Urtasun
Abstract: In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. By jointly reasoning about these tasks, our holistic approach is more robust to occlusion as well as sparse data at range. Our approach performs 3D convolutions across space and time over a bird's eye view representation of the 3D world, which is very efficient in terms of both memory and computation. Our experiments on a new very large scale dataset captured in several north american cities, show that we can outperform the state-of-the-art by a large margin. Importantly, by sharing computation we can perform all tasks in as little as 30 ms.
摘要：在本文中，我们提出了一种新颖的深度神经网络，该网络能够根据3D传感器捕获的数据，共同推理3D检测，跟踪和运动预测。通过共同推理这些任务，我们的整体方法在遮挡和稀疏范围内的数据方面更加强大。我们的方法在鸟瞰3D世界的情况下跨时空执行3D卷积，这在存储和计算方面都非常有效。我们对在几个北美城市中捕获的新的超大规模数据集进行的实验表明，我们可以大大超越最新技术。重要的是，通过共享计算，我们可以在30毫秒之内执行所有任务。

38. DAGMapper: Learning to Map by Discovering Lane Topology [PDF] 返回目录
Namdar Homayounfar, Wei-Chiu Ma, Justin Liang, Xinyu Wu, Jack Fan, Raquel Urtasun
Abstract: One of the fundamental challenges to scale self-driving is being able to create accurate high definition maps (HD maps) with low cost. Current attempts to automate this process typically focus on simple scenarios, estimate independent maps per frame or do not have the level of precision required by modern self driving vehicles. In contrast, in this paper we focus on drawing the lane boundaries of complex highways with many lanes that contain topology changes due to forks and merges. Towards this goal, we formulate the problem as inference in a directed acyclic graphical model (DAG), where the nodes of the graph encode geometric and topological properties of the local regions of the lane boundaries. Since we do not know a priori the topology of the lanes, we also infer the DAG topology (i.e., nodes and edges) for each region. We demonstrate the effectiveness of our approach on two major North American Highways in two different states and show high precision and recall as well as 89% correct topology.
摘要：自动驾驶汽车的基本挑战之一是如何以低成本创建准确的高清地图（HD地图）。当前使该过程自动化的尝试通常集中在简单的场景，估计每帧独立的地图或不具有现代自动驾驶汽车所需的精度水平。相反，在本文中，我们着重于绘制复杂高速公路的车道边界，该高速公路的许多车道包含因叉车和合并而引起的拓扑变化。为了实现这一目标，我们将问题表示为有向无环图形模型（DAG）中的推论，其中图形的节点编码车道边界局部区域的几何和拓扑特性。由于我们不事先知道车道的拓扑结构，因此我们还可以推断每个区域的DAG拓扑结构（即节点和边）。我们在两个不同州的两条主要北美高速公路上展示了我们的方法的有效性，并显示出高精度和召回率以及89％的正确拓扑。

39. A Structure-Aware Method for Direct Pose Estimation [PDF] 返回目录
Hunter Blanton, Scott Workman, Nathan Jacobs
Abstract: Estimating camera pose from a single image is a fundamental problem in computer vision. Existing methods for solving this task fall into two distinct categories, which we refer to as direct and indirect. Direct methods, such as PoseNet, regress pose from the image as a fixed function, for example using a feed-forward convolutional network. Such methods are desirable because they are deterministic and run in constant time. Indirect methods for pose regression are often non-deterministic, with various external dependencies such as image retrieval and hypothesis sampling. We propose a direct method that takes inspiration from structure-based approaches to incorporate explicit 3D constraints into the network. Our approach maintains the desirable qualities of other direct methods while achieving much lower error in general.
摘要：从单个图像估计相机姿势是计算机视觉中的一个基本问题。解决此任务的现有方法分为两个不同的类别，我们将其称为直接和间接。诸如PoseNet之类的直接方法，例如使用前馈卷积网络，将图像的姿态作为固定函数回归。这样的方法是可取的，因为它们是确定性的并且在恒定时间内运行。姿势回归的间接方法通常是不确定的，具有各种外部依赖性，例如图像检索和假设采样。我们提出一种直接方法，该方法从基于结构的方法中汲取灵感，将明确的3D约束合并到网络中。我们的方法保持了其他直接方法的理想质量，同时总体上实现了更低的误差。

40. Seeing past words: Testing the cross-modal capabilities of pretrained V&L models [PDF] 返回目录
Letitia Parcalabescu, Albert Gatt, Anette Frank, Iacer Calixto
Abstract: We investigate the ability of general-purpose pretrained vision and language V&L models to perform reasoning in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models use task (1) for pretraining. However, none of the pretrained V&L models are able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. Our investigations suggest that pretrained V&L representations are less successful than expected at integrating the two modalities. We propose a number of explanations for these findings: LXMERT's results on the image-sentence alignment task (and to a lesser extent those obtained by ViLBERT 12-in-1) indicate that the model may exhibit catastrophic forgetting. As for our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input.
摘要：我们研究了通用的预训练视觉和语言V＆L模型在需要多模式集成的两项任务中执行推理的能力：（1）从错误的句子中区分出正确的图像句子对，以及（2）在图片。我们针对零任务和微调设置在这些任务上评估了三个预训练的V＆L模型：ViLBERT，ViLBERT 12合1和LXMERT。我们的结果表明，由于所有模型都使用任务（1）进行预训练，因此模型可以很好地解决任务（1）。但是，没有一个预训练的V＆L模型能够充分解决任务（2）（我们的计数探针），并且不能将其推广到分布外的数量。我们的研究表明，在整合这两种模式时，预训练的V＆L表示不如预期的成功。对于这些发现，我们提出了许多解释：LXMERT在图像句子对齐任务上的结果（在较小程度上是通过ViLBERT 12合1获得的结果）表明该模型可能表现出灾难性的遗忘。至于我们在计数探针上的结果，我们发现证据表明所有模型都受到数据集偏差的影响，并且也无法在可视输入中区分实体。

41. Deep Unsupervised Image Hashing by Maximizing Bit Entropy [PDF] 返回目录
Yunqiang Li, Jan van Gemert
Abstract: Unsupervised hashing is important for indexing huge image or video collections without having expensive annotations available. Hashing aims to learn short binary codes for compact storage and efficient semantic retrieval. We propose an unsupervised deep hashing layer called Bi-half Net that maximizes entropy of the binary codes. Entropy is maximal when both possible values of the bit are uniformly (half-half) distributed. To maximize bit entropy, we do not add a term to the loss function as this is difficult to optimize and tune. Instead, we design a new parameter-free network layer to explicitly force continuous image features to approximate the optimal half-half bit distribution. This layer is shown to minimize a penalized term of the Wasserstein distance between the learned continuous image features and the optimal half-half bit distribution. Experimental results on the image datasets Flickr25k, Nus-wide, Cifar-10, Mscoco, Mnist and the video datasets Ucf-101 and Hmdb-51 show that our approach leads to compact codes and compares favorably to the current state-of-the-art.
摘要：无监督哈希对于索引巨大的图像或视频集合而没有可用的昂贵注释很重要。哈希的目的是学习简短的二进制代码，以实现紧凑存储和有效的语义检索。我们提出了一种称为双半网的无监督深度哈希层，该层使二进制代码的熵最大化。当该位的两个可能值均匀（一半-一半）分布时，熵最大。为了使比特熵最大化，我们不对损失函数添加项，因为这很难优化和调整。取而代之的是，我们设计了一个新的无参数网络层，以明确强制连续图像功能逼近最佳的半个比特分布。该层显示为使学习的连续图像特征与最佳的半个半比特分布之间的Wasserstein距离的惩罚项最小。在Flickr25k，Nus-wide，Cifar-10，Mscoco，Mnist图像数据集以及视频数据集Ucf-101和Hmdb-51上的实验结果表明，我们的方法可生成紧凑的代码，并且与当前的状态相比具有优势。艺术。

42. Hierarchical Recurrent Attention Networks for Structured Online Maps [PDF] 返回目录
Namdar Homayounfar, Wei-Chiu Ma, Shrinidhi Kowshika Lakshmikanth, Raquel Urtasun
Abstract: In this paper, we tackle the problem of online road network extraction from sparse 3D point clouds. Our method is inspired by how an annotator builds a lane graph, by first identifying how many lanes there are and then drawing each one in turn. We develop a hierarchical recurrent network that attends to initial regions of a lane boundary and traces them out completely by outputting a structured polyline. We also propose a novel differentiable loss function that measures the deviation of the edges of the ground truth polylines and their predictions. This is more suitable than distances on vertices, as there exists many ways to draw equivalent polylines. We demonstrate the effectiveness of our method on a 90 km stretch of highway, and show that we can recover the right topology 92\% of the time.
摘要：在本文中，我们解决了从稀疏3D点云中提取在线道路网络的问题。我们的方法的灵感来自于注释器如何构建车道图，首先确定存在多少车道，然后依次绘制每个车道。我们开发了一个层次化的递归网络，该网络连接到车道边界的初始区域，并通过输出结构化折线来完全追踪它们。我们还提出了一种新颖的可微损失函数，该函数可测量地面真线折线及其预测的边缘偏差。这比在顶点上的距离更合适，因为存在许多绘制等效折线的方法。我们证明了该方法在一段90公里长的高速公路上的有效性，并表明我们可以在92％的时间内恢复正确的拓扑。

43. Randomized RX for target detection [PDF] 返回目录
Fatih Nar, Adrián Pérez-Suay, José Antonio Padrón, Gustau Camps-Valls
Abstract: This work tackles the target detection problem through the well-known global RX method. The RX method models the clutter as a multivariate Gaussian distribution, and has been extended to nonlinear distributions using kernel methods. While the kernel RX can cope with complex clutters, it requires a considerable amount of computational resources as the number of clutter pixels gets larger. Here we propose random Fourier features to approximate the Gaussian kernel in kernel RX and consequently our development keep the accuracy of the nonlinearity while reducing the computational cost which is now controlled by an hyperparameter. Results over both synthetic and real-world image target detection problems show space and time efficiency of the proposed method while providing high detection performance.
摘要：这项工作通过众所周知的全局RX方法解决了目标检测问题。 RX方法将杂波建模为多元高斯分布，并已使用核方法扩展为非线性分布。尽管内核RX可以处理复杂的杂波，但是随着杂波像素的数量变大，它需要大量的计算资源。在这里，我们提出了随机傅里叶特征来近似内核RX中的高斯内核，因此我们的开发在保持非线性精度的同时降低了由超参数控制的计算成本。在合成图像和现实世界中的图像目标检测问题上的结果都表明了该方法的时空效率，同时提供了较高的检测性能。

44. Nonlinear Cook distance for Anomalous Change Detection [PDF] 返回目录
José A. Padrón Hidalgo, Adrián Pérez-Suay, Fatih Nar, Gustau Camps-Valls
Abstract: In this work we propose a method to find anomalous changes in remote sensing images based on the chronochrome approach. A regressor between images is used to discover the most {\em influential points} in the observed data. Typically, the pixels with largest residuals are decided to be anomalous changes. In order to find the anomalous pixels we consider the Cook distance and propose its nonlinear extension using random Fourier features as an efficient nonlinear measure of impact. Good empirical performance is shown over different multispectral images both visually and quantitatively evaluated with ROC curves.
摘要：在这项工作中，我们提出了一种基于年代学方法来发现遥感影像中异常变化的方法。使用图像之间的回归值来发现观测数据中最多的{\ em影响点}。通常，将残差最大的像素确定为异常变化。为了找到异常像素，我们考虑了库克距离，并使用随机傅立叶特征作为影响的有效非线性度量来提出其非线性扩展。通过ROC曲线在视觉和定量评估上，在不同的多光谱图像上均显示出良好的经验性能。

45. Pattern Recognition Scheme for Large-Scale Cloud Detection over Landmarks [PDF] 返回目录
Adrián Pérez-Suay, Julia Amorós-López, Luis Gómez-Chova, Jordi Muñoz-Marí, Dieter Just, Gustau Camps-Valls
Abstract: Landmark recognition and matching is a critical step in many Image Navigation and Registration (INR) models for geostationary satellite services, as well as to maintain the geometric quality assessment (GQA) in the instrument data processing chain of Earth observation satellites. Matching the landmark accurately is of paramount relevance, and the process can be strongly impacted by the cloud contamination of a given landmark. This paper introduces a complete pattern recognition methodology able to detect the presence of clouds over landmarks using Meteosat Second Generation (MSG) data. The methodology is based on the ensemble combination of dedicated support vector machines (SVMs) dependent on the particular landmark and illumination conditions. This divide-and-conquer strategy is motivated by the data complexity and follows a physically-based strategy that considers variability both in seasonality and illumination conditions along the day to split observations. In addition, it allows training the classification scheme with millions of samples at an affordable computational costs. The image archive was composed of 200 landmark test sites with near 7 million multispectral images that correspond to MSG acquisitions during 2010. Results are analyzed in terms of cloud detection accuracy and computational cost. We provide illustrative source code and a portion of the huge training data to the community.
摘要：地标识别和匹配是许多用于对地静止卫星服务的图像导航和注册（INR）模型中的关键步骤，也是在地球观测卫星的仪器数据处理链中维护几何质量评估（GQA）的关键步骤。准确匹配地标至关重要，并且给定地标的云污染会严重影响该过程。本文介绍了一种完整的模式识别方法，该方法能够使用Meteosat Second Generation（MSG）数据检测地标上云的存在。该方法基于依赖于特定地标和照明条件的专用支持向量机（SVM）的整体组合。这种分而治之的策略是由数据的复杂性引起的，并且遵循一种基于物理的策略，该策略考虑了一天中季节性和光照条件的可变性以拆分观测值。此外，它还允许以可承受的计算成本训练数百万个样本的分类方案。该图像档案由200个地标性测试站点组成，其中包含近700万幅多光谱图像，对应于2010年期间的MSG采集。对结果进行了云探测准确性和计算成本方面的分析。我们向社区提供了说明性的源代码和部分庞大的培训数据。

46. Flexible deep transfer learning by separate feature embeddings and manifold alignment [PDF] 返回目录
Samuel Rivera, Joel Klipfel, Deborah Weeks
Abstract: Object recognition is a key enabler across industry and defense. As technology changes, algorithms must keep pace with new requirements and data. New modalities and higher resolution sensors should allow for increased algorithm robustness. Unfortunately, algorithms trained on existing labeled datasets do not directly generalize to new data because the data distributions do not match. Transfer learning (TL) or domain adaptation (DA) methods have established the groundwork for transferring knowledge from existing labeled source data to new unlabeled target datasets. However, current DA approaches assume similar source and target feature spaces and suffer in the case of massive domain shifts or changes in the feature space. Existing methods assume the data are either the same modality, or can be aligned to a common feature space. Therefore, most methods are not designed to support a fundamental domain change such as visual to auditory data. We propose a novel deep learning framework that overcomes this limitation by learning separate feature extractions for each domain while minimizing the distance between the domains in a latent lower-dimensional space. The alignment is achieved by considering the data manifold along with an adversarial training procedure. We demonstrate the effectiveness of the approach versus traditional methods with several ablation experiments on synthetic, measured, and satellite image datasets. We also provide practical guidelines for training the network while overcoming vanishing gradients which inhibit learning in some adversarial training settings.
摘要：对象识别是整个工业和国防领域的关键推动力。随着技术的变化，算法必须与新的要求和数据保持同步。新的形式和更高分辨率的传感器应允许增加算法的鲁棒性。不幸的是，在现有的标记数据集上训练的算法不能直接推广到新数据，因为数据分布不匹配。转移学习（TL）或领域适应（DA）方法已经为将知识从现有标记的源数据转移到新的未标记的目标数据集奠定了基础。但是，当前的DA方法采用相似的源特征空间和目标特征空间，并且在特征域发生大规模的域移动或变化的情况下遭受损失。现有方法假定数据要么是相同的模态，要么可以与公共特征空间对齐。因此，大多数方法都不旨在支持基本域的更改，例如从视觉到听觉数据。我们提出了一种新颖的深度学习框架，该框架通过为每个域学习单独的特征提取，同时最小化潜在的低维空间中域之间的距离，克服了这一限制。通过考虑数据流形以及对抗训练过程来实现对齐。我们通过在合成，测量和卫星图像数据集上进行的多次消融实验，证明了该方法相对于传统方法的有效性。我们还提供了一些实用的准则来训练网络，同时克服了逐渐消失的梯度，这种梯度在某些对抗性训练环境中会抑制学习。

47. Generative Interventions for Causal Learning [PDF] 返回目录
Chengzhi Mao, Amogh Gupta, Augustine Cha, Hao Wang, Junfeng Yang, Carl Vondrick
Abstract: We introduce a framework for learning robust visual representations that generalize to new viewpoints, backgrounds, and scene contexts. Discriminative models often learn naturally occurring spurious correlations, which cause them to fail on images outside of the training distribution. In this paper, we show that we can steer generative models to manufacture interventions on features caused by confounding factors. Experiments, visualizations, and theoretical results show this method learns robust representations more consistent with the underlying causal relationships. Our approach improves performance on multiple datasets demanding out-of-distribution generalization, and we demonstrate state-of-the-art performance generalizing from ImageNet to ObjectNet dataset.
摘要：我们引入了一个框架来学习强大的视觉表示形式，该构架可以概括为新的视点，背景和场景上下文。判别模型通常会学习自然发生的虚假相关性，这会导致它们在训练分布之外的图像上失败。在本文中，我们表明可以指导生成模型对由混杂因素引起的特征进行干预。实验，可视化和理论结果表明，该方法学习的鲁棒表示与潜在因果关系更加一致。我们的方法提高了对需要分布外概括的多个数据集的性能，并且我们展示了从ImageNet到ObjectNet数据集的最新性能概括。

48. Noisy Labels Can Induce Good Representations [PDF] 返回目录
Jingling Li, Mozhi Zhang, Keyulu Xu, John P. Dickerson, Jimmy Ba
Abstract: The current success of deep learning depends on large-scale labeled datasets. In practice, high-quality annotations are expensive to collect, but noisy annotations are more affordable. Previous works report mixed empirical results when training with noisy labels: neural networks can easily memorize random labels, but they can also generalize from noisy labels. To explain this puzzle, we study how architecture affects learning with noisy labels. We observe that if an architecture "suits" the task, training with noisy labels can induce useful hidden representations, even when the model generalizes poorly; i.e., the last few layers of the model are more negatively affected by noisy labels. This finding leads to a simple method to improve models trained on noisy labels: replacing the final dense layers with a linear model, whose weights are learned from a small set of clean data. We empirically validate our findings across three architectures (Convolutional Neural Networks, Graph Neural Networks, and Multi-Layer Perceptrons) and two domains (graph algorithmic tasks and image classification). Furthermore, we achieve state-of-the-art results on image classification benchmarks by combining our method with existing approaches on noisy label training.
摘要：深度学习的当前成功取决于大规模的标记数据集。实际上，高质量的注释收集起来很昂贵，但是嘈杂的注释则更便宜。以前的工作报告了使用嘈杂标签进行训练时的混合经验结果：神经网络可以轻松记住随机标签，但也可以从嘈杂标签中进行概括。为了解释这个难题，我们研究了体系结构如何影响带有嘈杂标签的学习。我们观察到，如果架构“适合”任务，那么即使模型泛化不佳，带噪声标签的训练也可能会引起有用的隐藏表示。也就是说，模型的最后几层受到嘈杂标签的负面影响更大。这一发现导致了一种简单的方法来改进在嘈杂标签上训练的模型：用线性模型替换最终的密集层，线性模型的权重是从一小组干净的数据中学习的。我们通过经验验证了我们在三种架构（卷积神经网络，图神经网络和多层感知器）和两个领域（图形算法任务和图像分类）中的发现。此外，通过将我们的方法与噪声标签训练的现有方法相结合，我们在图像分类基准上获得了最新的结果。

49. Deep manifold learning reveals hidden dynamics of proteasome autoregulation [PDF] 返回目录
Zhaolong Wu, Shuwen Zhang, Wei Li Wang, Yinping Ma, Yuanchen Dong, Youdong Mao
Abstract: The 2.5-MDa 26S proteasome maintains proteostasis and regulates myriad cellular processes. How polyubiquitylated substrate interactions regulate proteasome activity is not understood. Here we introduce a deep manifold learning framework, named AlphaCryo4D, which enables atomic-level cryogenic electron microscopy (cryo-EM) reconstructions of nonequilibrium conformational continuum and reconstitutes hidden dynamics of proteasome autoregulation in the act of substrate degradation. AlphaCryo4D integrates 3D deep residual learning with manifold embedding of free-energy landscapes, which directs 3D clustering via an energy-based particle-voting algorithm. In blind assessments using simulated heterogeneous cryo-EM datasets, AlphaCryo4D achieved 3D classification accuracy three times that of conventional method and reconstructed continuous conformational changes of a 130-kDa protein at sub-3-angstrom resolution. By using AlphaCryo4D to analyze a single experimental cryo-EM dataset, we identified 64 conformers of the substrate-bound human 26S proteasome, revealing conformational entanglement of two regulatory particles in the doubly capped holoenzymes and their energetic differences with singly capped ones. Novel ubiquitin-binding sites are discovered on the RPN2, RPN10 and Alpha5 subunits to remodel polyubiquitin chains for deubiquitylation and recycle. Importantly, AlphaCryo4D choreographs single-nucleotide-exchange dynamics of proteasomal AAA-ATPase motor during translocation initiation, which upregulates proteolytic activity by allosterically promoting nucleophilic attack. Our systemic analysis illuminates a grand hierarchical allostery for proteasome autoregulation.
摘要：2.5-MDa 26S蛋白酶体维持蛋白稳定并调节无数细胞过程。还不清楚多泛素化的底物相互作用如何调节蛋白酶体活性。在这里，我们介绍了一个称为AlphaCryo4D的深层流形学习框架，该框架可实现原子级低温电子显微镜（cryo-EM）重建非平衡构象连续体，并在底物降解行为中重构蛋白酶体自动调节的隐藏动力学。 AlphaCryo4D将3D深度残差学习与自由能景观的多种嵌入集成在一起，后者通过基于能量的粒子投票算法指导3D聚类。在使用模拟异质冷冻EM数据集进行的盲法评估中，AlphaCryo4D达到了3D分类精度，是传统方法的三倍，并以低于3埃的分辨率重建了130 kDa蛋白的连续构象变化。通过使用AlphaCryo4D分析单个实验冷冻EM数据集，我们鉴定了与底物结合的人26S蛋白酶体的64个构象异构体，揭示了两个加盖的全酶中两个调节颗粒的构象缠结以及它们与单加盖的酶的能量差异。在RPN2，RPN10和Alpha5亚基上发现了新的泛素结合位点，以重塑聚泛素链以进行去泛素化和再循环。重要的是，AlphaCryo4D编排记录了蛋白酶体AAA-ATPase马达在移位启动过程中的单核苷酸交换动力学，该动力学通过变构促进亲核攻击来上调蛋白水解活性。我们的系统分析阐明了蛋白酶体自动调节的宏大分层别构。

50. Multiclass Spinal Cord Tumor Segmentation on MRI with Deep Learning [PDF] 返回目录
Andreanne Lemay, Charley Gros, Zhizheng Zhuo, Jie Zhang, Yunyun Duan, Julien Cohen-Adad, Yaou Liu
Abstract: Spinal cord tumors lead to neurological morbidity and mortality. Being able to obtain morphometric quantification (size, location, growth rate) of the tumor, edema, and cavity can result in improved monitoring and treatment planning. Such quantification requires the segmentation of these structures into three separate classes. However, manual segmentation of 3-dimensional structures is time-consuming and tedious, motivating the development of automated methods. Here, we tailor a model adapted to the spinal cord tumor segmentation task. Data were obtained from 343 patients using gadolinium-enhanced T1-weighted and T2-weighted MRI scans with cervical, thoracic, and/or lumbar coverage. The dataset includes the three most common intramedullary spinal cord tumor types: astrocytomas, ependymomas, and hemangioblastomas. The proposed approach is a cascaded architecture with U-Net-based models that segments tumors in a two-stage process: locate and label. The model first finds the spinal cord and generates bounding box coordinates. The images are cropped according to this output, leading to a reduced field of view, which mitigates class imbalance. The tumor is then segmented. The segmentation of the tumor, cavity, and edema (as a single class) reached 76.7 $\pm$ 1.5% of Dice score and the segmentation of tumors alone reached 61.8 $\pm$ 4.0% Dice score. The true positive detection rate was above 87% for tumor, edema, and cavity. To the best of our knowledge, this is the first fully automatic deep learning model for spinal cord tumor segmentation. The multiclass segmentation pipeline is available in the Spinal Cord Toolbox (this https URL). It can be run with custom data on a regular computer within seconds.
摘要：脊髓肿瘤导致神经系统疾病和死亡。能够获得肿瘤，水肿和腔的形态计量学量化（大小，位置，生长率）可以改善监测和治疗计划。这种量化要求将这些结构分为三个单独的类别。但是，手动分割3维结构既费时又乏味，从而刺激了自动方法的发展。在这里，我们为脊髓肿瘤分割任务定制了一个模型。使用增强的T1加权和T2加权MRI扫描从343例患者中获得数据，并进行颈，胸和/或腰椎覆盖。该数据集包括三种最常见的髓内脊髓肿瘤类型：星形细胞瘤，室管膜瘤和血管母细胞瘤。所提出的方法是具有基于U-Net的模型的级联体系结构，该模型在两个阶段的过程中分割肿瘤：定位和标记。该模型首先找到脊髓并生成边界框坐标。根据该输出裁剪图像，从而导致视野减小，从而减轻了类的不平衡。然后将肿瘤切开。肿瘤，腔和水肿的分割（作为单个类别）达到Dice分数的76.7％\ pm $，而仅肿瘤的分割达到Dice分数的61.8％\ pm $ 4.0％。肿瘤，水肿和腔的真实阳性检出率高于87％。据我们所知，这是第一个用于脊髓肿瘤分割的全自动深度学习模型。脊髓工具箱（此https URL）中提供了多类细分管道。它可以在几秒钟之内在常规计算机上与自定义数据一起运行。

51. Chest x-ray automated triage: a semiologic approach designed for clinical implementation, exploiting different types of labels through a combination of four Deep Learning architectures [PDF] 返回目录
Candelaria Mosquera, Facundo Nahuel Diaz, Fernando Binder, Jose Martin Rabellino, Sonia Elizabeth Benitez, Alejandro Daniel Beresñak, Alberto Seehaus, Gabriel Ducrey, Jorge Alberto Ocantos, Daniel Roberto Luna
Abstract: BACKGROUND AND OBJECTIVES: The multiple chest x-ray datasets released in the last years have ground-truth labels intended for different computer vision tasks, suggesting that performance in automated chest-xray interpretation might improve by using a method that can exploit diverse types of annotations. This work presents a Deep Learning method based on the late fusion of different convolutional architectures, that allows training with heterogeneous data with a simple implementation, and evaluates its performance on independent test data. We focused on obtaining a clinically useful tool that could be successfully integrated into a hospital workflow. MATERIALS AND METHODS: Based on expert opinion, we selected four target chest x-ray findings, namely lung opacities, fractures, pneumothorax and pleural effusion. For each finding we defined the most adequate type of ground-truth label, and built four training datasets combining images from public chest x-ray datasets and our institutional archive. We trained four different Deep Learning architectures and combined their outputs with a late fusion strategy, obtaining a unified tool. Performance was measured on two test datasets: an external openly-available dataset, and a retrospective institutional dataset, to estimate performance on local population. RESULTS: The external and local test sets had 4376 and 1064 images, respectively, for which the model showed an area under the Receiver Operating Characteristics curve of 0.75 (95%CI: 0.74-0.76) and 0.87 (95%CI: 0.86-0.89) in the detection of abnormal chest x-rays. For the local population, a sensitivity of 86% (95%CI: 84-90), and a specificity of 88% (95%CI: 86-90) were obtained, with no significant differences between demographic subgroups. We present examples of heatmaps to show the accomplished level of interpretability, examining true and false positives.
摘要：背景与目的：近年来发布的多个胸部X射线数据集具有用于不同计算机视觉任务的地面标签，表明通过使用可利用多种类型的方法，自动胸部X射线解释的性能可能会提高批注。这项工作提出了一种基于不同卷积架构后期融合的深度学习方法，该方法允许通过简单的实现对异构数据进行训练，并评估其在独立测试数据上的性能。我们专注于获得可成功整合到医院工作流程中的临床有用工具。材料与方法：基于专家意见，我们选择了四个目标胸部X线检查结果，即肺部混浊，骨折，气胸和胸腔积液。对于每个发现，我们定义了最适当的地面真相标签类型，并构建了四个训练数据集，这些数据集结合了来自公共胸部X射线数据集和我们机构档案的图像。我们训练了四种不同的深度学习架构，并将其输出与后期融合策略结合在一起，从而获得了统一的工具。在两个测试数据集上对性能进行了测量：一个外部公开可用数据集和一个回顾性机构数据集，以估计本地人口的性能。结果：外部和本地测试集分别具有4376和1064张图像，其模型在接收器工作特性曲线下的面积分别为0.75（95％CI：0.74-0.76）和0.87（95％CI：0.86-0.89）），以检测异常的胸部X射线。对于当地人群，灵敏度为86％（95％CI：84-90），特异性为88％（95％CI：86-90），人口统计学亚组之间无显着差异。我们提供了一些热图示例，以显示可解释性的水平，并检查真假肯定。

52. Prognostic Power of Texture Based Morphological Operations in a Radiomics Study for Lung Cancer [PDF] 返回目录
Paul Desbordes, Diksha, Benoit Macq
Abstract: The importance of radiomics features for predicting patient outcome is now well-established. Early study of prognostic features can lead to a more efficient treatment personalisation. For this reason new radiomics features obtained through mathematical morphology-based operations are proposed. Their study is conducted on an open database of patients suffering from Nonsmall Cells Lung Carcinoma (NSCLC). The tumor features are extracted from the CT images and analyzed via PCA and a Kaplan-Meier survival analysis in order to select the most relevant ones. Among the 1,589 studied features, 32 are found relevant to predict patient survival: 27 classical radiomics features and five MM features (including both granularity and morphological covariance features). These features will contribute towards the prognostic models, and eventually to clinical decision making and the course of treatment for patients.
摘要：放射学功能对预测患者预后的重要性现已得到公认。对预后特征的早期研究可导致更有效的个性化治疗。因此，提出了通过基于数学形态学的运算获得的新放射学特征。他们的研究是在一个开放的非小细胞肺癌（NSCLC）患者数据库中进行的。从CT图像中提取肿瘤特征，并通过PCA和Kaplan-Meier生存分析进行分析，以选择最相关的特征。在研究的1589个特征中，有32个与预测患者的生存有关：27个经典放射学特征和5个MM特征（包括粒度和形态协方差特征）。这些特征将有助于建立预后模型，并最终有助于临床决策和患者的治疗过程。

53. GANDA: A deep generative adversarial network predicts the spatial distribution of nanoparticles in tumor pixelly [PDF] 返回目录
Jiulou Zhang, Yuxia Tang, Shouju Wang
Abstract: Intratumoral nanoparticles (NPs) distribution is critical for the diagnostic and therapeutic effect, but methods to predict the distribution remain unavailable due to the complex bio-nano interactions. Here, we developed a Generative Adversarial Network for Distribution Analysis (GANDA) to make pixels-to-pixels prediction of the NPs distribution across tumors. This predictive model used deep learning approaches to automatically learn the features of tumor vessels and cell nuclei from whole-slide images of tumor sections. We showed that the GANDA could generate images of NPs distribution with the same spatial resolution as original images of tumor vessels and nuclei. The GANDA enabled quantitative analysis of NPs distribution (R2=0.93) and extravasation without knowing their real distribution. This model provides opportunities to investigate how influencing factors affect NPs distribution in individual tumors and may guide nanomedicine optimization for personalized treatments.
摘要：肿瘤内纳米颗粒（NPs）的分布对于诊断和治疗效果至关重要，但是由于复杂的生物-纳米相互作用，无法预测分布的方法。在这里，我们开发了一种用于分布分析的生成对抗网络（GANDA），可以对各个肿瘤中NP的分布进行逐像素预测。该预测模型使用深度学习方法从肿瘤切片的全幻灯片图像中自动学习肿瘤血管和细胞核的特征。我们表明，GANDA可以生成具有与肿瘤血管和细胞核原始图像相同的空间分辨率的NPs分布图像。 GANDA可以对NP分布（R2 = 0.93）和外渗进行定量分析，而无需知道其真实分布。该模型为研究影响因素如何影响单个肿瘤中NPs分布提供了机会，并可指导针对个性化治疗的纳米药物优化。

54. StainNet: a fast and robust stain normalization network [PDF] 返回目录
Hongtao Kang, Die Luo, Weihua Feng, Li Chen, Junbo Hu, Shaoqun Zeng, Tingwei Quan, Xiuli Liu
Abstract: Pathological images may have large variabilities in color intensities due to differences in staining process, operator ability, and scanner specifications. These variations hamper the performance of computer-aided diagnosis (CAD) systems. Stain normalization is used to reduce the variability in color intensities and increase the prediction accuracy. However, the conventional methods highly depend on a reference image, and the current deep learning based methods may have a wrong change in color intensities or texture. In this paper, a fully 1x1 convolutional stain normalization network with only 1.28K parameters is proposed. Our StainNet can learn the color mapping relation from the whole dataset and adjust the color value depended on a single pixel. The proposed method outperforms the state-of-art methods and achieves better accuracy and image quality.
摘要：由于染色过程，操作员能力和扫描仪规格的差异，病理图像的颜色强度可能会有较大差异。这些变化妨碍了计算机辅助诊断（CAD）系统的性能。染色归一化用于减少颜色强度的变化并提高预测精度。然而，常规方法高度依赖于参考图像，并且当前基于深度学习的方法可能在颜色强度或纹理上具有错误的变化。在本文中，提出了仅具有1.28K参数的完全1x1卷积染色归一化网络。我们的StainNet可以从整个数据集中学习颜色映射关系，并根据单个像素调整颜色值。所提出的方法优于最先进的方法，并获得更好的准确性和图像质量。

55. Analyzing Representations inside Convolutional Neural Networks [PDF] 返回目录
Uday Singh Saini, Evangelos E. Papalexakis
Abstract: How can we discover and succinctly summarize the concepts that a neural network has learned? Such a task is of great importance in applications of networks in areas of inference that involve classification, like medical diagnosis based on fMRI/x-ray etc. In this work, we propose a framework to categorize the concepts a network learns based on the way it clusters a set of input examples, clusters neurons based on the examples they activate for, and input features all in the same latent space. This framework is unsupervised and can work without any labels for input features, it only needs access to internal activations of the network for each input example, thereby making it widely applicable. We extensively evaluate the proposed method and demonstrate that it produces human-understandable and coherent concepts that a ResNet-18 has learned on the CIFAR-100 dataset.
摘要：我们如何发现并简洁地总结神经网络已学到的概念？这项任务在涉及分类的推理领域中的网络应用中非常重要，例如基于fMRI / x射线的医学诊断等。在这项工作中，我们提出了一个框架来对网络学习的概念进行分类它对一组输入示例进行聚类，根据它们为之激活的示例对神经元进行聚类，并在相同的潜在空间中输入所有特征。该框架不受监督，可以在没有任何输入功能标签的情况下工作，对于每个输入示例，只需要访问网络的内部激活，即可广泛使用。我们对所提出的方法进行了广泛的评估，并证明它产生了ResNet-18在CIFAR-100数据集上学到的易于理解和一致的概念。

56. Diabetic Retinopathy Grading System Based on Transfer Learning [PDF] 返回目录
Eman AbdelMaksoud, Sherif Barakat, Mohammed Elmogy
Abstract: Much effort is being made by the researchers in order to detect and diagnose diabetic retinopathy (DR) accurately automatically. The disease is very dangerous as it can cause blindness suddenly if it is not continuously screened. Therefore, many computers aided diagnosis (CAD) systems have been developed to diagnose the various DR grades. Recently, many CAD systems based on deep learning (DL) methods have been adopted to get deep learning merits in diagnosing the pathological abnormalities of DR disease. In this paper, we present a full based-DL CAD system depending on multi-label classification. In the proposed DL CAD system, we present a customized efficientNet model in order to diagnose the early and advanced grades of the DR disease. Learning transfer is very useful in training small datasets. We utilized IDRiD dataset. It is a multi-label dataset. The experiments manifest that the proposed DL CAD system is robust, reliable, and deigns promising results in detecting and grading DR. The proposed system achieved accuracy (ACC) equals 86%, and the Dice similarity coefficient (DSC) equals 78.45.
摘要：为了自动准确地检测和诊断糖尿病性视网膜病（DR），研究人员付出了很多努力。这种疾病非常危险，因为如果不持续筛查，它可能会突然导致失明。因此，已经开发了许多计算机辅助诊断（CAD）系统来诊断各种DR等级。近年来，已经采用了许多基于深度学习（DL）方法的CAD系统来获得深度学习在诊断DR疾病的病理异常方面的优点。在本文中，我们提出了一种基于多标签分类的基于全功能的DL CAD系统。在提出的DL CAD系统中，我们提出了一个定制的有效网络模型，以诊断DR疾病的早期和晚期等级。学习转移对于训练小型数据集非常有用。我们利用了IDRiD数据集。它是一个多标签数据集。实验表明，所提出的DL CAD系统是健壮，可靠的，并在DR的检测和分级中取得了可喜的成果。拟议的系统达到的准确度（ACC）等于86％，而骰子相似系数（DSC）等于78.45。

57. Small-Group Learning, with Application to Neural Architecture Search [PDF] 返回目录
Xuefeng Du, Pengtao Xie
Abstract: Small-group learning is a broadly used methodology in human learning and shows great effectiveness in improving learning outcomes: a small group of students work together towards the same learning objective, where they express their understanding of a topic to their peers, compare their ideas, and help each other to trouble-shoot problems. We are interested in investigating whether this powerful learning technique can be borrowed from humans to improve the learning abilities of machines. We propose a novel learning approach called small-group learning (SGL). In our approach, each learner uses its intermediately trained model to generate a pseudo-labeled dataset and re-trains its model using pseudo-labeled datasets generated by other learners. We propose a multi-level optimization framework to formulate SGL which involves three learning stages: learners train their network weights independently; learners train their network weights collaboratively via mutual pseudo-labeling; learners improve their architectures by minimizing validation losses. We develop an efficient algorithm to solve the SGL problem. We apply our approach to neural architecture search and achieve significant improvement on CIFAR-100, CIFAR-10, and ImageNet.
摘要：小组学习是人类学习中广泛使用的方法，在改善学习成果方面显示出巨大的效果：一小组学生共同努力实现相同的学习目标，他们向同龄人表达对某个主题的理解，比较同龄人的想法，互相帮助解决问题。我们有兴趣研究这种强大的学习技术是否可以从人身上借鉴来提高机器的学习能力。我们提出了一种称为小组学习（SGL）的新颖学习方法。在我们的方法中，每个学习者都使用其经过中间训练的模型来生成伪标记的数据集，并使用其他学习者生成的伪标记的数据集来对其模型进行重新训练。我们提出了一个多层次的优化框架来制定SGL，其中涉及三个学习阶段：学习者独立地训练其网络权重；学习者通过相互伪标记协作训练网络权重；学习者通过最小化验证损失来改进其体系结构。我们开发了一种有效的算法来解决SGL问题。我们将我们的方法应用于神经体系结构搜索，并在CIFAR-100，CIFAR-10和ImageNet上实现了重大改进。

58. Towards Boosting the Channel Attention in Real Image Denoising : Sub-band Pyramid Attention [PDF] 返回目录
Huayu Li, Haiyu Wu, Xiwen Chen, Hanning Zhang, Abolfazl Razi
Abstract: Convolutional layers in Artificial Neural Networks (ANN) treat the channel features equally without feature selection flexibility. While using ANNs for image denoising in real-world applications with unknown noise distributions, particularly structured noise with learnable patterns, modeling informative features can substantially boost the performance. Channel attention methods in real image denoising tasks exploit dependencies between the feature channels, hence being a frequency component filtering mechanism. Existing channel attention modules typically use global statics as descriptors to learn the inter-channel correlations. This method deems inefficient at learning representative coefficients for re-scaling the channels in frequency level. This paper proposes a novel Sub-band Pyramid Attention (SPA) based on wavelet sub-band pyramid to recalibrate the frequency components of the extracted features in a more fine-grained fashion. We equip the SPA blocks on a network designed for real image denoising. Experimental results show that the proposed method achieves a remarkable improvement than the benchmark naive channel attention block. Furthermore, our results show how the pyramid level affects the performance of the SPA blocks and exhibits favorable generalization capability for the SPA blocks.
摘要：人工神经网络（ANN）中的卷积层在没有特征选择灵活性的情况下平等对待通道特征。在具有未知噪声分布（尤其是具有可学习模式的结构化噪声）的实际应用中，使用ANN进行图像降噪时，对信息功能进行建模可以显着提高性能。实际图像去噪任务中的通道注意方法利用特征通道之间的依赖性，因此成为频率分量滤波机制。现有的频道关注模块通常使用全局静态变量作为描述符来学习频道间相关性。该方法认为在学习用于重新缩放频率水平的信道的代表系数方面效率低下。本文提出了一种基于小波子带金字塔的新型子带金字塔注意（SPA），以更细粒度的方式重新校准提取特征的频率分量。我们在为实像降噪设计的网络上配备了SPA块。实验结果表明，所提出的方法比基准的朴素频道关注块获得了显着的改进。此外，我们的结果表明金字塔等级如何影响SPA块的性能并展现出对SPA块有利的泛化能力。

59. Comparison of Classification Algorithms Towards Subject-Specific and Subject-Independent BCI [PDF] 返回目录
Parisa Ghane, Ulisses Braga-Neto
Abstract: Motor imagery brain computer interface designs are considered difficult due to limitations in subject-specific data collection and calibration, as well as demanding system adaptation requirements. Recently, subject-independent (SI) designs received attention because of their possible applicability to multiple users without prior calibration and rigorous system adaptation. SI designs are challenging and have shown low accuracy in the literature. Two major factors in system performance are the classification algorithm and the quality of available data. This paper presents a comparative study of classification performance for both SS and SI paradigms. Our results show that classification algorithms for SS models display large variance in performance. Therefore, distinct classification algorithms per subject may be required. SI models display lower variance in performance but should only be used if a relatively large sample size is available. For SI models, LDA and CART had the highest accuracy for small and moderate sample size, respectively, whereas we hypothesize that SVM would be superior to the other classifiers if large training sample-size was available. Additionally, one should choose the design approach considering the users. While the SS design sound more promising for a specific subject, an SI approach can be more convenient for mentally or physically challenged users.
摘要：由于受特定对象的数据收集和校准的限制以及对系统适应性的苛刻要求，运动图像脑计算机接口设计被认为是困难的。近来，与主题无关的（SI）设计受到关注，因为它们可能适用于多个用户，而无需事先进行校准和严格的系统调整。 SI设计具有挑战性，并且在文献中显示出较低的准确性。系统性能的两个主要因素是分类算法和可用数据的质量。本文对SS和SI范例的分类性能进行了比较研究。我们的结果表明，SS模型的分类算法显示出较大的性能差异。因此，可能需要每个主题不同的分类算法。 SI模型显示出较低的性能差异，但仅在有相对较大的样本量时才应使用。对于SI模型，LDA和CART分别在小样本量和中等样本量时具有最高的准确度，而我们假设，如果有大量训练样本量，则SVM将优于其他分类器。另外，应该考虑用户的选择设计方法。虽然SS设计在特定主题上听起来更有希望，但SI方法对于精神或身体有障碍的用户可能更方便。

60. Multi-Contrast Computed Tomography Healthy Kidney Atlas [PDF] 返回目录
Ho Hin Lee, Yucheng Tang, Kaiwen Xu, Shunxing Bao, Agnes B. Fogo, Raymond Harris, Mark P. de Caestecker, Mattias Heinrich, Jeffrey M. Spraggins, Yuankai Huo, Bennett A. Landman
Abstract: The construction of three-dimensional multi-modal tissue maps provides an opportunity to spur interdisciplinary innovations across temporal and spatial scales through information integration. While the preponderance of effort is allocated to the cellular level and explore the changes in cell interactions and organizations, contextualizing findings within organs and systems is essential to visualize and interpret higher resolution linkage across scales. There is a substantial normal variation of kidney morphometry and appearance across body size, sex, and imaging protocols in abdominal computed tomography (CT). A volumetric atlas framework is needed to integrate and visualize the variability across scales. However, there is no abdominal and retroperitoneal organs atlas framework for multi-contrast CT. Hence, we proposed a high-resolution CT retroperitoneal atlas specifically optimized for the kidney across non-contrast CT and early arterial, late arterial, venous and delayed contrast enhanced CT. Briefly, we introduce a deep learning-based volume of interest extraction method and an automated two-stage hierarchal registration pipeline to register abdominal volumes to a high-resolution CT atlas template. To generate and evaluate the atlas, multi-contrast modality CT scans of 500 subjects (without reported history of renal disease, age: 15-50 years, 250 males & 250 females) were processed. We demonstrate a stable generalizability of the atlas template for integrating the normal kidney variation from small to large, across contrast modalities and populations with great variability of demographics. The linkage of atlas and demographics provided a better understanding of the variation of kidney anatomy across populations.
摘要：三维多模式组织图的构建为通过信息集成促进跨时空尺度的跨学科创新提供了机会。尽管大部分精力都分配给了细胞水平，并探索了细胞相互作用和组织的变化，但器官和系统内的背景信息化对于可视化和解释跨尺度的高分辨率链接至关重要。在腹部计算机断层扫描（CT）中，肾脏的形态和外观在人体大小，性别和成像方案之间存在很大的正常变化。需要一个体积图集框架来整合和可视化跨尺度的可变性。但是，没有用于多对比度CT的腹部和腹膜后器官图谱框架。因此，我们提出了一种高分辨率CT腹膜后地图集，专门针对跨非对比CT和早期动脉，晚期动脉，静脉和延迟对比增强CT的肾脏进行了优化。简而言之，我们介绍了一种基于深度学习的兴趣量提取方法和一个自动的两阶段层次注册管道，以将腹部体积注册到高分辨率的CT图集模板中。为了生成和评估图集，我们对500名受试者（没有报道的肾脏疾病病史，年龄：15至50岁，男性250例，女性250例）进行了多种对比CT扫描。我们证明了Atlas模板具有稳定的概化性，可以整合正常的肾脏变异（从小到大），跨对比方式和人口统计学差异很大的人群。地图集和人口统计学的联系提供了对不同人群肾脏解剖结构变化的更好理解。

61. RAP-Net: Coarse-to-Fine Multi-Organ Segmentation with Single Random Anatomical Prior [PDF] 返回目录
Ho Hin Lee, Yucheng Tang, Shunxing Bao, Richard G. Abramson, Yuankai Huo, Bennett A. Landman
Abstract: Performing coarse-to-fine abdominal multi-organ segmentation facilitates to extract high-resolution segmentation minimizing the lost of spatial contextual information. However, current coarse-to-refine approaches require a significant number of models to perform single organ refine segmentation corresponding to the extracted organ region of interest (ROI). We propose a coarse-to-fine pipeline, which starts from the extraction of the global prior context of multiple organs from 3D volumes using a low-resolution coarse network, followed by a fine phase that uses a single refined model to segment all abdominal organs instead of multiple organ corresponding models. We combine the anatomical prior with corresponding extracted patches to preserve the anatomical locations and boundary information for performing high-resolution segmentation across all organs in a single model. To train and evaluate our method, a clinical research cohort consisting of 100 patient volumes with 13 organs well-annotated is used. We tested our algorithms with 4-fold cross-validation and computed the Dice score for evaluating the segmentation performance of the 13 organs. Our proposed method using single auto-context outperforms the state-of-the-art on 13 models with an average Dice score 84.58% versus 81.69% (p<0.0001). < font>
摘要：进行从粗到细的腹部多器官分割有助于提取高分辨率分割，从而最大程度地减少空间上下文信息的丢失。但是，当前的从粗到精的方法需要大量模型来执行与提取的目标器官区域（ROI）相对应的单个器官的精化分割。我们提出了从粗到细的管道，该管道从使用低分辨率的粗网络从3D体积中提取多个器官的全局先验上下文开始，然后是使用单个精细模型来分割所有腹部器官的精细阶段而不是多个器官对应的模型。我们将解剖学先验与相应的提取斑块相结合，以保留解剖学位置和边界信息，以便在单个模型中跨所有器官执行高分辨率分割。为了训练和评估我们的方法，我们使用了一个临床研究队列，该队列由100位患者体积和13个经过充分注释的器官组成。我们使用4倍交叉验证测试了我们的算法，并计算了Dice得分以评估13种器官的分割性能。我们提出的使用单一自动上下文的方法在13个模型上的最新技术表现均优于最新模型，平均Dice得分为84.58％，而Dice得分为81.69％（p <0.0001）。 < font>

62. Stochastic Gradient Variance Reduction by Solving a Filtering Problem [PDF] 返回目录
Xingyi Yang
Abstract: Deep neural networks (DNN) are typically optimized using stochastic gradient descent (SGD). However, the estimation of the gradient using stochastic samples tends to be noisy and unreliable, resulting in large gradient variance and bad convergence. In this paper, we propose \textbf{Filter Gradient Decent}~(FGD), an efficient stochastic optimization algorithm that makes the consistent estimation of the local gradient by solving an adaptive filtering problem with different design of filters. Our method reduces variance in stochastic gradient descent by incorporating the historical states to enhance the current estimation. It is able to correct noisy gradient direction as well as to accelerate the convergence of learning. We demonstrate the effectiveness of the proposed Filter Gradient Descent on numerical optimization and training neural networks, where it achieves superior and robust performance compared with traditional momentum-based methods. To the best of our knowledge, we are the first to provide a practical solution that integrates filtering into gradient estimation by making the analogy between gradient estimation and filtering problems in signal processing. (The code is provided in this https URL)
摘要：通常使用随机梯度下降（SGD）对深度神经网络（DNN）进行优化。但是，使用随机样本进行的梯度估计往往比较嘈杂且不可靠，导致梯度变化较大且收敛性较差。在本文中，我们提出了\ textbf {Filter Gradient Decent}〜（FGD），这是一种有效的随机优化算法，它通过解决不同滤波器设计的自适应滤波问题来对局部梯度进行一致的估计。我们的方法通过合并历史状态来增强当前估计，从而减少了随机梯度下降的方差。它能够校正嘈杂的梯度方向，并加速学习的收敛。我们在数值优化和训练神经网络上证明了拟议的过滤器梯度下降法的有效性，与传统的基于动量的方法相比，该方法可实现卓越而强大的性能。据我们所知，我们是第一个通过在信号处理中将梯度估计与滤波问题进行类比的方法，提供将滤波集成到梯度估计中的实用解决方案。（此https URL中提供了代码）

63. Towards Histopathological Stain Invariance by Unsupervised Domain Augmentation using Generative Adversarial Networks [PDF] 返回目录
Jelica Vasiljević, Friedrich Feuerhake, Cédric Wemmert, Thomas Lampert
Abstract: The application of supervised deep learning methods in digital pathology is limited due to their sensitivity to domain shift. Digital Pathology is an area prone to high variability due to many sources, including the common practice of evaluating several consecutive tissue sections stained with different staining protocols. Obtaining labels for each stain is very expensive and time consuming as it requires a high level of domain knowledge. In this article, we propose an unsupervised augmentation approach based on adversarial image-to-image translation, which facilitates the training of stain invariant supervised convolutional neural networks. By training the network on one commonly used staining modality and applying it to images that include corresponding, but differently stained, tissue structures, the presented method demonstrates significant improvements over other approaches. These benefits are illustrated in the problem of glomeruli segmentation in seven different staining modalities (PAS, Jones H&E, CD68, Sirius Red, CD34, H&E and CD3) and analysis of the learned representations demonstrate their stain invariance.
摘要：受监督的深度学习方法在数字病理学中的应用受到限制，因为它们对域移位敏感。由于许多原因，数字病理学是易于发生高度变化的区域，包括评估使用不同染色方案染色的多个连续组织切片的常规做法。为每个污渍获取标签非常昂贵且耗时，因为它需要高水平的领域知识。在本文中，我们提出了一种基于对抗性图像到图像转换的无监督增强方法，该方法有助于训练不变染色监督的卷积神经网络。通过在一种常用的染色方式上训练网络并将其应用于包含相应但染色不同的组织结构的图像，该方法证明了与其他方法相比的显着改进。这些益处在以下七个不同的染色方式（PAS，Jones H＆E，CD68，Sirius Red，CD34，H＆E和CD3）的肾小球分割问题中得到了说明，对所学表示法的分析表明它们的染色不变性。

64. QuickTumorNet: Fast Automatic Multi-Class Segmentation of Brain Tumors [PDF] 返回目录
Benjamin Maas, Erfan Zabeh, Soroush Arabshahi
Abstract: Non-invasive techniques such as magnetic resonance imaging (MRI) are widely employed in brain tumor diagnostics. However, manual segmentation of brain tumors from 3D MRI volumes is a time-consuming task that requires trained expert radiologists. Due to the subjectivity of manual segmentation, there is low inter-rater reliability which can result in diagnostic discrepancies. As the success of many brain tumor treatments depends on early intervention, early detection is paramount. In this context, a fully automated segmentation method for brain tumor segmentation is necessary as an efficient and reliable method for brain tumor detection and quantification. In this study, we propose an end-to-end approach for brain tumor segmentation, capitalizing on a modified version of QuickNAT, a brain tissue type segmentation deep convolutional neural network (CNN). Our method was evaluated on a data set of 233 patient's T1 weighted images containing three tumor type classes annotated (meningioma, glioma, and pituitary). Our model, QuickTumorNet, demonstrated fast, reliable, and accurate brain tumor segmentation that can be utilized to assist clinicians in diagnosis and treatment.
摘要：磁共振成像（MRI）等非侵入性技术已广泛应用于脑肿瘤诊断。但是，从3D MRI体积手动分割脑肿瘤是一项耗时的工作，需要训练有素的放射线专家。由于手动分割的主观性，评估者之间的可靠性较低，可能会导致诊断差异。由于许多脑肿瘤治疗的成功取决于早期干预，因此早期发现至关重要。在这种情况下，需要一种用于脑肿瘤分割的全自动分割方法，作为一种有效且可靠的脑肿瘤检测和定量方法。在这项研究中，我们提出了一种端到端的脑肿瘤分割方法，它利用了QuickNAT（脑组织类型分割深层卷积神经网络（CNN））的改进版本。我们的方法在233例患者的T1加权图像的数据集上进行了评估，其中包含三种注释的肿瘤类型类别（脑膜瘤，神经胶质瘤和垂体）。我们的模型QuickTumorNet展示了快速，可靠和准确的脑肿瘤分割技术，可用于协助临床医生进行诊断和治疗。

65. Turn Signal Prediction: A Federated Learning Case Study [PDF] 返回目录
Sonal Doomra, Naman Kohli, Shounak Athavale
Abstract: Driving etiquette takes a different flavor for each locality as drivers not only comply with rules/laws but also abide by local unspoken convention. When to have the turn signal (indicator) on/off is one such etiquette which does not have a definitive right or wrong answer. Learning this behavior from the abundance of data generated from various sensor modalities integrated in the vehicle is a suitable candidate for deep learning. But what makes it a prime candidate for Federated Learning are privacy concerns and bandwidth limitations for any data aggregation. This paper presents a long short-term memory (LSTM) based Turn Signal Prediction (on or off) model using vehicle control area network (CAN) signal data. The model is trained using two approaches, one by centrally aggregating the data and the other in a federated manner. Centrally trained models and federated models are compared under similar hyperparameter settings. This research demonstrates the efficacy of federated learning, paving the way for in-vehicle learning of driving etiquette.
摘要：驾驶员的礼节在每个地方都具有不同的风格，因为驾驶员不仅要遵守规则/法律，而且要遵守当地的潜规则。什么时候打开/关闭转向信号灯（指示器）是这样一种礼节，没有明确的正确或错误答案。从车辆中集成的各种传感器模式生成的大量数据中学习这种行为是深度学习的合适人选。但是，使之成为联合学习的主要候选人的原因是隐私问题和任何数据聚合的带宽限制。本文提出了一种基于长期短期记忆（LSTM）的转向信号预测（开或关）模型，该模型使用了车辆控制局域网（CAN）信号数据。使用两种方法对模型进行训练，一种方法是通过集中汇总数据，另一种则采用联合方式。在类似的超参数设置下比较集中训练的模型和联合模型。这项研究证明了联合学习的有效性，为驾驶礼节的车载学习铺平了道路。

66. Out-distribution aware Self-training in an Open World Setting [PDF] 返回目录
Maximilian Augustin, Matthias Hein
Abstract: Deep Learning heavily depends on large labeled datasets which limits further improvements. While unlabeled data is available in large amounts, in particular in image recognition, it does not fulfill the closed world assumption of semi-supervised learning that all unlabeled data are task-related. The goal of this paper is to leverage unlabeled data in an open world setting to further improve prediction performance. For this purpose, we introduce out-distribution aware self-training, which includes a careful sample selection strategy based on the confidence of the classifier. While normal self-training deteriorates prediction performance, our iterative scheme improves using up to 15 times the amount of originally labeled data. Moreover, our classifiers are by design out-distribution aware and can thus distinguish task-related inputs from unrelated ones.
摘要：深度学习在很大程度上依赖于大型标签数据集，这限制了进一步的改进。尽管可以大量获取未标记的数据，尤其是在图像识别中，但是它不能满足半监督学习的封闭世界假设，即所有未标记的数据都与任务相关。本文的目的是在开放的环境中利用未标记的数据来进一步提高预测性能。为此，我们引入了具有分布感知能力的自我训练，其中包括基于分类器置信度的谨慎样本选择策略。虽然正常的自我训练会降低预测性能，但我们的迭代方案可以使用多达15倍于原始标记数据的量来进行改进。此外，我们的分类器通过设计可以识别出分布，因此可以将与任务相关的输入与不相关的输入区分开。

67. On Frank-Wolfe Optimization for Adversarial Robustness and Interpretability [PDF] 返回目录
Theodoros Tsiligkaridis, Jay Roberts
Abstract: Deep neural networks are easily fooled by small perturbations known as adversarial attacks. Adversarial Training (AT) is a technique that approximately solves a robust optimization problem to minimize the worst-case loss and is widely regarded as the most effective defense against such attacks. While projected gradient descent (PGD) has received most attention for approximately solving the inner maximization of AT, Frank-Wolfe (FW) optimization is projection-free and can be adapted to any $L^p$ norm. A Frank-Wolfe adversarial training approach is presented and is shown to provide as competitive level of robustness as PGD-AT without much tuning for a variety of architectures. We empirically show that robustness is strongly connected to the $L^2$ magnitude of the adversarial perturbation and that more locally linear loss landscapes tend to have larger $L^2$ distortions despite having the same $L^\infty$ distortion. We provide theoretical guarantees on the magnitude of the distortion for FW that depend on local geometry which FW-AT exploits. It is empirically shown that FW-AT achieves strong robustness to white-box attacks and black-box attacks and offers improved resistance to gradient masking. Further, FW-AT allows networks to learn high-quality human-interpretable features which are then used to generate counterfactual explanations to model predictions by using dense and sparse adversarial perturbations.
摘要：深度神经网络很容易被称为对抗攻击的小扰动所欺骗。对抗训练（AT）是一种技术，可以近似解决强大的优化问题，以最大程度地减少最坏情况的损失，并且被广泛认为是针对此类攻击的最有效防御措施。虽然投影梯度下降（PGD）在近似解决AT的内部最大化问题上受到了最多的关注，但Frank-Wolfe（FW）优化是无投影的，并且可以适应任何$ L ^ p $范数。提出了一种Frank-Wolfe对抗训练方法，该方法显示出与PGD-AT一样的竞争优势，而没有为多种体系结构做太多调整。我们根据经验表明，鲁棒性与对抗性扰动的$ L ^ 2 $大小密切相关，并且尽管具有相同的$ L ^ \ infty $失真，但局部线性损失较大的趋势倾向于具有较大的$ L ^ 2 $失真。我们为FW的失真程度提供了理论上的保证，这取决于FW-AT利用的局部几何形状。从经验上可以看出，FW-AT对白盒和黑盒攻击具有强大的鲁棒性，并提高了对梯度掩膜的抵抗力。此外，FW-AT允许网络学习高质量的人类可解释特征，然后使用密集和稀疏的对抗性扰动将其用于生成反事实解释，以对预测进行建模。

68. Video Influencers: Unboxing the Mystique [PDF] 返回目录
Prashant Rajaram, Puneet Manchanda
Abstract: Influencer marketing is being used increasingly as a tool to reach customers because of the growing popularity of social media stars who primarily reach their audience(s) via custom videos. Despite the rapid growth in influencer marketing, there has been little research on the design and effectiveness of influencer videos. Using publicly available data on YouTube influencer videos, we implement novel interpretable deep learning architectures, supported by transfer learning, to identify significant relationships between advertising content in videos (across text, audio, and images) and video views, interaction rates and sentiment. By avoiding ex-ante feature engineering and instead using ex-post interpretation, our approach avoids making a trade-off between interpretability and predictive ability. We filter out relationships that are affected by confounding factors unassociated with an increase in attention to video elements, thus facilitating the generation of plausible causal relationships between video elements and marketing outcomes which can be tested in the field. A key finding is that brand mentions in the first 30 seconds of a video are on average associated with a significant increase in attention to the brand but a significant decrease in sentiment expressed towards the video. We illustrate the learnings from our approach for both influencers and brands.
摘要：由于社交媒体明星主要通过自定义视频吸引受众，因此越来越多的网红营销被用作吸引客户的工具。尽管影响者营销迅速增长，但对影响者视频的设计和有效性的研究很少。利用YouTube影响者视频上的公开数据，我们实施了新颖的，可解释的深度学习架构，并在迁移学习的支持下，确定了视频中的广告内容（跨文本，音频和图像）与视频观看次数，互动率和情感之间的重要关系。通过避免事前特征工程，而使用事后解释，我们的方法避免了在可解释性和预测能力之间进行权衡。我们筛选出不受混杂因素影响的关系，这些混杂因素与对视频元素的关注度增加无关，从而有助于在视频元素和营销成果之间建立合理的因果关系，从而可以在现场进行测试。一个关键发现是，在视频的前30秒中，品牌提及平均与对品牌的关注度显着提高但对视频表达的情感显着下降有关。我们说明了从我们的方法中为有影响力的人和品牌所获得的经验。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-24

目录

摘要