摘要

1. Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors [PDF] 返回目录
Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Siwei Ma, Ming-Hsuan Yang
Abstract: Recent image-to-image (I2I) translation algorithms focus on learning the mapping from a source to a target domain. However, the continuous translation problem that synthesizes intermediate results between the two domains has not been well-studied in the literature. Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains. Existing I2I approaches are limited to either intra-domain or deterministic inter-domain continuous translation. In this work, we present an effective signed attribute vector, which enables continuous translation on diverse mapping paths across various domains. In particular, utilizing the sign operation to encode the domain information, we introduce a unified attribute space shared by all domains, thereby allowing the interpolation on attribute vectors of different domains. To enhance the visual quality of continuous translation results, we generate a trajectory between two sign-symmetrical attribute vectors and leverage the domain information of the interpolated results along the trajectory for adversarial training. We evaluate the proposed method on a wide range of I2I translation tasks. Both qualitative and quantitative results demonstrate that the proposed framework generates more high-quality continuous translation results against the state-of-the-art methods.
摘要：最近的图像到图像（I2I）翻译算法专注于从源到目标域的映射。然而，在文献中，在两个域之间合成中间结果的连续翻译问题并未在文献中进行良好研究。产生平滑的中间结果序列桥接两个不同域的间隙，促进了域横跨域的变形效应。现有的I2I方法仅限于域内或确定性域间连续翻译。在这项工作中，我们提出了一个有效的签名属性向量，它可以在各个域的各种映射路径上连续翻译。具体地，利用符号操作来对域信息进行编码，我们引入由所有域共享的统一属性空间，从而允许在不同域的属性向量上插值。为了提高连续翻译结果的视觉质量，我们在两个符号对称属性向量之间产生轨迹，并利用沿着对抗训练的轨迹的插值结果的域信息。我们评估了在广泛的I2I翻译任务上的提出方法。定性和定量结果既表明，所提出的框架会产生更高质量的连续翻译结果，而最先进的方法。

2. Pushing the Envelope of Rotation Averaging for Visual SLAM [PDF] 返回目录
Xinyi Li, Lin Yuan, Longin Jan Latecki, Haibin Ling
Abstract: As an essential part of structure from motion (SfM) and Simultaneous Localization and Mapping (SLAM) systems, motion averaging has been extensively studied in the past years and continues to attract surging research attention. While canonical approaches such as bundle adjustment are predominantly inherited in most of state-of-the-art SLAM systems to estimate and update the trajectory in the robot navigation, the practical implementation of bundle adjustment in SLAM systems is intrinsically limited by the high computational complexity, unreliable convergence and strict requirements of ideal initializations. In this paper, we lift these limitations and propose a novel optimization backbone for visual SLAM systems, where we leverage rotation averaging to improve the accuracy, efficiency and robustness of conventional monocular SLAM pipelines. In our approach, we first decouple the rotational and translational parameters in the camera rigid body transformation and convert the high-dimensional non-convex nonlinear problem into tractable linear subproblems in lower dimensions, and show that the subproblems can be solved independently with proper constraints. We apply the scale parameter with $l_1$-norm in the pose-graph optimization to address the rotation averaging robustness against outliers. We further validate the global optimality of our proposed approach, revisit and address the initialization schemes, pure rotational scene handling and outlier treatments. We demonstrate that our approach can exhibit up to 10x faster speed with comparable accuracy against the state of the art on public benchmarks.
摘要：作为来自运动（SFM）和同时定位和映射（SLAM）系统的结构的重要组成部分，过去几年进行了运动平均，并继续吸引汹涌的研究注意力。虽然诸如捆绑调整的规范方法主要是在大多数最先进的SLAM系统中估计和更新机器人导航中的轨迹，但是SLAM系统中的束调节的实际实现是由高计算复杂性的内部限制，不可靠的融合和理想初始化的严格要求。在本文中，我们提出了这些限制，并提出了一种用于视觉SLAM系统的新颖优化骨干，在那里我们利用旋转平均来提高常规单眼猛杆管道的准确性，效率和鲁棒性。在我们的方法中，我们首先将相机刚性体变换中的旋转和翻译参数脱节，并将高维非凸非线性问题转换为较低维度的易旧线性子节点，并显示子问题可以用适当的约束独立解决。我们使用$ l_1 $ -norm应用尺度参数，在姿势图优化中，解决了对异常值的旋转平均鲁棒性。我们进一步验证了我们建议的方法，重新审视和解决初始化方案，纯旋转场景处理和异常值处理的全球最优性。我们证明，我们的方法可以在公共基准上的最佳精度上表现出高达10倍的速度更快。

3. Reducing the Annotation Effort for Video Object Segmentation Datasets [PDF] 返回目录
Paul Voigtlaender, Lishu Luo, Chun Yuan, Yong Jiang, Bastian Leibe
Abstract: For further progress in video object segmentation (VOS), larger, more diverse, and more challenging datasets will be necessary. However, densely labeling every frame with pixel masks does not scale to large datasets. We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations and investigate how far such pseudo-labels can carry us for training state-of-the-art VOS approaches. A very encouraging result of our study is that adding a manually annotated mask in only a single video frame for each object is sufficient to generate pseudo-labels which can be used to train a VOS method to reach almost the same performance level as when training with fully segmented videos. We use this workflow to create pixel pseudo-labels for the training set of the challenging tracking dataset TAO, and we manually annotate a subset of the validation set. Together, we obtain the new TAO-VOS benchmark, which we make publicly available at this http URL. While the performance of state-of-the-art methods on existing datasets starts to saturate, TAO-VOS remains very challenging for current algorithms and reveals their shortcomings.
摘要：对于视频对象分段（VOS）的进一步进展，需要更大，更多样化，更具挑战性数据集。但是，密集地标记具有像素掩模的每个帧都不会缩放到大型数据集。我们使用深度卷积网络自动在像素级别创建伪标签，从更便宜的边界框注释并调查此类伪标签可以携带我们培训最先进的VOS方法。我们的研究的一个非常令人鼓舞的结果是，仅为每个对象添加手动注释的掩码足以生成伪标签，该伪标签可用于训练VOS方法达到几乎与培训时的性能水平相同完全分段视频。我们使用此工作流程为具有挑战性的跟踪数据集TAO创建像素伪标签，我们手动注释验证集的子集。我们一起获取新的Tao-Vos基准，我们在此HTTP URL上公开提供。虽然现有数据集的最先进方法的性能开始饱和，但TAO-VOS对当前算法仍然非常具有挑战性，并揭示了他们的缺点。

4. SLAM in the Field: An Evaluation of Monocular Mapping and Localization on Challenging Dynamic Agricultural Environment [PDF] 返回目录
Fangwen Shu, Paul Lesur, Yaxu Xie, Alain Pagani, Didier Stricker
Abstract: This paper demonstrates a system capable of combining a sparse, indirect, monocular visual SLAM, with both offline and real-time Multi-View Stereo (MVS) reconstruction algorithms. This combination overcomes many obstacles encountered by autonomous vehicles or robots employed in agricultural environments, such as overly repetitive patterns, need for very detailed reconstructions, and abrupt movements caused by uneven roads. Furthermore, the use of a monocular SLAM makes our system much easier to integrate with an existing device, as we do not rely on a LiDAR (which is expensive and power consuming), or stereo camera (whose calibration is sensitive to external perturbation e.g. camera being displaced). To the best of our knowledge, this paper presents the first evaluation results for monocular SLAM, and our work further explores unsupervised depth estimation on this specific application scenario by simulating RGB-D SLAM to tackle the scale ambiguity, and shows our approach produces reconstructions that are helpful to various agricultural tasks. Moreover, we highlight that our experiments provide meaningful insight to improve monocular SLAM systems under agricultural settings.
摘要：本文演示了一种能够将稀疏，间接单眼视觉流动的系统与离线和实时多视图立体声（MVS）重建算法组合。这种组合克服了农业环境中自主车辆或机器人遇到的许多障碍，例如过度重复的模式，需要非常详细的重建，并且由不均匀的道路引起的突然移动。此外，单眼猛放的使用使我们的系统更容易与现有设备集成，因为我们不依赖于LIDAR（这是昂贵的且功耗）或立体相机（其校准对外部扰动敏感，例如相机流离失所）。据我们所知，本文提出了单眼奴役的第一个评估结果，我们的工作进一步探讨了通过模拟RGB-D SLAM解决规模模糊性的特定应用方案的无监督深度估计，并显示了我们的方法产生重建对各种农业任务有帮助。此外，我们强调了我们的实验提供了有意义的洞察力，可以在农业环境下改善单眼血压系统。

5. Facial Keypoint Sequence Generation from Audio [PDF] 返回目录
Prateek Manocha, Prithwijit Guha
Abstract: Whenever we speak, our voice is accompanied by facial movements and expressions. Several recent works have shown the synthesis of highly photo-realistic videos of talking faces, but they either require a source video to drive the target face or only generate videos with a fixed head pose. This lack of facial movement is because most of these works focus on the lip movement in sync with the audio while assuming the remaining facial keypoints' fixed nature. To address this, a unique audio-keypoint dataset of over 150,000 videos at 224p and 25fps is introduced that relates the facial keypoint movement for the given audio. This dataset is then further used to train the model, Audio2Keypoint, a novel approach for synthesizing facial keypoint movement to go with the audio. Given a single image of the target person and an audio sequence (in any language), Audio2Keypoint generates a plausible keypoint movement sequence in sync with the input audio, conditioned on the input image to preserve the target person's facial characteristics. To the best of our knowledge, this is the first work that proposes an audio-keypoint dataset and learns a model to output the plausible keypoint sequence to go with audio of any arbitrary length. Audio2Keypoint generalizes across unseen people with a different facial structure allowing us to generate the sequence with the voice from any source or even synthetic voices. Instead of learning a direct mapping from audio to video domain, this work aims to learn the audio-keypoint mapping that allows for in-plane and out-of-plane head rotations, while preserving the person's identity using a Pose Invariant (PIV) Encoder.
摘要：每当我们说话时，我们的声音都伴随着面部运动和表达。最近的几项工程已经显示了谈论面孔的高度光真实视频的合成，但他们要么需要源视频来驱动目标面部，或者只使用固定的头部姿势生成视频。这种缺乏面部运动是因为这些工作中的大多数都关注唇部运动与音频同步，同时假设剩余的面部关键点的固定性质。为了解决此问题，引入了超过150,000个视频的唯一音频关键点数据集，以224p和25fps为介绍给定音频的面部关键点移动。然后，该数据集进一步用于训练模型，Audio2keyPoint，一种用于综合面部键盘运动的新方法来与音频一起使用。给定目标人的单个图像和音频序列（以任何语言），Audio2keyPoint在与输入图像上的输入音频同步地生成合理的关键点移动序列，以保留目标人的面部特征。据我们所知，这是一个提出音频关键点数据集的第一项工作，并学习模型以输出合理的关键点序列，以便使用任何任意长度的音频。 Audio2keyPoint通过不同的面部结构横穿未见的人，允许我们从任何源甚至合成声音使用语音生成序列。这项工作而不是学习从音频到视频域的直接映射，旨在学习允许面内和外平面的头部旋转的音频关键点映射，同时使用姿势不变量（PIV）编码器保留该人的身份。

6. Multi-Task Learning for Calorie Prediction on a Novel Large-Scale Recipe Dataset Enriched with Nutritional Information [PDF] 返回目录
Robin Ruede, Verena Heusser, Lukas Frank, Alina Roitberg, Monica Haurilet, Rainer Stiefelhagen
Abstract: A rapidly growing amount of content posted online, such as food recipes, opens doors to new exciting applications at the intersection of vision and language. In this work, we aim to estimate the calorie amount of a meal directly from an image by learning from recipes people have published on the Internet, thus skipping time-consuming manual data annotation. Since there are few large-scale publicly available datasets captured in unconstrained environments, we propose the pic2kcal benchmark comprising 308,000 images from over 70,000 recipes including photographs, ingredients and instructions. To obtain nutritional information of the ingredients and automatically determine the ground-truth calorie value, we match the items in the recipes with structured information from a food item database. We evaluate various neural networks for regression of the calorie quantity and extend them with the multi-task paradigm. Our learning procedure combines the calorie estimation with prediction of proteins, carbohydrates, and fat amounts as well as a multi-label ingredient classification. Our experiments demonstrate clear benefits of multi-task learning for calorie estimation, surpassing the single-task calorie regression by 9.9%. To encourage further research on this task, we make the code for generating the dataset and the models publicly available.
摘要：在线发布的迅速增长的内容，例如食物食谱，为愿景和语言交叉开辟了新的令人兴奋的应用程序。在这项工作中，我们的目标是通过从互联网上发表的食谱来估算直接从图像中享受的热量数量，从而跳过耗时的手动数据注释。由于在不受约束的环境中捕获的大型公共可公共数据集，因此我们提出了包括来自超过70,000个食谱的308,000张图片的PIC2KCAL基准，包括照片，成分和指示。为了获得成分的营养信息并自动确定地面真实的卡路里值，我们将食谱中的配方中的项目与食品数据库的结构化信息匹配。我们评估各种神经网络以进行卡路里数量的回归，并将其与多任务范例扩展。我们的学习程序将卡路里估计与蛋白质，碳水化合物和脂肪量的预测相结合，以及多标签成分分类。我们的实验表明了对卡路里估计的多任务学习的明显好处，超越了单任务卡路里回归9.9％。为鼓励进一步研究此任务，我们会使代码生成数据集和型号公开可用。

7. Image Inpainting with Learnable Feature Imputation [PDF] 返回目录
Håkon Hukkelås, Frank Lindseth, Rudolf Mester
Abstract: A regular convolution layer applying a filter in the same way over known and unknown areas causes visual artifacts in the inpainted image. Several studies address this issue with feature re-normalization on the output of the convolution. However, these models use a significant amount of learnable parameters for feature re-normalization, or assume a binary representation of the certainty of an output. We propose (layer-wise) feature imputation of the missing input values to a convolution. In contrast to learned feature re-normalization, our method is efficient and introduces a minimal number of parameters. Furthermore, we propose a revised gradient penalty for image inpainting, and a novel GAN architecture trained exclusively on adversarial loss. Our quantitative evaluation on the FDF dataset reflects that our revised gradient penalty and alternative convolution improves generated image quality significantly. We present comparisons on CelebA-HQ and Places2 to current state-of-the-art to validate our model.
摘要：常规卷积层以相同的方式应用过滤器，以与已知和未知区域相同的方式导致染色图像中的视觉伪像。几项研究解决了在卷积输出的功能重新归一化的问题。但是，这些模型使用大量的可学习参数来重新标准化，或假设输出确定性的二进制表示。我们提出（层WISE）特征归因于卷积丢失的输入值。与学习功能重新标准化相比，我们的方法是有效的，并引入最小数量的参数。此外，我们提出了对图像污染的修订渐变惩罚，并且专门培训的新型GAN架构对抗对抗性损失。我们对FDF数据集的定量评估反映了我们修订的渐变惩罚和替代卷积，显着提高了产生的图像质量。我们在Celeba-HQ和Parument2上显示比较，以验证我们的模型。

8. CABiNet: Efficient Context Aggregation Network for Low-Latency Semantic Segmentation [PDF] 返回目录
Saumya Kumaar, Ye Lyu, Francesco Nex, Michael Ying Yang
Abstract: With the increasing demand of autonomous machines, pixel-wise semantic segmentation for visual scene understanding needs to be not only accurate but also efficient for any potential real-time applications. In this paper, we propose CABiNet (Context Aggregated Bi-lateral Network), a dual branch convolutional neural network (CNN), with significantly lower computational costs as compared to the state-of-the-art, while maintaining a competitive prediction accuracy. Building upon the existing multi-branch architectures for high-speed semantic segmentation, we design a cheap high resolution branch for effective spatial detailing and a context branch with light-weight versions of global aggregation and local distribution blocks, potent to capture both long-range and local contextual dependencies required for accurate semantic segmentation, with low computational overheads. Specifically, we achieve 76.6% and 75.9% mIOU on Cityscapes validation and test sets respectively, at 76 FPS on an NVIDIA RTX 2080Ti and 8 FPS on a Jetson Xavier NX. Codes and training models will be made publicly available.
摘要：随着自治机器的需求越来越多，视觉场景理解的像素明智的语义细分需要不仅准确，而且对任何潜在的实时应用也有效。在本文中，我们提出了机柜（上下文聚合的双横向网络），双分支卷积神经网络（CNN），与现有技术相比，计算成本明显较低，同时保持竞争预测准确性。在现有的高速语义分割时构建现有的多分支架构，我们设计一个便宜的高分辨率分支，用于有效的空间细节和上下文分支，具有轻量级版本的全局聚合和本地分配块，有效地捕获远程和准确的语义分割所需的本地上下文依赖性，具有低计算开销。具体而言，我们分别在CityCapes验证和测试集上达到76.6％和75.9％的Miou，在NVIDIA RTX 2080Ti上的76 FPS和Jetson Xavier NX上的8个FPS。 CODES和培训模式将公开可用。

9. PBP-Net: Point Projection and Back-Projection Network for 3D Point Cloud Segmentation [PDF] 返回目录
JuYoung Yang, Chanho Lee, Pyunghwan Ahn, Haeil Lee, Eojindl Yi, Junmo Kim
Abstract: Following considerable development in 3D scanning technologies, many studies have recently been proposed with various approaches for 3D vision tasks, including some methods that utilize 2D convolutional neural networks (CNNs). However, even though 2D CNNs have achieved high performance in many 2D vision tasks, existing works have not effectively applied them onto 3D vision tasks. In particular, segmentation has not been well studied because of the difficulty of dense prediction for each point, which requires rich feature representation. In this paper, we propose a simple and efficient architecture named point projection and back-projection network (PBP-Net), which leverages 2D CNNs for the 3D point cloud segmentation. 3 modules are introduced, each of which projects 3D point cloud onto 2D planes, extracts features using a 2D CNN backbone, and back-projects features onto the original 3D point cloud. To demonstrate effective 3D feature extraction using 2D CNN, we perform various experiments including comparison to recent methods. We analyze the proposed modules through ablation studies and perform experiments on object part segmentation (ShapeNet-Part dataset) and indoor scene semantic segmentation (S3DIS dataset). The experimental results show that proposed PBP-Net achieves comparable performance to existing state-of-the-art methods.
摘要：在3D扫描技术方面的相当大的发展之后，最近提出了许多研究，以适用于3D视觉任务的各种方法，包括利用2D卷积神经网络（CNNS）的一些方法。然而，尽管2D CNNS在许多2D视觉任务中实现了高性能，但现有的作品尚未将它们应用于3D视觉任务。特别是，由于每个点的难以致密的预测，细分尚未得到很好的研究，这需要丰富的特征表示。在本文中，我们提出了一种简单有效的架构命名点投影和反投影网络（PBP-NET），其利用了3D点云分段的2D CNN。介绍了3个模块，每个模块将3D点云投影到2D平面上，使用2D CNN骨干声提取功能，然后在原始3D点云上的后部项目功能。为了展示使用2D CNN的有效的3D特征提取，我们执行各种实验，包括与最近方法的比较。我们通过消融研究分析所提出的模块，并对对象部分分割（ShapEnet-Part DataSet）和室内场景语义分割进行实验（S3DIS数据集）。实验结果表明，所提出的PBP-Net实现了现有最先进方法的可比性。

10. 3D Multi-bodies: Fitting Sets of Plausible 3D Human Models to Ambiguous Image Data [PDF] 返回目录
Benjamin Biggs, Sébastien Ehrhadt, Hanbyul Joo, Benjamin Graham, Andrea Vedaldi, David Novotny
Abstract: We consider the problem of obtaining dense 3D reconstructions of humans from single and partially occluded views. In such cases, the visual evidence is usually insufficient to identify a 3D reconstruction uniquely, so we aim at recovering several plausible reconstructions compatible with the input data. We suggest that ambiguities can be modelled more effectively by parametrizing the possible body shapes and poses via a suitable 3D model, such as SMPL for humans. We propose to learn a multi-hypothesis neural network regressor using a best-of-M loss, where each of the M hypotheses is constrained to lie on a manifold of plausible human poses by means of a generative model. We show that our method outperforms alternative approaches in ambiguous pose recovery on standard benchmarks for 3D humans, and in heavily occluded versions of these benchmarks.
摘要：我们考虑从单个和部分封闭的观点获得人类致密3D重建的问题。在这种情况下，视觉证据通常不足以唯一能够唯一地识别3D重建，因此我们旨在恢复与输入数据兼容的几种合理的重建。我们建议通过参加可能的身体形状并通过合适的3D模型姿势更有效地建模模糊，例如人类的SMPL。我们建议使用最佳损失来学习多假设神经网络回归，其中每个假设被限制为借助于生成模型来介绍合理的人类姿势的歧管。我们表明，我们的方法优于3D人类标准基准的含糊不清姿势恢复的替代方法，以及这些基准的严重封闭版本。

11. Refactoring Policy for Compositional Generalizability using Self-Supervised Object Proposals [PDF] 返回目录
Tongzhou Mu, Jiayuan Gu, Zhiwei Jia, Hao Tang, Hao Su
Abstract: We study how to learn a policy with compositional generalizability. We propose a two-stage framework, which refactorizes a high-reward teacher policy into a generalizable student policy with strong inductive bias. Particularly, we implement an object-centric GNN-based student policy, whose input objects are learned from images through self-supervised learning. Empirically, we evaluate our approach on four difficult tasks that require compositional generalizability, and achieve superior performance compared to baselines.
摘要：我们研究如何使用合成概括地学习政策。我们提出了一个两阶段的框架，将高奖励教师政策重构为具有强烈归纳偏见的可推广的学生政策。特别是，我们实现了一个以对象为中心的基于GNN的学生策略，其输入对象通过自我监督学习从图像中学到。经验上，我们在四种困难任务中评估了我们的方法，需要组成宽度，并与基线相比实现优越的性能。

12. Diverse Image Captioning with Context-Object Split Latent Spaces [PDF] 返回目录
Shweta Mahajan, Stefan Roth
Abstract: Diverse image captioning models aim to learn one-to-many mappings that are innate to cross-domain datasets, such as of images and texts. Current methods for this task are based on generative latent variable models, e.g. VAEs with structured latent spaces. Yet, the amount of multimodality captured by prior work is limited to that of the paired training data -- the true diversity of the underlying generative process is not fully captured. To address this limitation, we leverage the contextual descriptions in the dataset that explain similar contexts in different visual scenes. To this end, we introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts within the dataset. Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data. We evaluate our COS-CVAE approach on the standard COCO dataset and on the held-out COCO dataset consisting of images with novel objects, showing significant gains in accuracy and diversity.
摘要：不同的图像标题模型旨在学习一对多映射，这是一个与跨域数据集一起的先天，例如图像和文本。此任务的当前方法基于生成潜变量模型，例如，具有结构化潜在空间的VAE。然而，通过先前工作捕获的多模数量仅限于配对训练数据的量 - 没有完全捕获基础生成过程的真正分集。为了解决此限制，我们利用数据集中的上下文描述，该数据集在不同的视觉场景中解释了类似上下文。为此，我们引入了潜在空间的新型分解，称为上下文 - 对象拆分，以在数据集中的图像和文本上的上下文描述中模拟多样性。我们的框架不仅可以通过基于上下文的伪监控来实现不同的标题，而是将此扩展到具有新颖对象的图像，并且在训练数据中没有配对标题。我们在标准Coco DataSet上以及由具有新型物体的图像组成的COS-CVAE方法以及由具有新型物体的图像组成的COCO数据集，在准确性和多样性中显示出显着的增益。

13. Learning a Deep Reinforcement Learning Policy Over the Latent Space of a Pre-trained GAN for Semantic Age Manipulation [PDF] 返回目录
Kumar Shubham, Gopalakrishnan Venkatesh, Reijul Sachdev, Akshi, Dinesh Babu Jayagopi, G. Srinivasaraghavan
Abstract: Learning a disentangled representation of the latent space has become one of the most fundamental problems studied in computer vision. Recently, many generative adversarial networks (GANs) have shown promising results in generating high fidelity images. However, studies to understand the semantic layout of the latent space of pre-trained models are still limited. Several works train conditional GANs to generate faces with required semantic attributes. Unfortunately, in these attempts often the generated output is not as photo-realistic as the state of the art models. Besides, they also require large computational resources and specific datasets to generate high fidelity images. In our work, we have formulated a Markov Decision Process (MDP) over the rich latent space of a pre-trained GAN model to learn a conditional policy for semantic manipulation along specific attributes under defined identity bounds. Further, we have defined a semantic age manipulation scheme using a locally linear approximation over the latent space. Results show that our learned policy can sample high fidelity images with required age variations, while at the same time preserve the identity of the person.
摘要：学习潜在空间的解除响应表示已成为计算机愿景中学最基本的问题之一。最近，许多生成的对抗性网络（GANS）已经显示出有希望产生高保真图像的结果。然而，要了解预先训练模型的潜在空间的语义布局的研究仍然有限。有几种工程列车有条件的GAN生成所需语义属性的面孔。遗憾的是，在这些尝试中，通常产生的输出不是作为最先进模型的照片逼真。此外，它们还需要大型计算资源和特定数据集来生成高保真图像。在我们的工作中，我们向预先训练的GAN模型的丰富潜在空间制定了Markov决策过程（MDP），以了解规定的身份界限下的特定属性的语义操纵条件策略。此外，我们已经定义了使用潜在空间上的局部线性近似的语义时代操作方案。结果表明，我们的学习政策可以使用所需年龄变化来样的高保真图像，同时保留人的身份。

14. Point Transformer [PDF] 返回目录
Nico Engel, Vasileios Belagiannis, Klaus Dietmayer
Abstract: In this work, we present Point Transformer, a deep neural network that operates directly on unordered and unstructured point sets. We design Point Transformer to extract local and global features and relate both representations by introducing the local-global attention mechanism, which aims to capture spatial point relations and shape information. For that purpose, we propose SortNet, as part of the Point Transformer, which induces input permutation invariance by selecting points based on a learned score. The output of Point Transformer is a sorted and permutation invariant feature list that can directly be incorporated into common computer vision applications. We evaluate our approach on standard classification and part segmentation benchmarks to demonstrate competitive results compared to the prior work.
摘要：在这项工作中，我们呈现点变压器，一个深神经网络，直接在无序和非结构点集合上运行。我们设计点变压器以提取本地和全局特征，并通过引入本地 - 全球关注机制来涉及捕获空间点关系和形状信息。为此目的，我们提出SortNet作为点变压器的一部分，通过基于学习得分选择点来引导输入置换不变性。点变换器的输出是一个排序和置换不变的功能列表，可以直接结合到公共计算机视觉应用程序中。我们评估了我们对标准分类和部分分割基准的方法，以证明与事先工作相比的竞争结果。

15. Boost Image Captioning with Knowledge Reasoning [PDF] 返回目录
Feicheng Huang, Zhixin Li, Haiyang Wei, Canlong Zhang, Huifang Ma
Abstract: Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. Finally, we validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches.
摘要：自动生成给定图像的人类描述是人工智能的潜在研究，最近吸引了很大的关注。大多数现有的注意方法探讨了图像中的句子和区域之间的单词之间的映射关系，这种不可预测的匹配方式有时会导致可能降低生成标题的质量的异道对齐。在本文中，我们努力推理更准确和有意义的标题。我们首先提出了在发行顺序描述字词中时提出注意力的正确性。专注于输入图像的不同区域时，特别的单词重点强调了单词重要性，并充分利用内部注释知识来帮助计算视觉关注。然后，为了揭示那些不能通过机器直截了当地表达的那些不可理解的意图，我们介绍了一种新的策略来将从知识图中提取的外部知识注入到编码器 - 解码器框架中，以便于有意义的标题。最后，我们在两个可自由的标题基准上验证我们的模型：Microsoft Coco DataSet和Flickr30k数据集。结果表明，我们的方法能够实现最先进的性能，并且优于许多现有方法。

16. MARNet: Multi-Abstraction Refinement Network for 3D Point Cloud Analysis [PDF] 返回目录
Rahul Chakwate, Arulkumar Subramaniam, Anurag Mittal
Abstract: Representation learning from 3D point clouds is challenging due to their inherent nature of permutation invariance and irregular distribution in space. Existing deep learning methods follow a hierarchical feature extraction paradigm in which high-level abstract features are derived from low-level features. However, they fail to exploit different granularity of information due to the limited interaction between these features. To this end, we propose Multi-Abstraction Refinement Network (MARNet) that ensures an effective exchange of information between multi-level features to gain local and global contextual cues while effectively preserving them till the final layer. We empirically show the effectiveness of MARNet in terms of state-of-the-art results on two challenging tasks: Shape classification and Coarse-to-fine grained semantic segmentation. MARNet significantly improves the classification performance by 2% over the baseline and outperforms the state-of-the-art methods on semantic segmentation task.
摘要：从3D点云学习的表示是挑战，因为它们的置换不变性的固有性质和空间不规则分布。现有的深度学习方法遵循分层特征提取范例，其中高级抽象特征来自低级功能。然而，由于这些特征之间的有限相互作用，它们未能利用不同的信息粒度。为此，我们提出了多抽象的细化网络（Marnet），该网络（Marnet）确保了多级功能之间的信息交流，以获得本地和全局上下文提示，同时有效保留它们直到最终层。我们在两个具有挑战性任务方面，在最先进的任务方面，我们经验证明了Marnet的有效性：形状分类和粗细纹的种子分割。 Marnet在基线上显着提高了2％的分类性能，并且优于语义细分任务的最先进方法。

17. Facial UV Map Completion for Pose-invariant Face Recognition: A Novel Adversarial Approach based on Coupled Attention Residual UNets [PDF] 返回目录
In Seop Na, Chung Tran, Dung Nguyen, Sang Dinh
Abstract: Pose-invariant face recognition refers to the problem of identifying or verifying a person by analyzing face images captured from different poses. This problem is challenging due to the large variation of pose, illumination and facial expression. A promising approach to deal with pose variation is to fulfill incomplete UV maps extracted from in-the-wild faces, then attach the completed UV map to a fitted 3D mesh and finally generate different 2D faces of arbitrary poses. The synthesized faces increase the pose variation for training deep face recognition models and reduce the pose discrepancy during the testing phase. In this paper, we propose a novel generative model called Attention ResCUNet-GAN to improve the UV map completion. We enhance the original UV-GAN by using a couple of U-Nets. Particularly, the skip connections within each U-Net are boosted by attention gates. Meanwhile, the features from two U-Nets are fused with trainable scalar weights. The experiments on the popular benchmarks, including Multi-PIE, LFW, CPLWF and CFP datasets, show that the proposed method yields superior performance compared to other existing methods.
摘要：姿势不变的人脸识别是指通过分析从不同姿势捕获的面部图像来识别或验证人的问题。由于姿势，照明和面部表情的巨大变化，这个问题是挑战。处理姿势变化的有希望的方法是满足从野外提取的不完整的UV地图，然后将完成的UV映射附加到装配的3D网格，最后产生不同的任意姿势的2D面。合成的面积增加了训练深面部识别模型的姿势变化，并在测试阶段减少姿势差异。在本文中，我们提出了一种新的生成模型，称为Rescet-GaN，以改善UV地图完成。我们通过使用几个U-Net来增强原始UV-GaN。特别地，每个U-Net内的跳过连接被注意栅极升高。同时，来自两个U-Net的功能与可训练标量重量融合。对流行基准测试的实验，包括多饼，LFW，CPLWF和CFP数据集，表明该方法与其他现有方法相比产生卓越的性能。

18. Efficient texture mapping via a non-iterative global texture alignment [PDF] 返回目录
Mohammad Rouhani, Matthieu Fradet, Caroline Baillard
Abstract: Texture reconstruction techniques generally suffer from the errors in keyframe poses. We present a non-iterative method for seamless texture reconstruction of a given 3D scene. Our method finds the best texture alignment in a single shot using a global optimisation framework. First, we automatically select the best keyframe to texture each face of the mesh. This leads to a decomposition of the mesh into small groups of connected faces associated to a same keyframe. We call such groups fragments. Then, we propose a geometry-aware matching technique between the 3D keypoints extracted around the fragment borders, where the matching zone is controlled by the margin size. These constraints lead to a least squares (LS) model for finding the optimal alignment. Finally, visual seams are further reduced by applying a fast colour correction. In contrast to pixel-wise methods, we find the optimal alignment by solving a sparse system of linear equations, which is very fast and non-iterative. Experimental results demonstrate low computational complexity and outperformance compared to other alignment methods.
摘要：纹理重建技术通常遭受关键帧姿势的错误。我们为给定3D场景的无缝纹理重建提供了一种非迭代方法。我们的方法使用全局优化框架在单次拍摄中找到最佳纹理对齐。首先，我们自动选择最佳关键帧来纹理网格的每个面。这导致网格分解成与相同关键帧相关联的少量连接面。我们称之为群组片段。然后，我们提出了一种在围绕片段边框中提取的3D关节点之间的几何特征感知匹配技术，其中匹配区域由边距大小控制。这些约束导致最小二乘（LS）模型用于找到最佳对准。最后，通过应用快速校正进一步降低了可视接缝。与像素 - 明智的方法相比，我们通过求解线性方程的稀疏系统来找到最佳对准，这非常快速且不迭代。与其他对准方法相比，实验结果表明了低计算复杂性和优于表现。

19. Receptive Field Size Optimization with Continuous Time Pooling [PDF] 返回目录
Dóra Babicz, Soma Kontár, Márk Pető, András Fülöp, Gergely Szabó, András Horváth
Abstract: The pooling operation is a cornerstone element of convolutional neural networks. These elements generate receptive fields for neurons, in which local perturbations should have minimal effect on the output activations, increasing robustness and invariance of the network. In this paper we will present an altered version of the most commonly applied method, maximum pooling, where pooling in theory is substituted by a continuous time differential equation, which generates a location sensitive pooling operation, more similar to biological receptive fields. We will present how this continuous method can be approximated numerically using discrete operations which fit ideally on a GPU. In our approach the kernel size is substituted by diffusion strength which is a continuous valued parameter, this way it can be optimized by gradient descent algorithms. We will evaluate the effect of continuous pooling on accuracy and computational need using commonly applied network architectures and datasets.
摘要：汇集操作是卷积神经网络的基石元素。这些元素为神经元产生接受领域，其中局部扰动应该对输出激活产生最小的影响，提高网络的鲁棒性和不变性。在本文中，我们将介绍一个最常用方法的更改版本，最大汇集，理论上的汇集是由连续的时间微分方程代替，这产生了与生物接收领域更类似的位置敏感池操作。我们将介绍这种连续方法如何使用基于GPU的离散操作来数值方式近似。在我们的方法中，内核尺寸被漫射强度代替，这是连续值参数，这种方式可以通过梯度下降算法进行优化。我们将评估使用常用网络架构和数据集的持续汇集对准确度和计算需求的影响。

20. Do 2D GANs Know 3D Shape? Unsupervised 3D shape reconstruction from 2D Image GANs [PDF] 返回目录
Xingang Pan, Bo Dai, Ziwei Liu, Chen Change Loy, Ping Luo
Abstract: Natural images are projections of 3D objects on a 2D image plane. While state-of-the-art 2D generative models like GANs show unprecedented quality in modeling the natural image manifold, it is unclear whether they implicitly capture the underlying 3D object structures. And if so, how could we exploit such knowledge to recover the 3D shapes of objects in the images? To answer these questions, in this work, we present the first attempt to directly mine 3D geometric clues from an off-the-shelf 2D GAN that is trained on RGB images only. Through our investigation, we found that such a pre-trained GAN indeed contains rich 3D knowledge and thus can be used to recover 3D shape from a single 2D image in an unsupervised manner. The core of our framework is an iterative strategy that explores and exploits diverse viewpoint and lighting variations in the GAN image manifold. The framework does not require 2D keypoint or 3D annotations, or strong assumptions on object shapes (e.g. shapes are symmetric), yet it successfully recovers 3D shapes with high precision for human faces, cats, cars, and buildings. The recovered 3D shapes immediately allow high-quality image editing like relighting and object rotation. We quantitatively demonstrate the effectiveness of our approach compared to previous methods in both 3D shape reconstruction and face rotation. Our code and models will be released at this https URL.
摘要：自然图像是2D图像平面上的3D对象的投影。虽然最先进的2D生成模型，如GAN，但在建模天然图像歧管时表现出前所未有的质量，但目前尚不清楚它们是隐含地捕获底层的3D对象结构。如果是，我们如何利用这些知识来恢复图像中的对象的3D形状？为了回答这些问题，在这项工作中，我们首次尝试将3D几何线索直接从现成的2D GaN中达到RGB映像训练的搁板。通过我们的调查，我们发现这种预先训练的GaN确实包含了丰富的3D知识，因此可用于以无监督的方式从单个2D图像中恢复3D形状。我们框架的核心是一种迭代策略，探讨了GaN图像歧管中的不同观点和照明变化。该框架不需要2D关键点或3D注释，或者对象形状的强烈假设（例如，形状是对称的），但它成功地恢复了人类面，猫，汽车和建筑物高精度的3D形状。恢复的3D形状立即允许高质量的图像编辑，如结合和对象旋转。我们定量展示了我们方法的效果与以前的3D形重建和面部旋转的方法相比。我们的代码和型号将在此HTTPS URL上发布。

21. Predicting Brain Degeneration with a Multimodal Siamese Neural Network [PDF] 返回目录
Cecilia Ostertag, Marie Beurton-Aimar, Muriel Visani, Thierry Urruty, Karell Bertet
Abstract: To study neurodegenerative diseases, longitudinal studies are carried on volunteer patients. During a time span of several months to several years, they go through regular medical visits to acquire data from different modalities, such as biological samples, cognitive tests, structural and functional imaging. These variables are heterogeneous but they all depend on the patient's health condition, meaning that there are possibly unknown relationships between all modalities. Some information may be specific to some modalities, others may be complementary, and others may be redundant. Some data may also be missing. In this work we present a neural network architecture for multimodal learning, able to use imaging and clinical data from two time points to predict the evolution of a neurodegenerative disease, and robust to missing values. Our multimodal network achieves 92.5\% accuracy and an AUC score of 0.978 over a test set of 57 subjects. We also show the superiority of the multimodal architecture, for up to 37.5\% of missing values in test set subjects' clinical measurements, compared to a model using only the clinical modality.
摘要：研究神经变性疾病，志愿者患者纵向研究。在几个月到几年的时间跨度期间，他们经过经常医疗访问，以获得来自不同方式的数据，例如生物样本，认知测试，结构和功能成像。这些变量是异构的，但它们都取决于患者的健康状况，这意味着所有模式之间存在可能未知的关系。一些信息可能是某些方式的特定，其他信息可能是互补的，其他人可能是多余的。某些数据也可能丢失。在这项工作中，我们为多式化学习提供了一种神经网络架构，能够使用来自两个时间点的成像和临床数据来预测神经变性疾病的演变，以及缺少价值的鲁棒。我们的多模式网络在57个科目的测试组上实现了92.5 \％的精度和0.978的AUC分数。与使用临床模型的模型相比，我们还展示了多式联运架构的优势，最多37.5 \％的试验临床测量中的缺失值。

22. PV-NAS: Practical Neural Architecture Search for Video Recognition [PDF] 返回目录
Zihao Wang, Chen Lin, Lu Sheng, Junjie Yan, Jing Shao
Abstract: Recently, deep learning has been utilized to solve video recognition problem due to its prominent representation ability. Deep neural networks for video tasks is highly customized and the design of such networks requires domain experts and costly trial and error tests. Recent advance in network architecture search has boosted the image recognition performance in a large margin. However, automatic designing of video recognition network is less explored. In this study, we propose a practical solution, namely Practical Video Neural Architecture Search (PV-NAS).Our PV-NAS can efficiently search across tremendous large scale of architectures in a novel spatial-temporal network search space using the gradient based search methods. To avoid sticking into sub-optimal solutions, we propose a novel learning rate scheduler to encourage sufficient network diversity of the searched models. Extensive empirical evaluations show that the proposed PV-NAS achieves state-of-the-art performance with much fewer computational resources. 1) Within light-weight models, our PV-NAS-L achieves 78.7% and 62.5% Top-1 accuracy on Kinetics-400 and Something-Something V2, which are better than previous state-of-the-art methods (i.e., TSM) with a large margin (4.6% and 3.4% on each dataset, respectively), and 2) among median-weight models, our PV-NAS-M achieves the best performance (also a new record)in the Something-Something V2 dataset.
摘要：最近，由于其突出的表现能力，已经利用深度学习来解决视频识别问题。用于视频任务的深度神经网络是高度自定义的，并且这些网络的设计需要域专家和昂贵的试验和错误测试。网络架构搜索最近的进步促进了大边距的图像识别性能。但是，探索了视频识别网络的自动设计。在这项研究中，我们提出了一种实用的解决方案，即实际的视频神经结构搜索（PV-NAS）。我们的PV-NAS可以使用基于梯度的搜索方法在新的空间网络搜索空间中有效地搜索大规模的大规模架构。为避免粘附到次优的解决方案中，我们提出了一种新颖的学习率调度程序，以鼓励搜索模型的足够的网络分集。广泛的经验评估表明，所提出的PV-NAS实现了最先进的性能，计算资源更少。 1）在轻量级模型中，我们的PV-NAS-L在动力学-400和某种东西上实现78.7％和62.5％的前1个精度，而不是以前的最先进的方法（即， TSM）在中间重量模型中具有大的边缘（分别在每个数据集中的4.6％和3.4％），我们的PV-NAS-M在某种程度上取得了最佳的性能（也是新的记录） - 某种东西v2数据集。

23. Data-free Knowledge Distillation for Segmentation using Data-Enriching GAN [PDF] 返回目录
Kaushal Bhogale
Abstract: Distilling knowledge from huge pre-trained networks to improve the performance of tiny networks has favored deep learning models to be used in many real-time and mobile applications. Several approaches that demonstrate success in this field have made use of the true training dataset to extract relevant knowledge. In absence of the True dataset, however, extracting knowledge from deep networks is still a challenge. Recent works on data-free knowledge distillation demonstrate such techniques on classification tasks. To this end, we explore the task of data-free knowledge distillation for segmentation tasks. First, we identify several challenges specific to segmentation. We make use of the DeGAN training framework to propose a novel loss function for enforcing diversity in a setting where a few classes are underrepresented. Further, we explore a new training framework for performing knowledge distillation in a data-free setting. We get an improvement of 6.93% in Mean IoU over previous approaches.
摘要：从巨大的预训练网络中蒸馏出来，提高微型网络的性能，有利于在许多实时和移动应用中使用的深度学习模型。在此领域的成功展示成功的几种方法已经利用了真正的培训数据集来提取相关知识。然而，在没有真正的数据集的情况下，从深网络提取知识仍然是一个挑战。最近关于无数据知识蒸馏的作品在分类任务上展示了这种技术。为此，我们探讨了用于分割任务的无数据知识蒸馏的任务。首先，我们确定特定对细分的挑战。我们利用了解冻培训框架，提出了一种新的损失函数，以在一些课程被认为的阶段执行多样性。此外，我们探索在无数据环境中执行知识蒸馏的新培训框架。我们在以前的方法中提高了6.93％的意思。

24. CaCL: Class-aware Codebook Learning for Weakly Supervised Segmentation on Diffuse Image Patterns [PDF] 返回目录
Ruining Deng, Quan Liu, Shunxing Bao, Aadarsh Jha, Catie Chang, Bryan A. Millis, Matthew J. Tyska, Yuankai Huo
Abstract: Weakly supervised learning has been rapidly advanced in biomedical image analysis to achieve pixel-wise labels (segmentation) from image-wise annotations (classification), as biomedical images naturally contain image-wise labels in many scenarios. The current weakly supervised learning algorithms from the computer vision community are largely designed for focal objects (e.g., dogs and cats). However, such algorithms are not optimized for diffuse patterns in biomedical imaging (e.g., stains and fluorescent in microscopy imaging). In this paper, we propose a novel class-aware codebook learning (CaCL) algorithm to perform weakly supervised learning for diffuse image patterns. Specifically, the CaCL algorithm is deployed to segment protein expressed brush border regions from histological images of human duodenum. This paper makes the following contributions: (1) we approach the weakly supervised segmentation from a novel codebook learning perspective; (2) the CaCL algorithm segments diffuse image patterns rather than focal objects; and (3) The proposed algorithm is implemented in a multi-task framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to perform image reconstruction, classification, feature embedding, and segmentation. The experimental results show that our method achieved superior performance compared with baseline weakly supervised algorithms.
摘要：生物医学图像分析中缺乏监督的学习，以实现来自图像明智的注释（分类）的像素 - 方向标签（分段），因为生物医学图像自然地包含在许多情况下的图像明智标签。来自计算机视觉社区的当前弱监督的学习算法主要用于焦点对象（例如，狗和猫）。然而，这种算法未针对生物医学成像中的漫反射图（例如，显微镜成像中的污渍和荧光）进行优化。在本文中，我们提出了一种新的类别感知码本学习（CACL）算法，用于对漫射图像模式执行弱监督学习。具体地，CaCl算法部署到来自人十二指肠的组织学图像的分段蛋白表达刷边界区域。本文提出了以下贡献：（1）我们从新颖的码本学习视角方面接近弱监督的细分; （2）CACL算法段漫射图像模式而不是焦点对象; （3）所提出的算法在基于向量量化 - 变形AutoEncoder（VQ-VAE）的多任务框架中实现，以执行图像重建，分类，特征嵌入和分段。实验结果表明，与基线弱监督算法相比，我们的方法实现了卓越的性能。

25. A topological approach to exploring convolutional neural networks [PDF] 返回目录
Yang Zhao, Hao Zhang
Abstract: Motivated by the elusive understanding concerning convolution neural networks (CNNs) in view of topology, we present two theoretical frameworks to interpret two topics by using topological data analysis. The first one reveals the topological essence of CNN filters. Our theory first abstracts a topological representation of how the features locate for a CNN filter, named feature topology, and characterises it by defining the starting edge density. We reveal a principle of CNN filters: tending to organize the feature topologies for the same category, and thus propose the SED Distribution to statistically describe such an organization. We demonstrate the effectiveness of CNN filters reflects in the compactness of SED Distribution, and introduce filter entropy to measure it. Remarkably, the variation of filter entropy during training reveals the essence of CNN training: a filter-entropy-decrease process. Also, based on the principle, we give a metric to assess the filter performance. The second one investigates the inter-class distinguishability in a model-agnostic way. For each class, we propose the MBC Distribution, a distribution that could differentiate categories by characterising the intrinsic organization of the given category. As for multi-classes, we introduce the category distance which metricizes the distance between two categories, and moreover propose the CD Matrix that comprehensively evaluates not just the distinguishability between each two category pair but the distinguishable degree for each category. Finally, our experiment results confirm our theories.
摘要：由于拓展拓展的卷积神经网络（CNNS）的难以捉摸的理解，我们展示了两个理论框架通过使用拓扑数据分析来解释两个主题。第一个揭示了CNN过滤器的拓扑精髓。我们的理论首先摘要通过定义起始边缘密度来摘要特征定位的功能定位的拓扑表示。我们揭示了CNN过滤器的原则：倾向于组织相同类别的特征拓扑，从而提出SED分配在统计描述这样的组织。我们证明了CNN过滤器的有效性在SED分配的紧凑性中反映，并引入过滤器熵以测量它。值得注意的是，培训期间过滤器熵的变化揭示了CNN训练的本质：过滤熵减少过程。此外，根据原理，我们提供了评估过滤器性能的指标。第二人以模型 - 不可知方式调查阶级间区别性。对于每个课程，我们提出了MBC分布，可以通过表征给定类别的内在组织来区分类别的分布。至于多级，我们介绍了从两类之间距离的类别距离，而且提出了全面评估的CD矩阵，而不仅仅是每个两种类别对，而是每个类别的可区分程度的区别性。最后，我们的实验结果证实了我们的理论。

26. Deep Representation Decomposition for Feature Disentanglement [PDF] 返回目录
Shang-Fu Chen, Jia-Wei Yan, Ya-Fan Su, Yu-Chiang Frank Wang
Abstract: Representation disentanglement aims at learning interpretable features, so that the output can be recovered or manipulated accordingly. While existing works like infoGAN and AC-GAN exist, they choose to derive disjoint attribute code for feature disentanglement, which is not applicable for existing/trained generative models. In this paper, we propose a decomposition-GAN (dec-GAN), which is able to achieve the decomposition of an existing latent representation into content and attribute features. Guided by the classifier pre-trained on the attributes of interest, our dec-GAN decomposes the attributes of interest from the latent representation, while data recovery and feature consistency objectives enforce the learning of our proposed method. Our experiments on multiple image datasets confirm the effectiveness and robustness of our dec-GAN over recent representation disentanglement models.
摘要：表示解剖旨在学习可解释的功能，以便可以相应地恢复或操纵输出。虽然存在类似于Infogan和AC-GaN的工作，但他们选择为特征解码提供脱位属性代码，这不适用于现有/培训的生成模型。在本文中，我们提出了一种分解 - GaN（Dec-GaN），它能够实现现有的潜在表示的分解成内容和属性特征。由分类器预先接受感兴趣的属性，我们的Dec-GaN在潜在表示中分解感兴趣的属性，而数据恢复和特征一致性目标强制执行我们所提出的方法。我们对多个图像数据集的实验证实了我们在最近的代表性解剖模型上的Dec-GaN的有效性和稳健性。

27. Actor and Action Modular Network for Text-based Video Segmentation [PDF] 返回目录
Jianhua Yang, Yan Huang, Kai Niu, Zhanyu Ma, Liang Wang
Abstract: The actor and action semantic segmentation is a challenging problem that requires joint actor and action understanding, and learns to segment from pre-defined actor and action label pairs. However, existing methods for this task fail to distinguish those actors that have same super-category and identify the actor-action pairs that outside of the fixed actor and action vocabulary. Recent studies have extended this task using textual queries, instead of word-level actor-action pairs, to make the actor and action can be flexibly specified. In this paper, we focus on the text-based actor and action segmentation problem, which performs fine-grained actor and action understanding in the video. Previous works predicted segmentation masks from the merged heterogenous features of a given video and textual query, while they ignored that the linguistic variation of the textual query and visual semantic discrepancy of the video, and led to the asymmetric matching between convolved volumes of the video and the global query representation. To alleviate aforementioned problem, we propose a novel actor and action modular network that individually localizes the actor and action in two separate modules. We first learn the actor-/action-related content for the video and textual query, and then match them in a symmetrical manner to localize the target region. The target region includes the desired actor and action which is then fed into a fully convolutional network to predict the segmentation mask. The whole model enables joint learning for the actor-action matching and segmentation, and achieves the state-of-the-art performance on A2D Sentences and J-HMDB Sentences datasets.
摘要：演员和动作语义分割是一个具有挑战性的问题，需要联合演员和行动理解，并学习从预定义的演员和行动标签对进行段。但是，此任务的现有方法无法区分那些具有相同超级类别的演员并识别在固定演员和动作词汇外的actor-action对。最近的研究已经使用文本查询来扩展此任务，而不是单词级演员 - 操作对，以使演员和操作可以灵活地指定。在本文中，我们专注于基于文本的演员和动作分割问题，在视频中执行细粒度的演员和行动理解。以前的作品预测了给定视频和文本查询的合并异构功能的分割掩码，而他们忽略了文本查询和视频的视觉语义差异的语言变化，并导致了视频的卷积卷之间的不对称匹配。全局查询表示。为了缓解上述问题，我们提出了一种新的演员和动作模块化网络，可以在两个单独的模块中单独定位演员和动作。我们首先要学习视频和文本查询的演员/动作相关内容，然后以对称的方式匹配它们以本地化目标区域。目标区域包括所需的actor和动作，然后被馈送到完全卷积网络中以预测分割掩模。整个模型使联合学习参与者行动匹配和分割，并实现了A2D句子和J-HMDB句子数据集的最先进的性能。

28. Context-based Image Segment Labeling (CBISL) [PDF] 返回目录
Tobias Schlagenhauf, Yefeng Xia, Jürgen Fleischer
Abstract: Working with images, one often faces problems with incomplete or unclear information. Image inpainting can be used to restore missing image regions but focuses, however, on low-level image features such as pixel intensity, pixel gradient orientation, and color. This paper aims to recover semantic image features (objects and positions) in images. Based on published gated PixelCNNs, we demonstrate a new approach referred to as quadro-directional PixelCNN to recover missing objects and return probable positions for objects based on the context. We call this approach context-based image segment labeling (CBISL). The results suggest that our four-directional model outperforms one-directional models (gated PixelCNN) and returns a human-comparable performance.
摘要：使用图像，一个经常面临不完整或不清晰的信息的问题。图像修复可用于恢复缺失的图像区域，但是，或者侧重于低级图像特征，例如像素强度，像素梯度方向和颜色。本文旨在在图像中恢复语义图像特征（对象和位置）。基于已发布的Gated PixelCnns，我们演示了一种新的方法，称为Quadro-Directional PixelCNN，以基于上下文恢复丢失对象并返回对象的可能位置。我们调用此方法基于上下文的图像段标记（CBISL）。结果表明，我们的四方向模型优于单向模型（门控PIXELCNN）并返回人类可比性的性能。

29. Set Augmented Triplet Loss for Video Person Re-Identification [PDF] 返回目录
Pengfei Fang, Pan Ji, Lars Petersson, Mehrtash Harandi
Abstract: Modern video person re-identification (re-ID) machines are often trained using a metric learning approach, supervised by a triplet loss. The triplet loss used in video re-ID is usually based on so-called clip features, each aggregated from a few frame features. In this paper, we propose to model the video clip as a set and instead study the distance between sets in the corresponding triplet loss. In contrast to the distance between clip representations, the distance between clip sets considers the pair-wise similarity of each element (i.e., frame representation) between two sets. This allows the network to directly optimize the feature representation at a frame level. Apart from the commonly-used set distance metrics (e.g., ordinary distance and Hausdorff distance), we further propose a hybrid distance metric, tailored for the set-aware triplet loss. Also, we propose a hard positive set construction strategy using the learned class prototypes in a batch. Our proposed method achieves state-of-the-art results across several standard benchmarks, demonstrating the advantages of the proposed method.
摘要：现代视频人员重新识别（RE-ID）机器通常使用度量学习方法进行培训，通过三重态丢失监督。视频RE-ID中使用的三态损耗通常基于所谓的剪辑特征，每个帧特征聚集在几帧特征中。在本文中，我们建议将视频剪辑作为集合模拟，而是研究相应的三态丢失中的集合之间的距离。与夹子表示之间的距离相比，剪辑组之间的距离考虑了两组之间的每个元素（即帧表示）的一对类似图像。这允许网络直接在帧级别优化特征表示。除了常用的设定距离指标（例如，普通距离和Hausdorff距离）外，我们还提出了一个混合距离度量，针对设定感知的三重态丢失量身定制。此外，我们提出了一种使用学习的类原型在批次中使用了一个硬质施加策略。我们所提出的方法通过若干标准基准实现最先进的结果，展示了该方法的优势。

30. Deep Feature Augmentation for Occluded Image Classification [PDF] 返回目录
Feng Cen, Xiaoyu Zhao, Wuzhuang Li, Guanghui Wang
Abstract: Due to the difficulty in acquiring massive task-specific occluded images, the classification of occluded images with deep convolutional neural networks (CNNs) remains highly challenging. To alleviate the dependency on large-scale occluded image datasets, we propose a novel approach to improve the classification accuracy of occluded images by fine-tuning the pre-trained models with a set of augmented deep feature vectors (DFVs). The set of augmented DFVs is composed of original DFVs and pseudo-DFVs. The pseudo-DFVs are generated by randomly adding difference vectors (DVs), extracted from a small set of clean and occluded image pairs, to the real DFVs. In the fine-tuning, the back-propagation is conducted on the DFV data flow to update the network parameters. The experiments on various datasets and network structures show that the deep feature augmentation significantly improves the classification accuracy of occluded images without a noticeable influence on the performance of clean images. Specifically, on the ILSVRC2012 dataset with synthetic occluded images, the proposed approach achieves 11.21% and 9.14% average increases in classification accuracy for the ResNet50 networks fine-tuned on the occlusion-exclusive and occlusion-inclusive training sets, respectively.
摘要：由于难以获取大规模的特定于特定的封闭图像，具有深度卷积神经网络（CNNS）的遮挡图像的分类仍然具有高度挑战性。为了减轻对大规模封闭图像数据集的依赖性，我们提出了一种新颖的方法，通过微调预先调整预先训练的模型来提高遮挡图像的分类准确性，通过一组增强的深度特征向量（DFV）。该组增强的DFV由原始DFV和伪DFV组成。通过随机添加差值向量（DVS）来生成伪DFV，从一小组清洁和闭塞图像对中提取到真实的DFV。在微调中，在DFV数据流中进行后传播以更新网络参数。各种数据集和网络结构的实验表明，深度特征增强显着提高了遮挡图像的分类精度，而不是对清洁图像性能的显着影响。具体地，在具有合成封闭图像的ILSVRC2012数据集上，所提出的方法分别在封闭闭塞和遮挡包容训练集上微调的Reset50网络的分类准确性的平均值增加11.21％和9.14％。

31. Mutual Information-based Disentangled Neural Networks for Classifying Unseen Categories in Different Domains: Application to Fetal Ultrasound Imaging [PDF] 返回目录
Qingjie Meng, Jacqueline Matthew, Veronika A. Zimmer, Alberto Gomez, David F.A. Lloyd, Daniel Rueckert, Bernhard Kainz
Abstract: Deep neural networks exhibit limited generalizability across images with different entangled domain features and categorical features. Learning generalizable features that can form universal categorical decision boundaries across domains is an interesting and difficult challenge. This problem occurs frequently in medical imaging applications when attempts are made to deploy and improve deep learning models across different image acquisition devices, across acquisition parameters or if some classes are unavailable in new training databases. To address this problem, we propose Mutual Information-based Disentangled Neural Networks (MIDNet), which extract generalizable categorical features to transfer knowledge to unseen categories in a target domain. The proposed MIDNet adopts a semi-supervised learning paradigm to alleviate the dependency on labeled data. This is important for real-world applications where data annotation is time-consuming, costly and requires training and expertise. We extensively evaluate the proposed method on fetal ultrasound datasets for two different image classification tasks where domain features are respectively defined by shadow artifacts and image acquisition devices. Experimental results show that the proposed method outperforms the state-of-the-art on the classification of unseen categories in a target domain with sparsely labeled training data.
摘要：深度神经网络在具有不同纠缠域功能和分类功能的图像上表现出有限的相互性。学习可以在域中形成普遍的分类决策边界的概括功能是一个有趣和艰巨的挑战。当尝试在不同图像采集设备上部署和改进深度学习模型时，跨采集参数或在新培训数据库中不可用的情况下，跨越不同图像采集设备进行部署和改进深度学习模型，频繁发生此问题。为了解决这个问题，我们提出了基于互信的解除不良神经网络（MIDNet），其中提取了可概括的分类特征，以将知识转移到目标域中的未经证明类别。建议的MIDNET采用半监督的学习范例来缓解标记数据的依赖。这对于数据注释是耗时，昂贵的，并且需要培训和专业知识，这对现实世界应用很重要。我们广泛地评估胎儿超声数据集的提出方法，用于两个不同的图像分类任务，其中域特征分别由阴影伪像和图像采集设备定义。实验结果表明，该方法在具有稀疏标记标记的训练数据的目标领域中的无奈类别的分类方面优于现有技术。

32. CNN-Driven Quasiconformal Model for Large Deformation Image Registration [PDF] 返回目录
Ka Chun Lam, Ho Law, Lok Ming Lui
Abstract: We present a novel way to perform image registration, which is not limited to a specific kind, between image pairs with very large deformation, while preserving Quasiconformal property without tedious manual landmark labeling that conventional mathematical registration methods require. Alongside the practical function of our algorithm, one just-as-important underlying message is that the integration between typical CNN and existing Mathematical model is successful as will be pointed out by our paper, meaning that machine learning and mathematical model could coexist, cover for each other and significantly improve registration result. This paper will demonstrate an unprecedented idea of making use of both robustness of CNNs and rigorousness of mathematical model to obtain meaningful registration maps between 2D images under the aforementioned strict constraints for the sake of well-posedness.
摘要：我们提出了一种新的方法来执行图像配准，这不限于具有非常大变形的图像对之间的特定类型，同时保留了在没有繁琐的手动地标标记的情况下保留了QuasicOnformal属性，即传统的数学登记方法所需的繁琐。除了我们算法的实际功能之外，一个刚刚的潜在消息是，典型的CNN和现有数学模型之间的集成是成功的，我们的论文将指出，这意味着机器学习和数学模型可以共存，覆盖彼此并显着改善登记结果。本文将展示使用CNNS的稳健性和数学模型的严格性的前所未有的想法，以便在上述严格的限制下获得2D图像之间的有意义的登记地图，以便良好地提出。

33. Road Damage Detection using Deep Ensemble Learning [PDF] 返回目录
Keval Doshi, Yasin Yilmaz
Abstract: Road damage detection is critical for the maintenance of a road, which traditionally has been performed using expensive high-performance sensors. With the recent advances in technology, especially in computer vision, it is now possible to detect and categorize different types of road damages, which can facilitate efficient maintenance and resource management. In this work, we present an ensemble model for efficient detection and classification of road damages, which we have submitted to the IEEE BigData Cup Challenge 2020. Our solution utilizes a state-of-the-art object detector known as You Only Look Once (YOLO-v4), which is trained on images of various types of road damages from Czech, Japan and India. Our ensemble approach was extensively tested with several different model versions and it was able to achieve an F1 score of 0.628 on the test 1 dataset and 0.6358 on the test 2 dataset.
摘要：道路损伤检测对于维护道路至关重要，传统上使用昂贵的高性能传感器进行。随着技术的最新进展，特别是在计算机视觉中，现在可以检测和分类不同类型的道路损坏，这可以促进高效的维护和资源管理。在这项工作中，我们展示了一个有效的道路损坏检测和分类的集合模型，我们已经提交给IEEE Bigdata杯挑战2020.我们的解决方案利用了最先进的对象探测器，因为您只需要看一次（ YOLO-V4），受捷克，日本和印度各类道路损害的图像培训。我们的集合方法广泛测试了几种不同的型号版本，它能够在测试1数据集上达到0.628的F1分数和0.6358。

34. Multi-Modal Active Learning for Automatic Liver Fibrosis Diagnosis based on Ultrasound Shear Wave Elastography [PDF] 返回目录
Lufei Gao, Ruisong Zhou, Changfeng Dong, Cheng Feng, Zhen Li, Xiang Wan, Li Liu
Abstract: With the development of radiomics, noninvasive diagnosis like ultrasound (US) imaging plays a very important role in automatic liver fibrosis diagnosis (ALFD). Due to the noisy data, expensive annotations of US images, the application of Artificial Intelligence (AI) assisting approaches encounters a bottleneck. Besides, the use of mono-modal US data limits the further improve of the classification results. In this work, we innovatively propose a multi-modal fusion network with active learning (MMFN-AL) for ALFD to exploit the information of multiple modalities, eliminate the noisy data and reduce the annotation cost. Four image modalities including US and three types of shear wave elastography (SWEs) are exploited. A new dataset containing these modalities from 214 candidates is well-collected and pre-processed, with the labels obtained from the liver biopsy results. Experimental results show that our proposed method outperforms the state-of-the-art performance using less than 30% data, and by using only around 80% data, the proposed fusion network achieves high AUC 89.27% and accuracy 70.59%.
摘要：随着辐射瘤的发展，超声（美国）成像等非侵入性诊断在自动肝纤维化诊断（ALFD）中起着非常重要的作用。由于嘈杂的数据，昂贵的美国图像注释，人工智能（AI）协助方法的应用遇到了瓶颈。此外，使用单模态美国数据限制了分类结果的进一步改进。在这项工作中，我们创新地提出了一种具有主动学习（MMFN-AL）的多模态融合网络，用于ALFD以利用多种方式的信息，消除嘈杂数据并降低注释成本。利用包括我们和三种类型的剪切波弹性造影（SWES）的四种图像模态。包含来自214名候选物的这些模式的新数据集是良好收集和预处理的，从肝脏活检结果中获得的标签。实验结果表明，我们所提出的方法优于使用少于30％的数据表达最先进的性能，并且仅使用约80％的数据，所提出的融合网络实现高AUC 89.27％，准确性70.59％。

35. Highway Driving Dataset for Semantic Video Segmentation [PDF] 返回目录
Byungju Kim, Junho Yim, Junmo Kim
Abstract: Scene understanding is an essential technique in semantic segmentation. Although there exist several datasets that can be used for semantic segmentation, they are mainly focused on semantic image segmentation with large deep neural networks. Therefore, these networks are not useful for real time applications, especially in autonomous driving systems. In order to solve this problem, we make two contributions to semantic segmentation task. The first contribution is that we introduce the semantic video dataset, the Highway Driving dataset, which is a densely annotated benchmark for a semantic video segmentation task. The Highway Driving dataset consists of 20 video sequences having a 30Hz frame rate, and every frame is densely annotated. Secondly, we propose a baseline algorithm that utilizes a temporal correlation. Together with our attempt to analyze the temporal correlation, we expect the Highway Driving dataset to encourage research on semantic video segmentation.
摘要：场景理解是语义细分的基本技术。虽然存在多个可以用于语义分割的数据集，但它们主要集中在具有大深度神经网络的语义图像分割。因此，这些网络对实时应用没有用，尤其是在自动驾驶系统中。为了解决这个问题，我们对语义分割任务进行了两项贡献。第一种贡献是我们介绍了语义视频数据集，高速公路驾驶数据集，这是一个语义视频分段任务的密集注释的基准。高速公路驾驶数据集由具有30Hz帧速率的20个视频序列组成，并且每个帧都是密集的注释。其次，我们提出了一种利用时间相关性的基线算法。我们试图分析时间相关性，我们希望公路驾驶数据集鼓励对语义视频分割的研究。

36. Multi-View Adaptive Fusion Network for 3D Object Detection [PDF] 返回目录
Guojun Wang, Bin Tian, Yachen Zhang, Long Chen, Dongpu Cao, Jian Wu
Abstract: 3D object detection based on LiDAR-camera fusion is becoming an emerging research theme for autonomous driving. However, it has been surprisingly difficult to effectively fuse both modalities without information loss and interference. To solve this issue, we propose a single-stage multi-view fusion framework that takes LiDAR Birds-Eye View, LiDAR Range View and Camera View images as inputs for 3D object detection. To effectively fuse multi-view features, we propose an Attentive Pointwise Fusion (APF) module to estimate the importance of the three sources with attention mechanisms which can achieve adaptive fusion of multi-view features in a pointwise manner. Besides, an Attentive Pointwise Weighting (APW) module is designed to help the network learn structure information and point feature importance with two extra tasks: foreground classification and center regression, and the predicted foreground probability will be used to reweight the point features. We design an end-to-end learnable network named MVAF-Net to integrate these two components. Our evaluations conducted on the KITTI 3D object detection datasets demonstrate that the proposed APF and APW module offer significant performance gain and that the proposed MVAF-Net achieves state-of-the-art performance in the KITTI benchmark.
摘要：基于激光乐相机融合的3D对象检测成为自动驾驶的新兴研究主题。然而，令人惊讶的是有效地融合了无信息损失和干扰的方式。为了解决这个问题，我们提出了一个单级多视图融合框架，将LIDAR鸟瞰图，激光雷达范围视图和相机视图图像作为3D对象检测的输入。为了有效地熔断多视图特征，我们提出了一种殷勤的点融合（APF）模块来估计三种来源的重要性，这些来源可以以点播方式实现多视图特征的自适应融合。此外，专注的点加权（APW）模块旨在帮助网络学习结构信息和点特征与两个额外的任务：前景分类和中心回归，并且预测的前景概率将用于重新重量点特征。我们设计一个名为MVAF-Net的端到端学习网络，以集成这两个组件。我们对基蒂3D对象检测数据集进行的评估表明，所提出的APF和APW模块具有显着的性能增益，并且所提出的MVAF-Net在Kitti基准中实现了最先进的性能。

37. Unsupervised Metric Relocalization Using Transform Consistency Loss [PDF] 返回目录
Mike Kasper, Fernando Nobre, Christoffer Heckman, Nima Keivan
Abstract: Training networks to perform metric relocalization traditionally requires accurate image correspondences. In practice, these are obtained by restricting domain coverage, employing additional sensors, or capturing large multi-view datasets. We instead propose a self-supervised solution, which exploits a key insight: localizing a query image within a map should yield the same absolute pose, regardless of the reference image used for registration. Guided by this intuition, we derive a novel transform consistency loss. Using this loss function, we train a deep neural network to infer dense feature and saliency maps to perform robust metric relocalization in dynamic environments. We evaluate our framework on synthetic and real-world data, showing our approach outperforms other supervised methods when a limited amount of ground-truth information is available.
摘要：执行公制重建的培训网络传统上需要准确的图像对应。实际上，这些通过限制域覆盖，采用额外的传感器或捕获大的多视图数据集来获得。我们提出了一种自我监督的解决方案，它利用一个关键洞察力：本地化地图中的查询图像应产生相同的绝对姿势，无论用于注册的参考图像如何。通过这种直觉为指导，我们推出了一种新颖的变换一致性损失。使用此损失功能，我们培训深度神经网络以推断密集特征和显着图，以在动态环境中进行强大的度量迁移化。我们评估了综合性和现实世界数据的框架，显示了我们的方法在有限的地面信息可用时表现出其他监督方法。

38. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning [PDF] 返回目录
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox
Abstract: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at this https URL
摘要：许多现实世界的视频文本任务涉及不同级别的粒度，例如框架和单词，剪辑和句子或视频和段落，每个都具有不同的语义。在本文中，我们提出了一个合作等级变压器（CoOT），以利用该层级信息并模拟不同粒度和不同方式之间的相互作用。该方法由三个主要组成部分组成：注意感知特征聚合层，它利用本地时间上下文（帧内，例如，在剪辑中），一个上下文变形器来学习低级和高级语义之间的交互（级别，例如剪辑 - 视频，句子 - 段落）以及连接视频和文本的跨模态周期 - 一致性损失。所得到的方法在几乎参数的同时在若干基准上对现有技术的状态有利地进行比较。此HTTPS URL的所有代码都是可用的开源

39. FusiformNet: Extracting Discriminative Facial Features on Different Levels [PDF] 返回目录
Kyo Takano
Abstract: Over the last several years, research on facial recognition based on Deep Neural Network has evolved with approaches like task-specific loss functions, image normalization and augmentation, network architectures, etc. However, there have been few approaches with attention to how human faces differ from person to person. Premising that inter-personal differences are found both generally and locally on the human face, I propose FusiformNet, a novel framework for feature extraction that leverages the nature of person-identifying facial features. Tested on ImageUnrestricted setting of Labeled Face in the Wild benchmark, this method achieved a state-of-the-art accuracy of 96.67% without labeled outside data, image augmentation, normalization, or special loss functions. Likewise, the method also performed on par with previous state-of-the-arts when pretrained on CASIA-WebFace dataset. Considering its ability to extract both general and local facial features, the utility of FusiformNet may not be limited to facial recognition but also extend to other DNN-based tasks.
摘要：在过去的几年里，基于深度神经网络的面部识别研究已经发展了任务特定的损失功能，图像标准化和增强，网络架构等。然而，有很少的方法注意人类面孔与人的不同之处。预先在人类脸上普遍和本地发现个人差异，我提出了一种新颖的特征提取框架，其利用人识别面部特征的性质。在野外基准测试中标记面的ImageunRestricted设置上的测试，这种方法实现了最先进的准确性为96.67％，无需标记为外部数据，图像增强，归一化或特殊损失功能。同样地，当在Casia-Webface数据集上返回之前，该方法还针对先前最先进的。考虑到其提取一般和本地面部特征的能力，FusiformNet的效用可能不限于面部识别，而且还扩展到基于DNN的其他任务。

40. Human Leg Motion Tracking by Fusing IMUs and RGB Camera Data Using Extended Kalman Filter [PDF] 返回目录
Omid Taheri, Hassan Salarieh, Aria Alasti
Abstract: Human motion capture is frequently used to study rehabilitation and clinical problems, as well as to provide realistic animation for the entertainment industry. IMU-based systems, as well as Marker-based motion tracking systems, are the most popular methods to track movement due to their low cost of implementation and lightweight. This paper proposes a quaternion-based Extended Kalman filter approach to recover the human leg segments motions with a set of IMU sensors data fused with camera-marker system data. In this paper, an Extended Kalman Filter approach is developed to fuse the data of two IMUs and one RGB camera for human leg motion tracking. Based on the complementary properties of the inertial sensors and camera-marker system, in the introduced new measurement model, the orientation data of the upper leg and the lower leg is updated through three measurement equations. The positioning of the human body is made possible by the tracked position of the pelvis joint by the camera marker system. A mathematical model has been utilized to estimate joints' depth in 2D images. The efficiency of the proposed algorithm is evaluated by an optical motion tracker system.
摘要：人类运动捕获经常用于研究康复和临床问题，以及为娱乐业提供现实的动画。基于IMU的系统以及基于标记的运动跟踪系统，是追踪运动的最流行的方法，因为它们的实施低成本和轻量级。本文提出了一种基于四元的扩展卡尔曼滤波方法，可以通过与相机标记系统数据融合的一组IMU传感器数据来恢复人力腿部的动作。在本文中，开发了一种扩展的卡尔曼滤波方法以使人力腿运动跟踪的两个IMU和一个RGB相机的数据熔断。基于惯性传感器和相机标记系统的互补特性，在引入的新测量模型中，通过三个测量方程式更新上腿和下腿的方向数据。通过相机标记系统通过骨盆接头的跟踪位置来实现人体的定位。已经利用数学模型来估计2D图像中的关节深度。通过光学运动跟踪系统评估所提出的算法的效率。

41. DeepOpht: Medical Report Generation for Retinal Images via Deep Models and Visual Explanation [PDF] 返回目录
Jia-Hong Huang, Chao-Han Huck Yang, Fangyu Liu, Meng Tian, Yi-Chieh Liu, Ting-Wei Wu, I-Hung Lin, Kang Wang, Hiromasa Morikawa, Hernghua Chang, Jesper Tegner, Marcel Worring
Abstract: In this work, we propose an AI-based method that intends to improve the conventional retinal disease treatment procedure and help ophthalmologists increase diagnosis efficiency and accuracy. The proposed method is composed of a deep neural networks-based (DNN-based) module, including a retinal disease identifier and clinical description generator, and a DNN visual explanation module. To train and validate the effectiveness of our DNN-based module, we propose a large-scale retinal disease image dataset. Also, as ground truth, we provide a retinal image dataset manually labeled by ophthalmologists to qualitatively show, the proposed AI-based method is effective. With our experimental results, we show that the proposed method is quantitatively and qualitatively effective. Our method is capable of creating meaningful retinal image descriptions and visual explanations that are clinically relevant.
摘要：在这项工作中，我们提出了一种基于AI的方法，该方法旨在改善传统的视网膜疾病治疗程序，并有助于眼科医生提高诊断效率和准确性。该方法由基于深神经网络（基于DNN的）模块组成，包括视网膜疾病标识符和临床描述发生器，以及DNN视觉解释模块。要培训和验证基于DNN的模块的有效性，我们提出了一个大规模的视网膜疾病图像数据集。此外，作为地面真理，我们提供了由眼科医生手动标记的视网膜图像数据集，以定性地显示，所提出的基于AI的方法是有效的。通过我们的实验结果，我们表明所提出的方法是定量和定性的。我们的方法能够创建有意义的视网膜图像描述和临床相关的视觉解释。

42. LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud-based Deep Networks [PDF] 返回目录
Hang Zhou, Dongdong Chen, Jing Liao, Weiming Zhang, Kejiang Chen, Xiaoyi Dong, Kunlin Liu, Gang Hua, Nenghai Yu
Abstract: Deep neural networks have made tremendous progress in 3D point-cloud recognition. Recent works have shown that these 3D recognition networks are also vulnerable to adversarial samples produced from various attack methods, including optimization-based 3D Carlini-Wagner attack, gradient-based iterative fast gradient method, and skeleton-detach based point-dropping. However, after a careful analysis, these methods are either extremely slow because of the optimization/iterative scheme, or not flexible to support targeted attack of a specific category. To overcome these shortcomings, this paper proposes a novel label guided adversarial network (LG-GAN) for real-time flexible targeted point cloud attack. To the best of our knowledge, this is the first generation based 3D point cloud attack method. By feeding the original point clouds and target attack label into LG-GAN, it can learn how to deform the point clouds to mislead the recognition network into the specific label only with a single forward pass. In detail, LGGAN first leverages one multi-branch adversarial network to extract hierarchical features of the input point clouds, then incorporates the specified label information into multiple intermediate features using the label encoder. Finally, the encoded features will be fed into the coordinate reconstruction decoder to generate the target adversarial sample. By evaluating different point-cloud recognition models (e.g., PointNet, PointNet++ and DGCNN), we demonstrate that the proposed LG-GAN can support flexible targeted attack on the fly while guaranteeing good attack performance and higher efficiency simultaneously.
摘要：深度神经网络在3D点云识别方面取得了巨大进展。最近的作品表明，这些3D识别网络也容易受到各种攻击方法产生的对抗的样本，包括基于优化的3D Carlini-Wagner攻击，梯度基迭代快速梯度方法和基于骨架的点滴。然而，在仔细分析之后，由于优化/迭代方案，这些方法非常慢，或者不灵活地支持特定类别的目标攻击。为了克服这些缺点，本文提出了一种用于实时灵活的目标点云攻击的新型标签引导的对抗网络（LG-GAN）。据我们所知，这是第一代基于3D点云攻击方法。通过将原始点云和目标攻击标签馈送到LG-GaN中，它可以了解如何将点云变形，以误导识别网络仅具有单个前向通过。详细说明，LGGAN首先利用一个多分支副脉内网络来提取输入点云的分层特征，然后使用标签编码器将指定的标签信息包含到多个中间特征中。最后，将馈送编码特征，进入坐标重建解码器以产生目标逆势样本。通过评估不同的点云识别模型（例如，PointNet，PointNet ++和DGCNN），我们证明所提出的LG-GaN可以在飞行中支持灵活的目标攻击，同时保证良好的攻击性能和更高的效率。

43. Memory Group Sampling Based Online Action Recognition Using Kinetic Skeleton Features [PDF] 返回目录
Guoliang Liu, Qinghui Zhang, Yichao Cao, Junwei Li, Hao Wu, Guohui Tian
Abstract: Online action recognition is an important task for human centered intelligent services, which is still difficult to achieve due to the varieties and uncertainties of spatial and temporal scales of human actions. In this paper, we propose two core ideas to handle the online action recognition problem. First, we combine the spatial and temporal skeleton features to depict the actions, which include not only the geometrical features, but also multi-scale motion features, such that both the spatial and temporal information of the action are covered. Second, we propose a memory group sampling method to combine the previous action frames and current action frames, which is based on the truth that the neighbouring frames are largely redundant, and the sampling mechanism ensures that the long-term contextual information is also considered. Finally, an improved 1D CNN network is employed for training and testing using the features from sampled frames. The comparison results to the state of the art methods using the public datasets show that the proposed method is fast and efficient, and has competitive performance
摘要：在线行动识别是人类集中智能服务的重要任务，这仍然难以实现由于人类行为的空间和时间尺度的品种和不确定性。在本文中，我们提出了两个核心想法来处理在线行动识别问题。首先，我们组合空间和时间骨架特征来描绘该动作，其不仅包括几何特征，而且还包括多尺度运动特征，使得覆盖该动作的空间和时间信息。其次，我们提出了一种存储器组采样方法来组合先前的动作帧和当前动作帧，其基于相邻帧在很大程度上是冗余的事实，并且采样机制确保也考虑了长期上下文信息。最后，采用改进的1D CNN网络用于使用来自采样帧的特征进行训练和测试。使用公共数据集的现有技术的状态的比较结果表明，所提出的方法快速有效，具有竞争性的性能

44. Adversarial Self-Supervised Scene Flow Estimation [PDF] 返回目录
Victor Zuanazzi, Joris van Vugt, Olaf Booij, Pascal Mettes
Abstract: This work proposes a metric learning approach for self-supervised scene flow estimation. Scene flow estimation is the task of estimating 3D flow vectors for consecutive 3D point clouds. Such flow vectors are fruitful, \eg for recognizing actions, or avoiding collisions. Training a neural network via supervised learning for scene flow is impractical, as this requires manual annotations for each 3D point at each new timestamp for each scene. To that end, we seek for a self-supervised approach, where a network learns a latent metric to distinguish between points translated by flow estimations and the target point cloud. Our adversarial metric learning includes a multi-scale triplet loss on sequences of two-point clouds as well as a cycle consistency loss. Furthermore, we outline a benchmark for self-supervised scene flow estimation: the Scene Flow Sandbox. The benchmark consists of five datasets designed to study individual aspects of flow estimation in progressive order of complexity, from a moving object to real-world scenes. Experimental evaluation on the benchmark shows that our approach obtains state-of-the-art self-supervised scene flow results, outperforming recent neighbor-based approaches. We use our proposed benchmark to expose shortcomings and draw insights on various training setups. We find that our setup captures motion coherence and preserves local geometries. Dealing with occlusions, on the other hand, is still an open challenge.
摘要：这项工作提出了自我监督场景流程估算的公制学习方法。场景流程估计是估计连续3D点云的3D流量矢量的任务。这种流动矢量是富有成效的，例如用于识别动作，或避免碰撞。通过监督学习培训一个神经网络对于场景流是不切实际的，因为这需要每个场景的每个新时间戳的每个3D点的手动注释。为此，我们寻求一种自我监督的方法，其中网络了解潜在的指标，以区分由流量估计和目标点云进行翻译的点。我们的普及度量学习包括两点云序列的多尺度三重态丢失以及循环一致性损失。此外，我们概述了自我监督场景流量估计的基准：场景流沙箱。基准测试由五个数据集组成，旨在研究复杂性的渐进性的流程估算的个性方面，从移动物体到现实世界的场景。基准测试的实验评估表明，我们的方法获得了最先进的自我监督场景流程，优于最近基于邻国的方法。我们利用所提出的基准揭示缺点并展示各种培训设置的见解。我们发现我们的设置捕获了运动一致性并保留了当地几何形状。另一方面，处理闭塞仍然是一个开放的挑战。

45. Autonomous Extraction of Gleason Patterns for Grading Prostate Cancer using Multi-Gigapixel Whole Slide Images [PDF] 返回目录
Taimur Hassan, Ayman El-Baz, Naoufel Werghi
Abstract: Prostate cancer (PCa) is the second deadliest form of cancer in males. The severity of PCa can be clinically graded through the Gleason scores obtained by examining the structural representation of Gleason cellular patterns. This paper presents an asymmetric encoder-decoder model that integrates a novel hierarchical decomposition block to exploit the feature representations pooled across various scales and then fuses them together to generate the Gleason cellular patterns using the whole slide images. Furthermore, the proposed network is penalized through a novel three-tiered hybrid loss function which ensures that the proposed model accurately recognizes the cluttered regions of the cancerous tissues despite having similar contextual and textural characteristics. We have rigorously tested the proposed network on 10,516 whole slide scans (containing around 71.7M patches), where the proposed model achieved 3.59\% improvement over state-of-the-art scene parsing, encoder-decoder, and fully convolutional networks in terms of intersection-over-union.
摘要：前列腺癌（PCA）是男性中最致命的癌症形式。通过检查Gleason细胞图案的结构表示，PCA的严重程度可以通过Gleason评分进行临床分析。本文介绍了一个非对称编码器 - 解码器模型，其集成了一种新颖的分层分解块来利用各种刻度汇集的特征表示，然后将它们熔化在一起以使用整个幻灯片图像生成Gleason蜂窝模式。此外，所提出的网络通过新型三层混合损失功能来惩罚，这确保了所提出的模型尽管具有类似的上下文和纹理特征，但是尽管具有相似的上下文和纹理特征，所提出的模型精确地识别癌组织的杂乱区域。我们已经严格测试了10,516个整个幻灯片扫描（包含大约71.7M补丁）的所提出的网络，其中拟议的模型实现了3.59 \％的完善，通过最先进的场景解析，编码器 - 解码器和完全卷积网络联盟交叉路口。

46. HM4: Hidden Markov Model with Memory Management for Visual Place Recognition [PDF] 返回目录
Anh-Dzung Doan, Yasir Latif, Tat-Jun Chin, Ian Reid
Abstract: Visual place recognition needs to be robust against appearance variability due to natural and man-made causes. Training data collection should thus be an ongoing process to allow continuous appearance changes to be recorded. However, this creates an unboundedly-growing database that poses time and memory scalability challenges for place recognition methods. To tackle the scalability issue for visual place recognition in autonomous driving, we develop a Hidden Markov Model approach with a two-tiered memory management. Our algorithm, dubbed HM$^4$, exploits temporal look-ahead to transfer promising candidate images between passive storage and active memory when needed. The inference process takes into account both promising images and a coarse representations of the full database. We show that this allows constant time and space inference for a fixed coverage area. The coarse representations can also be updated incrementally to absorb new data. To further reduce the memory requirements, we derive a compact image representation inspired by Locality Sensitive Hashing (LSH). Through experiments on real world data, we demonstrate the excellent scalability and accuracy of the approach under appearance changes and provide comparisons against state-of-the-art techniques.
摘要：由于自然和人为原因，视觉地位识别需要对外观变异性具有稳健。因此，培训数据收集应该是持续的过程，以允许记录持续的外观更改。但是，这会创建一个不排放的日益增长的数据库，可以为地点识别方法带来时间和内存可扩展性挑战。为了解决自主驾驶中的可视地位识别的可扩展性问题，我们开发了一种带有双层内存管理的隐藏马尔可夫模型方法。我们的算法被称为HM $ ^ 4 $，利用时间展示，在需要时在被动存储和活动内存之间传输有希望的候选图像。推断过程考虑了有前途的图像和完整数据库的粗略表示。我们表明这允许固定覆盖区域的恒定时间和空间推断。也可以逐步更新粗略表示以吸收新数据。为了进一步降低内存要求，我们推出了由局部敏感散列（LSH）启发的紧凑型图像表示。通过对现实世界数据的实验，我们展示了外观变化下方法的优异可扩展性和准确性，并提供了对最先进的技术的比较。

47. A Parallel Approach for Real-Time Face Recognition from a Large Database [PDF] 返回目录
Ashish Ranjan, Varun Nagesh Jolly Behera, Motahar Reza
Abstract: We present a new facial recognition system, capable of identifying a person, provided their likeness has been previously stored in the system, in real time. The system is based on storing and comparing facial embeddings of the subject, and identifying them later within a live video feed. This system is highly accurate, and is able to tag people with their ID in real time. It is able to do so, even when using a database containing thousands of facial embeddings, by using a parallelized searching technique. This makes the system quite fast and allows it to be highly scalable.
摘要：我们提供了一个新的面部识别系统，能够识别一个人，只要他们的肖像已经实时存储在系统中。该系统基于存储和比较对象的面部嵌入物，并在实时视频馈送中稍后识别它们。该系统非常准确，并且能够实时标记与其身份的人。它能够这样做，即使使用包含数千个面部嵌入的数据库，通过使用并行化搜索技术。这使得系统非常快速，并允许它是高度可扩展的。

48. Efficient Pipelines for Vision-Based Context Sensing [PDF] 返回目录
Xiaochen Liu
Abstract: Context awareness is an essential part of mobile and ubiquitous computing. Its goal is to unveil situational information about mobile users like locations and activities. The sensed context can enable many services like navigation, AR, and smarting shopping. Such context can be sensed in different ways including visual sensors. There is an emergence of vision sources deployed worldwide. The cameras could be installed on roadside, in-house, and on mobile platforms. This trend provides huge amount of vision data that could be used for context sensing. However, the vision data collection and analytics are still highly manual today. It is hard to deploy cameras at large scale for data collection. Organizing and labeling context from the data are also labor intensive. In recent years, advanced vision algorithms and deep neural networks are used to help analyze vision data. But this approach is limited by data quality, labeling effort, and dependency on hardware resources. In summary, there are three major challenges for today's vision-based context sensing systems: data collection and labeling at large scale, process large data volumes efficiently with limited hardware resources, and extract accurate context out of vision data. The thesis explores the design space that consists of three dimensions: sensing task, sensor types, and task locations. Our prior work explores several points in this design space. We make contributions by (1) developing efficient and scalable solutions for different points in the design space of vision-based sensing tasks; (2) achieving state-of-the-art accuracy in those applications; (3) and developing guidelines for designing such sensing systems.
摘要：语境意识是移动和无处不在的计算的重要组成部分。其目标是揭示有关移动用户喜欢地点和活动的情境信息。感测的上下文可以实现许多服务，如导航，AR和智能购物。可以以不同的方式感测这样的上下文，包括视觉传感器。在全球部署的愿景来源出现。相机可以安装在路边，内部和移动平台上。此趋势提供了巨大的视觉数据，可用于上下文感应。但是，愿景数据收集和分析仍然是今天的手动。很难以大规模部署相机进行数据收集。从数据组织和标记上下文也是劳动密集型的。近年来，高级视觉算法和深神经网络用于帮助分析视觉数据。但这种方法受到数据质量，标记工作和硬件资源依赖的限制。总之，对于当今基于视觉的上下文传感系统存在三个主要挑战：数据收集和大规模标记，有效地处理大的硬件资源的大数据量，并从视觉数据中提取准确的上下文。论文探讨了三维组成的设计空间：传感任务，传感器类型和任务位置。我们的事先工作探讨了这个设计空间中的几个点。我们通过（1）为基于视觉的传感任务的设计空间中的不同点开发有效和可扩展的解决方案; （2）在这些应用中实现最先进的准确性; （3）和制定设计这些传感系统的指导。

49. Dark Reciprocal-Rank: Boosting Graph-Convolutional Self-Localization Network via Teacher-to-student Knowledge Transfer [PDF] 返回目录
Koji Takeda, Kanji Tanaka
Abstract: In visual robot self-localization, graph-based scene representation and matching have recently attracted research interest as robust and discriminative methods for selflocalization. Although effective, their computational and storage costs do not scale well to large-size environments. To alleviate this problem, we formulate self-localization as a graph classification problem and attempt to use the graph convolutional neural network (GCN) as a graph classification engine. A straightforward approach is to use visual feature descriptors that are employed by state-of-the-art self-localization systems, directly as graph node features. However, their superior performance in the original self-localization system may not necessarily be replicated in GCN-based self-localization. To address this issue, we introduce a novel teacher-to-student knowledge-transfer scheme based on rank matching, in which the reciprocal-rank vector output by an off-the-shelf state-of-the-art teacher self-localization model is used as the dark knowledge to transfer. Experiments indicate that the proposed graph-convolutional self-localization network can significantly outperform state-of-the-art self-localization systems, as well as the teacher classifier.
摘要：在视觉机器人自我定位中，基于图形的场景表示和匹配最近吸引了研究兴趣作为鲁棒性和辨别方法的自我循环化方法。虽然有效，但它们的计算和存储成本不会很好地扩展到大型环境。为了减轻这个问题，我们将自定位制定为图形分类问题，并尝试使用图形卷积神经网络（GCN）作为图形分类引擎。直接使用最先进的自定位系统的视觉特征描述符，直接使用了最先进的自定位系统，直接用作图形节点特征。然而，它们在原始自定位系统中的卓越性能可能不一定在基于GCN的自定位中复制。为了解决这个问题，我们介绍了一种基于等级匹配的新型教师到学生知识转移方案，其中通过现有的最先进的教师自定位模型的互惠级矢量输出被用作转移的黑暗知识。实验表明，所提出的图形 - 卷积自定位网络可以显着优于最先进的自定位系统，以及教师分类器。

50. Temporally-Continuous Probabilistic Prediction using Polynomial Trajectory Parameterization [PDF] 返回目录
Zhaoen Su, Chao Wang, Henggang Cui, Nemanja Djuric, Carlos Vallespi-Gonzalez, David Bradley
Abstract: A commonly-used representation for motion prediction of actors is a sequence of waypoints (comprising positions and orientations) for each actor at discrete future time-points. While this approach is simple and flexible, it can exhibit unrealistic higher-order derivatives (such as acceleration) and approximation errors at intermediate time steps. To address this issue we propose a simple and general representation for temporally continuous probabilistic trajectory prediction that is based on polynomial trajectory parameterization. We evaluate the proposed representation on supervised trajectory prediction tasks using two large self-driving data sets. The results show realistic higher-order derivatives and better accuracy at interpolated time-points, as well as the benefits of the inferred noise distributions over the trajectories. Extensive experimental studies based on existing state-of-the-art models demonstrate the effectiveness of the proposed approach relative to other representations in predicting the future motions of vehicle, bicyclist, and pedestrian traffic actors.
摘要：演员的运动预测的常用表示是在离散的未来时间点处于每个actor的一系列路点（包括位置和方向）。虽然这种方法简单灵活，但它可以在中间时间步骤中表现出不切实际的高阶导数（例如加速度）和近似误差。为了解决这个问题，我们提出了一种简单且通用的表示，用于时间上连续的概率轨迹预测，其基于多项式轨迹参数化。我们使用两个大型自动驾驶数据集评估关于监督轨迹预测任务的提出的代表。结果表明了在内插时间点的逼真高阶衍生物和更好的准确性，以及在轨迹上的推断噪声分布的益处。基于现有最先进的模型的广泛实验研究表明了拟议方法相对于预测车辆，骑自行车司机和行人交通演员的未来动作的其他陈述的有效性。

51. IndRNN Based Long-term Temporal Recognition in the Spatial and Frequency Domain [PDF] 返回目录
Beidi Zhao, Shuai Li, Yanbo Gao, Chuankun Li, Wanqing Li, Jianjun Lei
Abstract: Smartphone sensors based human activity recognition is attracting increasing interests nowadays with the popularization of smartphones. With the high sampling rates of smartphone sensors, it is a highly long-range temporal recognition problem, especially with the large intra-class distances such as the smartphones carried at different locations such as in the bag or on the body, and the small inter-class distances such as taking train or subway. To address this problem, we propose a new approach, an Independently Recurrent Neural Network (IndRNN) based long-term temporal activity recognition with spatial and frequency domain features. Considering the periodic characteristics of the sensor data, short-term temporal features are first extracted in the spatial and frequency domains. Then the IndRNN, which is able to capture long-term patterns, is used to further obtain the long-term features for classification. In view of the large differences when the smartphone is carried at different locations, a group based location recognition is first developed to pinpoint the location of the smartphone. The Sussex-Huawei Locomotion (SHL) dataset from the SHL Challenge is used for evaluation. An earlier version of the proposed method has won the second place award in the SHL Challenge 2020 (the first place if not considering multiple models fusion approach). The proposed method is further improved in this paper and achieves 80.72$\%$ accuracy, better than the existing methods using a single model.
摘要：基于智能手机传感器的人类活动识别是在智能手机推广时吸引了越来越多的利益。利用智能手机传感器的高采样率，它是一个高度远的时间识别问题，特别是大型内部距离，如智能手机，如在袋子中或身体上的不同位置，以及小型间 - 休闲距离，如乘坐火车或地铁。为了解决这个问题，我们提出了一种新的方法，是一种独立反复性的神经网络（Indrnn）的长期时间活动识别，具有空间和频域特征。考虑到传感器数据的周期性特征，首先在空间和频率域中提取短期时间特征。然后，能够捕获长期模式的INDRNN用于进一步获得分类的长期特征。鉴于智能手机在不同位置携带的较大差异，首先开发基于组的位置识别以确定智能手机的位置。 SHL挑战中的SUSSEX-HUAWEI LOCAMOTION（SHL）数据集用于评估。早期版本的拟议方法在SHL挑战2020中赢得了第二位奖项（如果没有考虑多种型号融合方法）。本文进一步改进了所提出的方法，达到80.72 $ \％$准确性，比使用单个模型的现有方法更好。

52. Real-Time Text Detection and Recognition [PDF] 返回目录
Shuonan Pei, Mingzhi Zhu
Abstract: Inrecentyears,ConvolutionalNeuralNet-work(CNN) is quite a popular topic, as it is a powerful andintelligent technique that can be applied in various fields.The YOLO is a technique that uses the algorithms for real-time text detection tasks. However, issues like, photometricdistortion and geometric distortion, could affect the systemYOLO accuracy and cause system failure. Therefore, thereare improvements that can make the system work better. Inthis paper, we are going to present our solution - a potentialsolution of a fast and accurate real-time text direction andrecognition system. The paper covers the topic of Real-TimeText detection and recognition in three major areas: 1. videoand image preprocess, 2. Text detection, 3. Text recognition. Asa mature technique, there are many existing methods that canpotentially improve the solution. We will go through some ofthose existing methods in the literature review session. In thisway, we are presenting an industrial strength, high-accuracy,Real-Time Text Detection and recognition tool.
摘要：Inrequearlys，卷积度 - 工作（CNN）是一个非常受欢迎的主题，因为它是一种强大的AndInterligent技术，可以应用于各种领域。YOLO是一种使用该算法进行实时文本检测任务的技术。但是，如此，光度降低或几何失真，可能会影响Systemyolo精度并导致系统故障。因此，可以使系统更好地改善。 Inthis Paper，我们将介绍我们的解决方案 - 一种快速准确的实时文本方向和再识别系统的潜在解决。本文涵盖了三个主要区域中实时文本检测和识别的主题：1。录像机图像预处理，2.文本检测，3.文本识别。 ASA成熟技术，有许多现有方法可以改善解决方案。我们将在文献审查会议中经过一些现有方法。在今天，我们正在推出工业强度，高精度，实时文本检测和识别工具。

53. A Survey on Contrastive Self-supervised Learning [PDF] 返回目录
Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, Fillia Makedon
Abstract: Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and use the learned representations for several downstream tasks. Specifically, contrastive learning has recently become a dominant component in self-supervised learning methods for computer vision, natural language processing (NLP), and other domains. It aims at embedding augmented versions of the same sample close to each other while trying to push away embeddings from different samples. This paper provides an extensive review of self-supervised methods that follow the contrastive approach. The work explains commonly used pretext tasks in a contrastive learning setup, followed by different architectures that have been proposed so far. Next, we have a performance comparison of different methods for multiple downstream tasks such as image classification, object detection, and action recognition. Finally, we conclude with the limitations of the current methods and the need for further techniques and future directions to make substantial progress.
摘要：自我监督的学习获得了普及，因为它能够避免注释大规模数据集的成本。它能够采用自定义的伪标签作为监督，并使用若干下游任务的学习表现。具体而言，对比学习最近成为计算机视觉，自然语言处理（NLP）和其他域的自我监督学习方法中的主导组成部分。它旨在嵌入相同样本的增强版本，同时尝试从不同的样品中推开嵌入物。本文对自我监督方法进行了广泛的审查，遵循对比方法。该工作解释了在对比学习设置中常用的借口任务，其次是到目前为止提出的不同架构。接下来，我们对多个下游任务的不同方法进行了性能比较，例如图像分类，对象检测和动作识别。最后，我们结束了当前方法的局限性以及需要进一步的技术和未来方向的必要性。

54. TartanVO: A Generalizable Learning-based VO [PDF] 返回目录
Wenshan Wang, Yaoyu Hu, Sebastian Scherer
Abstract: We present the first learning-based visual odometry (VO) model, which generalizes to multiple datasets and real-world scenarios and outperforms geometry-based methods in challenging scenes. We achieve this by leveraging the SLAM dataset TartanAir, which provides a large amount of diverse synthetic data in challenging environments. Furthermore, to make our VO model generalize across datasets, we propose an up-to-scale loss function and incorporate the camera intrinsic parameters into the model. Experiments show that a single model, TartanVO, trained only on synthetic data, without any finetuning, can be generalized to real-world datasets such as KITTI and EuRoC, demonstrating significant advantages over the geometry-based methods on challenging trajectories. Our code is available at this https URL.
摘要：我们介绍了第一个基于学习的视觉径管（VO）模型，它推广到多个数据集和真实世界方案，并在具有挑战性的场景中优于基于几何的方法。我们通过利用SLAM DataSet Tartanair来实现这一目标，这在挑战环境中提供了大量不同的合成数据。此外，为了使我们的VO模型遍历数据集，我们提出了一个稳定的损失函数，并将相机内在参数合并到模型中。实验表明，只有在没有任何FineTuning的合成数据上训练的单一模型Tartanvo可以推广到基蒂和EUROC等现实世界数据集，证明了对基于几何的挑战轨迹的方法的显着优势。我们的代码可在此HTTPS URL上获得。

55. Unsupervised Deep Persistent Monocular Visual Odometry and Depth Estimation in Extreme Environments [PDF] 返回目录
Yasin Almalioglu, Angel Santamaria-Navarro, Benjamin Morrell, Ali-akbar Agha-mohammadi
Abstract: In recent years, unsupervised deep learning approaches have received significant attention to estimate the depth and visual odometry (VO) from unlabelled monocular image sequences. However, their performance is limited in challenging environments due to perceptual degradation, occlusions and rapid motions. Moreover, the existing unsupervised methods suffer from the lack of scale-consistency constraints across frames, which causes that the VO estimators fail to provide persistent trajectories over long sequences. In this study, we propose an unsupervised monocular deep VO framework that predicts six-degrees-of-freedom pose camera motion and depth map of the scene from unlabelled RGB image sequences. We provide detailed quantitative and qualitative evaluations of the proposed framework on a) a challenging dataset collected during the DARPA Subterranean challenge; and b) the benchmark KITTI and Cityscapes datasets. The proposed approach outperforms both traditional and state-of-the-art unsupervised deep VO methods providing better results for both pose estimation and depth recovery. The presented approach is part of the solution used by the COSTAR team participating at the DARPA Subterranean Challenge.
摘要：近年来，无监督的深度学习方法得到了重大关注，估计来自未标记的单眼图像序列的深度和视觉径流然而，由于感知降解，闭塞和快速动作，他们的性能受到挑战性环境。此外，现有的无监督方法遭受跨越框架缺乏规模 - 一致性约束，这导致VO估计器无法提供长期序列的持久轨迹。在这项研究中，我们提出了一种无人驾驶的单眼深VO框架，其从未标识的RGB图像序列预测了六个自由度姿势相机运动和场景的深度图。我们提供了在DARPA地下挑战期间收集的具有挑战性数据集的拟议框架的详细量化和定性评估; b）基准Kitti和CityCapes数据集。建议的方法优于传统和最先进的无监督的深度VO方法，为姿势估计和深度恢复提供更好的结果。提出的方法是Costar团队参与Darpa地下挑战的解决方案的一部分。

56. Self-paced and self-consistent co-training for semi-supervised image segmentation [PDF] 返回目录
Ping Wang, Jizong Peng, Marco Pedersoli, Yuanfeng Zhou, Caiming Zhang, Christian Desrosiers
Abstract: Deep co-training has recently been proposed as an effective approach for image segmentation when annotated data is scarce. In this paper, we improve existing approaches for semi-supervised segmentation with a self-paced and self-consistent co-training method. To help distillate information from unlabeled images, we first design a self-paced learning strategy for co-training that lets jointly-trained neural networks focus on easier-to-segment regions first, and then gradually consider harder ones.This is achieved via an end-to-end differentiable loss inthe form of a generalized Jensen Shannon Divergence(JSD). Moreover, to encourage predictions from different networks to be both consistent and confident, we enhance this generalized JSD loss with an uncertainty regularizer based on entropy. The robustness of individual models is further improved using a self-ensembling loss that enforces their prediction to be consistent across different training iterations. We demonstrate the potential of our method on three challenging image segmentation problems with different image modalities, using small fraction of labeled data. Results show clear advantages in terms of performance compared to the standard co-training baselines and recently proposed state-of-the-art approaches for semi-supervised segmentation
摘要：最近已经提出了深度共同培训作为注释数据稀缺时的图像分割的有效方法。在本文中，我们通过自节尾和自我一致的共训练方法改善了现有的半监督分割方法。为了帮助从未标记的图像中卸下信息，我们首先设计一个用于共同训练的自花奏的学习策略，让共同训练的神经网络首先关注更容易分段区域，然后逐渐考虑更难。这是通过的全面的Jensen Shannon发散（JSD）的端到端可分解损失。此外，要鼓励不同网络的预测既是一致的和信心，我们通过基于熵的不确定性规范器增强了这种通用的JSD损失。使用自组装损失进一步改善各个模型的鲁棒性，这在不同的训练迭代中强制执行它们的预测。我们展示了我们在具有不同图像模型的三个具有挑战性的图像分割问题的方法的潜力，使用少量标记数据。结果表明，与标准共同训练基线相比，表现明显的优势，最近提出了半监督分割的最先进的方法

57. Scene Flow from Point Clouds with or without Learning [PDF] 返回目录
Jhony Kaesemodel Pontes, James Hays, Simon Lucey
Abstract: Scene flow is the three-dimensional (3D) motion field of a scene. It provides information about the spatial arrangement and rate of change of objects in dynamic environments. Current learning-based approaches seek to estimate the scene flow directly from point clouds and have achieved state-of-the-art performance. However, supervised learning methods are inherently domain specific and require a large amount of labeled data. Annotation of scene flow on real-world point clouds is expensive and challenging, and the lack of such datasets has recently sparked interest in self-supervised learning methods. How to accurately and robustly learn scene flow representations without labeled real-world data is still an open problem. Here we present a simple and interpretable objective function to recover the scene flow from point clouds. We use the graph Laplacian of a point cloud to regularize the scene flow to be "as-rigid-as-possible". Our proposed objective function can be used with or without learning---as a self-supervisory signal to learn scene flow representations, or as a non-learning-based method in which the scene flow is optimized during runtime. Our approach outperforms related works in many datasets. We also show the immediate applications of our proposed method for two applications: motion segmentation and point cloud densification.
摘要：场景流是场景的三维（3D）运动场。它提供有关动态环境中对象的空间排列和变化率的信息。基于当前的学习方法寻求直接从点云估计场景流，并实现了最先进的性能。但是，监督学习方法本质上是特定的域，需要大量标记的数据。真实世界点云上的场景流量昂贵且挑战，缺乏这种数据集最近引发了对自我监督学习方法的兴趣。如何准确且强大地学习没有标记的现实数据的场景流表示仍然是一个开放的问题。在这里，我们介绍了一个简单且可解释的目标函数，可以从点云恢复场景流。我们使用Point云的图拉拉披肩将场景流程正规化为“尽可能刚性”。我们所提出的目标函数可以与学习或不学习我们的方法优于许多数据集中的相关工作。我们还显示了我们提出的两个应用方法的直接应用：运动分段和点云致密化。

58. General Data Analytics With Applications To Visual Information Analysis: A Provable Backward-Compatible Semisimple Paradigm Over T-Algebra [PDF] 返回目录
Liang Liao, Stephen John Maybank
Abstract: We consider a novel backward-compatible paradigm of general data analytics over a recently-reported semisimple algebra (called t-algebra). We study the abstract algebraic framework over the t-algebra by representing the elements of t-algebra by fix-sized multi-way arrays of complex numbers and the algebraic structure over the t-algebra by a collection of direct-product constituents. Over the t-algebra, many algorithms, if not all, are generalized in a straightforward manner using this new semisimple paradigm. To demonstrate the new paradigm's performance and its backward-compatibility, we generalize some canonical algorithms for visual pattern analysis. Experiments on public datasets show that the generalized algorithms compare favorably with their canonical counterparts.
摘要：我们考虑通过最近报道的半血液代数（称为T-Algebra）的一般数据分析的新型向后兼容的范例。我们通过通过一系列直接产品成分通过T-algebra在T-agrab上表示T-Algebra的元件来研究T-Algebra的摘要代数框架。在T-AlgeBra上，许多算法（如果不是全部），则使用这种新的半自动范围以直接的方式概括。为了展示新的范例的性能及其后向兼容性，我们概括了一些用于视觉模式分析的规范算法。公共数据集的实验表明，广义算法与其规范对应物相比优势。

59. Pose Randomization for Weakly Paired Image Style Translation [PDF] 返回目录
Zexi Chen, Jiaxin Guo, Xuecheng Xu, Yunkai Wang, Yue Wang, Rong Xiong
Abstract: Utilizing the trained model under different conditions without data annotation is attractive for robot applications. Towards this goal, one class of methods is to translate the image style from the training environment to the current one. Conventional studies on image style translation mainly focus on two settings: paired data on images from two domains with exactly aligned content, and unpaired data, with independent content. In this paper, we would like to propose a new setting, where the content in the two images is aligned with error in poses. We consider that this setting is more practical since robots with various sensors are able to align the data up to some error level, even with different styles. To solve this problem, we propose PRoGAN to learn a style translator by intentionally transforming the original domain images with a noisy pose, then matching the distribution of translated transformed images and the distribution of the target domain images. The adversarial training enforces the network to learn the style translation, avoiding being entangled with other variations. In addition, we propose two pose estimation based self-supervised tasks to further improve the performance. Finally, PRoGAN is validated on both simulated and real-world collected data to show the effectiveness. Results on down-stream tasks, classification, road segmentation, object detection, and feature matching show its potential for real applications. this https URL .
摘要：利用在不同条件下的训练模型没有数据注释对于机器人应用具有吸引力。为此目标，一类方法是将图像样式从培训环境转换为当前。关于图像样式翻译的常规研究主要关注两个设置：与两个域的图像上的图像与具有独立内容的两个域的图像和未配对数据的成对数据。在本文中，我们想提出一个新的设置，其中两个图像中的内容与姿势错误对齐。我们认为这种设置更实用，因为具有各种传感器的机器人能够将数据与一些错误级别对准，即使具有不同的样式。为了解决这个问题，我们通过有意地将原始域图像与嘈杂的姿势有意转换，然后匹配翻译的变换图像的分布和目标域图像的分布来学习风格的译者。对手培训执行网络以了解风格翻译，避免与其他变化纠缠在一起。此外，我们提出了两种基于姿势的自我监督任务，以进一步提高性能。最后，Progan在模拟和现实世界收集的数据上验证，以显示有效性。结果沿着流式任务，分类，道路分割，对象检测和特征匹配显示其对实际应用的潜力。这个https网址。

60. LandmarkGAN: Synthesizing Faces from Landmarks [PDF] 返回目录
Pu Sun, Yuezun Li, Honggang Qi, Siwei Lyu
Abstract: Face synthesis is an important problem in computer vision with many applications. In this work, we describe a new method, namely LandmarkGAN, to synthesize faces based on facial landmarks as input. Facial landmarks are a natural, intuitive, and effective representation for facial expressions and orientations, which are independent from the target's texture or color and background scene. Our method is able to transform a set of facial landmarks into new faces of different subjects, while retains the same facial expression and orientation. Experimental results on face synthesis and reenactments demonstrate the effectiveness of our method.
摘要：面部综合是许多应用的计算机视觉中的一个重要问题。在这项工作中，我们描述了一种新方法，即LandlarkGan，以基于面部地标作为输入来合成面部。面部地标是面部表情和方向的自然，直观和有效的代表，它与目标的纹理或颜色和背景场景无关。我们的方法能够将一组面部地标转换为不同主题的新面孔，同时保留相同的面部表情和方向。对面部合成和再生的实验结果证明了我们方法的有效性。

61. ProxylessKD: Direct Knowledge Distillation with Inherited Classifier for Face Recognition [PDF] 返回目录
Weidong Shi, Guanghui Ren, Yunpeng Chen, Shuicheng Yan
Abstract: Knowledge Distillation (KD) refers to transferring knowledge from a large model to a smaller one, which is widely used to enhance model performance in machine learning. It tries to align embedding spaces generated from the teacher and the student model (i.e. to make images corresponding to the same semantics share the same embedding across different models). In this work, we focus on its application in face recognition. We observe that existing knowledge distillation models optimize the proxy tasks that force the student to mimic the teacher's behavior, instead of directly optimizing the face recognition accuracy. Consequently, the obtained student models are not guaranteed to be optimal on the target task or able to benefit from advanced constraints, such as large margin constraints (e.g. margin-based softmax). We then propose a novel method named ProxylessKD that directly optimizes face recognition accuracy by inheriting the teacher's classifier as the student's classifier to guide the student to learn discriminative embeddings in the teacher's embedding space. The proposed ProxylessKD is very easy to implement and sufficiently generic to be extended to other tasks beyond face recognition. We conduct extensive experiments on standard face recognition benchmarks, and the results demonstrate that ProxylessKD achieves superior performance over existing knowledge distillation methods.
摘要：知识蒸馏（KD）是指将知识从大型模型转移到更小的模型，这被广泛用于提高机器学习中的模型性能。它试图对准从教师和学生模型生成的嵌入空间（即，使对应于相同语义的图像共享不同模型的相同嵌入）。在这项工作中，我们专注于其在人脸识别中的应用。我们观察到现有的知识蒸馏模型优化了迫使学生模仿教师行为的代理任务，而不是直接优化面部识别准确性。因此，所获得的学生模型不保证在目标任务上最佳，或者能够从高级约束中受益，例如大的边距约束（例如基于边缘的Softmax）。然后，我们提出了一种名为ProxyLessKD的新型方法，通过将教师的分类器作为学生的分类器来指导学生学习教师嵌入空间中的歧视性嵌入式来直接优化面部识别准确性。建议的ProxylessKD非常容易实施，并且足够通用将扩展到超越面部识别的其他任务。我们对标准面部识别基准进行了广泛的实验，结果表明，Proxylesskd对现有知识蒸馏方法的卓越性能。

62. Temporal Smoothing for 3D Human Pose Estimation and Localization for Occluded People [PDF] 返回目录
Marton Veges, Andras Lorincz
Abstract: In multi-person pose estimation actors can be heavily occluded, even become fully invisible behind another person. While temporal methods can still predict a reasonable estimation for a temporarily disappeared pose using past and future frames, they exhibit large errors nevertheless. We present an energy minimization approach to generate smooth, valid trajectories in time, bridging gaps in visibility. We show that it is better than other interpolation based approaches and achieves state of the art results. In addition, we present the synthetic MuCo-Temp dataset, a temporal extension of the MuCo-3DHP dataset. Our code is made publicly available.
摘要：在多人姿势估计演员可以严重封闭，甚至在另一个人背后完全看不见。虽然时间方法仍然可以使用过去和未来的帧来预测暂时消失的姿势的合理估计，但最终表现出大错误。我们提出了一种能量最小化方法，可以及时产生光滑，有效的轨迹，弥合可见性的差距。我们表明它比其他基于插值的方法更好，实现了最先进的结果。此外，我们介绍了合成的Muco-Temp DataSet，Muco-3Dhp数据集的时间扩展。我们的代码是公开的。

63. Enhanced Balancing GAN: Minority-class Image Generation [PDF] 返回目录
Gaofeng Huang, Amir H. Jafari
Abstract: Generative adversarial networks (GANs) are one of the most powerful generative models, but always require a large and balanced dataset to train. Traditional GANs are not applicable to generate minority-class images in a highly imbalanced dataset. Balancing GAN (BAGAN) is proposed to mitigate this problem, but it is unstable when images in different classes look similar, e.g. flowers and cells. In this work, we propose a supervised autoencoder with an intermediate embedding model to disperse the labeled latent vectors. With the improved autoencoder initialization, we also build an architecture of BAGAN with gradient penalty (BAGAN-GP). Our proposed model overcomes the unstable issue in original BAGAN and converges faster to high quality generations. Our model achieves high performance on the imbalanced scale-down version of MNIST Fashion, CIFAR-10, and one small-scale medical image dataset.
摘要：生成的对抗性网络（GANS）是最强大的生成模型之一，但总是需要一个大型和平衡的数据集来训练。传统的GAN不适用于在高度不平衡的数据集中生成少数群体图像。建议平衡GaN（Bagan）来缓解这个问题，但是当不同类别中的图像看起来相似时，它是不稳定的，例如，鲜花和细胞。在这项工作中，我们提出了一种具有中间嵌入模型的监督的自动拓，以分散标记的潜在载体。随着AutoEncoder初始化的改进，我们还建立了一个带梯度惩罚的蒲甘的体系结构（Bagan-GP）。我们拟议的模式克服了原始蒲甘中的不稳定问题，并收敛到高质量的代优质。我们的模型在MNIST时尚，CIFAR-10和一个小型医学图像数据集的不平衡缩放版本上实现了高性能。

64. Self-supervised Representation Learning for Evolutionary Neural Architecture Search [PDF] 返回目录
Chen Wei, Yiping Tang, Chuang Niu, Haihong Hu, Yue Wang, Jimin Liang
Abstract: Recently proposed neural architecture search (NAS) algorithms adopt neural predictors to accelerate the architecture search. The capability of neural predictors to accurately predict the performance metrics of neural architecture is critical to NAS, and the acquisition of training datasets for neural predictors is time-consuming. How to obtain a neural predictor with high prediction accuracy using a small amount of training data is a central problem to neural predictor-based NAS. Here, we firstly design a new architecture encoding scheme that overcomes the drawbacks of existing vector-based architecture encoding schemes to calculate the graph edit distance of neural architectures. To enhance the predictive performance of neural predictors, we devise two self-supervised learning methods from different perspectives to pre-train the architecture embedding part of neural predictors to generate a meaningful representation of neural architectures. The first one is to train a carefully designed two branch graph neural network model to predict the graph edit distance of two input neural architectures. The second method is inspired by the prevalently contrastive learning, and we present a new contrastive learning algorithm that utilizes a central feature vector as a proxy to contrast positive pairs against negative pairs. Experimental results illustrate that the pre-trained neural predictors can achieve comparable or superior performance compared with their supervised counterparts with several times less training samples. We achieve state-of-the-art performance on the NASBench-101 and NASBench201 benchmarks when integrating the pre-trained neural predictors with an evolutionary NAS algorithm.
摘要：最近提出的神经结构搜索（NAS）算法采用神经预测器来加速架构搜索。神经预测器能够准确预测神经结构的性能度量的能力对于NAS至关重要，并且对神经预测器的训练数据集的获取是耗时的。如何使用少量训练数据获得具有高预测精度的神经预测器是基于神经预测器的NAS的核心问题。在这里，我们首先设计了一种新的架构编码方案，克服了现有的基于矢量的架构编码方案的缺点，以计算神经架构的图表编辑距离。为提高神经预测因子的预测性能，我们设计了两个来自不同观点的自我监督的学习方法，以预先培训嵌入部分神经预测器的架构，以产生神经结构的有意义的代表。第一个是训练一个精心设计的两个分支图形神经网络模型，以预测图表编辑两个输入神经架构的距离。第二种方法是由普遍的对比学学习的启发，并且我们提出了一种新的对比学习算法，该算法利用中心特征向量作为对比正对对比负对对的代理。实验结果表明，与其监督的对应物相比，预先训练的神经预测器可以实现可比或优异的性能，而他们的监督对手较少几次训练样本。在将预先训练的神经预测器与进化NAS算法集成时，我们在NASBench-101和NASBench201基准上实现最先进的性能。

65. Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution [PDF] 返回目录
Renshu Gu, Gaoang Wang, Jenq-Neng Hwang
Abstract: 3D human pose estimation (HPE) is crucial in many fields, such as human behavior analysis, augmented reality/virtual reality (AR/VR) applications, and self-driving industry. Videos that contain multiple potentially occluded people captured from freely moving monocular cameras are very common in real-world scenarios, while 3D HPE for such scenarios is quite challenging, partially because there is a lack of such data with accurate 3D ground truth labels in existing datasets. In this paper, we propose a temporal regression network with a gated convolution module to transform 2D joints to 3D and recover the missing occluded joints in the meantime. A simple yet effective localization approach is further conducted to transform the normalized pose to the global trajectory. To verify the effectiveness of our approach, we also collect a new moving camera multi-human (MMHuman) dataset that includes multiple people with heavy occlusion captured by moving cameras. The 3D ground truth joints are provided by accurate motion capture (MoCap) system. From the experiments on static-camera based Human3.6M data and our own collected moving-camera based data, we show that our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods, especially for the scenarios with heavy occlusions.
摘要：3D人类姿势估计（HPE）在许多领域至关重要，例如人类行为分析，增强现实/虚拟现实（AR / VR）应用以及自动驾驶行业。包含从自由移动单手套摄像机捕获的多个可能封闭的人的视频在现实世界的场景中非常常见，而3D HPE对于这种情况进行了很大挑战，部分是因为缺乏现有数据集中准确的3D地面真理标签等数据。在本文中，我们提出了一个具有门控卷积模块的时间回归网络，以将2D接头转换为3D并与此同时恢复丢失的遮挡关节。进一步进行简单但有效的本地化方法以将标准化的姿势转换为全局轨迹。为了验证我们的方法的有效性，我们还收集了一个新的移动相机多人（MMHUMAN）数据集，包括由移动摄像机捕获的多人闭塞的人。 3D地面真理关节由精确的运动捕获（MoCap）系统提供。从基于静态相机的人3.6M数据和我们自己的收集相机数据的实验，我们表明我们所提出的方法优于最先进的2D-3D姿势估算方法，尤其是对于场景用重闭塞。

66. Learning Open Set Network with Discriminative Reciprocal Points [PDF] 返回目录
Guangyao Chen, Limeng Qiao, Yemin Shi, Peixi Peng, Jia Li, Tiejun Huang, Shiliang Pu, Yonghong Tian
Abstract: Open set recognition is an emerging research area that aims to simultaneously classify samples from predefined classes and identify the rest as 'unknown'. In this process, one of the key challenges is to reduce the risk of generalizing the inherent characteristics of numerous unknown samples learned from a small amount of known data. In this paper, we propose a new concept, Reciprocal Point, which is the potential representation of the extra-class space corresponding to each known category. The sample can be classified to known or unknown by the otherness with reciprocal points. To tackle the open set problem, we offer a novel open space risk regularization term. Based on the bounded space constructed by reciprocal points, the risk of unknown is reduced through multi-category interaction. The novel learning framework called Reciprocal Point Learning (RPL), which can indirectly introduce the unknown information into the learner with only known classes, so as to learn more compact and discriminative representations. Moreover, we further construct a new large-scale challenging aircraft dataset for open set recognition: Aircraft 300 (Air-300). Extensive experiments on multiple benchmark datasets indicate that our framework is significantly superior to other existing approaches and achieves state-of-the-art performance on standard open set benchmarks.
摘要：开放式识别是一个新兴的研究区，旨在同时对预定义的类分类样本，并将其余作为“未知”识别。在此过程中，关键挑战之一是降低概括从少量已知数据中学习的许多未知样本的固有特性的风险。在本文中，我们提出了一种新的概念，互惠点，这是对对应于每个已知类别的额外级别空间的潜在表示。样品可以被互易点分类为已知或未知的。为了解决开放式问题，我们提供新颖的开放空间风险正规化术语。基于互易点构建的有界空间，通过多类别相互作用减少未知的风险。新颖的学习框架称为互惠点学习（RPL），它可以间接地将未知信息与仅具有已知类别的学习者介绍，以便了解更紧凑和辨别的表示。此外，我们进一步构建了一个新的大型挑战飞机数据集，用于开放式识别：飞机300（AIR-300）。在多个基准数据集上的广泛实验表明，我们的框架明显优于其他现有方法，并在标准开放式基准上实现最先进的性能。

67. Multimodal and self-supervised representation learning for automatic gesture recognition in surgical robotics [PDF] 返回目录
Aniruddha Tamhane, Jie Ying Wu, Mathias Unberath
Abstract: Self-supervised, multi-modal learning has been successful in holistic representation of complex scenarios. This can be useful to consolidate information from multiple modalities which have multiple, versatile uses. Its application in surgical robotics can lead to simultaneously developing a generalised machine understanding of the surgical process and reduce the dependency on quality, expert annotations which are generally difficult to obtain. We develop a self-supervised, multi-modal representation learning paradigm that learns representations for surgical gestures from video and kinematics. We use an encoder-decoder network configuration that encodes representations from surgical videos and decodes them to yield kinematics. We quantitatively demonstrate the efficacy of our learnt representations for gesture recognition (with accuracy between 69.6 % and 77.8 %), transfer learning across multiple tasks (with accuracy between 44.6 % and 64.8 %) and surgeon skill classification (with accuracy between 76.8 % and 81.2 %). Further, we qualitatively demonstrate that our self-supervised representations cluster in semantically meaningful properties (surgeon skill and gestures).
摘要：自我监督，多模态学习在复杂情景的整体表现中取得了成功。这对于整合来自具有多个多功能使用的多个模式的信息非常有用。其在手术机器人中的应用可以同时开发广泛的机器理解外科手术过程，减少对质量的依赖性，专家注释通常难以获得。我们开发一个自我监督的多模态表示学习范式，用于了解视频和运动学的外科手势的表示。我们使用编码器 - 解码器网络配置，该网络配置编码来自外科视频的表示，并解码它们以产生运动学。我们量化地证明了我们学到的姿态识别的效果（以69.6％和77.8％的准确性），跨多项任务的转移学习（以44.6％和64.8％的准确性）和外科医生技能分类（76.8％和81.2之间）％）。此外，我们定性证明我们在语义有意义的物业（外科医生技能和手势）中的自我监督的陈述集群。

68. Automatic Chronic Degenerative Diseases Identification Using Enteric Nervous System Images [PDF] 返回目录
Gustavo Z. Felipe, Jacqueline N. Zanoni, Camila C. Sehaber-Sierakowski, Gleison D. P. Bossolani, Sara R. G. Souza, Franklin C. Flores, Luiz E. S. Oliveira, Rodolfo M. Pereira, Yandre M. G. Costa
Abstract: Studies recently accomplished on the Enteric Nervous System have shown that chronic degenerative diseases affect the Enteric Glial Cells (EGC) and, thus, the development of recognition methods able to identify whether or not the EGC are affected by these type of diseases may be helpful in its diagnoses. In this work, we propose the use of pattern recognition and machine learning techniques to evaluate if a given animal EGC image was obtained from a healthy individual or one affect by a chronic degenerative disease. In the proposed approach, we have performed the classification task with handcrafted features and deep learning based techniques, also known as non-handcrafted features. The handcrafted features were obtained from the textural content of the ECG images using texture descriptors, such as the Local Binary Pattern (LBP). Moreover, the representation learning techniques employed in the approach are based on different Convolutional Neural Network (CNN) architectures, such as AlexNet and VGG16, with and without transfer learning. The complementarity between the handcrafted and non-handcrafted features was also evaluated with late fusion techniques. The datasets of EGC images used in the experiments, which are also contributions of this paper, are composed of three different chronic degenerative diseases: Cancer, Diabetes Mellitus, and Rheumatoid Arthritis. The experimental results, supported by statistical analysis, shown that the proposed approach can distinguish healthy cells from the sick ones with a recognition rate of 89.30% (Rheumatoid Arthritis), 98.45% (Cancer), and 95.13% (Diabetes Mellitus), being achieved by combining classifiers obtained both feature scenarios.
摘要：最近在肠柱神经系统上完成的研究表明，慢性退行性疾病影响肠胶质细胞（EGC），因此，能够识别EGC是否受这些类型疾病影响的识别方法的发展可能是有助于其诊断。在这项工作中，我们提出了使用模式识别和机器学习技术来评估给定的动物EGC图像是否从健康个体或一种受慢性退行性疾病的一个影响获得。在拟议的方法中，我们使用手工特征和基于深度学习的技术进行了分类任务，也称为非手工制作功能。从纹理描述符（例如局部二进制模式（LBP））从ECG图像的纹理内容获得手工特征。此外，该方法中采用的表示学习技术基于不同的卷积神经网络（CNN）架构，例如AlexNet和VGG16，具有和不转移学习。手工制作和非手工制作功能之间的互补性也被评估为晚期融合技术。在实验中使用的EGC图像的数据集是本文的贡献，由三种不同的慢性退行性疾病组成：癌症，糖尿病和类风湿性关节炎。通过统计分析支持的实验结果，表明该方法可以将来自病人的健康细胞与89.30％（类风湿性关节炎），98.45％（癌症）和95.13％（糖尿病）的识别率区分开通过组合分类器获得的两个特征方案。

69. Weakly Supervised 3D Classification of Chest CT using Aggregated Multi-Resolution Deep Segmentation Features [PDF] 返回目录
Anindo Saha, Fakrul I. Tushar, Khrystyna Faryna, Vincent M. D'Anniballe, Rui Hou, Maciej A. Mazurowski, Geoffrey D. Rubin, Joseph Y. Lo
Abstract: Weakly supervised disease classification of CT imaging suffers from poor localization owing to case-level annotations, where even a positive scan can hold hundreds to thousands of negative slices along multiple planes. Furthermore, although deep learning segmentation and classification models extract distinctly unique combinations of anatomical features from the same target class(es), they are typically seen as two independent processes in a computer-aided diagnosis (CAD) pipeline, with little to no feature reuse. In this research, we propose a medical classifier that leverages the semantic structural concepts learned via multi-resolution segmentation feature maps, to guide weakly supervised 3D classification of chest CT volumes. Additionally, a comparative analysis is drawn across two different types of feature aggregation to explore the vast possibilities surrounding feature fusion. Using a dataset of 1593 scans labeled on a case-level basis via rule-based model, we train a dual-stage convolutional neural network (CNN) to perform organ segmentation and binary classification of four representative diseases (emphysema, pneumonia/atelectasis, mass and nodules) in lungs. The baseline model, with separate stages for segmentation and classification, results in AUC of 0.791. Using identical hyperparameters, the connected architecture using static and dynamic feature aggregation improves performance to AUC of 0.832 and 0.851, respectively. This study advances the field in two key ways. First, case-level report data is used to weakly supervise a 3D CT classifier of multiple, simultaneous diseases for an organ. Second, segmentation and classification models are connected with two different feature aggregation strategies to enhance the classification performance.
摘要：由于案例级别注释，CT成像的弱势监督疾病分类遭受了差的定位，在这种情况下，即使是阳性扫描也可以沿着多个平面持有数百到数千个负片。此外，虽然深度学习分割和分类模型从相同的目标类中提取明显独特的解剖特征组合，但它们通常被视为计算机辅助诊断（CAD）管道中的两个独立过程，几乎没有任何功能重用。在这项研究中，我们提出了一种医疗分类器，它利用了通过多分辨率分割特征映射学习的语义结构概念，以指导胸部CT卷的弱监督3D分类。另外，跨越两种不同类型的特征聚合绘制比较分析，以探索周围特征融合的巨大可能性。通过基于规则的模型，在案例级别基础上使用1593个扫描的数据集进行标准，我们培养一个双阶段卷积神经网络（CNN），以进行四种代表性疾病的器官分段和二进制分类（肺气肿，肺炎/大规模，质量和结节）在肺部。基线模型，具有分割和分类的单独阶段，结果为0.791的AUC。使用相同的超参数，使用静态和动态特征聚合的连接架构将分别提高了0.832和0.851的AUC的性能。本研究以两种关键方式推进了该领域。首先，案例级报告数据用于削弱器官的多个同时疾病的3D CT分类器。其次，分段和分类模型与两个不同的特征聚合策略连接，以增强分类性能。

70. Leveraging Adaptive Color Augmentation in Convolutional Neural Networks for Deep Skin Lesion Segmentation [PDF] 返回目录
Anindo Saha, Prem Prasad, Abdullah Thabit
Abstract: Fully automatic detection of skin lesions in dermatoscopic images can facilitate early diagnosis and repression of malignant melanoma and non-melanoma skin cancer. Although convolutional neural networks are a powerful solution, they are limited by the illumination spectrum of annotated dermatoscopic screening images, where color is an important discriminative feature. In this paper, we propose an adaptive color augmentation technique to amplify data expression and model performance, while regulating color difference and saturation to minimize the risks of using synthetic data. Through deep visualization, we qualitatively identify and verify the semantic structural features learned by the network for discriminating skin lesions against normal skin tissue. The overall system achieves a Dice Ratio of 0.891 with 0.943 sensitivity and 0.932 specificity on the ISIC 2018 Testing Set for segmentation.
摘要：充分自动检测皮肤病的皮肤病变可以促进恶性黑素瘤和非黑色素瘤皮肤癌的早期诊断和抑制。尽管卷积神经网络是一种强大的解决方案，但它们受到带注释的皮肤筛选图像的照明光谱的限制，其中颜色是一个重要的鉴别特征。在本文中，我们提出了一种自适应颜色增强技术来放大数据表达和模型性能，同时调节色差和饱和度，以最小化使用合成数据的风险。通过深度可视化，我们定性地识别并验证网络学习的语义结构特征，以鉴别对普通皮肤组织的皮肤病变。整体系统达到0.891的骰子比率为0.943次灵敏度和0.932个特异性的ISIC 2018测试集分割。

71. Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation [PDF] 返回目录
Guoliang Kang, Yunchao Wei, Yi Yang, Yueting Zhuang, Alexander G. Hauptmann
Abstract: Domain adaptive semantic segmentation aims to train a model performing satisfactory pixel-level predictions on the target with only out-of-domain (source) annotations. The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer. Previous domain discrepancy minimization methods are mainly based on the adversarial training. They tend to consider the domain discrepancy globally, which ignore the pixel-wise relationships and are less discriminative. In this paper, we propose to build the pixel-level cycle association between source and target pixel pairs and contrastively strengthen their connections to diminish the domain gap and make the features more discriminative. To the best of our knowledge, this is a new perspective for tackling such a challenging task. Experiment results on two representative domain adaptation benchmarks, i.e. GTAV $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes, verify the effectiveness of our proposed method and demonstrate that our method performs favorably against previous state-of-the-arts. Our method can be trained end-to-end in one stage and introduces no additional parameters, which is expected to serve as a general framework and help ease future research in domain adaptive semantic segmentation. Code is available at this https URL Level-Cycle-Association.
摘要：域自适应语义分割旨在训练在目标上执行令人满意的像素级预测的模型，只有域名（源）注释。对此任务的传统解决方案是最小化源和目标之间的差异，以实现有效的知识传输。以前的域差异最小化方法主要基于对抗性培训。它们倾向于考虑全局的域差异，这忽略了像素方面的关系并且差异较小。在本文中，我们建议在源极和目标像素对之间构建像素级循环关联，并对它们的连接进行对比增强，以减小域间隙并使特征变得更加辨别。据我们所知，这是解决如此具有挑战性的任务的新视角。两种代表性域适应基准测试的实验结果，即GTAV $ \ Lightarrow $ CityScapes和Synthia $ \ Lightarrow $ CityScapes，验证了我们所提出的方法的有效性，并证明我们的方法对以前的最先进的方式表现出有利的措施。我们的方法可以在一个阶段接受终端到底培训，并未介绍任何其他参数，预计将作为一般框架，并帮助缓解域自适应语义分割的未来研究。代码可在此HTTPS URL级别周期关联中获得。

72. (Un)Masked COVID-19 Trends from Social Media [PDF] 返回目录
Asmit Kumar Singh, Paras Mehan, Divyanshu Sharma, Rohan Pandey, Tavpritesh Sethi, Ponnurangam Kumaraguru
Abstract: COVID-19 has affected the entire world. One useful protection method for people against COVID-19 is to wear masks in public areas. Across the globe, many public service providers have mandated correctly wearing masks to use their services. This paper proposes two new datasets VAriety MAsks Classification VAMA-C) and VAriety MAsks - Segmentation (VAMA-S), for mask detection and mask fit analysis tasks, respectively. We propose a framework for classifying masked and unmasked faces and a segmentation based model to calculate the mask-fit score. Both the models trained in this study achieved an accuracy of 98%. Using the two trained deep learning models, 2.04 million social media images for six major US cities were analyzed. By comparing the regulations, an increase in masks worn in images as the COVID-19 cases rose in these cities was observed, particularly when their respective states imposed strict regulations. Furthermore, mask compliance in the Black Lives Matter protest was analyzed, eliciting that 40% of the people in group photos wore masks, and 45% of them wore the masks with a fit score of greater than 80%.
摘要：Covid-19影响了整个世界。针对Covid-19的人的一种有用的保护方法是在公共区域佩戴面具。在全球范围内，许多公共服务提供商授权正确地戴上面具来使用他们的服务。本文提出了两个新的数据集各种掩模分类VAMA-C）和品种掩模 - 分割（VAMA-S），分别用于掩模检测和掩模拟合分析任务。我们提出了一个框架，用于分类蒙版和取消屏蔽的面和基于分段的模型，以计算掩模适合分数。这项研究培训的模型都达到了98％的准确性。使用这两款训练有素的深度学习模式，分析了六大主要城市的204万张社交媒体图像。通过比较法规，观察到在图像中佩戴的面具增加，特别是在这些城市上升的速度上升，特别是当他们各自的各国施加严格的法规时。此外，分析了黑人生活中的面罩遵守抗议活动，引发了40％的人群中的40％的人穿着面具，其中45％穿着拟合得分大于80％。

73. Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation [PDF] 返回目录
Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo
Abstract: Inspired by the human ability to infer emotions from body language, we propose an automated framework for body language based emotion recognition starting from regular RGB videos. In collaboration with psychologists, we further extend the framework for psychiatric symptom prediction. Because a specific application domain of the proposed framework may only supply a limited amount of data, the framework is designed to work on a small training set and possess a good transferability. The proposed system in the first stage generates sequences of body language predictions based on human poses estimated from input videos. In the second stage, the predicted sequences are fed into a temporal network for emotion interpretation and psychiatric symptom prediction. We first validate the accuracy and transferability of the proposed body language recognition method on several public action recognition datasets. We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms. The proposed framework outperforms other methods on the URMC dataset.
摘要：灵感来自人类从肢体语言推断出情绪的能力，从普通RGB视频开始为基于肢体语言的情感识别提出自动化框架。与心理学家合作，我们进一步扩展了精神症状预测的框架。由于所提出的框架的特定应用领域可能仅提供有限量的数据，所以框架旨在用于小型训练集，并且具有良好的可转移性。第一阶段中所提出的系统基于从输入视频估计的人类姿势生成肢体语言预测序列。在第二阶段，预测序列被馈送到颞型网络中以进行情绪解释和精神症状预测。我们首先在几个公共行动识别数据集上验证所提出的肢体语言识别方法的准确性和可转换性。然后，我们在拟议的URMC数据集上评估框架，该框架包括标准化患者和行为健康专业人士之间的对话，以及肢体语言，情绪和潜在精神症状的专家注释。该框架在URMC数据集上占此概括了其他方法。

74. A Deep Learning Study on Osteosarcoma Detection from Histological Images [PDF] 返回目录
D M Anisuzzaman, Hosein Barzekar, Ling Tong, Jake Luo, Zeyun Yu
Abstract: In the U.S, 5-10\% of new pediatric cases of cancer are primary bone tumors. The most common type of primary malignant bone tumor is osteosarcoma. The intention of the present work is to improve the detection and diagnosis of osteosarcoma using computer-aided detection (CAD) and diagnosis (CADx). Such tools as convolutional neural networks (CNNs) can significantly decrease the surgeon's workload and make a better prognosis of patient conditions. CNNs need to be trained on a large amount of data in order to achieve a more trustworthy performance. In this study, transfer learning techniques, pre-trained CNNs, are adapted to a public dataset on osteosarcoma histological images to detect necrotic images from non-necrotic and healthy tissues. First, the dataset was preprocessed, and different classifications are applied. Then, Transfer learning models including VGG19 and Inception V3 are used and trained on Whole Slide Images (WSI) with no patches, to improve the accuracy of the outputs. Finally, the models are applied to different classification problems, including binary and multi-class classifiers. Experimental results show that the accuracy of the VGG19 has the highest, 96\%, performance amongst all binary classes and multiclass classification. Our fine-tuned model demonstrates state-of-the-art performance on detecting malignancy of Osteosarcoma based on histologic images.
摘要：在美国，5-10％的癌症新儿科病例是原发性骨肿瘤。最常见的主要恶性骨肿瘤是骨肉瘤。本作本作的目的是使用计算机辅助检测（CAD）和诊断（CADX）来改善骨肉瘤的检测和诊断。这种工具作为卷积神经网络（CNNS）可以显着降低外科医生的工作量，并更好地对患者条件预后。 CNNS需要在大量数据上接受培训，以实现更值得信赖的性能。在该研究中，转移学习技术预先训练的CNNS，适用于骨肉瘤组织学图像的公共数据集，以检测来自非坏死和健康组织的坏死图像。首先，将数据集预处理，应用不同的分类。然后，在整个幻灯片图像（WSI）上使用包括VGG19和Inception V3的传输学习模型，没有斑块，以提高输出的准确性。最后，模型应用于不同的分类问题，包括二进制和多级分类器。实验结果表明，VGG19的准确性最高，96 \％，所有二进制类和多字数分类之间的性能。我们的微调模型证明了基于组织学图像检测骨肉瘤恶性肿瘤的最先进的性能。

75. Perceive, Attend, and Drive: Learning Spatial Attention for Safe Self-Driving [PDF] 返回目录
Bob Wei, Mengye Ren, Wenyuan Zeng, Ming Liang, Bin Yang, Raquel Urtasun
Abstract: In this paper, we propose an end-to-end self-driving network featuring a sparse attention module that learns to automatically attend to important regions of the input. The attention module specifically targets motion planning, whereas prior literature only applied attention in perception tasks. Learning an attention mask directly targeted for motion planning significantly improves the planner safety by performing more focused computation. Furthermore, visualizing the attention improves interpretability of end-to-end self-driving.
摘要：在本文中，我们提出了一个结尾的自动驾驶网络，具有稀疏的注意模块，可以学习自动参加输入的重要区域。注意模块专门针对运动规划，而现有文献仅应用于感知任务中的注意。学习直接针对运动规划的注意掩模通过执行更多聚焦的计算，显着提高了计划者安全性。此外，可视化注意力提高了端到端自动驾驶的可解释性。

76. Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds [PDF] 返回目录
Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey
Abstract: Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.
摘要：深入学习的最新进展使得声音分离和视觉场景的理解已经实现了许多进展。但是，提取在自然视频中明显的声源仍然是一个公开问题。在这项工作中，我们呈现了听觉镜，这是一个新的视听声音分离框架，可以在没有监督的情况下培训，以便从真正的野外视频隔离屏幕上的声音源。先前的视听分离工作假设对声学域的人为限制（例如，对语音或音乐），约束源的数量，并需要强大的声音分离或视觉分割标签。 AudioScope克服了这些限制，在开放的声音领域运行，具有可变数量的源，没有标签或先前的视觉分割。听觉镜的培训程序使用混合物不变的训练（Mixit）将混合物（MOMS）的合成混合物分离成单独的来源，其中混合物的嘈杂标签由无监视的视听符合模型提供。使用嘈杂的标签，以及视频和音频功能之间的注意，AudioScope学会识别视听相似性并抑制屏幕上的声音。我们展示了我们使用从开放域YFCC100M视频数据中提取的视频剪辑的数据集的方法的有效性。此数据集包含在无约束条件下记录的广泛的声音类，使其应用于以前的方法不适合。对于评估和半监督实验，我们将人类标签收集到屏幕上的屏幕上和截止屏幕上的声音。

77. U-Net and its variants for medical image segmentation: theory and applications [PDF] 返回目录
Nahian Siddique, Paheding Sidike, Colin Elkin, Vijay Devabhaktuni
Abstract: U-net is an image segmentation technique developed primarily for medical image analysis that can precisely segment images using a scarce amount of training data. These traits provide U-net with a very high utility within the medical imaging community and have resulted in extensive adoption of U-net as the primary tool for segmentation tasks in medical imaging. The success of U-net is evident in its widespread use in all major image modalities from CT scans and MRI to X-rays and microscopy. Furthermore, while U-net is largely a segmentation tool, there have been instances of the use of U-net in other applications. As the potential of U-net is still increasing, in this review we look at the various developments that have been made in the U-net architecture and provide observations on recent trends. We examine the various innovations that have been made in deep learning and discuss how these tools facilitate U-net. Furthermore, we look at image modalities and application areas where U-net has been applied.
摘要：U-Net是一种主要用于医学图像分析的图像分割技术，可以使用稀缺量的训练数据进行精确分割图像。这些特征在医学成像社区中提供了一个非常高的效用，并导致U-Net广泛采用，作为医学成像中分割任务的主要工具。在CT扫描和MRI到X射线和显微镜的所有主要图像方式中，U-Net的成功在广泛的广泛使用中是显而易见的。此外，虽然U-Net在很大程度上是分割工具，但是在其他应用程序中使用U-Net的使用情况。由于U-Net的潜力仍在增加，在这篇评论中，我们研究了U-Net架构中所做的各种发展，并提供最近趋势的观察。我们研究了深入学习的各种创新，并讨论这些工具如何促进U-Net。此外，我们查看U-Net已应用的图像方式和应用领域。

78. Top 10 BraTS 2020 challenge solution: Brain tumor segmentation with self-ensembled, deeply-supervised 3D-Unet like neural networks [PDF] 返回目录
Theophraste Henry, Alexandre Carre, Marvin Lerousseau, Theo Estienne, Charlotte Robert, Nikos Paragios, Eric Deutsch
Abstract: Brain tumor segmentation is a critical task for patient's disease management. To this end, we trained multiple U-net like neural networks, mainly with deep supervision and stochastic weight averaging, on the Multimodal Brain Tumor Segmentation Challenge (BraTS) 2020 training dataset, in a cross-validated fashion. Final brain tumor segmentations were produced by first averaging independently two sets of models, and then custom merging the labelmaps to account for individual performance of each set. Our performance on the online validation dataset with test time augmentation were as follows: Dice of 0.81, 0.91 and 0.85; Hausdorff (95%) of 20.6, 4,3, 5.7 mm for the enhancing tumor, whole tumor and tumor core, respectively. Similarly, our ensemble achieved a Dice of 0.79, 0.89 and 0.84, as well as Hausdorff (95%) of 20.4, 6.7 and 19.5mm on the final test dataset. More complicated training schemes and neural network architectures were investigated, without significant performance gain, at the cost of greatly increased training time. While relatively straightforward, our approach yielded good and balanced performance for each tumor subregions. Our solution is open sourced at this https URL.
摘要：脑肿瘤细分是患者疾病管理的关键任务。为此，我们经过多次U-Net等神经网络培训，主要具有深度监督和随机体重平均，在多模式脑肿瘤分割挑战（Brats）2020训练数据集上，以交叉验证的方式。通过首先平均两组模型来生产最终脑肿瘤分割，然后自定义合并标签图以解释每个集合的单独性能。我们在线验证数据集的性能具有测试时间增强如下：骰子为0.81,0.91和0.85; Hausdorff（95％）分别为增强肿瘤，全肿瘤和肿瘤核心的20.6,4,3,5.7mm。同样，我们的整体骰子含量为0.79,0.89和0.84，以及最终测试数据集的20.4,6.7和19.5mm的Hausdorff（95％）。调查了更加复杂的培训计划和神经网络架构，无需显着性能收益，以大大增加培训时间。虽然相对简单，但我们的方法对每个肿瘤次区域产生了良好和平衡的性能。我们的解决方案在此HTTPS URL上开放。

79. Depth Ranging Performance Evaluation and Improvement for RGB-D Cameras on Field-Based High-Throughput Phenotyping Robots [PDF] 返回目录
Zhengqiang Fan, Na Sun, Quan Qiu, Chunjiang Zhao
Abstract: RGB-D cameras have been successfully used for indoor High-ThroughpuT Phenotyping (HTTP). However, their capability and feasibility for in-field HTTP still need to be evaluated, due to the noise and disturbances generated by unstable illumination, specular reflection, and diffuse reflection, etc. To solve these problems, we evaluated the depth-ranging performances of two consumer-level RGB-D cameras (RealSense D435i and Kinect V2) under in-field HTTP scenarios, and proposed a strategy to compensate the depth measurement error. For performance evaluation, we focused on determining their optimal ranging areas for different crop organs. Based on the evaluation results, we proposed a brightness-and-distance-based Support Vector Regression Strategy, to compensate the ranging error. Furthermore, we analyzed the depth filling rate of two RGB-D cameras under different lighting intensities. Experimental results showed that: 1) For RealSense D435i, its effective ranging area is [0.160, 1.400] m, and in-field filling rate is approximately 90%. 2) For Kinect V2, it has a high ranging accuracy in the [0.497, 1.200] m, but its in-field filling rate is less than 24.9%. 3) Our error compensation model can effectively reduce the influences of lighting intensity and target distance. The maximum MSE and minimum R2 of this model are 0.029 and 0.867, respectively. To sum up, RealSense D435i has better ranging performances than Kinect V2 on in-field HTTP.
摘要：RGB-D摄像机已成功用于室内高吞吐量表型（HTTP）。然而，由于不稳定的照明，镜面反射和漫反射等产生的噪声和扰动，因此需要评估其对现场HTTP的能力和可行性，镜面反射和漫反射等来解决这些问题，我们评估了深度的性能在现场HTTP场景下，两个消费级RGB-D相机（REALSENSE D435I和Kinect V2），并提出了一种补偿深度测量误差的策略。对于绩效评估，我们专注于确定其不同作物器官的最佳测距区域。基于评估结果，我们提出了一种基于亮度和距离的支持向量回归策略，以补偿测距误差。此外，我们在不同的照明强度下分析了两个RGB-D相机的深度灌装速率。实验结果表明：1）对于RealSense D435i，其有效测距面积为[0.160,1.400] M，现场灌装速率约为90％。 2）对于Kinect v2，它在[0.497,1.200] M中具有高的测距精度，但其现场灌装速率小于24.9％。 3）我们的误差补偿模型可以有效地降低照明强度和目标距离的影响。该模型的最大MSE和最小R2分别为0.029和0.867。总而言之，RealSense D435I比现场内部HTTP上的Kinect V2更好地提供更好的演示性能。

80. AVECL-UMONS database for audio-visual event classification and localization [PDF] 返回目录
Mathilde Brousmiche, Stéphane Dupont, Jean Rouat
Abstract: We introduce the AVECL-UMons dataset for audio-visual event classification and localization in the context of office environments. The audio-visual dataset is composed of 11 event classes recorded at several realistic positions in two different rooms. Two types of sequences are recorded according to the number of events in the sequence. The dataset comprises 2662 unilabel sequences and 2724 multilabel sequences corresponding to a total of 5.24 hours. The dataset is publicly accessible online : this https URL.
摘要：我们在办公环境的上下文中介绍了Avecl-Umons DataSet，用于视听事件分类和本地化。 Audio-Visual DataSet由11个事件类组成，在两个不同的房间中的几个现实位置记录。根据序列中的事件数来记录两种类型的序列。该数据集包含2662个UniLabel序列和2724个多标签序列，其总共5.24小时。 DataSet可公开访问在线：此HTTPS URL。

81. ASIST: Annotation-free synthetic instance segmentation and tracking for microscope video analysis [PDF] 返回目录
Quan Liu, Isabella M. Gaeta, Mengyang Zhao, Ruining Deng, Aadarsh Jha, Bryan A. Millis, Anita Mahadevan-Jansen, Matthew J. Tyska, Yuankai Huo
Abstract: Instance object segmentation and tracking provide comprehensive quantification of objects across microscope videos. The recent single-stage pixel-embedding based deep learning approach has shown its superior performance compared with "segment-then-associate" two-stage solutions. However, one major limitation of applying a supervised pixel-embedding based method to microscope videos is the resource-intensive manual labeling, which involves tracing hundreds of overlapped objects with their temporal associations across video frames. Inspired by the recent generative adversarial network (GAN) based annotation-free image segmentation, we propose a novel annotation-free synthetic instance segmentation and tracking (ASIST) algorithm for analyzing microscope videos of sub-cellular microvilli. The contributions of this paper are three-fold: (1) proposing a new annotation-free video analysis paradigm is proposed. (2) aggregating the embedding based instance segmentation and tracking with annotation-free synthetic learning as a holistic framework; and (3) to the best of our knowledge, this is first study to investigate microvilli instance segmentation and tracking using embedding based deep learning. From the experimental results, the proposed annotation-free method achieved superior performance compared with supervised learning.
摘要：实例对象分割和跟踪在显微镜视频中提供了全面的对象。与“段 - 缔合”的两级解决方案相比，最近的基于单级像素嵌入的深度学习方法表现出其优越的性能。然而，将基于像素的嵌入方法应用于显微镜视频的一个主要限制是资源密集型手动标记，涉及通过视频帧的时间关联追踪数百个重叠的对象。灵感来自最近的生成对抗网络（GaN）的无注释图像分割，我们提出了一种新的无注释的合成实例分段和跟踪（Asist）算法，用于分析亚蜂窝微伏针的显微镜视频。本文的贡献是三倍：（1）提出新的注释视频分析范式。（2）将基于嵌入的实例分割和跟踪与无注释的合成学习作为整体框架进行汇总; （3）据我们所知，这是首次研究MicroVilli实例分割和跟踪使用基于嵌入的深度学习的研究。从实验结果中，与监督学习相比，所提出的无注释方法取得了卓越的性能。

82. Deep Learning in Computer-Aided Diagnosis and Treatment of Tumors: A Survey [PDF] 返回目录
Dan Zhao, Guizhi Xu, Zhenghua XU, Thomas Lukasiewicz, Minmin Xue, Zhigang Fu
Abstract: Computer-Aided Diagnosis and Treatment of Tumors is a hot topic of deep learning in recent years, which constitutes a series of medical tasks, such as detection of tumor markers, the outline of tumor leisures, subtypes and stages of tumors, prediction of therapeutic effect, and drug development. Meanwhile, there are some deep learning models with precise positioning and excellent performance produced in mainstream task scenarios. Thus follow to introduce deep learning methods from task-orient, mainly focus on the improvements for medical tasks. Then to summarize the recent progress in four stages of tumor diagnosis and treatment, which named In-Vitro Diagnosis (IVD), Imaging Diagnosis (ID), Pathological Diagnosis (PD), and Treatment Planning (TP). According to the specific data types and medical tasks of each stage, we present the applications of deep learning in the Computer-Aided Diagnosis and Treatment of Tumors and analyzing the excellent works therein. This survey concludes by discussing research issues and suggesting challenges for future improvement.
摘要：近年来肿瘤的计算机辅助诊断和治疗是近年来深度学习的热门话题，这构成了一系列医学任务，例如检测肿瘤标志物，肿瘤休闲概要，肿瘤亚型和肿瘤阶段，预测治疗效果和药物发育。同时，有一些深入的学习模型，具有精确定位和优异的性能，在主流任务方案中产生。因此，介绍了任务 - 东方的深度学习方法，主要关注医疗任务的改进。然后总结肿瘤诊断和治疗的四个阶段的最近进展，其命名在体外诊断（IVD），成像诊断（ID），病理诊断（PD）和治疗计划（TP）。根据每个阶段的具体数据类型和医疗任务，我们展示了深度学习在计算机辅助诊断和治疗肿瘤的应用，分析了其中的优秀作品。该调查通过讨论研究问题并暗示未来改善的挑战来结束。

83. nnU-Net for Brain Tumor Segmentation [PDF] 返回目录
Fabian Isensee, Paul F. Jaeger, Peter M. Full, Philipp Vollmuth, Klaus H. Maier-Hein
Abstract: We apply nnU-Net to the segmentation task of the BraTS 2020 challenge. The unmodified nnU-Net baseline configuration already achieves a respectable result. By incorporating BraTS-specific modifications regarding postprocessing, region-based training, a more aggressive data augmentation as well as several minor modifications to the nnUNet pipeline we are able to improve its segmentation performance substantially. We furthermore re-implement the BraTS ranking scheme to determine which of our nnU-Net variants best fits the requirements imposed by it. Our final ensemble took the first place in the BraTS 2020 competition with Dice scores of 88.95, 85.06 and 82.03 and HD95 values of 8.498,17.337 and 17.805 for whole tumor, tumor core and enhancing tumor, respectively.
摘要：我们将NNU-Net应用于Brats 2020挑战的分割任务。未修改的NNU-NET基线配置已经实现了可观的结果。通过将关于后处理，基于区域的培训的特定于资助的修改，更具侵略性的数据增强以及对NNUNET管道的几个微小修改，我们能够大大提高其分割性能。我们还重新实施了Brats排名方案，以确定我们的哪些NNU-Net变体最能符合其所施加的要求。我们的最终集团在Brats 2020的第一个竞争中举办了88.95,85.06和82.03和8.498,17.337和17.805的骰子得分分别为整个肿瘤，肿瘤核心和增强肿瘤。

84. Bifurcated Autoencoder for Segmentation of COVID-19 Infected Regions in CT Images [PDF] 返回目录
Parham Yazdekhasty, Ali Zindar, Zahra Nabizadeh-ShahreBabak, Roshank Roshandel, Pejman Khadivi, Nader Karimi, Shadrokh Samavi
Abstract: The new coronavirus infection has shocked the world since early 2020 with its aggressive outbreak. Rapid detection of the disease saves lives, and relying on medical imaging (Computed Tomography and X-ray) to detect infected lungs has shown to be effective. Deep learning and convolutional neural networks have been used for image analysis in this context. However, accurate identification of infected regions has proven challenging for two main reasons. Firstly, the characteristics of infected areas differ in different images. Secondly, insufficient training data makes it challenging to train various machine learning algorithms, including deep-learning models. This paper proposes an approach to segment lung regions infected by COVID-19 to help cardiologists diagnose the disease more accurately, faster, and more manageable. We propose a bifurcated 2-D model for two types of segmentation. This model uses a shared encoder and a bifurcated connection to two separate decoders. One decoder is for segmentation of the healthy region of the lungs, while the other is for the segmentation of the infected regions. Experiments on publically available images show that the bifurcated structure segments infected regions of the lungs better than state of the art.
摘要：自2020年初以来，新的冠状病毒感染令人震惊，爆发了它的侵略性爆发。快速检测疾病可节省生命，并依赖于医学成像（计算机断层扫描和X射线）来检测感染肺部已显示有效。深入学习和卷积神经网络已被用于在此背景下进行图像分析。然而，由于两种主要原因，精确识别受感染的地区的挑战。首先，感染区域的特征在不同的图像中不同。其次，培训数据不足以挑战各种机器学习算法，包括深度学习模型。本文提出了一种对Covid-19感染的分段肺部地区的方法，以帮助心脏病学家更准确地诊断疾病，更快，更易于管理。我们提出了两种分割的分叉二维模型。该模型使用共享编码器和分叉连接到两个单独的解码器。一个解码器是用于肺的健康区域的分割，而另一个是用于分割受感染的区域。在公开的图像上的实验表明，分叉的结构段比现有技术更好地感染肺部区域。

85. Brain Tumor Classification Using Medial Residual Encoder Layers [PDF] 返回目录
Zahra SobhaniNia, Nader Karimi, Pejman Khadivi, Roshank Roshandel, Shadrokh Samavi
Abstract: According to the World Health Organization, cancer is the second leading cause of death worldwide, responsible for over 9.5 million deaths in 2018 alone. Brain tumors count for one out of every four cancer deaths. Accurate and timely diagnosis of brain tumors will lead to more effective treatments. To date, several image classification approaches have been proposed to aid diagnosis and treatment. We propose an encoder layer that uses post-max-pooling features for residual learning. Our approach shows promising results by improving the tumor classification accuracy in MR images using a limited medical image dataset. Experimental evaluations of this model on a dataset consisting of 3064 MR images show 95-98% accuracy, which is better than previous studies on this database.
摘要：根据世界卫生组织，癌症是全世界第二次死亡原因，仅负责2018年以上950万人死亡。脑肿瘤计算每四个癌症死亡中的一个。准确且及时诊断脑肿瘤会导致更有效的治疗方法。迄今为止，已经提出了几种图像分类方法以辅助诊断和治疗。我们提出了一个编码器层，它使用最大池池特征进行剩余学习。我们的方法展示了通过使用有限的医学图像数据集提高MR图像中的肿瘤分类准确性的有希望的结果。该模型在数据集上的实验评估由3064 MR图像组成的精度为95-98％，这比以前关于此数据库的研究更好。

86. Tracking Partially-Occluded Deformable Objects while Enforcing Geometric Constraints [PDF] 返回目录
Yixuan Wang, Dale McConachie, Dmitry Berenson
Abstract: In order to manipulate a deformable object, such as rope or cloth, in unstructured environments, robots need a way to estimate its current shape. However, tracking the shape of a deformable object can be challenging because of the object's high flexibility, (self-)occlusion, and interaction with obstacles. Building a high-fidelity physics simulation to aid in tracking is difficult for novel environments. Instead we focus on tracking the object based on RGBD images and geometric motion estimates and obstacles. Our key contributions over previous work in this vein are: 1) A better way to handle severe occlusion by using a motion model to regularize the tracking estimate; and 2) The formulation of \textit{convex} geometric constraints, which allow us to prevent self-intersection and penetration into known obstacles via a post-processing step. These contributions allow us to outperform previous methods by a large margin in terms of accuracy in scenarios with severe occlusion and obstacles.
摘要：为了操纵一个可变形的物体，如绳索或布，在非结构化的环境中，机器人需要一种估计其当前形状的方法。然而，跟踪可变形物体的形状可能是具有挑战性的，因为物体的高柔韧性（自我）闭塞和与障碍物的相互作用。建立高保真的物理模拟以帮助跟踪很难对新颖的环境很难。相反，我们专注于基于RGBD图像和几何运动估计和障碍跟踪对象。我们对此静脉以前工作的关键贡献是：1）通过使用运动模型来规范跟踪估计的更好方法来处理严重遮挡; 2）制定\纺织{凸}几何约束，使我们能够通过后处理步骤来防止自交叉和渗透到已知的障碍物。这些贡献允许我们在具有严重遮挡和障碍物的方案的准确性方面以大幅度优于前面的方法。

87. Triage of Potential COVID-19 Patients from Chest X-ray Images using Hierarchical Convolutional Networks [PDF] 返回目录
Kapal Dev, Sunder Ali Khowaja, Aman Jaiswal, Ankur Singh Bist, Vaibhav Saini, Surbhi Bhatia
Abstract: The current COVID-19 pandemic has motivated the researchers to use artificial intelligence techniques for potential alternatives to reverse transcription polymerase chain reaction (RT-PCR) due to the limited scale of testing. The chest X-ray (CXR) is one of the alternatives to achieve fast diagnosis but the unavailability of large scale annotated data makes the clinical implementation of machine learning-based COVID detection methods difficult. Another important issue is the usage of ImageNet pre-trained networks which does not guarantee to extract reliable feature representations. In this paper, we propose the use of hierarchical convolutional network (HCN) architecture to naturally augment the data along with diversified features. The HCN uses the first convolution layer from COVIDNet followed by the convolutional layers from well known pre-trained networks to extract the features. The use of the convolution layer from COVIDNet ensures the extraction of representations relevant to the CXR modality. We also propose the use of ECOC for encoding multiclass problems to binary classification for improving the recognition performance. Experimental results show that HCN architecture is capable of achieving better results in comparison to the existing studies. The proposed method can accurately triage potential COVID-19 patients through CXR images for sharing the testing load and increasing the testing capacity.
摘要：目前的Covid-19大流行动机是由于检测量规模有限，研究人员使用人工智能技术用于逆转转录聚合酶链反应（RT-PCR）的潜在替代品。胸部X射线（CXR）是实现快速诊断的替代方案之一，但大规模注释数据的不可用性使得机器学习的Covid检测方法的临床实施难以实现。另一个重要问题是使用Imagenet预训练网络，不保证提取可靠的特征表示。在本文中，我们提出了使用分层卷积网络（HCN）架构来自然地增加数据以及多元化的特征。 HCN使用来自CovidNet的第一个卷积层，然后是来自众所周知的预先训练网络的卷积层来提取该特征。从CoVIDNET使用卷积层可确保提取与CXR模态相关的表示。我们还提出了使用ECOC来编码多牌问题以提高识别性能的二进制分类。实验结果表明，与现有研究相比，HCN架构能够实现更好的结果。该方法可以通过CXR图像进行准确地进行分类潜在的Covid-19患者，以共享测试负载并增加测试能力。

88. Learning Euler's Elastica Model for Medical Image Segmentation [PDF] 返回目录
Xu Chen, Xiangde Luo, Yitian Zhao, Shaoting Zhang, Guotai Wang, Yalin Zheng
Abstract: Image segmentation is a fundamental topic in image processing and has been studied for many decades. Deep learning-based supervised segmentation models have achieved state-of-the-art performance but most of them are limited by using pixel-wise loss functions for training without geometrical constraints. Inspired by Euler's Elastica model and recent active contour models introduced into the field of deep learning, we propose a novel active contour with elastica (ACE) loss function incorporating Elastica (curvature and length) and region information as geometrically-natural constraints for the image segmentation tasks. We introduce the mean curvature i.e. the average of all principal curvatures, as a more effective image prior to representing curvature in our ACE loss function. Furthermore, based on the definition of the mean curvature, we propose a fast solution to approximate the ACE loss in three-dimensional (3D) by using Laplace operators for 3D image segmentation. We evaluate our ACE loss function on four 2D and 3D natural and biomedical image datasets. Our results show that the proposed loss function outperforms other mainstream loss functions on different segmentation networks. Our source code is available at this https URL.
摘要：图像分割是图像处理中的基本话题，并已研究数十年。基于深度学习的监督分段模型已经实现了最先进的性能，但大多数通过使用用于训练的像素方面的损耗功能而没有几何约束。受到欧拉Elastica模型的启发和最近引入深度学习领域的活动轮廓模型，我们提出了一种新的主动轮廓，与Elastica（ACE）丢失功能结合在一起的Elastica（曲率和长度）和区域信息作为图像分割的几何自然约束任务。我们介绍平均曲率即，所有主曲率的平均值，作为在我们ACE损耗函数中代表曲率之前的更有效的图像。此外，基于平均曲率的定义，我们提出了一种快速解决方案，通过使用Laplace运算符进行3D图像分割来近似三维（3D）中的ACE损耗。我们在四个2D和3D自然和生物医学图像数据集中评估我们的ACE损失功能。我们的研究结果表明，该损失函数在不同分割网络上占外的其他主流损失功能。我们的源代码可在此HTTPS URL上获得。

89. Generating Correct Answers for Progressive Matrices Intelligence Tests [PDF] 返回目录
Niv Pekar, Yaniv Benny, Lior Wolf
Abstract: Raven's Progressive Matrices are multiple-choice intelligence tests, where one tries to complete the missing location in a $3\times 3$ grid of abstract images. Previous attempts to address this test have focused solely on selecting the right answer out of the multiple choices. In this work, we focus, instead, on generating a correct answer given the grid, without seeing the choices, which is a harder task, by definition. The proposed neural model combines multiple advances in generative models, including employing multiple pathways through the same network, using the reparameterization trick along two pathways to make their encoding compatible, a dynamic application of variational losses, and a complex perceptual loss that is coupled with a selective backpropagation procedure. Our algorithm is able not only to generate a set of plausible answers, but also to be competitive to the state of the art methods in multiple-choice tests.
摘要：Raven的渐进矩阵是多项选择智能测试，其中一个人试图以3美元的3美元的抽象图像中的3美元网格完成缺失的位置。以前解决此测试的尝试仅侧重于选择多种选择的正确答案。在这项工作中，我们专注于给出网格的正确答案，而不看到根据定义的选择，这是一个更难的任务。所提出的神经模型结合了生成模型的多个进步，包括通过同一网络采用多种路径，使用沿两个途径的Reparameterization技巧来使其编码兼容，动态应用变分损失，以及与a耦合的复杂感性损失选择性反向演出程序。我们的算法不仅能够生成一组合理的答案，而且还可以在多项选择测试中竞争最先进的方法。

90. Dynamic radiomics: a new methodology to extract quantitative time-related features from tomographic images [PDF] 返回目录
Fengying Che, Ruichuan Shi, Zhi Li, Jian Wu, Shuqin Li, Weixing Chen, Hao Zhang, Xiaoyu Cui
Abstract: The feature extraction methods of radiomics are mainly based on static tomographic images at a certain moment, while the occurrence and development of disease is a dynamic process that cannot be fully reflected by only static characteristics. This study proposes a new dynamic radiomics feature extraction workflow that uses time-dependent tomographic images of the same patient, focuses on the changes in image features over time, and then quantifies them as new dynamic features for diagnostic or prognostic evaluation. We first define the mathematical paradigm of dynamic radiomics and introduce three specific methods that can describe the transformation process of features over time. Three different clinical problems are used to validate the performance of the proposed dynamic feature with conventional 2D and 3D static features.
摘要：辐射瘤的特征提取方法主要基于一定刻的静态断层图像，而疾病的发生和发展是一种动态过程，不能仅通过静态特征充分反映。本研究提出了一种新的动态辐射族特征，使用相同患者的时间依赖性断层摄影图像的特征提取工作流程，它关注图像特征的变化随时间，然后将它们量化为新的动态特征以进行诊断或预后评估。我们首先定义动态射频的数学范式，并引入了三种特定方法，可以随着时间的推移描述特征的转换过程。使用传统的2D和3D静态特征来使用三个不同的临床问题来验证所提出的动态功能的性能。

91. Two-layer clustering-based sparsifying transform learning for low-dose CT reconstruction [PDF] 返回目录
Xikai Yang, Yong Long, Saiprasad Ravishankar
Abstract: Achieving high-quality reconstructions from low-dose computed tomography (LDCT) measurements is of much importance in clinical settings. Model-based image reconstruction methods have been proven to be effective in removing artifacts in LDCT. In this work, we propose an approach to learn a rich two-layer clustering-based sparsifying transform model (MCST2), where image patches and their subsequent feature maps (filter residuals) are clustered into groups with different learned sparsifying filters per group. We investigate a penalized weighted least squares (PWLS) approach for LDCT reconstruction incorporating learned MCST2 priors. Experimental results show the superior performance of the proposed PWLS-MCST2 approach compared to other related recent schemes.
摘要：从低剂量计算断层扫描（LDCT）测量中实现高质量的重建在临床环境中具有很大的重要性。基于模型的图像重建方法已被证明是有效地删除LDCT中的伪影。在这项工作中，我们提出了一种学习富有的双层聚类的稀疏变换模型（MCST2）的方法，其中映像修补程序及其后续特征映射（过滤器残差）被聚集成具有不同学习的Sparsify滤波器的组每组。我们调查LDCT重建的受到惩罚的加权最小二乘法（PWLS）方法，该方法包括学习的MCST2 Priors。实验结果表明，与其他相关最近方案相比，所提出的PWLS-MCST2方法的优越性。

92. Using Monte Carlo dropout and bootstrap aggregation for uncertainty estimation in radiation therapy dose prediction with deep learning neural networks [PDF] 返回目录
Dan Nguyen, Azar Sadeghnejad Barkousaraie, Gyanendra Bohara, Anjali Balagopal, Rafe McBeth, Mu-Han Lin, Steve Jiang
Abstract: Recently, artificial intelligence technologies and algorithms have become a major focus for advancements in treatment planning for radiation therapy. As these are starting to become incorporated into the clinical workflow, a major concern from clinicians is not whether the model is accurate, but whether the model can express to a human operator when it does not know if its answer is correct. We propose to use Monte Carlo dropout (MCDO) and the bootstrap aggregation (bagging) technique on deep learning models to produce uncertainty estimations for radiation therapy dose prediction. We show that both models are capable of generating a reasonable uncertainty map, and, with our proposed scaling technique, creating interpretable uncertainties and bounds on the prediction and any relevant metrics. Performance-wise, bagging provides statistically significant reduced loss value and errors in most of the metrics investigated in this study. The addition of bagging was able to further reduce errors by another 0.34% for Dmean and 0.19% for Dmax, on average, when compared to the baseline framework. Overall, the bagging framework provided significantly lower MAE of 2.62, as opposed to the baseline framework's MAE of 2.87. The usefulness of bagging, from solely a performance standpoint, does highly depend on the problem and the acceptable predictive error, and its high upfront computational cost during training should be factored in to deciding whether it is advantageous to use it. In terms of deployment with uncertainty estimations turned on, both frameworks offer the same performance time of about 12 seconds. As an ensemble-based metaheuristic, bagging can be used with existing machine learning architectures to improve stability and performance, and MCDO can be applied to any deep learning models that have dropout as part of their architecture.
摘要：最近，人工智能技术和算法已经成为放射治疗治疗计划的进步的重点。由于这些开始纳入临床工作流程，临床医生的主要问题不是该模型是否准确，但是当模型是否可以在不知道其答案是否正确时向人类运营商表示。我们建议使用Monte Carlo辍学（MCDO）和对深层学习模型的引导聚合（袋装）技术，以产生用于放射治疗剂量预测的不确定性估计。我们表明，两种型号都能够产生合理的不确定性地图，以及我们提出的缩放技术，在预测和任何相关度量标准上创造可解释的不确定性和界限。性能明智，袋装在本研究中调查的大多数指标中提供统计上显着的降低价值和错误。与基线框架相比，添加袋装的添加能够进一步减少0.34％的Dmean和0.19％的Dmax。总的来说，袋装框架提供了2.62的显着降低的MAE，而不是2.87的基线框架的MAE。袋装的有用性从完全是性能的角度来看，确实高度取决于问题和可接受的预测误差，以及其在培训期间的高前期计算成本应该是决定使用它是否有利。在使用不确定性估计的部署方面，两个框架都提供了约12秒的相同性能时间。作为基于集合的成分型，可以与现有的机器学习架构一起使用以提高稳定性和性能，并且MCDO可以应用于任何作为其架构的一部分的深度学习模型。

93. Segmentation of Infrared Breast Images Using MultiResUnet Neural Network [PDF] 返回目录
Ange Lou, Shuyue Guan, Nada Kamona, Murray Loew
Abstract: Breast cancer is the second leading cause of death for women in the U.S. Early detection of breast cancer is key to higher survival rates of breast cancer patients. We are investigating infrared (IR) thermography as a noninvasive adjunct to mammography for breast cancer screening. IR imaging is radiation-free, pain-free, and non-contact. Automatic segmentation of the breast area from the acquired full-size breast IR images will help limit the area for tumor search, as well as reduce the time and effort costs of manual segmentation. Autoencoder-like convolutional and deconvolutional neural networks (C-DCNN) had been applied to automatically segment the breast area in IR images in previous studies. In this study, we applied a state-of-the-art deep-learning segmentation model, MultiResUnet, which consists of an encoder part to capture features and a decoder part for precise localization. It was used to segment the breast area by using a set of breast IR images, collected in our pilot study by imaging breast cancer patients and normal volunteers with a thermal infrared camera (N2 Imager). The database we used has 450 images, acquired from 14 patients and 16 volunteers. We used a thresholding method to remove interference in the raw images and remapped them from the original 16-bit to 8-bit, and then cropped and segmented the 8-bit images manually. Experiments using leave-one-out cross-validation (LOOCV) and comparison with the ground-truth images by using Tanimoto similarity show that the average accuracy of MultiResUnet is 91.47%, which is about 2% higher than that of the autoencoder. MultiResUnet offers a better approach to segment breast IR images than our previous model.
摘要：乳腺癌是美国女性死亡的第二个主要原因。早期检测乳腺癌是乳腺癌患者的较高存活率的关键。我们正在调查红外（IR）热成像作为非侵入性乳腺X线摄影乳腺癌的乳腺癌乳腺癌。 IR成像无辐射，无疼痛，无接触。从所获取的全尺寸乳房IR图像自动分割乳房区域将有助于限制肿瘤搜索的区域，并减少手动分割的时间和精力成本。类似的自动化器样卷积和解卷积神经网络（C-DCNN）已被应用于在以前的研究中自动分段IR图像中的乳房区域。在这项研究中，我们应用了一种最先进的深度学习分割模型，MultileUnet，其包括用于捕获特征和解码器部分的编码器部分以精确定位。它用于通过使用一组乳房IR图像进行乳房区域，通过在我们的试验研究中通过成像乳腺癌患者和具有热红外相机（N2成像仪）的普通志愿者来进行。我们使用的数据库有450个图像，从14名患者和16名志愿者获得。我们使用了一个阈值化方法来消除原始图像中的干扰并将它们从原始16位重新映射到8位，然后手动裁剪并分割8位图像。使用休假交叉验证（LOOCV）的实验和通过使用Tanimoto的相似性与地面实图像的比较表明，多路仓的平均精度为91.47％，比autoEncoder的平均精度高约2％。 MultiSilleUnet提供比以前的模型更好的乳房IR图像方法。

94. Pose Estimation of Specular and Symmetrical Objects [PDF] 返回目录
Jiaming Hu, Hongyi Ling, Priyam Parashar, Aayush Naik, Henrik Christensen
Abstract: In the robotic industry, specular and textureless metallic components are ubiquitous. The 6D pose estimation of such objects with only a monocular RGB camera is difficult because of the absence of rich texture features. Furthermore, the appearance of specularity heavily depends on the camera viewpoint and environmental light conditions making traditional methods, like template matching, fail. In the last 30 years, pose estimation of the specular object has been a consistent challenge, and most related works require massive knowledge modeling effort for light setups, environment, or the object surface. On the other hand, recent works exhibit the feasibility of 6D pose estimation on a monocular camera with convolutional neural networks(CNNs) however they mostly use opaque objects for evaluation. This paper provides a data-driven solution to estimate the 6D pose of specular objects for grasping them, proposes a cost function for handling symmetry, and demonstrates experimental results showing the system's feasibility.
摘要：在机器人行业中，镜面和纹理的金属部件普遍存在。由于没有丰富的纹理特征，仅具有单眼RGB相机的6D姿态估计。此外，镜面的外观大量取决于相机视点和环境光线使传统方法的匹配，失败。在过去的30年中，镜面对象的姿势估计是一个一致的挑战，大多数相关工程需要大量知识建模努力，用于光设置，环境或物体表面。另一方面，最近的作品表现出6D带有卷积神经网络（CNNS）的单眼相机的可行性（CNNS），但它们主要使用不透明物体进行评估。本文提供了一种数据驱动的解决方案来估计用于抓住它们的镜面对象的6D姿势，提出了处理对称性的成本函数，并展示了显示系统可行性的实验结果。

95. DL-Reg: A Deep Learning Regularization Technique using Linear Regression [PDF] 返回目录
Maryam Dialameh, Ali Hamzeh, Hossein Rahmani
Abstract: Regularization plays a vital role in the context of deep learning by preventing deep neural networks from the danger of overfitting. This paper proposes a novel deep learning regularization method named as DL-Reg, which carefully reduces the nonlinearity of deep networks to a certain extent by explicitly enforcing the network to behave as much linear as possible. The key idea is to add a linear constraint to the objective function of the deep neural networks, which is simply the error of a linear mapping from the inputs to the outputs of the model. More precisely, the proposed DL-Reg carefully forces the network to behave in a linear manner. This linear constraint, which is further adjusted by a regularization factor, prevents the network from the risk of overfitting. The performance of DL-Reg is evaluated by training state-of-the-art deep network models on several benchmark datasets. The experimental results show that the proposed regularization method: 1) gives major improvements over the existing regularization techniques, and 2) significantly improves the performance of deep neural networks, especially in the case of small-sized training datasets.
摘要：正规化在深入学习的背景下，防止从过度装备的危险中的深度神经网络起着重要作用。本文提出了一种名为DL-REG的新型深度学习正规化方法，通过明确实施网络尽可能多地执行线性，仔细地降低了一定程度的深网络的非线性。关键的想法是向深度神经网络的目标函数添加线性约束，这只是从输入到模型输出的线性映射的误差。更确切地说，提出的DL-REG精心迫使网络以线性方式行事。这种线性约束，由正则化因子进一步调整，防止网络从过度拟合的风险中进行。 DL-REG的性能是通过在几个基准数据集上培训最先进的深网络模型来评估的。实验结果表明，拟议的正则化方法：1）对现有的正则化技术进行了重大改进，2）显着提高了深度神经网络的性能，特别是在小型训练数据集的情况下。

96. Deep learning in the ultrasound evaluation of neonatal respiratory status [PDF] 返回目录
Michela Gravina, Diego Gragnaniello, Luisa Verdoliva, Giovanni Poggi, Iuri Corsini, Carlo Dani, Fabio Meneghin, Gianluca Lista, Salvatore Aversa, Francesco Raimondi, Fiorella Migliaro, Carlo Sansone
Abstract: Lung ultrasound imaging is reaching growing interest from the scientific community. On one side, thanks to its harmlessness and high descriptive power, this kind of diagnostic imaging has been largely adopted in sensitive applications, like the diagnosis and follow-up of preterm newborns in neonatal intensive care units. On the other side, state-of-the-art image analysis and pattern recognition approaches have recently proven their ability to fully exploit the rich information contained in these data, making them attractive for the research community. In this work, we present a thorough analysis of recent deep learning networks and training strategies carried out on a vast and challenging multicenter dataset comprising 87 patients with different diseases and gestational ages. These approaches are employed to assess the lung respiratory status from ultrasound images and are evaluated against a reference marker. The conducted analysis sheds some light on this problem by showing the critical points that can mislead the training procedure and proposes some adaptations to the specific data and task. The achieved results sensibly outperform those obtained by a previous work, which is based on textural features, and narrow the gap with the visual score predicted by the human experts.
摘要：肺超声成像来自科学界的兴趣日益增长。在一方面，由于其无害和高的描述性，这种诊断成像在很大程度上采用了敏感的应用，如新生儿重症监护单位中的早产新生儿的诊断和后续。另一方面，最先进的图像分析和模式识别方法最近证明了他们充分利用这些数据中包含的丰富信息的能力，使它们对研究界具有吸引力。在这项工作中，我们对近期深度学习网络和培训策略进行了彻底的分析，这些网络和培训策略在包含不同疾病和妊娠期患者的87名患者的巨大和挑战性的多中心数据集中进行。这些方法用于评估超声图像的肺部呼吸状态，并评估参考标记。通过显示可能误导培训程序的关键点并提出对特定数据和任务的一些调整来阐明此问题的一些亮点。实现的结果明显优于通过先前的工作获得的结果，这是基于纹理特征，并通过人类专家预测的视觉分数缩小差距。

97. Encoding Clinical Priori in 3D Convolutional Neural Networks for Prostate Cancer Detection in bpMRI [PDF] 返回目录
Anindo Saha, Matin Hosseinzadeh, Henkjan Huisman
Abstract: We hypothesize that anatomical priors can be viable mediums to infuse domain-specific clinical knowledge into state-of-the-art convolutional neural networks (CNN) based on the U-Net architecture. We introduce a probabilistic population prior which captures the spatial prevalence and zonal distinction of clinically significant prostate cancer (csPCa), in order to improve its computer-aided detection (CAD) in bi-parametric MR imaging (bpMRI). To evaluate performance, we train 3D adaptations of the U-Net, U-SEResNet, UNet++ and Attention U-Net using 800 institutional training-validation scans, paired with radiologically-estimated annotations and our computed prior. For 200 independent testing bpMRI scans with histologically-confirmed delineations of csPCa, our proposed method of encoding clinical priori demonstrates a strong ability to improve patient-based diagnosis (upto 8.70% increase in AUROC) and lesion-level detection (average increase of 1.08 pAUC between 0.1-1.0 false positive per patient) across all four architectures.
摘要：我们假设基于U-Net架构，解剖学前沿可以是将域特定的临床知识（CNN）注入最先进的卷积神经网络（CNN）。我们介绍了概率群体，其捕获临床显着的前列腺癌（CSPCA）的空间患病率和地区区别，以改善其在双参数MR成像（BPMRI）中的计算机辅助检测（CAD）。为了评估性能，我们使用800机构训练验证扫描培训U-Net，U-Seresnet，UNET ++和注意力的3D适应U-Net，与放射学估计的注释和我们的计算。对于200个独立的测试BPMRI扫描与CSPCA的组织学证实划分，我们提出的编码临床先验方法证明了改善基于患者的诊断的良好能力（Auroc的增加8.70％）和病变水平检测（平均增加1.08 pauc的平均增加在所有四个架构上，每位患者0.1-1.0误阳性阳性。

98. Meta-Learning with Adaptive Hyperparameters [PDF] 返回目录
Sungyong Baik, Myungsub Choi, Janghoon Choi, Heewon Kim, Kyoung Mu Lee
Abstract: Despite its popularity, several recent works question the effectiveness of MAML when test tasks are different from training tasks, thus suggesting various task-conditioned methodology to improve the initialization. Instead of searching for better task-aware initialization, we focus on a complementary factor in MAML framework, inner-loop optimization (or fast adaptation). Consequently, we propose a new weight update rule that greatly enhances the fast adaptation process. Specifically, we introduce a small meta-network that can adaptively generate per-step hyperparameters: learning rate and weight decay coefficients. The experimental results validate that the Adaptive Learning of hyperparameters for Fast Adaptation (ALFA) is the equally important ingredient that was often neglected in the recent few-shot learning approaches. Surprisingly, fast adaptation from random initialization with ALFA can already outperform MAML.
摘要：尽管其受欢迎程度，但最近的几项作品质疑MAML当测试任务与训练任务不同时MAML的有效性，从而提示各种任务条件方法来改善初始化。而不是搜索更好的任务感知初始化，我们专注于MAML框架中的互补因素，内循环优化（或快速适应）。因此，我们提出了一种新的重量更新规则，大大增强了快速适应过程。具体地，我们介绍了一个小型元网络，可以自适应地生成每一步的超参数：学习率和重量衰减系数。实验结果验证了快速适应（ALFA）的高参数的自适应学习是在最近几次拍摄的学习方法中经常被忽视的同样重要的成分。令人惊讶的是，与ALFA随机初始化的快速适应可能已经胜过MAML。

99. Combining Domain-Specific Meta-Learners in the Parameter Space for Cross-Domain Few-Shot Classification [PDF] 返回目录
Shuman Peng, Weilian Song, Martin Ester
Abstract: The goal of few-shot classification is to learn a model that can classify novel classes using only a few training examples. Despite the promising results shown by existing meta-learning algorithms in solving the few-shot classification problem, there still remains an important challenge: how to generalize to unseen domains while meta-learning on multiple seen domains? In this paper, we propose an optimization-based meta-learning method, called Combining Domain-Specific Meta-Learners (CosML), that addresses the cross-domain few-shot classification problem. CosML first trains a set of meta-learners, one for each training domain, to learn prior knowledge (i.e., meta-parameters) specific to each domain. The domain-specific meta-learners are then combined in the \emph{parameter space}, by taking a weighted average of their meta-parameters, which is used as the initialization parameters of a task network that is quickly adapted to novel few-shot classification tasks in an unseen domain. Our experiments show that CosML outperforms a range of state-of-the-art methods and achieves strong cross-domain generalization ability.
摘要：少量分类的目标是学习一个模型，可以仅使用几个培训示例对新颖类进行分类。尽管现有的元学习算法表现出了有前途的结果，但解决了几次拍摄的分类问题，仍然仍然存在重要挑战：如何在多个看到的域上学习时概念概念域名？在本文中，我们提出了一种基于优化的元学习方法，称为组合特定的域特定的元学习者（Cosml），该方法解决了跨域的跨域几次拍摄分类问题。 Cosml首次列举一组元学习者，一个用于每个培训域，学习特定于每个域的知识（即元参数）。然后通过采用其元参数的加权平均值来组合在\ MEMPH {参数空间}中，将其用作快速调整为小说的任务网络的初始化参数在一个看不见的域中的分类任务。我们的实验表明，COSML优于一系列最先进的方法，实现了强大的跨域泛化能力。

100. Evaluation of Inference Attack Models for Deep Learning on Medical Data [PDF] 返回目录
Maoqiang Wu, Xinyue Zhang, Jiahao Ding, Hien Nguyen, Rong Yu, Miao Pan, Stephen T. Wong
Abstract: Deep learning has attracted broad interest in healthcare and medical communities. However, there has been little research into the privacy issues created by deep networks trained for medical applications. Recently developed inference attack algorithms indicate that images and text records can be reconstructed by malicious parties that have the ability to query deep networks. This gives rise to the concern that medical images and electronic health records containing sensitive patient information are vulnerable to these attacks. This paper aims to attract interest from researchers in the medical deep learning community to this important problem. We evaluate two prominent inference attack models, namely, attribute inference attack and model inversion attack. We show that they can reconstruct real-world medical images and clinical reports with high fidelity. We then investigate how to protect patients' privacy using defense mechanisms, such as label perturbation and model perturbation. We provide a comparison of attack results between the original and the medical deep learning models with defenses. The experimental evaluations show that our proposed defense approaches can effectively reduce the potential privacy leakage of medical deep learning from the inference attacks.
摘要：深入学习吸引了对医疗保健和医疗社区的广泛兴趣。但是，对医疗应用培训的深网络创建的隐私问题几乎没有研究。最近开发的推理攻击算法表明，可以通过具有查询深网络的恶意派对来重建图像和文本记录。这引起了涉及含有敏感患者信息的医学图像和电子健康记录易受这些攻击的影响。本文旨在吸引医学深入学习界的研究人员对这一重要问题的影响。我们评估了两个突出的推理攻击模型，即属性推理攻击和模型反演攻击。我们表明，他们可以重建现实世界的医学图像和高保真的临床报告。然后，我们调查如何使用防御机制保护患者的隐私，例如标签扰动和模型扰动。我们提供了与防御的原始和医疗深层学习模型之间的攻击结果的比较。实验评估表明，我们拟议的防御方法可以有效降低医疗深度学习从推理攻击的潜在隐私泄漏。

101. Dense Pixel-wise Micro-motion Estimation of Object Surface by using Low Dimensional Embedding of Laser Speckle Pattern [PDF] 返回目录
Ryusuke Sagawa, Yusuke Higuchi, Hiroshi Kawasaki, Ryo Furukawa, Takahiro Ito
Abstract: This paper proposes a method of estimating micro-motion of an object at each pixel that is too small to detect under a common setup of camera and illumination. The method introduces an active-lighting approach to make the motion visually detectable. The approach is based on speckle pattern, which is produced by the mutual interference of laser light on object's surface and continuously changes its appearance according to the out-of-plane motion of the surface. In addition, speckle pattern becomes uncorrelated with large motion. To compensate such micro- and large motion, the method estimates the motion parameters up to scale at each pixel by nonlinear embedding of the speckle pattern into low-dimensional space. The out-of-plane motion is calculated by making the motion parameters spatially consistent across the image. In the experiments, the proposed method is compared with other measuring devices to prove the effectiveness of the method.
摘要：本文提出了一种估计在每个像素的对象的微观运动的方法，该对象太小，无法在相机和照明的公共设置下检测。该方法引入了一种有效照明方法来使运动可视地检测。该方法基于散斑图案，其通过对象表面上的激光相互干扰而产生，并且根据表面的外平面运动连续地改变其外观。此外，散斑图案具有大的运动变得不相关。为了补偿这种微型和大运动，该方法通过非线性嵌入散斑图案到低维空间，估计在每个像素处的运动参数估计。通过使运动参数在图像上的空间一致地进行空间一致地计算平面外运动。在实验中，将所提出的方法与其他测量装置进行比较以证明该方法的有效性。

102. Integer Programming-based Error-Correcting Output Code Design for Robust Classification [PDF] 返回目录
Samarth Gupta, Saurabh Amin
Abstract: Error-Correcting Output Codes (ECOCs) offer a principled approach for combining simple binary classifiers into multiclass classifiers. In this paper, we investigate the problem of designing optimal ECOCs to achieve both nominal and adversarial accuracy using Support Vector Machines (SVMs) and binary deep learning models. In contrast to previous literature, we present an Integer Programming (IP) formulation to design minimal codebooks with desirable error correcting properties. Our work leverages the advances in IP solvers to generate codebooks with optimality guarantees. To achieve tractability, we exploit the underlying graph-theoretic structure of the constraint set in our IP formulation. This enables us to use edge clique covers to substantially reduce the constraint set. Our codebooks achieve a high nominal accuracy relative to standard codebooks (e.g., one-vs-all, one-vs-one, and dense/sparse codes). We also estimate the adversarial accuracy of our ECOC-based classifiers in a white-box setting. Our IP-generated codebooks provide non-trivial robustness to adversarial perturbations even without any adversarial training.
摘要：纠错输出代码（ECOCS）提供了一个主要的方法，将简单的二进制分类器结合到多字数分类器中。在本文中，我们研究了使用支持向量机（SVM）和二进制深度学习模型实现最佳ecocs以实现标称和普通准确性的问题。与以前的文献相比，我们介绍了一个整数编程（IP）制定，用于设计具有所需纠错属性的最小码本。我们的工作利用IP求解器的进步来生成具有最优性保证的码本。为了实现易遗传性，我们利用了我们IP配方中的约束的基础图形 - 理论结构。这使我们能够使用边缘Clique封面基本上减少约束集。我们的码本相对于标准码本（例如，一VS-ALL，ONE-ONE和DENTER / SPERSES代码）达到高标称准确性。我们还估计了在白盒设置中基于Ecoc的分类器的对抗准确性。我们的IP生成的码本即使没有任何对抗训练，也为对抗性扰动提供了非琐碎的鲁棒性。

103. EDCNN: Edge enhancement-based Densely Connected Network with Compound Loss for Low-Dose CT Denoising [PDF] 返回目录
Tengfei Liang, Yi Jin, Yidong Li, Tao Wang, Songhe Feng, Congyan Lang
Abstract: In the past few decades, to reduce the risk of X-ray in computed tomography (CT), low-dose CT image denoising has attracted extensive attention from researchers, which has become an important research issue in the field of medical images. In recent years, with the rapid development of deep learning technology, many algorithms have emerged to apply convolutional neural networks to this task, achieving promising results. However, there are still some problems such as low denoising efficiency, over-smoothed result, etc. In this paper, we propose the Edge enhancement based Densely connected Convolutional Neural Network (EDCNN). In our network, we design an edge enhancement module using the proposed novel trainable Sobel convolution. Based on this module, we construct a model with dense connections to fuse the extracted edge information and realize end-to-end image denoising. Besides, when training the model, we introduce a compound loss that combines MSE loss and multi-scales perceptual loss to solve the over-smoothed problem and attain a marked improvement in image quality after denoising. Compared with the existing low-dose CT image denoising algorithms, our proposed model has a better performance in preserving details and suppressing noise.
摘要：在过去的几十年中，为了降低计算断层扫描（CT）的X射线的风险，低剂量CT图像去噪也引起了研究人员的广泛关注，这已成为医学图像领域的重要研究问题。近年来，随着深度学习技术的快速发展，许多算法已经出现了将卷积神经网络应用于此任务，实现了有希望的结果。然而，在本文中，仍存在低去噪效率，过平滑的结果等问题。我们提出了基于边缘增强的密集连接卷积神经网络（EDCNN）。在我们的网络中，我们使用所提出的新型培训Sobel卷积设计边缘增强模块。基于该模块，我们构建具有密集连接的模型，以熔化提取的边缘信息并实现端到端图像去噪。此外，在培训模型时，我们介绍了一种复合损失，将MSE损失和多尺度感知损失结合起来解决过平滑的问题，并在去噪后获得显着的图像质量的改善。与现有的低剂量CT图像去噪算法相比，我们所提出的模型具有更好的性能，在保持细节和抑制噪声方面。

104. Multi-stage transfer learning for lung segmentation using portable X-ray devices for patients with COVID-19 [PDF] 返回目录
Plácido L Vidal, Joaquim de Moura, Jorge Novo, Marcos Ortega
Abstract: In 2020, the SARS-CoV-2 virus causes a global pandemic of the new human coronavirus disease COVID-19. This pathogen primarily infects the respiratory system of the afflicted, usually resulting in pneumonia and in a severe case of acute respiratory distress syndrome. These disease developments result in the formation of different pathological structures in the lungs, similar to those observed in other viral pneumonias that can be detected by the use of chest X-rays. For this reason, the detection and analysis of the pulmonary regions, the main focus of affection of COVID-19, becomes a crucial part of both clinical and automatic diagnosis processes. Due to the overload of the health services, portable X-ray devices are widely used, representing an alternative to fixed devices to reduce the risk of cross-contamination. However, these devices entail different complications as the image quality that, together with the subjectivity of the clinician, make the diagnostic process more difficult. In this work, we developed a novel fully automatic methodology specially designed for the identification of these lung regions in X-ray images of low quality as those from portable devices. To do so, we took advantage of a large dataset from magnetic resonance imaging of a similar pathology and performed two stages of transfer learning to obtain a robust methodology with a low number of images from portable X-ray devices. This way, our methodology obtained a satisfactory accuracy of $0.9761 \pm 0.0100$ for patients with COVID-19, $0.9801 \pm 0.0104$ for normal patients and $0.9769 \pm 0.0111$ for patients with pulmonary diseases with similar characteristics as COVID-19 (such as pneumonia) but not genuine COVID-19.
摘要：在2020年，SARS-COV-2病毒导致全球性冠状病毒疾病Covid-19的全球大流行病。该病原体主要感染受患病的呼吸系统，通常导致肺炎和严重的急性呼吸窘迫综合征。这些疾病的发展导致肺部中不同的病理结构形成，类似于在其他病毒肺炎中观察到的那些，可以通过使用胸部X射线来检测。因此，肺区的检测和分析是Covid-19的影响的主要焦点成为临床和自动诊断过程的关键部分。由于保健服务的过载，广泛使用便携式X射线设备，代表固定装置的替代方案，以降低交叉污染的风险。然而，这些器件需要不同的并发症作为图像质量，与临床医生的主观性一起，使诊断过程更加困难。在这项工作中，我们开发了一种专门设计的新型全自动方法，专门用于鉴定这些肺部地区的低质量的X射线图像，作为便携式设备的X射线图像。为此，我们利用了类似病理学的磁共振成像的大型数据集，并执行了两个转移学习阶段，以获得具有来自便携式X射线设备的较少数量的图像的鲁棒方法。通过这种方式，我们的方法可以获得令人满意的精度为0.9761美元，为Covid-19，0.9801 \ PM 0.0104 $的患者为正常患者0.9769美元，为肺部疾病的患者为0.9769美元，具有与Covid-19相似的患者（例如作为肺炎）但不是真正的covid-19。

105. C-Net: A Reliable Convolutional Neural Network for Biomedical Image Classification [PDF] 返回目录
Hosein Barzekar, Zeyun Yu
Abstract: Cancers are the leading cause of death in many developed countries. Early diagnosis plays a crucial role in having proper treatment for this debilitating disease. The automated classification of the type of cancer is a challenging task since pathologists must examine a huge number of histopathological images to detect infinitesimal abnormalities. In this study, we propose a novel convolutional neural network (CNN) architecture composed of a Concatenation of multiple Networks, called C-Net, to classify biomedical images. In contrast to conventional deep learning models in biomedical image classification, which utilize transfer learning to solve the problem, no prior knowledge is employed. The model incorporates multiple CNNs including Outer, Middle, and Inner. The first two parts of the architecture contain six networks that serve as feature extractors to feed into the Inner network to classify the images in terms of malignancy and benignancy. The C-Net is applied for histopathological image classification on two public datasets, including BreakHis and Osteosarcoma. To evaluate the performance, the model is tested using several evaluation metrics for its reliability. The C-Net model outperforms all other models on the individual metrics for both datasets and achieves zero misclassification.
摘要：癌症是许多发达国家死亡的主要原因。早期诊断起到适当治疗这种衰弱疾病的关键作用。癌症类型的自动分类是一个具有挑战性的任务，因为病理学家必须检查大量的组织病理学图像以检测无穷大的异常。在本研究中，我们提出了一种新颖的卷积神经网络（CNN）架构，其由称为C-Net的多个网络的串联构成，以对生物医学图像进行分类。与生物医学图像分类中的传统深层学习模型相比，利用转移学习解决问题，没有采用先验知识。该模型包含多个CNN，包括外部，中间和内部。该体系结构的前两个部分包含六个网络，该网络用作特征提取器，以进入内部网络，以对恶性和良性率分类图像。在两个公共数据集上施用C-Net，包括Brankhis和OsteoSarcoma的组织病理学图像分类。为了评估性能，使用几种评估度量来测试模型，以获得其可靠性。 C-Net模型优于数据集的各个度量标准的所有其他模型，并实现零错误分类。

106. 83% ImageNet Accuracy in One Hour [PDF] 返回目录
Arissa Wongpanich, Hieu Pham, James Demmel, Mingxing Tan, Quoc Le, Yang You, Sameer Kumar
Abstract: EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks. Currently, EfficientNets can take on the order of days to train; for example, training an EfficientNet-B0 model takes 23 hours on a Cloud TPU v2-8 node. In this paper, we explore techniques to scale up the training of EfficientNets on TPU-v3 Pods with 2048 cores, motivated by speedups that can be achieved when training at such scales. We discuss optimizations required to scale training to a batch size of 65536 on 1024 TPU-v3 cores, such as selecting large batch optimizers and learning rate schedules as well as utilizing distributed evaluation and batch normalization techniques. Additionally, we present timing and performance benchmarks for EfficientNet models trained on the ImageNet dataset in order to analyze the behavior of EfficientNets at scale. With our optimizations, we are able to train EfficientNet on ImageNet to an accuracy of 83% in 1 hour and 4 minutes.
摘要：高效通道是一种基于有效缩放的卷积神经网络的最先进的图像分类模型系列。目前，有效导通量可以花费培训的天数;例如，培训vieftyNet-B0型号在云TPU V2-8节点上需要23小时。在本文中，我们探讨了扩展TPU-V3 POD上有效导通培训的技术，其具有2048个核心的动机，通过加速，可以在这种尺度训练时实现。我们讨论在1024 TPU-V3核心上向65536批量培训培训所需的优化，例如选择大批量优化器和学习率计划以及利用分布式评估和批量归一化技术。此外，我们提供了在想象网数据集上培训的有效网络型号的时序和性能基准，以便分析尺度上的有效导轨的行为。通过我们的优化，我们能够在1小时和4分钟内培训到ImageNet上的高效网络到83％的准确性。

107. Adversarial Robust Training in MRI Reconstruction [PDF] 返回目录
Francesco Calivá, Kaiyang Cheng, Rutwik Shah, Valentina Pedoia
Abstract: Deep Learning has shown potential in accelerating Magnetic Resonance Image acquisition and reconstruction. Nevertheless, there is a dearth of tailored methods to guarantee that the reconstruction of small features is achieved with high fidelity. In this work, we employ adversarial attacks to generate small synthetic perturbations that when added to the input MRI, they are not reconstructed by a trained DL reconstruction network. Then, we use robust training to increase the network's sensitivity to small features and encourage their reconstruction. Next, we investigate the generalization of said approach to real world features. For this, a musculoskeletal radiologist annotated a set of cartilage and meniscal lesions from the knee Fast-MRI dataset, and a classification network was devised to assess the features reconstruction. Experimental results show that by introducing robust training to a reconstruction network, the rate (4.8\%) of false negative features in image reconstruction can be reduced. The results are encouraging and highlight the necessity for attention on this problem by the image reconstruction community, as a milestone for the introduction of DL reconstruction in clinical practice. To support further research, we make our annotation publicly available at this https URL.
摘要：深入学习表明了加速磁共振图像采集和重建的潜力。然而，有一种缺乏量身定制的方法，以保证具有高保真性的小功能的重建。在这项工作中，我们雇用对抗性攻击来产生小的合成扰动，当添加到输入MRI时，它们不会被培训的DL重建网络重建。然后，我们使用强大的培训来提高网络对小功能的敏感性，并鼓励重建。接下来，我们调查所述现实世界特征的概括。为此，肌肉骨骼放射科医师从膝关节快速数据集中注释了一组软骨和半月板病变，并设计了分类网络以评估功能重建。实验结果表明，通过对重建网络引入鲁棒训练，可以减少图像重建中的假阴性特征的速率（4.8 \％）。结果令人鼓舞并突出了图像重建社区对这个问题的关注必要性，作为在临床实践中引入DL重建的里程碑。为了支持进一步的研究，我们在此HTTPS URL上公开提供注释。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-11-03

目录

摘要