摘要

1. Reconstructing Hand-Object Interactions in the Wild [PDF] 返回目录
Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, Jitendra Malik
Abstract: In this work we explore reconstructing hand-object interactions in the wild. The core challenge of this problem is the lack of appropriate 3D labeled data. To overcome this issue, we propose an optimization-based procedure which does not require direct 3D supervision. The general strategy we adopt is to exploit all available related data (2D bounding boxes, 2D hand keypoints, 2D instance masks, 3D object models, 3D in-the-lab MoCap) to provide constraints for the 3D reconstruction. Rather than optimizing the hand and object individually, we optimize them jointly which allows us to impose additional constraints based on hand-object contact, collision, and occlusion. Our method produces compelling reconstructions on the challenging in-the-wild data from the EPIC Kitchens and the 100 Days of Hands datasets, across a range of object categories. Quantitatively, we demonstrate that our approach compares favorably to existing approaches in the lab settings where ground truth 3D annotations are available.
摘要：在这项工作中，我们探索了在野外重建手与物体之间的相互作用的方法。此问题的核心挑战是缺少适当的3D标签数据。为了克服这个问题，我们提出了一种基于优化的过程，不需要直接的3D监督。我们采用的一般策略是利用所有可用的相关数据（2D边界框，2D手关键点，2D实例蒙版，3D对象模型，3D实验室内MoCap）来为3D重建提供约束。与其单独优化手和物体，不如对它们进行联合优化，这使我们能够基于手-物体的接触，碰撞和遮挡施加其他约束。我们的方法可以根据EPIC Kitchens和100天之手数据集中具有挑战性的野生数据对各种对象类别进行令人信服的重建。从数量上讲，我们证明了我们的方法与实验室设置中可以使用地面真实3D注释的现有方法相比具有优势。

2. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [PDF] 返回目录
Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, Angjoo Kanazawa
Abstract: We introduce the problem of perpetual view generation -- long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image. This is a challenging problem that goes far beyond the capabilities of current view synthesis methods, which work for a limited range of viewpoints and quickly degenerate when presented with a large camera motion. Methods designed for video generation also have limited ability to produce long video sequences and are often agnostic to scene geometry. We take a hybrid approach that integrates both geometry and image synthesis in an iterative render, refine, and repeat framework, allowing for long-range generation that cover large distances after hundreds of frames. Our approach can be trained from a set of monocular video sequences without any manual annotation. We propose a dataset of aerial footage of natural coastal scenes, and compare our method with recent view synthesis and conditional video generation baselines, showing that it can generate plausible scenes for much longer time horizons over large camera trajectories compared to existing methods.
摘要：我们介绍了永久视图生成的问题-对应于给定单个图像的任意长相机轨迹的新颖视图的远程生成。这是一个具有挑战性的问题，远远超出了当前视图合成方法的能力，该方法只能在有限的视点范围内工作，并且在出现较大的相机运动时会迅速退化。为视频生成而设计的方法还不能产生较长的视频序列，并且通常与场景几何无关。我们采用了一种混合方法，该方法在迭代渲染，精炼和重复框架中集成了几何和图像合成功能，可以进行远距离生成，覆盖数百帧后的较大距离。我们的方法可以从一组单眼视频序列中进行训练，而无需任何人工注释。我们提出了一个自然沿海场景的航拍数据集，并将我们的方法与最近的视图合成和有条件的视频生成基准进行了比较，表明与现有方法相比，它可以在更长的时间范围内在大型摄像机轨迹上生成合理的场景。

3. Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image [PDF] 返回目录
Ronghang Hu, Deepak Pathak
Abstract: We present Worldsheet, a method for novel view synthesis using just a single RGB image as input. This is a challenging problem as it requires an understanding of the 3D geometry of the scene as well as texture mapping to generate both visible and occluded regions from new view-points. Our main insight is that simply shrink-wrapping a planar mesh sheet onto the input image, consistent with the learned intermediate depth, captures underlying geometry sufficient enough to generate photorealistic unseen views with arbitrarily large view-point changes. To operationalize this, we propose a novel differentiable texture sampler that allows our wrapped mesh sheet to be textured; which is then transformed into a target image via differentiable rendering. Our approach is category-agnostic, end-to-end trainable without using any 3D supervision and requires a single image at test time. Worldsheet consistently outperforms prior state-of-the-art methods on single-image view synthesis across several datasets. Furthermore, this simple idea captures novel views surprisingly well on a wide range of high resolution in-the-wild images in converting them into a navigable 3D pop-up. Video results and code at this https URL
摘要：我们提出了Worldsheet，一种仅使用单个RGB图像作为输入的新颖视图合成方法。这是一个具有挑战性的问题，因为它需要了解场景的3D几何形状以及纹理映射以从新的视点生成可见区域和遮挡区域。我们的主要见解是，只需将平面网格薄板收缩包装到输入图像上即可，与学习到的中间深度相一致，可以捕获足以产生具有真实大的视点变化的照片级看不见视图的基本几何形状。为了实现这一点，我们提出了一种新颖的可微分纹理采样器，该采样器可以使包裹的网状薄板纹理化。然后通过可区分的渲染将其转换为目标图像。我们的方法是与类别无关的，无需任何3D监督即可进行端到端培训，并且在测试时仅需要一张图像。在多个数据集中的单图像视图合成方面，Worldsheet始终优于现有的最新技术。此外，这个简单的想法在将各种高分辨率的野生图像转换成可导航的3D弹出窗口时，可以很好地捕获新颖的视图。此https URL上的视频结果和代码

4. Human Mesh Recovery from Multiple Shots [PDF] 返回目录
Georgios Pavlakos, Jitendra Malik, Angjoo Kanazawa
Abstract: Videos from edited media like movies are a useful, yet under-explored source of information. The rich variety of appearance and interactions between humans depicted over a large temporal context in these films could be a valuable source of data. However, the richness of data comes at the expense of fundamental challenges such as abrupt shot changes and close up shots of actors with heavy truncation, which limits the applicability of existing human 3D understanding methods. In this paper, we address these limitations with an insight that while shot changes of the same scene incur a discontinuity between frames, the 3D structure of the scene still changes smoothly. This allows us to handle frames before and after the shot change as multi-view signal that provide strong cues to recover the 3D state of the actors. We propose a multi-shot optimization framework, which leads to improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh. We show that the resulting data is beneficial in the training of various human mesh recovery models: for single image, we achieve improved robustness; for video we propose a pure transformer-based temporal encoder, which can naturally handle missing observations due to shot changes in the input frames. We demonstrate the importance of the insight and proposed models through extensive experiments. The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media, which could be helpful for many downstream applications. Project page: this https URL
摘要：来自电影等经过编辑的媒体的视频是有用的信息，但仍未得到充分开发。这些影片在较大的时间范围内描绘的人与人之间的外观和互动的丰富变化，可能是有价值的数据来源。但是，数据的丰富性是以根本性的挑战为代价的，这些挑战包括突然的镜头变化和严重被截断的演员的特写镜头，这限制了现有人类3D理解方法的适用性。在本文中，我们以洞察力解决了这些局限性，即尽管同一场景的镜头更改会导致帧之间的不连续，但场景的3D结构仍会平滑变化。这样，我们就可以将镜头更改前后的帧作为多视图信号进行处理，从而提供强大的线索来恢复演员的3D状态。我们提出了一个多镜头优化框架，该框架可改进3D重建并利用伪地面实况3D人网格对长序列进行挖掘。我们表明，所得数据在训练各种人体网格恢复模型方面是有益的：对于单个图像，我们实现了更高的鲁棒性；对于视频，我们提出了一种基于纯变压器的时间编码器，该编码器自然可以处理由于输入帧中的镜头变化而导致的观测值丢失。我们通过广泛的实验证明了洞察力和提出的模型的重要性。我们开发的工具打开了处理和分析来自大型已编辑媒体库的3D内容的大门，这可能对许多下游应用程序有所帮助。项目页面：此https URL

5. $\mathbb{X}$Resolution Correspondence Networks [PDF] 返回目录
Georgi Tinchev, Shuda Li, Kai Han, David Mitchell, Rigas Kouskouridas
Abstract: In this paper, we aim at establishing accurate dense correspondences between a pair of images with overlapping field of view under challenging illumination variation, viewpoint changes, and style differences. Through an extensive ablation study of the state-of-the-art correspondence networks, we surprisingly discovered that the widely adopted 4D correlation tensor and its related learning and processing modules could be de-parameterised and removed from training with merely a minor impact over the final matching accuracy. Disabling some of the most memory consuming and computational expensive modules dramatically speeds up the training procedure and allows to use 4x bigger batch size, which in turn compensates for the accuracy drop. Together with a multi-GPU inference stage, our method facilitates the systematic investigation of the relationship between matching accuracy and up-sampling resolution of the native testing images from 720p to 4K. This leads to finding an optimal resolution $\mathbb X$ that produces accurate matching performance surpassing the state-of-the-art methods particularly over the lower error band for the proposed network and evaluation datasets.
摘要：本文旨在在具有挑战性的照明变化，视点变化和样式差异下，在具有重叠视场的一对图像之间建立精确的密集对应关系。通过对最先进的通信网络的广泛消融研究，我们意外地发现，可以对广泛采用的4D相关张量及其相关的学习和处理模块进行参数化，并从训练中删除它们，而对最终匹配精度。禁用一些最消耗内存和计算量最大的模块，可以极大地加快训练过程，并允许使用4倍大的批处理大小，从而弥补了精度下降的麻烦。结合多GPU推理阶段，我们的方法有助于系统地研究从720p到4K的原始测试图像的匹配精度与上采样分辨率之间的关系。这导致找到最佳分辨率$ \ mathbb X $，该分辨率产生的精确匹配性能超过了最新技术，尤其是在所建议的网络和评估数据集的较低误差带上。

6. Taming Transformers for High-Resolution Image Synthesis [PDF] 返回目录
Patrick Esser, Robin Rombach, Björn Ommer
Abstract: Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers. Project page at this https URL .
摘要：旨在学习顺序数据上的远程交互作用的变压器，在各种任务上继续显示出最新的结果。与CNN相比，它们不包含优先考虑局部相互作用的归纳偏置。这使它们表现出来，但对于长序列（例如高分辨率图像）在计算上是不可行的。我们演示了如何将CNN的感应偏置的有效性与变压器的表达能力相结合，使它们能够建模并合成高分辨率图像。我们展示了如何（i）使用CNN来学习图像成分的上下文相关词汇，然后（ii）利用转换器有效地对高分辨率图像中的成分进行建模。我们的方法很容易应用于条件合成任务，其中非空间信息（例如对象类）和空间信息（例如分割）都可以控制生成的图像。特别是，我们介绍了使用转换器在语义指导下合成百万像素图像的第一个结果。此https URL上的项目页面。

7. Transformer Interpretability Beyond Attention Visualization [PDF] 返回目录
Hila Chefer, Shir Gur, Lior Wolf
Abstract: Self-attention techniques, and specifically Transformers, are dominating the field of text processing and are becoming increasingly popular in computer vision classification tasks. In order to visualize the parts of the image that led to a certain classification, existing methods either rely on the obtained attention maps, or employ heuristic propagation along the attention graph. In this work, we propose a novel way to compute relevancy for Transformer networks. The method assigns local relevance based on the deep Taylor decomposition principle and then propagates these relevancy scores through the layers. This propagation involves attention layers and skip connections, which challenge existing methods. Our solution is based on a specific formulation that is shown to maintain the total relevancy across layers. We benchmark our method on very recent visual Transformer networks, as well as on a text classification problem, and demonstrate a clear advantage over the existing explainability methods.
摘要：自注意力技术，尤其是“变形金刚”，在文本处理领域占据主导地位，并且在计算机视觉分类任务中越来越受欢迎。为了可视化导致某种分类的图像部分，现有方法要么依赖于获得的注意力图，要么采用沿注意力图的启发式传播。在这项工作中，我们提出了一种计算变压器网络相关性的新颖方法。该方法基于深度泰勒分解原理分配局部相关性，然后将这些相关性分数传播到各层。这种传播涉及注意层和跳过连接，这挑战了现有方法。我们的解决方案基于一种特定的配方，该配方被证明可以保持各层之间的整体相关性。我们在最近的可视化Transformer网络以及文本分类问题上对我们的方法进行了基准测试，并证明了相对于现有可解释性方法的明显优势。

8. SceneFormer: Indoor Scene Generation with Transformers [PDF] 返回目录
Xinpeng Wang, Chandan Yeshwanth, Matthias Nießner
Abstract: The task of indoor scene generation is to generate a sequence of objects, their locations and orientations conditioned on the shape and size of a room. Large scale indoor scene datasets allow us to extract patterns from user-designed indoor scenes and then generate new scenes based on these patterns. Existing methods rely on the 2D or 3D appearance of these scenes in addition to object positions, and make assumptions about the possible relations between objects. In contrast, we do not use any appearance information, and learn relations between objects using the self attention mechanism of transformers. We show that this leads to faster scene generation compared to existing methods with the same or better levels of realism. We build simple and effective generative models conditioned on the room shape, and on text descriptions of the room using only the cross-attention mechanism of transformers. We carried out a user study showing that our generated scenes are preferred over DeepSynth scenes 57.7% of the time for bedroom scenes, and 63.3% for living room scenes. In addition, we generate a scene in 1.48 seconds on average, 20% faster than the state of the art method Fast & Flexible, allowing interactive scene generation.
摘要：室内场景生成的任务是生成一系列对象，它们的位置和方向取决于房间的形状和大小。大规模的室内场景数据集使我们能够从用户设计的室内场景中提取图案，然后基于这些图案生成新的场景。现有方法除了对象位置之外还依赖于这些场景的2D或3D外观，并假设对象之间的可能关系。相反，我们不使用任何外观信息，而是使用变压器的自注意机制来学习对象之间的关系。我们证明，与具有相同或更好水平真实感的现有方法相比，这可以更快地生成场景。我们仅根据变压器的交叉注意机制，建立基于房间形状和房间文字描述的简单有效的生成模型。我们进行了一项用户研究，结果显示，生成的场景比DeepSynth场景在卧室场景中占57.7％的时间，在客厅场景中占63.3％的时间更受青睐。此外，我们平均可以在1.48秒内生成一个场景，比最先进的快速灵活方法快20％，从而可以生成交互式场景。

9. Neural Radiance Flow for 4D View Synthesis and Video Processing [PDF] 返回目录
Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B. Tenenbaum, Jiajun Wu
Abstract: We present a method, Neural Radiance Flow (NeRFlow),to learn a 4D spatial-temporal representation of a dynamic scene from a set of RGB images. Key to our approach is the use of a neural implicit representation that learns to capture the 3D occupancy, radiance, and dynamics of the scene. By enforcing consistency across different modalities, our representation enables multi-view rendering in diverse dynamic scenes, including water pouring, robotic interaction, and real images, outperforming state-of-the-art methods for spatial-temporal view synthesis. Our approach works even when inputs images are captured with only one camera. We further demonstrate that the learned representation can serve as an implicit scene prior, enabling video processing tasks such as image super-resolution and de-noising without any additional supervision.
摘要：我们提出了一种方法，即神经辐射流（NeRFlow），用于从一组RGB图像中学习动态场景的4D时空表示。我们方法的关键是使用神经隐式表示，该隐式表示学会捕获场景的3D占用率，辐射度和动态。通过在不同模式之间实现一致性，我们的表示可以在各种动态场景中实现多视图渲染，包括水浇注，机器人交互和真实图像，其表现优于时空视图合成的最新方法。即使仅使用一台摄像机捕获输入图像，我们的方法仍然有效。我们进一步证明，学习的表示形式可以用作隐式场景，从而无需任何额外的监督就可以执行视频处理任务，例如图像超分辨率和降噪。

10. End-to-end Deep Object Tracking with Circular Loss Function for Rotated Bounding Box [PDF] 返回目录
Vladislav Belyaev, Aleksandra Malysheva, Aleksei Shpilman
Abstract: The task object tracking is vital in numerous applications such as autonomous driving, intelligent surveillance, robotics, etc. This task entails the assigning of a bounding box to an object in a video stream, given only the bounding box for that object on the first frame. In 2015, a new type of video object tracking (VOT) dataset was created that introduced rotated bounding boxes as an extension of axis-aligned ones. In this work, we introduce a novel end-to-end deep learning method based on the Transformer Multi-Head Attention architecture. We also present a new type of loss function, which takes into account the bounding box overlap and orientation. Our Deep Object Tracking model with Circular Loss Function (DOTCL) shows an considerable improvement in terms of robustness over current state-of-the-art end-to-end deep learning models. It also outperforms state-of-the-art object tracking methods on VOT2018 dataset in terms of expected average overlap (EAO) metric.
摘要：任务对象跟踪在自动驾驶，智能监控，机器人技术等众多应用中至关重要。此任务需要为视频流中的对象分配边界框，而只需要为视频流中的对象分配边界框即可。第一帧。 2015年，创建了一种新型的视频对象跟踪（VOT）数据集，该数据集引入了旋转边界框作为轴对齐边界框的扩展。在这项工作中，我们介绍了一种基于Transformer多头注意力架构的新颖的端到端深度学习方法。我们还提出了一种新型的损失函数，该函数考虑了边界框的重叠和方向。我们的具有循环损失函数（DOTCL）的深度对象跟踪模型在健壮性方面显示出比当前最新的端到端深度学习模型有相当大的改进。在预期平均重叠（EAO）指标方面，它也优于VOT2018数据集上的最新对象跟踪方法。

11. End-to-End Human Pose and Mesh Reconstruction with Transformers [PDF] 返回目录
Kevin Lin, Lijuan Wang, Zicheng Liu
Abstract: We present a new method, called MEsh TRansfOrmer (METRO), to reconstruct 3D human pose and mesh vertices from a single image. Our method uses a transformer encoder to jointly model vertex-vertex and vertex-joint interactions, and outputs 3D joint coordinates and mesh vertices simultaneously. Compared to existing techniques that regress pose and shape parameters, METRO does not rely on any parametric mesh models like SMPL, thus it can be easily extended to other objects such as hands. We further relax the mesh topology and allow the transformer self-attention mechanism to freely attend between any two vertices, making it possible to learn non-local relationships among mesh vertices and joints. With the proposed masked vertex modeling, our method is more robust and effective in handling challenging situations like partial occlusions. METRO generates new state-of-the-art results for human mesh reconstruction on the public Human3.6M and 3DPW datasets. Moreover, we demonstrate the generalizability of METRO to 3D hand reconstruction in the wild, outperforming existing state-of-the-art methods on FreiHAND dataset.
摘要：我们提出了一种称为MEsh TRansfOrmer（METRO）的新方法，该方法可以从单个图像重建3D人体姿势和网格顶点。我们的方法使用变压器编码器对顶点-顶点和顶点-关节相互作用进行联合建模，并同时输出3D关节坐标和网格顶点。与回归姿态和形状参数的现有技术相比，METRO不依赖任何参数网格模型（例如SMPL），因此可以轻松扩展到其他对象（例如手）。我们进一步放宽了网格拓扑，并允许变压器的自注意力机制自由地参与任意两个顶点之间的连接，从而可以了解网格顶点和关节之间的非局部关系。借助提出的蒙版顶点建模，我们的方法在处理诸如局部遮挡等挑战性情况时更加健壮和有效。 METRO在公共Human3.6M和3DPW数据集上生成了用于人类网格重建的最新技术结果。此外，我们证明了METRO在野外进行3D手重建的普遍性，其表现优于FreiHAND数据集上现有的最新方法。

12. Interpretable Image Clustering via Diffeomorphism-Aware K-Means [PDF] 返回目录
Romain Cosentino, Randall Balestriero, Yanis Bahroun, Anirvan Sengupta, Richard Baraniuk, Behnaam Aazhang
Abstract: We design an interpretable clustering algorithm aware of the nonlinear structure of image manifolds. Our approach leverages the interpretability of $K$-means applied in the image space while addressing its clustering performance issues. Specifically, we develop a measure of similarity between images and centroids that encompasses a general class of deformations: diffeomorphisms, rendering the clustering invariant to them. Our work leverages the Thin-Plate Spline interpolation technique to efficiently learn diffeomorphisms best characterizing the image manifolds. Extensive numerical simulations show that our approach competes with state-of-the-art methods on various datasets.
摘要：我们设计了一种可理解的聚类算法，该算法可识别图像流形的非线性结构。我们的方法利用了应用于图像空间的$ K $ -means的可解释性，同时解决了其聚类性能问题。具体来说，我们开发了一种图像和形心之间的相似性度量，其中包含变形的一般类别：亚同形，使聚类对其不变。我们的工作利用薄板样条插值技术有效地学习了能最好地表征图像流形的衍射。大量的数值模拟表明，我们的方法可以与各种数据集上的最新方法相抗衡。

13. AutoCaption: Image Captioning with Neural Architecture Search [PDF] 返回目录
Xinxin Zhu, Weining Wang, Longteng Guo, Jing Liu
Abstract: Image captioning transforms complex visual information into abstract natural language for representation, which can help computers understanding the world quickly. However, due to the complexity of the real environment, it needs to identify key objects and realize their connections, and further generate natural language. The whole process involves a visual understanding module and a language generation module, which brings more challenges to the design of deep neural networks than other tasks. Neural Architecture Search (NAS) has shown its important role in a variety of image recognition tasks. Besides, RNN plays an essential role in the image captioning task. We introduce a AutoCaption method to better design the decoder module of the image captioning where we use the NAS to design the decoder module called AutoRNN automatically. We use the reinforcement learning method based on shared parameters for automatic design the AutoRNN efficiently. The search space of the AutoCaption includes connections between the layers and the operations in layers both, and it can make AutoRNN express more architectures. In particular, RNN is equivalent to a subset of our search space. Experiments on the MSCOCO datasets show that our AutoCaption model can achieve better performance than traditional hand-design methods. Our AutoCaption obtains the best published CIDEr performance of 135.8% on COCO Karpathy test split. When further using ensemble technology, CIDEr is boosted up to 139.5%.
摘要：图像字幕将复杂的视觉信息转换为抽象的自然语言来表示，可以帮助计算机快速理解世界。但是，由于实际环境的复杂性，它需要识别关键对象并实现它们的连接，并进一步生成自然语言。整个过程涉及视觉理解模块和语言生成模块，这给深度神经网络的设计带来了比其他任务更多的挑战。神经架构搜索（NAS）在各种图像识别任务中已显示出重要作用。此外，RNN在图像字幕任务中起着至关重要的作用。我们介绍了一种AutoCaption方法来更好地设计图像字幕的解码器模块，其中我们使用NAS自动设计称为AutoRNN的解码器模块。我们使用基于共享参数的强化学习方法来有效地自动设计AutoRNN。 AutoCaption的搜索空间包括各层之间的连接以及各层中的操作，这可以使AutoRNN表达更多的体系结构。特别是，RNN相当于我们搜索空间的一个子集。在MSCOCO数据集上进行的实验表明，与传统的手工设计方法相比，我们的AutoCaption模型可以实现更好的性能。我们的自动字幕在COCO Karpathy测试分割中获得了135.8％的最佳发布CIDEr性能。进一步使用集成技术时，CIDEr最高可提高139.5％。

14. Robust Image Captioning [PDF] 返回目录
Daniel Yarnell, Xian Wang
Abstract: Automated captioning of photos is a mission that incorporates the difficulties of photo analysis and text generation. One essential feature of captioning is the concept of attention: how to determine what to specify and in which sequence. In this study, we leverage the Object Relation using adversarial robust cut algorithm, that grows upon this method by specifically embedding knowledge about the spatial association between input data through graph representation. Our experimental study represent the promising performance of our proposed method for image captioning.
摘要：自动为照片添加字幕是一项任务，其中包含了照片分析和文本生成的困难。字幕的一个基本特征是注意的概念：如何确定要指定的内容和顺序。在这项研究中，我们利用对抗性鲁棒割算法来利用对象关系，该算法通过专门嵌入有关通过图形表示的输入数据之间的空间关联的知识，在这种方法的基础上发展起来。我们的实验研究代表了我们提出的图像字幕方法的有希望的性能。

15. Efficient CNN-LSTM based Image Captioning using Neural Network Compression [PDF] 返回目录
Harshit Rampal, Aman Mohanty
Abstract: Modern Neural Networks are eminent in achieving state of the art performance on tasks under Computer Vision, Natural Language Processing and related verticals. However, they are notorious for their voracious memory and compute appetite which further obstructs their deployment on resource limited edge devices. In order to achieve edge deployment, researchers have developed pruning and quantization algorithms to compress such networks without compromising their efficacy. Such compression algorithms are broadly experimented on standalone CNN and RNN architectures while in this work, we present an unconventional end to end compression pipeline of a CNN-LSTM based Image Captioning model. The model is trained using VGG16 or ResNet50 as an encoder and an LSTM decoder on the flickr8k dataset. We then examine the effects of different compression architectures on the model and design a compression architecture that achieves a 73.1% reduction in model size, 71.3% reduction in inference time and a 7.7% increase in BLEU score as compared to its uncompressed counterpart.
摘要：现代神经网络在计算机视觉，自然语言处理和相关垂直领域下的任务中实现最先进的性能方面举世闻名。但是，它们以其巨大的内存和计算需求而臭名昭著，这进一步阻碍了它们在资源受限的边缘设备上的部署。为了实现边缘部署，研究人员开发了修剪和量化算法以压缩此类网络，而不会影响其效率。此类压缩算法在独立的CNN和RNN架构上进行了广泛的实验，而在这项工作中，我们提出了基于CNN-LSTM的图像字幕模型的非常规端到端压缩流水线。使用VGG16或ResNet50作为flickr8k数据集上的编码器和LSTM解码器来训练模型。然后，我们检查了不同压缩架构对模型的影响，并设计了一种压缩架构，与未压缩的压缩架构相比，该压缩架构的模型大小减少了73.1％，推理时间减少了71.3％，BLEU得分增加了7.7％。

16. RainNet: A Large-Scale Dataset for Spatial Precipitation Downscaling [PDF] 返回目录
Xuanhong Chen, Kairui Feng, Naiyuan Liu, Naiyuan Liu, Zhengyan Tong, Bingbing Ni, Ziang Liu, Ning Lin
Abstract: Spatial Precipitation Downscaling is one of the most important problems in the geo-science community. However, it still remains an unaddressed issue. Deep learning is a promising potential solution for downscaling. In order to facilitate the research on precipitation downscaling for deep learning, we present the first \textbf{REAL} (non-simulated) Large-Scale Spatial Precipitation Downscaling Dataset, \textbf{RainNet}, which contains \textbf{62,424} pairs of low-resolution and high-resolution precipitation maps for 17 years. Contrary to simulated data, this real dataset covers various types of real meteorological phenomena (e.g., Hurricane, Squall, etc.), and shows the physical characters - \textbf{Temporal Misalignment}, \textbf{Temporal Sparse} and \textbf{Fluid Properties} - that challenge the downscaling algorithms. In order to fully explore potential downscaling solutions, we propose an implicit physical estimation framework to learn the above characteristics. Eight metrics specifically considering the physical property of the data set are raised, while fourteen models are evaluated on the proposed dataset. Finally, we analyze the effectiveness and feasibility of these models on precipitation downscaling task. The Dataset and Code will be available at \url{this https URL}.
摘要：空间降水缩减是地球科学界最重要的问题之一。但是，它仍然是一个未解决的问题。深度学习是缩小规模的有希望的潜在解决方案。为了促进针对深度学习的降水缩减的研究，我们提出了第一个\ textbf {REAL}（非模拟）大规模空间降水缩减数据集\ textbf {RainNet}，其中包含\ textbf {62,424}对， 17年的低分辨率和高分辨率降水图。与模拟数据相反，此真实数据集涵盖了各种类型的真实气象现象（例如，飓风，Squ风等），并显示了物理特征-\ textbf {Temporal Misalignment}，\ textbf {Temporal Sparse}和\ textbf {Fluid属性}-挑战缩减算法。为了充分探索潜在的降尺度解决方案，我们提出了一个隐式的物理估算框架来学习上述特征。提出了专门考虑数据集物理属性的八个指标，同时在提议的数据集上评估了十四个模型。最后，我们分析了这些模型对降水缩减任务的有效性和可行性。数据集和代码将位于\ url {此https URL}。

17. PCT: Point Cloud Transformer [PDF] 返回目录
Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R. Martin, Shi-Min Hu
Abstract: The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer(PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation and normal estimation tasks.
摘要：不规则域和缺乏排序使设计用于点云处理的深度神经网络具有挑战性。本文提出了一种用于点云学习的新颖框架Point Cloud Transformer（PCT）。 PCT基于Transformer，在自然语言处理方面取得了巨大成功，并在图像处理方面显示出巨大潜力。它本质上是置换不变的，可以处理一系列点，使其非常适合点云学习。为了更好地捕获点云中的本地上下文，我们在最远点采样和最近邻居搜索的支持下增强了输入嵌入。大量的实验表明，PCT在形状分类，零件分割和法线估算任务方面达到了最先进的性能。

18. Multi-Modal Depth Estimation Using Convolutional Neural Networks [PDF] 返回目录
Sadique Adnan Siddiqui, Axel Vierling, Karsten Berns
Abstract: This paper addresses the problem of dense depth predictions from sparse distance sensor data and a single camera image on challenging weather conditions. This work explores the significance of different sensor modalities such as camera, Radar, and Lidar for estimating depth by applying Deep Learning approaches. Although Lidar has higher depth-sensing abilities than Radar and has been integrated with camera images in lots of previous works, depth estimation using CNN's on the fusion of robust Radar distance data and camera images has not been explored much. In this work, a deep regression network is proposed utilizing a transfer learning approach consisting of an encoder where a high performing pre-trained model has been used to initialize it for extracting dense features and a decoder for upsampling and predicting desired depth. The results are demonstrated on Nuscenes, KITTI, and a Synthetic dataset which was created using the CARLA simulator. Also, top-view zoom-camera images captured from the crane on a construction site are evaluated to estimate the distance of the crane boom carrying heavy loads from the ground to show the usability in safety-critical applications.
摘要：本文解决了在恶劣的天气条件下根据稀疏距离传感器数据和单个摄像机图像进行密集深度预测的问题。这项工作探索了不同的传感器模式（如相机，雷达和激光雷达）通过应用深度学习方法来估计深度的重要性。尽管激光雷达具有比雷达更高的深度感测能力，并且在许多先前的工作中已与摄像头图像集成在一起，但是对于将鲁棒雷达距离数据与摄像头图像融合使用CNN进行深度估计的探索还很少。在这项工作中，提出了一种深度递归网络，该网络采用了一种转移学习方法，该方法包括一个编码器和一个解码器，其中编码器使用高性能的预训练模型对其进行初始化，以提取密集特征；解码器用于上采样和预测所需深度。结果显示在Nuscenes，KITTI和使用CARLA模拟器创建的Synthetic数据集上。此外，还将评估从建筑工地上的起重机捕获的顶视图变焦摄像机图像，以估计从地面上搬运重物的起重机吊臂的距离，以显示在安全关键型应用中的可用性。

19. A fully pipelined FPGA accelerator for scale invariant feature transform keypoint descriptor matching, [PDF] 返回目录
Luka Daoud, Muhammad Kamran Latif, H S. Jacinto, Nader Rafla
Abstract: The scale invariant feature transform (SIFT) algorithm is considered a classical feature extraction algorithm within the field of computer vision. SIFT keypoint descriptor matching is a computationally intensive process due to the amount of data consumed. In this work, we designed a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching. The accelerator core was implemented and tested on a field programmable gate array (FPGA). The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation and hits the roofline performance model, achieving the potential maximum throughput. The fully pipelined matching architecture was designed based on the consine angle distance method. Our architecture was optimized for 16-bit fixed-point operations and implemented on hardware using a Xilinx Zynq-based FPGA development board. Our proposed architecture shows a noticeable reduction of area resources compared with its counterparts in literature, while maintaining high throughput by alleviating memory bandwidth restrictions. The results show a reduction in consumed device resources of up to 91 percent in LUTs and 79 percent of BRAMs. Our hardware implementation is 15.7 times faster than the comparable software approach.
摘要：尺度不变特征变换（SIFT）算法被认为是计算机视觉领域的经典特征提取算法。由于消耗的数据量，SIFT关键点描述符匹配是一个计算密集型过程。在这项工作中，我们为SIFT关键点描述符匹配设计了一种新颖的全流水线硬件加速器体系结构。加速器核心是在现场可编程门阵列（FPGA）上实施和测试的。拟议的硬件体系结构能够正确处理完全流水线实施所需的内存带宽，并且达到了屋顶性能模型，从而实现了潜在的最大吞吐量。基于余弦角距离法设计了全流水线匹配架构。我们的架构针对16位定点操作进行了优化，并使用基于Xilinx Zynq的FPGA开发板在硬件上实现。与文献中的对应结构相比，我们提出的体系结构显示出区域资源显着减少，同时通过减轻内存带宽限制来保持高吞吐量。结果表明，LUT和BRAM的消耗设备资源减少了多达91％。我们的硬件实现比同类软件方法快15.7倍。

20. Firearm Detection via Convolutional Neural Networks: Comparing a Semantic Segmentation Model Against End-to-End Solutions [PDF] 返回目录
Alexander Egiazarov, Fabio Massimo Zennaro, Vasileios Mavroeidis
Abstract: Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents such as terrorism, general criminal offences, or even domestic violence. One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis. In this paper we conduct a comparison between a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation. We evaluated both models from different points of view, including accuracy, computational and data complexity, flexibility and reliability. Our results show that a semantic segmentation model provides considerable amount of flexibility and resilience in the low data environment compared to classical deep model models, although its configuration and tuning presents a challenge in achieving the same levels of accuracy as an end-to-end model.
摘要：通过实时视频对武器进行威胁检测和攻击行为可用于快速检测和预防潜在的致命事件，例如恐怖主义，一般刑事犯罪甚至家庭暴力。实现这一目标的一种方法是使用人工智能，尤其是使用机器学习进行图像分析。在本文中，我们将传统的整体式端到端深度学习模型与先前提出的模型进行了比较，该模型基于更简单的神经网络通过语义分割检测火武器的集合。我们从不同的角度评估了这两个模型，包括准确性，计算和数据复杂性，灵活性和可靠性。我们的结果表明，与传统的深度模型模型相比，语义分割模型在低数据环境中提供了大量的灵活性和弹性，尽管其配置和调整对实现与端到端模型相同的准确性水平提出了挑战。

21. Detection and Prediction of Nutrient Deficiency Stress using Longitudinal Aerial Imagery [PDF] 返回目录
Saba Dadsetan, Gisele Rose, Naira Hovakimyan, Jennifer Hobbs
Abstract: Early, precise detection of nutrient deficiency stress (NDS) has key economic as well as environmental impact; precision application of chemicals in place of blanket application reduces operational costs for the growers while reducing the amount of chemicals which may enter the environment unnecessarily. Furthermore, earlier treatment reduces the amount of loss and therefore boosts crop production during a given season. With this in mind, we collect sequences of high-resolution aerial imagery and construct semantic segmentation models to detect and predict NDS across the field. Our work sits at the intersection of agriculture, remote sensing, and modern computer vision and deep learning. First, we establish a baseline for full-field detection of NDS and quantify the impact of pretraining, backbone architecture, input representation, and sampling strategy. We then quantify the amount of information available at different points in the season by building a single-timestamp model based on a UNet. Next, we construct our proposed spatiotemporal architecture, which combines a UNet with a convolutional LSTM layer, to accurately detect regions of the field showing NDS; this approach has an impressive IOU score of 0.53. Finally, we show that this architecture can be trained to predict regions of the field which are expected to show NDS in a later flight -- potentially more than three weeks in the future -- maintaining an IOU score of 0.47-0.51 depending on how far in advance the prediction is made. We will also release a dataset which we believe will benefit the computer vision, remote sensing, as well as agriculture fields. This work contributes to the recent developments in deep learning for remote sensing and agriculture, while addressing a key social challenge with implications for economics and sustainability.
摘要：早期，精确地检测营养不足胁迫（NDS）具有重要的经济和环境影响。精确地使用化学药品代替毯子使用可以减少种植者的运营成本，同时减少不必要地进入环境的化学药品的数量。此外，较早的处理减少了损失的数量，因此在给定季节内提高了作物产量。考虑到这一点，我们收集高分辨率的航空影像序列，并构建语义分割模型以检测和预测整个领域的NDS。我们的工作位于农业，遥感与现代计算机视觉和深度学习的交汇处。首先，我们为NDS的全场检测建立基线，并量化预训练，骨干架构，输入表示和采样策略的影响。然后，我们通过基于UNet建立单时间戳模型来量化季节不同时间点可用的信息量。接下来，我们构建建议的时空架构，将UNet与卷积LSTM层相结合，以准确检测显示NDS的区域。这种方法的IOU得分高达0.53。最后，我们证明了可以训练这种体系结构来预测预期在以后的飞行中显示NDS的区域（未来可能超过三周），其IOU得分保持在0.47-0.51之间，具体取决于飞行距离事先进行预测。我们还将发布一个数据集，我们相信它将有助于计算机视觉，遥感以及农业领域。这项工作为遥感和农业深度学习的最新发展做出了贡献，同时解决了对经济和可持续性具有影响的关键社会挑战。

22. Trajectory saliency detection using consistency-oriented latent codes from a recurrent auto-encoder [PDF] 返回目录
L. Maczyta, P. Bouthemy, O. Le Meur
Abstract: In this paper, we are concerned with the detection of progressive dynamic saliency from video sequences. More precisely, we are interested in saliency related to motion and likely to appear progressively over time. It can be relevant to trigger alarms, to dedicate additional processing or to detect specific events. Trajectories represent the best way to support progressive dynamic saliency detection. Accordingly, we will talk about trajectory saliency. A trajectory will be qualified as salient if it deviates from normal trajectories that share a common motion pattern related to a given context. First, we need a compact while discriminative representation of trajectories. We adopt a (nearly) unsupervised learning-based approach. The latent code estimated by a recurrent auto-encoder provides the desired representation. In addition, we enforce consistency for normal (similar) trajectories through the auto-encoder loss function. The distance of the trajectory code to a prototype code accounting for normality is the means to detect salient trajectories. We validate our trajectory saliency detection method on synthetic and real trajectory datasets, and highlight the contributions of its different components. We show that our method outperforms existing methods on several scenarios drawn from the publicly available dataset of pedestrian trajectories acquired in a railway station (Alahi 2014).
摘要：本文涉及视频序列中渐进动态显着性的检测。更确切地说，我们对与运动有关的显着性感兴趣，并且随着时间的流逝会逐渐出现。它可能与触发警报，专用于其他处理或检测特定事件有关。轨迹代表了支持渐进式动态显着性检测的最佳方法。因此，我们将讨论轨迹显着性。如果一条轨迹偏离共享与给定上下文相关的公共运动模式的正常轨迹，则该轨迹将被视为显着轨迹。首先，我们需要一个紧凑而有区别的轨迹表示。我们采用（几乎）无监督的基于学习的方法。由循环自动编码器估算的潜在代码提供了所需的表示形式。此外，我们通过自动编码器损失功能为正常（相似）轨迹增强了一致性。轨迹代码与考虑正态性的原型代码之间的距离是检测显着轨迹的手段。我们在合成和真实轨迹数据集上验证了轨迹显着性检测方法，并强调了其不同组成部分的贡献。我们表明，在从火车站获取的行人轨迹的公开数据集中得出的几种情况下，我们的方法优于现有方法（Alahi 2014）。

23. Incremental Learning from Low-labelled Stream Data in Open-Set Video Face Recognition [PDF] 返回目录
Eric Lopez-Lopez, Carlos V. Regueiro, Xose M. Pardo
Abstract: Deep Learning approaches have brought solutions, with impressive performance, to general classification problems where wealthy of annotated data are provided for training. In contrast, less progress has been made in continual learning of a set of non-stationary classes, mainly when applied to unsupervised problems with streaming data. Here, we propose a novel incremental learning approach which combines a deep features encoder with an Open-Set Dynamic Ensembles of SVM, to tackle the problem of identifying individuals of interest (IoI) from streaming face data. From a simple weak classifier trained on a few video-frames, our method can use unsupervised operational data to enhance recognition. Our approach adapts to new patterns avoiding catastrophic forgetting and partially heals itself from miss-adaptation. Besides, to better comply with real world conditions, the system was designed to operate in an open-set setting. Results show a benefit of up to 15% F1-score increase respect to non-adaptive state-of-the-art methods.
摘要：深度学习方法为一般分类问题带来了令人印象深刻的解决方案，这些分类问题为训练提供了丰富的带注释数据。相反，主要是在将非平稳类应用于流数据的无监督问题时，在持续学习中取得的进展较小。在这里，我们提出了一种新颖的增量学习方法，该方法将深度特征编码器与SVM的开放集动态集成相结合，以解决从流脸数据中识别感兴趣的个人（IoI）的问题。通过在几个视频帧上训练的简单弱分类器，我们的方法可以使用无监督的操作数据来增强识别能力。我们的方法适应了新的模式，避免了灾难性的遗忘，并且可以部分地自我修复以适应错误。此外，为了更好地符合实际条件，该系统被设计为在开放式环境下运行。结果显示，相对于非自适应最新技术，F1分数最多可提高15％。

24. Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN [PDF] 返回目录
Novanto Yudistira, Muthu Subash Kavitha, Takio Kurita
Abstract: 3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; i) aggregate layer-wise global to local (global-local) discrete gradients using trained 3DResNext network, and ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global-local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradients and activations of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class's input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCam. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating on each layer produces better classification results than the baseline model.
摘要：3D卷积神经网络（3D CNN）捕获有关3D数据（例如视频序列）的时空信息。但是，由于卷积和池化机制，信息丢失似乎是不可避免的。为了改善3D CNN的视觉解释和分类，我们提出了两种方法： i）使用训练有素的3DResNext网络将分层全局全局到局部（global-local）离散梯度进行聚合，并且ii）实现注意门控网络以提高动作识别的准确性。拟议的方法旨在通过视觉归因，弱监督的动作定位和动作识别来显示称为3D CNN的全球局部注意的每一层的有用性。首先，对3DResNext进行训练，并使用涉及最大预测类别的反向传播将其应用于动作分类。然后对每个层的梯度和激活进行上采样。后来，聚合被用于引起更多细微的关注，这指出了预测班级输入视频中最关键的部分。我们使用最终关注的轮廓阈值进行最终定位。我们通过3DCam使用细粒度的视觉解释来评估修剪后的视频中的空间和时间动作定位。实验结果表明，该方法产生了有益的视觉解释和辨别力。此外，通过注意门控对每一层的动作识别比基线模型产生更好的分类结果。

25. Embodied Visual Active Learning for Semantic Segmentation [PDF] 返回目录
David Nilsson, Aleksis Pirinen, Erik Gärtner, Cristian Sminchisescu
Abstract: We study the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding by actively selecting views for which to request annotation. While accurate on some benchmarks, today's deep visual recognition pipelines tend to not generalize well in certain real-world scenarios, or for unusual viewpoints. Robotic perception, in turn, requires the capability to refine the recognition capabilities for the conditions where the mobile system operates, including cluttered indoor environments or poor illumination. This motivates the proposed task, where an agent is placed in a novel environment with the objective of improving its visual recognition capability. To study embodied visual active learning, we develop a battery of agents - both learnt and pre-specified - and with different levels of knowledge of the environment. The agents are equipped with a semantic segmentation network and seek to acquire informative views, move and explore in order to propagate annotations in the neighbourhood of those views, then refine the underlying segmentation network by online retraining. The trainable method uses deep reinforcement learning with a reward function that balances two competing objectives: task performance, represented as visual recognition accuracy, which requires exploring the environment, and the necessary amount of annotated data requested during active exploration. We extensively evaluate the proposed models using the photorealistic Matterport3D simulator and show that a fully learnt method outperforms comparable pre-specified counterparts, even when requesting fewer annotations.
摘要：我们研究了体现式视觉主动学习的任务，其中设置了一个代理来探索3d环境，目标是通过主动选择请求注释的视图来获取视觉场景理解。尽管在某些基准上是准确的，但当今的深度视觉识别管道在某些实际场景中或针对不寻常的观点时往往无法很好地推广。反过来，机器人感知需要具有针对移动系统运行条件（包括混乱的室内环境或照明不佳）的识别能力的能力。这激发了提出的任务，其中将代理放置在新颖的环境中，目的是提高其视觉识别能力。为了研究具体的视觉主动学习，我们开发了一系列的代理（包括已学习和预先指定的代理），并且具有不同级别的环境知识。代理配备了语义分割网络，并试图获取信息视图，移动和探索以便在这些视图的邻域中传播注释，然后通过在线重新训练来完善基础的分割网络。这种可训练的方法使用了具有奖励功能的深度强化学习，该功能平衡了两个相互竞争的目标：任务表现（表示为视觉识别准确度，需要探索环境）和在主动探索期间需要的必要批注数据。我们使用逼真的Matterport3D模拟器对提议的模型进行了广泛的评估，结果表明，即使请求的注释较少，完全学习的方法也可以胜过类似的预先指定的方法。

26. A Hierarchical Feature Constraint to Camouflage Medical Adversarial Attacks [PDF] 返回目录
Qingsong Yao, Zecheng He, Yi Lin, Kai Ma, Yefeng Zheng, S. Kevin Zhou
Abstract: Deep neural networks (DNNs) for medical images are extremely vulnerable to adversarial examples (AEs), which poses security concerns on clinical decision making. Luckily, medical AEs are also easy to detect in hierarchical feature space per our study herein. To better understand this phenomenon, we thoroughly investigate the intrinsic characteristic of medical AEs in feature space, providing both empirical evidence and theoretical explanations for the question: why are medical adversarial attacks easy to detect? We first perform a stress test to reveal the vulnerability of deep representations of medical images, in contrast to natural images. We then theoretically prove that typical adversarial attacks to binary disease diagnosis network manipulate the prediction by continuously optimizing the vulnerable representations in a fixed direction, resulting in outlier features that make medical AEs easy to detect. However, this vulnerability can also be exploited to hide the AEs in the feature space. We propose a novel hierarchical feature constraint (HFC) as an add-on to existing adversarial attacks, which encourages the hiding of the adversarial representation within the normal feature distribution. We evaluate the proposed method on two public medical image datasets, namely {Fundoscopy} and {Chest X-Ray}. Experimental results demonstrate the superiority of our adversarial attack method as it bypasses an array of state-of-the-art adversarial detectors more easily than competing attack methods, supporting that the great vulnerability of medical features allows an attacker more room to manipulate the adversarial representations.
摘要：用于医学图像的深度神经网络（DNN）极易受到对抗性示例（AE）的攻击，这对临床决策提出了安全性方面的考虑。幸运的是，根据我们在本文中的研究，医学AE也很容易在分层特征空间中检测到。为了更好地理解这种现象，我们彻底研究了医学AE在特征空间中的内在特征，为以下问题提供了经验证据和理论解释：为什么医学对抗性攻击易于检测？我们首先进行压力测试，以揭示医学图像与自然图像相比深层表示的脆弱性。然后，我们从理论上证明，对二进制疾病诊断网络的典型对抗攻击是通过在固定方向上连续优化易受攻击的表示来操纵预测的，从而导致异常特征使医疗AE易于检测。但是，也可以利用此漏洞在特征空间中隐藏AE。我们提出了一种新颖的分层特征约束（HFC），作为现有对抗攻击的补充，它鼓励在正常特征分布内隐藏对抗表示。我们在两个公共医学图像数据集{Fundoscopy}和{Chest X-Ray}上评估了该方法。实验结果证明了我们的对抗性攻击方法的优越性，因为它比竞争性攻击方法更容易绕过一系列最新的对抗性检测器，支持医疗功能的巨大脆弱性使攻击者有更多空间来操纵对抗性表示形式。

27. Exploiting Learnable Joint Groups for Hand Pose Estimation [PDF] 返回目录
Moran Li, Yuan Gao, Nong Sang
Abstract: In this paper, we propose to estimate 3D hand pose by recovering the 3D coordinates of joints in a group-wise manner, where less-related joints are automatically categorized into different groups and exhibit different features. This is different from the previous methods where all the joints are considered holistically and share the same feature. The benefits of our method are illustrated by the principle of multi-task learning (MTL), i.e., by separating less-related joints into different groups (as different tasks), our method learns different features for each of them, therefore efficiently avoids the negative transfer (among less related tasks/groups of joints). The key of our method is a novel binary selector that automatically selects related joints into the same group. We implement such a selector with binary values stochastically sampled from a Concrete distribution, which is constructed using Gumbel softmax on trainable parameters. This enables us to preserve the differentiable property of the whole network. We further exploit features from those less-related groups by carrying out an additional feature fusing scheme among them, to learn more discriminative features. This is realized by implementing multiple 1x1 convolutions on the concatenated features, where each joint group contains a unique 1x1 convolution for feature fusion. The detailed ablation analysis and the extensive experiments on several benchmark datasets demonstrate the promising performance of the proposed method over the state-of-the-art (SOTA) methods. Besides, our method achieves top-1 among all the methods that do not exploit the dense 3D shape labels on the most recently released FreiHAND competition at the submission date. The source code and models are available at this https URL moranli-aca/LearnableGroups-Hand.
摘要：在本文中，我们建议通过以组方式恢复关节的3D坐标来估计3D手势，在这种情况下，关联性较低的关节会自动分类为不同的组，并表现出不同的特征。这与以前的方法不同，以前的方法是整体考虑所有关节并具有相同特征。多任务学习（MTL）的原理说明了我们方法的好处，即通过将关联性较低的关节分成不同的组（作为不同的任务），我们的方法为每个关节学习不同的特征，因此有效地避免了负迁移（除相关任务/关节组外）。我们方法的关键是新颖的二进制选择器，该选择器会自动将相关关节选择到同一组中。我们使用从Concrete分布中随机采样的二进制值来实现这样的选择器，它是使用Gumbel softmax构造可训练参数而构造的。这使我们能够保留整个网络的差异性。我们通过在它们之间进行附加的特征融合方案，进一步从那些关系较少的群体中挖掘特征，以学习更多区分特征。这是通过在级联特征上执行多个1x1卷积来实现的，其中每个关节组包含一个用于特征融合的唯一1x1卷积。详细的消融分析和在多个基准数据集上的广泛实验证明了该方法相对于最新技术（SOTA）的前景广阔。此外，在提交日期最新发布的FreiHAND竞赛中，我们的方法在未利用密集3D形状标签的所有方法中均达到了top-1。可从此https URL moranli-aca / LearnableGroups-Hand获得源代码和模型。

28. CT Film Recovery via Disentangling Geometric Deformation and Illumination Variation: Simulated Datasets and Deep Models [PDF] 返回目录
Quan Quan, Qiyuan Wang, Liu Li, Yuanqi Du, S. Kevin Zhou
Abstract: While medical images such as computed tomography (CT) are stored in DICOM format in hospital PACS, it is still quite routine in many countries to print a film as a transferable medium for the purposes of self-storage and secondary consultation. Also, with the ubiquitousness of mobile phone cameras, it is quite common to take pictures of the CT films, which unfortunately suffer from geometric deformation and illumination variation. In this work, we study the problem of recovering a CT film, which marks the first attempt in the literature, to the best of our knowledge. We start with building a large-scale head CT film database CTFilm20K, consisting of approximately 20,000 pictures, using the widely used computer graphics software Blender. We also record all accompanying information related to the geometric deformation (such as 3D coordinate, depth, normal, and UV maps) and illumination variation (such as albedo map). Then we propose a deep framework to disentangle geometric deformation and illumination variation using the multiple maps extracted from the CT films to collaboratively guide the recovery process. Extensive experiments on simulated and real images demonstrate the superiority of our approach over the previous approaches. We plan to open source the simulated images and deep models for promoting the research on CT film recovery (https://anonymous.4open.science/r/e6b1f6e3-9b36-423f-a225-55b7d0b55523/).
摘要：虽然计算机断层扫描（CT）等医学图像以DICOM格式存储在医院PACS中，但在许多国家还是经常将胶卷作为可转移介质进行打印，以进行自我存储和二次咨询。此外，由于手机相机的普及，拍摄CT胶片的照片非常普遍，而CT胶片却遭受几何变形和照明变化的困扰。在这项工作中，我们尽我们所能研究了恢复CT胶片的问题，这是文献中的首次尝试。我们首先使用广泛使用的计算机图形软件Blender构建大型的头部CT胶片数据库CTFilm20K，该数据库包含大约20,000张照片。我们还将记录与几何变形（例如3D坐标，深度，法线和UV贴图）和光照变化（例如反照率贴图）有关的所有附带信息。然后，我们提出了一个深层框架，使用从CT胶片中提取的多个贴图来解开几何变形和照明变化，从而共同指导恢复过程。在模拟和真实图像上进行的大量实验证明了我们的方法优于以前的方法。我们计划开源模拟图像和深度模型，以促进CT胶片恢复的研究（https://anonymous.4open.science/r/e6b1f6e3-9b36-423f-a225-55b7d0b55523/）。

29. Learning to Share: A Multitasking Genetic Programming Approach to Image Feature Learning [PDF] 返回目录
Ying Bi, Bing Xue, Mengjie Zhang
Abstract: Evolutionary multitasking is a promising approach to simultaneously solving multiple tasks with knowledge sharing. Image feature learning can be solved as a multitasking problem because different tasks may have a similar feature space. Genetic programming (GP) has been successfully applied to image feature learning for classification. However, most of the existing GP methods solve one task, independently, using sufficient training data. No multitasking GP method has been developed for image feature learning. Therefore, this paper develops a multitasking GP approach to image feature learning for classification with limited training data. Owing to the flexible representation of GP, a new knowledge sharing mechanism based on a new individual representation is developed to allow GP to automatically learn what to share across two tasks. The shared knowledge is encoded as a common tree, which can represent the common/general features of two tasks. With the new individual representation, each task is solved using the features extracted from a common tree and a task-specific tree representing task-specific features. To learn the best common and task-specific trees, a new evolutionary process and new fitness functions are developed. The performance of the proposed approach is examined on six multitasking problems of 12 image classification datasets with limited training data and compared with three GP and 14 non-GP-based competitive methods. Experimental results show that the new approach outperforms these compared methods in almost all the comparisons. Further analysis reveals that the new approach learns simple yet effective common trees with high effectiveness and transferability.
摘要：进化多任务处理是一种有前途的方法，可以通过知识共享同时解决多个任务。图像特征学习可以解决为多任务问题，因为不同的任务可能具有相似的特征空间。遗传编程（GP）已成功应用于图像特征学习以进行分类。但是，大多数现有的GP方法都使用足够的训练数据来独立解决一项任务。尚未开发用于图像特征学习的多任务GP方法。因此，本文开发了一种用于图像特征学习的多任务GP方法，用于训练数据有限的分类。由于GP的灵活表示，因此开发了基于新的个人表示的新知识共享机制，以使GP可以自动学习要在两个任务之间共享的内容。共享知识被编码为公共树，可以表示两个任务的公共/一般特征。使用新的单独表示形式，可以使用从公用树和代表任务特定特征的任务特定树中提取的特征来解决每个任务。为了学习最佳的通用树和特定于任务的树，开发了新的进化过程和新的适应度函数。该方法的性能在训练数据有限的12个图像分类数据集中的6个多任务问题上进行了检验，并与3种GP和14种基于非GP的竞争方法进行了比较。实验结果表明，新方法几乎在所有比较中均优于这些比较方法。进一步的分析表明，该新方法可以学习简单而有效的常见树，并且具有很高的有效性和可移植性。

30. FG-Net: Fast Large-Scale LiDAR Point CloudsUnderstanding Network Leveraging CorrelatedFeature Mining and Geometric-Aware Modelling [PDF] 返回目录
Kangcheng Liu, Zhi Gao, Feng Lin, Ben M. Chen
Abstract: This work presents FG-Net, a general deep learning framework for large-scale point clouds understanding without voxelizations, which achieves accurate and real-time performance with a single NVIDIA GTX 1080 GPU. First, a novel noise and outlier filtering method is designed to facilitate subsequent high-level tasks. For effective understanding purpose, we propose a deep convolutional neural network leveraging correlated feature mining and deformable convolution based geometric-aware modelling, in which the local feature relationships and geometric patterns can be fully exploited. For the efficiency issue, we put forward an inverse density sampling operation and a feature pyramid based residual learning strategy to save the computational cost and memory consumption respectively. Extensive experiments on real-world challenging datasets demonstrated that our approaches outperform state-of-the-art approaches in terms of accuracy and efficiency. Moreover, weakly supervised transfer learning is also conducted to demonstrate the generalization capacity of our method.
摘要：这项工作提出了FG-Net，这是一个通用的深度学习框架，可用于无需像素化即可理解大规模点云，该框架可通过单个NVIDIA GTX 1080 GPU来实现准确和实时的性能。首先，设计了一种新颖的噪声和异常值滤波方法来促进后续的高级任务。为了有效理解，我们提出了一种深度卷积神经网络，利用相关特征挖掘和基于可变形卷积的几何感知建模，可以充分利用局部特征关系和几何图案。对于效率问题，我们提出了一种逆密度采样操作和一种基于特征金字塔的残差学习策略，以分别节省计算成本和内存消耗。在现实世界中具有挑战性的数据集上进行的大量实验表明，我们的方法在准确性和效率方面都优于最新方法。此外，还进行了弱监督的迁移学习来证明我们方法的泛化能力。

31. Multi-shot Temporal Event Localization: a Benchmark [PDF] 返回目录
Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, Philip H.S. Torr
Abstract: Current developments in temporal event or action localization usually target actions captured by a single camera. However, extensive events or actions in the wild may be captured as a sequence of shots by multiple cameras at different positions. In this paper, we propose a new and challenging task called multi-shot temporal event localization, and accordingly, collect a large scale dataset called MUlti-Shot EventS (MUSES). MUSES has 31,477 event instances for a total of 716 video hours. The core nature of MUSES is the frequent shot cuts, for an average of 19 shots per instance and 176 shots per video, which induces large intrainstance variations. Our comprehensive evaluations show that the state-of-the-art method in temporal action localization only achieves an mAP of 13.1% at IoU=0.5. As a minor contribution, we present a simple baseline approach for handling the intra-instance variations, which reports an mAP of 18.9% on MUSES and 56.9% on THUMOS14 at IoU=0.5. To facilitate research in this direction, we release the dataset and the project code at https://songbai.site/muses.
摘要：时间事件或动作本地化的最新发展通常以单个摄像机捕获的动作为目标。但是，野外发生的大量事件或动作可能会被不同位置的多个摄像机捕获为一系列镜头。在本文中，我们提出了一个新的具有挑战性的任务，称为多镜头时间事件本地化，并因此收集了一个称为MUlti-Shot EventS（MUSES）的大规模数据集。 MUSES具有31,477个事件实例，总计716个视频小时。 MUSES的核心特性是频繁的镜头剪辑，平均每个实例19镜头和每个视频176镜头，这会导致很大的变化。我们的综合评估表明，在IoU = 0.5的情况下，最新的时间动作定位方法只能实现13.1％的mAP。作为次要贡献，我们提出了一个简单的基线方法来处理实例内的变化，该方法报告在IoU = 0.5时，MUSES的mAP为18.9％，THUMOS14的mAP为56.9％。为了促进朝这个方向的研究，我们在https://songbai.site/muses上发布了数据集和项目代码。

32. PanoNet3D: Combining Semantic and Geometric Understanding for LiDARPoint Cloud Detection [PDF] 返回目录
Xia Chen, Jianren Wang, David Held, Martial Hebert
Abstract: Visual data in autonomous driving perception, such as camera image and LiDAR point cloud, can be interpreted as a mixture of two aspects: semantic feature and geometric structure. Semantics come from the appearance and context of objects to the sensor, while geometric structure is the actual 3D shape of point clouds. Most detectors on LiDAR point clouds focus only on analyzing the geometric structure of objects in real 3D space. Unlike previous works, we propose to learn both semantic feature and geometric structure via a unified multi-view framework. Our method exploits the nature of LiDAR scans -- 2D range images, and applies well-studied 2D convolutions to extract semantic features. By fusing semantic and geometric features, our method outperforms state-of-the-art approaches in all categories by a large margin. The methodology of combining semantic and geometric features provides a unique perspective of looking at the problems in real-world 3D point cloud detection.
摘要：自主驾驶感知中的视觉数据，例如相机图像和LiDAR点云，可以解释为语义特征和几何结构两个方面的混合。语义来自对象到传感器的外观和上下文，而几何结构是点云的实际3D形状。 LiDAR点云上的大多数检测器仅专注于分析真实3D空间中对象的几何结构。与以前的作品不同，我们建议通过统一的多视图框架学习语义特征和几何结构。我们的方法利用了LiDAR扫描的本质-2D距离图像，并应用经过深入研究的2D卷积来提取语义特征。通过融合语义和几何特征，我们的方法在所有类别上都大大优于最新方法。结合语义和几何特征的方法为观察现实世界中3D点云检测中的问题提供了独特的视角。

33. Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [PDF] 返回目录
Guodong Xu, Ziwei Liu, Chen Change Loy
Abstract: Knowledge distillation, which involves extracting the "dark knowledge" from a teacher network to guide the learning of a student network, has emerged as an essential technique for model compression and transfer learning. Unlike previous works that focus on the accuracy of student network, here we study a little-explored but important question, i.e., knowledge distillation efficiency. Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training. We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution. The uncertainty sampling strategy is used to evaluate the informativeness of each training sample. Adaptive mixup is applied to uncertain samples to compact knowledge. We further show that the redundancy of conventional knowledge distillation lies in the excessive learning of easy samples. By combining uncertainty and mixup, our approach reduces the redundancy and makes better use of each query to the teacher network. We validate our approach on CIFAR100 and ImageNet. Notably, with only 79% computation cost, we outperform conventional knowledge distillation on CIFAR100 and achieve a comparable result on ImageNet.
摘要：涉及到从教师网络中提取“黑暗知识”以指导学生网络学习的知识提炼已经成为模型压缩和转移学习的基本技术。与以前的工作着重于学生网络的准确性不同，我们在这里研究了一个很少探索但重要的问题，即知识提炼效率。我们的目标是在训练过程中以较低的计算成本实现与传统知识蒸馏相当的性能。我们证明了具有不确定性的mIXup（UNIX）可以作为一种干净而有效的解决方案。不确定性抽样策略用于评估每个训练样本的信息量。自适应混合应用于不确定样本以压缩知识。我们进一步表明，传统知识蒸馏的冗余在于对简单样本的过度学习。通过结合不确定性和混淆，我们的方法减少了冗余，并更好地利用了对教师网络的每个查询。我们在CIFAR100和ImageNet上验证了我们的方法。值得注意的是，仅以79％的计算成本，我们在CIFAR100上的表现优于传统知识提炼，并在ImageNet上取得了可比的结果。

34. Temporal LiDAR Frame Prediction for Autonomous Driving [PDF] 返回目录
David Deng, Avideh Zakhor
Abstract: Anticipating the future in a dynamic scene is critical for many fields such as autonomous driving and robotics. In this paper we propose a class of novel neural network architectures to predict future LiDAR frames given previous ones. Since the ground truth in this application is simply the next frame in the sequence, we can train our models in a self-supervised fashion. Our proposed architectures are based on FlowNet3D and Dynamic Graph CNN. We use Chamfer Distance (CD) and Earth Mover's Distance (EMD) as loss functions and evaluation metrics. We train and evaluate our models using the newly released nuScenes dataset, and characterize their performance and complexity with several baselines. Compared to directly using FlowNet3D, our proposed architectures achieve CD and EMD nearly an order of magnitude lower. In addition, we show that our predictions generate reasonable scene flow approximations without using any labelled supervision.
摘要：在动态场景中预测未来对于自动驾驶和机器人技术等许多领域至关重要。在本文中，我们提出了一类新颖的神经网络架构，以根据先前的框架预测未来的LiDAR框架。由于此应用程序中的基本事实仅仅是序列中的下一帧，因此我们可以以自我监督的方式训练模型。我们提出的架构基于FlowNet3D和动态图CNN。我们将倒角距离（CD）和推土机距离（EMD）用作损失函数和评估指标。我们使用新发布的nuScenes数据集训练和评估我们的模型，并使用几个基准来表征它们的性能和复杂性。与直接使用FlowNet3D相比，我们提出的体系结构实现的CD和EMD降低了近一个数量级。此外，我们证明了我们的预测无需使用任何标记的监督即可生成合理的场景流近似值。

35. LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos [PDF] 返回目录
Sai Praneeth Reddy Sunkesula, Rishabh Dabral, Ganesh Ramakrishnan
Abstract: Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the commonly used hand-crafted spatial features. We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO dataset, setting a new benchmark for visual features based approaches. Code for LIGHTEN is available at this https URL
摘要：分析视频中人与物之间的互动包括识别人与视频中存在的物体之间的关系。可以将其视为视觉关系检测的专用版本，其中对象之一必须是人类。尽管传统方法将问题表达为对视频片段序列的推理，但我们提出了一种分层方法LIGHTEN，以学习视觉特征以有效捕获视频中多个粒度的时空提示。与当前的方法不同，LIGHTEN避免使用地面真实数据，例如深度图或3D人体姿态，因此也增加了跨非RGBD数据集的概括性。此外，我们仅使用视觉功能，而不是常用的手工制作的空间功能即可达到相同的效果。我们在V-COCO数据集中实现了人-物体交互检测的最新结果（88.9％和92.6％），CAD-120的预期任务以及基于图像的HOI检测的竞争结果，从而为可视化树立了新的标杆基于特征的方法。可从以下https URL获得LIGHTEN的代码

36. Zoom-to-Inpaint: Image Inpainting with High Frequency Details [PDF] 返回目录
Soo Ye Kim, Kfir Aberman, Nori Kanazawa, Rahul Garg, Neal Wadhwa, Huiwen Chang, Nikhil Karnad, Munchurl Kim, Orly Liba
Abstract: Although deep learning has enabled a huge leap forward in image inpainting, current methods are often unable to synthesize realistic high-frequency details. In this paper, we propose applying super resolution to coarsely reconstructed outputs, refining them at high resolution, and then downscaling the output to the original resolution. By introducing high-resolution images to the refinement network, our framework is able to reconstruct finer details that are usually smoothed out due to spectral bias - the tendency of neural networks to reconstruct low frequencies better than high frequencies. To assist training the refinement network on large upscaled holes, we propose a progressive learning technique in which the size of the missing regions increases as training progresses. Our zoom-in, refine and zoom-out strategy, combined with high-resolution supervision and progressive learning, constitutes a framework-agnostic approach for enhancing high-frequency details that can be applied to other inpainting methods. We provide qualitative and quantitative evaluations along with an ablation analysis to show the effectiveness of our approach, which outperforms state-of-the-art inpainting methods.
摘要：尽管深度学习在图像修复方面实现了巨大的飞跃，但当前的方法通常无法合成逼真的高频细节。在本文中，我们提议将超分辨率应用于粗略重构的输出，以高分辨率对其进行细化，然后将输出缩减为原始分辨率。通过将高分辨率图像引入细化网络，我们的框架能够重建通常由于频谱偏差而平滑的更精细的细节-神经网络比低频更好地重建低频的趋势。为了帮助在较大的高档孔上训练细化网络，我们提出了一种渐进式学习技术，其中随着训练的进行，缺失区域的大小会增加。我们的放大，细化和缩小策略，结合高分辨率的监督和渐进式学习，构成了一种框架无关的方法，可以增强可用于其他修复方法的高频细节。我们提供了定性和定量评估以及烧蚀分析，以显示我们的方法的有效性，该方法优于最新的修补方法。

37. Invariant Teacher and Equivariant Student for Unsupervised 3D Human Pose Estimation [PDF] 返回目录
Chenxin Xu, Siheng Chen, Maosen Li, Ya Zhang
Abstract: We propose a novel method based on teacher-student learning framework for 3D human pose estimation without any 3D annotation or side information. To solve this unsupervised-learning problem, the teacher network adopts pose-dictionary-based modeling for regularization to estimate a physically plausible 3D pose. To handle the decomposition ambiguity in the teacher network, we propose a cycle-consistent architecture promoting a 3D rotation-invariant property to train the teacher network. To further improve the estimation accuracy, the student network adopts a novel graph convolution network for flexibility to directly estimate the 3D coordinates. Another cycle-consistent architecture promoting 3D rotation-equivariant property is adopted to exploit geometry consistency, together with knowledge distillation from the teacher network to improve the pose estimation performance. We conduct extensive experiments on Human3.6M and MPI-INF-3DHP. Our method reduces the 3D joint prediction error by 11.4% compared to state-of-the-art unsupervised methods and also outperforms many weakly-supervised methods that use side information on Human3.6M. Code will be available at this https URL.
摘要：我们提出了一种基于师生学习框架的，无需任何3D注释或辅助信息的3D人体姿势估计的新方法。为了解决这一无监督学习问题，教师网络采用基于姿势字典的建模方法进行正则化，以估算物理上合理的3D姿势。为了处理教师网络中的分解歧义，我们提出了一种循环一致的体系结构，该体系结构可促进3D旋转不变性来训练教师网络。为了进一步提高估计精度，学生网络采用新颖的图卷积网络以灵活地直接估计3D坐标。采用另一种促进3D旋转等量特性的周期一致架构，以利用几何形状一致性，并结合来自教师网络的知识提炼，以提高姿态估计性能。我们对Human3.6M和MPI-INF-3DHP进行了广泛的实验。与最新的无监督方法相比，我们的方法将3D联合预测误差降低了11.4％，并且也优于许多使用Human3.6M附带信息的弱监督方法。代码将在此https URL上可用。

38. Efficient Golf Ball Detection and Tracking Based on Convolutional Neural Networks and Kalman Filter [PDF] 返回目录
Tianxiao Zhang, Xiaohan Zhang, Yiju Yang, Zongbo Wang, Guanghui Wang
Abstract: This paper focuses on the problem of online golf ball detection and tracking from image sequences. An efficient real-time approach is proposed by exploiting convolutional neural networks (CNN) based object detection and a Kalman filter based prediction. Five classical deep learning-based object detection networks are implemented and evaluated for ball detection, including YOLO v3 and its tiny version, YOLO v4, Faster R-CNN, SSD, and RefineDet. The detection is performed on small image patches instead of the entire image to increase the performance of small ball detection. At the tracking stage, a discrete Kalman filter is employed to predict the location of the ball and a small image patch is cropped based on the prediction. Then, the object detector is utilized to refine the location of the ball and update the parameters of Kalman filter. In order to train the detection models and test the tracking algorithm, a collection of golf ball dataset is created and annotated. Extensive comparative experiments are performed to demonstrate the effectiveness and superior tracking performance of the proposed scheme.
摘要：本文着重于在线高尔夫球图像序列的检测和跟踪问题。通过利用基于卷积神经网络（CNN）的目标检测和基于卡尔曼滤波器的预测，提出了一种有效的实时方法。实现并评估了五个基于深度学习的经典对象检测网络，包括YOLO v3及其微型版本YOLO v4，Faster R-CNN，SSD和RefineDet。该检测是在小图像小块而不是整个图像上执行的，以提高小球检测的性能。在跟踪阶段，采用离散卡尔曼滤波器来预测球的位置，并根据该预测裁剪一个小的图像块。然后，利用物体检测器来细化球的位置并更新卡尔曼滤波器的参数。为了训练检测模型并测试跟踪算法，创建并标注了高尔夫球数据集。进行了广泛的比较实验，以证明所提出方案的有效性和出色的跟踪性能。

39. Event Camera Calibration of Per-pixel Biased Contrast Threshold [PDF] 返回目录
Ziwei Wang, Yonhon Ng, Pieter van Goor, Robert Mahony
Abstract: Event cameras output asynchronous events to represent intensity changes with a high temporal resolution, even under extreme lighting conditions. Currently, most of the existing works use a single contrast threshold to estimate the intensity change of all pixels. However, complex circuit bias and manufacturing imperfections cause biased pixels and mismatch contrast threshold among pixels, which may lead to undesirable outputs. In this paper, we propose a new event camera model and two calibration approaches which cover event-only cameras and hybrid image-event cameras. When intensity images are simultaneously provided along with events, we also propose an efficient online method to calibrate event cameras that adapts to time-varying event rates. We demonstrate the advantages of our proposed methods compared to the state-of-the-art on several different event camera datasets.
摘要：即使在极端光照条件下，事件摄像机也会输出异步事件，以高时间分辨率来表示强度变化。当前，大多数现有作品使用单个对比度阈值来估计所有像素的强度变化。但是，复杂的电路偏置和制造缺陷会导致像素偏置以及像素之间的对比度阈值不匹配，这可能会导致不良输出。在本文中，我们提出了一种新的事件摄像机模型和两种校准方法，它们涵盖了仅事件摄像机和混合图像事件摄像机。当同时提供强度图像和事件时，我们还提出了一种有效的在线方法来校准事件摄影机，以适应随时间变化的事件发生率。与几个不同的事件摄像机数据集上的最新技术相比，我们证明了我们提出的方法的优势。

40. Unlabeled Data Guided Semi-supervised Histopathology Image Segmentation [PDF] 返回目录
Hongxiao Wang, Hao Zheng, Jianxu Chen, Lin Yang, Yizhe Zhang, Danny Z. Chen
Abstract: Automatic histopathology image segmentation is crucial to disease analysis. Limited available labeled data hinders the generalizability of trained models under the fully supervised setting. Semi-supervised learning (SSL) based on generative methods has been proven to be effective in utilizing diverse image characteristics. However, it has not been well explored what kinds of generated images would be more useful for model training and how to use such images. In this paper, we propose a new data guided generative method for histopathology image segmentation by leveraging the unlabeled data distributions. First, we design an image generation module. Image content and style are disentangled and embedded in a clustering-friendly space to utilize their distributions. New images are synthesized by sampling and cross-combining contents and styles. Second, we devise an effective data selection policy for judiciously sampling the generated images: (1) to make the generated training set better cover the dataset, the clusters that are underrepresented in the original training set are covered more; (2) to make the training process more effective, we identify and oversample the images of "hard cases" in the data for which annotated training data may be scarce. Our method is evaluated on glands and nuclei datasets. We show that under both the inductive and transductive settings, our SSL method consistently boosts the performance of common segmentation models and attains state-of-the-art results.
摘要：自动组织病理学图像分割对疾病分析至关重要。有限的可用标签数据阻碍了在完全监督的环境下训练模型的推广。基于生成方法的半监督学习（SSL）已被证明可有效利用各种图像特征。但是，尚未很好地探索哪种类型的生成图像对于模型训练以及如何使用此类图像更有用。在本文中，我们提出了一种利用未标记的数据分布进行组织病理学图像分割的新的数据指导的生成方法。首先，我们设计一个图像生成模块。图像内容和样式被解开，并嵌入到群集友好的空间中以利用它们的分布。通过采样和交叉组合内容和样式来合成新图像。其次，我们设计了一种有效的数据选择策略，以明智地对生成的图像进行采样：（1）使生成的训练集更好地覆盖数据集，原始训练集中代表性不足的聚类被更多地覆盖；（2）为了使训练过程更有效，我们在数据中可能很难找到带注释的训练数据的情况下，对“困难案例”的图像进行识别和过采样。我们的方法在腺体和细胞核数据集上进行了评估。我们证明，在归纳和转导设置下，我们的SSL方法始终如一地提高了通用细分模型的性能，并获得了最新的结果。

41. Semi-Global Shape-aware Network [PDF] 返回目录
Pengju Zhang, Yihong Wu, Jiagang Zhu
Abstract: Non-local operations are usually used to capture long-range dependencies via aggregating global context to each position recently. However, most of the methods cannot preserve object shapes since they only focus on feature similarity but ignore proximity between central and other positions for capturing long-range dependencies, while shape-awareness is beneficial to many computer vision tasks. In this paper, we propose a Semi-Global Shape-aware Network (SGSNet) considering both feature similarity and proximity for preserving object shapes when modeling long-range dependencies. A hierarchical way is taken to aggregate global context. In the first level, each position in the whole feature map only aggregates contextual information in vertical and horizontal directions according to both similarity and proximity. And then the result is input into the second level to do the same operations. By this hierarchical way, each central position gains supports from all other positions, and the combination of similarity and proximity makes each position gain supports mostly from the same semantic object. Moreover, we also propose a linear time algorithm for the aggregation of contextual information, where each of rows and columns in the feature map is treated as a binary tree to reduce similarity computation cost. Experiments on semantic segmentation and image retrieval show that adding SGSNet to existing networks gains solid improvements on both accuracy and efficiency.
摘要：非本地操作通常用于通过将全局上下文最近聚合到每个位置来捕获远程依赖关系。但是，大多数方法无法保留对象形状，因为它们仅关注特征相似性，却忽略了中心位置与其他位置之间的接近性以捕获远程依赖关系，而形状意识对于许多计算机视觉任务都是有益的。在本文中，我们提出了一种半全局形状感知网络（SGSNet），该模型在建模远程依赖项时考虑特征相似性和邻近性来保留对象形状。采用分层方式来聚合全局上下文。在第一级中，整个特征图中的每个位置仅根据相似度和邻近度在垂直和水平方向上聚合上下文信息。然后将结果输入到第二级以执行相同的操作。通过这种分层方式，每个中心位置获得来自所有其他位置的支持，而相似性和接近性的组合使得每个位置获得的支持大部分来自同一语义对象。此外，我们还提出了一种用于上下文信息聚合的线性时间算法，其中将特征图中的行和列中的每一行都视为二叉树，以减少相似度计算成本。语义分割和图像检索的实验表明，将SGSNet添加到现有网络中可以在准确性和效率上取得实质性的提高。

42. Learning to Recover 3D Scene Shape from a Single Image [PDF] 返回目录
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, Chunhua Shen
Abstract: Despite significant progress in monocular depth estimation in the wild, recent state-of-the-art methods cannot be used to recover accurate 3D scene shape due to an unknown depth shift induced by shift-invariant reconstruction losses used in mixed-data depth prediction training, and possible unknown camera focal length. We investigate this problem in detail, and propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to enhance depth prediction models trained on mixed datasets. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot dataset generalization. Code is available at: this https URL
摘要：尽管在野外单眼深度估计方面取得了显着进展，但由于混合数据深度中使用的平移不变重建损失引起的未知深度偏移，因此无法使用最新的方法来恢复准确的3D场景形状预测训练，以及可能未知的相机焦距。我们将详细研究此问题，并提出一个两阶段的框架，该框架首先可预测到未知比例的深度并从单个单眼图像进行偏移，然后使用3D点云编码器来预测丢失的深度偏移和焦距，这使我们能够恢复逼真的3D场景形状。此外，我们提出了图像级归一化回归损失和基于法线的几何损失，以增强在混合数据集上训练的深度预测模型。我们在9个看不见的数据集上测试了深度模型，并在零镜头数据集泛化方面实现了最新的性能。可以在以下网址获得代码：https URL

43. Roof-GAN: Learning to Generate Roof Geometry and Relations for Residential Houses [PDF] 返回目录
Yiming Qian, Hao Zhang, Yasutaka Furukawa
Abstract: This paper presents Roof-GAN, a novel generative adversarial network that generates structured geometry of residential roof structures as a set of roof primitives and their relationships. Given the number of primitives, the generator produces a structured roof model as a graph, which consists of 1) primitive geometry as raster images at each node, encoding facet segmentation and angles; 2) inter-primitive colinear/coplanar relationships at each edge; and 3) primitive geometry in a vector format at each node, generated by a novel differentiable vectorizer while enforcing the relationships. The discriminator is trained to assess the primitive raster geometry, the primitive relationships, and the primitive vector geometry in a fully end-to-end architecture. Qualitative and quantitative evaluations demonstrate the effectiveness of our approach in generating diverse and realistic roof models over the competing methods with a novel metric proposed in this paper for the task of structured geometry generation. We will share our code and data.
摘要：本文介绍了Roof-GAN，这是一种新型的生成对抗网络，可生成住宅屋顶结构的结构化几何图形，作为一组屋顶图元及其相互关系。给定图元的数量，生成器将生成一个结构化的屋顶模型作为图形，该图由以下组成：1）在每个节点处作为光栅图像的图元几何形状，对小平面分割和角度进行编码； 2）每个边上的本原共线性/共平面关系； 3）在每个节点上以矢量格式的原始几何图形，由新颖的可微分矢量化器在执行关系时生成。鉴别器经过训练，可以评估完整端到端架构中的原始栅格几何，原始关系和原始矢量几何。定性和定量评估证明了我们的方法在竞争方法上生成多样化和逼真的屋顶模型的有效性，本文提出了针对结构化几何生成任务的新颖指标。我们将共享我们的代码和数据。

44. Unsupervised Learning of Local Discriminative Representation for Medical Images [PDF] 返回目录
Huai Chen, Jieyu Li, Renzhen Wang, Yijie Huang, Fanrui Meng, Deyu Meng, Qing Peng, Lisheng Wang
Abstract: Local discriminative representation is needed in many medical image analysis tasks such as identifying sub-types of lesion or segmenting detailed components of anatomical structures by measuring similarity of local image regions. However, the commonly applied supervised representation learning methods require a large amount of annotated data, and unsupervised discriminative representation learning distinguishes different images by learning a global feature. In order to avoid the limitations of these two methods and be suitable for localized medical image analysis tasks, we introduce local discrimination into unsupervised representation learning in this work. The model contains two branches: one is an embedding branch which learns an embedding function to disperse dissimilar pixels over a low-dimensional hypersphere; and the other is a clustering branch which learns a clustering function to classify similar pixels into the same cluster. These two branches are trained simultaneously in a mutually beneficial pattern, and the learnt local discriminative representations are able to well measure the similarity of local image regions. These representations can be transferred to enhance various downstream tasks. Meanwhile, they can also be applied to cluster anatomical structures from unlabeled medical images under the guidance of topological priors from simulation or other structures with similar topological characteristics. The effectiveness and usefulness of the proposed method are demonstrated by enhancing various downstream tasks and clustering anatomical structures in retinal images and chest X-ray images. The corresponding code is available at this https URL.
摘要：在许多医学图像分析任务中都需要局部区分表示，例如通过测量局部图像区域的相似性来识别病变的亚型或分割解剖结构的详细组成部分。但是，通常采用的监督式表示学习方法需要大量的注释数据，无监督的判别式表示学习通过学习全局特征来区分不同的图像。为了避免这两种方法的局限性并适合于局部医学图像分析任务，我们在这项工作中将局部区分引入无监督的表示学习中。该模型包含两个分支：一个是嵌入分支，其学习嵌入功能以在低维超球体上分散相异的像素；另一个是嵌入分支。另一个是聚类分支，其学习聚类功能以将相似像素分类到同一聚类中。这两个分支以互利的模式同时进行训练，并且学习到的局部判别表示能够很好地测量局部图像区域的相似性。可以传输这些表示以增强各种下游任务。同时，它们也可以在模拟的拓扑先验或具有类似拓扑特征的其他结构的指导下，将未标记医学图像的解剖结构聚类。通过增强各种下游任务并在视网膜图像和胸部X射线图像中聚集解剖结构，证明了该方法的有效性和实用性。相应的代码可从此https URL获得。

45. Polyblur: Removing mild blur by polynomial reblurring [PDF] 返回目录
Mauricio Delbracio, Ignacio Garcia-Dorado, Sungjoon Choi, Damien Kelly, Peyman Milanfar
Abstract: We present a highly efficient blind restoration method to remove mild blur in natural images. Contrary to the mainstream, we focus on removing slight blur that is often present, damaging image quality and commonly generated by small out-of-focus, lens blur, or slight camera motion. The proposed algorithm first estimates image blur and then compensates for it by combining multiple applications of the estimated blur in a principled way. To estimate blur we introduce a simple yet robust algorithm based on empirical observations about the distribution of the gradient in sharp natural images. Our experiments show that, in the context of mild blur, the proposed method outperforms traditional and modern blind deblurring methods and runs in a fraction of the time. Our method can be used to blindly correct blur before applying off-the-shelf deep super-resolution methods leading to superior results than other highly complex and computationally demanding techniques. The proposed method estimates and removes mild blur from a 12MP image on a modern mobile phone in a fraction of a second.
摘要：我们提出了一种高效的盲目复原方法，以消除自然图像中的轻微模糊。与主流相反，我们专注于消除经常出现的轻微模糊，从而损害图像质量，并且通常由较小的失焦，镜头模糊或轻微的相机运动产生。所提出的算法首先估计图像模糊，然后以有原则的方式通过组合估计模糊的多种应用来对其进行补偿。为了估计模糊，我们基于经验观察引入了一种简单而鲁棒的算法，该经验是关于清晰自然图像中梯度的分布。我们的实验表明，在温和模糊的情况下，所提出的方法优于传统和现代的盲去模糊方法，并且运行时间仅占一小部分。在应用现成的深层超分辨率方法之前，我们的方法可用于盲目校正模糊，从而获得比其他高度复杂且计算要求高的技术更好的结果。提出的方法可以在一秒钟内估算并消除现代手机上的12MP图像中的轻微模糊。

46. Learning to Recognize Patch-Wise Consistency for Deepfake Detection [PDF] 返回目录
Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, Wei Xia
Abstract: We propose to detect Deepfake generated by face manipulation based on one of their fundamental features: images are blended by patches from multiple sources, carrying distinct and persistent source features. In particular, we propose a novel representation learning approach for this task, called patch-wise consistency learning (PCL). It learns by measuring the consistency of image source features, resulting to representation with good interpretability and robustness to multiple forgery methods. We develop an inconsistency image generator (I2G) to generate training data for PCL and boost its robustness. We evaluate our approach on seven popular Deepfake detection datasets. Our model achieves superior detection accuracy and generalizes well to unseen generation methods. On average, our model outperforms the state-of-the-art in terms of AUC by 2% and 8% in the in- and cross-dataset evaluation, respectively.
摘要：我们建议根据其基本特征之一来检测由面部操作生成的Deepfake：图像是由来自多个来源的补丁混合而成的，具有不同且持久的来源特征。特别是，我们针对此任务提出了一种新颖的表示学习方法，称为逐块一致性学习（PCL）。它通过测量图像源特征的一致性来学习，从而获得对多种伪造方法具有良好解释性和鲁棒性的表示形式。我们开发了一个不一致的图像生成器（I2G），以生成PCL的训练数据并增强其鲁棒性。我们在七个流行的Deepfake检测数据集上评估了我们的方法。我们的模型实现了卓越的检测精度，并很好地推广到了看不见的生成方法。平均而言，在数据集内和交叉数据集评估中，我们的模型在AUC方面分别比最新技术高2％和8％。

47. Self-Supervised Sketch-to-Image Synthesis [PDF] 返回目录
Bingchen Liu, Yizhe Zhu, Kunpeng Song, Ahmed Elgammal
Abstract: Imagining a colored realistic image from an arbitrarily drawn sketch is one of the human capabilities that we eager machines to mimic. Unlike previous methods that either requires the sketch-image pairs or utilize low-quantity detected edges as sketches, we study the exemplar-based sketch-to-image (s2i) synthesis task in a self-supervised learning manner, eliminating the necessity of the paired sketch data. To this end, we first propose an unsupervised method to efficiently synthesize line-sketches for general RGB-only datasets. With the synthetic paired-data, we then present a self-supervised Auto-Encoder (AE) to decouple the content/style features from sketches and RGB-images, and synthesize images that are both content-faithful to the sketches and style-consistent to the RGB-images. While prior works employ either the cycle-consistence loss or dedicated attentional modules to enforce the content/style fidelity, we show AE's superior performance with pure self-supervisions. To further improve the synthesis quality in high resolution, we also leverage an adversarial network to refine the details of synthetic images. Extensive experiments on 1024*1024 resolution demonstrate a new state-of-art-art performance of the proposed model on CelebA-HQ and Wiki-Art datasets. Moreover, with the proposed sketch generator, the model shows a promising performance on style mixing and style transfer, which require synthesized images to be both style-consistent and semantically meaningful. Our code is available on this https URL, and please visit this https URL for an online demo of our model.
摘要：从任意绘制的素描中想象出彩色逼真的图像是我们渴望机器模仿的人类能力之一。与以前的需要素描图像对或利用低数量检测到的边缘作为素描的方法不同，我们以自我监督的学习方式研究基于示例的素描到图像（s2i）合成任务，从而消除了配对的草图数据。为此，我们首先提出了一种无监督方法，可以有效地合成仅用于常规RGB的数据集的线草图。利用合成的配对数据，然后我们提供一个自我监督的自动编码器（AE），以将内容/样式特征与草图和RGB图像分离，并合成内容忠实于草图且样式一致的图像到RGB图像。尽管先前的作品采用了周期一致性损失或专用的注意模块来增强内容/样式的逼真度，但我们通过纯自我监督展示了AE的卓越性能。为了进一步提高高分辨率的合成质量，我们还利用对抗网络来细化合成图像的细节。在1024 * 1024分辨率上的大量实验证明了CelebA-HQ和Wiki-Art数据集上所提出模型的最新技术性能。此外，使用提出的草图生成器，该模型在样式混合和样式转移方面显示出令人鼓舞的性能，这要求合成图像既具有样式一致性又具有语义上的意义。我们的代码可在此https URL上找到，请访问此https URL以获取我们模型的在线演示。

48. Projected Distribution Loss for Image Enhancement [PDF] 返回目录
Mauricio Delbracio, Hossein Talebi, Peyman Milanfar
Abstract: Features obtained from object recognition CNNs have been widely used for measuring perceptual similarities between images. Such differentiable metrics can be used as perceptual learning losses to train image enhancement models. However, the choice of the distance function between input and target features may have a consequential impact on the performance of the trained model. While using the norm of the difference between extracted features leads to limited hallucination of details, measuring the distance between distributions of features may generate more textures; yet also more unrealistic details and artifacts. In this paper, we demonstrate that aggregating 1D-Wasserstein distances between CNN activations is more reliable than the existing approaches, and it can significantly improve the perceptual performance of enhancement models. More explicitly, we show that in imaging applications such as denoising, super-resolution, demosaicing, deblurring and JPEG artifact removal, the proposed learning loss outperforms the current state-of-the-art on reference-based perceptual losses. This means that the proposed learning loss can be plugged into different imaging frameworks and produce perceptually realistic results.
摘要：从目标识别CNN获得的特征已被广泛用于测量图像之间的感知相似度。这种可区分的度量可以用作感知学习损失来训练图像增强模型。但是，输入特征和目标特征之间距离函数的选择可能会对训练模型的性能产生相应影响。虽然使用提取的特征之间的差异的范数导致细节的幻觉有限，但测量特征分布之间的距离可能会生成更多纹理；还有更多不切实际的细节和工件。在本文中，我们证明了聚合CNN激活之间的1D-Wasserstein距离比现有方法更可靠，并且可以显着提高增强模型的感知性能。更明确地，我们表明，在诸如去噪，超分辨率，去马赛克，去模糊和JPEG伪影去除等成像应用中，建议的学习损失优于基于参考的感知损失的最新技术。这意味着可以将建议的学习损失插入不同的成像框架中，并产生可感知的逼真的结果。

49. Sparse Signal Models for Data Augmentation in Deep Learning ATR [PDF] 返回目录
Tushar Agarwal, Nithin Sugavanam, Emre Ertin
Abstract: Automatic Target Recognition (ATR) algorithms classify a given Synthetic Aperture Radar (SAR) image into one of the known target classes using a set of training images available for each class. Recently, learning methods have shown to achieve state-of-the-art classification accuracy if abundant training data is available, sampled uniformly over the classes, and their poses. In this paper, we consider the task of ATR with a limited set of training images. We propose a data augmentation approach to incorporate domain knowledge and improve the generalization power of a data-intensive learning algorithm, such as a Convolutional neural network (CNN). The proposed data augmentation method employs a limited persistence sparse modeling approach, capitalizing on commonly observed characteristics of wide-angle synthetic aperture radar (SAR) imagery. Specifically, we exploit the sparsity of the scattering centers in the spatial domain and the smoothly-varying structure of the scattering coefficients in the azimuthal domain to solve the ill-posed problem of over-parametrized model fitting. Using this estimated model, we synthesize new images at poses and sub-pixel translations not available in the given data to augment CNN's training data. The experimental results show that for the training data starved region, the proposed method provides a significant gain in the resulting ATR algorithm's generalization performance.
摘要：自动目标识别（ATR）算法使用一组可用于每个类别的训练图像，将给定的合成孔径雷达（SAR）图像分类为一个已知的目标类别。最近，如果有足够的训练数据可用，学习方法可以达到最新的分类准确性，并且可以在各个班级及其姿势上统一采样。在本文中，我们以一组有限的训练图像来考虑ATR的任务。我们提出一种数据增强方法，以融合领域知识并提高数据密集型学习算法（例如卷积神经网络（CNN））的泛化能力。提出的数据增强方法利用有限的持久性稀疏建模方法，充分利用了广角合成孔径雷达（SAR）图像的通常观察到的特征。具体来说，我们利用空间域中散射中心的稀疏性和方位角域中散射系数的平滑变化结构来解决模型化拟合过高的不适定问题。使用这个估计的模型，我们可以合成给定数据中不存在的姿势和子像素转换处的新图像，以增强CNN的训练数据。实验结果表明，对于训练数据不足的区域，该方法在所得的ATR算法的泛化性能上具有显着的提高。

50. ISD: Self-Supervised Learning by Iterative Similarity Distillation [PDF] 返回目录
Ajinkya Tejankar, Soroush Abbasi Koohpayegani, Vipin Pillai, Paolo Favaro, Hamed Pirsiavash
Abstract: Recently, contrastive learning has achieved great results in self-supervised learning, where the main idea is to push two augmentations of an image (positive pairs) closer compared to other random images (negative pairs). We argue that not all random images are equal. Hence, we introduce a self supervised learning algorithm where we use a soft similarity for the negative images rather than a binary distinction between positive and negative pairs. We iteratively distill a slowly evolving teacher model to the student model by capturing the similarity of a query image to some random images and transferring that knowledge to the student. We argue that our method is less constrained compared to recent contrastive learning methods, so it can learn better features. Specifically, our method should handle unbalanced and unlabeled data better than existing contrastive learning methods, because the randomly chosen negative set might include many samples that are semantically similar to the query image. In this case, our method labels them as highly similar while standard contrastive methods label them as negative pairs. Our method achieves better results compared to state-of-the-art models like BYOL and MoCo on transfer learning settings. We also show that our method performs better in the settings where the unlabeled data is unbalanced. Our code is available here: this https URL.
摘要：最近，对比学习在自我监督学习中取得了出色的成绩，其主要思想是与其他随机图像（负对）相比，将图像的两个增强（正对）推近。我们认为并非所有随机图像都是相等的。因此，我们引入了一种自我监督的学习算法，该算法对负图像使用软相似性，而不是对正负对之间使用二进制区分。通过捕获查询图像与一些随机图像的相似性并将该知识传递给学生，我们迭代地将缓慢发展的教师模型提炼为学生模型。我们认为，与最近的对比学习方法相比，我们的方法的约束较少，因此可以学习更好的功能。具体来说，我们的方法应比现有的对比学习方法更好地处理不平衡和未标记的数据，因为随机选择的负数集可能包含语义上与查询图像相似的许多样本。在这种情况下，我们的方法将它们标记为高度相似，而标准对比法将它们标记为负对。与最先进的模型（例如BYOL和MoCo）相比，我们的方法在转移学习设置上可获得更好的结果。我们还表明，在未标记数据不平衡的设置中，我们的方法性能更好。我们的代码在这里可用：此https URL。

51. Neural Pruning via Growing Regularization [PDF] 返回目录
Huan Wang, Can Qin, Yulun Zhang, Yun Fu
Abstract: Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms. Our code and trained models are publicly available at this https URL.
摘要：长期以来，正则化一直被用来学习稀疏性，以进行深度神经网络修剪。但是，其作用主要是在小罚强度体系中进行的。在这项工作中，我们将其应用扩展到正则化逐渐变大的新场景，以解决两个主要的修剪问题：修剪时间表和权重重要性评分。（1）前一个主题是这项工作中新提出的，我们发现它对修剪性能至关重要，而很少受到研究关注。具体来说，我们提出了一个具有增加的惩罚因子的L2正则化变体，并表明与单次射击相比，即使去除了相同的权重，它也可以带来显着的准确性提升。（2）不断增加的惩罚方案还为我们提供了一种方法，可以在不知道其具体值的情况下利用Hessian信息进行更精确的修剪，因此不会受到常见的Hessian逼近问题的困扰。根据经验，所提出的算法易于实现，并且可以在结构化和非结构化修剪中扩展到大型数据集和网络。在CIFAR和ImageNet数据集上的现代深度神经网络证明了它们的有效性，与许多最新算法相比，它们具有竞争优势。我们的代码和训练有素的模型可以在此https URL上公开获得。

52. S3CNet: A Sparse Semantic Scene Completion Network for LiDAR Point Clouds [PDF] 返回目录
Ran Cheng, Christopher Agia, Yuan Ren, Xinhai Li, Liu Bingbing
Abstract: With the increasing reliance of self-driving and similar robotic systems on robust 3D vision, the processing of LiDAR scans with deep convolutional neural networks has become a trend in academia and industry alike. Prior attempts on the challenging Semantic Scene Completion task - which entails the inference of dense 3D structure and associated semantic labels from "sparse" representations - have been, to a degree, successful in small indoor scenes when provided with dense point clouds or dense depth maps often fused with semantic segmentation maps from RGB images. However, the performance of these systems drop drastically when applied to large outdoor scenes characterized by dynamic and exponentially sparser conditions. Likewise, processing of the entire sparse volume becomes infeasible due to memory limitations and workarounds introduce computational inefficiency as practitioners are forced to divide the overall volume into multiple equal segments and infer on each individually, rendering real-time performance impossible. In this work, we formulate a method that subsumes the sparsity of large-scale environments and present S3CNet, a sparse convolution based neural network that predicts the semantically completed scene from a single, unified LiDAR point cloud. We show that our proposed method outperforms all counterparts on the 3D task, achieving state-of-the art results on the SemanticKITTI benchmark. Furthermore, we propose a 2D variant of S3CNet with a multi-view fusion strategy to complement our 3D network, providing robustness to occlusions and extreme sparsity in distant regions. We conduct experiments for the 2D semantic scene completion task and compare the results of our sparse 2D network against several leading LiDAR segmentation models adapted for bird's eye view segmentation on two open-source datasets.
摘要：随着无人驾驶和类似机器人系统对鲁棒3D视觉的依赖性日益增强，利用深度卷积神经网络处理LiDAR扫描已成为学术界和行业的趋势。先前对具有挑战性的“语义场景完成”任务的尝试（需要从“稀疏”表示中推断出密集的3D结构和相关的语义标签）在某种程度上已经成功地在小型室内场景中获得了密集的点云或密集的深度图。通常与RGB图像中的语义分割图相融合。但是，当将这些系统应用于以动态和指数稀疏条件为特征的大型室外场景时，这些系统的性能将急剧下降。同样，由于内存限制，无法处理整个稀疏卷，并且由于从业人员被迫将整个卷分成多个相等的部分并分别进行推断，因此变通方法会导致计算效率低下，从而无法实现实时性能。在这项工作中，我们制定了一种包含大型环境稀疏性的方法，并提出了S3CNet，这是一个基于稀疏卷积的神经网络，可以从单个统一的LiDAR点云中预测语义上完整的场景。我们表明，我们提出的方法在3D任务上胜过所有同类方法，并在SemanticKITTI基准上达到了最新的结果。此外，我们提出了一种S3CNet的2D变体，并采用了多视图融合策略来补充我们的3D网络，从而为远距离遮挡和极端稀疏提供了鲁棒性。我们针对2D语义场景完成任务进行了实验，并将稀疏2D网络的结果与适用于两个开源数据集的鸟瞰图分割的几种领先的LiDAR分割模型进行了比较。

53. uBAM: Unsupervised Behavior Analysis and Magnification using Deep Learning [PDF] 返回目录
Biagio Brattoli, Uta Buechler, Michael Dorkenwald, Philipp Reiser, Linard Filli, Fritjof Helmchen, Anna-Sophia Wahl, Bjoern Ommer
Abstract: Motor behavior analysis is essential to biomedical research and clinical diagnostics as it provides a non-invasive strategy for identifying motor impairment and its change caused by interventions. State-of-the-art instrumented movement analysis is time- and cost-intensive, since it requires placing physical or virtual markers. Besides the effort required for marking keypoints or annotations necessary for training or finetuning a detector, users need to know the interesting behavior beforehand to provide meaningful keypoints. We introduce uBAM, a novel, automatic deep learning algorithm for behavior analysis by discovering and magnifying deviations. We propose an unsupervised learning of posture and behavior representations that enable an objective behavior comparison across subjects. A generative model with novel disentanglement of appearance and behavior magnifies subtle behavior differences across subjects directly in a video without requiring a detour via keypoints or annotations. Evaluations on rodents and human patients with neurological diseases demonstrate the wide applicability of our approach.
摘要：运动行为分析对于生物医学研究和临床诊断至关重要，因为它提供了一种非侵入性策略来识别运动障碍及其干预引起的变化。由于需要放置物理或虚拟标记，因此最先进的仪器化运动分析需要大量时间和成本。除了标记训练或微调检测器所需的关键点或注释所需的工作之外，用户还需要事先知道有趣的行为以提供有意义的关键点。我们介绍了uBAM，这是一种通过发现和放大偏差来进行行为分析的新颖的自动深度学习算法。我们提出了一种无监督的姿势和行为表示学习方法，可以对受试者进行客观的行为比较。具有新颖的外观和行为解析能力的生成模型可以直接在视频中放大跨主体的细微行为差异，而无需通过关键点或注释绕行。对啮齿动物和人类神经系统疾病患者的评估证明了我们方法的广泛适用性。

54. Shape My Face: Registering 3D Face Scans by Surface-to-Surface Translation [PDF] 返回目录
Mehdi Bahri, Eimear O' Sullivan, Shunwang Gong, Feng Liu, Xiaoming Liu, Michael M. Bronstein, Stefanos Zafeiriou
Abstract: Existing surface registration methods focus on fitting in-sample data with little to no generalization ability and require both heavy pre-processing and careful hand-tuning. In this paper, we cast the registration task as a surface-to-surface translation problem, and design a model to reliably capture the latent geometric information directly from raw 3D face scans. We introduce Shape-My-Face (SMF), a powerful encoder-decoder architecture based on an improved point cloud encoder, a novel visual attention mechanism, graph convolutional decoders with skip connections, and a specialized mouth model that we smoothly integrate with the mesh convolutions. Compared to the previous state-of-the-art learning algorithms for non-rigid registration of face scans, SMF only requires the raw data to be rigidly aligned (with scaling) with a pre-defined face template. Additionally, our model provides topologically-sound meshes with minimal supervision, offers faster training time, has orders of magnitude fewer trainable parameters, is more robust to noise, and can generalize to previously unseen datasets. We extensively evaluate the quality of our registrations on diverse data. We demonstrate the robustness and generalizability of our model with in-the-wild face scans across different modalities, sensor types, and resolutions. Finally, we show that, by learning to register scans, SMF produces a hybrid linear and non-linear morphable model that can be used for generation, shape morphing, and expression transfer through manipulation of the latent space, including in-the-wild. We train SMF on a dataset of human faces comprising 9 large-scale databases on commodity hardware.
摘要：现有的表面配准方法专注于拟合样本数据，几乎没有泛化能力，需要大量的预处理和仔细的手工调整。在本文中，我们将注册任务转换为表面到表面的平移问题，并设计了一个模型来直接从原始3D人脸扫描中可靠地捕获潜在的几何信息。我们介绍了Shape-My-Face（SMF），基于改进的点云编码器的强大编码器-解码器体系结构，新颖的视觉注意力机制，具有跳过连接的图卷积解码器以及可与网格平滑集成的专用口模型卷积。与以前的非刚性人脸扫描配准的最新学习算法相比，SMF仅要求将原始数据与预定义的人脸模板进行严格对齐（缩放）。此外，我们的模型以最少的监督提供了拓扑合理的网格，提供了更快的训练时间，可训练参数减少了几个数量级，对噪声更鲁棒，并且可以推广到以前看不见的数据集。我们对各种数据的注册质量进行了广泛的评估。我们通过跨模式，传感器类型和分辨率的野外面部扫描展示了我们模型的鲁棒性和通用性。最后，我们表明，通过学习注册扫描，SMF产生了线性和非线性混合可变形模型，该模型可通过操纵潜在空间（包括在野生环境中）用于生成，形状变形和表达传递。我们在包含9个大型商品硬件数据库的人脸数据集上训练SMF。

55. On Episodes, Prototypical Networks, and Few-shot Learning [PDF] 返回目录
Steinar Laenen, Luca Bertinetto
Abstract: Episodic learning is a popular practice among researchers and practitioners interested in few-shot learning. It consists of organising training in a series of learning problems, each relying on small "support" and "query" sets to mimic the few-shot circumstances encountered during evaluation. In this paper, we investigate the usefulness of episodic learning in Prototypical Networks and Matching Networks, two of the most popular algorithms making use of this practice. Surprisingly, in our experiments we found that, for Prototypical and Matching Networks, it is detrimental to use the episodic learning strategy of separating training samples between support and query set, as it is a data-inefficient way to exploit training batches. These "non-episodic" variants, which are closely related to the classic Neighbourhood Component Analysis, reliably improve over their episodic counterparts in multiple datasets, achieving an accuracy that (in the case of Prototypical Networks) is competitive with the state-of-the-art, despite being extremely simple.
摘要：情景学习是研究人员和从业人员对少数快照学习感兴趣的一种流行做法。它包括组织一系列学习问题的培训，每个学习问题都依靠小的“支持”和“查询”集来模拟评估过程中遇到的少数情况。在本文中，我们研究了原型网络和匹配网络中的情境学习的有用性，这是利用这种实践的两种最流行的算法。出乎意料的是，在我们的实验中，我们发现，对于原型网络和匹配网络，使用间歇式学习策略在支持集和查询集之间分离训练样本是有害的，因为这是一种利用数据效率低下的方法来利用训练批次的方法。这些与经典邻域分量分析密切相关的“非周期性”变体在多个数据集中可靠地改进了其情景上的对应变体，从而获得了与原型状态相比具有竞争力的准确性（在原型网络的情况下）艺术，尽管非常简单。

56. Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency [PDF] 返回目录
Qiang Zhang, Tete Xiao, Alexei A. Efros, Lerrel Pinto, Xiaolong Wang
Abstract: At the heart of many robotics problems is the challenge of learning correspondences across domains. For instance, imitation learning requires obtaining correspondence between humans and robots; sim-to-real requires correspondence between physics simulators and the real world; transfer learning requires correspondences between different robotics environments. This paper aims to learn correspondence across domains differing in representation (vision vs. internal state), physics parameters (mass and friction), and morphology (number of limbs). Importantly, correspondences are learned using unpaired and randomly collected data from the two domains. We propose \textit{dynamics cycles} that align dynamic robot behavior across two domains using a cycle-consistency constraint. Once this correspondence is found, we can directly transfer the policy trained on one domain to the other, without needing any additional fine-tuning on the second domain. We perform experiments across a variety of problem domains, both in simulation and on real robot. Our framework is able to align uncalibrated monocular video of a real robot arm to dynamic state-action trajectories of a simulated arm without paired data. Video demonstrations of our results are available at: this https URL .
摘要：许多机器人技术问题的核心是跨领域学习通信的挑战。例如，模仿学习需要获得人与机器人之间的对应关系。模拟到现实需要物理模拟器与现实世界之间的对应关系；转移学习需要不同机器人环境之间的对应关系。本文旨在学习表示形式（视觉与内部状态），物理参数（质量和摩擦力）和形态（肢体数量）不同的域之间的对应关系。重要的是，使用来自两个域的未配对和随机收集的数据来学习对应关系。我们提出了\ textit {dynamics cycle}，它使用循环一致性约束跨两个域对齐动态机器人行为。找到此对应关系后，我们可以将在一个域上训练的策略直接转移到另一个域，而无需在第二个域上进行任何其他的微调。我们在仿真和实际机器人上跨各种问题领域进行实验。我们的框架能够将真实机器人手臂的未经校准的单目视频与没有配对数据的模拟手臂的动态状态动作轨迹对齐。我们的结果的视频演示可在以下网址获得：https https。

57. Deep Learning Techniques for Super-Resolution in Video Games [PDF] 返回目录
Alexander Watson
Abstract: The computational cost of video game graphics is increasing and hardware for processing graphics is struggling to keep up. This means that computer scientists need to develop creative new ways to improve the performance of graphical processing hardware. Deep learning techniques for video super-resolution can enable video games to have high quality graphics whilst offsetting much of the computational cost. These emerging technologies allow consumers to have improved performance and enjoyment from video games and have the potential to become standard within the game development industry.
摘要：视频游戏图形的计算成本正在增加，并且用于处理图形的硬件也难以跟上。这意味着计算机科学家需要开发创新的方法来提高图形处理硬件的性能。用于视频超分辨率的深度学习技术可以使视频游戏具有高质量的图形，同时抵消很多计算成本。这些新兴技术使消费者能够提高视频游戏的性能和娱乐性，并有可能成为游戏开发行业的标准配置。

58. Describing the Structural Phenotype of the Glaucomatous Optic Nerve Head Using Artificial Intelligence [PDF] 返回目录
Satish K. Panda, Haris Cheong, Tin A. Tun, Sripad K. Devella, Ramaswami Krishnadas, Martin L. Buist, Shamira Perera, Ching-Yu Cheng, Tin Aung, Alexandre H. Thiéry, Michaël J. A. Girard
Abstract: The optic nerve head (ONH) typically experiences complex neural- and connective-tissue structural changes with the development and progression of glaucoma, and monitoring these changes could be critical for improved diagnosis and prognosis in the glaucoma clinic. The gold-standard technique to assess structural changes of the ONH clinically is optical coherence tomography (OCT). However, OCT is limited to the measurement of a few hand-engineered parameters, such as the thickness of the retinal nerve fiber layer (RNFL), and has not yet been qualified as a stand-alone device for glaucoma diagnosis and prognosis applications. We argue this is because the vast amount of information available in a 3D OCT scan of the ONH has not been fully exploited. In this study we propose a deep learning approach that can: \textbf{(1)} fully exploit information from an OCT scan of the ONH; \textbf{(2)} describe the structural phenotype of the glaucomatous ONH; and that can \textbf{(3)} be used as a robust glaucoma diagnosis tool. Specifically, the structural features identified by our algorithm were found to be related to clinical observations of glaucoma. The diagnostic accuracy from these structural features was $92.0 \pm 2.3 \%$ with a sensitivity of $90.0 \pm 2.4 \% $ (at $95 \%$ specificity). By changing their magnitudes in steps, we were able to reveal how the morphology of the ONH changes as one transitions from a `non-glaucoma' to a `glaucoma' condition. We believe our work may have strong clinical implication for our understanding of glaucoma pathogenesis, and could be improved in the future to also predict future loss of vision.
摘要：视神经头（ONH）通常会随着青光眼的发生和发展而经历复杂的神经和结缔组织结构变化，因此监测这些变化对于改善青光眼临床诊断和预后至关重要。临床上评估ONH结构变化的金标准技术是光学相干断层扫描（OCT）。但是，OCT仅限于测量一些手工设计的参数，例如视网膜神经纤维层（RNFL）的厚度，并且尚未被认可为用于青光眼诊断和预后应用的独立设备。我们认为这是因为尚未对ONH的3D OCT扫描中可用的大量信息进行充分利用。在这项研究中，我们提出了一种深度学习方法，该方法可以：\ textbf {（1）}充分利用来自ONH的OCT扫描的信息； \ textbf {（2）}描述了青光眼ONH的结构表型； \ textbf {（3）}可以用作可靠的青光眼诊断工具。具体来说，我们的算法确定的结构特征被发现与青光眼的临床观察有关。这些结构特征的诊断准确性为$ 92.0 \ pm 2.3 \％$，灵敏度为$ 90.0 \ pm 2.4 \％$（特异性为$ 95 \％$）。通过逐步改变其大小，我们能够揭示ONH的形态在从“非青光眼”状态向“青光眼”状态转变时是如何变化的。我们认为我们的工作可能对理解青光眼的发病机制具有强烈的临床意义，并且在将来可以改善以预测未来的视力丧失。

59. Image-Based Jet Analysis [PDF] 返回目录
Michael Kagan
Abstract: Image-based jet analysis is built upon the jet image representation of jets that enables a direct connection between high energy physics and the fields of computer vision and deep learning. Through this connection, a wide array of new jet analysis techniques have emerged. In this text, we survey jet image based classification models, built primarily on the use of convolutional neural networks, examine the methods to understand what these models have learned and what is their sensitivity to uncertainties, and review the recent successes in moving these models from phenomenological studies to real world application on experiments at the LHC. Beyond jet classification, several other applications of jet image based techniques, including energy estimation, pileup noise reduction, data generation, and anomaly detection, are discussed.
摘要：基于图像的射流分析是建立在射流的射流图像表示之上的，它使高能物理与计算机视觉和深度学习领域之间具有直接联系。通过这种联系，出现了许多新的射流分析技术。在本文中，我们调查了主要基于卷积神经网络的基于喷气图像的分类模型，研究了了解这些模型学到的方法以及它们对不确定性的敏感性的方法，并回顾了将这些模型从现象学研究在LHC的实验中在现实世界中的应用。除了射流分类之外，还讨论了基于射流图像的技术的其他几种应用，包括能量估计，降低堆积噪声，数据生成和异常检测。

60. Combating Mode Collapse in GAN training: An Empirical Analysis using Hessian Eigenvalues [PDF] 返回目录
Ricard Durall, Avraam Chatzimichailidis, Peter Labus, Janis Keuper
Abstract: Generative adversarial networks (GANs) provide state-of-the-art results in image generation. However, despite being so powerful, they still remain very challenging to train. This is in particular caused by their highly non-convex optimization space leading to a number of instabilities. Among them, mode collapse stands out as one of the most daunting ones. This undesirable event occurs when the model can only fit a few modes of the data distribution, while ignoring the majority of them. In this work, we combat mode collapse using second-order gradient information. To do so, we analyse the loss surface through its Hessian eigenvalues, and show that mode collapse is related to the convergence towards sharp minima. In particular, we observe how the eigenvalues of the $G$ are directly correlated with the occurrence of mode collapse. Finally, motivated by these findings, we design a new optimization algorithm called nudged-Adam (NuGAN) that uses spectral information to overcome mode collapse, leading to empirically more stable convergence properties.
摘要：生成对抗网络（GAN）提供了图像生成的最新结果。但是，尽管功能如此强大，但培训仍然非常困难。这尤其是由于其高度非凸的优化空间导致了许多不稳定性。其中，模式崩溃是最令人生畏的模式之一。当模型仅适合少数几种数据分发模式，而忽略了大多数模式时，就会发生此不良事件。在这项工作中，我们使用二阶梯度信息来对抗模式崩溃。为此，我们通过其Hessian特征值分析了损失表面，并表明模式崩溃与朝着锐利极小点的收敛有关。特别是，我们观察到$ G $的特征值如何与模式崩溃的发生直接相关。最后，受这些发现的启发，我们设计了一种新的优化算法，称为Nudged-Adam（NuGAN），该算法使用频谱信息来克服模式崩溃，从而获得经验上更稳定的收敛特性。

61. Kernelized Classification in Deep Networks [PDF] 返回目录
Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar
Abstract: In this paper, we propose a kernelized classification layer for deep networks. Although conventional deep networks introduce an abundance of nonlinearity for representation (feature) learning, they almost universally use a linear classifier on the learned feature vectors. We introduce a nonlinear classification layer by using the kernel trick on the softmax cross-entropy loss function during training and the scorer function during testing. Furthermore, we study the choice of kernel functions one could use with this framework and show that the optimal kernel function for a given problem can be learned automatically within the deep network itself using the usual backpropagation and gradient descent methods. To this end, we exploit a classic mathematical result on the positive definite kernels on the unit n-sphere embedded in the (n+1)-dimensional Euclidean space. We show the usefulness of the proposed nonlinear classification layer on several vision datasets and tasks.
摘要：在本文中，我们提出了一种用于深度网络的内核分类层。尽管常规的深度网络为表示（特征）学习引入了大量的非线性，但它们几乎普遍在学习的特征向量上使用线性分类器。我们通过在训练过程中使用softmax交叉熵损失函数和在测试过程中使用刻痕函数使用内核技巧引入非线性分类层。此外，我们研究了可以在该框架中使用的内核函数的选择，并表明可以使用常规的反向传播和梯度下降方法在深度网络内部自动学习针对给定问题的最佳内核函数。为此，我们利用嵌入在（n + 1）维欧几里得空间中的单位n球面上的正定核上的经典数学结果。我们在几个视觉数据集和任务上展示了提出的非线性分类层的有用性。

62. Learned Block-based Hybrid Image Compression [PDF] 返回目录
Yaojun Wu, Xin Li, Zhizheng Zhang, Xin Jin, Zhibo Chen
Abstract: Learned image compression based on neural networks have made huge progress thanks to its superiority in learning better representation through non-linear transformation. Different from traditional hybrid coding frameworks, that are commonly block-based, existing learned image codecs usually process the images in a full-resolution manner thus not supporting acceleration via parallelism and explicit prediction. Compared to learned image codecs, traditional hybrid coding frameworks are in general hand-crafted and lack the adaptability of being optimized according to heterogeneous metrics. Therefore, in order to collect their good qualities and offset their weakness, we explore a learned block-based hybrid image compression (LBHIC) framework, which achieves a win-win between coding performance and efficiency. Specifically, we introduce block partition and explicit learned predictive coding into learned image compression framework. Compared to prediction through linear weighting of neighbor pixels in traditional codecs, our contextual prediction module (CPM) is designed to better capture long-range correlations by utilizing the strip pooling to extract the most relevant information in neighboring latent space. Moreover, to alleviate blocking artifacts, we further propose a boundary-aware post-processing module (BPM) with the importance of edge taken into account. Extensive experiments demonstrate that the proposed LBHIC codec outperforms state-of-the-art image compression methods in terms of both PSNR and MS-SSIM metrics and promises obvious time-saving.
摘要：基于神经网络的学习图像压缩由于其在通过非线性变换学习更好的表示方面的优势而取得了长足的进步。与通常基于块的传统混合编码框架不同，现有的学习图像编解码器通常以全分辨率方式处理图像，因此不支持通过并行性和显式预测进行加速。与学习的图像编解码器相比，传统的混合编码框架通常是手工制作的，缺乏根据异构指标进行优化的适应性。因此，为了收集它们的优良品质并弥补它们的弱点，我们探索了一种基于学习的基于块的混合图像压缩（LBHIC）框架，该框架在编码性能和效率之间实现了双赢。具体来说，我们将块分区和显式学习预测编码引入学习图像压缩框架。与通过传统编解码器中的相邻像素的线性加权进行预测相比，我们的上下文预测模块（CPM）通过利用条带池提取相邻潜在空间中最相关的信息来更好地捕获远程相关性。此外，为减轻阻塞伪影，我们进一步提出了一种考虑边缘重要性的边界感知后处理模块（BPM）。大量实验表明，在PSNR和MS-SSIM指标方面，所提出的LBHIC编解码器均优于最新的图像压缩方法，并有望节省大量时间。

63. A new semi-supervised self-training method for lung cancer prediction [PDF] 返回目录
Kelvin Shak, Mundher Al-Shabi, Andrea Liew, Boon Leong Lan, Wai Yee Chan, Kwan Hoong Ng, Maxine Tan
Abstract: Background and Objective: Early detection of lung cancer is crucial as it has high mortality rate with patients commonly present with the disease at stage 3 and above. There are only relatively few methods that simultaneously detect and classify nodules from computed tomography (CT) scans. Furthermore, very few studies have used semi-supervised learning for lung cancer prediction. This study presents a complete end-to-end scheme to detect and classify lung nodules using the state-of-the-art Self-training with Noisy Student method on a comprehensive CT lung screening dataset of around 4,000 CT scans. Methods: We used three datasets, namely LUNA16, LIDC and NLST, for this study. We first utilise a three-dimensional deep convolutional neural network model to detect lung nodules in the detection stage. The classification model known as Maxout Local-Global Network uses non-local networks to detect global features including shape features, residual blocks to detect local features including nodule texture, and a Maxout layer to detect nodule variations. We trained the first Self-training with Noisy Student model to predict lung cancer on the unlabelled NLST datasets. Then, we performed Mixup regularization to enhance our scheme and provide robustness to erroneous labels. Results and Conclusions: Our new Mixup Maxout Local-Global network achieves an AUC of 0.87 on 2,005 completely independent testing scans from the NLST dataset. Our new scheme significantly outperformed the next highest performing method at the 5% significance level using DeLong's test (p = 0.0001). This study presents a new complete end-to-end scheme to predict lung cancer using Self-training with Noisy Student combined with Mixup regularization. On a completely independent dataset of 2,005 scans, we achieved state-of-the-art performance even with more images as compared to other methods.
摘要：背景与目的：肺癌的早期发现至关重要，因为对于处于3期及以上阶段的常见患者，其高死亡率。从计算机断层扫描（CT）扫描中同时检测和分类结节的方法相对较少。此外，很少有研究将半监督学习用于预测肺癌。这项研究提出了一个完整的端到端方案，该方案使用最新的“带噪声学生自学训练”方法对约4,000个CT扫描的CT肺筛查数据集进行检测和分类。方法：本研究使用了三个数据集，即LUNA16，LIDC和NLST。我们首先利用三维深度卷积神经网络模型在检测阶段检测肺结节。称为Maxout局部全球网络的分类模型使用非局部网络来检测包括形状特征的全局特征，使用残留块来检测包括结节纹理的局部特征，并使用Maxout层来检测结节的变化。我们训练了第一个带有噪声的学生自我训练模型，以在未标记的NLST数据集上预测肺癌。然后，我们执行了Mixup正则化以增强我们的方案并为错误标签提供鲁棒性。结果与结论：我们的新Mixup Maxout本地-全球网络在来自NLST数据集的2,005个完全独立的测试扫描中实现了0.87的AUC。使用DeLong检验（5％= 0.0001），我们的新方案在5％的显着性水平上显着优于性能次高的方法。这项研究提出了一种新的完整的端到端方案，该方案可以使用带有噪声学生的自我训练结合Mixup正则化来预测肺癌。在完全独立的2,005次扫描数据集上，即使与其他方法相比，即使有更多的图像，我们也能达到最先进的性能。

64. Joint Search of Data Augmentation Policies and Network Architectures [PDF] 返回目录
Taiga Kashima, Yoshihiro Yamada, Shunta Saito
Abstract: The common pipeline of training deep neural networks consists of several building blocks such as data augmentation and network architecture selection. AutoML is a research field that aims at automatically designing those parts, but most methods explore each part independently because it is more challenging to simultaneously search all the parts. In this paper, we propose a joint optimization method for data augmentation policies and network architectures to bring more automation to the design of training pipeline. The core idea of our approach is to make the whole part differentiable. The proposed method combines differentiable methods for augmentation policy search and network architecture search to jointly optimize them in the end-to-end manner. The experimental results show our method achieves competitive or superior performance to the independently searched results.
摘要：训练深度神经网络的通用管道由几个构建块组成，例如数据增强和网络体系结构选择。 AutoML是一个旨在自动设计这些零件的研究领域，但是大多数方法都是独立地探索每个零件，因为同时搜索所有零件更具挑战性。在本文中，我们提出了一种针对数据扩充策略和网络架构的联合优化方法，以为训练流水线的设计带来更多的自动化。我们方法的核心思想是使整个部分具有差异性。所提出的方法结合了用于扩充策略搜索和网络体系结构搜索的不同方法，以端到端的方式共同优化它们。实验结果表明我们的方法比独立搜索的结果具有竞争性或优越的性能。

65. A Contrast Synthesized Thalamic Nuclei Segmentation Scheme using Convolutional Neural Networks [PDF] 返回目录
Lavanya Umapathy, Mahesh Bharath Keerthivasan, Natalie M. Zahr, Ali Bilgin, Manojkumar Saranathan
Abstract: Thalamic nuclei have been implicated in several neurological diseases. WMn-MPRAGE images have been shown to provide better intra-thalamic nuclear contrast compared to conventional MPRAGE images but the additional acquisition results in increased examination times. In this work, we investigated 3D Convolutional Neural Network (CNN) based techniques for thalamic nuclei parcellation from conventional MPRAGE images. Two 3D CNNs were developed and compared for thalamic nuclei parcellation using MPRAGE images: a) a native contrast segmentation (NCS) and b) a synthesized contrast segmentation (SCS) using WMn-MPRAGE images synthesized from MPRAGE images. We trained the two segmentation frameworks using MPRAGE images (n=35) and thalamic nuclei labels generated on WMn-MPRAGE images using a multi-atlas based parcellation technique. The segmentation accuracy and clinical utility were evaluated on a cohort comprising of healthy subjects and patients with alcohol use disorder (AUD) (n=45). The SCS network yielded higher Dice scores in the Medial geniculate nucleus (P=.003) and Centromedian nucleus (P=.01) with lower volume differences for Ventral anterior (P=.001) and Ventral posterior lateral (P=.01) nuclei when compared to the NCS network. A Bland-Altman analysis revealed tighter limits of agreement with lower coefficient of variation between true volumes and those predicted by the SCS network. The SCS network demonstrated a significant atrophy in Ventral lateral posterior nucleus in AUD patients compared to healthy age-matched controls (P=0.01), agreeing with previous studies on thalamic atrophy in alcoholism, whereas the NCS network showed spurious atrophy of the Ventral posterior lateral nucleus. CNN-based contrast synthesis prior to segmentation can provide fast and accurate thalamic nuclei segmentation from conventional MPRAGE images.
摘要：丘脑核与多种神经系统疾病有关。与传统的MPRAGE图像相比，已显示WMn-MPRAGE图像可提供更好的丘脑内核对比度，但额外的采集会导致检查时间增加。在这项工作中，我们研究了基于3D卷积神经网络（CNN）的丘脑核分离的常规MPRAGE图像技术。开发了两个3D CNN，并使用MPRAGE图像比较了丘脑核分裂：a）天然对比度分割（NCS），b）使用从MPRAGE图像合成的WMn-MPRAGE图像合成了对比度对比分割（SCS）。我们使用MPRAGE图像（n = 35）和使用基于多图集的分割技术在WMn-MPRAGE图像上生成的丘脑核标签训练了两个分割框架。在包括健康受试者和酒精使用障碍（AUD）患者（n = 45）的队列中评估了分割的准确性和临床效用。 SCS网络在内侧膝状核（P = .003）和中心ome核（P = .01）中产生较高的Dice评分，而腹前壁（P = .001）和腹侧后外侧（P = .01）的体积差异较小。与NCS网络相比时的原子核。一项Bland-Altman分析显示，协议的范围更加严格，实际交易量与SCS网络预测的交易量之间的变异系数更低。与健康的年龄匹配对照组相比，SCS网络显示出AUD患者的腹外侧后核明显萎缩（P = 0.01），与先前关于酒精中毒丘脑萎缩的研究一致，而NCS网络显示腹侧后外侧假性萎缩核。分割之前基于CNN的对比合成可以从常规MPRAGE图像提供快速准确的丘脑核分割。

66. On the Limitations of Denoising Strategies as Adversarial Defenses [PDF] 返回目录
Zhonghan Niu, Zhaoxi Chen, Linyi Li, Yubin Yang, Bo Li, Jinfeng Yi
Abstract: As adversarial attacks against machine learning models have raised increasing concerns, many denoising-based defense approaches have been proposed. In this paper, we summarize and analyze the defense strategies in the form of symmetric transformation via data denoising and reconstruction (denoted as $F+$ inverse $F$, $F-IF$ Framework). In particular, we categorize these denoising strategies from three aspects (i.e. denoising in the spatial domain, frequency domain, and latent space, respectively). Typically, defense is performed on the entire adversarial example, both image and perturbation are modified, making it difficult to tell how it defends against the perturbations. To evaluate the robustness of these denoising strategies intuitively, we directly apply them to defend against adversarial noise itself (assuming we have obtained all of it), which saving us from sacrificing benign accuracy. Surprisingly, our experimental results show that even if most of the perturbations in each dimension is eliminated, it is still difficult to obtain satisfactory robustness. Based on the above findings and analyses, we propose the adaptive compression strategy for different frequency bands in the feature domain to improve the robustness. Our experiment results show that the adaptive compression strategies enable the model to better suppress adversarial perturbations, and improve robustness compared with existing denoising strategies.
摘要：随着对机器学习模型的对抗性攻击日益引起人们的关注，提出了许多基于降噪的防御方法。在本文中，我们通过数据去噪和重建（以$ F + $逆$ F $，$ F-IF $框架表示）以对称变换的形式总结和分析防御策略。特别是，我们从三个方面（即分别在空间域，频域和潜在空间中进行降噪）对这些降噪策略进行了分类。通常，防御是在整个对抗示例中进行的，图像和扰动都被修改，这使得很难说出如何防御扰动。为了直观地评估这些去噪策略的鲁棒性，我们直接将它们应用来抵御对抗性噪声本身（假设我们已经获得了所有噪声），这使我们免于牺牲良性准确性。令人惊讶的是，我们的实验结果表明，即使消除了每个维度上的大多数扰动，仍然很难获得令人满意的鲁棒性。基于以上发现和分析，我们提出了针对特征域中不同频段的自适应压缩策略，以提高鲁棒性。我们的实验结果表明，与现有的降噪策略相比，自适应压缩策略可使模型更好地抑制对抗性扰动，并提高鲁棒性。

67. Clique: Spatiotemporal Object Re-identification at the City Scale [PDF] 返回目录
Tiantu Xu, Kaiwen Shen, Yang Fu, Humphrey Shi, Felix Xiaozhu Lin
Abstract: Object re-identification (ReID) is a key application of city-scale cameras. While classic ReID tasks are often considered as image retrieval, we treat them as spatiotemporal queries for locations and times in which the target object appeared. Spatiotemporal reID is challenged by the accuracy limitation in computer vision algorithms and the colossal videos from city cameras. We present Clique, a practical ReID engine that builds upon two new techniques: (1) Clique assesses target occurrences by clustering fuzzy object features extracted by ReID algorithms, with each cluster representing the general impression of a distinct object to be matched against the input; (2) to search in videos, Clique samples cameras to maximize the spatiotemporal coverage and incrementally adds cameras for processing on demand. Through evaluation on 25 hours of videos from 25 cameras, Clique reached a high accuracy of 0.87 (recall at 5) across 70 queries and runs at 830x of video realtime in achieving high accuracy.
摘要：对象重新识别（ReID）是城市规模摄像机的关键应用。虽然经典的ReID任务通常被视为图像检索，但我们将它们视为对目标对象出现的位置和时间的时空查询。时空reID受到计算机视觉算法的精度限制和城市摄像机的巨大视频的挑战。我们介绍一种实用的ReID引擎Clique，它基于两种新技术：（1）Clique通过对ReID算法提取的模糊对象特征进行聚类来评估目标出现，每个聚类代表要与输入进行匹配的不同对象的总体印象；（2）在视频中进行搜索时，Clique对摄像机进行采样以最大化时空覆盖范围，并按需增加摄像机以进行处理。通过对来自25个摄像机的25小时视频进行评估，Clique在70个查询中达到了0.87的高精度（召回率为5），并以830倍的视频实时性运行，以实现高精度。

68. Simultaneous View and Feature Selection for Collaborative Multi-Robot Recognition [PDF] 返回目录
Brian Reily, Hao Zhang
Abstract: Collaborative multi-robot perception provides multiple views of an environment, offering varying perspectives to collaboratively understand the environment even when individual robots have poor points of view or when occlusions are caused by obstacles. These multiple observations must be intelligently fused for accurate recognition, and relevant observations need to be selected in order to allow unnecessary robots to continue on to observe other targets. This research problem has not been well studied in the literature yet. In this paper, we propose a novel approach to collaborative multi-robot perception that simultaneously integrates view selection, feature selection, and object recognition into a unified regularized optimization formulation, which uses sparsity-inducing norms to identify the robots with the most representative views and the modalities with the most discriminative features. As our optimization formulation is hard to solve due to the introduced non-smooth norms, we implement a new iterative optimization algorithm, which is guaranteed to converge to the optimal solution. We evaluate our approach on multi-view benchmark datasets, a case-study in simulation, and on a physical multi-robot system. Experimental results demonstrate that our approach enables accurate object recognition and effective view selection as defined by mutual information.
摘要：协作式多机器人感知可提供环境的多种视图，即使个体机器人的视点较差或障碍物造成遮挡，也可提供不同的视角来协作理解环境。必须智能地融合这些多个观测值以进行准确识别，并且需要选择相关的观测值，以允许不必要的机器人继续观察其他目标。该研究问题尚未在文献中得到很好的研究。在本文中，我们提出了一种新颖的协作式多机器人感知方法，该方法同时将视图选择，特征选择和对象识别集成到统一的规则化优化公式中，该公式使用稀疏性诱导准则来识别具有最具代表性的视图和特征的机器人。具有最具区别性的特征的方式。由于引入的非光滑准则导致我们的优化公式难以求解，因此我们实现了一种新的迭代优化算法，可以保证收敛到最优解。我们在多视图基准数据集，模拟案例研究以及物理多机器人系统上评估我们的方法。实验结果表明，我们的方法能够实现准确的对象识别和有效的视图选择，而这些视图是由互信息定义的。

69. StarcNet: Machine Learning for Star Cluster Identification [PDF] 返回目录
Gustavo Perez, Matteo Messa, Daniela Calzetti, Subhransu Maji, Dooseok Jung, Angela Adamo, Mattia Siressi
Abstract: We present a machine learning (ML) pipeline to identify star clusters in the multi{color images of nearby galaxies, from observations obtained with the Hubble Space Telescope as part of the Treasury Project LEGUS (Legacy ExtraGalactic Ultraviolet Survey). StarcNet (STAR Cluster classification NETwork) is a multi-scale convolutional neural network (CNN) which achieves an accuracy of 68.6% (4 classes)/86.0% (2 classes: cluster/non-cluster) for star cluster classification in the images of the LEGUS galaxies, nearly matching human expert performance. We test the performance of StarcNet by applying pre-trained CNN model to galaxies not included in the training set, finding accuracies similar to the reference one. We test the effect of StarcNet predictions on the inferred cluster properties by comparing multi-color luminosity functions and mass-age plots from catalogs produced by StarcNet and by human-labeling; distributions in luminosity, color, and physical characteristics of star clusters are similar for the human and ML classified samples. There are two advantages to the ML approach: (1) reproducibility of the classifications: the ML algorithm's biases are fixed and can be measured for subsequent analysis; and (2) speed of classification: the algorithm requires minutes for tasks that humans require weeks to months to perform. By achieving comparable accuracy to human classifiers, StarcNet will enable extending classifications to a larger number of candidate samples than currently available, thus increasing significantly the statistics for cluster studies.
摘要：我们提出了一种机器学习（ML）管道，用于根据哈勃太空望远镜作为美国国库计划LEGUS（传统银河系紫外线调查）的一部分而获得的观测结果，来识别附近星系的多色图像中的星团。 StarcNet（STAR聚类分类网络）是一种多尺度卷积神经网络（CNN），对于图像中的星团分类，其准确度达到68.6％（4类）/86.0%（2类：聚类/非聚类）。 LEGUS星系，几乎与人类专家的表现相匹配。我们通过对训练集中未包含的星系应用预训练的CNN模型，测试StarcNet的性能，发现其精确度与参考系相似。通过比较StarcNet产生的目录和人工标记的多色光度函数和质量图，我们测试了StarcNet预测对推断簇性质的影响；对于人类和ML分类样品，星团的发光度，颜色和物理特征的分布相似。机器学习方法有两个优点：（1）分类的可重复性：机器学习算法的偏差是固定的，可以进行测量以进行后续分析；（2）分类速度：该算法需要几分钟才能完成人类需要数周到数月才能完成的任务。通过达到与人类分类器相当的准确性，StarcNet将能够将分类扩展到比当前可用的更大数量的候选样本，从而显着增加聚类研究的统计数据。

70. Spatial Context-Aware Self-Attention Model For Multi-Organ Segmentation [PDF] 返回目录
Hao Tang, Xingwei Liu, Kun Han, Shanlin Sun, Narisu Bai, Xuming Chen, Huang Qian, Yong Liu, Xiaohui Xie
Abstract: Multi-organ segmentation is one of most successful applications of deep learning in medical image analysis. Deep convolutional neural nets (CNNs) have shown great promise in achieving clinically applicable image segmentation performance on CT or MRI images. State-of-the-art CNN segmentation models apply either 2D or 3D convolutions on input images, with pros and cons associated with each method: 2D convolution is fast, less memory-intensive but inadequate for extracting 3D contextual information from volumetric images, while the opposite is true for 3D convolution. To fit a 3D CNN model on CT or MRI images on commodity GPUs, one usually has to either downsample input images or use cropped local regions as inputs, which limits the utility of 3D models for multi-organ segmentation. In this work, we propose a new framework for combining 3D and 2D models, in which the segmentation is realized through high-resolution 2D convolutions, but guided by spatial contextual information extracted from a low-resolution 3D model. We implement a self-attention mechanism to control which 3D features should be used to guide 2D segmentation. Our model is light on memory usage but fully equipped to take 3D contextual information into account. Experiments on multiple organ segmentation datasets demonstrate that by taking advantage of both 2D and 3D models, our method is consistently outperforms existing 2D and 3D models in organ segmentation accuracy, while being able to directly take raw whole-volume image data as inputs.
摘要：多器官分割是深度学习在医学图像分析中最成功的应用之一。深卷积神经网络（CNN）在实现CT或MRI图像的临床可应用图像分割性能方面显示出巨大的希望。最新的CNN分割模型在输入图像上应用2D或3D卷积，每种方法各有利弊：2D卷积速度快，内存占用少，但不足以从体积图像中提取3D上下文信息，而3D卷积则相反。为了使3D CNN模型适合商品GPU上的CT或MRI图像，通常必须对输入图像进行下采样或使用裁剪的局部区域作为输入，这限制了3D模型用于多器官分割的效用。在这项工作中，我们提出了一个将3D和2D模型相结合的新框架，其中通过高分辨率2D卷积实现分割，但要以从低分辨率3D模型中提取的空间上下文信息为指导。我们实现了一种自我关注机制，以控制应使用哪些3D功能来引导2D分割。我们的模型只考虑内存使用情况，但设备齐全，可以考虑3D上下文信息。在多个器官分割数据集上进行的实验表明，通过同时利用2D和3D模型，我们的方法在器官分割精度上始终优于现有的2D和3D模型，同时能够直接将原始的完整体积图像数据作为输入。

71. Reduction in the complexity of 1D 1H-NMR spectra by the use of Frequency to Information Transformation [PDF] 返回目录
Homayoun Valafar, Faramarz Valafar
Abstract: Analysis of 1H-NMR spectra is often hindered by large variations that occur during the collection of these spectra. Large solvent and standard peaks, base line drift and negative peaks (due to improper phasing) are among some of these variations. Furthermore, some instrument dependent alterations, such as incorrect shimming, are also embedded in the recorded spectrum. The unpredictable nature of these alterations of the signal has rendered the automated and instrument independent computer analysis of these spectra unreliable. In this paper, a novel method of extracting the information content of a signal (in this paper, frequency domain 1H-NMR spectrum), called the frequency-information transformation (FIT), is presented and compared to a previously used method (SPUTNIK). FIT can successfully extract the relevant information to a pattern matching task present in a signal, while discarding the remainder of a signal by transforming a Fourier transformed signal into an information spectrum (IS). This technique exhibits the ability of decreasing the inter-class correlation coefficients while increasing the intra-class correlation coefficients. Different spectra of the same molecule, in other words, will resemble more to each other while the spectra of different molecules will look more different from each other. This feature allows easier automated identification and analysis of molecules based on their spectral signatures using computer algorithms.
摘要：1H-NMR光谱的分析通常受到收集这些光谱时发生的巨大变化的阻碍。这些变化中包括较大的溶剂峰和标准峰，基线漂移和负峰（由于不正确的定相）。此外，某些与仪器有关的变更，例如不正确的匀场，也嵌入到已记录的频谱中。这些信号变化的不可预测的性质使得对这些频谱的自动且独立于仪器的计算机分析变得不可靠。本文提出了一种提取信号信息内容的新方法（本文为频域1H-NMR谱），称为频率信息变换（FIT），并将其与以前使用的方法（SPUTNIK）进行了比较。。 FIT可以成功地将相关信息提取到信号中存在的模式匹配任务，同时通过将傅立叶变换后的信号转换为信息频谱（IS）来丢弃信号的其余部分。该技术表现出减小类别间相关系数而同时增加类别内相关系数的能力。换句话说，同一分子的不同光谱彼此之间的相似度更高，而不同分子的光谱彼此之间的外观看起来相差更大。此功能使使用计算机算法根据分子的光谱特征更容易地自动识别和分析分子。

72. Transfer Learning Through Weighted Loss Function and Group Normalization for Vessel Segmentation from Retinal Images [PDF] 返回目录
Abdullah Sarhan, Jon Rokne, Reda Alhajj, Andrew Crichton
Abstract: The vascular structure of blood vessels is important in diagnosing retinal conditions such as glaucoma and diabetic retinopathy. Accurate segmentation of these vessels can help in detecting retinal objects such as the optic disc and optic cup and hence determine if there are damages to these areas. Moreover, the structure of the vessels can help in diagnosing glaucoma. The rapid development of digital imaging and computer-vision techniques has increased the potential for developing approaches for segmenting retinal vessels. In this paper, we propose an approach for segmenting retinal vessels that uses deep learning along with transfer learning. We adapted the U-Net structure to use a customized InceptionV3 as the encoder and used multiple skip connections to form the decoder. Moreover, we used a weighted loss function to handle the issue of class imbalance in retinal images. Furthermore, we contributed a new dataset to this field. We tested our approach on six publicly available datasets and a newly created dataset. We achieved an average accuracy of 95.60% and a Dice coefficient of 80.98%. The results obtained from comprehensive experiments demonstrate the robustness of our approach to the segmentation of blood vessels in retinal images obtained from different sources. Our approach results in greater segmentation accuracy than other approaches.
摘要：血管的血管结构在诊断青光眼和糖尿病性视网膜病变等视网膜疾病中具有重要意义。这些血管的正确分割可以帮助检测视网膜对象，例如视盘和视杯，从而确定这些区域是否受损。此外，血管的结构可以帮助诊断青光眼。数字成像和计算机视觉技术的迅速发展增加了开发用于分割视网膜血管的方法的潜力。在本文中，我们提出了一种分割视网膜血管的方法，该方法将深度学习与转移学习一起使用。我们调整了U-Net结构，以使用定制的InceptionV3作为编码器，并使用多个跳过连接来形成解码器。此外，我们使用加权损失函数来处理视网膜图像中类别不平衡的问题。此外，我们为此字段贡献了一个新的数据集。我们在六个公开可用的数据集和一个新创建的数据集上测试了我们的方法。我们实现了95.60％的平均准确度和80.98％的骰子系数。从综合实验获得的结果证明了我们从不同来源获得的视网膜图像中血管分割方法的鲁棒性。与其他方法相比，我们的方法可实现更高的细分精度。

73. MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification [PDF] 返回目录
Te-Lin Wu, Shikhar Singh, Sayan Paul, Gully Burns, Nanyun Peng
Abstract: We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeNt methoD clAssification. The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database, and the actual contents are extracted from papers associated with each of the records in the database. We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs, and multimodal models. Extensive experiments and analysis show that multimodal models, despite outperforming unimodal ones, still need improvements especially on a less-supervised way of grounding visual concepts with languages, and better transferability to low resource domains. We release our dataset and the benchmarks to facilitate future research in multimodal learning, especially to motivate targeted improvements for applications in scientific domains.
摘要：我们为多峰生物医学方法分类引入了一个新的数据集MELINDA。以全自动的远程监管方式收集数据集，其中标签是从现有的策划数据库中获取的，而实际内容是从与数据库中每个记录相关联的论文中提取的。我们对各种最新的NLP和计算机视觉模型进行了基准测试，包括仅以字幕文本或图像作为输入的单峰模型以及多峰模型。大量的实验和分析表明，尽管多峰模型的性能优于单峰模型，但仍需要改进，特别是在以语言为基础的可视化概念较少监督的方式上，以及向低资源域的更好转移性方面。我们发布了数据集和基准，以促进未来多模式学习的研究，尤其是针对科学领域中的应用激励有针对性的改进。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-18

目录

摘要