摘要

1. The Heterogeneity Hypothesis: Finding Layer-Wise Dissimilated Network Architecture [PDF] 返回目录
Yawei Li, Wen Li, Martin Danelljan, Kai Zhang, Shuhang Gu, Luc Van Gool, Radu Timofte
Abstract: In this paper, we tackle the problem of convolutional neural network design. Instead of focusing on the overall architecture design, we investigate a design space that is usually overlooked, \ie adjusting the channel configurations of predefined networks. We find that this adjustment can be achieved by pruning widened baseline networks and leads to superior performance. Base on that, we articulate the ``heterogeneity hypothesis'': with the same training protocol, there exists a layer-wise dissimilated network architecture (LW-DNA) that can outperform the original network with regular channel configurations under lower level of model complexity. The LW-DNA models are identified without added computational cost and training time compared with the original network. This constraint leads to controlled experiment which directs the focus to the importance of layer-wise specific channel configurations. Multiple sources of hints relate the benefits of LW-DNA models to overfitting, \ie the relative relationship between model complexity and dataset size. Experiments are conducted on various networks and datasets for image classification, visual tracking and image restoration. The resultant LW-DNA models consistently outperform the compared baseline models.
摘要：在本文中，我们处理的卷积神经网络的设计问题。闷头的总体架构设计的，我们研究设计空间通常被忽视，\即调整预定义网络的信道配置。我们发现，这种调整可以通过修剪扩大基线网络和通往卓越的性能来实现。基础上，我们阐明``异质性假说“”：用相同的训练方案，存在逐层异化网络架构（LW-DNA），其可以优于用常规信道配置中，原始网络下的模型的复杂性较低的水平。该LW-DNA模型确定不添加计算成本，并与原有网络相比，训练时间。这个约束导致对照实验，其引导的焦点逐层特定信道配置的重要性。提示的多种来源涉及LW-DNA模型的好处过度拟合，\即模型的复杂性和数据集大小之间的相对关系。实验在不同的网络和图像分类，视觉跟踪和图像恢复的数据集进行的。得到的LW-DNA模型持续超越比较基准模型。

2. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization [PDF] 返回目录
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer
Abstract: We introduce three new robustness benchmarks consisting of naturally occurring distribution changes in image style, geographic location, camera operation, and more. Using our benchmarks, we take stock of previously proposed hypotheses for out-of-distribution robustness and put them to the test. We find that using larger models and synthetic data augmentation can improve robustness on real-world distribution shifts, contrary to claims in prior work. Motivated by this, we introduce a new data augmentation method which advances the state-of-the-art and outperforms models pretrained with 1000x more labeled data. We find that some methods consistently help with distribution shifts in texture and local image statistics, but these methods do not help with some other distribution shifts like geographic changes. We conclude that future research must study multiple distribution shifts simultaneously.
摘要：介绍包括在影像风格天然分布的变化，地理位置，相机操作，多了三个新的鲁棒性的基准。使用我们的测试中，我们采取外的分布稳健先前提出的假说的股票，并把他们的测试。我们发现，使用较大的模型和合成数据增强可以提高现实世界的分布的变化的鲁棒性，违背了以前的工作要求。通过此激励，我们介绍该前进用1000个多种标记的数据预训练的状态的最先进的，优于模型一个新的数据增强方法。我们发现，一些方法与质感和地方形象统计分布的变化一致的帮助，但这些方法不利于与其他一些分布的变化，比如地理位置的变化。我们得出结论，未来的研究必须同时学习多个分布的变化。

3. Self-Supervised MultiModal Versatile Networks [PDF] 返回目录
Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman
Abstract: Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of audio and vision can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51 and ESC-50 when compared to previous self-supervised work.
摘要：视频是一个丰富的多模态监督的来源。视觉，音频和语言：在这项工作中，我们通过利用三种方式来天然存在于视频中的自检学表示。为此，我们引入一个多多功能网络的概念 - 可以摄取多种方式，其表示能够在多模态下游任务的网络。特别是，我们探索如何最好地的方式，这样的音频和视觉的细粒度表示可以维持组合，同时还集成文本到一个共同的嵌入。通过多功能性驱动，我们还介绍通缩的新方法，从而使网络可以毫不费力地应用于视觉数据中的视频的形式或静态图像。我们演示了如何训练的未标记的视频数据的大集合这样的网络可以在视频，视频文本，图像和音频任务应用。配备这些表示，我们获得多个具有挑战性的基准测试，包括UCF101，HMDB51和ESC-50相比以前的自我监督工作时，国家的最先进的性能。

4. Unsupervised Learning Consensus Model for Dynamic Texture Videos Segmentation [PDF] 返回目录
Lazhar Khelifi, Max Mignotte
Abstract: Dynamic texture (DT) segmentation, and video processing in general, is currently widely dominated by methods based on deep neural networks that require the deployment of a large number of layers. Although this parametric approach has shown superior performances for the dynamic texture segmentation, all current deep learning methods suffer from a significant main weakness related to the lack of a sufficient reference annotation to train models and to make them functional. This study explores the unsupervised segmentation approach that can be used in the absence of training data to segment new videos. We present an effective unsupervised learning consensus model for the segmentation of dynamic texture (ULCM). This model is designed to merge different segmentation maps that contain multiple and weak quality regions in order to achieve a more accurate final result of segmentation. The diverse labeling fields required for the combination process are obtained by a simplified grouping scheme applied to an input video (on the basis of a three orthogonal planes: xy, yt and xt). In the proposed model, the set of values of the requantized local binary patterns (LBP) histogram around the pixel to be classified are used as features which represent both the spatial and temporal information replicated in the video. Experiments conducted on the challenging SynthDB dataset show that, contrary to current dynamic texture segmentation approaches that either require parameter estimation or a training step, ULCM is significantly faster, easier to code, simple and has limited parameters. Further qualitative experiments based on the YUP++ dataset prove the efficiently and competitively of the ULCM.
摘要：动态纹理（DT）的分割和视频在通用处理，目前广泛基础上，需要大量层的部署深层神经网络方法为主。虽然这个参数方法已经显示出动态纹理分割性能优越，目前所有的深学习方法从与缺乏足够的引用注释训练模型，才能使其正常运行的显著主要弱点困扰。本研究探讨，可以在没有训练数据段新视频中使用的无监督分割方法。我们提出一个有效的监督学习的共识模型，动态纹理（ULCM）的分割。这种模式是专门合并包含以达到分割更准确的最终结果多而弱质量地区不同的分割地图。用于组合处理所需要的不同的标记字段由施加到输入视频（：XY，YT和XT一个三个正交平面的基础上）的简化分组方案获得。在所提出的模型，该组中的像素周围的再量化局部二进制模式（LBP）直方图的值被分类作为其代表既在视频复制的空间和时间信息的功能。在挑战SynthDB进行的实验数据集显示，相反，当前的动态纹理分割方法，要么需要参数估计或训练步骤，ULCM是显著更快，更容易编码，简单和具有有限的参数。进一步定性实验基于所述YUP ++数据集的高效和竞争性的ULCM的证明。

5. Automatic Operating Room Surgical Activity Recognition for Robot-Assisted Surgery [PDF] 返回目录
Aidean Sharghi, Helene Haugerud, Daniel Oh, Omid Mohareri
Abstract: Automatic recognition of surgical activities in the operating room (OR) is a key technology for creating next generation intelligent surgical devices and workflow monitoring/support systems. Such systems can potentially enhance efficiency in the OR, resulting in lower costs and improved care delivery to the patients. In this paper, we investigate automatic surgical activity recognition in robot-assisted operations. We collect the first large-scale dataset including 400 full-length multi-perspective videos from a variety of robotic surgery cases captured using Time-of-Flight cameras. We densely annotate the videos with 10 most recognized and clinically relevant classes of activities. Furthermore, we investigate state-of-the-art computer vision action recognition techniques and adapt them for the OR environment and the dataset. First, we fine-tune the Inflated 3D ConvNet (I3D) for clip-level activity recognition on our dataset and use it to extract features from the videos. These features are then fed to a stack of 3 Temporal Gaussian Mixture layers which extracts context from neighboring clips, and eventually go through a Long Short Term Memory network to learn the order of activities in full-length videos. We extensively assess the model and reach a peak performance of 88% mean Average Precision.
摘要：自动识别的手术室（OR）手术活动是一个用于创建下一代智能手术设备和工作流程监控/支撑系统的关键技术。这种系统可潜在地增强在OR效率，从而降低成本和提高服务递送给病人。在本文中，我们探讨机器人辅助操作自动手术活动的认可。我们收集了第一次大规模的数据集，包括从各种机器人手术病例400全长多视角视频使用时间的飞行摄像机拍摄。我们密集的注释视频与10最被认可和临床相关的活动类。此外，我们研究了国家的最先进的计算机视觉动作识别技术和适应它们的环境或和数据集。首先，我们微调充气3D ConvNet（I3D）对我们的数据夹级别的动作识别，并用它来从视频中提取特征。然后，这些功能被送到3态高斯混合层的叠堆从邻近的剪辑中提取背景，并最终经过一个长短期记忆网络学习的全长视频活动的顺序。我们广泛的评估模型，并达到88％的峰值性能意味着平均准确率。

6. Human Activity Recognition based on Dynamic Spatio-Temporal Relations [PDF] 返回目录
Zhenyu Liu, Yaqiang Yao, Yan Liu, Yuening Zhu, Zhenchao Tao, Lei Wang, Yuhong Feng
Abstract: Human activity, which usually consists of several actions, generally covers interactions among persons and or objects. In particular, human actions involve certain spatial and temporal relationships, are the components of more complicated activity, and evolve dynamically over time. Therefore, the description of a single human action and the modeling of the evolution of successive human actions are two major issues in human activity recognition. In this paper, we develop a method for human activity recognition that tackles these two issues. In the proposed method, an activity is divided into several successive actions represented by spatio temporal patterns, and the evolution of these actions are captured by a sequential model. A refined comprehensive spatio temporal graph is utilized to represent a single action, which is a qualitative representation of a human action incorporating both the spatial and temporal relations of the participant objects. Next, a discrete hidden Markov model is applied to model the evolution of action sequences. Moreover, a fully automatic partition method is proposed to divide a long-term human activity video into several human actions based on variational objects and qualitative spatial relations. Finally, a hierarchical decomposition of the human body is introduced to obtain a discriminative representation for a single action. Experimental results on the Cornell Activity Dataset demonstrate the efficiency and effectiveness of the proposed approach, which will enable long videos of human activity to be better recognized.
摘要：人类活动，它通常由几个动作，一般包括人员和或物体间的相互作用。特别是，人的行为涉及到一定的空间和时间关系，更复杂的活动的组成部分，并动态地随着时间的推移。因此，一个人的行为的描述，并连续人类活动的演变进行建模是人类活动识别两个主要问题。在本文中，我们开发了人类活动的认识，铲球这两个问题的方法。在所提出的方法中，一个活动被分成由时空时间模式表示几个连续的动作，并且这些动作的演变由顺序模型捕获。精制的全面时空时间图被用于表示一个单一的动作，这是结合了参与者的对象的空间和时间关系两者人类行为的定性表示。接下来，离散隐马尔可夫模型被施加到动作序列的进化建模。此外，全自动的分区方法，提出了一个长期的人类活动的视频分为基于变对象和定性空间关系的几个人的行为。最后，对人体的分级分解被引入以获得用于一个单一的动作的判别表示。在康奈尔大学的活动数据集实验结果表明，该方法，这将使人类活动的长视频将被更好的认识的效率和效益。

7. GramGAN: Deep 3D Texture Synthesis From 2D Exemplars [PDF] 返回目录
Tiziano Portenier, Siavash Bigdeli, Orçun Göksel
Abstract: We present a novel texture synthesis framework, enabling the generation of infinite, high-quality 3D textures given a 2D exemplar image. Inspired by recent advances in natural texture synthesis, we train deep neural models to generate textures by non-linearly combining learned noise frequencies. To achieve a highly realistic output conditioned on an exemplar patch, we propose a novel loss function that combines ideas from both style transfer and generative adversarial networks. In particular, we train the synthesis network to match the Gram matrices of deep features from a discriminator network. In addition, we propose two architectural concepts and an extrapolation strategy that significantly improve generalization performance. In particular, we inject both model input and condition into hidden network layers by learning to scale and bias hidden activations. Quantitative and qualitative evaluations on a diverse set of exemplars motivate our design decisions and show that our system performs superior to previous state of the art. Finally, we conduct a user study that confirms the benefits of our framework.
摘要：我们提出一个新的纹理合成框架，使无限的产生，高品质的3D纹理给定二维图像范例。在天然纹理合成的最新进展的鼓舞下，我们训练深层神经模型的非线性结合学会噪声频率产生的纹理。为了达到高度逼真的输出调节一个好榜样，补丁，我们提出了一种新的损失函数，从两个风格转移及发生敌对网络联合的想法。特别是，我们培养了合成网从鉴别器网络相匹配的深特征革兰氏矩阵。此外，我们提出了两种架构概念和显著改善泛化性能外推的策略。特别是，我们通过学习，规模和偏置隐藏激活注入两模型的输入和条件隐入网络层。在一组不同的范例的定量和定性的评价激励我们的设计决策，并表明我们的系统进行优于现有技术的以前的状态。最后，我们进行的是证实了我们框架的好处，用户研究。

8. Iris Recognition: Inherent Binomial Degrees of Freedom [PDF] 返回目录
J. Michael Rozmus
Abstract: The distinctiveness of the human iris has been measured by first extracting a set of features from the iris, an encoding, and then comparing these encoded feature sets to determine how distinct they are from one another. For example, John Daugman measures the distinctiveness of the human iris at 244 degrees of freedom, that is, Daugman's encoding maps irises into the equivalent of 2 ^ 244 distinct possibilities [2]. This paper shows by direct pixel-by-pixel comparison of high-quality iris images that the inherent number of degrees of freedom embodied in the human iris, independent of any encoding, is at least 536. When the resolution of these images is gradually reduced, the number of degrees of freedom decreases smoothly to 123 for the lowest resolution images tested.
摘要：人类虹膜的独特性已经通过首先从虹膜，编码中提取一组特征，然后比较这些经编码的功能集，以确定它们是彼此不同的如何测量。例如，约翰·多曼测量在244个自由度，即人类虹膜的独特性，Daugman的编码映射虹膜成2 ^ 244不同的可能性[2]的当量。本文示出了通过高品质直接逐像素比较虹膜图像即体现在人眼虹膜，独立于任何编码的自由度的数目固有，是至少536当这些图像的分辨率逐渐减小中，自由度的数量而平滑地减小到123，用于测试的最低分辨率的图像。

9. A Measurement of Transportation Ban inside Wuhan on the COVID-19 Epidemic by Vehicle Detection in Remote Sensing Imagery [PDF] 返回目录
Chen Wu, Jingwen Yuan, Lixiang Ru, Hongruixuan Chen, Bo Du, Liangpei Zhang
Abstract: Wuhan, the biggest city in China's central region with a population of more than 11 million, was shut down to control the COVID-19 epidemic on 23 January, 2020. Even though many researches have studied the travel restriction between cities and provinces, few studies focus on the transportation control inside the city, which may be due to the lack of the measurement of the transportation ban. Therefore, we evaluate the implementation of transportation ban policy inside the city by extracting motor vehicles on the road from two high-resolution remote sensing image sets before and after Wuhan lockdown. In order to detect vehicles from the remote sensing image datasets with the resolution of 0.8m accurately, we proposed a novel method combining anomaly detection, region grow and deep learning. The vehicle numbers in Wuhan dropped with a percentage of at least 63.31% caused by COVID-19. Considering fewer interferences, the dropping percentages of ring road and high-level road should be more representative with the value of 84.81% and 80.22%. The districts located in city center were more intensively affected by the transportation ban. Since the public transportations have been shut down, the significant reduction of motor vehicles indicates that the lockdown policy in Wuhan show effectiveness in controlling human transmission inside the city.
摘要：武汉，在中国与超过11万人口的中部地区最大的城市，被关闭，以控制1月23日COVID-19的流行，2020年尽管许多研究已经研究了城市和省份之间的旅行限制，一些研究集中在城市内部交通控制，这可能是由于缺乏交通禁令的测量。因此，我们从两个高分辨率遥感图像集之前，武汉锁定后在道路上提取机动车评估交通禁令政策的城市内的实现。为了从与0.8米的分辨率准确遥感图像数据组检测车辆，提出了一种新颖的方法结合异常检测，区域生长，深度学习。武汉机动车数量下降引起的COVID-19的至少63.31％的百分比。考虑到更少的干扰，环城公路，高等级公路的下降比例应与84.81％和80.22％的价值更具有代表性。位于市中心的地区进行更密集的交通禁令的影响。由于公共交通已经关闭，机动车显著减少表明，在武汉演出效果的锁定策略在控制城市里面传人。

10. Layered Stereo by Cooperative Grouping with Occlusion [PDF] 返回目录
Jialiang Wang, Todd Zickler
Abstract: Human stereo vision uses occlusions as a prominent cue, sometimes the only cue, to localize object boundaries and recover depth relationships between surfaces adjacent to these boundaries. However, many modern computer vision systems treat occlusions as a secondary cue or ignore them as outliers, leading to imprecise boundaries, especially when matching cues are weak. In this work, we introduce a layered approach to stereo that explicitly incorporates occlusions. Unlike previous layer-based methods, our model is cooperative, involving local computations among units that have overlapping receptive fields at multiple scales, and sparse lateral and vertical connections between the computational units. Focusing on bi-layer scenes, we demonstrate our model's ability to localize boundaries between figure and ground in a wide variety of cases, including images from Middlebury and Falling Things datasets, as well as perceptual stimuli that lack matching cues and have yet to be well explained by previous computational stereo systems. Our model suggests new directions for creating cooperative stereo systems that incorporate occlusion cues in a human-like manner.
摘要：人类立体视觉用途闭塞作为一个突出的线索，有时是唯一线索，来定位物体边界和恢复邻近这些边界表面之间的深度的关系。然而，许多现代计算机视觉系统治疗闭塞作为次要线索或忽视这些异常，导致不精确的边界，尤其是当匹配线索是薄弱。在这项工作中，我们引入了分层的方法来立体声明确纳入闭塞。不同于以往的基于层的方法中，我们的模型是合作的，涉及该多尺度具有重叠感受域，和计算单元之间稀疏横向和纵向的连接单元中的本地计算。着眼于双层的场景中，我们证明了我们模型的本地化数字和地面之间的界限在各种情况下，包括来自明德图像和落物联网的数据集的能力，以及缺乏配套的线索，还没有很好地感知刺激通过前面的计算立体声系统解释。我们的模型表明，新的方向，用于创建协作立体声系统，在一个类似人类的方式一体化闭塞的线索。

11. Patch SVDD: Patch-level SVDD for Anomaly Detection and Segmentation [PDF] 返回目录
Jihun Yi, Sungroh Yoon
Abstract: In this paper, we tackle the problem of image anomaly detection and segmentation. Anomaly detection is to make a binary decision whether an input image contains an anomaly or not, and anomaly segmentation aims to locate the defect in a pixel-level. SVDD is a longstanding algorithm for an anomaly detection. We extend its deep learning variant to patch-level using self-supervised learning. The extension enables the anomaly segmentation, and it improves the detection performance as well. As a result, we achieved a state-of-the-art performances on a standard industrial dataset, MVTec AD. Detailed analysis on the proposed method offers a useful insight about its behavior.
摘要：在本文中，我们将处理图像异常检测和分割问题。异常检测是使二元判决的输入图像中是否包含异常与否，和异常分割目标定位在一个像素级的缺陷。 SVDD是用于异常检测的长期算法。我们深切学习变种延伸到使用自监督学习的补丁级别。延长使得异常分割，并且它提高了检测性能为好。其结果是，我们在标准工业数据集，MVTec公司AD实现国家的最先进的性能。对所提出的方法提供有关其行为的有用的见解详细分析。

12. Forgery Detection in a Questioned Hyperspectral Document Image using K-means Clustering [PDF] 返回目录
Maria Yaseen, Rammal Aftab Ahmed, Rimsha Mahrukh
Abstract: Hyperspectral imaging allows for analysis of images in several hundred of spectral bands depending on the spectral resolution of the imaging sensor. Hyperspectral document image is the one which has been captured by a hyperspectral camera so that the document can be observed in the different bands on the basis of their unique spectral signatures. To detect the forgery in a document various Ink mismatch detection techniques based on hyperspectral imaging have presented vast potential in differentiating visually similar inks. Inks of different materials exhibit different spectral signature even if they have the same color. Hyperspectral analysis of document images allows identification and discrimination of visually similar inks. Based on this analysis forensic experts can identify the authenticity of the document. In this paper an extensive ink mismatch detection technique is presented which uses KMean Clustering to identify different inks on the basis of their unique spectral response and separates them into different clusters.
摘要：高光谱成像允许图像的分析中根据成像传感器的光谱分辨率的光谱带的几百。高光谱文档图像是已经由一个高光谱相机，使得文档可以在不同的频带被观察他们的独特光谱特征的基础上，所捕获的一个。为了检测在基于高光谱成像文档的各种油墨失配检测技术伪造在鉴别视觉上相似的油墨已经提出巨大潜力。不同材料的油墨表现出即使它们具有相同的颜色不同的光谱特征。原稿图像的高光谱分析允许鉴定和视觉上相似的油墨的歧视。在此基础上分析法医专家能识别该文件的真实性。在本文中，提出了广泛的墨水不匹配检测技术，其使用KMean聚类它们独特的光谱响应的基础上确定不同的墨水，并将其分离成不同的簇。

13. Visual Kinship Recognition: A Decade in the Making [PDF] 返回目录
Joseph P Robinson, Ming Shao, Yun Fu
Abstract: Kinship recognition is a challenging problem with many practical applications. With much progress and milestones having been reached after ten years since pioneered - it is now that today we are able to survey their research and create new milestones. We list and review the public resources and data challenges that enabled and inspired many to hone-in on one or more views of automatic kinship recognition in the visual domain. The different tasks are described in technical terms and syntax consistent across the problem domain and the practical value of each discussed and measured. State-of-the-art methods for visual kinship recognition problems, whether to discriminate between or generate from, are examined. As part of such, we review systems proposed as part of a recent data challenge held in conjunction with the 2020 IEEE Conference on Automatic Face and Gesture Recognition. We establish a stronghold for the state of progress for the different problems in a consistent manner. We intend for this survey will serve as the central resource for work of the next decade to build upon. For the tenth anniversary, demo code is provided for the various kin-based tasks. Detecting relatives with visual recognition and classifying the relationship is an area with high potential for impact in research and practice.
摘要：亲属识别与许多实际应用中具有挑战性的问题。随着十年后已经达到了很大的进步和里程碑，因为首创 - 现在，今天我们能够审视自己的研究，创造了新的里程碑。我们列出并查看启用并启发了许多磨练，对在视觉领域的自动识别血缘关系的一个或多个视图的公共资源和数据的挑战。的不同任务中的技术术语和语法横跨问题域并讨论和测量的每个的实际值相一致的方式进行说明。状态的最先进的方法视觉识别的血缘关系的问题，是否区分或生成从被检查。作为这样的一部分，我们审查建议作为与自动脸部和手势识别2020年的IEEE会议同时举行的最近数据挑战的一部分系统。我们建立了一个据点为以一致的方式不同的问题进展情况。我们打算为这次调查将作为在打造未来十年的工作的重要资源。十周年，演示代码提供了各种基于亲属任务。检测与视觉识别亲属和关系进行分类是在研究和实践的影响高潜力的领域。

14. Creating Artificial Modalities to Solve RGB Liveness [PDF] 返回目录
Aleksandr Parkin, Oleg Grinchuk
Abstract: Special cameras that provide useful features for face anti-spoofing are desirable, but not always an option. In this work we propose a method to utilize the difference in dynamic appearance between bona fide and spoof samples by creating artificial modalities from RGB videos. We introduce two types of artificial transforms: rank pooling and optical flow, combined in end-to-end pipeline for spoof detection. We demonstrate that using intermediate representations that contain less identity and fine-grained features increase model robustness to unseen attacks as well as to unseen ethnicities. The proposed method achieves state-of-the-art on the largest cross-ethnicity face anti-spoofing dataset CASIA-SURF CeFA (RGB).
摘要：提供面部反欺骗有用的功能特殊的相机是可取的，但并不总是一个选项。在这项工作中，我们提出了由RGB视频创建人工方式利用在善意和恶搞样本之间的动态外观上的差异的方法。我们引入两种类型的人工变换：秩池和光流，组合在端至端管道用于欺骗检测。我们证明，含有较少的身份和细粒度功能使用中间表示增加模型的鲁棒性看不见的攻击以及对看不见的种族。所提出的方法实现了国家的最先进的最大横种族面防伪数据集CASIA-SURF CEFA（RGB）。

15. Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image Decomposition [PDF] 返回目录
Hassan Abu Alhaija, Siva Karthik Mustikovela, Justus Thies, Matthias Nießner, Andreas Geiger, Carsten Rother
Abstract: Neural rendering techniques promise efficient photo-realistic image synthesis while at the same time providing rich control over scene parameters by learning the physical image formation process. While several supervised methods have been proposed for this task, acquiring a dataset of images with accurately aligned 3D models is very difficult. The main contribution of this work is to lift this restriction by training a neural rendering algorithm from unpaired data. More specifically, we propose an autoencoder for joint generation of realistic images from synthetic 3D models while simultaneously decomposing real images into their intrinsic shape and appearance properties. In contrast to a traditional graphics pipeline, our approach does not require to specify all scene properties, such as material parameters and lighting by hand. Instead, we learn photo-realistic deferred rendering from a small set of 3D models and a larger set of unaligned real images, both of which are easy to acquire in practice. Simultaneously, we obtain accurate intrinsic decompositions of real images while not requiring paired ground truth. Our experiments confirm that a joint treatment of rendering and decomposition is indeed beneficial and that our approach outperforms state-of-the-art image-to-image translation baselines both qualitatively and quantitatively.
摘要：神经渲染技术保证高效的照片般逼真的图像合成，而在同一时间通过学习物理成像过程中提供了场景参数丰富的控制。虽然一些监管方法已经提出了这个任务，获取与准确地对准3D模型图像的数据集是非常困难的。这项工作的主要贡献是由不成对的数据训练神经渲染算法来解除这种限制。更具体地讲，我们提出了联合产生逼真的图像的自动编码器由合成的三维模型，同时分解真实图像到其固有形状和外观性能。相较于传统的图形流水线，我们的方法并不需要指定所有场景属性，如材料参数和照明手。相反，我们学到一小套的三维模型照片般逼真的延迟渲染和较大的一组未对齐的真实图像，这两者都容易获得在实践中。同时，我们得到对影像进行真实准确的内在分解而无需配对地面实况。我们的实验证实，渲染和分解的联合治疗确实有益，我们的方法比国家的最先进的图像 - 图像平移基线定性和定量。

16. MoNet3D: Towards Accurate Monocular 3D Object Localization in Real Time [PDF] 返回目录
Xichuan Zhou, Yicong Peng, Chunqiao Long, Fengbo Ren, Cong Shi
Abstract: Monocular multi-object detection and localization in 3D space has been proven to be a challenging task. The MoNet3D algorithm is a novel and effective framework that can predict the 3D position of each object in a monocular image and draw a 3D bounding box for each object. The MoNet3D method incorporates prior knowledge of the spatial geometric correlation of neighbouring objects into the deep neural network training process to improve the accuracy of 3D object localization. Experiments on the KITTI dataset show that the accuracy for predicting the depth and horizontal coordinates of objects in 3D space can reach 96.25\% and 94.74\%, respectively. Moreover, the method can realize the real-time image processing at 27.85 FPS, showing promising potential for embedded advanced driving-assistance system applications. Our code is publicly available at this https URL.
摘要：单眼多目标检测和在3D空间中的定位已被证明是一项艰巨的任务。所述MoNet3D算法是一种新的和有效的框架，可以预测每个对象的3D位置中的单眼图像和绘制三维边界框为每个对象。该MoNet3D方法结合相邻对象进深神经网络的训练过程，以提高三维对象定位的精度的空间相关性几何的先验知识。对数据集KITTI的实验表明，用于预测在三维空间中的对象的深度和水平坐标的精度可以达到96.25 \％和94.74 \％之间。此外，该方法可以在27.85 FPS实现实时图像处理，显示出有前途的嵌入式先进的驾驶辅助系统的应用潜力。我们的代码是公开的，在此HTTPS URL。

17. Explainable 3D Convolutional Neural Networks by Learning Temporal Transformations [PDF] 返回目录
Gabriëlle Ras, Luca Ambrogioni, Pim Haselager, Marcel A.J. van Gerven, Umut Güçlü
Abstract: In this paper we introduce the temporally factorized 3D convolution (3TConv) as an interpretable alternative to the regular 3D convolution (3DConv). In a 3TConv the 3D convolutional filter is obtained by learning a 2D filter and a set of temporal transformation parameters, resulting in a sparse filter where the 2D slices are sequentially dependent on each other in the temporal dimension. We demonstrate that 3TConv learns temporal transformations that afford a direct interpretation. The temporal parameters can be used in combination with various existing 2D visualization methods. We also show that insight about what the model learns can be achieved by analyzing the transformation parameter statistics on a layer and model level. Finally, we implicitly demonstrate that, in popular ConvNets, the 2DConv can be replaced with a 3TConv and that the weights can be transferred to yield pretrained 3TConvs. pretrained 3TConvnets leverage more than a decade of work on traditional 2DConvNets by being able to make use of features that have been proven to deliver excellent results on image classification benchmarks.
摘要：本文介绍了时间因式分解3D卷积（3TConv）作为解释替代常规3D卷积（3DConv）。在一个3TConv三维卷积滤波器由学习2D滤波器和一组时间变换参数，产生了稀疏滤波器，其中2D切片是在时间维度上依次依赖于彼此而获得。我们证明3TConv得知买得起直接解释域变换。时间参数可以组合使用具有各种现有的2D可视化方法。我们还表明什么模型获悉可以通过在一个层和模型级别分析变换参数的统计数据来实现这种洞察力。最后，我们含蓄地表明，在流行ConvNets，该2DConv可以用3TConv被替换，而且权重可以被转移到产生预先训练3TConvs。预训练3TConvnets杠杆超过传统2DConvNets工作了十年，通过能够充分利用已证实可提供对图像分类基准效果极佳的特点。

18. Adversarial Multi-Source Transfer Learning in Healthcare: Application to Glucose Prediction for Diabetic People [PDF] 返回目录
Maxime De Bois, Mounîm A. El Yacoubi, Mehdi Ammi
Abstract: Deep learning has yet to revolutionize general practices in healthcare, despite promising results for some specific tasks. This is partly due to data being in insufficient quantities hurting the training of the models. To address this issue, data from multiple health actors or patients could be combined by capitalizing on their heterogeneity through the use of transfer learning. To improve the quality of the transfer between multiple sources of data, we propose a multi-source adversarial transfer learning framework that enables the learning of a feature representation that is similar across the sources, and thus more general and more easily transferable. We apply this idea to glucose forecasting for diabetic people using a fully convolutional neural network. The evaluation is done by exploring various transfer scenarios with three datasets characterized by their high inter and intra variability. While transferring knowledge is beneficial in general, we show that the statistical and clinical accuracies can be further improved by using of the adversarial training methodology, surpassing the current state-of-the-art results. In particular, it shines when using data from different datasets, or when there is too little data in an intra-dataset situation. To understand the behavior of the models, we analyze the learnt feature representations and propose a new metric in this regard. Contrary to a standard transfer, the adversarial transfer does not discriminate the patients and datasets, helping the learning of a more general feature representation. The adversarial training framework improves the learning of a general feature representation in a multi-source environment, enhancing the knowledge transfer to an unseen target. The proposed method can help improve the efficiency of data shared by different health actors in the training of deep models.
摘要：深学习还没有，彻底改变卫生保健的一般做法，尽管有希望的结果对于一些特定的任务。这部分是由于数据在数量不足伤害模型的训练之中。为了解决这个问题，从多个健康者或患者的数据可以通过使用转移学习善用他们的异质性结合起来。为了改善多个数据源之间转移的质量，提出了一种多源对抗转移学习框架，使一个特征表示是横跨源的类似，因而更一般的和更容易转移的学习。我们把这一概念应用到葡萄糖预测为使用完全卷积神经网络的糖尿病人。该评估是通过探索不同的传输需求有三个数据集中特点是其高帧间和帧内的变化来完成。虽然知识转移是普遍有利，我们表明，在统计和临床精度可以通过使用对抗训练方法的，超越当前国家的最先进的成果有待进一步提高。特别是，它从不同的数据集使用数据时眼前一亮，或者当有在内部数据集的情况太少数据。要了解模型的行为，我们分析学习地物交涉，并提出这方面的新指标。相反，标准传递，对抗性转让不区分病人和数据集，有助于更广泛的特征表示的学习。对抗训练框架提高了在多源环境的一般特征表现的学习，提高知识转移到一个看不见的目标。该方法可以帮助改善由不同的健康者在深模型的训练共享数据的效率。

19. Hybrid Tensor Decomposition in Neural Network Compression [PDF] 返回目录
Bijiao Wu, Dingheng Wang, Guangshe Zhao, Lei Deng, Guoqi Li
Abstract: Deep neural networks (DNNs) have enabled impressive breakthroughs in various artificial intelligence (AI) applications recently due to its capability of learning high-level features from big data. However, the current demand of DNNs for computational resources especially the storage consumption is growing due to that the increasing sizes of models are being required for more and more complicated applications. To address this problem, several tensor decomposition methods including tensor-train (TT) and tensor-ring (TR) have been applied to compress DNNs and shown considerable compression effectiveness. In this work, we introduce the hierarchical Tucker (HT), a classical but rarely-used tensor decomposition method, to investigate its capability in neural network compression. We convert the weight matrices and convolutional kernels to both HT and TT formats for comparative study, since the latter is the most widely used decomposition method and the variant of HT. We further theoretically and experimentally discover that the HT format has better performance on compressing weight matrices, while the TT format is more suited for compressing convolutional kernels. Based on this phenomenon we propose a strategy of hybrid tensor decomposition by combining TT and HT together to compress convolutional and fully connected parts separately and attain better accuracy than only using the TT or HT format on convolutional neural networks (CNNs). Our work illuminates the prospects of hybrid tensor decomposition for neural network compression.
摘要：深层神经网络（DNNs）启用各种人工智能令人印象深刻的突破（AI）的应用最近由于其学习高层次的能力，从大数据的特点。然而，DNNs对计算资源的电流需求特别是存储消费增长由于正在要求更多，更复杂的应用程序模型的增加尺寸。为了解决这个问题，一些张量分解方法，包括张量列车（TT）和张量环（TR）已应用于压缩DNNs和表现出相当的压缩效率。在这项工作中，我们介绍了分层塔克（HT），经典，但很少使用的张量分解方法，探讨其在神经网络的抗压能力。我们转换权重矩阵和卷积内核既HT和TT格式比较研究，因为后者是最广泛使用的分解方法和HT的变体。我们进一步从理论和实验发现，HT格式对压缩权重矩阵更好的性能，而TT格式更适合于压缩卷积内核。根据该现象，我们通过组合TT和HT在一起以分别压缩卷积和完全连接部件并实现更好的准确度不是仅使用于卷积神经网络（细胞神经网络）的TT或HT格式提出混合张量分解的策略。我们的工作照亮混合张量分解的神经网络压缩的前景。

20. Improving Few-Shot Learning using Composite Rotation based Auxiliary Task [PDF] 返回目录
Pratik Mazumder, Pravendra Singh, Vinay P. Namboodiri
Abstract: In this paper, we propose an approach to improve few-shot classification performance using a composite rotation based auxiliary task. Few-shot classification methods aim to produce neural networks that perform well for classes with a large number of training samples and classes with less number of training samples. They employ techniques to enable the network to produce highly discriminative features that are also very generic. Generally, the better the quality and generic-nature of the features produced by the network, the better is the performance of the network on few-shot learning. Our approach aims to train networks to produce such features by using a self-supervised auxiliary task. Our proposed composite rotation based auxiliary task performs rotation at two levels, i.e., rotation of patches inside the image (inner rotation) and rotation of the whole image (outer rotation) and assigns one out of 16 rotation classes to the modified image. We then simultaneously train for the composite rotation prediction task along with the original classification task, which forces the network to learn high-quality generic features that help improve the few-shot classification performance. We experimentally show that our approach performs better than existing few-shot learning methods on multiple benchmark datasets.
摘要：在本文中，我们提出使用复合旋转基于辅助任务，以提高几拍分类性能的方法。少数次分类方法的目的是产生与大量的训练样本和类以较少的训练样本的数目的类表现良好的神经网络。他们采用的技术，使网络产生高度歧视性功能也非常普通。一般来说，越好的由网络产生的质量特征和仿制性，更好的是几拍的学习网络的性能。我们的做法旨在培养网络通过使用自监督辅助任务产生这样的特征。我们提出的复合旋转基于辅助任务执行在两个级别，轮换即图像（内旋）和整个图像（外旋转）的旋转，并且一个受让人16个旋转班出到修改的图像内的补丁的旋转。然后，我们同时培养与原有的分类任务，强制网络学习高品质的通用功能，可以帮助改善一些，镜头分类性能沿着复合旋转预测任务。我们通过实验证明我们的方法进行比对多个基准数据集现有的几个次学习方法更好。

21. Abnormal activity capture from passenger flow of elevator based on unsupervised learning and fine-grained multi-label recognition [PDF] 返回目录
Chunhua Jia, Wenhai Yi, Yu Wu, Hui Huang, Lei Zhang, Leilei Wu
Abstract: We present a work-flow which aims at capturing residents' abnormal activities through the passenger flow of elevator in multi-storey residence buildings. Camera and sensors (hall sensor, photoelectric sensor, gyro, accelerometer, barometer, and thermometer) with internet connection are mounted in elevator to collect image and data. Computer vision algorithms such as instance segmentation, multi-label recognition, embedding and clustering are applied to generalize passenger flow of elevator, i.e. how many people and what kinds of people get in and out of the elevator on each floor. More specifically in our implementation we propose GraftNet, a solution for fine-grained multi-label recognition task, to recognize human attributes, e.g. gender, age, appearance, and occupation. Then anomaly detection of unsupervised learning is hierarchically applied on the passenger flow data to capture abnormal or even illegal activities of the residents which probably bring safety hazard, e.g. drug dealing, pyramid sale gathering, prostitution, and over crowded residence. Experiment shows effects are there, and the captured records will be directly reported to our customer(property managers) for further confirmation.
摘要：我们在通过电梯的多层住宅楼的客流采集居民的正常活动呈现出的工作流程，其目的。照相机和与互联网连接的传感器（霍尔传感器，光电传感器，陀螺仪，加速度计，气压计和温度计的）安装在电梯以收集图像和数据。计算机视觉算法，如例如分割，多标签识别，嵌入和集群应用到推广电梯的客流，即有多少人与各种各样的人进入和离开每层电梯的东西。更具体地在我们实现我们提出GraftNet，细粒度多标签识别任务的解决方案，来识别人的属性，例如性别，年龄，外貌和职业。然后，无监督学习的异常检测是按层次应用的客流数据采集居民的异常，甚至非法活动，可能带来安全隐患，例如毒品交易，传销聚会，卖淫，并在拥挤的住所。实验证明效果是有的，而且所拍摄的记录将直接向我们的客户（物业管理）为进一步确认。

22. A Benchmark dataset for both underwater image enhancement and underwater object detection [PDF] 返回目录
Long Chen, Lei Tong, Feixiang Zhou, Zheheng Jiang, Zhenyang Li, Jialin Lv, Junyu Dong, Huiyu Zhou
Abstract: Underwater image enhancement is such an important vision task due to its significance in marine engineering and aquatic robot. It is usually work as a pre-processing step to improve the performance of high level vision tasks such as underwater object detection. Even though many previous works show the underwater image enhancement algorithms can boost the detection accuracy of the detectors, no work specially focus on investigating the relationship between these two tasks. This is mainly because existing underwater datasets lack either bounding box annotations or high quality reference images, based on which detection accuracy or image quality assessment metrics are calculated. To investigate how the underwater image enhancement methods influence the following underwater object detection tasks, in this paper, we provide a large-scale underwater object detection dataset with both bounding box annotations and high quality reference images, namely OUC dataset. The OUC dataset provides a platform for researchers to comprehensive study the influence of underwater image enhancement algorithms on the underwater object detection task.
摘要：水下图像增强是如此重要的视觉任务，由于其在海洋工程和水产机器人意义。它通常是工作作为预处理步骤以改善的高层次视觉任务的性能，如在水下物体检测。尽管许多以前的作品展示水下图像增强算法，可以提高检测器的检测精度，没有工作特别着重调查这两个任务之间的关系。这主要是因为现有的水下数据集缺少任一边界框的注释或高质量的参考图像的基础上，其检测精度或图像质量评价指标的计算。为了研究水下图像增强方法如何影响水下目标探测任务之后，在本文中，我们提供了一个大规模的水下既边框注释和高质量的参考图像，即中央电大DataSet对象检测数据集。中央电大数据集为研究人员提供一个平台，综合研究的水下图像增强算法对水下目标探测任务的影响。

23. Active Ensemble Deep Learning for Polarimetric Synthetic Aperture Radar Image Classification [PDF] 返回目录
Sheng-Jie Liu, Haowen Luo, Qian Shi
Abstract: Although deep learning has achieved great success in image classification tasks, its performance is subject to the quantity and quality of training samples. For classification of polarimetric synthetic aperture radar (PolSAR) images, it is nearly impossible to annotate the images from visual interpretation. Therefore, it is urgent for remote sensing scientists to develop new techniques for PolSAR image classification under the condition of very few training samples. In this letter, we take the advantage of active learning and propose active ensemble deep learning (AEDL) for PolSAR image classification. We first show that only 35\% of the predicted labels of a deep learning model's snapshots near its convergence were exactly the same. The disagreement between snapshots is non-negligible. From the perspective of multiview learning, the snapshots together serve as a good committee to evaluate the importance of unlabeled instances. Using the snapshots committee to give out the informativeness of unlabeled data, the proposed AEDL achieved better performance on two real PolSAR images compared with standard active learning strategies. It achieved the same classification accuracy with only 86% and 55% of the training samples compared with breaking ties active learning and random selection for the Flevoland dataset.
摘要：尽管深学习已在图像分类任务取得了巨大成功，其业绩受到的数量和训练样本的质量。对于极化合成孔径雷达（极化SAR）图像的分类，这是几乎不可能从视觉解释注释图像。因此，人们迫切遥感科学家开发很少训练样本的条件下进行了极化SAR图像分类的新技术。在这封信中，我们采取主动学习的优势，提出了极化SAR图像分类活跃合奏深度学习（AEDL）。我们首先表明，只有35 \％了深刻的学习模式的快照的预测标签的接近其收敛是完全一样的。快照之间的分歧是不可忽略的。从多视图学习的角度来看，快照一起作为一个很好的委员会来评估未标记的情况下的重要性。使用快照委员会给出了未标记的数据的信息量，建议AEDL上两个真实极化SAR图像与标准的主动学习策略相比，取得了较好的业绩。它实现了相同的分类准确率只有86％和打破关系主动学习和随机选择的弗莱福兰省集比较训练样本的55％。

24. EmotionNet Nano: An Efficient Deep Convolutional Neural Network Design for Real-time Facial Expression Recognition [PDF] 返回目录
James Ren Hou Lee, Linda Wang, Alexander Wong
Abstract: While recent advances in deep learning have led to significant improvements in facial expression classification (FEC), a major challenge that remains a bottleneck for the widespread deployment of such systems is their high architectural and computational complexities. This is especially challenging given the operational requirements of various FEC applications, such as safety, marketing, learning, and assistive living, where real-time requirements on low-cost embedded devices is desired. Motivated by this need for a compact, low latency, yet accurate system capable of performing FEC in real-time on low-cost embedded devices, this study proposes EmotionNet Nano, an efficient deep convolutional neural network created through a human-machine collaborative design strategy, where human experience is combined with machine meticulousness and speed in order to craft a deep neural network design catered towards real-time embedded usage. Two different variants of EmotionNet Nano are presented, each with a different trade-off between architectural and computational complexity and accuracy. Experimental results using the CK+ facial expression benchmark dataset demonstrate that the proposed EmotionNet Nano networks demonstrated accuracies comparable to state-of-the-art in FEC networks, while requiring significantly fewer parameters (e.g., 23$\times$ fewer at a higher accuracy). Furthermore, we demonstrate that the proposed EmotionNet Nano networks achieved real-time inference speeds (e.g. $>25$ FPS and $>70$ FPS at 15W and 30W, respectively) and high energy efficiency (e.g. $>1.7$ images/sec/watt at 15W) on an ARM embedded processor, thus further illustrating the efficacy of EmotionNet Nano for deployment on embedded devices.
摘要：虽然近期深度学习进步导致面部表情分类显著改善（FEC），这仍然是这种系统的广泛部署的瓶颈是其高超的建筑和计算复杂度的一个重大挑战。这是特别具有挑战性给出各种FEC的应用，如安全，市场营销，学习，生活辅助，其中在低成本的嵌入式设备的实时性要求需要运行要求。这个需要一种紧凑，低时延，但能够实时对低成本的嵌入式设备上执行的FEC准确的系统的启发，本研究提出EmotionNet纳米，通过人机协同设计策略创造了一个高效的深卷积神经网络，在人类的经验是为了制作一个深层神经网络的设计对实时嵌入式使用照顾与机器的细致和速度相结合。 EmotionNet纳米备有两种不同的呈现，每间建筑和计算复杂度和精度的不同权衡。使用CK +面部表情基准实验结果数据集表明，同时需要显著更少的参数，所提出的EmotionNet纳米网络证明精度媲美状态的最先进的在FEC网络（例如，23 $ \以更高的精度倍$更少）的。此外，我们证明了所提出的EmotionNet纳米网络实现的实时推理速度（例如$> $ 25 FPS和$> 70 $ FPS在15W和30W，分别地）和高能量效率（例如$> 1.7 $张/秒/瓦15W处）一个ARM嵌入式处理器上，从而进一步示出用于在嵌入式设备上部署EmotionNet纳米的功效。

25. Roweisposes, Including Eigenposes, Supervised Eigenposes, and Fisherposes, for 3D Action Recognition [PDF] 返回目录
Benyamin Ghojogh, Fakhri Karray, Mark Crowley
Abstract: Human action recognition is one of the important fields of computer vision and machine learning. Although various methods have been proposed for 3D action recognition, some of which are basic and some use deep learning, the need of basic methods based on generalized eigenvalue problem is sensed for action recognition. This need is especially sensed because of having similar basic methods in the field of face recognition such as eigenfaces and Fisherfaces. In this paper, we propose Roweisposes which uses Roweis discriminant analysis for generalized subspace learning. This method includes Fisherposes, eigenposes, supervised eigenposes, and double supervised eigenposes as its special cases. Roweisposes is a family of infinite number of action recongition methods which learn a discriminative subspace for embedding the body poses. Experiments on the TST, UTKinect, and UCFKinect datasets verify the effectiveness of the proposed method for action recognition.
摘要：人类动作识别是计算机视觉和机器学习的重要领域之一。虽然各种方法已经被提出了3D动作识别，其中一些是基本的，有的使用深度学习的基本方法基于广义特征值问题需要被检测的动作识别。这种需要尤其感测到，因为在具有脸部识别的字段相似的基本方法，例如特征脸和费舍尔的。在本文中，我们提出一种使用Roweis判别分析广义子空间学习Roweisposes。该方法包括Fisherposes，eigenposes，监督eigenposes，双监督eigenposes作为其特殊情况。 Roweisposes是该学会辨别子空间用于嵌入身体姿势动作recongition方法无限数量的家庭。在TST，UTKinect和UCFKinect数据集实验验证动作识别了该方法的有效性。

26. Unsupervised Learning of Video Representations via Dense Trajectory Clustering [PDF] 返回目录
Pavel Tokmakov, Martial Hebert, Cordelia Schmid
Abstract: This paper addresses the task of unsupervised learning of representations for action recognition in videos. Previous works proposed to utilize future prediction, or other domain-specific objectives to train a network, but achieved only limited success. In contrast, in the relevant field of image representation learning, simpler, discrimination-based methods have recently bridged the gap to fully-supervised performance. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation, to the video domain. In particular, the latter approach iterates between clustering the videos in the feature space of a network and updating it to respect the cluster with a non-parametric classification loss. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns, grouping the videos based on appearance. To mitigate this issue, we turn to the heuristic-based IDT descriptors, that were manually designed to encode motion patterns in videos. We form the clusters in the IDT space, using these descriptors as a an unsupervised prior in the iterative local aggregation algorithm. Our experiments demonstrates that this approach outperform prior work on UCF101 and HMDB51 action recognition benchmarks. We also qualitatively analyze the learned representations and show that they successfully capture video dynamics.
摘要：本文论述表示在视频行为识别的无监督学习的任务。建议以前的作品，利用未来的预测，或其他特定领域的目标来训练网络，但只取得了有限的成功。相反，在图像表示学习相关领域，简单的，基于性别的歧视的方法最近桥接的差距，充分监督的表现。我们首先提出来适应这个类中的两个表现最出色的目标 - 例如识别和地方聚集，到视频领域。特别地，后一种方法中的网络的特征空间聚类视频和更新，以尊重与非参数分类损失群集之间迭代。我们观察到有前途的性能，但定性分析表明，该学会表示无法捕捉运动模式，分组基于外观的影片。为了缓解这个问题，我们就转向启发式IDT描述符，通过手动设计，编码运动模式的视频。我们形成的IDT空间集群，使用这些描述符作为迭代本地聚合算法的无监督之前。我们的实验表明，这种方法优于上UCF101和HMDB51动作识别基准测试之前的工作。我们还定性地分析学习交涉，并表明他们成功地捕获视频动态。

27. Generalizable Cone Beam CT Esophagus Segmentation Using In Silico Data Augmentation [PDF] 返回目录
Sadegh R Alam, Tianfang Li, Si-Yuan Zhang, Pengpeng Zhang, Saad Nadeem
Abstract: Lung cancer radiotherapy entails high quality planning computed tomography (pCT) imaging of the patient with radiation oncologist contouring of the tumor and the organs at risk (OARs) at the start of the treatment. This is followed by weekly low-quality cone beam CT (CBCT) imaging for treatment setup and qualitative visual assessment of tumor and critical OARs. In this work, we aim to make the weekly CBCT assessment quantitative by automatically segmenting the most critical OAR, esophagus, using deep learning and in silico (image-driven simulation) artifact induction to convert pCTs to pseudo-CBCTs (pCTs$+$artifacts). Specifically, for the in silico data augmentation, we make use of the critical insight that CT and CBCT have the same underlying physics and that it is easier to deteriorate the pCT to look more like CBCT (and use the accompanying high quality manual contours for segmentation) than to synthesize CT from CBCT where the critical anatomical information may have already been lost (which leads to anatomical hallucination with the prevalent generative adversarial networks for example). Given these pseudo-CBCTs and the high quality manual contours, we introduce a modified 3D-Unet architecture and a multi-objective loss function specifically designed for segmenting soft-tissue organs such as esophagus on real weekly CBCTs. The model achieved 0.74 dice overlap (against manual contours of an experienced radiation oncologist) on weekly CBCTs and was robust and generalizable enough to also produce state-of-the-art results on pCTs, achieving 0.77 dice overlap against the previous best of 0.72. This shows that our in silico data augmentation spans the realistic noise/artifact spectrum across patient CBCT/pCT data and can generalize well across modalities (without requiring retraining or domain adaptation), eventually improving the accuracy of treatment setup and response analysis.
摘要：肺癌放疗引起该患者的高品质规划计算机断层摄影（PCT）成像放射肿瘤学家在治疗开始时的肿瘤的轮廓和处于风险中的器官（桨）。这之后是每周低质量的锥形束CT（CBCT）成像用于治疗设置和肿瘤和临界桨定性视觉评估。在这项工作中，我们的目标是通过自动分割最关键的OAR，食道，采用深度学习和在计算机芯片（图像驱动模拟）神器感应转换PCTS到伪CBCTs（PCTS $ + $文物，使每周CBCT评估定量）。具体地，对于在硅片数据扩张，我们利用临界洞察力该CT和CBCT具有相同的基本物理和，它更容易恶化的PCT看起来更像CBCT（和用于分割伴随高品质手册轮廓）比从CBCT合成CT其中临界解剖信息可能已经被丢失（其导致与例如流行生成对抗网络）解剖幻觉。鉴于这些伪CBCTs和高品质手动轮廓，我们引入一个修改的3D-UNET架构和专门为分割软组织等器官真实每周CBCTs食道设计了多目标损失函数。该模型实现了0.74骰子重叠（针对有经验的放射手动轮廓肿瘤学家）在每周CBCTs，是强大和普及不够也会产生对PCTS国家的先进成果，达到0.77骰子对以前最好的0.72重叠。这表明我们的在计算机芯片上的数据扩充跨度现实噪声/伪像横跨患者CBCT / PCT数据频谱，并且可以跨模态概括阱（而不需要重新训练或域自适应），最终提高治疗的设置和响应分析的准确度。

28. Harvesting, Detecting, and Characterizing Liver Lesions from Large-scale Multi-phase CT Data via Deep Dynamic Texture Learning [PDF] 返回目录
Yuankai Huo, Jinzheng Cai, Chi-Tung Cheng, Ashwin Raju, Ke Yan, Bennett A. Landman, Jing Xiao, Le Lu, Chien-Hung Liao, Adam Harrison
Abstract: Effective and non-invasive radiological imaging based tumor/lesion characterization (e.g., subtype classification) has long been a major aim in the oncology diagnosis and treatment procedures, with the hope of reducing needs for invasive surgical biopsies. Prior work are generally very restricted to a limited patient sample size, especially using patient studies with confirmed pathological reports as ground truth. In this work, we curate a patient cohort of 1305 dynamic contrast CT studies (i.e., 5220 multi-phase 3D volumes) with pathology confirmed ground truth. A novel fully-automated and multi-stage liver tumor characterization framework is proposed, comprising four steps of tumor proposal detection, tumor harvesting, primary tumor site selection, and deep texture-based characterization. More specifically, (1) we propose a 3D non-isotropic anchor-free lesion detection method; (2) we present and validate the use of multi-phase deep texture learning for precise liver lesion tissue characterization, named spatially adaptive deep texture (SaDT); (3) we leverage small-sized public datasets to semi-automatically curate our large-scale clinical dataset of 1305 patients where four main liver tumor subtypes of primary, secondary, metastasized and benign are presented. Extensive evaluations demonstrate that our new data curation strategy, combined with the SaDT deep dynamic texture analysis, can effectively improve the mean F1 scores by >8.6% compared with baselines, in differentiating four major liver lesion types. This is a significant step towards the clinical goal.
摘要：有效和非侵入性放射成像基于肿瘤/病灶表征（例如，亚型分类）长期以来一直是主要目的在肿瘤的诊断和治疗程序，具有减少了侵入性外科活组织检查需要的希望。以前的工作一般都是非常受限于有限的患者样本的大小，特别是使用病人的研究证实与病理报告为地面实况。在这项工作中，我们策划1305个动态对比度CT检查（即5220多相3D体积）与病理证实地面实况的患者队列。一种新颖的完全自动化的和提出的多级肝肿瘤的表征的框架，其包括四个步骤肿瘤提议的检测，肿瘤收获，原发肿瘤部位的选择，和深基于纹理的表征。更具体地，（1）提出了一种三维非各向同性无锚病变检测方法; （2）我们本和验证学习精确肝脏肿瘤组织表征使用多相深纹理的，命名空间自适应深纹理（SADT）; （3）我们利用小型公共数据集半自动策划1305名病人我们大规模临床数据集，其中的一级，二级，转移和良性呈现四个主要的肝肿瘤亚型。广泛的评估表明，我们的新的数据策展策略，以SADT深动态纹理分析相结合，可以有效的改善> 8.6％的平均得分F1与基线相比，在区分四大肝脏病变类型。这是迈向临床目标的一个显著的一步。

29. Geometry-Inspired Top-k Adversarial Perturbations [PDF] 返回目录
Nurislam Tursynbek, Aleksandr Petiushko, Ivan Oseledets
Abstract: State-of-the-art deep learning models are untrustworthy due to their vulnerability to adversarial examples. Intriguingly, besides simple adversarial perturbations, there exist Universal Adversarial Perturbations (UAPs), which are input-agnostic perturbations that lead to misclassification of majority inputs. The main target of existing adversarial examples (including UAPs) is to change primarily the correct Top-1 predicted class by the incorrect one, which does not guarantee changing the Top-k prediction. However, in many real-world scenarios, dealing with digital data, Top-k predictions are more important. We propose an effective geometry-inspired method of computing Top-k adversarial examples for any k. We evaluate its effectiveness and efficiency by comparing it with other adversarial example crafting techniques. Based on this method, we propose Top-k Universal Adversarial Perturbations, image-agnostic tiny perturbations that cause true class to be absent among the Top-k pre-diction. We experimentally show that our approach outperforms baseline methods and even improves existing techniques of generating UAPs.
摘要：最先进的国家的最深刻的学习模型是他们的弱点，以对抗例子不可信的原因。有趣的是，除了简单的对抗性扰动，存在通用对抗性扰动（UAPs），它们是输入无关的扰动导致的多数输入错误分类。的现有例子对抗性（包括UAPs）的主要目标是通过不正确的一个，这并不能保证改变前k个预测主要改变正确顶-1预测的类。然而，在许多现实世界的情况下，处理数字数据，前k个预测是更重要的。我们建议计算前k对抗的例子对任意k的有效几何风格的方法。我们通过与其他敌对例如手工艺技术比较评估其有效性和效率。基于这种方法，我们提出了前k通用对抗性扰动，图像无关的微小扰动，导致真正的类是前k预文辞中缺席。我们通过实验证明我们的方法比基线方法，甚至提高产生UAPs的现有技术。

30. Video Representations of Goals Emerge from Watching Failure [PDF] 返回目录
Dave Epstein, Carl Vondrick
Abstract: We introduce a video representation learning framework that models the latent goals behind observable human action. Motivated by how children learn to reason about goals and intentions by experiencing failure, we leverage unconstrained video of unintentional action to learn without direct supervision. Our approach models videos as contextual trajectories that represent both low-level motion and high-level action features. Experiments and visualizations show the model is able to predict underlying goals, detect when action switches from intentional to unintentional, and automatically correct unintentional action. Although the model is trained with minimal supervision, it is competitive with highly-supervised baselines, underscoring the role of failure examples for learning goal-oriented video representations. The project website is available at this https URL
摘要：介绍视频表示学习框架，模式背后观察到的人类行为的潜在目标。儿童如何经历失败中学习到的原因有关目标和意图的推动下，我们利用无意的动作不受约束的视频来学习没有直接监督。我们的方法模型视频为代表两个低级别的运动和高级别行动的特点情境轨迹。实验和可视化显示模型能够预测潜在的目标，检测时动作故意切换到无意的，并自动校正无意的动作。虽然该模型是在无人监督的训练，这是有竞争力的高度监督基线，强调的失败例子学习面向目标的视频表示的作用。该项目的网站可在此HTTPS URL

31. Improving VQA and its Explanations \\ by Comparing Competing Explanations [PDF] 返回目录
Jialin Wu, Liyan Chen, Raymond J. Mooney
Abstract: Most recent state-of-the-art Visual Question Answering (VQA) systems are opaque black boxes that are only trained to fit the answer distribution given the question and visual content. As a result, these systems frequently take shortcuts, focusing on simple visual concepts or question priors. This phenomenon becomes more problematic as the questions become complex that requires more reasoning and commonsense knowledge. To address this issue, we present a novel framework that uses explanations for competing answers to help VQA systems select the correct answer. By training on human textual explanations, our framework builds better representations for the questions and visual content, and then reweights confidences in the answer candidates using either generated or retrieved explanations from the training set. We evaluate our framework on the VQA-X dataset, which has more difficult questions with human explanations, achieving new state-of-the-art results on both VQA and its explanations.
摘要：最近国家的最先进的视觉答疑（VQA）系统是不透明的黑色箱子，只有经过培训，以适应给定的问题和视觉内容的答案分布。其结果是，这些系统经常走捷径，专注于简单的视觉概念或问题前科。这种现象变得更成问题，因为问题变得复杂，需要更多的推理和常识知识。为了解决这个问题，我们提出了一个新的框架，以便为竞争答案，帮助VQA系统中的使用说明选择正确的答案。通过对人的文字说明培训，我们的框架构建的问题和可视内容更好的表述，然后reweights使用从训练集生成或检索解释在回答候选可信度。我们评估我们的VQA-X数据集架构，它与人类的解释更难的问题，实现国家的最先进的新双方VQA和解释结果。

32. Offline Handwritten Chinese Text Recognition with Convolutional Neural Networks [PDF] 返回目录
Brian Liu, Xianchao Xu, Yu Zhang
Abstract: Deep learning based methods have been dominating the text recognition tasks in different and multilingual scenarios. The offline handwritten Chinese text recognition (HCTR) is one of the most challenging tasks because it involves thousands of characters, variant writing styles and complex data collection process. Recently, the recurrent-free architectures for text recognition appears to be competitive as its highly parallelism and comparable results. In this paper, we build the models using only the convolutional neural networks and use CTC as the loss function. To reduce the overfitting, we apply dropout after each max-pooling layer and with extreme high rate on the last one before the linear layer. The CASIA-HWDB database is selected to tune and evaluate the proposed models. With the existing text samples as templates, we randomly choose isolated character samples to synthesis more text samples for training. We finally achieve 6.81% character error rate (CER) on the ICDAR 2013 competition set, which is the best published result without language model correction.
摘要：深基础的学习方法已经支配着不同的和多语种的场景文本识别任务。脱机手写文字中国识别（HCTR）是最具挑战性的任务之一，因为它涉及到成千上万个字符，变异的写作风格和复杂的数据收集过程。近日，文本识别的没有复发的架构似乎是其高度并行性和可比较的结果具有竞争力。在本文中，我们仅使用卷积神经网络建立的模型和使用CTC作为损失函数。为了减少过度拟合，我们每个MAX-汇聚层后，并与线性层前的最后一个极端的高利率申请退学。该CASIA-HWDB选择数据库调优和评估提出的模型。与现有的文本样本为模板，我们随机选择孤立字符样本，以合成更多的文本样本进行训练。我们终于实现了对ICDAR 2013竞争集，这是没有语言模型校正的最佳发布结果6.81％字符错误率（CER）。

33. Analogical Image Translation for Fog Generation [PDF] 返回目录
Rui Gong, Dengxin Dai, Yuhua Chen, Wen Li, Luc Van Gool
Abstract: Image-to-image translation is to map images from a given \emph{style} to another given \emph{style}. While exceptionally successful, current methods assume the availability of training images in both source and target domains, which does not always hold in practice. Inspired by humans' reasoning capability of analogy, we propose analogical image translation (AIT). Given images of two styles in the source domain: $\mathcal{A}$ and $\mathcal{A}^\prime$, along with images $\mathcal{B}$ of the first style in the target domain, learn a model to translate $\mathcal{B}$ to $\mathcal{B}^\prime$ in the target domain, such that $\mathcal{A}:\mathcal{A}^\prime ::\mathcal{B}:\mathcal{B}^\prime$. AIT is especially useful for translation scenarios in which training data of one style is hard to obtain but training data of the same two styles in another domain is available. For instance, in the case from normal conditions to extreme, rare conditions, obtaining real training images for the latter case is challenging but obtaining synthetic data for both cases is relatively easy. In this work, we are interested in adding adverse weather effects, more specifically fog effects, to images taken in clear weather. To circumvent the challenge of collecting real foggy images, AIT learns with synthetic clear-weather images, synthetic foggy images and real clear-weather images to add fog effects onto real clear-weather images without seeing any real foggy images during training. AIT achieves this zero-shot image translation capability by coupling a supervised training scheme in the synthetic domain, a cycle consistency strategy in the real domain, an adversarial training scheme between the two domains, and a novel network design. Experiments show the effectiveness of our method for zero-short image translation and its benefit for downstream tasks such as semantic foggy scene understanding.
摘要：图像 - 图像转换是映射从给定\ {EMPH风格}图像其他给定\ {EMPH风格}。虽然非常成功，目前的方法假设在源和目标域，这并不总是在实践中保持训练图像的可用性。通过类比人类的推理能力的启发，我们提出了类比图像平移（AIT）。鉴于两种风格的图像源域：$ \ mathcal {A} $和$ \ mathcal {A} ^ \黄金$，与图像$ \ mathcal {B} $目标域中的第一个样式的一起，学习模型翻译$ \ mathcal {B} $到$ \ mathcal {B} ^ \在目标域主要$，使得$ \ mathcal {A}：\ mathcal {A} ^ \黄金:: \ mathcal {B} ：\ mathcal {B} ^ \ $总理。 AIT是翻译方案中的一种风格的训练数据很难获得，但在另一个域相同的两个风格的训练数据是可用的特别有用。例如，在从正常条件极端，罕见的情况，从而获得真实的训练图像为后者的情况下的情况下是具有挑战性的，但对于这两种情况下获得的合成数据是比较容易的。在这项工作中，我们有兴趣增加的不利天气的影响，更具体地说雾效果，在晴朗的天气拍摄的图像。为了规避收集真实模糊图像，AIT获悉与合成清晰全天候图像，合成的模糊图像和真实清晰的图像天气增加雾化效果到真实清晰的图像天气没有在训练中看到任何真正的模糊图像的挑战。 AIT通过在合成结构域耦合监督训练方案实现这种零拍摄图像翻译能力，一个周期一致性策略在真实域，这两个结构域之间的对抗性训练方案，以及一种新颖的网络设计。实验证明，如语义雾现场了解我们的方法对零短图像的平移和其下游的任务受益的效果。

34. Shadow Removal by a Lightness-Guided Network with Training on Unpaired Data [PDF] 返回目录
Zhihao Liu, Hui Yin, Yang Mi, Mengyang Pu, Song Wang
Abstract: Shadow removal can significantly improve the image visual quality and has many applications in computer vision. Deep learning methods based on CNNs have become the most effective approach for shadow removal by training on either paired data, where both the shadow and underlying shadow-free versions of an image are known, or unpaired data, where shadow and shadow-free training images are totally different with no correspondence. In practice, CNN training on unpaired data is more preferred given the easiness of training data collection. In this paper, we present a new Lightness-Guided Shadow Removal Network (LG-ShadowNet) for shadow removal by training on unpaired data. In this method, we first train a CNN module to compensate for the lightness and then train a second CNN module with the guidance of lightness information from the first CNN module for final shadow removal. We also introduce a loss function to further utilise the colour prior of existing data. Extensive experiments on widely used ISTD, adjusted ISTD and USR datasets demonstrate that the proposed method outperforms the state-of-the-art methods with training on unpaired data.
摘要：暗影去除可显著改善图像视觉质量，并在计算机视觉的许多应用。深度学习基于细胞神经网络的方法已经被训练成为阴影去除最有效的方法在任配对数据，其中两个阴影和图像的基本无阴影的版本是已知的，或不成对的数据，其中阴影和无阴影的训练图像是没有对应完全不同的。在实践中，对未配对数据CNN培训，更优选给定的训练数据收集的难易程度。在本文中，我们通过对未成数据训练提出了一种新亮度制导阴影去除网络（LG-SHADOWNET）的阴影去除。在该方法中，我们首先训练CNN模块来补偿亮度，然后培养了第二CNN模块与从第一CNN模块用于最终阴影去除亮度信息的引导。我们还引进了损失函数之前已有数据的进一步利用颜色。上广泛使用ISTD广泛的实验，调整ISTD和USR数据集表明，该方法优于国家的最先进的方法与上不成对数据的训练。

35. Localization Uncertainty Estimation for Anchor-Free Object Detection [PDF] 返回目录
Youngwan Lee, Joong-won Hwang, Hyung-Il Kim, Kimin Yun, Joungyoul Park
Abstract: Since many safety-critical systems such as surgical robots and autonomous driving cars are in unstable environments with sensor noise or incomplete data, it is desirable for object detectors to take the confidence of the localization prediction into account. Recent attempts to estimate localization uncertainty for object detection focus only anchor-based method that captures the uncertainty of different characteristics such as location (center point) and scale (width, height). Also, anchor-based methods need to adjust sensitive anchor-box settings. Therefore, we propose a new object detector called Gaussian-FCOS that estimates the localization uncertainty based on an anchor-free detector that captures the uncertainty of similar property with four directions of box offsets (left, right, top, bottom) and avoids the anchor tuning. For this purpose, we design a new loss function, uncertainty loss, to measure how uncertain the estimated object location is by modeling the uncertainty as a Gaussian distribution. Then, the detection score is calibrated through the estimated uncertainty. Experiments on challenging COCO datasets demonstrate that the proposed new loss function not only enables the network to estimate the uncertainty but produces a synergy effect with regression loss. In addition, our Gaussian-FCOS reduces false positives with the estimated localization uncertainty and finds more missing-objects, boosting both Average Precision (AP) and Recall (AR). We hope Gaussian-FCOS serve as a baseline for the reliability-required task.
摘要：由于许多安全关键系统，例如外科手术机器人和自动驾驶汽车是在不稳定的环境与传感器噪声或不完整的数据，是理想的对象检测器采取定位预测的信心考虑。最近试图估计物体检测焦点定位的不确定性仅基于锚的方法，其捕获的不同特性的不确定性，例如位置（中心点）和标度（宽度，高度）。此外，基于锚的方法需要调整敏感锚箱的设置。因此，我们提出所谓的高斯FCOS一个新的对象检测器的估算值的无锚探测器上的定位不确定性捕获类似属性的用箱偏移的四个方向（左，右，上，下）和不确定性避免了锚调整。为此，我们设计了一种新的损失函数，不确定性损失，测量估计对象位置不确定如何是不确定高斯分布建模。然后，检测得分是通过估计不确定性校准。在挑战COCO数据集实验结果表明，所提出的新的损失函数不仅能够估计不确定性的网络，但产生与回归损失协同效应。此外，我们的高斯FCOS减少了与本地化估计不确定性误报和发现更多失踪的对象，同时提升平均精度（AP）和召回（AR）。我们希望高斯FCOS作为可靠性要求的任务的基准。

36. Multi-Person Pose Regression via Pose Filtering and Scoring [PDF] 返回目录
Huixin Miao, Junqi Lin, Junjie Cao
Abstract: Multi-person pose estimation is one of the mainstream tasks of computer vision. Existing methods include the top-down methods which need additional human detector and the bottom-up methods which need to complete heuristic grouping after predicting all human keypoints. They all need to deal with the grouping and detection of keypoints separately, resulting in low efficiency. In this work, we propose an end-to-end network framework for multi-person pose regression to predict the instance-aware keypoints directly. This framework uses a cascaded manner: the first stage provides basic estimation. Then we propose the OKSFilter which is used to remove low-quality predictions, so that the second stage could focus on better results for further optimization. In addition, in order to quantify the quality of the predicted poses, we also propose the pose scoring module(PSM), so that when using non-maximum suppression(NMS) in the inference, the correct type and high-quality poses are preserved. We have verified on the COCO keypoint benchmark. The experiments show that our multi-person pose regression network is feasible and effective, and the two newly proposed modules are helpful to improve the performance of the model.
摘要：多个人姿态估计是计算机视觉的主流任务之一。现有方法包括自上而下这就需要额外的人力探测器和自下而上的方法，这需要完成启发式预测人类所有的关键点之后分组方法。他们都需要处理的关键点的分组和检测分开，导致效率低下。在这项工作中，我们提出了多方人士的姿态回归的终端到终端的网络架构直接预测实例感知的关键点。这个框架使用以级联方式：第一阶段提供了基本的估计。然后，我们提出这是用来去除低质量的预测，从而使第二阶段会专注于进一步优化更好的结果OKSFilter。此外，以量化的预测姿势的质量，我们也提出了姿势评分模块（PSM），所以，在推理使用非最大抑制（NMS）时，正确的类型和高品质的姿态被保留。我们已经验证的COCO关键点基准。实验表明，我们的多的人的姿态回归网络是可行的，有效的，和两个新提出的模块是有助于提高模型的性能。

37. MvMM-RegNet: A new image registration framework based on multivariate mixture model and neural network estimation [PDF] 返回目录
Xinzhe Luo, Xiahai Zhuang
Abstract: Current deep-learning-based registration algorithms often exploit intensity-based similarity measures as the loss function, where dense correspondence between a pair of moving and fixed images is optimized through backpropagation during training. However, intensity-based metrics can be misleading when the assumption of intensity class correspondence is violated, especially in cross-modality or contrast-enhanced images. Moreover, existing learning-based registration methods are predominantly applicable to pairwise registration and are rarely extended to groupwise registration or simultaneous registration with multiple images. In this paper, we propose a new image registration framework based on multivariate mixture model (MvMM) and neural network estimation. A generative model consolidating both appearance and anatomical information is established to derive a novel loss function capable of implementing groupwise registration. We highlight the versatility of the proposed framework for various applications on multimodal cardiac images, including single-atlas-based segmentation (SAS) via pairwise registration and multi-atlas segmentation (MAS) unified by groupwise registration. We evaluated performance on two publicly available datasets, i.e. MM-WHS-2017 and MS-CMRSeg-2019. The results show that the proposed framework achieved an average Dice score of $0.871\pm 0.025$ for whole-heart segmentation on MR images and $0.783\pm 0.082$ for myocardium segmentation on LGE MR images.
摘要：当前深学习系配准算法常常利用基于强度的相似性度量为损失函数，其中一对移动和固定图像之间的密集对应关系训练期间通过反向传播最优化。然而，基于强度的度量可以是当强度类对应的假设被违反，特别是在跨模态或对比度增强图像误导。此外，现有的基于学习的配准方法主要是适用于成对登记和很少扩展到成组注册或同时配准多个图像。在本文中，我们提出了一种基于多变量混合模型（MVMM）和神经网络估计的新的图像配准架构。生成模型巩固外观和解剖学信息被建立以导出能够实现成组登记的新的损失函数。我们强调对多峰的心脏图像，包括通过由成组注册统一成对登记和多图谱分割（MAS）为基础的单图谱分割（SAS）的各种应用所提出的框架的通用性。我们在两个可公开获得的数据集进行评估性能，即MM-WHS-2017和MS-CMRSeg-2019。结果表明，所提出的框架内取得的平均骰子得分0.871 $ \下午0.025 $为全心脏分割的MR图像和$ 0.783 \ 0.082下午$关于LGE MR图像分割心肌。

38. Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion [PDF] 返回目录
Yujin Chen, Zhigang Tu, Di Kang, Ruizhi Chen, Linchao Bao, Zhengyou Zhang, Junsong Yuan
Abstract: Accurate 3D reconstruction of the hand and object shape from a hand-object image is important for understanding human-object interaction as well as human daily activities. Different from bare hand pose estimation, hand-object interaction poses a strong constraint on both the hand and its manipulated object, which suggests that hand configuration may be crucial contextual information for the object, and vice versa. However, current approaches address this task by training a two-branch network to reconstruct the hand and object separately with little communication between the two branches. In this work, we propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We extensively investigate cross-branch feature fusion architectures with MLP or LSTM units. Among the investigated architectures, a variant with LSTM units that enhances object feature with hand feature shows the best performance gain. Moreover, we employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map, which further improves the reconstruction accuracy. Experiments conducted on public datasets demonstrate that our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.
摘要：精确从手对象图像的手和对象形状的三维重建为理解人类对象的交互以及人类日常活动非常重要的。从裸手姿势估计不同，手对象相互作用造成的手和其操纵对象，这表明手配置可以是用于对象的关键上下文信息，反之亦然既是强约束。然而，目前的方法通过训练两个分支网络重建手解决这个任务，并与两个分支之间交流很少单独对象。在这项工作中，我们提出要考虑的手，在功能空间共同反对和探索的两个分支的互惠。我们广泛的调查与MLP或LSTM单位跨部门特征融合架构。在被调查的架构，与LSTM单位的变体，增强对象与手功能显示出最佳的性能增益功能。此外，我们采用的辅助深度估计模块与所估计的深度图，这进一步提高了重建精度来增强输入的RGB图像。公共数据集进行的实验表明，我们的方法显著优于中对象的重建精度方面现有的方法。

39. Dynamic Sampling Networks for Efficient Action Recognition in Videos [PDF] 返回目录
Yin-Dong Zheng, Zhaoyang Liu, Tong Lu, Limin Wang
Abstract: The existing action recognition methods are mainly based on clip-level classifiers such as two-stream CNNs or 3D CNNs, which are trained from the randomly selected clips and applied to densely sampled clips during testing. However, this standard setting might be suboptimal for training classifiers and also requires huge computational overhead when deployed in practice. To address these issues, we propose a new framework for action recognition in videos, called {\em Dynamic Sampling Networks} (DSN), by designing a dynamic sampling module to improve the discriminative power of learned clip-level classifiers and as well increase the inference efficiency during testing. Specifically, DSN is composed of a sampling module and a classification module, whose objective is to learn a sampling policy to on-the-fly select which clips to keep and train a clip-level classifier to perform action recognition based on these selected clips, respectively. In particular, given an input video, we train an observation network in an associative reinforcement learning setting to maximize the rewards of the selected clips with a correct prediction. We perform extensive experiments to study different aspects of the DSN framework on four action recognition datasets: UCF101, HMDB51, THUMOS14, and ActivityNet v1.3. The experimental results demonstrate that DSN is able to greatly improve the inference efficiency by only using less than half of the clips, which can still obtain a slightly better or comparable recognition accuracy to the state-of-the-art approaches.
摘要：现有的动作识别方法主要是基于剪辑级分类器例如双流细胞神经网络或3D细胞神经网络，这是从随机选择的剪辑训练和测试过程中施加到密集采样剪辑。然而，这个标准的制定可能是训练分类不理想，也需要在实践中部署时巨大的计算开销。为了解决这些问题，我们提出了在视频行为识别的新框架，叫做{\ EM动态采样网络}（DSN），通过设计一个动态采样模块，以提高了解到夹级分类的辨别力和以及增加在测试过程中推理效率。具体来说，DSN是由采样模块和分类模块，其目的是为了学习采样政策上即时选择要保留的剪辑和训练剪辑级分类来执行基于这些选择的剪辑动作识别的，分别。特别是，鉴于输入的视频，我们在联想增强训练的观测网络学习设置选择的剪辑的奖励有正确的预测最大化。我们进行了广泛的实验研究四个动作识别数据集的DSN框架的不同方面：UCF101，HMDB51，THUMOS14和ActivityNet V1.3。实验结果表明，DSN能够通过仅使用小于夹子，它仍然可以得到稍好的或相当的识别精度的状态的最先进的方法的一半，大大提高了效率推理。

40. SAR Image Despeckling by Deep Neural Networks: from a pre-trained model to an end-to-end training strategy [PDF] 返回目录
Emanuele Dalsasso, Loïc Denis, Florence Tupin
Abstract: Speckle reduction is a longstanding topic in synthetic aperture radar (SAR) images. Many different schemes have been proposed for the restoration of intensity SAR images. Among the different possible approaches, methods based on convolutional neural networks (CNNs) have recently shown to reach state-of-the-art performance for SAR image restoration. CNN training requires good training data: many pairs of speckle-free / speckle-corrupted images. This is an issue in SAR applications, given the inherent scarcity of speckle-free images. To handle this problem, this paper analyzes different strategies one can adopt, depending on the speckle removal task one wishes to perform and the availability of multitemporal stacks of SAR data. The first strategy applies a CNN model, trained to remove additive white Gaussian noise from natural images, to a recently proposed SAR speckle removal framework: MuLoG (MUlti-channel LOgarithm with Gaussian denoising). No training on SAR images is performed, the network is readily applied to speckle reduction tasks. The second strategy considers a novel approach to construct a reliable dataset of speckle-free SAR images necessary to train a CNN model. Finally, a hybrid approach is also analyzed: the CNN used to remove additive white Gaussian noise is trained on speckle-free SAR images. The proposed methods are compared to other state-of-the-art speckle removal filters, to evaluate the quality of denoising and to discuss the pros and cons of the different strategies. Along with the paper, we make available the weights of the trained network to allow its usage by other researchers.
摘要：散斑减少是在合成孔径雷达（SAR）图像的长期主题。许多不同的方案已经被提出了强度SAR图像恢复。之间的不同的可能的方法中，基于卷积神经网络（细胞神经网络）的方法最近已显示出国家的最先进的到达性能的SAR图像恢复。 CNN培训需要良好的训练数据：许多对无斑点/斑点损坏的图像。这是SAR应用的问题，给予免费斑点图像固有的稀缺性。为了解决这个问题，本文分析了不同的策略可以采用，取决于斑点去除任务一个希望执行和SAR数据多时栈的可用性。第一种策略施加CNN模型，训练以从自然图像删除加性高斯白噪声，在最近提出的SAR斑点去除框架：MuLoG（多通道对数高斯去噪）。执行对SAR图像没有训练，网络容易地应用于斑点减排任务。第二个策略考虑了新的方法来构建自由散斑必要来训练模型CNN SAR图像的可靠的数据集。最后，进行了分析的混合方法：将CNN用于去除加性高斯白噪声是在无斑点的SAR图像训练。所提出的方法进行了比较，以国家的最先进的其他祛除斑点过滤器，以评估去噪的质量，并讨论了不同的策略的利弊。随着纸，我们也提供训练网络的权重，以允许其使用由其他研究人员。

41. DHARI Report to EPIC-Kitchens 2020 Object Detection Challenge [PDF] 返回目录
Kaide Li, Bingyan Liao, Laifeng Hu, Yaonong Wang
Abstract: In this report, we describe the technical details of oursubmission to the EPIC-Kitchens Object Detection Challenge.Duck filling and mix-up techniques are firstly introduced to augment the data and significantly improve the robustness of the proposed method. Then we propose GRE-FPN and Hard IoU-imbalance Sampler methods to extract more representative global object features. To bridge the gap of category imbalance, Class Balance Sampling is utilized and greatly improves the test results. Besides, some training and testing strategies are also exploited, such as Stochastic Weight Averaging and multi-scale testing. Experimental results demonstrate that our approach can significantly improve the mean Average Precision (mAP) of object detection on both the seen and unseen test sets of EPICKitchens.
摘要：在这份报告中，我们描述oursubmission的技术细节到EPIC-厨房目标检测Challenge.Duck灌装和混合式技术被首先介绍，以增加数据和显著提高了该方法的稳健性。然后，我们提出了GRE-FPN和硬IOU不平衡取样方法提取比较有代表性的全局对象的特征。为了弥补类别不平衡的间隙，类余额采样被利用，大大提高了测试结果。此外，一些培训和测试策略也被利用，如随机重量平均和多尺度试验。实验结果表明，我们的方法可以显著提高对象检测的平均平均精度（MAP）上都看到和看不到的测试集EPICKitchens的。

42. Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition under Occlusion [PDF] 返回目录
Adam Kortylewski, Qing Liu, Angtian Wang, Yihong Sun, Alan Yuille
Abstract: Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlusion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural Networks (CompositionalNets) - an interpretable deep architecture with innate robustness to partial occlusion. Specifically, we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into objects and context, as well as to further decompose object representations in terms of individual parts and the objects' pose. The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of artificially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16, ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only. Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can be understood as detecting parts and estimating an objects' viewpoint.
摘要：在实际应用中的计算机视觉系统需要稳健的部分遮挡同时还可以解释。在这项工作中，我们表明，暗箱深卷积神经网络（DCNNs）只稳健性限于部分遮挡。可解释的深层结构与先天的鲁棒性部分遮挡 - 我们用基于部分模型为成分卷积神经网络（CompositionalNets）统一DCNNs克服这些限制。具体来说，我们建议具有可训练的端至端的微成分模型来代替DCNNs的完全连接分类头。组成模型的结构使得CompositionalNets分解图像为对象和背景，以及进一步分解对象表示在单个部件的条款和对象的姿势。我们的成分模型的生成特性使得其能够定位封堵器，并根据他们的非封闭部分识别物体。我们在图像分类和对象检测的方面进行了广泛的实验从PASCAL3D +和ImageNet数据集，并且部分地遮挡的车辆实际图像人为遮挡对象的图像从MS-COCO数据集。我们的实验表明，从几个流行DCNN骨架（VGG-16，ResNet50，ResNext）通过在它们的非对应组成大幅度在分类和检测部分被遮挡的对象改善作出CompositionalNets。此外，他们能够准确地，尽管只用类级别的监督接受培训本地化封堵器。最后，我们证明CompositionalNets提供人类可解释预测它们的各个部件可以被理解为检测部和估计对象的视点。

43. DeepACC:Automate Chromosome Classification based on Metaphase Images using Deep Learning Framework Fused with Prior Knowledge [PDF] 返回目录
Chunlong Luo, Tianqi Yu, Yufan Luo, Manqing Wang, Fuhai Yu, Yinhao Li, Chan Tian, Jie Qiao, Li Xiao
Abstract: Chromosome classification is an important but difficult and tedious task in karyotyping. Previous methods only classify manually segmented single chromosome, which is far from clinical practice. In this work, we propose a detection based method, DeepACC, to locate and fine classify chromosomes simultaneously based on the whole metaphase image. We firstly introduce the Additive Angular Margin Loss to enhance the discriminative power of model. To alleviate batch effects, we transform decision boundary of each class case-by-case through a siamese network which make full use of prior knowledges that chromosomes usually appear in pairs. Furthermore, we take the clinically seven group criterion as a prior knowledge and design an additional Group Inner-Adjacency Loss to further reduce inter-class similarities. 3390 metaphase images from clinical laboratory are collected and labelled to evaluate the performance. Results show that the new design brings encouraging performance gains comparing to the state-of-the-art baselines.
摘要：染色体分类为核型分析的一个重要而困难和繁琐的任务。以前的方法仅分类手工分割单一染色体，这是远离临床实践。在这项工作中，我们提出了基于同时在整个中期图像上的基于视觉的检测方法，DeepACC，定位和精细分类染色体。我们首先介绍了添加剂角差额损失，以提高模型的辨别力。为了减轻批量的影响，我们通过一个连体网络，充分利用现有知识的那染色体通常是成对出现的变换每一类案件逐案的决策边界。此外，我们采取的临床7组标准作为先验知识和设计额外的组内邻接损耗进一步降低类间的相似性。从临床实验室3390倍中期的图像采集和标记，以评估性能。结果表明，新的设计带来了令人鼓舞的性能提升比较国家的最先进的基线。

44. Few-Shot Class-Incremental Learning via Feature Space Composition [PDF] 返回目录
Hanbin Zhao, Yongjian Fu, Xuewei Li, Songyuan Li, Bourahla Omar, Xi Li
Abstract: As a challenging problem in machine learning, few-shot class-incremental learning asynchronously learns a sequence of tasks, acquiring the new knowledge from new tasks (with limited new samples) while keeping the learned knowledge from previous tasks (with old samples discarded). In general, existing approaches resort to one unified feature space for balancing old-knowledge preserving and new-knowledge adaptation. With a limited embedding capacity of feature representation, the unified feature space often makes the learner suffer from semantic drift or overfitting as the number of tasks increases. With this motivation, we propose a novel few-shot class-incremental learning pipeline based on a composite representation space, which makes old-knowledge preserving and new-knowledge adaptation mutually compatible by feature space composition (enlarging the embedding capacity). The composite representation space is generated by integrating two space components (i.e. stable base knowledge space and dynamic lifelong-learning knowledge space) in terms of distance metric construction. With the composite feature space, our method performs remarkably well on the CUB200 and CIFAR100 datasets, outperforming the state-of-the-art algorithms by 10.58% and 14.65% respectively.
摘要：在机器学习一个具有挑战性的问题，很少拍类增量学习异步学习任务序列，获得来自新任务的新知识（以有限的新样本），同时保持从以前的任务所学的知识（旧样品丢弃）。在一般情况下，现有的方法采取一个统一的特征空间来平衡旧知识的保存和新知识适应。随着特征表示的限制嵌入容量，统一的特征空间往往使得学习者从语义漂移或过度拟合的任务数量的增加受到影响。有了这个动机，我们提出了一种基于复合表示空间，这使得旧知识的保存和新知识，适应由功能空间组成相互兼容（扩大嵌入容量）一种新型的几个次类增量学习管道。通过在距离度量构造方面整合两个空间分量（即稳定的基础知识的空间和动态终身学习知识空间）中产生的复合表示的空间。与复合特征空间，我们的方法进行非常好对CUB200和CIFAR100数据集，分别由10.58％和14.65％超越国家的最先进的算法。

45. Predictive and Generative Neural Networks for Object Functionality [PDF] 返回目录
Ruizhen Hu, Zihao Yan, Jingwen Zhang, Oliver van Kaick, Ariel Shamir, Hao Zhang, Hui Huang
Abstract: Humans can predict the functionality of an object even without any surroundings, since their knowledge and experience would allow them to "hallucinate" the interaction or usage scenarios involving the object. We develop predictive and generative deep convolutional neural networks to replicate this feat. Specifically, our work focuses on functionalities of man-made 3D objects characterized by human-object or object-object interactions. Our networks are trained on a database of scene contexts, called interaction contexts, each consisting of a central object and one or more surrounding objects, that represent object functionalities. Given a 3D object in isolation, our functional similarity network (fSIM-NET), a variation of the triplet network, is trained to predict the functionality of the object by inferring functionality-revealing interaction contexts. fSIM-NET is complemented by a generative network (iGEN-NET) and a segmentation network (iSEG-NET). iGEN-NET takes a single voxelized 3D object with a functionality label and synthesizes a voxelized surround, i.e., the interaction context which visually demonstrates the corresponding functionality. iSEG-NET further separates the interacting objects into different groups according to their interaction types.
摘要：人类可以预测对象的功能，即使没有任何的环境，因为他们的知识和经验将让他们“幻觉”，涉及对象交互或使用场景。我们开发的预测及发生深刻的卷积神经网络复制了这一壮举。具体来说，我们的工作重点是人造的3D对象的功能特点是人的对象或对象的对象交互。我们的网络被训练场景上下文的数据库，称为交互作用上下文，每个由一个中央对象和一个或多个周围物体，表示对象的功能。鉴于隔离的3D对象，我们的功能相似性网络（FSIM-NET），三重网络的变型中，被训练通过推断功能-揭示交互作用上下文来预测对象的功能。 FSIM-NET由生成网络（IGEN-NET）和分割网络（ISEG-NET）补充。 IGEN-NET采用单个体素化的3D对象具有的功能的标签，并合成一个体素化环绕，即，其在视觉上表明了相应的功能的交互作用上下文。 ISEG-NET进一步根据自己的交互类型的相互作用的对象分成不同的组。

46. 2nd Place Solution for Waymo Open Dataset Challenge -- 2D Object Detection [PDF] 返回目录
Sijia Chen, Yu Wang, Li Huang, Runzhou Ge, Yihan Hu, Zhuangzhuang Ding, Jie Liao
Abstract: A practical autonomous driving system urges the need to reliably and accurately detect vehicles and persons. In this report, we introduce a state-of-the-art 2D object detection system for autonomous driving scenarios. Specifically, we integrate both popular two-stage detector and one-stage detector with anchor free fashion to yield a robust detection. Furthermore, we train multiple expert models and design a greedy version of the auto ensemble scheme that automatically merges detections from different models. Notably, our overall detection system achieves 70.28 L2 mAP on the Waymo Open Dataset v1.2, ranking the 2nd place in the 2D detection track of the Waymo Open Dataset Challenges.
摘要：一种实用的自主驾驶系统促请需要可靠和准确地检测车辆和人员。在这份报告中，我们介绍了自主驾驶情况下一个国家的最先进的2D物体探测系统。具体来说，我们既整合流行两阶段检测器和一阶段检测器与锚自由方式，得到鲁棒的检测。此外，我们训练多专家的模型和设计汽车整体方案自动合并来自不同模型检测的贪婪版本。值得注意的是，我们的整体检测系统实现的Waymo打开的数据集1.2版70.28 L2地图，居Waymo打开的数据集挑战的2D检测轨道的第2位。

47. 1st Place Solutions for Waymo Open Dataset Challenges -- 2D and 3D Tracking [PDF] 返回目录
Yu Wang, Sijia Chen, Li Huang, Runzhou Ge, Yihan Hu, Zhuangzhuang Ding, Jie Liao
Abstract: This technical report presents the online and real-time 2D and 3D multi-object tracking (MOT) algorithms that reached the 1st places on both Waymo Open Dataset 2D tracking and 3D tracking challenges. An efficient and pragmatic online tracking-by-detection framework named HorizonMOT is proposed for camera-based 2D tracking in the image space and LiDAR-based 3D tracking in the 3D world space. Within the tracking-by-detection paradigm, our trackers leverage our high-performing detectors used in the 2D/3D detection challenges and achieved 45.13% 2D MOTA/L2 and 63.45% 3D MOTA/L2 in the 2D/3D tracking challenges.
摘要：该技术报告呈现了达成双方Waymo打开的数据集的二维跟踪和三维跟踪挑战1号地在线和实时2D和3D多目标追踪（MOT）算法。以高效务实的网上跟踪，通过检测框架命名HorizonMOT提出了在3D世界空间中的图像空间和基于激光雷达的3D追踪基于摄像头的2D跟踪。内的跟踪逐检测范例，我们的跟踪器利用在2D / 3D检测挑战使用了我们的高性能检测器和在2D / 3D跟踪挑战达到45.13％2D MOTA / L2和63.45％3D MOTA / L2。

48. 1st Place Solution for Waymo Open Dataset Challenge -- 3D Detection and Domain Adaptation [PDF] 返回目录
Zhuangzhuang Ding, Yihan Hu, Runzhou Ge, Li Huang, Sijia Chen, Yu Wang, Jie Liao
Abstract: In this technical report, we introduce our winning solution "HorizonLiDAR3D" for the 3D detection track and the domain adaptation track in Waymo Open Dataset Challenge at CVPR 2020. Many existing 3D object detectors include prior-based anchor box design to account for different scales and aspect ratios and classes of objects, which limits its capability of generalization to a different dataset or domain and requires post-processing (e.g. Non-Maximum Suppression (NMS)). We proposed a one-stage, anchor-free and NMS-free 3D point cloud object detector AFDet, using object key-points to encode the 3D attributes, and to learn an end-to-end point cloud object detection without the need of hand-engineering or learning the anchors. AFDet serves as a strong baseline in our winning solution and significant improvements are made over this baseline during the challenges. Specifically, we design stronger networks and enhance the point cloud data using densification and point painting. To leverage camera information, we append/paint additional attributes to each point by projecting them to camera space and gathering image-based perception information. The final detection performance also benefits from model ensemble and Test-Time Augmentation (TTA) in both the 3D detection track and the domain adaptation track. Our solution achieves the 1st place with 77.11% mAPH/L2 and 69.49% mAPH/L2 respectively on the 3D detection track and the domain adaptation track.
摘要：在这个技术报告中，我们介绍我们的成功的解决方案“HorizonLiDAR3D”为3D检测的轨道，并在CVPR在Waymo打开的数据集挑战的领域适应性轨道2020年许多现有的3D对象检测器包括基于先前锚箱设计，以适应不同对象的秤和纵横比和类，这限制了它的泛化能力到一个不同的数据集或域，并且需要后处理（例如，非最大抑制（NMS））。我们提出了一个阶段，锚自由和NMS-免费三维点云物体探测器AFDet，使用对象的关键点来编码3D属性，学习结束到终端的点云物体检测，而不需要手 - 工程或学习的锚。 AFDet在我们成功的解决方案作为一个强大的基线和显著的改进过程中所面临的挑战超过这一基准作出。具体来说，我们设计强大的网络和增强使用致密化和点绘画中的点云数据。为了充分利用相机的信息，我们追加/它们投射到摄像机空间和收集基于图像的感知信息描绘到每个点的附加属性。最终的检测性能也从3D检测轨道和领域适应性道两者模式集合和测试时间增强（TTA）受益。我们的解决方案分别达到了3D检测轨道和域适应轨道上与77.11％MAPH / L2和69.49％MAPH / L2的第一位置。

49. Video Representation Learning with Visual Tempo Consistency [PDF] 返回目录
Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou
Abstract: Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%) and HMDB-51 (49.2\%). Moreover, we show that the learned representations are also generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, our empirical analysis suggests that a more thorough evaluation protocol is needed to verify the effectiveness of the self-supervised video representations across network structures and downstream tasks.
摘要：视觉节奏，描述一个动作如何快速去，已经显示出其在监管动作识别的潜力。在这项工作中，我们证明了视觉节奏也可以作为视频表示学习自我监控信号。我们建议通过分层对比学习（VTHCL）最大化的慢速和快速视频表述的互信息。具体而言，通过分别采样在慢速和快速的帧速率相同的实例，我们可以得到共享相同的语义，但包含不同的视觉节奏慢和快的视频帧。从VTHCL了解到视频表示实现对UCF-101（82.1 \％）和HMDB-51（49.2 \％），自我监督评价协议下的竞争力绩效的行为识别。此外，我们表明，学习交涉也很好推广到其他下游任务，包括对AVA动作检测和史诗厨房行动的预期。最后，我们的实证分析表明，一个更加彻底的评估协议需要验证跨网络结构和下游任务的自我监督的视频表示的有效性。

50. Bottom-Up Human Pose Estimation by Ranking Heatmap-Guided Adaptive Keypoint Estimates [PDF] 返回目录
Ke Sun, Zigang Geng, Depu Meng, Bin Xiao, Dong Liu, Zhaoxiang Zhang, Jingdong Wang
Abstract: The typical bottom-up human pose estimation framework includes two stages, keypoint detection and grouping. Most existing works focus on developing grouping algorithms, e.g., associative embedding, and pixel-wise keypoint regression that we adopt in our approach. We present several schemes that are rarely or unthoroughly studied before for improving keypoint detection and grouping (keypoint regression) performance. First, we exploit the keypoint heatmaps for pixel-wise keypoint regression instead of separating them for improving keypoint regression. Second, we adopt a pixel-wise spatial transformer network to learn adaptive representations for handling the scale and orientation variance to further improve keypoint regression quality. Last, we present a joint shape and heatvalue scoring scheme to promote the estimated poses that are more likely to be true poses. Together with the tradeoff heatmap estimation loss for balancing the background and keypoint pixels and thus improving heatmap estimation quality, we get the state-of-the-art bottom-up human pose estimation result. Code is available at this https URL.
摘要：典型的自下而上的人体姿势评估框架包括两个阶段，关键点检测和分组。大多数现有的作品专注于开发分组算法，例如，联想嵌入，并逐像素关键点回归，我们在我们的方法采用。我们提出了为提高关键点检测和分组（关键点回归）性能之前很少或unthoroughly研究几种方案。首先，我们利用了逐个像素的关键点回归，而不是将它们分开以提高关键点回归的关键点热图。其次，我们采用逐像素空间变换网络学习适应性表示处理规模和方向变化，进一步提高关键点回归质量。最后，我们提出了一个共同的形状和heatvalue评分方案，以促进估计姿态更可能是正确的姿势。加上平衡的背景和特征像素，从而提高热图估计质量的权衡热图估计的损失，我们得到了国家的最先进的自底向上的人体姿势估计结果。代码可在此HTTPS URL。

51. Automated Stitching of Coral Reef Images and Extraction of Features for Damselfish Shoaling Behavior Analysis [PDF] 返回目录
Riza Rae Pineda, Kristofer delas Peñas, Dana Manogan
Abstract: Behavior analysis of animals involves the observation of intraspecific and interspecific interactions among various organisms in the environment. Collective behavior such as herding in farm animals, flocking of birds, and shoaling and schooling of fish provide information on its benefits on collective survival, fitness, reproductive patterns, group decision-making, and effects in animal epidemiology. In marine ethology, the investigation of behavioral patterns in schooling species can provide supplemental information in the planning and management of marine resources. Currently, damselfish species, although prevalent in tropical waters, have no adequate established base behavior information. This limits reef managers in efficiently planning for stress and disaster responses in protecting the reef. Visual marine data captured in the wild are scarce and prone to multiple scene variations, primarily caused by motion and changes in the natural environment. The gathered videos of damselfish by this research exhibit several scene distortions caused by erratic camera motions during acquisition. To effectively analyze shoaling behavior given the issues posed by capturing data in the wild, we propose a pre-processing system that utilizes color correction and image stitching techniques and extracts behavior features for manual analysis.
摘要：动物的行为分析涉及环境中的各种生物间内和种间相互作用的观察。集体行为，如放牧农场动物，植绒鸟类，和不冻和鱼的就学提供有关其集体生存，健康，生育模式，集体决策，并在动物流行病学效果福利信息。在海洋动物行为学，行为模式的学校教育中的物种调查可以提供规划和海洋资源管理的补充信息。目前，雀鲷品种，虽然在热带水域普遍存在，没有足够的业务基础行为的信息。这就限制礁管理者在保护珊瑚礁的压力和灾害应对计划有效。在野外捕获的视觉海洋数据缺乏，容易发生多种场景变化，主要是通过运动和在自然环境中的变化引起的。通过这项研究珊瑚鱼的聚集视频显示采集过程中造成不稳定的摄像机运动几个场景的扭曲。为了有效地分析不冻鉴于在野外捕获数据所带来的问题行为，我们提出了一个预处理系统，利用色彩校正和图像拼接技术和提取行为特征进行人工分析。

52. Frequency learning for image classification [PDF] 返回目录
José Augusto Stuchi, Levy Boccato, Romis Attux
Abstract: Machine learning applied to computer vision and signal processing is achieving results comparable to the human brain on specific tasks due to the great improvements brought by the deep neural networks (DNN). The majority of state-of-the-art architectures nowadays are DNN related, but only a few explore the frequency domain to extract useful information and improve the results, like in the image processing field. In this context, this paper presents a new approach for exploring the Fourier transform of the input images, which is composed of trainable frequency filters that boost discriminative components in the spectrum. Additionally, we propose a slicing procedure to allow the network to learn both global and local features from the frequency-domain representations of the image blocks. The proposed method proved to be competitive with respect to well-known DNN architectures in the selected experiments, with the advantage of being a simpler and lightweight model. This work also raises the discussion on how the state-of-the-art DNNs architectures can exploit not only spatial features, but also the frequency, in order to improve its performance when solving real world problems.
摘要：机器学习应用到计算机视觉和信号处理实现由于深层神经网络（DNN）带来了很大的改进效果媲美于特定任务的人的大脑。大多数国家的最先进的体系结构现在是DNN相关，但仅有少数探索频域，以提取有用的信息，提高了结果，如在图像处理领域。在这方面，本文提出了探索的傅里叶变换输入图像，这是由该升压判别部件在光谱可训练频率滤波器的一种新的方法。此外，我们提出了切片的过程，让网络从图像块的频域表示了解全局和局部特征。所提出的方法被证明是有竞争力的相对于在所选择的实验中公知的DNN架构，具有作为一个更简单的，重量轻的模型的优点。这项工作也提出了对国家的最先进的DNNs架构如何利用不仅空间特征，而且频率，以解决现实世界的问题的时候，以提高其性能的讨论。

53. Interpretable Deepfake Detection via Dynamic Prototypes [PDF] 返回目录
Loc Trinh, Michael Tsang, Sirisha Rambhatla, Yan Liu
Abstract: Deepfake is one notorious application of deep learning research, leading to massive amounts of video content on social media ridden with malicious intent. Therefore detecting deepfake videos has emerged as one of the most pressing challenges in AI research. Most state-of-the-art deepfake solutions are based on black-box models that process videos frame-by-frame for inference, and they do not consider temporal dynamics, which are key for detecting and explaining deepfake videos by humans. To this end, we propose Dynamic Prototype Network (DPNet) - a simple, interpretable, yet effective solution that leverages dynamic representations (i.e., prototypes) to explain deepfake visual dynamics. Experiment results show that the explanations of DPNet provide better overlap with the ground truth than state-of-the-art methods with comparable prediction performance. Furthermore, we formulate temporal logic specifications based on these prototypes to check the compliance of our model to desired temporal behaviors.
摘要：Deepfake是深度学习研究的一个臭名昭著的应用，导致了大量的与恶意缠身的社交媒体视频内容。因此检测deepfake视频已成为在人工智能研究的最紧迫的挑战之一。大多数国家的最先进的deepfake解决方案是基于黑盒模型，过程的视频帧一帧的推断，他们不考虑时间动态，这是用于检测和人类解释deepfake录像键。为此，我们提出了动态网络原型（DPNet） - 一个简单的，它利用动态表示可解释的，但有效的解决方案（即原型）来解释deepfake视觉动态。实验结果表明，DPNet的解释提供与地面实况比状态的最先进的方法相媲美的预测性能更好的重叠。此外，我们制定基于这些原型时序逻辑规格来检查我们的模型符合所需时间的行为。

54. AerialMPTNet: Multi-Pedestrian Tracking in Aerial Imagery Using Temporal and Graphical Features [PDF] 返回目录
Maximilian Kraus, Seyed Majid Azimi, Emec Ercelik, Reza Bahmanyar, Peter Reinartz, Alois Knoll
Abstract: Multi-pedestrian tracking in aerial imagery has several applications such as large-scale event monitoring, disaster management, search-and-rescue missions, and as input into predictive crowd dynamic models. Due to the challenges such as the large number and the tiny size of the pedestrians (e.g., 4 x 4 pixels) with their similar appearances as well as different scales and atmospheric conditions of the images with their extremely low frame rates (e.g., 2 fps), current state-of-the-art algorithms including the deep learning-based ones are unable to perform well. In this paper, we propose AerialMPTNet, a novel approach for multi-pedestrian tracking in geo-referenced aerial imagery by fusing appearance features from a Siamese Neural Network, movement predictions from a Long Short-Term Memory, and pedestrian interconnections from a GraphCNN. In addition, to address the lack of diverse aerial pedestrian tracking datasets, we introduce the Aerial Multi-Pedestrian Tracking (AerialMPT) dataset consisting of 307 frames and 44,740 pedestrians annotated. We believe that AerialMPT is the largest and most diverse dataset to this date and will be released publicly. We evaluate AerialMPTNet on AerialMPT and KIT AIS, and benchmark with several state-of-the-art tracking methods. Results indicate that AerialMPTNet significantly outperforms other methods on accuracy and time-efficiency.
摘要：在航空影像多行人跟踪有一些应用，如大型事件监测，灾害管理，搜索和救援任务，并作为输入预测人群动态模型。由于挑战，如大量和行人（例如，4×4像素）与它们类似的外观，以及不同的尺度，并用它们的极低的帧速率（例如，每秒2帧的图像的大气条件的微小的尺寸），国家的最先进的现有算法包括基于深学习酮无法表现良好。在本文中，我们提出AerialMPTNet，多行人跟踪，地理参考航拍图像通过融合外观新颖的方法从连体神经网络的特点，从长短期存储器移动的预测，并从GraphCNN行人互连。此外，为了解决缺乏多样化天线行人跟踪数据集，我们引入空中多行人跟踪（AerialMPT）数据集由307帧和44740名行人注释的。我们认为，AerialMPT是全球最大和最多样化的数据集中到这个日期，将公开发布。我们评估AerialMPTNet上AerialMPT和KIT AIS和基准与国家的最先进的几种跟踪方法。结果表明，AerialMPTNet显著优于准确性和时间效率等方法。

55. On the generalization of learning-based 3D reconstruction [PDF] 返回目录
Miguel Angel Bautista, Walter Talbott, Shuangfei Zhai, Nitish Srivastava, Joshua M Susskind
Abstract: State-of-the-art learning-based monocular 3D reconstruction methods learn priors over object categories on the training set, and as a result struggle to achieve reasonable generalization to object categories unseen during training. In this paper we study the inductive biases encoded in the model architecture that impact the generalization of learning-based 3D reconstruction methods. We find that 3 inductive biases impact performance: the spatial extent of the encoder, the use of the underlying geometry of the scene to describe point features, and the mechanism to aggregate information from multiple views. Additionally, we propose mechanisms to enforce those inductive biases: a point representation that is aware of camera position, and a variance cost to aggregate information across views. Our model achieves state-of-the-art results on the standard ShapeNet 3D reconstruction benchmark in various settings.
摘要：国家的最先进的学习型单眼3D重建方法学会了在训练集对象类别前科，并因此争取实现合理推广到训练中看不见的对象类别。在本文中，我们研究了模型结构编码的感性偏见是影响学习型三维重建方法的推广。我们发现，3个感应偏见的影响表现为：编码器的空间范围，使用场景的基本几何体来描述点特征和机制，从多个视图，汇总信息。此外，我们建议的机制来执行这些感性的偏见：一个点表示是知道摄像头的位置，和方差成本跨越意见汇总信息。我们的模型实现对各种设置标准ShapeNet三维重建基准国家的先进成果。

56. Counting Out Time: Class Agnostic Video Repetition Counting in the Wild [PDF] 返回目录
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
Abstract: We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in constraining the period prediction module to use temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen repetitions in videos in the wild. We train this model, called Repnet, with a synthetic dataset that is generated from a large unlabeled video collection by sampling short clips of varying lengths and repeating them with different periods and counts. This combination of synthetic data and a powerful yet constrained model, allows us to predict periods in a class-agnostic fashion. Our model substantially exceeds the state of the art performance on existing periodicity (PERTUBE) and repetition counting (QUVA) benchmarks. We also collect a new challenging dataset called Countix (~90 times larger than existing datasets) which captures the challenges of repetition counting in real-world videos. Project webpage: this https URL .
摘要：我们提出的方法，用于估计与一个动作是在视频重复的周期。该方法在于在约束周期预测模块症结使用时间的自相似性作为中间表示瓶颈，允许推广到看不见重复在野外视频。我们培养这种模式，称为Repnet，与从大的未标记的视频采集通过采样不同长度的短片，并用不同的时期和计数重复它们产生的合成数据集。模拟数据和一个功能强大且约束模型的结合，使我们能够预测类无关的时尚周期。我们的模型实质上超过对现有周期性（PERTUBE）和重复计数（QUVA）基准的技术性能的状态。我们还收集了新的挑战数据集名为Countix（〜比现有的数据集大90倍），它捕捉现实世界的影片重复计数的挑战。项目网页：此HTTPS URL。

57. Improving Interpretability of CNN Models Using Non-Negative Concept Activation Vectors [PDF] 返回目录
Ruihan Zhang, Prashan Madumal, Tim Miller, Kris Ehinger, Benjamin Rubinstein
Abstract: Convolutional neural network (CNN) models for computer vision are powerful but lack explainability in their most basic form. This deficiency remains a key challenge when applying CNNs in important domains. Recent work for explanations through feature importance of approximate linear models has moved from input-level features (pixels or segments) to features from mid-layer feature maps in the guise of concept activation vectors (CAVs). CAVs contain concept-level information and could be learnt via Clustering. In this work, we rethink the ACE algorithm of Ghorbani et~al., proposing an alternative concept-based explanation framework. Based on the requirements of fidelity (approximate models) and interpretability (being meaningful to people), we design measurements and evaluate a range of dimensionality reduction methods for alignment with our framework. We find that non-negative concept activation vectors from non-negative matrix factorization provide superior performance in interpretability and fidelity based on computational and human subject experiments. Our framework provides both local and global concept-level explanations for pre-trained CNN models.
摘要：卷积神经网络（CNN）模型，计算机视觉是强大的，但它们最基本的形式缺乏explainability。在一些重要领域应用细胞神经网络时，该不足仍然是一个关键的挑战。最近，用于通过近似的线性模型的特征的重要性解释工作已经从输入级特性（像素或段），以特征移动离开中间层特征映射在概念活化载体（骑士）假借。骑士包含概念层面的信息以及可以通过聚类学习。在这项工作中，我们重新思考格尔巴尼等〜人的ACE算法，提出了另一种基于概念的解释框架。基于保真（近似模型）和可解释性（是有意义的人）的要求，我们设计的测量和评估一系列的降维的方法来与我们的框架对齐。我们发现从非负矩阵分解非负概念激活载体提供了基于计算和人类主体实验解释性和保真性能优越。我们的框架提供了预先训练的CNN模型本地和全局观念层面的解释。

58. Deep Sea Robotic Imaging Simulator for UUV Development [PDF] 返回目录
Yifan Song, David Nakath, Mengkun She, Furkan Elibol, Kevin Köser
Abstract: Nowadays underwater vision systems are being widely applied in ocean research. However, the largest portion of the ocean - the deep sea - still remains mostly unexplored. Only relatively few images sets have been taken from the deep sea due to the physical limitations caused by technical challenges and enormous costs. The shortage of deep sea images and the corresponding ground truth data for evaluation and training is becoming a bottleneck for the development of underwater computer vision methods. Thus this paper presents a physical model-based image simulation solution which uses in-air and depth information as inputs to generate underwater images for robotics in deep ocean scenarios. Compared to shallow water conditions, active lighting is required to illuminate the scene in deep sea. Our radiometric image formation model considers both attenuation and scattering effects with co-moving light sources in the dark. Additionally, we also incorporate geometric distortion (refraction), which is caused by thick glass housings commonly employed in deep sea conditions. By detailed analysis and evaluation of the underwater image formation model, we propose a 3D lookup table structure in combination with a novel rendering strategy to improve simulation performance, which enables us to integrate and implement interactive deep sea robotic vision simulation in the Gazebo-based Unmanned Underwater Vehicles simulator.
摘要：目前水下视觉系统在海洋研究被广泛应用。然而，海洋中最大的部分 - 深海 - 仍然大多是未知。从深海已经采取由于造成技术上的挑战和巨大的成本的物理限制，只有相对较少的图像集。深海的图像和用于评估和训练相应的地面实况数据的短缺正在成为水下计算机视觉方法发展的瓶颈。因此，这个提出一种使用在空气和作为输入的深度信息来产生水下在深海场景机器人图像的物理基于模型的图像模拟的解决方案。相比浅水的条件下，有源照明需要照亮在深海现场。我们的辐射测量图像形成模型考虑与在黑暗中共同移动光源两者的衰减和散射效应。此外，我们也结合几何失真（折射），这是由在深海条件通常采用厚的玻璃外壳引起的。通过详细的分析和水下成像模型的评估，我们提出了一个三维的查找表结构，结合新颖的渲染策略，以提高仿真性能，这使我们能够整合并在基于凉亭无人实现交互式深海机器人视觉仿真水下航行模拟器。

59. Light Pose Calibration for Camera-light Vision Systems [PDF] 返回目录
Yifan Song, Furkan Elibol, Mengkun She, David Nakath, Kevin Köser
Abstract: Illuminating a scene with artificial light is a prerequisite for seeing in dark environments. However, nonuniform and dynamic illumination can deteriorate or even break computer vision approaches, for instance when operating a robot with headlights in the darkness. This paper presents a novel light calibration approach by taking multi-view and -distance images of a reference plane in order to provide pose information of the employed light sources to the computer vision system. By following a physical light propagation approach, under consideration of energy preservation, the estimation of light poses is solved by minimizing of the differences between real and rendered pixel intensities. During the evaluation we show the robustness and consistency of this method by statistically analyzing the light pose estimation results with different setups. Although the results are demonstrated using a rotationally-symmetric non-isotropic light, the method is suited also for non-symmetric lights.
摘要：用的照明人造光的场景是在黑暗环境中看到的一个先决条件。然而，非均匀的和动态的照明会劣化或操作具有在黑暗中头灯的机器人时甚至断裂计算机视觉方法，例如。本文呈现通过取多视图和参考平面的-distance图像，以便提供对所采用的光源到计算机视觉系统姿态信息的新颖光校准的方法。通过以下的物理光传播的方式，在考虑能量保存的，光的姿势估计是通过实部和渲染像素强度之间的差异最小化来解决。在评估过程中，我们显示通过统计分析不同设置光姿态估计结果该方法的稳健性和一致性。尽管结果是使用旋转对称非各向同性光证明的，该方法也适用于非对称的灯。

60. MTStereo 2.0: improved accuracy of stereo depth estimation withMax-trees [PDF] 返回目录
Rafael Brandt, Nicola Strisciuglio, Nicolai Petkov
Abstract: Efficient yet accurate extraction of depth from stereo image pairs is required by systems with low power resources, such as robotics and embedded systems. State-of-the-art stereo matching methods based on convolutional neural networks require intensive computations on GPUs and are difficult to deploy on embedded systems. In this paper, we propose a stereo matching method, called MTStereo 2.0, for limited-resource systems that require efficient and accurate depth estimation. It is based on a Max-tree hierarchical representation of image pairs, which we use to identify matching regions along image scan-lines. The method includes a cost function that considers similarity of region contextual information based on the Max-trees and a disparity border preserving cost aggregation approach. MTStereo 2.0 improves on its predecessor MTStereo 1.0 as it a) deploys a more robust cost function, b) performs more thorough detection of incorrect matches, c) computes disparity maps with pixel-level rather than node-level precision. MTStereo provides accurate sparse and semi-dense depth estimation and does not require intensive GPU computations like methods based on CNNs. Thus it can run on embedded and robotics devices with low-power requirements. We tested the proposed approach on several benchmark data sets, namely KITTI 2015, Driving, FlyingThings3D, Middlebury 2014, Monkaa and the TrimBot2020 garden data sets, and achieved competitive accuracy and efficiency. The code is available at this https URL.
摘要：从立体图像对深度的高效而准确提取通过以低功率资源，如机器人和嵌入式系统的系统要求。基于卷积神经网络的国家的最先进的立体匹配方法需要在GPU上密集的计算，而且很难在嵌入式系统部署。在本文中，我们提出了一个立体匹配的方法，叫做MTStereo 2.0，适用于要求高效，精确的深度估计资源有限的系统。它基于图像对，我们用它来确定沿着图像扫描线匹配区域的最大树分层表示。该方法包括考虑的基础上，最大树与视差边境保护成本聚集接近区域的上下文信息相似的成本函数。 MTStereo 2.0其前身MTStereo 1.0提高，因为它一个）部署了一个更坚固的成本函数，B）进行更彻底的检测不正确的匹配的，c）中计算的视差图与像素级而不是节点级精度。 MTStereo提供准确的稀疏和半密集深度估计，并不需要密集的GPU计算像基于细胞神经网络的方法。因此，它可以与低功耗要求的嵌入式和机器人等设备上运行。我们测试的几个基准数据集所提出的方法，即2015年KITTI，驾驶，FlyingThings3D，明德2014年，Monkaa和TrimBot2020花园数据集，取得了有竞争力的精度和效率。该代码可在此HTTPS URL。

61. ReMarNet: Conjoint Relation and Margin Learning for Small-Sample Image Classification [PDF] 返回目录
Xiaoxu Li, Liyun Yu, Xiaochen Yang, Zhanyu Ma, Jing-Hao Xue, Jie Cao, Jun Guo
Abstract: Despite achieving state-of-the-art performance, deep learning methods generally require a large amount of labeled data during training and may suffer from overfitting when the sample size is small. To ensure good generalizability of deep networks under small sample sizes, learning discriminative features is crucial. To this end, several loss functions have been proposed to encourage large intra-class compactness and inter-class separability. In this paper, we propose to enhance the discriminative power of features from a new perspective by introducing a novel neural network termed Relation-and-Margin learning Network (ReMarNet). Our method assembles two networks of different backbones so as to learn the features that can perform excellently in both of the aforementioned two classification mechanisms. Specifically, a relation network is used to learn the features that can support classification based on the similarity between a sample and a class prototype; at the meantime, a fully connected network with the cross entropy loss is used for classification via the decision boundary. Experiments on four image datasets demonstrate that our approach is effective in learning discriminative features from a small set of labeled samples and achieves competitive performance against state-of-the-art methods. Codes are available at this https URL.
摘要：尽管实现国家的最先进的性能，深度学习方法通常需要在训练中大量的标签数据，并可以在自样本规模很小过度拟合吃亏。为了保证深网络的良好普遍性下样本量小，学习判别特征是至关重要的。为此，一些损失的功能已经被提出，鼓励大型类内的紧凑性和类间可分性。在本文中，我们提出通过引入一种新型的神经网络，以提高从一个新的视角的特点辨别力称为关系和边距学习网络（ReMarNet）。我们的方法组装不同骨架的两个网络，以便学习这可以在两个上述两种分类机制的优异的性能的功能。具体而言，关系网络是用来学习，可以根据样本和一个类原型之间的相似性支持分类的特征;在此期间，与交叉熵损失完全连接的网络经由决策边界用于分类。在四个图像数据集实验结果表明，我们的方法是有效的，从一小标记样本的学习判别特征，并实现对国家的最先进的方法，有竞争力的表现。代码可在此HTTPS URL。

62. An Evoked Potential-Guided Deep Learning Brain Representation For Visual Classification [PDF] 返回目录
Xianglin Zheng, Zehong Cao, Quan Bai
Abstract: The new perspective in visual classification aims to decode the feature representation of visual objects from human brain activities. Recording electroencephalogram (EEG) from the brain cortex has been seen as a prevalent approach to understand the cognition process of an image classification task. In this study, we proposed a deep learning framework guided by the visual evoked potentials, called the Event-Related Potential (ERP)-Long short-term memory (LSTM) framework, extracted by EEG signals for visual classification. In specific, we first extracted the ERP sequences from multiple EEG channels to response image stimuli-related information. Then, we trained an LSTM network to learn the feature representation space of visual objects for classification. In the experiment, 10 subjects were recorded by over 50,000 EEG trials from an image dataset with 6 categories, including a total of 72 exemplars. Our results showed that our proposed ERP-LSTM framework could achieve classification accuracies of cross-subject of 66.81% and 27.08% for categories (6 classes) and exemplars (72 classes), respectively. Our results outperformed that of using the existing visual classification frameworks, by improving classification accuracies in the range of 12.62% - 53.99%. Our findings suggested that decoding visual evoked potentials from EEG signals is an effective strategy to learn discriminative brain representations for visual classification.
摘要：在视觉分类目标的新的视角，从人脑活动的视觉对象的特征表示进行解码。从大脑皮层记录脑电图（EEG）一直被视为一种普遍的方法来理解图像分类任务的认知过程。在这项研究中，我们提出的视觉诱发电位引导了深刻的学习框架，称为事件相关电位（ERP） - 长短期记忆（LSTM）框架，由脑电信号进行视觉分类提取。在具体的，我们首先提取来自多个EEG通道ERP序列以响应图像刺激相关的信息。然后，我们培养的LSTM网络学习的分类视觉对象的特征表示空间。在实验中，10名受试者被超过50,000 EEG试验从记录的图像数据集与6类，包括共72个范例。我们的研究结果表明，我们提出的ERP-LSTM框架可以达到66.81％，对类别（6个班）和范例（72类）27.08％的跨学科的分类准确度，分别。我们的研究结果跑赢，使用现有的视觉分类框架，通过提高分类精确度在12.62％的范围内 - 53.99％。我们的研究结果表明，从EEG信号解码视觉诱发电位是要学会辨别大脑的表示，视觉分类的有效策略。

63. PCLNet: A Practical Way for Unsupervised Deep PolSAR Representations and Few-Shot Classification [PDF] 返回目录
Lamei Zhang, Siyu Zhang, Bin Zou, Hongwei Dong
Abstract: Deep learning and convolutional neural networks (CNNs) have made progress in polarimetric synthetic aperture radar (PolSAR) image classification over the past few years. However, a crucial issue has not been addressed, i.e., the requirement of CNNs for abundant labeled samples versus the insufficient human annotations of PolSAR images. It is well-known that following the supervised learning paradigm may lead to the overfitting of training data, and the lack of supervision information of PolSAR images undoubtedly aggravates this problem, which greatly affects the generalization performance of CNN-based classifiers in large-scale applications. To handle this problem, in this paper, learning transferrable representations from unlabeled PolSAR data through convolutional architectures is explored for the first time. Specifically, a PolSAR-tailored contrastive learning network (PCLNet) is proposed for unsupervised deep PolSAR representation learning and few-shot classification. Different from the utilization of optical processing methods, a diversity stimulation mechanism is constructed to narrow the application gap between optics and PolSAR. Beyond the conventional supervised methods, PCLNet develops an auxiliary pre-training phase based on the proxy objective of contrastive instance discrimination to learn useful representations from unlabeled PolSAR data. The acquired representations are transferred to the downstream task, i.e., few-shot PolSAR classification. Experiments on two widely used PolSAR benchmark datasets confirm the validity of PCLNet. Besides, this work may enlighten how to efficiently utilize the massive unlabeled PolSAR data to alleviate the greedy demands of CNN-based methods for human annotations.
摘要：深学习和卷积神经网络（细胞神经网络）取得了在过去的几年中，极化合成孔径雷达（极化SAR）图像分类的进展。然而，关键的问题一直没有解决，即细胞神经网络为丰富标记的样品与极化SAR图像的不足人类注释的需求。这是众所周知的是，继监督学习范式可能会导致训练数据的过度拟合，以及缺乏极化SAR图像的监管信息无疑加剧了这个问题，这极大地影响了基于CNN-分类的大规模应用推广性能。为了解决这个问题，在本文中，学习从通过卷积架构未标记的极化SAR数据转让交涉进行了探讨首次。具体而言，极化SAR定制的对比学习网络（PCLNet）提出了一种无监督的深层极化SAR代表学习和为数不多的镜头分类。从光学处理的方法利用不同的，分集激励机制被构造为缩小光学系统和极化SAR之间的间隙的应用。超越常规方法的监督，PCLNet开发基于对比实例歧视的代理目标学习从未标记极化SAR数据用的表现的辅助预训练阶段。该收购表示被转移到下游的任务，即，很少拍极化SAR分类。两种广泛使用的极化SAR标准数据集实验证实PCLNet的有效性。此外，这项工作可能会开导如何有效地利用大量的未标记的极化SAR数据，以减轻对人类注释基于CNN-方法的贪婪需求。

64. MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation [PDF] 返回目录
Jun Liu, Qing Li, Rui Cao, Wenming Tang, Guoping Qiu
Abstract: Predicting depth from a single image is an attractive research topic since it provides one more dimension of information to enable machines to better perceive the world. Recently, deep learning has emerged as an effective approach to monocular depth estimation. As obtaining labeled data is costly, there is a recent trend to move from supervised learning to unsupervised learning to obtain monocular depth. However, most unsupervised learning methods capable of achieving high depth prediction accuracy will require a deep network architecture which will be too heavy and complex to run on embedded devices with limited storage and memory spaces. To address this issue, we propose a new powerful network with a recurrent module to achieve the capability of a deep network while at the same time maintaining an extremely lightweight size for real-time high performance unsupervised monocular depth prediction from video sequences. Besides, a novel efficient upsample block is proposed to fuse the features from the associated encoder layer and recover the spatial size of features with the small number of model parameters. We validate the effectiveness of our approach via extensive experiments on the KITTI dataset. Our new model can run at a speed of about 110 frames per second (fps) on a single GPU, 37 fps on a single CPU, and 2 fps on a Raspberry Pi 3. Moreover, it achieves higher depth accuracy with nearly 33 times fewer model parameters than state-of-the-art models. To the best of our knowledge, this work is the first extremely lightweight neural network trained on monocular video sequences for real-time unsupervised monocular depth estimation, which opens up the possibility of implementing deep learning-based real-time unsupervised monocular depth prediction on low-cost embedded devices.
摘要：从单一图像预测深度是一个有吸引力的研究课题，因为它提供的信息多了一个维度，使机器更好地感知世界。近日，深学习已经成为一种有效的方法来单眼深度估计。作为获得标签的数据是昂贵的，有从被动学习转向无监督学习，以获得单眼深度最近的趋势。然而，能够实现高深度的预测精度会需要深的网络架构，这将是过重且复杂到在具有有限存储和存储器空间的嵌入式设备上运行的最无监督学习方法。为了解决这个问题，我们提出了一个新的强大的网络与经常性的模块，实现了深刻的网络的能力，而同时保持从视频序列的实时高性能无人监督的单眼深度预测重量极轻的大小。此外，一个新的有效的上采样块被提出来融合从相关联的编码器层的功能和与小数量的模型参数恢复的特征的空间大小。我们通过对KITTI数据集大量的实验验证了我们方法的有效性。我们的新的模型可以在在单个GPU每秒（fps）的约110帧，在单CPU上37帧，和每秒2帧的上树莓裨3.另外的速度运行，实现了更高的深度精度具有较少近33倍模型的参数比国家的最先进的车型。低到我们所知，这项工作是培训了实时监督的单眼深度估计，开辟了实现深的可能性单筒视频序列中的第一个重量极轻的神经网络学习基于实时监督的单眼深度预测-cost嵌入式设备。

65. Compositional Video Synthesis with Action Graphs [PDF] 返回目录
Amir Bar, Roei Herzig, Xiaolong Wang, Gal Chechik, Trevor Darrell, Amir Globerson
Abstract: Videos of actions are complex spatio-temporal signals, containing rich compositional structures. Current generative models are limited in their ability to generate examples of object configurations outside the range they were trained on. Towards this end, we introduce a generative model (AG2Vid) based on Action Graphs, a natural and convenient structure that represents the dynamics of actions between objects over time. Our AG2Vid model disentangles appearance and position features, allowing for more accurate generation. AG2Vid is evaluated on the CATER and Something-Something datasets and outperforms other baselines. Finally, we show how Action Graphs can be used for generating novel compositions of unseen actions.
摘要：动作影片是复杂的时空信号，含有丰富的组成结构。当前生成模型在其产生，他们被训练在范围之外的对象配置的例子能力有限。为此，我们推出基于行动图表，表示随着时间的推移物体之间的行动力度自然和方便的结构生成模型（AG2Vid）。我们AG2Vid模式理顺了那些纷繁外观和位置的功能，让更精确的产生。 AG2Vid是在迎合什么东西出头的数据集和优于其他基线评估。最后，我们将展示行动图表可以如何用于产生看不见的动作新颖组合物。

66. Interactive Deep Refinement Network for Medical Image Segmentation [PDF] 返回目录
Titinunt Kitrungrotsakul, Iwamoto Yutaro, Lanfen Lin, Ruofeng Tong, Jingsong Li, Yen-Wei Chen
Abstract: Deep learning techniques have successfully been employed in numerous computer vision tasks including image segmentation. The techniques have also been applied to medical image segmentation, one of the most critical tasks in computer-aided diagnosis. Compared with natural images, the medical image is a gray-scale image with low-contrast (even with some invisible parts). Because some organs have similar intensity and texture with neighboring organs, there is usually a need to refine automatic segmentation results. In this paper, we propose an interactive deep refinement framework to improve the traditional semantic segmentation networks such as U-Net and fully convolutional network. In the proposed framework, we added a refinement network to traditional segmentation network to refine the segmentation results.Experimental results with public dataset revealed that the proposed method could achieve higher accuracy than other state-of-the-art methods.
摘要：深学习技术已经成功地在许多计算机视觉任务，包括图像分割使用。该技术也被应用到医学图像分割，在计算机辅助诊断中最关键的任务之一。与自然图像相比，该医疗用图像是具有低对比度（即使有一些不可见的部件）的灰度图像。由于某些器官有相似的强度和质感与邻近器官，通常有需要改进的自动分割结果。在本文中，我们提出了一个互动的深入精简架构，改善了传统语义分割网络，如U-Net和充分卷积网络。在拟议的框架，我们增加了一个细化网络，传统的分割网络，以细化分割results.Experimental结果与公共数据集表明，该方法可以实现比其他国家的最先进的精度高。

67. Making DensePose fast and light [PDF] 返回目录
Ruslan Rakhimov, Emil Bogomolov, Alexandr Notchenko, Fung Mao, Alexey Artemov, Denis Zorin, Evgeny Burnaev
Abstract: DensePose estimation task is a significant step forward for enhancing user experience computer vision applications ranging from augmented reality to cloth fitting. Existing neural network models capable of solving this task are heavily parameterized and a long way from being transferred to an embedded or mobile device. To enable Dense Pose inference on the end device with current models, one needs to support an expensive server-side infrastructure and have a stable internet connection. To make things worse, mobile and embedded devices do not always have a powerful GPU inside. In this work, we target the problem of redesigning the DensePose R-CNN model's architecture so that the final network retains most of its accuracy but becomes more light-weight and fast. To achieve that, we tested and incorporated many deep learning innovations from recent years, specifically performing an ablation study on 23 efficient backbone architectures, multiple two-stage detection pipeline modifications, and custom model quantization methods. As a result, we achieved $17\times$ model size reduction and $2\times$ latency improvement compared to the baseline model.
摘要：DensePose估计任务是提高用户体验的计算机视觉应用范围从扩增实境布配件迈出的一步显著。能够解决这一任务的现有的神经网络模型是大量参数和被转移到嵌入式或移动设备很长的路要走。为了能够与目前的车型在终端设备上密集的姿态推断，一个需要支持昂贵的服务器端的基础设施，有一个稳定的互联网连接。更糟糕的是，移动和嵌入式设备并不总是有一个强大的GPU里面。在这项工作中，我们的目标，这样最终的网络保留了大部分其准确性，但变得更加轻量和快速的重新设计DensePose R-CNN模型架构的问题。为了实现这个目标，我们测试了从近年来纳入了许多深度学习的创新，特别是在执行效率23级骨干架构，多个两阶段检测管道的修改，并自定义模型量化方法消融研究。其结果是，我们实现了$ 17 \ $次模型尺寸缩小和$ 2 \ $次延迟的改善比较基准模型。

68. Region-of-interest guided Supervoxel Inpainting for Self-supervision [PDF] 返回目录
Subhradeep Kayal, Shuai Chen, Marleen de Bruijne
Abstract: Self-supervised learning has proven to be invaluable in making best use of all of the available data in biomedical image segmentation. One particularly simple and effective mechanism to achieve self-supervision is inpainting, the task of predicting arbitrary missing areas based on the rest of an image. In this work, we focus on image inpainting as the self-supervised proxy task, and propose two novel structural changes to further enhance the performance of a deep neural network. We guide the process of generating images to inpaint by using supervoxel-based masking instead of random masking, and also by focusing on the area to be segmented in the primary task, which we term as the region-of-interest. We postulate that these additions force the network to learn semantics that are more attuned to the primary task, and test our hypotheses on two applications: brain tumour and white matter hyperintensities segmentation. We empirically show that our proposed approach consistently outperforms both supervised CNNs, without any self-supervision, and conventional inpainting-based self-supervision methods on both large and small training set sizes.
摘要：自监督学习已被证明是在使所有可用的数据，最好使用生物医学图像分割无价的。一个特别简单和有效的机制来实现自我监管的图像修补，预测任意缺少基于图像的休息区的任务。在这项工作中，我们专注于图像修复作为自我监督的代理任务，并提出了两个新的结构性变化，以进一步提高深层神经网络的性能。我们通过使用supervoxel基于屏蔽，而不是随机的掩蔽生成图像的过程中引导补绘，并且还通过聚焦的区域中的首要任务，这是我们长期作为该地区的兴趣进行分割。我们假定这些新增强制网络来学习那些更切合首要任务语义，并测试对两个应用我们的假设：脑肿瘤，白质高信号分割。我们经验表明，该方法可以始终如一性能优于监督细胞神经网络，没有任何自我监管，以及大型和小型训练集大小传统的基于图像修复，自我监督的方法。

69. Bookworm continual learning: beyond zero-shot learning and continual learning [PDF] 返回目录
Kai Wang, Luis Herranz, Anjan Dutta, Joost van de Weijer
Abstract: We propose bookworm continual learning(BCL), a flexible setting where unseen classes can be inferred via a semantic model, and the visual model can be updated continually. Thus BCL generalizes both continual learning (CL) and zero-shot learning (ZSL). We also propose the bidirectional imagination (BImag) framework to address BCL where features of both past and future classes are generated. We observe that conditioning the feature generator on attributes can actually harm the continual learning ability, and propose two variants (joint class-attribute conditioning and asymmetric generation) to alleviate this problem.
摘要：本文提出书呆子不断学习（BCL），一个灵活的环境，让看不见的类可以通过语义模型推断，和视觉模型可以不断地被更新。因此BCL概括二者持续学习（CL）和零次学习（ZSL）。我们还提出了双向想象力（BImag）框架来解决BCL其中产生过去和将来的类特性。我们观察到，空调上属性特征生成实际上可以伤害持续学习的能力，并提出了两个变种（合资类属性空调和不对称代）来缓解这一问题。

70. Large Deformation Diffeomorphic Image Registration with Laplacian Pyramid Networks [PDF] 返回目录
Tony C.W. Mok, Albert C.S. Chung
Abstract: Deep learning-based methods have recently demonstrated promising results in deformable image registration for a wide range of medical image analysis tasks. However, existing deep learning-based methods are usually limited to small deformation settings, and desirable properties of the transformation including bijective mapping and topology preservation are often being ignored by these approaches. In this paper, we propose a deep Laplacian Pyramid Image Registration Network, which can solve the image registration optimization problem in a coarse-to-fine fashion within the space of diffeomorphic maps. Extensive quantitative and qualitative evaluations on two MR brain scan datasets show that our method outperforms the existing methods by a significant margin while maintaining desirable diffeomorphic properties and promising registration speed.
摘要：深基于学习的方法最近证明在变形图像配准可喜的成果为广泛的医学图像分析任务。然而，现有的深基于学习的方法通常限于小变形的设置，和变换的理想的性质，包括双射映射和拓扑保持通常由这些方法忽略。在本文中，我们提出了一个深刻的拉普拉斯金字塔图像配准的网络，它可以在一个由粗到精的时尚微分同胚图的空间内解决图像配准的优化问题。在两个MR脑部扫描数据集丰富的定量和定性评估表明，我们的方法，通过一个显著利润率优于现有的方法，同时保持理想的微分同胚的性质和前途的登记时间。

71. Shape from Projections via Differentiable Forward Projector [PDF] 返回目录
Jakeoung Koo, Qiongyang Chen, J. Andreas Bærentzen, Anders B. Dahl, Vedrana A. Dahl
Abstract: In tomography, forward projection of 3D meshes has been mostly studied to simulate data acquisition. However, such works did not consider an inverse process of estimating shapes from projections. In this paper, we propose a differentiable forward projector for 3D meshes, to bridge the gap between the forward model for 3D surfaces and optimization. We view the forward projection as a rendering process, and make it differentiable by extending a recent work in differentiable rasterization. We use the proposed forward projector to reconstruct 3D shapes directly from projections. Experimental results for single-object problems show that our method outperforms the traditional voxel-based methods on noisy simulated data. We also apply our method on real data from electron tomography to estimate the shapes of some nanoparticles.
摘要：在断层扫描，3D网格的前投影大多已经研究了模拟数据采集。然而，这样的作品没有考虑从预测估计形状的逆过程。在本文中，我们提出了一个微向前投影机的3D网格，以弥合3D曲面和优化正向模型之间的差距。我们认为正投影作为渲染过程中，并使其通过微延伸的近期微光栅化工作。我们用所提出的向前投影重建3D直接从投影形状。单目标问题，实验结果表明，我们的方法优于在嘈杂的模拟数据的传统的基于体素的方法。我们还对真实数据应用我们的方法从电子断层扫描来估计一些纳米粒子的形状。

72. COVID-19 detection using Residual Attention Network an Artificial Intelligence approach [PDF] 返回目录
Vishal Sharma, Curtis Dyreson
Abstract: Coronavirus Disease 2019 (COVID-19) is caused by the severe acute respiratory syndrome coronavirus 2 virus (SARS-CoV-2). The virus transmits rapidly, it has a basic reproductive number ($R_0$) of $2.2-2.7$. In March, 2020 the World Health Organization declared the COVID-19 outbreak a pandemic. Effective testing for COVID-19 is crucial to controlling the outbreak since infected patients can be quarantined. But the demand for testing outstrips the availability of test kits that use Reverse Transcription Polymerase Chain Reaction (RT-PCR). In this paper, we present a technique to detect COVID-19 using Artificial Intelligence. Our technique takes only a few seconds to detect the presence of the virus in a patient. We collected a dataset of chest X-ray images and trained several popular deep convolution neural network-based models (VGG, MobileNet, Xception, DenseNet, InceptionResNet) to classify chest X-rays. Unsatisfied with these models we then designed and built a Residual Attention Network that was able to detect COVID-19 with a testing accuracy of 98\% and a validation accuracy of 100\%. Feature maps of our model show which areas within a chest X-ray are important for classification. Our work can help to increase the adaptation of AI-assisted applications in clinical practice.
摘要：冠状病毒病2019（COVID-19）是由严重急性呼吸综合征冠状病毒2型病毒（SARS-COV-2）引起的。该病毒迅速传递，它有$ 2.2-2.7 $基本再生数（$ R_0 $）。 2020年三月，世界卫生组织宣布COVID-19爆发大流行。对于COVID-19的有效检测是控制疫情，因为感染者可被隔离的关键。但对于测试的需求超过检测试剂盒的可用性是利用逆转录聚合酶链式反应（RT-PCR）。在本文中，我们提出了一种技术使用人工智能技术来检测COVID-19。我们的技术只需要几秒钟就可以检测患者的病毒的存在。我们收集了胸部X射线图像的数据集，并培训了流行的深卷积基于神经网络的模型（VGG，MobileNet，Xception，DenseNet，InceptionResNet）分类胸部X射线。不满意这些车型，我们再设计和建造一个剩余关注网络，这是能够检测COVID-19的98 \％，测试精度和100 \％，验证准确度。特征映射我们的模型显示其胸部X线内的区域是为了分类，非常重要。我们的工作有助于提高临床实践AI辅助应用的适配。

73. Interpretable Deep Learning for Pattern Recognition in Brain Differences Between Men and Women [PDF] 返回目录
Maxim Kan, Ruslan Aliev, Anna Rudenko, Nikita Drobyshev, Nikita Petrashen, Ekaterina Kondrateva, Maxim Sharaev, Alexander Bernstein, Evgeny Burnaev
Abstract: Deep learning shows high potential for many medical image analysis tasks. Neural networks work with full-size data without extensive preprocessing and feature generation and, thus, information loss. Recent work has shown that morphological difference between specific brain regions can be found on MRI with deep learning techniques. We consider the pattern recognition task based on a large open-access dataset of healthy subjects - an exploration of brain differences between men and women. However, interpretation of the lately proposed models is based on a region of interest and can not be extended to pixel or voxel-wise image interpretation, which is considered to be more informative. In this paper, we confirm the previous findings in sex differences from diffusion-tensor imaging on T1 weighted brain MRI scans. We compare the results of three voxel-based 3D CNN interpretation methods: Meaningful Perturbations, GradCam and Guided Backpropagation and provide the open-source code.
摘要：深学习表演了许多医学图像分析任务的巨大潜力。神经网络采用全尺寸数据工作，而无需大量的预处理和特征生成，因此，信息丢失。最近的研究表明，特定的大脑区域之间的形态差异可以在MRI发现深学习技术。我们认为，基于健康受试者的大型开放式访问的数据集的模式识别任务 - 男人和女人之间的大脑差异的探索。然而，最近提出的模型的解释是基于对感兴趣的区域，不能扩展到像素或体素明智的图像解释，这被认为是提供更多的信息。在本文中，我们证实了从扩散张量成像在T1加权脑部核磁共振成像扫描性别差异以前的研究结果。我们比较了三种基于体素的3D CNN解释方法的结果：有意义的扰动，GradCam和引导反向传播，并提供开放的源代码。

74. Multi-level colonoscopy malignant tissue detection with adversarial CAC-UNet [PDF] 返回目录
Chuang Zhu, Ke Mei, Ting Peng, Yihao Luo, Jun Liu, Ying Wang, Mulan Jin
Abstract: The automatic and objective medical diagnostic model can be valuable to achieve early cancer detection, and thus reducing the mortality rate. In this paper, we propose a highly efficient multi-level malignant tissue detection through the designed adversarial CAC-UNet. A patch-level model with a pre-prediction strategy and a malignancy area guided label smoothing is adopted to remove the negative WSIs, with which to lower the risk of false positive detection. For the selected key patches by multi-model ensemble, an adversarial context-aware and appearance consistency UNet (CAC-UNet) is designed to achieve robust segmentation. In CAC-UNet, mirror designed discriminators are able to seamlessly fuse the whole feature maps of the skillfully designed powerful backbone network without any information loss. Besides, a mask prior is further added to guide the accurate segmentation mask prediction through an extra mask-domain discriminator. The proposed scheme achieves the best results in MICCAI DigestPath2019 challenge on colonoscopy tissue segmentation and classification task. The full implementation details and the trained models are available at this https URL.
摘要：自动和目标医疗诊断模型可以是有价值的，以实现早期癌症检测，并因此降低了死亡率。在本文中，我们提出了通过高效率的多级恶性组织检测所设计的对抗性CAC-UNET。与预先预测策略和恶性肿瘤区域引导标签平滑的贴剂级模型采用以除去负峰会，与以降低假阳性检测的风险。用于通过多模式集合，对抗性上下文感知和外观的一致性UNET所选择的键的补丁（CAC-UNET）被设计为实现健壮的分割。在CAC-UNET，镜鉴设计能够无缝保险丝整体特征没有丢失任何信息的设计巧妙，功能强大的骨干网的映射。此外，现有进一步添加的掩模通过一个额外的掩模域鉴别器来引导准确分割掩码预测。该方案实现了MICCAI DigestPath2019挑战结肠镜检查组织分割和分类任务的最佳结果。全面实施细节和训练的模型可在此HTTPS URL。

75. Interpreting and Disentangling Feature Components of Various Complexity from DNNs [PDF] 返回目录
Jie Ren, Mingjie Li, Zexu Liu, Quanshi Zhang
Abstract: This paper aims to define, quantify, and analyze the feature complexity that is learned by a DNN. We propose a generic definition for the feature complexity. Given the feature of a certain layer in the DNN, our method disentangles feature components of different complexity orders from the feature. We further design a set of metrics to evaluate the reliability, the effectiveness, and the significance of over-fitting of these feature components. Furthermore, we successfully discover a close relationship between the feature complexity and the performance of DNNs. As a generic mathematical tool, the feature complexity and the proposed metrics can also be used to analyze the success of network compression and knowledge distillation.
摘要：本文旨在界定，量化和分析是由DNN学习的特征的复杂性。我们提出了功能复杂性通用定义。鉴于DNN的某一层的特点，我们的方法理顺了那些纷繁功能从功能不同复杂次数成分。我们进一步设计了一套指标来评估的可靠性，有效性和过度拟合这些功能组件的意义。此外，我们成功地发现功能的复杂性和DNNs的性能之间的密切关系。作为一个通用的数学工具，功能的复杂性和所提出的指标，也可以用来分析网络压缩和知识蒸馏的成功。

76. OpenDVC: An Open Source Implementation of the DVC Video Compression Method [PDF] 返回目录
Ren Yang, Luc Van Gool, Radu Timofte
Abstract: We introduce an open source Tensorflow implementation of the Deep Video Compression (DVC) method in this technical report. DVC is the first end-to-end optimized learned video compression method, achieving better MS-SSIM performance than the Low-Delay P (LDP) very fast setting of x265 and comparable PSNR performance with x265 (LDP very fast). At the time of writing this report, several learned video compression methods are superior to DVC, but currently none of them provides open source codes. We hope that our OpenDVC codes are able to provide a useful model for further development, and facilitate future researches on learned video compression. Different from the original DVC, which is only optimized for PSNR, we release not only the PSNR-optimized re-implementation, denoted by OpenDVC (PSNR), but also the MS-SSIM-optimized model OpenDVC (MS-SSIM). Our OpenDVC (MS-SSIM) model provides a more convincing baseline for MS-SSIM optimized methods, which can only compare with the PSNR optimized DVC in the past. The OpenDVC source codes and pre-trained models are publicly released at this https URL.
摘要：我们这个技术报告中介绍了一个开源Tensorflow实现深视频压缩（DVC）方法。 DVC比低时延P（LDP）非常快X265设置X265和可比的PSNR性能（LDP非常快）的第一端至端优化了解到视频压缩方式，实现更好的MS-SSIM性能。在撰写本报告时，有几个了解到视频压缩方法都优于DVC，但目前他们没有提供的开源代码。我们希望我们的OpenDVC代码能够提供进一步发展的有效模式，并方便日后上了解到视频压缩研究。从原来的DVC，这是只针对PSNR优化的不同，我们不仅释放PSNR优化的重新实施，由OpenDVC（PSNR）来表示，也MS-SSIM优化模型OpenDVC（MS-SSIM）。我们的OpenDVC（MS-SSIM）模型提供了用于MS-SSIM更有说服力的基线优化方法，这些方法只能与过去优化DVC的PSNR比较。该OpenDVC源代码和预先训练模式，在此HTTPS URL公开发布。

77. End-to-End Differentiable Learning to HDR Image Synthesis for Multi-exposure Images [PDF] 返回目录
Jung Hee Kim, Siyeong Lee, Soyeon Jo, Suk-Ju Kang
Abstract: Recent deep learning-based methods have reconstructed a high dynamic range (HDR) image from a single low dynamic range (LDR) image by focusing on the exposure transfer task to reconstruct the multi-exposure stack. However, these methods often fail to fuse the multi-exposure stack into a perceptually pleasant HDR image as the local inversion artifacts are formed in the HDR imaging (HDRI) process. The artifacts arise from the impossibility of learning the whole HDRI process due to its non-differentiable structure of the camera response recovery. Therefore, we tackle the major challenge in stack reconstruction-based methods by proposing a novel framework with the fully differentiable HDRI process. Our framework enables a neural network to train the HDR image generation based on the end-to-end structure. Hence, a deep neural network can train the precise correlations between multi-exposure images in the HDRI process using our differentiable HDR synthesis layer. In addition, our network uses the image decomposition and the recursive process to facilitate the exposure transfer task and to adaptively respond to recursion frequency. The experimental results show that the proposed network outperforms the state-of-the-art quatitative and qualitative results in terms of both the exposure transfer tasks and the whole HDRI process.
摘要：最近的深基于学习的方法已经从一个单一的低动态范围（LDR）图像通过聚焦在曝光转移任务，重构多曝光堆重建的高动态范围（HDR）图像。然而，这些方法常常失败，因为在HDR成像（HDRI）工艺形成的局部反转伪影到多重曝光堆融合成一个感知上令人愉快的HDR图像。工件从学习整个HDRI过程是不可能的，因为它的相机响应恢复的非微分结构出现。因此，我们通过提出一个新的框架，完全可微HDRI过程中处理堆栈基于重建的方法的主要挑战。我们的框架使神经网络训练基础上，端至端部结构的HDR图像生成。因此，深神经网络可以使用我们的微的HDR合成层培养多重曝光图像之间的相关性的精确的HDRI过程。此外，我们的网络使用图像分解和递归过程，以促进曝光转职任务和自适应响应，以递归频率。实验结果表明，该网络性能优于国家的最先进的quatitative和定性结果中的曝光传输任务，整个过程HDRI两个术语。

78. Confidence-rich grid mapping [PDF] 返回目录
Ali-akbar Agha-mohammadi, Eric Heiden, Karol Hausman, Gaurav S. Sukhatme
Abstract: Representing the environment is a fundamental task in enabling robots to act autonomously in unknown environments. In this work, we present confidence-rich mapping (CRM), a new algorithm for spatial grid-based mapping of the 3D environment. CRM augments the occupancy level at each voxel by its confidence value. By explicitly storing and evolving confidence values using the CRM filter, CRM extends traditional grid mapping in three ways: first, it partially maintains the probabilistic dependence among voxels. Second, it relaxes the need for hand-engineering an inverse sensor model and proposes the concept of sensor cause model that can be derived in a principled manner from the forward sensor model. Third, and most importantly, it provides consistent confidence values over the occupancy estimation that can be reliably used in collision risk evaluation and motion planning. CRM runs online and enables mapping environments where voxels might be partially occupied. We demonstrate the performance of the method on various datasets and environments in simulation and on physical systems. We show in real-world experiments that, in addition to achieving maps that are more accurate than traditional methods, the proposed filtering scheme demonstrates a much higher level of consistency between its error and the reported confidence, hence, enabling a more reliable collision risk evaluation for motion planning.
摘要：代表环境是使机器人在未知环境自主行动的根本任务。在这项工作中，我们提出了丰富的信心映射（CRM），在3D环境中的空间基于网格的映射一种新的算法。 CRM在增强其信心值每个像素的占用水平。通过明确地存储和使用CRM过滤器不断发展的信心值，CRM扩展在三个方面传统电网映射：第一，它部分地保持体素之间的概率依赖。其次，它降低了对手工工程逆传感器模型的需要，提出了可以在原则地从前进传感器模型导出传感器原因模型的概念。第三，也是最重要的是，它提供了可在碰撞风险评估和运动规划能够可靠地使用占用估计是一致的置信度值。 CRM在线运行，使那里的体素可能被部分地占据映射环境。我们展示各种数据集和模拟环境和物理系统的方法的性能。我们显示，除了实现了比传统方法更准确的地图，所提出的过滤方案证明了其错误和报告的置信度之间的一致性更高的水平，因此，现实世界的实验，实现了更可靠的碰撞风险评估对于运动规划。

79. Motion Pyramid Networks for Accurate and Efficient Cardiac Motion Estimation [PDF] 返回目录
Hanchao Yu, Xiao Chen, Humphrey Shi, Terrence Chen, Thomas S. Huang, Shanhui Sun
Abstract: Cardiac motion estimation plays a key role in MRI cardiac feature tracking and function assessment such as myocardium strain. In this paper, we propose Motion Pyramid Networks, a novel deep learning-based approach for accurate and efficient cardiac motion estimation. We predict and fuse a pyramid of motion fields from multiple scales of feature representations to generate a more refined motion field. We then use a novel cyclic teacher-student training strategy to make the inference end-to-end and further improve the tracking performance. Our teacher model provides more accurate motion estimation as supervision through progressive motion compensations. Our student model learns from the teacher model to estimate motion in a single step while maintaining accuracy. The teacher-student knowledge distillation is performed in a cyclic way for a further performance boost. Our proposed method outperforms a strong baseline model on two public available clinical datasets significantly, evaluated by a variety of metrics and the inference time. New evaluation metrics are also proposed to represent errors in a clinically meaningful manner.
摘要：心脏运动估计起着MRI心脏功能跟踪和功能评估中起关键作用，如心肌劳损。在本文中，我们提出了运动金字塔网络，准确，高效的心脏运动估计的新的深基于学习的方法。我们预测和融合来自特征表示的多尺度运动领域的金字塔产生一个更精致的运动场。然后，我们使用一个新的环状师生培训战略作出的推断终端到终端，进一步提高跟踪性能。我们的老师模型提供了更精确的运动估计通过渐进的运动补偿监督。我们从教师模型学生模型学会估计运动在一个单一的步骤，同时保持准确度。师生知识蒸馏以循环的方式进行进一步的性能提升执行。我们提出的方法优于两个公众提供临床数据集显著强烈的基准模型，通过各种指标和推理的时间进行评估。还提出了新的评价指标，以表示有临床意义的方式错误。

80. MIMC-VINS: A Versatile and Resilient Multi-IMU Multi-Camera Visual-Inertial Navigation System [PDF] 返回目录
Kevin Eckenhoff, Patrick Geneva, Guoquan Huang
Abstract: As cameras and inertial sensors are becoming ubiquitous in mobile devices and robots, it holds great potential to design visual-inertial navigation systems (VINS) for efficient versatile 3D motion tracking which utilize any (multiple) available cameras and inertial measurement units (IMUs) and are resilient to sensor failures or measurement depletion. To this end, rather than the standard VINS paradigm using a minimal sensing suite of a single camera and IMU, in this paper we design a real-time consistent multi-IMU multi-camera (MIMC)-VINS estimator that is able to seamlessly fuse multi-modal information from an arbitrary number of uncalibrated cameras and IMUs. Within an efficient multi-state constraint Kalman filter (MSCKF) framework, the proposed MIMC-VINS algorithm optimally fuses asynchronous measurements from all sensors, while providing smooth, uninterrupted, and accurate 3D motion tracking even if some sensors fail. The key idea of the proposed MIMC-VINS is to perform high-order on-manifold state interpolation to efficiently process all available visual measurements without increasing the computational burden due to estimating additional sensors' poses at asynchronous imaging times. In order to fuse the information from multiple IMUs, we propagate a joint system consisting of all IMU states while enforcing rigid-body constraints between the IMUs during the filter update stage. Lastly, we estimate online both spatiotemporal extrinsic and visual intrinsic parameters to make our system robust to errors in prior sensor calibration. The proposed system is extensively validated in both Monte-Carlo simulations and real-world experiments.
摘要：摄像机和惯性传感器正在成为移动设备和机器人无处不在，它具有巨大的潜力，以用于高效多功能，其利用任何（多个）三维运动跟踪可用摄像机和惯性测量单元设计视觉惯性导航系统（VINS）（IMU的）并且是有弹性的传感器故障或测量耗尽。为此，而不是标准的VINS使用单个相机和IMU的最小感应套件范式，在本文中，我们设计了一个实时一致的多IMU多摄像机（MIMC）-VINS估计是能够无缝保险丝从未校准的相机和IMU的任意数量的多模态的信息。内的有效的多状态约束卡尔曼滤波器（MSCKF）框架，所提出的MIMC-VINS算法最佳融合来自所有传感器的测量异步，同时提供光滑的，不间断的，并且精确的3D运动跟踪即使一些传感器失效。所提出的MIMC-VINS的关键思想是在歧管状态内插以执行高阶有效地处理所有可用的视觉测量，而不增加计算负担由于在异步成像时间估计附加传感器的姿势。为了融合来自多个的IMU的信息，我们传播由所有IMU状态而在所述过滤器更新级强制执行的IMU之间刚体约束的接头系统。最后，我们估计在线时空两者外在和内在的视觉参数，使我们的系统鲁棒性的前传感器校准误差。所提出的系统在蒙特卡罗模拟和真实世界的实验都广泛验证。

81. Simulation of Brain Resection for Cavity Segmentation Using Self-Supervised and Semi-Supervised Learning [PDF] 返回目录
Fernando Pérez-García, Roman Rodionov, Ali Alim-Marvasti, Rachel Sparks, John S. Duncan, Sébastien Ourselin
Abstract: Resective surgery may be curative for drug-resistant focal epilepsy, but only 40% to 70% of patients achieve seizure freedom after surgery. Retrospective quantitative analysis could elucidate patterns in resected structures and patient outcomes to improve resective surgery. However, the resection cavity must first be segmented on the postoperative MR image. Convolutional neural networks (CNNs) are the state-of-the-art image segmentation technique, but require large amounts of annotated data for training. Annotation of medical images is a time-consuming process requiring highly-trained raters, and often suffering from high inter-rater variability. Self-supervised learning can be used to generate training instances from unlabeled data. We developed an algorithm to simulate resections on preoperative MR images. We curated a new dataset, EPISURG, comprising 431 postoperative and 269 preoperative MR images from 431 patients who underwent resective surgery. In addition to EPISURG, we used three public datasets comprising 1813 preoperative MR images for training. We trained a 3D CNN on artificially resected images created on the fly during training, using images from 1) EPISURG, 2) public datasets and 3) both. To evaluate trained models, we calculate Dice score (DSC) between model segmentations and 200 manual annotations performed by three human raters. The model trained on data with manual annotations obtained a median (interquartile range) DSC of 65.3 (30.6). The DSC of our best-performing model, trained with no manual annotations, is 81.7 (14.2). For comparison, inter-rater agreement between human annotators was 84.0 (9.9). We demonstrate a training method for CNNs using simulated resection cavities that can accurately segment real resection cavities, without manual annotations.
摘要：切除手术可能治愈的耐药性局灶性癫痫，但手术后患者只有40％〜70％，实现扣押自由。回顾性的定量分析可以阐明切除结构和患者的治疗效果模式来提高切除手术。然而，切除空腔必须首先在术后MR图像上分割。卷积神经网络（细胞神经网络）是国家的最先进的图像分割技术，但需要大量的训练注释的数据的。医学图像的注释是一个耗时的过程，需要训练有素的评价者，并通常由高评估者间可变性的痛苦。自我监督的学习可以用来生成无标签数据训练实例。我们开发了一种算法，以术前MR图像模拟切除。我们策划一个新的数据集，EPISURG，包括431术后，并从431例谁接受切除手术术前269个MR图像。除了EPISURG，我们使用包括用于培训1813幅术前MR图像三次公开数据集。我们在训练中培养了3D CNN上动态创建人为切除图像，使用图片来自1）EPISURG，2）公共数据集和3）两者。为了评估训练的模型，我们计算模型的分割和三个人工评级进行200个手动注释之间骰子得分（DSC）。训练与手动注释数据的模型而获得的中位数（四分位数间距）DSC的65.3（30.6）。我们表现最好的模式的DSC，无需手动注释的训练，为81.7（14.2）。为了便于比较，人工注释之间评估人之间一致为84.0（9.9）。我们演示了使用模拟切除空腔，可以准确段切除实腔，细胞神经网络，无需人工标注的训练方法。

82. A lateral semicircular canal segmentation based geometric calibration for human temporal bone CT Image [PDF] 返回目录
Xiaoguang Li, Peng Fu, Hongxia Yin, ZhenChang Wang, Li Zhuo, Hui Zhang
Abstract: Computed Tomography (CT) of the temporal bone has become an important method for diagnosing ear diseases. Due to the different posture of the subject and the settings of CT scanners, the CT image of the human temporal bone should be geometrically calibrated to ensure the symmetry of the bilateral anatomical structure. Manual calibration is a time-consuming task for radiologists and an important pre-processing step for further computer-aided CT analysis. We propose an automatic calibration algorithm for temporal bone CT images. The lateral semicircular canals (LSCs) are segmented as anchors at first. Then, we define a standard 3D coordinate system. The key step is the LSC segmentation. We design a novel 3D LSC segmentation encoder-decoder network, which introduces a 3D dilated convolution and a multi-pooling scheme for feature fusion in the encoding stage. The experimental results show that our LSC segmentation network achieved a higher segmentation accuracy. Our proposed method can help to perform calibration of temporal bone CT images efficiently.
摘要：计算机断层扫描（CT）的颞骨的已成为用于诊断穗病的重要方法。由于不同的姿势被检体和CT扫描仪的设置，人颞骨的CT图像应当几何校准，以确保双方的解剖结构的对称性。手动校准为放射科医师一个耗时的任务，并为进一步计算机辅助CT分析的重要预处理步骤。我们提出了颞骨CT图像的自动校准算法。横向半规管器（LSC）被分段如在第一锚。然后，我们定义了一个标准的三维坐标系。的关键步骤是LSC分割。我们设计了一种新的3D LSC分割编码器 - 解码器网络，它引入了一个三维扩张卷积以及在编码阶段特征融合的多池方案。实验结果表明，我们的LSC分割网络来实现更高的分割精度。我们提出的方法可以帮助有效地进行颞骨CT图像校准。

83. Fabric Image Representation Encoding Networks for Large-scale 3D Medical Image Analysis [PDF] 返回目录
Siyu Liu, Wei Dai, Craig Engstrom, Jurgen Fripp, Peter B. Greer, Stuart Crozier, Jason A. Dowling, Shekhar S Chandra
Abstract: Deep neural networks are parameterised by weights that encode feature representations, whose performance is dictated through generalisation by using large-scale feature-rich datasets. The lack of large-scale labelled 3D medical imaging datasets restrict constructing such generalised networks. In this work, a novel 3D segmentation network, Fabric Image Representation Networks (FIRENet), is proposed to extract and encode generalisable feature representations from multiple medical image datasets in a large-scale manner. FIRENet learns image specific feature representations by way of 3D fabric network architecture that contains exponential number of sub-architectures to handle various protocols and coverage of anatomical regions and structures. The fabric network uses Atrous Spatial Pyramid Pooling (ASPP) extended to 3D to extract local and image-level features at a fine selection of scales. The fabric is constructed with weighted edges allowing the learnt features to dynamically adapt to the training data at an architecture level. Conditional padding modules, which are integrated into the network to reinsert voxels discarded by feature pooling, allow the network to inherently process different-size images at their original resolutions. FIRENet was trained for feature learning via automated semantic segmentation of pelvic structures and obtained a state-of-the-art median DSC score of 0.867. FIRENet was also simultaneously trained on MR (Magnatic Resonance) images acquired from 3D examinations of musculoskeletal elements in the (hip, knee, shoulder) joints and a public OAI knee dataset to perform automated segmentation of bone across anatomy. Transfer learning was used to show that the features learnt through the pelvic segmentation helped achieve improved mean DSC scores of 0.962, 0.963, 0.945 and 0.986 for automated segmentation of bone across datasets.
摘要：深层神经网络是由权重参数化该编码特征表示，它的性能是通过推广使用大型功能丰富的数据集所决定的。缺乏大型的标记的三维医学成像数据集限制构造这种广义的网络。在这项工作中，一种新颖的3D分割网络，织物图像表示网络（FIRENet），提出了从多个医学图像数据提取与编码普遍意义特征表示在一个大型的方式。通过包含指数号子架构来处理各种协议和解剖区域和结构的覆盖3D结构的网络架构的方式FIRENet获悉图像特定的功能表示。扩展到三维织物网络用途Atrous空间金字塔池（ASPP）提取本地和图象水平尺度的精细选择特征。该织物被构造成具有加权边缘允许学习功能来动态地适应训练数据在结构级。有条件的填充模块，其被集成到网络中，以通过特征池丢弃重新插入体素，允许网络在它们的原始分辨率固有地处理不同大小的图像。 FIRENet经由骨盆结构的自动语义分割训练特征的学习和获得的0.867的状态的最先进的中位数DSC得分。 FIRENet也同时上训练的（髋，膝，肩）从肌肉骨骼元件的三维检查所采集的MR（磁共振Magnatic）图像关节和一个公共OAI膝盖数据集以跨解剖学执行骨的自动分割。迁移学习被用来表明通过骨盆分段学习的功能帮助实现了0.962，0.963，0.945和0.986跨数据集骨自动分割提高平均分数DSC。

84. When and How Can Deep Generative Models be Inverted? [PDF] 返回目录
Aviad Aberdam, Dror Simon, Michael Elad
Abstract: Deep generative models (e.g. GANs and VAEs) have been developed quite extensively in recent years. Lately, there has been an increased interest in the inversion of such a model, i.e. given a (possibly corrupted) signal, we wish to recover the latent vector that generated it. Building upon sparse representation theory, we define conditions that are applicable to any inversion algorithm (gradient descent, deep encoder, etc.), under which such generative models are invertible with a unique solution. Importantly, the proposed analysis is applicable to any trained model, and does not depend on Gaussian i.i.d. weights. Furthermore, we introduce two layer-wise inversion pursuit algorithms for trained generative networks of arbitrary depth, and accompany these with recovery guarantees. Finally, we validate our theoretical results numerically and show that our method outperforms gradient descent when inverting such generators, both for clean and corrupted signals.
摘要：深生成模型（如甘斯和VAES）在最近几年已经发展得相当广泛。近来，出现了在给定（可能损坏）信号这样的模型的反演，即增加的兴趣，我们希望恢复潜矢量生成它。在稀疏表示理论的基础上，我们定义了适用于任何反演算法（梯度下降，深编码器等）的条件下，它这样的生成模型是可逆一个独特的解决方案。重要的是，所提出的分析适用于任何训练的模式，不依赖于高斯独立同分布权重。此外，我们引入两个逐层反转追求算法任意深度的训练有素的生成网络，并与恢复保证陪这些。最后，我们用数字验证了我们的理论成果，表明我们的方法反转这样的发电机，既清洁和损坏的信号时优于梯度下降。

85. Enhancement of a CNN-Based Denoiser Based on Spatial and Spectral Analysis [PDF] 返回目录
Rui Zhao, Kin-Man Lam, Daniel P.K. Lun
Abstract: Convolutional neural network (CNN)-based image denoising methods have been widely studied recently, because of their high-speed processing capability and good visual quality. However, most of the existing CNN-based denoisers learn the image prior from the spatial domain, and suffer from the problem of spatially variant noise, which limits their performance in real-world image denoising tasks. In this paper, we propose a discrete wavelet denoising CNN (WDnCNN), which restores images corrupted by various noise with a single model. Since most of the content or energy of natural images resides in the low-frequency spectrum, their transformed coefficients in the frequency domain are highly imbalanced. To address this issue, we present a band normalization module (BNM) to normalize the coefficients from different parts of the frequency spectrum. Moreover, we employ a band discriminative training (BDT) criterion to enhance the model regression. We evaluate the proposed WDnCNN, and compare it with other state-of-the-art denoisers. Experimental results show that WDnCNN achieves promising performance in both synthetic and real noise reduction, making it a potential solution to many practical image denoising applications.
摘要：卷积神经网络（CNN）为基础的图像去噪方法已被广泛最近研究，因为它们的高速处理能力和良好的视觉质量。然而，大多数现有的基于CNN-denoisers的学习从空间域之前的图像，并从空间变化的噪声问题，这限制了在现实世界图像去噪任务他们的表现受到影响。在本文中，我们提出了一种离散小波去噪CNN（WDnCNN），该恢复图像损坏通过各种噪声与单个模型。由于大多数天然的图像驻留在低频频谱的内容或能量的，在频域中的变换系数是高度不平衡。为了解决这个问题，我们提出了一个带正规化模块（BNM）正常化从频谱的不同部分的系数。此外，我们采用了带歧视性的训练（BDT）标准，以提高模型的回归。我们评价所提出WDnCNN，它与国家的最先进的其他denoisers比较。实验结果表明，WDnCNN达到承诺的性能在合成和实际降噪，使之成为一个潜在的解决许多实际图像去噪应用。

86. Laplacian Regularized Few-Shot Learning [PDF] 返回目录
Imtiaz Masud Ziko, Jose Dolz, Eric Granger, Ismail Ben Ayed
Abstract: We propose a transductive Laplacian-regularized inference for few-shot tasks. Given any feature embedding learned from the base classes, we minimize a quadratic binary-assignment function containing two terms: (1) a unary term assigning query samples to the nearest class prototype, and (2) a pairwise Laplacian term encouraging nearby query samples to have consistent label assignments. Our transductive inference does not re-train the base model, and can be viewed as a graph clustering of the query set, subject to supervision constraints from the support set. We derive a computationally efficient bound optimizer of a relaxation of our function, which computes independent (parallel) updates for each query sample, while guaranteeing convergence. Following a simple cross-entropy training on the base classes, and without complex meta-learning strategies, we conducted comprehensive experiments over five few-shot learning benchmarks. Our LaplacianShot consistently outperforms state-of-the-art methods by significant margins across different models, settings, and data sets. Furthermore, our transductive inference is very fast, with computational times that are close to inductive inference, and can be used for large-scale few-shot tasks.
摘要：我们提出了一些次任务的直推式拉普拉斯正规化的推论。鉴于从基类包埋了解到的任何功能，我们最小化包含两个方面的二次二进制分配功能：（1）一元项分配查询样品到最近的类原型，和（2）成对拉普拉斯项鼓励就近查询样本具有一致的标签分配。我们转导推理不重新训练基地模式，可以被看作是查询集的图聚类，受到来自支持一套监督约束。我们得出我们的功能，计算每个查询样品独立的（平行）更新的放松的计算效率边界优化，同时保证收敛。继在基类的简单交叉熵的培训，也无需复杂的元学习策略，我们在五年为数不多的射门学习的基准进行了全面的实验。我们一贯LaplacianShot优于国家的最先进的方法，通过在不同的模式，设置和数据集显著的利润。此外，我们的转导推理是非常快的，与计算时间是接近归纳推理，可用于大规模几拍任务。

87. Chroma Intra Prediction with attention-based CNN architectures [PDF] 返回目录
Marc Górriz, Saverio Blasi, Alan F. Smeaton, Noel E. O'Connor, Marta Mrak
Abstract: Neural networks can be used in video coding to improve chroma intra-prediction. In particular, usage of fully-connected networks has enabled better cross-component prediction with respect to traditional linear models. Nonetheless, state-of-the-art architectures tend to disregard the location of individual reference samples in the prediction process. This paper proposes a new neural network architecture for cross-component intra-prediction. The network uses a novel attention module to model spatial relations between reference and predicted samples. The proposed approach is integrated into the Versatile Video Coding (VVC) prediction pipeline. Experimental results demonstrate compression gains over the latest VVC anchor compared with state-of-the-art chroma intra-prediction methods based on neural networks.
摘要：神经网络可以在视频编码，以改善色度帧内预测中使用。特别地，全连接的网络的使用已经使更好的跨分量预测相对于传统的线性模型。然而，国家的最先进的体系结构趋向于忽视在预测处理个别的参考样本的位置。本文提出了一种跨组件帧内预测的新的神经网络结构。该网络使用一种新颖的注意模块到的参考和预测采样之间的空间关系进行建模。所提出的方法集成到通用视频编码（VVC）预测管道。与基于神经网络的状态的最先进的色度帧内预测方法相比实验结果表明在最新VVC锚压缩增益。

88. A Tool for Automatic Estimation of Patient Position in Spinal CT Data [PDF] 返回目录
Roman Jakubicek, Tomas Vicar, Jiri Chmelik
Abstract: Much of the recently available research and challenge data lack the meta-data containing any information about the patient position. This paper presents a tool for automatic rotation of CT data into a standardized (HFS) patient position. The proposed method is based on the prediction of rotation angle with CNN, and it achieved nearly perfect results with an accuracy of 99.55 %. We provide implementations with easy to use an example for both Matlab and Python (PyTorch), which can be used, for example, for automatic rotation correction of VerSe2020 challenge data.
摘要：大部分最近获得的研究和挑战数据缺乏的元数据包含关于病人位置的任何信息。本文提出了CT数据的自动旋转到一个标准化的（HFS）患者位置的工具。所提出的方法是基于旋转角度与CNN的预测，并将其与99.55％的准确度来实现几乎完美的效果。我们提供容易的实现使用的示例两者Matlab的和Python（PyTorch），其可以用于，例如，用于VerSe2020挑战数据的自动旋转校正。

89. Video-Grounded Dialogues with Pretrained Generation Language Models [PDF] 返回目录
Hung Le, Steven C.H. Hoi
Abstract: Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.
摘要：预先训练语言模型已经在文本数据显示在改善各种下游NLP任务，显着的成就是由于他们有能力捕获的依赖性和产生的自然反应。在本文中，我们利用的预先训练语言模型的功率提高视频接地对话，这是非常具有挑战性的，涉及不同的动态的复杂的功能：（1）视频功能，它可以跨越空间和时间维度延伸;和（2）的对话特征，其涉及在多个对话回合语义相关性。我们通过扩展GPT-2车型，以解决通过制定视频接地对话任务序列到序列任务，无论是视觉和文字表述组合成一个结构化的序列这些挑战提出了一个框架，并微调了大量预先训练GPT-2网络。我们的框架允许微调语言模型，以便在不同层次的信息在多个方式捕获的依赖关系：在对话背景视频和令牌句子层面的时空水平。我们实现了从DSTC7的视听场景感知对话（AVSD）的基准，它支持在这一行的研究潜力有前途的方向改进。

90. A Retinex based GAN Pipeline to Utilize Paired and Unpaired Datasets for Enhancing Low Light Images [PDF] 返回目录
Harshana Weligampola, Gihan Jayatilaka, Suren Sritharan, Roshan Godaliyadda, Parakrama Ekanayaka, Roshan Ragel, Vijitha Herath
Abstract: Low light image enhancement is an important challenge for the development of robust computer vision algorithms. The machine learning approaches to this have been either unsupervised, supervised based on paired dataset or supervised based on unpaired dataset. This paper presents a novel deep learning pipeline that can learn from both paired and unpaired datasets. Convolution Neural Networks (CNNs) that are optimized to minimize standard loss, and Generative Adversarial Networks (GANs) that are optimized to minimize the adversarial loss are used to achieve different steps of the low light image enhancement process. Cycle consistency loss and a patched discriminator are utilized to further improve the performance. The paper also analyses the functionality and the performance of different components, hidden layers, and the entire pipeline.
摘要：低光图像增强是稳健计算机视觉算法发展的一个重要挑战。机器学习方法这已经无人看管或者，监督的基于成对数据集或基于配对的数据集的监督。本文提出了一种新的深度学习管道，可以从两个成对和非成对的数据集学习。卷积神经网络（细胞神经网络）被优化以最小化标准的损失，并且被优化以最小化对抗损失剖成对抗性网络（甘斯）来实现低光图像增强过程的不同步骤。周期一致性的损失和修补的鉴别器被利用来进一步提高性能。文中还分析了功能和不同的部件，隐藏层的性能，并且整个管道。

91. Compressive MR Fingerprinting reconstruction with Neural Proximal Gradient iterations [PDF] 返回目录
Dongdong Chen, Mike E. Davies, Mohammad Golbabaee
Abstract: Consistency of the predictions with respect to the physical forward model is pivotal for reliably solving inverse problems. This consistency is mostly un-controlled in the current end-to-end deep learning methodologies proposed for the Magnetic Resonance Fingerprinting (MRF) problem. To address this, we propose ProxNet, a learned proximal gradient descent framework that directly incorporates the forward acquisition and Bloch dynamic models within a recurrent learning mechanism. The ProxNet adopts a compact neural proximal model for de-aliasing and quantitative inference, that can be flexibly trained on scarce MRF training datasets. Our numerical experiments show that the ProxNet can achieve a superior quantitative inference accuracy, much smaller storage requirement, and a comparable runtime to the recent deep learning MRF baselines, while being much faster than the dictionary matching schemes. Code has been released at this https URL.
摘要：相对于物理正向模型预测的一致性是关键可靠地解决逆问题。这种一致性主要是未控制提出了磁共振指纹（MRF）问题目前终端到终端的深度学习方法。为了解决这个问题，我们提出ProxNet，一个有学问的近端梯度下降框架，直接采用了经常性学习机制内向前采集和布洛赫动态模型。该ProxNet采用的去混叠和定量推理紧凑神经近端模型，可对稀缺MRF训练数据集进行灵活的培训。我们的数值实验表明，ProxNet可以实现卓越的定量推理精度，更小的存储需求，并且运行相媲美，近期深学习MRF基线，同时比字典匹配方案快得多。代码已在此HTTPS URL被释放。

92. Attention-Guided Generative Adversarial Network to Address Atypical Anatomy in Modality Transfer [PDF] 返回目录
Hajar Emami, Ming Dong, Carri K. Glide-Hurst
Abstract: Recently, interest in MR-only treatment planning using synthetic CTs (synCTs) has grown rapidly in radiation therapy. However, developing class solutions for medical images that contain atypical anatomy remains a major limitation. In this paper, we propose a novel spatial attention-guided generative adversarial network (attention-GAN) model to generate accurate synCTs using T1-weighted MRI images as the input to address atypical anatomy. Experimental results on fifteen brain cancer patients show that attention-GAN outperformed existing synCT models and achieved an average MAE of 85.22$\pm$12.08, 232.41$\pm$60.86, 246.38$\pm$42.67 Hounsfield units between synCT and CT-SIM across the entire head, bone and air regions, respectively. Qualitative analysis shows that attention-GAN has the ability to use spatially focused areas to better handle outliers, areas with complex anatomy or post-surgical regions, and thus offer strong potential for supporting near real-time MR-only treatment planning.
摘要：近日，在使用合成CTS（synCTs）MR-唯一的治疗计划的兴趣已在放射治疗发展迅速。然而，开发包含非典型解剖医学影像级的解决方案仍然是一个主要的限制。在本文中，我们提出了一种新颖的空间注意引导生成对抗网络（注意-GAN）模型来生成使用T1加权MRI图像作为输入到地址非典型解剖学准确synCTs。根据十五脑癌患者的实验结果表明，关注-GaN优于现有synCT模式，取得了85.22 $ \时许$ 12.08，232.41 $ \时许$ 60.86，246.38 $ \时许$ 42.67亨氏synCT和CT-SIM卡之间的单位在整个的平均MAE头，骨和空气的区域，分别。定性分析表明，注意力GaN具有使用空间上集中的地区，以更好地处理异常的能力，处理复杂的解剖结构或手术后的地区，因此区域提供近实时仅MR-治疗计划配套能力强的潜力。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-06-30

目录

摘要