摘要

1. GAN Compression: Efficient Architectures for Interactive Conditional GANs [PDF] 返回目录
Muyang Li, Ji Lin, Yaoyao Ding, Zhijian Liu, Jun-Yan Zhu, Song Han
Abstract: Conditional Generative Adversarial Networks (cGANs) have enabled controllable image synthesis for many computer vision and graphics applications. However, recent cGANs are 1-2 orders of magnitude more computationally-intensive than modern recognition CNNs. For example, GauGAN consumes 281G MACs per image, compared to 0.44G MACs for MobileNet-v3, making it difficult for interactive deployment. In this work, we propose a general-purpose compression framework for reducing the inference time and model size of the generator in cGANs. Directly applying existing CNNs compression methods yields poor performance due to the difficulty of GAN training and the differences in generator architectures. We address these challenges in two ways. First, to stabilize the GAN training, we transfer knowledge of multiple intermediate representations of the original model to its compressed model, and unify unpaired and paired learning. Second, instead of reusing existing CNN designs, our method automatically finds efficient architectures via neural architecture search (NAS). To accelerate the search process, we decouple the model training and architecture search via weight sharing. Experiments demonstrate the effectiveness of our method across different supervision settings (paired and unpaired), model architectures, and learning methods (e.g., pix2pix, GauGAN, CycleGAN). Without losing image quality, we reduce the computation of CycleGAN by more than 20X and GauGAN by 9X, paving the way for interactive image synthesis. The code and demo are publicly available.
摘要：条件剖成对抗性网络（CGANS）已启用可控的图像合成为许多计算机视觉和图形应用程序。然而，最近的CGANS的幅度比现代识别细胞神经网络计算上更密集的1-2个数量级。例如，GauGAN消耗281G MAC的每幅图像，相比0.44克互助为MobileNet-V3，因此很难进行互动部署。在这项工作中，我们提出了减少推理时间和CGANS发电机的模型尺寸通用压缩框架。直接应用现有的细胞神经网络的压缩方法，由于甘训练的难度和发电机结构的差别产生了表现不佳。我们讨论两种方式应对这些挑战。首先，稳定GAN训练，我们原来的模型的多个中间表示的知识传授给它的压缩模式，统一未成及配对学习。二，而不是重用现有CNN的设计，我们的方法自动查找通过神经结构搜索（NAS），高效的架构。为了加快搜索的过程中，我们通过解耦重量共享模型训练和体系结构的搜索。实验证明我们的方法在不同的监管设置（成对和非成对），模型架构的有效性，以及学习方法（例如，pix2pix，GauGAN，CycleGAN）。不会丢失图像质量，我们比20X和GauGAN多由9X减少CycleGAN的计算，从而为交互式图像合成的方式。该代码和演示是公开的。

2. Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression [PDF] 返回目录
Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool, Radu Timofte
Abstract: In this paper, we analyze two popular network compression techniques, i.e. filter pruning and low-rank decomposition, in a unified sense. By simply changing the way the sparsity regularization is enforced, filter pruning and low-rank decomposition can be derived accordingly. This provides another flexible choice for network compression because the techniques complement each other. For example, in popular network architectures with shortcut connections (e.g. ResNet), filter pruning cannot deal with the last convolutional layer in a ResBlock while the low-rank decomposition methods can. In addition, we propose to compress the whole network jointly instead of in a layer-wise manner. Our approach proves its potential as it compares favorably to the state-of-the-art on several benchmarks.
摘要：在本文中，我们分析两个流行的网络压缩技术，即过滤器修剪和低秩分解，在一个统一的感觉。通过简单地改变稀疏正则化的实施方式，过滤器的修剪和低秩分解可以相应地推导。这提供了另一种灵活的选择了网络压缩，因为技术相得益彰。例如，在使用快捷连接（例如RESNET）流行的网络架构，滤波器修剪无法与最后卷积层中的ResBlock而低秩分解方法能处理。此外，我们建议在逐层的方式联合，而不是压缩整个网络。因为它媲美的国家的最先进的几个基准我们的方法证明了它的潜力。

3. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis [PDF] 返回目录
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng
Abstract: We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location $(x,y,z)$ and viewing direction $(\theta, \phi)$) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
摘要：我们提出了实现状态的最先进的结果，用于通过使用稀疏集合的输入视图优化底层连续体积场景功能合成复杂场景的新颖视图的方法。我们的算法表示使用全连接（非卷积）深网络的场景，其输入是一个单一的连续5D坐标（空间位置$（X，Y，Z）$和观看方向$（\ THETA，\ PHI）$ ），其输出是在该空间位置的体积密度和视点相关发射的辐射。我们通过查询沿着相机光线5D坐标综合视图，并使用经典的体绘制技术输出的颜色和密度投射到图像。因为体绘制自然是可微的，唯一的输入处所需的优化我们的表示是一组具有已知的相机的姿势图像。我们描述了如何有效地优化神经辐射场渲染的场景与复杂的几何形状和外观逼真的新颖观点，并证明优于神经渲染和视图合成之前的工作成果。查看综合结果最好看作是视频，所以我们强烈建议读者查看我们的补充视频说服力的比较。

4. Depth Estimation by Learning Triangulation and Densification of Sparse Points for Multi-view Stereo [PDF] 返回目录
Ayan Sinha, Zak Murez, James Bartolozzi, Vijay Badrinarayanan, Andrew Rabinovich
Abstract: Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation. Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems. However, this accuracy comes at a high computational cost which impedes practical adoption. Distinct from cost volume approaches, we propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally (c) densifying this sparse set of 3D points using CNNs. An end-to-end network efficiently performs all three steps within a deep learning framework and trained with intermediate 2D image and 3D geometric supervision, along with depth supervision. Crucially, our first step complements pose estimation using interest point detection and descriptor learning. We demonstrate that state-of-the-art results on depth estimation with lower compute for different scene lengths. Furthermore, our method generalizes to newer environments and the descriptors output by our network compare favorably to strong baselines.
摘要：多视点立体（MVS）是活性深度传感的准确度和单眼深度估计的实用性之间中庸。采用三维卷积神经网络（细胞神经网络）基于成本体积的方法已经显着地改善MVS系统的准确度。然而，这样的精度正值阻碍实际采用计算成本高。从成本体积截然不同的方法，我们提出通过首先（a）一个有效深度估计方法检测和评估对兴趣点的描述符，然后（b）的学习，以匹配和三角测量的小集合的兴趣点，和最后（c）中致密化该疏一套使用细胞神经网络的三维点。的端至端网络有效地执行所有三个步骤深学习框架内，并与中间2D图像和3D几何监督训练，随深度监督沿。最重要的是，我们的第一步互补使用兴趣点检测和描述符的学习姿态估计。我们证明在深度估计是国家的最先进成果与不同的场景下的长度计算。此外，我们的方法可以推广到新的环境，并通过我们的网络输出的描述相媲美强基线。

5. GIQA: Generated Image Quality Assessment [PDF] 返回目录
Shuyang Gu, Jianmin Bao, Dong Chen, Fang Wen
Abstract: Generative adversarial networks (GANs) have achieved impressive results today, but not all generated images are perfect. A number of quantitative criteria have recently emerged for generative model, but none of them are designed for a single generated image. In this paper, we propose a new research topic, Generated Image Quality Assessment (GIQA), which quantitatively evaluates the quality of each generated image. We introduce three GIQA algorithms from two perspectives: learning-based and data-based. We evaluate a number of images generated by various recent GAN models on different datasets and demonstrate that they are consistent with human assessments. Furthermore, GIQA is available to many applications, like separately evaluating the realism and diversity of generative models, and enabling online hard negative mining (OHEM) in the training of GANs to improve the results.
摘要：创成对抗网络（甘斯）今天都取得了不俗的成绩，但并不是所有生成的图像是完美的。一些量化标准最近已经出现生成模型，但他们都不是专为单生成的图像。在本文中，我们提出了一个新的研究课题，产生的图像质量评价（GIQA），其定量评估每个生成的图像的质量。我们从两个角度推出三款GIQA算法：学习基础和数据为基础的。我们评估了许多通过各种甘最近型号不同的数据集生成的图像并证明它们与人类的评估一致。此外，GIQA可用于许多应用，如单独评估生成模型的现实主义和多样性，能够在甘斯的在线培训难负开采（OHEM）改善的效果。

6. Normalized and Geometry-Aware Self-Attention Network for Image Captioning [PDF] 返回目录
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, Hanqing Lu
Abstract: Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the vanilla self-attention network. We extensively evaluate our proposals on MS-COCO image captioning dataset and superior results are achieved when comparing to state-of-the-art approaches. Further experiments on three challenging tasks, i.e. video captioning, machine translation, and visual question answering, show the generality of our methods.
摘要：自注意（SA）网络已经显示出图像字幕具有深远的意义。在本文中，我们从两个方面提高SA促进图像字幕的性能。首先，我们建议归自注意（NSA），SA的重新参数化带来的内部SA正常化的好处。而归一化是以前仅SA外部施加，我们介绍一种新颖的归一化方法，并表明它是既可以与有益的上里面SA隐藏激活执行它。其次，以补偿变压器的主要限制在于它不输入对象的几何结构模型，我们提出了一个类，它扩展SA明确和有效的考虑之间的相对几何关系几何感知自我关注（GSA）的的对象在图像中。为了构建我们的形象字幕模型中，我们结合两个模块，并将其应用到香草自重视网络。我们广泛地评估我们在MS-COCO图像字幕数据集的建议和优异的业绩都比较先进设备，最先进的方法时实现。三个有挑战性的任务进一步的实验，即视频字幕，机器翻译，和视觉答疑，表明我们的方法的通用性。

7. Local Rotation Invariance in 3D CNNs [PDF] 返回目录
Vincent Andrearczyk, Julien Fageot, Valentin Oreiller, Xavier Montet, Adrien Depeursinge
Abstract: Locally Rotation Invariant (LRI) image analysis was shown to be fundamental in many applications and in particular in medical imaging where local structures of tissues occur at arbitrary rotations. LRI constituted the cornerstone of several breakthroughs in texture analysis, including Local Binary Patterns (LBP), Maximum Response 8 (MR8) and steerable filterbanks. Whereas globally rotation invariant Convolutional Neural Networks (CNN) were recently proposed, LRI was very little investigated in the context of deep learning. LRI designs allow learning filters accounting for all orientations, which enables a drastic reduction of trainable parameters and training data when compared to standard 3D CNNs. In this paper, we propose and compare several methods to obtain LRI CNNs with directional sensitivity. Two methods use orientation channels (responses to rotated kernels), either by explicitly rotating the kernels or using steerable filters. These orientation channels constitute a locally rotation equivariant representation of the data. Local pooling across orientations yields LRI image analysis. Steerable filters are used to achieve a fine and efficient sampling of 3D rotations as well as a reduction of trainable parameters and operations, thanks to a parametric representations involving solid Spherical Harmonics (SH), which are products of SH with associated learned radial profiles.Finally, we investigate a third strategy to obtain LRI based on rotational invariants calculated from responses to a learned set of solid SHs. The proposed methods are evaluated and compared to standard CNNs on 3D datasets including synthetic textured volumes composed of rotated patterns, and pulmonary nodule classification in CT. The results show the importance of LRI image analysis while resulting in a drastic reduction of trainable parameters, outperforming standard 3D CNNs trained with data augmentation.
摘要：局部旋转不变（LRI）图像分析被证明是在许多应用中的基本和特别是在组织中的局部结构发生在任意旋转医学成像。 LRI构成了纹理分析的几个突破的基石，包括局部二值模式（LBP），最大响应8（MR8）和可操纵滤波器组。而在全球范围旋转不变卷积神经网络（CNN）最近提出，LRI是很少研究了深度学习的环境。 LRI设计允许学习过滤器占所有方向，相对于标准的3D细胞神经网络时使可训练参数和训练数据的急剧减少。在本文中，我们提出并比较了几种方法来获得LRI细胞神经网络与灵敏方向。两种方法通过明确地旋转所述内核或使用可操纵的滤波器使用定向信道（到旋转内核响应），或者。这些取向通道构成数据的局部旋转等变表示。跨取向本地池产生LRI图像分析。可操纵的滤波器用于实现精细和3D旋转有效采样以及涉及固体球谐（SH）的参数表示，其具有相关联了解到径向profiles.Finally SH的产品的降低可训练参数和操作，感谢的我们调查第三种策略基于从响应计算一组了解到专业户固体旋转不变量来获得LRI。所提出的方法进行评估，并与三维数据集的标准细胞神经网络包括旋转图形组成的合成纹理的卷，并在CT肺结节分类。结果表明产生的显着降低可训练参数的同时，表现优于标准3D细胞神经网络训练数据增强LRI图像分析的重要性。

8. Unique Geometry and Texture from Corresponding Image Patches [PDF] 返回目录
Dor Verbin, Steven J. Gortler, Todd Zickler
Abstract: We present a sufficient condition for the recovery of a unique texture process and a unique set of viewpoints from a set of image patches that are generated by observing a flat texture process from unknown directions and orientations. We show that four image patches are sufficient in general, and we characterize the ambiguities that arise when this condition is not satisfied. The results are applicable to the perception of shape from texture and to texture-based structure from motion.
摘要：我们提出一个充分条件独特的质感处理的恢复和从通过观察从未知的方向和定向的扁平纹理处理产生的一组图像块一组独特的视点。我们发现，四个图像补丁是一般足够了，我们的特点是出现在这个条件不能满足的含糊之处。该结果适用于形状的从质地和从运动的基于纹理的结构的感知。

9. Across Scales \& Across Dimensions: Temporal Super-Resolution using Deep Internal Learning [PDF] 返回目录
Liad Pollak Zuckerman, Shai Bagon, Eyal Naor, George Pisha, Michal Irani
Abstract: When a very fast dynamic event is recorded with a low-framerate camera, the resulting video suffers from severe motion blur (due to exposure time) and motion aliasing (due to low sampling rate in time). True Temporal Super-Resolution (TSR) is more than just Temporal-Interpolation (increasing framerate). It can also recover new high temporal frequencies beyond the temporal Nyquist limit of the input video, thus resolving both motion-blur and motion-aliasing effects that temporal frame interpolation (as sophisticated as it maybe) cannot undo. In this paper we propose a "Deep Internal Learning" approach for true TSR. We train a video-specific CNN on examples extracted directly from the low-framerate input video. Our method exploits the strong recurrence of small space-time patches inside a single video sequence, both within and across different spatio-temporal scales of the video. We further observe (for the first time) that small space-time patches recur also across-dimensions of the video sequence - i.e., by swapping the spatial and temporal dimensions. In particular, the higher spatial resolution of video frames provides strong examples as to how to increase the temporal resolution of that video. Such internal video-specific examples give rise to strong self-supervision, requiring no data but the input video itself. This results in Zero-Shot Temporal-SR of complex videos, which removes both motion blur and motion aliasing, outperforming previous supervised methods trained on external video datasets.
摘要：当一个非常快的动态事件记录有低帧率的摄像头，（由于时间低采样率）所产生的视频从患有严重的运动模糊（由于曝光时间）和运动混叠。真正的现世超分辨率（TSR）更不仅仅是时空插值（增加帧率）。它也可以恢复新高时间频率超出输入视频的时间奈奎斯特极限，从而解决这两个运动模糊和运动混叠效应时间帧插补（如复杂的，因为它可能）无法复原。在本文中，我们提出了真正的TSR是“深层次的内部学习”的方法。我们培养上直接从低帧率输入视频中提取的示例的视频特定CNN。我们的方法利用了小时空补丁单个视频序列内的强复发，无论是内部和整个视频的不同时空尺度。我们进一步观察（首次），该小空间 - 时间的补丁也复发视频序列的跨维度 - 即，通过交换空间和时间维度。特别地，视频帧的更高的空间分辨率提供了强有力的例子作为如何增加视频的时间分辨率。这种内部视频具体的例子引起强烈的自我监督，不需要数据，但输入视频本身。这导致了复杂的视频零射门颞-SR，这将同时删除运动模糊和运动走样，跑赢上训练外部视频数据集先前的监督方法。

10. Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation [PDF] 返回目录
Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, Stephen Lin
Abstract: In the feature maps of CNNs, there commonly exists considerable spatial redundancy that leads to much repetitive processing. Towards reducing this superfluous computation, we propose to compute features only at sparsely sampled locations, which are probabilistically chosen according to activation responses, and then densely reconstruct the feature map with an efficient interpolation procedure. With this sampling-interpolation scheme, our network avoids expending computation on spatial locations that can be effectively interpolated, while being robust to activation prediction errors through broadly distributed sampling. A technical challenge of this sampling-based approach is that the binary decision variables for representing discrete sampling locations are non-differentiable, making them incompatible with backpropagation. To circumvent this issue, we make use of a reparameterization trick based on the Gumbel-Softmax distribution, with which backpropagation can iterate these variables towards binary values. The presented network is experimentally shown to save substantial computation while maintaining accuracy over a variety of computer vision tasks.
摘要：在功能细胞神经网络的地图，还有普遍存在相当大的空间冗余，导致很多重复处理。朝减少这种多余的计算，我们提出了只在稀疏取样的位置，这是根据激活的响应概率选择，然后密集重建有高效内插程序的特征图来计算特征。有了这个采样插值方案，我们的网络避免花费上，可有效地插空间位置的计算，同时通过广泛分布的抽样很鲁棒激活预测误差。这种基于采样的方法的技术挑战是，代表离散采样地点二元决策变量是不可微，使得它们反向传播不兼容。为了规避这个问题，我们利用基于该冈贝尔，使用SoftMax分布的新参数的把戏，与反向传播可以对二进制值重复这些变量。所提出的网络实验表明，以节省大量计算，同时保持准确度的各种计算机视觉任务。

11. JAA-Net: Joint Facial Action Unit Detection and Face Alignment via Adaptive Attention [PDF] 返回目录
Zhiwen Shao, Zhilei Liu, Jianfei Cai, Lizhuang Ma
Abstract: Facial action unit (AU) detection and face alignment are two highly correlated tasks, since facial landmarks can provide precise AU locations to facilitate the extraction of meaningful local features for AU detection. However, most existing AU detection works handle the two tasks independently by treating face alignment as a preprocessing, and often use landmarks to predefine a fixed region or attention for each AU. In this paper, we propose a novel end-to-end deep learning framework for joint AU detection and face alignment, which has not been explored before. In particular, multi-scale shared feature is learned firstly, and high-level feature of face alignment is fed into AU detection. Moreover, to extract precise local features, we propose an adaptive attention learning module to refine the attention map of each AU adaptively. Finally, the assembled local features are integrated with face alignment feature and global feature for AU detection. Extensive experiments demonstrate that our framework (i) significantly outperforms the state-of-the-art AU detection methods on the challenging BP4D, DISFA, GFT and BP4D+ benchmarks, (ii) can adaptively capture the irregular region of each AU, (iii) achieves competitive performance for face alignment, and (iv) also works well under partial occlusions and non-frontal poses. The code for our method is available at this https URL.
摘要：面部动作单元（AU）检测和面部对准是两个高度相关的任务，因为面部界标可以提供精确的位置AU，以方便的有意义的局部特征为AU检测用提取。然而，大多数现有的AU检测工作独立处理两个任务通过治疗面部对准作为预处理，并经常使用地标，预先确定一个固定的区域或关注每个AU。在本文中，我们提出了联合AU检测和面部对准，尚未探索前一种新的终端到终端的深度学习的框架。特别地，多尺度共享特征首先了解到，和面部对准的高级特征被送入AU检测。此外，提取精确的局部特征，提出了一种自适应注意学习模块，自适应细化每个AU的注意图。最后，当地组装的功能都集成与AU检测脸部对准特征和全局特征。大量的实验证明，我们的框架（ⅰ）显著优于对挑战BP4D，DISFA，GFT和BP4D +基准状态的最先进的AU的检测方法，（ⅱ）可以自适应地捕获每个AU的不规则区域，（III）实现用于面部对准竞争的性能，和（iv）也是行之有效的下部分遮挡和非正面姿势。我们的方法的代码可在此HTTPS URL。

12. DHOG: Deep Hierarchical Object Grouping [PDF] 返回目录
Luke Nicholas Darlow, Amos Storkey
Abstract: Recently, a number of competitive methods have tackled unsupervised representation learning by maximising the mutual information between the representations produced from augmentations. The resulting representations are then invariant to stochastic augmentation strategies, and can be used for downstream tasks such as clustering or classification. Yet data augmentations preserve many properties of an image and so there is potential for a suboptimal choice of representation that relies on matching easy-to-find features in the data. We demonstrate that greedy or local methods of maximising mutual information (such as stochastic gradient optimisation) discover local optima of the mutual information criterion; the resulting representations are also less-ideally suited to complex downstream tasks. Earlier work has not specifically identified or addressed this issue. We introduce deep hierarchical object grouping (DHOG) that computes a number of distinct discrete representations of images in a hierarchical order, eventually generating representations that better optimise the mutual information objective. We also find that these representations align better with the downstream task of grouping into underlying object classes. We tested DHOG on unsupervised clustering, which is a natural downstream test as the target representation is a discrete labelling of the data. We achieved new state-of-the-art results on the three main benchmarks without any prefiltering or Sobel-edge detection that proved necessary for many previous methods to work. We obtain accuracy improvements of: 4.3% on CIFAR-10, 1.5% on CIFAR-100-20, and 7.2% on SVHN.
摘要：近日，一批有竞争力的方法已经通过最大化从扩充所产生的陈述之间的互信息解决无监督表示学习。将得到的表示是再不变的随机增强策略，并且可以被用于下游任务，例如聚类或分类。然而，数据扩充将保留图像的许多特性，因此对于一个次优的选择表示依靠匹配比较容易找到的数据功能的潜力。我们证明最大化互信息（如随机梯度优化）的贪婪或本地方法发现的互信息准则的局部最优解;所产生的表示也不太理想地适用于复杂的下游任务。早期的工作还没有具体确定或解决了这个问题。我们引进深层次对象分组（DHOG），计算分层顺序的图像的数量不同的离散表示，最终生成表示，用于更好地优化互信息的目标。我们还发现，这些表示对准分组到对象类潜在的下游任务更好。我们测试DHOG无监督聚类，这是一种自然的下游测试作为目标表示是所述数据的离散标记。我们在三个主要基准实现国家的最先进的新成果，没有任何预过滤或索贝尔边缘检测被证明有必要对许多以前的方法来工作。我们获得的精确度的改进：在CIFAR-10，上CIFAR-100-20 1.5％，并在SVHN 7.2％4.3％。

13. Brain MRI-based 3D Convolutional Neural Networks for Classification of Schizophrenia and Controls [PDF] 返回目录
Mengjiao Hu, Kang Sim, Juan Helen Zhou, Xudong Jiang, Cuntai Guan
Abstract: Convolutional Neural Network (CNN) has been successfully applied on classification of both natural images and medical images but not yet been applied to differentiating patients with schizophrenia from healthy controls. Given the subtle, mixed, and sparsely distributed brain atrophy patterns of schizophrenia, the capability of automatic feature learning makes CNN a powerful tool for classifying schizophrenia from controls as it removes the subjectivity in selecting relevant spatial features. To examine the feasibility of applying CNN to classification of schizophrenia and controls based on structural Magnetic Resonance Imaging (MRI), we built 3D CNN models with different architectures and compared their performance with a handcrafted feature-based machine learning approach. Support vector machine (SVM) was used as classifier and Voxel-based Morphometry (VBM) was used as feature for handcrafted feature-based machine learning. 3D CNN models with sequential architecture, inception module and residual module were trained from scratch. CNN models achieved higher cross-validation accuracy than handcrafted feature-based machine learning. Moreover, testing on an independent dataset, 3D CNN models greatly outperformed handcrafted feature-based machine learning. This study underscored the potential of CNN for identifying patients with schizophrenia using 3D brain MR images and paved the way for imaging-based individual-level diagnosis and prognosis in psychiatric disorders.
摘要：卷积神经网络（CNN）已成功应用于自然图像和医学图像的分类，但尚未应用到与健康对照精神分裂症患者区别。由于精神分裂症的微妙，混合，稀疏地分布脑萎缩模式，自动功能的学习能力，使CNN的，因为它消除在选择相关的空间特征的主观性精神分裂症与对照分类的有力工具。为了检验申请CNN精神分裂症和基于结构磁共振成像（MRI）控制的分类的可行性，我们建立3D模型CNN与不同体系结构和比较它们与手工基于特征的机器学习方法的性能。支持向量机（SVM）作为分类器和基于体素的形态计量学（VBM）用作特征为手工基于特征的机器学习。 3D CNN模型与顺序架构，成立模块和剩余模块进行了从头开始培训。 CNN模型实现比手工制作的基于特征的机器学习更高的交叉验证准确性。此外，在一个独立的数据集测试，3D模型CNN大大优于手工制作的基于特征的机器学习。该研究强调CNN的电位为使用三维大脑MR图像识别精神分裂症患者和铺平了基于成像的个体水平的诊断和预后的精神病学障碍的方法。

14. Functional Data Analysis and Visualisation of Three-dimensional Surface Shape [PDF] 返回目录
Stanislav Katina, Liberty Vittert, Adrian W. Bowman
Abstract: The advent of high resolution imaging has made data on surface shape widespread. Methods for the analysis of shape based on landmarks are well established but high resolution data require a functional approach. The starting point is a systematic and consistent description of each surface shape. Three innovative forms of analysis are then introduced. The first uses surface integration to address issues of registration, principal component analysis and the measurement of asymmetry, all in functional form. Computational issues are handled through discrete approximations to integrals, based in this case on appropriate surface area weighted sums. The second innovation is to focus on sub-spaces where interesting behaviour such as group differences are exhibited, rather than on individual principal components. The third innovation concerns the comparison of individual shapes with a relevant control set, where the concept of a normal range is extended to the highly multivariate setting of surface shape. This has particularly strong applications to medical contexts where the assessment of individual patients is very important. All of these ideas are developed and illustrated in the important context of human facial shape, with a strong emphasis on the effective visual communication of effects of interest.
摘要：高分辨率成像的出现使得地表形状数据普遍。基于地标为形状的分析方法已经非常成熟，但高分辨率数据需要的功能的方法。的出发点是每个表面形状的系统和一致的说明。分析的三种创新形式，然后介绍了。第一用途表面集成到登记，主成分分析和不对称性的测量中，所有的功能形式的地址的问题。计算问题通过离散近似来积分，基于在这种情况下适当的表面积加权和处理。第二个创新是着眼于子空间，其中有趣的行为如组差异显示，而不是单个主成分。第三创新涉及单个形状的具有相关的控制集，其中一个正常范围的概念被扩展到表面形状的高度设定多元比较。这具有特别强的应用到医疗环境，其中个别病人的评估是非常重要的。所有这些想法都开发并在人的面部形状的重要的背景说明，并着重强调对利益的影响的有效可视通信。

15. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation [PDF] 返回目录
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, Rongrong Ji
Abstract: Referring expression comprehension (REC) and segmentation (RES) are two highly-related tasks, which both aim at identifying the referent according to a natural language expression. In this paper, we propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning of REC and RES for the first time. In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent. In addition, we address a key challenge in this multi-task setup, i.e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS). Specifically, CEM enables REC and RES to focus on similar visual regions by maximizing the consistency energy between two tasks. ASNLS supresses the response of unrelated regions in RES based on the prediction of REC. To validate our model, we conduct extensive experiments on three benchmark datasets of REC and RES, i.e., RefCOCO, RefCOCO+ and RefCOCOg. The experimental results report the significant performance gains of MCN over all existing methods, i.e., up to +7.13% for REC and +11.50% for RES over SOTA, which well confirm the validity of our model for joint REC and RES learning.
摘要：参照表达式理解（REC）和分段（RES）是两个高度相关的任务，其中两个瞄准根据自然语言表达式识别所指对象。在本文中，我们提出了一个新的多任务协作网络（MCN）实现REC和RES的共同学习的第一次。在MCN，RES可以帮助REC实现更好的语言，视觉校准，而REC可以帮助RES更好地找到参照物。此外，我们要解决这个多任务设置一个关键挑战，即预测冲突，有两个创新的设计，即一致能量最大化（CEM）和自适应软非定位抑制（ASNLS）。具体而言，CEM使REC和RES由两个任务之间最大化一致性能量集中于相似的视觉区域。 ASNLS supresses无关区域的基于REC的预测RES的响应。为了验证我们的模型，我们对REC和RES，即RefCOCO，RefCOCO +和RefCOCOg三个标准数据集进行了广泛的实验。实验结果报告MCN的显著的性能提升现有的所有方法了，即高达+ 7.13％，为REC和+ 11.50％的RES超过SOTA，这也证实了我们模型的有效性进行联合REC和RES学习。

16. Deep Learning for Automatic Tracking of Tongue Surface in Real-time Ultrasound Videos, Landmarks instead of Contours [PDF] 返回目录
M. Hamed Mozaffari, Won-Sook Lee
Abstract: One usage of medical ultrasound imaging is to visualize and characterize human tongue shape and motion during a real-time speech to study healthy or impaired speech production. Due to the low-contrast characteristic and noisy nature of ultrasound images, it might require expertise for non-expert users to recognize tongue gestures in applications such as visual training of a second language. Moreover, quantitative analysis of tongue motion needs the tongue dorsum contour to be extracted, tracked, and visualized. Manual tongue contour extraction is a cumbersome, subjective, and error-prone task. Furthermore, it is not a feasible solution for real-time applications. The growth of deep learning has been vigorously exploited in various computer vision tasks, including ultrasound tongue contour tracking. In the current methods, the process of tongue contour extraction comprises two steps of image segmentation and post-processing. This paper presents a new novel approach of automatic and real-time tongue contour tracking using deep neural networks. In the proposed method, instead of the two-step procedure, landmarks of the tongue surface are tracked. This novel idea enables researchers in this filed to benefits from available previously annotated databases to achieve high accuracy results. Our experiment disclosed the outstanding performances of the proposed technique in terms of generalization, performance, and accuracy.
摘要：医学超声成像的一个用途是可视化和一个实时语音期间表征人舌的形状和运动来研究健康或受损的语音产生。由于低对比度的特点和超声图像的嘈杂性质，可能需要对非专业用户的专业知识来识别应用，例如作为第二语言的视觉训练舌头手势。此外，舌运动的定量分析需要舌背部轮廓被提取，跟踪和可视化。手册舌轮廓提取是一个繁琐的，主观的，容易出错的任务。此外，它不是为实时应用提供可行的解决方案。深度学习的增长一直在大力开发各种计算机视觉任务，包括超声舌轮廓跟踪。在当前的方法中，舌轮廓提取的方法包括图像分割和后处理的两个步骤。本文给出了使用深层神经网络自动的，实时的舌头轮廓跟踪的新新方法。在所提出的方法，而不是两步程序，舌头表面的地标被跟踪。这种新颖的想法使研究人员在此从可用先前注释数据库提交给好处，以实现高精确度的结果。我们的实验中所公开的提出的技术的杰出表现在泛化，性能和准确性方面。

17. MagicEyes: A Large Scale Eye Gaze Estimation Dataset for Mixed Reality [PDF] 返回目录
Zhengyang Wu, Srivignesh Rajendran, Tarrence van As, Joelle Zimmermann, Vijay Badrinarayanan, Andrew Rabinovich
Abstract: With the emergence of Virtual and Mixed Reality (XR) devices, eye tracking has received significant attention in the computer vision community. Eye gaze estimation is a crucial component in XR -- enabling energy efficient rendering, multi-focal displays, and effective interaction with content. In head-mounted XR devices, the eyes are imaged off-axis to avoid blocking the field of view. This leads to increased challenges in inferring eye related quantities and simultaneously provides an opportunity to develop accurate and robust learning based approaches. To this end, we present MagicEyes, the first large scale eye dataset collected using real MR devices with comprehensive ground truth labeling. MagicEyes includes $587$ subjects with $80,000$ images of human-labeled ground truth and over $800,000$ images with gaze target labels. We evaluate several state-of-the-art methods on MagicEyes and also propose a new multi-task EyeNet model designed for detecting the cornea, glints and pupil along with eye segmentation in a single forward pass.
摘要：随着虚拟和混合现实（XR）设备的出现，眼动追踪已收到显著关注的计算机视觉社区。眼睛注视估计是XR的重要组成部分 - 使节能渲染，多焦点的显示器，并与内容有效相互作用。在头部佩戴型XR装置，眼睛成像的偏轴，以避免阻塞的视场。这导致增加的挑战推断眼睛相关的量，并同时提供了开发准确和可靠的基于学习方法的机会。为此，我们提出MagicEyes，第一个大型数据集的眼用真实的MR设备与全面地实况标签收集。 MagicEyes包括$ 587个$科目与人类标记的地面实况的$ $ 80,000的图像和超过$ $ 800,000图像凝视目标的标签。我们评估的国家的最先进的几种对MagicEyes方法，也提出了设计用于检测单直传眼分段沿着角膜，闪烁和瞳孔一个新的多任务EyeNet模型。

18. Deep Object Detection based Mitosis Analysis in Breast Cancer Histopathological Images [PDF] 返回目录
Anabia Sohail, Muhammad Ahsan Mukhtar, Asifullah Khan, Muhammad Mohsin Zafar, Aneela Zameer, Saranjam Khan
Abstract: Empirical evaluation of breast tissue biopsies for mitotic nuclei detection is considered an important prognostic biomarker in tumor grading and cancer progression. However, automated mitotic nuclei detection poses several challenges because of the unavailability of pixel-level annotations, different morphological configurations of mitotic nuclei, their sparse representation, and close resemblance with non-mitotic nuclei. These challenges undermine the precision of the automated detection model and thus make detection difficult in a single phase. This work proposes an end-to-end detection system for mitotic nuclei identification in breast cancer histopathological images. Deep object detection-based Mask R-CNN is adapted for mitotic nuclei detection that initially selects the candidate mitotic region with maximum recall. However, in the second phase, these candidate regions are refined by multi-object loss function to improve the precision. The performance of the proposed detection model shows improved discrimination ability (F-score of 0.86) for mitotic nuclei with significant precision (0.86) as compared to the two-stage detection models (F-score of 0.701) on TUPAC16 dataset. Promising results suggest that the deep object detection-based model has the potential to learn the characteristic features of mitotic nuclei from weakly annotated data and suggests that it can be adapted for the identification of other nuclear bodies in histopathological images.
摘要：有丝分裂核的检测乳房组织活检的经验评价被认为是在肿瘤分级和癌症进展的重要预后生物标记。然而，自动的有丝分裂核的检测带来了一些挑战，因为像素级注释的不可用的，有丝分裂核，其稀疏表示的不同的形态配置，并与非有丝分裂核的非常相似。这些挑战破坏自动化检测模型的精度，从而使检测困难在单相。这项工作提出了一种用于在乳腺癌组织病理学图像有丝分裂核的标识的端至端检测系统。深对象检测基于掩模R-CNN适于最初选择具有最大召回候选的有丝分裂核分裂区域检测细胞核。然而，在第二阶段中，这些候选区域由多对象损失函数精制以提高精度。所提出的检测模型示出了性能改善区分能力（0.86 F-得分），用于与显著精度（0.86）有丝分裂核相比，两阶段检测模型（的0.701 F值）上TUPAC16数据集作为。可喜的成果表明，深为基础的检测对象模型必须学会从弱注释的数据有丝分裂细胞核的特征的潜力，表明它可以适用于病理图像识别其他核机构。

19. Dynamic Multiscale Graph Neural Networks for 3D Skeleton-Based Human Motion Prediction [PDF] 返回目录
Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, Qi Tian
Abstract: We propose novel dynamic multiscale graph neural networks (DMGNN) to predict 3D skeleton-based human motions. The core idea of DMGNN is to use a multiscale graph to comprehensively model the internal relations of a human body for motion feature learning. This multiscale graph is adaptive during training and dynamic across network layers. Based on this graph, we propose a multiscale graph computational unit (MGCU) to extract features at individual scales and fuse features across scales. The entire model is action-category-agnostic and follows an encoder-decoder framework. The encoder consists of a sequence of MGCUs to learn motion features. The decoder uses a proposed graph-based gate recurrent unit to generate future poses. Extensive experiments show that the proposed DMGNN outperforms state-of-the-art methods in both short and long-term predictions on the datasets of Human 3.6M and CMU Mocap. We further investigate the learned multiscale graphs for the interpretability. The codes could be downloaded from this https URL.
摘要：我们提出了新的动态多尺度图形神经网络（DMGNN）预测为基础骨架的3D人体动作。 DMGNN的核心思想是利用多尺度图形到人体的内部关系全面模型运动特征的学习。此多尺度图形是跨网络层训练和动态自适应期间。基于此图中，我们提出了一个多尺度图形计算单元（MGCU）提取特征的个体规模和跨尺度的保险丝功能。整个模型是行动的类别无关的和如下的编码器 - 解码器的框架。该编码器由MGCUs序列的学习运动特征。解码器使用所提议的基于图形的栅重复单元，以产生未来姿势。大量的实验表明，在人类3.6M和CMU动作捕捉的数据集的短期和长期预测提出DMGNN性能优于国家的最先进的方法。我们进一步探讨的解释性学习多尺度图。该代码可以从这个HTTPS URL下载。

20. Pedestrian Detection: The Elephant In The Room [PDF] 返回目录
Irtiza Hasan, Shengcai Liao, Jinpeng Li, Saad Ullah Akram, Ling Shao
Abstract: Pedestrian detection is used in many vision based applications ranging from video surveillance to autonomous driving. Despite achieving high performance, it is still largely unknown how well existing detectors generalize to unseen data. To this end, we conduct a comprehensive study in this paper, using a general principle of direct cross-dataset evaluation. Through this study, we find that existing state-of-the-art pedestrian detectors generalize poorly from one dataset to another. We demonstrate that there are two reasons for this trend. Firstly, they over-fit on popular datasets in a traditional single-dataset training and test pipeline. Secondly, the training source is generally not dense in pedestrians and diverse in scenarios. Accordingly, through experiments we find that a general purpose object detector works better in direct cross-dataset evaluation compared with state-of-the-art pedestrian detectors and we illustrate that diverse and dense datasets, collected by crawling the web, serve to be an efficient source of pre-training for pedestrian detection. Furthermore, we find that a progressive training pipeline works good for autonomous driving oriented detector. We improve upon previous state-of-the-art on reasonable/heavy subsets of CityPersons dataset by 1.3%/1.7% and on Caltech by 1.8%/14.9% in terms of log average miss rate (MR^2) points without any fine-tuning on the test set. Detector trained through proposed pipeline achieves top rank at the leaderborads of CityPersons [42] and ECP [4]. Code and models will be available at this https URL.
摘要：行人检测在许多基于视觉的应用范围从视频监控到自主驾驶使用。尽管实现高性能，它仍然是未知现有检测系统如何推广到看不见的数据。为此，我们在本文中进行全面的研究，使用直接跨数据集的评估的一般原则。通过这项研究，我们发现，国家的最先进的行人存在从探测器一个数据集不佳推广到另一个。我们表明，有两个原因，这一趋势。首先，他们在配合上流行的数据集在一个传统的单一数据集训练和测试管道。其次，训练源一般不会行人密集和场景多样。因此，通过实验，我们发现，一个通用对象检测器在直接跨数据集的评价与国家的最先进的行人检测器相比，效果更好，我们示出了不同的和密集的数据集，由抓取网络收集的，用于为用于行人检测预训练效率源。此外，我们发现一个循序渐进的训练管道工程有利于自主驾驶导向探测器。我们在前面的国家的最先进的CityPersons的合理/重亚提高1.3％/ 1.7％，并在加州理工学院1.8％/ 14.9％，数据集数平均命中率方面（MR ^ 2）无任何细微之处-tuning的测试集。检测器通过管道提出训练实现在CityPersons [42]和ECP [4]的leaderborads顶部秩。代码和模型将可在此HTTPS URL。

21. Incremental Object Detection via Meta-Learning [PDF] 返回目录
K J Joseph, Jathushan Rajasegaran, Salman Khan, Fahad Shahbaz Khan, Vineeth Balasubramanian, Ling Shao
Abstract: In a real-world setting, object instances from new classes may be continuously encountered by object detectors. When existing object detectors are applied to such scenarios, their performance on old classes deteriorates significantly. A few efforts have been reported to address this limitation, all of which apply variants of knowledge distillation to avoid catastrophic forgetting. We note that although distillation helps to retain previous learning, it obstructs fast adaptability to new tasks, which is a critical requirement for incremental learning. In this pursuit, we propose a meta-learning approach that learns to reshape model gradients, such that information across incremental tasks is optimally shared. This ensures a seamless information transfer via a meta-learned gradient preconditioning that minimizes forgetting and maximizes knowledge transfer. In comparison to existing meta-learning methods, our approach is task-agnostic, allows incremental addition of new-classes and scales to large-sized models for object detection. We evaluate our approach on a variety of incremental settings defined on PASCAL-VOC and MS COCO datasets, demonstrating significant improvements over state-of-the-art.
摘要：在现实世界环境中，距离新类对象实例可以连续通过物体探测器遇到。当现有的对象检测器被应用到这样的情况下，它们对老班性能显著劣化。一些努力已经上报解决此限制，所有这些应用知识蒸馏的变种，以避免灾难性的遗忘。我们注意到，虽然蒸馏有助于保持以前的学习，它阻碍快速适应新的任务，这是增量学习的关键要求。在这种追求中，我们提出了元学习的办法，学会重塑模型梯度，使得跨越增量任务的信息共享最佳。这确保通过一个元了解到梯度预处理最小化遗忘和最大化知识转移的无缝信息的传送。相较于现有的元学习方法，我们的做法是任务无关，允许增量增加新的类，并扩展到大尺寸模型对象检测。我们评估我们在各种关于PASCAL VOC含量和MS COCO数据集定义增量的设置方法，证明了国家的最先进的显著改善。

22. Teacher-Student chain for efficient semi-supervised histology image classification [PDF] 返回目录
Shayne Shaw, Maciej Pajak, Aneta Lisowska, Sotirios A Tsaftaris, Alison Q O'Neil
Abstract: Deep learning shows great potential for the domain of digital pathology. An automated digital pathology system could serve as a second reader, perform initial triage in large screening studies, or assist in reporting. However, it is expensive to exhaustively annotate large histology image databases, since medical specialists are a scarce resource. In this paper, we apply the semi-supervised teacher-student knowledge distillation technique proposed by Yalniz et al. (2019) to the task of quantifying prognostic features in colorectal cancer. We obtain accuracy improvements through extending this approach to a chain of students, where each student's predictions are used to train the next student i.e. the student becomes the teacher. Using the chain approach, and only 0.5% labelled data (the remaining 99.5% in the unlabelled pool), we match the accuracy of training on 100% labelled data. At lower percentages of labelled data, similar gains in accuracy are seen, allowing some recovery of accuracy even from a poor initial choice of labelled training set. In conclusion, this approach shows promise for reducing the annotation burden, thus increasing the affordability of automated digital pathology systems.
摘要：深学习节目的数字病理学领域的巨大潜力。自动化数字病理系统可以作为第二读者，在大型筛选研究执行初始分流，或协助报告。但是，它是昂贵的详尽注释大组织学图像数据库，因为医学专家是稀缺资源。在本文中，我们应用由Yalniz等人提出的半监督师生知识蒸馏技术。（2019）在大肠癌量化预后特征的任务。我们通过这种方式延伸到链的学生，每个学生的预测是用来训练下一个学生，即学生变成老师获得的精度提高。使用链方法，只有0.5％的数据标记为（在未标记的池的剩余99.5％），我们匹配的100％标记的训练数据的准确性。在标记数据的比例较低，在精度类似的收益都见过，甚至允许从标记的训练集的差的初始选择的正确性有所恢复。总之，这种做法表明承诺减少注释的负担，从而提高了自动化数字病理系统的承受能力。

23. Deep Active Learning for Remote Sensing Object Detection [PDF] 返回目录
Zhenshen Qu, Jingda Du, Yong Cao, Qiuyu Guan, Pengbo Zhao
Abstract: Recently, CNN object detectors have achieved high accuracy on remote sensing images but require huge labor and time costs on annotation. In this paper, we propose a new uncertainty-based active learning which can select images with more information for annotation and detector can still reach high performance with a fraction of the training images. Our method not only analyzes objects' classification uncertainty to find least confident objects but also considers their regression uncertainty to declare outliers. Besides, we bring out two extra weights to overcome two difficulties in remote sensing datasets, class-imbalance and difference in images' objects amount. We experiment our active learning algorithm on DOTA dataset with CenterNet as object detector. We achieve same-level performance as full supervision with only half images. We even override full supervision with 55% images and augmented weights on least confident images.
摘要：近日，CNN对象检测器已经实现高精度的遥感图像，但需要在注释巨大的人力和时间成本。在本文中，我们提出了一种新的基于不确定性，主动学习，可以选择与注解和检测器的详细信息图像与训练图像的一小部分仍然可以达到很高的性能。我们的方法不仅分析对象的分类不确定性发现至少有信心对象，但也认为他们的回归不确定性申报异常。此外，我们带出两个额外的权重，克服遥感数据集，类不平衡和差异图像对象量两个困难。我们与CenterNet实验我们对DOTA的数据集主动学习算法作为目标物检测。我们实现同级别的性能与只有一半的图像全程监督。我们甚至覆盖上至少有信心图像全程监督有55％的图像和增强的权重。

24. High-Resolution Daytime Translation Without Domain Labels [PDF] 返回目录
Ivan Anokhin, Pavel Solovev, Denis Korzhenkov, Alexey Kharlamov, Taras Khakhulin, Gleb Sterkin, Alexey Silvestrov, Sergey Nikolenko, Victor Lempitsky
Abstract: Modeling daytime changes in high resolution photographs, e.g., re-rendering the same scene under different illuminations typical for day, night, or dawn, is a challenging image manipulation task. We present the high-resolution daytime translation (HiDT) model for this task. HiDT combines a generative image-to-image model and a new upsampling scheme that allows to apply image translation at high resolution. The model demonstrates competitive results in terms of both commonly used GAN metrics and human evaluation. Uniquely, this good performance comes as a result of training on a dataset of still landscape images with no daytime labels available. Our results are available at this https URL.
摘要：在建模高分辨率的照片，白天的变化例如，重新呈现在典型的白天，黑夜，黎明或不同照明的同一个场景，是一个具有挑战性的图像处理任务。我们提出高分辨率白天翻译（HIDT）模型此任务。 HIDT结合了生成的图像 - 图像模式和新的采样方案，它允许以高分辨率应用图像平移。该模型演示了常用的甘指标和人工评估两个方面的竞争结果。与众不同的是，这个良好的业绩来作为对仍然景观图像的数据集，没有白天标签可用训练的结果。我们的研究结果可在此HTTPS URL。

25. Child Face Age-Progression via Deep Feature Aging [PDF] 返回目录
Debayan Deb, Divyansh Aggarwal, Anil K. Jain
Abstract: Given a gallery of face images of missing children, state-of-the-art face recognition systems fall short in identifying a child (probe) recovered at a later age. We propose a feature aging module that can age-progress deep face features output by a face matcher. In addition, the feature aging module guides age-progression in the image space such that synthesized aged faces can be utilized to enhance longitudinal face recognition performance of any face matcher without requiring any explicit training. For time lapses larger than 10 years (the missing child is found after 10 or more years), the proposed age-progression module improves the closed-set identification accuracy of FaceNet from 16.53% to 21.44% and CosFace from 60.72% to 66.12% on a child celebrity dataset, namely ITWCC. The proposed method also outperforms state-of-the-art approaches with a rank-1 identification rate of 95.91%, compared to 94.91%, on a public aging dataset, FG-NET, and 99.58%, compared to 99.50%, on CACD-VS. These results suggest that aging face features enhances the ability to identify young children who are possible victims of child trafficking or abduction.
摘要：鉴于失踪儿童的脸部图像，国家的最先进的脸部识别系统功亏一篑在确定孩子（探头）的画廊恢复在以后的年龄。我们提出了一个功能老化模块，可年龄正在进行面部深层由面部匹配功能输出。此外，该特征老化模块导向年龄进展在图像空间中，使得合成的老化面可以用于增强任何面匹配的纵向面识别性能，而不需要任何显式训练。对于时间流逝比10年更大的（失踪儿童后10年或更长时间找到），建议年龄进展模块提高FaceNet从16.53％，从60.72％的闭集识别精度21.44％和CosFace至66.12％的孩子名人集，即ITWCC。所提出的方法也优于状态的最先进的具有95.91％秩1的识别率接近，比94.91％，在一个公共的老化数据集，FG-NET，和99.58％，而99.50％，上CACD -VS。这些结果表明，老化脸特征增强识别年幼的孩子谁是贩卖儿童或绑架的可能的受害者的能力。

26. Self-Guided Adaptation: Progressive Representation Alignment for Domain Adaptive Object Detection [PDF] 返回目录
Zongxian Li, Qixiang Ye, Chong Zhang, Jingjing Liu, Shijian Lu, Yonghong Tian
Abstract: Unsupervised domain adaptation (UDA) has achieved unprecedented success in improving the cross-domain robustness of object detection models. However, existing UDA methods largely ignore the instantaneous data distribution during model learning, which could deteriorate the feature representation given large domain shift. In this work, we propose a Self-Guided Adaptation (SGA) model, target at aligning feature representation and transferring object detection models across domains while considering the instantaneous alignment difficulty. The core of SGA is to calculate "hardness" factors for sample pairs indicating domain distance in a kernel space. With the hardness factor, the proposed SGA adaptively indicates the importance of samples and assigns them different constrains. Indicated by hardness factors, Self-Guided Progressive Sampling (SPS) is implemented in an "easy-to-hard" way during model adaptation. Using multi-stage convolutional features, SGA is further aggregated to fully align hierarchical representations of detection models. Extensive experiments on commonly used benchmarks show that SGA improves the state-of-the-art methods with significant margins, while demonstrating the effectiveness on large domain shift.
摘要：无监督域适配（UDA）在改善对象检测模型的跨域鲁棒性取得了前所未有的成功。但是，现有的方法UDA很大程度上忽略模型学习期间的瞬时数据分布，这可能恶化给予大域移位的特征表示。在这项工作中，我们提出了一个自导适应（SGA）模型，目标在调心功能的代表性和跨域转移目标检测模型，同时考虑瞬时对准困难。 SGA的核心是计算用于指示在内核空间域距离样品对“硬度”的因素。随着硬度系数，建议SGA自适应表示取样，并为它们分配不同的约束的重要性。通过硬度因素表明，自导进采样（SPS）是在“易难”的方法模型适应过程中实现的。采用多级卷积功能，SGA进一步聚集到检测模型的完全对齐分层表示。上常用的基准广泛实验表明，SGA提高国家的最先进的方法与显著边距，同时展现在大域移位的有效性。

27. Measuring and improving the quality of visual explanations [PDF] 返回目录
Agnieszka Grabska-Barwińska
Abstract: The ability of to explain neural network decisions goes hand in hand with their safe deployment. Several methods have been proposed to highlight features important for a given network decision. However, there is no consensus on how to measure effectiveness of these methods. We propose a new procedure for evaluating explanations. We use it to investigate visual explanations extracted from a range of possible sources in a neural network. We quantify the benefit of combining these sources and challenge a recent appeal for taking bias parameters into account. We support our conclusions with a general assessment of the impact of bias parameters in ImageNet classifiers
摘要：解释神经网络决定的能力齐头并进他们的安全部署。有几种方法被提出来突出特点对于一个给定网络的决定非常重要。但是，如何衡量这些方法的有效性没有达成共识。我们提出了评估解释的新程序。我们用它来研究从各种可能的来源在神经网络中提取的视觉解释。我们量化结合这些资源的利益和挑战最近呼吁采取偏参数考虑在内。我们支持与偏置参数的影响总体评估结论在ImageNet分类

28. Do CNNs Encode Data Augmentations? [PDF] 返回目录
Eddie Yan, Yanping Huang
Abstract: Data augmentations are an important ingredient in the recipe for training robust neural networks, especially in computer vision. A fundamental question is whether neural network features explicitly encode data augmentation transformations. To answer this question, we introduce a systematic approach to investigate which layers of neural networks are the most predictive of augmentation transformations. Our approach uses layer features in pre-trained vision models with minimal additional processing to predict common properties transformed by augmentation (scale, aspect ratio, hue, saturation, contrast, brightness). Surprisingly, neural network features not only predict data augmentation transformations, but they predict many transformations with high accuracy. After validating that neural networks encode features corresponding to augmentation transformations, we show that these features are primarily encoded in the early layers of modern CNNs.
摘要：数据扩充是配方训练强健神经网络，特别是在计算机视觉的一个重要因素。一个根本的问题是神经网络是否拥有明确的编码数据增强转换。要回答这个问题，我们引入一个系统的方法来研究其神经网络层是增强变革的最有预测。我们的方法的用途层在预先训练视觉模型以最小的附加处理功能，以预测由增强转化公共属性（规模，纵横比，色调，饱和度，对比度，亮度）。出人意料的是，神经网络的功能不仅预测数据增强变换，但他们预测精度高许多转换。验证该神经网络编码对应于增强变换功能后，我们证明了这些功能在现代细胞神经网络的早期层主要编码。

29. "Who is Driving around Me?" Unique Vehicle Instance Classification using Deep Neural Features [PDF] 返回目录
Tim Oosterhuis, Lambert Schomaker
Abstract: Being aware of other traffic is a prerequisite for self-driving cars to operate in the real world. In this paper, we show how the intrinsic feature maps of an object detection CNN can be used to uniquely identify vehicles from a dash-cam feed. Feature maps of a pretrained `YOLO' network are used to create 700 deep integrated feature signatures (DIFS) from 20 different images of 35 vehicles from a high resolution dataset and 340 signatures from 20 different images of 17 vehicles of a lower resolution tracking benchmark dataset. The YOLO network was trained to classify general object categories, e.g. classify a detected object as a `car' or `truck'. 5-Fold nearest neighbor (1NN) classification was used on DIFS created from feature maps in the middle layers of the network to correctly identify unique vehicles at a rate of 96.7\% for the high resolution data and with a rate of 86.8\% for the lower resolution data. We conclude that a deep neural detection network trained to distinguish between different classes can be successfully used to identify different instances belonging to the same class, through the creation of deep integrated feature signatures (DIFS).
摘要：意识到其他流量是自驾车车在现实世界中运作的先决条件。在本文中，我们表现出固有特征的物体检测CNN可用来从一个破折号凸轮饲料唯一标识的车辆如何映射。一个预训练的`YOLO”网络的特征映射用于从较低分辨率跟踪基准数据集的17辆车辆20个不同的图像创建从高分辨率数据集35辆的车辆20个不同的图像700个深综合特征签名（DIFS）和340个签名。所述YOLO网络被训练来分类一般目的的类别，例如检测到的物体归类为'汽车“或'卡车”。 5倍最近邻（1NN）分类是从功能创建DIFS用于在网络的中间层映射到正确识别独特的车辆在96.7 \％的高分辨率数据，并与86.8的速率的速率\％为较低的分辨率的数据。我们的结论是受过训练的不同类别来区分一个深层神经网络的检测可以成功地用于识别属于同一类的不同实例，通过建立深合并特征签名（DIFS）的。

30. ElixirNet: Relation-aware Network Architecture Adaptation for Medical Lesion Detection [PDF] 返回目录
Chenhan Jiang, Shaoju Wang, Hang Xu, Xiaodan Liang, Nong Xiao
Abstract: Most advances in medical lesion detection network are limited to subtle modification on the conventional detection network designed for natural images. However, there exists a vast domain gap between medical images and natural images where the medical image detection often suffers from several domain-specific challenges, such as high lesion/background similarity, dominant tiny lesions, and severe class imbalance. Is a hand-crafted detection network tailored for natural image undoubtedly good enough over a discrepant medical lesion domain? Is there more powerful operations, filters, and sub-networks that better fit the medical lesion detection problem to be discovered? In this paper, we introduce a novel ElixirNet that includes three components: 1) TruncatedRPN balances positive and negative data for false positive reduction; 2) Auto-lesion Block is automatically customized for medical images to incorporate relation-aware operations among region proposals, and leads to more suitable and efficient classification and localization. 3) Relation transfer module incorporates the semantic relationship and transfers the relevant contextual information with an interpretable the graph thus alleviates the problem of lack of annotations for all types of lesions. Experiments on DeepLesion and Kits19 prove the effectiveness of ElixirNet, achieving improvement of both sensitivity and precision over FPN with fewer parameters.
摘要：在医疗病变检测网络，大多数进展限于细微修改设计用于自然图像的常规检测网络上。然而，存在的医用图像，并且其中所述医学图像检测往往从几个域具体的挑战，如高病变/背景相似度，主导微小病变，和严重的类不平衡遭受自然图像之间的广阔领域间隙。是手工制作的检测网络，自然的图像无疑是个好足够的定制过不符医疗损害域？有没有发现更强大的操作，过滤器和子网络，更好地适应医疗病变检测问题？在本文中，我们介绍一种新颖的ElixirNet，其包括三个部分：1）TruncatedRPN平衡了假阳性降低正和负数据; 2）自动病变块被自动定制区域提案中的医用图像掺入关系感知操作，并导致更合适的和有效的分类和定位。 3）关系转印模块具有语义关系，并与可解释的相关上下文信息从而图形减轻的缺乏注释为所有类型的损伤的问题传送。上DeepLesion和Kits19实验证明ElixirNet的有效性，同时实现灵敏度和精确度超过FPN具有较少参数的改进。

31. Personalized Taste and Cuisine Preference Modeling via Images [PDF] 返回目录
Nitish Nag, Bindu Rajanna, Ramesh Jain
Abstract: With the exponential growth in the usage of social media to share live updates about life, taking pictures has become an unavoidable phenomenon. Individuals unknowingly create a unique knowledge base with these images. The food images, in particular, are of interest as they contain a plethora of information. From the image metadata and using computer vision tools, we can extract distinct insights for each user to build a personal profile. Using the underlying connection between cuisines and their inherent tastes, we attempt to develop such a profile for an individual based solely on the images of his food. Our study provides insights about an individual's inclination towards particular cuisines. Interpreting these insights can lead to the development of a more precise recommendation system. Such a system would avoid the generic approach in favor of a personalized recommendation system.
摘要：随着社会化媒体的使用量呈指数级增长，分享生活实时更新，拍照已经成为一个不可回避的现象。个人在不知不觉中创建这些图像的独特的知识基础。食品的图像，特别感兴趣，因为它们包含的信息太多了。从图像元数据和基于计算机视觉的工具，我们可以为每个用户建立个人档案中提取不同的见解。用美食和其固有的口味与基础连接，我们试图开发完全基于他的食物的图像的单独这样的外形。我们的研究提供了关于个人向特定的美食倾向的见解。解释这些观点可能会导致更精确的推荐系统的发展。这样的系统将避免，取而代之的是个性化推荐系统的通用方法。

32. A Review of Computational Approaches for Evaluation of Rehabilitation Exercises [PDF] 返回目录
Yalin Liao, Aleksandar Vakanski, Min Xian, David Paul, Russell Baker
Abstract: Recent advances in data analytics and computer-aided diagnostics stimulate the vision of patient-centric precision healthcare, where treatment plans are customized based on the health records and needs of every patient. In physical rehabilitation, the progress in machine learning and the advent of affordable and reliable motion capture sensors have been conducive to the development of approaches for automated assessment of patient performance and progress toward functional recovery. The presented study reviews computational approaches for evaluating patient performance in rehabilitation programs using motion capture systems. Such approaches will play an important role in supplementing traditional rehabilitation assessment performed by trained clinicians, and in assisting patients participating in home-based rehabilitation. The reviewed computational methods for exercise evaluation are grouped into three main categories: discrete movement score, rule-based, and template-based approaches. The review places an emphasis on the application of machine learning methods for movement evaluation in rehabilitation. Related work in the literature on data representation, feature engineering, movement segmentation, and scoring functions is presented. The study also reviews existing sensors for capturing rehabilitation movements and provides an informative listing of pertinent benchmark datasets. The significance of this paper is in being the first to provide a comprehensive review of computational methods for evaluation of patient performance in rehabilitation programs.
摘要：数据分析和计算机辅助诊断的最新进展刺激患者为中心的医疗保健精度，在治疗计划是基于健康档案和每一位患者的需求定制的视野。在身体康复，在机器学习进度和负担得起的和可靠的动作捕捉传感器的问世已经有利于对患者的表现和进步向功能恢复自动评估方法的发展。所提出的研究审查评估使用动作捕捉系统的康复计划的患者表现的计算方法。这种方法将在补充由经过培训的医生进行传统的康复评定，并在协助参与家庭康复的患者中发挥重要作用。锻炼评价审查的计算方法可分为三大类：离散移动得分，以规则为基础，并基于模板的方法。审查将依赖于机器学习方法在康复申请运动评价的重点。在数据表示，特征工程，运动分割和评分函数文献相关的工作提出。该研究还回顾了现有的传感器捕捉康复运动，并提供相关的标准数据集的信息清单。本文的意义在于成为第一个提供了在康复计划的患者绩效评估的计算方法进行全面审查。

33. Dense Crowds Detection and Surveillance with Drones using Density Maps [PDF] 返回目录
Javier Gonzalez-Trejo, Diego Mercado-Ravell
Abstract: Detecting and Counting people in a human crowd from a moving drone present challenging problems that arisefrom the constant changing in the image perspective andcamera angle. In this paper, we test two different state-of-the-art approaches, density map generation with VGG19 trainedwith the Bayes loss function and detect-then-count with FasterRCNN with ResNet50-FPN as backbone, in order to comparetheir precision for counting and detecting people in differentreal scenarios taken from a drone flight. We show empiricallythat both proposed methodologies perform especially well fordetecting and counting people in sparse crowds when thedrone is near the ground. Nevertheless, VGG19 provides betterprecision on both tasks while also being lighter than FasterRCNN. Furthermore, VGG19 outperforms Faster RCNN whendealing with dense crowds, proving to be more robust toscale variations and strong occlusions, being more suitable forsurveillance applications using drones
摘要：检测和计数的人从移动雄蜂本挑战在于arisefrom恒定在图像透视andcamera角度改变的问题人类人群。在本文中，我们测试两种不同的状态的最先进的方法，密度图生成与VGG19 trainedwith贝叶斯损失函数和检测-然后计数与FasterRCNN与ResNet50-FPN为骨干，为了comparetheir精度用于计数和检测人从无人机飞行采取differentreal场景。我们发现empiricallythat双方拟议方法进行特别好fordetecting和稀疏的人群计数的人当thedrone是近地面。然而，VGG19提供了两个任务，同时也比FasterRCNN轻betterprecision。此外，VGG19性能优于更快RCNN密集人群whendealing，证明是更健壮toscale变化和强闭塞，是使用无人驾驶飞机更适合forsurveillance应用

34. Salient Facial Features from Humans and Deep Neural Networks [PDF] 返回目录
Shanmeng Sun, Wei Zhen Teoh, Michael Guerzhoy
Abstract: In this work, we explore the features that are used by humans and by convolutional neural networks (ConvNets) to classify faces. We use Guided Backpropagation (GB) to visualize the facial features that influence the output of a ConvNet the most when identifying specific individuals; we explore how to best use GB for that purpose. We use a human intelligence task to find out which facial features humans find to be the most important for identifying specific individuals. We explore the differences between the saliency information gathered from humans and from ConvNets. Humans develop biases in employing available information on facial features to discriminate across faces. Studies show these biases are influenced both by neurological development and by each individual's social experience. In recent years the computer vision community has achieved human-level performance in many face processing tasks with deep neural network-based models. These face processing systems are also subject to systematic biases due to model architectural choices and training data distribution.
摘要：在这项工作中，我们探索了由人类和卷积神经网络（ConvNets）分类面使用的功能。我们使用指导反向传播（GB），以可视化影响一个ConvNet最确定何时特定个人的输出的面部特征;我们探讨如何充分利用GB用于这一目的。我们用人类智慧的任务，找出哪些五官人觉得是识别特定个人最重要的。我们探索从人类和ConvNets收集到的信息显着的差异。人类开发利用面部特征资料横跨面孔歧视偏见。研究表明，这些偏见都是由神经系统的发育和每一个人的社会经历的影响。近年来计算机视觉领域已与深层神经网络型机型多面加工任务实现人类水平的性能。这些面处理系统也受到由于架构的选择和训练数据分布进行建模的系统偏差。

35. Shape retrieval of non-rigid 3d human models [PDF] 返回目录
David Pickup, Xianfang Sun, Paul L Rosin, Ralph R Martin, Z Cheng, Zhouhui Lian, Masaki Aono, A Ben Hamza, A Bronstein, M Bronstein, S Bu, Umberto Castellani, S Cheng, Valeria Garro, Andrea Giachetti, Afzal Godil, Luca Isaia, J Han, Henry Johan, L Lai, Bo Li, C Li, Haisheng Li, Roee Litman, X Liu, Z Liu, Yijuan Lu, L Sun, G Tam, Atsushi Tatsuma, J Ye
Abstract: 3D models of humans are commonly used within computer graphics and vision, and so the ability to distinguish between body shapes is an important shape retrieval problem. We extend our recent paper which provided a benchmark for testing non-rigid 3D shape retrieval algorithms on 3D human models. This benchmark provided a far stricter challenge than previous shape benchmarks. We have added 145 new models for use as a separate training set, in order to standardise the training data used and provide a fairer comparison. We have also included experiments with the FAUST dataset of human scans. All participants of the previous benchmark study have taken part in the new tests reported here, many providing updated results using the new data. In addition, further participants have also taken part, and we provide extra analysis of the retrieval results. A total of 25 different shape retrieval methods.
摘要：人类的3D模型计算机图形和视觉中常用，所以身体的形状来区分的能力是一个重要的形状检索问题。我们扩展我们最近的一篇论文，其对3D人体模型测试非刚性3D形状检索算法提供了一个基准。这个基准测试提供了比以前的形状基准的更加严格的挑战。我们已经添加了用作单独的训练集145个的新车型，以规范使用的训练数据，并提供一个更加公平的比较。我们还包括与人类扫描的FAUST数据集实验。以前的基准研究的所有参与者都参加了新的测试报告在这里，许多提供使用新的数据更新的结果。此外，进一步的人士也参加，而我们提供的检索结果的额外分析。总共有25种不同的形状检索方法。

36. Exemplar Normalization for Learning Deep Representation [PDF] 返回目录
Ruimao Zhang, Zhanglin Peng, Lingyun Wu, Zhen Li, Ping Luo
Abstract: Normalization techniques are important in different advanced neural networks and different tasks. This work investigates a novel dynamic learning-to-normalize (L2N) problem by proposing Exemplar Normalization (EN), which is able to learn different normalization methods for different convolutional layers and image samples of a deep network. EN significantly improves flexibility of the recently proposed switchable normalization (SN), which solves a static L2N problem by linearly combining several normalizers in each normalization layer (the combination is the same for all samples). Instead of directly employing a multi-layer perceptron (MLP) to learn data-dependent parameters as conditional batch normalization (cBN) did, the internal architecture of EN is carefully designed to stabilize its optimization, leading to many appealing benefits. (1) EN enables different convolutional layers, image samples, categories, benchmarks, and tasks to use different normalization methods, shedding light on analyzing them in a holistic view. (2) EN is effective for various network architectures and tasks. (3) It could replace any normalization layers in a deep network and still produce stable model training. Extensive experiments demonstrate the effectiveness of EN in a wide spectrum of tasks including image recognition, noisy label learning, and semantic segmentation. For example, by replacing BN in the ordinary ResNet50, improvement produced by EN is 300% more than that of SN on both ImageNet and the noisy WebVision dataset.
摘要：标准化技术在不同的高级神经网络和不同的任务重要。这项工作研究了一种新颖的动态学习到正常化（L2N）的问题通过提出的Exemplar正常化（EN），这是能够学习对于不同的卷积层和深网络的图像样本不同归一化方法。 EN显著改善了最近提出切换正常化（SN），它通过在每个归一化层线性组合几个正规化（组合对于所有样品是相同的）解决了静态L2N问题的灵活性。而不是直接采用多层感知器（MLP），以了解数据相关的参数为条件批标准化（CBN）没有，EN的内部结构经过精心设计，以稳定其优化，导致许多吸引人的优点。（1）EN使得不同的卷积层，图像样本，类别，基准，和任务使用不同归一化方法，在整体视图分析它们脱落光。（2）EN有效用于各种网络架构和任务。（3）它可以替代任何正常化层在深网络，并且仍然产生稳定的模型的培训。大量的实验证明EN的任务，包括图像识别的，嘈杂的标签学习，语义分割的广泛有效性。例如，通过在普通的ResNet50替换BN，通过EN产生改进是300％比SN的更上都ImageNet和嘈杂WebVision数据集。

37. A Question-Centric Model for Visual Question Answering in Medical Imaging [PDF] 返回目录
Minh H. Vu, Tommy Löfstedt, Tufve Nyholm, Raphael Sznitman
Abstract: Deep learning methods have proven extremely effective at performing a variety of medical image analysis tasks. With their potential use in clinical routine, their lack of transparency has however been one of their few weak points, raising concerns regarding their behavior and failure modes. While most research to infer model behavior has focused on indirect strategies that estimate prediction uncertainties and visualize model support in the input image space, the ability to explicitly query a prediction model regarding its image content offers a more direct way to determine the behavior of trained models. To this end, we present a novel Visual Question Answering approach that allows an image to be queried by means of a written question. Experiments on a variety of medical and natural image datasets show that by fusing image and question features in a novel way, the proposed approach achieves an equal or higher accuracy compared to current methods.
摘要：深学习方法已被证明在执行各种医学图像分析任务非常有效。凭借其在临床常规的潜在用途，缺乏透明度然而，已经很少的薄弱环节之一，提高公众对他们的行为和失效模式的担忧。虽然大多数研究推断模型行为主要集中在间接的策略是，在输入图像空间估计预测的不确定性和可视化模型的支持，能够明确地查询的预测模型关于其图像内容提供了更直接的方式来确定训练的模型的行为。为此，我们提出了一个新颖的视觉问题回答方法，使图像通过一个书面提问的方式进行查询。在各种医疗和自然的图像数据集的实验表明，通过以新颖的方式定影图像和问题的特征，所提出的方法相比，目前的方法实现了相等的或更高的精度。

38. FePh: An Annotated Facial Expression Dataset for the RWTH-PHOENIX-Weather 2014 Dataset [PDF] 返回目录
Marie Alaghband, Niloofar Yousefi, Ivan Garibay
Abstract: Facial expressions are important parts of both gesture and sign language recognition systems. Despite the recent advances in both fields, annotated facial expression dataset in the context of sign language are still scarce resources. In this manuscript, we introduce a continuous sign language facial expression dataset, comprising over $3000$ annotated images of the RWTH-PHOENIX-Weather 2014 development set. Unlike the majority of currently existing facial expression datasets, FePh provides sequenced semi-blurry facial images with different head poses, orientations, and movements. In addition, in the majority of images, identities are mouthing the words, which makes the data more challenging. To annotate this dataset we consider primary, secondary, and tertiary dyads of seven basic emotions of "sad", "surprise", "fear", "angry", "neutral", "disgust", and "happy". We also considered the "None" class if the image's facial expression could not be described by any of the aforementioned emotions. Although we provide FePh in the context of facial expression and sign language, it has a wider application in gesture recognition and Human Computer Interaction (HCI) systems. The dataset will be publicly available.
摘要：面部表情都是手势和手语识别系统的重要组成部分。尽管在这两个领域的最新进展，在手语的语境注释面部表情数据集仍然是稀缺资源。在这个手稿中，我们引入一个连续的手语面部表情数据集，包括超过$ 3000 RWTH-PHOENIX-天气2014开发的一套$注释的图像。与大多数现有的面部表情数据集，FePh提供测序半模糊的面部图像具有不同的头部姿势，方向和运动。另外，在大多数的图像，标识被唱衰的话，这使得数据更具挑战性。为了诠释这个数据集，我们考虑的“伤心”，“惊喜”的七个基本情绪一级，二级和三级对，“恐惧”，“愤怒”，“中性”，“厌恶”和“快乐”。我们也认为是“无”类，如果图像的面部表情不能由任何上述的情绪来描述。虽然我们在面部表情和手语的上下文中提供FePh，它具有在手势识别和人机交互（HCI）的系统更广泛的应用。此数据集将公开。

39. A Compact Spectral Descriptor for Shape Deformations [PDF] 返回目录
Skylar Sible, Rodrigo Iza-Teran, Jochen Garcke, Nikola Aulig, Patricia Wollstadt
Abstract: Modern product design in the engineering domain is increasingly driven by computational analysis including finite-element based simulation, computational optimization, and modern data analysis techniques such as machine learning. To apply these methods, suitable data representations for components under development as well as for related design criteria have to be found. While a component's geometry is typically represented by a polygon surface mesh, it is often not clear how to parametrize critical design properties in order to enable efficient computational analysis. In the present work, we propose a novel methodology to obtain a parameterization of a component's plastic deformation behavior under stress, which is an important design criterion in many application domains, for example, when optimizing the crash behavior in the automotive context. Existing parameterizations limit computational analysis to relatively simple deformations and typically require extensive input by an expert, making the design process time intensive and costly. Hence, we propose a way to derive a compact descriptor of deformation behavior that is based on spectral mesh processing and enables a low-dimensional representation of also complex deformations.We demonstrate the descriptor's ability to represent relevant deformation behavior by applying it in a nearest-neighbor search to identify similar simulation results in a filtering task. The proposed descriptor provides a novel approach to the parametrization of geometric deformation behavior and enables the use of state-of-the-art data analysis techniques such as machine learning to engineering tasks concerned with plastic deformation behavior.
摘要：在工程领域现代产品设计越来越多地被计算分析，包括有限元仿真基础，计算优化和现代的数据分析技术，例如机器学习驱动。要应用正在开发的这些方法，适合数据表示为组件，以及相关的设计标准已被发现。尽管组件的几何形状典型地由多边形表面网格表示的，它往往是不清楚如何，以使有效的计算分析参数化关键的设计特性。在目前的工作中，我们提出一种新颖的方法，以获得在应力下的部件的塑性变形行为，这是在许多应用领域的一个重要的设计准则的参数化，例如，优化在汽车上下文碰撞特性时。现有的参数化限制的计算分析，以相对简单的变形，通常由专家需要大量的投入，使得设计过程耗费时间和昂贵的。因此，我们提出了一个方法来导出的变形行为的紧凑描述符是基于光谱目处理，并实现的也是复杂deformations.We表明描述符的通过在nearest-应用它来表示有关变形行为的能力的低维表示邻搜索识别过滤任务相似的模拟结果。所提出的描述符提供了一种新颖的方法来几何变形行为参数化，并允许使用的状态的最先进的数据分析技术，例如机器学习涉及的塑性变形行为的工程任务。

40. Adversarial Camouflage: Hiding Physical-World Attacks with Natural Styles [PDF] 返回目录
Ranjie Duan, Xingjun Ma, Yisen Wang, James Bailey, A. K. Qin, Yun Yang
Abstract: Deep neural networks (DNNs) are known to be vulnerable to adversarial examples. Existing works have mostly focused on either digital adversarial examples created via small and imperceptible perturbations, or physical-world adversarial examples created with large and less realistic distortions that are easily identified by human observers. In this paper, we propose a novel approach, called Adversarial Camouflage (\emph{AdvCam}), to craft and camouflage physical-world adversarial examples into natural styles that appear legitimate to human observers. Specifically, \emph{AdvCam} transfers large adversarial perturbations into customized styles, which are then "hidden" on-target object or off-target background. Experimental evaluation shows that, in both digital and physical-world scenarios, adversarial examples crafted by \emph{AdvCam} are well camouflaged and highly stealthy, while remaining effective in fooling state-of-the-art DNN image classifiers. Hence, \emph{AdvCam} is a flexible approach that can help craft stealthy attacks to evaluate the robustness of DNNs. \emph{AdvCam} can also be used to protect private information from being detected by deep learning systems.
摘要：深层神经网络（DNNs）被称为是脆弱的对抗性的例子。现有的作品主要集中在通过小和不易察觉的扰动，或者很容易被人类观察者认定大，不太现实扭曲创建的物理世界的对抗实例创建是数字对抗性的例子。在本文中，我们提出了一种新的方法，称为对抗性伪装（\ {EMPH} AdvCam），对工艺和伪装物理世界的敌对例子成合法的出现对人类观察者自然的风格。具体而言，\ {EMPH} AdvCam大对抗性扰动传送到定制的样式，然后将其“隐藏的”基于目标对象或脱靶的背景。实验评价表明，在数字和物理世界的场景，对抗性例子制作的由\ {EMPH} AdvCam是公伪装和高度隐蔽，同时保持有效地嘴硬状态的最先进的DNN图像分类器。因此，\ {EMPH} AdvCam是一种灵活的方法，可以帮助工艺隐身攻击来评估DNNs的鲁棒性。 \ {EMPH} AdvCam也可以用来保护私人信息从深度学习系统被发现。

41. Deep Neural Network Perception Models and Robust Autonomous Driving Systems [PDF] 返回目录
Mohammad Javad Shafiee, Ahmadreza Jeddi, Amir Nazemi, Paul Fieguth, Alexander Wong
Abstract: This paper analyzes the robustness of deep learning models in autonomous driving applications and discusses the practical solutions to address that.
摘要：本文分析了自主驾驶应用的深刻学习模型的鲁棒性和讨论了切实可行的解决方案来解决这个问题。

42. Adaptive binarization based on fuzzy integrals [PDF] 返回目录
Francesco Bardozzo, Borja De La Osa, Lubomira Horanska, Javier Fumanal-Idocin, Mattia delli Priscoli, Luigi Troiano, Roberto Tagliaferri, Javier Fernandez, Humberto Bustince
Abstract: Adaptive binarization methodologies threshold the intensity of the pixels with respect to adjacent pixels exploiting the integral images. In turn, the integral images are generally computed optimally using the summed-area-table algorithm (SAT). This document presents a new adaptive binarization technique based on fuzzy integral images through an efficient design of a modified SAT for fuzzy integrals. We define this new methodology as FLAT (Fuzzy Local Adaptive Thresholding). The experimental results show that the proposed methodology have produced an image quality thresholding often better than traditional algorithms and saliency neural networks. We propose a new generalization of the Sugeno and CF 1,2 integrals to improve existing results with an efficient integral image computation. Therefore, these new generalized fuzzy integrals can be used as a tool for grayscale processing in real-time and deep-learning applications. Index Terms: Image Thresholding, Image Processing, Fuzzy Integrals, Aggregation Functions
摘要：自适应二值化的方法阈值的像素的相对于利用该积分图像相邻的像素的强度。进而，积分图像通常计算最佳地使用总计区域表算法（SAT）。本文件通过改性SAT的用于模糊积分有效的设计基于模糊积分图像的新自适应二值化技术。我们定义这个新的方法，因为FLAT（模糊局部自适应阈值）。实验结果表明，该方法已经产生的图像质量阈值处理往往优于传统的算法和显着性的神经网络。我们提出的sugeno和CF 1,2积分的新的概括，以提高与高效的整体图像计算现有的成果。因此，这些新广义模糊积分可以作为实时和深学习应用灰度处理的工具。关键词：图像二值化，图像处理，模糊积分，聚合函数

43. SAM: The Sensitivity of Attribution Methods to Hyperparameters [PDF] 返回目录
Naman Bansal, Chirag Agarwal, Anh Nguyen
Abstract: Attribution methods can provide powerful insights into the reasons for a classifier's decision. We argue that a key desideratum of an explanation method is its robustness to input hyperparameters which are often randomly set or empirically tuned. High sensitivity to arbitrary hyperparameter choices does not only impede reproducibility but also questions the correctness of an explanation and impairs the trust of end-users. In this paper, we provide a thorough empirical study on the sensitivity of existing attribution methods. We found an alarming trend that many methods are highly sensitive to changes in their common hyperparameters e.g. even changing a random seed can yield a different explanation! Interestingly, such sensitivity is not reflected in the average explanation accuracy scores over the dataset as commonly reported in the literature. In addition, explanations generated for robust classifiers (i.e. which are trained to be invariant to pixel-wise perturbations) are surprisingly more robust than those generated for regular classifiers.
摘要：归属方法可以提供强大的见解原因进行了分类的决定。我们认为，解释方法的一个关键是迫切要求其鲁棒性这往往是随机设置或根据经验调输入超参数。任意选择超参数高灵敏度确实不仅阻碍再现性也是问题的解释和损害最终用户的信任是正确的。在本文中，我们提供的现有归属方法的灵敏度彻底的实证研究。我们发现一个惊人的趋势，很多方法都是对变化高度敏感的他们共同的超参数例如甚至改变一个随机种子可以产生不同的解释！有趣的是，这样的灵敏度不反映在平均说明分数精度在数据集如文献中常见的。此外，对于健壮分类器产生的解释（即，其被训练是不变的逐像素扰动）令人惊奇地比常规的分类器产生的那些更稳健。

44. FineHand: Learning Hand Shapes for American Sign Language Recognition [PDF] 返回目录
Al Amin Hosain, Panneer Selvam Santhalingam, Parth Pathak, Huzefa Rangwala, Jana Kosecka
Abstract: American Sign Language recognition is a difficult gesture recognition problem, characterized by fast, highly articulate gestures. These are comprised of arm movements with different hand shapes, facial expression and head movements. Among these components, hand shape is the vital, often the most discriminative part of a gesture. In this work, we present an approach for effective learning of hand shape embeddings, which are discriminative for ASL gestures. For hand shape recognition our method uses a mix of manually labelled hand shapes and high confidence predictions to train deep convolutional neural network (CNN). The sequential gesture component is captured by recursive neural network (RNN) trained on the embeddings learned in the first stage. We will demonstrate that higher quality hand shape models can significantly improve the accuracy of final video gesture classification in challenging conditions with variety of speakers, different illumination and significant motion blurr. We compare our model to alternative approaches exploiting different modalities and representations of the data and show improved video gesture recognition accuracy on GMU-ASL51 benchmark dataset
摘要：美国手语识别是一个困难的手势识别的问题，特点是速度快，高度阐明手势。这些是由与不同的手形，面部表情和头部运动臂运动的。在这些部件中，手形是至关重要的，往往是最判别手势的一部分。在这项工作中，我们提出了一个手形的嵌入，这是歧视性的对ASL手势的有效学习的方法。对于手形识别我们的方法是使用手工标注的手形和高可信度的预测的混合训练深卷积神经网络（CNN）。顺序手势分量由受过训练的在第一阶段了解到的嵌入递归神经网络（RNN）捕获。我们将证明，更高质量的手形状模型可以显著提高最终视频手势分类的准确性与各种扬声器，不同的照明和显著运动罗嗦的挑战性的条件。我们我们的模型进行比较的替代方法利用数据的不同方式和表现，并显示在GMU-ASL51基准数据集提高视频的手势识别的准确性

45. Hierarchical Modes Exploring in Generative Adversarial Networks [PDF] 返回目录
Mengxiao Hu, Jinlong Li, Maolin Hu, Tao Hu
Abstract: In conditional Generative Adversarial Networks (cGANs), when two different initial noises are concatenated with the same conditional information, the distance between their outputs is relatively smaller, which makes minor modes likely to collapse into large modes. To prevent this happen, we proposed a hierarchical mode exploring method to alleviate mode collapse in cGANs by introducing a diversity measurement into the objective function as the regularization term. We also introduced the Expected Ratios of Expansion (ERE) into the regularization term, by minimizing the sum of differences between the real change of distance and ERE, we can control the diversity of generated images w.r.t specific-level features. We validated the proposed algorithm on four conditional image synthesis tasks including categorical generation, paired and un-paired image translation and text-to-image generation. Both qualitative and quantitative results show that the proposed method is effective in alleviating the mode collapse problem in cGANs, and can control the diversity of output images w.r.t specific-level features.
摘要：在条件剖成对抗性网络（CGANS），当两个不同的初始噪声与相同条件信息级联，它们的输出之间的距离相对较小，这使得次要模式可能崩溃成大模式。为了防止这种情况发生，我们提出了一个分级模式探索方法，通过引入多元化测量到目标函数作为正则项，以减轻CGANS模式崩溃。我们还推出了扩展（ERE）为正则项的预期比，通过最小化的距离和ERE的实际变化之间差异的总和，我们可以控制w.r.t特定级别的功能生成的图像的多样性。我们验证了四个条件的图像合成任务的算法，包括分类代，配对和未配对的图像翻译和文字到图像生成。定性和定量的结果表明，所提出的方法可有效地在CGANS缓解模式崩溃的问题，并且可以控制输出图像w.r.t特定级功能的多样性。

46. Unique Class Group Based Multi-Label Balancing Optimizer for Action Unit Detection [PDF] 返回目录
Ines Rieger, Jaspar Pahl, Dominik Seuss
Abstract: Balancing methods for single-label data cannot be applied to multi-label problems as they would also resample the samples with high occurrences. We propose to reformulate this problem as an optimization problem in order to balance multi-label data. We apply this balancing algorithm to training datasets for detecting isolated facial movements, so-called Action Units. Several Action Units can describe combined emotions or physical states such as pain. As datasets in this area are limited and mostly imbalanced, we show how optimized balancing and then augmentation can improve Action Unit detection. At the IEEE Conference on Face and Gesture Recognition 2020, we ranked third in the Affective Behavior Analysis in-the-wild (ABAW) challenge for the Action Unit detection task.
摘要：单标签数据平衡方法不能适用于多标签问题，因为它们也存在很多重采样样本。我们建议为了平衡多标签数据重新制定这一问题作为优化的问题。我们这个平衡算法应用到训练数据检测孤立的面部动作，所谓的行动单位。一些行动单位可以描述合并情绪或身体状态，如疼痛。作为这个领域的数据集是有限的，主要是不均衡的，我们展示了优化平衡，然后隆多能够提高行动单位检测。在人脸和手势识别2020年IEEE会议，我们在情感行为排在第三位分析最野（ABAW）挑战动作单元检测任务。

47. Longevity Associated Geometry Identified in Satellite Images: Sidewalks, Driveways and Hiking Trails [PDF] 返回目录
Joshua J. Levy, Rebecca M. Lebeaux, Anne G. Hoen, Brock C. Christensen, Louis J. Vaickus, Todd A. MacKenzie
Abstract: Importance: Following a century of increase, life expectancy in the United States has stagnated and begun to decline in recent decades. Using satellite images and street view images prior work has demonstrated associations of the built environment with income, education, access to care and health factors such as obesity. However, assessment of learned image feature relationships with variation in crude mortality rate across the United States has been lacking. Objective: Investigate prediction of county-level mortality rates in the U.S. using satellite images. Design: Satellite images were extracted with the Google Static Maps application programming interface for 430 counties representing approximately 68.9% of the US population. A convolutional neural network was trained using crude mortality rates for each county in 2015 to predict mortality. Learned image features were interpreted using Shapley Additive Feature Explanations, clustered, and compared to mortality and its associated covariate predictors. Main Outcomes and Measures: County mortality was predicted using satellite images. Results: Predicted mortality from satellite images in a held-out test set of counties was strongly correlated to the true crude mortality rate (Pearson r=0.72). Learned image features were clustered, and we identified 10 clusters that were associated with education, income, geographical region, race and age. Conclusion and Relevance: The application of deep learning techniques to remotely-sensed features of the built environment can serve as a useful predictor of mortality in the United States. Tools that are able to identify image features associated with health-related outcomes can inform targeted public health interventions.
摘要：重要性：下面增加了一个世纪，预期寿命在美国滞缓，近几十年来开始下降。利用卫星图像和街景图像之前的工作已经证明与收入，教育，获得建筑环境协会关心健康的因素，如肥胖。然而，一直缺乏与死亡率在美国原油变化了解到图像特征的关系的评价。目的：探讨利用卫星图像县级死亡率在美国的预测。设计：卫星图像进行与谷歌静态地图应用程序编程接口，代表美国人口的约68.9％，430个县中提取。卷积神经网络用粗死亡率为每个县在2015年预测死亡率培训。了解到图像特征使用沙普利添加剂功能说明解释，集群，并与死亡率和其相关联的协变量预测因子。主要成果和措施：县死亡率利用卫星图像预测。结果：从卫星图像预测死亡率在保持输出测试组县强烈关联到真实死亡率粗（皮尔逊R = 0.72）。据悉图像特征进行聚类，我们确定了用教育，收入，地理区域，种族和年龄相关的10簇。结论与关联：深学习技术的应用建筑环境的遥感功能，可以作为死亡率在美国一个有用的预测。工具能够识别与健康相关的结果有关的图像特征可以通知有针对性的公共卫生干预措施。

48. Toward Enabling a Reliable Quality Monitoring System for Additive Manufacturing Process using Deep Convolutional Neural Networks [PDF] 返回目录
Yaser Banadaki, Nariman Razaviarab, Hadi Fekrmandi, Safura Sharifi
Abstract: Additive Manufacturing (AM) is a crucial component of the smart industry. In this paper, we propose an automated quality grading system for the AM process using a deep convolutional neural network (CNN) model. The CNN model is trained offline using the images of the internal and surface defects in the layer-by-layer deposition of materials and tested online by studying the performance of detecting and classifying the failure in AM process at different extruder speeds and temperatures. The model demonstrates the accuracy of 94% and specificity of 96%, as well as above 75% in three classifier measures of the Fscore, the sensitivity, and precision for classifying the quality of the printing process in five grades in real-time. The proposed online model adds an automated, consistent, and non-contact quality control signal to the AM process that eliminates the manual inspection of parts after they are entirely built. The quality monitoring signal can also be used by the machine to suggest remedial actions by adjusting the parameters in real-time. The proposed quality predictive model serves as a proof-of-concept for any type of AM machines to produce reliable parts with fewer quality hiccups while limiting the waste of both time and materials.
摘要：添加剂制造（AM）是智能产业的重要组成部分。在本文中，我们提出了AM过程中使用了深刻的卷积神经网络（CNN）模型的自动质量分级制度。 CNN的模型使用内部和表面缺陷的图像在材料层 - 层沉积离线训练和通过研究的检测，并在不同的挤出机的速度和温度在AM过程的故障进行分类性能的在线测试。该模型表明了94％，为五个等级实时的打印处理的质量分类的准确性和特异性为96％，以及75％以上在Fscore三个分级措施，灵敏度和精确度。所提出的在线模型增加一个自动化的，一致的，和非接触式的质量控制信号给AM过程消除部分的人工检查之后它们被完全建立。质量监控信号也可以通过机器来建议通过实时调整参数的补救措施。所提出的质量预测模型作为证明的概念对任何类型的AM机器的生产用更少的质量打嗝可靠的零件，同时限制的时间和材料的浪费。

49. Reduction of Surgical Risk Through the Evaluation of Medical Imaging Diagnostics [PDF] 返回目录
Marco A. V. M. Grinet, Nuno M. Garcia, Ana I. R. Gouveia, Jose A. F. Moutinho, Abel J. P. Gomes
Abstract: Computer aided diagnosis (CAD) of Breast Cancer (BRCA) images has been an active area of research in recent years. The main goals of this research is to develop reliable automatic methods for detecting and diagnosing different types of BRCA from diagnostic images. In this paper, we present a review of the state of the art CAD methods applied to magnetic resonance (MRI) and mammography images of BRCA patients. The review aims to provide an extensive introduction to different features extracted from BRCA images through texture and statistical analysis and to categorize deep learning frameworks and data structures capable of using metadata to aggregate relevant information to assist oncologists and radiologists. We divide the existing literature according to the imaging modality and into radiomics, machine learning, or combination of both. We also emphasize the difference between each modality and methods strengths and weaknesses and analyze their performance in detecting BRCA through a quantitative comparison. We compare the results of various approaches for implementing CAD systems for the detection of BRCA. Each approachs standard workflow components are reviewed and summary tables provided. We present an extensive literature review of radiomics feature extraction techniques and machine learning methods applied in BRCA diagnosis and detection, focusing on data preparation, data structures, pre processing and post processing strategies available in the literature. There is a growing interest on radiomic feature extraction and machine learning methods for BRCA detection through histopathological images, MRI and mammography images. However, there isnt a CAD method able to combine distinct data types to provide the best diagnostic results. Employing data fusion techniques to medical images and patient data could lead to improved detection and classification results.
摘要：乳腺癌（BRCA）图像的计算机辅助诊断（CAD）一直是一个活跃的研究领域在最近几年。这项研究的主要目标是开发用于检测和从诊断图像诊断不同类型的BRCA可靠的自动方法。在本文中，我们提出的适用于磁共振（MRI）和BRCA患者的乳房X射线摄影图像的技术CAD方法的状态的评价。该审查的目的是提供一个广泛的介绍，通过纹理和统计分析从BRCA图像中提取出不同的特点和归类能够使用元数据汇总相关信息，以帮助肿瘤科医生和放射科的深度学习框架和数据结构。我们根据这两者的成像模态，并进入radiomics，机器学习，或者组合划分现有的文献。我们还强调，每一种模式和方法的优点和缺点之间的差异，并通过定量比较检测BRCA分析它们的性能。我们比较了各种方法的结果实现CAD系统检测BRCA的。每个技术途径标准工作流组件进行审查和汇总表提供。我们提出radiomics的详尽文献综述特征提取技术和机器学习方法在BRCA诊断和检测施加，着眼于数据准备，数据结构，前处理和在文献中获得的后处理策略。有通过病理图像，MRI和乳腺X线摄影图像radiomic特征提取和机器学习方法BRCA检测的兴趣与日俱增。然而，心不是能不同的数据类型的组合，以提供最好的诊断结果的CAD方法。使用数据融合技术来的医学图像和病人数据可能会导致改进的检测和分类结果。

50. IROF: a low resource evaluation metric for explanation methods [PDF] 返回目录
Laura Rieger, Lars Kai Hansen
Abstract: The adoption of machine learning in health care hinges on the transparency of the used algorithms, necessitating the need for explanation methods. However, despite a growing literature on explaining neural networks, no consensus has been reached on how to evaluate those explanation methods. We propose IROF, a new approach to evaluating explanation methods that circumvents the need for manual evaluation. Compared to other recent work, our approach requires several orders of magnitude less computational resources and no human input, making it accessible to lower resource groups and robust to human bias.
摘要：在上的使用的算法透明度保健铰链，因此有必要进行解释的方法需要采用机器学习的。然而，尽管在解释神经网络越来越多的研究，没有达成共识已经达成关于如何评估这些解释方法。我们建议IROF，一种新的方法来评估的解释方法，即规避了人工评估的需要。相比于其他最近的工作，我们的方法需要较少数量的计算资源，没有人力投入的几个数量级，使其接触到较低的资源群体和强大的人的偏见。

51. On the Road with 16 Neurons: Mental Imagery with Bio-inspired Deep Neural Networks [PDF] 返回目录
Alice Plebe, Mauro Da Lio
Abstract: This paper proposes a strategy for visual prediction in the context of autonomous driving. Humans, when not distracted or drunk, are still the best drivers you can currently find. For this reason we take inspiration from two theoretical ideas about the human mind and its neural organization. The first idea concerns how the brain uses a hierarchical structure of neuron ensembles to extract abstract concepts from visual experience and code them into compact representations. The second idea suggests that these neural perceptual representations are not neutral but functional to the prediction of the future state of affairs in the environment. Similarly, the prediction mechanism is not neutral but oriented to the current planning of a future action. We identify within the deep learning framework two artificial counterparts of the aforementioned neurocognitive theories. We find a correspondence between the first theoretical idea and the architecture of convolutional autoencoders, while we translate the second theory into a training procedure that learns compact representations which are not neutral but oriented to driving tasks, from two distinct perspectives. From a static perspective, we force groups of neural units in the compact representations to distinctly represent specific concepts crucial to the driving task. From a dynamic perspective, we encourage the compact representations to be predictive of how the current road scenario will change in the future. We successfully learn compact representations that use as few as 16 neural units for each of the two basic driving concepts we consider: car and lane. We prove the efficiency of our proposed perceptual representations on the SYNTHIA dataset. Our source code is available at this https URL
摘要：本文提出了一种在自动驾驶的情况下视觉预测的策略。人类，而不是分心或喝醉的时候，依然是你目前能找到的最好的驱动程序。为此，我们采取的灵感来自对人的心灵和神经组织中的两个理论思想。第一个想法涉及大脑如何使用神经元合奏的分层结构来提取的视觉体验和他们的代码到简洁表示抽象的概念。第二个观点表明，这些神经感知表示是不是中立的，但功能以事务的环境中的未来状况的预测。同样，预测机制不是中性的，但面向未来行动的现行规划。我们确定了上述神经认知理论的深度学习框架两个人工同行内。我们找到的第一个理论概念和卷积自动编码的体系结构之间的对应关系，而我们的第二个理论翻译成学习简洁表示这不是中立的，但面向驾驶任务的训练过程，从两个不同的观点。从静态的角度来看，我们强迫神经单位的群体在紧凑交涉明显代表了驾驶任务至关重要具体的概念。从动态的角度来看，我们鼓励简洁表示可以预测的是如何在当前道路情况将在未来发生改变。我们成功地学习简洁表示是用尽可能少的每一个的，我们考虑两个基本驱动概念16个的神经单位：汽车和车道。我们证明了我们提出的感性表示对SYNTHIA数据集的效率。我们的源代码可在此HTTPS URL

52. PLOP: Probabilistic poLynomial Objects trajectory Planning for autonomous driving [PDF] 返回目录
Thibault Buhet, Emilie Wirbel, Xavier Perrotton
Abstract: To navigate safely in an urban environment, an autonomous vehicle (ego vehicle) needs to understand and anticipate its surroundings, in particular the behavior of other road users (neighbors). However, multiple choices are often acceptable (e.g. turn right or left, or different ways of avoiding an obstacle). We focus here on predicting multiple feasible future trajectories both for the ego vehicle and neighbors through a probabilistic framework. We use a conditional imitation learning algorithm, conditioned by a navigation command for the ego vehicle (e.g. "turn right"). It takes as input the ego car front camera image, a Lidar point cloud in a bird-eye view grid and present and past objects detections to output ego vehicle and neighbors possible trajectories but also semantic segmentation as an auxiliary loss. We evaluate our method on the publicly available dataset nuScenes, showing state-of-the-art performance and investigating the impact of our architecture choices.
摘要：在城市环境中安全航行，自动车辆（自车辆）需要了解和预测及其周围地区，特别是其他道路使用者（邻居）的行为。然而，多种选择是经常可接受的（避开障碍物的例如向右转或向左，或不同的方式）。在这里，我们专注于通过概率框架预测多个可行的未来轨迹，既为自身车辆和邻居。我们使用条件模仿学习算法，由导航命令自身车辆条件（例如，“右转”）。它输入自我汽车前方的摄像机图像，以鸟瞰图栅格和当前和过去的对象的检测，以输出自主车辆和邻居可能的轨迹，而且语义分割作为辅助损失激光雷达点云。我们评估的可公开获得的数据集nuScenes我们的方法，显示出国家的最先进的性能和调查的我们的架构选择的影响。

53. Generative Multi-Stream Architecture For American Sign Language Recognition [PDF] 返回目录
Dom Huh, Sai Gurrapu, Frederick Olson, Huzefa Rangwala, Parth Pathak, Jana Kosecka
Abstract: With advancements in deep model architectures, tasks in computer vision can reach optimal convergence provided proper data preprocessing and model parameter initialization. However, training on datasets with low feature-richness for complex applications limit and detriment optimal convergence below human performance. In past works, researchers have provided external sources of complementary data at the cost of supplementary hardware, which are fed in streams to counteract this limitation and boost performance. We propose a generative multi-stream architecture, eliminating the need for additional hardware with the intent to improve feature richness without risking impracticability. We also introduce the compact spatio-temporal residual block to the standard 3-dimensional convolutional model, C3D. Our rC3D model performs comparatively to the top C3D residual variant architecture, the pseudo-3D model, on the FASL-RGB dataset. Our methods have achieved 95.62% validation accuracy with a variance of 1.42% from training, outperforming past models by 0.45% in validation accuracy and 5.53% in variance.
摘要：随着深层模型架构的进步，在计算机视觉任务都可以达到最佳的融合提供适当的数据预处理和模型参数进行初始化。然而，在低功能丰富的数据集复杂的应用培训和限制低于人体损害性能最优收敛。在过去的作品，研究人员已经在补充硬件的成本，这是在进料流，以抵消这种限制，并提高性能提供补充数据的外部来源。我们提出了一个生成多流架构，省去了与意图额外的硬件来提高的丰富功能，而不用担心不可行。我们还介绍了紧凑时空残余块以标准的3维卷积模型，C3D。我们的模型rC3D进行比较顶端C3D剩余变体结构，伪3D模型，在FASL-RGB数据集。我们的方法都取得了95.62％，验证准确度为1.42％，从训练的偏差，通过验证的准确性0.45％和5.53方差跑赢％过去的机型。

54. A CNN-based Patent Image Retrieval Method for Design Ideation [PDF] 返回目录
Shuo Jiang, Jianxi Luo, Guillermo Ruiz Pava, Jie Hu, Christopher L. Magee
Abstract: The patent database is often used in searches of inspirational stimuli for innovative design opportunities because of its large size, extensive variety and rich design information in patent documents. However, most patent mining research only focuses on textual information and ignores visual information. Herein, we propose a convolutional neural network (CNN)- based patent image retrieval method. The core of this approach is a novel neural network architecture named Dual-VGG that is aimed to accomplish two tasks: visual material type prediction and International Patent Classification (IPC) class label prediction. In turn, the trained neural network provides the deep features in the image embedding vectors that can be utilized for patent image retrieval. The accuracy of both training tasks and patent image embedding space are evaluated to show the performance of our model. This approach is also illustrated in a case study of robot arm design retrieval. Compared to traditional keyword-based searching and Google image searching, the proposed method discovers more useful visual information for engineering design.
摘要：专利数据库经常用，因为它的大尺寸的创新设计的机会，广泛的品种和专利文件中丰富的设计灵感的信息刺激的搜索中使用。然而，大多数专利挖掘研究只集中于文本信息，而忽略视觉信息。在此，我们提出了一个卷积神经网络（CNN） - 基于专利图像检索方法。这种方法的核心是一种新颖的神经网络结构命名为双VGG其旨在完成两项任务：视觉材料类型的预测和国际专利分类（IPC）类别标签预测。继而，经训练的神经网络提供在可用于专利图像检索图像嵌入矢量的深特征。两个训练任务和专利图像嵌入空间的准确性进行评估，以表明我们的模型的性能。这种方法也被在机器人手臂设计检索为例示出。相比传统的基于关键字的搜索和谷歌图片搜索，为工程设计所提出的方法发现的更多有用的视觉信息。

55. Out-of-Distribution Detection in Multi-Label Datasets using Latent Space of $β$-VAE [PDF] 返回目录
Vijaya Kumar Sundar, Shreyas Ramakrishna, Zahra Rahiminasab, Arvind Easwaran, Abhishek Dubey
Abstract: Learning Enabled Components (LECs) are widely being used in a variety of perception based autonomy tasks like image segmentation, object detection, end-to-end driving, etc. These components are trained with large image datasets with multimodal factors like weather conditions, time-of-day, traffic-density, etc. The LECs learn from these factors during training, and while testing if there is variation in any of these factors, the components get confused resulting in low confidence predictions. The images with factors not seen during training is commonly referred to as Out-of-Distribution (OOD). For safe autonomy it is important to identify the OOD images, so that a suitable mitigation strategy can be performed. Classical one-class classifiers like SVM and SVDD are used to perform OOD detection. However, the multiple labels attached to the images in these datasets, restricts the direct application of these techniques. We address this problem using the latent space of the $\beta$-Variational Autoencoder ($\beta$-VAE). We use the fact that compact latent space generated by an appropriately selected $\beta$-VAE will encode the information about these factors in a few latent variables, and that can be used for computationally inexpensive detection. We evaluate our approach on the nuScenes dataset, and our results shows the latent space of $\beta$-VAE is sensitive to encode changes in the values of the generative factor.
摘要：学习启用组件（淋巴管内皮细胞）被广泛被用于各种这些部件与多峰因素如天气状况大图像数据组训练有素的像的图像分割，对象检测端至端驱动等基于感知自治任务使用的，时间的天，交通密度等淋巴管内皮细胞培养过程中，从这些因素的学习，同时检测是否有任何这些因素的变化，组件感到困惑导致低可信度的预测。与训练期间没有看到因素图像通常被称为外的分布（OOD）。对于安全的自主性是很重要的识别OOD的图像，让合适的缓解策略可以执行。古典一个级分类器等SVM和SVDD用于执行OOD检测。然而，附着于这些数据集图像的多个标签，限制了这些技术的直接应用。我们解决使用$ \ $公测自动编码器-Variational（$ \ $公测-VAE）的潜在空间这个问题。我们使用的是通过适当选择$ \ $公测产生-VAE紧凑的潜在空间将编码这些因素的信息在几个潜在变量，以及可用于计算上更便宜检测的事实。我们评估我们在nuScenes数据集的方法，我们的结果表明$ \ $公测的-VAE潜在空间是在生成系数的值编码的变化很敏感。

56. A Matlab Toolbox for Feature Importance Ranking [PDF] 返回目录
Shaode Yu, Zhicheng Zhang, Xiaokun Liang, Junjie Wu, Erlei Zhang, Wenjian Qin, Yaoqin Xie
Abstract: More attention is being paid for feature importance ranking (FIR), in particular when thousands of features can be extracted for intelligent diagnosis and personalized medicine. A large number of FIR approaches have been proposed, while few are integrated for comparison and real-life applications. In this study, a matlab toolbox is presented and a total of 30 algorithms are collected. Moreover, the toolbox is evaluated on a database of 163 ultrasound images. To each breast mass lesion, 15 features are extracted. To figure out the optimal subset of features for classification, all combinations of features are tested and linear support vector machine is used for the malignancy prediction of lesions annotated in ultrasound images. At last, the effectiveness of FIR is analyzed according to performance comparison. The toolbox is online (this https URL). In our future work, more FIR methods, feature selection methods and machine learning classifiers will be integrated.
摘要：更多地注意为特征重要性排序（FIR），特别是当成千上万的功能可提取智能诊断和个性化医疗。大量的FIR方法被提出，而很少被整合为比较和现实生活中的应用。在这项研究中，一个matlab工具箱被呈现和总共30个算法被收集。此外，工具箱163个超声图像的数据库进行评价。到每个乳房肿块，15个特征被提取。为了找出的特征的最佳子集进行分类，的特征的所有组合进行了测试和线性支持向量机用于病灶的超声图像注释的恶性预测。最后，FIR的有效性，根据性能比较分析。该工具箱在线（此HTTPS URL）。在我们今后的工作中，更FIR方法，特征选择的方法和机器学习分类器将被整合。

57. Real-Time High-Performance Semantic Image Segmentation of Urban Street Scenes [PDF] 返回目录
Genshun Dong, Yan Yan, Chunhua Shen, Hanzi Wang
Abstract: Deep Convolutional Neural Networks (DCNNs) have recently shown outstanding performance in semantic image segmentation. However, state-of-the-art DCNN-based semantic segmentation methods usually suffer from high computational complexity due to the use of complex network architectures. This greatly limits their applications in the real-world scenarios that require real-time processing. In this paper, we propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes, which achieves a good trade-off between accuracy and speed. Specifically, a Lightweight Baseline Network with Atrous convolution and Attention (LBN-AA) is firstly used as our baseline network to efficiently obtain dense feature maps. Then, the Distinctive Atrous Spatial Pyramid Pooling (DASPP), which exploits the different sizes of pooling operations to encode the rich and distinctive semantic information, is developed to detect objects at multiple scales. Meanwhile, a Spatial detail-Preserving Network (SPN) with shallow convolutional layers is designed to generate high-resolution feature maps preserving the detailed spatial information. Finally, a simple but practical Feature Fusion Network (FFN) is used to effectively combine both shallow and deep features from the semantic branch (DASPP) and the spatial branch (SPN), respectively. Extensive experimental results show that the proposed method respectively achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps on the challenging Cityscapes and CamVid test datasets (by only using a single NVIDIA TITAN X card). This demonstrates that the proposed method offers excellent performance at the real-time speed for semantic segmentation of urban street scenes.
摘要：深卷积神经网络（DCNNs）最近表现出的语义图像分割出色的表现。然而，基于DCNN状态的最先进的语义分割方法通常从高计算复杂遭受由于使用复杂的网络架构。这极大地限制了需要实时处理真实世界的场景他们的应用程序。在本文中，我们提出了城市街景的稳健语义分割，达到一个很好的权衡精度和速度之间的实时基于高性能DCNN法。具体而言，轻量基准网络与Atrous卷积和注意力（LBN-AA）首先用作我们的基准网络有效地获得致密的特征图。然后，独具匠心的Atrous空间金字塔池（DASPP），它利用集中行动来编码丰富和独特的语义信息的大小不同，发达的多尺度检测对象。同时，空间细节保留网络（SPN）与浅卷积层被设计成产生高分辨率特征映射保存详述空间信息。最后，一个简单而实用的特征融合网络（FFN）用于有效地从语义分支（DASPP）和空间分支（SPN），结合深浅两种功能分别。广泛的实验结果表明，所提出的方法分别达到73.6％和68.0％的平均交叉口与51.0 FPS和对挑战都市风景和CamVid测试数据集39.3 fps的推理速度的准确度联盟（米欧）（通过仅使用一个单一的NVIDIA TITAN X卡）。这表明，在城市街道场景的语义分割的实时速度，该方法提供出色的性能。

58. Fast Distance-based Anomaly Detection in Images Using an Inception-like Autoencoder [PDF] 返回目录
Natasa Sarafijanovic-Djukic, Jesse Davis
Abstract: The goal of anomaly detection is to identify examples that deviate from normal or expected behavior. We tackle this problem for images. We consider a two-phase approach. First, using normal examples, a convolutional autoencoder (CAE) is trained to extract a low-dimensional representation of the images. Here, we propose a novel architectural choice when designing the CAE, an Inception-like CAE. It combines convolutional filters of different kernel sizes and it uses a Global Average Pooling (GAP) operation to extract the representations from the CAE's bottleneck layer. Second, we employ a distanced-based anomaly detector in the low-dimensional space of the learned representation for the images. However, instead of computing the exact distance, we compute an approximate distance using product quantization. This alleviates the high memory and prediction time costs of distance-based anomaly detectors. We compare our proposed approach to a number of baselines and state-of-the-art methods on four image datasets, and we find that our approach resulted in improved predictive performance.
摘要：异常检测的目的是确定正常或预期行为偏差的例子。我们解决这个问题的图像。我们考虑一个两阶段的方式。首先，使用正常的例子中，卷积自动编码（CAE）被训练以提取图像的低维表示。在这里，我们设计CAE，一个盗梦空间般的CAE时，提出了一个新颖的架构选择。它结合了不同的内核尺寸的卷积过滤器，它采用的是全球平均池（GAP）的操作，以提取从CAE的瓶颈层的表示。其次，我们采用在图像学习表现的低维空间中的疏远基于异常检测器。然而，代替计算的精确距离，我们计算使用的产品量化的大概距离。这减轻的高存储器和预测时间成本基于距离的异常检测器。我们我们提出的方法比较了一些关于四个图像数据集的基线和国家的最先进的方法，我们发现，我们的做法导致改善预测性能。

59. Customized Video QoE Estimation with Algorithm-Agnostic Transfer Learning [PDF] 返回目录
Selim Ickin, Markus Fiedler, Konstantinos Vandikas
Abstract: The development of QoE models by means of Machine Learning (ML) is challenging, amongst others due to small-size datasets, lack of diversity in user profiles in the source domain, and too much diversity in the target domains of QoE models. Furthermore, datasets can be hard to share between research entities, as the machine learning models and the collected user data from the user studies may be IPR- or GDPR-sensitive. This makes a decentralized learning-based framework appealing for sharing and aggregating learned knowledge in-between the local models that map the obtained metrics to the user QoE, such as Mean Opinion Scores (MOS). In this paper, we present a transfer learning-based ML model training approach, which allows decentralized local models to share generic indicators on MOS to learn a generic base model, and then customize the generic base model further using additional features that are unique to those specific localized (and potentially sensitive) QoE nodes. We show that the proposed approach is agnostic to specific ML algorithms, stacked upon each other, as it does not necessitate the collaborating localized nodes to run the same ML algorithm. Our reproducible results reveal the advantages of stacking various generic and specific models with corresponding weight factors. Moreover, we identify the optimal combination of algorithms and weight factors for the corresponding localized QoE nodes.
摘要：QoE的模型通过机器学习（ML）的手段的发展是挑战，除其他外，由于小尺寸的数据集，缺乏多样性的用户配置文件在源域和体验质量模型的目标域太大差异。此外，数据集可能很难研究机构之间的份额，从用户研究机器学习模型和收集的用户数据可能IPR-或GDPR敏感。这使得分散学习为主的框架，呼吁共享和聚合所学的知识在两者之间的局部模型映射所获得的指标，用户体验质量，如平均意见得分（MOS）。在本文中，我们提出了一种基于学习迁移ML模型训练方法，它可将分散的局部模型共享通用指标上的MOS学习通用的基础模型，然后自定义通用基础模型进一步使用所特有的这些附加功能特定的本地化（和潜在的敏感）的QoE节点。我们证明，该方法是不可知的具体ML算法，相互堆叠，因为它没有必要的协作本地化节点运行相同的ML算法。我们重复的结果显示堆叠各种通用和具体型号与相应的权重因素的优势。此外，我们确定用于相应的局部的QoE节点的算法和加权因子的最佳组合。

60. Dynamic Spatiotemporal Graph Neural Network with Tensor Network [PDF] 返回目录
Chengcheng Jia, Bo Wu, Xiao-Ping Zhang
Abstract: Dynamic spatial graph construction is a challenge in graph neural network (GNN) for time series data problems. Although some adaptive graphs are conceivable, only a 2D graph is embedded in the network to reflect the current spatial relation, regardless of all the previous situations. In this work, we generate a spatial tensor graph (STG) to collect all the dynamic spatial relations, as well as a temporal tensor graph (TTG) to find the latent pattern along time at each node. These two tensor graphs share the same nodes and edges, which leading us to explore their entangled correlations by Projected Entangled Pair States (PEPS) to optimize the two graphs. We experimentally compare the accuracy and time costing with the state-of-the-art GNN based methods on the public traffic datasets.
摘要：动态空间图形结构是时间序列数据的问题图形神经网络（GNN）是一个挑战。虽然某些自适应图是可能的，只有一个二维曲线图被嵌入在网络中，以反映当前的空间关系，而不管所有以前的情况。在这项工作中，我们产生的空间张图（STG），收集所有的动态空间关系，以及时间张量图（TTG）找到在每个节点随着时间的潜在图案。这两个图张共享相同的节点和边缘，这导致我们的预计纠缠对状态（PEPS）探索他们纠缠的相关性来优化两个图。我们通过实验比较准确度和时间成本计算与公共交通的数据集的基于状态的最先进的GNN方法。

61. Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding [PDF] 返回目录
Thierry Deruyttere, Guillem Collell, Marie-Francine Moens
Abstract: We propose a new spatial memory module and a spatial reasoner for the Visual Grounding (VG) task. The goal of this task is to find a certain object in an image based on a given textual query. Our work focuses on integrating the regions of a Region Proposal Network (RPN) into a new multi-step reasoning model which we have named a Multimodal Spatial Region Reasoner (MSRR). The introduced model uses the object regions from an RPN as initialization of a 2D spatial memory and then implements a multi-step reasoning process scoring each region according to the query, hence why we call it a multimodal reasoner. We evaluate this new model on challenging datasets and our experiments show that our model that jointly reasons over the object regions of the image and words of the query largely improves accuracy compared to current state-of-the-art models.
摘要：我们提出了一个新的空间存储模块和视觉接地（VG）任务的空间推理。这项任务的目标是找到基于给定文本查询图像中的特定对象。我们的工作主要集中在一个区域建议网络（RPN）的区域纳入其中我们命名多式联运空间地域里森纳（MSRR）一个新的多步推理模型。所引入的模型从RPN为2D空间记忆的初始化使用对象区域，然后实现一个多步骤的推理过程根据查询得分每个区域，因此为什么我们称之为多峰推理。我们评估有挑战性的数据集，这种新的模式和我们的实验表明，我们的模型，在图像的查询的物体的区域和字共同的原因在很大程度上提高了精度，相比于国家的最先进的电流模式。

62. Leveraging Frequency Analysis for Deep Fake Image Recognition [PDF] 返回目录
Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, Thorsten Holz
Abstract: Deep neural networks can generate images that are astonishingly realistic, so much so that it is often hard for humans to distinguish them from actual photos. These achievements have been largely made possible by Generative Adversarial Networks (GANs). While these deep fake images have been thoroughly investigated in the image domain-a classical approach from the area of image forensics-an analysis in the frequency domain has been missing so far. In this paper, we address this shortcoming and our results reveal that in frequency space, GAN-generated images exhibit severe artifacts that can be easily identified. We perform a comprehensive analysis, showing that these artifacts are consistent across different neural network architectures, data sets, and resolutions. In a further investigation, we demonstrate that these artifacts are caused by upsampling operations found in all current GAN architectures, indicating a structural and fundamental problem in the way images are generated via GANs. Based on this analysis, we demonstrate how the frequency representation can be used to identify deep fake images in an automated way, surpassing state-of-the-art methods.
摘要：深层神经网络可以产生令人惊讶的是逼真的画面，以至于它往往是很难为人类他们的实际照片区分开来。这些成就的创成对抗性网络（甘斯）在很大程度上成为可能。虽然这些深假图像已被彻底图像域经典方法在频域中从调查取证的图像，分析领域迄今下落不明。在本文中，我们针对这个缺点，我们的研究结果显示，在频率空间，GAN-生成的图像显示，可以很容易地识别严重的文物。我们进行了全面的分析，表明这些文物是在不同的神经网络结构，数据集和决议是一致的。在进一步的调查中，我们证明这些文物是由上采样在目前所有的GAN架构中常见的操作，说明图像是通过甘斯生成方式的结构性和根本性的问题引起的。基于这一分析，我们证明频率表示如何可用于以自动的方式来识别深假图像，超越国家的最先进的方法。

63. Efficient and Robust Shape Correspondence via Sparsity-Enforced Quadratic Assignment [PDF] 返回目录
Rui Xiang, Rongjie Lai, Hongkai Zhao
Abstract: In this work, we introduce a novel local pairwise descriptor and then develop a simple, effective iterative method to solve the resulting quadratic assignment through sparsity control for shape correspondence between two approximate isometric surfaces. Our pairwise descriptor is based on the stiffness and mass matrix of finite element approximation of the Laplace-Beltrami differential operator, which is local in space, sparse to represent, and extremely easy to compute while containing global information. It allows us to deal with open surfaces, partial matching, and topological perturbations robustly. To solve the resulting quadratic assignment problem efficiently, the two key ideas of our iterative algorithm are: 1) select pairs with good (approximate) correspondence as anchor points, 2) solve a regularized quadratic assignment problem only in the neighborhood of selected anchor points through sparsity control. These two ingredients can improve and increase the number of anchor points quickly while reducing the computation cost in each quadratic assignment iteration significantly. With enough high-quality anchor points, one may use various pointwise global features with reference to these anchor points to further improve the dense shape correspondence. We use various experiments to show the efficiency, quality, and versatility of our method on large data sets, patches, and point clouds (without global meshes).
摘要：在这项工作中，我们介绍一种新颖的局部描述符成对，然后开发一种简单，有效迭代法解决两个近似等距表面之间通过稀疏度控制形状对应所得到的二次分配。我们的成对描述符是基于拉普拉斯贝尔特拉米微分算子，其是在本地空间的有限元近似的刚度和质量矩阵，稀疏来表示，并且非常容易计算，而含有全局信息。它使我们能够对付开路面，部分匹配和稳健拓扑扰动。为了有效地解决所产生的二次分配问题，我们的迭代算法的两个主要思路是：1）选择具有良好（近似）对应锚分，2对）只在选定的锚点通过邻里解决正规化二次分配问题稀疏控制。这两种成分能改善和提高的定位点的数量迅速而显著减少在每个二次分配迭代计算成本。具有足够高品质的锚点，一个可以使用各种逐点全局特征，参照这些锚点以进一步改善致密形状对应。我们使用各种实验来证明的效率，质量和我们的方法的通用性大型数据集，修补程序和点云（没有全球网格）。

64. LANCE: efficient low-precision quantized Winograd convolution for neural networks based on graphics processing units [PDF] 返回目录
Guangli Li, Lei Liu, Xueying Wang, Xiu Ma, Xiaobing Feng
Abstract: Accelerating deep convolutional neural networks has become an active topic and sparked an interest in academia and industry. In this paper, we propose an efficient low-precision quantized Winograd convolution algorithm, called LANCE, which combines the advantages of fast convolution and quantization techniques. By embedding linear quantization operations into the Winograd-domain, the fast convolution can be performed efficiently under low-precision computation on graphics processing units. We test neural network models with LANCE on representative image classification datasets, including SVHN, CIFAR, and ImageNet. The experimental results show that our 8-bit quantized Winograd convolution improves the performance by up to 2.40x over the full-precision convolution with trivial accuracy loss.
摘要：加快深卷积神经网络已经成为一个活跃的话题，引发了学术界和工业界的兴趣。在本文中，我们提出了一种高效低精度量化威诺格拉德卷积算法，称为LANCE，它结合了快速卷积和量化技术的优势。通过嵌入线性量化操作到的Winograd域，快速卷积可以有效地在低精度计算上的图形处理单元中执行。我们测试的神经网络模型与代表性图像分类数据集，包括SVHN，CIFAR和ImageNet LANCE。实验结果表明，我们的8位量化的Winograd卷积提高了多达2.40x在琐碎的精度损失全精度卷积性能。

65. Detecting Deepfakes with Metric Learning [PDF] 返回目录
Akash Kumar, Arnav Bhavsar
Abstract: With the arrival of several face-swapping applications such as FaceApp, SnapChat, MixBooth, FaceBlender and many more, the authenticity of digital media content is hanging on a very loose thread. On social media platforms, videos are widely circulated often at a high compression factor. In this work, we analyze several deep learning approaches in the context of deepfakes classification in high compression scenario and demonstrate that a proposed approach based on metric learning can be very effective in performing such a classification. Using less number of frames per video to assess its realism, the metric learning approach using a triplet network architecture proves to be fruitful. It learns to enhance the feature space distance between the cluster of real and fake videos embedding vectors. We validated our approaches on two datasets to analyze the behavior in different environments. We achieved a state-of-the-art AUC score of 99.2% on the Celeb-DF dataset and accuracy of 90.71% on a highly compressed Neural Texture dataset. Our approach is especially helpful on social media platforms where data compression is inevitable.
摘要：随着几面交换的应用，如FaceApp，SnapChat，MixBooth，FaceBlender更多的人，数字媒体内容的真实性挂在一个非常松散的线程的到来。在社会化媒体平台，视频被广泛往往以高压缩因子循环。在这项工作中，我们分析了几个深度学习在deepfakes分类的高压缩方案的背景下接近并展示了基于度量学习提出的方法可以在执行这样的分类是非常有效的。使用较少的每个视频帧的数目，以评估其真实性，利用三重态的网络体系结构的度量学习方法证明是卓有成效的。它学习，提升的真假视频嵌入矢量集群之间的特征空间距离。我们验证两个数据集我们的方法来分析在不同环境中的行为。我们实现了99.2％的国家的最先进的AUC得分的名人-DF数据集和90.71％的准确度上高度压缩的神经纹理数据集。我们的做法是在社会化媒体平台，其中数据压缩是必然的特别有帮助。

66. Photo-Realistic Video Prediction on Natural Videos of Largely Changing Frames [PDF] 返回目录
Osamu Shouno
Abstract: Recent advances in deep learning have significantly improved performance of video prediction. However, state-of-the-art methods still suffer from blurriness and distortions in their future predictions, especially when there are large motions between frames. To address these issues, we propose a deep residual network with the hierarchical architecture where each layer makes a prediction of future state at different spatial resolution, and these predictions of different layers are merged via top-down connections to generate future frames. We trained our model with adversarial and perceptual loss functions, and evaluated it on a natural video dataset captured by car-mounted cameras. Our model quantitatively outperforms state-of-the-art baselines in future frame prediction on video sequences of both largely and slightly changing frames. Furthermore, our model generates future frames with finer details and textures that are perceptually more realistic than the baselines, especially under fast camera motions.
摘要：在深度学习的最新进展已显著提高视频预测的性能。然而，国家的最先进的方法仍然模糊和扭曲遭受他们的未来的预测，尤其是当有帧之间的大运动。为了解决这些问题，我们提出与分层架构，其中每个层使未来状态的预测在不同空间分辨率的深残余网络，和不同的层的这些预测是通过自上而下的连接合并以生成未来帧。我们训练我们的对抗性和视觉损失函数模型，并评估其对车载摄像机拍摄的自然影像数据集。我们的模型定量优于在很大程度上和稍微改变帧两者的视频序列未来帧预测状态的最先进的基线。此外，我们的模型生成具有更精细的细节和纹理的感知比基线更现实，尤其是在快速的相机运动未来帧。

67. Unsupervised text line segmentation [PDF] 返回目录
Berat Kurar Barakat, Ahmad Droby, Rym Alasam, Boraq Madi, Irina Rabaev, Raed Shammes, Jihad El-Sana
Abstract: We present an unsupervised text line segmentation method that is inspired by the relative variance between text lines and spaces among text lines. Handwritten text line segmentation is important for the efficiency of further processing. A common method is to train a deep learning network for embedding the document image into an image of blob lines that are tracing the text lines. Previous methods learned such embedding in a supervised manner, requiring the annotation of many document images. This paper presents an unsupervised embedding of document image patches without a need for annotations. The main idea is that the number of foreground pixels over the text lines is relatively different from the number of foreground pixels over the spaces among text lines. Generating similar and different pairs relying on this principle definitely leads to outliers. However, as the results show, the outliers do not harm the convergence and the network learns to discriminate the text lines from the spaces between text lines. We experimented with a challenging Arabic handwritten text line segmentation dataset, VML-AHTE, and achieved a superior performance even over the supervised methods.
摘要：我们提出了通过中的文本行的文本的线和间隙之间的相对方差启发无监督文本行分割方法。手写文本行分割是用于进一步处理的效率是重要的。一种常见的方法是培养深学习网络对文档图像嵌入到的被跟踪文本行斑点线的图像。以前的方法在受监督的方式得知这样嵌入，需要许多文档图像的注释。本文介绍了文档图像块的无监督嵌入而无需注释。其主要思想是，前景像素的在文本行数为前景像素为中文本行的空格数相对不同。产生类似和不同的对依靠这一原则绝对导致异常。然而，由于结果显示，异常值不伤害融合和网络学会区分文本行之间的空间中的文本行。我们尝试用一个具有挑战性的阿拉伯语手写文字的线路分段数据集，VML-AHTE，甚至在监督方法取得了卓越的性能。

68. Foldover Features for Dynamic Object Behavior Description in Microscopic Videos [PDF] 返回目录
Xialin Li, Chen Li, Wenwei Zhao
Abstract: Behavior description is conducive to the analysis of tiny objects, similar objects, objects with weak visual information and objects with similar visual information, playing a fundamental role in the identification and classification of dynamic objects in microscopic videos. To this end, we propose foldover features to describe the behavior of dynamic objects. First, we generate foldover for each object in microscopic videos in X, Y and Z directions, respectively. Then, we extract foldover features from the X, Y and Z directions with statistical methods, respectively. Finally, we use four different classifiers to test the effectiveness of the proposed foldover features. In the experiment, we use a sperm microscopic video dataset to evaluate the proposed foldover features, including three types of 1374 sperms, and obtain the highest classification accuracy of 96.5%.
摘要：行为的描述，有利于微小物体，相似对象的分析，以微弱的视觉对象的信息，并与类似的视觉信息的对象，打在显微镜的视频识别和动态对象的分类的基础性作用。为此，我们提出折返功能描述动态对象的行为。首先，我们分别折返产生X型，Y微观视频和Z方向的每个对象。然后，我们分别提取折返从X，Y与统计法，以Z方向的功能。最后，我们用四个不同的分类测试的建议折返功能的有效性。在实验中，我们使用了精子显微影像数据集来评价所提出的折返功能，包括三种类型的精子1374，并获得96.5％的最高分类精度。

69. Domain-Adaptive Few-Shot Learning [PDF] 返回目录
An Zhao, Mingyu Ding, Zhiwu Lu, Tao Xiang, Yulei Niu, Jiechao Guan, Ji-Rong Wen, Ping Luo
Abstract: Existing few-shot learning (FSL) methods make the implicit assumption that the few target class samples are from the same domain as the source class samples. However, in practice this assumption is often invalid -- the target classes could come from a different domain. This poses an additional challenge of domain adaptation (DA) with few training samples. In this paper, the problem of domain-adaptive few-shot learning (DA-FSL) is tackled, which requires solving FSL and DA in a unified framework. To this end, we propose a novel domain-adversarial prototypical network (DAPN) model. It is designed to address a specific challenge in DA-FSL: the DA objective means that the source and target data distributions need to be aligned, typically through a shared domain-adaptive feature embedding space; but the FSL objective dictates that the target domain per class distribution must be different from that of any source domain class, meaning aligning the distributions across domains may harm the FSL performance. How to achieve global domain distribution alignment whilst maintaining source/target per-class discriminativeness thus becomes the key. Our solution is to explicitly enhance the source/target per-class separation before domain-adaptive feature embedding learning in the DAPN, in order to alleviate the negative effect of domain alignment on FSL. Extensive experiments show that our DAPN outperforms the state-of-the-art FSL and DA models, as well as their naïve combinations. The code is available at this https URL.
摘要：现有的几个次学习（FSL）方法使隐含的假设是，几个目标类样本来自同一域中的源类样品。然而，在实践中，这种假设往往是无效的 - 目标类可能来自不同的域。这对领域适应性（DA）与几个训练样本的一个额外的挑战。在本文中，领域自适应几拍学习（DA-FSL）的问题解决，这需要解决一个统一的框架FSL和DA。为此，我们提出了一个新的领域对抗的原型网络（DAPN）模型。它的目的是解决DA-FSL的特定挑战：DA目标装置，其分布所需要的源和目标数据被对准，典型地通过一个共享的域自适应特征嵌入空间;但FSL目标决定了每类分发目标域必须是从任何源域类的，这意味着对准跨域分布可能会损害FSL性能不同。如何实现，同时保持源/目标每类discriminativeness从而成为全球主要分布域对齐。我们的解决方案是显式提升之前域自适应特征源/目标每个类分离嵌入在DAPN学习，以减轻域对准的上FSL的负面影响。大量的实验表明，我们的DAPN优于国家的最先进的FSL和DA模型，以及他们天真的组合。该代码可在此HTTPS URL。

70. PT2PC: Learning to Generate 3D Point Cloud Shapes from Part Tree Conditions [PDF] 返回目录
Kaichun Mo, He Wang, Xinchen Yan, Leonidas J. Guibas
Abstract: 3D generative shape modeling is a fundamental research area in computer vision and interactive computer graphics, with many real-world applications. This paper investigates the novel problem of generating 3D shape point cloud geometry from a symbolic part tree representation. In order to learn such a conditional shape generation procedure in an end-to-end fashion, we propose a conditional GAN "part tree"-to-"point cloud" model (PT2PC) that disentangles the structural and geometric factors. The proposed model incorporates the part tree condition into the architecture design by passing messages top-down and bottom-up along the part tree hierarchy. Experimental results and user study demonstrate the strengths of our method in generating perceptually plausible and diverse 3D point clouds, given the part tree condition. We also propose a novel structural measure for evaluating if the generated shape point clouds satisfy the part tree conditions.
摘要：3D生成形状建模是计算机视觉和交互式计算机图形的基础研究领域，有许多现实世界的应用。本文研究从一个符号部分树表示生成3D形状点群几何结构的新颖的问题。为了学习在端至端的方式这样的条件形状生成过程中，我们提出了一种有条件GAN“部分树” - 到 - “点云”模型（PT2PC），其理顺了那些纷繁的结构和几何因素。该模型通过传递消息自上而下和自下而上沿部分树层次包含了部分树势到架构设计。实验结果和用户研究证明产生感知合理和多样化的三维点云了该方法的优势，考虑到部分树的条件。我们还提出了一种新颖的结构性量度用于评估如果所生成的形状的点云满足部分树的条件。

71. AQPDBJUT Dataset: Picture-Based PM2.5 Monitoring in the Campus of BJUT [PDF] 返回目录
Yonghui Zhang, Ke Gu, Zhifang Xia, Junfei Qiao
Abstract: Ensuring the students in good physical levels is imperative for their future health. In recent years, the continually growing concentration of Particulate Matter (PM) has done increasingly serious harm to student health. Hence, it is highly required to prevent and control PM concentrations in the campus. As the source of PM prevention and control, developing a good model for PM monitoring is extremely urgent and has posed a big challenge. It has been found in prior works that photo-based methods are available for PM monitoring. To verify the effectiveness of existing PM monitoring methods in the campus, we establish a new dataset which includes 1,500 photos collected in the Beijing University of Technology. Experiments show that stated-of-the-art methods are far from ideal for PM2.5 monitoring in the campus.
摘要：确保良好的物理水平的学生必须为他们未来的健康。近年来，颗粒物（PM）的不断增长的浓度做了日益严重危害学生健康。因此，强烈要求防止和控制PM在校园浓度。由于PM预防和源头控制，制定PM监测一个很好的模式是非常紧迫的，并已提出了很大的挑战。它已经在之前的作品已经发现，基于照片的方法可用于监测PM。为了验证在校园现有PM监测方法的有效性，我们建立了一个新的数据集，其中包括在科技北京工业大学收集1500张照片。实验证明，说的最先进的方法是很不理想的PM2.5在校园监控。

72. Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection [PDF] 返回目录
Zuyao Chen, Qingming Huang
Abstract: There are two main issues in RGB-D salient object detection: (1) how to effectively integrate the complementarity from the cross-modal RGB-D data; (2) how to prevent the contamination effect from the unreliable depth map. In fact, these two problems are linked and intertwined, but the previous methods tend to focus only on the first problem and ignore the consideration of depth map quality, which may yield the model fall into the sub-optimal state. In this paper, we address these two issues in a holistic model synergistically, and propose a novel network named DPANet to explicitly model the potentiality of the depth map and effectively integrate the cross-modal complementarity. By introducing the depth potentiality perception, the network can perceive the potentiality of depth information in a learning-based manner, and guide the fusion process of two modal data to prevent the contamination occurred. The gated multi-modality attention module in the fusion process exploits the attention mechanism with a gate controller to capture long-range dependencies from a cross-modal perspective. Experimental results compared with 15 state-of-the-art methods on 8 datasets demonstrate the validity of the proposed approach both quantitatively and qualitatively.
摘要：有在RGB-d显着对象检测两个主要问题：（1）如何有效地整合来自跨通道RGB-d数据的互补性; （2）如何防止来自不可靠的深度图中的污染的效果。其实，这两个问题都相连，唇齿相依，但以前的方法往往只注重第一个问题，而忽略了考虑深度图的质量，这可以产生模型落入次优状态。在本文中，我们解决一个整体模式这两个问题协同，并提出命名DPANet明确一个新的网络模型深度图的潜力，并有效整合跨模式的互补性。通过引入深度潜力感知，网络可以感知的在基于学习的方式深度信息的潜力，引导两个相模态的数据的融合过程，以防止发生污染。在融合过程中的浇口的多模态注意模块利用与栅控制器从一个跨通道透视拍摄远距离的依赖关系的注意机制。与国家的最先进的15上的数据集8的方法相比实验结果定量和定性表明了该方法的有效性。

73. Unsupervised Domain Adaptation via Structurally Regularized Deep Clustering [PDF] 返回目录
Hui Tang, Ke Chen, Kui Jia
Abstract: Unsupervised domain adaptation (UDA) is to make predictions for unlabeled data on a target domain, given labeled data on a source domain whose distribution shifts from the target one. Mainstream UDA methods learn aligned features between the two domains, such that a classifier trained on the source features can be readily applied to the target ones. However, such a transferring strategy has a potential risk of damaging the intrinsic discrimination of target data. To alleviate this risk, we are motivated by the assumption of structural domain similarity, and propose to directly uncover the intrinsic target discrimination via discriminative clustering of target data. We constrain the clustering solutions using structural source regularization that hinges on our assumed structural domain similarity. Technically, we use a flexible framework of deep network based discriminative clustering that minimizes the KL divergence between predictive label distribution of the network and an introduced auxiliary one; replacing the auxiliary distribution with that formed by ground-truth labels of source data implements the structural source regularization via a simple strategy of joint network training. We term our proposed method as Structurally Regularized Deep Clustering (SRDC), where we also enhance target discrimination with clustering of intermediate network features, and enhance structural regularization with soft selection of less divergent source examples. Careful ablation studies show the efficacy of our proposed SRDC. Notably, with no explicit domain alignment, SRDC outperforms all existing methods on three UDA benchmarks.
摘要：无监督域适配（UDA）是使用于在目标域，给定上的源域，其分布从目标一个移位标记的数据的未标记数据的预测。主流UDA方法学对准的两个结构域，使得上训练源特征的分类器可以容易地施加到目标者之间的功能。然而，这样的战略转移有破坏目标数据的固有歧视的潜在风险。为了减轻这种风险，我们是通过结构域相似的假设动机，并提出直接揪出通过目标数据的辨别集群的内在目标的歧视。我们使用约束结构性源正规化的群集解决方案在我们的假设结构域相似的铰链。从技术上讲，我们使用基于判别聚类深网络的灵活的框架，最小化网络和引入的辅助一个的预测标签分布之间的KL散度;更换与该通过的联合网络训练的简单策略由源数据器具的结构源正规化地面实况标签形成辅助分布。我们长期我们提出的方法，在结构上正则深聚类（SRDC），在这里我们也增强了目标识别与中间网络功能集群，增强结构与正规化少发散源例中的软选择。仔细消融研究表明我们提出的SRDC的功效。值得注意的是，有没有明确的畴对准，SRDC优于三个UDA基准所有现有的方法。

74. End-to-End Deep Diagnosis of X-ray Images [PDF] 返回目录
Kudaibergen Urinbayev, Yerassyl Orazbek, Yernur Nurambek, Almas Mirzakhmetov, Huseyin Atakan Varol
Abstract: In this work, we present an end-to-end deep learning framework for X-ray image diagnosis. As the first step, our system determines whether a submitted image is an X-ray or not. After it classifies the type of the X-ray, it runs the dedicated abnormality classification network. In this work, we only focus on the chest X-rays for abnormality classification. However, the system can be extended to other X-ray types easily. Our deep learning classifiers are based on DenseNet-121 architecture. The test set accuracy obtained for 'X-ray or Not', 'X-ray Type Classification', and 'Chest Abnormality Classification' tasks are 0.987, 0.976, and 0.947, respectively, resulting into an end-to-end accuracy of 0.91. For achieving better results than the state-of-the-art in the 'Chest Abnormality Classification', we utilize the new RAdam optimizer. We also use Gradient-weighted Class Activation Mapping for visual explanation of the results. Our results show the feasibility of a generalized online projectional radiography diagnosis system.
摘要：在这项工作中，我们提出了X射线图像诊断结束到终端的深度学习的框架。作为第一步骤，我们的系统确定所提交的图像是否是一个透视与否。之后将其归类X射线的类型，它运行在专用异常分类网络。在这项工作中，我们只专注于胸部X光检查的异常分类。然而，该系统可以容易地扩展到其他X射线的类型。我们深厚的学习分类是基于DenseNet-121架构。关于“透视或不”，“透视类型分类”和“胸部异常分类”任务分别为0.987，0.976，和0.947，所得到的0.91的端至端的精度获得的测试集精度。为了实现比国家的最先进的“胸部异常分类”更好的结果，我们利用新的RAdam优化。我们还使用梯度加权级激活映射的结果的可视化解释。我们的研究结果表明广义网上projectional X线摄影诊断系统的可行性。

75. Deep convolutional embedding for digitized painting clustering [PDF] 返回目录
Giovanna Castellano, Gennaro Vessio
Abstract: Clustering artworks is difficult because of several reasons. On one hand, recognizing meaningful patterns in accordance with domain knowledge and visual perception is extremely hard. On the other hand, the application of traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, we propose a deep convolutional embedding model for clustering digital paintings, in which the task of mapping the input raw data to an abstract, latent space is optimized jointly with the task of finding a set of cluster centroids in this latent feature space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. The model is also able to outperform other state-of-the-art deep clustering approaches to the same problem. The proposed method may be beneficial to several art-related tasks, particularly visual link retrieval and historical knowledge discovery in painting datasets.
摘要：集群的作品是因为几个原因很难。一方面，按照领域知识和视觉识别有意义的模式是非常困难的。在另一方面，传统的聚类和特征减少技术的高度二维像素空间中的应用程序可以是无效的。为了解决这些问题，我们提出了聚类数字绘画的深刻卷积嵌入模式，即映射原始输入数据到一个抽象的，潜在空间的任务，在这个潜在部件找到一组聚类中心的任务联合优化空间。定量和定性实验结果表明，该方法的有效性。该模型还能够超越国家的最先进的等深集群接近同样的问题。该方法可以是几个艺术相关的任务，特别是视觉链接检索与画中的数据集的历史知识发现是有益的。

76. Curriculum DeepSDF [PDF] 返回目录
Yueqi Duan, Haidong Zhu, He Wang, Li Yi, Ram Nevatia, Leonidas J. Guibas
Abstract: When learning to sketch, beginners start with simple and flexible shapes, and then gradually strive for more complex and accurate ones in the subsequent training sessions. In this paper, we design a "shape curriculum" for learning continuous Signed Distance Function (SDF) on shapes, namely Curriculum DeepSDF. Inspired by how humans learn, Curriculum DeepSDF organizes the learning task in ascending order of difficulty according to the following two criteria: surface accuracy and sample difficulty. The former considers stringency in supervising with ground truth, while the latter regards the weights of hard training samples near complex geometry and fine structure. More specifically, Curriculum DeepSDF learns to reconstruct coarse shapes at first, and then gradually increases the accuracy and focuses more on complex local details. Experimental results show that a carefully-designed curriculum leads to significantly better shape reconstructions with the same training data, training epochs and network architecture as DeepSDF. We believe that the application of shape curricula can benefit the training process of a wide variety of 3D shape representation learning methods.
摘要：当学习素描，初学者入手简单，灵活的形状，然后在随后的培训课程更加复杂和精确的人逐渐努力。在本文中，我们设计出“形课程”学习形状上的连续符号距离函数（SDF），即课程DeepSDF。课程DeepSDF人类如何学习的启发，组织在提升根据以下两个标准的难度顺序的学习任务：表面精度和采样困难。前者认为，严格的地面实况监督，而后者的问候附近复杂的几何形状和精细结构刻苦训练样本的权重。更具体地，课程DeepSDF学会重构粗的形状在第一，然后逐渐增加的精度和更专注于复杂的局部细节。实验结果表明，精心设计的课程导致显著更好地塑造重建具有相同的训练数据，培训时代和网络架构DeepSDF。我们相信，形状课程的应用可以受益的各种各样的3D形状表示学习方法的训练过程。

77. High Accuracy Face Geometry Capture using a Smartphone Video [PDF] 返回目录
Shubham Agrawal, Anuj Pahuja, Simon Lucey
Abstract: What's the most accurate 3D model of your face you can obtain while sitting at your desk? We attempt to answer this question in our work. High fidelity face reconstructions have so far been limited to either studio settings or through expensive 3D scanners. On the other hand, unconstrained reconstruction methods are typically limited by low-capacity models. Our method reconstructs accurate face geometry of a subject using a video shot from a smartphone in an unconstrained environment. Our approach takes advantage of recent advances in visual SLAM, keypoint detection, and object detection to improve accuracy and robustness. By not being constrained to a model subspace, our reconstructed meshes capture important details while being robust to noise and being topologically consistent. Our evaluations show that our method outperforms current single and multi-view baselines by a significant margin, both in terms of geometric accuracy and in capturing person-specific details important for making realistic looking models.
摘要：什么是你的脸最准确的3D模型可以一边坐在办公桌前获得？我们试图回答我们工作这个问题。高保真面部重建迄今限于要么工作室设置或通过昂贵的3D扫描仪。在另一方面，无约束的重建方法通常通过低容量模型的限制。我们的方法重建使用从智能手机拍摄的视频不受约束的环境对象的精确的几何结构设计。我们的做法在视觉SLAM，关键点检测和目标检测采用最新进展的优势，提高精确度和耐用性。如果不被约束到模型子空间，我们重建的网捕捉重要的细节，同时抗噪能力并且是拓扑一致。我们的评估显示，我们的方法由显著利润率优于目前的单和多视图的基准，无论是在几何精度方面和摄像者，具体细节制作逼真的模型很重要。

78. CPR-GCN: Conditional Partial-Residual Graph Convolutional Network in Automated Anatomical Labeling of Coronary Arteries [PDF] 返回目录
Han Yang, Xingjian Zhen, Ying Chi, Lei Zhang, Xian-Sheng Hua
Abstract: Automated anatomical labeling plays a vital role in coronary artery disease diagnosing procedure. The main challenge in this problem is the large individual variability inherited in human anatomy. Existing methods usually rely on the position information and the prior knowledge of the topology of the coronary artery tree, which may lead to unsatisfactory performance when the main branches are confusing. Motivated by the wide application of the graph neural network in structured data, in this paper, we propose a conditional partial-residual graph convolutional network (CPR-GCN), which takes both position and CT image into consideration, since CT image contains abundant information such as branch size and spanning direction. Two majority parts, a Partial-Residual GCN and a conditions extractor, are included in CPR-GCN. The conditions extractor is a hybrid model containing the 3D CNN and the LSTM, which can extract 3D spatial image features along the branches. On the technical side, the Partial-Residual GCN takes the position features of the branches, with the 3D spatial image features as conditions, to predict the label for each branches. While on the mathematical side, our approach twists the partial differential equation (PDE) into the graph modeling. A dataset with 511 subjects is collected from the clinic and annotated by two experts with a two-phase annotation process. According to the five-fold cross-validation, our CPR-GCN yields 95.8% meanRecall, 95.4% meanPrecision and 0.955 meanF1, which outperforms state-of-the-art approaches.
摘要：自动贴标解剖起着冠状动脉疾病诊断过程至关重要的作用。在这个问题上的主要挑战是在人体解剖学继承了较大的个体差异。现有的方法通常依赖于位置信息和冠状动脉树的拓扑结构，当主分支混淆，这可能导致表现欠佳的先验知识。通过在结构化数据的曲线图的神经网络的广泛应用的动机，在本文中，我们提出了一种有条件局部剩余图卷积网络（CPR-GCN），它带有两个位置和CT图像的考虑，因为CT图像中含有丰富的信息如分支尺寸和跨越方向。两个多数部件，部分残留GCN和条件提取器，被包括在CPR-GCN。提取器是包含3D CNN和LSTM混合模型，该模型可以提取3D空间图像的条件沿分支的特征。在技术方面，所述部分残余GCN取分支的位置的功能，与所述3D空间图像特征的条件下，以预测所述标签为每个分支。而在数学方面，我们的方法绞偏微分方程（PDE）到图形建模。与511组的受试者的数据集从诊所收集并且由两个专家的两相注释过程进行注解。根据个5倍交叉验证，我们的CPR-GCN产生95.8％meanRecall，95.4％和meanPrecision 0.955 meanF1，这优于状态的最先进的方法。

79. Quality Control of Neuron Reconstruction Based on Deep Learning [PDF] 返回目录
Donghuan Lu, Sujun Zhao, Peng Xie, Kai Ma, Lijuan Liu, Yefeng Zheng
Abstract: Neuron reconstruction is essential to generate exquisite neuron connectivity map for understanding brain function. Despite the significant amount of effect that has been made on automatic reconstruction methods, manual tracing by well-trained human annotators is still necessary. To ensure the quality of reconstructed neurons and provide guidance for annotators to improve their efficiency, we propose a deep learning based quality control method for neuron reconstruction in this paper. By formulating the quality control problem into a binary classification task regarding each single point, the proposed approach overcomes the technical difficulties resulting from the large image size and complex neuron morphology. Not only it provides the evaluation of reconstruction quality, but also can locate exactly where the wrong tracing begins. This work presents one of the first comprehensive studies for whole-brain scale quality control of neuron reconstructions. Experiments on five-fold cross validation with a large dataset demonstrate that the proposed approach can detect 74.7% errors with only 1.4% false alerts.
摘要：神经重建是必不可少的产生理解大脑功能的神经元精美连接映射。尽管已对自动重建方法的显著量的效果，由训练有素的人注释手动跟踪仍然是必要的。为确保重建神经元的质量，并提供指导注释，以提高他们的效率，我们提出了神经重建本文深刻学习为基础的质量控制方法。通过制定质量控制问题转化为关于每个单点的二元分类任务，该方法克服了从大图像的大小和复杂的神经元形态造成的技术困难。它不仅提供了重建质量的评价，不过也有错误的跟踪开始的地方可以准确定位。这项工作礼物神经重建的全脑规模质量控制的第一个全面的研究之一。在实验中交叉验证与大型数据集表明，该方法可以检测到只有1.4％的假警报74.7％的误差五倍。

80. Detecting Lane and Road Markings at A Distance with Perspective Transformer Layers [PDF] 返回目录
Zhuoping Yu, Xiaozhou Ren, Yuyao Huang, Wei Tian, Junqiao Zhao
Abstract: Accurate detection of lane and road markings is a task of great importance for intelligent vehicles. In existing approaches, the detection accuracy often degrades with the increasing distance. This is due to the fact that distant lane and road markings occupy a small number of pixels in the image, and scales of lane and road markings are inconsistent at various distances and perspectives. The Inverse Perspective Mapping (IPM) can be used to eliminate the perspective distortion, but the inherent interpolation can lead to artifacts especially around distant lane and road markings and thus has a negative impact on the accuracy of lane marking detection and segmentation. To solve this problem, we adopt the Encoder-Decoder architecture in Fully Convolutional Networks and leverage the idea of Spatial Transformer Networks to introduce a novel semantic segmentation neural network. This approach decomposes the IPM process into multiple consecutive differentiable homographic transform layers, which are called "Perspective Transformer Layers". Furthermore, the interpolated feature map is refined by subsequent convolutional layers thus reducing the artifacts and improving the accuracy. The effectiveness of the proposed method in lane marking detection is validated on two public datasets: TuSimple and ApolloScape
摘要：车道和道路标线的精确检测是智能汽车非常重要的任务。在现有的方法，检测精度经常与距离的增加而下降。这是由于这样的事实，那遥远的车道和道路标记占据少数图像中的像素，以及车道和道路标记的尺度是在不同距离和角度不一致。逆透视映射（IPM）可以用来消除透视畸变，但内在插值可能导致伪像特别是围绕遥远车道及道路标记，因此对车道标记检测与分割的准确度产生负面影响。为了解决这个问题，我们采用的编码器，解码器架构综卷积网络，并充分利用空间变压器网络的想法，引入新的语义分割的神经网络。这种方法分解IPM过程分成多个连续的微分单应变换层，其被称为“视角变压器层”。此外，内插的特征图通过随后卷积层从而减少了伪像和改善精确度细化。在车道标线检测了该方法的有效性验证的两个公共数据集：TuSimple和ApolloScape

81. Stereo Endoscopic Image Super-Resolution Using Disparity-Constrained Parallel Attention [PDF] 返回目录
Tianyi Zhang, Yun Gu, Xiaolin Huang, Enmei Tu, Jie Yang
Abstract: With the popularity of stereo cameras in computer assisted surgery techniques, a second viewpoint would provide additional information in surgery. However, how to effectively access and use stereo information for the super-resolution (SR) purpose is often a challenge. In this paper, we propose a disparity-constrained stereo super-resolution network (DCSSRnet) to simultaneously compute a super-resolved image in a stereo image pair. In particular, we incorporate a disparity-based constraint mechanism into the generation of SR images in a deep neural network framework with an additional atrous parallax-attention modules. Experiment results on laparoscopic images demonstrate that the proposed framework outperforms current SR methods on both quantitative and qualitative evaluations. Our DCSSRnet provides a promising solution on enhancing spatial resolution of stereo image pairs, which will be extremely beneficial for the endoscopic surgery.
摘要：随着电脑辅助手术技术的立体相机的普及，第二个观点可以提供额外的信息手术。然而，如何有效地获取和使用立体声信息的超分辨率（SR）的目的往往是一个挑战。在本文中，我们提出了一个差距受限立体声超分辨率网（DCSSRnet）在立体图像对同时计算超分辨图像。特别是，我们结合了基于视差约束机制为SR图像的生成与附加atrous视差的关注模块深层神经网络架构。腹腔镜图像实验结果表明，该框架上优于定量和定性评估当前的SR方法。我们DCSSRnet提供增强立体图像对，这将是内窥镜手术极为有利的空间分辨率的有前途的解决方案。

82. Pose Augmentation: Class-agnostic Object Pose Transformation for Object Recognition [PDF] 返回目录
Yunhao Ge, Jiaping Zhao, Laurent Itti
Abstract: Object pose increases interclass object variance which makes object recognition from 2D images harder. To render a classifier robust to pose variations, most deep neural networks try to eliminate the influence of pose by using large datasets with many poses for each class. Here, we propose a different approach: a class-agnostic object pose transformation network (OPT-Net) can transform an image along 3D yaw and pitch axes to synthesize additional poses continuously. Synthesized images lead to better training of an object classifier. We design a novel eliminate-add structure to explicitly disentangle pose from object identity: first eliminate pose information of the input image and then add target pose information (regularized as continuous variables) to synthesize any target pose. We trained OPT-Net on images of toy vehicles shot on a turntable from the iLab-20M dataset. After training on unbalanced discrete poses (5 classes with 6 poses per object instance, plus 5 classes with only 2 poses), we show that OPT-Net can synthesize balanced continuous new poses along yaw and pitch axes with high quality. Training a ResNet-18 classifier with original plus synthesized poses improves mAP accuracy by 9% overtraining on original poses only. Further, the pre-trained OPT-Net can generalize to new object classes, which we demonstrate on both iLab-20M and RGB-D. We also show that the learned features can generalize to ImageNet.
摘要：对象姿势增加类间方差对象使得对象识别从二维图像更难。为了呈现稳健姿态变化的分类，最深层神经网络试图通过使用大型数据集，借助每一类很多姿势，以消除姿势的影响。在这里，我们提出了不同的方法：一类无关的对象姿势变换网络（OPT-网）可以将沿着3D偏航和变桨轴的图像连续合成额外的姿势。合成图像导致物体分类进行更好的培训。我们从对象标识设计一种新型的消除相加结构明确地分清姿态：第一消除输入图像的姿势信息，然后添加目标姿态信息（正规化为连续变量）来合成任何目标姿态。我们在拍摄从iLab的-20M数据集转盘玩具车的图片训练有素的OPT-网。（每个对象实例6个姿势，加5类只有2姿势5类）上不平衡离散姿势训练之后，我们表明，OPT-Net的能合成平衡随着高质量偏航和俯仰轴连续新姿势。训练RESNET-18分级用原始加合成姿势9％上只有原来的姿势训练过度提高地图精度。此外，预先训练的OPT-网可以推广到新的对象类，我们展示了两个iLab的-20M和RGB-d。我们还表明，学习功能，可以推广到ImageNet。

83. SAPIEN: A SimulAted Part-based Interactive ENvironment [PDF] 返回目录
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, Hao Su
Abstract: Building home assistant robots has long been a pursuit for vision and robotics researchers. To achieve this task, a simulated environment with physically realistic simulation, sufficient articulated objects, and transferability to the real robot is indispensable. Existing environments achieve these requirements for robotics simulation with different levels of simplification and focus. We take one step further in constructing an environment that supports household tasks for training robot learning algorithm. Our work, SAPIEN, is a realistic and physics-rich simulated environment that hosts a large-scale set for articulated objects. Our SAPIEN enables various robotic vision and interaction tasks that require detailed part-level understanding.We evaluate state-of-the-art vision algorithms for part detection and motion attribute recognition as well as demonstrate robotic interaction tasks using heuristic approaches and reinforcement learning algorithms. We hope that our SAPIEN can open a lot of research directions yet to be explored, including learning cognition through interaction, part motion discovery, and construction of robotics-ready simulated game environment.
摘要：建筑家庭助理机器人一直是视力和机器人技术研究人员的追求。为了实现这一任务，用实际逼真的模拟，充分关节的对象，并转让给真正的机器人模拟环境是必不可少的。现有环境实现机器人仿真不同程度的简化和聚焦的这些要求。我们以构建，学习算法进行训练机械手支持家务的环境又进了一步。我们的工作，SAPIEN，是一个现实的和丰富的物理模拟环境，主机的铰接式对象大规模集。我们SAPIEN使那些需要详细的零件级的各种机器人视觉和交互任务understanding.We评价国家的最先进的视觉算法对部分检测和运动属性识别以及演示使用启发式方法和强化学习算法机器人互动任务。我们希望我们的SAPIEN可以打开很多的研究方向还有待探索，包括通过互动中学习认知，部分运动发现，和机器人技术准备模拟游戏环境建设。

84. Evaluating Salient Object Detection in Natural Images with Multiple Objects having Multi-level Saliency [PDF] 返回目录
Gökhan Yildirim, Debashis Sen, Mohan Kankanhalli, Sabine Süsstrunk
Abstract: Salient object detection is evaluated using binary ground truth with the labels being salient object class and background. In this paper, we corroborate based on three subjective experiments on a novel image dataset that objects in natural images are inherently perceived to have varying levels of importance. Our dataset, named SalMoN (saliency in multi-object natural images), has 588 images containing multiple objects. The subjective experiments performed record spontaneous attention and perception through eye fixation duration, point clicking and rectangle drawing. As object saliency in a multi-object image is inherently multi-level, we propose that salient object detection must be evaluated for the capability to detect all multi-level salient objects apart from the salient object class detection capability. For this purpose, we generate multi-level maps as ground truth corresponding to all the dataset images using the results of the subjective experiments, with the labels being multi-level salient objects and background. We then propose the use of mean absolute error, Kendall's rank correlation and average area under precision-recall curve to evaluate existing salient object detection methods on our multi-level saliency ground truth dataset. Approaches that represent saliency detection on images as local-global hierarchical processing of a graph perform well in our dataset.
摘要：显着对象检测是使用与所述标签是显着对象类和背景二进制地面实况评价。在本文中，我们证实了基于一个新的图像数据3个主观实验，在自然图像对象本质上被认为有重要的不同级别。我们的数据集中，命名为鲑鱼（显着性在多对象的自然图像），具有含有多个对象588倍的图像。主观实验通过眼睛注视时间，点点击和绘制矩形进行记录自发的关注和认知。作为一个多对象图像中物体显着本质上是多层次的，我们提出了显着的物体检测必须的能力进行评估，以检测所有多层次突出，从显着对象类检测能力的对象分开。为此，我们使用生成的主观实验的结果对应于所有数据集图像多级地图为基础事实与标签是多层次的显着对象和背景。然后，我们提出了精确召回曲线下使用平均绝对误差，Kendall的秩相关和平均面积，以评估我们的多层次显着地面实况数据集现有显着对象的检测方法。代表图像上以图表的地方 - 全球分层处理显着性检测方法在我们的数据表现良好。

85. A Metric Learning Reality Check [PDF] 返回目录
Kevin Musgrave, Serge Belongie, Ser-Nam Lim
Abstract: Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look at the field to see if this is actually true. We find flaws in the experimental setup of these papers, and propose a new way to evaluate metric learning algorithms. Finally, we present experimental results that show that the improvements over time have been marginal at best.
摘要：在过去的四年深度量学习的论文一直权利准确性巨大的进步，往往比十年之久的方法的性能的两倍多。在本文中，我们仔细看看在现场看到，如果这是真的。我们发现，在这些论文中的实验装置的缺陷，并提出评价指标学习算法的新途径。最后，我们目前的实验结果表明，随着时间的推移的改善已经少量的最好。

86. Reconstructing Sinus Anatomy from Endoscopic Video -- Towards a Radiation-free Approach for Quantitative Longitudinal Assessment [PDF] 返回目录
Xingtong Liu, Maia Stiber, Jindan Huang, Masaru Ishii, Gregory D. Hager, Russell H. Taylor, Mathias Unberath
Abstract: Reconstructing accurate 3D surface models of sinus anatomy directly from an endoscopic video is a promising avenue for cross-sectional and longitudinal analysis to better understand the relationship between sinus anatomy and surgical outcomes. We present a patient-specific, learning-based method for 3D reconstruction of sinus surface anatomy directly and only from endoscopic videos. We demonstrate the effectiveness and accuracy of our method on in and ex vivo data where we compare to sparse reconstructions from Structure from Motion, dense reconstruction from COLMAP, and ground truth anatomy from CT. Our textured reconstructions are watertight and enable measurement of clinically relevant parameters in good agreement with CT. The source code will be made publicly available upon publication.
摘要：从内窥镜视频重构直接窦解剖结构的精确的3D表面模型为横截面和纵向分析以更好地理解窦解剖结构和外科结果之间的关系的有希望的途径。我们提出了三维重建直接窦表面成像的，只有从内窥镜视频患者的特异性，学习基础的方法。我们证明在和体外，我们比较稀疏重建从运动，从COLMAP密集重建结构，并从CT地面实况解剖数据了该方法的有效性和准确性。我们的纹理重建的水密性，并能够与CT一致临床相关参数的测量。源代码将被公之于众后公布。

87. Gaze-Sensing LEDs for Head Mounted Displays [PDF] 返回目录
Kaan Akşit, Jan Kautz, David Luebke
Abstract: We introduce a new gaze tracker for Head Mounted Displays (HMDs). We modify two off-the-shelf HMDs to be gaze-aware using Light Emitting Diodes (LEDs). Our key contribution is to exploit the sensing capability of LEDs to create low-power gaze tracker for virtual reality (VR) applications. This yields a simple approach using minimal hardware to achieve good accuracy and low latency using light-weight supervised Gaussian Process Regression (GPR) running on a mobile device. With our hardware, we show that Minkowski distance measure based GPR implementation outperforms the commonly used radial basis function-based support vector regression (SVR) without the need to precisely determine free parameters. We show that our gaze estimation method does not require complex dimension reduction techniques, feature extraction, or distortion corrections due to off-axis optical paths. We demonstrate two complete HMD prototypes with a sample eye-tracked application, and report on a series of subjective tests using our prototypes.
摘要：我们推出了新凝视跟踪器的头盔显示器（HMDS）。我们修改两个现成的，货架HMD的是使用发光二极管（LED）的凝视感知。我们的主要贡献是利用LED来创造虚拟现实（VR）应用的低功耗凝视跟踪器的感应能力。这产生使用最少的硬件来实现良好的精度和使用重量轻的低等待时间的简单方法监督高斯过程回归（GPR）的移动设备上运行。我们的硬件，我们表明，基于明可夫斯基距离测量GPR实现优于常用的径向基函数基于支持向量回归（SVR），而不需要精确地确定自由参数。我们表明，我们的注视估计方法不需要复杂的尺寸减小技术，特征提取，或失真的校正由于离轴光路。我们证明了样品的眼睛跟踪应用程序，并在一系列的使用我们的原型主观测试报告两个完整的HMD原型。

88. Self-Supervised Contextual Bandits in Computer Vision [PDF] 返回目录
Aniket Anand Deshmukh, Abhimanu Kumar, Levi Boyles, Denis Charles, Eren Manavoglu, Urun Dogan
Abstract: Contextual bandits are a common problem faced by machine learning practitioners in domains as diverse as hypothesis testing to product recommendations. There have been a lot of approaches in exploiting rich data representations for contextual bandit problems with varying degree of success. Self-supervised learning is a promising approach to find rich data representations without explicit labels. In a typical self-supervised learning scheme, the primary task is defined by the problem objective (e.g. clustering, classification, embedding generation etc.) and the secondary task is defined by the self-supervision objective (e.g. rotation prediction, words in neighborhood, colorization, etc.). In the usual self-supervision, we learn implicit labels from the training data for a secondary task. However, in the contextual bandit setting, we don't have the advantage of getting implicit labels due to lack of data in the initial phase of learning. We provide a novel approach to tackle this issue by combining a contextual bandit objective with a self supervision objective. By augmenting contextual bandit learning with self-supervision we get a better cumulative reward. Our results on eight popular computer vision datasets show substantial gains in cumulative reward. We provide cases where the proposed scheme doesn't perform optimally and give alternative methods for better learning in these cases.
摘要：上下文土匪在域面临的学习机从业者等不同的假设检验到产品建议一个共同的问题。已经有很多的开发为背景匪问题有不同的成功程度的丰富的数据表示方法。自我监督学习是找到丰富的数据表示，而没有明确的标签有前途的方法。在典型的自监督学习方案中，首要任务是由问题目标定义（例如聚类，分类，嵌入代等）和次级任务由自检目标定义（例如旋转预测，在附近的话，着色等）。在平时的自我监督，我们学会从训练数据的次要任务隐含的标签。然而，在上下文匪设置，我们没有得到隐含标签的优势，由于缺乏学习的初始阶段的数据。我们提供了一种新方法，通过上下文强盗目标与自我监督的目标相结合来解决这个问题。通过增加语境匪学习与自我监督，我们得到一个更好的累积回报。我们对八大热门的计算机视觉数据集结果显示，累积回报可观的收益。我们提供该提议的方案并没有达到最佳性能，并给予替代方法，在这些情况下，更好的学习情况。

89. Visual link retrieval and knowledge discovery in painting datasets [PDF] 返回目录
Giovanna Castellano, Eufemia Lella, Gennaro Vessio
Abstract: Visual arts have invaluable importance for the cultural, historic and economic growth of our societies. One of the building blocks of most analysis in visual arts is to find similarities among paintings of different artists and painting schools. To help art historians better understand visual arts, the present paper presents a framework for visual link retrieval and knowledge discovery in digital painting datasets. The proposed framework is based on a deep convolutional neural network to perform feature extraction and on a fully unsupervised nearest neighbor approach to retrieve visual links among digitized paintings. The fully unsupervised strategy makes attractive the proposed method especially in those cases where metadata are either scarce or unavailable or difficult to collect. In addition, the proposed framework includes a graph analysis that makes it possible to study influences among artists, thus providing historical knowledge discovery.
摘要：视觉艺术为我们社会的文化，历史和经济增长的重要性非常宝贵。一个在视觉艺术大多数分析的积木的是要找到不同的艺术家和绘画学校的画作之间的相似性。为了帮助艺术史学更好地了解视觉艺术，本文件介绍了在数字绘画集视觉链接检索和知识发现的框架。拟议的框架是基于执行特征提取了深刻的卷积神经网络，并在完全无人监管的最近邻居的方法来获取数字化的绘画中的视觉联系。完全无监督的策略使得尤其是在那些元数据或者是稀缺或不可用或难以收集案件的吸引力所提出的方法。此外，所提出的框架包括，使得它可以研究艺术家之间的影响，从而提供历史知识发现的图表分析。

90. Semi-supervised few-shot learning for medical image segmentation [PDF] 返回目录
Abdur R Feyjie, Reza Azad, Marco Pedersoli, Claude Kauffman, Ismail Ben Ayed, Jose Dolz
Abstract: Recent years have witnessed the great progress of deep neural networks on semantic segmentation, particularly in medical imaging. Nevertheless, training high-performing models require large amounts of pixel-level ground truth masks, which can be prohibitive to obtain in the medical domain. Furthermore, training such models in a low-data regime highly increases the risk of overfitting. Recent attempts to alleviate the need for large annotated datasets have developed training strategies under the few-shot learning paradigm, which addresses this shortcoming by learning a novel class from only a few labeled examples. In this context, a segmentation model is trained on episodes, which represent different segmentation problems, each of them trained with a very small labeled dataset. In this work, we propose a novel few-shot learning framework for semantic segmentation, where unlabeled images are also made available at each episode. To handle this new learning paradigm, we propose to include surrogate tasks that can leverage very powerful supervisory signals --derived from the data itself-- for semantic feature learning. We show that including unlabeled surrogate tasks in the episodic training leads to more powerful feature representations, which ultimately results in better generability to unseen tasks. We demonstrate the efficiency of our method in the task of skin lesion segmentation in two publicly available datasets. Furthermore, our approach is general and model-agnostic, which can be combined with different deep architectures.
摘要：近年来，两国深层神经网络对语义分割的巨大进步，特别是在医疗成像。然而，培养高性能车型需要大量的像素级地面实况口罩，这可以让人望而却步，以获得在医疗领域的。此外，在低数据政权培训等车型高度增加了过度拟合的风险。最近试图缓解需要大型注释的数据集有几个拍的学习模式，通过从只有几个标记的例子学习一类新型的解决这个缺点下发展培养策略。在这种情况下，分割模型上的情节，这代表了不同的分割问题的培训，他们每个人的训练的一个非常小的标记数据集。在这项工作中，我们提出了语义分割，其中未标记的图像也以每集提供一种新型的几个次学习框架。为了处理这种新的学习模式，我们建议包括可以利用从数据itself--的语义特征的学习--derived非常强大的监控信号替代任务。我们发现，包括在情节培训能带来更强大的功能交涉，最终导致更好generability到看不见的任务未标记的替代任务。我们证明了我们的皮肤损伤分割的任务的方法有两种可公开获得的数据集的效率。此外，我们的做法是一般和模型无关，可以用不同的深层结构相结合。

91. Detecting Pancreatic Adenocarcinoma in Multi-phase CT Scans via Alignment Ensemble [PDF] 返回目录
Yingda Xia, Qihang Yu, Wei Shen, Yuyin Zhou, Alan L. Yuille
Abstract: Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal cancers among population. Screening for PDACs in dynamic contrast-enhanced CT is beneficial for early diagnose. In this paper, we investigate the problem of automated detecting PDACs in multi-phase (arterial and venous) CT scans. Multiple phases provide more information than single phase, but they are unaligned and inhomogeneous in texture, making it difficult to combine cross-phase information seamlessly. We study multiple phase alignment strategies, i.e., early alignment (image registration), late alignment (high-level feature registration) and slow alignment (multi-level feature registration), and suggest an ensemble of all these alignments as a promising way to boost the performance of PDAC detection. We provide an extensive empirical evaluation on two PDAC datasets and show that the proposed alignment ensemble significantly outperforms previous state-of-the-art approaches, illustrating strong potential for clinical use.
摘要：胰腺导管腺癌（PDAC）是人群中最致命的癌症之一。筛选动态PDAC的增强CT是早期诊断有益。在本文中，我们研究了在多相（动脉和静脉）的CT扫描的自动化检测PDAC的问题。多阶段提供比单相的更多信息，但他们不对齐和非均匀的质地，因此很难交叉相位信息无缝结合。我们研究多个阶段调整策略，即早期对齐（图像配准），逾期对齐（高级别功能注册）和慢比（多级特性注册），并建议合奏所有这些路线作为一个有前途的方式来提振PDAC检测的性能。我们两个PDAC数据集提供了丰富的经验和评估表明，该对准合奏显著优于以前的状态的最先进的方法，它给出了临床使用的巨大潜力。

92. Synthesize then Compare: Detecting Failures and Anomalies for Semantic Segmentation [PDF] 返回目录
Yingda Xia, Yi Zhang, Fengze Liu, Wei Shen, Alan Yuille
Abstract: The ability to detect failures and anomalies are fundamental requirements for building reliable systems for computer vision applications, especially safety-critical applications of semantic segmentation, such as autonomous driving and medical image analysis. In this paper, we systematically study failure and anomaly detection for semantic segmentation and propose a unified framework, consisting of two modules, to address these two related problems. The first module is an image synthesis module, which generates a synthesized image from a segmentation layout map, and the second is a comparison module, which computes the difference between the synthesized image and the input image. We validate our framework on three challenging datasets and improve the state-of-the-arts by large margins, i.e., 6% AUPR-Error on Cityscapes, 10% DSC correlation on pancreatic tumor segmentation in MSD and 20% AUPR on StreetHazards anomaly segmentation.
摘要：检测故障和异常情况的能力是建立计算机视觉应用提供可靠的系统，语义分割，尤其是安全关键应用，如自动驾驶和医学图像分析的基本要求。在本文中，我们系统地研究故障和异常检测的语义分割，并提出一个统一的框架，包括两个模块，来解决这两个相关的问题。第一模块是图像合成模块，它从一个分割布局地图生成合成图像，并且第二个是一比较模块，其计算所述合成图像和输入图像之间的差异。我们验证三个挑战数据集我们的框架，提高国家的最艺术的大空间，即6％风情AUPR错误，胰腺肿瘤分割在MSD和20％AUPR上StreetHazards异常分割10％DSC相关。

93. Collaborative Distillation for Ultra-Resolution Universal Style Transfer [PDF] 返回目录
Huan Wang, Yijun Li, Yuehai Wang, Haoji Hu, Ming-Hsuan Yang
Abstract: Universal style transfer methods typically leverage rich representations from deep Convolutional Neural Network (CNN) models (e.g., VGG-19) pre-trained on large collections of images. Despite the effectiveness, its application is heavily constrained by the large model size to handle ultra-resolution images given limited memory. In this work, we present a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models. Moreover, to overcome the feature size mismatch when applying collaborative distillation, a linear embedding loss is introduced to drive the student network to learn a linear embedding of the teacher's features. Extensive experiments show the effectiveness of our method when applied to different universal style transfer approaches (WCT and AdaIN), even if the model size is reduced by 15.5 times. Especially, on WCT with the compressed models, we achieve ultra-resolution (over 40 megapixels) universal style transfer on a 12GB GPU for the first time. Further experiments on optimization-based stylization scheme show the generality of our algorithm on different stylization paradigms. Our code and trained models are available at this https URL.
摘要：通用风格转移方法通常利用深卷积神经网络（CNN）模型丰富的表示（例如，VGG-19）上的图像的大集合预先训练。尽管效果，其应用在很大程度上受到大的模型大小的约束来处理给出有限的内存的超高分辨率图像。在这项工作中，我们提出了基于编码器的解码器的神经传递的风格，以减少卷积过滤器的新知识蒸馏法（命名协作蒸馏）。其主要思想是通过发现支撑编码器，解码器对构建一个独特的协作关系，这被认为是一种新的知识的样式转让款。此外，应用协同蒸馏时克服特征尺寸不匹配，引入了线性嵌入损失带动学生网络学习的教师特征的线性嵌入。当应用到不同的通用风格转让办法（WCT和AdaIN），即使模型尺寸由15.5倍缩小大量的实验证明了该方法的有效性。特别是，在WCT与压缩模型，我们实现了超分辨率（40百万像素）上的12GB GPU通用风格传递的第一次。在基于优化的程式化方案进一步的实验表明我们的算法对不同风格化范式的普遍性。我们的代码和训练的模型可在此HTTPS URL。

94. STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [PDF] 返回目录
Ali Athar, Sabarinath Mahadevan, Aljoša Ošep, Laura Leal-Taixé, Bastian Leibe
Abstract: Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks.
摘要：例如在分割现有的视频方法通常涉及下面的跟踪逐检测范式和视频剪辑建模为图像的序列的多级流水线。多个网络被用于检测在各个帧的对象，然后将这些检测随时间相关联。因此，这些方法往往非终端到终端的可训练和高度定制的特定任务。在本文中，我们提出了一种不同的方法，是非常适合于各种涉及视频实例分割的任务。特别是，我们的模型的视频剪辑作为单个3D时空体积，并且提出了一种新颖的方法，在一个单级段和轨道跨越实例空间和时间。我们的问题制剂在其周围被训练在整个视频剪辑属于特定对象实例簇像素时空的嵌入的想法居中。为此，我们引入增强时空的嵌入的特征表示，和（ii）的单级，自由提案-网络，可以推理时间上下文（i）新的混合功能。我们的网络进行训练端至端学到时空的嵌入以及需要这些集群的嵌入参数，从而简化推理。我们的方法实现跨多个数据集和任务，国家的先进成果。

95. Volumetric parcellation of the right ventricle for regional geometric and functional assessment [PDF] 返回目录
Gabriel Bernardino, Amir Hodzic, Helene Langet, Mathieu De Craene, Miguel Angel González Ballester, Eric Saloux, Bart Bijnens
Abstract: In clinical practice, assessment of right ventricle (RV) is primarily done through its global volume, given it is a standardised measurement, and has a good reproducibility in 3D modalities such as MRI and 3D echocardiography. However, many illness produce regionalchanges and therefore a local analysis could provide a better tool for understanding and diagnosis of illnesses. Current regional clinical measurements are 2D linear dimensions, and suffer of low reproducibility due to the difficulty to identify landmarks in the RV, specially in echocardiographic images due to its noise and artefacts. We proposed an automatic method for parcellating the RV cavity and compute regional volumes and ejection fractions in three regions: apex, inlet and outflow. We tested the reproducibility in 3D echocardiographic images. We also present a synthetic mesh-deformation method to generate a groundtruth dataset for validating the ability of the method to capture different types of remodelling. Results showed an acceptable intra-observer reproduciblity (<10%) but a higher inter-observer(>10%). The synthetic dataset allowed to identify that the method was capable of assessing global dilatations, and local dilatations in the circumferential direction but not longitudinal elongations
摘要：在临床实践中，右心室（RV）的评估通过其全球量主要是做，因为它是一个标准化的测量，并且在3D模式，如磁共振成像和三维超声心动图重现性好。然而，许多疾病产生regionalchanges，因此当地的分析，可以理解和疾病诊断提供了更好的工具。当前区域的临床测量是2D线性尺寸，和低再现性遭受由于识别地标在RV，特别是在心脏超声图像，由于其噪音和人工制品的难度。我们提出了parcellating的RV腔和计算区域体积和射血分数在三个区域的自动方法：顶点，入口和流出。我们在3D超声图像测试的可重复性。我们还提出了一种合成的网格变形的方法来产生一个数据集地面实况用于验证的方法的捕获不同类型的重构的能力。结果表明可接受的观察者内重现性（<10％），但更高的间观察者（> 10％）。合成数据集允许以确定该方法是能够评估全球扩张术的和局部扩张术在圆周方向上而不是纵向伸长

96. Oral-3D: Reconstructing the 3D Bone Structure of Oral Cavity from 2D Panoramic X-ray [PDF] 返回目录
Weinan Song, Yuan Liang, Kun Wang, Lei He
Abstract: Panoramic X-ray and Cone Beam Computed Tomography (CBCT) are two of the most general imaging methods in digital dentistry. While CBCT can provide higher-dimension information, the panoramic X-ray has the advantages of lower radiation dose and cost. Consequently, generating 3D information of bony tissues from the X-ray that can reflect dental diseases is of great interest. This technique can be even more helpful for developing areas where the CBCT is not always available due to the lack of screening machines or high screening cost. In this paper, we present \textit{Oral-3D} to reconstruct the bone structure of oral cavity from a single panoramic X-ray image by taking advantage of some prior knowledge in oral structure, which conventionally can only be obtained by a 3D imaging method like CBCT. Specifically, we first train a generative network to back project the 2D X-ray image into 3D space, then restore the bone structure by registering the generated 3D image with the prior shape of the dental arch. To be noted, \textit{Oral-3D} can restore both the density of bony tissues and the curved mandible surface. Experimental results show that our framework can reconstruct the 3D structure with significantly high quality. To the best of our knowledge, this is the first work that explores 3D reconstruction from a 2D image in dental health.
摘要：全景X射线和锥束CT（CBCT）是两个在数字牙科最一般的成像方法。虽然CBCT可以提供更高的维度信息，全景X射线具有较低的辐射剂量和成本的优点。因此，由X射线能反映牙齿疾病产生骨组织的3D信息是非常令人感兴趣的。在CBCT并不总是可用由于缺乏筛选机或高筛选成本的这种技术可以为发展中地区更是有帮助的。在本文中，我们本\ textit {口服-3D}通过在口腔结构，这通常只能由3D成像来获得采取的一些现有知识优点重建从单个全景X射线图像口腔的骨结构方法像CBCT。具体地讲，我们首先训练生成网络到背面投射二维X射线图像为3D空间中，然后通过与牙弓的现有形状登记所生成的3D图像恢复骨结构。应注意，\ textit {口服-3D}可以恢复骨组织两者的密度和弯曲表面下颌骨。实验结果表明，我们的框架可以显著高质量的重建三维结构。据我们所知，这是从牙齿健康2D图像探讨三维重建的第一部作品。

97. A Content Transformation Block For Image Style Transfer [PDF] 返回目录
Dmytro Kotovenko, Artsiom Sanakoyeu, Pingchuan Ma, Sabine Lang, Björn Ommer
Abstract: Style transfer has recently received a lot of attention, since it allows to study fundamental challenges in image understanding and synthesis. Recent work has significantly improved the representation of color and texture and computational speed and image resolution. The explicit transformation of image content has, however, been mostly neglected: while artistic style affects formal characteristics of an image, such as color, shape or texture, it also deforms, adds or removes content details. This paper explicitly focuses on a content-and style-aware stylization of a content image. Therefore, we introduce a content transformation module between the encoder and decoder. Moreover, we utilize similar content appearing in photographs and style samples to learn how style alters content details and we generalize this to other class details. Additionally, this work presents a novel normalization layer critical for high resolution image synthesis. The robustness and speed of our model enables a video stylization in real-time and high definition. We perform extensive qualitative and quantitative evaluations to demonstrate the validity of our approach.
摘要：风格转移最近收到了很多的关注，因为它允许研究图像理解和综合根本性的挑战。最近的工作显著改善的颜色和纹理，并计算速度和图像分辨率的表示。图像内容的明确的转型，然而却被忽略大部分：虽然艺术风格影响图像的形式特征，如颜色，形状或纹理，它也变形，添加或删除内容的详细信息。本文着重明确上的内容图像的内容和风格的感知风格化。因此，我们引入编码器和解码器之间的内容变换模块。此外，我们利用出现在照片和风格样品的风格会改变内容的细节了解如何相似的内容，我们推广为其他类的细节。此外，此工作提出了一种归一化层为高分辨率图像合成的关键。我们的模型的鲁棒性和速度实现实时和高清视频的风格化。我们进行大量的定性和定量评估，以证明我们的方法的有效性。

98. Adversarial Texture Optimization from RGB-D Scans [PDF] 返回目录
Jingwei Huang, Justus Thies, Angela Dai, Abhijit Kundu, Chiyu Max Jiang, Leonidas Guibas, Matthias Nießner, Thomas Funkhouser
Abstract: Realistic color texture generation is an important step in RGB-D surface reconstruction, but remains challenging in practice due to inaccuracies in reconstructed geometry, misaligned camera poses, and view-dependent imaging artifacts. In this work, we present a novel approach for color texture generation using a conditional adversarial loss obtained from weakly-supervised views. Specifically, we propose an approach to produce photorealistic textures for approximate surfaces, even from misaligned images, by learning an objective function that is robust to these errors. The key idea of our approach is to learn a patch-based conditional discriminator which guides the texture optimization to be tolerant to misalignments. Our discriminator takes a synthesized view and a real image, and evaluates whether the synthesized one is realistic, under a broadened definition of realism. We train the discriminator by providing as `real' examples pairs of input views and their misaligned versions -- so that the learned adversarial loss will tolerate errors from the scans. Experiments on synthetic and real data under quantitative or qualitative evaluation demonstrate the advantage of our approach in comparison to state of the art. Our code is publicly available with video demonstration.
摘要：逼真的色彩纹理生成是在RGB-d表面重建的一个重要步骤，但由于在重构的几何不准确，未对准照相机姿势，和视点相关成像伪影保持在实践挑战。在这项工作中，我们提出使用从弱监督的观点而得到的条件对抗性损失色彩纹理生成一种新的方法。具体来说，我们提出了一个方法来产生逼真的纹理近似的表面，甚至是错位的图像，通过学习的目标函数是稳健的这些错误。我们的方法的核心思想是要学会基于块拼贴的条件鉴别器引导质地优化海纳失调。我们鉴别需要合成的视点与真实的图像，并评估合成一个是现实的，现实主义的扩大定义下。使学习对抗损失将从扫描容忍的错误 - 我们提供的'输入看法和他们的错位版本真正的”实例对培养鉴别。对在定量或定性评价模型和实际数据实验表明相比，最先进的我们的方法的优点。我们的代码是公开的视频演示。

99. Overinterpretation reveals image classification model pathologies [PDF] 返回目录
Brandon Carter, Siddhartha Jain, Jonas Mueller, David Gifford
Abstract: Image classifiers are typically scored on their test set accuracy, but high accuracy can mask a subtle type of model failure. We find that high scoring convolutional neural networks (CNN) exhibit troubling pathologies that allow them to display high accuracy even in the absence of semantically salient features. When a model provides a high-confidence decision without salient supporting input features we say that the classifier has overinterpreted its input, finding too much class-evidence in patterns that appear nonsensical to humans. Here, we demonstrate that state of the art neural networks for CIFAR-10 and ImageNet suffer from overinterpretation, and find CIFAR-10 trained models make confident predictions even when 95% of an input image has been masked and humans are unable to discern salient features in the remaining pixel subset. Although these patterns portend potential model fragility in real-world deployment, they are in fact valid statistical patterns of the image classification benchmark that alone suffice to attain high test accuracy. We find that ensembling strategies can help mitigate model overinterpretation, and classifiers which rely on more semantically meaningful features can improve accuracy over both the test set and out-of-distribution images from a different source than the training data.
摘要：图像分类器通常得分他们的测试集的精度，但是精度高可掩盖了微妙的类型的模型故障。我们发现，高得分卷积神经网络（CNN）表现出令人不安的疾病，让他们即使在没有语义显着特征显示精度高。当一个模型提供了一个高可信度的决定没有显着的支持输入功能我们说的分类已经过分解释它的输入，发现在出现无意义人类的模式太多类的证据。在这里，我们展示的艺术神经网络的这个状态CIFAR-10和ImageNet从过度阐释受苦，找CIFAR-10训练的模型，自信地预测，即使在输入图像的95％已经被屏蔽，人类是无法辨别的显着特征在剩余的像素子集。尽管这些模式在实际部署预示着潜在的模型脆弱性，它们实际上是图像分类基准的有效统计模式，仅仅足以达到测试精度高。我们发现，ensembling策略可以帮助减轻模型过度阐释，并依靠更多的语义上有意义的功能分类可以提高准确度都从不同的来源除训练数据的测试集和外的分布图像。

100. Brain tumor segmentation with missing modalities via latent multi-source correlation representation [PDF] 返回目录
Tongxue Zhou, Stephane Canu, Pierre Vera, Su Ruan
Abstract: Multimodal MR images can provide complementary information for accurate brain tumor segmentation. However, it's common to have missing imaging modalities in clinical practice. Since it exists a strong correlation between multi modalities, a novel correlation representation block is proposed to specially discover the latent multi-source correlation. Thanks to the obtained correlation representation, the segmentation becomes more robust in the case of missing modality. The model parameter estimation module first maps the individual representation produced by each encoder to obtain independent parameters, then, under these parameters, correlation expression module transforms all the individual representations to form a latent multi-source correlation representation. Finally, the correlation representations across modalities are fused via attention mechanism into a shared representation to emphasize the most important features for segmentation. We evaluate our model on BraTS 2018 datasets, it outperforms the current state-of-the-art method and produces robust results when one or more modalities are missing.
摘要：多模式MR图像可以提供用于精确脑肿瘤分割的补充信息。然而，这是常见的有在临床实践中缺少的成像方式。因为它存在多模态之间的强相关性，一种新的相关性表示块拟特别发现潜多源的相关性。由于所获得的相关表示，分割变得缺少模态的情况下更坚固。该模型参数估计模块第一映射由每个编码器以获得独立的参数，然后所产生的个体表示，这些参数下，相关性表达模块变换所有单个的表示，以形成潜多源相关的表示。最后，跨模态的相关表述是通过关注机制融合成一个共同的表示，强调最重要的功能进行分割。我们评估我们的臭小子型号2018点的数据集，它优于当前国家的最先进的方法，当一个或多个模态缺失产生稳定的结果。

101. DRST: Deep Residual Shearlet Transform for Densely Sampled Light Field Reconstruction [PDF] 返回目录
Yuan Gao, Robert Bregovic, Reinhard Koch, Atanas Gotchev
Abstract: The Image-Based Rendering (IBR) approach using Shearlet Transform (ST) is one of the most effective methods for Densely-Sampled Light Field (DSLF) reconstruction. The ST-based DSLF reconstruction typically relies on an iterative thresholding algorithm for Epipolar-Plane Image (EPI) sparse regularization in shearlet domain, involving dozens of transformations between image domain and shearlet domain, which are in general time-consuming. To overcome this limitation, a novel learning-based ST approach, referred to as Deep Residual Shearlet Transform (DRST), is proposed in this paper. Specifically, for an input sparsely-sampled EPI, DRST employs a deep fully Convolutional Neural Network (CNN) to predict the residuals of the shearlet coefficients in shearlet domain in order to reconstruct a densely-sampled EPI in image domain. The DRST network is trained on synthetic Sparsely-Sampled Light Field (SSLF) data only by leveraging elaborately-designed masks. Experimental results on three challenging real-world light field evaluation datasets with varying moderate disparity ranges (8 - 16 pixels) demonstrate the superiority of the proposed learning-based DRST approach over the non-learning-based ST method for DSLF reconstruction. Moreover, DRST provides a 2.4x speedup over ST, at least.
摘要：使用剪切波变换（ST）的基于图像的渲染（IBR）的方法是最有效的方法为密集采样的光场（DSLF）重建一个。基于ST-DSLF重建通常依赖于在剪切波域核线平面图像（EPI）稀疏正则化的迭代阈值算法，涉及数十个图像域和剪切波域之间的变换，这是一般费时的。为了克服这个限制，一种新颖的基于学习的方法ST，称为深残余剪切波变换（DRST），在本文中提出了。具体地，对于输入稀疏采样的EPI，DRST采用深充分卷积神经网络（CNN），以预测在剪切波域中的剪切波系数的残差，以便在图像域中重构密集采样的EPI。所述DRST网络仅通过利用精心设计的掩模的合成训练稀疏采样的光场（SSLF）数据。在三个实验结果具有不同视差中等范围（8 - 16个像素）具有挑战性的真实世界的光场数据集的评价验证了基于学习的方法DRST以上为DSLF重建非基于学习-ST方法的优越性。此外，DRST提供了2.4倍的加速超过ST，至少。

102. Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives [PDF] 返回目录
Oliver Groth, Chia-Man Hung, Andrea Vedaldi, Ingmar Posner
Abstract: Visuomotor control (VMC) is an effective means of achieving basic manipulation tasks such as pushing or pick-and-place from raw images. Conditioning VMC on desired goal states is a promising way of achieving versatile skill primitives. However, common conditioning schemes either rely on task-specific fine tuning (e.g. using meta-learning) or on sampling approaches using a forward model of scene dynamics i.e. model-predictive control, leaving deployability and planning horizon severely limited. In this paper we propose a conditioning scheme which avoids these pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion and the distance to a given target observation. In contrast to related works, this enables our approach to efficiently perform complex pushing and pick-and-place tasks from raw image observations without predefined control primitives. We report significant improvements in task success over a representative model-predictive controller and also demonstrate our model's generalisation capabilities in challenging, unseen tasks handling unfamiliar objects.
摘要：视觉运动控制（VMC）是实现基本操作的任务，例如推动或拾取和放置从原始图像的一种有效手段。空调VMC期望的目标状态是实现通用的技能元的有希望的途径。然而，常见的调理方案要么依靠任务的具体微调（例如，使用元学习）或采样使用现场动力学即模型预测控制，使可部署和规划水平严重限制的正向模型方法。在本文中，我们提出一种通过学习结束到终端的方式向控制器及其调理避免这些陷阱调理方案。我们的模型预测直接基于机器人运动的动态图像表示和距离给定的目标观测复杂的动作序列。相较于相关工作，这使我们的方法能够有效地执行复杂的推动和原始图像的观察拾放任务，而无需预先控制原语。我们报道了有代表性的模型预测控制器在任务成功显著的改善，也证明具有挑战性的，看不见的任务处理不熟悉的对象我们的模型的泛化能力。

103. Addressing the Memory Bottleneck in AI Model Training [PDF] 返回目录
David Ojika, Bhavesh Patel, G. Anthony Reina, Trent Boyer, Chad Martin, Prashant Shah
Abstract: Using medical imaging as case-study, we demonstrate how Intel-optimized TensorFlow on an x86-based server equipped with 2nd Generation Intel Xeon Scalable Processors with large system memory allows for the training of memory-intensive AI/deep-learning models in a scale-up server configuration. We believe our work represents the first training of a deep neural network having large memory footprint (~ 1 TB) on a single-node server. We recommend this configuration to scientists and researchers who wish to develop large, state-of-the-art AI models but are currently limited by memory.
摘要：利用医学影像作为案例研究，我们展示英特尔如何优化TensorFlow配备了第二代智能英特尔®至强®可扩展处理器与大系统内存在基于x86的服务器上允许的内存密集型AI /深学习模型在训练规模扩大的服务器配置。我们相信，我们的工作表示具有大容量内存（约1个TB）单节点服务器上的深层神经网络的第一次训练。我们建议使用此配置，谁希望发展大，国家的最先进的AI模型，但目前受内存限制的科学家和研究人员。

104. RADIOGAN: Deep Convolutional Conditional Generative adversarial Network To Generate PET Images [PDF] 返回目录
Amine Amyar, Su Ruan, Pierre Vera, Pierre Decazes, Romain Modzelewski
Abstract: One of the most challenges in medical imaging is the lack of data. It is proven that classical data augmentation methods are useful but still limited due to the huge variation in images. Using generative adversarial networks (GAN) is a promising way to address this problem, however, it is challenging to train one model to generate different classes of lesions. In this paper, we propose a deep convolutional conditional generative adversarial network to generate MIP positron emission tomography image (PET) which is a 2D image that represents a 3D volume for fast interpretation, according to different lesions or non lesion (normal). The advantage of our proposed method consists of one model that is capable of generating different classes of lesions trained on a small sample size for each class of lesion, and showing a very promising results. In addition, we show that a walk through a latent space can be used as a tool to evaluate the images generated.
摘要：一个在医学成像中最挑战之一是缺乏数据。事实证明，传统的数据增强方法由于图像中的巨大变化是有用的，但仍然受到限制。使用生成对抗网络（GAN）是解决这一问题的一个希望的办法，但是，它是具有挑战性的训练一个模型来生成不同类型的病变。在本文中，我们提出了一个深卷积条件生成对抗网络生成MIP正电子发射断层图像（PET），其是表示用于解释快速的3D体积，可根据不同的病变或非病变（正常）的2D图像。我们提出的方法的优点在于一个模型，其能够产生不同类的病变上训练为每个类病变的一个小样本大小的，并示出一个非常有前途的结果的。此外，我们表明，通过潜在空间散步可以作为评估产生的图像的工具。

105. Backdooring and Poisoning Neural Networks with Image-Scaling Attacks [PDF] 返回目录
Erwin Quiring, Konrad Rieck
Abstract: Backdoors and poisoning attacks are a major threat to the security of machine-learning and vision systems. Often, however, these attacks leave visible artifacts in the images that can be visually detected and weaken the efficacy of the attacks. In this paper, we propose a novel strategy for hiding backdoor and poisoning attacks. Our approach builds on a recent class of attacks against image scaling. These attacks enable manipulating images such that they change their content when scaled to a specific resolution. By combining poisoning and image-scaling attacks, we can conceal the trigger of backdoors as well as hide the overlays of clean-label poisoning. Furthermore, we consider the detection of image-scaling attacks and derive an adaptive attack. In an empirical evaluation, we demonstrate the effectiveness of our strategy. First, we show that backdoors and poisoning work equally well when combined with image-scaling attacks. Second, we demonstrate that current detection defenses against image-scaling attacks are insufficient to uncover our manipulations. Overall, our work provides a novel means for hiding traces of manipulations, being applicable to different poisoning approaches.
摘要：后门和中毒攻击是机器学习和视觉系统的安全构成重大威胁。然而，通常这些攻击留下可以肉眼观察到，削弱的攻击效果的图像可见的假象。在本文中，我们提出了隐藏后门和病毒攻击的新策略。我们的方法建立在最近的类的对图像缩放攻击。这些攻击能够处理图像，这样当缩放到特定分辨率他们改变的内容。通过结合中毒和图像缩放攻击，我们可以隐藏后门的触发以及隐藏清洁标签中毒的覆盖。此外，我们考虑的图像缩放攻击的检测，并得出自适应攻击。在实证分析中，我们证明了我们策略的有效性。首先，我们表明，后门程序和中毒工作同样在与图像缩放的攻击组合。其次，我们证明针对图像缩放攻击的电流检测防御不足以揭开我们的操作。总体而言，我们的工作提供了新的手段隐藏操作痕迹，适用于不同的中毒途径。

106. HyNNA: Improved Performance for Neuromorphic Vision Sensor based Surveillance using Hybrid Neural Network Architecture [PDF] 返回目录
Deepak Singla, Soham Chatterjee, Lavanya Ramapantulu, Andres Ussa, Bharath Ramesh, Arindam Basu
Abstract: Applications in the Internet of Video Things (IoVT) domain have very tight constraints with respect to power and area. While neuromorphic vision sensors (NVS) may offer advantages over traditional imagers in this domain, the existing NVS systems either do not meet the power constraints or have not demonstrated end-to-end system performance. To address this, we improve on a recently proposed hybrid event-frame approach by using morphological image processing algorithms for region proposal and address the low-power requirement for object detection and classification by exploring various convolutional neural network (CNN) architectures. Specifically, we compare the results obtained from our object detection framework against the state-of-the-art low-power NVS surveillance system and show an improved accuracy of 82.16% from 63.1%. Moreover, we show that using multiple bits does not improve accuracy, and thus, system designers can save power and area by using only single bit event polarity information. In addition, we explore the CNN architecture space for object classification and show useful insights to trade-off accuracy for lower power using lesser memory and arithmetic operations.
摘要：应用在视频物联网（IoVT）域的互联网具有对于功耗和面积非常严格的限制。虽然神经形态视觉传感器（NVS）可以提供在这一领域比传统的成像优势，现有的NVS系统或者不能满足功率限制，或者没有证明终端到终端的系统性能。为了解决这个问题，我们通过使用形态学图像处理算法区域提案在最近提出的混合事件帧的方法提高和通过探索各种卷积神经网络（CNN）架构地址为目标检测和分类的低功率需求。具体而言，我们比较来自我们的抵靠状态的最先进的低功率NVS监视系统对象检测框架中获得的结果，并显示82.16％的改进的精确度从63.1％。此外，我们表明，使用多个位不提高精度，因此，系统设计人员可以仅使用单个位事件极性信息节省功耗和面积。此外，我们还探索对象分类CNN建筑空间与展示使用较少的内存和算术运算有益的见解，以权衡精度低功耗。

107. Ensemble learning in CNN augmented with fully connected subnetworks [PDF] 返回目录
Daiki Hirata, Norikazu Takahashi
Abstract: Convolutional Neural Networks (CNNs) have shown remarkable performance in general object recognition tasks. In this paper, we propose a new model called EnsNet which is composed of one base CNN and multiple Fully Connected SubNetworks (FCSNs). In this model, the set of feature-maps generated by the last convolutional layer in the base CNN is divided along channels into disjoint subsets, and these subsets are assigned to the FCSNs. Each of the FCSNs is trained independent of others so that it can predict the class label from the subset of the feature-maps as signed to it. The output of the overall model is determined by majority vote of the base CNN and the FCSNs. Experimental results using the MNIST, Fashion-MNIST and CIFAR-10 datasets show that the proposed approach further improves the performance of CNNs. In particular, an EnsNet achieves a state-of-the-art error rate of 0.16% on MNIST.
摘要：卷积神经网络（细胞神经网络）表明，在一般对象识别任务骄人的业绩。在本文中，我们提出了一个名为EnsNet新的模型，它是由一个基地CNN和多个完全连接的子网（FCSNs）的。在这个模型中，-地图功能由最后卷积层中的碱CNN沿着通道分成分离子集所生成的集合的，并且这些子集被分配给FCSNs。如登录，那么每个FCSNs的训练独立他人，以便它可以从特征地图的子集预测类的标签。总体模型的输出由基地CNN和FCSNs的多数表决来确定。使用MNIST，时尚MNIST和CIFAR-10数据集实验结果表明，该方法进一步提高细胞神经网络的性能。特别是，EnsNet达到0.16％MNIST上的状态下的最先进的错误率。

108. Lifelong Learning with Searchable Extension Units [PDF] 返回目录
Wenjin Wang, Yunqing Hu, Yin Zhang
Abstract: Lifelong learning remains an open problem. One of its main difficulties is catastrophic forgetting. Many dynamic expansion approaches have been proposed to address this problem, but they all use homogeneous models of predefined structure for all tasks. The common original model and expansion structures ignore the requirement of different model structures on different tasks, which leads to a less compact model for multiple tasks and causes the model size to increase rapidly as the number of tasks increases. Moreover, they can not perform best on all tasks. To solve those problems, in this paper, we propose a new lifelong learning framework named Searchable Extension Units (SEU) by introducing Neural Architecture Search into lifelong learning, which breaks down the need for a predefined original model and searches for specific extension units for different tasks, without compromising the performance of the model on different tasks. Our approach can obtain a much more compact model without catastrophic forgetting. The experimental results on the PMNIST, the split CIFAR10 dataset, the split CIFAR100 dataset, and the Mixture dataset empirically prove that our method can achieve higher accuracy with much smaller model, whose size is about 25-33 percentage of that of the state-of-the-art methods.
摘要：终身学习仍然是一个悬而未决的问题。它的一个主要困难是灾难性的遗忘。许多动态扩展的方法被提出来解决这个问题，但预定义结构的所有任务都使用均质模型。公共原模型和扩张结构忽略不同任务的不同模型结构，其引线的用于多个任务的较少紧凑模型的要求，并且使得模型尺寸迅速地增加随着任务的数量增加。此外，他们还不能在所有的任务表现最佳。为了解决这些问题，在本文中，我们通过引入神经结构搜索到终身学习，建议命名为检索扩展单元（SEU）一个新的终身学习框架，打破了需要有一个预定义的原始模型，并搜索特定扩展名的单位不同任务，不影响任务的不同模型的性能。我们的方法可以得到未造成灾难性遗忘一个更加小巧的机型。在PMNIST的实验结果，分割CIFAR10数据集，分裂CIFAR100数据集，以及混合数据集经验证明，我们的方法可以用更小的机型，其大小大约是该州的25-33比例达到更高的精度-the-技术的方法。

109. Extremal Region Analysis based Deep Learning Framework for Detecting Defects [PDF] 返回目录
Zelin Deng, Xiaolong Yan, Shengjun Zhang, Colleen P. Bailey
Abstract: A maximally stable extreme region (MSER) analysis based convolutional neural network (CNN) for unified defect detection framework is proposed in this paper. Our proposed framework utilizes the generality and stability of MSER to generate the desired defect candidates. Then a specific trained binary CNN classifier is adopted over the defect candidates to produce the final defect set. Defect datasets over different categories \blue{are used} in the experiments. More generally, the parameter settings in MSER can be adjusted to satisfy different requirements in various industries (high precision, high recall, etc). Extensive experimental results have shown the efficacy of the proposed framework.
摘要：统一缺陷检测框架的一个最大稳定极端区域（MSER）分析基于卷积神经网络（CNN）在本文提出。我们提出的架构利用MSER的通用性和稳定性，以产生所希望的缺陷候选。然后具体训练的二元分类CNN采用了缺陷考生产生最终的缺陷集。在不同的类别\蓝色缺陷数据集{用于}在实验中。更一般地，在MSER参数设置可以调整，以满足各种行业不同要求（高精度，高召回等）。广泛的实验结果表明所提出的框架的功效。

110. Efficiently Calibrating Cable-Driven Surgical Robots With RGBD Sensing, Temporal Windowing, and Linear and Recurrent Neural Network Compensation [PDF] 返回目录
Minho Hwang, Brijen Thananjeyan, Samuel Paradis, Daniel Seita, Jeffrey Ichnowski, Danyal Fer, Thomas Low, Ken Goldberg
Abstract: Automation of surgical subtasks using cable-driven robotic surgical assistants (RSAs) such as Intuitive Surgical's da Vinci Research Kit (dVRK) is challenging due to imprecision in control from cable-related effects such as backlash, stretch, and hysteresis. We propose a novel approach to efficiently calibrate a dVRK by placing a 3D printed fiducial coordinate frame on the arm and end-effector that is tracked using RGBD sensing. To measure the coupling effects between joints and history-dependent effects, we analyze data from sampled trajectories and consider 13 modeling approaches using LSTM recurrent neural networks and linear models with varying temporal window length to provide corrective feedback. With the proposed method, data collection takes 31 minutes to produce 1800 samples and model training takes less than a minute. Results suggest that the resulting model can reduce the mean tracking error of the physical robot from 2.96mm to 0.65mm on a test set of reference trajectories. We evaluate the model by executing open-loop trajectories of the FLS peg transfer surgeon training task. Results suggest that the best approach increases success rate from 39.4% to 96.7% comparable to the performance of an expert surgical resident. Supplementary material, including 3D-printable models, is available at this https URL.
摘要：使用电缆驱动的机器人手术助理体（RSA），诸如Intuitive Surgical公司的达芬奇研究试剂盒（dVRK）是具有挑战性的，由于在从电缆相关的效果，如侧隙，拉伸和滞后控制不精确外科子任务自动化。我们提出了一种新颖的方法通过将3D有效地校准印刷dVRK基准坐标上的臂和被使用RGBD感测跟踪端部执行器框架。为了测量关节和历史依赖效应之间的耦合效应，我们分析来自取样轨迹数据，并考虑使用LSTM回归神经网络，并与不同的时间窗口长度的线性模型，以提供正确的反馈13的建模方法。利用所提出的方法中，数据收集需要31分钟以产生1800个样品和模型训练需要不到一分钟。结果表明，所得到的模型可以从2.96毫米减少物理机器人的平均跟踪误差到0.65毫米在测试组参考轨迹。我们评估通过执行FLS PEG转移外科医生培训任务的开环轨迹模型。结果表明，最好的方法，从39.4％提高成功率96.7％，堪比专家外科住院医师的性能。补充材料，包括3D打印的机型，可在此HTTPS URL。

111. MINT: Deep Network Compression via Mutual Information-based Neuron Trimming [PDF] 返回目录
Madan Ravi Ganesh, Jason J. Corso, Salimeh Yasaei Sekeh
Abstract: Most approaches to deep neural network compression via pruning either evaluate a filter's importance using its weights or optimize an alternative objective function with sparsity constraints. While these methods offer a useful way to approximate contributions from similar filters, they often either ignore the dependency between layers or solve a more difficult optimization objective than standard cross-entropy. Our method, Mutual Information-based Neuron Trimming (MINT), approaches deep compression via pruning by enforcing sparsity based on the strength of the relationship between filters of adjacent layers, across every pair of layers. The relationship is calculated using conditional geometric mutual information which evaluates the amount of similar information exchanged between the filters using a graph-based criterion. When pruning a network, we ensure that retained filters contribute the majority of the information towards succeeding layers which ensures high performance. Our novel approach outperforms existing state-of-the-art compression-via-pruning methods on the standard benchmarks for this task: MNIST, CIFAR-10, and ILSVRC2012, across a variety of network architectures. In addition, we discuss our observations of a common denominator between our pruning methodology's response to adversarial attacks and calibration statistics when compared to the original network.
摘要：大多数通过修剪接近深层神经网络压缩或者评估使用它的权重过滤器的重要性或优化与稀疏约束的一个替代的目标函数。尽管这些方法提供了一种有用的方法，以从类似的过滤器近似的贡献，它们经常要么忽略层之间的依赖性或解决比标准交叉熵更困难的优化目标。我们的方法，基于信息的相互神经元剪裁（MINT），经由通过强制执行基于相邻层的滤光器之间的关系的强度稀疏性，跨越每对层的修剪接近深度压缩。的关系是使用其评估的使用基于图形的标准过滤器之间交换的类似的信息量条件几何互信息计算。当修剪网络，我们确保保留过滤器对后续确保高性能层有助于大多数人的信息。状态的最先进的我们的新颖方法比现有压缩通修剪上此任务的标准基准的方法：MNIST，CIFAR-10，和ILSVRC2012，跨各种网络架构。另外，我们比较原来的网络时，我们讨论我们的修剪方法的应对敌对攻击和校准统计之间的共同点的意见。

112. Two Tier Prediction of Stroke Using Artificial Neural Networks and Support Vector Machines [PDF] 返回目录
Jerrin Thomas Panachakel, Jeena R.S
Abstract: Cerebrovascular accident (CVA) or stroke is the rapid loss of brain function due to disturbance in the blood supply to the brain. Statistically, stroke is the second leading cause of death. This has motivated us to suggest a two-tier system for predicting stroke; the first tier makes use of Artificial Neural Network (ANN) to predict the chances of a person suffering from stroke. The ANN is trained the using the values of various risk factors of stroke of several patients who had stroke. Once a person is classified as having a high risk of stroke, s/he undergoes another the tier-2 classification test where his/her neuro MRI (Magnetic resonance imaging) is analysed to predict the chances of stroke. The tier-2 uses Non-negative Matrix Factorization and Haralick Textural features for feature extraction and SVM classifier for classification. We have obtained an accuracy of 96.67% in tier-1 and an accuracy of 70% in tier-2.
摘要：脑血管意外（CVA）或中风是脑功能的迅速损失，由于在血液供应到大脑紊乱。据统计，中风是导致死亡的第二大原因。这促使我们提出预测中风两层体系;第一层是利用人工神经网络（ANN）的预测中风的人的痛苦的机会。人工神经网络使用的几个病人谁曾中风中风的各种风险因素的价值观训练。一个人一旦被归类为中风的高风险，他/她经历了另一个他的神经她MRI（磁共振成像）进行分析，预测中风的机会/在二级分类考试。所述二级用途非负矩阵因子分解和Haralick纹理特征进行特征提取和SVM分类器进行分类。我们已获得的96.67％在一线的精度和70％在二级的精度。

113. Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation [PDF] 返回目录
Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, Kai Xu
Abstract: Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform 3D convolution directly over the progressively fused 3D geometric data, and how to smartly fuse information from frame to frame. We propose a novel fusion-aware 3D point convolution which operates directly on the geometric surface being reconstructed and exploits effectively the inter-frame correlation for high quality 3D feature learning. This is enabled by a dedicated dynamic data structure which organizes the online acquired point cloud with global-local trees. Globally, we compile the online reconstructed 3D points into an incrementally growing coordinate interval tree, enabling fast point insertion and neighborhood query. Locally, we maintain the neighborhood information for each point using an octree whose construction benefits from the fast query of the global tree.Both levels of trees update dynamically and help the 3D convolution effectively exploits the temporal coherence for effective information fusion across RGB-D frames.
摘要：在线语义3D分割在公司与实时RGB-d重建遇到特殊挑战，比如如何直接在逐步融合的三维几何数据进行三维卷积，以及如何巧妙保险丝帧与帧的信息。我们提出一种直接操作几何表面被重构并有效地利用高品质的三维特征学习帧间相关性的新的融合感知3D点卷积。这是通过一个专用的动态数据结构组织与全球和当地的树木网上获得的点云启用。从全球来看，我们编译了在线三维重建点到逐渐增长的坐标区间树，可实现快速插入点和邻里查询。本地方面，我们采用八叉树，其建设从动态树木更新和帮助全球tree.Both水平的快速查询福利3D卷积有效地利用有效的信息融合跨RGB-d帧时间相干性保持每个点附近信息。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-20

目录

摘要