摘要

1. Time-Travel Rephotography [PDF] 返回目录
Xuan Luo, Xuaner Zhang, Paul Yoo, Ricardo Martin-Brualla, Jason Lawrence, Steven M. Seitz
Abstract: Many historical people are captured only in old, faded, black and white photos, that have been distorted by the limitations of early cameras and the passage of time. This paper simulates traveling back in time with a modern camera to rephotograph famous subjects. Unlike conventional image restoration filters which apply independent operations like denoising, colorization, and superresolution, we leverage the StyleGAN2 framework to project old photos into the space of modern high-resolution photos, achieving all of these effects in a unified framework. A unique challenge with this approach is capturing the identity and pose of the photo's subject and not the many artifacts in low-quality antique photos. Our comparisons to current state-of-the-art restoration filters show significant improvements and compelling results for a variety of important historical people.
摘要：许多历史人物仅被捕获在褪色的旧黑白照片中，这些照片由于早期相机的局限性和时间的流逝而失真。本文使用现代相机模拟时光倒流，以重新拍摄著名的主题。与传统的图像恢复滤镜不同，它们应用了独立的操作（例如降噪，着色和超分辨率），我们利用StyleGAN2框架将旧照片投影到现代高分辨率照片的空间中，从而在统一的框架中实现所有这些效果。这种方法面临的一个独特挑战是捕获照片主题的身份和姿势，而不是低质量的古董照片中的许多伪影。我们与当前最先进的恢复过滤器的比较显示了许多重要历史人物的显着改进和令人信服的结果。

2. YolactEdge: Real-time Instance Segmentation on the Edge (Jetson AGX Xavier: 30 FPS, RTX 2080 Ti: 170 FPS) [PDF] 返回目录
Haotian Liu, Rafael A. Rivera Soto, Fanyi Xiao, Yong Jae Lee
Abstract: We propose YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds. Specifically, YolactEdge runs at up to 30.8 FPS on a Jetson AGX Xavier (and 172.7 FPS on an RTX 2080 Ti) with a ResNet-101 backbone on 550x550 resolution images. To achieve this, we make two improvements to the state-of-the-art image-based real-time method YOLACT: (1) TensorRT optimization while carefully trading off speed and accuracy, and (2) a novel feature warping module to exploit temporal redundancy in videos. Experiments on the YouTube VIS and MS COCO datasets demonstrate that YolactEdge produces a 3-5x speed up over existing real-time methods while producing competitive mask and box detection accuracy. We also conduct ablation studies to dissect our design choices and modules. Code and models are available at this https URL.
摘要：我们提出了YolactEdge，这是第一种竞争性实例分割方法，可以在小型边缘设备上以实时速度运行。具体来说，在550x550分辨率的图像上，带有ResNet-101主干的YolactEdge在Jetson AGX Xavier上的运行速度高达30.8 FPS（在RTX 2080 Ti上的运行速度为172.7 FPS）。为了实现这一目标，我们对基于图像的最新实时方法YOLACT进行了两项改进：（1）优化TensorRT，同时谨慎权衡速度和准确性，以及（2）利用新的特征扭曲模块视频中的时间冗余。在YouTube VIS和MS COCO数据集上进行的实验表明，与现有的实时方法相比，YolactEdge的速度提高了3-5倍，同时具有极好的口罩和盒子检测精度。我们还进行消融研究，以剖析我们的设计选择和模块。代码和模型可从此https URL获得。

3. Underwater image filtering: methods, datasets and evaluation [PDF] 返回目录
Chau Yi Li, Riccardo Mazzon, Andrea Cavallaro
Abstract: Underwater images are degraded by the selective attenuation of light that distorts colours and reduces contrast. The degradation extent depends on the water type, the distance between an object and the camera, and the depth under the water surface the object is at. Underwater image filtering aims to restore or to enhance the appearance of objects captured in an underwater image. Restoration methods compensate for the actual degradation, whereas enhancement methods improve either the perceived image quality or the performance of computer vision algorithms. The growing interest in underwater image filtering methods--including learning-based approaches used for both restoration and enhancement--and the associated challenges call for a comprehensive review of the state of the art. In this paper, we review the design principles of filtering methods and revisit the oceanology background that is fundamental to identify the degradation causes. We discuss image formation models and the results of restoration methods in various water types. Furthermore, we present task-dependent enhancement methods and categorise datasets for training neural networks and for method evaluation. Finally, we discuss evaluation strategies, including subjective tests and quality assessment measures. We complement this survey with a platform ( this https URL ), which hosts state-of-the-art underwater filtering methods and facilitates comparisons.
摘要：水下的图像会因选择性的光衰减而退化，从而使颜色失真并降低对比度。退化程度取决于水的类型，物体与相机之间的距离以及物体所处水面以下的深度。水下图像过滤旨在恢复或增强在水下图像中捕获的对象的外观。恢复方法可以补偿实际的性能下降，而增强方法可以改善感知的图像质量或计算机视觉算法的性能。对水下图像过滤方法的兴趣日益浓厚，包括用于恢复和增强的基于学习的方法，以及相关的挑战，要求对现有技术进行全面的审查。在本文中，我们回顾了过滤方法的设计原理，并重新审视了确定退化原因的基础海洋学背景。我们讨论了各种水类型的图像形成模型和恢复方法的结果。此外，我们提出了与任务相关的增强方法，并对数据集进行了分类，以训练神经网络和进行方法评估。最后，我们讨论评估策略，包括主观测试和质量评估措施。我们通过一个平台（此https URL）对本次调查进行补充，该平台托管了最新的水下过滤方法，并便于进行比较。

4. Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video [PDF] 返回目录
Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, Christian Theobalt
Abstract: In this tech report, we present the current state of our ongoing work on reconstructing Neural Radiance Fields (NERF) of general non-rigid scenes via ray bending. Non-rigid NeRF (NR-NeRF) takes RGB images of a deforming object (e.g., from a monocular video) as input and then learns a geometry and appearance representation that not only allows to reconstruct the input sequence but also to re-render any time step into novel camera views with high fidelity. In particular, we show that a consumer-grade camera is sufficient to synthesize convincing bullet-time videos of short and simple scenes. In addition, the resulting representation enables correspondence estimation across views and time, and provides rigidity scores for each point in the scene. We urge the reader to watch the supplemental videos for qualitative results.
摘要：在这份技术报告中，我们介绍了通过射线弯曲重建一般非刚性场景的神经辐射场（NERF）正在进行的工作的现状。非刚性NeRF（NR-NeRF）将变形对象的RGB图像（例如，来自单眼视频）作为输入，然后学习几何形状和外观表示形式，不仅可以重构输入序列，而且还可以重新渲染任何时间步入高保真度的新颖相机视野。特别是，我们证明了消费级相机足以合成令人信服的短时场景和短时场景视频。另外，结果表示可以实现跨视图和跨时间的对应估计，并为场景中的每个点提供刚度得分。我们敦促读者观看补充视频以获取定性结果。

5. Unadversarial Examples: Designing Objects for Robust Vision [PDF] 返回目录
Hadi Salman, Andrew Ilyas, Logan Engstrom, Sai Vemprala, Aleksander Madry, Ashish Kapoor
Abstract: We study a class of realistic computer vision settings wherein one can influence the design of the objects being recognized. We develop a framework that leverages this capability to significantly improve vision models' performance and robustness. This framework exploits the sensitivity of modern machine learning algorithms to input perturbations in order to design "robust objects," i.e., objects that are explicitly optimized to be confidently detected or classified. We demonstrate the efficacy of the framework on a wide variety of vision-based tasks ranging from standard benchmarks, to (in-simulation) robotics, to real-world experiments. Our code can be found at this https URL .
摘要：我们研究了一类现实的计算机视觉设置，其中可以影响被识别对象的设计。我们开发了一个利用此功能来显着改善视觉模型的性能和鲁棒性的框架。该框架利用了现代机器学习算法输入扰动的敏感性，以设计“健壮的对象”，即经过明确优化以可靠地检测或分类的对象。我们证明了该框架在各种基于视觉的任务上的功效，这些任务从标准基准测试到（模拟中的）机器人技术，再到实际实验。可以在此https URL上找到我们的代码。

6. Training Convolutional Neural Networks With Hebbian Principal Component Analysis [PDF] 返回目录
Gabriele Lagani, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro
Abstract: Recent work has shown that biologically plausible Hebbian learning can be integrated with backpropagation learning (backprop), when training deep convolutional neural networks. In particular, it has been shown that Hebbian learning can be used for training the lower or the higher layers of a neural network. For instance, Hebbian learning is effective for re-training the higher layers of a pre-trained deep neural network, achieving comparable accuracy w.r.t. SGD, while requiring fewer training epochs, suggesting potential applications for transfer learning. In this paper we build on these results and we further improve Hebbian learning in these settings, by using a nonlinear Hebbian Principal Component Analysis (HPCA) learning rule, in place of the Hebbian Winner Takes All (HWTA) strategy used in previous work. We test this approach in the context of computer vision. In particular, the HPCA rule is used to train Convolutional Neural Networks in order to extract relevant features from the CIFAR-10 image dataset. The HPCA variant that we explore further improves the previous results, motivating further interest towards biologically plausible learning algorithms.
摘要：最近的研究表明，在训练深度卷积神经网络时，生物学上可行的Hebbian学习可以与反向传播学习（backprop）集成。特别地，已经表明，Hebbian学习可用于训练神经网络的较低或较高的层。例如，Hebbian学习对于重新训练预训练的深度神经网络的更高层是有效的，达到了相当的准确度。 SGD，虽然需要较少的培训时间，但建议进行转移学习。在本文中，我们基于这些结果，并通过使用非线性Hebbian主成分分析（HPCA）学习规则代替先前工作中使用的Hebbian赢家通吃（HWTA）策略，在这些环境中进一步改善了Hebbian学习。我们在计算机视觉的背景下测试这种方法。特别是，HPCA规则用于训练卷积神经网络，以便从CIFAR-10图像数据集中提取相关特征。我们探索的HPCA变体进一步改善了先前的结果，激发了人们对生物学上合理的学习算法的兴趣。

7. Geometric robust descriptor for 3D point cloud [PDF] 返回目录
Seung Hwan Jung, Yeong-Gil Shin, Minyoung Chung
Abstract: We propose rotation robust and density robust local geometric descriptor. Local geometric feature of point cloud is used in many applications, for example, to find correspondences in 3D registration and to segment local regions. Usually, application accuracy depends on the discriminative power of the local geometric features. However, there are some problems such as point sparsity, rotated point cloud, and so on. In this paper, we present new local feature generation method to make a rotation robust and density robust descriptor. First, we place kernels aligned around each point and align them to the normal of the point. To avoid the sign problem of the normal vector, we use symmetric kernel point distribution with respect to the tangent plane. Next, from each kernel point, we estimate geometric information which is rotation robust and discriminative. Finally, we operate convolution process with consideration of kernel point structure, and aggregate all kernel features. We experiment our local descriptors on the ModelNet40 dataset for registration and classification, and the ShapeNet part dataset for segmentation. Our descriptor shows discriminative power regardless of point distribution.
摘要：我们提出了旋转鲁棒和密度鲁棒的局部几何描述符。点云的局部几何特征可用于许多应用程序中，例如，在3D配准中查找对应关系并分割局部区域。通常，应用精度取决于局部几何特征的判别力。但是，存在一些问题，例如点稀疏性，旋转的点云等等。在本文中，我们提出了一种新的局部特征生成方法，以使旋转鲁棒性和密度鲁棒性描述符。首先，我们将内核对准每个点，然后将它们对准该点的法线。为了避免法向矢量的正负号问题，我们使用相对于切平面的对称核点分布。接下来，从每个内核点，我们估计几何信息，该信息是旋转鲁棒的和可区分的。最后，我们在考虑内核点结构的情况下进行卷积过程，并汇总所有内核特征。我们在ModelNet40数据集上进行局部描述符实验，以进行注册和分类，在ShapeNet零件数据集上进行分段。我们的描述符显示出判别力，与点分布无关。

8. Labels Are Not Perfect: Inferring Spatial Uncertainty in Object Detection [PDF] 返回目录
Di Feng, Zining Wang, Yiyang Zhou, Lars Rosenbaum, Fabian Timm, Klaus Dietmayer, Masayoshi Tomizuka, Wei Zhan
Abstract: The availability of many real-world driving datasets is a key reason behind the recent progress of object detection algorithms in autonomous driving. However, there exist ambiguity or even failures in object labels due to error-prone annotation process or sensor observation noise. Current public object detection datasets only provide deterministic object labels without considering their inherent uncertainty, as does the common training process or evaluation metrics for object detectors. As a result, an in-depth evaluation among different object detection methods remains challenging, and the training process of object detectors is sub-optimal, especially in probabilistic object detection. In this work, we infer the uncertainty in bounding box labels from LiDAR point clouds based on a generative model, and define a new representation of the probabilistic bounding box through a spatial uncertainty distribution. Comprehensive experiments show that the proposed model reflects complex environmental noises in LiDAR perception and the label quality. Furthermore, we propose Jaccard IoU (JIoU) as a new evaluation metric that extends IoU by incorporating label uncertainty. We conduct an in-depth comparison among several LiDAR-based object detectors using the JIoU metric. Finally, we incorporate the proposed label uncertainty in a loss function to train a probabilistic object detector and to improve its detection accuracy. We verify our proposed methods on two public datasets (KITTI, Waymo), as well as on simulation data. Code is released at this https URL.
摘要：许多现实世界中的驾驶数据集的可用性是自动驾驶中对象检测算法的最新进展背后的一个关键原因。但是，由于容易出错的注释过程或传感器观察到的噪声，对象标签中存在歧义甚至失败。当前的公共对象检测数据集仅提供确定性的对象标签，而没有考虑其固有的不确定性，对象检测器的常见训练过程或评估指标也是如此。结果，在不同的物体检测方法之间进行深入评估仍然具有挑战性，并且物体检测器的训练过程不是最优的，尤其是在概率物体检测中。在这项工作中，我们基于生成的模型从LiDAR点云中推断出边界框标签的不确定性，并通过空间不确定性分布定义概率边界框的新表示形式。综合实验表明，该模型反映了LiDAR感知和标签质量中的复杂环境噪声。此外，我们提出了Jaccard IoU（JIoU）作为通过结合标签不确定性扩展IoU的新评估指标。我们使用JIoU指标在几个基于LiDAR的物体检测器之间进行了深入的比较。最后，我们将提出的标签不确定度纳入损失函数，以训练概率目标检测器并提高其检测精度。我们在两个公共数据集（KITTI，Waymo）以及模拟数据上验证了我们提出的方法。代码在此https URL上发布。

9. Latent Feature Representation via Unsupervised Learning for Pattern Discovery in Massive Electron Microscopy Image Volumes [PDF] 返回目录
Gary B Huang, Huei-Fang Yang, Shin-ya Takemura, Pat Rivlin, Stephen M Plaza
Abstract: We propose a method to facilitate exploration and analysis of new large data sets. In particular, we give an unsupervised deep learning approach to learning a latent representation that captures semantic similarity in the data set. The core idea is to use data augmentations that preserve semantic meaning to generate synthetic examples of elements whose feature representations should be close to one another. We demonstrate the utility of our method applied to nano-scale electron microscopy data, where even relatively small portions of animal brains can require terabytes of image data. Although supervised methods can be used to predict and identify known patterns of interest, the scale of the data makes it difficult to mine and analyze patterns that are not known a priori. We show the ability of our learned representation to enable query by example, so that if a scientist notices an interesting pattern in the data, they can be presented with other locations with matching patterns. We also demonstrate that clustering of data in the learned space correlates with biologically-meaningful distinctions. Finally, we introduce a visualization tool and software ecosystem to facilitate user-friendly interactive analysis and uncover interesting biological patterns. In short, our work opens possible new avenues in understanding of and discovery in large data sets, arising in domains such as EM analysis.
摘要：我们提出了一种有助于探索和分析新的大数据集的方法。特别是，我们提供了一种无监督的深度学习方法来学习捕获数据集中语义相似性的潜在表示。核心思想是使用保留语义含义的数据增强来生成元素的合成示例，这些元素的特征表示应彼此接近。我们证明了我们的方法适用于纳米级电子显微镜数据的效用，在这种情况下，即使相对较小的动物大脑部分也需要TB级的图像数据。尽管可以使用监督方法来预测和识别已知的感兴趣模式，但是数据的规模使得难以挖掘和分析先验未知的模式。我们以示例的方式展示了我们所学的表示形式使查询成为可能的能力，因此，如果科学家注意到数据中有趣的模式，则可以将它们与具有匹配模式的其他位置一起呈现。我们还证明，在学习空间中的数据聚类与具有生物学意义的区别相关。最后，我们介绍了可视化工具和软件生态系统，以促进用户友好的交互式分析并发现有趣的生物学模式。简而言之，我们的工作为理解和发现大数据集开辟了可能的新途径，这些大数据集来自诸如EM分析等领域。

10. Convolutional Recurrent Network for Road Boundary Extraction [PDF] 返回目录
Justin Liang, Namdar Homayounfar, Wei-Chiu Ma, Shenlong Wang, Raquel Urtasun
Abstract: Creating high definition maps that contain precise information of static elements of the scene is of utmost importance for enabling self driving cars to drive safely. In this paper, we tackle the problem of drivable road boundary extraction from LiDAR and camera imagery. Towards this goal, we design a structured model where a fully convolutional network obtains deep features encoding the location and direction of road boundaries and then, a convolutional recurrent network outputs a polyline representation for each one of them. Importantly, our method is fully automatic and does not require a user in the loop. We showcase the effectiveness of our method on a large North American city where we obtain perfect topology of road boundaries 99.3% of the time at a high precision and recall.
摘要：创建包含场景静态元素的精确信息的高清地图对于使自动驾驶汽车安全行驶至关重要。在本文中，我们解决了从LiDAR和摄像机图像提取可驾驶道路边界的问题。为了实现这一目标，我们设计了一个结构化模型，其中，一个完全卷积的网络获得对道路边界的位置和方向进行编码的深层特征，然后，一个卷积的递归网络为其中的每一个输出一条折线表示。重要的是，我们的方法是全自动的，不需要用户参与。我们在北美一个大城市展示了我们方法的有效性，在该城市中，我们可以99.3％的时间获得高精度的道路边界拓扑，并具有很高的查全率。

11. Image to Bengali Caption Generation Using Deep CNN and Bidirectional Gated Recurrent Unit [PDF] 返回目录
Al Momin Faruk, Hasan Al Faraby, Md. Muzahidul Azad, Md. Riduyan Fedous, Md. Kishor Morol
Abstract: There is very little notable research on generating descriptions of the Bengali language. About 243 million people speak in Bengali, and it is the 7th most spoken language on the planet. The purpose of this research is to propose a CNN and Bidirectional GRU based architecture model that generates natural language captions in the Bengali language from an image. Bengali people can use this research to break the language barrier and better understand each other's perspectives. It will also help many blind people with their everyday lives. This paper used an encoder-decoder approach to generate captions. We used a pre-trained Deep convolutional neural network (DCNN) called InceptonV3image embedding model as the encoder for analysis, classification, and annotation of the dataset's images Bidirectional Gated Recurrent unit (BGRU) layer as the decoder to generate captions. Argmax and Beam search is used to produce the highest possible quality of the captions. A new dataset called BNATURE is used, which comprises 8000 images with five captions per image. It is used for training and testing the proposed model. We obtained BLEU-1, BLEU-2, BLEU-3, BLEU-4 and Meteor is 42.6, 27.95, 23, 66, 16.41, 28.7 respectively.
摘要：关于生成孟加拉语言描述的研究很少。孟加拉语大约有2.43亿人说，这是地球上使用人数最多的第七种语言。这项研究的目的是提出一个基于CNN和双向GRU的体系结构模型，该模型可以从图像中生成孟加拉语言的自然语言字幕。孟加拉人可以利用这项研究来打破语言障碍并更好地了解彼此的观点。它还将帮助许多盲人的日常生活。本文使用编解码器方法生成字幕。我们使用称为InceptonV3图像嵌入模型的预训练深层卷积神经网络（DCNN）作为编码器，对数据集的图像双向门控递归单元（BGRU）层进行分析，分类和注释，以将其作为字幕生成字幕。 Argmax和Beam搜索用于产生尽可能高的字幕质量。使用了一个称为BNATURE的新数据集，该数据集包含8000张图像，每个图像有五个标题。它用于训练和测试所提出的模型。我们获得的BLEU-1，BLEU-2，BLEU-3，BLEU-4和Meteor分别为42.6、27.95、23、66、16.41、28.7。

12. Explainable Abstract Trains Dataset [PDF] 返回目录
Manuel de Sousa Ribeiro, Ludwig Krippahl, Joao Leite
Abstract: The Explainable Abstract Trains Dataset is an image dataset containing simplified representations of trains. It aims to provide a platform for the application and research of algorithms for justification and explanation extraction. The dataset is accompanied by an ontology that conceptualizes and classifies the depicted trains based on their visual characteristics, allowing for a precise understanding of how each train was labeled. Each image in the dataset is annotated with multiple attributes describing the trains' features and with bounding boxes for the train elements.
摘要：可解释的抽象火车数据集是一个包含火车简化表示的图像数据集。它旨在为证明和解释提取算法的应用和研究提供一个平台。该数据集随附一个本体，该本体基于它们的视觉特征对所描绘的列车进行概念化和分类，从而可以精确地了解每个列车的标记方式。数据集中的每个图像都用描述火车特征的多个属性和火车元素的边框标注。

13. MOCCA: Multi-Layer One-Class Classification for Anomaly Detection [PDF] 返回目录
Fabio Valerio Massoli, Fabrizio Falchi, Alperen Kantarci, Şeymanur Akti, Hazim Kemal Ekenel, Giuseppe Amato
Abstract: Anomalies are ubiquitous in all scientific fields and can express an unexpected event due to incomplete knowledge about the data distribution or an unknown process that suddenly comes into play and distorts the observations. Due to such events' rarity, it is common to train deep learning models on "normal", i.e. non-anomalous, datasets only, thus letting the neural network to model the distribution beneath the input data. In this context, we propose our deep learning approach to the anomaly detection problem named Multi-LayerOne-Class Classification (MOCCA). We explicitly leverage the piece-wise nature of deep neural networks by exploiting information extracted at different depths to detect abnormal data instances. We show how combining the representations extracted from multiple layers of a model leads to higher discrimination performance than typical approaches proposed in the literature that are based neural networks' final output only. We propose to train the model by minimizing the $L_2$ distance between the input representation and a reference point, the anomaly-free training data centroid, at each considered layer. We conduct extensive experiments on publicly available datasets for anomaly detection, namely CIFAR10, MVTec AD, and ShanghaiTech, considering both the single-image and video-based scenarios. We show that our method reaches superior performances compared to the state-of-the-art approaches available in the literature. Moreover, we provide a model analysis to give insight on how our approach works.
摘要：在所有科学领域中，异常现象普遍存在，并且由于对数据分布的不完全了解或突然发生的未知过程而扭曲了观测结果，因此可以表达意外事件。由于此类事件的稀有性，通常仅在“正常”（即非异常）数据集上训练深度学习模型，从而让神经网络对输入数据下的分布进行建模。在这种情况下，我们提出了针对异常检测问题的深度学习方法，称为多层一类分类（MOCCA）。通过利用在不同深度提取的信息来检测异常数据实例，我们显式地利用了深度神经网络的分段性质。我们展示了如何组合从模型的多个层提取的表示，比仅基于神经网络最终输出的文献中提出的典型方法具有更高的识别性能。我们建议通过最小化每个表示层的输入表示和参考点之间的$ L_2 $距离（无异常训练数据质心）来训练模型。考虑到单图像和基于视频的场景，我们对CIFAR10，MVTec AD和ShanghaiTech等可公开获取的异常数据集进行了广泛的实验。我们证明，与文献中提供的最先进方法相比，我们的方法具有更高的性能。此外，我们提供了一个模型分析，以深入了解我们的方法如何工作。

14. Noise-Equipped Convolutional Neural Networks [PDF] 返回目录
Menghan Xia, Tien-Tsin Wong
Abstract: As a generic modeling tool, Convolutional Neural Network (CNN) has been widely employed in image synthesis and translation tasks. However, when a CNN model is fed with a flat input, the transformation degrades into a scaling operation due to the spatial sharing nature of convolution kernels. This inherent problem has been barely studied nor raised as an application restriction. In this paper, we point out that such convolution degradation actually hinders some specific image generation tasks that expect value-variant output from a flat input. We study the cause behind it and propose a generic solution to tackle it. Our key idea is to break the flat input condition through a proxy input module that perturbs the input data symmetrically with a noise map and reassembles them in feature domain. We call it noise-equipped CNN model and study its behavior through multiple analysis. Our experiments show that our model is free of degradation and hence serves as a superior alternative to standard CNN models. We further demonstrate improved performances of applying our model to existing applications, e.g. semantic photo synthesis and color-encoded grayscale generation.
摘要：卷积神经网络（CNN）作为一种通用的建模工具，已广泛应用于图像合成和翻译任务。但是，当CNN模型输入平面输入时，由于卷积核的空间共享性质，该转换降级为缩放操作。几乎没有研究或提出这一固有问题作为应用限制。在本文中，我们指出，这种卷积降级实际上阻碍了某些特定的图像生成任务，这些任务期望平面输入产生值变量输出。我们研究了其背后的原因，并提出了解决该问题的通用解决方案。我们的关键思想是通过代理输入模块打破平坦的输入条件，该代理输入模块利用噪声图对称地干扰输入数据，并在特征域中对其进行重组。我们称其为配备噪声的CNN模型，并通过多次分析研究其行为。我们的实验表明，我们的模型没有退化，因此可以作为标准CNN模型的替代方案。我们进一步展示了将我们的模型应用于现有应用程序的改进性能，例如语义照片合成和颜色编码的灰度级生成。

15. Convolutional Neural Networks from Image Markers [PDF] 返回目录
Barbara C. Benato, Italos E. de Souza, Felipe L. Galvão, Alexandre X. Falcão
Abstract: A technique named Feature Learning from Image Markers (FLIM) was recently proposed to estimate convolutional filters, with no backpropagation, from strokes drawn by a user on very few images (e.g., 1-3) per class, and demonstrated for coconut-tree image classification. This paper extends FLIM for fully connected layers and demonstrates it on different image classification problems. The work evaluates marker selection from multiple users and the impact of adding a fully connected layer. The results show that FLIM-based convolutional neural networks can outperform the same architecture trained from scratch by backpropagation.
摘要：最近提出了一种名为“从图像标记学习特征（FLIM）”的技术，用于估计用户在每类很少的图像（例如1-3）上绘制的笔划，而不会向后传播，从而估计卷积滤波器。树图像分类。本文将FLIM扩展到全连接层，并针对不同的图像分类问题进行了演示。该工作评估了来自多个用户的标记选择以及添加完全连接的层的影响。结果表明，基于FLIM的卷积神经网络可以胜过通过反向传播从头开始训练的相同体系结构。

16. Warped Gaussian Processes in Remote Sensing Parameter Estimation and Causal Inference [PDF] 返回目录
Anna Mateo-Sanchis, Jordi Muñoz-Marí, Adrián Pérez-Suay, Gustau Camps-Valls
Abstract: This paper introduces warped Gaussian processes (WGP) regression in remote sensing applications. WGP models output observations as a parametric nonlinear transformation of a GP. The parameters of such prior model are then learned via standard maximum likelihood. We show the good performance of the proposed model for the estimation of oceanic chlorophyll content from multispectral data, vegetation parameters (chlorophyll, leaf area index, and fractional vegetation cover) from hyperspectral data, and in the detection of the causal direction in a collection of 28 bivariate geoscience and remote sensing causal problems. The model consistently performs better than the standard GP and the more advanced heteroscedastic GP model, both in terms of accuracy and more sensible confidence intervals.
摘要：本文介绍了遥感应用中的扭曲高斯过程（WGP）回归。 WGP将输出观测值建模为GP的参数非线性转换。然后，通过标准最大似然来学习这种先验模型的参数。我们展示了该模型在多光谱数据中估算海洋叶绿素含量，从高光谱数据中估算植被参数（叶绿素，叶面积指数和植被覆盖度分数）以及在收集海藻的因果方向方面的良好性能。 28个二元地球科学和遥感因果问题。在准确性和更合理的置信区间方面，该模型的性能始终优于标准GP和更高级的异方差GP模型。

17. A Deep Reinforcement Learning Approach for Ramp Metering Based on Traffic Video Data [PDF] 返回目录
Bing Liu, Yu Tang, Yuxiong Ji, Yu Shen, Yuchuan Du
Abstract: Ramp metering that uses traffic signals to regulate vehicle flows from the on-ramps has been widely implemented to improve vehicle mobility of the freeway. Previous studies generally update signal timings in real-time based on predefined traffic measures collected by point detectors, such as traffic volumes and occupancies. Comparing with point detectors, traffic cameras-which have been increasingly deployed on road networks-could cover larger areas and provide more detailed traffic information. In this work, we propose a deep reinforcement learning (DRL) method to explore the potential of traffic video data in improving the efficiency of ramp metering. The proposed method uses traffic video frames as inputs and learns the optimal control strategies directly from the high-dimensional visual inputs. A real-world case study demonstrates that, in comparison with a state-of-the-practice method, the proposed DRL method results in 1) lower travel times in the mainline, 2) shorter vehicle queues at the on-ramp, and 3) higher traffic flows downstream of the merging area. The results suggest that the proposed method is able to extract useful information from the video data for better ramp metering controls.
摘要：已经广泛实施了使用交通信号来调节坡道上车辆流量的匝道计量，以提高高速公路的车辆机动性。以前的研究通常会根据点检测器收集的预定义流量度量（例如流量和占用率）实时更新信号时序。与点探测器相比，已经越来越多地部署在道路网络上的交通摄像机可以覆盖更大的区域并提供更详细的交通信息。在这项工作中，我们提出了一种深度强化学习（DRL）方法，以探索交通视频数据在提高匝道计量效率方面的潜力。所提出的方法以交通视频帧为输入，直接从高维视觉输入中学习最优控制策略。实际案例研究表明，与实际方法相比，拟议的DRL方法导致1）缩短了干线的行进时间，2）匝道上的车辆排队更短，以及3 ）合并区域下游的流量增加。结果表明，所提出的方法能够从视频数据中提取有用的信息，以实现更好的斜坡测光控制。

18. Predicting survival outcomes using topological features of tumor pathology images [PDF] 返回目录
Chul Moon, Qiwei Li, Guanghua Xiao
Abstract: Tumor shape and size have been used as important markers for cancer diagnosis and treatment. Recent developments in medical imaging technology enable more detailed segmentation of tumor regions in high resolution. This paper proposes a topological feature to characterize tumor progression from digital pathology images and examine its effect on the time-to-event data. We develop distance transform for pathology images and show that a topological summary statistic computed by persistent homology quantifies tumor shape, size, distribution, and connectivity. The topological features are represented in functional space and used as functional predictors in a functional Cox regression model. A case study is conducted using non-small cell lung cancer pathology images. The results show that the topological features predict survival prognosis after adjusting for age, sex, smoking status, stage, and size of tumors. Also, the topological features with non-zero effects correspond to the shapes that are known to be related to tumor progression. Our study provides a new perspective for understanding tumor shape and patient prognosis.
摘要：肿瘤的形状和大小已被用作癌症诊断和治疗的重要标志。医学成像技术的最新发展使得能够以高分辨率更详细地分割肿瘤区域。本文提出了一种拓扑特征，可通过数字病理图像表征肿瘤的进展，并检查其对事件发生时间数据的影响。我们为病理图像开发距离变换，并显示通过持久同源性计算出的拓扑摘要统计量可量化肿瘤的形状，大小，分布和连通性。拓扑特征在功能空间中表示，并在功能Cox回归模型中用作功能预测器。使用非小细胞肺癌病理图像进行案例研究。结果表明，在调整了年龄，性别，吸烟状态，肿瘤的分期和大小后，拓扑特征可预测生存预后。同样，具有非零影响的拓扑特征对应于已知与肿瘤进展相关的形状。我们的研究为理解肿瘤形状和患者预后提供了新的视角。

19. Estimating Crop Primary Productivity with Sentinel-2 and Landsat 8 using Machine Learning Methods Trained with Radiative Transfer Simulations [PDF] 返回目录
Aleksandra Wolanin, Gustau Camps-Valls, Luis Gómez-Chova, Gonzalo Mateo-García, Christiaan van der Tol, Yongguang Zhang, Luis Guanter
Abstract: Satellite remote sensing has been widely used in the last decades for agricultural applications, {both for assessing vegetation condition and for subsequent yield prediction.} Existing remote sensing-based methods to estimate gross primary productivity (GPP), which is an important variable to indicate crop photosynthetic function and stress, typically rely on empirical or semi-empirical approaches, which tend to over-simplify photosynthetic mechanisms. In this work, we take advantage of all parallel developments in mechanistic photosynthesis modeling and satellite data availability for advanced monitoring of crop productivity. In particular, we combine process-based modeling with the soil-canopy energy balance radiative transfer model (SCOPE) with Sentinel-2 {and Landsat 8} optical remote sensing data and machine learning methods in order to estimate crop GPP. Our model successfully estimates GPP across a variety of C3 crop types and environmental conditions even though it does not use any local information from the corresponding sites. This highlights its potential to map crop productivity from new satellite sensors at a global scale with the help of current Earth observation cloud computing platforms.
摘要：在过去的几十年中，卫星遥感已广泛用于农业应用，{都用于评估植被状况和随后的产量预测。}现有的基于遥感的方法来估算总初级生产力（GPP），这是一个重要的变量为了表明作物的光合功能和胁迫，通常依靠经验或半经验方法，这往往过度简化了光合机制。在这项工作中，我们利用机械光合作用建模和卫星数据可用性中所有并行开发的优势来对作物生产力进行高级监控。特别是，我们将基于过程的建模与具有Sentinel-2 {and Landsat 8}光学遥感数据和机器学习方法的土壤-树冠能量平衡辐射传输模型（SCOPE）相结合，以估算农作物GPP。我们的模型成功地估计了各种C3作物类型和环境条件下的GPP，即使它没有使用来自相应站点的任何本地信息。这凸显了其借助当前地球观测云计算平台在全球范围内利用新型卫星传感器绘制作物生产力的潜力。

20. Limitation of Acyclic Oriented Graphs Matching as Cell Tracking Accuracy Measure when Evaluating Mitosis [PDF] 返回目录
Ye Chen, Yuankai Huo
Abstract: Multi-object tracking (MOT) in computer vision and cell tracking in biomedical image analysis are two similar research fields, whose common aim is to achieve instance level object detection/segmentation and associate such objects across different video frames. However, one major difference between these two tasks is that cell tracking also aim to detect mitosis (cell division), which is typically not considered in MOT tasks. Therefore, the acyclic oriented graphs matching (AOGM) has been used as de facto standard evaluation metrics for cell tracking, rather than directly using the evaluation metrics in computer vision, such as multiple object tracking accuracy (MOTA), ID Switches (IDS), ID F1 Score (IDF1) etc. However, based on our experiments, we realized that AOGM did not always function as expected for mitosis events. In this paper, we exhibit the limitations of evaluating mitosis with AOGM using both simulated and real cell tracking data.
摘要：计算机视觉中的多目标跟踪（MOT）和生物医学图像分析中的细胞跟踪是两个相似的研究领域，其共同目标是实现实例级别的对象检测/分段，并将这些对象跨不同的视频帧进行关联。但是，这两个任务之间的主要区别在于，细胞跟踪还旨在检测有丝分裂（细胞分裂），这在MOT任务中通常不考虑。因此，非循环定向图匹配（AOGM）已被用作细胞跟踪的事实上的标准评估指标，而不是直接在计算机视觉中使用评估指标，例如多目标跟踪精度（MOTA），ID开关（IDS）， ID F1分数（IDF1）等。但是，基于我们的实验，我们认识到AOGM并非总是对有丝分裂事件起到预期的作用。在本文中，我们展示了使用模拟和真实细胞跟踪数据评估AOGM有丝分裂的局限性。

21. Disentangling images with Lie group transformations and sparse coding [PDF] 返回目录
Ho Yin Chau, Frank Qiu, Yubei Chen, Bruno Olshausen
Abstract: Discrete spatial patterns and their continuous transformations are two important regularities contained in natural signals. Lie groups and representation theory are mathematical tools that have been used in previous works to model continuous image transformations. On the other hand, sparse coding is an important tool for learning dictionaries of patterns in natural signals. In this paper, we combine these ideas in a Bayesian generative model that learns to disentangle spatial patterns and their continuous transformations in a completely unsupervised manner. Images are modeled as a sparse superposition of shape components followed by a transformation that is parameterized by n continuous variables. The shape components and transformations are not predefined, but are instead adapted to learn the symmetries in the data, with the constraint that the transformations form a representation of an n-dimensional torus. Training the model on a dataset consisting of controlled geometric transformations of specific MNIST digits shows that it can recover these transformations along with the digits. Training on the full MNIST dataset shows that it can learn both the basic digit shapes and the natural transformations such as shearing and stretching that are contained in this data.
摘要：离散空间模式及其连续变换是自然信号中包含的两个重要规律。谎言组和表示理论是在先前的作品中用来对连续图像转换建模的数学工具。另一方面，稀疏编码是学习自然信号中模式字典的重要工具。在本文中，我们将这些思想组合到一个贝叶斯生成模型中，该模型学习以一种完全不受监督的方式来解开空间模式及其连续变换。图像被建模为形状成分的稀疏叠加，然后是由n个连续变量参数化的转换。形状分量和变换不是预先定义的，而是适合于学习数据中的对称性，但要限制的是变换形成n维圆环的表示。在由特定MNIST数字的受控几何变换组成的数据集上训练模型表明，该模型可以与数字一起恢复这些变换。对完整的MNIST数据集进行的训练表明，它既可以学习基本的数字形状，又可以学习包含在该数据中的自然变换，例如剪切和拉伸。

22. Do We Really Need Scene-specific Pose Encoders? [PDF] 返回目录
Yoli Shavit, Ron Ferens
Abstract: Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current \textit{state-of-the-art} pose regressors in some cases. Moreover, we show that for outdoor localization, the proposed architecture is the only pose regressor, to date, consistently localizing in under 2 meters and 5 degrees.
摘要：视觉姿势回归模型可通过单次向前通过查询图像来估计相机姿势。当前的模型使用针对每个场景训练的深度卷积网络从图像中学习姿势编码。通常将所得的编码传递到多层感知器，以使姿势回归。在这项工作中，我们建议进行姿势回归不需要特定于场景的姿势编码器，而可以使用经过视觉相似性训练的编码。为了检验我们的假设，我们采用了几个完全连接的层的浅层架构，并使用来自通用图像检索模型的预先计算的编码对其进行训练。我们发现这些编码不仅足以回归相机的姿势，而且当提供给分支的完全连接的体系结构时，经过训练的模型可以取得有竞争力的结果，甚至超过当前的\ textit {the-the-the-art}回归器在某些情况下。此外，我们表明，对于室外定位，迄今为止，所提出的体系结构是唯一的姿态回归器，始终位于2米和5度以下。

23. Multiple Instance Segmentation in Brachial Plexus Ultrasound Image Using BPMSegNet [PDF] 返回目录
Yi Ding, Qiqi Yang, Guozheng Wu, Jian Zhang, Zhiguang Qin
Abstract: The identification of nerve is difficult as structures of nerves are challenging to image and to detect in ultrasound images. Nevertheless, the nerve identification in ultrasound images is a crucial step to improve performance of regional anesthesia. In this paper, a network called Brachial Plexus Multi-instance Segmentation Network (BPMSegNet) is proposed to identify different tissues (nerves, arteries, veins, muscles) in ultrasound images. The BPMSegNet has three novel modules. The first is the spatial local contrast feature, which computes contrast features at different scales. The second one is the self-attention gate, which reweighs the channels in feature maps by their importance. The third is the addition of a skip concatenation with transposed convolution within a feature pyramid network. The proposed BPMSegNet is evaluated by conducting experiments on our constructed Ultrasound Brachial Plexus Dataset (UBPD). Quantitative experimental results show the proposed network can segment multiple tissues from the ultrasound images with a good performance.
摘要：由于神经结构难以在超声图像中成像和检测，因此难以识别神经。然而，超声图像中的神经识别是提高局部麻醉性能的关键步骤。在本文中，提出了一种称为臂丛多实例分割网络（BPMSegNet）的网络，以识别超声图像中的不同组织（神经，动脉，静脉，肌肉）。 BPMSegNet具有三个新颖的模块。第一个是空间局部对比度特征，它可以计算不同比例的对比度特征。第二个是自我注意门，它通过其重要性重新称量特征图中的通道。第三是在特征金字塔网络中添加带有转置卷积的跳过级联。通过对我们构建的超声臂丛神经数据集（UBPD）进行实验，对提出的BPMSegNet进行了评估。定量实验结果表明，所提出的网络可以很好地分割超声图像中的多个组织。

24. 3D Point-to-Keypoint Voting Network for 6D Pose Estimation [PDF] 返回目录
Weitong Hua, Jiaxin Guo, Yue Wang, Rong Xiong
Abstract: Object 6D pose estimation is an important research topic in the field of computer vision due to its wide application requirements and the challenges brought by complexity and changes in the real-world. We think fully exploring the characteristics of spatial relationship between points will help to improve the pose estimation performance, especially in the scenes of background clutter and partial occlusion. But this information was usually ignored in previous work using RGB image or RGB-D data. In this paper, we propose a framework for 6D pose estimation from RGB-D data based on spatial structure characteristics of 3D keypoints. We adopt point-wise dense feature embedding to vote for 3D keypoints, which makes full use of the structure information of the rigid body. After the direction vectors pointing to the keypoints are predicted by CNN, we use RANSAC voting to calculate the coordinate of the 3D keypoints, then the pose transformation can be easily obtained by the least square method. In addition, a spatial dimension sampling strategy for points is employed, which makes the method achieve excellent performance on small training sets. The proposed method is verified on two benchmark datasets, LINEMOD and OCCLUSION LINEMOD. The experimental results show that our method outperforms the state-of-the-art approaches, achieves ADD(-S) accuracy of 98.7\% on LINEMOD dataset and 52.6\% on OCCLUSION LINEMOD dataset in real-time.
摘要：由于6D姿态估计的广泛应用需求以及现实世界中复杂性和变化带来的挑战，因此6D姿态估计是计算机视觉领域的重要研究课题。我们认为，充分探索点之间空间关系的特征将有助于改善姿势估计性能，尤其是在背景混乱和部分遮挡的场景中。但是，在以前使用RGB图像或RGB-D数据的工作中，通常会忽略此信息。在本文中，我们提出了一种基于3D关键点的空间结构特征的RGB-D数据6D姿态估计框架。我们采用逐点密集特征嵌入来投票支持3D关键点，从而充分利用刚体的结构信息。在CNN预测到指向关键点的方向向量之后，我们使用RANSAC投票计算3D关键点的坐标，然后可以通过最小二乘法轻松获得姿势变换。另外，采用了针对点的空间维度采样策略，这使得该方法在小型训练集上可获得出色的性能。在两个基准数据集LINEMOD和OCCLUSION LINEMOD上验证了该方法的有效性。实验结果表明，我们的方法优于最新方法，在LINEMOD数据集上实时达到ADD（-S）精度为98.7 \％，在OCCLUSION LINEMOD数据集上达到52.6 \％。

25. A Hybrid VDV Model for Automatic Diagnosis of Pneumothorax using Class-Imbalanced Chest X-rays Dataset [PDF] 返回目录
Tahira Iqbal, Arslan Shaukat, Usman Akram, Zartasha Mustansar, Yung-Cheol Byun
Abstract: Pneumothorax, a life threatening disease, needs to be diagnosed immediately and efficiently. The prognosis in this case is not only time consuming but also prone to human errors. So an automatic way of accurate diagnosis using chest X-rays is the utmost requirement. To-date, most of the available medical images datasets have class-imbalance issue. The main theme of this study is to solve this problem along with proposing an automated way of detecting pneumothorax. We first compare the existing approaches to tackle the class-imbalance issue and find that data-level-ensemble (i.e. ensemble of subsets of dataset) outperforms other approaches. Thus, we propose a novel framework named as VDV model, which is a complex model-level-ensemble of data-level-ensembles and uses three convolutional neural networks (CNN) including VGG16, VGG-19 and DenseNet-121 as fixed feature extractors. In each data-level-ensemble features extracted from one of the pre-defined CNN are fed to support vector machine (SVM) classifier, and output from each data-level-ensemble is calculated using voting method. Once outputs from the three data-level-ensembles with three different CNN architectures are obtained, then, again, voting method is used to calculate the final prediction. Our proposed framework is tested on SIIM ACR Pneumothorax dataset and Random Sample of NIH Chest X-ray dataset (RS-NIH). For the first dataset, 85.17% Recall with 86.0% Area under the Receiver Operating Characteristic curve (AUC) is attained. For the second dataset, 90.9% Recall with 95.0% AUC is achieved with random split of data while 85.45% recall with 77.06% AUC is obtained with patient-wise split of data. For RS-NIH, the obtained results are higher as compared to previous results from literature However, for first dataset, direct comparison cannot be made, since this dataset has not been used earlier for Pneumothorax classification.
摘要：气胸是一种威胁生命的疾病，需要立即进行有效诊断。这种情况下的预后不仅费时，而且容易发生人为错误。因此，使用胸部X光进行准确诊断的自动方法是最高要求。迄今为止，大多数可用的医学图像数据集都存在类不平衡问题。这项研究的主要主题是解决这一问题，并提出一种自动检测气胸的方法。我们首先比较解决类不平衡问题的现有方法，然后发现数据级集合（即数据集子集的集合）优于其他方法。因此，我们提出了一种名为VDV模型的新颖框架，该框架是数据级集合的复杂模型级集合，并使用包括VGG16，VGG-19和DenseNet-121的三个卷积神经网络（CNN）作为固定特征提取器。在从一个预定义的CNN中提取的每个数据级别集合特征中，将特征馈入支持向量机（SVM）分类器，并使用表决方法计算每个数据级别集合的输出。一旦获得了具有三个不同CNN架构的三个数据级别集合的输出，然后再次使用表决方法来计算最终预测。我们提出的框架已在SIIM ACR气胸数据集和NIH胸部X射线数据集（RS-NIH）随机样本上进行了测试。对于第一个数据集，在接收器工作特性曲线（AUC）下获得85.17％的查全率和86.0％的面积。对于第二个数据集，通过随机拆分数据可实现90.9％的召回率和95.0％的AUC，而通过对患者进行数据拆分可获得85.45％的召回率和77.06％的AUC。对于RS-NIH，获得的结果要比文献中的先前结果更高。但是，对于第一个数据集，无法进行直接比较，因为该数据集尚未被用于气胸分类。

26. FcaNet: Frequency Channel Attention Networks [PDF] 返回目录
Zequn Qin, Pengyi Zhang, Fei Wu, Xi Li
Abstract: Attention mechanism, especially channel attention, has gained great success in the computer vision field. Many works focus on how to design efficient channel attention mechanisms while ignoring a fundamental problem, i.e., using global average pooling (GAP) as the unquestionable pre-processing method. In this work, we start from a different view and rethink channel attention using frequency analysis. Based on the frequency analysis, we mathematically prove that the conventional GAP is a special case of the feature decomposition in the frequency domain. With the proof, we naturally generalize the pre-processing of channel attention mechanism in the frequency domain and propose FcaNet with novel multi-spectral channel attention. The proposed method is simple but effective. We can change only one line of code in the calculation to implement our method within existing channel attention methods. Moreover, the proposed method achieves state-of-the-art results compared with other channel attention methods on image classification, object detection, and instance segmentation tasks. Our method could improve by 1.8% in terms of Top-1 accuracy on ImageNet compared with the baseline SENet-50, with the same number of parameters and the same computational cost. Our code and models will be made publicly available.
摘要：注意力机制，尤其是渠道注意力，在计算机视觉领域取得了巨大的成功。许多工作专注于如何设计有效的频道关注机制，同时忽略一个基本问题，即使用全局平均池（GAP）作为毫无疑问的预处理方法。在这项工作中，我们从不同的角度出发，并使用频率分析重新考虑频道的注意力。基于频率分析，我们在数学上证明了传统的GAP是频域中特征分解的特例。有了证明，我们自然地在频域上概括了信道注意机制的预处理，并提出了具有新颖的多光谱信道注意的FcaNet。所提出的方法简单但有效。我们只能在计算中更改一行代码，以在现有渠道关注方法中实施我们的方法。此外，与其他在图像分类，目标检测和实例分割任务上的通道关注方法相比，该方法可实现最新的结果。与基线SENet-50相比，在相同数量的参数和相同的计算成本的情况下，我们的方法在ImageNet上的Top-1准确性方面可提高1.8％。我们的代码和模型将公开提供。

27. Human Action Recognition from Various Data Modalities: A Review [PDF] 返回目录
Zehua Sun, Jun Liu, Qiuhong Ke, Hossein Rahmani
Abstract: Human Action Recognition (HAR), aiming to understand human behaviors and then assign category labels, has a wide range of applications, and thus has been attracting increasing attention in the field of computer vision. Generally, human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, and WiFi, etc., which encode different sources of useful yet distinct information and have various advantages and application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this paper, we give a comprehensive survey for HAR from the perspective of the input data modalities. Specifically, we review both the hand-crafted feature-based and deep learning-based methods for single data modalities, and also review the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches. The current benchmark datasets for HAR are also introduced. Finally, we discuss some potentially important research directions in this area.
摘要：人类行为识别（HAR）旨在理解人类行为并为其指定类别标签，具有广泛的应用范围，因此在计算机视觉领域引起了越来越多的关注。通常，可以使用各种数据模式来表示人类行为，例如RGB，骨架，深度，红外序列，点云，事件流，音频，加速度，雷达和WiFi等，这些数据会编码有用但独特的信息的不同来源并具有各种优势和应用场景。因此，许多现有工作已尝试使用各种方式来研究用于HAR的不同类型的方法。在本文中，我们将从输入数据模式的角度对HAR进行了全面的调查。具体来说，我们回顾了针对单个数据模式的基于手工特征的方法和基于深度学习的方法，并且还回顾了基于多种模式的方法，包括基于融合的框架和基于共同学习的方法。还介绍了HAR的当前基准数据集。最后，我们讨论了该领域中一些潜在的重要研究方向。

28. GuidedStyle: Attribute Knowledge Guided Style Manipulation for Semantic Face Editing [PDF] 返回目录
Xianxu Hou, Xiaokang Zhang, Linlin Shen, Zhihui Lai, Jun Wan
Abstract: Although significant progress has been made in synthesizing high-quality and visually realistic face images by unconditional Generative Adversarial Networks (GANs), there still lacks of control over the generation process in order to achieve semantic face editing. In addition, it remains very challenging to maintain other face information untouched while editing the target attributes. In this paper, we propose a novel learning framework, called GuidedStyle, to achieve semantic face editing on StyleGAN by guiding the image generation process with a knowledge network. Furthermore, we allow an attention mechanism in StyleGAN generator to adaptively select a single layer for style manipulation. As a result, our method is able to perform disentangled and controllable edits along various attributes, including smiling, eyeglasses, gender, mustache and hair color. Both qualitative and quantitative results demonstrate the superiority of our method over other competing methods for semantic face editing. Moreover, we show that our model can be also applied to different types of real and artistic face editing, demonstrating strong generalization ability.
摘要：尽管通过无条件生成对抗网络（GANs）合成高质量和视觉逼真的人脸图像方面已取得了重大进展，但仍然缺乏对生成过程的控制以实现语义人脸编辑。另外，在编辑目标属性时保持其他面部信息保持不变仍然非常具有挑战性。在本文中，我们提出了一种新颖的学习框架，称为GuidedStyle，该框架通过使用知识网络指导图像生成过程来在StyleGAN上实现语义人脸编辑。此外，我们允许StyleGAN生成器中的注意力机制自适应地选择单个图层以进行样式操作。结果，我们的方法能够沿着各种属性执行纠缠且可控制的编辑，包括微笑，眼镜，性别，胡子和头发的颜色。定性和定量结果都证明了我们的方法在语义面孔编辑方面优于其他竞争方法。此外，我们证明了我们的模型还可以应用于不同类型的真实和艺术面孔编辑，证明了强大的泛化能力。

29. Predicting Online Video Advertising Effects with Multimodal Deep Learning [PDF] 返回目录
Jun Ikeda, Hiroyuki Seshime, Xueting Wang, Toshihiko Yamasaki
Abstract: With expansion of the video advertising market, research to predict the effects of video advertising is getting more attention. Although effect prediction of image advertising has been explored a lot, prediction for video advertising is still challenging with seldom research. In this research, we propose a method for predicting the click through rate (CTR) of video advertisements and analyzing the factors that determine the CTR. In this paper, we demonstrate an optimized framework for accurately predicting the effects by taking advantage of the multimodal nature of online video advertisements including video, text, and metadata features. In particular, the two types of metadata, i.e., categorical and continuous, are properly separated and normalized. To avoid overfitting, which is crucial in our task because the training data are not very rich, additional regularization layers are inserted. Experimental results show that our approach can achieve a correlation coefficient as high as 0.695, which is a significant improvement from the baseline (0.487).
摘要：随着视频广告市场的扩大，预测视频广告效果的研究越来越受到关注。尽管已经大量研究了图像广告的效果预测，但是很少进行研究来预测视频广告仍然具有挑战性。在这项研究中，我们提出了一种预测视频广告点击率（CTR）并分析决定点击率的因素的方法。在本文中，我们演示了一个优化的框架，可以利用在线视频广告的多模式特性（包括视频，文本和元数据功能）来准确地预测效果。特别地，适当地分离和规范化了两种类型的元数据，即分类的和连续的。为了避免过度拟合，这对于我们的任务至关重要，因为训练数据不是非常丰富，因此会插入其他正则化层。实验结果表明，我们的方法可以实现高达0.695的相关系数，与基线（0.487）相比有显着改善。

30. Adversarial Multiscale Feature Learning for Overlapping Chromosome Segmentation [PDF] 返回目录
Liye Mei, Yalan Yu, Yueyun Weng, Xiaopeng Guo, Yan Liu, Du Wang, Sheng Liu, Fuling Zhou, Cheng Lei
Abstract: Chromosome karyotype analysis is of great clinical importance in the diagnosis and treatment of diseases, especially for genetic diseases. Since manual analysis is highly time and effort consuming, computer-assisted automatic chromosome karyotype analysis based on images is routinely used to improve the efficiency and accuracy of the analysis. Due to the strip shape of the chromosomes, they easily get overlapped with each other when imaged, significantly affecting the accuracy of the analysis afterward. Conventional overlapping chromosome segmentation methods are usually based on manually tagged features, hence, the performance of which is easily affected by the quality, such as resolution and brightness, of the images. To address the problem, in this paper, we present an adversarial multiscale feature learning framework to improve the accuracy and adaptability of overlapping chromosome segmentation. Specifically, we first adopt the nested U-shape network with dense skip connections as the generator to explore the optimal representation of the chromosome images by exploiting multiscale features. Then we use the conditional generative adversarial network (cGAN) to generate images similar to the original ones, the training stability of which is enhanced by applying the least-square GAN objective. Finally, we employ Lovasz-Softmax to help the model converge in a continuous optimization setting. Comparing with the established algorithms, the performance of our framework is proven superior by using public datasets in eight evaluation criteria, showing its great potential in overlapping chromosome segmentation
摘要要】染色体核型分析对疾病尤其是遗传疾病的诊断和治疗具有重要的临床意义。由于手动分析非常耗时且费力，因此通常使用基于图像的计算机辅助自动染色体核型分析来提高分析的效率和准确性。由于染色体的条形形状，它们在成像时很容易彼此重叠，从而显着影响随后的分析准确性。传统的重叠染色体分割方法通常基于手动标记的特征，因此，其性能容易受到图像质量（例如分辨率和亮度）的影响。为了解决这个问题，本文提出了一种对抗式多尺度特征学习框架，以提高重叠染色体分割的准确性和适应性。具体而言，我们首先采用具有密集跳过连接的嵌套U形网络作为生成器，以通过利用多尺度特征来探索染色体图像的最佳表示。然后我们使用条件生成对抗网络（cGAN）生成与原始图像相似的图像，通过应用最小二乘GAN物镜可以增强图像的训练稳定性。最后，我们使用Lovasz-Softmax帮助模型在连续优化设置中收敛。与已建立的算法相比，通过使用八个评估标准中的公共数据集，证明了我们框架的性能优越，显示了其在重叠染色体分割中的巨大潜力

31. Subject-independent Human Pose Image Construction with Commodity Wi-Fi [PDF] 返回目录
Shuang Zhou, Lingchao Guo, Zhaoming Lu, Xiangming Wen, Wei Zheng, Yiming Wang
Abstract: Recently, commodity Wi-Fi devices have been shown to be able to construct human pose images, i.e., human skeletons, as fine-grained as cameras. Existing papers achieve good results when constructing the images of subjects who are in the prior training samples. However, the performance drops when it comes to new subjects, i.e., the subjects who are not in the training samples. This paper focuses on solving the subject-generalization problem in human pose image construction. To this end, we define the subject as the domain. Then we design a Domain-Independent Neural Network (DINN) to extract subject-independent features and convert them into fine-grained human pose images. We also propose a novel training method to train the DINN and it has no re-training overhead comparing with the domain-adversarial approach. We build a prototype system and experimental results demonstrate that our system can construct fine-grained human pose images of new subjects with commodity Wi-Fi in both the visible and through-wall scenarios, which shows the effectiveness and the subject-generalization ability of our model.
摘要：最近，已证明商品Wi-Fi设备能够构建像相机一样细腻的人体姿势图像，即人体骨骼。在构建先前训练样本中的对象图像时，现有论文取得了良好的效果。但是，当涉及新主题时，即不在训练样本中的主题时，性能会下降。本文着重解决人体姿势图像构建中的主题泛化问题。为此，我们将主题定义为领域。然后，我们设计了一个独立于域的神经网络（DINN），以提取独立于主题的特征并将其转换为细粒度的人体姿势图像。我们还提出了一种新型的训练DINN的训练方法，与领域对抗方法相比，它没有再训练的开销。我们构建了一个原型系统，实验结果表明，该系统可以在可见光和穿墙情况下使用商品Wi-Fi构造新物体的细粒度人体姿势图像，这表明了我们的有效性和物体泛化能力模型。

32. Progressive One-shot Human Parsing [PDF] 返回目录
Haoyu He, Jing Zhang, Bhavani Thuraisingham, Dacheng Tao
Abstract: Prior human parsing models are limited to parsing humans into classes pre-defined in the training data, which is not flexible to generalize to unseen classes, e.g., new clothing in fashion analysis. In this paper, we propose a new problem named one-shot human parsing (OSHP) that requires to parse human into an open set of reference classes defined by any single reference example. During training, only base classes defined in the training set are exposed, which can overlap with part of reference classes. In this paper, we devise a novel Progressive One-shot Parsing network (POPNet) to address two critical challenges , i.e., testing bias and small sizes. POPNet consists of two collaborative metric learning modules named Attention Guidance Module and Nearest Centroid Module, which can learn representative prototypes for base classes and quickly transfer the ability to unseen classes during testing, thereby reducing testing bias. Moreover, POPNet adopts a progressive human parsing framework that can incorporate the learned knowledge of parent classes at the coarse granularity to help recognize the descendant classes at the fine granularity, thereby handling the small sizes issue. Experiments on the ATR-OS benchmark tailored for OSHP demonstrate POPNet outperforms other representative one-shot segmentation models by large margins and establishes a strong baseline. Source code can be found at this https URL.
摘要：先前的人体解析模型仅限于将人体解析为训练数据中预先定义的类别，因此无法灵活地推广到看不见的类别，例如时尚分析中的新服装。在本文中，我们提出了一个新的问题，即一次人为解析（OSHP），该问题要求将人解析为由任何单个参考示例定义的开放参考类集。在训练过程中，仅暴露训练集中定义的基础类别，这些基础类别可能与参考类别的一部分重叠。在本文中，我们设计了一种新颖的渐进式单次解析网络（POPNet），以解决两个关键挑战，即测试偏差和小尺寸。 POPNet由两个协作的度量学习模块组成，分别称为“注意力指导模块”和“最近质心模块”，可以学习基础类的代表性原型，并在测试过程中快速将能力转移到看不见的类中，从而减少测试偏差。此外，POPNet采用了一种渐进式人工分析框架，该框架可以以粗粒度合并所学习的父类知识，以帮助以细粒度识别后代类，从而解决小规模问题。针对OSHP量身定制的ATR-OS基准测试表明，POPNet大大优于其他有代表性的单次分割模型，并建立了坚实的基准。可以在此https URL中找到源代码。

33. Learning Disentangled Semantic Representation for Domain Adaptation [PDF] 返回目录
Ruichu Cai, Zijian Li, Pengfei Wei, Jie Qiao, Kun Zhang, Zhifeng Hao
Abstract: Domain adaptation is an important but challenging task. Most of the existing domain adaptation methods struggle to extract the domain-invariant representation on the feature space with entangling domain information and semantic information. Different from previous efforts on the entangled feature space, we aim to extract the domain invariant semantic information in the latent disentangled semantic representation (DSR) of the data. In DSR, we assume the data generation process is controlled by two independent sets of variables, i.e., the semantic latent variables and the domain latent variables. Under the above assumption, we employ a variational auto-encoder to reconstruct the semantic latent variables and domain latent variables behind the data. We further devise a dual adversarial network to disentangle these two sets of reconstructed latent variables. The disentangled semantic latent variables are finally adapted across the domains. Experimental studies testify that our model yields state-of-the-art performance on several domain adaptation benchmark datasets.
摘要：领域适应是一项重要但具有挑战性的任务。现有的大多数领域自适应方法都难以通过缠结领域信息和语义信息来提取特征空间上的领域不变表示。与先前对纠缠特征空间所做的努力不同，我们旨在提取数据的潜在非纠缠语义表示（DSR）中的领域不变语义信息。在DSR中，我们假设数据生成过程由两组独立的变量控制，即语义潜在变量和领域潜在变量。在上述假设下，我们采用变分自动编码器来重构数据背后的语义潜在变量和领域潜在变量。我们进一步设计了一个双重对抗网络来解开这两组重构的潜在变量。解开的语义潜在变量最终会在各个域之间进行调整。实验研究证明，我们的模型在多个领域自适应基准数据集上均具有最先进的性能。

34. Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos [PDF] 返回目录
Yu Cheng, Bo Wang, Bo Yang, Robby T. Tan
Abstract: Despite the recent progress, 3D multi-person pose estimation from monocular videos is still challenging due to the commonly encountered problem of missing information caused by occlusion, partially out-of-frame target persons, and inaccurate person this http URL tackle this problem, we propose a novel framework integrating graph convolutional networks (GCNs) and temporal convolutional networks (TCNs) to robustly estimate camera-centric multi-person 3D poses that do not require camera parameters. In particular, we introduce a human-joint GCN, which unlike the existing GCN, is based on a directed graph that employs the 2D pose estimator's confidence scores to improve the pose estimation results. We also introduce a human-bone GCN, which models the bone connections and provides more information beyond human joints. The two GCNs work together to estimate the spatial frame-wise 3D poses and can make use of both visible joint and bone information in the target frame to estimate the occluded or missing human-part information. To further refine the 3D pose estimation, we use our temporal convolutional networks (TCNs) to enforce the temporal and human-dynamics constraints. We use a joint-TCN to estimate person-centric 3D poses across frames, and propose a velocity-TCN to estimate the speed of 3D joints to ensure the consistency of the 3D pose estimation in consecutive frames. Finally, to estimate the 3D human poses for multiple persons, we propose a root-TCN that estimates camera-centric 3D poses without requiring camera parameters. Quantitative and qualitative evaluations demonstrate the effectiveness of the proposed method.
摘要：尽管最近取得了进展，但单眼视频中的3D多人姿势估计仍然具有挑战性，因为通常会遇到由于遮挡，部分框架外目标人和不准确的人而造成的信息丢失的问题，此http URL可以解决此问题，我们提出了一种新颖的框架，将图卷积网络（GCN）和时间卷积网络（TCN）集成在一起，以可靠地估计不需要相机参数的以相机为中心的多人3D姿势。特别是，我们介绍了与现有GCN不同的人体关节GCN，它是基于有向图的，该有向图使用2D姿态估计器的置信度得分来改善姿态估计结果。我们还介绍了人骨GCN，该模型可以对骨骼连接进行建模，并提供人体关节以外的更多信息。两个GCN协同工作以估计空间框架式3D姿势，并且可以利用目标框架中的可见关节和骨骼信息来估计被遮挡或丢失的人体部位信息。为了进一步完善3D姿态估计，我们使用了时间卷积网络（TCN）来增强时间和人类动力学约束。我们使用关节TCN来估计跨帧的以人为本的3D姿势，并提出速度TCN来估计3D关节的速度，以确保连续帧中3D姿势估计的一致性。最后，为了估计多个人的3D人体姿势，我们提出了一个根TNC，无需照相机参数即可估计以相机为中心的3D姿势。定量和定性评估证明了该方法的有效性。

35. Modeling Deep Learning Based Privacy Attacks on Physical Mail [PDF] 返回目录
Bingyao Huang, Ruyi Lian, Dimitris Samaras, Haibin Ling
Abstract: Mail privacy protection aims to prevent unauthorized access to hidden content within an envelope since normal paper envelopes are not as safe as we think. In this paper, for the first time, we show that with a well designed deep learning model, the hidden content may be largely recovered without opening the envelope. We start by modeling deep learning-based privacy attacks on physical mail content as learning the mapping from the camera-captured envelope front face image to the hidden content, then we explicitly model the mapping as a combination of perspective transformation, image dehazing and denoising using a deep convolutional neural network, named Neural-STE (See-Through-Envelope). We show experimentally that hidden content details, such as texture and image structure, can be clearly recovered. Finally, our formulation and model allow us to design envelopes that can counter deep learning-based privacy attacks on physical mail.
摘要：邮件隐私保护旨在防止未经授权的访问信封中的隐藏内容，因为普通纸质信封并不像我们想象的那样安全。在本文中，我们首次展示了通过精心设计的深度学习模型，无需打开信封就可以在很大程度上恢复隐藏的内容。我们首先对物理邮件内容进行基于深度学习的隐私攻击建模，方法是学习从摄像头捕获的信封正面图像到隐藏内容的映射，然后通过将透视变换，图像去雾和去噪结合使用一个深层卷积神经网络，名为Neural-STE（See-Through-Envelope）。我们通过实验表明，隐藏的内容细节（例如纹理和图像结构）可以被清晰地恢复。最后，我们的公式和模型使我们能够设计信封，以应对物理邮件上基于深度学习的隐私攻击。

36. SERV-CT: A disparity dataset from CT for validation of endoscopic 3D reconstruction [PDF] 返回目录
P.J. "Eddie'' Edwards, Dimitris Psychogyios, Stefanie Speidel, Lena Maier-Hein, Danail Stoyanov
Abstract: In computer vision, reference datasets have been highly successful in promoting algorithmic development in stereo reconstruction. Surgical scenes gives rise to specific problems, including the lack of clear corner features, highly specular surfaces and the presence of blood and smoke. Publicly available datasets have been produced using CT and either phantom images or biological tissue samples covering a relatively small region of the endoscope field-of-view. We present a stereo-endoscopic reconstruction validation dataset based on CT (SERV-CT). Two {\it ex vivo} small porcine full torso cadavers were placed within the view of the endoscope with both the endoscope and target anatomy visible in the CT scan. Orientation of the endoscope was manually aligned to the stereoscopic view. Reference disparities and occlusions were calculated for 8 stereo pairs from each sample. For the second sample an RGB surface was acquired to aid alignment of smooth, featureless surfaces. Repeated manual alignments showed an RMS disparity accuracy of ~2 pixels and a depth accuracy of ~2mm. The reference dataset includes endoscope image pairs with corresponding calibration, disparities, depths and occlusions covering the majority of the endoscopic image and a range of tissue types. Smooth specular surfaces and images with significant variation of depth are included. We assessed the performance of various stereo algorithms from online available repositories. There is a significant variation between algorithms, highlighting some of the challenges of surgical endoscopic images. The SERV-CT dataset provides an easy to use stereoscopic validation for surgical applications with smooth reference disparities and depths with coverage over the majority of the endoscopic images. This complements existing resources well and we hope will aid the development of surgical endoscopic anatomical reconstruction algorithms.
摘要：在计算机视觉中，参考数据集在促进立体重建算法开发方面非常成功。手术现场会引起一些具体问题，包括缺乏清晰的角部特征，高度镜面的表面以及血液和烟雾的存在。使用CT和覆盖内窥镜视野相对较小区域的幻像图像或生物组织样本，已经产生了可公开获得的数据集。我们提出了基于CT（SERV-CT）的立体内窥镜重建验证数据集。将两个{离体}小型猪全躯干尸体放在内窥镜视野内，在CT扫描中可以看到内窥镜和目标解剖结构。内窥镜的方向手动对准立体视图。从每个样本中计算了8对立体对的参考差异和遮挡。对于第二个样本，获取RGB表面以帮助对齐平滑无特征的表面。重复的手动对齐显示RMS视差精度约为2个像素，深度精度约为2mm。参考数据集包括具有相应校准，视差，深度和遮挡物的内窥镜图像对，其覆盖了大部分内窥镜图像和一系列组织类型。包括光滑的镜面表面和深度变化很大的图像。我们从在线可用存储库中评估了各种立体声算法的性能。这些算法之间存在重大差异，突出了外科内窥镜图像的一些挑战。 SERV-CT数据集为手术应用提供了易于使用的立体验证，具有平滑的参考差异和深度，覆盖了大部分内窥镜图像。这很好地补充了现有资源，我们希望将有助于外科内窥镜解剖重建算法的开发。

37. Power-SLIC: Diagram-based superpixel generation [PDF] 返回目录
Maximilian Fiedler, Andreas Alpers
Abstract: Superpixel algorithms, which group pixels similar in color and other low-level properties, are increasingly used for pre-processing in image segmentation. Commonly important criteria for the computation of superpixels are boundary adherence, speed, and regularity. Boundary adherence and regularity are typically contradictory goals. Most recent algorithms have focused on improving boundary adherence. Motivated by improving superpixel regularity, we propose a diagram-based superpixel generation method called Power-SLIC. On the BSDS500 data set, Power-SLIC outperforms other state-of-the-art algorithms in terms of compactness and boundary precision, and its boundary adherence is the most robust against varying levels of Gaussian noise. In terms of speed, Power-SLIC is competitive with SLIC.
摘要：将颜色相似和其他低级属性的像素分组的超像素算法越来越多地用于图像分割的预处理。计算超像素的最重要的标准是边界依从性，速度和规则性。边界遵守和规则性通常是相互矛盾的目标。最新的算法集中在改善边界依从性上。为了提高超像素规则性，我们提出了一种基于图的超像素生成方法，称为Power-SLIC。在BSDS500数据集上，就紧凑性和边界精度而言，Power-SLIC优于其他最新算法，并且其边界依从性对变化的高斯噪声水平最稳定。在速度方面，Power-SLIC与SLIC竞争。

38. Contraband Materials Detection Within Volumetric 3D Computed Tomography Baggage Security Screening Imagery [PDF] 返回目录
Qian Wang, Toby P. Breckon
Abstract: Automatic prohibited object detection within 2D/3D X-ray Computed Tomography (CT) has been studied in literature to enhance the aviation security screening at checkpoints. Deep Convolutional Neural Networks (CNN) have demonstrated superior performance in 2D X-ray imagery. However, there exists very limited proof of how deep neural networks perform in materials detection within volumetric 3D CT baggage screening imagery. We attempt to close this gap by applying Deep Neural Networks in 3D contraband substance detection based on their material signatures. Specifically, we formulate it as a 3D semantic segmentation problem to identify material types for all voxels based on which contraband materials can be detected. To this end, we firstly investigate 3D CNN based semantic segmentation algorithms such as 3D U-Net and its variants. In contrast to the original dense representation form of volumetric 3D CT data, we propose to convert the CT volumes into sparse point clouds which allows the use of point cloud processing approaches such as PointNet++ towards more efficient processing. Experimental results on a publicly available dataset (NEU ATR) demonstrate the effectiveness of both 3D U-Net and PointNet++ in materials detection in 3D CT imagery for baggage security screening.
摘要：文献已经研究了2D / 3D X射线计算机断层扫描（CT）中的自动禁止物体检测，以增强检查点的航空安全检查。深度卷积神经网络（CNN）在2D X射线图像中表现出卓越的性能。但是，关于在体积3D CT行李筛查图像中进行材料检测时，深度神经网络如何发挥作用的证据非常有限。我们试图通过基于3D违禁品物质签名的3D违禁品物质检测应用深度神经网络来弥合这一差距。具体来说，我们将其公式化为3D语义分割问题，以基于可以检测到违禁品的所有体素来识别材料类型。为此，我们首先研究基于3D CNN的语义分割算法，例如3D U-Net及其变体。与体积3D CT数据的原始密集表示形式相反，我们建议将CT体积转换为稀疏的点云，这允许使用点云处理方法（例如PointNet ++）来实现更有效的处理。公开数据集（NEU ATR）上的实验结果证明了3D U-Net和PointNet ++在3D CT图像中用于行李安全筛选的材料检测中的有效性。

39. Training DNNs in O(1) memory with MEM-DFA using Random Matrices [PDF] 返回目录
Tien Chu, Kamil Mykitiuk, Miron Szewczyk, Adam Wiktor, Zbigniew Wojna
Abstract: This work presents a method for reducing memory consumption to a constant complexity when training deep neural networks. The algorithm is based on the more biologically plausible alternatives of the backpropagation (BP): direct feedback alignment (DFA) and feedback alignment (FA), which use random matrices to propagate error. The proposed method, memory-efficient direct feedback alignment (MEM-DFA), uses higher independence of layers in DFA and allows avoiding storing at once all activation vectors, unlike standard BP, FA, and DFA. Thus, our algorithm's memory usage is constant regardless of the number of layers in a neural network. The method increases the computational cost only by a constant factor of one extra forward pass. The MEM-DFA, BP, FA, and DFA were evaluated along with their memory profiles on MNIST and CIFAR-10 datasets on various neural network models. Our experiments agree with our theoretical results and show a significant decrease in the memory cost of MEM-DFA compared to the other algorithms.
摘要：这项工作提出了一种在训练深度神经网络时将内存消耗降低到恒定复杂度的方法。该算法基于反向传播（BP）的生物学上更合理的选择：直接反馈对齐（DFA）和反馈对齐（FA），它们使用随机矩阵来传播误差。与标准BP，FA和DFA不同，所提出的方法是内存有效的直接反馈对齐（MEM-DFA），它在DFA中使用了更高的层独立性，并避免了一次存储所有激活向量。因此，无论神经网络中的层数如何，我们算法的内存使用情况都是恒定的。该方法仅以一个额外的前向通过的恒定因子来增加计算成本。在各种神经网络模型的MNIST和CIFAR-10数据集上，对MEM-DFA，BP，FA和DFA及其内存配置文件进行了评估。我们的实验与我们的理论结果相符，并且表明与其他算法相比，MEM-DFA的存储成本显着降低。

40. HDNET: Exploiting HD Maps for 3D Object Detection [PDF] 返回目录
Bin Yang, Ming Liang, Raquel Urtasun
Abstract: In this paper we show that High-Definition (HD) maps provide strong priors that can boost the performance and robustness of modern 3D object detectors. Towards this goal, we design a single stage detector that extracts geometric and semantic features from the HD maps. As maps might not be available everywhere, we also propose a map prediction module that estimates the map on the fly from raw LiDAR data. We conduct extensive experiments on KITTI as well as a large-scale 3D detection benchmark containing 1 million frames, and show that the proposed map-aware detector consistently outperforms the state-of-the-art in both mapped and un-mapped scenarios. Importantly the whole framework runs at 20 frames per second.
摘要：在本文中，我们展示了高清（HD）映射提供了强大的先验知识，可以提高现代3D对象检测器的性能和鲁棒性。为了实现这一目标，我们设计了一个单级检测器，该检测器从高清地图中提取几何和语义特征。由于地图可能无处不在，因此我们还提出了一种地图预测模块，该模块可根据原始LiDAR数据实时估算地图。我们在KITTI以及包含100万帧的大型3D检测基准上进行了广泛的实验，结果表明，所提出的地图感知检测器在映射和未映射的场景中始终优于最新技术。重要的是，整个框架以每秒20帧的速度运行。

41. Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge [PDF] 返回目录
Pierre Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi, Mattia Rigotti, Jarret Ross, Yair Schiff, Richard A. Young, Brian Belgodere
Abstract: Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems.
摘要：图像字幕最近显示出令人印象深刻的进步，这主要是由于引入了在诸如MS-COCO之类的精选数据集上训练的神经网络算法。在实际应用中部署字幕系统的希望常常推动了这一领域的工作。但是，许多比赛数据集中数据和上下文的稀缺性使得在这些数据集上训练的系统的实用性受到限制，无法在现实环境中用作辅助技术，例如帮助视力障碍人士导航和完成日常任务。这种差距促使引入了新颖的VizWiz数据集，该数据集由视障者拍摄的图像和具有有用的，面向任务的信息的字幕组成。为了帮助机器学习计算机视觉领域实现其产生对社会产生积极影响的技术的承诺，VizWiz数据集的策展人举办了几场比赛，其中包括一项图像说明比赛。这项工作详细介绍了从获奖作品到2020年字幕竞赛的理论和工程知识。我们的工作为改进辅助图像字幕系统迈出了一步。

42. Alleviating Noisy Data in Image Captioning with Cooperative Distillation [PDF] 返回目录
Pierre Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi, Mattia Rigotti, Jarret Ross, Yair Schiff
Abstract: Image captioning systems have made substantial progress, largely due to the availability of curated datasets like Microsoft COCO or Vizwiz that have accurate descriptions of their corresponding images. Unfortunately, scarce availability of such cleanly labeled data results in trained algorithms producing captions that can be terse and idiosyncratically specific to details in the image. We propose a new technique, cooperative distillation that combines clean curated datasets with the web-scale automatically extracted captions of the Google Conceptual Captions dataset (GCC), which can have poor descriptions of images, but is abundant in size and therefore provides a rich vocabulary resulting in more expressive captions.
摘要：图像字幕系统取得了长足的进步，这主要归功于诸如Microsoft COCO或Vizwiz之类的精选数据集的可用性，这些数据集可以准确描述其相应的图像。不幸的是，这种干净标记的数据的稀缺可用性导致训练有素的算法产生的标题可能是简短而特有的，具体取决于图像的细节。我们提出了一种合作蒸馏的新技术，该技术将干净的策展数据集与Google概念字幕数据集（GCC）的网络规模自动提取的字幕相结合，该图像的图像描述不佳，但是大小丰富，因此提供了丰富的词汇产生更具表现力的字幕。

43. Natural vs Balanced Distribution in Deep Learning on Whole Slide Images for Cancer Detection [PDF] 返回目录
Ismat Ara Reshma, Sylvain Cussat-Blanc, Radu Tudor Ionescu, Hervé Luga, Josiane Mothe
Abstract: The class distribution of data is one of the factors that regulates the performance of machine learning models. However, investigations on the impact of different distributions available in the literature are very few, sometimes absent for domain-specific tasks. In this paper, we analyze the impact of natural and balanced distributions of the training set in deep learning (DL) models applied on histological images, also known as whole slide images (WSIs). WSIs are considered as the gold standard for cancer diagnosis. In recent years, researchers have turned their attention to DL models to automate and accelerate the diagnosis process. In the training of such DL models, filtering out the non-regions-of-interest from the WSIs and adopting an artificial distribution (usually, a balanced distribution) is a common trend. In our analysis, we show that keeping the WSIs data in their usual distribution (which we call natural distribution) for DL training produces fewer false positives (FPs) with comparable false negatives (FNs) than the artificially-obtained balanced distribution. We conduct an empirical comparative study with 10 random folds for each distribution, comparing the resulting average performance levels in terms of five different evaluation metrics. Experimental results show the effectiveness of the natural distribution over the balanced one across all the evaluation metrics.
摘要：数据的类分布是调节机器学习模型性能的因素之一。但是，对文献中可用的不同分布的影响的研究很少，有时针对特定领域的任务则没有。在本文中，我们分析了深度学习（DL）模型中训练集的自然和平衡分布对组织学图像（也称为整体幻灯片图像（WSI））的影响。 WSI被认为是诊断癌症的金标准。近年来，研究人员已将注意力转向DL模型以自动化和加速诊断过程。在训练此类DL模型时，从WSI滤除非关注区域并采用人工分配（通常是平衡分配）是一种普遍的趋势。在我们的分析中，我们表明，将WSI数据保持在DL训练的常规分布（我们称为自然分布）中，与人工获得的平衡分布相比，具有可比较的假阴性（FN）的假阳性（FP）更少。我们对每个分布进行了10次随机倍数的实证比较研究，比较了五种不同评估指标得出的平均绩效水平。实验结果表明，在所有评估指标上，均衡分布的自然分布是有效的。

44. Smoothed Gaussian Mixture Models for Video Classification and Recommendation [PDF] 返回目录
Sirjan Kafle, Aman Gupta, Xue Xia, Ananth Sankar, Xi Chen, Di Wen, Liang Zhang
Abstract: Cluster-and-aggregate techniques such as Vector of Locally Aggregated Descriptors (VLAD), and their end-to-end discriminatively trained equivalents like NetVLAD have recently been popular for video classification and action recognition tasks. These techniques operate by assigning video frames to clusters and then representing the video by aggregating residuals of frames with respect to the mean of each cluster. Since some clusters may see very little video-specific data, these features can be noisy. In this paper, we propose a new cluster-and-aggregate method which we call smoothed Gaussian mixture model (SGMM), and its end-to-end discriminatively trained equivalent, which we call deep smoothed Gaussian mixture model (DSGMM). SGMM represents each video by the parameters of a Gaussian mixture model (GMM) trained for that video. Low-count clusters are addressed by smoothing the video-specific estimates with a universal background model (UBM) trained on a large number of videos. The primary benefit of SGMM over VLAD is smoothing which makes it less sensitive to small number of training samples. We show, through extensive experiments on the YouTube-8M classification task, that SGMM/DSGMM is consistently better than VLAD/NetVLAD by a small but statistically significant margin. We also show results using a dataset created at LinkedIn to predict if a member will watch an uploaded video.
摘要：最近，在视频分类和动作识别任务中，诸如局部矢量描述符（VLAD）的群集和聚合技术及其像NetVLAD这样的端到端经过判别训练的等效技术已经很流行。这些技术通过将视频帧分配给群集，然后通过相对于每个群集的平均值聚合帧的残差来表示视频，从而进行操作。由于某些群集可能看不到特定于视频的数据，因此这些功能可能比较吵杂。在本文中，我们提出了一种新的聚类方法，称为平滑高斯混合模型（SGMM），以及端到端的判别训练等效项，称为深度平滑高斯混合模型（DSGMM）。 SGMM通过针对该视频训练的高斯混合模型（GMM）的参数来表示每个视频。通过使用针对大量视频训练的通用背景模型（UBM）来平滑视频特定的估计值，可以解决低数量的群集问题。 SGMM优于VLAD的主要好处是平滑，这使得它对少量训练样本不太敏感。通过对YouTube-8M分类任务的广泛实验，我们显示SGMM / DSGMM始终比VLAD / NetVLAD好一点，但在统计上却非常可观。我们还使用LinkedIn上创建的数据集显示结果，以预测会员是否会观看上传的视频。

45. Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised Video Object Segmentation [PDF] 返回目录
Hyojin Park, Jayeon Yoo, Seohyeong Jeong, Ganesh Venkatesh, Nojun Kwak
Abstract: Current state-of-the-art approaches for Semi-supervised Video Object Segmentation (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame. This results in high-quality segmentation across challenging scenarios such as changes in appearance and occlusion. But it also leads to unnecessary computations for stationary or slow-moving objects where the change across frames is minimal. In this work, we exploit this observation by using temporal information to quickly identify frames with minimal change and skip the heavyweight mask generation step. To realize this efficiency, we propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose depending on the expected similarity. Experimental results show that our approach significantly improves inference speed without much accuracy degradation on challenging Semi-VOS datasets -- DAVIS 16, DAVIS 17, and YouTube-VOS. Furthermore, our approach can be applied to multiple Semi-VOS methods demonstrating its generality.
摘要：半监督视频对象分割（Semi-VOS）的当前最新方法传播来自先前帧的信息，以生成当前帧的分割掩码。这样可以在各种挑战性场景（例如外观和遮挡的变化）中进行高质量的细分。但是，这也会导致对静止或缓慢移动的对象进行不必要的计算，而这些对象在帧之间的变化最小。在这项工作中，我们通过使用时间信息来利用这种观察来快速识别变化最小的帧，并跳过重量级蒙版生成步骤。为了实现这种效率，我们提出了一种新颖的动态网络，该网络可以估计帧之间的变化并根据预期的相似性来选择要选择的路径（计算完整网络还是重用先前帧的功能）。实验结果表明，在具有挑战性的Semi-VOS数据集（DAVIS 16，DAVIS 17和YouTube-VOS）上，我们的方法可显着提高推理速度，而不会造成太多准确性下降。此外，我们的方法可以应用于证明其通用性的多种Semi-VOS方法。

46. FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations [PDF] 返回目录
Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, Zhiru Zhang
Abstract: Binary neural networks (BNNs) have 1-bit weights and activations. Such networks are well suited for FPGAs, as their dominant computations are bitwise arithmetic and the memory requirement is also significantly reduced. However, compared to start-of-the-art compact convolutional neural network (CNN) models, BNNs tend to produce a much lower accuracy on realistic datasets such as ImageNet. In addition, the input layer of BNNs has gradually become a major compute bottleneck, because it is conventionally excluded from binarization to avoid a large accuracy loss. This work proposes FracBNN, which exploits fractional activations to substantially improve the accuracy of BNNs. Specifically, our approach employs a dual-precision activation scheme to compute features with up to two bits, using an additional sparse binary convolution. We further binarize the input layer using a novel thermometer encoding. Overall, FracBNN preserves the key benefits of conventional BNNs, where all convolutional layers are computed in pure binary MAC operations (BMACs). We design an efficient FPGA-based accelerator for our novel BNN model that supports the fractional activations. To evaluate the performance of FracBNN under a resource-constrained scenario, we implement the entire optimized network architecture on an embedded FPGA (Xilinx Ultra96v2). Our experiments on ImageNet show that FracBNN achieves an accuracy comparable to MobileNetV2, surpassing the best-known BNN design on FPGAs with an increase of 28.9% in top-1 accuracy and a 2.5x reduction in model size. FracBNN also outperforms a recently introduced BNN model with an increase of 2.4% in top-1 accuracy while using the same model size. On the embedded FPGA device, FracBNN demonstrates the ability of real-time image classification.
摘要：二进制神经网络（BNN）具有1位权重和激活。这样的网络非常适合FPGA，因为它们的主要计算是按位算术的，并且存储器的需求也大大减少了。但是，与最先进的紧凑型卷积神经网络（CNN）模型相比，BNN在诸如ImageNet之类的真实数据集上产生的准确性要低得多。另外，BNN的输入层已逐渐成为主要的计算瓶颈，因为常规上将其排除在二值化之外，以避免大的精度损失。这项工作提出了FracBNN，它利用分数激活来显着提高BNN的准确性。具体来说，我们的方法采用双精度激活方案，通过使用额外的稀疏二进制卷积来计算最多两位的特征。我们使用新颖的温度计编码进一步将输入层二值化。总体而言，FracBNN保留了传统BNN的关键优势，在传统BNN中，所有卷积层均以纯二进制MAC运算（BMAC）计算。我们为支持分数激活的新型BNN模型设计了一种基于FPGA的高效加速器。为了评估在资源受限的情况下FracBNN的性能，我们在嵌入式FPGA（Xilinx Ultra96v2）上实现了整个优化的网络架构。我们在ImageNet上进行的实验表明，FracBNN的精度可与MobileNetV2媲美，超越了FPGA上最著名的BNN设计，其top-1精度提高了28.9％，模型尺寸减小了2.5倍。 FracBNN还优于最近推出的BNN模型，在使用相同模型大小的情况下，top-1准确性提高了2.4％。在嵌入式FPGA器件上，FracBNN演示了实时图像分类的功能。

47. Automated Multi-Channel Segmentation for the 4D Myocardial Velocity Mapping Cardiac MR [PDF] 返回目录
Yinzhe Wu, Suzan Hatipoglu, Diego Alonso-Álvarez, Peter Gatehouse, David Firmin, Jennifer Keegan, Guang Yang
Abstract: Four-dimensional (4D) left ventricular myocardial velocity mapping (MVM) is a cardiac magnetic resonance (CMR) technique that allows assessment of cardiac motion in three orthogonal directions. Accurate and reproducible delineation of the myocardium is crucial for accurate analysis of peak systolic and diastolic myocardial velocities. In addition to the conventionally available magnitude CMR data, 4D MVM also acquires three velocity-encoded phase datasets which are used to generate velocity maps. These can be used to facilitate and improve myocardial delineation. Based on the success of deep learning in medical image processing, we propose a novel automated framework that improves the standard U-Net based methods on these CMR multi-channel data (magnitude and phase) by cross-channel fusion with attention module and shape information based post-processing to achieve accurate delineation of both epicardium and endocardium contours. To evaluate the results, we employ the widely used Dice scores and the quantification of myocardial longitudinal peak velocities. Our proposed network trained with multi-channel data shows enhanced performance compared to standard U-Net based networks trained with single-channel data. Based on the results, our method provides compelling evidence for the design and application for the multi-channel image analysis of the 4D MVM CMR data.
摘要：二维（4D）左心室心肌速度测绘（MVM）是一种心脏磁共振（CMR）技术，可以评估三个正交方向上的心脏运动。心肌的准确和可再现的轮廓对于准确分析收缩期和舒张期心肌速度峰值至关重要。除了常规可用的幅度CMR数据外，4D MVM还获取三个速度编码的相位数据集，这些数据集用于生成速度图。这些可用于促进和改善心肌轮廓。基于深度学习在医学图像处理中的成功经验，我们提出了一种新颖的自动化框架，该框架通过与注意模块和形状信息的跨通道融合来改进这些CMR多通道数据（幅度和相位）的基于标准U-Net的方法基于后处理，以准确描绘心外膜和心内膜轮廓。为了评估结果，我们使用了广泛使用的Dice评分和心肌纵向峰值速度的量化。与使用单通道数据训练的基于标准U-Net的网络相比，我们建议的使用多通道数据训练的网络显示出更高的性能。根据结果，我们的方法为4D MVM CMR数据的多通道图像分析的设计和应用提供了令人信服的证据。

48. Cloud removal in remote sensing images using generative adversarial networks and SAR-to-optical image translation [PDF] 返回目录
Faramarz Naderi Darbaghshahi, Mohammad Reza Mohammadi, Mohsen Soryani
Abstract: Satellite images are often contaminated by clouds. Cloud removal has received much attention due to the wide range of satellite image applications. As the clouds thicken, the process of removing the clouds becomes more challenging. In such cases, using auxiliary images such as near-infrared or synthetic aperture radar (SAR) for reconstructing is common. In this study, we attempt to solve the problem using two generative adversarial networks (GANs). The first translates SAR images into optical images, and the second removes clouds using the translated images of prior GAN. Also, we propose dilated residual inception blocks (DRIBs) instead of vanilla U-net in the generator networks and use structural similarity index measure (SSIM) in addition to the L1 Loss function. Reducing the number of downsamplings and expanding receptive fields by dilated convolutions increase the quality of output images. We used the SEN1-2 dataset to train and test both GANs, and we made cloudy images by adding synthetic clouds to optical images. The restored images are evaluated with PSNR and SSIM. We compare the proposed method with state-of-the-art deep learning models and achieve more accurate results in both SAR-to-optical translation and cloud removal parts.
摘要：卫星图像经常被云污染。由于广泛的卫星图像应用，去除云已经引起了广泛的关注。随着云层变厚，去除云层的过程变得更具挑战性。在这种情况下，通常使用辅助图像（如近红外或合成孔径雷达（SAR））进行重建。在这项研究中，我们尝试使用两个生成对抗网络（GAN）解决问题。第一种将SAR图像转换为光学图像，第二种使用先前GAN的转换图像去除云。此外，我们建议在生成器网络中使用扩张的残差起始块（DRIB）代替香草U-net，并使用L1损失函数之外的结构相似性指标度量（SSIM）。通过扩大卷积减少下采样的数量并扩展接收场，可以提高输出图像的质量。我们使用SEN1-2数据集来训练和测试这两个GAN，并通过将合成云添加到光学图像中来制作多云图像。使用PSNR和SSIM评估还原的图像。我们将提出的方法与最新的深度学习模型进行了比较，并且在SAR到光学翻译和除云部分均获得了更准确的结果。

49. Deep Learning of Cell Classification using Microscope Images of Intracellular Microtubule Networks [PDF] 返回目录
Aleksei Shpilman, Dmitry Boikiy, Marina Polyakova, Daniel Kudenko, Anton Burakov, Elena Nadezhdina
Abstract: Microtubule networks (MTs) are a component of a cell that may indicate the presence of various chemical compounds and can be used to recognize properties such as treatment resistance. Therefore, the classification of MT images is of great relevance for cell diagnostics. Human experts find it particularly difficult to recognize the levels of chemical compound exposure of a cell. Improving the accuracy with automated techniques would have a significant impact on cell therapy. In this paper we present the application of Deep Learning to MT image classification and evaluate it on a large MT image dataset of animal cells with three degrees of exposure to a chemical agent. The results demonstrate that the learned deep network performs on par or better at the corresponding cell classification task than human experts. Specifically, we show that the task of recognizing different levels of chemical agent exposure can be handled significantly better by the neural network than by human experts.
摘要：微管网络（MTs）是细胞的一个组成部分，可能指示各种化学物质的存在，可用于识别诸如治疗抗性的特性。因此，MT图像的分类对于细胞诊断具有重要的意义。人类专家发现很难识别细胞中化学物质暴露的水平。用自动化技术提高准确性将对细胞治疗产生重大影响。在本文中，我们介绍了深度学习在MT图像分类中的应用，并在具有化学剂三度暴露的大型动物细胞MT图像数据集上进行了评估。结果表明，所学习的深度网络在相应的细胞分类任务上的表现与人类专家相当或更好。具体而言，我们表明，与人类专家相比，神经网络可以更好地处理识别不同水平的化学剂暴露的任务。

50. Prediction of Chronic Kidney Disease Using Deep Neural Network [PDF] 返回目录
Iliyas Ibrahim Iliyas, Isah Rambo Saidu, Ali Baba Dauda, Suleiman Tasiu
Abstract: Deep neural Network (DNN) is becoming a focal point in Machine Learning research. Its application is penetrating into different fields and solving intricate and complex problems. DNN is now been applied in health image processing to detect various ailment such as cancer and diabetes. Another disease that is causing threat to our health is the kidney disease. This disease is becoming prevalent due to substances and elements we intake. Death is imminent and inevitable within few days without at least one functioning kidney. Ignoring the kidney malfunction can cause chronic kidney disease leading to death. Frequently, Chronic Kidney Disease (CKD) and its symptoms are mild and gradual, often go unnoticed for years only to be realized lately. Bade, a Local Government of Yobe state in Nigeria has been a center of attention by medical practitioners due to the prevalence of CKD. Unfortunately, a technical approach in culminating the disease is yet to be attained. We obtained a record of 400 patients with 10 attributes as our dataset from Bade General Hospital. We used DNN model to predict the absence or presence of CKD in the patients. The model produced an accuracy of 98%. Furthermore, we identified and highlighted the Features importance to provide the ranking of the features used in the prediction of the CKD. The outcome revealed that two attributes; Creatinine and Bicarbonate have the highest influence on the CKD prediction.
摘要：深度神经网络（DNN）成为机器学习研究的重点。它的应用已渗透到不同领域，解决了复杂而复杂的问题。 DNN现在已应用于健康图像处理，以检测各种疾病，例如癌症和糖尿病。另一个威胁我们健康的疾病是肾脏疾病。由于我们摄入的物质和元素，这种疾病正变得普遍。如果没有至少一个正常运作的肾脏，几天之内即将到来的死亡是不可避免的。忽视肾脏功能障碍会导致慢性肾脏疾病，导致死亡。慢性肾脏病（CKD）及其症状通常是轻度和渐进性的，多年来一直未引起注意，直到最近才意识到。由于CKD的盛行，尼日利亚尤贝州地方政府Bade一直是医生关注的中心。不幸的是，尚未达到使疾病最终发展的技术方法。我们从Bade总医院获得了400例具有10个属性的患者的记录。我们使用DNN模型来预测患者中是否存在CKD。该模型的准确性为98％。此外，我们确定并强调了功能的重要性，以提供CKD预测中使用的功能的排名。结果显示出两个属性；肌酐和碳酸氢盐对CKD预测的影响最大。

51. Unsupervised Spatial-spectral Network Learning for Hyperspectral Compressive Snapshot Reconstruction [PDF] 返回目录
Yubao Sun, Ying Yang, Qingshan Liu, Mohan Kankanhalli
Abstract: Hyperspectral compressive imaging takes advantage of compressive sensing theory to achieve coded aperture snapshot measurement without temporal scanning, and the entire three-dimensional spatial-spectral data is captured by a two-dimensional projection during a single integration period. Its core issue is how to reconstruct the underlying hyperspectral image using compressive sensing reconstruction algorithms. Due to the diversity in the spectral response characteristics and wavelength range of different spectral imaging devices, previous works are often inadequate to capture complex spectral variations or lack the adaptive capacity to new hyperspectral imagers. In order to address these issues, we propose an unsupervised spatial-spectral network to reconstruct hyperspectral images only from the compressive snapshot measurement. The proposed network acts as a conditional generative model conditioned on the snapshot measurement, and it exploits the spatial-spectral attention module to capture the joint spatial-spectral correlation of hyperspectral images. The network parameters are optimized to make sure that the network output can closely match the given snapshot measurement according to the imaging model, thus the proposed network can adapt to different imaging settings, which can inherently enhance the applicability of the network. Extensive experiments upon multiple datasets demonstrate that our network can achieve better reconstruction results than the state-of-the-art methods.
摘要：高光谱压缩成像利用压缩感测理论无需时间扫描即可实现编码孔径快照测量，并且整个三维空间光谱数据是在单个积分期间通过二维投影捕获的。其核心问题是如何使用压缩感测重建算法重建基础的高光谱图像。由于不同光谱成像设备的光谱响应特性和波长范围的差异，以前的工作通常不足以捕获复杂的光谱变化或缺乏对新的高光谱成像仪的适应能力。为了解决这些问题，我们提出了一种无监督的空间光谱网络，仅从压缩快照测量中重建高光谱图像。所提出的网络充当以快照测量为条件的条件生成模型，并利用空间光谱关注模块捕获高光谱图像的联合空间光谱相关性。对网络参数进行优化，以确保网络输出可以根据成像模型与给定的快照测量值紧密匹配，从而使拟议的网络可以适应不同的成像设置，从而可以固有地增强网络的适用性。在多个数据集上进行的广泛实验表明，与最新方法相比，我们的网络可以实现更好的重建结果。

52. HDR Denoising and Deblurring by Learning Spatio-temporal Distortion Models [PDF] 返回目录
Uğur Çoğalan, Mojtaba Bemana, Karol Myszkowski, Hans-Peter Seidel, Tobias Ritschel
Abstract: We seek to reconstruct sharp and noise-free high-dynamic range (HDR) video from a dual-exposure sensor that records different low-dynamic range (LDR) information in different pixel columns: Odd columns provide low-exposure, sharp, but noisy information; even columns complement this with less noisy, high-exposure, but motion-blurred data. Previous LDR work learns to deblur and denoise (DISTORTED->CLEAN) supervised by pairs of CLEAN and DISTORTED images. Regrettably, capturing DISTORTED sensor readings is time-consuming; as well, there is a lack of CLEAN HDR videos. We suggest a method to overcome those two limitations. First, we learn a different function instead: CLEAN->DISTORTED, which generates samples containing correlated pixel noise, and row and column noise, as well as motion blur from a low number of CLEAN sensor readings. Second, as there is not enough CLEAN HDR video available, we devise a method to learn from LDR video in-stead. Our approach compares favorably to several strong baselines, and can boost existing methods when they are re-trained on our data. Combined with spatial and temporal super-resolution, it enables applications such as re-lighting with low noise or blur.
摘要：我们试图从双曝光传感器重建清晰无噪的高动态范围（HDR）视频，该传感器在不同像素列中记录不同的低动态范围（LDR）信息：奇数列可提供低曝光，清晰，但是嘈杂的信息；即使是列，也可以用较少的噪点，高曝光率但运动模糊的数据来补充这一点。先前的LDR工作是学习在成对的CLEAN和DISTORTED图像的监督下进行去模糊和去噪（DISTORTED-> CLEAN）。遗憾的是，获取失真的传感器读数非常耗时。同时，也缺少CLEAN HDR视频。我们建议一种方法来克服这两个限制。首先，我们改为学习一个不同的函数：CLEAN-> DISTORTED，该函数生成包含相关像素噪声，行和列噪声以及少量CLEAN传感器读数引起的运动模糊的样本。其次，由于没有足够的CLEAN HDR视频，我们设计了一种方法来代替LDR视频学习。我们的方法可以与几个强大的基准进行比较，并且可以在对我们的数据进行重新训练后增强现有方法。结合空间和时间超分辨率，它可以实现低噪声或模糊的重新照明等应用。

53. A Feasibility study for Deep learning based automated brain tumor segmentation using Magnetic Resonance Images [PDF] 返回目录
Shanaka Ramesh Gunasekara, HNTK Kaldera, Maheshi B. Dissanayake
Abstract: Deep learning algorithms have accounted for the rapid acceleration of research in artificial intelligence in medical image analysis, interpretation, and segmentation with many potential applications across various sub disciplines in medicine. However, only limited number of research which investigates these application scenarios, are deployed into the clinical sector for the evaluation of the real requirement and the practical challenges of the model deployment. In this research, a deep convolutional neural network (CNN) based classification network and Faster RCNN based localization network were developed for brain tumor MR image classification and tumor localization. A typical edge detection algorithm called Prewitt was used for tumor segmentation task, based on the output of the tumor localization. Overall performance of the proposed tumor segmentation architecture, was analyzed using objective quality parameters including Accuracy, Boundary Displacement Error (BDE), Dice score and confidence interval. A subjective quality assessment of the model was conducted based on the Double Stimulus Impairment Scale (DSIS) protocol using the input of medical expertise. It was observed that the confidence level of our segmented output was in a similar range to that of experts. Also, the Neurologists have rated the output of our model as highly accurate segmentation.
摘要：深度学习算法已使人工智能在医学图像分析，解释和分割方面的研究迅速发展，并在医学的各个学科中都有许多潜在的应用。但是，只有有限数量的研究这些应用场景的研究被部署到临床领域，以评估模型部署的实际需求和实际挑战。在这项研究中，开发了基于深度卷积神经网络（CNN）的分类网络和基于Faster RCNN的定位网络，用于脑肿瘤MR图像分类和肿瘤定位。基于肿瘤定位的输出，一种称为Prewitt的典型边缘检测算法用于肿瘤分割任务。使用客观质量参数（包括准确性，边界位移误差（BDE），Dice评分和置信区间）分析了建议的肿瘤分割结构的总体性能。基于双重刺激障碍量表（DSIS）协议，使用医学专业知识进行输入，对模型进行主观质量评估。据观察，我们分段输出的置信度与专家的置信度处于相似范围。此外，神经科医生对我们模型的输出结果进行了高度准确的细分。

54. This is not the Texture you are looking for! Introducing Novel Counterfactual Explanations for Non-Experts using Generative Adversarial Learning [PDF] 返回目录
Silvan Mertes, Tobias Huber, Katharina Weitz, Alexander Heimerl, Elisabeth André
Abstract: With the ongoing rise of machine learning, the need for methods for explaining decisions made by artificial intelligence systems is becoming a more and more important topic. Especially for image classification tasks, many state-of-the-art tools to explain such classifiers rely on visual highlighting of important areas of the input data. Contrary, counterfactual explanation systems try to enable a counterfactual reasoning by modifying the input image in a way such that the classifier would have made a different prediction. By doing so, the users of counterfactual explanation systems are equipped with a completely different kind of explanatory information. However, methods for generating realistic counterfactual explanations for image classifiers are still rare. In this work, we present a novel approach to generate such counterfactual image explanations based on adversarial image-to-image translation techniques. Additionally, we conduct a user study to evaluate our approach in a use case which was inspired by a healthcare scenario. Our results show that our approach leads to significantly better results regarding mental models, explanation satisfaction, trust, emotions, and self-efficacy than two state-of-the art systems that work with saliency maps, namely LIME and LRP.
摘要：随着机器学习的不断发展，对解释人工智能系统决策的方法的需求正变得越来越重要。特别是对于图像分类任务，许多用于解释此类分类器的先进工具都依赖于对输入数据重要区域的视觉突出显示。相反，反事实解释系统试图通过以分类器将做出不同预测的方式修改输入图像来实现反事实推理。通过这样做，反事实解释系统的用户被配备了完全不同种类的解释信息。但是，为图像分类器生成逼真的反事实解释的方法仍然很少。在这项工作中，我们提出了一种新颖的方法来基于对抗性图像到图像的翻译技术来生成这种反事实的图像说明。此外，我们进行了用户研究，以根据医疗保健场景启发对用例进行评估。我们的结果表明，与两个使用显着性图的最先进的系统（LIME和LRP）相比，我们的方法在心理模型，解释满意度，信任，情感和自我效能方面产生了明显更好的结果。

55. Deep learning-based virtual refocusing of images using an engineered point-spread function [PDF] 返回目录
Xilin Yang, Luzhe Huang, Yilin Luo, Yichen Wu, Hongda Wang, Yair Rivenson, Aydogan Ozcan
Abstract: We present a virtual image refocusing method over an extended depth of field (DOF) enabled by cascaded neural networks and a double-helix point-spread function (DH-PSF). This network model, referred to as W-Net, is composed of two cascaded generator and discriminator network pairs. The first generator network learns to virtually refocus an input image onto a user-defined plane, while the second generator learns to perform a cross-modality image transformation, improving the lateral resolution of the output image. Using this W-Net model with DH-PSF engineering, we extend the DOF of a fluorescence microscope by ~20-fold. This approach can be applied to develop deep learning-enabled image reconstruction methods for localization microscopy techniques that utilize engineered PSFs to improve their imaging performance, including spatial resolution and volumetric imaging throughput.
摘要：我们提出了一种由级联神经网络和双螺旋点扩展函数（DH-PSF）启用的扩展景深（DOF）的虚拟图像重新聚焦方法。这种网络模型称为W-Net，由两个级联的生成器网络和鉴别器网络对组成。第一个生成器网络学习将输入图像虚拟地重新聚焦到用户定义的平面上，而第二个生成器网络学习执行跨模态图像转换，从而提高输出图像的横向分辨率。使用带有DH-PSF工程的W-Net模型，我们将荧光显微镜的自由度扩展了约20倍。此方法可用于开发具有深度学习功能的图像重建方法，用于定位显微技术，该方法利用工程化的PSF改善其成像性能，包括空间分辨率和体积成像吞吐量。

56. Efficient and Visualizable Convolutional Neural Networks for COVID-19 Classification Using Chest CT [PDF] 返回目录
Aksh Garg, Sana Salehi, Marianna La Rocca, Rachael Garner, Dominique Duncan
Abstract: The novel 2019 coronavirus disease (COVID-19) has infected over 65 million people worldwide as of December 4, 2020, pushing the world to the brink of social and economic collapse. With cases rising rapidly, deep learning has emerged as a promising diagnosis technique. However, identifying the most accurate models to characterize COVID-19 patients is challenging because comparing results obtained with different types of data and acquisition processes is non-trivial. In this paper, we evaluated and compared 40 different convolutional neural network architectures for COVID-19 diagnosis, serving as the first to consider the EfficientNet family for COVID-19 diagnosis. EfficientNet-B5 is identified as the best model with an accuracy of 0.9931+/-0.0021, F1 score of 0.9931+/-0.0020, sensitivity of 0.9952+/-0.0020, and specificity of 0.9912+/-0.0048. Intermediate activation maps and Gradient-weighted Class Activation Mappings offer human-interpretable evidence of the model's perception of ground-class opacities and consolidations, hinting towards a promising use-case of artificial intelligence-assisted radiology tools.
摘要：截至2020年12月4日，新颖的2019年冠状病毒病（COVID-19）已感染全球6500万人，使世界濒临社会和经济崩溃的边缘。随着病例的迅速增加，深度学习已成为一种很有前途的诊断技术。但是，确定最准确的模型来表征COVID-19患者具有挑战性，因为比较使用不同类型的数据和采集过程获得的结果并非易事。在本文中，我们评估和比较了40种不同的卷积神经网络体系结构用于COVID-19诊断，这是第一个考虑EfficientNet系列用于COVID-19诊断的体系。 EfficientNet-B5被认为是最佳模型，其准确度为0.9931 +/- 0.0021，F1得分为0.9931 +/- 0.0020，灵敏度为0.9952 +/- 0.0020，特异性为0.9912 +/- 0.0048。中间激活图和梯度加权类激活图为模型对地面类浑浊和合并的感知提供了人类可解释的证据，这暗示了人工智能辅助放射学工具的前景广阔的用例。

57. Residual Matrix Product State for Machine Learning [PDF] 返回目录
Ye-Ming Meng, Jing Zhang, Peng Zhang, Chao Gao, Shi-Ju Ran
Abstract: Tensor network (TN), which originates from quantum physics, shows broad prospects in classical and quantum machine learning (ML). However, there still exists a considerable gap of accuracy between TN and the sophisticated neural network (NN) models for classical ML. It is still elusive how far TN ML can be improved by, e.g., borrowing the techniques from NN. In this work, we propose the residual matrix product state (ResMPS) by combining the ideas of matrix product state (MPS) and residual NN. ResMPS can be treated as a network where its layers map the "hidden" features to the outputs (e.g., classifications), and the variational parameters of the layers are the functions of the features of samples (e.g., pixels of images). This is essentially different from NN, where the layers map feed-forwardly the features to the output. ResMPS can naturally incorporate with the non-linear activations and dropout layers, and outperforms the state-of-the-art TN models on the efficiency, stability, and expression power. Besides, ResMPS is interpretable from the perspective of polynomial expansion, where the factorization and exponential machines naturally emerge. Our work contributes to connecting and hybridizing neural and tensor networks, which is crucial to understand the working mechanisms further and improve both models' performances.
摘要：Tensor网络（TN）起源于量子物理学，在经典和量子机器学习（ML）中显示出广阔的前景。但是，TN和经典ML的复杂神经网络（NN）模型之间仍然存在相当大的精度差距。仍然难以捉摸，例如通过借鉴NN的技术可以将TN ML改进多远。在这项工作中，我们通过结合矩阵乘积状态（MPS）和残差NN的思想提出残差矩阵乘积状态（ResMPS）。 ResMPS可被视为网络，其中其层将“隐藏”特征映射到输出（例如分类），并且层的变化参数是样本特征（例如图像的像素）的函数。这与NN基本不同，后者的图层将要素前馈映射到输出。 ResMPS可以自然地与非线性激活和缺失层结合在一起，并且在效率，稳定性和表达能力方面均优于最新的TN模型。此外，ResMPS可以从多项式扩展的角度进行解释，因为自然会出现因式分解和指数机。我们的工作为神经网络和张量网络的连接和混合做出了贡献，这对于进一步了解工作机制并改善两种模型的性能至关重要。

58. Objective Evaluation of Deep Uncertainty Predictions for COVID-19 Detection [PDF] 返回目录
Hamzeh Asgharnezhad, Afshar Shamsi, Roohallah Alizadehsani, Abbas Khosravi, Saeid Nahavandi, Zahra Alizadeh Sani, Dipti Srinivasan
Abstract: Deep neural networks (DNNs) have been widely applied for detecting COVID-19 in medical images. Existing studies mainly apply transfer learning and other data representation strategies to generate accurate point estimates. The generalization power of these networks is always questionable due to being developed using small datasets and failing to report their predictive confidence. Quantifying uncertainties associated with DNN predictions is a prerequisite for their trusted deployment in medical settings. Here we apply and evaluate three uncertainty quantification techniques for COVID-19 detection using chest X-Ray (CXR) images. The novel concept of uncertainty confusion matrix is proposed and new performance metrics for the objective evaluation of uncertainty estimates are introduced. Through comprehensive experiments, it is shown that networks pertained on CXR images outperform networks pretrained on natural image datasets such as ImageNet. Qualitatively and quantitatively evaluations also reveal that the predictive uncertainty estimates are statistically higher for erroneous predictions than correct predictions. Accordingly, uncertainty quantification methods are capable of flagging risky predictions with high uncertainty estimates. We also observe that ensemble methods more reliably capture uncertainties during the inference.
摘要：深度神经网络（DNN）已被广泛应用于检测医学图像中的COVID-19。现有研究主要应用转移学习和其他数据表示策略来生成准确的点估计。这些网络的泛化能力总是值得怀疑的，因为它们是使用小型数据集开发的，并且无法报告其预测可信度。量化与DNN预测相关的不确定性是在医疗环境中可信部署的前提。在这里，我们应用和评估使用胸部X射线（CXR）图像进行COVID-19检测的三种不确定性量化技术。提出了不确定性混淆矩阵的新概念，并引入了用于不确定性估计的客观评估的新性能指标。通过全面的实验表明，与CXR图像有关的网络优于在自然图像数据集（如ImageNet）上预先训练的网络。定性和定量评估还显示，错误预测的预测不确定性估计值在统计上高于正确预测。因此，不确定性量化方法能够用高不确定性估计来标记风险预测。我们还观察到合奏方法在推断过程中更可靠地捕获了不确定性。

59. Dual-encoder Bidirectional Generative Adversarial Networks for Anomaly Detection [PDF] 返回目录
Teguh Budianto, Tomohiro Nakai, Kazunori Imoto, Takahiro Takimoto, Kosuke Haruki
Abstract: Generative adversarial networks (GANs) have shown promise for various problems including anomaly detection. When anomaly detection is performed using GAN models that learn only the features of normal data samples, data that are not similar to normal data are detected as abnormal samples. The present approach is developed by employing a dual-encoder in a bidirectional GAN architecture that is trained simultaneously with a generator and a discriminator network. Through the learning mechanism, the proposed method aims to reduce the problem of bad cycle consistency, in which a bidirectional GAN might not be able to reproduce samples with a large difference between normal and abnormal samples. We assume that bad cycle consistency occurs when the method does not preserve enough information of the sample data. We show that our proposed method performs well in capturing the distribution of normal samples, thereby improving anomaly detection on GAN-based models. Experiments are reported in which our method is applied to publicly available datasets, including application to a brain magnetic resonance imaging anomaly detection system.
摘要：生成对抗网络（GAN）已显示出应对包括异常检测在内的各种问题的希望。使用仅学习正常数据样本特征的GAN模型执行异常检测时，会将与正常数据不相似的数据检测为异常样本。通过在双向GAN架构中采用双编码器来开发本方法，该双向编码器与生成器和鉴别器网络同时进行训练。通过学习机制，提出的方法旨在减少不良循环一致性的问题，其中双向GAN可能无法复制正常样本与异常样本之间差异较大的样本。我们假设当该方法没有保存足够的样本数据信息时，会出现不良的循环一致性。我们表明，我们提出的方法在捕获正常样本的分布方面表现良好，从而改善了基于GAN的模型的异常检测。据报道，实验将我们的方法应用于可公开获得的数据集，包括应用于脑磁共振成像异常检测系统。

60. Towards an Automatic System for Extracting Planar Orientations from Software Generated Point Clouds [PDF] 返回目录
J. Kissi-Ameyaw, K. McIsaac, X. Wang, G. R. Osinski
Abstract: In geology, a key activity is the characterisation of geological structures (surface formation topology and rock units) using Planar Orientation measurements such as Strike, Dip and Dip Direction. In general these measurements are collected manually using basic equipment; usually a compass/clinometer and a backboard, recorded on a map by hand. Various computing techniques and technologies, such as Lidar, have been utilised in order to automate this process and update the collection paradigm for these types of measurements. Techniques such as Structure from Motion (SfM) reconstruct of scenes and objects by generating a point cloud from input images, with detailed reconstruction possible on the decimetre scale. SfM-type techniques provide advantages in areas of cost and usability in more varied environmental conditions, while sacrificing the extreme levels of data fidelity. Here is presented a methodology of data acquisition and a Machine Learning-based software system: GeoStructure, developed to automate the measurement of orientation measurements. Rather than deriving measurements using a method applied to the input images, such as the Hough Transform, this method takes measurements directly from the reconstructed point cloud surfaces. Point cloud noise is mitigated using a Mahalanobis distance implementation. Significant structure is characterised using a k-nearest neighbour region growing algorithm, and final surface orientations are quantified using the plane, and normal direction cosines.
摘要：在地质学中，一项关键活动是使用诸如“走向”，“倾角”和“倾角方向”之类的平面方向测量来表征地质结构（地层拓扑和岩石单元）。通常，这些测量值是使用基本设备手动收集的；通常是手工记录在地图上的指南针/斜度计和篮板。为了使该过程自动化并更新这些类型的测量的收集范例，已经利用了诸如Lidar之类的各种计算技术。诸如运动构造（SfM）之类的技术通过根据输入图像生成点云来重建场景和对象，并可能以分米为单位进行详细的重建。 SfM类型的技术在更多样化的环境条件下在成本和可用性方面提供了优势，同时牺牲了数据保真度的极端水平。这里介绍了一种数据采集方法和一种基于机器学习的软件系统：GeoStructure，该系统开发用于自动进行方位测量。该方法不是使用应用于输入图像的方法（例如霍夫变换）来得出测量结果，而是直接从重构的点云表面进行测量。使用马氏距离实现可减轻点云噪声。重要结构的特征在于使用k最近邻区域增长算法，并使用平面和法向余弦来量化最终表面方向。

61. Social NCE: Contrastive Learning of Socially-aware Motion Representations [PDF] 返回目录
Yuejiang Liu, Qi Yan, Alexandre Alahi
Abstract: Learning socially-aware motion representations is at the core of recent advances in human trajectory forecasting and robot navigation in crowded spaces. Yet existing methods often struggle to generalize to challenging scenarios and even output unacceptable solutions (e.g., collisions). In this work, we propose to address this issue via contrastive learning. Concretely, we introduce a social contrastive loss that encourages the encoded motion representation to preserve sufficient information for distinguishing a positive future event from a set of negative ones. We explicitly draw these negative samples based on our domain knowledge about socially unfavorable scenarios in the multi-agent context. Experimental results show that the proposed method consistently boosts the performance of previous trajectory forecasting, behavioral cloning, and reinforcement learning algorithms in various settings. Our method makes little assumptions about neural architecture designs, and hence can be used as a generic way to incorporate negative data augmentation into motion representation learning.
摘要：在拥挤空间中，人体轨迹预测和机器人导航的最新进展是学习具有社会意识的运动表示的核心。然而，现有方法通常难以将其推广到具有挑战性的场景，甚至输出不可接受的解决方案（例如，冲突）。在这项工作中，我们建议通过对比学习来解决这个问题。具体而言，我们引入了一种社会对比性损失，它鼓励编码的运动表示形式保留足够的信息，以区分积极的未来事件与消极的未来事件。我们基于我们在多主体环境中对社会不利场景的领域知识，明确得出了这些负面样本。实验结果表明，所提出的方法在各种情况下都能持续提高先前的轨迹预测，行为克隆和强化学习算法的性能。我们的方法几乎没有关于神经体系结构设计的假设，因此可以用作将否定数据增强合并到运动表示学习中的通用方法。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-23

目录

摘要