摘要

1. An Advert Creation System for 3D Product Placements [PDF] 返回目录
Ivan Bacher, Hossein Javidnia, Soumyabrata Dev, Rahul Agrahari, Murhaf Hossari, Matthew Nicholson, Clare Conran, Jian Tang, Peng Song, David Corrigan, François Pitié
Abstract: Over the past decade, the evolution of video-sharing platforms has attracted a significant amount of investments on contextual advertising. The common contextual advertising platforms utilize the information provided by users to integrate 2D visual ads into videos. The existing platforms face many technical challenges such as ad integration with respect to occluding objects and 3D ad placement. This paper presents a Video Advertisement Placement & Integration (Adverts) framework, which is capable of perceiving the 3D geometry of the scene and camera motion to blend 3D virtual objects in videos and create the illusion of reality. The proposed framework contains several modules such as monocular depth estimation, object segmentation, background-foreground separation, alpha matting and camera tracking. Our experiments conducted using Adverts framework indicates the significant potential of this system in contextual ad integration, and pushing the limits of advertising industry using mixed reality technologies.
摘要：在过去的十年中，视频分享平台的发展已经吸引了上下文广告投资的显著量。常见的上下文广告平台利用用户提供的信息，2D可视广告整合到视频。现有的平台面临很多技术挑战，例如广告整合相对于阻挡对象和3D广告位置。本文提出了一种视频广告投放和集成（广告）的框架，它能够感知场景和摄像机运动的3D几何融入3D虚拟对象的视频，并创建现实的错觉。所提出的框架包含几个模块，例如单眼深度估计，对象分割，背景前景分离，α-消光和相机跟踪。我们的实验进行了使用广告框架表示该系统的上下文广告整合显著的潜力，推动广告业使用的混合现实技术的限制。

2. Person Re-identification by analyzing Dynamic Variations in Gait Sequences [PDF] 返回目录
Sandesh Bharadwaj, Kunal Chanda
Abstract: Gait recognition is a biometric technology that identifies individuals in a video sequence by analysing their style of walking or limb movement. However, this identification is generally sensitive to appearance changes and conventional feature descriptors such as Gait Energy Image (GEI) lose some of the dynamic information in the gait sequence. Active Energy Image (AEI) focuses more on dynamic motion changes than GEI and is more suited to deal with appearance changes. We propose a new approach, which allows recognizing people by analysing the dynamic motion variations and identifying people without using a database of predicted changes. In the proposed method, the active energy image is calculated by averaging the difference frames of the silhouette sequence and divided into multiple segments. Affine moment invariants are computed as gait features for each section. Next, matching weights are calculated based on the similarity between extracted features and those in the database. Finally, the subject is identified by the weighted combination of similarities in all segments. The CASIA-B Gait Database is used as the principal dataset for the experimental analysis.
摘要：步态识别是生物特征识别技术，在通过分析他们的步行或肢体运动风格的视频序列识别个人。然而，这种识别是外观变化和常规特征描述符诸如步态能量图像（GEI）通常敏感失去一些在步态序列的动态信息。活动能量图像（AEI）更侧重于比GEI动态运动的变化，更适合于处理外观上的变化。我们提出了一种新的方法，它允许通过分析动态运动变化和鉴定人不使用的预测变化的数据库认识的人。在所提出的方法中，所述活性能量图像通过平均轮廓序列的差异帧计算并分成多个段。仿射不变矩被计算为每个部分的步态特征。接着，匹配权重是基于提取的特征和那些在数据库之间的相似性计算。最后，主题是通过在所有段的相似性的加权组合来标识。所述CASIA-B步态数据库被用作用于实验分析的主要数据集。

3. ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks [PDF] 返回目录
Rajat Saini, Nandan Kumar Jha, Bedanta Das, Sparsh Mittal, C. Krishna Mohan
Abstract: The capability of the self-attention mechanism to model the long-range dependencies has catapulted its deployment in vision models. Unlike convolution operators, self-attention offers infinite receptive field and enables compute-efficient modeling of global dependencies. However, the existing state-of-the-art attention mechanisms incur high compute and/or parameter overheads, and hence unfit for compact convolutional neural networks (CNNs). In this work, we propose a simple yet effective "Ultra-Lightweight Subspace Attention Mechanism" (ULSAM), which infers different attention maps for each feature map subspace. We argue that leaning separate attention maps for each feature subspace enables multi-scale and multi-frequency feature representation, which is more desirable for fine-grained image classification. Our method of subspace attention is orthogonal and complementary to the existing state-of-the-arts attention mechanisms used in vision models. ULSAM is end-to-end trainable and can be deployed as a plug-and-play module in the pre-existing compact CNNs. Notably, our work is the first attempt that uses a subspace attention mechanism to increase the efficiency of compact CNNs. To show the efficacy of ULSAM, we perform experiments with MobileNet-V1 and MobileNet-V2 as backbone architectures on ImageNet-1K and three fine-grained image classification datasets. We achieve $\approx$13% and $\approx$25% reduction in both the FLOPs and parameter counts of MobileNet-V2 with a 0.27% and more than 1% improvement in top-1 accuracy on the ImageNet-1K and fine-grained image classification datasets (respectively). Code and trained models are available at this https URL.
摘要：自重视机制，远程模拟依赖能力已经一跃它在视觉模型的部署。不同于卷积运营商，自注意报价无限感受野，使全球的相关性计算，高效的建模。然而，现有的国家的最先进的注意机制招致高的计算和/或参数的开销，并因此不适合紧凑卷积神经网络（细胞神经网络）。在这项工作中，我们提出了一个简单而有效的“超轻型子空间注意机制”（ULSAM），其推断不同的关注映射每个特征图子空间。我们认为，靠在单独关注地图的每个特征子空间使多尺度和多频特征表示，这是更可取的细粒度图像分类。我们的子空间注意力的方法是正交和补充，以在视觉模型中使用的现有状态的最艺术的关注机制。 ULSAM是终端到终端的可训练和可部署为在预先存在的紧凑型细胞神经网络的一个插件和播放模块。值得注意的是，我们的工作是使用子空间注意机制，加大小型细胞神经网络的效率的首次尝试。为了显示ULSAM的功效，我们在ImageNet-1K和三个细粒度图像分类数据集执行与MobileNet-V1和MobileNet-V2实验为骨干架构。我们的ImageNet-1K和细粒度的图像上实现$ \约13％和$ $ \约$ 25的触发器，并用0.27％MobileNet-V2的参数计数既％的减少，超过1％的改善顶1精度分类数据集（分别）。代码和训练有素的型号可供选择，在此HTTPS URL。

4. End-to-end training of deep kernel map networks for image classification [PDF] 返回目录
Mingyuan Jiu, Hichem Sahbi
Abstract: Deep kernel map networks have shown excellent performances in various classification problems including image annotation. Their general recipe consists in aggregating several layers of singular value decompositions (SVDs) -- that map data from input spaces into high dimensional spaces -- while preserving the similarity of the underlying kernels. However, the potential of these deep map networks has not been fully explored as the original setting of these networks focuses mainly on the approximation quality of their kernels and ignores their discrimination power. In this paper, we introduce a novel "end-to-end" design for deep kernel map learning that balances the approximation quality of kernels and their discrimination power. Our method proceeds in two steps; first, layerwise SVD is applied in order to build initial deep kernel map approximations and then an "end-to-end" supervised learning is employed to further enhance their discrimination power while maintaining their efficiency. Extensive experiments, conducted on the challenging ImageCLEF annotation benchmark, show the high efficiency and the out-performance of this two-step process with respect to different related methods.
摘要：深核映射网络显示，在不同的分类问题优异的性能，包括图像标注。他们的通用配方在于聚集奇异值分解（的SVD）的几层 - 从输入空间，地图数据转换成高维空间 - 同时保持底层的内核的相似性。然而，这些深层次的地图网络的潜力还没有得到充分的探讨，因为这些网络的原始设置主要集中在其内核的近似质量，而忽略他们的鉴别能力。在本文中，我们介绍了深核映射学习新的“终端到终端”的设计，平衡内核的近似质量和判别力。我们的方法进行在两个步骤;第一，分层SVD是为了建立初始深内核地图近似值，然后一个“端到端”监督采用，以进一步增强其识别能力，同时保持它们的效率学习应用。广泛的实验，对具有挑战性ImageCLEF注释基准进行的，显示出高效率，这两个步骤的过程相对于不同的相关的方法的出性能。

5. Cross-Ssupervised Object Detection [PDF] 返回目录
Zitian Chen, Zhiqiang Shen, Jiahui Yu, Erik Learned-Miller
Abstract: After learning a new object category from image-level annotations (with no object bounding boxes), humans are remarkably good at precisely localizing those objects. However, building good object localizers (i.e., detectors) currently requires expensive instance-level annotations. While some work has been done on learning detectors from weakly labeled samples (with only class labels), these detectors do poorly at localization. In this work, we show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories. We call this novel learning paradigm cross-supervised object detection. We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations, together with a spatial correlation module that bridges the gap between detection and recognition. These contributions enable us to better detect novel objects with image-level annotations in complex multi-object scenes such as the COCO dataset.
摘要：从学习图像级别注释一个新的对象类别（没有对象边界框）之后，人类在精确定位这些对象非常好。然而，建立良好的对象定位器（即，检测器）目前需要昂贵的实例级注释。虽然有些工作已经学习从弱标记的样品（只有类标签）探测器完成的，这些探测器在本地化做的很差。在这项工作中，我们将展示如何利用知识从完全标记的碱基类别学会了从建立新类别的弱标记的图像更好的对象检测器。我们称这种新型学习模式的交叉监督的对象检测。我们提出一个统一的框架，结合检测头从实例级别的注解，并从图像级别注释学会了识别头的训练，与桥梁检测与识别的差距的空间相关模块组装在一起。这些捐款使我们能够更好地检测与复杂的多的对象场景图像级别注释新的天体，例如COCO数据集。

6. High Resolution Zero-Shot Domain Adaptation of Synthetically Rendered Face Images [PDF] 返回目录
Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie Shotton
Abstract: Generating photorealistic images of human faces at scale remains a prohibitively difficult task using computer graphics approaches. This is because these require the simulation of light to be photorealistic, which in turn requires physically accurate modelling of geometry, materials, and light sources, for both the head and the surrounding scene. Non-photorealistic renders however are increasingly easy to produce. In contrast to computer graphics approaches, generative models learned from more readily available 2D image data have been shown to produce samples of human faces that are hard to distinguish from real data. The process of learning usually corresponds to a loss of control over the shape and appearance of the generated images. For instance, even simple disentangling tasks such as modifying the hair independently of the face, which is trivial to accomplish in a computer graphics approach, remains an open research question. In this work, we propose an algorithm that matches a non-photorealistic, synthetically generated image to a latent vector of a pretrained StyleGAN2 model which, in turn, maps the vector to a photorealistic image of a person of the same pose, expression, hair, and lighting. In contrast to most previous work, we require no synthetic training data. To the best of our knowledge, this is the first algorithm of its kind to work at a resolution of 1K and represents a significant leap forward in visual realism.
摘要：在大规模生成的人脸逼真的图像仍然使用计算机图形登天还难的任务的方法。这是因为这些要求的光的模拟是真实感，而这又需要物理几何形状，材料，和光源的精确建模，用于头部和周围场景两者。非真实感渲染但是越来越容易产生。相反计算机图形接近，从更容易获得的2D图像数据得知生成模型已经显示出人脸是很难从真实数据来区分的生产样品。学习通常的处理对应于在生成的图像的形状和外观的控制的损失。例如，即使是简单的解开任务，如独立的脸，这是微不足道的图形接近计算机来完成的修改在头发上，仍然是一个有待研究的问题。在这项工作中，我们提出了非写实，合成产生的图像匹配到预训练StyleGAN2模型的潜在向量，反过来，矢量映射到一个人的相同的姿势，表情，毛发的逼真图像的算法和照明。与大多数以前的工作中，我们不需要合成训练数据。据我们所知，这是同类工作在1K的分辨率的第一算法，代表了视觉逼真向前迈进了一显著的飞跃。

7. SASO: Joint 3D Semantic-Instance Segmentation via Multi-scale Semantic Association and Salient Point Clustering Optimization [PDF] 返回目录
Jingang Tan, Lili Chen, Kangru Wang, Jingquan Peng, Jiamao Li, Xiaolin Zhang
Abstract: We propose a novel 3D point cloud segmentation framework named SASO, which jointly performs semantic and instance segmentation tasks. For semantic segmentation task, inspired by the inherent correlation among objects in spatial context, we propose a Multi-scale Semantic Association (MSA) module to explore the constructive effects of the semantic context information. For instance segmentation task, different from previous works that utilize clustering only in inference procedure, we propose a Salient Point Clustering Optimization (SPCO) module to introduce a clustering procedure into the training process and impel the network focusing on points that are difficult to be distinguished. In addition, because of the inherent structures of indoor scenes, the imbalance problem of the category distribution is rarely considered but severely limits the performance of 3D scene perception. To address this issue, we introduce an adaptive Water Filling Sampling (WFS) algorithm to balance the category distribution of training data. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on benchmark datasets in both semantic segmentation and instance segmentation tasks.
摘要：我们提出了一个名为SASO一种新颖的三维点云分割框架，共同执行语义和实例分割任务。对于语义分割的任务，在空间环境中的物体之间的内在相关性的启发，我们提出了一个多尺度语义协会（MSA）模块，以探索的语义上下文信息的建设性作用。例如分割任务，从利用聚类仅在推理过程以前的作品不同，我们提出了一个突出的一点聚类优化（SPCO）模块引入分组过程进入训练过程，并推动网络集中于那些难以区分点。另外，由于室内场景的固有结构，类别分布的不平衡问题很少被考虑，但严重限制了3D场景感知的性能。为了解决这个问题，我们引入了自适应充水采样（WFS）算法来平衡训练数据的类别分布。大量的实验证明我们的方法优于两种语义分割和实例分割任务的基准数据集的国家的最先进的方法。

8. Suggestive Annotation of Brain Tumour Images with Gradient-guided Sampling [PDF] 返回目录
Chengliang Dai, Shuo Wang, Yuanhan Mo, Elsa Angelini, Yike Guo, Wenjia Bai
Abstract: Machine learning has been widely adopted for medical image analysis in recent years given its promising performance in image segmentation and classification tasks. As a data-driven science, the success of machine learning, in particular supervised learning, largely depends on the availability of manually annotated datasets. For medical imaging applications, such annotated datasets are not easy to acquire. It takes a substantial amount of time and resource to curate an annotated medical image set. In this paper, we propose an efficient annotation framework for brain tumour images that is able to suggest informative sample images for human experts to annotate. Our experiments show that training a segmentation model with only 19% suggestively annotated patient scans from BraTS 2019 dataset can achieve a comparable performance to training a model on the full dataset for whole tumour segmentation task. It demonstrates a promising way to save manual annotation cost and improve data efficiency in medical imaging applications.
摘要：机器学习已经为医学图像分析中给出的图像分割和分类的任务及其前途性能近年来广泛采用。作为一个数据驱动的科学，学习机的成功，特别是监督学习，在很大程度上依赖于手工标注的数据集的可用性。用于医学成像的应用，例如数据集注释也不易获取。这需要时间和资源，以策划带注释的医学图像集的大量。在本文中，我们提出了脑肿瘤图像的高效注释框架，能够为人类专家来注释提示信息样本图像。我们的实验表明，训练细分模型与臭小子2019集只有19％的暗示性注释病人扫描可以达到相同的性能，在全数据集整个肿瘤分割任务训练模式。它展示了一个有前途的方式，以节省人工标注成本，提高医疗成像应用数据的效率。

9. Fast Multi-Level Foreground Estimation [PDF] 返回目录
Thomas Germer, Tobias Uelwer, Stefan Conrad, Stefan Harmeling
Abstract: Alpha matting aims to estimate the translucency of an object in a given image. The resulting alpha matte describes pixel-wise to what amount foreground and background colors contribute to the color of the composite image. While most methods in literature focus on estimating the alpha matte, the process of estimating the foreground colors given the input image and its alpha matte is often neglected, although foreground estimation is an essential part of many image editing workflows. In this work, we propose a novel method for foreground estimation given the alpha matte. We demonstrate that our fast multi-level approach yields results that are comparable with the state-of-the-art while outperforming those methods in computational runtime and memory usage.
摘要：阿尔法抠图的目标来估计对象的透光性的给定图像英寸将所得的α无光泽描述逐像素到什么量前景和背景颜色有助于合成图像的颜色。尽管大多数方法在文献集中在估计所述阿尔法磨砂，估计给定的输入图像和它的阿尔法磨砂的前景颜色的过程中经常被忽略，虽然前景估计是许多图像编辑工作流的一个重要部分。在这项工作中，我们提出了给定的阿尔法磨砂前景估计的新方法。我们表明，我们的快速多级方法产生的结果是与国家的最先进的可比的，而超越的计算运行时间和内存使用这些方法。

10. Ensemble Transfer Learning for Emergency Landing Field Identification on Moderate Resource Heterogeneous Kubernetes Cluster [PDF] 返回目录
Andreas Klos, Marius Rosenbaum, Wolfram Schiffmann
Abstract: The full loss of thrust of an aircraft requires fast and reliable decisions of the pilot. If no published landing field is within reach, an emergency landing field must be selected. The choice of a suitable emergency landing field denotes a crucial task to avoid unnecessary damage of the aircraft, risk for the civil population as well as the crew and all passengers on board. Especially in case of instrument meteorological conditions it is indispensable to use a database of suitable emergency landing fields. Thus, based on public available digital orthographic photos and digital surface models, we created various datasets with different sample sizes to facilitate training and testing of neural networks. Each dataset consists of a set of data layers. The best compositions of these data layers as well as the best performing transfer learning models are selected. Subsequently, certain hyperparameters of the chosen models for each sample size are optimized with Bayesian and Bandit optimization. The hyperparameter tuning is performed with a self-made Kubernetes cluster. The models outputs were investigated with respect to the input data by the utilization of layer-wise relevance propagation. With optimized models we created an ensemble model to improve the segmentation performance. Finally, an area around the airport of Arnsberg in North Rhine-Westphalia was segmented and emergency landing fields are identified, while the verification of the final approach's obstacle clearance is left unconsidered. These emergency landing fields are stored in a PostgreSQL database.
摘要：飞机的推力的全部损失需要飞行员的快速和可靠的决策。如果没有出版着陆场是指日可待，紧急着陆场，必须进行选择。合适的应急着陆场的选择表示极为重要的任务，以避免对平民的飞机造成不必要的损坏，风险以及船员和船上所有乘客。特别是在仪表气象条件情况下，它是必不可少的使用合适的迫降场的数据库。因此，基于公众提供数字正照片和数字表面模型，我们创建了各种数据集与不同的样本量，以促进培训和神经网络的测试。每个数据集包括一组数据层组成。这些数据层以及表现最佳的转印学习模型的最佳组合物被选择。随后，对于每个样本大小所选择的模型的某些超参数与贝叶斯和强盗优化优化。所述超参数调谐用自制Kubernetes集群执行。模型输出与通过分层相关性传播的利用率相对于输入数据进行了调查。随着优化模型，我们创建了一个整体模型，以提高分割性能。最后，在周围的北莱茵 - 威斯特法伦州阿恩斯贝格机场的面积分割和紧急着陆场被识别，而最后进近的净空验证剩下未被考虑。这些紧急降落字段存储在PostgreSQL数据库。

11. RPM-Net: Recurrent Prediction of Motion and Parts from Point Cloud [PDF] 返回目录
Zihao Yan, Ruizhen Hu, Xingguang Yan, Luanmin Chen, Oliver van Kaick, Hao Zhang, Hui Huang
Abstract: We introduce RPM-Net, a deep learning-based approach which simultaneously infers movable parts and hallucinates their motions from a single, un-segmented, and possibly partial, 3D point cloud shape. RPM-Net is a novel Recurrent Neural Network (RNN), composed of an encoder-decoder pair with interleaved Long Short-Term Memory (LSTM) components, which together predict a temporal sequence of pointwise displacements for the input point cloud. At the same time, the displacements allow the network to learn movable parts, resulting in a motion-based shape segmentation. Recursive applications of RPM-Net on the obtained parts can predict finer-level part motions, resulting in a hierarchical object segmentation. Furthermore, we develop a separate network to estimate part mobilities, e.g., per-part motion parameters, from the segmented motion sequence. Both networks learn deep predictive models from a training set that exemplifies a variety of mobilities for diverse objects. We show results of simultaneous motion and part predictions from synthetic and real scans of 3D objects exhibiting a variety of part mobilities, possibly involving multiple movable parts.
摘要：介绍RPM-网，深学习型的做法，同时推断可移动部件，并从一个单一的，未经分段的，可能部分，三维点云形状出现幻觉它们的运动。 RPM-Net是一种新颖的回归神经网络（RNN），编码器 - 解码器对与交错长短期存储器（LSTM）组件，其一起预测逐点位移的时间序列为输入点云组成。同时，位移允许网络学习的可移动部件，从而导致基于运动的形状的分割。对所得到的零件RPM-Net的递归应用程序可以预测更精细一级的部分运动，从而在分层对象分割。此外，我们开发了一个单独的网络，以估计部分的迁移率，例如，每部分的运动参数，从所分割的运动序列。这两个网络学习从训练集，体现了各种迁移为不同对象的深预测模型。我们示出了从三维的合成的和实际扫描的同时运动，并且部分预测的结果显示对象的各种部分的迁移率，这可能涉及多个可移动的部件。

12. Domain Contrast for Domain Adaptive Object Detection [PDF] 返回目录
Feng Liu, Xiaoxong Zhang, Fang Wan, Xiangyang Ji, Qixiang Ye
Abstract: We present Domain Contrast (DC), a simple yet effective approach inspired by contrastive learning for training domain adaptive detectors. DC is deduced from the error bound minimization perspective of a transferred model, and is implemented with cross-domain contrast loss which is plug-and-play. By minimizing cross-domain contrast loss, DC guarantees the transferability of detectors while naturally alleviating the class imbalance issue in the target domain. DC can be applied at either image level or region level, consistently improving detectors' transferability and discriminability. Extensive experiments on commonly used benchmarks show that DC improves the baseline and state-of-the-art by significant margins, while demonstrating great potential for large domain divergence.
摘要：我们目前域对比度（DC），一个简单而有效的方法，通过对比学习培训域自适应探测器的启发。 DC从转移模型的误差最小化的约束透视推导出，并且与跨域对比度损失，这是插塞和播放实现。通过最小化跨域对比度损失，DC保证检测器的转印性，同时自然减轻类不平衡问题在目标域。 DC可以在任一图像水平或区域级别应用，始终如一改善探测器转印性和可辨性。在常用的基准大量的实验表明，DC提高了显著利润率底线和国家的最先进的，同时展现了大域分歧巨大潜力。

13. Designing and Learning Trainable Priors with Non-Cooperative Games [PDF] 返回目录
Bruno Lecouat, Jean Ponce, Julien Mairal
Abstract: We introduce a general framework for designing and learning neural networks whose forward passes can be interpreted as solving convex optimization problems, and whose architectures are derived from an optimization algorithm. We focus on non-cooperative convex games, solved by local agents represented by the nodes of a graph and interacting through regularization functions. This approach is appealing for solving imaging problems, as it allows the use of classical image priors within deep models that are trainable end to end. The priors used in this presentation include variants of total variation, Laplacian regularization, sparse coding on learned dictionaries, and non-local self similarities. Our models are parameter efficient and fully interpretable, and our experiments demonstrate their effectiveness on a large diversity of tasks ranging from image denoising and compressed sensing for fMRI to dense stereo matching.
摘要：介绍了设计和学习神经网络，其向前传球可以解释为求解凸优化问题，并且其结构是由优化算法得出的总体框架。我们专注于非合作凸游戏，由图的节点表示，并通过规范化的功能交互当地代理商解决。这种方法是有吸引力的解决成像问题，因为它允许使用深模型是训练的端到端内古典先验图像的。在此演示文稿中使用的先验包括总的变化，拉普拉斯正规化的变种，稀疏编码上了解到词典，以及非本地的自相似性。我们的模型参数高效，充分解释，而我们的实验证明他们对任务，从图像去噪的大多样性效益和压缩感知的功能磁共振成像密集的立体匹配。

14. AutoSNAP: Automatically Learning Neural Architectures for Instrument Pose Estimation [PDF] 返回目录
David Kügler, Marc Uecker, Arjan Kuijper, Anirban Mukhopadhyay
Abstract: Despite recent successes, the advances in Deep Learning have not yet been fully translated to Computer Assisted Intervention (CAI) problems such as pose estimation of surgical instruments. Currently, neural architectures for classification and segmentation tasks are adopted ignoring significant discrepancies between CAI and these tasks. We propose an automatic framework (AutoSNAP) for instrument pose estimation problems, which discovers and learns the architectures for neural networks. We introduce 1)~an efficient testing environment for pose estimation, 2)~a powerful architecture representation based on novel Symbolic Neural Architecture Patterns (SNAPs), and 3)~an optimization of the architecture using an efficient search scheme. Using AutoSNAP, we discover an improved architecture (SNAPNet) which outperforms both the hand-engineered i3PosNet and the state-of-the-art architecture search method DARTS.
摘要：尽管最近取得的成功，在深度学习的进步还没有完全转化为计算机辅助干预（CAI）的问题，如手术器械的姿态估计。目前，分类和细分任务神经结构采用忽视CAI和这些任务之间显著的差异。我们提出了仪器的姿态估计问题，它发现和学习用于神经网络体系结构的框架，自动（自动捕捉）。我们姿态估计，2）〜根据新的符号神经架构模式（捕获）具有强大的架构表示，和3）引进1）〜高效的测试环境〜使用高效的搜索方案架构的优化。使用自动捕捉，我们发现一种改进的架构（SNAPNet），其性能优于手工程i3PosNet和国家的最先进的体系结构的搜索方法飞镖。

15. An Automatic Reader of Identity Documents [PDF] 返回目录
Filippo Attivissimo, Nicola Giaquinto, Marco Scarpetta, Maurizio Spadavecchia
Abstract: Identity documents automatic reading and verification is an appealing technology for nowadays service industry, since this task is still mostly performed manually, leading to waste of economic and time resources. In this paper the prototype of a novel automatic reading system of identity documents is presented. The system has been thought to extract data of the main Italian identity documents from photographs of acceptable quality, like those usually required to online subscribers of various services. The document is first localized inside the photo, and then classified; finally, text recognition is executed. A synthetic dataset has been used, both for neural networks training, and for performance evaluation of the system. The synthetic dataset avoided privacy issues linked to the use of real photos of real documents, which will be used, instead, for future developments of the system.
摘要：身份证明文件自动读取和验证是当今服务业有吸引力的技术，因为这项任务仍然是大部分手动执行，导致浪费的经济和时间资源。本文身份证件的新颖的自动读取系统的原型被呈现。该系统已被认为是意大利主要的身份证明文件的提取数据，从质量合格的照片，像那些通常需要的各种服务的在线用户。该文件是第一局部照片里面，然后分类;最后，文本识别被执行。合成数据集已在使用，既为神经网络训练，并为系统的性能评价。与使用的实际文件的真实照片，将被使用，取而代之的是，该系统未来发展的综合数据集避免隐私问题。

16. Expandable YOLO: 3D Object Detection from RGB-D Images [PDF] 返回目录
Masahiro Takahashi, Alessandro Moro, Yonghoon Ji, Kazunori Umeda
Abstract: This paper aims at constructing a light-weight object detector that inputs a depth and a color image from a stereo camera. Specifically, by extending the network architecture of YOLOv3 to 3D in the middle, it is possible to output in the depth direction. In addition, Intersection over Uninon (IoU) in 3D space is introduced to confirm the accuracy of region extraction results. In the field of deep learning, object detectors that use distance information as input are actively studied for utilizing automated driving. However, the conventional detector has a large network structure, and the real-time property is impaired. The effectiveness of the detector constructed as described above is verified using datasets. As a result of this experiment, the proposed model is able to output 3D bounding boxes and detect people whose part of the body is hidden. Further, the processing speed of the model is 44.35 fps.
摘要：本文旨在构建轻质物体检测器，其输入的深度，并且从立体摄像机的彩色图像。具体而言，通过在中间YOLOv3的网络体系结构扩展到3D，也能够在深度方向上输出。此外，交点超过Uninon（IOU）在3D空间中被引入，以确认区域提取结果的准确性。在深学习的领域中，对象检测器，作为输入使用距离信息积极地研究了利用自动驾驶。然而，传统的检测器具有大的网络结构，以及实时性受损。检测器的结构如上所述的有效性使用的数据集验证。作为这一实验的结果，该模型能够输出3D包围盒和检测人身体的一部分，其隐藏。此外，模型的处理速度是44.35帧。

17. Few-Shot Anomaly Detection for Polyp Frames from Colonoscopy [PDF] 返回目录
Yu Tian, Gabriel Maicas, Leonardo Zorron Cheng Tao Pu, Rajvinder Singh, Johan W. Verjans, Gustavo Carneiro
Abstract: Anomaly detection methods generally target the learning of a normal image distribution (i.e., inliers showing healthy cases) and during testing, samples relatively far from the learned distribution are classified as anomalies (i.e., outliers showing disease cases). These approaches tend to be sensitive to outliers that lie relatively close to inliers (e.g., a colonoscopy image with a small polyp). In this paper, we address the inappropriate sensitivity to outliers by also learning from inliers. We propose a new few-shot anomaly detection method based on an encoder trained to maximise the mutual information between feature embeddings and normal images, followed by a few-shot score inference network, trained with a large set of inliers and a substantially smaller set of outliers. We evaluate our proposed method on the clinical problem of detecting frames containing polyps from colonoscopy video sequences, where the training set has 13350 normal images (i.e., without polyps) and less than 100 abnormal images (i.e., with polyps). The results of our proposed model on this data set reveal a state-of-the-art detection result, while the performance based on different number of anomaly samples is relatively stable after approximately 40 abnormal training images.
摘要：异常检测方法通常靶向正常图像分布的学习（即，内围层表示健康例）和测试期间，样本从分配了解到相对远离被归类为异常（即，异常值显示疾病例）。这些方法往往是到位于相对靠近内围层的异常值敏感（例如，具有小的息肉结肠镜检查图像）。在本文中，我们还将来自正常值地址学习到异常的不恰当的灵敏度。我们提出了一种基于训练的最大化功能的嵌入和正常图像的相互信息的编码器一个新的为数不多的镜头异常检测方法，跟着几个射得分推论网络，拥有一大套的内围和一组小得多的受训异常值。我们评价我们提出的方法上的检测包含从结肠镜检查的视频序列，其中，所述训练集具有13350个正常图像息肉帧的临床问题（即，没有息肉）和小于100幅的异常图像（即，具有息肉）。我们提出的对这个数据集的模型的结果显示状态的最先进的检测结果，而基于不同数量的异常样品的性能是后大约40异常训练图像相对稳定。

18. Text Detection on Roughly Placed Books by Leveraging a Learning-based Model Trained with Another Domain Data [PDF] 返回目录
Riku Anegawa, Masayoshi Aritsugi
Abstract: Text detection enables us to extract rich information from images. In this paper, we focus on how to generate bounding boxes that are appropriate to grasp text areas on books to help implement automatic text detection. We attempt not to improve a learning-based model by training it with an enough amount of data in the target domain but to leverage it, which has been already trained with another domain data. We develop algorithms that construct the bounding boxes by improving and leveraging the results of a learning-based method. Our algorithms can utilize different learning-based approaches to detect scene texts. Experimental evaluations demonstrate that our algorithms work well in various situations where books are roughly placed.
摘要：文本检测，使我们能够从图像中提取丰富的信息。在本文中，我们把重点放在如何产生的包围盒是适合于书本上的文字把握地区帮助实现自动文本检测。我们试图通过不使用数据的目标域中足够的训练量是改善基于学习的模式，而是利用它，它已经与其他域的数据已经被训练的。我们开发的算法，结构改善和利用基于学习的方法的结果的边框。我们的算法可以利用不同的基于学习的方法来检测场景文本。试验评估表明，我们的算法在书大致放置各种情况下工作。

19. Ricci Curvature Based Volumetric Segmentation of the Auditory Ossicles [PDF] 返回目录
Na Lei, Jisui Huang, Yuxue Ren, Emil Saucan, Zhenchang Wang
Abstract: The auditory ossicles that are located in the middle ear are the smallest bones in the human body. Their damage will result in hearing loss. It is therefore important to be able to automatically diagnose ossicles' diseases based on Computed Tomography (CT) 3D imaging. However CT images usually include the whole head area, which is much larger than the bones of interest, thus the localization of the ossicles, followed by segmentation, both play a significant role in automatic diagnosis. The commonly employed local segmentation methods require manually selected initial points, which is a highly time consuming process. We therefore propose a completely automatic method to locate the ossicles which requires neither templates, nor manual labels. It relies solely on the connective properties of the auditory ossicles themselves, and their relationship with the surrounding tissue fluid. For the segmentation task, we define a novel energy function and obtain the shape of the ossicles from the 3D CT image by minimizing this new energy. Compared to the state-of-the-art methods which usually use the gradient operator and some normalization terms, we propose to add a Ricci curvature term to the commonly employed energy function. We compare our proposed method with the state-of-the-art methods and show that the performance of discrete Forman-Ricci curvature is superior to the others.
摘要：这是位于中耳的听小骨是人体最小的骨头。他们的伤害会导致听力损失。能够基于计算机断层摄影（CT）的3D成像自动诊断听小骨疾病是很重要的。然而CT图像通常包括整个头部区域，这比感兴趣的骨头，因此，听小骨的定位，其次是分割更大，既起到自动诊断显著的作用。通常使用的局部分割方法需要手动选择的初始点，这是一个非常费时的过程。因此，我们提出了一个完全自动化的方法来定位其既不需要的模板，也没有手动标签的小骨。它完全依靠听小骨本身以及它们与周围组织液关系的结缔组织性。对于分割任务，我们定义一种新的能量函数和通过最小化这个新的能源获取来自3D CT图像听小骨的形状。相比于通常使用的梯度算子和一些正常化方面的状态的最先进的方法，我们提出了一种Ricci曲率术语添加到通常使用的能量函数。我们比较了我们提出的方法与国家的最先进的方法和显示离散福尔曼-Ricci曲率的性能优于其他人。

20. Unsupervised Discovery of Object Landmarks via Contrastive Learning [PDF] 返回目录
Zezhou Cheng, Jong-Chyi Su, Subhransu Maji
Abstract: Given a collection of images, humans are able to discover landmarks of the depicted objects by modeling the shared geometric structure across instances. This idea of geometric equivariance has been widely used for unsupervised discovery of object landmark representations. In this paper, we develop a simple and effective approach based on contrastive learning of invariant representations. We show that when a deep network is trained to be invariant to geometric and photometric transformations, representations from its intermediate layers are highly predictive of object landmarks. Furthermore, by stacking representations across layers in a hypercolumn their effectiveness can be improved. Our approach is motivated by the phenomenon of the gradual emergence of invariance in the representation hierarchy of a deep network. We also present a unified view of existing equivariant and invariant representation learning approaches through the lens of contrastive learning, shedding light on the nature of invariances learned. Experiments on standard benchmarks for landmark discovery, as well as a challenging one we propose, show that the proposed approach surpasses prior state-of-the-art.
摘要：鉴于图像的集合，人类能够通过模拟跨实例共享的几何结构，以发现描绘的对象的地标。几何同变性的这种想法已被广泛用于对象里程碑交涉无监督发现。在本文中，我们开发了基于恒定表征的对比学习一个简单而有效的方法。我们表明，当了深刻的网络进行训练是不变的几何和光度转变，从它的中间层表示是高度预测对象的地标。此外，通过在一个hypercolumn层叠跨层表示其有效性能够得到改善。我们的做法是通过不变性在深网络的层次表现逐渐出现的现象动机。我们还提出现有的等变和不变表示学习的统一视图，通过对比学习的镜头接近，上了解到不变性的自然脱落光。标准基准里程碑式的发现，还有一个挑战，我们提出，表明该方法优于以前的状态的最先进的实验。

21. Blind Image Deconvolution using Student's-t Prior with Overlapping Group Sparsity [PDF] 返回目录
In S. Jeon, Deokyoung Kang, Suk I. Yoo
Abstract: In this paper, we solve blind image deconvolution problem that is to remove blurs form a signal degraded image without any knowledge of the blur kernel. Since the problem is ill-posed, an image prior plays a significant role in accurate blind deconvolution. Traditional image prior assumes coefficients in filtered domains are sparse. However, it is assumed here that there exist additional structures over the sparse coefficients. Accordingly, we propose new problem formulation for the blind image deconvolution, which utilizes the structural information by coupling Student's-t image prior with overlapping group sparsity. The proposed method resulted in an effective blind deconvolution algorithm that outperforms other state-of-the-art algorithms.
摘要：在本文中，我们解决了图像盲反卷积问题是消除模糊形成的信号劣化的图像而不模糊内核的任何知识。由于该问题是病态，图像之前起着准确盲解一个显著的作用。传统图像之前假设在过滤域系数是稀疏的。然而，这里假设存在额外的结构在稀疏系数。因此，我们提出了新的问题公式化为盲去卷积图像，其利用由偶联之前具有重叠组稀疏Student's-t图像的结构信息。所提出的方法造成对该优于状态的最先进的其它算法的有效盲解卷积算法。

22. Pushing the Limit of Unsupervised Learning for Ultrasound Image Artifact Removal [PDF] 返回目录
Shujaat Khan, Jaeyoung Huh, Jong Chul Ye
Abstract: Ultrasound (US) imaging is a fast and non-invasive imaging modality which is widely used for real-time clinical imaging applications without concerning about radiation hazard. Unfortunately, it often suffers from poor visual quality from various origins, such as speckle noises, blurring, multi-line acquisition (MLA), limited RF channels, small number of view angles for the case of plane wave imaging, etc. Classical methods to deal with these problems include image-domain signal processing approaches using various adaptive filtering and model-based approaches. Recently, deep learning approaches have been successfully used for ultrasound imaging field. However, one of the limitations of these approaches is that paired high quality images for supervised training are difficult to obtain in many practical applications. In this paper, inspired by the recent theory of unsupervised learning using optimal transport driven cycleGAN (OT-cycleGAN), we investigate applicability of unsupervised deep learning for US artifact removal problems without matched reference data. Experimental results for various tasks such as deconvolution, speckle removal, limited data artifact removal, etc. confirmed that our unsupervised learning method provides comparable results to supervised learning for many practical applications.
摘要：超声（US）成像是一种快速，无创成像模式被广泛用于实时临床成像应用不用注意辐射危害。不幸的是，它通常从各种来源，如斑点噪声，模糊，多线采集（MLA），限制RF信道，对于平面波成像的情况下少数视角的等古典方法差的视觉质量受到损害处理这些问题包括使用各种自适应滤波和基于模型的方法的图像域的信号处理方法。近日，深学习方法已经成功地用于超声成像领域。然而，这些方法的限制之一是成对的高品质图像的指导训练是很难获得在许多实际应用。在本文中，通过使用最佳的交通工具驱动cycleGAN（OT-cycleGAN）无监督学习的理论最近的启发，我们研究无监督的深度学习美国假象去除问题的适用性不匹配的参考数据。各种任务，如反褶积，祛除斑点，有限的数据假象去除等实验结果证实了我们的无监督的学习方法提供比较的结果，监督学习对于许多实际应用。

23. Lesion Mask-based Simultaneous Synthesis of Anatomic and MolecularMR Images using a GAN [PDF] 返回目录
Pengfei Guo, Puyang Wang, Jinyuan Zhou, Vishal Patel, Shanshan Jiang
Abstract: Data-driven automatic approaches have demonstrated their great potential in resolving various clinical diagnostic dilemmas for patients with malignant gliomas in neuro-oncology with the help of conventional and advanced molecular MR images. However, the lack of sufficient annotated MRI data has vastly impeded the development of such automatic methods. Conventional data augmentation approaches, including flipping, scaling, rotation, and distortion are not capable of generating data with diverse image content. In this paper, we propose a generative adversarial network (GAN), which can simultaneously synthesize data from arbitrary manipulated lesion information on multiple anatomic and molecular MRI sequences, including T1-weighted (T1w), gadolinium enhanced T1w (Gd-T1w), T2-weighted (T2w), fluid-attenuated inversion recovery (FLAIR), and amide proton transfer-weighted (APTw). The proposed framework consists of a stretch-out up-sampling module, a brain atlas encoder, a segmentation consistency module, and multi-scale labelwise discriminators. Extensive experiments on real clinical data demonstrate that the proposed model can perform significantly better than the state-of-the-art synthesis methods.
摘要：数据驱动的自动方法已经证明在神经肿瘤学传统和先进的分子MR图像的帮助下解决了患者的恶性胶质瘤各种临床诊断困境的巨大潜力。然而，由于缺乏足够的注释MRI数据已大大阻碍了这种自动方法的发展。常规的数据增强方法，包括翻转，缩放，旋转和扭曲不能够与不同的图像内容产生的数据。在本文中，我们提出了一种生成对抗网络（GAN），其可同时合成多个解剖和分子MRI序列，包括T1加权（T1W）从任意的操纵病变信息数据，钆增强T1W（钆T1W），T2加权（T2加权），液体衰减反转恢复（FLAIR），和酰胺质子转移加权（APTw）。所提出的框架包括一个拉伸了上采样模块，脑图谱编码器，分割一致性模块，和多尺度labelwise鉴别器。真实的临床数据大量实验表明，该模型可以比国家的最先进的合成方法显著更好地履行。

24. Meta Deformation Network: Meta Functionals for Shape Correspondence [PDF] 返回目录
Daohan Lu, Yi Fang
Abstract: We present a new technique named "Meta Deformation Network" for 3D shape matching via deformation, in which a deep neural network maps a reference shape onto the parameters of a second neural network whose task is to give the correspondence between a learned template and query shape via deformation. We categorize the second neural network as a meta-function, or a function generated by another function, as its parameters are dynamically given by the first network on a per-input basis. This leads to a straightforward overall architecture and faster execution speeds, without loss in the quality of the deformation of the template. We show in our experiments that Meta Deformation Network leads to improvements on the MPI-FAUST Inter Challenge over designs that utilized a conventional decoder design that has non-dynamic parameters.
摘要：我们提出了一个名为“元变形网”为通过变形三维形状匹配的新技术，其中深层神经网络参考形状映射到其任务是给一个有学问的模板之间的对应关系的第二神经网络的参数通过变形查询形状。我们分类的第二神经网络作为元功能，或由另一个函数生成的功能，作为其参数由第一网络上的每个输入的基础动态地给出。这导致了一个简单的整体架构和更快的执行速度，而不会在模板变形的质量损失。我们发现在我们的实验证明，多彩变形网络导致了设计，使用传统的解码器的设计，具有非动态参数的MPI-FAUST国米挑战改进。

25. Deepfake Detection using Spatiotemporal Convolutional Networks [PDF] 返回目录
Oscar de Lima, Sean Franklin, Shreshtha Basu, Blake Karwoski, Annet George
Abstract: Better generative models and larger datasets have led to more realistic fake videos that can fool the human eye but produce temporal and spatial artifacts that deep learning approaches can detect. Most current Deepfake detection methods only use individual video frames and therefore fail to learn from temporal information. We created a benchmark of the performance of spatiotemporal convolutional methods using the Celeb-DF dataset. Our methods outperformed state-of-the-art frame-based detection methods. Code for our paper is publicly available at this https URL.
摘要：更好的生成模式和更大的数据集导致更逼真的假动作影片，可以骗过人眼，但产生的时间和空间文物，深刻的学习方法可以检测。目前大多数Deepfake检测方法只使用单个视频帧，因此无法从时间信息学。我们创建的使用名人-DF数据集时空卷积方法的性能的基准。我们的方法优于国家的最先进的基于帧的检测方法。代码对我们的报纸是公开的，在此HTTPS URL。

26. Unsupervised Video Decomposition using Spatio-temporal Iterative Inference [PDF] 返回目录
Polina Zablotskaia, Edoardo A. Dominici, Leonid Sigal, Andreas M. Lehrmann
Abstract: Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in video. We propose a novel spatio-temporal iterative inference framework that is powerful enough to jointly model complex multi-object representations and explicit temporal dependencies between latent variables across frames. This is achieved by leveraging 2D-LSTM, temporally conditioned inference and generation within the iterative amortized inference for posterior refinement. Our method improves the overall quality of decompositions, encodes information about the objects' dynamics, and can be used to predict trajectories of each object separately. Additionally, we show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation, and prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets, one of which was curated for this work and will be made publicly available.
摘要：无监督的多目标场景分解为代表学习迅速崛起的问题。尽管在静态场景显著的进步，这种模式不能出现在视频杠杆重要的动态线索。我们提出了一个新颖的时空迭代推理的框架，是足够强大的联合模型复杂的多对象表示和跨框架潜变量之间明确的时间相关。这是通过利用2D-LSTM，时间调节推断和产生用于后细化迭代摊销推理内实现。我们的方法提高分解的整体质量，编码关于所述对象的动态信息，并可以被用来单独地预测每个对象的轨迹。此外，我们表明，我们的模型具有较高的精度，即使没有颜色信息。我们证明我们的模型的分解，细分和预测能力，并表明它优于几个基准数据集，其中之一是策划了这项工作，并会对外公布的国家的最先进的。

27. CognitiveCNN: Mimicking Human Cognitive Models to resolve Texture-Shape Bias [PDF] 返回目录
Satyam Mohla, Anshul Nasery, Biplab Banerjee, Subhasis Chaudhari
Abstract: Recent works demonstrate the texture bias in Convolutional Neural Networks (CNNs), conflicting with early works claiming that networks identify objects using shape. It is commonly believed that the cost function forces the network to take a greedy route to increase accuracy using texture, failing to explore any global statistics. We propose a novel intuitive architecture, namely CognitiveCNN, inspired from feature integration theory in psychology to utilise human-interpretable feature like shape, texture, edges etc. to reconstruct, and classify the image. We define two metrics, namely TIC and RIC to quantify the importance of each stream using attention maps. We introduce a regulariser which ensures that the contribution of each feature is same for any task, as it is for reconstruction; and perform experiments to show the resulting boost in accuracy and robustness besides imparting explainability. Lastly, we adapt these ideas to conventional CNNs and propose Augmented Cognitive CNN to achieve superior performance in object recognition.
摘要：最近的作品展示卷积神经网络（细胞神经网络）的质地偏差，与早期作品声称网络标识使用图形对象发生冲突。人们普遍认为，成本函数力网络采取了贪婪的路线使用纹理以提高准确性，遇事探索任何全局统计信息。我们提出了一个新颖直观的架构，即CognitiveCNN，在心理学上，从功能整合理论的启发，利用人类可解释的形状，纹理，边缘等功能重建，并且图像分类。我们定义两个指标，即议会和RIC量化使用注意地图的每个流的重要性。我们引入一个正则化可保证每一个要素的贡献是相同的任何任务，因为它是为重建;并进行实验，以显示精确度和耐用性所产生的提升，除了传授explainability。最后，我们这些想法适应传统的细胞神经网络，并提出增强认知CNN实现目标识别性能优越。

28. Investigating and Exploiting Image Resolution for Transfer Learning-based Skin Lesion Classification [PDF] 返回目录
Amirreza Mahbod, Gerald Schaefer, Chunliang Wang, Rupert Ecker, Georg Dorffner, Isabella Ellinger
Abstract: Skin cancer is among the most common cancer types. Dermoscopic image analysis improves the diagnostic accuracy for detection of malignant melanoma and other pigmented skin lesions when compared to unaided visual inspection. Hence, computer-based methods to support medical experts in the diagnostic procedure are of great interest. Fine-tuning pre-trained convolutional neural networks (CNNs) has been shown to work well for skin lesion classification. Pre-trained CNNs are usually trained with natural images of a fixed image size which is typically significantly smaller than captured skin lesion images and consequently dermoscopic images are downsampled for fine-tuning. However, useful medical information may be lost during this transformation. In this paper, we explore the effect of input image size on skin lesion classification performance of fine-tuned CNNs. For this, we resize dermoscopic images to different resolutions, ranging from 64x64 to 768x768 pixels and investigate the resulting classification performance of three well-established CNNs, namely DenseNet-121, ResNet-18, and ResNet-50. Our results show that using very small images (of size 64x64 pixels) degrades the classification performance, while images of size 128x128 pixels and above support good performance with larger image sizes leading to slightly improved classification. We further propose a novel fusion approach based on a three-level ensemble strategy that exploits multiple fine-tuned networks trained with dermoscopic images at various sizes. When applied on the ISIC 2017 skin lesion classification challenge, our fusion approach yields an area under the receiver operating characteristic curve of 89.2% and 96.6% for melanoma classification and seborrheic keratosis classification, respectively, outperforming state-of-the-art algorithms.
摘要：皮肤癌是最常见的癌症类型之一。相比肉眼目视检查时皮肤镜图像分析改善了检测恶性黑色素瘤和其他皮肤色素性病变的诊断准确性。因此，基于计算机的方法，以支持医学专家在诊断过程是极大的兴趣。微调预训练卷积神经网络（细胞神经网络）已被证明工作做好皮肤病变分类。预先训练的细胞神经网络通常与一个固定的图像大小通常比捕获的皮肤损伤图像显著较小，因此皮肤镜图像被下采样进行微调的自然图像训练。然而，有用的医疗信息可能这个转变过程中丢失。在本文中，我们将探讨输入图像尺寸上微调细胞神经网络的皮肤损伤的分类性能的影响。为此，我们调整皮肤镜图像不同的分辨率，从64×64到768x768像素，并探讨三个著名建立细胞神经网络的分类结果的表现，即DenseNet-121，RESNET-18和RESNET-50。我们的研究结果显示，使用非常小的图像（大小为64×64像素），降低了分类性能，而大小128×128像素及以上的较大图像支持性能良好的图像大小不一导致略有改善分类。我们进一步提出了一种基于它利用与各种尺寸的皮肤镜图像训练有素的多个微调网络的三电平整体战略的一个新的融合方法。当在2017年ISIC皮肤病变分类挑战施加，我们的融合方法的接收机下产生一个区域中操作的89.2％的特性曲线和用于分别黑素瘤分类和脂溢性角化病分类，96.6％，表现优于国家的最先进的算法。

29. Perspective Plane Program Induction from a Single Image [PDF] 返回目录
Yikai Li, Jiayuan Mao, Xiuming Zhang, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu
Abstract: We study the inverse graphics problem of inferring a holistic representation for natural images. Given an input image, our goal is to induce a neuro-symbolic, program-like representation that jointly models camera poses, object locations, and global scene structures. Such high-level, holistic scene representations further facilitate low-level image manipulation tasks such as inpainting. We formulate this problem as jointly finding the camera pose and scene structure that best describe the input image. The benefits of such joint inference are two-fold: scene regularity serves as a new cue for perspective correction, and in turn, correct perspective correction leads to a simplified scene structure, similar to how the correct shape leads to the most regular texture in shape from texture. Our proposed framework, Perspective Plane Program Induction (P3I), combines search-based and gradient-based algorithms to efficiently solve the problem. P3I outperforms a set of baselines on a collection of Internet images, across tasks including camera pose estimation, global structure inference, and down-stream image manipulation tasks.
摘要：我们研究推断为自然的图像整体表现的逆图形问题。鉴于输入图像，我们的目标是诱发神经象征性的，程序般的表示形式共同车型摄影机姿态，对象位置，和全球现场搭建。这种高层次的，全面的现场陈述进一步推动低级别的图像处理任务，如图像修补。我们提出这个问题作为共同寻找相机姿势和场景结构最能描述输入图像。这种联合推断的好处有两方面：现场规律性充当透视校正了新的线索，反过来，正确的透视校正导致简化的场景结构，类似于如何正确的形状，因而在形状最规则纹理从质感。我们提出的框架，平面透视计划感应（P3I），融合了基于搜索和基于梯度的算法，有效地解决这个问题。 P3I优于一组基线在互联网上的图像的集合，跨越任务，包括相机姿态估计，全球结构推断，和下游的图像处理任务。

30. Learning Data Augmentation with Online Bilevel Optimization for Image Classification [PDF] 返回目录
Saypraseuth Mounsaveng, Issam Laradji, Ismail Ben Ayed, David Vazquez, Marco Pedersoli
Abstract: Data augmentation is a key practice in machine learning for improving generalization performance. However, finding the best data augmentation hyperparameters requires domain knowledge or a computationally demanding search. We address this issue by proposing an efficient approach to automatically train a network that learns an effective distribution of transformations to improve its generalization score. Using bilevel optimization, we directly optimize the data augmentation parameters using a validation set. This framework can be used as a general solution to learn the optimal data augmentation jointly with an end task model like a classifier. Results show that our joint training method produces an image classification accuracy that is comparable to or better than carefully hand-crafted data augmentation. Yet, it does not need an expensive external validation loop on the data augmentation hyperparameters.
摘要：数据隆胸是在机器学习为提高泛化性能的一个关键做法。然而，寻找最佳的数据增强超参数要求的领域知识或需要大量计算的搜索。我们通过提出一种有效的方法来自动列车学习转化的有效分配，以提高其泛化得分的网络解决这个问题。使用双层优化，我们直接使用优化验证组数据增强参数。这个框架可以作为一个通用的解决方案与像分类结束任务模型共同学习的最佳数据增强。结果表明，我们的联合训练方法产生的图像分类精度是相当或比精心制作的手工数据增强更好。然而，它并不需要对数据增强超参数昂贵的外部确认循环。

31. Adaptive additive classification-based loss for deep metric learning [PDF] 返回目录
Istvan Fehervari, Ives Macedo
Abstract: Recent works have shown that deep metric learning algorithms can benefit from weak supervision from another input modality. This additional modality can be incorporated directly into the popular triplet-based loss function as distances. Also recently, classification loss and proxy-based metric learning have been observed to lead to faster convergence as well as better retrieval results, all the while without requiring complex and costly sampling strategies. In this paper we propose an extension to the existing adaptive margin for classification-based deep metric learning. Our extension introduces a separate margin for each negative proxy per sample. These margins are computed during training from precomputed distances of the classes in the other modality. Our results set a new state-of-the-art on both on the Amazon fashion retrieval dataset as well as on the public DeepFashion dataset. This was observed with both fastText- and BERT-based embeddings for the additional textual modality. Our results were achieved with faster convergence and lower code complexity than the prior state-of-the-art.
摘要：最近的工作表明，深度量学习算法，可以从其他输入方式监管不力获益。这种额外的模态，可直接掺入到流行的基于三线态 - 损失函数作为距离。最近此外，分类损失和基于代理的度量学习已经观察到导致更快的收敛以及更好的检索结果，所有的同时，而不需要复杂和昂贵的抽样策略。在本文中，我们提出了一个扩展为基于分类的深度量学习现有的自适应余量。我们的扩展引入每个样品各负代理一个单独的余量。这些利润在其它模式从阶级的预先计算距离训练期间计算。我们的研究结果对亚马逊的方式检索数据集，以及对公众DeepFashion数据集设置一个新的两个国家的最先进的。这可使用fastText-和额外的文本形式的基于BERT-的嵌入观察。我们的结果与比现有状态的最先进的收敛速度快和低码复杂性来实现。

32. Duodepth: Static Gesture Recognition Via Dual Depth Sensors [PDF] 返回目录
Ilya Chugunov, Avideh Zakhor
Abstract: Static gesture recognition is an effective non-verbal communication channel between a user and their devices; however many modern methods are sensitive to the relative pose of the user's hands with respect to the capture device, as parts of the gesture can become occluded. We present two methodologies for gesture recognition via synchronized recording from two depth cameras to alleviate this occlusion problem. One is a more classic approach using iterative closest point registration to accurately fuse point clouds and a single PointNet architecture for classification, and the other is a dual Point-Net architecture for classification without registration. On a manually collected data-set of 20,100 point clouds we show a 39.2% reduction in misclassification for the fused point cloud method, and 53.4% for the dual PointNet, when compared to a standard single camera pipeline.
摘要：静态手势识别为用户和他们的设备之间的有效的非语言通信信道;然而，许多现代方法是用户的手相对于捕捉设备的相对姿态敏感，随着手势的部分可以被堵塞。我们通过同步记录提出了两种方法用于手势识别来自两个深度相机来缓解这个遮挡问题。一种是采用迭代最近点登记到准确保险丝点云和用于分类单个PointNet架构的更经典的方法，和另一种是未经登记分类双PointNet架构。上的20100个云手动收集的数据集，我们显示在错误分类的减少39.2％的熔融点云方法，以及用于所述双PointNet 53.4％，相比之下，标准单照相机管道时。

33. Fully Convolutional Open Set Segmentation [PDF] 返回目录
Hugo Oliveira, Caio Silva, Gabriel L. S. Machado, Keiller Nogueira, Jefersson A. dos Santos
Abstract: In semantic segmentation knowing about all existing classes is essential to yield effective results with the majority of existing approaches. However, these methods trained in a Closed Set of classes fail when new classes are found in the test phase. It means that they are not suitable for Open Set scenarios, which are very common in real-world computer vision and remote sensing applications. In this paper, we discuss the limitations of Closed Set segmentation and propose two fully convolutional approaches to effectively address Open Set semantic segmentation: OpenFCN and OpenPCS. OpenFCN is based on the well-known OpenMax algorithm, configuring a new application of this approach in segmentation settings. OpenPCS is a fully novel approach based on feature-space from DNN activations that serve as features for computing PCA and multi-variate gaussian likelihood in a lower dimensional space. Experiments were conducted on the well-known Vaihingen and Potsdam segmentation datasets. OpenFCN showed little-to-no improvement when compared to the simpler and much more time efficient SoftMax thresholding, while being between some orders of magnitude slower. OpenPCS achieved promising results in almost all experiments by overcoming both OpenFCN and SoftMax thresholding. OpenPCS is also a reasonable compromise between the runtime performances of the extremely fast SoftMax thresholding and the extremely slow OpenFCN, being close able to run close to real-time. Experiments also indicate that OpenPCS is effective, robust and suitable for Open Set segmentation, being able to improve the recognition of unknown class pixels without reducing the accuracy on the known class pixels.
摘要：在语义分割知道所有现有的类是必不可少的，以产生有效的结果与大多数现有的方法。然而，当新的类在测试阶段发现这些方法在一个封闭的班培训的失败。这意味着它们不适合开集的情况，这在现实世界的计算机视觉和遥感应用非常普遍。在本文中，我们讨论了闭集分割的局限性，并提出了两个完全卷积的方法来有效地解决开集语义分割：OpenFCN和OpenPCS中。 OpenFCN是基于著名的OpenMax的算法，配置在分段设置这种方法的一个新的应用程序。 OpenPCS中是基于从充当特征为在一个较低的维空间的计算PCA和多变量高斯可能性DNN激活特征空间中的完全新颖的方法。实验是在公知的Vaihingen和波茨坦分割数据集进行的。相比于简单和更高效的时间阈值使用SoftMax时，虽然幅度的一些订单较慢之间OpenFCN不大到没有改善。 OpenPCS中克服双方OpenFCN和使用SoftMax阈值取得了可喜的成果，几乎所有的实验。 OpenPCS中也是非常快使用SoftMax阈值的运行性能和极其缓慢OpenFCN之间进行合理的妥协，接近能够运行接近实时。实验还表明，OpenPCS中是有效的，鲁棒和适于开集分割，能够提高未知类别的像素的识别，而不会降低在已知类别的像素精度。

34. SPSG: Self-Supervised Photometric Scene Generation from RGB-D Scans [PDF] 返回目录
Angela Dai, Yawar Siddiqui, Justus Thies, Julien Valentin, Matthias Nießner
Abstract: We present SPSG, a novel approach to generate high-quality, colored 3D models of scenes from RGB-D scan observations by learning to infer unobserved scene geometry and color in a self-supervised fashion. Our self-supervised approach learns to jointly inpaint geometry and color by correlating an incomplete RGB-D scan with a more complete version of that scan. Notably, rather than relying on 3D reconstruction losses to inform our 3D geometry and color reconstruction, we propose adversarial and perceptual losses operating on 2D renderings in order to achieve high-resolution, high-quality colored reconstructions of scenes. This exploits the high-resolution, self-consistent signal from individual raw RGB-D frames, in contrast to fused 3D reconstructions of the frames which exhibit inconsistencies from view-dependent effects, such as color balancing or pose inconsistencies. Thus, by informing our 3D scene generation directly through 2D signal, we produce high-quality colored reconstructions of 3D scenes, outperforming state of the art on both synthetic and real data.
摘要：我们目前SPSG，一种新的方法，通过学习来推断未观察到的场景的几何形状和颜色的自我监督的方式生成高质量的彩色三维模型从RGB-d扫描观察场景。我们的自我监督方法学会联合补绘几何形状和颜色由一个不完整的RGB-d扫描与扫描的更完整的版本相关。值得注意的是，而不是依靠三维重建的损失，通知我们的3D几何和色彩重建，我们提出了以实现场景的高分辨率，高质量的彩色重建的2D渲染操作对抗性和感性的损失。这利用从单独的原始RGB-d帧的高分辨率，自我一致的信号，相对于来自视点相关的效果，如颜色平衡或姿势表现出不一致的不一致的帧的稠合的3D重建。因此，通过直接通过2D信号通知我们的3D场景生成，我们产生3D场景的高品质的彩色重建，在合成的和真实数据优于现有技术的状态。

35. Determining Image similarity with Quasi-Euclidean Metric [PDF] 返回目录
Vibhor Singh, Vishesh Devgan, Ishu Anand
Abstract: Image similarity is a core concept in Image Analysis due to its extensive application in computer vision, image processing, and pattern recognition. The objective of our study is to evaluate Quasi-Euclidean metric as an image similarity measure and analyze how it fares against the existing standard ways like SSIM and Euclidean metric. In this paper, we analyzed the similarity between two images from our own novice dataset and assessed its performance against the Euclidean distance metric and SSIM. We also present experimental results along with evidence indicating that our proposed implementation when applied to our novice dataset, furnished different results than standard metrics in terms of effectiveness and accuracy. In some cases, our methodology projected remarkable performance and it is also interesting to note that our implementation proves to be a step ahead in recognizing similarity when compared to
摘要：图像相似度处于图像分析的一个核心概念，由于其在计算机视觉，图象处理，和模式识别广泛的应用。我们研究的目的是评估准欧几里德度量作为图像相似性度量和分析它的票价对像SSIM和欧几里德度量现有标准的方式如何。在本文中，我们分析了两个图像之间的相似性，从我们自己的新手数据集，并评估其对欧氏距离度量和SSIM性能。有证据表明沿着我们还提出实验结果表明，当应用到我们的新手数据集我们建议的落实，布置不同的结果比在有效性和准确性方面的标准指标。在某些情况下，我们的方法预测卓越的性能，它也是有趣的是，相比于我们的实现被证明是在识别相似领先一步

36. Computing Light Transport Gradients using the Adjoint Method [PDF] 返回目录
Jos Stam
Abstract: This paper proposes a new equation from continuous adjoint theory to compute the gradient of quantities governed by the Transport Theory of light. Unlike discrete gradients ala autograd, which work at the code level, we first formulate the continuous theory and then discretize it. The key insight of this paper is that computing gradients in Transport Theory is akin to computing the importance, a quantity adjoint to radiance that satisfies an adjoint equation. Importance tells us where to look for light that matters. This is one of the key insights of this paper. In fact, this mathematical journey started from a whimsical thought that these adjoints might be related. Computing gradients is therefore no more complicated than computing the importance field. This insight and the following paper hopefully will shed some light on this complicated problem and ease the implementations of gradient computations in existing path tracers.
摘要：本文提出了从连续伴随学说的新公式来计算由光传输理论管辖量的梯度。不同于离散梯度翼autograd，在代码级别，其工作中，我们首先制定了连续的理论，然后离散它。本文的主要观点是，在交通运输理论计算梯度是类似的计算，以辐射，其满足伴随方程的重要性，数量伴随。重要性告诉我们到哪里寻找光明的事项。这是本文的主要见解之一。事实上，这一数学旅程从一个异想天开的想法，这些伴随矩阵可能与开始。因此计算梯度没有更复杂的不是计算的重要性场。这种见解和下述纸张希望将阐明这个复杂的问题的一些光，缓解梯度计算的实施方式在现有路径示踪剂。

37. A Loss Function for Generative Neural Networks Based on Watson's Perceptual Model [PDF] 返回目录
Steffen Czolbe, Oswin Krause, Ingemar Cox, Christian Igel
Abstract: To train Variational Autoencoders (VAEs) to generate realistic imagery requires a loss function that reflects human perception of image similarity. We propose such a loss function based on Watson's perceptual model, which computes a weighted distance in frequency space and accounts for luminance and contrast masking. We extend the model to color images, increase its robustness to translation by using the Fourier Transform, remove artifacts due to splitting the image into blocks, and make it differentiable. In experiments, VAEs trained with the new loss function generated realistic, high-quality image samples. Compared to using the Euclidean distance and the Structural Similarity Index, the images were less blurry; compared to deep neural network based losses, the new approach required less computational resources and generated images with less artifacts.
摘要：要培养变自动编码（VAES）生成逼真的图像需要反映图像相似的人类感知丧失功能。我们提出了一种基于Watson的感知模型，其计算在频率空间加权距离，占亮度和对比度掩蔽这样的损失函数。我们的模型扩展到彩色图像，利用傅立叶变换，取出文物由于分裂图像分块提高其鲁棒性的翻译，并使其可微。在实验中，VAES训练了与现实产生的，高品质的图像样本的新的损失函数。与使用欧几里德距离和结构相似性指数，分别为图像模糊少;相比深基于神经网络的损失，新的方法需要更少的计算资源和产生的图像与失真较小。

38. Object-Centric Learning with Slot Attention [PDF] 返回目录
Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf
Abstract: Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.
摘要：学习复杂场景的对象为中心的交涉是实现从低层次的感知功能实现高效的抽象推理一个可喜的一步。然而，最深刻的学习方法学分布不捕捉自然场景的组成性质表示。在本文中，我们提出的插槽警示模块，架构组件，与感性表示，诸如卷积神经网络的输出和接口产生一组我们称之为时隙取决于任务的抽象表示。这些槽是可交换的，并且可以通过专门通过多轮的关注的竞争程序绑定到输入的任何对象。我们经验表明，插槽警示可以提取对象为中心的交涉上无人监管对象发现和监督物业预测任务训练的时候，使推广到看不见的成分。

39. SAR2SAR: a self-supervised despeckling algorithm for SAR images [PDF] 返回目录
Emanuele Dalsasso, Loïc Denis, Florence Tupin
Abstract: Speckle reduction is a key step in many remote sensing applications. By strongly affecting synthetic aperture radar (SAR) images, it makes them difficult to analyse. Due to the difficulty to model the spatial correlation of speckle, a deep learning algorithm with self-supervision is proposed in this paper: SAR2SAR. Multi-temporal time series are leveraged and the neural network learns to restore SAR images by only looking at noisy acquisitions. To this purpose, the recently proposed noise2noise framework has been employed. The strategy to adapt it to SAR despeckling is presented, based on a compensation of temporal changes and a loss function adapted to the statistics of speckle. A study with synthetic speckle noise is presented to compare the performances of the proposed method with other state-of-the-art filters. Then, results on real images are discussed, to show the potential of the proposed algorithm. The code is made available to allow testing and reproducible research in this field.
摘要：散斑减少在许多遥感应用中的关键步骤。通过强烈地影响着合成孔径雷达（SAR）图像，这让他们很难分析。由于散斑的空间相关性模型的难度，具有自我监督深刻学习算法本文：SAR2SAR。多时时间序列杠杆和神经网络学习到只看嘈杂的收购，以恢复SAR图像。为此，最近提出noise2noise框架已经采用。以使其适应特区去斑战略提出的基础上，随时间变化的补偿和适应斑点的统计损失函数。用合成的斑点噪声的研究，提出以比较所提出的方法与其它国家的先进过滤器的性能。然后，在真实图像结果进行了讨论，表明该算法的潜力。该代码可允许在这一领域的测试和可重复性的研究。

40. An Interactive Data Visualization and Analytics Tool to Evaluate Mobility and Sociability Trends During COVID-19 [PDF] 返回目录
Fan Zuo, Jingxing Wang, Jingqin Gao, Kaan Ozbay, Xuegang Jeff Ban, Yubin Shen, Hong Yang, Shri Iyer
Abstract: The COVID-19 outbreak has dramatically changed travel behavior in affected cities. The C2SMART research team has been investigating the impact of COVID-19 on mobility and sociability. New York City (NYC) and Seattle, two of the cities most affected by COVID-19 in the U.S. were included in our initial study. An all-in-one dashboard with data mining and cloud computing capabilities was developed for interactive data analytics and visualization to facilitate the understanding of the impact of the outbreak and corresponding policies such as social distancing on transportation systems. This platform is updated regularly and continues to evolve with the addition of new data, impact metrics, and visualizations to assist public and decision-makers to make informed decisions. This paper presents the architecture of the COVID related mobility data dashboard and preliminary mobility and sociability metrics for NYC and Seattle.
摘要：COVID-19的爆发在受影响的城市有很大的改变出行行为。该C2SMART研究小组一直在调查COVID-19的移动性和社交性的影响。纽约市（NYC）和西雅图，两个最受到COVID-19在美国城市被列入我们的初步研究。所有功能于一身的仪表盘数据挖掘和云计算能力是为交互式数据分析和可视化，以方便爆发的影响的认识，并如运输系统的社会疏远了相应的政策制定。该平台将定期更新，并继续增加新的数据，影响指标和可视化的发展，以帮助公众和决策者做出明智的决定。本文介绍了COVID相关移动数据仪表盘和纽约和西雅图初步移动和社交指标的体系结构。

41. Orthogonal Deep Models As Defense Against Black-Box Attacks [PDF] 返回目录
Mohammad A. A. K. Jalwana, Naveed Akhtar, Mohammed Bennamoun, Ajmal Mian
Abstract: Deep learning has demonstrated state-of-the-art performance for a variety of challenging computer vision tasks. On one hand, this has enabled deep visual models to pave the way for a plethora of critical applications like disease prognostics and smart surveillance. On the other, deep learning has also been found vulnerable to adversarial attacks, which calls for new techniques to defend deep models against these attacks. Among the attack algorithms, the black-box schemes are of serious practical concern since they only need publicly available knowledge of the targeted model. We carefully analyze the inherent weakness of deep models in black-box settings where the attacker may develop the attack using a model similar to the targeted model. Based on our analysis, we introduce a novel gradient regularization scheme that encourages the internal representation of a deep model to be orthogonal to another, even if the architectures of the two models are similar. Our unique constraint allows a model to concomitantly endeavour for higher accuracy while maintaining near orthogonal alignment of gradients with respect to a reference model. Detailed empirical study verifies that controlled misalignment of gradients under our orthogonality objective significantly boosts a model's robustness against transferable black-box adversarial attacks. In comparison to regular models, the orthogonal models are significantly more robust to a range of $l_p$ norm bounded perturbations. We verify the effectiveness of our technique on a variety of large-scale models.
摘要：深学习已经证明国家的最先进的性能，适用于各种具有挑战性的计算机视觉任务。一方面，这使得深可视化模型来铺路样疾病预测和智能监控关键应用程序过多。另一方面，深度学习也被发现容易受到攻击的对抗性，这需要新技术来保卫深模型免受这些攻击。在这些攻击算法，黑盒计划是严重的实际关注，因为他们只需要目标模型的可公开获得的知识。我们仔细分析了黑盒设置深车型固有的弱点下，攻击者可以开发使用类似目标模型的模型攻击。根据我们的分析，我们引入鼓励了深刻的模型的内部表示是正交到另一个，即使这两个模型的架构是类似的一种新颖的梯度正计划。我们独特的约束允许一个角色模型更高的精度伴随努力，同时保持近梯度的正交对准相对于一个参考模型。详细的实证研究验证该控制梯度的错位下，我们的目标正交提升显著的模型的反对转让的黑箱对抗攻击的鲁棒性。相较于普通机型，正交模型显著更稳健的一系列$ L_P $范数有界扰动。我们对各种大型模型验证我们的技术的有效性。

42. Not all Failure Modes are Created Equal: Training Deep Neural Networks for Explicable (Mis)Classification [PDF] 返回目录
Alberto Olmo, Sailik Sengupta, Subbarao Kambhampati
Abstract: Deep Neural Networks are often brittle on image classification tasks and known to misclassify inputs. While these misclassifications may be inevitable, all failure modes cannot be considered equal. Certain misclassifications (eg. classifying the image of a dog to an airplane) can create surprise and result in the loss of human trust in the system. Even worse, certain errors (eg. a person misclassified as a primate) can have societal impacts. Thus, in this work, we aim to reduce inexplicable errors. To address this challenge, we first discuss how to obtain the class-level semantics that captures the human's expectation ($M^h$) regarding which classes are semantically close vs. ones that are far away. We show that for data-sets like CIFAR-10 and CIFAR-100, class-level semantics can be obtained by leveraging human subject studies (significantly inexpensive compared to existing works) and, whenever possible, by utilizing publicly available human-curated knowledge. Second, we propose the use of Weighted Loss Functions to penalize misclassifications by the weight of their inexplicability. Finally, we show that training (or even fine-tuning) existing classifiers with the two proposed methods lead to Deep Neural Networks that have (1) comparable top-1 accuracy, an important metric in operational contexts, (2) more explicable failure modes and (3) require significantly less cost in teams of additional human labels compared to existing work.
摘要：深层神经网络常常易碎图像分类任务和已知错误分类的投入。虽然这些错误分类可能是不可避免的，所有的故障模式不能被认为是相等的。某些错误分类（如狗到飞机的图像分类）可在人的信任在系统中的损失创造惊喜和结果。更糟的是，某些错误（例如，错误归类为灵长类动物的人）可以有社会影响。因此，在这项工作中，我们的目标是减少莫名其妙的错误。为了应对这一挑战，我们先讨论如何获取关于类级别的语义，捕捉人的期望（$ M ^ H $），它的类是语义上密切与那些很遥远。我们表明，数据集一样CIFAR-10和CIFAR-100，类层次语义可以通过充分利用人类受试者的研究（显著便宜相比，现有的作品），并尽可能利用公开可用的人策划的知识获得。其次，我们建议使用加权损失函数由他们费解的重量惩罚错误分类。最后，我们表明，提出的两种方法，现有的培训（甚至微调）分类导致深层神经网络具有（1）媲美顶级-1精度，操作环境的一项重要指标，（2）更可解释性故障模式（3）需要额外的人力标签球队相比于现有的工作成本显著少。

43. A survey of loss functions for semantic segmentation [PDF] 返回目录
Shruti Jadon
Abstract: Image Segmentation has been an active field of research, as it has the potential to fix loopholes in healthcare, and help the mass. In the past 5 years, various papers came up with different objective loss functions used in different cases such as biased data, sparse segmentation, etc. In this paper, we have summarized most of the well-known loss functions widely used in Image segmentation and listed out the cases where their usage can help in fast and better convergence of a Model. Furthermore, We have also introduced a new log-cosh dice loss function and compared its performance on NBFS skull stripping with widely used loss functions. We showcased that certain loss functions perform well across all datasets and can be taken as a good choice in unknown-distribution datasets. The code is available at this https URL.
摘要：图像分割一直是一个活跃的研究领域，因为它在医疗保健，以修复漏洞的潜力，并帮助群众。在过去的5年中，各种论文提出了在不同情况下使用不同的目标损失的功能，如偏置数据，稀疏的分割等。在本文中，我们总结了大部分知名的损失函数广泛应用于图像分割和罗列出来的情况下，它们的使用可以帮助快速和模型更好的收敛。此外，我们也推出了新的日志，COSH骰子损失函数，并比较其对NBFS头骨性能与广泛使用的损失函数剥离。我们展示的某些损失函数在所有数据集表现良好，并可以作为一个不错的选择未知分布数据集。该代码可在此HTTPS URL。

44. Point Proposal Network for Reconstructing 3D Particle Positions with Sub-Pixel Precision in Liquid Argon Time Projection Chambers [PDF] 返回目录
Laura Dominé, Kazuhiro Terao
Abstract: Liquid Argon Time Projection Chambers (LArTPC) are particle imaging detectors recording 2D or 3D images of numerous complex trajectories of charged particles. Identifying points of interest in these images, such as the starting and ending points of particles trajectories, is a crucial step of identifying and analyzing these particles and impacts inference of physics signals such as neutrino interaction. The Point Proposal Network is designed to discover specific points of interest, namely the starting and ending points of track-like particle trajectories such as muons and protons, and the starting points of electromagnetic shower-like particle trajectories such as electrons and gamma rays. The algorithm predicts with a sub-voxel precision their spatial location, and also determines the category of the identified points of interest. Using the PILArNet public LArTPC data sample as a benchmark, our algorithm successfully predicted 96.8%, 97.8%, and 98.1% of 3D points within the voxel distance of 3, 10, and 20 from the provided true point locations respectively. For the predicted 3D points within 3 voxels of the closest true point locations, the median distance is found to be 0.25 voxels, achieving the sub-voxel level precision. We report that the majority of predicted points that are more than 10 voxels away from the closest true point locations are legitimate mistakes, and our algorithm achieved high enough accuracy to identify issues associated with a small fraction of true point locations provided in the dataset. Further, using those predicted points, we demonstrate a set of simple algorithms to cluster 3D voxels into individual track-like particle trajectories at the clustering efficiency, purity, and Adjusted Rand Index of 83.2%, 96.7%, and 94.7% respectively.
摘要：液氩时间投影庭（LArTPC）是粒子成像检测器记录的带电粒子的许多复杂轨迹的2D或3D图像。识别在这些图像中，如开始和结束粒子轨迹的点兴趣点，是识别和分析物理信号的这些颗粒和影响推理如中微子相互作用的关键步骤。该点的提案网络被设计为发现的感兴趣的特定点，即起点和轨道状粒子轨迹诸如μ介子和质子，和电磁淋浴样颗粒轨迹例如电子和γ射线的起点的结束点。该算法预测与子体素精度的空间位置，同时也决定了识别的感兴趣点的类别。使用PILArNet公共LArTPC数据样本为基准，我们的算法成功地预测分别从所提供的真点位置96.8％，97.8％，和3，10体素的距离内的3D点98.1％，和20。为最接近真实点位置的体素3内的预测的3D点，中间距离被发现是0.25的体素，实现了子体素级精度。我们报告说，大多数预测点是从最近的真实点位10余张素走的是合法的错误，我们的算法实现足够高的精度识别与数据集中提供真正点位置的一小部分相关的问题。另外，使用这些预测点中，我们证明了一组简单的算法状轨道分别在聚类效率，纯度83.2％，96.7％，和94.7％调整兰德指数颗粒的轨迹，并进行聚类3D体素到个体。

45. Graph Optimal Transport for Cross-Domain Alignment [PDF] 返回目录
Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu
Abstract: Cross-domain alignment between two sets of entities (e.g., objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, with no training signals to explicitly encourage alignment. The learned attention matrices are also dense and lacks interpretability. We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT). In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph. Two types of OT distances are considered: (i) Wasserstein distance (WD) for node (entity) matching; and (ii) Gromov-Wasserstein distance (GWD) for edge (structure) matching. Both WD and GWD can be incorporated into existing neural network models, effectively acting as a drop-in regularizer. The inferred transport plan also yields sparse and self-normalized alignment, enhancing the interpretability of the learned model. Experiments show consistent outperformance of GOT over baselines across a wide range of tasks, including image-text retrieval, visual question answering, image captioning, machine translation, and text summarization.
摘要：两组实体之间的交叉域对准（例如，图像中的对象，在一个句子话）是既计算机视觉和自然语言处理的基础。现有的方法主要集中在设计先进注意机制，模拟软校准，没有训练信号明确鼓励对齐。博学的注意矩阵也密集和缺乏可解释性。我们建议图最优运输（GOT），一个原则性的框架，从优化交通运输（OT）的最新进展发芽。在GOT，跨域对准配制成图匹配问题，由代表实体成动态构建的曲线图。两种类型的OT的距离被认为是：（ⅰ）瓦瑟斯坦距离（WD）为节点（实体）匹配;和（ii）用于边缘（结构）匹配格罗莫夫-瓦瑟斯坦距离（GWD）。无论WD和GWD可以被纳入现有的神经网络模型，有效地充当一个下拉正则。推断运输计划也得到稀疏，自正则化调整，增强学习模型的可解释性。实验证明了在广泛的任务，包括图像，文本检索，可视化问答，图像字幕，机器翻译和文本摘要基线GOT的持续跑赢大市。

46. Cascaded Convolutional Neural Networks with Perceptual Loss for Low Dose CT Denoising [PDF] 返回目录
Sepehr Ataei, Dr. Javad Alirezaie, Dr. Paul Babyn
Abstract: Low Dose CT Denoising research aims to reduce the risks of radiation exposure to patients. Recently researchers have used deep learning to denoise low dose CT images with promising results. However, approaches that use mean-squared-error (MSE) tend to over smooth the image resulting in loss of fine structural details in low contrast regions of the image. These regions are often crucial for diagnosis and must be preserved in order for Low dose CT to be used effectively in practice. In this work we use a cascade of two neural networks, the first of which aims to reconstruct normal dose CT from low dose CT by minimizing perceptual loss, and the second which predicts the difference between the ground truth and prediction from the perceptual loss network. We show that our method outperforms related works and more effectively reconstructs fine structural details in low contrast regions of the image.
摘要：低剂量CT降噪研究的目的是减少辐射的风险的患者。最近研究人员利用深学习去噪可喜的成果低剂量CT图像。然而，接近于使用均方误差（MSE）倾向于过度平滑导致在图像的低对比度区域的精细结构细节损失该图像。这些区域往往是诊断至关重要，必须被保存为低剂量CT在实践中有效地使用。在这项工作中，我们使用两个神经网络，其中第一个目的通过最小化感知损失重建从低剂量CT正常剂量CT的级联，并且其预测从感性损失网络的地面实况和预测之间的差的第二个。我们证明了我们的方法优于相关的工作，更有效地在图像的低对比度区域重建精细结构细节。

47. Deep Q-Network-Driven Catheter Segmentation in 3D US by Hybrid Constrained Semi-Supervised Learning and Dual-UNet [PDF] 返回目录
Hongxu Yang, Caifeng Shan, Alexander F. Kolen, Peter H. N. de With
Abstract: Catheter segmentation in 3D ultrasound is important for computer-assisted cardiac intervention. However, a large amount of labeled images are required to train a successful deep convolutional neural network (CNN) to segment the catheter, which is expensive and time-consuming. In this paper, we propose a novel catheter segmentation approach, which requests fewer annotations than the supervised learning method, but nevertheless achieves better performance. Our scheme considers a deep Q learning as the pre-localization step, which avoids voxel-level annotation and which can efficiently localize the target catheter. With the detected catheter, patch-based Dual-UNet is applied to segment the catheter in 3D volumetric data. To train the Dual-UNet with limited labeled images and leverage information of unlabeled images, we propose a novel semi-supervised scheme, which exploits unlabeled images based on hybrid constraints from predictions. Experiments show the proposed scheme achieves a higher performance than state-of-the-art semi-supervised methods, while it demonstrates that our method is able to learn from large-scale unlabeled images.
摘要：在三维超声导管分割是计算机辅助心脏介入非常重要。然而，需要大量的标记的图像的圆满深卷积神经网络（CNN）培养到段导管，这是昂贵的和耗时的。在本文中，我们提出了一个新颖的导管分割方法，这比监督学习方法要求更少的注释，但仍然取得了较好的性能。我们的方案考虑了深刻的Q学习作为预本地化一步，避免了体素水平注释，并能有效地定位目标导管。与检测到的导管，补丁基于双UNET被施加到段在3D体积数据中的导管。训练双UNET与限制标记的图像和未标记的图象的利用信息，我们提出了一种新颖的半监督方案，其利用基于从预测混合动力车约束未标记的图像。实验表明，该方案实现了比国家的最先进的半监督方法更高的性能，同时也证明了我们的方法是能够从大规模未标记的图像来学习。

48. Can 3D Adversarial Logos Cloak Humans? [PDF] 返回目录
Tianlong Chen, Yi Wang, Jingyang Zhou, Sijia Liu, Shiyu Chang, Chandrajit Bajaj, Zhangyang Wang
Abstract: With the trend of adversarial attacks, researchers attempt to fool trained object detectors in 2D scenes. Among many of them, an intriguing new form of attack with potential real-world usage is to append adversarial patches (e.g. logos) to images. Nevertheless, much less have we known about adversarial attacks from 3D rendering views, which is essential for the attack to be persistently strong in the physical world. This paper presents a new 3D adversarial logo attack: we construct an arbitrary shape logo from a 2D texture image and map this image into a 3D adversarial logo via a texture mapping called logo transformation. The resulting 3D adversarial logo is then viewed as an adversarial texture enabling easy manipulation of its shape and position. This greatly extends the versatility of adversarial training for computer graphics synthesized imagery. Contrary to the traditional adversarial patch, this new form of attack is mapped into the 3D object world and back-propagates to the 2D image domain through differentiable rendering. In addition, and unlike existing adversarial patches, our new 3D adversarial logo is shown to fool state-of-the-art deep object detectors robustly under model rotations, leading to one step further for realistic attacks in the physical world. Our codes are available at this https URL.
摘要：随着对抗攻击的趋势，研究人员试图愚弄受训对象探测器2D场景。其中许多人，与潜在的现实使用攻击的一个有趣的新形式追加对抗补丁（如标志）到图像。然而，要少得多，我们已经知道关于从3D渲染的意见，对这次袭击事件是在物理世界中持续强这是必不可少的对抗性攻击。本文提出了一种新的3D对抗标志的攻击：我们构建了从2D纹理图像任意形状的标志，并通过所谓的标志改造纹理映射这个图像转换成3D对抗性的LOGO图。由此产生的三维对抗性标志然后视为对抗性纹理启用其形状和位置的容易操纵。这极大地扩展了对合成图像的计算机图形对抗性训练的多功能性。相反，传统的对抗性补丁，攻击这种新形式被映射到3D对象世界，背传播通过微渲染2D图像域。另外，不像现有的对抗性补丁，我们的新的3D对抗徽标显示出强劲骗国家的最先进的深对象探测器模型下旋转，导致一步在物理世界的现实攻击。我们的代码可在此HTTPS URL。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-06-29

目录

摘要