摘要

1. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction [PDF] 返回目录
Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, James Tu, Raquel Urtasun
Abstract: In this paper, we explore the use of vehicle-to-vehicle (V2V) communication to improve the perception and motion forecasting performance of self-driving vehicles. By intelligently aggregating the information received from multiple nearby vehicles, we can observe the same scene from different viewpoints. This allows us to see through occlusions and detect actors at long range, where the observations are very sparse or non-existent. We also show that our approach of sending compressed deep feature map activations achieves high accuracy while satisfying communication bandwidth requirements.
摘要：在本文中，我们探索利用车对车（V2V）通信，以提高自身驾驶车辆的感知和运动预测性能。通过智能聚合来自多个附近车辆接收到的信息，我们可以从不同的角度观察同一场景。这让我们通过闭塞看到，在长期范围内，那里的观测是非常稀疏或不存在的检测者。我们还表明，我们发送的压缩深特征地图的激活方法实现高精确度，同时满足通信带宽的需求。

2. Source Free Domain Adaptation with Image Translation [PDF] 返回目录
Yunzhong Hou, Liang Zheng
Abstract: Effort in releasing large-scale datasets may be compromised by privacy and intellectual property considerations. A feasible alternative is to release pre-trained models instead. While these models are strong on their original task (source domain), their performance might degrade significantly when deployed directly in a new environment (target domain), which might not contain labels for training under realistic settings. Domain adaptation (DA) is a known solution to the domain gap problem, but usually requires labeled source data. In this paper, we study the problem of source free domain adaptation (SFDA), whose distinctive feature is that the source domain only provides a pre-trained model, but no source data. Being source free adds significant challenges to DA, especially when considering that the target dataset is unlabeled. To solve the SFDA problem, we propose an image translation approach that transfers the style of target images to that of unseen source images. To this end, we align the batch-wise feature statistics of generated images to that stored in batch normalization layers of the pre-trained model. Compared with directly classifying target images, higher accuracy is obtained with these style transferred images using the pre-trained model. On several image classification datasets, we show that the above-mentioned improvements are consistent and statistically significant.
摘要：努力释放大规模的数据集可以通过隐私和知识产权问题受到影响。一个可行的办法是释放预先训练型号代替使用。虽然这些模型是对他们的原始任务（源域）强，在一个新的环境（目标域），这可能不包含标签下的现实环境训练直接部署他们的表现可能会显著下降。域的适应（DA）是一个已知的解决域间隙问题，但是通常需要标记的源数据。在本文中，我们研究了源自由领域适应性（SFDA），其显着特点是源域只提供了一个预先训练模式的问题，但没有源数据。作为源无增加显著挑战DA，特别是考虑到目标数据集未标记时。为了解决这个问题，国家药监局，我们提出了一个形象的翻译方法转移了目标图像的图像风格到看不见的源图像。为此，我们一致产生的图像存储在预先训练模型的批标准化层的不同批次的特征的统计数据。直接分类目标图像相比，与使用预训练的模型这些样式转印图像而获得更高的精度。在几个图像分类数据集，我们证明了上述的改进是一致的，统计上显著。

3. Zero Shot Domain Generalization [PDF] 返回目录
Udit Maniyar, Joseph K J, Aniket Anand Deshmukh, Urun Dogan, Vineeth N Balasubramanian
Abstract: Standard supervised learning setting assumes that training data and test data come from the same distribution (domain). Domain generalization (DG) methods try to learn a model that when trained on data from multiple domains, would generalize to a new unseen domain. We extend DG to an even more challenging setting, where the label space of the unseen domain could also change. We introduce this problem as Zero-Shot Domain Generalization (to the best of our knowledge, the first such effort), where the model generalizes across new domains and also across new classes in those domains. We propose a simple strategy which effectively exploits semantic information of classes, to adapt existing DG methods to meet the demands of Zero-Shot Domain Generalization. We evaluate the proposed methods on CIFAR-10, CIFAR-100, F-MNIST and PACS datasets, establishing a strong baseline to foster interest in this new research direction.
摘要：标准监督学习设置假定训练数据和测试数据来自同分布（域）。域泛化（DG）方法尝试学习一个模型，从多个域数据训练的时候，会推广到一个新的看不见的领域。我们扩大DG一个更具挑战性的环境，在这里看不见域的标签空间也可能有所改变。我们介绍这个问题，因为零炮域泛化（据我们所知，第一个这样的努力），其中跨越新领域，也横跨在这些领域新的类模型概括。我们提出了一个简单的策略，有效地利用类语义信息，以适应现有的DG方法来满足零炮域推广的需求。我们评估对CIFAR-10提出的方法，CIFAR-100，F-MNIST和PACS数据集，在这个新的研究方向建立一个强大的基础来培育兴趣。

4. Hey Human, If your Facial Emotions are Uncertain, You Should Use Bayesian Neural Networks! [PDF] 返回目录
Maryam Matin, Matias Valdenegro-Toro
Abstract: Facial emotion recognition is the task to classify human emotions in face images. It is a difficult task due to high aleatoric uncertainty and visual ambiguity. A large part of the literature aims to show progress by increasing accuracy on this task, but this ignores the inherent uncertainty and ambiguity in the task. In this paper we show that Bayesian Neural Networks, as approximated using MC-Dropout, MC-DropConnect, or an Ensemble, are able to model the aleatoric uncertainty in facial emotion recognition, and produce output probabilities that are closer to what a human expects. We also show that calibration metrics show strange behaviors for this task, due to the multiple classes that can be considered correct, which motivates future work. We believe our work will motivate other researchers to move away from Classical and into Bayesian Neural Networks.
摘要：面部情感识别是任务中的脸图像进行分类的人类情感。这是一项艰巨的任务，由于高肆意的不确定性和模糊性的视觉。文献目标的很大一部分显示通过该任务提高了精度的进步，但是这忽略了任务固有的不确定性和模糊性。在本文中，我们表明，神经贝叶斯网络，使用MC-差，MC-DropConnect，或合奏近似，都能够在情绪认知的肆意不确定性模型，并产生输出概率更接近到什么人的期望。我们还表明，校准指标显示出怪异的行为此任务，由于多个类，可以认为是正确的，这促使今后的工作。我们相信，我们的工作将激励其他研究人员从古典进入贝叶斯神经网络搬走。

5. Improving Emergency Response during Hurricane Season using Computer Vision [PDF] 返回目录
Marc Bosch, Christian Conroy, Benjamin Ortiz, Philip Bogden
Abstract: We have developed a framework for crisis response and management that incorporates the latest technologies in computer vision (CV), inland flood prediction, damage assessment and data visualization. The framework uses data collected before, during, and after the crisis to enable rapid and informed decision making during all phases of disaster response. Our computer-vision model analyzes spaceborne and airborne imagery to detect relevant features during and after a natural disaster and creates metadata that is transformed into actionable information through web-accessible mapping tools. In particular, we have designed an ensemble of models to identify features including water, roads, buildings, and vegetation from the imagery. We have investigated techniques to bootstrap and reduce dependency on large data annotation efforts by adding use of open source labels including OpenStreetMaps and adding complementary data sources including Height Above Nearest Drainage (HAND) as a side channel to the network's input to encourage it to learn other features orthogonal to visual characteristics. Modeling efforts include modification of connected U-Nets for (1) semantic segmentation, (2) flood line detection, and (3) for damage assessment. In particular for the case of damage assessment, we added a second encoder to U-Net so that it could learn pre-event and post-event image features simultaneously. Through this method, the network is able to learn the difference between the pre- and post-disaster images, and therefore more effectively classify the level of damage. We have validated our approaches using publicly available data from the National Oceanic and Atmospheric Administration (NOAA)'s Remote Sensing Division, which displays the city and street-level details as mosaic tile images as well as data released as part of the Xview2 challenge.
摘要：我们已经开发出一种产生合并在计算机视觉（CV），内陆洪水预报，损害评估和数据可视化最新的技术应对危机和管理的框架。该框架使用数据采集之前，期间和危机过后，以便能够在救灾的各个阶段快速和明智的决策。我们的计算机视觉模型中，并在自然灾害后分析星载和机载图像检测相关特征，并创建一个通过Web访问的映射工具转化为可操作的信息元数据。特别是，我们设计的模型的集合，以确定包括水，道路，建筑物和植被从图像的功能。我们已经调查技术，引导和增加使用开放源代码标签，包括开放街道地图和添加补充数据源，包括身高高于最近的排水（手）的侧通道到网络的输入，以鼓励其学习其他的减少对大型数据注解的努力依赖性特征正交的视觉特性。建模工作包括用于（1）语义分割，（2）洪水线检测，和（3），用于损伤评估连接U形网的改性。特别是进行损失评估的情况下，我们增加了第二个编码器U型网络，以便能够了解事件前和事件后的图像同时拥有。通过这种方法，网络能够学习差前和灾后图像之间，因此更有效地分类损伤程度。我们使用来自美国国家海洋和大气管理局（NOAA）的遥感事业部，公开的数据证实了我们的做法，显示了城市和街道级细节马赛克图像，以及发布的Xview2挑战的一部分数据。

6. Spatial Temporal Transformer Network for Skeleton-based Action Recognition [PDF] 返回目录
Chiara Plizzari, Marco Cannici, Matteo Matteucci
Abstract: Skeleton-based Human Activity Recognition has achieved a great interest in recent years, as skeleton data has been demonstrated to be robust to illumination changes, body scales, dynamic camera views and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially how to extract effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network, whose performance is evaluated on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120 and Kinetics Skeleton 400, outperforming the state-of-the-art on NTU-RGB+D w.r.t. models using the same input data consisting of joint information.
摘要：基于骨架，人类行为识别已经取得了近年来的极大兴趣，作为骨架数据已被证明是稳健的光照变化，人体秤，动态摄像机视图和复杂的背景。特别地，展示了时空格拉夫卷积网络（ST-GCN）可有效地对非欧几里得数据如曲线图骨架学习空间和时间依赖性。然而，潜信息的3D骨架底层的有效编码仍然是一个未解决的问题，特别是如何提取从关节运动模式和它们的相关性有效的信息。在这项工作中，我们提出了一个新颖的时空变换网络（ST-TR）使用变压器自我关注运营商之间的关节该款机型的依赖。在我们的ST-TR模型的空间自注意模块（SSA）被用来理解不同的身体部位之间的帧内的相互作用，以及时间自注意模块（TSA）来帧间相关性进行建模。两个被组合在两流网络，其性能在三个大规模数据集，NTU-RGB + d 60，NTU-RGB + d 120和动力学骨架400被评估，表现优于状态的最先进的上NTU-RGB + d WRT使用由联合信息相同的输入数据的模型。

7. Rotation-Invariant Gait Identification with Quaternion Convolutional Neural Networks [PDF] 返回目录
Bowen Jing, Vinay Prabhu, Angela Gu, John Whaley
Abstract: A desireable property of accelerometric gait-based identification systems is robustness to new device orientations presented by users during testing but unseen during the training phase. However, traditional Convolutional neural networks (CNNs) used in these systems compensate poorly for such transformations. In this paper, we target this problem by introducing Quaternion CNN, a network architecture which is intrinsically layer-wise equivariant and globally invariant under 3D rotations of an array of input vectors. We show empirically that this network indeed significantly outperforms a traditional CNN in a multi-user rotation-invariant gait classification setting .Lastly, we demonstrate how the kernels learned by this QCNN can also be visualized as basis-independent but origin- and chirality-dependent trajectory fragments in the euclidean space, thus yielding a novel mode of feature visualization and extraction.
摘要：基于加速度步态识别系统的desireable属性是在训练阶段稳健性测试期间，用户提出了新的设备取向，但看不见。然而，在这些系统中使用传统的卷积神经网络（细胞神经网络）差补偿这样的变换。在本文中，我们的目标通过引入四元数CNN，网络体系结构，其在本质上是逐层等变和下输入向量的阵列的3D旋转全局不变这一问题。我们经验表明，这种网络中的多用户旋转不变的步态分类设置确实显著优于传统的CNN .Lastly，我们将演示如何通过这个QCNN学到的内核也可以看作基础，独立的，但成份，其中和手性依赖在欧几里德空间轨迹的片段，从而产生特征的可视化和萃取的新颖方式。

8. SoftPoolNet: Shape Descriptor for Point Cloud Completion and Classification [PDF] 返回目录
Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari
Abstract: Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature -- points are stored in an unordered way -- makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a new way of organizing the extracted features based on their activations, which we name soft pooling. For the decoder stage, we propose regional convolutions, a novel operator aimed at maximizing the global activation entropy. Furthermore, inspired by the local refining procedure in Point Completion Network (PCN), we also propose a patch-deforming operation to simulate deconvolutional operations for point clouds. This paper proves that our regional activation can be incorporated in many point cloud architectures like AtlasNet and PCN, leading to better performance for geometric completion. We evaluate our approach on different 3D tasks such as object completion and classification, achieving state-of-the-art accuracy.
摘要：因为他们表现出比体数据更多的灵活性和效率的点云往往是许多应用程序的默认选择。然而，他们没有组织的性质 - 点存储在一个无序的方式 - 使它们不太适合通过深度学习管道进行处理。在本文中，我们提出了基于点云3D对象完成和分类的方法。我们引入组织根据他们的激活，这是我们的名字软池提取的特征的新方法。对于解码器的阶段，我们提出了区域盘旋，一个新型的操作旨在最大限度地提高全球活化熵。此外，通过在点完成网络（PCN）的地方炼油过程的启发，我们也提出了一个补丁变形操作来模拟解卷积运算的点云。本文证明了我们的区域激活可以在很多点云架构一样AtlasNet和PCN被纳入，导致对几何完成更好的性能。我们评估我们在不同的3D任务，如对象完成和分类，实现国家的最先进的精确度的方法。

9. AP-Loss for Accurate One-Stage Object Detection [PDF] 返回目录
Kean Chen, Weiyao Lin, Jianguo Li, John See, Ji Wang, Junni Zou
Abstract: One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm, which seamlessly combines the error-driven update scheme in perceptron learning and backpropagation algorithm in deep networks. We provide in-depth analyses on the good convergence property and computational complexity of the proposed algorithm, both theoretically and empirically. Experimental results demonstrate notable improvement in addressing the imbalance issue in object detection over existing AP-based optimization algorithms. An improved state-of-the-art performance is achieved in one-stage detectors based on AP-loss over detectors using classification-losses on various standard benchmarks. The proposed framework is also highly versatile in accommodating different network architectures. Code is available at this https URL .
摘要：一期目标探测器通过同时优化分类，损失和本地化损失，其中大部分来自极端的前景，背景类不平衡问题，前者的痛苦，由于大量锚培训。本文通过提出一种新颖的框架与排序任务，以取代在单级探测器分类任务，并采用用于排名问题的平均精度损失（AP-损失）解决了这个问题。由于其非微分和非凸时，AP-损失不能直接优化。为此，我们开发了一种新的优化算法，它完美地结合了错误驱动更新方案在感知学习和深刻的网络传播算法。我们提供了深入分析了该算法的良好收敛性和计算复杂性，理论和实证。实验结果表明，在全球现有的基于AP的优化算法，解决失衡问题的物体检测显着改善。一种改进的状态的最先进的性能，基于AP-损失比使用各种标准基准分类后损失检测器一阶段检测器来实现的。拟议的框架也高度灵活的适应不同的网络架构。代码可在此HTTPS URL。

10. An Improved Dilated Convolutional Network for Herd Counting in Crowded Scenes [PDF] 返回目录
Soufien Hamrouni, Hakim Ghazzai, Hamid Menouar, Yahya Massoud
Abstract: Crowd management technologies that leverage computer vision are widespread in contemporary times. There exists many security-related applications of these methods, including, but not limited to: following the flow of an array of people and monitoring large gatherings. In this paper, we propose an accurate monitoring system composed of two concatenated convolutional deep learning architectures. The first part called Front-end, is responsible for converting bi-dimensional signals and delivering high-level features. The second part, called the Back-end, is a dilated Convolutional Neural Network (CNN) used to replace pooling layers. It is responsible for enlarging the receptive field of the whole network and converting the descriptors provided by the first network to a saliency map that will be utilized to estimate the number of people in highly congested images. We also propose to utilize a genetic algorithm in order to find an optimized dilation rate configuration in the back-end. The proposed model is shown to converge 30\% faster than state-of-the-art approaches. It is also shown that it achieves 20\% lower Mean Absolute Error (MAE) when applied to the Shanghai data~set.
摘要：人群管理技术，充分利用计算机视觉在当代普遍。存在这些方法中的许多与安全相关的应用，包括但不限于：下面的人数组的流量和监控大型集会。在本文中，我们提出了两个级联卷积深度学习结构组成的准确的监测系统。第一部分称为前端，负责转换两维信号和提供高级别功能。第二部分，被称为后端，是一个扩张型卷积神经网络（CNN）用来代替池层。它负责扩大整个网络的感受域和变换由所述第一网络到将被用于估计的人在高度拥挤的图像的数量的显着图中提供的描述符。我们亦建议利用遗传算法，以便找到在后端优化的扩张速率配置。该模型被示出为收敛30 \％比状态的最先进的方法更快。它也表明，它实现了当施加到上海数据集〜20 \％降低平均绝对误差（MAE）。

11. Self-Supervised Learning for Monocular Depth Estimation from Aerial Imagery [PDF] 返回目录
Max Hermann, Boitumelo Ruf, Martin Weinmann, Stefan Hinz
Abstract: Supervised learning based methods for monocular depth estimation usually require large amounts of extensively annotated training data. In the case of aerial imagery, this ground truth is particularly difficult to acquire. Therefore, in this paper, we present a method for self-supervised learning for monocular depth estimation from aerial imagery that does not require annotated training data. For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information. By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application. We evaluate our approach on three diverse datasets and compare the results to conventional methods that estimate depth maps based on multi-view geometry. We achieve an accuracy {\delta}1.25 of up to 93.5 %. In addition, we have paid particular attention to the generalization of a trained model to unknown data and the self-improving capabilities of our approach. We conclude that, even though the results of monocular depth estimation are inferior to those achieved by conventional methods, they are well suited to provide a good initialization for methods that rely on image matching or to provide estimates in regions where image matching fails, e.g. occluded or texture-less regions.
摘要：监督学习基于单眼深度估计方法通常需要大量的广泛注释的训练数据。在航空影像的情况下，这个基本事实是特别难以掌握。因此，在本文中，我们提出了从不需要注释的训练数据的航拍图像单眼深度估计自我监督学习的方法。对于这一点，我们只使用图像序列从单一的移动摄像头和学习同时估计深度和姿态的信息。通过共享姿势和深度估计之间的权，我们实现了一个比较小的模型，这有利于实时应用。我们评估我们在三个不同的数据集方法和结果，以传统的方法是估计深度贴图基于多视图几何比较。我们达到的精度{\三角洲}高达93.5％的1.25。此外，我们还特别注意训练模型未知的数据和我们的方法的自我改进能力的概括。我们的结论是，即使单眼深度估计的结果差于常规方法实现，它们非常适合于对依赖图像匹配，或者在提供的区域估计数，图像匹配失败的方法，例如提供一个良好的初始化闭塞或无纹理的区域。

12. Multi-label Learning with Missing Values using Combined Facial Action Unit Datasets [PDF] 返回目录
Jaspar Pahl, Ines Rieger, Dominik Seuss
Abstract: Facial action units allow an objective, standardized description of facial micro movements which can be used to describe emotions in human faces. Annotating data for action units is an expensive and time-consuming task, which leads to a scarce data situation. By combining multiple datasets from different studies, the amount of training data for a machine learning algorithm can be increased in order to create robust models for automated, multi-label action unit detection. However, every study annotates different action units, leading to a tremendous amount of missing labels in a combined database. In this work, we examine this challenge and present our approach to create a combined database and an algorithm capable of learning under the presence of missing labels without inferring their values. Our approach shows competitive performance compared to recent competitions in action unit detection.
摘要：面部动作单元允许可用于描述人脸的情绪面部微运动的目的，标准化描述。行动单位标注的数据是一个昂贵和耗时的任务，这导致了稀缺数据情况。通过组合来自不同的研究多个数据集，训练数据的机器学习算法的量可以以创建可靠的模型用于自动化，多标签的动作单元的检测而增加。然而，每一项研究诠释不同的操作单元，导致标签丢失了大量的在联合数据库。在这项工作中，我们研究这一挑战，并提出我们的做法没有推断它们的值来创建一个综合数据库，并能够在缺少标签的存在学习的算法。我们的做法显示了有竞争力的表现相比，近期的比赛在行动装置检测。

13. Category-Level 3D Non-Rigid Registration from Single-View RGB Images [PDF] 返回目录
Diego Rodriguez, Florian Huber, Sven Behnke
Abstract: In this paper, we propose a novel approach to solve the 3D non-rigid registration problem from RGB images using Convolutional Neural Networks (CNNs). Our objective is to find a deformation field (typically used for transferring knowledge between instances, e.g., grasping skills) that warps a given 3D canonical model into a novel instance observed by a single-view RGB image. This is done by training a CNN that infers a deformation field for the visible parts of the canonical model and by employing a learned shape (latent) space for inferring the deformations of the occluded parts. As result of the registration, the observed model is reconstructed. Because our method does not need depth information, it can register objects that are typically hard to perceive with RGB-D sensors, e.g. with transparent or shiny surfaces. Even without depth data, our approach outperforms the Coherent Point Drift (CPD) registration method for the evaluated object categories.
摘要：在本文中，我们提出了一种新颖的方法来解决由使用卷积神经网络（细胞神经网络）的RGB图像的三维非刚性配准的问题。我们的目标是找到一种变形场（通常用于转让知识实例之间，例如，抓技能），该翘曲的给定3D典范模型到由单视点RGB图像观察到的新的实例。这是由训练CNN说，推测变形场的规范模型的可见部分，并通过采用了解到形状（潜）空间推断的变形遮挡部分完成。作为注册的结果，观察到的模型被重建。因为我们的方法不需要深度信息，它可以注册通常很难用RGB-d传感器，例如感知对象用透明或有光泽的表面。即使没有深度数据，我们的方法比对所评估的对象类别的相干点漂移（CPD）注册的方法。

14. White blood cell classification [PDF] 返回目录
Na Dong, Meng-die Zhai, Jian-fang Chang, Chun-ho Wu
Abstract: This paper proposes a novel automatic classification framework for the recognition of five types of white blood cells. Segmenting complete white blood cells from blood smears images and extracting advantageous features from them remain challenging tasks in the classification of white blood cells. Therefore, we present an adaptive threshold segmentation method to deal with blood smears images with non-uniform color and uneven illumination, which is designed based on color space information and threshold segmentation. Subsequently, after successfully separating the white blood cell from the blood smear image, a large number of nonlinear features including geometrical, color and texture features are extracted. Nevertheless, redundant features can affect the classification speed and efficiency, and in view of that, a feature selection algorithm based on classification and regression trees (CART) is designed. Through in-depth analysis of the nonlinear relationship between features, the irrelevant and redundant features are successfully removed from the initial nonlinear features. Afterwards, the selected prominent features are fed into particle swarm optimization support vector machine (PSO-SVM) classifier to recognize the types of the white blood cells. Finally, to evaluate the performance of the proposed white blood cell classification methodology, we build a white blood cell data set containing 500 blood smear images for experiments. By comparing with the ground truth obtained manually, the proposed segmentation method achieves an average of 95.98% and 97.57% dice similarity for segmented nucleus and cell regions respectively. Furthermore, the proposed methodology achieves 99.76% classification accuracy, which well demonstrates its effectiveness.
摘要：本文提出了一种用于识别五种白血细胞的一种新的自动分类的框架。从他们的血液涂片图像分割完整的白血细胞和提取的有利特征留在白血细胞的分类有挑战性的任务。因此，我们提出了一个自适应阈值分割的方法来处理血涂片具有非均匀的颜色和照度不均匀，这是基于颜色空间信息和阈值分割设计图像。随后，在成功地分离血涂片图像的白血细胞，被提取的大量的非线性特性，包括几何，颜色和纹理特征。然而，冗余特征可影响分类的速度和效率，并且在认为，基于分类和回归树（CART）的特征选择算法被设计。通过深入的特征之间的非线性关系的分析，不相关和冗余功能被成功地从初始非线性特征去除。此后，所选择的显着特征被送入粒子群优化支持向量机（PSO-SVM）分类器来识别类型的白血细胞的。最后，评估所提出的白细胞分类方法的性能，我们建立包含实验500个血涂片图像的白血细胞的数据集。通过用手动获得的地面实况比较，所提出的分割方法实现的平均95.98％和97.57％骰子的相似性分别为分段核和单元区域。此外，拟议的方法达到99.76％分类准确度，这也证明了其有效性。

15. DeepGIN: Deep Generative Inpainting Network for Extreme Image Inpainting [PDF] 返回目录
Chu-Tak Li, Wan-Chi Siu, Zhi-Song Liu, Li-Wen Wang, Daniel Pak-Kong Lun
Abstract: The degree of difficulty in image inpainting depends on the types and sizes of the missing parts. Existing image inpainting approaches usually encounter difficulties in completing the missing parts in the wild with pleasing visual and contextual results as they are trained for either dealing with one specific type of missing patterns (mask) or unilaterally assuming the shapes and/or sizes of the masked areas. We propose a deep generative inpainting network, named DeepGIN, to handle various types of masked images. We design a Spatial Pyramid Dilation (SPD) ResNet block to enable the use of distant features for reconstruction. We also employ Multi-Scale Self-Attention (MSSA) mechanism and Back Projection (BP) technique to enhance our inpainting results. Our DeepGIN outperforms the state-of-the-art approaches generally, including two publicly available datasets (FFHQ and Oxford Buildings), both quantitatively and qualitatively. We also demonstrate that our model is capable of completing masked images in the wild.
摘要：在图像修复难易程度取决于型号和尺寸缺失的部分的。现有图像修复办法与因为它们是训练任一处理缺失图案（掩模）或单方面假设掩蔽的形状和/或尺寸的一种特定类型的赏心悦目的视觉和上下文结果完成在野外缺失的部分通常遇到困难区域。我们提出了一个深刻的生成修补网络，命名DeepGIN，处理各种类型掩盖图像。我们设计了一个空间金字塔扩张（SPD）RESNET块，使重建使用遥远的功能。我们还采用了多尺度自注意（MSSA）机制和背投（BP）技术来提高我们的图像修补效果。我们DeepGIN优于国家的最先进的方法一般包括两个公开可用的数据集（FFHQ和牛津建筑物），在数量和质量。我们还表明，我们的模型能够在野外完成掩盖图像。

16. Neutral Face Game Character Auto-Creation via PokerFace-GAN [PDF] 返回目录
Tianyang Shi, Zhengxia Zou, Xinhui Song, Zheng Song, Changjian Gu, Changjie Fan, Yi Yuan
Abstract: Game character customization is one of the core features of many recent Role-Playing Games (RPGs), where players can edit the appearance of their in-game characters with their preferences. This paper studies the problem of automatically creating in-game characters with a single photo. In recent literature on this topic, neural networks are introduced to make game engine differentiable and the self-supervised learning is used to predict facial customization parameters. However, in previous methods, the expression parameters and facial identity parameters are highly coupled with each other, making it difficult to model the intrinsic facial features of the character. Besides, the neural network based renderer used in previous methods is also difficult to be extended to multi-view rendering cases. In this paper, considering the above problems, we propose a novel method named "PokerFace-GAN" for neutral face game character auto-creation. We first build a differentiable character renderer which is more flexible than the previous methods in multi-view rendering cases. We then take advantage of the adversarial training to effectively disentangle the expression parameters from the identity parameters and thus generate player-preferred neutral face (expression-less) characters. Since all components of our method are differentiable, our method can be easily trained under a multi-task self-supervised learning paradigm. Experiment results show that our method can generate vivid neutral face game characters that are highly similar to the input photos. The effectiveness of our method is verified by comparison results and ablation studies.
摘要：游戏角色的个性是最近许多角色扮演游戏（RPG游戏），玩家可以编辑自己在游戏中的人物与自己的喜好外观的核心功能之一。本文研究由单一光自动创建在游戏中的人物的问题。在关于这一主题的最新文献，介绍了神经网络进行游戏引擎和微自我监督学习用来预测面部自定义参数。然而，在以前的方法中，表达参数和面部身份参数是高度相互耦接，从而使得难以字符的固有的面部特征进行建模。此外，在以前的方法中使用的基于神经网络的渲染也很难扩展到多视图渲染的情况下。在本文中，考虑到上述问题，我们提出了一个名为“PokerFace-GaN”中性面孔游戏角色自动创建一个新方法。我们先建一个微字符渲染器，比多视图渲染情况下，以前的方法更加灵活。然后，我们利用对抗训练的，以有效地从身份参数解开表达式参数，从而生成玩家优选的中性面（表达少）字符。由于我们的方法的所有组件都是可微，我们的方法可以很容易地多任务自我监督学习范式下训练。实验结果表明，该方法能够产生生动的中性面孔的游戏人物是高度相似的输入照片。我们的方法的有效性是通过比较结果和消融的研究证实。

17. Multi-organ Segmentation via Co-training Weight-averaged Models from Few-organ Datasets [PDF] 返回目录
Rui Huang, Yuanjie Zheng, Zhiqiang Hu, Shaoting Zhang, Hongsheng Li
Abstract: Multi-organ segmentation has extensive applications in many clinical applications. To segment multiple organs of interest, it is generally quite difficult to collect full annotations of all the organs on the same images, as some medical centers might only annotate a portion of the organs due to their own clinical practice. In most scenarios, one might obtain annotations of a single or a few organs from one training set, and obtain annotations of the the other organs from another set of training images. Existing approaches mostly train and deploy a single model for each subset of organs, which are memory intensive and also time inefficient. In this paper, we propose to co-train weight-averaged models for learning a unified multi-organ segmentation network from few-organ datasets. We collaboratively train two networks and let the coupled networks teach each other on un-annotated organs. To alleviate the noisy teaching supervisions between the networks, the weighted-averaged models are adopted to produce more reliable soft labels. In addition, a novel region mask is utilized to selectively apply the consistent constraint on the un-annotated organ regions that require collaborative teaching, which further boosts the performance. Extensive experiments on three public available single-organ datasets LiTS, KiTS, Pancreas and manually-constructed single-organ datasets from MOBA show that our method can better utilize the few-organ datasets and achieves superior performance with less inference computational cost.
摘要：多器官分割在许多临床应用广泛的应用。感兴趣段多个器官，它通常是相当困难的收集同一图像的所有器官的完整说明，如一些医疗中心可能只注释器官的一部分，由于其自身的临床实践。在大多数情况下，人们可能会获得单一的注解或几个器官从一个训练集，并获得另一组训练图像的其他机关的注释。现有的方法主要是培养和器官，这需要大量的存储器也时刻低效的每个子集部署一个单一的模式。在本文中，我们提出联合列车加权平均模型从几个器官数据集学习一个统一的多器官分割网络。我们协同训练两个网络，让网络耦合互教上未标注的器官。为了减轻网络之间的吵闹教学监督，加权平均模型采用产生更可靠的软标签。此外，一个新的区域掩模，利用选择性地应用在需要协作教学，从而进一步提升性能的未注释的器官区域中的一致的约束。三个公众提供单一器官数据集双床，包，胰腺和MOBA表明，我们的方法可以更好地利用为数不多的器官数据集，并实现了以较少的推理计算成本卓越的性能人工建造的单器官数据集大量的实验。

18. How to Train Your Robust Human Pose Estimator: Pay Attention to the Constraint Cue [PDF] 返回目录
Junjie Huang, Zheng Zhu, Guan Huang, Dalong Du
Abstract: Both appearance cue and constraint cue are important in human pose estimation. However, the widely used response map supervision has the tendency to overfit the appearance cue and overlook the constraint cue. In this paper, we propose occlusion augmentation with customized training schedules to tackle this dilemma. Specifically, we implicitly force the neural network focus on the constraint cue by dropping appearance information within keypoint-aware strategy. Besides, a two-steps schedule is designed to deal with the information shortage in early training process, which effectively exploits the potential of the proposed occlusion augmentation. In experiments, as a model-agnostic approach, occlusion augmentation consistently promotes most SOTAs with different input sizes, frameworks, backbones, training and test sets. For HRNet within W32-256x192 and W48plus-384x288 configurations, occlusion augmentation obtains gains by 0.6 AP (75.6 to 76.2) and 0.7 AP (76.8 to 77.5) on COCO test-dev set, respectively. HRNet-W48plus-384x288 equipped with extra training data and occlusion augmentation achieves 78.7 AP. Furthermore, the proposed occlusion augmentation makes a remarkable improvement on more challenging CrowdPose dataset. The source code will be publicly available for further research in this https URL.
摘要：无论是外观线索和约束线索是人类姿态估计很重要的。然而，广泛使用的响应地图监督有过拟合外观线索，而忽略约束提示的倾向。在本文中，我们提出了闭塞增强定制训练计划，以解决这一难题。具体而言，我们隐含关键点感知战略中掉落的外观信息迫使神经网络专注于约束线索。此外，两步骤时间表旨在应对早期训练过程中的信息不足，从而有效地利用所提出的闭塞增强的潜力。在实验中，作为模型无关的方法，闭塞增强一致促进最SOTAs用不同的输入尺寸，框架，主链，训练和测试集。对于HRNet内W32-256x192和W48plus-384x288配置中，闭塞增强取得增益0.6 AP（75.6至76.2）和0.7 AP（76.8〜77.5）上COCO测试-dev的集合，分别。 HRNet-W48plus-384x288配备了额外的训练数据和闭塞隆胸达到78.7 AP。此外，拟议的闭塞增强，使更多的挑战CrowdPose数据集显着改善。源代码是公开的对该HTTPS URL进一步研究。

19. Fast and Robust Face-to-Parameter Translation for Game Character Auto-Creation [PDF] 返回目录
Tianyang Shi, Zhengxia Zou, Yi Yuan, Changjie Fan
Abstract: With the rapid development of Role-Playing Games (RPGs), players are now allowed to edit the facial appearance of their in-game characters with their preferences rather than using default templates. This paper proposes a game character auto-creation framework that generates in-game characters according to a player's input face photo. Different from the previous methods that are designed based on neural style transfer or monocular 3D face reconstruction, we re-formulate the character auto-creation process in a different point of view: by predicting a large set of physically meaningful facial parameters under a self-supervised learning paradigm. Instead of updating facial parameters iteratively at the input end of the renderer as suggested by previous methods, which are time-consuming, we introduce a facial parameter translator so that the creation can be done efficiently through a single forward propagation from the face embeddings to parameters, with a considerable 1000x computational speedup. Despite its high efficiency, the interactivity is preserved in our method where users are allowed to optionally fine-tune the facial parameters on our creation according to their needs. Our approach also shows better robustness than previous methods, especially for those photos with head-pose variance. Comparison results and ablation analysis on seven public face verification datasets suggest the effectiveness of our method.
摘要：随着角色扮演游戏（RPG游戏）的快速发展，玩家现在可以编辑自己在游戏中的人物面部外观与自己的喜好，而不是使用默认模板。本文提出了一种游戏角色自动创建框架，根据玩家的输入脸照片生成游戏中的人物。从基于神经风格转让或单眼3D人脸重建设计了以前的方法不同的是，我们重新制订的字符自动创建过程中的一个不同的观点：通过自下预测大集物理意义的面部参数监督学习的范例。代替通过以前的方法，这是费时的建议在渲染器的输入端迭代地更新面部参数，我们引入一个面部参数翻译，使得创建可以有效地通过一个单个前向传播来完成从脸部的嵌入到参数，具有相当的1000倍的计算加速。尽管其效率高，互动性是我们的方法，其中用户根据自己的需要对我们的创作面部参数允许可选微调保留。我们的做法也显示出比以前的方法更好的鲁棒性，特别是对那些的照片头部姿势变化。比较结果和7所公立人脸验证数据集消融的分析表明了该方法的有效性。

20. Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation [PDF] 返回目录
Filippo Aleotti, Fabio Tosi, Li Zhang, Matteo Poggi, Stefano Mattoccia
Abstract: In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches. This fact occurs for depth estimation based on either monocular or stereo, with the latter often providing a valid source of self-supervision for the former. In contrast, to soften typical stereo artefacts, we propose a novel self-supervised paradigm reversing the link between the two. Purposely, in order to train deep stereo networks, we distill knowledge through a monocular completion network. This architecture exploits single-image clues and few sparse points, sourced by traditional stereo algorithms, to estimate dense yet accurate disparity maps by means of a consensus mechanism over multiple estimations. We thoroughly evaluate with popular stereo datasets the impact of different supervisory signals showing how stereo networks trained with our paradigm outperform existing self-supervised frameworks. Finally, our proposal achieves notable generalization capabilities dealing with domain shift issues. Code available at this https URL
摘要：在许多领域，自我监督的学习解决方案正在迅速发展，并填充监督的方法的差距。这一事实发生的基础上无论是单目或立体深度估计，后者往往提供自我监督为前有效的来源。相比之下，软化典型的立体文物，我们提出了一个新的自我监督模式扭转了两者之间的联系。故意，以培养深立体网络，我们通过一个单眼完成网络提制知识。这种架构利用单一图像线索和几个稀疏的点，由传统的立体声源的算法，由共识机制在多个估计的方法来估算密集但准确视差图。我们彻底与流行的立体数据集的评估不同的监控信号的显示与我们的模式跑赢现有的自我监管框架如何训练的立体网络的影响。最后，我们的建议实现处理域转移的问题显着泛化能力。代码可以在这个HTTPS URL

21. Generative Design by Reinforcement Learning: Maximizing Diversity of Topology Optimized Designs [PDF] 返回目录
Seowoo Jang, Namwoo Kang
Abstract: Generative design is a design exploration process in which a large number of structurally optimal designs are generated in parallel by diversifying parameters of the topology optimization while fulfilling certain constraints. Recently, data-driven generative design has gained much attention due to its integration with artificial intelligence (AI) technologies. When generating new designs through a generative approach, one of the important evaluation factors is diversity. In general, the problem definition of topology optimization is diversified by varying the force and boundary conditions, and the diversity of the generated designs is influenced by such parameter combinations. This study proposes a reinforcement learning (RL) based generative design process with reward functions maximizing the diversity of the designs. We formulate the generative design as a sequential problem of finding optimal parameter level values according to a given initial design. Proximal Policy Optimization (PPO) was applied as the learning framework, which is demonstrated in the case study of an automotive wheel design problem. This study also proposes the use of a deep neural network to instantly generate new designs without the topology optimization process, thus reducing the large computational burdens required by reinforcement learning. We show that RL-based generative design produces a large number of diverse designs within a short inference time by exploiting GPU in a fully automated manner. It is different from the previous approach using CPU which takes much more processing time and involving human intervention.
摘要：生成性设计是其中并行地由拓扑优化的多样化参数，同时满足一定的约束产生的大量结构上优化设计一个设计探索过程。最近，数据驱动的生成设计获得了巨大的关注，因为它具有人工智能（AI）技术的集成。当通过生成的方法产生新的设计，最重要的评价因素之一就是多样性。一般情况下，拓扑优化问题定义通过改变力和边界条件多样化，并且将所生成的设计的多样性是由这样的参数的组合的影响。这项研究提出了强化学习（RL）的生成设计过程中与奖赏功能最大化设计的多样性。我们制定的生成设计根据给定的初始设计找到最佳参数级别值的顺序问题。近端政策优化（PPO）施加作为学习框架，其被证明在汽车车轮的设计问题的案例研究。这项研究还提出了使用深层神经网络，即时生成新的设计，而拓扑优化过程，从而减少通过强化学习需要大量的计算负担。我们表明，基于RL-生成设计在一个完全自动化的方式利用GPU产生大量短的推理时间内不同的设计。这是从以前的使用方法CPU这需要更多的处理时间和涉及人工干预不同。

22. WSRNet: Joint Spotting and Recognition of Handwritten Words [PDF] 返回目录
George Retsinas, Giorgos Sfikas, Petros Maragos
Abstract: In this work, we present a unified model that can handle both Keyword Spotting and Word Recognition with the same network architecture. The proposed network is comprised of a non-recurrent CTC branch and a Seq2Seq branch that is further augmented with an Autoencoding module. The related joint loss leads to a boost in recognition performance, while the Seq2Seq branch is used to create efficient word representations. We show how to further process these representations with binarization and a retraining scheme to provide compact and highly efficient descriptors, suitable for keyword spotting. Numerical results validate the usefulness of the proposed architecture, as our method outperforms the previous state-of-the-art in keyword spotting, and provides results in the ballpark of the leading methods for word recognition.
摘要：在这项工作中，我们提出了一个统一的模型，可以使用相同的网络架构同时处理关键词提取和认字。所提出的网络由一个非经常CTC支路和Seq2Seq分支，其与一个Autoencoding模块进一步增强的。相关联的损失，因而在识别性能的提高，而Seq2Seq分支用于创建高效的字表示。我们展示如何进一步处理这些交涉，二值化和再培训计划，提供紧凑而高效的描述符，适用于关键字斑点。数值结果验证了该建筑的用处，我们的方法优于先前的国家的最先进的关键词识别，并在字识别的主要方法的球场提供的结果。

23. Spherical coordinates transformation pre-processing in Deep Convolution Neural Networks for brain tumor segmentation in MRI [PDF] 返回目录
Carlo Russo, Sidong Liu, Antonio Di Ieva
Abstract: Magnetic Resonance Imaging (MRI) is used in everyday clinical practice to assess brain tumors. Several automatic or semi-automatic segmentation algorithms have been introduced to segment brain tumors and achieve an expert-like accuracy. Deep Convolutional Neural Networks (DCNN) have recently shown very promising results, however, DCNN models are still far from achieving clinically meaningful results mainly because of the lack of generalization of the models. DCNN models need large annotated datasets to achieve good performance. Models are often optimized on the domain dataset on which they have been trained, and then fail the task when the same model is applied to different datasets from different institutions. One of the reasons is due to the lack of data standardization to adjust for different models and MR machines. In this work, a 3D Spherical coordinates transform during the pre-processing phase has been hypothesized to improve DCNN models' accuracy and to allow more generalizable results even when the model is trained on small and heterogeneous datasets and translated into different domains. Indeed, the spherical coordinate system avoids several standardization issues since it works independently of resolution and imaging settings. Both Cartesian and spherical volumes were evaluated in two DCNN models with the same network structure using the BraTS 2019 dataset. The model trained on spherical transform pre-processed inputs resulted in superior performance over the Cartesian-input trained model on predicting gliomas' segmentation on tumor core and enhancing tumor classes (increase of 0.011 and 0.014 respectively on the validation dataset), achieving a further improvement in accuracy by merging the two models together. Furthermore, the spherical transform is not resolution-dependent and achieve same results on different input resolution.
摘要：磁共振成像（MRI）在日常临床实践中用来评估脑肿瘤。几个自动或半自动分割算法已被引入到段脑肿瘤和实现专家状精度。深卷积神经网络（DCNN）最近表现出非常有希望的结果，但是，DCNN模型是从临床上取得有意义的结果，主要是因为缺乏模型的推广仍远。 DCNN车型需要大型注释的数据集，以取得良好的业绩。模型往往是优化对他们进行了培训域数据集，然后在相同的模型应用到来自不同机构的不同的数据集失败的任务。其中一个原因是由于缺乏数据标准化的调整为不同型号和MR机。在这项工作中，一个三维球面坐标在预处理阶段变换一直推测，以改善DCNN模型的准确性，并允许即使在模型训练的小和异构数据集和翻译成不同的域更概括的结果。事实上，球面坐标系统避免几个标准化问题，因为它独立工作的分辨率和成像设置。笛卡尔和球面卷两种DCNN车型使用的臭小子2019数据集相同的网络结构进行了评价。上训练球形变换预处理输入的模型导致超过笛卡尔输入训练的模型性能优越于预测对肿瘤核心胶质瘤分割和增强肿瘤类（0.011增加和0.014分别对验证数据集），实现了进一步的改进由两个模型合并在一起的准确性。此外，该球形变换是不依赖分辨率和实现在不同的输入分辨率相同的结果。

24. Alpha Net: Adaptation with Composition in Classifier Space [PDF] 返回目录
Nadine Chang, Jayanth Koushik, Michael J. Tarr, Martial Hebert, Yu-Xiong Wang
Abstract: Deep learning classification models typically train poorly on classes with small numbers of examples. Motivated by the human ability to solve this task, models have been developed that transfer knowledge from classes with many examples to learn classes with few examples. Critically, the majority of these models transfer knowledge within model feature space. In this work, we demonstrate that transferring knowledge within classified space is more effective and efficient. Specifically, by linearly combining strong nearest neighbor classifiers along with a weak classifier, we are able to compose a stronger classifier. Uniquely, our model can be implemented on top of any existing classification model that includes a classifier layer. We showcase the success of our approach in the task of long-tailed recognition, whereby the classes with few examples, otherwise known as the "tail" classes, suffer the most in performance and are the most challenging classes to learn. Using classifier-level knowledge transfer, we are able to drastically improve - by a margin as high as 12.6% - the state-of-the-art performance on the "tail" categories.
摘要：深学习分类模型通常很差训练课上用的例子小的数字。通过解决这一任务的人的能力的推动下，模型已经从开发类，传授知识与很多例子可以学习班，几个例子。重要的是，大多数这些模型的模型特征空间内的知识转移。在这项工作中，我们证明了分类空间内传递知识是更加有效和高效。具体来说，通过用弱分类沿着线性组合强大的近邻分类，我们能够组成一个强大的分类。与众不同的是，我们的模型可以对包括分类层的任何现有的分类模型之上实现。我们展示我们在长尾识别任务，其中有几个例子类，也被称为“尾”类，遭受性能的最方法的成功，是最有挑战性班学习。采用分级层次的知识转移，我们能够显着提高 - 由利润率高达12.6％ - 在“尾巴”类别的国家的最先进的性能。

25. Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection [PDF] 返回目录
Shuhan Chen, Yun Fu
Abstract: In this paper, we aim to develop an efficient and compact deep network for RGB-D salient object detection, where the depth image provides complementary information to boost performance in complex scenarios. Starting from a coarse initial prediction by a multi-scale residual block, we propose a progressively guided alternate refinement network to refine it. Instead of using ImageNet pre-trained backbone network, we first construct a lightweight depth stream by learning from scratch, which can extract complementary features more efficiently with less redundancy. Then, different from the existing fusion based methods, RGB and depth features are fed into proposed guided residual (GR) blocks alternately to reduce their mutual degradation. By assigning progressive guidance in the stacked GR blocks within each side-output, the false detection and missing parts can be well remedied. Extensive experiments on seven benchmark datasets demonstrate that our model outperforms existing state-of-the-art approaches by a large margin, and also shows superiority in efficiency (71 FPS) and model size (64.9 MB).
摘要：在本文中，我们的目标是开发用于RGB-d显着对象的检测，其中，所述深度图像提供补充信息在复杂场景升压性能的高效且紧凑的深网络。从通过多尺度残余块的粗略初始预测开始，我们提出了一种逐渐被引导交替细化网络加以改进。代替使用预先训练ImageNet骨干网，我们首先通过从头，它可以用较少的冗余更有效地提取互补特征学习构建轻质深度流。然后，从现有的融合为基础的方法不同，RGB和深度特性被送入提出引导残留（GR）块交替，以减少它们之间的相互的降解。通过分配在每一侧输出，误检测内堆叠的GR块渐进指导和缺少部件能够很好地解决。七基准数据集大量的实验证明，我们现有的最先进的国家的模型优于大幅度接近，也说明了在效率（71 FPS）和模型尺寸（64.9 MB）的优势。

26. Video Region Annotation with Sparse Bounding Boxes [PDF] 返回目录
Yuzheng Xu, Yang Wu, Nur Sabrina binti Zuraimi, Shohei Nobuhara, Ko Nishino
Abstract: Video analysis has been moving towards more detailed interpretation (e.g. segmentation) with encouraging progresses. These tasks, however, increasingly rely on densely annotated training data both in space and time. Since such annotation is labour-intensive, few densely annotated video data with detailed region boundaries exist. This work aims to resolve this dilemma by learning to automatically generate region boundaries for all frames of a video from sparsely annotated bounding boxes of target regions. We achieve this with a Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find keypoints on the region boundaries using the spatio-temporal volume of surrounding appearance and motion. The global optimization of VGCN makes it significantly stronger and generalize better than existing solutions. Experimental results using two latest datasets (one real and one synthetic), including ablation studies, demonstrate the effectiveness and superiority of our method.
摘要：视频分析已经走向更详细的解释（如分割）以鼓励进展。这些任务，然而，越来越依赖于在空间和时间密集的注释的训练数据。由于这样的注释是劳动密集型的，存在有详细的区域边界几密集注释视频数据。这项工作旨在通过学习来自动生成从目标区域的稀疏注释边框的视频的所有帧区域边界，以解决这一难题。我们用体积格拉夫卷积网络（VGCN），其学会迭代地找到关于使用周围的外观和运动的时空体积区域边界关键点实现这一点。 VGCN的全局优化使得显著强和推广比现有解决方案更好。使用两个最新的数据集（一个真正的和一个合成的），包括切除研究实验结果，证明了该方法的有效性和优越性。

27. Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors [PDF] 返回目录
Jingru Yi, Pengxiang Wu, Bo Liu, Qiaoying Huang, Hui Qu, Dimitris Metaxas
Abstract: Oriented object detection in aerial images is a challenging task as the objects in aerial images are displayed in arbitrary directions and are usually densely packed. Current oriented object detection methods mainly rely on two-stage anchor-based detectors. However, the anchor-based detectors typically suffer from a severe imbalance issue between the positive and negative anchor boxes. To address this issue, in this work we extend the horizontal keypoint-based object detector to the oriented object detection task. In particular, we first detect the center keypoints of the objects, based on which we then regress the box boundary-aware vectors (BBAVectors) to capture the oriented bounding boxes. The box boundary-aware vectors are distributed in the four quadrants of a Cartesian coordinate system for all arbitrarily oriented objects. To relieve the difficulty of learning the vectors in the corner cases, we further classify the oriented bounding boxes into horizontal and rotational bounding boxes. In the experiment, we show that learning the box boundary-aware vectors is superior to directly predicting the width, height, and angle of an oriented bounding box, as adopted in the baseline method. Besides, the proposed method competes favorably with state-of-the-art methods. Code is available at this https URL.
摘要：在航空图像面向物体检测是一项具有挑战性的任务，因为在航空图像中的对象被显示在任意方向，并且通常密集。当前面向对象的检测方法主要依靠两阶段基于锚的检测器。然而，基于锚的探测器通常遭受的正面和负面锚箱之间的严重失衡问题。为了解决这个问题，在这项工作中，我们扩展了基于水平关键点对象检测到面向对象的检测任务。特别是，我们首先检测的对象，在此基础上，我们然后回归的框边界感知载体（BBAVectors）来捕获取向的边界框的中心的关键点。框边界感知矢量分布在的四个象限的笛卡尔坐标系的所有任意定向的对象。为了缓解学习的角落情况下，向量的困难，我们进一步分类面向边界框为水平和旋转边界框。在实验中，我们表明，学习框边界感知矢量优于直接预测的宽度，高度，和取向的边界框的角度，如在基线方法中采用。此外，所提出的方法与国家的最先进的方法有利地竞争。代码可在此HTTPS URL。

28. AutoPose: Searching Multi-Scale Branch Aggregation for Pose Estimation [PDF] 返回目录
Xinyu Gong, Wuyang Chen, Yifan Jiang, Ye Yuan, Xianming Liu, Qian Zhang, Yuan Li, Zhangyang Wang
Abstract: We present AutoPose, a novel neural architecture search(NAS) framework that is capable of automatically discovering multiple parallel branches of cross-scale connections towards accurate and high-resolution 2D human pose estimation. Recently, high-performance hand-crafted convolutional networks for pose estimation show growing demands on multi-scale fusion and high-resolution representations. However, current NAS works exhibit limited flexibility on scale searching, they dominantly adopt simplified search spaces of single-branch architectures. Such simplification limits the fusion of information at different scales and fails to maintain high-resolution representations. The presentedAutoPose framework is able to search for multi-branch scales and network depth, in addition to the cell-level microstructure. Motivated by the search space, a novel bi-level optimization method is presented, where the network-level architecture is searched via reinforcement learning, and the cell-level search is conducted by the gradient-based method. Within 2.5 GPU days, AutoPose is able to find very competitive architectures on the MS COCO dataset, that are also transferable to the MPII dataset. Our code is available at this https URL.
摘要：本AutoPose，一种新颖的神经结构搜索（NAS）框架，其能够自动地发现向精确和高分辨率的2D人类姿势估计跨尺度连接的多个并行的分支。近日，高性能手工制作的多尺度融合和高分辨率表示姿态估计的节目不断增长的需求卷积网络。然而，目前的NAS作品表现出对大规模搜索的灵活性有限，他们显性采用单分支结构的简化搜索空间。这样简化限制的信息融合在不同尺度和不能保持高分辨率表示。该presentedAutoPose框架能够搜索到多分支尺度和网络深度，除了细胞级微观结构。通过搜索空间的动机，一种新型的双层优化方法被提出，其中，所述网络级架构经由强化学习搜索，和小区级别的搜索通过基于梯度的方法进行。在2.5 GPU天，AutoPose能够找到在MS COCO数据集非常有竞争力的架构，这也转移到了MPII数据集。我们的代码可在此HTTPS URL。

29. Adversarial Concurrent Training: Optimizing Robustness and Accuracy Trade-off of Deep Neural Networks [PDF] 返回目录
Elahe Arani, Fahad Sarfraz, Bahram Zonooz
Abstract: Adversarial training has been proven to be an effective technique for improving the adversarial robustness of models. However, there seems to be an inherent trade-off between optimizing the model for accuracy and robustness. To this end, we propose Adversarial Concurrent Training (ACT), which employs adversarial training in a collaborative learning framework whereby we train a robust model in conjunction with a natural model in a minimax game. ACT encourages the two models to align their feature space by using the task-specific decision boundaries and explore the input space more broadly. Furthermore, the natural model acts as a regularizer, enforcing priors on features that the robust model should learn. Our analyses on the behavior of the models show that ACT leads to a robust model with lower model complexity, higher information compression in the learned representations, and high posterior entropy solutions indicative of convergence to a flatter minima. We demonstrate the effectiveness of the proposed approach across different datasets and network architectures. On ImageNet, ACT achieves 68.20% standard accuracy and 44.29% robustness accuracy under a 100-iteration untargeted attack, improving upon the standard adversarial training method's 65.70% standard accuracy and 42.36% robustness.
摘要：对抗性训练已经被证明是提高模型的鲁棒性对抗性的有效方法。然而，似乎是优化准确性和鲁棒性模型之间的内在权衡。为此，我们提出了对抗性训练并发（ACT），它采用的协作学习的框架，由此我们一起训练稳健的模型，在一个极小的游戏自然模型对抗训练。 ACT鼓励这两个车型使用特定任务的决策边界，以调整其功能的空间和更广泛的探索输入空间。此外，自然的模型可以作为一个正则，上了稳健的模型应该学习功能执行前科。我们对模型的行为分析表明，ACT导致较低的模型复杂度稳健的模型，更高的信息压缩在学习表示，和高后熵解指示收敛到一个平坦的最小值。我们展示了在不同的数据集和网络架构所提出的方法的有效性。在ImageNet，ACT达到68.20％的标准精度和44.29％，稳健性精度在100迭代无针对性的攻击，在标准的对抗性训练法的65.70％的标准精度和42.36％，提高了耐用性。

30. Time-Supervised Primary Object Segmentation [PDF] 返回目录
Yanchao Yang, Brian Lai, Stefano Soatto
Abstract: We describe an unsupervised method to detect and segment portions of live scenes that, at some point in time, are seen moving as a coherent whole, which we refer to as primary objects. Our method first segments motions by minimizing the mutual information between partitions of the image domain, which bootstraps a static object detection model that takes a single image as input. The two models are mutually reinforced within a feedback loop, enabling extrapolation to previously unseen classes of objects. Our method requires video for training, but can be used on either static images or videos at inference time. As the volume of our training sets grows, more and more objects are seen moving, thus turning our method into unsupervised (or time-supervised) training to segment primary objects. The resulting system outperforms the state-of-the-art in both video object segmentation and salient object detection benchmarks, even when compared to methods that use explicit manual annotation.
摘要：我们描述的无监督的方法来检测和，在某个时间点，被认为移动作为整体，其我们称之为主对象的实况场景分段部分。我们的方法的第一区段的运动通过最小化图像结构域，其自举一个静态对象检测模型采用单个图像作为输入的分区之间的互信息。这两个模型是一个反馈回路中相互加强，从而使外推至前所未见的类的对象。我们的方法需要的训练视频，但可以在推理时间静态图像或视频使用。作为我们的训练集的量的增长，越来越多的对象被看作移动，从而把我们的方法成无监督（或时间监督）训练段主要对象。所得到的系统优于状态的最先进的视频对象分割和显着对象检测基准既，即使当相比于使用显式手动注释的方法。

31. InstanceMotSeg: Real-time Instance Motion Segmentation for Autonomous Driving [PDF] 返回目录
Eslam Mohamed, Mahmoud Ewaisha, Mennatullah Siam, Hazem Rashed, Senthil Yogamani, Ahmad El-Sallab
Abstract: Moving object segmentation is a crucial task for autonomous vehicles as it can be used to segment objects in a class agnostic manner based on its motion cues. It will enable the detection of objects unseen during training (e.g., moose or a construction truck) generically based on their motion. Although pixel-wise motion segmentation has been studied in the literature, it is not dealt with at instance level, which would help separate connected segments of moving objects leading to better trajectory planning. In this paper, we proposed a motion-based instance segmentation task and created a new annotated dataset based on KITTI, which will be released publicly. We make use of the YOLACT model to solve the instance motion segmentation network by feeding inflow and image as input and instance motion masks as output. We extend it to a multi-task model that learns semantic and motion instance segmentation in a computationally efficient manner. Our model is based on sharing a prototype generation network between the two tasks and learning separate prototype coefficients per task. To obtain real-time performance, we study different efficient encoders and obtain 39 fps on a Titan Xp GPU using MobileNetV2 with an improvement of 10% mAP relative to the baseline. A video demonstration of our work is available in this https URL.
摘要：运动目标分割是因为它可以基于其运动提示信息来段对象的一类不可知的方式自主车的关键任务。它将使一般基于其运动的对象的训练（例如，驼鹿或建筑卡车）中看不见的检测。虽然像素方面的运动分割在文献中被研究，它不是在实例级别处理，这将有助于从而获得更好的轨迹规划运动物体的单独连接段。在本文中，我们提出了一个基于运动的情况下分割的任务，并创建了一个基于KITTI一个新的注释数据集，将公开发布。我们利用YOLACT模型通过供给流入和图像作为输入，并且实例运动掩模作为输出来解决实例运动分割网络。我们把它扩展到学习的语义和运动实例分割的计算高效的方式多任务模式。我们的模型是基于共享两个任务之间的原型下一代网络和学习每个任务单独的原型系数。为了获得实时性能，我们研究了不同的高效编码器和使用MobileNetV2相对于基线10％映像的改善对泰坦XP的GPU获得39个FPS。我们工作的一个视频演示在这HTTPS URL可用。

32. Learning Disentangled Expression Representations from Facial Images [PDF] 返回目录
Marah Halawa, Manuel Wöllhaf, Eduardo Vellasques, Urko SánchezSanz, Olaf Hellwich
Abstract: Face images are subject to many different factors of variation, especially in unconstrained in-the-wild scenarios. For most tasks involving such images, e.g. expression recognition from video streams, having enough labeled data is prohibitively expensive. One common strategy to tackle such a problem is to learn disentangled representations for the different factors of variation of the observed data using adversarial learning. In this paper, we use a formulation of the adversarial loss to learn disentangled representations for face images. The used model facilitates learning on single-task datasets and improves the state-of-the-art in expression recognition with an accuracy of60.53%on the AffectNetdataset, without using any additional data.
摘要：人脸图像都可能发生变化的许多不同的因素，尤其是在无约束的最疯狂的场景。对于涉及这样的图像，例如大多数任务来自视频流的表情识别，具有足够的标记的数据是非常昂贵的。解决这样的问题的一种常用策略是学习解开表示使用对抗性学习观测数据的变化的不同因素。在本文中，我们使用了对抗性损失的配方来学习的人脸图像解缠结的表示。所使用的模型便于学习关于单任务的数据集，并提高了国家的最先进的表情识别与of60.53％在AffectNetdataset的精度，而无需使用任何额外的数据。

33. Is Face Recognition Sexist? No, Gendered Hairstyles and Biology Are [PDF] 返回目录
Vítor Albiero, Kevin W. Bowyer
Abstract: Recent news articles have accused face recognition of being "biased", "sexist" or "racist". There is consensus in the research literature that face recognition accuracy is lower for females, who often have both a higher false match rate and a higher false non-match rate. However, there is little published research aimed at identifying the cause of lower accuracy for females. For instance, the 2019 Face Recognition Vendor Test that documents lower female accuracy across a broad range of algorithms and datasets also lists "Analyze cause and effect" under the heading "What we did not do". We present the first experimental analysis to identify major causes of lower face recognition accuracy for females on datasets where previous research has observed this result. Controlling for equal amount of visible face in the test images reverses the apparent higher false non-match rate for females. Also, principal component analysis indicates that images of two different females are inherently more similar than of two different males, potentially accounting for a difference in false match rates.
摘要：最近的新闻文章已经被指责为“有偏见”，“性别歧视”或“种族主义”的人脸识别。还有就是面部识别精度为女性，谁往往同时拥有较高的错误匹配率和较高的错误不匹配率较低的研究文献的共识。然而，很少有发表的研究，旨在确定精度较低的原因为女性。例如，2019年人脸识别厂商测试的文档降低在广泛的算法和数据集也列出了女性的准确性“分析原因和影响”，“我们没有做什么”的标题下。我们提出的第一个实验性分析，以确定较低的面部识别准确度的重要原因在哪里以前的研究已经观察到了这种结果的数据集女性。控制装置，用于在测试图像等量可视面的反转表观较高的假非匹配率为女性。此外，主成分分析表明，两种不同的女性图像是固有地比两个不同的男性更相似，从而可能占假匹配率的差。

34. False Detection (Positives and Negatives) in Object Detection [PDF] 返回目录
Subrata Goswami
Abstract: Object detection is a very important function of visual perception systems. Since the early days of classical object detection based on HOG to modern deep learning based detectors, object detection has improved in accuracy. Two stage detectors usually have higher accuracy than single stage ones. Both types of detectors use some form of quantization of the search space of rectangular regions of image. There are far more of the quantized elements than true objects. The way these bounding boxes are filtered out possibly results in the false positive and false negatives. This empirical experimental study explores ways of reducing false positives and negatives with labelled data.. In the process also discovered insufficient labelling in Openimage 2019 Object Detection dataset.
摘要：目标检测是视觉感知系统的一个非常重要的功能。由于基于HOG现代深度学习基于探测器古典对象检测的初期，目标检测已经精度提高。两级探测器通常比单级者更高的精度。这两种类型的检测器使用某种形式的图像的矩形区域的搜索空间的量化的。有远远超过真正的对象量化的元素。这些边界框可能过滤掉的方式导致假阳性和假阴性。降低误报和漏报与标记的数据。在这一过程的实证实验研究探讨如何还发现在Openimage 2019目标检测数据集标注不足。

35. A Self-supervised GAN for Unsupervised Few-shot Object Recognition [PDF] 返回目录
Khoi Nguyen, Sinisa Todorovic
Abstract: This paper addresses unsupervised few-shot object recognition, where all training images are unlabeled, and test images are divided into queries and a few labeled support images per object class of interest. The training and test images do not share object classes. We extend the vanilla GAN with two loss functions, both aimed at self-supervised learning. The first is a reconstruction loss that enforces the discriminator to reconstruct the probabilistically sampled latent code which has been used for generating the ``fake'' image. The second is a triplet loss that enforces the discriminator to output image encodings that are closer for more similar images. Evaluation, comparisons, and detailed ablation studies are done in the context of few-shot classification. Our approach significantly outperforms the state of the art on the Mini-Imagenet and Tiered-Imagenet datasets.
摘要：本文地址无人监管的几拍物体识别，所有的训练图像未标记，并且测试图像分为查询和每个对象类的感兴趣的几个标记支持图像。培训和测试图像不共享对象类。我们有两个损失函数，都旨在自我监督学习扩大香草GAN。第一种是强制执行鉴别以重构已经用于产生``假“”图像的概率采样潜代码重构损失。第二个是，强制执行鉴别，以输出图像编码更接近更多类似的图像的三重态损耗。评估，比较和详细的消融研究是在为数不多的镜头分类的情况下完成的。我们的做法显著优于艺术上的迷你Imagenet和分层-Imagenet数据集的状态。

36. Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis [PDF] 返回目录
Zhipeng Bao, Yu-Xiong Wang, Martial Hebert
Abstract: Generative modeling has recently shown great promise in computer vision, but its success is often limited to separate tasks. In this paper, motivated by multi-task learning of shareable feature representations, we consider a novel problem of learning a shared generative model across various tasks. We instantiate it on the illustrative dual-task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of the object from new viewpoints. To this end, we propose bowtie networks that jointly learn 3D geometric and semantic representations with feedback in the loop. Experimental evaluation on challenging fine-grained recognition datasets demonstrates that our synthesized images are realistic from multiple viewpoints and significantly improve recognition performance as ways of data augmentation, especially in the low-data regime. We further show that our approach is flexible and can be easily extended to incorporate other tasks, such as style guided synthesis.
摘要：剖成建模近来显示出在计算机视觉很大希望，但它的成功往往局限于独立的任务。在本文中，通过共享的特征表示的多任务学习动机，我们认为学习在各种任务共享生成模型的新问题。我们将它实例化联合几拍的认可和新颖的视点合成的说明性双重任务：只给一个或几个从类只标注任意观点新物体的图像，我们的目标是同时学习对象分类，并生成图像来自新观点的对象。为此，我们提出了蝴蝶结网络，共同学习三维几何和语义表述与循环反馈。具有挑战性的细粒度识别数据集实验评价表明我们的合成图像是从多个角度求实显著提高识别性能，增强数据的方式，特别是在低数据政权。进一步的研究表明，我们的方法是灵活的，可以很容易地扩展到包括其它任务，例如风格指导合成。

37. Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation [PDF] 返回目录
Yu Feng, Boyuan Tian, Tiancheng Xu, Paul Whatmough, Yuhao Zhu
Abstract: Point cloud analytics is poised to become a key workload on battery-powered embedded and mobile platforms in a wide range of emerging application domains, such as autonomous driving, robotics, and augmented reality, where efficiency is paramount. This paper proposes Mesorasi, an algorithm-architecture co-designed system that simultaneously improves the performance and energy efficiency of point cloud analytics while retaining its accuracy. Our extensive characterizations of state-of-the-art point cloud algorithms show that, while structurally reminiscent of convolutional neural networks (CNNs), point cloud algorithms exhibit inherent compute and memory inefficiencies due to the unique characteristics of point cloud data. We propose delayed-aggregation, a new algorithmic primitive for building efficient point cloud algorithms. Delayed-aggregation hides the performance bottlenecks and reduces the compute and memory redundancies by exploiting the approximately distributive property of key operations in point cloud algorithms. Delayed-aggregation let point cloud algorithms achieve 1.6x speedup and 51.1% energy reduction on a mobile GPU while retaining the accuracy (-0.9% loss to 1.2% gains). To maximize the algorithmic benefits, we propose minor extensions to contemporary CNN accelerators, which can be integrated into a mobile Systems-on-a-Chip (SoC) without modifying other SoC components. With additional hardware support, Mesorasi achieves up to 3.6x speedup.
摘要：点云分析有望成为上的一个键工作量电池供电的在宽范围内出现的应用领域，如自主驾驶，机器人和增强现实，其中效率是最重要的嵌入式和移动平台。本文提出Mesorasi，算法架构协同设计系统，同时提高了点云分析的性能和能效，同时保留其准确性。我们广泛的状态的最先进的点云算法表征表明，虽然在结构让人联想到卷积神经网络（细胞神经网络）的，点云算法由于点云数据的独特特性表现出固有的计算和存储器的低效率。我们建议延迟聚集，新的算法基本构建高效的点云算法。延迟聚集隐藏性能瓶颈和通过利用键操作的在点云算法的大约分配律降低了计算和存储器冗余。延迟聚集让点云算法实现1.6倍的加速比和能量削减51.1％在移动GPU同时保留精度（-0.9％的损失〜1.2％增益）。为了最大限度地提高算法的利益，我们建议未成年人扩展到当代CNN加速器，它可以集成到移动片上系统级芯片（SoC）的不修改其他SoC组件。有了额外的硬件支持，Mesorasi实现了高达3.6倍的加速。

38. Do Not Disturb Me: Person Re-identification Under the Interference of Other Pedestrians [PDF] 返回目录
Shizhen Zhao, Changxin Gao, Jun Zhang, Hao Cheng, Chuchu Han, Xinyang Jiang, Xiaowei Guo, Wei-Shi Zheng, Nong Sang, Xing Sun
Abstract: In the conventional person Re-ID setting, it is widely assumed that cropped person images are for each individual. However, in a crowded scene, off-shelf-detectors may generate bounding boxes involving multiple people, where the large proportion of background pedestrians or human occlusion exists. The representation extracted from such cropped images, which contain both the target and the interference pedestrians, might include distractive information. This will lead to wrong retrieval results. To address this problem, this paper presents a novel deep network termed Pedestrian-Interference Suppression Network (PISNet). PISNet leverages a Query-Guided Attention Block (QGAB) to enhance the feature of the target in the gallery, under the guidance of the query. Furthermore, the involving Guidance Reversed Attention Module and the Multi-Person Separation Loss promote QGAB to suppress the interference of other pedestrians. Our method is evaluated on two new pedestrian-interference datasets and the results show that the proposed method performs favorably against existing Re-ID methods.
摘要：在传统的人重新ID设置，它被广泛认为裁剪人图像是为每个单独的。然而，在拥挤的场景，离保质期探测器可以生成涉及多个人，其中的背景行人或人类闭塞的大比例存在边界框。从这样的裁剪图像，同时包含目标和干扰行人提取的表示，可能包含分散注意力的信息。这将导致错误的检索结果。为了解决这个问题，本文提出了一种深刻的网络被称为行人干扰抑制网络（PISNet）。 PISNet利用查询制导注块（QGAB），以提高目标的特征在画廊，查询的指导下进行。此外，涉及指导发生逆转注意模块和多人分离损失促进QGAB打压其他行人的干扰。我们的方法是在两个新的行人干扰的数据集评估，结果表明，对毫不逊色现有的再ID的方法，该方法执行。

39. Image Stylization for Robust Features [PDF] 返回目录
Iaroslav Melekhov, Gabriel J. Brostow, Juho Kannala, Daniyar Turmukhambetov
Abstract: Local features that are robust to both viewpoint and appearance changes are crucial for many computer vision tasks. In this work we investigate if photorealistic image stylization improves robustness of local features to not only day-night, but also weather and season variations. We show that image stylization in addition to color augmentation is a powerful method of learning robust features. We evaluate learned features on visual localization benchmarks, outperforming state of the art baseline models despite training without ground-truth 3D correspondences using synthetic homographies only. We use trained feature networks to compete in Long-Term Visual Localization and Map-based Localization for Autonomous Driving challenges achieving competitive scores.
摘要：本地特点是稳健的这两个观点和外观的变化是许多计算机视觉任务的关键。在这项工作中，我们调查是否逼真的影像风格化改善局部特征的鲁棒性，不仅昼夜，而且天气和季节的变化。我们发现除了颜色增强的形象风格化是学习强大的功能的有效方法。我们评估学到了视觉定位基准的特征，跑赢基准的艺术模型状态，尽管没有地面实况训练只使用合成3D单应矩阵对应。我们使用的培训功能的网络竞争中的长期视觉定位和基于地图的定位对于自动驾驶各种挑战实现有竞争力的成绩。

40. Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding [PDF] 返回目录
Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan
Abstract: Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. This challenging task requires to capture critical object relations to identify the queried target. However, existing approaches cannot distinguish notable objects and remain in ineffective relation modeling between unnecessary objects. Thus, we propose a novel object-aware multi-branch relation network for object-aware relation discovery. Concretely, we first devise multiple branches to develop object-aware region modeling, where each branch focuses on a crucial object mentioned in the sentence. We then propose multi-branch relation reasoning to capture critical object relationships between the main branch and auxiliary branches. Moreover, we apply a diversity loss to make each branch only pay attention to its corresponding object and boost multi-branch learning. The extensive experiments show the effectiveness of our proposed method.
摘要：时空视频接地旨在根据给定的句子检索查询对象的时空管。目前，大多数现有的接地方式仅限于良好对准段句对。在本文中，我们将探讨在未对齐的数据，多形式的句子时空视频接地。这一具有挑战性的任务，需要捕捉到关键的客体关系来识别查询目标。但是，现有的方法不能区分显着的对象，并保持在不必要的对象之间的关系是无效的建模。因此，我们提出了一个新的对象知晓多分支关系网络对象认知关系的发现。具体而言，我们首先设计出多个分支开发对象认知区域建模，其中每一分支集中在句子中提到的一个重要对象。然后，我们提出了多分支关系推理捕捉主枝和辅助支路之间的关键对象关系。此外，我们采用一种多样性的丧失，使每个分支只注意其对应的对象，并提高多分支学习。广泛的实验表明，我们提出的方法的有效性。

41. Visual stream connectivity predicts assessments of image quality [PDF] 返回目录
Elijah Bowen, Antonio Rodriguez, Damian Sowinski, Richard Granger
Abstract: Some biological mechanisms of early vision are comparatively well understood, but they have yet to be evaluated for their ability to accurately predict and explain human judgments of image similarity. From well-studied simple connectivity patterns in early vision, we derive a novel formalization of the psychophysics of similarity, showing the differential geometry that provides accurate and explanatory accounts of perceptual similarity judgments. These predictions then are further improved via simple regression on human behavioral reports, which in turn are used to construct more elaborate hypothesized neural connectivity patterns. Both approaches outperform standard successful measures of perceived image fidelity from the literature, as well as providing explanatory principles of similarity perception.
摘要：早期一些有远见的生物学机制是比较好理解的，但他们还没有对自己的能力进行评估，以准确地预测和解释的图像相似的人的判断。从早期视觉充分研究的简单的连接模式，我们得出相似的心理物理学的新颖形式化，示出了微分几何提供感知相似判断的准确和说明的帐户。这些预测然后进一步通过对人的行为的报道，而这又是用来构建更复杂的虚拟神经连接模式简单回归改善。这两种方法优于大从文献中感知的图像保真度，以及提供相似感知的说明原理的标准成功措施。

42. Neural Descent for Visual 3D Human Pose and Shape [PDF] 返回目录
Andrei Zanfir, Eduard Gabriel Bazavan, Mihai Zanfir, William T. Freeman, Rahul Sukthankar, Cristian Sminchisescu
Abstract: We present deep neural network methodology to reconstruct the 3d pose and shape of people, given an input RGB image. We rely on a recently introduced, expressivefull body statistical 3d human model, GHUM, trained end-to-end, and learn to reconstruct its pose and shape state in a self-supervised regime. Central to our methodology, is a learning to learn and optimize approach, referred to as HUmanNeural Descent (HUND), which avoids both second-order differentiation when training the model parameters,and expensive state gradient descent in order to accurately minimize a semantic differentiable rendering loss at test time. Instead, we rely on novel recurrent stages to update the pose and shape parameters such that not only losses are minimized effectively, but the process is meta-regularized in order to ensure end-progress. HUND's symmetry between training and testing makes it the first 3d human sensing architecture to natively support different operating regimes including self-supervised ones. In diverse tests, we show that HUND achieves very competitive results in datasets like H3.6M and 3DPW, aswell as good quality 3d reconstructions for complex imagery collected in-the-wild.
摘要：我们目前深层神经网络方法，以重建给予输入的RGB图像三维姿态和人们形状。我们依靠近期推出，expressivefull身体统计3D人体模型，GHUM，训练有素的端至端，并学会重建其姿势和体形状态的自我监督机制。中央对我们的方法论，是一个学习的学习和优化的方法，以被称为HUmanNeural下降（HUND），训练模型的参数时，可以防止两个二阶分化，以及昂贵的状态梯度下降准确地最小化语义微渲染损耗的测试时间。相反，我们依靠新颖的反复阶段更新姿势和体形等参数，不仅损失最小化有效，但是这个过程是元正规化，以确保最终取得进展。 HUND的培训和测试之间的对称性使得它的第一款3D人体感应架构原生支持不同的操作方式，包括自我监督的。在不同的测试中，我们表明，HUND实现了非常有竞争力的结果一样H3.6M和3DPW数据集，为藏汉收集最野外复杂的图像质量好三维重建。

43. Geodesic Paths for Image Segmentation with Implicit Region-based Homogeneity Enhancement [PDF] 返回目录
Da Chen, Jian Zhu, Xinxin Zhang, Minglei Shu, Laurent D. Cohen
Abstract: Minimal paths are considered as a powerful and efficient tool for boundary detection and image segmentation due to its global optimality and well-established numerical solutions such as fast marching algorithm. In this paper, we introduce a flexible interactive image segmentation model based on the minimal geodesic framework in conjunction with region-based homogeneity enhancement. A key ingredient in our model is the construction of Finsler geodesic metrics, which are capable of integrating anisotropic and asymmetric edge features, region-based homogeneity and/or curvature regularization. This is done by exploiting an implicit method to incorporate the region-based homogeneity information to the metrics used. Moreover, we also introduce a way to build objective simple closed contours, each of which is treated as the concatenation of two disjoint open paths. Experimental results prove that the proposed model indeed outperforms state-of-the-art minimal paths-based image segmentation approaches.
摘要：最小的路径被认为是边界检测和图像分割一个强大而有效的工具，因为它全局最优的和行之有效的数值解，如快速行进算法。在本文中，我们将介绍基于与基于区域的同质性增强的结合最小测地框架灵活的交互式图像分割模型。在我们的模型的一个关键成分是芬斯拉测地度量，其能够集成各向异性和不对称边缘特征，基于区域的均匀性和/或曲率正规化的结构。这是通过利用隐式方法掺入基于区域的同质性信息来使用的指标进行。此外，我们还介绍了一种方法来构建目标简单闭合轮廓，其中的每一个处理为的两个不相交的开放路径的连接。实验结果证明，该模型确实优于状态的最先进的基于最小路径的图像分割方法。

44. Context-aware Feature Generation for Zero-shot Semantic Segmentation [PDF] 返回目录
Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, Liqing Zhang
Abstract: Existing semantic segmentation models heavily rely on dense pixel-wise annotations. To reduce the annotation pressure, we focus on a challenging task named zero-shot semantic segmentation, which aims to segment unseen objects with zero annotations. This task can be accomplished by transferring knowledge across categories via semantic word embeddings. In this paper, we propose a novel context-aware feature generation method for zero-shot segmentation named CaGNet. In particular, with the observation that a pixel-wise feature highly depends on its contextual information, we insert a contextual module in a segmentation network to capture the pixel-wise contextual information, which guides the process of generating more diverse and context-aware features from semantic word embeddings. Our method achieves state-of-the-art results on three benchmark datasets for zero-shot segmentation. Codes are available at: this https URL.
摘要：现有的语义分割模式在很大程度上依赖于密集的逐像素的注释。为了减少注释的压力，我们专注于一个具有挑战性的任务命名的零拍语义分割，以零个注解，旨在段看不见的对象。这个任务可以由通过语义词的嵌入跨类知识转移来完成。在本文中，我们提出了一个名为CaGNet零镜头分割一个新的上下文感知功能生成方法。特别是，与逐像素特征很大程度上依赖于它的上下文信息的观察，我们插入在一个分割网络上下文模块捕捉逐像素的上下文信息，其引导产生更多样化的和环境感知特征的过程从语义词的嵌入。我们的方法实现对三个标准数据集的零镜头分割国家的先进成果。代码，请访问：此HTTPS URL。

45. DeVLBert: Learning Deconfounded Visio-Linguistic Representations [PDF] 返回目录
Shengyu Zhang, Tan Jiang, Tan Wang, Kun Kuang, Zhou Zhao, Jianke Zhu, Jin Yu, Hongxia Yang, Fei Wu
Abstract: In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned. Existing methods for this problem are purely likelihood-based, leading to the spurious correlations and hurt the generalization ability when transferred to out-of-domain downstream tasks. By spurious correlation, we mean that the conditional probability of one token (object or word) given another one can be high (due to the dataset biases) without robust (causal) relationships between them. To mitigate such dataset biases, we propose a Deconfounded Visio-Linguistic Bert framework, abbreviated as DeVLBert, to perform intervention-based learning. We borrow the idea of the backdoor adjustment from the research field of causality and propose several neural-network based architectures for Bert-style out-of-domain pretraining. The quantitative results on three downstream tasks, Image Retrieval (IR), Zero-shot IR, and Visual Question Answering, show the effectiveness of DeVLBert by boosting generalization ability.
摘要：在本文中，我们提出研究领域外的Visio的语言训练前的问题，其中来自其上预训练的模式将是微调的下行数据的预训练数据分布不同。对于这个问题的现有方法是纯粹的可能性为基础的，导致伪相关，当转移到域外下游任务伤泛化能力。通过伪相关，我们的意思是给定一个又一个一个令牌（对象或文字）的条件概率可以很高（由于数据集的偏见），他们之间没有稳健（因果）关系。为了减轻这种数据集的偏见，我们提出了一个Deconfounded Visio的语言伯特框架，简称DeVLBert，进行基础的干预学习。我们借用因果关系的研究领域借壳调整的思路，并提出了几种基于神经网络结构为伯特风格外的域训练前。在三个下游任务的定量结果，图像检索（IR），零射门IR和Visual问答系统，通过提高推广能力显示DeVLBert的有效性。

46. SPL-MLL: Selecting Predictable Landmarks for Multi-Label Learning [PDF] 返回目录
Junbing Li, Changqing Zhang, Pengfei Zhu, Baoyuan Wu, Lei Chen, Qinghua Hu
Abstract: Although significant progress achieved, multi-label classification is still challenging due to the complexity of correlations among different labels. Furthermore, modeling the relationships between input and some (dull) classes further increases the difficulty of accurately predicting all possible labels. In this work, we propose to select a small subset of labels as landmarks which are easy to predict according to input (predictable) and can well recover the other possible labels (representative). Different from existing methods which separate the landmark selection and landmark prediction in the 2-step manner, the proposed algorithm, termed Selecting Predictable Landmarks for Multi-Label Learning (SPL-MLL), jointly conducts landmark selection, landmark prediction, and label recovery in a unified framework, to ensure both the representativeness and predictableness for selected landmarks. We employ the Alternating Direction Method (ADM) to solve our problem. Empirical studies on real-world datasets show that our method achieves superior classification performance over other state-of-the-art methods.
摘要：虽然显著取得的进展，多标签分类仍然是具有挑战性的，由于不同的标签之间的相关性的复杂性。此外，建模输入和一些（无光泽）类之间的关系进一步增加的准确预测的所有可能的标签的难度。在这项工作中，我们建议选择标签的一小部分作为地标，其很容易根据输入（可预测）来预测并能很好地恢复其他可能的标签（代表）。从其中分离标志选择和界标预测在两步骤的方式，所提出的算法的现有方法不同，被称为选择可预测的地标为多标签学习（SPL-MLL），共同进行标志选择，地标预测，和标签回收在一个统一的框架，以确保无论是选择标志的代表性和predictableness。我们采用交替方向法（ADM）来解决我们的问题。现实世界的数据集的实证研究表明，我们的方法实现超过其他国家的最先进的方法优越的分类性能。

47. Poet: Product-oriented Video Captioner for E-commerce [PDF] 返回目录
Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu
Abstract: In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred product characteristics depicted in the video is vital for successful promoting. Traditional video captioning methods, which focus on routinely describing what exists and happens in a video, are not amenable for product-oriented video captioning. To address this problem, we propose a product-oriented video captioner framework, abbreviated as Poet. Poet firstly represents the videos as product-oriented spatial-temporal graphs. Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics. The knowledge leveraging module in Poet differs from the traditional design by performing knowledge filtering and dynamic memory modeling. We show that Poet achieves consistent performance improvement over previous methods concerning generation quality, product aspects capturing, and lexical diversity. Experiments are performed on two product-oriented video captioning datasets, buyer-generated fashion video dataset (BFVD) and fan-generated fashion video dataset (FFVD), collected from Mobile Taobao. We will release the desensitized datasets to promote further investigations on both video captioning and general video analysis problems.
摘要：在电子商务，用于产品推广，越来越多的用户生成的视频。如何生成叙述在视频中所描绘的用户首选的产品特性是促进成功的重要影片说明。传统的视频字幕的方法，其中重点描述常规存在什么和在视频发生，不适合用于产品为导向的视频字幕。为了解决这个问题，我们提出了一个面向产品的视频字幕人员框架，简称诗人。诗人首先代表影片为产品导向的时空图。然后，基于视频相关产品方面，我们执行这些图表知识增强时空推理捕捉的细粒产品部分特性的动态变化。在从传统的设计不同诗人的知识杠杆模块通过执行知识过滤和动态存储器建模。我们发现，在诗人关于代质量，产品方面捕捉，和词汇的多样性以前的方法达到一致的性能提升。实验在两个面向产品的视频字幕数据集执行，买方生成的方式的视频数据集（BFVD）和风扇产生的方式的视频数据集（FFVD），从移动淘收集。我们将释放脱敏的数据集，以促进双方的视频字幕和一般的视频分析问题的进一步调查。

48. SMPLpix: Neural Avatars from 3D Human Models [PDF] 返回目录
Sergey Prokudin, Michael J. Black, Javier Romero
Abstract: Recent advances in deep generative models have led to an unprecedented level of realism for synthetically generated images of humans. However, one of the remaining fundamental limitations of these models is the ability to flexibly control the generative process, e.g. change the camera and human pose while retaining the subject identity. At the same time, deformable human body models like SMPL and its successors provide full control over pose and shape, but rely on classic computer graphics pipelines for rendering. Such rendering pipelines require explicit mesh rasterization that (a) does not have the potential to fix artifacts or lack of realism in the original 3D geometry and (b) until recently, were not fully incorporated into deep learning frameworks. In this work, we propose to bridge the gap between classic geometry-based rendering and the latest generative networks operating in pixel space by introducing a neural rasterizer, a trainable neural network module that directly "renders" a sparse set of 3D mesh vertices as photorealistic images, avoiding any hardwired logic in pixel colouring and occlusion reasoning. We train our model on a large corpus of human 3D models and corresponding real photos, and show the advantage over conventional differentiable renderers both in terms of the level of photorealism and rendering efficiency.
摘要：深生成模型的最新进展已经导致现实主义对人类的合成生成的图像前所未有的水平。然而，这些模型的其余基本限制之一是灵活地控制生成过程的能力，例如更改摄像头和人体姿势，同时保留了主体身份。同时，变形的人体模型，如SMPL及其继任者提供了姿势和体形完全控制，但依靠传统的计算机图形管线进行渲染。这种渲染管线需要明确的网格光栅化是：（a）不具有文物修复的潜力或缺乏原始3D几何现实主义和（b）中直到最近，没有完全纳入深度学习框架。在这项工作中，我们提出了弥合传统的基于几何渲染，并通过引入神经光栅，可训练神经网络模块，直接“渲染”的最新生成的网络在像素空间操作之间的差距稀疏组3D网格顶点的真实感图像，避免了在像素着色和遮挡推理任何硬连线逻辑。我们培养一个大语料库人体3D模型和相应的实物照片的我们的模型，并显示了传统的微渲染器无论是在写实和渲染效率的水平方面的优势。

49. KutralNet: A Portable Deep Learning Model for Fire Recognition [PDF] 返回目录
Angel Ayala, Bruno Fernandes, Francisco Cruz, David Macêdo, Adriano L. I. Oliveira, Cleber Zanchettin
Abstract: Most of the automatic fire alarm systems detect the fire presence through sensors like thermal, smoke, or flame. One of the new approaches to the problem is the use of images to perform the detection. The image approach is promising since it does not need specific sensors and can be easily embedded in different devices. However, besides the high performance, the computational cost of the used deep learning methods is a challenge to their deployment in portable devices. In this work, we propose a new deep learning architecture that requires fewer floating-point operations (flops) for fire recognition. Additionally, we propose a portable approach for fire recognition and the use of modern techniques such as inverted residual block, convolutions like depth-wise, and octave, to reduce the model's computational cost. The experiments show that our model keeps high accuracy while substantially reducing the number of parameters and flops. One of our models presents 71\% fewer parameters than FireNet, while still presenting competitive accuracy and AUROC performance. The proposed methods are evaluated on FireNet and FiSmo datasets. The obtained results are promising for the implementation of the model in a mobile device, considering the reduced number of flops and parameters acquired.
摘要：大部分的火灾自动报警系统的检测通过像热，烟雾，或火焰传感器火的存在。其中一个新方法的问题是图像的使用进行检测。图像的方法是有前途的，因为它不需要特定的传感器和可方便地嵌入到不同的设备。然而，除了高性能，的使用深度学习方法的计算成本是其在便携式设备的部署是一个挑战。在这项工作中，我们提出了需要更少的浮点运算次数（FLOPS）火灾识别新的深度学习建筑。此外，我们提出了火灾识别的便携方式和运用现代技术，如倒置的残余块，卷积像纵深，和八度，减少模型的计算成本。实验结果表明，我们的模型保持高精确度，同时大幅降低的参数和触发器的数量。我们的一个模型礼物71个\％，比FireNet参数少，同时还提出有竞争力的准确度和AUROC性能。所提出的方法是在FireNet和FiSmo数据集进行评估。将得到的结果是有希望的用于模型的在移动设备中实施，考虑到获取触发器和参数的数量减少。

50. Detection of Gait Abnormalities caused by Neurological Disorders [PDF] 返回目录
Daksh Goyal, Koteswar Rao Jerripothula, Ankush Mittal
Abstract: In this paper, we leverage gait to potentially detect some of the important neurological disorders, namely Parkinson's disease, Diplegia, Hemiplegia, and Huntington's Chorea. Persons with these neurological disorders often have a very abnormal gait, which motivates us to target gait for their potential detection. Some of the abnormalities involve the circumduction of legs, forward-bending, involuntary movements, etc. To detect such abnormalities in gait, we develop gait features from the key-points of the human pose, namely shoulders, elbows, hips, knees, ankles, etc. To evaluate the effectiveness of our gait features in detecting the abnormalities related to these diseases, we build a synthetic video dataset of persons mimicking the gait of persons with such disorders, considering the difficulty in finding a sufficient number of people with these disorders. We name it \textit{NeuroSynGait} video dataset. Experiments demonstrated that our gait features were indeed successful in detecting these abnormalities.
摘要：在本文中，我们利用步态可能检测到一些重要的神经系统疾病，也就是帕金森病，双瘫，偏瘫，和亨廷顿舞蹈病。与这些神经系统疾病的人往往有一个非常步态异常，这促使我们的目标步态他们电位检测。一些异常的涉及腿的环转，向前弯曲，不自主运动等，为了检测步态异常等，我们开发了从人的姿势，即肩，肘，髋，膝，踝关节的关键点步态特征等评估我们的步态的有效性检测与这些疾病的异常特征，我们建立模仿此类障碍患者的步态人士的合成视频数据集，考虑到在寻找这些疾病的足够数量的人的困难。我们将其命名为\ {textit} NeuroSynGait视频数据集。实验表明，我们的步态特征是检测这些异常的确成功。

51. Learning Flow-based Feature Warping for Face Frontalization with Illumination Inconsistent Supervision [PDF] 返回目录
Yuxiang Wei, Ming Liu, Haolin Wang, Ruifeng Zhu, Guosheng Hu, Wangmeng Zuo
Abstract: Despite recent advances in deep learning-based face frontalization methods, photo-realistic and illumination preserving frontal face synthesis is still challenging due to large pose and illumination discrepancy during training. We propose a novel Flow-based Feature Warping Model (FFWM) which can learn to synthesize photo-realistic and illumination preserving frontal images with illumination inconsistent supervision. Specifically, an Illumination Preserving Module (IPM) is proposed to learn illumination preserving image synthesis from illumination inconsistent image pairs. IPM includes two pathways which collaborate to ensure the synthesized frontal images are illumination preserving and with fine details. Moreover, a Warp Attention Module (WAM) is introduced to reduce the pose discrepancy in the feature level, and hence to synthesize frontal images more effectively and preserve more details of profile images. The attention mechanism in WAM helps reduce the artifacts caused by the displacements between the profile and the frontal images. Quantitative and qualitative experimental results show that our FFWM can synthesize photo-realistic and illumination preserving frontal images and performs favorably against the state-of-the-art results.
摘要：尽管深学习型面frontalization方法的最新进展，照片般逼真的照明和保持正面人脸合成还是由于在训练中具有挑战性的大姿态和照度差异。我们提出了一种基于流的特征翘曲模型（FFWM），它可以学习到合成照片般逼真的照明和保持正面图像与照明不一致的监督。具体而言，照明模块保（IPM），提出了学习的照明从照明不一致图像对保持图像合成。 IPM包括两个途径，其协作以确保合成正面图像是照明维护和具有精细的细节。此外，经注意模块（WAM）被引入，以减少在特征水平姿势差异，并因此更有效地合成正面图像和保存轮廓图像的更多细节。在WAM注意机制有助于降低由曲线和正面图像之间的位移的假象。定量和定性的实验结果表明，我们的FFWM可以合成照片般逼真和照明保持正面图像和执行针对有利状态的最先进的结果。

52. We Learn Better Road Pothole Detection: from Attention Aggregation to Adversarial Domain Adaptation [PDF] 返回目录
Rui Fan, Hengli Wang, Mohammud J. Bocus, Ming Liu
Abstract: Manual visual inspection performed by certified inspectors is still the main form of road pothole detection. This process is, however, not only tedious, time-consuming and costly, but also dangerous for the inspectors. Furthermore, the road pothole detection results are always subjective, because they depend entirely on the individual experience. Our recently introduced disparity (or inverse depth) transformation algorithm allows better discrimination between damaged and undamaged road areas, and it can be easily deployed to any semantic segmentation network for better road pothole detection results. To boost the performance, we propose a novel attention aggregation (AA) framework, which takes the advantages of different types of attention modules. In addition, we develop an effective training set augmentation technique based on adversarial domain adaptation, where the synthetic road RGB images and transformed road disparity (or inverse depth) images are generated to enhance the training of semantic segmentation networks. The experimental results demonstrate that, firstly, the transformed disparity (or inverse depth) images become more informative; secondly, AA-UNet and AA-RTFNet, our best performing implementations, respectively outperform all other state-of-the-art single-modal and data-fusion networks for road pothole detection; and finally, the training set augmentation technique based on adversarial domain adaptation not only improves the accuracy of the state-of-the-art semantic segmentation networks, but also accelerates their convergence.
摘要：通过认证检查员进行手动目视检查仍然是道路坑洞检测的主要形式。这个过程，但是，不仅繁琐，费时和昂贵的，而且是危险的督察。此外，道路坑洼检测结果总是主观的，因为它们完全取决于个人的经验。我们最近推出的差距（或逆深度）变换算法允许损坏和完好的道路区域之间更好的歧视，它可以很容易地部署到任何语义分割网络，改善道路坑洼检测结果。为了提高性能，我们提出了一个新颖的注意力聚集（AA）的框架，这需要不同类型的注意力模块的优势。此外，我们还开发了基于对抗域调整，其中合成的道路的RGB图像，并转化道路视差（或深度逆）图像生成，以增强语义分割网络的训练的有效训练集增强技术。实验结果表明，首先，将转化的视差（或深度逆）图像变得更多的信息;其次，AA-UNET和AA-RTFNet，我们的表现最好的实施方式中，分别优于用于道路路面凹坑检测的所有其它国家的最先进的单峰和数据融合网络;最后，基于对抗域适应训练集增强技术不仅提高了国家的最先进的语义分割网络的精度，而且也促进它们的收敛性。

53. Open source tools for management and archiving of digital microscopy data to allow integration with patient pathology and treatment information [PDF] 返回目录
Matloob Khushi, Georgina Edwards, Diego Alonso de Marcos, Jane E Carpenter, J Dinny Graham, Christine L Clarke
Abstract: Virtual microscopy includes digitisation of histology slides and the use of computer technologies for complex investigation of diseases such as cancer. However, automated image analysis, or website publishing of such digital images, is hampered by their large file sizes. We have developed two Java based open source tools: Snapshot Creator and NDPI-Splitter. Snapshot Creator converts a portion of a large digital slide into a desired quality JPEG image. The image is linked to the patients clinical and treatment information in a customised open source cancer data management software (Caisis) in use at the Australian Breast Cancer Tissue Bank (ABCTB) and then published on the ABCTB website this http URL using Deep Zoom open source technology. Using the ABCTB online search engine, digital images can be searched by defining various criteria such as cancer type, or biomarkers expressed. NDPI-Splitter splits a large image file into smaller sections of TIFF images so that they can be easily analysed by image analysis software such as Metamorph or Matlab. NDPI-Splitter also has the capacity to filter out empty images. Snapshot Creator and NDPI-Splitter are novel open source Java tools. They convert digital slides into files of smaller size for further processing. In conjunction with other open source tools such as Deep Zoom and Caisis, this suite of tools is used for the management and archiving of digital microscopy images, enabling digitised images to be explored and zoomed online. Our online image repository also has the capacity to be used as a teaching resource. These tools also enable large files to be sectioned for image analysis.
摘要：虚拟显微镜包括组织学幻灯片的数字化和疾病如癌症复杂的调查采用计算机技术。然而，自动化图像分析，或网站发布这样的数字图像，由它们的大的文件大小的阻碍。我们已经开发了两个基于Java开源工具：快照Creator和NDPI分离器。快照创建者的大型数字幻灯片的一部分转换成所需质量的JPEG图像。图像链接到在澳大利亚乳腺癌组织银行（ABCTB）使用一个定制的开放源码癌症数据管理软件（Caisis）患者的临床和治疗信息，然后发布了ABCTB网站这个HTTP URL中使用的Deep Zoom开源上技术。使用ABCTB在线搜索引擎，数字图像可以通过定义的各种标准，如癌症类型，或表达的生物标志物被搜索。 NDPI分离器将一个大型的图像文件为使他们能够通过图像分析软件如的Metamorph或Matlab很容易分析TIFF图像的小部分。 NDPI分离器还能够过滤掉空图像的能力。快照Creator和NDPI分离器是新的开源Java工具。他们将数字滑入作进一步处理更小尺寸的文件。与其他开源工具，如深度缩放和Caisis相结合，这一套工具，用于数字显微图像的管理和归档，从而实现数字化图像加以探讨和放大在线。我们的在线图片库也有可能被用来作为教学资源的能力。这些工具还能够进行切片进行图像分析大型文件。

54. A novel approach to remove foreign objects from chest X-ray images [PDF] 返回目录
Hieu X. Le, Phuong D. Nguyen, Thang H. Nguyen, Khanh N.Q. Le, Thanh T. Nguyen
Abstract: We initially proposed a deep learning approach for foreign objects inpainting in smartphone-camera captured chest radiographs utilizing the cheXphoto dataset. Foreign objects which can significantly affect the quality of a computer-aided diagnostic prediction are captured under various settings. In this paper, we used multi-method to tackle both removal and inpainting chest radiographs. Firstly, an object detection model is trained to separate the foreign objects from the given image. Subsequently, the binary mask of each object is extracted utilizing a segmentation model. Each pair of the binary mask and the extracted object are then used for inpainting purposes. Finally, the in-painted regions are now merged back to the original image, resulting in a clean and non-foreign-object-existing output. To conclude, we achieved state-of-the-art accuracy. The experimental results showed a new approach to the possible applications of this method for chest X-ray images detection.
摘要：我们最初提出的异物在利用cheXphoto数据集智能手机，相机拍摄胸片补绘了深刻的学习方法。其可以显著影响计算机辅助诊断预测的质量异物下各种设置捕获。在本文中，我们使用多的方法来解决这两个去除和补绘胸片。首先，对象检测模型被训练以异物从给定的图像分离。随后，每个对象的二进制掩码提取利用分割模型。每对二进制掩码和所提取的对象的随后用于补绘目的。最后，在绘地区已经合并回原始图像，从而得到一个无尘和无异物存在的输出。最后，我们实现了国家的最先进的精度。实验结果表明一种新的方法此方法为胸部X射线图像检测的可能的应用。

55. Faster Person Re-Identification [PDF] 返回目录
Guan'an Wang, Shaogang Gong, Jian Cheng, Zengguang Hou
Abstract: Fast person re-identification (ReID) aims to search person images quickly and accurately. The main idea of recent fast ReID methods is the hashing algorithm, which learns compact binary codes and performs fast Hamming distance and counting sort. However, a very long code is needed for high accuracy (e.g. 2048), which compromises search speed. In this work, we introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine (CtF) hashing code search strategy, which complementarily uses short and long codes, achieving both faster speed and better accuracy. It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID. Specifically, we design an All-in-One (AiO) framework together with a Distance Threshold Optimization (DTO) algorithm. In AiO, we simultaneously learn and enhance multiple codes of different lengths in a single model. It learns multiple codes in a pyramid structure, and encourage shorter codes to mimic longer codes by self-distillation. DTO solves a complex threshold search problem by a simple optimization process, and the balance between accuracy and speed is easily controlled by a single parameter. It formulates the optimization target as a $F_{\beta}$ score that can be optimised by Gaussian cumulative distribution functions. Experimental results on 2 datasets show that our proposed method (CtF) is not only 8% more accurate but also 5x faster than contemporary hashing ReID methods. Compared with non-hashing ReID methods, CtF is $50\times$ faster with comparable accuracy. Code is available at this https URL.
摘要：快人重新鉴定（里德）的目标是快速准确地搜索人的图像。近期快速里德方法的主要思想是散列算法，其学习紧凑的二进制代码，并进行快速海明距离和计数排序。然而，需要一种用于高准确度（例如2048），这损害了搜索速度很长的代码。在这项工作中，我们通过制定新的粗到细（CTF）散列码搜索策略，其中短互补用途和长码，同时实现更快的速度和更高的精度引入快速里德一个新的解决方案。它采用更短的代码来粗略排名广泛匹配的相似性和较长的码细化只为更准确的实例里德几个热门人选。具体来说，我们有一个距离阈值优化（DTO）算法设计的所有功能于一体机（AiO）框架在一起。在一体机中，我们同时学习和提高的单一模式不同长度的多个代码。它在学习一个金字塔结构的多个代码，并通过自蒸馏鼓励较短的代码来模仿较长码。 DTO通过简单的优化过程解决了一个复杂的阈值搜索的问题，并且精度和速度之间的平衡被容易地由单个参数控制。它制定的优化目标为$ F _ {\公测} $得分，可以通过高斯累积分布函数进行优化。 2个数据集实验结果表明，我们提出的方法（CTF）不仅是8％，更准确，而且5倍的速度比现代的散列里德方法。与非散列里德方法相比，周大福为$ 50 \ $次具有相当的准确性更快。代码可在此HTTPS URL。

56. Attack on Multi-Node Attention for Object Detection [PDF] 返回目录
Sizhe Chen, Fan He, Xiaolin Huang, Kun Zhang
Abstract: This paper focuses on high-transferable adversarial attacks on detection networks, which are crucial for life-concerning systems such as autonomous driving and security surveillance. Detection networks are hard to attack in a black-box manner, because of their multiple-output property and diversity across architectures. To pursue a high attacking transferability, one needs to find a common property shared by different models. Multi-node attention heat map obtained by our newly proposed method is such a property. Based on it, we design the ATTACk on multi-node attenTION for object detecTION (ATTACTION). ATTACTION achieves a state-of-the-art transferability in numerical experiments. On MS COCO, the detection mAP for all 7 tested black-box architectures is halved and the performance of semantic segmentation is greatly influenced. Given the great transferability of ATTACTION, we generate Adversarial Objects in COntext (AOCO), the first adversarial dataset on object detection networks, which could help designers to quickly evaluate and improve the robustness of detection networks.
摘要：本文着重于如自动驾驶和安全监控的检测网络的高对抗性转让的攻击，这是有关生命系统的关键。检测网络是很难在一个黑盒子的方式攻击，因为整个架构的多路输出特性和多样性。为了追求高攻击的转让，需要找到不同的模型共享的共同财产。我们新提出的方法得到的多节点的关注热图是这样的属性。在此基础上，设计了多节点注意物体检测（ATTACTION）的攻击。 ATTACTION实现国家的最先进的转印在数值实验。在MS COCO，对于所有测试的7黑盒架构检测地图减半和语义分割的性能有很大的影响。鉴于ATTACTION伟大转让，我们生成背景下对抗性对象（AOCO），对目标检测网络第一对抗性数据集，可以帮助设计人员快速评估和完善的检测网络的鲁棒性。

57. Cascaded channel pruning using hierarchical self-distillation [PDF] 返回目录
Roy Miles, Krystian Mikolajczyk
Abstract: In this paper, we propose an approach for filter-level pruning with hierarchical knowledge distillation based on the teacher, teaching-assistant, and student framework. Our method makes use of teaching assistants at intermediate pruning levels that share the same architecture and weights as the target student. We propose to prune each model independently using the gradient information from its corresponding teacher. By considering the relative sizes of each student-teacher pair, this formulation provides a natural trade-off between the capacity gap for knowledge distillation and the bias of the filter saliency updates. Our results show improvements in the attainable accuracy and model compression across the CIFAR10 and ImageNet classification tasks using the VGG16and ResNet50 architectures. We provide an extensive evaluation that demonstrates the benefits of using a varying number of teaching assistant models at different sizes.
摘要：在本文中，我们提出了基于教师分层蒸馏知识，教学助理的，和学生的框架过滤级别修剪的方法。我们的方法是利用助教在共享相同的体系结构和权重为目标的学生中间修剪水平。我们建议修剪每个模型独立使用来自其相应教师的梯度信息。通过考虑每个学生，教师对的相对大小，这一提法提供了知识蒸馏能力差距和过滤器显着更新的偏见之间的自然平衡。我们的研究结果表明在整个CIFAR10中所能达到的精度和模型压缩改进和使用VGG16and ResNet50架构ImageNet分类任务。我们提供的演示使用不同数量的不同尺寸的助教模式的好处广泛的评估。

58. Cluster-level Feature Alignment for Person Re-identification [PDF] 返回目录
Qiuyu Chen, Wei Zhang, Jianping Fan
Abstract: Instance-level alignment is widely exploited for person re-identification, e.g. spatial alignment, latent semantic alignment and triplet alignment. This paper probes another feature alignment modality, namely cluster-level feature alignment across whole dataset, where the model can see not only the sampled images in local mini-batch but the global feature distribution of the whole dataset from distilled anchors. Towards this aim, we propose anchor loss and investigate many variants of cluster-level feature alignment, which consists of iterative aggregation and alignment from the overview of dataset. Our extensive experiments have demonstrated that our methods can provide consistent and significant performance improvement with small training efforts after the saturation of traditional training. In both theoretical and experimental aspects, our proposed methods can result in more stable and guided optimization towards better representation and generalization for well-aligned embedding.
摘要：实例级对准被广泛开发用于人重新鉴定，例如空间对准，潜在语义对准和三重态对准。本文探讨的另一个特征对齐方式，跨越整个数据集即集群级功能定位，在该模型不仅可以看到在当地小批量采样图像，但是从蒸馏锚整个数据集的全球特征分布。为了实现这一目标，我们提出了锚损失和调查组水平的功能定位，其中包括从数据集的概述迭代聚集和排列的许多变种。我们广泛的实验已经证明，我们的方法能够提供与传统培训的饱和后小培训力度一致，显著的性能提升。在理论和实验两个方面，可能会导致更加稳定和走向更好的代表性和推广的良好对准的嵌入引导优化我们提出的方法。

59. A Deep Convolutional Neural Network for the Detection of Polyps in Colonoscopy Images [PDF] 返回目录
Tariq Rahim, Syed Ali Hassan, Soo Young Shin
Abstract: Computerized detection of colonic polyps remains an unsolved issue because of the wide variation in the appearance, texture, color, size, and presence of the multiple polyp-like imitators during colonoscopy. In this paper, we propose a deep convolutional neural network based model for the computerized detection of polyps within colonoscopy images. The proposed model comprises 16 convolutional layers with 2 fully connected layers, and a Softmax layer, where we implement a unique approach using different convolutional kernels within the same hidden layer for deeper feature extraction. We applied two different activation functions, MISH and rectified linear unit activation functions for deeper propagation of information and self regularized smooth non-monotonicity. Furthermore, we used a generalized intersection of union, thus overcoming issues such as scale invariance, rotation, and shape. Data augmentation techniques such as photometric and geometric distortions are adapted to overcome the obstacles faced in polyp detection. Detailed benchmarked results are provided, showing better performance in terms of precision, sensitivity, F1- score, F2- score, and dice-coefficient, thus proving the efficacy of the proposed model.
摘要：结肠息肉的计算机化检测保持因为在外观，质地，颜色，尺寸，和所述多个息肉状模仿的结肠镜检查期间存在的广泛变化的一个未解决的问题。在本文中，我们提出了一个深刻的卷积基于神经网络的结肠镜检查图像中息肉的电脑检测模型。该模型包括具有2完全连接层，和一个使用SoftMax层，在这里我们使用相同的隐藏层内的不同的卷积内核对于较深的特征提取实现一种独特的方法16个的卷积层。我们应用了两个不同的激活功能，米什和整流线性单元激活函数用于信息传播更深和自正规化平滑非单调性。此外，我们使用了联合的广义交叉点，因此克服诸如标度不变性，旋转，和形状的问题。数据增强技术，如光度和几何扭曲适于克服面临息肉检测的障碍。提供了详细的基准的结果，表示的精确度，灵敏度，F1-得分，F2-得分，和骰子系数方面更好的性能，由此证明所提出的模型的功效。

60. Curriculum Learning for Recurrent Video Object Segmentation [PDF] 返回目录
Maria Gonzalez-i-Calabuig, Carles Ventura, Xavier Giró-i-Nieto
Abstract: Video object segmentation can be understood as a sequence-to-sequence task that can benefit from the curriculum learning strategies for better and faster training of deep neural networks. This work explores different schedule sampling and frame skipping variations to significantly improve the performance of a recurrent architecture. Our results on the car class of the KITTI-MOTS challenge indicate that, surprisingly, an inverse schedule sampling is a better option than a classic forward one. Also, that a progressive skipping of frames during training is beneficial, but only when training with the ground truth masks instead of the predicted ones. Source code and trained models are available at this http URL.
摘要：视频对象分割可以理解为一个序列到序列任务，可以从课程学习更好，更快地训练深层神经网络的策略中受益。这项工作探讨不同时间表采样和跳帧变化，以改善显著复发性架构的性能。我们对汽车类KITTI-MOTS挑战的结果表明，出人意料的是，逆时间表采样比经典的前一个更好的选择。此外，培训期间帧的渐进跳绳是有益的，但只有与地面真相掩盖，而不是预测者训练的时候。源代码和训练的模型都可以在这个HTTP URL。

61. BroadFace: Looking at Tens of Thousands of People at Once for Face Recognition [PDF] 返回目录
Yonghyun Kim, Wonpyo Park, Jongju Shin
Abstract: The datasets of face recognition contain an enormous number of identities and instances. However, conventional methods have difficulty in reflecting the entire distribution of the datasets because a mini-batch of small size contains only a small portion of all identities. To overcome this difficulty, we propose a novel method called BroadFace, which is a learning process to consider a massive set of identities, comprehensively. In BroadFace, a linear classifier learns optimal decision boundaries among identities from a large number of embedding vectors accumulated over past iterations. By referring more instances at once, the optimality of the classifier is naturally increased on the entire datasets. Thus, the encoder is also globally optimized by referring the weight matrix of the classifier. Moreover, we propose a novel compensation method to increase the number of referenced instances in the training stage. BroadFace can be easily applied on many existing methods to accelerate a learning process and obtain a significant improvement in accuracy without extra computational burden at inference stage. We perform extensive ablation studies and experiments on various datasets to show the effectiveness of BroadFace, and also empirically prove the validity of our compensation method. BroadFace achieves the state-of-the-art results with significant improvements on nine datasets in 1:1 face verification and 1:N face identification tasks, and is also effective in image retrieval.
摘要：脸部识别的数据集包含身份和实例的数量巨大。然而，传统的方法具有在反射数据集的整个分布，因为一个小批量小尺寸的仅包含所有标识的一小部分的困难。为了克服这个困难，我们提出了一个名为BroadFace新方法，这是一个学习的过程中要考虑大规模的组标识，全面。在BroadFace，线性分类获悉最佳从大量嵌入积累了过去迭代向量的标识当中的决策边界。参照在一次的情况下，分类的最优化对整个数据集自然增加。因此，编码器也全局通过参考所述分类器的权重矩阵优化。此外，我们提出了一个新的补偿方法，以增加在训练阶段中引用实例的数量。 BroadFace可以方便地应用于许多现有的方法来加快学习进程，并获得精度显著改善，而不在推论阶段额外的计算负担。我们进行了广泛切除研究，并在不同的数据集实验表明BroadFace的有效性，同时也证明了经验我们补偿方法的有效性。 BroadFace实现国家的最先进的结果与在1上的数据集9显著改进：1张人脸验证和1：N面部识别任务，并且也是有效的图像检索。

62. ECG beats classification via online sparse dictionary and time pyramid matching [PDF] 返回目录
Nanyu Li, Yujuan Si, Duo Deng, Chunyu Yuan
Abstract: Recently, the Bag-Of-Word (BOW) algorithm provides efficient features and promotes the accuracy of the ECG classification system. However, BOW algorithm has two shortcomings: (1). it has large quantization errors and poor reconstruction performance; (2). it loses heart beat's time information, and may provide confusing features for different kinds of heart beats. Furthermore, ECG classification system can be used for long time monitoring and analysis of cardiovascular patients, while a huge amount of data will be produced, so we urgently need an efficient compression algorithm. In view of the above problems, we use the wavelet feature to construct the sparse dictionary, which lower the quantization error to a minimum. In order to reduce the complexity of our algorithm and adapt to large-scale heart beats operation, we combine the Online Dictionary Learning with Feature-sign algorithm to update the dictionary and coefficients. Coefficients matrix is used to represent ECG beats, which greatly reduces the memory consumption, and solve the problem of quantitative error simultaneously. Finally, we construct the pyramid to match coefficients of each ECG beat. Thus, we obtain the features that contain the beat time information by time stochastic pooling. It is efficient to solve the problem of losing time information. The experimental results show that: on the one hand, the proposed algorithm has advantages of high reconstruction performance for BOW, this storage method is high fidelity and low memory consumption; on the other hand, our algorithm yields highest accuracy in ECG beats classification; so this method is more suitable for large-scale heart beats data storage and classification.
摘要：最近，一袋字（BOW）算法提供了高效的功能和促进ECG分类系统的准确度。然而，BOW算法有两个缺点：（1）。它有大量的量化误差和重建表现不佳; （2）。它失去了心脏跳动的时间信息，并且可以对不同类型的心脏跳动的混乱提供的功能。此外，心电图分类系统可用于长时间监测和心血管病患者的分析，而庞大的数据量就会产生，所以我们迫切需要一种高效的压缩算法。鉴于上述问题，我们使用小波特征来构造稀疏词典，这降低了量化误差为最小。为了减少我们的算法的复杂性和适应大型心脏跳动操作，我们结合在线词典学习功能与-SIGN算法更新字典和系数。系数矩阵用于表示ECG节拍，这大大降低了存储器消耗，并且同时解决定量误差的问题。最后，我们构建金字塔匹配每个心拍的系数。因此，我们获得包含由时间随机池节拍时间信息的功能。它是有效的解决丢失时间信息的问题。实验结果表明：在一方面，所提出的算法具有用于BOW高重建性能，这种存储方法是高保真度和低存储器消耗的优点;在另一方面，我们的算法产生的ECG精度最高击败分类;因此这种方法更适合于大规模心脏搏动数据存储和分类。

63. Object Detection in the Context of Mobile Augmented Reality [PDF] 返回目录
Xiang Li, Yuan Tian, Fuyao Zhang, Shuxue Quan, Yi Xu
Abstract: In the past few years, numerous Deep Neural Network (DNN) models and frameworks have been developed to tackle the problem of real-time object detection from RGB images. Ordinary object detection approaches process information from the images only, and they are oblivious to the camera pose with regard to the environment and the scale of the environment. On the other hand, mobile Augmented Reality (AR) frameworks can continuously track a camera's pose within the scene and can estimate the correct scale of the environment by using Visual-Inertial Odometry (VIO). In this paper, we propose a novel approach that combines the geometric information from VIO with semantic information from object detectors to improve the performance of object detection on mobile devices. Our approach includes three components: (1) an image orientation correction method, (2) a scale-based filtering approach, and (3) an online semantic map. Each component takes advantage of the different characteristics of the VIO-based AR framework. We implemented the AR-enhanced features using ARCore and the SSD Mobilenet model on Android phones. To validate our approach, we manually labeled objects in image sequences taken from 12 room-scale AR sessions. The results show that our approach can improve on the accuracy of generic object detectors by 12% on our dataset.
摘要：在过去的几年中，许多深层神经网络（DNN）模型和框架已经发展到从RGB图像处理的实时目标检测的问题。普通物体检测仅从图像接近过程的信息，他们无视关于环境和环境的大规模相机姿态。在另一方面，移动增强现实（AR）框架可以连续跟踪场景内的摄像头的姿势，可以通过使用Visual惯性测程（VIO）估计环境的正确比例。在本文中，我们提议将来自VIO与从对象检测器的语义信息，以提高物体检测的移动设备上的性能的几何信息的新方法。我们的方法包括三个组成部分：（1）一个图像定向修正方法，（2）基于标度的滤波方法，和（3）一个在线语义图。各组分利用了基于VIO-AR框架的不同特性的优点。我们实施使用ARCORE和Android手机上的SSD Mobilenet模型的AR增强功能。为了验证我们的方法，我们手工标注的12间客房规模的AR会议作出的图像序列的对象。结果表明，我们的方法可以通过我们的数据12％的通用对象探测器的精度提高。

64. Graph Edit Distance Reward: Learning to Edit Scene Graph [PDF] 返回目录
Lichang Chen, Guosheng Lin, Shijie Wang, Qingyao Wu
Abstract: Scene Graph, as a vital tool to bridge the gap between language domain and image domain, has been widely adopted in the cross-modality task like VQA. In this paper, we propose a new method to edit the scene graph according to the user instructions, which has never been explored. To be specific, in order to learn editing scene graphs as the semantics given by texts, we propose a Graph Edit Distance Reward, which is based on the Policy Gradient and Graph Matching algorithm, to optimize neural symbolic model. In the context of text-editing image retrieval, we validate the effectiveness of our method in CSS and CRIR dataset. Besides, CRIR is a new synthetic dataset generated by us, which we will publish it soon for future use.
摘要：场景图，作为一个重要的工具，以弥补语言域和图像域之间的差距，已经广泛应用于像VQA的跨模态任务采用。在本文中，我们提出了一种新的方法来编辑场景图根据用户指令，从未被探索。具体而言，为了学习编辑场景图，由给定文本的语义，我们提出了一个图形编辑距离的奖励，这是基于策略梯度和图匹配算法，优化神经符号模型。在文本编辑图像检索的背景下，我们验证了我们在CSS和CRIR数据集方法的有效性。此外，CRIR是我们产生一种新的合成数据集，我们将很快公布以备将来使用。

65. Neural Ray Surfaces for Self-Supervised Learning of Depth and Ego-motion [PDF] 返回目录
Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, Adrien Gaidon
Abstract: Self-supervised learning has emerged as a powerful tool for depth and ego-motion estimation, leading to state-of-the-art results on benchmark datasets. However, one significant limitation shared by current methods is the assumption of a known parametric camera model -- usually the standard pinhole geometry -- leading to failure when applied to imaging systems that deviate significantly from this assumption (e.g., catadioptric cameras or underwater imaging). In this work, we show that self-supervision can be used to learn accurate depth and ego-motion estimation without prior knowledge of the camera model. Inspired by the geometric model of Grossberg and Nayar, we introduce Neural Ray Surfaces (NRS), convolutional networks that represent pixel-wise projection rays, approximating a wide range of cameras. NRS are fully differentiable and can be learned end-to-end from unlabeled raw videos. We demonstrate the use of NRS for self-supervised learning of visual odometry and depth estimation from raw videos obtained using a wide variety of camera systems, including pinhole, fisheye, and catadioptric.
摘要：自监督学习已经成为深度和自我的运动估计一个强大的工具，导致对标准数据集的国家的最先进的成果。然而，通过目前的方法共享的一个显著限制是已知的参数摄像机模型的假设 - 通常是标准针孔几何 - 当应用到从这个假设显著偏离成像系统导致故障（例如，折反射式相机或水下成像）。在这项工作中，我们表明，自我监督，可以用来学习准确的深度和自我的运动估计没有相机型号的先验知识。通过格罗斯伯格和纳亚尔的几何模型的启发，我们引入神经雷表面（NRS），卷积网络，代表逐像素投影射线的近似范围广泛的摄像机。 NRS是完全区分的，可以学到终端到终端的无标签的原始视频。我们演示了使用各种相机系统，包括针孔，鱼眼，和折射型获得的原始视频视觉里程和深度估计的自我监督学习使用NRS的。

66. Object Detection with a Unified Label Space from Multiple Datasets [PDF] 返回目录
Xiangyun Zhao, Samuel Schulter, Gaurav Sharma, Yi-Hsuan Tsai, Manmohan Chandraker, Ying Wu
Abstract: Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant application-relevant categories can be picked and merged form arbitrary existing datasets. However, naive merging of datasets is not possible in this case, due to inconsistent object annotations. Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset, although the object itself appears in the latter images. Some categories, like face here, would thus be considered foreground in one dataset, but background in another. To address this challenge, we design a framework which works with such partial annotations, and we exploit a pseudo labeling approach that we adapt for our specific case. We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels. Evaluation in the proposed novel setting requires full annotation on the test set. We collect the required annotations and define a new challenging experimental setup for this task based one existing public datasets. We show improved performances compared to competitive baselines and appropriate adaptations of existing work.
摘要：考虑到与不同的标签空间多个数据集，这项工作的目标是在所有的标签空间的联合培养单个对象探测器预测。这样的对象检测器的实际好处是显而易见的，显著应用相关的类别能够被拾取并合并形式的任意现有的数据集。然而，数据集的幼稚合并是不可能在此情况下，由于不一致的对象的注释。考虑对象类别等被在一个数据集注释，但在另一个数据集没有被标注面，尽管对象本身出现在后者的图像。有些类别，一样的脸在这里，因而被认为是前景在一个数据集，但背景另一个。为了应对这一挑战，我们设计了一个框架，这样的部分标注工作，我们利用伪标记方法，我们适应我们的具体情况。我们建议仔细整合与互补，但嘈杂的伪标签部分，但正确标注损失的功能。评价所提出的新的设置要求的测试集全面的诠释。我们收集所需的注释和定义此基于任务的一个现有公共数据集新的具有挑战性的实验装置。我们相比显示出有竞争力的基线和现有工作的适当调整性能改善。

67. Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound [PDF] 返回目录
Jianbo Jiao, Yifan Cai, Mohammad Alsharid, Lior Drukker, Aris T.Papageorghiou, J. Alison Noble
Abstract: In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.
摘要：在医学成像中，手动注释可能是昂贵的获取和有时不可行的访问，使得常规的深基于学习的模型难以规模。其结果是，这将是有益的，如果用的表现可以从原始数据中导出，无需手动注释。在本文中，我们提出了解决自我监督表示学习的多模态超声视频语音的原始数据的问题。对于这种情况，我们假设有超声影像和超声医师相应的叙述性演讲音频之间的高相关性。为了学习有意义的表述，该模型需要确定这种相关性，并在同一时间了解下面的解剖特点。我们设计了一个框架，没有任何形式的人类注释的视频和音频之间的对应关系进行建模。在这个框架内，我们会引入交叉模式对比学习和亲和力感知自学计划，以提高相关模型。在多模态胎儿超声的视频和音频节目，该方法能够学习严正交涉和传输以及标准平面检测和下游任务实验评价的眼睛注视预测。

68. Weakly supervised cross-domain alignment with optimal transport [PDF] 返回目录
Siyang Yuan, Ke Bai, Liqun Chen, Yizhe Zhang, Chenyang Tao, Chunyuan Li, Guoyin Wang, Ricardo Henao, Lawrence Carin
Abstract: Cross-domain alignment between image objects and text sequences is key to many visual-language tasks, and it poses a fundamental challenge to both computer vision and natural language processing. This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities, under a weakly-supervised setup, improving performance over state-of-the-art solutions. Our method builds upon recent advances in optimal transport (OT) to resolve the cross-domain matching problem in a principled manner. Formulated as a drop-in regularizer, the proposed OT solution can be efficiently computed and used in combination with other existing approaches. We present empirical evidence to demonstrate the effectiveness of our approach, showing how it enables simpler model architectures to outperform or be comparable with more sophisticated designs on a range of vision-language tasks.
摘要：图像对象和文本序列之间的跨域定位是关键，许多视觉语言的任务，它对双方的计算机视觉和自然语言处理的一个基本挑战。本文研究了图像和文本实体之间细粒度语义相似性的鉴定和优化的新方法，弱监督下设置，改进过度状态的最技术的解决方案的性能。我们的方法是建立在在最佳传输（OT）的最新进展，以解决在一个原则性方式跨域匹配问题。配制为落入式正则，所提出的解决方案OT可以有效地计算并结合其它现有的方法中使用。我们目前的经验证据来证明我们的方法的有效性，显示它如何使简单的模型架构超越或者是与一系列的视觉语言的任务更复杂的设计相媲美。

69. Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention [PDF] 返回目录
Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, Yan Yan
Abstract: The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization. Particularly, we present a concise yet valid architecture that effectively learns representations from multiple modalities in a joint manner. Initially, visual features are combined with auditory features and then turned into joint representations. Next, we make use of the joint representations to attend to visual features and auditory features, respectively. With the help of this joint co-attention, new visual and auditory features are produced, and thus both features can enjoy the mutually improved benefits from each other. It is worth noting that the joint co-attention unit is recursive meaning that it can be performed multiple times for obtaining better joint representations progressively. Extensive experiments on the public AVE dataset have shown that the proposed method achieves significantly better results than the state-of-the-art methods.
摘要：在视听事件的本地化任务，是有效的多模态如何熔丝信息的主要挑战。最近的工作表明，关注机制是融合过程是有益的。在本文中，我们提出了用多模态融合方法视听事件定位一个新的共同关注机制。特别是，我们提出了一个简洁而有效的架构，有效地学习从多模态表示以联合方式。最初，视觉特征相结合，与听觉功能，然后变成联合代表。接下来，我们利用联合代表的分别参加视觉功能和听觉功能。随着这一联合共同关注的帮助下，新的视觉和听觉特征的生产，因而这两个功能可以享受彼此相互提高效益。值得注意的是，联合共同关注单元是，它可以逐步获得更好的联合代表进行多次递归的含义。对公众AVE大量的实验数据集已经表明，该方法实现了比国家的最先进的方法显著更好的结果。

70. Sketch-Guided Object Localization in Natural Images [PDF] 返回目录
Aditay Tripathi, Rajath R Dani, Anand Mishra, Anirban Chakraborty
Abstract: We introduce the novel problem of localizing all the instances of an object (seen or unseen during training) in a natural image via sketch query. We refer to this problem as sketch-guided object localization. This problem is distinctively different from the traditional sketch-based image retrieval task where the gallery set often contains images with only one object. The sketch-guided object localization proves to be more challenging when we consider the following: (i) the sketches used as queries are abstract representations with little information on the shape and salient attributes of the object, (ii) the sketches have significant variability as they are hand-drawn by a diverse set of untrained human subjects, and (iii) there exists a domain gap between sketch queries and target natural images as these are sampled from very different data distributions. To address the problem of sketch-guided object localization, we propose a novel cross-modal attention scheme that guides the region proposal network (RPN) to generate object proposals relevant to the sketch query. These object proposals are later scored against the query to obtain final localization. Our method is effective with as little as a single sketch query. Moreover, it also generalizes well to object categories not seen during training and is effective in localizing multiple object instances present in the image. Furthermore, we extend our framework to a multi-query setting using novel feature fusion and attention fusion strategies introduced in this paper. The localization performance is evaluated on publicly available object detection benchmarks, viz. MS-COCO and PASCAL-VOC, with sketch queries obtained from `Quick, Draw!'. The proposed method significantly outperforms related baselines on both single-query and multi-query localization tasks.
摘要：介绍通过草图查询定位在自然图像中的物体（训练期间看到的或看不见的）的所有实例的新问题。我们把这个问题作为草图引导目标定位。这个问题是从传统的基于草图的图像检索任务，其中画廊集往往包含了一个对象的图像明显不同。草图引导目标定位被证明是更具挑战的时候，我们考虑以下因素：（一）作为查询的草图上的形状和对象的显着属性的信息很少抽象表示，（ii）本草图有显著变化性它们是手绘由一组不同的未经训练的人受试者的，和（iii）存在草图查询和目标自然图像作为这些之间的间隙域从非常不同的数据分布进行采样。为了解决草图引导目标定位的问题，我们提出了一个新的跨模式的关注方案，指导该地区的建议网络（RPN）来生成相关的草图查询对象的建议。这些对象提案后取得了对查询，以获得最终的定位。我们的方法是用少一个草图查询有效。此外，它也概括很好地训练期间没有看到对象类别和有效地定位在图像中存在多个对象实例。另外，我们为架构扩展到多查询使用本文所介绍的新功能的融合，注重融合策略设置。本地化性能上公开提供对象检测基准评估，即MS-COCO和PASCAL VOC含量，从'快，得到草图绘制查询！”。该方法显著优于采用单查询，多查询本地化任务相关的基线。

71. AntiDote: Attention-based Dynamic Optimization for Neural Network Runtime Efficiency [PDF] 返回目录
Fuxun Yu, Chenchen Liu, Di Wang, Yanzhi Wang, Xiang Chen
Abstract: Convolutional Neural Networks (CNNs) achieved great cognitive performance at the expense of considerable computation load. To relieve the computation load, many optimization works are developed to reduce the model redundancy by identifying and removing insignificant model components, such as weight sparsity and filter pruning. However, these works only evaluate model components' static significance with internal parameter information, ignoring their dynamic interaction with external inputs. With per-input feature activation, the model component significance can dynamically change, and thus the static methods can only achieve sub-optimal results. Therefore, we propose a dynamic CNN optimization framework in this work. Based on the neural network attention mechanism, we propose a comprehensive dynamic optimization framework including (1) testing-phase channel and column feature map pruning, as well as (2) training-phase optimization by targeted dropout. Such a dynamic optimization framework has several benefits: (1) First, it can accurately identify and aggressively remove per-input feature redundancy with considering the model-input interaction; (2) Meanwhile, it can maximally remove the feature map redundancy in various dimensions thanks to the multi-dimension flexibility; (3) The training-testing co-optimization favors the dynamic pruning and helps maintain the model accuracy even with very high feature pruning ratio. Extensive experiments show that our method could bring 37.4% to 54.5% FLOPs reduction with negligible accuracy drop on various of test networks.
摘要：卷积神经网络（细胞神经网络）在相当大的计算量为代价取得了巨大的认知能力。为了减轻运算负荷，许多优化工程开发通过识别和消除微不足道模型组件，如体重稀疏和过滤修剪，以减少冗余模式。然而，这些作品只能评估模型组件的内部参数信息静态意义，忽略了与外部输入的动态交互。与每个输入特征激活，模型组件意义可以动态地改变，并且因此静态方法只能实现次优的结果。因此，我们建议在这项工作动态CNN优化框架。基于神经网络的注意机制，我们提出了一个全面的动态优化框架包括：（1）测试相通道和柱特征映射修剪，以及由针对性差（2）培训阶段优化。这样的动态优化框架具有以下几个优点：（1）首先，它可以准确地识别并积极除去每个输入特征冗余与考虑到模型输入交互; （2）同时，能最大限度地除去各种尺寸得益于多维灵活性在特征地图冗余; （3）培训测试协同优化有利于动态修剪和帮助维持甚至具有非常高的功能修剪比例模型的准确性。大量的实验表明，我们的方法可以带来37.4％至54.5％FLOPS减少对各种测试网络，可以忽略不计的精度下降。

72. MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images [PDF] 返回目录
Benjamin Attal, Selena Ling, Aaron Gokaslan, Christian Richardt, James Tompkin
Abstract: We introduce a method to convert stereo 360° (omnidirectional stereo) imagery into a layered, multi-sphere image representation for six degree-of-freedom (6DoF) rendering. Stereo 360° imagery can be captured from multi-camera systems for virtual reality (VR), but lacks motion parallax and correct-in-all-directions disparity cues. Together, these can quickly lead to VR sickness when viewing content. One solution is to try and generate a format suitable for 6DoF rendering, such as by estimating depth. However, this raises questions as to how to handle disoccluded regions in dynamic scenes. Our approach is to simultaneously learn depth and disocclusions via a multi-sphere image representation, which can be rendered with correct 6DoF disparity and motion parallax in VR. This significantly improves comfort for the viewer, and can be inferred and rendered in real time on modern GPU hardware. Together, these move towards making VR video a more comfortable immersive medium.
摘要：我们介绍的方法来转换的立体声360°（全向立体声）成像到六个程度的自由度（6自由度）的渲染一个层次，多球图像表示。立体声360°图像可以从多相机系统，用于虚拟现实（VR）被捕获，但缺少运动视差和正确功能于全方位视差线索。总之，查看内容时，这些可迅速导致VR疾病。一种解决方案是尝试和产生适合于6自由度渲染，一个格式，例如通过估计深度。然而，这提出了一个问题如何在动态场景处理disoccluded地区。我们的方法是经由多球图像表示，其可与正确的6自由度视差和运动视差在VR被呈现给学习同时深度和disocclusions。这显著提高舒适性的浏览器，可以推断在现代GPU硬件实时渲染。总之，这些举措朝着使VR视频更舒适的身临其境的媒体。

73. Learning Gradient Fields for Shape Generation [PDF] 返回目录
Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, Bharath Hariharan
Abstract: In this work, we propose a novel technique to generate shapes from point cloud data. A point cloud can be viewed as samples from a distribution of 3D points whose density is concentrated near the surface of the shape. Point cloud generation thus amounts to moving randomly sampled points to high-density areas. We generate point clouds by performing stochastic gradient ascent on an unnormalized probability density, thereby moving sampled points toward the high-likelihood regions. Our model directly predicts the gradient of the log density field and can be trained with a simple objective adapted from score-based generative models. We show that our method can reach state-of-the-art performance for point cloud auto-encoding and generation, while also allowing for extraction of a high-quality implicit surface. Code is available at this https URL .
摘要：在这项工作中，我们提出了一种新的技术来生成点云数据的形状。点云可以被看作是从样品的3D点，其密度集中的形状的表面附近的分布。点云生成从而达移动随机采样点的高密度区域。我们通过在非标准化概率密度执行随机梯度上升，从而向高似然性的区域移动采样的点生成点云。我们的模型预测直接将数密度场的梯度，并可以用一个简单的目标进行训练改编自基于分数的生成模型。我们表明，我们的方法可以达到国家的最先进的性能自动编码和生成点云，同时还允许高品质的隐式曲面的提取。代码可在此HTTPS URL。

74. Siloed Federated Learning for Multi-Centric Histopathology Datasets [PDF] 返回目录
Mathieu Andreux, Jean Ogier du Terrail, Constance Beguier, Eric W. Tramel
Abstract: While federated learning is a promising approach for training deep learning models over distributed sensitive datasets, it presents new challenges for machine learning, especially when applied in the medical domain where multi-centric data heterogeneity is common. Building on previous domain adaptation works, this paper proposes a novel federated learning approach for deep learning architectures via the introduction of local-statistic batch normalization (BN) layers, resulting in collaboratively-trained, yet center-specific models. This strategy improves robustness to data heterogeneity while also reducing the potential for information leaks by not sharing the center-specific layer activation statistics. We benchmark the proposed method on the classification of tumorous histopathology image patches extracted from the Camelyon16 and Camelyon17 datasets. We show that our approach compares favorably to previous state-of-the-art methods, especially for transfer learning across datasets.
摘要：虽然联合学习是在训练深度学习模型分布敏感数据集有前途的方法，它为机器学习，尤其是当在多中心的数据异质性是普遍的医疗领域应用的新的挑战。以往域适应工作的基础上，提出一种深度学习架构通过引入局部统计量批标准化（BN）层的一种新型的联合学习方法，导致协同训练，但中心的具体型号。这一策略提高稳健性数据的异质性，同时通过不共享中心专用层活化的统计减少了信息泄露的可能性。我们的基准从Camelyon16和Camelyon17数据集提取肿瘤组织病理学图像块的分类所提出的方法。我们证明了我们的方法相比，毫不逊色于以前的状态的最先进的方法，特别是对整个转让的数据集学习。

75. First U-Net Layers Contain More Domain Specific Information Than The Last Ones [PDF] 返回目录
Boris Shirokikh, Ivan Zakazov, Alexey Chernyavskiy, Irina Fedulova, Mikhail Belyaev
Abstract: MRI scans appearance significantly depends on scanning protocols and, consequently, the data-collection institution. These variations between clinical sites result in dramatic drops of CNN segmentation quality on unseen domains. Many of the recently proposed MRI domain adaptation methods operate with the last CNN layers to suppress domain shift. At the same time, the core manifestation of MRI variability is a considerable diversity of image intensities. We hypothesize that these differences can be eliminated by modifying the first layers rather than the last ones. To validate this simple idea, we conducted a set of experiments with brain MRI scans from six domains. Our results demonstrate that 1) domain-shift may deteriorate the quality even for a simple brain extraction segmentation task (surface Dice Score drops from 0.85-0.89 even to 0.09); 2) fine-tuning of the first layers significantly outperforms fine-tuning of the last layers in almost all supervised domain adaptation setups. Moreover, fine-tuning of the first layers is a better strategy than fine-tuning of the whole network, if the amount of annotated data from the new domain is strictly limited.
摘要：MRI扫描外观显著取决于扫描的协议，因此，数据收集机构。临床站点之间的这些变化导致的对看不见的领域CNN分段质量急剧下降。许多最近提出MRI领域适应性方法与上次CNN层来抑制域移位操作。与此同时，MRI变异的核心表现是图像强度的相当大的差异。我们推测，这些差异可以通过修改第一层，而不是最后的被淘汰。为了验证这一简单的想法，我们进行了一系列的从六个领域脑部核磁共振成像扫描实验。我们的结果表明，1）结构域移可能降低质量，即使是简单的脑提取分割任务（表面骰子分数滴从0.85-0.89甚至到0.09）; 2）所述第一层的微调显著优于最后层的微调在几乎所有监督域适配设置。此外，第一层的微调是比整个网络的微调更好的策略，如果注释数据的来自新域的量被严格限制。

76. Bayesian deep learning: a new era for 'big data' geostatistics? [PDF] 返回目录
Charlie Kirkwood, Theo Economou
Abstract: For geospatial modelling and mapping tasks, variants of kriging - the spatial interpolation technique developed by South African mining engineer Danie Krige - have long been regarded as the established geostatistical methods. However, kriging and its variants (such as regression kriging, in which auxiliary variables or derivatives of these are included as covariates) are relatively restrictive models and lack capabilities that have been afforded to us in the last decade by deep neural networks. Principal among these is feature learning - the ability to learn filters to recognise task-specific patterns in gridded data such as images. Here we demonstrate the power of feature learning in a geostatistical context, by showing how deep neural networks can automatically learn the complex relationships between point-sampled target variables and gridded auxiliary variables (such as those provided by remote sensing), and in doing so produce detailed maps of chosen target variables. At the same time, in order to cater for the needs of decision makers who require well-calibrated probabilities, we obtain uncertainty estimates via a Bayesian approximation known as Monte Carlo dropout. In our example, we produce a national-scale probabilistic geochemical map from point-sampled assay data, with auxiliary information provided by a terrain elevation grid. Unlike traditional geostatistical approaches, auxiliary variable grids are fed into our deep neural network raw. There is no need to provide terrain derivatives (e.g. slope angles, roughness, etc) because the deep neural network is capable of learning these and arbitrarily more complex derivatives as necessary to maximise predictive performance. We hope our results will raise awareness of the suitability of Bayesian deep learning - and its feature learning capabilities for large-scale geostatistical applications where uncertainty matters.
摘要：地理空间建模与贴图任务，克里金变种 - 南非采矿工程师丹尼尔克里格发展空间插值技术 - 一直被视为建立地质统计方法。然而，克里金及其变种（如回归克里格，其中辅助变量或它们的衍生物包括作为协）是比较严格的模式和缺乏已经由深层神经网络提供给我们在过去十年的能力。其中主要是功能的学习 - 学习过滤器来识别网格数据任务的具体模式，例如图像的能力。在这里，我们证明特征学习的在地质统计上下文中的功率，通过说明如何深神经网络能够自动学习点采样的目标变量和网格化辅助变量（例如，通过遥感提供的那些）之间的复杂关系，并在这样做产生选择的目标变量的详细地图。与此同时，为了迎合决策者谁需要校准好的概率的需求，我们获得通过称为蒙特卡洛辍学贝叶斯近似不确定性估算。在我们的例子中，我们生产从点采样化验数据的全国规模的概率地球化学图，与由地形高程电网提供辅助信息。不同于传统的地质统计学方法，辅助可变栅被送入我们的深层神经网络的原料。没有必要提供地形衍生物（例如倾斜角，粗糙度等），因为深层神经网络能够学习这些的任意和更复杂的衍生物作为必要最大化预测性能。我们希望我们的研究结果将提高贝叶斯深度学习的适宜性的认识 - 和它的地物学习能力进行大规模的地质统计应用中的不确定性问题。

77. Facial Recognition: A cross-national Survey on Public Acceptance, Privacy, and Discrimination [PDF] 返回目录
Léa Steinacker, Miriam Meckel, Genia Kostka, Damian Borth
Abstract: With rapid advances in machine learning (ML), more of this technology is being deployed into the real world interacting with us and our environment. One of the most widely applied application of ML is facial recognition as it is running on millions of devices. While being useful for some people, others perceive it as a threat when used by public authorities. This discrepancy and the lack of policy increases the uncertainty in the ML community about the future direction of facial recognition research and development. In this paper we present results from a cross-national survey about public acceptance, privacy, and discrimination of the use of facial recognition technology (FRT) in the public. This study provides insights about the opinion towards FRT from China, Germany, the United Kingdom (UK), and the United States (US), which can serve as input for policy makers and legal regulators.
摘要：在机器学习（ML）的快速发展，更多的这种技术被部署到现实世界与我们和我们的环境进行交互。每ml的应用最为广泛应用的是面部识别，因为它是上百万的设备上运行。虽然是对某些人有用，公共机构使用，当别人认为它是一个威胁。这种差异和缺乏政策增加了ML社会对面部识别的研究和未来的发展方向的不确定性。本文从跨国调查有关公众接受，隐私和公众使用的面部识别技术（FRT）的歧视，我们现在的结果。这项研究提供了关于向FRT认为来自中国，德国，英国（UK）和美国（US），它可以作为政策制定者和监管者的法律见解输入。

78. MLBF-Net: A Multi-Lead-Branch Fusion Network for Multi-Class Arrhythmia Classification Using 12-Lead ECG [PDF] 返回目录
Jing Zhang, Deng Liang, Aiping Liu, Min Gao, Xiang Chen, Xu Zhang, Xun Chen
Abstract: Automatic arrhythmia detection using 12-lead electrocardiogram (ECG) signal plays a critical role in early prevention and diagnosis of cardiovascular diseases. In the previous studies on automatic arrhythmia detection, most methods concatenated 12 leads of ECG into a matrix, and then input the matrix to a variety of feature extractors or deep neural networks for extracting useful information. Under such frameworks, these methods had the ability to extract comprehensive features (known as integrity) of 12-lead ECG since the information of each lead interacts with each other during training. However, the diverse lead-specific features (known as diversity) among 12 leads were neglected, causing inadequate information learning for 12-lead ECG. To maximize the information learning of multi-lead ECG, the information fusion of comprehensive features with integrity and lead-specific features with diversity should be taken into account. In this paper, we propose a novel Multi-Lead-Branch Fusion Network (MLBF-Net) architecture for arrhythmia classification by integrating multi-loss optimization to jointly learning diversity and integrity of multi-lead ECG. MLBF-Net is composed of three components: 1) multiple lead-specific branches for learning the diversity of multi-lead ECG; 2) cross-lead features fusion by concatenating the output feature maps of all branches for learning the integrity of multi-lead ECG; 3) multi-loss co-optimization for all the individual branches and the concatenated network. We demonstrate our MLBF-Net on China Physiological Signal Challenge 2018 which is an open 12-lead ECG dataset. The experimental results show that MLBF-Net obtains an average $F_1$ score of 0.855, reaching the highest arrhythmia classification performance. The proposed method provides a promising solution for multi-lead ECG analysis from an information fusion perspective.
摘要：使用12导联心电图（ECG）信号的自动心律失常检测起着早期预防和心血管疾病的诊断中起关键作用。在自动心律失常检测先前的研究中，大多数方法级联ECG的12个导联成一个矩阵，然后输入矩阵的各种特征提取器或提取有用信息深层神经网络。在这样的框架，这些方法必须提取的综合功能12导联心电图的，因为彼此每个引线相互作用的训练期间的信息的能力（称为完整性）。然而，12个导联中的各种领先的特定功能（称为多样性）被忽视，造成信息的学习不足12导联心电图。为了最大限度地提高信息学习的多导联心电图，用诚信和铅特异功能的综合信息融合与多样性特点，应考虑在内。在本文中，我们通过多损耗优化整合，共同学习的多导联心电图的多样性和完整性提出心律失常分类的新型多铅科融合网络（MLBF-网）架构。 MLBF-Net的是由三个部分组成：用于学习的多导联ECG的多样性1）多个引线特定分支; 2）跨引线通过连接用于学习的多导联ECG的完整性所有分支的输出特征地图特征融合; 3）多损失协同优化所有各个分支和级联网络。我们证明了中国生理信号挑战赛2018是一个开放的12导联心电图数据集我们MLBF-Net的。实验结果表明，MLBF-网获得的平均$ F_1 $得分0.855，达到历史最高心律不齐分类性能。所提出的方法提供了一种从信息融合透视多导联ECG分析有希望的解决方案。

79. Edge Network-Assisted Real-Time Object Detection Framework for Autonomous Driving [PDF] 返回目录
Seung Wook Kim, Keunsoo Ko, Haneul Ko, Victor C. M. Leung
Abstract: Autonomous vehicles (AVs) can achieve the desired results within a short duration by offloading tasks even requiring high computational power (e.g., object detection (OD)) to edge clouds. However, although edge clouds are exploited, real-time OD cannot always be guaranteed due to dynamic channel quality. To mitigate this problem, we propose an edge network-assisted real-time OD framework~(EODF). In an EODF, AVs extract the region of interests~(RoIs) of the captured image when the channel quality is not sufficiently good for supporting real-time OD. Then, AVs compress the image data on the basis of the RoIs and transmit the compressed one to the edge cloud. In so doing, real-time OD can be achieved owing to the reduced transmission latency. To verify the feasibility of our framework, we evaluate the probability that the results of OD are not received within the inter-frame duration (i.e., outage probability) and their accuracy. From the evaluation, we demonstrate that the proposed EODF provides the results to AVs in real-time and achieves satisfactory accuracy.
摘要：自主车辆（AVS）可通过卸载甚至需要高的计算能力（例如，对象检测（OD））到边缘云任务实现很短的时间内所期望的结果。然而，尽管边缘云被利用，实时OD不能总是得到保证，由于动态信道质量。为了缓解这一问题，我们提出了一种边缘网络辅助实时OD框架〜（EODF）。在一个EODF，AVS提取的所捕获的图像的利益〜（投资回报）的区域时，信道质量是不用于支持实时OD足够好的。然后，AVS压缩ROI的基础上对图像数据和发送被压缩的一个到边缘云。这样，实时OD可由于降低了传输时延来实现。为了验证我们的框架的可行性，我们评估认为OD的结果没有帧间时间（即中断概率）和它们的准确度内接收的概率。从评测中，我们表明，该EODF将结果提供实时AVS和达到满意的精度。

80. Towards Cardiac Intervention Assistance: Hardware-aware Neural Architecture Exploration for Real-Time 3D Cardiac Cine MRI Segmentation [PDF] 返回目录
Dewen Zeng, Weiwen Jiang, Tianchen Wang, Xiaowei Xu, Haiyun Yuan, Meiping Huang, Jian Zhuang, Jingtong Hu, Yiyu Shi
Abstract: Real-time cardiac magnetic resonance imaging (MRI) plays an increasingly important role in guiding various cardiac interventions. In order to provide better visual assistance, the cine MRI frames need to be segmented on-the-fly to avoid noticeable visual lag. In addition, considering reliability and patient data privacy, the computation is preferably done on local hardware. State-of-the-art MRI segmentation methods mostly focus on accuracy only, and can hardly be adopted for real-time application or on local hardware. In this work, we present the first hardware-aware multi-scale neural architecture search (NAS) framework for real-time 3D cardiac cine MRI segmentation. The proposed framework incorporates a latency regularization term into the loss function to handle real-time constraints, with the consideration of underlying hardware. In addition, the formulation is fully differentiable with respect to the architecture parameters, so that stochastic gradient descent (SGD) can be used for optimization to reduce the computation cost while maintaining optimization quality. Experimental results on ACDC MICCAI 2017 dataset demonstrate that our hardware-aware multi-scale NAS framework can reduce the latency by up to 3.5 times and satisfy the real-time constraints, while still achieving competitive segmentation accuracy, compared with the state-of-the-art NAS segmentation framework.
摘要：实时心脏磁共振成像（MRI）在指导各种心脏介入越来越重要的作用。为了提供更好的视觉协助下，电影MRI框架需要在即时被分割，以避免醒目的视觉滞后。此外，考虑到可靠性和患者数据隐私，计算优选在本地硬件完成的。国家的最先进的MRI分割方法大多只注重准确性，也很难实时的应用程序或在本地硬件采用。在这项工作中，我们提出了第一个硬件识别多尺度的神经结构搜索（NAS）的实时3D心脏电影MRI分割的框架。拟议的框架集成了延时调整项进入损失函数来处理实时约束，在考虑底层硬件。此外，所述制剂是相对于所述结构参数完全微分，以便随机梯度下降（SGD）可用于优化降低计算成本，同时保持优化质量。在ACDC MICCAI 2017年数据集的实验结果表明，我们的硬件识别多尺度NAS架构可以最多延迟降低到3.5倍，满足实时约束，同时还实现有竞争力的分割精度，比国家的最-ART NAS分割框架。

81. Training CNN Classifiers for Semantic Segmentation using Partially Annotated Images: with Application on Human Thigh and Calf MRI [PDF] 返回目录
Chun Kit Wong, Stephanie Marchesseau, Maria Kalimeri, Tiang Siew Yap, Serena S. H. Teo, Lingaraj Krishna, Alfredo Franco-Obregón, Stacey K. H. Tay, Chin Meng Khoo, Philip T. H. Lee, Melvin K. S. Leow, John J. Totman, Mary C. Stephenson
Abstract: Objective: Medical image datasets with pixel-level labels tend to have a limited number of organ or tissue label classes annotated, even when the images have wide anatomical coverage. With supervised learning, multiple classifiers are usually needed given these partially annotated datasets. In this work, we propose a set of strategies to train one single classifier in segmenting all label classes that are heterogeneously annotated across multiple datasets without moving into semi-supervised learning. Methods: Masks were first created from each label image through a process we termed presence masking. Three presence masking modes were evaluated, differing mainly in weightage assigned to the annotated and unannotated classes. These masks were then applied to the loss function during training to remove the influence of unannotated classes. Results: Evaluation against publicly available CT datasets shows that presence masking is a viable method for training class-generic classifiers. Our class-generic classifier can perform as well as multiple class-specific classifiers combined, while the training duration is similar to that required for one class-specific classifier. Furthermore, the class-generic classifier can outperform the class-specific classifiers when trained on smaller datasets. Finally, consistent results are observed from evaluations against human thigh and calf MRI datasets collected in-house. Conclusion: The evaluation outcomes show that presence masking is capable of significantly improving both training and inference efficiency across imaging modalities and anatomical regions. Improved performance may even be observed on small datasets. Significance: Presence masking strategies can reduce the computational resources and costs involved in manual medical image annotations. All codes are publicly available at this https URL.
摘要：目的：医学影像与像素级标签的数据集往往有注释的器官或组织的标签类别的数量有限，即使图像有广泛的解剖覆盖。随着监督学习，多分类，通常需要考虑到这些部分标注的数据集。在这项工作中，我们提出了一套策略来训练单一分类中分割跨多个数据集多相注释不动，成半监督学习的所有标签类。方法：面罩首先从每个标签图像通过我们称为存在掩蔽的工艺创建。三种存在掩蔽模式进行评价，主要不同在分配给该注释和未注释类的权重。然后，这些面具被训练，除去未注释阶级的影响过程中施加于损失函数。结果：评估对可公开获得的CT数据显示，存在掩蔽了培训班，通用分类器可行的方法。我们班，通用分类器执行，以及多类特定分类相结合，而训练时间是类似于对一类特定的分类要求。此外，在较小的数据集的培训上课的时候，通用的分类可以跑赢类特定的分类。最后，一致的结果是从对人大腿和小腿MRI数据集的评估内部收集观察。结论：评价结果表明，存在掩蔽能够显著改善整个成像模态和解剖区域训练和推理效率。改进的性能甚至可以在小样本中观察到。意义：存在屏蔽策略可以减少计算资源和参与手动医学图像注释成本。所有的代码是公开的，在此HTTPS URL。

82. Spontaneous preterm birth prediction using convolutional neural networks [PDF] 返回目录
Tomasz Włodarczyk, Szymon Płotka, Przemysław Rokita, Nicole Sochacki-Wójcicka, Jakub Wójcicki, Michał Lipa, Tomasz Trzciński
Abstract: An estimated 15 million babies are born too early every year. Approximately 1 million children die each year due to complications of preterm birth (PTB). Many survivors face a lifetime of disability, including learning disabilities and visual and hearing problems. Although manual analysis of ultrasound images (US) is still prevalent, it is prone to errors due to its subjective component and complex variations in the shape and position of organs across patients. In this work, we introduce a conceptually simple convolutional neural network (CNN) trained for segmenting prenatal ultrasound images and classifying task for the purpose of preterm birth detection. Our method efficiently segments different types of cervixes in transvaginal ultrasound images while simultaneously predicting a preterm birth based on extracted image features without human oversight. We employed three popular network models: U-Net, Fully Convolutional Network, and Deeplabv3 for the cervix segmentation task. Based on the conducted results and model efficiency, we decided to extend U-Net by adding a parallel branch for classification task. The proposed model is trained and evaluated on a dataset consisting of 354 2D transvaginal ultrasound images and achieved a segmentation accuracy with a mean Jaccard coefficient index of 0.923 $\pm$ 0.081 and a classification sensitivity of 0.677 $\pm$ 0.042 with a 3.49\% false positive rate. Our method obtained better results in the prediction of preterm birth based on transvaginal ultrasound images compared to state-of-the-art methods.
摘要：据估计，每年有太早1500万个婴儿出生。大约有100万儿童每年死于早产（PTB）的并发症。许多幸存者面临终身残疾，包括学习障碍和视觉和听觉上的问题。尽管超声图像（US）的人工分析仍然是普遍的，这是容易出错，由于其主观成分和在横跨患者器官的形状和位置复杂的变化。在这项工作中，我们介绍了训练分割产前超声图像和早产检测的目的进行分类任务概念简单卷积神经网络（CNN）。我们的方法有效地在经阴道超声图像段不同类型的子宫颈的同时基于无需人工监督提取的图像特征预测早产。我们采用三种流行的网络模型：U-Net的，完全卷积网络，并为Deeplabv3宫颈分割任务。基于传导的结果和模型的效率，我们决定通过增加并联支路进行分类的任务延长掌中。该模型被训练并在由354个2D阴道超声图像的数据集进行评估，并取得了分割精度与0.923 $ \下午$ 0.081的平均的Jaccard系数索引和0.677 $ \下午$ 0.042的用3.49分类灵敏度\ ％的假阳性率。我们的方法相比，国家的最先进的方法，早产基于经阴道超声图像中的预测获得更好的结果。

83. RevPHiSeg: A Memory-Efficient Neural Network for Uncertainty Quantification in Medical Image Segmentation [PDF] 返回目录
Marc Gantenbein, Ertunc Erdil, Ender Konukoglu
Abstract: Quantifying segmentation uncertainty has become an important issue in medical image analysis due to the inherent ambiguity of anatomical structures and its pathologies. Recently, neural network-based uncertainty quantification methods have been successfully applied to various problems. One of the main limitations of the existing techniques is the high memory requirement during training; which limits their application to processing smaller field-of-views (FOVs) and/or using shallower architectures. In this paper, we investigate the effect of using reversible blocks for building memory-efficient neural network architectures for quantification of segmentation uncertainty. The reversible architecture achieves memory saving by exactly computing the activations from the outputs of the subsequent layers during backpropagation instead of storing the activations for each layer. We incorporate the reversible blocks into a recently proposed architecture called PHiSeg that is developed for uncertainty quantification in medical image segmentation. The reversible architecture, RevPHiSeg, allows training neural networks for quantifying segmentation uncertainty on GPUs with limited memory and processing larger FOVs. We perform experiments on the LIDC-IDRI dataset and an in-house prostate dataset, and present comparisons with PHiSeg. The results demonstrate that RevPHiSeg consumes ~30% less memory compared to PHiSeg while achieving very similar segmentation accuracy.
摘要：量化分割的不确定性已经成为医学图像分析中的重要问题，由于解剖结构及病变的内在不确定性。近年来，基于神经网络的不确定性量化方法已成功地应用于各种问题。一个现有技术的主要限制是在训练期间高内存要求;这限制了它们的应用程序，以处理场的视图的情况下（视场）和/或使用较浅的架构。在本文中，我们研究了使用可逆块构建记忆效神经网络结构进行分割的不确定性量化的效果。可逆架构通过反向传播期间恰好计算从后续层的输出激活，而不是存储每个层中的激活实现存储器节省。我们结合可逆块到最近提出的架构，称为PHiSeg是在医学图像分割的不确定性量化发展。可逆架构，RevPHiSeg，允许训练神经网络用于在GPU上定量分割的不确定性具有有限存储器和处理更大的视场。我们在LIDC-IDRI数据集和内部前列腺数据集，并与PHiSeg目前比较进行实验。结果表明，RevPHiSeg消耗〜30相比更少％内存PHiSeg同时实现非常相似的分割精度。

84. Deep Learning Predicts Cardiovascular Disease Risks from Lung Cancer Screening Low Dose Computed Tomography [PDF] 返回目录
Hanqing Chao, Hongming Shan, Fatemeh Homayounieh, Ramandeep Singh, Ruhani Doda Khera, Hengtao Guo, Timothy Su, Ge Wang, Mannudeep K. Kalra, Pingkun Yan
Abstract: The high risk population of cardiovascular disease (CVD) is simultaneously at high risk of lung cancer. Given the dominance of low dose computed tomography (LDCT) for lung cancer screening, the feasibility of extracting information on CVD from the same LDCT scan would add major value to patients at no additional radiation dose. However, with strong noise in LDCT images and without electrocardiogram (ECG) gating, CVD risk analysis from LDCT is highly challenging. Here we present an innovative deep learning model to address this challenge. Our deep model was trained with 30,286 LDCT volumes and achieved the state-of-the-art performance (area under the curve (AUC) of 0.869) on 2,085 National Lung Cancer Screening Trial (NLST) subjects, and effectively identified patients with high CVD mortality risks (AUC of 0.768). Our deep model was further calibrated against the clinical gold standard CVD risk scores from ECG-gated dedicated cardiac CT, including coronary artery calcification (CAC) score, CAD-RADS score and MESA 10-year CHD risk score from an independent dataset of 106 subjects. In this validation study, our model achieved AUC of 0.942, 0.809 and 0.817 for CAC, CAD-RADS and MESA scores, respectively. Our deep learning model has the potential to convert LDCT for lung cancer screening into dual-screening quantitative tool for CVD risk estimation.
摘要：心血管疾病（CVD）的高危人群同时是肺癌的高危人群。鉴于肺癌筛查低剂量计算机断层扫描（LDCT）的主导地位，从同一LDCT扫描上CVD提取信息的可行性将在不增加额外的辐射剂量大值添加到患者。然而，随着LDCT图像和没有心电图（ECG）门控强噪声，从LDCT CVD风险分析是高度挑战。在这里，我们提出了一个创新的深度学习模式来应对这一挑战。我们深厚的模型与30286个LDCT运动量训练和2085全国肺癌（曲线的0.869（AUC）下面积）达到国家的最先进的性能筛查试验（NLST）科目，并有效地识别患者的高CVD死亡风险（0.768的AUC）。我们的深模型对来自ECG门控专用心脏CT临床黄金标准CVD风险分数，包括冠状动脉钙化（CAC）得分，CAD-RADS评分和MESA 10年冠心病风险评分从106名受试者的独立数据集进一步校准。在此验证研究中，我们的模型分别达到AUC 0.942，0.809和0.817的CAC，CAD-RADS和MESA分数。我们深厚的学习模式必须转换LDCT肺癌筛查成心血管疾病的风险估计双定量筛选工具的潜力。

85. Prediction of Homicides in Urban Centers: A Machine Learning Approach [PDF] 返回目录
José Ribeiro, Lair Meneses, Denis Costa, Wando Miranda, Ronnie Alves
Abstract: Relevant research has been standing out in the computing community aiming to develop computational models capable of predicting occurrence of crimes, analyzing contexts of crimes, extracting profiles of individuals linked to crimes, and analyzing crimes according to time. This, due to the social impact and also the complex origin of the data, thus showing itself as an interesting computational challenge. This research presents a computational model for the prediction of homicide crimes, based on tabular data of crimes registered in the city of Belém - Pará, Brazil. Statistical tests were performed with 8 different classification methods, both Random Forest, Logistic Regression, and Neural Network presented best results, AUC ~ 0.8. Results considered as a baseline for the proposed problem.
摘要：相关研究已经在计算社区，旨在开发能够预测犯罪的发生，分析犯罪的背景下，提取链接到罪行的个人的配置文件，并根据时间分析犯罪的计算模型已经脱颖而出。这是由于社会影响，也是数据的复杂的起源，从而展示自己作为一个有趣的计算挑战。这项研究礼物杀人罪的预测，计算模型基于城市贝伦的注册罪的表格数据 - 帕拉州，巴西。统计测试使用8种不同的分类方法进行，无论是随机森林，Logistic回归和神经网络呈现最佳效果，AUC〜0.8。结果视为所提出的问题的基准。

86. Automated Detection of Congenital HeartDisease in Fetal Ultrasound Screening [PDF] 返回目录
Jeremy Tan, Anselm Au, Qingjie Meng, Sandy FinesilverSmith, John Simpson, Daniel Rueckert, Reza Razavi, Thomas Day, David Lloyd, Bernhard Kainz
Abstract: Prenatal screening with ultrasound can lower neonatal mor-tality significantly for selected cardiac abnormalities. However, the needfor human expertise, coupled with the high volume of screening cases,limits the practically achievable detection rates. In this paper we discussthe potential for deep learning techniques to aid in the detection of con-genital heart disease (CHD) in fetal ultrasound. We propose a pipelinefor automated data curation and classification. During both training andinference, we exploit an auxiliary view classification task to bias featurestoward relevant cardiac structures. This bias helps to improve in F1-scores from 0.72 and 0.77 to 0.87 and 0.85 for healthy and CHD classesrespectively.
摘要：产前超声筛查可以显著降低新生儿MOR-tality选定心脏异常。然而，人类needfor专门知识，再加上高容量筛选的情况下，限制了实际可实现检测率。在本文中，我们对深学习技术discussthe潜力，在检测胎儿超声CON-生殖器心脏疾病（CHD）的帮助。我们提出了一个pipelinefor自动数据策展和分类。在这两次训练andinference，我们利用辅助视图分类任务偏置featurestoward相关的心脏结构。这种倾向有助于F1-分数提高0.72和0.77〜0.87和0.85的健康和冠心病classesrespectively。

87. Wavelet Denoising and Attention-based RNN-ARIMA Model to Predict Forex Price [PDF] 返回目录
Zhiwen Zeng, Matloob Khushi
Abstract: Every change of trend in the forex market presents a great opportunity as well as a risk for investors. Accurate forecasting of forex prices is a crucial element in any effective hedging or speculation strategy. However, the complex nature of the forex market makes the predicting problem challenging, which has prompted extensive research from various academic disciplines. In this paper, a novel approach that integrates the wavelet denoising, Attention-based Recurrent Neural Network (ARNN), and Autoregressive Integrated Moving Average (ARIMA) are proposed. Wavelet transform removes the noise from the time series to stabilize the data structure. ARNN model captures the robust and non-linear relationships in the sequence and ARIMA can well fit the linear correlation of the sequential information. By hybridization of the three models, the methodology is capable of modelling dynamic systems such as the forex market. Our experiments on USD/JPY five-minute data outperforms the baseline methods. Root-Mean-Squared-Error (RMSE) of the hybrid approach was found to be 1.65 with a directional accuracy of ~76%.
摘要：在外汇市场呈现出趋势的变化每一个很好的机会，以及投资者的风险。外汇价格的准确预测是任何有效的套期保值或投机策略的一个关键要素。然而，外汇市场的复杂性，使得预测问题的挑战，这促使来自不同学科广泛的研究。在本文中，一种新的方法，它集成了小波去噪，基于注意机制的回归神经网络（ARNN），和ARIMA模型（ARIMA）提出。小波变换去除噪声从时间序列，以稳定的数据结构。 ARNN模型捕获序列中的健壮和非线性关系和ARIMA能很好地适合的顺序信息的线性相关性。通过这三种模式的杂交，该方法能够模拟动态系统，如外汇市场。我们对美元/日元五分钟的实验数据优于基准方法。混合方法的根均方误差（RMSE）被发现是1.65具有〜76％的定向精度。

88. Automated Detection of Cortical Lesions in Multiple Sclerosis Patients with 7T MRI [PDF] 返回目录
Francesco La Rosa, Erin S Beck, Ahmed Abdulkadir, Jean-Philippe Thiran, Daniel S Reich, Pascal Sati, Meritxell Bach Cuadra
Abstract: The automated detection of cortical lesions (CLs) in patients with multiple sclerosis (MS) is a challenging task that, despite its clinical relevance, has received very little attention. Accurate detection of the small and scarce lesions requires specialized sequences and high or ultra-high field MRI. For supervised training based on multimodal structural MRI at 7T, two experts generated ground truth segmentation masks of 60 patients with 2014 CLs. We implemented a simplified 3D U-Net with three resolution levels (3D U-Net-). By increasing the complexity of the task (adding brain tissue segmentation), while randomly dropping input channels during training, we improved the performance compared to the baseline. Considering a minimum lesion size of 0.75 {\mu}L, we achieved a lesion-wise cortical lesion detection rate of 67% and a false positive rate of 42%. However, 393 (24%) of the lesions reported as false positives were post-hoc confirmed as potential or definite lesions by an expert. This indicates the potential of the proposed method to support experts in the tedious process of CL manual segmentation.
摘要：皮质病变（CLS）患者的多发性硬化症（MS）的自动化检测是一项艰巨的任务，尽管它的临床意义，已收到很少关注。小而稀少的病变的精确检测需要专门的序列和高或超高场MRI。对于基于多模态结构MRI在7T指导训练，产生了两位专家的60例患者的实测分割掩码2014年的CLS。我们实现了一个简单的3D掌中有三个分辨率级（3D U型NET-）。通过增加任务（增加脑组织分割）的复杂性，同时在训练中随机丢弃输入通道，我们比较基准提高了性能。考虑到0.75 {\亩} L的最小损伤尺寸，我们实现了67％的病变明智皮质病变检测率和42％的假阳性率。然而，393（24％）病变的报告为误报是事后确认为潜在的或明确病变的专家。这表明所提出的方法，以支持专家CL手动分割的繁琐过程的潜力。

89. Model Patching: Closing the Subgroup Performance Gap with Data Augmentation [PDF] 返回目录
Karan Goel, Albert Gu, Yixuan Li, Christopher Ré
Abstract: Classifiers in machine learning are often brittle when deployed. Particularly concerning are models with inconsistent performance on specific subgroups of a class, e.g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage. To mitigate these performance differences, we introduce model patching, a two-stage framework for improving robustness that encourages the model to be invariant to subgroup differences, and focus on class information shared by subgroups. Model patching first models subgroup features within a class and learns semantic transformations between them, and then trains a classifier with data augmentations that deliberately manipulate subgroup features. We instantiate model patching with CAMEL, which (1) uses a CycleGAN to learn the intra-class, inter-subgroup augmentations, and (2) balances subgroup performance using a theoretically-motivated subgroup consistency regularizer, accompanied by a new robust objective. We demonstrate CAMEL's effectiveness on 3 benchmark datasets, with reductions in robust error of up to 33% relative to the best baseline. Lastly, CAMEL successfully patches a model that fails due to spurious features on a real-world skin cancer dataset.
摘要：在部署时量词使用中的机器学习往往易碎。特别是关于是模型与上一类的特定亚类，例如性能不一致，表现出寄生绷带的存在或不存在于皮肤癌分类的差距。为了减轻这些性能上的差异，我们引入模型补丁，以提高坚固性，鼓励该模型是不变的亚组差异，以及专注于通过分组共享类信息的两级架构。型号修补首款车型分组功能类中，得知他们之间的语义转换，然后训练数据扩充故意操纵分组功能分类。我们实例化CAMEL，其中（1）使用CycleGAN学习类内，子群间增扩，以及（2）余额亚组性能使用理论上动机亚组一致性正则化，伴随着一个新的强大的目标模型补丁。我们证明骆驼的3个标准数据集的有效性，在长达相对最好的基线33％的健壮的错误减少。最后，CAMEL成功修补由于对现实世界的皮肤癌的数据集杂散特性失败的典范。

90. Single image dehazing for a variety of haze scenarios using back projected pyramid network [PDF] 返回目录
Ayush Singh, Ajay Bhave, Dilip K. Prasad
Abstract: Learning to dehaze single hazy images, especially using a small training dataset is quite challenging. We propose a novel generative adversarial network architecture for this problem, namely back projected pyramid network (BPPNet), that gives good performance for a variety of challenging haze conditions, including dense haze and inhomogeneous haze. Our architecture incorporates learning of multiple levels of complexities while retaining spatial context through iterative blocks of UNets and structural information of multiple scales through a novel pyramidal convolution block. These blocks together for the generator and are amenable to learning through back projection. We have shown that our network can be trained without over-fitting using as few as 20 image pairs of hazy and non-hazy images. We report the state of the art performances on NTIRE 2018 homogeneous haze datasets for indoor and outdoor images, NTIRE 2019 denseHaze dataset, and NTIRE 2020 non-homogeneous haze dataset.
摘要：学习dehaze单朦胧的影像，特别是在使用小训练数据集是相当具有挑战性的。我们提出了一个新颖的生成对抗性的网络架构对于这个问题，即反向投影金字塔网络（BPPNet），给出了各种挑战霾天气，包括密集的阴霾和雾不均匀性能好。我们的架构结合了复杂的多层次的学习，同时通过一种新颖的锥体卷积块保持通过UNets和多尺度的结构信息的迭代块空间上下文。这些块一起为发电机和经得起通过背投学习。我们已经表明，我们的网络可以在不使用朦胧和非朦胧的图像只要不到20图像对过度拟合训练。我们报告对NTIRE 2018均匀混浊的数据集用于室内和室外的影像，NTIRE 2019 denseHaze数据集，NTIRE 2020非均质阴霾数据集文艺演出的状态。

91. Evolving Deep Convolutional Neural Networks for Hyperspectral Image Denoising [PDF] 返回目录
Yuqiao Liu, Yanan Sun, Bing Xue, Mengjie Zhang
Abstract: Hyperspectral images (HSIs) are susceptible to various noise factors leading to the loss of information, and the noise restricts the subsequent HSIs object detection and classification tasks. In recent years, learning-based methods have demonstrated their superior strengths in denoising the HSIs. Unfortunately, most of the methods are manually designed based on the extensive expertise that is not necessarily available to the users interested. In this paper, we propose a novel algorithm to automatically build an optimal Convolutional Neural Network (CNN) to effectively denoise HSIs. Particularly, the proposed algorithm focuses on the architectures and the initialization of the connection weights of the CNN. The experiments of the proposed algorithm have been well-designed and compared against the state-of-the-art peer competitors, and the experimental results demonstrate the competitive performance of the proposed algorithm in terms of the different evaluation metrics, visual assessments, and the computational complexity.
摘要：高光谱图像（HSIS）易受导致的信息丢失各种噪声因素，和噪声限制后续HSIS目标检测和分类任务。近年来，基于学习的方法已经证明了它们在去噪HSIS优越的优势。不幸的是，大多数的方法基础上，丰富的专业知识不一定是提供给有兴趣的用户手动设计。在本文中，我们提出了一种新的算法来自动生成最佳的卷积神经网络（CNN），有效降噪HSIS。特别是，该算法侧重于结构和CNN的连接权值的初始化。所提出的算法的实验已经精心设计和对状态的最先进的对等竞争对手相比，实验结果证明所提出的算法在不同的评价指标，视觉评估方面的竞争性能，并且计算复杂度。

92. Dehaze-GLCGAN: Unpaired Single Image De-hazing via Adversarial Training [PDF] 返回目录
Zahra Anvari, Vassilis Athitsos
Abstract: Single image de-hazing is a challenging problem, and it is far from solved. Most current solutions require paired image datasets that include both hazy images and their corresponding haze-free ground-truth images. However, in reality, lighting conditions and other factors can produce a range of haze-free images that can serve as ground truth for a hazy image, and a single ground truth image cannot capture that range. This limits the scalability and practicality of paired image datasets in real-world applications. In this paper, we focus on unpaired single image de-hazing and we do not rely on the ground truth image or physical scattering model. We reduce the image de-hazing problem to an image-to-image translation problem and propose a dehazing Global-Local Cycle-consistent Generative Adversarial Network (Dehaze-GLCGAN). Generator network of Dehaze-GLCGAN combines an encoder-decoder architecture with residual blocks to better recover the haze free scene. We also employ a global-local discriminator structure to deal with spatially varying haze. Through ablation study, we demonstrate the effectiveness of different factors in the performance of the proposed network. Our extensive experiments over three benchmark datasets show that our network outperforms previous work in terms of PSNR and SSIM while being trained on smaller amount of data compared to other methods.
摘要：单图像去欺侮是一个具有挑战性的问题，它是远远解决。目前大多数解决方案需要对图像数据集，其中包括既朦胧的图像和其相应的无混浊的地面实况图像。然而，在现实中，照明条件和其它因素可产生一系列无混浊的图像，可以作为基础事实为朦胧的图像的，并且单个地面实况图像不能捕捉该范围。这限制了可扩展性和对图像数据集的实用性在现实世界的应用。在本文中，我们专注于单未成图像去欺侮，我们不依赖地面实况图像或物理散射模型上。我们减少图像除雾问题的图像 - 图像转换问题，并提出了一个除雾全局 - 局部周期一致剖成对抗性网络（Dehaze-GLCGAN）。 Dehaze-GLCGAN的发电机网络结合与残余块的编码器 - 解码器的体系结构，以便更好地回收雾度自由场景。我们还聘请了全球和当地鉴别结构，以应对空间变化的阴霾。通过消融研究中，我们表现出不同的因素在提出的网络性能的有效性。我们在三个标准数据集大量的实验表明，我们的网络性能优于PSNR和SSIM方面的前期工作，而在与其它方法相比更小的数据量正在接受培训。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-18

目录

摘要