摘要

1. MichiGAN: Multi-Input-Conditioned Hair Image Generation for Portrait Editing [PDF] 返回目录
Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Lu Yuan, Sergey Tulyakov, Nenghai Yu
Abstract: Despite the recent success of face image generation with GANs, conditional hair editing remains challenging due to the under-explored complexity of its geometry and appearance. In this paper, we present MichiGAN (Multi-Input-Conditioned Hair Image GAN), a novel conditional image generation method for interactive portrait hair manipulation. To provide user control over every major hair visual factor, we explicitly disentangle hair into four orthogonal attributes, including shape, structure, appearance, and background. For each of them, we design a corresponding condition module to represent, process, and convert user inputs, and modulate the image generation pipeline in ways that respect the natures of different visual attributes. All these condition modules are integrated with the backbone generator to form the final end-to-end network, which allows fully-conditioned hair generation from multiple user inputs. Upon it, we also build an interactive portrait hair editing system that enables straightforward manipulation of hair by projecting intuitive and high-level user inputs such as painted masks, guiding strokes, or reference photos to well-defined condition representations. Through extensive experiments and evaluations, we demonstrate the superiority of our method regarding both result quality and user controllability. The code is available at this https URL.
摘要：尽管近期图像的成功与GANS，但由于其几何形状和外观的探索性复杂性，有条件的头发编辑仍然挑战。在本文中，我们展示了密歇根（多输入调节的头发图像GaN），一种用于交互式纵向毛发操纵的新型条件图像生成方法。为了提供用户对每个主要的头发视觉因素的控制，我们明确地解散到四个正交属性，包括形状，结构，外观和背景。对于它们中的每一个，我们设计相应的条件模块以表示，处理和转换用户输入，并以尊重不同视觉属性的自然的方式调制图像生成流水线。所有这些条件模块都与骨干发电机集成，以形成最终端到端网络，允许从多个用户输入的完全调节的发型。在它之上，我们还建立了一个交互式肖像头发编辑系统，通过将诸如彩绘的掩码，引导笔划或参考照片投影到明确定义的条件表示，通过投射直觉和高级用户输入，可以直接操纵头发。通过广泛的实验和评估，我们展示了我们对结果质量和用户可控性的方法的优越性。该代码可在此HTTPS URL上获得。

2. Unsupervised Monocular Depth Learning in Dynamic Scenes [PDF] 返回目录
Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, Anelia Angelova
Abstract: We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision. We show that this apparently heavily underdetermined problem can be regularized by imposing the following prior knowledge about 3D translation fields: they are sparse, since most of the scene is static, and they tend to be constant for rigid moving objects. We show that this regularization alone is sufficient to train monocular depth prediction models that exceed the accuracy achieved in prior work for dynamic scenes, including methods that require semantic input. Code is at this https URL .
摘要：我们提出了一种联合训练相对于现场对象的深度，自我运动和密集3D翻译领域的方法，单眼光度符合是监督唯一的监督源。我们表明，通过对3D翻译领域的现有知识施加以下事先知识来规范，这显然是重大的问题：由于大多数场景是静态的，因此它们稀疏，因为刚性移动物体往往是恒定的。我们表明，单独的正规化足以培训超过在现有工作中实现的精度的单眼深度预测模型，包括需要语义输入的方法。代码位于此HTTPS URL。

3. What's in a Loss Function for Image Classification? [PDF] 返回目录
Simon Kornblith, Honglak Lee, Ting Chen, Mohammad Norouzi
Abstract: It is common to use the softmax cross-entropy loss to train neural networks on classification datasets where a single class label is assigned to each example. However, it has been shown that modifying softmax cross-entropy with label smoothing or regularizers such as dropout can lead to higher performance. This paper studies a variety of loss functions and output layer regularization strategies on image classification tasks. We observe meaningful differences in model predictions, accuracy, calibration, and out-of-distribution robustness for networks trained with different objectives. However, differences in hidden representations of networks trained with different objectives are restricted to the last few layers; representational similarity reveals no differences among network layers that are not close to the output. We show that all objectives that improve over vanilla softmax loss produce greater class separation in the penultimate layer of the network, which potentially accounts for improved performance on the original task, but results in features that transfer worse to other tasks.
摘要：常常使用SoftMax跨熵丢失来培训在分类数据集上的神经网络，其中为每个示例分配单个类标签。但是，已经表明，用标签平滑或诸如辍学的常规程序修改软MAX跨熵可能导致更高的性能。本文研究了各种损耗功能和输出层正则化策略在图像分类任务上。我们观察到不同目标培训的网络的模型预测，准确性，校准和分配鲁棒性的有意义的差异。但是，用不同目标培训的网络隐藏表示的差异仅限于最后几层;代表性相似性揭示了不接近输出的网络层之间的差异。我们展示了改善Vanilla Softmax损失的所有目标在网络的倒数第二层中产生了更大的分离，这可能会占原始任务的改进性能，但导致转移到其他任务的功能。

4. Emotion Understanding in Videos Through Body, Context, and Visual-Semantic Embedding Loss [PDF] 返回目录
Panagiotis Paraskevas Filntisis, Niki Efthymiou, Gerasimos Potamianos, Petros Maragos
Abstract: We present our winning submission to the First International Workshop on Bodily Expressed Emotion Understanding (BEEU) challenge. Based on recent literature on the effect of context/environment on emotion, as well as visual representations with semantic meaning using word embeddings, we extend the framework of Temporal Segment Network to accommodate these. Our method is verified on the validation set of the Body Language Dataset (BoLD) and achieves 0.26235 Emotion Recognition Score on the test set, surpassing the previous best result of 0.2530.
摘要：我们向第一届国际讲习班展示了对体身表达情感理解（Beeu）挑战的第一个国际研讨会。基于最近的文献对情绪上下文/环境对情感的影响，以及使用Word Embeddings与语义含义的视觉表示，我们扩展了颞段网络的框架来容纳这些。我们的方法在验证的验证集（粗体）上验证（粗体），并在测试集上实现了0.26235的情感识别得分，超过了0.2530的最佳结果。

5. Automatic Counting and Identification of Train Wagons Based on Computer Vision and Deep Learning [PDF] 返回目录
Rayson Laroca, Alessander Cidral Boslooper, David Menotti
Abstract: In this work, we present a robust and efficient solution for counting and identifying train wagons using computer vision and deep learning. The proposed solution is cost-effective and can easily replace solutions based on radiofrequency identification (RFID), which are known to have high installation and maintenance costs. According to our experiments, our two-stage methodology achieves impressive results on real-world scenarios, i.e., 100% accuracy in the counting stage and 99.7% recognition rate in the identification one. Moreover, the system is able to automatically reject some of the train wagons successfully counted, as they have damaged identification codes. The results achieved were surprising considering that the proposed system requires low processing power (i.e., it can run in low-end setups) and that we used a relatively small number of images to train our Convolutional Neural Network (CNN) for character recognition. The proposed method is registered, under number BR512020000808-9, with the National Institute of Industrial Property (Brazil).
摘要：在这项工作中，我们使用计算机愿景和深度学习来展示一个强大而有效的解决方案，用于计算和识别火车车辆。所提出的解决方案具有成本效益，并且可以轻松替换基于射频识别（RFID）的解决方案，该射程识别（RFID）是已知具有高安装和维护成本的。根据我们的实验，我们的两阶段方法令人印象深刻的现实情景，即计数阶段的100％准确性，识别率为99.7％。此外，系统能够自动拒绝成功计算的一些火车车辆，因为它们具有损坏的识别码。考虑到所提出的系统需要低处理能力（即，它可以在低端设置中运行）而令人惊讶的是，我们使用了相对少量的图像来训练我们的卷积神经网络（CNN）进行字符识别。拟议的方法是在国立工业财产研究所（巴西）的号码下进行注册。

6. All-Weather Object Recognition Using Radar and Infrared Sensing [PDF] 返回目录
Marcel Sheeny
Abstract: Autonomous cars are an emergent technology which has the capacity to change human lives. The current sensor systems which are most capable of perception are based on optical sensors. For example, deep neural networks show outstanding results in recognising objects when used to process data from cameras and Light Detection And Ranging (LiDAR) sensors. However these sensors perform poorly under adverse weather conditions such as rain, fog, and snow due to the sensor wavelengths. This thesis explores new sensing developments based on long wave polarised infrared (IR) imagery and imaging radar to recognise objects. First, we developed a methodology based on Stokes parameters using polarised infrared data to recognise vehicles using deep neural networks. Second, we explored the potential of using only the power spectrum captured by low-THz radar sensors to perform object recognition in a controlled scenario. This latter work is based on a data-driven approach together with the development of a data augmentation method based on attenuation, range and speckle noise. Last, we created a new large-scale dataset in the "wild" with many different weather scenarios (sunny, overcast, night, fog, rain and snow) showing radar robustness to detect vehicles in adverse weather. High resolution radar and polarised IR imagery, combined with a deep learning approach, are shown as a potential alternative to current automotive sensing systems based on visible spectrum optical technology as they are more robust in severe weather and adverse light conditions.
摘要：自治车是一种紧急技术，具有改变人类生活的能力。最能感知的电流传感器系统基于光学传感器。例如，当用于处理来自摄像机和光检测和测距（LIDAR）传感器的数据时，深度神经网络在识别对象时显示出色的结果。然而，由于传感器波长，这些传感器在诸如雨，雾和雪等恶劣天气条件下表现不佳。本文探讨了基于长波偏振红外（IR）图像和成像雷达的新感测发展，以识别对象。首先，我们根据使用偏振红外数据的基于Stokes参数开发了一种方法，以识别使用深神经网络的车辆。其次，我们探讨了仅使用低THZ雷达传感器捕获的功率谱的潜力，以在受控场景中执行对象识别。后一项工作基于数据驱动方法，以及基于衰减，范围和散斑噪声的数据增强方法的开发。最后，我们在“狂野”中创建了一个新的大型数据集，许多不同的天气场景（阳光明媚，阴云密布，雾，雨雪和雪），显示雷达稳健性，以检测恶劣天气的车辆。高分辨率雷达和偏振IR图像与深度学习方法相结合，作为基于可见光谱光学技术的电流汽车传感系统的潜在替代品，因为它们在严重的天气和不利的光线条件下更加坚固。

7. 3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations [PDF] 返回目录
Mihir Prabhudesai, Shamit Lal, Hsiao-Yu Fish Tung, Adam W. Harley, Shubhankar Potdar, Katerina Fragkiadaki
Abstract: We propose a system that learns to detect objects and infer their 3D poses in RGB-D images. Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations. The challenge here is to achieve this without relying on strong supervision signals. To address this challenge, we propose a model that maps RGB-D images to a set of 3D visual feature maps in a differentiable fully-convolutional manner, supervised by predicting views. The 3D feature maps correspond to a featurization of the 3D world scene depicted in the images. The object 3D feature representations are invariant to camera viewpoint changes or zooms, which means feature matching can identify similar objects under different camera viewpoints. We can compare the 3D feature maps of two objects by searching alignment across scales and 3D rotations, and, as a result of the operation, we can estimate pose and scale changes without the need for 3D pose annotations. We cluster object feature maps into a set of 3D prototypes that represent familiar objects in canonical scales and orientations. We then parse images by inferring the prototype identity and 3D pose for each detected object. We compare our method to numerous baselines that do not learn 3D feature visual representations or do not attempt to correspond features across scenes, and outperform them by a large margin in the tasks of object retrieval and object pose estimation. Thanks to the 3D nature of the object-centric feature maps, the visual similarity cues are invariant to 3D pose changes or small scale changes, which gives our method an advantage over 2D and 1D methods.
摘要：我们提出了一个系统，了解了检测对象并在RGB-D图像中推断出他们的3D姿势。许多现有系统可以识别对象和推断3D姿势，但它们严重依赖于人类标签和3D注释。这里的挑战是为了实现这一目标，而无需依赖强大的监督信号。为了解决这一挑战，我们提出了一种模型，该模型以可差的完全卷积方式将RGB-D图像映射到一组3D视觉特征映射，通过预测视图监督。 3D特征图对应于图像中描绘的3D世界场景的卵形。对象3D特征表示不变于相机视点更改或缩放，这意味着特征匹配可以在不同的相机视点下识别类似的对象。我们可以通过在横跨尺度和3D旋转的对齐方面进行比较两个对象的3D特征映射，并且由于操作的结果，我们可以估计姿势和缩放更改，而无需3D姿势注释。我们群集对象功能映射到一组3D原型，表示规范尺度和方向中熟悉的对象。然后，我们通过推断每个检测到的对象的原型标识和3D姿势来解析图像。我们将我们的方法与不学习3D特征视觉表示的许多基线进行比较，或者在对象检索和对象姿态估计的任务中，不要尝试对应跨场景的特征。由于以对象为中心的特征地图的3D性质，视觉相似性提示是不变的3D姿势变化或小规模变化，这使我们的方法是超过2D和1D方法的优势。

8. Exploring Dynamic Context for Multi-path Trajectory Prediction [PDF] 返回目录
Hao Cheng, Wentong Liao, Xuejiao Tang, Michael Ying Yang, Monika Sester, Bodo Rosenhahn
Abstract: To accurately predict future positions of different agents in traffic scenarios is crucial for safely deploying intelligent autonomous systems in the real-world environment. However, it remains a challenge due to the behavior of a target agent being affected by other agents dynamically, and there being more than one socially possible paths the agent could take. In this paper, we propose a novel framework, named Dynamic Context Encoder Network (DCENet). In our framework, first, the spatial context between agents is explored by using self-attention architectures. Then, two LSTM encoders are trained to learn temporal context between steps by taking the observed trajectories and the extracted dynamic spatial context as input, respectively. The spatial-temporal context is encoded into a latent space using a Conditional Variational Auto-Encoder (CVAE) module. Finally, a set of future trajectories for each agent is predicted conditioned on the learned spatial-temporal context by sampling from the latent space, repeatedly. DCENet is evaluated on the largest and most challenging trajectory forecasting benchmark Trajnet and reports a new state-of-the-art performance. It also demonstrates superior performance evaluated on the benchmark InD for mixed traffic at intersections. A series of ablation studies are conducted to validate the effectiveness of each proposed module. Our code is available at git@github.com:tanjatang/DCENet.git.
摘要：准确地预测交通方案中不同代理的未来位置对于安全部署在现实世界环境中的智能自治系统是至关重要的。然而，由于目标代理的行为动态受其他药剂影响的目标代理的行为，并且代理人可以采取多种社会可能的路径，这仍然是一个挑战。在本文中，我们提出了一个名为动态上下文编码器网络（DCenet）的新颖框架。在我们的框架中，首先，通过使用自我关注架构探索代理之间的空间上下文。然后，通过将观察到的轨迹和提取的动态空间上下文分别作为输入，训练了两个LSTM编码器以学习步骤之间的时间上下文。使用条件变形自动编码器（CVAE）模块将空间时间上下文编码为潜伏。最后，通过从潜伏空间，重复采样，在学习的空间 - 时间上下文中预测一组未来的每个代理的轨迹。 DCENET在最大，最具挑战性的轨迹预测基准测试基准TRAJNET中评估，并报告了新的最先进的性能。它还展示了在交叉路口的混合流量的基准IND上评估的卓越性能。进行了一系列消融研究以验证每个提出的模块的有效性。我们的代码可在git@github.com上获得：Tanjatang / dcenet.git。

9. Experimental design for MRI by greedy policy search [PDF] 返回目录
Tim Bakker, Herke van Hoof, Max Welling
Abstract: In today's clinical practice, magnetic resonance imaging (MRI) is routinely accelerated through subsampling of the associated Fourier domain. Currently, the construction of these subsampling strategies - known as experimental design - relies primarily on heuristics. We propose to learn experimental design strategies for accelerated MRI with policy gradient methods. Unexpectedly, our experiments show that a simple greedy approximation of the objective leads to solutions nearly on-par with the more general non-greedy approach. We offer a partial explanation for this phenomenon rooted in greater variance in the non-greedy objective's gradient estimates, and experimentally verify that this variance hampers non-greedy models in adapting their policies to individual MR images. We empirically show that this adaptivity is key to improving subsampling designs.
摘要：在今天的临床实践中，通过相关的傅立叶结构域的数据采样，磁共振成像（MRI）经常加速。目前，建设这些超级采样策略 - 称为实验设计 - 主要依赖于启发式。我们建议使用政策梯度方法学习加速MRI的实验设计策略。出乎意料的是，我们的实验表明，客观的简单贪婪近似值与更普遍的非贪婪方法有几乎接受的解决方案。我们为这种现象提供了局部解释，这些现象源于非贪婪目标梯度估计的更大方差，并通过实验验证这种方差扰乱了非贪婪模型，适应各个MR图像的政策。我们经验证明这种适应性是改善轮回设计的关键。

10. HOI Analysis: Integrating and Decomposing Human-Object Interaction [PDF] 返回目录
Yong-Lu Li, Xinpeng Liu, Xiaoqian Wu, Yizhuo Li, Cewu Lu
Abstract: Human-Object Interaction (HOI) consists of human, object and implicit interaction/verb. Different from previous methods that directly map pixels to HOI semantics, we propose a novel perspective for HOI learning in an analytical manner. In analogy to Harmonic Analysis, whose goal is to study how to represent the signals with the superposition of basic waves, we propose the HOI Analysis. We argue that coherent HOI can be decomposed into isolated human and object. Meanwhile, isolated human and object can also be integrated into coherent HOI again. Moreover, transformations between human-object pairs with the same HOI can also be easier approached with integration and decomposition. As a result, the implicit verb will be represented in the transformation function space. In light of this, we propose an Integration-Decomposition Network (IDN) to implement the above transformations and achieve state-of-the-art performance on widely-used HOI detection benchmarks. Code is available at this https URL.
摘要：人对象交互（HOI）包括人，对象和隐式互动/动词。与以前的方法不同，直接将像素映射到海西语义，我们提出了一种以分析方式为海学学习的新颖视角。类似于谐波分析，其目标是研究如何用基本波的叠加来代表信号，我们提出了HOI分析。我们争辩说，连贯的会安可以分解成分离的人和物体。同时，孤立的人和物体也可以再次集成到连贯的海中。此外，通过集成和分解，使用相同的HOI的人对象对之间的转换也可以更容易地接近。结果，隐式动词将在转换函数空间中表示。鉴于此，我们提出了一个集成分解网络（IDN）来实现上述变换并实现最先进的HOI检测基准测试。此HTTPS URL可提供代码。

11. Statistical Analysis of Signal-Dependent Noise: Application in Blind Localization of Image Splicing Forgery [PDF] 返回目录
Mian Zou, Heng Yao, Chuang Qin, Xinpeng Zhang
Abstract: Visual noise is often regarded as a disturbance in image quality, whereas it can also provide a crucial clue for image-based forensic tasks. Conventionally, noise is assumed to comprise an additive Gaussian model to be estimated and then used to reveal anomalies. However, for real sensor noise, it should be modeled as signal-dependent noise (SDN). In this work, we apply SDN to splicing forgery localization tasks. Through statistical analysis of the SDN model, we assume that noise can be modeled as a Gaussian approximation for a certain brightness and propose a likelihood model for a noise level function. By building a maximum a posterior Markov random field (MAP-MRF) framework, we exploit the likelihood of noise to reveal the alien region of spliced objects, with a probability combination refinement strategy. To ensure a completely blind detection, an iterative alternating method is adopted to estimate the MRF parameters. Experimental results demonstrate that our method is effective and provides a comparative localization performance.
摘要：视觉噪声通常被视为图像质量的干扰，而它也可以为基于图像的取证任务提供关键的线索。通常，假设噪声包括待估计的添加高斯模型，然后用于露出异常。但是，对于实际传感器噪声，它应该被建模为信号依赖性噪声（SDN）。在这项工作中，我们将SDN应用于拼接伪造的本地化任务。通过对SDN模型的统计分析，假设噪声可以被建模为一定亮度的高斯近似，并提出噪声电平函数的似然模型。通过建立最大后马尔可夫随机场（MAP-MRF）框架，我们利用噪声的可能性来揭示拼接物体的外星区域，具有概率的组合细化策略。为了确保完全盲检测，采用迭代交替方法来估计MRF参数。实验结果表明，我们的方法是有效的，并提供比较本地化性能。

12. End-to-end Animal Image Matting [PDF] 返回目录
Jizhizi Li, Jing Zhang, Stephen J. Maybank, Dacheng Tao
Abstract: Extracting accurate foreground animals from natural animal images benefits many downstream applications such as film production and augmented reality. However, the various appearance and furry characteristics of animals challenge existing matting methods, which usually require extra user inputs such as trimap or scribbles. To resolve these problems, we study the distinct roles of semantics and details for image matting and decompose the task into two parallel sub-tasks: high-level semantic segmentation and low-level details matting. Specifically, we propose a novel Glance and Focus Matting network (GFM), which employs a shared encoder and two separate decoders to learn both tasks in a collaborative manner for end-to-end animal image matting. Besides, we establish a novel Animal Matting dataset (AM-2k) containing 2,000 high-resolution natural animal images from 20 categories along with manually labeled alpha mattes. Furthermore, we investigate the domain gap issue between composite images and natural images systematically by conducting comprehensive analyses of various discrepancies between foreground and background images. We find that a carefully designed composition route RSSN that aims to reduce the discrepancies can lead to a better model with remarkable generalization ability. Comprehensive empirical studies on AM-2k demonstrate that GFM outperforms state-of-the-art methods and effectively reduces the generalization error.
摘要：从天然动物图像中提取精确的前景动物有利于电影生产和增强现实等许多下游应用。然而，动物的各种外观和毛茸茸的特征挑战现有的消光方法，这通常需要额外的用户输入，例如Trimap或涂鸦。为了解决这些问题，我们研究了图像消光的语义和细节的不同角色，并将任务分解为两个并行子任务：高电平语义分段和低级细节消光。具体地，我们提出了一种新颖的浏览和聚焦光伏网络（GFM），它采用共享编码器和两个单独的解码器，以以协同方式学习两个任务，用于端到端的动物图像消光。此外，我们建立了一种新型动物垫子数据集（AM-2K），含有来自20个类别的2,000个高分辨率的自然动物图像以及手动标记的alpha锍。此外，我们通过在前景和背景图像之间的各种差异进行综合分析来系统地调查复合图像和自然图像之间的域间隙问题。我们发现一个精心设计的构图RSSN，旨在降低差异可能导致具有显着概括能力的更好模型。关于AM-2K的综合实证研究表明，GFM优于最先进的方法，有效地降低了泛化误差。

13. Small Noisy and Perspective Face Detection using Deformable Symmetric Gabor Wavelet Network [PDF] 返回目录
Sherzod Salokhiddinov, Seungkyu Lee
Abstract: Face detection and tracking in low resolution image is not a trivial task due to the limitation in the appearance features for face characterization. Moreover, facial expression gives additional distortion on this small and noisy face. In this paper, we propose deformable symmetric Gabor wavelet network face model for face detection in low resolution image. Our model optimizes the rotation, translation, dilation, perspective and partial deformation amount of the face model with symmetry constraints. Symmetry constraints help our model to be more robust to noise and distortion. Experimental results on our low resolution face image dataset and videos show promising face detection and tracking results under various challenging conditions.
摘要：低分辨率图像中的面部检测和跟踪不是一个琐碎的任务，因为面部表征的外观特征的限制。此外，面部表情在这个小而嘈杂的脸上提供额外的失真。本文在低分辨率图像中提出了可变形的对称Gabor小波网络面模型。我们的模型优化了具有对称性约束的面部模型的旋转，翻译，膨胀，透视和部分变形量。对称性约束有助于我们的模型对噪声和失真更强大。在我们的低分辨率面部图像数据集和视频的实验结果显示了在各种具有挑战性条件下的有希望的面部检测和跟踪结果。

14. Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation [PDF] 返回目录
Yangxin Wu, Gengwei Zhang, Hang Xu, Xiaodan Liang, Liang Lin
Abstract: Panoptic segmentation is posed as a new popular test-bed for the state-of-the-art holistic scene understanding methods with the requirement of simultaneously segmenting both foreground things and background stuff. The state-of-the-art panoptic segmentation network exhibits high structural complexity in different network components, i.e. backbone, proposal-based foreground branch, segmentation-based background branch, and feature fusion module across branches, which heavily relies on expert knowledge and tedious trials. In this work, we propose an efficient, cooperative and highly automated framework to simultaneously search for all main components including backbone, segmentation branches, and feature fusion module in a unified panoptic segmentation pipeline based on the prevailing one-shot Network Architecture Search (NAS) paradigm. Notably, we extend the common single-task NAS into the multi-component scenario by taking the advantage of the newly proposed intra-modular search space and problem-oriented inter-modular search space, which helps us to obtain an optimal network architecture that not only performs well in both instance segmentation and semantic segmentation tasks but also be aware of the reciprocal relations between foreground things and background stuff classes. To relieve the vast computation burden incurred by applying NAS to complicated network architectures, we present a novel path-priority greedy search policy to find a robust, transferrable architecture with significantly reduced searching overhead. Our searched architecture, namely Auto-Panoptic, achieves the new state-of-the-art on the challenging COCO and ADE20K benchmarks. Moreover, extensive experiments are conducted to demonstrate the effectiveness of path-priority policy and transferability of Auto-Panoptic across different datasets. Codes and models are available at: this https URL.
摘要：Panoptic Segmentation是一个新的流行测试床，用于最先进的整体场景了解方法，要求同时分割前景事物和背景的东西。最先进的Panoptic分割网络在不同的网络组件中表现出高结构复杂性，即骨干，基于建议的前景分支，分段基础的背景分支，以及跨分支机构的特征融合模块，这严重依赖于专家知识和繁琐试验。在这项工作中，我们提出了一种高效，合作和高度自动化的框架，同时搜索了基于统一的一拍网络架构搜索（NAS）的统一Panoptic分段管道中的所有主要组件，包括骨干，分段分支和特征融合模块范例。值得注意的是，我们通过采用新提出的模块化内部搜索空间和面向问题的模块间搜索空间的优势来将常见的单任务NAS扩展到多组件方案中，这有助于我们获得最佳网络架构仅在实例分段和语义分段任务中表现良好，但也意识到前景事物和背景中的互惠关系。为了减轻通过将NAS应用于复杂的网络架构来实现的巨大的计算负担，我们提出了一种新颖的路径优先级贪婪搜索策略，以找到一个强大的可传输架构，具有显着降低的搜索开销。我们搜索的架构，即自动Panoptic，在挑战的Coco和Ade20K基准上实现了新的最先进的。此外，进行了广泛的实验，以证明路径优先政策的有效性和在不同数据集中的自动Panoptic的可转换性。代码和模型可用于：此HTTPS URL。

15. PyraPose: Feature Pyramids for Fast and Accurate Object Pose Estimation under Domain Shift [PDF] 返回目录
Stefan Thalhammer, Markus Leitner, Timothy Patten, Markus Vincze
Abstract: Object pose estimation enables robots to understand and interact with their environments. Training with synthetic data is necessary in order to adapt to novel situations. Unfortunately, pose estimation under domain shift, i.e., training on synthetic data and testing in the real world, is challenging. Deep learning-based approaches currently perform best when using encoder-decoder networks but typically do not generalize to new scenarios with different scene characteristics. We argue that patch-based approaches, instead of encoder-decoder networks, are more suited for synthetic-to-real transfer because local to global object information is better represented. To that end, we present a novel approach based on a specialized feature pyramid network to compute multi-scale features for creating pose hypotheses on different feature map resolutions in parallel. Our single-shot pose estimation approach is evaluated on multiple standard datasets and outperforms the state of the art by up to 35%. We also perform grasping experiments in the real world to demonstrate the advantage of using synthetic data to generalize to novel environments.
摘要：对象姿势估计使机器人能够了解和与其环境进行互动。为了适应新颖的情况，需要具有合成数据的培训。遗憾的是，域移位下的姿势估计，即在现实世界中对合成数据和测试的培训是具有挑战性的。基于深度学习的方法目前在使用编码器解码器网络时表现最佳，但通常不会概括到具有不同场景特征的新方案。我们认为基于补丁的方法，而不是编码器解码器网络，更适合合成综合传输，因为全局对象信息的本地表示。为此，我们介绍了一种基于专业特征金字塔网络的新方法，以计算不同特征映射分辨率在不同特征映射分辨率上创建姿势假设的多尺度特征。我们的单次姿势估计方法在多个标准数据集上进行评估，优于最高的状态高达35％。我们还在现实世界中进行了掌握实验，以展示使用合成数据概括到新环境的优势。

16. An Unsupervised Approach towards Varying Human Skin Tone Using Generative Adversarial Networks [PDF] 返回目录
Debapriya Roy, Diganta Mukherjee, Bhabatosh Chanda
Abstract: With the increasing popularity of augmented and virtual reality, retailers are now focusing more towards customer satisfaction to increase the amount of sales. Although augmented reality is not a new concept but it has gained much needed attention over the past few years. Our present work is targeted towards this direction which may be used to enhance user experience in various virtual and augmented reality based applications. We propose a model to change skin tone of a person. Given any input image of a person or a group of persons with some value indicating the desired change of skin color towards fairness or darkness, this method can change the skin tone of the persons in the image. This is an unsupervised method and also unconstrained in terms of pose, illumination, number of persons in the image etc. The goal of this work is to reduce the time and effort which is generally required for changing the skin tone using existing applications (e.g., Photoshop) by professionals or novice. To establish the efficacy of this method we have compared our result with that of some popular photo editor and also with the result of some existing benchmark method related to human attribute manipulation. Rigorous experiments on different datasets show the effectiveness of this method in terms of synthesizing perceptually convincing outputs.
摘要：随着增强和虚拟现实的普及日益越来越多，零售商现在将更多地关注客户满意度，以增加销售额。虽然增强现实不是一个新的概念，但在过去几年中它已经获得了很多需要的关注。我们现在的工作旨在朝着这个方向针对，可用于增强各种虚拟和增强现实的应用程序中的用户体验。我们提出了一种模型来改变一个人的肤色。给定个人或一组人的任何输入图像，其中一些值表明肤色对公平或黑暗的所需变化，这种方法可以改变图像中的人的肤色。这是一个无人监督的方法，也是在姿势，照明，图像中的人数方面不受约束的。这项工作的目标是减少使用现有应用改变肤色所需的时间和精力（例如， Photoshop）由专业人士或新手。为了建立这种方法的功效，我们将结果与一些流行的照片编辑器的结果相比，以及一些与人类属性操纵相关的现有基准方法的结果。在不同数据集上严格的实验表明了这种方法在合成感知令人信服的产出方面的有效性。

17. Correspondence Matrices are Underrated [PDF] 返回目录
Tejas Zodage, Rahul Chakwate, Vinit Sarode, Rangaprasad Arun Srivatsan, Howie Choset
Abstract: Point-cloud registration (PCR) is an important task in various applications such as robotic manipulation, augmented and virtual reality, SLAM, etc. PCR is an optimization problem involving minimization over two different types of interdependent variables: transformation parameters and point-to-point correspondences. Recent developments in deep-learning have produced computationally fast approaches for PCR. The loss functions that are optimized in these networks are based on the error in the transformation parameters. We hypothesize that these methods would perform significantly better if they calculated their loss function using correspondence error instead of only using error in transformation parameters. We define correspondence error as a metric based on incorrectly matched point pairs. We provide a fundamental explanation for why this is the case and test our hypothesis by modifying existing methods to use correspondence-based loss instead of transformation-based loss. These experiments show that the modified networks converge faster and register more accurately even at larger misalignment when compared to the original networks.
摘要：点云注册（PCR）是各种应用中的重要任务，如机器人操纵，增强和虚拟现实，SLAM等.PCR是涉及在两种不同类型的相互依赖变量上最小化的优化问题：转换参数和点 - 要点对应。深度学习的最新发展已经生产了PCR的计算快速方法。在这些网络中优化的丢失功能基于变换参数中的错误。我们假设如果使用通信误差计算它们的损耗函数而不是仅在转换参数中使用错误，则这些方法将显着更好地执行。基于错误匹配的点对将对应错误定义为度量标准。我们为为什么这种情况提供了一个根本的解释，并通过修改现有方法来使用基于对应的损失而不是基于转换的损失来测试我们的假设。这些实验表明，与原始网络相比，改进的网络即使在更大的错位时也会更快地收敛并更准确地注册。

18. LIFI: Towards Linguistically Informed Frame Interpolation [PDF] 返回目录
Aradhya Neeraj Mathur, Devansh Batra, Yaman Kumar, Rajiv Ratn Shah, Roger Zimmermann, Amanda Stent
Abstract: In this work, we explore a new problem of frame interpolation for speech videos. Such content today forms the major form of online communication. We try to solve this problem by using several deep learning video generation algorithms to generate the missing frames. We also provide examples where computer vision models despite showing high performance on conventional non-linguistic metrics fail to accurately produce faithful interpolation of speech. With this motivation, we provide a new set of linguistically-informed metrics specifically targeted to the problem of speech videos interpolation. We also release several datasets to test computer vision video generation models of their speech understanding.
摘要：在这项工作中，我们探索了语音视频帧插值的新问题。今天的内容形成了在线沟通的主要形式。我们尝试通过使用几个深入学习的视频生成算法来解决这个问题来生成缺失帧。我们还提供了计算机视觉模型尽管在传统的非语言指标上表现出高性能，但无法准确地产生忠实的语音插值。通过这种动机，我们提供了一套专门针对语音视频插值问题的新型语言信息指标。我们还释放了几个数据集以测试其语音理解的计算机视觉视频生成模型。

19. Volumetric Medical Image Segmentation: A 3D Deep Coarse-to-fine Framework and Its Adversarial Examples [PDF] 返回目录
Yingwei Li, Zhuotun Zhu, Yuyin Zhou, Yingda Xia, Wei Shen, Elliot K. Fishman, Alan L. Yuille
Abstract: Although deep neural networks have been a dominant method for many 2D vision tasks, it is still challenging to apply them to 3D tasks, such as medical image segmentation, due to the limited amount of annotated 3D data and limited computational resources. In this chapter, by rethinking the strategy to apply 3D Convolutional Neural Networks to segment medical images, we propose a novel 3D-based coarse-to-fine framework to efficiently tackle these challenges. The proposed 3D-based framework outperforms their 2D counterparts by a large margin since it can leverage the rich spatial information along all three axes. We further analyze the threat of adversarial attacks on the proposed framework and show how to defense against the attack. We conduct experiments on three datasets, the NIH pancreas dataset, the JHMI pancreas dataset and the JHMI pathological cyst dataset, where the first two and the last one contain healthy and pathological pancreases respectively, and achieve the current state-of-the-art in terms of Dice-Sorensen Coefficient (DSC) on all of them. Especially, on the NIH pancreas segmentation dataset, we outperform the previous best by an average of over $2\%$, and the worst case is improved by $7\%$ to reach almost $70\%$, which indicates the reliability of our framework in clinical applications.
摘要：虽然深度神经网络已经是许多2D视觉任务的主导方法，但由于有限的注释的3D数据和有限的计算资源，将它们应用于3D任务，例如医学图像分割等3D任务仍然具有挑战性。在本章中，通过重新思考将3D卷积神经网络应用到分段医学图像的策略，我们提出了一种新的3D基于3D的粗细框架，以有效地解决这些挑战。所提出的基于3D的框架优于其2D对应物，因为它可以利用所有三个轴利用丰富的空间信息。我们进一步分析了对拟议框架的对抗攻击威胁，并展示了如何防范攻击。我们在三个数据集中进行实验，NIH胰腺数据集，JHMI胰腺数据集和JHMI病理囊肿数据集，前两者分别含有健康和病理胰腺，并实现最新的最先进的所有这些肌霉系数（DSC）的条款。特别是，在NIH胰腺分段数据集上，我们以超过2 \％$的平均优先效果超过2 \％$，最坏的情况是7 \％$ 7 \％$达到近70 \％$，这表明我们框架的可靠性在临床应用中。

20. CNN based Multistage Gated Average Fusion (MGAF) for Human Action Recognition Using Depth and Inertial Sensors [PDF] 返回目录
Zeeshan Ahmad, Naimul khan
Abstract: Convolutional Neural Network (CNN) provides leverage to extract and fuse features from all layers of its architecture. However, extracting and fusing intermediate features from different layers of CNN structure is still uninvestigated for Human Action Recognition (HAR) using depth and inertial sensors. To get maximum benefit of accessing all the CNN's layers, in this paper, we propose novel Multistage Gated Average Fusion (MGAF) network which extracts and fuses features from all layers of CNN using our novel and computationally efficient Gated Average Fusion (GAF) network, a decisive integral element of MGAF. At the input of the proposed MGAF, we transform the depth and inertial sensor data into depth images called sequential front view images (SFI) and signal images (SI) respectively. These SFI are formed from the front view information generated by depth data. CNN is employed to extract feature maps from both input modalities. GAF network fuses the extracted features effectively while preserving the dimensionality of fused feature as well. The proposed MGAF network has structural extensibility and can be unfolded to more than two modalities. Experiments on three publicly available multimodal HAR datasets demonstrate that the proposed MGAF outperforms the previous state of the art fusion methods for depth-inertial HAR in terms of recognition accuracy while being computationally much more efficient. We increase the accuracy by an average of 1.5 percent while reducing the computational cost by approximately 50 percent over the previous state of the art.
摘要：卷积神经网络（CNN）提供杠杆以从其架构的所有层中提取和保险丝特征。然而，使用深度和惯性传感器的人体动作识别（Har）仍未对来自CNN结构不同的CNN结构层的中间特征进行提取和融合中间特征。为了获得获得所有CNN层的最大利益，本文提出了新的多级门控平均融合（MGAF）网络，其使用我们的新颖和计算高效门控平均融合（GAF）网络提取和融合来自CNN的所有层的功能， MGAF的决定性整体元素。在所提出的MGAF的输入，我们将深度和惯性传感器数据转换为称为顺序前视图图像（SFI）和信号图像（SI）的深度图像。这些SFI由深度数据产生的前视图信息形成。使用CNN从两个输入方式中提取特征映射。 GAF网络在保留融合特征的维度的同时有效地熔化提取的特征。所提出的MGAF网络具有结构可伸展性，可以展开超过两种方式。在三个公开的多模式HAR数据集上的实验表明，所提出的MGAF在识别准确性方面优于识别精度的深度惯性方法的先前状态，同时计算得多。我们平均提高了1.5％的准确性，同时通过以前的现有技术将计算成本降低约50％。

21. SMOT: Single-Shot Multi Object Tracking [PDF] 返回目录
Wei Li, Yuanjun Xiong, Shuo Yang, Siqi Deng, Wei Xia
Abstract: We present single-shot multi-object tracker (SMOT), a new tracking framework that converts any single-shot detector (SSD) model into an online multiple object tracker, which emphasizes simultaneously detecting and tracking of the object paths. Contrary to the existing tracking by detection approaches which suffer from errors made by the object detectors, SMOT adopts the recently proposed scheme of tracking by re-detection. We combine this scheme with SSD detectors by proposing a novel tracking anchor assignment module. With this design SMOT is able to generate tracklets with a constant per-frame runtime. A light-weighted linkage algorithm is then used for online tracklet linking. On three benchmarks of object tracking: Hannah, Music Videos, and MOT17, the proposed SMOT achieves state-of-the-art performance.
摘要要与通过对象探测器制造的错误的检测方法相反，SMOT采用最近提出的跟踪方案通过重新检测。通过提出新颖的跟踪锚定分配模块，我们将该方案与SSD检测器结合起来。使用这种设计，SMOT能够生成具有常数每个帧运行时的Tracklet。然后将光加权连锁算法用于在线轨迹链接。在对象跟踪的三个基准中：汉娜，音乐视频和MOT17，拟议的笑容实现了最先进的性能。

22. Loss-rescaling VQA: Revisiting Language Prior Problem from a Class-imbalance View [PDF] 返回目录
Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Qi Tian
Abstract: Recent studies have pointed out that many well-developed Visual Question Answering (VQA) models are heavily affected by the language prior problem, which refers to making predictions based on the co-occurrence pattern between textual questions and answers instead of reasoning visual contents. To tackle it, most existing methods focus on enhancing visual feature learning to reduce this superficial textual shortcut influence on VQA model decisions. However, limited effort has been devoted to providing an explicit interpretation for its inherent cause. It thus lacks a good guidance for the research community to move forward in a purposeful way, resulting in model construction perplexity in overcoming this non-trivial problem. In this paper, we propose to interpret the language prior problem in VQA from a class-imbalance view. Concretely, we design a novel interpretation scheme whereby the loss of mis-predicted frequent and sparse answers of the same question type is distinctly exhibited during the late training phase. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer, to a given question whose right answer is sparse in the training set. Based upon this observation, we further develop a novel loss re-scaling approach to assign different weights to each answer based on the training data statistics for computing the final loss. We apply our approach into three baselines and the experimental results on two VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
摘要：最近的研究表明，许多发达的视觉问题答案（VQA）模型受到先前问题的严重影响，这是指基于文本问题和答案之间的共同发生模式而不是推理视觉内容的预测。为了解决它，大多数现有方法都侧重于增强视觉特征学习，以减少对VQA模型决策的这种肤浅的快捷方式影响。然而，有限的努力已经致力于为其固有原因提供明确的解释。因此，它缺乏以有目的的方式向前发展的良好指导，导致模型构建困惑在克服这种非平凡的问题时。在本文中，我们建议从类别不平衡视图中解释VQA中的前问题。具体而言，我们设计了一种新颖的解释方案，其中在晚期训练阶段明显展出了对同一问题类型的错误预测频繁和稀疏答案的损失。它明确揭示了为什么VQA模型倾向于产生频繁但是明显的错误答案，给出了训练集中稀疏的给定问题。基于此观察，我们进一步开发了一种新的损失重新缩放方法，基于计算最终损失的训练数据统计来为每个答案分配不同权重。我们将我们的方法应用于三个基线，两个VQA-CP基准数据集上的实验结果明显证明了其有效性。此外，我们还可以证明在其他计算机视觉任务上的类别不平衡解释方案的有效性，例如面部识别和图像分类。

23. Can the state of relevant neurons in a deep neural networks serve as indicators for detecting adversarial attacks? [PDF] 返回目录
Roger Granda, Tinne Tuytelaars, Jose Oramas
Abstract: We present a method for adversarial attack detection based on the inspection of a sparse set of neurons. We follow the hypothesis that adversarial attacks introduce imperceptible perturbations in the input and that these perturbations change the state of neurons relevant for the concepts modelled by the attacked model. Therefore, monitoring the status of these neurons would enable the detection of adversarial attacks. Focusing on the image classification task, our method identifies neurons that are relevant for the classes predicted by the model. A deeper qualitative inspection of these sparse set of neurons indicates that their state changes in the presence of adversarial samples. Moreover, quantitative results from our empirical evaluation indicate that our method is capable of recognizing adversarial samples, produced by state-of-the-art attack methods, with comparable accuracy to that of state-of-the-art detectors.
摘要：我们提出了一种基于稀疏神经元的检查的对抗攻击检测方法。我们遵循对抗性攻击在投入中引入难以察觉的扰动，并且这些扰动改变了对受攻击模型建模的概念的神经元的状态改变了神经元的状态。因此，监测这些神经元的状态将能够检测对抗性攻击。专注于图像分类任务，我们的方法识别与模型预测的类相关的神经元。这些稀疏的神经元的更深层次的定性检查表明它们的状态在对抗性样本的存在下变化。此外，来自我们的经验评估的定量结果表明，我们的方法能够识别通过最先进的攻击方法产生的对抗性样本，其具有最先进的探测器的准确性。

24. PAL : Pretext-based Active Learning [PDF] 返回目录
Shubhang Bhatnagar, Darshan Tank, Sachin Goyal, Amit Sethi
Abstract: When obtaining labels is expensive, the requirement of a large labeled training data set for deep learning can be mitigated by active learning. Active learning refers to the development of algorithms to judiciously pick limited subsets of unlabeled samples that can be sent for labeling by an oracle. We propose an intuitive active learning technique that, in addition to the task neural network (e.g., for classification), uses an auxiliary self-supervised neural network that assesses the utility of an unlabeled sample for inclusion in the labeled set. Our core idea is that the difficulty of the auxiliary network trained on labeled samples to solve a self-supervision task on an unlabeled sample represents the utility of obtaining the label of that unlabeled sample. Specifically, we assume that an unlabeled image on which the precision of predicting a random applied geometric transform is low must be out of the distribution represented by the current set of labeled images. These images will therefore maximize the relative information gain when labeled by the oracle. We also demonstrate that augmenting the auxiliary network with task specific training further improves the results. We demonstrate strong performance on a range of widely used datasets and establish a new state of the art for active learning. We also make our code publicly available to encourage further research.
摘要：获取标签昂贵时，可以通过主动学习缓解为深度学习提供大量标记训练数据的要求。主动学习是指算法的开发，可明智地选择可以通过Oracle发送标记的未标记样本的有限子集。我们提出了一种直观的主动学习技术，除了任务神经网络（例如，对于分类）之外，使用辅助自我监督的神经网络，该神经网络评估未标记的样品的效用以包含在标记的集合中。我们的核心思想是，辅助网络训练在标记的样本上训练以解决未标记的样本的自我监督任务代表了获得该未标记样品标签的效用。具体地，我们假设预测预测随机施加的几何变换的精度低的未标记图像必须超出由当前标记图像集表示的分布。因此，这些图像将在由Oracle标记时最大化相对信息增益。我们还证明，使用任务特定培训增强辅助网络进一步提高了结果。我们在一系列广泛使用的数据集中展示了强大的表现，并为主动学习建立了新的艺术状态。我们还将我们的代码公开推动进一步研究。

25. Detecting small polyps using a Dynamic SSD-GAN [PDF] 返回目录
Daniel C. Ohrenstein, Patrick Brandao, Daniel Toth, Laurence Lovat, Danail Stoyanov, Peter Mountney
Abstract: Endoscopic examinations are used to inspect the throat, stomach and bowel for polyps which could develop into cancer. Machine learning systems can be trained to process colonoscopy images and detect polyps. However, these systems tend to perform poorly on objects which appear visually small in the images. It is shown here that combining the single-shot detector as a region proposal network with an adversarially-trained generator to upsample small region proposals can significantly improve the detection of visually-small polyps. The Dynamic SSD-GAN pipeline introduced in this paper achieved a 12% increase in sensitivity on visually-small polyps compared to a conventional FCN baseline.
摘要：内窥镜检查用于检查可能发展成癌症的息肉，胃和肠道。可以培训机器学习系统以处理结肠镜检查图像并检测息肉。然而，这些系统倾向于在图像上看起来略微小的物体上的差。这里示出了将单次检测器与具有过上的发生器的区域提议网络组合为upsample的小区域提议可以显着改善视觉上小息肉的检测。与传统的FCN基线相比，本文介绍的动态SSD-GaN管线达到了视觉上小息肉的敏感性增加了12％。

26. A Comprehensive Comparison of End-to-End Approaches for Handwritten Digit String Recognition [PDF] 返回目录
Andre G. Hochuli, Alceu S. Britto Jr, David A. Saji, Jose M. Saavedra, Robert Sabourin, Luiz S. Oliveira
Abstract: Over the last decades, most approaches proposed for handwritten digit string recognition (HDSR) have resorted to digit segmentation, which is dominated by heuristics, thereby imposing substantial constraints on the final performance. Few of them have been based on segmentation-free strategies where each pixel column has a potential cut location. Recently, segmentation-free strategies has added another perspective to the problem, leading to promising results. However, these strategies still show some limitations when dealing with a large number of touching digits. To bridge the resulting gap, in this paper, we hypothesize that a string of digits can be approached as a sequence of objects. We thus evaluate different end-to-end approaches to solve the HDSR problem, particularly in two verticals: those based on object-detection (e.g., Yolo and RetinaNet) and those based on sequence-to-sequence representation (CRNN). The main contribution of this work lies in its provision of a comprehensive comparison with a critical analysis of the above mentioned strategies on five benchmarks commonly used to assess HDSR, including the challenging Touching Pair dataset, NIST SD19, and two real-world datasets (CAR and CVL) proposed for the ICFHR 2014 competition on HDSR. Our results show that the Yolo model compares favorably against segmentation-free models with the advantage of having a shorter pipeline that minimizes the presence of heuristics-based models. It achieved a 97%, 96%, and 84% recognition rate on the NIST-SD19, CAR, and CVL datasets, respectively.
摘要：在过去十年中，对于手写数字字符串识别（HDSR）提出的大多数方法都采取了位于启发式的数字分割，从而对最终表现施加了大量的限制。其中很少有基于分段的策略，其中每个像素列具有潜在的切割位置。最近，分割策略已经增加了对问题的另一个观点，导致了有希望的结果。但是，在处理大量触摸的数字时，这些策略仍然显示了一些限制。为了弥合所产生的差距，在本文中，我们假设可以将一串数字串作为一系列物体接近。因此，我们评估了解决HDSR问题的不同端到端方法，特别是在两个垂直方面：基于对象检测（例如，YOLO和RETINALET）的那些，基于序列到序列表示（CRNN）。这项工作的主要贡献在于提供全面的比较，其与上述策略对常用于评估HDSR的五个基准的批判性分析，包括具有挑战性的触摸对数据集，NIST SD19和两个现实世界数据集（汽车和CVL）提出为HDSR的ICFHR 2014竞争。我们的研究结果表明，YOLO模型对分割模型进行了比较，其优点是具有更短的管道，最小化基于启发式的模型的存在。它分别在NIST-SD19，CAR和CVL数据集上达到了97％，96％和84％的识别率。

27. Perception Matters: Exploring Imperceptible and Transferable Anti-forensics for GAN-generated Fake Face Imagery Detection [PDF] 返回目录
Yongwei Wang, Xin Ding, Li Ding, Rabab Ward, Z. Jane Wang
Abstract: Recently, generative adversarial networks (GANs) can generate photo-realistic fake facial images which are perceptually indistinguishable from real face photos, promoting research on fake face detection. Though fake face forensics can achieve high detection accuracy, their anti-forensic counterparts are less investigated. Here we explore more \textit{imperceptible} and \textit{transferable} anti-forensics for fake face imagery detection based on adversarial attacks. Since facial and background regions are often smooth, even small perturbation could cause noticeable perceptual impairment in fake face images. Therefore it makes existing adversarial attacks ineffective as an anti-forensic method. Our perturbation analysis reveals the intuitive reason of the perceptual degradation issue when directly applying existing attacks. We then propose a novel adversarial attack method, better suitable for image anti-forensics, in the transformed color domain by considering visual perception. Simple yet effective, the proposed method can fool both deep learning and non-deep learning based forensic detectors, achieving higher attack success rate and significantly improved visual quality. Specially, when adversaries consider imperceptibility as a constraint, the proposed anti-forensic method can improve the average attack success rate by around 30\% on fake face images over two baseline attacks. \textit{More imperceptible} and \textit{more transferable}, the proposed method raises new security concerns to fake face imagery detection. We have released our code for public use, and hopefully the proposed method can be further explored in related forensic applications as an anti-forensic benchmark.
摘要：最近，生成的对抗性网络（GANS）可以生成照片逼真的假面部图像，从真实的脸照片中毫无疑问地无法区分，促进假脸检测研究。虽然假面取证可以达到高检测精度，但它们的反法医对应物不太调查。在这里，我们探索更多\ extentit {incercetical}和\ textit {可转让}反对法，用于基于对抗攻击的假面图像检测。由于面部和背景区域通常是平滑的，即使是小扰动也可能导致假脸图像中显着的感知损伤。因此，它使现有的对抗攻击无效作为反法医方法。我们的扰动分析揭示了直接申请现有攻击时感知劣化问题的直观原因。然后，我们通过考虑视觉感知，我们提出了一种新的逆势攻击方法，更好地适合于转换的彩色域中的图像抗真菌。简单而有效，所提出的方法可以欺骗深度学习和非深度学习的法证探测器，实现更高的攻击成功率，显着提高了视觉质量。特别是，当对手认为难以易于的限制时，所提出的反法医方法可以在两个基线攻击中提高平均攻击成功率大约30 \％。 \ TextIt {更令人无法察区}和\ Textit {更传统}，所提出的方法对假面图像检测提出了新的安全问题。我们发布了我们的公共使用代码，希望在相关的法医应用中进一步探索所提出的方法作为反法医基准。

28. Development and Evaluation of a Deep Neural Network for Histologic Classification of Renal Cell Carcinoma on Biopsy and Surgical Resection Slides [PDF] 返回目录
Mengdan Zhu, Bing Ren, Ryland Richards, Matthew Suriawinata, Naofumi Tomita, Saeed Hassanpour
Abstract: Renal cell carcinoma (RCC) is the most common renal cancer in adults. The histopathologic classification of RCC is essential for diagnosis, prognosis, and management of patients. Reorganization and classification of complex histologic patterns of RCC on biopsy and surgical resection slides under a microscope remains a heavily specialized, error-prone, and time-consuming task for pathologists. In this study, we developed a deep neural network model that can accurately classify digitized surgical resection slides and biopsy slides into five related classes: clear cell RCC, papillary RCC, chromophobe RCC, renal oncocytoma, and normal. In addition to the whole-slide classification pipeline, we visualized the identified indicative regions and features on slides for classification by reprocessing patch-level classification results to ensure the explainability of our diagnostic model. We evaluated our model on independent test sets of 78 surgical resection whole slides and 79 biopsy slides from our tertiary medical institution, and 69 randomly selected surgical resection slides from The Cancer Genome Atlas (TCGA) database. The average area under the curve (AUC) of our classifier on the internal resection slides, internal biopsy slides, and external TCGA slides is 0.98, 0.98 and 0.99, respectively. Our results suggest that the high generalizability of our approach across different data sources and specimen types. More importantly, our model has the potential to assist pathologists by (1) automatically pre-screening slides to reduce false-negative cases, (2) highlighting regions of importance on digitized slides to accelerate diagnosis, and (3) providing objective and accurate diagnosis as the second opinion.
摘要：肾细胞癌（RCC）是成人中最常见的肾癌。 RCC的组织病理学分类对于患者的诊断，预后和管理至关重要。显微镜下的活检和手术切除术中rCC复杂组织学模式的重组和分类仍然是病理学家的严重专业，容易出错，耗时的任务。在这项研究中，我们开发了一个深度神经网络模型，可以将数字化外科切除幻灯片和活检幻灯片分类为五种相关等级：透明细胞rcc，乳头状rcc，辐射rcc，肾癌瘤和正常情况。除了整个幻灯片分类管道外，我们还通过再处理补丁级分类结果来可视化识别的指示区域和幻灯片的特征，以确保我们的诊断模型的解释性。我们评估了我们在我们第三级医疗机构的78个手术切除整体幻灯片和79个活检幻灯片上的独立测试组模型，69例随机选择的手术切除从癌症基因组地图集（TCGA）数据库。在内部切除幻灯片，内部活检载玻片和外部TCGA载玻片上的分类器的曲线（AUC）下的平均面积分别为0.98,0.98和0.99。我们的结果表明，我们对不同数据源和样本类型的方法的高度普遍性。更重要的是，我们的模型有可能通过（1）自动预筛选幻灯片来帮助病理学家，以减少假阴性情况，（2）突出显示数字化幻灯片的重要性，以加速诊断，并提供客观和准确的诊断作为第二种意见。

29. Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents [PDF] 返回目录
Gregory Yauney, Jack Hessel, David Mimno
Abstract: Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from such documents, even though they have significant lexical and visual overlap? Working with a case study dataset of real estate listings, we demonstrate the challenge of distinguishing highly correlated grounded terms, such as "kitchen" and "bedroom", and introduce metrics to assess this document similarity. We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines when evaluated on labeled subsets of the dataset. The proposed method is particularly effective for local contextual meanings of a word, for example associating "granite" with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.
摘要：图像可以向我们洞察中的语境含义，但当前的图像文本接地方法需要详细的注释。这种粒度注释是罕见的，昂贵的，并且在大多数具体域的上下文中不可用。相反，未标记的多图像，多句子文件丰富。如果他们有重大的词汇和视觉重叠，可以从这些文件中吸取的词汇接地吗？与房地产上市的案例研究数据集进行工作，我们展示了区分高度相关的接地术语，如“厨房”和“卧室”，并引入指标来评估该文件的相似性。我们提出了一种简单的无监督基于聚类的方法，可以在评估数据集的标记子集时增加对象检测和图像标记基线超出对象检测和图像标记基线。该方法对于单词的本地语境含义特别有效，例如将“花岗岩”与房地产数据集中的台面相关联，并在维基百科数据集中使用岩石景观。

30. DeepWay: a Deep Learning Estimator for Unmanned Ground Vehicle Global Path Planning [PDF] 返回目录
Vittorio Mazzia, Francesco Salvetti, Diego Aghi, Marcello Chiaberge
Abstract: Agriculture 3.0 and 4.0 have gradually introduced service robotics and automation into several agricultural processes, mostly improving crops quality and seasonal yield. Row-based crops are the perfect settings to test and deploy smart machines capable of monitoring and manage the harvest. In this context, global path planning is essential either for ground or aerial vehicles, and it is the starting point for every type of mission plan. Nevertheless, little attention has been currently given to this problem by the research community and global path planning automation is still far to be solved. In order to generate a viable path for an autonomous machine, the presented research proposes a feature learning fully convolutional model capable of estimating waypoints given an occupancy grid map. In particular, we apply the proposed data-driven methodology to the specific case of row-based crops with the general objective to generate a global path able to cover the extension of the crop completely. Extensive experimentation with a custom made synthetic dataset and real satellite-derived images of different scenarios have proved the effectiveness of our methodology and demonstrated the feasibility of an end-to-end and completely autonomous global path planner.
摘要：农业3.0和4.0已逐步将服务机器人和自动化引入几个农业过程中，大多提高农作物质量和季节性产量。基于行的作物是测试和部署能够监控和管理收获的智能机器的完美设置。在这方面，全球路径规划对于地面或空中车辆至关重要，并且是各种类型的任务计划的起点。尽管如此，目前对这一问题进行了很少的关注，研究界和全球路径规划自动化仍然远远待解决。为了为自主机器生成可行的路径，所提出的研究提出了一种能够估计占用网格图的航点的特征学习的特征学习。特别是，我们将建议的数据驱动方法应用于基于行的作物的具体情况，普遍目标是产生能够完全覆盖作物延伸的全局路径。通过定制的合成数据集和不同方案的实际卫星衍生图像的广泛实验已经证明了我们的方法的有效性，并证明了端到端和完全自主的全球路径策划者的可行性。

31. Learning Vision-based Reactive Policies for Obstacle Avoidance [PDF] 返回目录
Elie Aljalbout, Ji Chen, Konstantin Ritt, Maximilian Ulmer, Sami Haddadin
Abstract: In this paper, we address the problem of vision-based obstacle avoidance for robotic manipulators. This topic poses challenges for both perception and motion generation. While most work in the field aims at improving one of those aspects, we provide a unified framework for approaching this problem. The main goal of this framework is to connect perception and motion by identifying the relationship between the visual input and the corresponding motion representation. To this end, we propose a method for learning reactive obstacle avoidance policies. We evaluate our method on goal-reaching tasks for single and multiple obstacles scenarios. We show the ability of the proposed method to efficiently learn stable obstacle avoidance strategies at a high success rate, while maintaining closed-loop responsiveness required for critical applications like human-robot interaction.
摘要：在本文中，我们解决了机器人操纵器的视觉障碍避免问题。本主题对感知和运动产生构成挑战。虽然该领域的大多数工作旨在改进其中一个方面，但我们为接近这个问题提供统一的框架。该框架的主要目标是通过识别视觉输入和相应的运动表示之间的关系来连接感知和运动。为此，我们提出了一种学习反应障碍避免政策的方法。我们在单一和多个障碍场景中评估我们的目标达到目标的方法。我们展示了所提出的方法以高成功率有效地学习稳定障碍避免策略的能力，同时保持人体机器人交互等关键应用所需的闭环响应性。

32. Automatic Myocardial Infarction Evaluation from Delayed-Enhancement Cardiac MRI using Deep Convolutional Networks [PDF] 返回目录
Kibrom Berihu Girum, Youssef Skandarani, Raabid Hussain, Alexis Bozorg Grayeli, Gilles Créhange, Alain Lalande
Abstract: In this paper, we propose a new deep learning framework for an automatic myocardial infarction evaluation from clinical information and delayed enhancement-MRI (DE-MRI). The proposed framework addresses two tasks. The first task is automatic detection of myocardial contours, the infarcted area, the no-reflow area, and the left ventricular cavity from a short-axis DE-MRI series. It employs two segmentation neural networks. The first network is used to segment the anatomical structures such as the myocardium and left ventricular cavity. The second network is used to segment the pathological areas such as myocardial infarction, myocardial no-reflow, and normal myocardial region. The segmented myocardium region from the first network is further used to refine the second network's pathological segmentation results. The second task is to automatically classify a given case into normal or pathological from clinical information with or without DE-MRI. A cascaded support vector machine (SVM) is employed to classify a given case from its associated clinical information. The segmented pathological areas from DE-MRI are also used for the classification task. We evaluated our method on the 2020 EMIDEC MICCAI challenge dataset. It yielded an average Dice index of 0.93 and 0.84, respectively, for the left ventricular cavity and the myocardium. The classification from using only clinical information yielded 80% accuracy over five-fold cross-validation. Using the DE-MRI, our method can classify the cases with 93.3% accuracy. These experimental results reveal that the proposed method can automatically evaluate the myocardial infarction.
摘要：在本文中，我们为临床信息和延迟增强-MRI（DE-MRI）提出了一种新的深度学习框架，用于自动临床信息和延迟增强-MRI（DE-MRI）。建议的框架解决了两个任务。第一任务是自动检测从短轴DE-MRI系列自动检测心肌轮廓，梗塞区域，无回流区域和左心室腔。它采用了两个分割神经网络。第一网络用于分割诸如心肌和左心室腔的解剖结构。第二个网络用于分割心肌梗塞，心肌无回流和正常心肌区域的病理区域。来自第一网络的分段心肌区域进一步用于优化第二网络的病理分割结果。第二任务是自动将给定案例分类为正常或病理来自临床信息，或者没有DE-MRI。采用级联支持向量机（SVM）从其相关的临床信息分类给定案例。来自DE-MRI的分段病理区域也用于分类任务。我们在2020 emidec Miccai挑战数据集上评估了我们的方法。它分别产生0.93和0.84的平均骰子指数，用于左心室腔和心肌。仅使用临床信息的分类在五倍交叉验证中产生了80％的精度。使用DE-MRI，我们的方法可以将案例分类为93.3％的精度。这些实验结果表明，所提出的方法可以自动评估心肌梗死。

33. Fusion-Catalyzed Pruning for Optimizing Deep Learning on Intelligent Edge Devices [PDF] 返回目录
Guangli Li, Xiu Ma, Xueying Wang, Lei Liu, Jingling Xue, Xiaobing Feng
Abstract: The increasing computational cost of deep neural network models limits the applicability of intelligent applications on resource-constrained edge devices. While a number of neural network pruning methods have been proposed to compress the models, prevailing approaches focus only on parametric operators (e.g., convolution), which may miss optimization opportunities. In this paper, we present a novel fusion-catalyzed pruning approach, called FuPruner, which simultaneously optimizes the parametric and non-parametric operators for accelerating neural networks. We introduce an aggressive fusion method to equivalently transform a model, which extends the optimization space of pruning and enables non-parametric operators to be pruned in a similar manner as parametric operators, and a dynamic filter pruning method is applied to decrease the computational cost of models while retaining the accuracy requirement. Moreover, FuPruner provides configurable optimization options for controlling fusion and pruning, allowing much more flexible performance-accuracy trade-offs to be made. Evaluation with state-of-the-art residual neural networks on five representative intelligent edge platforms, Jetson TX2, Jetson Nano, Edge TPU, NCS, and NCS2, demonstrates the effectiveness of our approach, which can accelerate the inference of models on CIFAR-10 and ImageNet datasets.
摘要：深神经网络模型的增加计算成本限制了智能应用对资源受限的边缘设备的适用性。虽然已经提出了许多神经网络修剪方法来压缩模型，但普遍的方法仅关注参数算子（例如，卷积），这可能会错过优化机会。在本文中，我们提出了一种新的融合催化修剪方法，称为Fupruner，其同时优化参数和非参数算子以加速神经网络。我们介绍了一种激进的融合方法来等效地改变模型，该模型扩展了修剪的优化空间，并使非参数运算符能够以与参数运算器类似的方式进行修剪，并且应用动态滤波器修剪方法来降低计算成本塑造准确性要求的同时。此外，Fupruner提供可配置的优化选项，用于控制融合和修剪，允许更灵活的性能 - 准确性折衷。评估五个代表性智能边缘平台上的最先进的残差神经网络，Jetson TX2，Jetson Nano，Edge TPU，NC和NCS2展示了我们方法的有效性，可以加速CiFar上的模型推理10和Imagenet数据集。

34. Bayesian Optimization Meets Laplace Approximation for Robotic Introspection [PDF] 返回目录
Matthias Humt, Jongseok Lee, Rudolph Triebel
Abstract: In robotics, deep learning (DL) methods are used more and more widely, but their general inability to provide reliable confidence estimates will ultimately lead to fragile and unreliable systems. This impedes the potential deployments of DL methods for long-term autonomy. Therefore, in this paper we introduce a scalable Laplace Approximation (LA) technique to make Deep Neural Networks (DNNs) more introspective, i.e. to enable them to provide accurate assessments of their failure probability for unseen test data. In particular, we propose a novel Bayesian Optimization (BO) algorithm to mitigate their tendency of under-fitting the true weight posterior, so that both the calibration and the accuracy of the predictions can be simultaneously optimized. We demonstrate empirically that the proposed BO approach requires fewer iterations for this when compared to random search, and we show that the proposed framework can be scaled up to large datasets and architectures.
摘要：在机器人学中，深入学习（DL）方法越来越广泛地使用，但它们的一般无法提供可靠的信心估计最终将导致脆弱和不可靠的系统。这阻碍了DL方法的潜在部署，用于长期自主。因此，在本文中，我们引入可扩展的LAPLACE近似（LA）技术，以制造深度神经网络（DNN）更多内部，即使它们能够为其对未经检验数据的故障概率提供准确评估。特别是，我们提出了一种新颖的贝叶斯优化（BO）算法，以减轻其底层后底的倾向，从而可以同时优化预测的校准和准确性。我们经验证明，与随机搜索相比，建议的BO方法需要较少的迭代次数，我们表明所提出的框架可以缩放到大型数据集和体系结构。

35. Classifying Malware Images with Convolutional Neural Network Models [PDF] 返回目录
Ahmed Bensaoud, Nawaf Abudawaood, Jugal Kalita
Abstract: Due to increasing threats from malicious software (malware) in both number and complexity, researchers have developed approaches to automatic detection and classification of malware, instead of analyzing methods for malware files manually in a time-consuming effort. At the same time, malware authors have developed techniques to evade signature-based detection techniques used by antivirus companies. Most recently, deep learning is being used in malware classification to solve this issue. In this paper, we use several convolutional neural network (CNN) models for static malware classification. In particular, we use six deep learning models, three of which are past winners of the ImageNet Large-Scale Visual Recognition Challenge. The other three models are CNN-SVM, GRU-SVM and MLP-SVM, which enhance neural models with support vector machines (SVM). We perform experiments using the Malimg dataset, which has malware images that were converted from Portable Executable malware binaries. The dataset is divided into 25 malware families. Comparisons show that the Inception V3 model achieves a test accuracy of 99.24%, which is better than the accuracy of 98.52% achieved by the current state-of-the-art system called the M-CNN model.
摘要：由于在数量和复杂性中越来越多的恶意软件（恶意软件）威胁，研究人员已经开发了自动检测和分类恶意软件的方法，而不是在耗时的努力中手动分析恶意软件文件的方法。与此同时，恶意软件作者已经开发出用于逃避基于签名的检测技术的技术，由防病毒公司使用。最近，在恶意软件分类中使用深度学习来解决这个问题。在本文中，我们使用几种卷积神经网络（CNN）模型进行静态恶意软件分类。特别是，我们使用六种深度学习模型，其中三个是想象成的大规模视觉识别挑战的赢家。另外三种型号是CNN-SVM，GRU-SVM和MLP-SVM，其增强了具有支持向量机（SVM）的神经模型。我们使用Malimg DataSet执行实验，该数据集具有从便携式可执行恶意软件二进制文件转换的恶意软件映像。数据集分为25个恶意软件系列。比较表明，初始V3模型实现了99.24％的测试精度，优于目前最先进的系统所实现的98.52％的准确性，称为M-CNN模型。

36. CT-CAPS: Feature Extraction-based Automated Framework for COVID-19 Disease Identification from Chest CT Scans using Capsule Networks [PDF] 返回目录
Shahin Heidarian, Parnian Afshar, Arash Mohammadi, Moezedin Javad Rafiee, Anastasia Oikonomou, Konstantinos N. Plataniotis, Farnoosh Naderkhani
Abstract: The global outbreak of the novel corona virus (COVID-19) disease has drastically impacted the world and led to one of the most challenging crisis across the globe since World War II. The early diagnosis and isolation of COVID-19 positive cases are considered as crucial steps towards preventing the spread of the disease and flattening the epidemic curve. Chest Computed Tomography (CT) scan is a highly sensitive, rapid, and accurate diagnostic technique that can complement Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Recently, deep learning-based models, mostly based on Convolutional Neural Networks (CNN), have shown promising diagnostic results. CNNs, however, are incapable of capturing spatial relations between image instances and require large datasets. Capsule Networks, on the other hand, can capture spatial relations, require smaller datasets, and have considerably fewer parameters. In this paper, a Capsule network framework, referred to as the "CT-CAPS", is presented to automatically extract distinctive features of chest CT scans. These features, which are extracted from the layer before the final capsule layer, are then leveraged to differentiate COVID-19 from Non-COVID cases. The experiments on our in-house dataset of 307 patients show the state-of-the-art performance with the accuracy of 90.8%, sensitivity of 94.5%, and specificity of 86.0%.
摘要：新的Corona病毒（Covid-19）疾病的全球爆发急剧影响了世界，并导致全球自第二次世界大战以来全球最具挑战性的危机之一。 Covid-19阳性案件的早期诊断和分离被认为是防止疾病传播和平移曲线的关键步骤。胸部计算断层扫描（CT）扫描是一种高度敏感，快速，准确的诊断技术，可以补充逆转录聚合酶链反应（RT-PCR）测试。最近，基于深度学习的模型，主要是基于卷积神经网络（CNN），已经显示了有前途的诊断结果。然而，CNNS无法捕获图像实例之间的空间关系，并且需要大型数据集。另一方面，胶囊网络可以捕获空间关系，需要较小的数据集，并且具有更少的参数。本文提出了一种称为“CT-Caps”的胶囊网络框架，以自动提取胸部CT扫描的独特特征。然后从最终胶囊层之前从层中提取的这些特征，然后利用来自非Covid病例的Covid-19区分。我们307名患者内部数据集的实验表明了最先进的性能，精度为90.8％，灵敏度为94.5％，特异性为86.0％。

37. COVID-FACT: A Fully-Automated Capsule Network-based Framework for Identification of COVID-19 Cases from Chest CT scans [PDF] 返回目录
Shahin Heidarian, Parnian Afshar, Nastaran Enshaei, Farnoosh Naderkhani, Anastasia Oikonomou, S. Farokh Atashzar, Faranak Babaki Fard, Kaveh Samimi, Konstantinos N. Plataniotis, Arash Mohammadi, Moezedin Javad Rafiee
Abstract: The newly discovered Corona virus Disease 2019 (COVID-19) has been globally spreading and causing hundreds of thousands of deaths around the world as of its first emergence in late 2019. Computed tomography (CT) scans have shown distinctive features and higher sensitivity compared to other diagnostic tests, in particular the current gold standard, i.e., the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Current deep learning-based algorithms are mainly developed based on Convolutional Neural Networks (CNNs) to identify COVID-19 pneumonia cases. CNNs, however, require extensive data augmentation and large datasets to identify detailed spatial relations between image instances. Furthermore, existing algorithms utilizing CT scans, either extend slice-level predictions to patient-level ones using a simple thresholding mechanism or rely on a sophisticated infection segmentation to identify the disease. In this paper, we propose a two-stage fully-automated CT-based framework for identification of COVID-19 positive cases referred to as the "COVID-FACT". COVID-FACT utilizes Capsule Networks, as its main building blocks and is, therefore, capable of capturing spatial information. In particular, to make the proposed COVID-FACT independent from sophisticated segmentation of the area of infection, slices demonstrating infection are detected at the first stage and the second stage is responsible for classifying patients into COVID and non-COVID cases. COVID-FACT detects slices with infection, and identifies positive COVID-19 cases using an in-house CT scan dataset, containing COVID-19, community acquired pneumonia, and normal cases. Based on our experiments, COVID-FACT achieves an accuracy of 90.82%, a sensitivity of 94.55%, a specificity of 86.04%, and an Area Under the Curve (AUC) of 0.98, while depending on far less supervision and annotation, in comparison to its counterparts.
摘要：2019年新发现的电晕病毒疾病（Covid-19）一直在全球扩散，并在2019年底初始出现，造成了数十万人死亡。计算机断层扫描（CT）扫描表现出独特的特征和更高的灵敏度与其他诊断测试相比，特别是当前金标准，即逆转录聚合酶链反应（RT-PCR）试验。基于卷积神经网络（CNNS）的基于深度学习的算法主要开发，以鉴定Covid-19肺炎病例。但是，CNNS需要广泛的数据增强和大型数据集以识别图像实例之间的详细空间关系。此外，利用CT扫描的现有算法，使用简单的阈值机制将切片水平预测扩展到患者级预测或依赖于复杂的感染细分以识别疾病。在本文中，我们提出了一种两级全自动基于CT的基于CT的框架，用于识别Covid-19阳性案件称为“Covid-Fact”。 Covid-Fact利用胶囊网络，作为其主要构建块，因此，能够捕获空间信息。特别是，为了使拟议的covid-fific独立于感染领域的复杂分割，在第一阶段检测显示感染的切片，第二阶段负责将患者分类为Covid和非Covid病例。 Covid-Fact检测了感染的切片，并使用内部CT扫描数据集来识别含有Covid-19，社区获得的肺炎和正常情况的阳性Covid-19案例。基于我们的实验，Covid-Fact的准确性为90.82％，敏感性为94.55％，特异性为86.04％，曲线下的一个区域（AUC）为0.98，同时取决于监督和注释，相比之下到其同行。

38. FLANNEL: Focal Loss Based Neural Network Ensemble for COVID-19 Detection [PDF] 返回目录
Zhi Qiao, Austin Bae, Lucas M. Glass, Cao Xiao, Jimeng Sun
Abstract: To test the possibility of differentiating chest x-ray images of COVID-19 against other pneumonia and healthy patients using deep neural networks. We construct the X-ray imaging data from two publicly available sources, which include 5508 chest x-ray images across 2874 patients with four classes: normal, bacterial pneumonia, non-COVID-19 viral pneumonia, and COVID-19. To identify COVID-19, we propose a Focal Loss Based Neural Ensemble Network (FLANNEL), a flexible module to ensemble several convolutional neural network (CNN) models and fuse with a focal loss for accurate COVID-19 detection on class imbalance data. FLANNEL consistently outperforms baseline models on COVID-19 identification task in all metrics. Compared with the best baseline, FLANNEL shows a higher macro-F1 score with 6% relative increase on Covid-19 identification task where it achieves 0.7833(0.07) in Precision, 0.8609(0.03) in Recall, and 0.8168(0.03) F1 score.
摘要：测试使用深神经网络对其他肺炎和健康患者进行区分Covid-19的胸部X射线图像的可能性。我们构建了来自两个公共可用来源的X射线成像数据，其中包括2874名患者的5508胸X射线图像：正常，细菌肺炎，非Covid-19病毒肺炎和Covid-19。为了识别CoVID-19，我们提出了一种基于焦点损失的神经集合网络（Flannel），一个灵活的模块，可以集合几个卷积神经网络（CNN）模型，并在类别不平衡数据上进行精确的Covid-19检测局灶性损耗。 Flannel始终如一地优于所有指标中Covid-19识别任务的基线模型。与最佳基线相比，法兰绒显示出更高的宏F1得分，相对增加了6％的CoVID-19识别任务，其中概述了0.7833（0.07），召回0.8609（0.03），0.8168（0.03）F1得分。

39. PIINET: A 360-degree Panoramic Image Inpainting Network Using a Cube Map [PDF] 返回目录
Seo Woo Han, Doug Young Suh
Abstract: Inpainting has been continuously studied in the field of computer vision. As artificial intelligence technology developed, deep learning technology was introduced in inpainting research, helping to improve performance. Currently, the input target of an inpainting algorithm using deep learning has been studied from a single image to a video. However, deep learning-based inpainting technology for panoramic images has not been actively studied. We propose a 360-degree panoramic image inpainting method using generative adversarial networks (GANs). The proposed network inputs a 360-degree equirectangular format panoramic image converts it into a cube map format, which has relatively little distortion and uses it as a training network. Since the cube map format is used, the correlation of the six sides of the cube map should be considered. Therefore, all faces of the cube map are used as input for the whole discriminative network, and each face of the cube map is used as input for the slice discriminative network to determine the authenticity of the generated image. The proposed network performed qualitatively better than existing single-image inpainting algorithms and baseline algorithms.
摘要：在计算机视野领域不断研究菊粉。随着人工智能技术开发的，深入学习技术被引入了解研究，有助于提高性能。目前，已经从单个图像到视频中研究了使用深度学习的预测算法的输入目标。然而，没有积极研究基于深度学习的全景图像的初始化技术。我们提出了一种使用生成的对抗性网络（GANS）的360度全景图像染色方法。所提出的网络输入360度互感格式全景图像将其转换为立方体地图格式，其具有相对较小的失真并将其用作训练网络。由于使用立方体地图格式，因此应考虑立方体地图的六个边的相关性。因此，多维数据集地图的所有面都用作整个判别网络的输入，并且立方体地图的每个面被用作切片鉴别网络的输入，以确定所生成的图像的真实性。所提出的网络比现有的单图像修复算法和基线算法更好地执行。

40. AutoAtlas: Neural Network for 3D Unsupervised Partitioning and Representation Learning [PDF] 返回目录
K. Aditya Mohan, Alan D. Kaplan
Abstract: We present a novel neural network architecture called AutoAtlas for fully unsupervised partitioning and representation learning of 3D brain Magnetic Resonance Imaging (MRI) volumes. AutoAtlas consists of two neural network components: one that performs multi-label partitioning based on local texture in the volume and a second that compresses the information contained within each partition. We train both of these components simultaneously by optimizing a loss function that is designed to promote accurate reconstruction of each partition, while encouraging spatially smooth and contiguous partitioning, and discouraging relatively small partitions. We show that the partitions adapt to the subject specific structural variations of brain tissue while consistently appearing at similar spatial locations across subjects. AutoAtlas also produces very low dimensional features that represent local texture of each partition. We demonstrate prediction of metadata associated with each subject using the derived feature representations and compare the results to prediction using features derived from FreeSurfer anatomical parcellation. Since our features are intrinsically linked to distinct partitions, we can then map values of interest, such as partition-specific feature importance scores onto the brain for visualization.
摘要：我们提出了一种新颖的神经网络架构，称为Autoatlas，用于完全无监督的分区和3D脑磁共振成像（MRI）卷的表现学习。 Autoatlas由两个神经网络组件组成：基于卷中的本地纹理执行多标签分区的组件，以及压缩每个分区内包含的信息的第二个纹理。我们通过优化旨在促进每个分区的准确重建的损耗功能同时培训这两种组件，同时鼓励空间平滑和连续的隔断，并令人沮丧的分区。我们表明分区适应脑组织的主题特定结构变化，同时始终出现在跨对象的类似空间位置。 Autoatlas还产生非常低的维度功能，表示每个分区的本地纹理。我们展示了使用导出的特征表示的与每个对象相关联的元数据的预测，并使用从释放脲释放解剖局派生的特征进行比较到预测的结果。由于我们的功能本质上与不同的分区相关联，我们可以映射感兴趣的值，例如特定于分区特征重要性得分在大脑上进行可视化。

41. Human versus Machine Attention in Deep Reinforcement Learning Tasks [PDF] 返回目录
Ruohan Zhang, Bo Liu, Yifeng Zhu, Sihang Guo, Mary Hayhoe, Dana Ballard, Peter Stone
Abstract: Deep reinforcement learning (RL) algorithms are powerful tools for solving visuomotor decision tasks. However, the trained models are often difficult to interpret, because they are represented as end-to-end deep neural networks. In this paper, we shed light on the inner workings of such trained models by analyzing the pixels that they attend to during task execution, and comparing them with the pixels attended to by humans executing the same tasks. To this end, we investigate the following two questions that, to the best of our knowledge, have not been previously studied. 1) How similar are the visual features learned by RL agents and humans when performing the same task? and, 2) How do similarities and differences in these learned features correlate with RL agents' performance on these tasks? Specifically, we compare the saliency maps of RL agents against visual attention models of human experts when learning to play Atari games. Further, we analyze how hyperparameters of the deep RL algorithm affect the learned features and saliency maps of the trained agents. The insights provided by our results have the potential to inform novel algorithms for the purpose of closing the performance gap between human experts and deep RL agents.
摘要：深度加强学习（RL）算法是解决Visuomotor决策任务的强大工具。然而，训练有素的型号往往难以解释，因为它们被代表为端到端的深神经网络。在本文中，我们通过分析他们在任务执行期间参加的像素来阐明这种训练有素的模型的内部工作，并将它们与执行相同任务的人类所附的像素进行比较。为此，我们调查以下两项问题，以至于我们以前没有研究过。 1）在执行相同任务时，RL代理商和人类学习的视觉功能如何？ 2）这些学习功能的相似性和差异如何与这些任务的竞争对手的性能相关？具体而言，我们在学习玩Atari Games时比较RL代理人对人类专家的视觉型号的显着图。此外，我们分析了深度RL算法的超参数如何影响培训代理的学习功能和显着性图。我们的结果提供的见解有可能告知新算法，以便关闭人类专家和深层RL代理商之间的性能差距。

42. Multi-agent Trajectory Prediction with Fuzzy Query Attention [PDF] 返回目录
Nitin Kamra, Hao Zhu, Dweep Trivedi, Ming Zhang, Yan Liu
Abstract: Trajectory prediction for scenes with multiple agents and entities is a challenging problem in numerous domains such as traffic prediction, pedestrian tracking and path planning. We present a general architecture to address this challenge which models the crucial inductive biases of motion, namely, inertia, relative motion, intents and interactions. Specifically, we propose a relational model to flexibly model interactions between agents in diverse environments. Since it is well-known that human decision making is fuzzy by nature, at the core of our model lies a novel attention mechanism which models interactions by making continuous-valued (fuzzy) decisions and learning the corresponding responses. Our architecture demonstrates significant performance gains over existing state-of-the-art predictive models in diverse domains such as human crowd trajectories, US freeway traffic, NBA sports data and physics datasets. We also present ablations and augmentations to understand the decision-making process and the source of gains in our model.
摘要：具有多个代理和实体的场景轨迹预测是众多域中的挑战性问题，如交通预测，行人跟踪和路径规划。我们展示了一般的架构，以解决模型运动，即惯性，相对运动，意图和相互作用的关键归纳偏差。具体地，我们提出了一个关系模型，以在不同环境中的代理之间灵活的型号进行灵活的模型。由于众所周知，人工决策是模糊的自然界，在我们的模型的核心下，介绍了一种新的注意机制，通过制造连续值（模糊）决策并学习相应的反应来模拟相互作用。我们的架构在各个领域中，在人群轨迹，美国高速公路交通，NBA体育数据和物理数据集等各种领域，我们的架构展示了对现有的最先进的预测模型的绩效增长。我们还展示了消融和增强来了解我们模型中的决策过程和收益来源。

43. Ink Marker Segmentation in Histopathology Images Using Deep Learning [PDF] 返回目录
Danial Maleki, Mehdi Afshari, Morteza Babaie, H.R. Tizhoosh
Abstract: Due to the recent advancements in machine vision, digital pathology has gained significant attention. Histopathology images are distinctly rich in visual information. The tissue glass slide images are utilized for disease diagnosis. Researchers study many methods to process histopathology images and facilitate fast and reliable diagnosis; therefore, the availability of high-quality slides becomes paramount. The quality of the images can be negatively affected when the glass slides are ink-marked by pathologists to delineate regions of interest. As an example, in one of the largest public histopathology datasets, The Cancer Genome Atlas (TCGA), approximately $12\%$ of the digitized slides are affected by manual delineations through ink markings. To process these open-access slide images and other repositories for the design and validation of new methods, an algorithm to detect the marked regions of the images is essential to avoid confusing tissue pixels with ink-colored pixels for computer methods. In this study, we propose to segment the ink-marked areas of pathology patches through a deep network. A dataset from $79$ whole slide images with $4,305$ patches was created and different networks were trained. Finally, the results showed an FPN model with the EffiecentNet-B3 as the backbone was found to be the superior configuration with an F1 score of $94.53\%$.
摘要：由于最近机器视觉的进步，数字病理学效率显着。组织病理学图像在视觉信息中明显丰富。组织玻璃滑动图像用于疾病诊断。研究人员研究了许多方法来处理组织病理学图像并促进快速可靠的诊断;因此，高质量幻灯片的可用性变得至关重要。当玻璃载玻片是通过病理学家标记的墨水标记以描绘感兴趣的区域时，图像的质量可能会受到负面影响。作为一个例子，在最大的公共组织病理学数据集之一中，癌症基因组Atlas（TCGA），大约12 000美元的数字化幻灯片受手动描销的影响。为了处理这些开放式幻灯片图像和用于设计和验证的新方法的其他存储库，一种检测图像的标记区域的算法对于避免将组织像素混淆具有用于计算机方法的墨水像素是必不可少的。在这项研究中，我们建议通过深网络分割病理斑块的墨水标记区域。创建具有4,305美元的$ 79 $整个幻灯片图像的数据集是创建了4,305美元的补丁，培训了不同的网络。最后，结果显示了具有EffiecentNet-B3的FPN模型，因为骨干被发现是卓越的配置，F1得分为94.53 \％$。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-11-02

目录

摘要