摘要

1. 4D Visualization of Dynamic Events from Unconstrained Multi-View Videos [PDF] 返回目录
Aayush Bansal, Minh Vo, Yaser Sheikh, Deva Ramanan, Srinivasa Narasimhan
Abstract: We present a data-driven approach for 4D space-time visualization of dynamic events from videos captured by hand-held multiple cameras. Key to our approach is the use of self-supervised neural networks specific to the scene to compose static and dynamic aspects of an event. Though captured from discrete viewpoints, this model enables us to move around the space-time of the event continuously. This model allows us to create virtual cameras that facilitate: (1) freezing the time and exploring views; (2) freezing a view and moving through time; and (3) simultaneously changing both time and view. We can also edit the videos and reveal occluded objects for a given view if it is visible in any of the other views. We validate our approach on challenging in-the-wild events captured using up to 15 mobile cameras.
摘要：我们提出了从手持多个摄像机采集的视频动态事件的四维时空可视化数据驱动的方法。关键是我们的方法是使用自我监督的神经网络特定的到现场撰写静态和事件的动态方面。虽然从离散的视点拍摄，这种模式使我们能够不断地走动事件的时空。这种模式使我们能够创造有利于虚拟摄像机：冻结时间和浏览视图（1）; （2）冷冻的图，并通过时间移动;和（3）同时改变时间和视图。我们还可以编辑视频和显示遮挡对象为给定的观点如果是在任何其他视图可见。我们验证了我们在最狂野，最多15个移动摄像头捕获的事件挑战的办法。

2. Improve bone age assessment by learning from anatomical local regions [PDF] 返回目录
Dong Wang, Kexin Zhang, Jia Ding, Liwei Wang
Abstract: Skeletal bone age assessment (BAA), as an essential imaging examination, aims at evaluating the biological and structural maturation of human bones. In the clinical practice, Tanner and Whitehouse (TW2) method is a widely-used method for radiologists to perform BAA. The TW2 method splits the hands into Region Of Interests (ROI) and analyzes each of the anatomical ROI separately to estimate the bone age. Because of considering the analysis of local information, the TW2 method shows accurate results in practice. Following the spirit of TW2, we propose a novel model called Anatomical Local-Aware Network (ALA-Net) for automatic bone age assessment. In ALA-Net, anatomical local extraction module is introduced to learn the hand structure and extract local information. Moreover, we design an anatomical patch training strategy to provide extra regularization during the training process. Our model can detect the anatomical ROIs and estimate bone age jointly in an end-to-end manner. The experimental results show that our ALA-Net achieves a new state-of-the-art single model performance of 3.91 mean absolute error (MAE) on the public available RSNA dataset. Since the design of our model is well consistent with the well recognized TW2 method, it is interpretable and reliable for clinical usage.
摘要：骨骼骨龄评估（BAA）作为主要的影像学检查，旨在评估人体骨骼的生物和结构成熟。在临床实践中，Tanner和怀特豪斯（TW2）方法为放射科医师执行BAA广泛使用的方法。的TW2方法拆分手放入地区利益（ROI），并分析每个解剖ROI的单独估计骨龄。由于考虑本地信息的分析中，TW2方法显示在实践中准确的结果。继TW2的精神，我们提出了自动骨龄评估称为解剖本地感知网络（ALA-网）一种新的模式。在ALA-网，解剖本地提取模块被引入到学习手结构和提取本地信息。此外，我们设计了一个解剖补丁培训战略在训练过程中提供额外的正规化。我们的模型可以检测解剖的投资回报，并最终到终端的方式评估骨龄联合。实验结果表明，我们的ALA-Net的实现上向公众提供RSNA数据集3.91平均绝对误差（MAE）的一个新的国家的最先进的单一模型的性能。由于我们的模型的设计与公认TW2方法较好的一致性，它是可解释的，可靠的临床使用。

3. Center3D: Center-based Monocular 3D Object Detection with Joint Depth Understanding [PDF] 返回目录
Yunlei Tang, Sebastian Dorn, Chiragkumar Savani
Abstract: Localizing objects in 3D space and understanding their associated 3D properties is challenging given only monocular RGB images. The situation is compounded by the loss of depth information during perspective projection. We present Center3D, a one-stage anchor-free approach, to efficiently estimate 3D location and depth using only monocular RGB images. By exploiting the difference between 2D and 3D centers, we are able to estimate depth consistently. Center3D uses a combination of classification and regression to understand the hidden depth information more robustly than each method alone. Our method employs two joint approaches: (1) LID: a classification-dominated approach with sequential Linear Increasing Discretization. (2) DepJoint: a regression-dominated approach with multiple Eigen's transformations for depth estimation. Evaluating on KITTI dataset for moderate objects, Center3D improved the AP in BEV from $29.7\%$ to $42.8\%$, and the AP in 3D from $18.6\%$ to $39.1\%$. Compared with state-of-the-art detectors, Center3D has achieved the best speed-accuracy trade-off in realtime monocular object detection.
摘要：在三维空间中的物体本地化和理解它们相关联的3D属性是具有挑战性只给出单眼RGB图像。这种情况是由深度信息透视投影过程中损失加剧。我们本Center3D，单级无锚的方法，以有效地估计仅使用单眼RGB图像的3D位置和深度。通过利用2D和3D中心之间的区别，我们能够始终如一地估计深度。 Center3D使用分类和回归的组合来了解隐藏深度信息比更鲁棒单独每个方法。我们的方法使用两个接头的方法：（1）LID：具有顺序上升线性离散分类为主的方法。（2）DepJoint：使用多个本征的变换用于深度估计回归为主的方法。在KITTI数据集为评估对象适中，Center3D改善BEV从29.7 $ \％$至$ 42.8 \％$的AP，而3D AP从18.6 $ \％$至$ 39.1 \％$。与国家的最先进的探测器相比，Center3D取得最佳的速度准确度的折衷实时单眼物体检测。

4. AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings [PDF] 返回目录
Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, Vinay P. Namboodiri
Abstract: In this paper, we solve for the problem of generalized zero-shot learning in a multi-modal setting, where we have novel classes of audio/video during testing that were not seen during training. We demonstrate that projecting the audio and video embeddings to the class label text feature space allows us to use the semantic relatedness of text embeddings as a means for zero-shot learning. Importantly, our multi-modal zero-shot learning approach works even if a modality is missing at test time. Our approach makes use of a cross-modal decoder which enforces the constraint that the class label text features can be reconstructed from the audio and video embeddings of data points in order to perform better on the multi-modal zero-shot learning task. We further minimize the gap between audio and video embedding distributions using KL-Divergence loss. We test our approach on the zero-shot classification and retrieval tasks, and it performs better than other models in the presence of a single modality as well as in the presence of multiple modalities.
摘要：在本文中，我们解决了广义零射门学习的多模式设置，在这里我们是在训练期间也没有观察到在测试过程中有音频/视频的小说类的问题。我们表明，突出的音频和视频的嵌入到类标签文本特征空间允许我们使用的嵌入文本的语义关联，作为零次学习的一种手段。重要的是，我们的多模态零射门的学习方法效果即使一个模态在测试时失踪。我们的方法是利用一个跨通道解码器，它强制约束该类标签文本功能可以从数据点的音频和视频的嵌入，以在多模态零射门学习任务有更好的表现来重建的。我们进一步减少使用KL散度损失的音频和视频嵌入分布之间的差距。我们测试的零镜头分类和检索任务的方法，它的性能比在一个单一的模式的存在，以及在多模态存在其他车型更好。

5. Weakly Supervised Vessel Segmentation in X-ray Angiograms by Self-Paced Learning from Noisy Labels with Suggestive Annotation [PDF] 返回目录
Jingyang Zhang, Guotai Wang, Hongzhi Xie, Shuyang Zhang, Ning Huang, Shaoting Zhang, Lixu Gu
Abstract: The segmentation of coronary arteries in X-ray angiograms by convolutional neural networks (CNNs) is promising yet limited by the requirement of precisely annotating all pixels in a large number of training images, which is extremely labor-intensive especially for complex coronary trees. To alleviate the burden on the annotator, we propose a novel weakly supervised training framework that learns from noisy pseudo labels generated from automatic vessel enhancement, rather than accurate labels obtained by fully manual annotation. A typical self-paced learning scheme is used to make the training process robust against label noise while challenged by the systematic biases in pseudo labels, thus leading to the decreased performance of CNNs at test time. To solve this problem, we propose an annotation-refining self-paced learning framework (AR-SPL) to correct the potential errors using suggestive annotation. An elaborate model-vesselness uncertainty estimation is also proposed to enable the minimal annotation cost for suggestive annotation, based on not only the CNNs in training but also the geometric features of coronary arteries derived directly from raw data. Experiments show that our proposed framework achieves 1) comparable accuracy to fully supervised learning, which also significantly outperforms other weakly supervised learning frameworks; 2) largely reduced annotation cost, i.e., 75.18% of annotation time is saved, and only 3.46% of image regions are required to be annotated; and 3) an efficient intervention process, leading to superior performance with even fewer manual interactions.
摘要：冠状动脉的X射线分割血管造影是卷积神经网络（细胞神经网络）是有希望但通过在大量的训练图像的，这是非常劳动密集型尤其对于复杂的冠状动脉树精确注释的所有像素要求的限制。为了减轻对注释的负担，我们提出了一个新的弱指导训练的框架，从自动血管增强产生，而不是由全手动注释获得准确的标签嘈杂伪标签获悉。一个典型的自学方案来使培训过程中对标签的噪声稳健，同时在伪标签系统偏差的挑战，从而导致在测试时间细胞神经网络的性能下降。为了解决这个问题，我们提出了一个注解的精炼自学框架（AR-SPL）纠正使用提示注解潜在的错误。一个精心制作的模型，血管性的不确定性估计还建议启用提示注释最小注释成本的基础上，不仅在细胞神经网络训练，还冠状动脉的几何特征直接从原始数据导出。实验表明，我们提出的框架实现1）具有相当的精度，充分监督学习，这也显著优于其他弱监督学习框架; 2）大大降低成本注释，即，标注时间75.18％被保存，只需要3.46％的图像区域的要注释;和3）一种有效的干预过程，从而导致优异的性能甚至更少的手动交互。

6. GSTO: Gated Scale-Transfer Operation for Multi-Scale Feature Learning in Pixel Labeling [PDF] 返回目录
Zhuoying Wang, YongtaoWang, Zhi Tang, Yangyan Li, Ying Chen, Haibin Ling, Weisi Lin
Abstract: Existing CNN-based methods for pixel labeling heavily depend on multi-scale features to meet the requirements of both semantic comprehension and detail preservation. State-of-the-art pixel labeling neural networks widely exploit conventional scale-transfer operations, i.e., up-sampling and down-sampling to learn multi-scale features. In this work, we find that these operations lead to scale-confused features and suboptimal performance because they are spatial-invariant and directly transit all feature information cross scales without spatial selection. To address this issue, we propose the Gated Scale-Transfer Operation (GSTO) to properly transit spatial-filtered features to another scale. Specifically, GSTO can work either with or without extra supervision. Unsupervised GSTO is learned from the feature itself while the supervised one is guided by the supervised probability matrix. Both forms of GSTO are lightweight and plug-and-play, which can be flexibly integrated into networks or modules for learning better multi-scale features. In particular, by plugging GSTO into HRNet, we get a more powerful backbone (namely GSTO-HRNet) for pixel labeling, and it achieves new state-of-the-art results on the COCO benchmark for human pose estimation and other benchmarks for semantic segmentation including Cityscapes, LIP and Pascal Context, with negligible extra computational cost. Moreover, experiment results demonstrate that GSTO can also significantly boost the performance of multi-scale feature aggregation modules like PPM and ASPP. Code will be made available at this https URL.
摘要：像素标签基于CNN-现有的方法主要依赖于多尺度的功能，以满足语义理解和细节保留的要求。状态的最先进的像素标记神经网络广泛利用常规扩展传输操作，即，上采样和下采样学习多尺度特征。在这项工作中，我们发现，这些行动导致尺度混淆功能和最佳性能，因为它们是空间不变，并直接转运无空间选择的所有特征信息跨尺度。为了解决这个问题，我们提出了门控比例转换操作（GSTO）适当中转空间过滤功能，以另一种规模。具体来说，可以GSTO有或没有额外的监督工作。无监督GSTO从功能本身，而监督的一个被监督的概率矩阵引导教训。 GSTO的两种形式是重量轻，插件和播放，可以灵活地集成到网络或模块学习更好的多尺度特征。特别是，通过插入GSTO到HRNet，我们得到的像素标注一个更强大的骨干（即GSTO-HRNet），它实现了国家的最先进的新的COCO基准人体姿势估计和其他基准语义结果分割包括城市景观，LIP和Pascal背景下，可以忽略不计的额外计算成本。此外，实验结果表明，GSTO也可以显著提高像PPM和ASPP多尺度特征聚集模块的性能。代码将在这个HTTPS URL提供。

7. NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation [PDF] 返回目录
Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richardson, A. Stephen McGough, Nick Wright, Ben Burville, Per Berggren
Abstract: We introduce the Northumberland Dolphin Dataset 2020 (NDD20), a challenging image dataset annotated for both coarse and fine-grained instance segmentation and categorisation. This dataset, the first release of the NDD, was created in response to the rapid expansion of computer vision into conservation research and the production of field-deployable systems suited to extreme environmental conditions -- an area with few open source datasets. NDD20 contains a large collection of above and below water images of two different dolphin species for traditional coarse and fine-grained segmentation. All data contained in NDD20 was obtained via manual collection in the North Sea around the Northumberland coastline, UK. We present experimentation using standard deep learning network architecture trained using NDD20 and report baselines results.
摘要：介绍了诺森伯兰海豚数据集2020（NDD20），一个具有挑战性的图像数据集注释为粗粒和细粒例如分割和分类。很少开源数据集的区域 - 这个数据集的NDD的第一个版本，是为了响应迅速扩张计算机视觉纳入保护性研究和生产适合于极端环境条件下现场部署系统的建立。 NDD20包含大集合的两个不同物种海豚传统粗和细颗粒分割的上方和下方的水图像。包含在NDD20所有数据都是在围绕着海岸线诺森伯兰郡，英国北海通过人工采集获得。使用标准的深度学习网络架构采用NDD20和报告的基线结果训练有素我们目前的实验。

8. Tackling the Problem of Large Deformations in Deep Learning Based Medical Image Registration Using Displacement Embeddings [PDF] 返回目录
Lasse Hansen, Mattias P. Heinrich
Abstract: Though, deep learning based medical image registration is currently starting to show promising advances, often, it still fells behind conventional frameworks in terms of registration accuracy. This is especially true for applications where large deformations exist, such as registration of interpatient abdominal MRI or inhale-to-exhale CT lung registration. Most current works use U-Net-like architectures to predict dense displacement fields from the input images in different supervised and unsupervised settings. We believe that the U-Net architecture itself to some level limits the ability to predict large deformations (even when using multilevel strategies) and therefore propose a novel approach, where the input images are mapped into a displacement space and final registrations are reconstructed from this embedding. Experiments on inhale-to-exhale CT lung registration demonstrate the ability of our architecture to predict large deformations in a single forward path through our network (leading to errors below 2 mm).
摘要：虽然，深基础的学习医学图像配准，目前开始显现看好的进步，往往，它仍然传统的框架背后荒原中配准精度方面。这对于其中大的变形存在的应用，如患者间腹部MRI的登记或吸入到呼气CT肺登记尤其如此。目前大多数工程采用U型网状结构来预测从不同的监督和无监督的设置输入图像密集的位移场。我们相信，在U-Net的建筑本身到一定程度的限制，预测大变形（使用多策略，即使），因此提出了一种新的方法，其中输入图像映射到位移空间，并最终登记从这个重建的能力嵌入。上吸入到呼气CT肺登记实验表明我们的体系结构来预测大的变形在通过我们的网络的单个正向通路的能力（导致错误低于2毫米）。

9. Joint Learning of Vessel Segmentation and Artery/Vein Classification with Post-processing [PDF] 返回目录
Liangzhi Li, Manisha Verma, Yuta Nakashima, Ryo Kawasaki, Hajime Nagahara
Abstract: Retinal imaging serves as a valuable tool for diagnosis of various diseases. However, reading retinal images is a difficult and time-consuming task even for experienced specialists. The fundamental step towards automated retinal image analysis is vessel segmentation and artery/vein classification, which provide various information on potential disorders. To improve the performance of the existing automated methods for retinal image analysis, we propose a two-step vessel classification. We adopt a UNet-based model, SeqNet, to accurately segment vessels from the background and make prediction on the vessel type. Our model does segmentation and classification sequentially, which alleviates the problem of label distribution bias and facilitates training. To further refine classification results, we post-process them considering the structural information among vessels to propagate highly confident prediction to surrounding vessels. Our experiments show that our method improves AUC to 0.98 for segmentation and the accuracy to 0.92 in classification over DRIVE dataset.
摘要：视网膜成像用作用于各种疾病的诊断的重要工具。然而，阅读的视网膜图像是一个困难和耗时的任务，即使是经验丰富的专家。向自动化视网膜图像分析的基本步骤是血管分割和动脉/静脉的分类，其提供对潜在病症的各种信息。为了提高对视网膜图像分析现有的自动化方法的性能，我们提出了一个两步血管分类。我们采用基于UNET模型，SeqNet，从对容器类型的背景，使预测准确段的船只。我们的模型做分割和分类顺序，从而减轻的标签分发偏见，并促进培训的问题。要进一步细化分类结果，我们的后处理他们考虑船只之间的结构信息，以高度自信的预测传播到周围血管。我们的实验表明，该方法提高了AUC，以0.98分及以上DRIVE数据集的准确性0.92分类。

10. AutoSweep: Recovering 3D Editable Objectsfrom a Single Photograph [PDF] 返回目录
Xin Chen, Yuwei Li, Xi Luo, Tianjia Shao, Jingyi Yu, Kun Zhou, Youyi Zheng
Abstract: This paper presents a fully automatic framework for extracting editable 3D objects directly from a single photograph. Unlike previous methods which recover either depth maps, point clouds, or mesh surfaces, we aim to recover 3D objects with semantic parts and can be directly edited. We base our work on the assumption that most human-made objects are constituted by parts and these parts can be well represented by generalized primitives. Our work makes an attempt towards recovering two types of primitive-shaped objects, namely, generalized cuboids and generalized cylinders. To this end, we build a novel instance-aware segmentation network for accurate part separation. Our GeoNet outputs a set of smooth part-level masks labeled as profiles and bodies. Then in a key stage, we simultaneously identify profile-body relations and recover 3D parts by sweeping the recognized profile along their body contour and jointly optimize the geometry to align with the recovered masks. Qualitative and quantitative experiments show that our algorithm can recover high quality 3D models and outperforms existing methods in both instance segmentation and 3D reconstruction. The dataset and code of AutoSweep are available at https://chenxin.tech/AutoSweep.html.
摘要：本文介绍了提取可编辑的3D全自动框架直接从单张照片的对象。不同于恢复任一深度图，点云或网状表面以前的方法中，我们的目标是恢复与语义份3D对象，并且可以直接编辑。我们立足我们的假设是大多数人为对象由部分构成，并且这些部分可以由通用原语得到很好的体现工作。我们的工作，使朝中恢复的两种类型的原始形状的物体，即广义的长方体和广义缸的尝试。为此，我们建立了一个新的实例感知分割网络进行精确分离的一部分。我们的土工网输出一组标记为型材和机构顺利部件级口罩。然后，在一个关键阶段，我们同时沿着他们的身体轮廓席卷公认的轮廓识别轮廓体的关系，恢复3D零件和共同优化的几何形状对齐与恢复的面具。定性和定量实验表明，该算法能够恢复高品质的3D模型和现有的情况下分割和三维重建两种方法性能优于。 AutoSweep的数据集和代码可在https://chenxin.tech/AutoSweep.html。

11. Iteratively Optimized Patch Label Inference Network for Automatic Pavement Disease Detection [PDF] 返回目录
Wenhao Tang, Qiming Zhao, Sheng Huang, Ren Li, Luwen Huangfu
Abstract: We present a novel deep learning framework named Iteratively Optimized Patch Label Inference Network (IOPLIN) for automatically detecting various pavement diseases not just limited to the specific ones, such as crack and pothole. IOPLIN can be iteratively trained with only the image label via using Expectation-Maximization Inspired Patch Label Distillation (EMIPLD) strategy, and accomplishes this task well by inferring the labels of patches from the pavement images. IOPLIN enjoys many desirable properties over the state-of-the-art single branch CNN models such as GoogLeNet and EfficientNet. It is able to handle any resolution of image and sufficiently utilize image information particularly for the high-resolution ones. Moreover, it can roughly localize the pavement distress without using any prior localization information in training phase. In order to better evaluate the effectiveness of our method in practice, we construct a large-scale Bituminous Pavement Disease Detection dataset named CQU-BPDD consists of 60059 high-resolution pavement images, which are acquired from different areas at different time. Extensive results on this dataset demonstrate the superiority of IOPLIN over the state-of-the-art image classificaiton approaches in automatic pavement disease detection.
摘要：我们提出命名迭代优化修补标签推理网络（IOPLIN）用于自动检测各种路面疾病不仅限于特定的，如裂纹和坑洞的新型深学习框架。 IOPLIN可以只用图像标签经由使用期望最大化启发补丁标签蒸馏（EMIPLD）战略反复的训练，并从路面图像推断补丁的标签完成这项任务的好。 IOPLIN享有在国家的最先进的单个分支CNN模型如GoogLeNet和EfficientNet许多期望的性质。它能够特别是要处理好图像的任何决议，充分利用图像信息高分辨率的。此外，它可以大致定位在路面损坏而不训练阶段使用任何现有的定位信息。为了更好地评估我们在实践方法的有效性，我们构建了一个大规模的命名CQU-BPDD沥青路面病害检测数据集由60059高分辨率路面图像，这是从不同地区在不同时间获得的。有关此数据集广泛结果证明IOPLIN的过度状态的最先进的图像classificaiton优越性在自动人行道疾病检测接近。

12. Accelerating Neural Network Inference by Overflow Aware Quantization [PDF] 返回目录
Hongwei Xie, Shuo Zhang, Huanghao Ding, Yafei Song, Baitao Shao, Conggang Hu, Ling Cai, Mingyang Li
Abstract: The inherent heavy computation of deep neural networks prevents their widespread applications. A widely used method for accelerating model inference is quantization, by replacing the input operands of a network using fixed-point values. Then the majority of computation costs focus on the integer matrix multiplication accumulation. In fact, high-bit accumulator leads to partially wasted computation and low-bit one typically suffers from numerical overflow. To address this problem, we propose an overflow aware quantization method by designing trainable adaptive fixed-point representation, to optimize the number of bits for each input tensor while prohibiting numeric overflow during the computation. With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance. To verify the effectiveness of our method, we conduct image classification, object detection, and semantic segmentation tasks on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that the proposed method can achieve comparable performance with state-of-the-art quantization methods while accelerating the inference process by about 2 times.
摘要：深层神经网络的固有重计算防止其广泛的应用。用于加速模型推断的广泛使用的方法是量化，通过替换使用固定点值的网络的输入操作数。于是大多数计算成本集中在整数矩阵乘法累加。实际上，高位累加器引线部分地浪费了计算和低比特一个典型地从数值溢出困扰。为了解决这个问题，我们提出了通过设计可训练自适应定点表示，以优化的比特的数量对于每个输入张量而计算期间禁止数值溢出溢流意识到量化方法。所提出的方法，我们可以充分利用计算能力，以尽量减少损失量化，并获得优化性能推断。为了验证方法的有效性，我们分别进行图像分类，目标检测，并在ImageNet语义分割任务，帕斯卡尔VOC和COCO的数据集。实验结果表明，在通过2倍左右加速推理过程所提出的方法可以实现与国家的最先进的量化方法可比较的性能。

13. Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3 [PDF] 返回目录
Petr Hurtik, Vojtech Molek, Jan Hula, Marek Vajgl, Pavel Vlasanek, Tomas Nejezchleba
Abstract: We present a new version of YOLO with better performance and extended with instance segmentation called Poly-YOLO. Poly-YOLO builds on the original ideas of YOLOv3 and removes two of its weaknesses: a large amount of rewritten labels and inefficient distribution of anchors. Poly-YOLO reduces the issues by aggregating features from a light SE-Darknet-53 backbone with a hypercolumn technique, using stairstep upsampling, and produces a single scale output with high resolution. In comparison with YOLOv3, Poly-YOLO has only 60\% of its trainable parameters but improves mAP by a relative 40\%. We also present Poly-YOLO lite with fewer parameters and a lower output resolution. It has the same precision as YOLOv3, but it is three times smaller and twice as fast, thus suitable for embedded devices. Finally, Poly-YOLO performs instance segmentation using bounding polygons. The network is trained to detect size-independent polygons defined on a polar grid. Vertices of each polygon are being predicted with their confidence, and therefore Poly-YOLO produces polygons with a varying number of vertices.
摘要：我们提出YOLO新版本具有更好的性能，并与实例分割称为聚YOLO延长。聚YOLO基础上YOLOv3的独到的见解，并删除其弱点二：大量重写标签和锚的低效分配。聚YOLO通过从光SE-暗网-53主链具有hypercolumn技术聚合的功能，使用阶梯上采样降低的问题，并且产生具有高分辨率的单个标度输出。与YOLOv3相比，聚YOLO仅具有60 \％其可训练参数的，而是由相对的40 \％提高地图。我们还提出聚YOLO精简版用更少的参数和较低的输出分辨率。它具有相同的精度YOLOv3，但它是较小的三倍和两倍的速度，因此适合于嵌入式设备。最后，聚YOLO执行使用边界多边形实例分割。该网络进行训练，以检测在极坐标网格定义的大小无关的多边形。每个多边形的顶点正在与他们的信任预测，因此聚YOLO产生多边形与可变数量的顶点。

14. Zoom in to the details of human-centric videos [PDF] 返回目录
Guanghan Li, Yaping Zhao, Mengqi Ji, Xiaoyun Yuan, Lu Fang
Abstract: Presenting high-resolution (HR) human appearance is always critical for the human-centric videos. However, current imagery equipment can hardly capture HR details all the time. Existing super-resolution algorithms barely mitigate the problem by only considering universal and low-level priors of im-age patches. In contrast, our algorithm is under bias towards the human body super-resolution by taking advantage of high-level prior defined by HR human appearance. Firstly, a motion analysis module extracts inherent motion pattern from the HR reference video to refine the pose estimation of the low-resolution (LR) sequence. Furthermore, a human body reconstruction module maps the HR texture in the reference frames onto a 3D mesh model. Consequently, the input LR videos get super-resolved HR human sequences are generated conditioned on the original LR videos as well as few HR reference frames. Experiments on an existing dataset and real-world data captured by hybrid cameras show that our approach generates superior visual quality of human body compared with the traditional method.
摘要：呈现高分辨率（HR）人类的外表始终是人类为中心的视频关键。然而，目前的影像设备也很难捕捉HR详细介绍所有的时间。现有的超分辨率算法几乎只考虑IM-年龄补丁的普及和低水平先验缓解这个问题。相比之下，我们的算法是对下人体超分辨率采取高层之前通过HR人力外观定义的优势偏见。首先，运动分析模块从所述HR参考视频中提取固有的运动模式来细化低分辨率（LR）序列的姿势估计。此外，人体重建模块中的参考帧的纹理HR映射到3D网格模型。因此，输入LR的影片会出现超分辨率的HR人力序列生成条件上的原始LR视频以及一些HR参考帧。通过混合相机拍摄的现有数据集和真实数据实验表明，我们的方法产生，与传统方法相比，人体的卓越的视觉质量。

15. Arbitrary Style Transfer via Multi-Adaptation Network [PDF] 返回目录
Yingying Deng, Fan Tang, Weiming Dong, Wen Sun, Feiyue Huang, Changsheng Xu
Abstract: Arbitrary style transfer is a significant topic with both research value and application prospect.Given a content image and a referenced style painting, a desired style transfer would render the content image with the color tone and vivid stroke patterns of the style painting while synchronously maintain the detailed content structure information.Commonly, style transfer approaches would learn content and style representations of the content and style references first and then generate the stylized images guided by these this http URL this paper, we propose the multi-adaption network which involves two Self-Adaptation (SA) modules and one Co-Adaptation (CA) module: SA modules adaptively disentangles the content and style representations, i.e., content SA module uses the position-wise self-attention to enhance content representation and style SA module uses channel-wise self-attention to enhance style representation; CA module rearranges the distribution of style representation according to content representation distribution by calculating the local similarity between the disentangled content and style features in a non-local fashion.Moreover, a new disentanglement loss function enables our network to extract main style patterns to adapt to various content images and extract exact content features to adapt to various style images. Various qualitative and quantitative experiments demonstrate that the proposed multi-adaption network leads to better results than the state-of-the-art style transfer methods.
摘要：任意的方式转移既研究价值和应用prospect.Given内容图像和引用的样式画一个显著的话题，希望的风格转移将呈现与色调和绘画，而同步风格的生动行程模式中的内容图像保持详细内容结构information.Commonly，风格接近转让将首先学习的内容和样式的引用的内容和风格交涉，然后生成这些指导程式化的图像这个HTTP URL本文中，我们提出了多适应网络，它包括两个自适应（SA）模块和一个互相适应（CA）模块：SA模块自适应地理顺了那些纷繁的内容和风格的表示，即内容SA模块采用的位置明智的自我关注，以提高内容的代表性和风格SA模块使用通道-wise自重视加强作风表示; CA模块根据通过计算在一个非本地的fashion.Moreover解缠结的内容和风格特征之间的局部相似内容表示分发重新排列式表示的分布，新的解缠结损失函数使我们的网络来提取主风格图案，以适应各种内容的图像和精确的提取内容特征，以适应各种风格的图像。各种定性和定量实验结果表明，所提出的多适应网络导致比国家的最先进的风格传输方法更好的结果。

16. Concurrent Segmentation and Object Detection CNNs for Aircraft Detection and Identification in Satellite Images [PDF] 返回目录
Damien Grosgeorge, Maxime Arbelot, Alex Goupilleau, Tugdual Ceillier, Renaud Allioux
Abstract: Detecting and identifying objects in satellite images is a very challenging task: objects of interest are often very small and features can be difficult to recognize even using very high resolution imagery. For most applications, this translates into a trade-off between recall and precision. We present here a dedicated method to detect and identify aircraft, combining two very different convolutional neural networks (CNNs): a segmentation model, based on a modified U-net architecture, and a detection model, based on the RetinaNet architecture. The results we present show that this combination outperforms significantly each unitary model, reducing drastically the false negative rate.
摘要：探测和卫星图像识别对象是一个非常具有挑战性的任务：感兴趣的对象往往是非常小的功能是很困难的，即使使用非常高的分辨率的图像识别。对于大多数应用而言，这意味着召回和精度之间的权衡。我们在这里提出一个专用的方法来检测和识别飞机，结合两个非常不同的卷积神经网络（细胞神经网络）：一个分割模型的基础上修改的U形网结构，和检测模型，基于所述RetinaNet架构。结果我们现在表明，该组合显著优于各单一模式，大大降低了假阴性率。

17. Extrapolative-Interpolative Cycle-Consistency Learning for Video Frame Extrapolation [PDF] 返回目录
Sangjin Lee, Hyeongmin Lee, Taeoh Kim, Sangyoun Lee
Abstract: Video frame extrapolation is a task to predict future frames when the past frames are given. Unlike previous studies that usually have been focused on the design of modules or construction of networks, we propose a novel Extrapolative-Interpolative Cycle (EIC) loss using pre-trained frame interpolation module to improve extrapolation performance. Cycle-consistency loss has been used for stable prediction between two function spaces in many visual tasks. We formulate this cycle-consistency using two mapping functions; frame extrapolation and interpolation. Since it is easier to predict intermediate frames than to predict future frames in terms of the object occlusion and motion uncertainty, interpolation module can give guidance signal effectively for training the extrapolation function. EIC loss can be applied to any existing extrapolation algorithms and guarantee consistent prediction in the short future as well as long future frames. Experimental results show that simply adding EIC loss to the existing baseline increases extrapolation performance on both UCF101 and KITTI datasets.
摘要：视频帧外推法来预测未来帧给出过去的帧时的任务。不同于以往，通常都集中在模块或网络的建设设计研究，我们提出用预先训练帧插值模块来提高性能外推一个新的外插，插周期（EIC）的损失。周期一致性的损失已被用于许多视觉任务二元函数空间之间稳定的预测。我们制定使用两个映射函数这个周期的一致性;帧外推法和插值。因为它是更容易预测中间帧比在对象闭塞和运动的不确定性而言预测未来帧，内插模块可以用于训练外推函数有效地给予指导信号。 EIC损失可以应用到任何现有的外插算法，并保证在短期未来一致预测以及长未来帧。实验结果表明，简单地增加EIC损失两个UCF101和KITTI数据集现有基准增加外推的性能。

18. TIME: Text and Image Mutual-Translation Adversarial Networks [PDF] 返回目录
Bingchen Liu, Kunpeng Song, Yizhe Zhu, Gerard de Melo, Ahmed Elgammal
Abstract: Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator $G$ and an image captioning discriminator $D$ under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image-text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of $G$ can be boosted substantially by training it jointly with $D$ as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design a hinged and annealing conditional loss that dynamically balances the adversarial learning. In our experiments, TIME establishes the new state-of-the-art Inception Score of 4.88 on the CUB dataset, and shows competitive performance on MS-COCO on both text-to-image and image captioning tasks.
摘要：针对文本到影像（T2I）一代，我们提出了文本和图像互-翻译对抗性网络（TIME），一个轻量级的，但有效的模式，共同学习一个T2I发电机$ G $和图像字幕鉴别$ d $下剖成对抗性网络架构。虽然以前的方法解决这个问题T2I作为一个单向的任务和使用预先训练语言模型来执行图像文本的一致性，时间既不需要额外的模块，也没有预先的训练。我们发现，$ G $的表现可以用$ d $联合训练它作为语言模型大幅提升。具体来说，我们采用变形金刚的形象特征和字的嵌入之间的跨通道连接模型，并设计了一个铰链和退火条件损失动态平衡对抗学习。在我们的实验中，时间建立新的国家的最先进的盗梦空间的4.88的CUB数据集上的文字到影像和图像字幕任务得分，并显示竞争力的性能在MS-COCO。

19. Learning to segment from misaligned and partial labels [PDF] 返回目录
Simone Fobi, Terence Conlon, Jayant Taneja, Vijay Modi
Abstract: To extract information at scale, researchers increasingly apply semantic segmentation techniques to remotely-sensed imagery. While fully-supervised learning enables accurate pixel-wise segmentation, compiling the exhaustive datasets required is often prohibitively expensive. As a result, many non-urban settings lack the ground-truth needed for accurate segmentation. Existing open source infrastructure data for these regions can be inexact and non-exhaustive. Open source infrastructure annotations like OpenStreetMaps (OSM) are representative of this issue: while OSM labels provide global insights to road and building footprints, noisy and partial annotations limit the performance of segmentation algorithms that learn from them. In this paper, we present a novel and generalizable two-stage framework that enables improved pixel-wise image segmentation given misaligned and missing annotations. First, we introduce the Alignment Correction Network to rectify incorrectly registered open source labels. Next, we demonstrate a segmentation model -- the Pointer Segmentation Network -- that uses corrected labels to predict infrastructure footprints despite missing annotations. We test sequential performance on the AIRS dataset, achieving a mean intersection-over-union score of 0.79; more importantly, model performance remains stable as we decrease the fraction of annotations present. We demonstrate the transferability of our method to lower quality data, by applying the Alignment Correction Network to OSM labels to correct building footprints; we also demonstrate the accuracy of the Pointer Segmentation Network in predicting cropland boundaries in California from medium resolution data. Overall, our methodology is robust for multiple applications with varied amounts of training data present, thus offering a method to extract reliable information from noisy, partial data.
摘要：在大规模提取信息，研究人员越来越多地应用语义分割技术，遥感图像。在充分监督学习能够进行精确的逐像素分割，编译所需的详尽数据集通常是非常昂贵的。其结果是，许多非城市环境缺乏必要的精确分割的地面实况。这些地区现有的开源基础设施数据可以是不精确的和非穷尽的。像开放街道地图（OSM）开源基础设施注解是代表这个问题的：而OSM标签提供全球性的见解，道路和建筑物轮廓，嘈杂和部分注释限制的，从他们的学习分割算法的性能。在本文中，我们提出了一个新颖的和一般化的两级框架，使给定的未对准和缺少注释改进的逐像素的图像分割。首先，我们介绍了对齐校正网络，以纠正不正确的注册开源标签。接下来，我们展示了一个细分模型 - 指针分割网络 - 使用修正标签预测基础设施脚印尽管错过注解。我们对AIRS数据集测试顺序的性能，实现了平均交叉点过联盟得分0.79;更重要的是，模型的性能保持稳定，因为我们减少注释本的分数。我们证明我们的方法来降低质量的数据，通过应用对齐校正网络到OSM标签，以正确的建筑物占据的转让;我们还表明，从中等分辨率的数据在加利福尼亚预测农田边界的指针分割网络的准确度。总的来说，我们的方法是与变化的训练数据存在多个应用程序的健壮的，从而提供一个方法，以提取从嘈杂的可靠信息，部分数据。

20. Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments [PDF] 返回目录
Pegah Salehi, Abdolah Chalechale, Maryam Taghizadeh
Abstract: One of the most significant challenges in statistical signal processing and machine learning is how to obtain a generative model that can produce samples of large-scale data distribution, such as images and speeches. Generative Adversarial Network (GAN) is an effective method to address this problem. The GANs provide an appropriate way to learn deep representations without widespread use of labeled training data. This approach has attracted the attention of many researchers in computer vision since it can generate a large amount of data without precise modeling of the probability density function (PDF). In GANs, the generative model is estimated via a competitive process where the generator and discriminator networks are trained simultaneously. The generator learns to generate plausible data, and the discriminator learns to distinguish fake data created by the generator from real data samples. Given the rapid growth of GANs over the last few years and their application in various fields, it is necessary to investigate these networks accurately. In this paper, after introducing the main concepts and the theory of GAN, two new deep generative models are compared, the evaluation metrics utilized in the literature and challenges of GANs are also explained. Moreover, the most remarkable GAN architectures are categorized and discussed. Finally, the essential applications in computer vision are examined.
摘要：一个在统计信号处理和机器学习的最显著的挑战是如何获得能够生产大型数据分布的样本，如图像和演讲生成模型。生成对抗性网络（GAN）是解决该问题的有效方法。在甘斯提供适当的方式来学习深表示，而没有广泛使用标记的训练数据。这种做法已经引起了计算机视觉的许多研究人员的关注，因为它可以产生大量的数据，而不概率密度函数（PDF）的精确建模。在甘斯，生成模型通过竞争过程，其中发电机和鉴别器网络被同时训练估计。发电机学会生成可信数据，和鉴别器获悉，以区分由从真实数据样本的发生器产生假数据。鉴于甘斯在过去几年的快速增长及其在各个领域的应用，有必要准确地调查这些网络。本文介绍的主要概念和GaN的理论后，两个新的深生成模型进行比较，在文献中使用的评价指标和甘斯的挑战，也有介绍。此外，最显着的GAN架构进行分类讨论。最后，在计算机视觉中必不可少的应用程序进行检查。

21. False Positive Removal for 3D Vehicle Detection with Penetrated Point Classifier [PDF] 返回目录
Sungmin Woo, Sangwon Hwang, Woojin Kim, Junhyeop Lee, Dogyoon Lee, Sangyoun Lee
Abstract: Recently, researchers have been leveraging LiDAR point cloud for higher accuracy in 3D vehicle detection. Most state-of-the-art methods are deep learning based, but are easily affected by the number of points generated on the object. This vulnerability leads to numerous false positive boxes at high recall positions, where objects are occasionally predicted with few points. To address the issue, we introduce Penetrated Point Classifier (PPC) based on the underlying property of LiDAR that points cannot be generated behind vehicles. It determines whether a point exists behind the vehicle of the predicted box, and if does, the box is distinguished as false positive. Our straightforward yet unprecedented approach is evaluated on KITTI dataset and achieved performance improvement of PointRCNN, one of the state-of-the-art methods. The experiment results show that precision at the highest recall position is dramatically increased by 15.46 percentage points and 14.63 percentage points on the moderate and hard difficulty of car class, respectively.
摘要：最近，研究人员已经利用激光雷达点云中的3D车辆检测精度更高。大多数国家的最先进的方法是基于深刻的学习，但很容易被物体上产生点的数量的影响。此漏洞导致大量的假阳性盒在高召回的位置，在那里对象偶尔几点预测。为了解决这个问题，我们基于不能后方车辆可以产生点激光雷达的基本属性介绍渗点分类（PPC）。它确定是否存在预测框的车辆后方的点，并且如果确实，盒子的特征在于为假阳性。我们简单又前所未有的做法是在KITTI数据集进行评估，并取得PointRCNN的国家的最先进的方法之一的性能改进。实验结果表明，精度最高召回位置显着的汽车类的中等和困难难度分别提高了15.46个百分点和14.63个百分点。

22. SSM-Net for Plants Disease Identification in LowData Regime [PDF] 返回目录
Shruti Jadon
Abstract: Plant disease detection is a necessary step in increasing agricultural production. Due to the difficulty of disease detection, farmers spray every form of pesticide on their crops to save them, in turn causing harm to crop growth and food standards. Deep learning can help a lot in detecting such diseases. However, it is highly inconvenient to collect a large amount of data on all forms of disease of a specific species of plant. In this paper, we propose a new metrics-based few-shot learning SSM net architecture which consists of stacked siamese and matching network components to solve the problem of disease detection in low data regimes. We showcase that using the SSM net (stacked siamese matching) method, we were able to achieve better decision boundaries and accuracy of 94.3%, an increase of ~5% from using the traditional transfer learning approach (VGG16 and Xception net) and 3% from using original matching networks. Furthermore, we were able to attain an F1 score of 0.90 using SSM Net, an improvement from 0.30 using transfer learning and 0.80 using original matching networks.
摘要：植物病害检测是提高农业生产的必要步骤。由于疾病检测的难度，农民喷洒在作物的农药的各种形式来拯救他们，反过来造成危害作物生长和食品标准。深学习可以帮助很多在检测这些疾病。然而，这是非常不方便，收集一切形式的植物的特殊物种的疾病的大量数据。在本文中，我们提出了一种新的基于指标为数不多的射门学习SSM网络架构，它由堆叠连体和匹配网络组件来解决疾病检测的问题，在低数据政权。我们也展示了使用SSM网（堆叠连体匹配）的方法，我们能够实现更好的决策边界和94.3％的准确度，从使用传统的转移学习方式（VGG16和Xception网）和3％，同比增长约5％使用原始的匹配网络。此外，我们能够达到使用SSM网，采用独创的匹配网络从0.30使用迁移学习和0.80的改善0.90 F1得分。

23. PAI-Conv: Permutable Anisotropic Convolutional Networks for Learning on Point Clouds [PDF] 返回目录
Zhongpai Gao, Guangtao Zhai, Junchi Yan, Xiaokang Yang
Abstract: Demand for efficient representation learning on point clouds is increasing in many 3D computer vision applications. The recent success of convolutional neural networks (CNNs) for image analysis suggests the value of adapting insight from CNN to point clouds. However, unlike images that are Euclidean structured, point clouds are irregular since each point's neighbors vary from one to another. Various point neural networks have been developed using isotropic filters or applying weighting matrices to overcome the structure inconsistency on point clouds. However, isotropic filters or weighting matrices limit the representation power. In this paper, we propose a permutable anisotropic convolutional operation (PAI-Conv) that calculates soft-permutation matrices for each point according to a set of evenly distributed kernel points on a sphere's surface and performs shared anisotropic filters as CNN does. PAI-Conv is physically meaningful and can efficiently cooperate with random point sampling method. Comprehensive experiments demonstrate that PAI-Conv produces competitive results in classification and semantic segmentation tasks compared to state-of-the-art methods.
摘要：对点云有效表示学习的需求在很多3D计算机视觉的应用越来越多。卷积神经网络进行图像分析，最近成功（细胞神经网络）提出适应的洞察力，从CNN到的点云的价值。然而，不像是欧几里德结构图像，点云是不规则的，因为每一个点的邻居从一个到另一个变化。各个点的神经网络已经使用各向同性过滤器或应用加权矩阵以克服上点云的结构不一致显影。然而，各向同性的过滤器或加权矩阵限制表示功率。在本文中，我们提出了一个置换各向异性卷积运算（PAI-CONV），其计算用于根据上的球体的表面和执行一组均匀分布的核心点的每个点的软置换矩阵共享各向异性滤波器如CNN一样。 PAI-CONV是有物理意义的，并且可以有效地与随机点采样方法配合。全面实验表明，相比于国家的最先进的方法PAI-CONV产生在分类竞争的结果和语义分割任务。

24. Robust Trajectory Forecasting for Multiple Intelligent Agents in Dynamic Scene [PDF] 返回目录
Yanliang Zhu, Dongchun Ren, Mingyu Fan, Deheng Qian, Xin Li, Huaxia Xia
Abstract: Trajectory forecasting, or trajectory prediction, of multiple interacting agents in dynamic scenes, is an important problem for many applications, such as robotic systems and autonomous driving. The problem is a great challenge because of the complex interactions among the agents and their interactions with the surrounding scenes. In this paper, we present a novel method for the robust trajectory forecasting of multiple intelligent agents in dynamic scenes. The proposed method consists of three major interrelated components: an interaction net for global spatiotemporal interactive feature extraction, an environment net for decoding dynamic scenes (i.e., the surrounding road topology of an agent), and a prediction net that combines the spatiotemporal feature, the scene feature, the past trajectories of agents and some random noise for the robust trajectory prediction of agents. Experiments on pedestrian-walking and vehicle-pedestrian heterogeneous datasets demonstrate that the proposed method outperforms the state-of-the-art prediction methods in terms of prediction accuracy.
摘要：轨迹预测，或轨迹预测，在动态场景多个交互的代理商，对于许多应用，如机器人系统和自动驾驶的一个重要问题。这个问题是因为代理商，并与周围的场景及其相互作用之间错综复杂的关系的一个巨大的挑战。在本文中，我们提出了在动态场景多个智能代理的鲁棒轨迹预测的新方法。所提出的方法包括三个主要相互关联的部分组成：用于全球时空互动特征提取，环境净用于解码的动态场景的交互网（即，试剂的周边道路拓扑），以及预测净，结合了时空特征，所述场景特征，代理商在过去的轨迹和代理商的强大的轨迹预测一些随机噪声。对行人行走车辆和行人的异构数据集实验结果表明，所提出的方法优于在预测精度方面的状态的最先进的预测方法。

25. Efficient Pig Counting in Crowds with Keypoints Tracking and Spatial-aware Temporal Response Filtering [PDF] 返回目录
Guang Chen, Shiwen Shen, Longyin Wen, Si Luo, Liefeng Bo
Abstract: Pig counting is a crucial task for large-scale pig farming, which is usually completed by human visually. But this process is very time-consuming and error-prone. Few studies in literature developed automated pig counting method. Existing methods only focused on pig counting using single image, and its accuracy is challenged by several factors, including pig movements, occlusion and overlapping. Especially, the field of view of a single image is very limited, and could not meet the requirements of pig counting for large pig grouping houses. To that end, we presented a real-time automated pig counting system in crowds using only one monocular fisheye camera with an inspection robot. Our system showed that it produces accurate results surpassing human. Our pipeline began with a novel bottom-up pig detection algorithm to avoid false negatives due to overlapping, occlusion and deformation of pigs. A deep convolution neural network (CNN) is designed to detect keypoints of pig body part and associate the keypoints to identify individual pigs. After that, an efficient on-line tracking method is used to associate pigs across video frames. Finally, a novel spatial-aware temporal response filtering (STRF) method is proposed to predict the counts of pigs, which is effective to suppress false positives caused by pig or camera movements or tracking failures. The whole pipeline has been deployed in an edge computing device, and demonstrated the effectiveness.
摘要：猪计数是大型生猪养殖，这通常是由人完成视觉上极为重要的任务。但是，这个过程是非常耗时且容易出错。在文献中很少有研究开发的自动化生猪计数法。现有的方法仅集中在猪计数使用单图像，并且其准确度由几个因素，包括猪的动作，闭塞和重叠的挑战。特别是，看一个图像的领域是非常有限的，并不能满足猪计数的大肥猪分组房子的要求。为此，我们提出了在人群中仅使用一个单目鱼眼相机与检查机器人实时自动猪计数系统。我们的系统显示，它产生准确的结果超过了人类。我们的管道具有新颖自下而上猪检测算法开始，以避免假阴性由于重叠，闭塞和猪的变形。深卷积神经网络（CNN）被设计为检测猪身体部分的关键点和关键点相关联，以标识各个猪。在此之后，一个高效的在线跟踪方法被用于跨越视频帧相关联的猪。最后，一种新颖的空间感知的时间响应滤波（STRF）方法提出了预测猪的计数，这是有效地抑制由猪或摄像机运动或跟踪失败误报。整个管道已被部署在边缘计算设备，并且表现出的效力。

26. Towards Mesh Saliency Detection in 6 Degrees of Freedom [PDF] 返回目录
Xiaoying Ding, Zhenzhong Chen
Abstract: Traditional 3D mesh saliency detection algorithms and corresponding databases were proposed under several constraints such as providing limited viewing directions and not taking the subject's movement into consideration. In this work, a novel 6DoF mesh saliency database is developed which provides both the subject's 6DoF data and eye-movement data. Different from traditional databases, subjects in the experiment are allowed to move freely to observe 3D meshes in a virtual reality environment. Based on the database, we first analyze the inter-observer variation and the influence of viewing direction towards subject's visual attention, then we provide further investigations about the subject's visual attention bias during observation. Furthermore, we propose a 6DoF mesh saliency detection algorithm based on the uniqueness measure and the bias preference. To evaluate the proposed approach, we also design an evaluation metric accordingly which takes the 6DoF information into consideration, and extend some state-of-the-art 3D saliency detection methods to make comparisons. The experimental results demonstrate the superior performance of our approach for 6DoF mesh saliency detection, in addition to providing benchmarks for the presented 6DoF mesh saliency database. The database and the corresponding algorithms will be made publicly available for research purposes.
摘要：传统三维网格显着性检测的算法和对应的数据库进行了下几个约束提出诸如提供有限的观看方向和未服用被检者的运动考虑进去。在这项工作中，一种新颖的六自由度显着网状数据库开发它同时提供了受试者的6自由度数据和眼动数据。从传统的数据库不同，实验对象被允许自由移动的虚拟现实环境，观察3D网格。基于数据库的，我们首先分析了国际观察员的变化和对受试者的视觉注意观察方向的影响，那么我们提供有关的观察期间，受试者的视注意偏向进一步调查。此外，我们提出了一个6自由度网基础上的独特性措施和偏置偏好显着性检测算法。为了评估所提出的方法，我们还设计相应指标的评价这需要6自由度信息考虑在内，并延长一些国家的最先进的三维显着性检测方法进行比较。实验结果表明，我们的方法对6自由度的卓越性能网格显着性检测，除了提供对于呈现的6自由度显着网状数据库基准测试。数据库和相应的算法将可公开获得用于研究目的。

27. TRIE: End-to-End Text Reading and Information Extraction for Document Understanding [PDF] 返回目录
Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, Fei Wu
Abstract: Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in the images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text. However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
摘要：由于现实世界中无处不在的文件（例如，发票，票据，恢复和传单）包含丰富的信息，自动文档图像理解已成为一个热门话题。大多数现有的作品去耦问题分成两个独立的任务（1），文本阅读，用于检测并识别图像中的文本和（2）信息提取用于分析和提取从先前提取的纯文本的关键要素。然而，他们主要集中在提高信息抽取的任务，而忽视了一个事实，即文本阅读和信息提取相互关联。在本文中，我们提出了一个统一的终端到终端的文本阅读和信息提取的网络，其中两个任务可以互为因果。具体来说，文本阅读的多视觉和文本特征融合的信息提取，进而，在信息提取语义有助于文本阅读的优化。在三个真实世界的数据集与不同的文档图像（从固定布局来可变布局，从结构化文本到半结构化文本），我们提出的方法显著优于在效率和准确度的状态的最先进的方法。

28. SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition [PDF] 返回目录
Chengwei Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Fei Wu, Futai Zou
Abstract: Arbitrary text appearance poses a great challenge in scene text recognition tasks. Existing works mostly handle with the problem in consideration of the shape distortion, including perspective distortions, line curvature or other style variations. Therefore, methods based on spatial transformers are extensively studied. However, chromatic difficulties in complex scenes have not been paid much attention on. In this work, we introduce a new learnable geometric-unrelated module, the Structure-Preserving Inner Offset Network (SPIN), which allows the color manipulation of source data within the network. This differentiable module can be inserted before any recognition architecture to ease the downstream tasks, giving neural networks the ability to actively transform input intensity rather than the existing spatial rectification. It can also serve as a complementary module to known spatial transformations and work in both independent and collaborative ways with them. Extensive experiments show that the use of SPIN results in a significant improvement on multiple text recognition benchmarks compared to the state-of-the-arts.
摘要：任意文本外观带来场景文本识别任务的巨大挑战。现有作品大多与处理问题考虑到形状的失真，包括透视畸变，线弯曲或其它样式的变化。因此，基于空间变压器的方法是广泛的研究。然而，在复杂的场景色的困难尚未支付的广泛关注。在这项工作中，我们引入一个新的可学习几何无关模块，该结构保留内偏移网络（SPIN），其允许源数据的网络内的颜色操作。这种可微模块可以在任何识别结构，缓解下游任务之前被插入，使神经网络来积极地变换输入强度，而不是现有的空间整流的能力。它也可以作为一个补充模块已知的空间变换和工作与他们既独立又合作的方式。大量的实验表明，在多个文本识别显著改善使用SPIN结果初步基准的状态相比的最美术馆。

29. Object-QA: Towards High Reliable Object Quality Assessment [PDF] 返回目录
Jing Lu, Baorui Zou, Zhanzhan Cheng, Shiliang Pu, Shuigeng Zhou, Yi Niu, Fei Wu
Abstract: In object recognition applications, object images usually appear with different quality levels. Practically, it is very important to indicate object image qualities for better application performance, e.g. filtering out low-quality object image frames to maintain robust video object recognition results and speed up inference. However, no previous works are explicitly proposed for addressing the problem. In this paper, we define the problem of object quality assessment for the first time and propose an effective approach named Object-QA to assess high-reliable quality scores for object images. Concretely, Object-QA first employs a well-designed relative quality assessing module that learns the intra-class-level quality scores by referring to the difference between object images and their estimated templates. Then an absolute quality assessing module is designed to generate the final quality scores by aligning the quality score distributions in inter-class. Besides, Object-QA can be implemented with only object-level annotations, and is also easily deployed to a variety of object recognition tasks. To our best knowledge this is the first work to put forward the definition of this problem and conduct quantitative evaluations. Validations on 5 different datasets show that Object-QA can not only assess high-reliable quality scores according with human cognition, but also improve application performance.
摘要：在目标识别应用中，物体的图像通常会出现不同的质量水平。实际上，这是非常重要的，以指示用于更好应用性能，例如对象的图像质量滤除低质量对象的图像帧，以保持稳定的视频对象的识别结果和加快推理。然而，没有以前的作品都明确提出，为解决这个问题。在本文中，我们定义首次物体质量评估的问题，并提出命名对象-QA的有效途径，以评估对象的图像高可靠的质量分数。具体地，对象的QA第一采用精心设计的相对质量评估模块，其通过参考对象的图像和其估计的模板之间的差得知帧内类级别的质量得分。然后绝对质量评估模块被设计成通过在对准类间的质量分数分布，以产生最终的质量得分。此外，对象-QA只能与对象级别的注解来实现，也容易部署到各种物体识别任务。据我们所知，这是提出这个问题的定义，并进行定量评估的第一项工作。在5个不同的数据集的验证显示出与人类的认知是根据对象的QA不仅可以评估高可靠的质量分数，还能提高应用程序的性能。

30. Evolutionary NAS with Gene Expression Programming of Cellular Encoding [PDF] 返回目录
Clifford Broni-Bediako, Yuki Murata, Luiz Henrique Mormille, Masayasu Atsumi
Abstract: The renaissance of neural architecture search (NAS) has seen classical methods such as genetic algorithms (GA) and genetic programming (GP) being exploited for convolutional neural network (CNN) architectures. While recent work have achieved promising performance on visual perception tasks, the direct encoding scheme of both GA and GP has functional complexity deficiency and does not scale well on large architectures like CNN. To address this, we present a new generative encoding scheme -- $symbolic\ linear\ generative\ encoding$ (SLGE) -- simple, yet powerful scheme which embeds local graph transformations in chromosomes of linear fixed-length string to develop CNN architectures of variant shapes and sizes via evolutionary process of gene expression programming. In experiments, the effectiveness of SLGE is shown in discovering architectures that improve the performance of the state-of-the-art handcrafted CNN architectures on CIFAR-10 and CIFAR-100 image classification tasks; and achieves a competitive classification error rate with the existing NAS methods using less GPU resources.
摘要：神经结构搜索的（NAS）的复兴已经看到经典方法如遗传算法（GA）和遗传编程（GP）被利用用于卷积神经网络（CNN）架构。虽然最近的工作已经取得了在视觉任务看好的表现，既GA和GP的直接编码方案具有功能的复杂性缺乏和大型结构像CNN不能很好地扩展。为了解决这个问题，我们提出了一个新生成的编码方案 - $符号\线性\生成\ $编码（SLGE） - 中嵌入本地图形变换的线性定长字符串染色体简单，但功能强大的方案制定的CNN架构变体的形状和通过基因表达编程进化过程尺寸。状态的最先进的手工上CIFAR-10和CIFAR-100图像分类任务CNN架构在实验中，SLGE的效力示于发现，改善的性能的体系结构;并实现了与使用较少的GPU资源的现有NAS方法有竞争力的分类错误率。

31. Road Segmentation on low resolution Lidar point clouds for autonomous vehicles [PDF] 返回目录
Leonardo Gigli, B Ravi Kiran, Thomas Paul, Andres Serna, Nagarjuna Vemuri, Beatriz Marcotegui, Santiago Velasco-Forero
Abstract: Point cloud datasets for perception tasks in the context of autonomous driving often rely on high resolution 64-layer Light Detection and Ranging (LIDAR) scanners. They are expensive to deploy on real-world autonomous driving sensor architectures which usually employ 16/32 layer LIDARs. We evaluate the effect of subsampling image based representations of dense point clouds on the accuracy of the road segmentation task. In our experiments the low resolution 16/32 layer LIDAR point clouds are simulated by subsampling the original 64 layer data, for subsequent transformation in to a feature map in the Bird-Eye-View (BEV) and SphericalView (SV) representations of the point cloud. We introduce the usage of the local normal vector with the LIDAR's spherical coordinates as an input channel to existing LoDNN architectures. We demonstrate that this local normal feature in conjunction with classical features not only improves performance for binary road segmentation on full resolution point clouds, but it also reduces the negative impact on the accuracy when subsampling dense point clouds as compared to the usage of classical features alone. We assess our method with several experiments on two datasets: KITTI Road-segmentation benchmark and the recently released Semantic KITTI dataset.
摘要：点云数据集在自动驾驶的情况下感知任务往往依靠高分辨率64层光探测和测距（LIDAR）扫描仪。他们对现实世界的自主驾驶传感器架构通常采用16/32层激光雷达部署昂贵。我们评估的道路上分割任务的准确性子采样密度点云的基于图像表示的效果。在我们的实验中低分辨率16/32层LIDAR点云通过在鸟眼视图（BEV）和SphericalView（SV）特征图点的二次抽样表示原来64点的数据，为后续改造模拟云。我们引进与激光雷达的球面坐标为输入通道，现有LoDNN架构局部法向量的使用。我们表明，与传统的特征结合本地区的正常功能不仅提高了对全分辨率点云二进制道路分割性能，但相较于古典特色单独使用二次抽样密度点云时，它也降低了对精度的负面影响。 KITTI路分割基准和最近发布的语义KITTI数据集：我们对两个数据集有几个实验评估我们的方法。

32. Multi-task deep learning for image segmentation using recursive approximation tasks [PDF] 返回目录
Rihuan Ke, Aurélie Bugeau, Nicolas Papadakis, Mark Kirkland, Peter Schuetz, Carola-Bibiane Schönlieb
Abstract: Fully supervised deep neural networks for segmentation usually require a massive amount of pixel-level labels which are manually expensive to create. In this work, we develop a multi-task learning method to relax this constraint. We regard the segmentation problem as a sequence of approximation subproblems that are recursively defined and in increasing levels of approximation accuracy. The subproblems are handled by a framework that consists of 1) a segmentation task that learns from pixel-level ground truth segmentation masks of a small fraction of the images, 2) a recursive approximation task that conducts partial object regions learning and data-driven mask evolution starting from partial masks of each object instance, and 3) other problem oriented auxiliary tasks that are trained with sparse annotations and promote the learning of dedicated features. Most training images are only labeled by (rough) partial masks, which do not contain exact object boundaries, rather than by their full segmentation masks. During the training phase, the approximation task learns the statistics of these partial masks, and the partial regions are recursively increased towards object boundaries aided by the learned information from the segmentation task in a fully data-driven fashion. The network is trained on an extremely small amount of precisely segmented images and a large set of coarse labels. Annotations can thus be obtained in a cheap way. We demonstrate the efficiency of our approach in three applications with microscopy images and ultrasound images.
摘要：全监督的分割深层神经网络通常需要的像素级别的标签，这是手工打造昂贵的巨量。在这项工作中，我们开发了一个多任务学习方法放宽这一限制。我们认为分割问题作为是递归定义近似子问题的序列，并在提高逼近精度水平。子问题是由包含：1）一个分割任务框架处理，从图像的一小部分，2）的递归近似任务的像素级地面实况分割掩码得知导通部分对象区域的学习和数据驱动掩模演化从每个对象实例的局部掩模开始，和3），其被训练以稀疏注释和促进的专用特征的学习其他面向问题的辅助任务。大多数训练图像仅由（粗）部分罩，它不包含具体对象边界，而不是其全部分割掩码标记。在训练阶段，近似任务学习这些部分罩的统计数据，以及部分区域正朝着由分割任务在完全数据驱动的方式学到的信息辅助对象边界递归增加。该网络上精确地分割图像的非常小的量，并且一大组粗标签的训练。注释可以因此以便宜的方式来获得。我们证明我们的方法在三个应用程序与显微镜图像和超声图像的效率。

33. Pay Attention to What You Read: Non-recurrent Handwritten Text-Line Recognition [PDF] 返回目录
Lei Kang, Pau Riba, Marçal Rusiñol, Alicia Fornés, Mauricio Villegas
Abstract: The advent of recurrent neural networks for handwriting recognition marked an important milestone reaching impressive recognition accuracies despite the great variability that we observe across different writing styles. Sequential architectures are a perfect fit to model text lines, not only because of the inherent temporal aspect of text, but also to learn probability distributions over sequences of characters and words. However, using such recurrent paradigms comes at a cost at training stage, since their sequential pipelines prevent parallelization. In this work, we introduce a non-recurrent approach to recognize handwritten text by the use of transformer models. We propose a novel method that bypasses any recurrence. By using multi-head self-attention layers both at the visual and textual stages, we are able to tackle character recognition as well as to learn language-related dependencies of the character sequences to be decoded. Our model is unconstrained to any predefined vocabulary, being able to recognize out-of-vocabulary words, i.e. words that do not appear in the training vocabulary. We significantly advance over prior art and demonstrate that satisfactory recognition accuracies are yielded even in few-shot learning scenarios.
摘要：复发性神经网络的手写识别的问世，标志着尽管很大的可变性，我们在不同的写作风格观察达到令人印象深刻的识别精度的一个重要里程碑。顺序架构是一个完美的结合，以示范文本行，不仅文本的内在时间方面的原因，也超过字符和单词的顺序来学习概率分布。然而，使用这种反复的范式正值在训练阶段有代价的，因为他们的顺序管线防止并行。在这项工作中，我们将介绍通过使用变压器模型来识别手写文字非经常性的做法。我们建议，绕过任何复发的新方法。通过使用多头自我关注层无论是在视觉和文本的阶段，我们能够解决字符识别以及学习字符序列的语言相关的依存关系进行解码。我们的模型是不受约束任何预定义的词汇，能够识别外的词汇，没有出现在训练词汇，即单词。我们显著推进了现有技术，并证明了令人满意的识别精确度甚至在一些次学习情境产生。

34. ALBA : Reinforcement Learning for Video Object Segmentation [PDF] 返回目录
Shreyank N Gowda, Panagiotis Eustratiadis, Timothy Hospedales, Laura Sevilla-Lara
Abstract: We consider the challenging problem of zero-shot video object segmentation (VOS). That is, segmenting and tracking multiple moving objects within a video fully automatically, without any manual initialization. We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time. We propose a network architecture for tractably performing proposal selection and joint grouping. Crucially, we then show how to train this network with reinforcement learning so that it learns to perform the optimal non-myopic sequence of grouping decisions to segment the whole video. Unlike standard supervised techniques, this also enables us to directly optimize for the non-differentiable overlap-based metrics used to evaluate VOS. We show that the proposed method, which we call ALBA outperforms the previous stateof-the-art on three benchmarks: DAVIS 2017 [2], FBMS [20] and Youtube-VOS [27].
摘要：我们认为零拍视频对象分割（VOS）的具有挑战性的问题。即，分段和完全自动的视频内追踪多个移动对象，无需任何手动初始化。我们将此视为通过利用对象的建议，使有关分组在空间和时间上的联合推论分组问题。我们提出了一个网络架构tractably执行方案的选择和联合分组。最重要的是，我们随后将展示如何在强化学习，使其学会执行分组决定段整个视频的最佳非近视序列训练这个网络。不同于标准的监督技术，这也使我们能够直接优化为用于评估VOS的不可微分基于重叠的指标。我们证明了该方法，我们称之为ALBA胜过三个基准以前stateof先进：DAVIS 2017年[2]，FBMS [20]和YouTube的VOS [27]。

35. How to do Physics-based Learning [PDF] 返回目录
Michael Kellman, Michael Lustig, Laura Waller
Abstract: The goal of this tutorial is to explain step-by-step how to implement physics-based learning for the rapid prototyping of a computational imaging system. We provide a basic overview of physics-based learning, the construction of a physics-based network, and its reduction to practice. Specifically, we advocate exploiting the auto-differentiation functionality twice, once to build a physics-based network and again to perform physics-based learning. Thus, the user need only implement the forward model process for their system, speeding up prototyping time. We provide an open-source Pytorch implementation of a physics-based network and training procedure for a generic sparse recovery problem
摘要：本教程的目的是解释一步一步如何实现基于物理的学习的计算成像系统的快速原型。我们提供基于物理的学习，基于物理的网络建设，其减少到实践的基本概况。具体来说，我们提倡利用自动分化功能两次，第一次建立一个基于物理的网络，并再次进行基于物理的学习。因此，用户只需要实现对他们的系统正向模型过程中，加快了原型设计时间。我们提供了一个通用的稀疏恢复问题的开源Pytorch实现基于物理的网络和训练过程

36. Kernel methods library for pattern analysis and machine learning in python [PDF] 返回目录
Pradeep Reddy Raamana
Abstract: Kernel methods have proven to be powerful techniques for pattern analysis and machine learning (ML) in a variety of domains. However, many of their original or advanced implementations remain in Matlab. With the incredible rise and adoption of Python in the ML and data science world, there is a clear need for a well-defined library that enables not only the use of popular kernels, but also allows easy definition of customized kernels to fine-tune them for diverse applications. The kernelmethods library fills that important void in the python ML ecosystem in a domain-agnostic fashion, allowing the sample data type to be anything from numerical, categorical, graphs or a combination of them. In addition, this library provides a number of well-defined classes to make various kernel-based operations efficient (for large scale datasets), modular (for ease of domain adaptation), and inter-operable (across different ecosystems). The library is available at this https URL.
摘要：核方法已被证明是在各种领域的模式分析和机器学习（ML）强有力的技术。然而，许多原来的或高级的实现留在Matlab的。随着在ML和数据科学世界的令人难以置信的增长和采用的Python，有一个明确定义的库，不仅能够使用流行的内核的明确需要，同时也允许微调他们的定制内核的简单的定义对于不同的应用。所述kernelmethods库填充的是，在一个域无关的方式蟒ML生态系统重要的空隙，允许样品的数据类型是从数值，分类，图形或它们的组合任何东西。此外，该库提供了许多良好定义的类来进行各种基于内核的操作效率（用于大规模数据集），模块化（为便于域适应的），和可互操作的（在不同的生态系统）。该库可在此HTTPS URL。

37. gram filtering and sinogram interpolation for pixel-basis in parallel-beam x-ray ct reconstruction [PDF] 返回目录
Ziyu Shu, Alireza Entezari
Abstract: The key aspect of parallel-beam X-ray CT is forward and back projection, but its computational burden continues to be an obstacle for applications. We propose a method to improve the performance of related algorithms by calculating the Gram filter exactly and interpolating the sinogram signal optimally. In addition, the detector blur effect can be included in our model efficiently. The improvements in speed and quality for back projection and iterative reconstruction are shown in our experiments on both analytical phantoms and real CT images.
摘要：平行束X射线CT的关键方面是向前和向后投影，但它的计算负担仍然是应用的一个障碍。我们提出了一个方法，正好计算革兰氏过滤器和最佳插值正弦信号，以提高相关算法的性能。此外，探测器模糊效果可以有效地纳入我们的模型。在速度和质量为反投影和迭代重构的改进示于我们的实验中在两个分析幻影和真实的CT图像。

38. Segmentation Loss Odyssey [PDF] 返回目录
Jun Ma
Abstract: Loss functions are one of the crucial ingredients in deep learning-based medical image segmentation methods. Many loss functions have been proposed in existing literature, but are studied separately or only investigated with few other losses. In this paper, we present a systematic taxonomy to sort existing loss functions into four meaningful categories. This helps to reveal links and fundamental similarities between them. Moreover, we explore the relationship between the traditional region-based and the more recent boundary-based loss functions. The PyTorch implementations of these loss functions are publicly available at \url{this https URL}.
摘要：损失函数在深学习型医学图像分割方法的重要成分之一。许多损失函数中已经提出了现有的文献，但研究单独或仅有很少的其他损失调查。在本文中，我们提出了一个系统分类，以现有的排序功能损失分为四个有意义的类别。这有助于揭示它们之间的联系和基本的相似之处。此外，我们探讨基于地区的传统和最近的基于边界的损失函数之间的关系。这些损失函数的PyTorch实现是公开的，在\ {URL这HTTPS URL}。

39. A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews [PDF] 返回目录
Edison Marrese-Taylor, Cristian Rodriguez-Opazo, Jorge A. Balazs, Stephen Gould, Yutaka Matsuo
Abstract: Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.
摘要：尽管观点挖掘书面审查的最新进展，一些作品在解决的审查其他来源的问题。在这个问题上的光，我们提出了从挖掘视频的评论，它能够确定正在讨论审查的项目和对他们的情感倾向的方面细粒度的意见多模式的方法。我们的方法工作在句子层面，而不需要时间标注和使用功能，从它的内容的音频，视频和语言改编的。我们评估我们的两个数据集的做法，并表明利用视频和音频模式持续提供增加了纯文本的基线性能，提供证据，这些额外的模式是更好地了解视频的评论关键。

40. Data-Driven Continuum Dynamics via Transport-Teleport Duality [PDF] 返回目录
Jong-Hoon Ahn
Abstract: In recent years, machine learning methods have been widely used to study physical systems that are challenging to solve with governing equations. However, most learning architectures do not inherently incorporate conservation laws in the form of continuity equations, and they require dense data to learn the continuum dynamics of conserved quantities. In this study, we propose a mathematical framework for machine learning of transport phenomena. Through the derived involution, the continuity equation becomes a pointwise operation for disappearance and reappearance of a quantity with zero velocity. By modeling the process with sparse observations, we can determine and predict the dynamics of a physical system. The approach does not require the explicit use of governing equations and only depends on observation data.
摘要：近年来，机器学习方法已被广泛用于研究那些挑战与控制方程求解物理系统。然而，大多数的学习架构做连续性方程的形式不是天生一体化守恒定律，他们需要密集的数据，了解守恒量的连续动态。在这项研究中，我们提出的传输现象机器学习的数学框架。通过派生乘方，连续性方程变成用于零速度的量的消失和再现逐点操作。通过建模与稀疏的观察过程中，我们可以判断和预测物理系统的动态特性。该方法不需要明确使用控制方程，并仅依赖于观测数据。

41. Earballs: Neural Transmodal Translation [PDF] 返回目录
Andrew Port, Chelhwon Kim, Mitesh Patel
Abstract: As is expressed in the adage "a picture is worth a thousand words", when using spoken language to communicate visual information, brevity can be a challenge. This work describes a novel technique for leveraging machine learned feature embeddings to translate visual (and other types of) information into a perceptual audio domain, allowing users to perceive this information using only their aural faculty. To be clear, the goal of this work is to propose a mechanism for providing an information preserving mapping that users can learn to use to see (or perceive other information) using their auditory system. The system uses a pretrained image embedding network to extract visual features and embed them in a compact subset of Euclidean space -- this converts the images into feature vectors whose $L^2$ distances can be used as a meaningful measure of similarity. A generative adversarial network is then used to find a distance preserving map from this metric space of feature vectors into the metric space defined by a target audio dataset and a mel-frequency cepstrum-based psychoacoustic distance metric. We demonstrate this technique by translating images of faces into human speech-like audio. The GAN successfully found a metric preserving mapping, and in human subject tests, users were able to successfully classify images of faces using only the audio output by our model.
摘要：在谚语表达了“一张图片胜过千言万语”，用口语时沟通的视觉信息，简洁是一个挑战。这部作品描述了利用机器学习功能的嵌入到视觉（以及其他类型）的信息转换成感知音频域，让使用者感知只使用他们的听觉教师这一信息的新技术。需要明确的是，这项工作的目标是提出一种机制，提供使用它们的听觉系统的信息保存映射，用户可以学会用看（或感知其他信息）。该系统使用一个预训练图像嵌入网络中提取视觉特征和在欧几里得空间的紧凑子组将它们嵌入 - 此转换图像转换成特征向量，其$ L ^ 2 $距离可以被用作相似性的有意义的措施。然后将生成的对抗性网络是用来寻找从特征向量的该度量空间中的距离保持地图到由目标音频数据集和基于倒频谱梅尔频率的心理声学距离度量定义的度量空间。我们通过翻译的面孔图像转换为人类言语的音频演示了这种技术。甘成功地发现了一个指标保持映射，并在人类受试者的测试中，用户能够仅使用音频输出我们的模型面的成功图像分类。

42. An Entropy Based Outlier Score and its Application to Novelty Detection for Road Infrastructure Images [PDF] 返回目录
Jonas Wurst, Alberto Flores Fernández, Michael Botsch, Wolfgang Utschick
Abstract: A novel unsupervised outlier score, which can be embedded into graph based dimensionality reduction techniques, is presented in this work. The score uses the directed nearest neighbor graphs of those techniques. Hence, the same measure of similarity that is used to project the data into lower dimensions, is also utilized to determine the outlier score. The outlier score is realized through a weighted normalized entropy of the similarities. This score is applied to road infrastructure images. The aim is to identify newly observed infrastructures given a pre-collected base dataset. Detecting unknown scenarios is a key for accelerated validation of autonomous vehicles. The results show the high potential of the proposed technique. To validate the generalization capabilities of the outlier score, it is additionally applied to various real world datasets. The overall average performance in identifying outliers using the proposed methods is higher compared to state-of-the-art methods. In order to generate the infrastructure images, an openDRIVE parsing and plotting tool for Matlab is developed as part of this work. This tool and the implementation of the entropy based outlier score in combination with Uniform Manifold Approximation and Projection are made publicly available.
摘要：一种新型的无监督离群分数，其可以被嵌入到基于图的维数降低技术，提出了这项工作。评分采用这些技术的直接近邻图。因此，其被用于将数据投影到较低维度的相似性的度量相同，也是用于确定孤立点的分数。离群得分是通过相似度的加权归一化熵实现。这个分数被应用到道路基础设施的图像。的目的是确定给定的预先收集的数据集基地新观察到的基础设施。检测未知的情况是自主车的加速验证的关键。结果表明，该技术的巨大潜力。为了验证异常分数的概括能力，额外地应用到各种现实世界的数据集。在确定使用所提出的方法的异常值的总平均性能相比更高状态的最先进的方法。为了产生基础设施的图像，一个openDRIVE解析和绘图工具Matlab的开发作为这项工作的一部分。这个工具和熵的执行异常基于联合得分与统一流形逼近和投影被公之于众。

43. Co-Heterogeneous and Adaptive Segmentation from Multi-Source and Multi-Phase CT Imaging Data: A Study on Pathological Liver and Lesion Segmentation [PDF] 返回目录
Ashwin Raju, Chi-Tung Cheng, Yunakai Huo, Jinzheng Cai, Junzhou Huang, Jing Xiao, Le Lu, ChienHuang Liao, Adam P Harrison
Abstract: In medical imaging, organ/pathology segmentation models trained on current publicly available and fully-annotated datasets usually do not well-represent the heterogeneous modalities, phases, pathologies, and clinical scenarios encountered in real environments. On the other hand, there are tremendous amounts of unlabelled patient imaging scans stored by many modern clinical centers. In this work, we present a novel segmentation strategy, co-heterogenous and adaptive segmentation (CHASe), which only requires a small labeled cohort of single phase imaging data to adapt to any unlabeled cohort of heterogenous multi-phase data with possibly new clinical scenarios and pathologies. To do this, we propose a versatile framework that fuses appearance based semi-supervision, mask based adversarial domain adaptation, and pseudo-labeling. We also introduce co-heterogeneous training, which is a novel integration of co-training and hetero modality learning. We have evaluated CHASe using a clinically comprehensive and challenging dataset of multi-phase computed tomography (CT) imaging studies (1147 patients and 4577 3D volumes). Compared to previous state-of-the-art baselines, CHASe can further improve pathological liver mask Dice-Sorensen coefficients by ranges of $4.2\% \sim 9.4\%$, depending on the phase combinations: e.g., from $84.6\%$ to $94.0\%$ on non-contrast CTs.
摘要：在医疗成像，培训了当前公开，充分注解数据集的器官/病理学分割模型通常不充分代表的异构方式，阶段，病理和临床实际环境中遇到的情景。在另一方面，也有许多现代临床中心存储未标记的患者成像扫描的巨大数额。在这项工作中，我们提出了一个新颖的分割策略，共异质和自适应分段（CHASE），其仅需要单相的成像数据的一小标记队列适应异质多相位数据的任何未标记的队列与可能是新的临床方案和病症。要做到这一点，我们提出了一个通用的框架，保险丝的外观基于半监督，口罩基于对抗领域适应性，和伪标签。我们还引进合作异质培训，这是协同训练和异态学习的一种新型的集成。我们已经使用多相计算机断层扫描（CT）成像的研究（1147例和4577 3D卷）临床全面和具有挑战性的数据集进行评估追逐。例如，从$ 84.6 \％$到：相比于国家的最先进的前基线，大通可以通过$ 4.2 \％\ SIM 9.4 \％$，根据相位的组合的范围进一步改善病理肝掩模骰子索伦森系数$ 94.0 \非相反的CT％$。

44. On Mutual Information in Contrastive Learning for Visual Representations [PDF] 返回目录
Mike Wu, Chengxu Zhuang, Milan Mosse, Daniel Yamins, Noah Goodman
Abstract: In recent years, several unsupervised, "contrastive" learning algorithms in vision have been shown to learn representations that perform remarkably well on transfer tasks. We show that this family of algorithms maximizes a lower bound on the mutual information between two or more "views" of an image; typical views come from a composition of image augmentations. Our bound generalizes the InfoNCE objective to support negative sampling from a restricted region of "difficult" contrasts. We find that the choice of (1) negative samples and (2) "views" are critical to the success of contrastive learning, the former of which is largely unexplored. The mutual information reformulation also simplifies and stabilizes previous learning objectives. In practice, our new objectives yield representations that outperform those learned with previous approaches for transfer to classification, bounding box detection, instance segmentation, and keypoint detection. The mutual information framework provides a unifying and rigorous comparison of approaches to contrastive learning and uncovers the choices that impact representation learning.
摘要：近年来，一些无人监管，“对比”学习视觉算法已被证明获悉，在传递任务执行得非常好表示。我们表明，这种算法家族最大化的下限两个或两个以上的图像的“意见”之间的互信息;典型视图来自图像增强系统的组合物。我们的结合概括了InfoNCE目标从的“艰难”的对比的受限区域支持负采样。我们发现，（1）阴性样品和（2）“意见”的选择是对比学习的成功，其中前者是很大的未开发的关键。互信息再形成还简化并稳定以前的学习目标。在实践中，我们新的目标产生是超越那些以前的方法转移到分类，包围盒检测，例如分割和关键点检测学会表示。互信息框架提供的方法对比学习的统一和严格的比较和揭示的选择，影响表示学习。

45. Microstructure and Water Absorption of Ancient Concrete from Pompeii: An Integrated Synchrotron Microtomography and Neutron Radiography Characterization [PDF] 返回目录
Ke Xu, Anton S. Tremsin, Jiaqi Li, Daniela M. Ushizima, Catherine A. Davy, Amine Bouterf, Ying Tsun Su, Milena Marroccoli, Anna Maria Mauro, Massimo Osanna, Antonio Telesca, Paulo J. M. Monteiro
Abstract: There is renewed interest in using advanced techniques to characterize ancient Roman concrete. In the present work, samples were drilled from the "Hospitium" in Pompeii and were analyzed by synchrotron microtomography (uCT) and neutron radiography to study how the microstructure, including the presence of induced cracks, affects their water adsorption. The water distribution and absorptivity were quantified by neutron radiography. The 3D crack propagation, pore size distribution and orientation, tortuosity, and connectivity were analyzed from uCT results using advanced imaging methods. The concrete characterization also included classical methods (e.g., differential thermal-thermogravimetric, X-ray diffractometry, and scanning electron microscopy). Ductile fracture patterns were observed once cracks were introduced. When compared to Portland cement mortar/concrete, Pompeii samples had relatively high porosity, low connectivity, and similar coefficient of capillary penetration. In addition, the permeability was predicted from models based on percolation theory and the pore structure data to evaluate the fluid transport properties.
摘要：在使用先进的技术来表征古罗马具体的新的兴趣。在目前的工作，将样品从在庞贝“Hospitium”钻孔和由同步加速器微断层（UCT）和中子照相进行分析，以研究微结构，包括引起的裂缝的存在如何，影响他们的水吸附。水分配和吸收被中子照相量化。三维裂纹扩展，孔径分布和取向，弯曲度，和连接从采用先进的成像方法UCT结果进行分析。具体表征还包括经典方法（例如，差热 - 热重，X射线衍射和扫描电子显微术）。观察到韧性断裂模式一旦裂纹进行了介绍。当与波特兰水泥砂浆/混凝土，庞培样品具有相对高孔隙率，低的连通性，和毛细管渗透的相似系数。此外，渗透率基于渗透理论和孔结构数据来评估所述流体输送特性模型所预测的。

46. Benchmarking Differentially Private Residual Networks for Medical Imagery [PDF] 返回目录
Sahib Singh, Harshvardhan Sikka
Abstract: Hospitals and other medical institutions often have vast amounts of medical data which can provide significant value when utilized to advance research. However, this data is often sensitive in nature, and as such is not readily available for use in a research setting, often due to privacy concerns. In this paper, we measure the performance of a deep neural network on differentially private image datasets pertaining to Pneumonia. We analyze the trade-off between the model's accuracy and the scale of perturbation among the images. Knowing how the model's accuracy varies among various perturbation levels in differentially private medical images is useful in these contexts. This work is contextually significant given the corona-virus pandemic, as Pneumonia has become an even greater concern owing to its potentially deadly complication of infection with COVID-19.
摘要：医院和其他医疗机构往往有大量的医学数据时，利用预先研究能够提供显著的价值。然而，这个数据在本质上是往往很敏感，因此不容易获得在研究环境中使用，常因隐私问题。在本文中，我们衡量一个深层神经网络对有关肺炎差异私人图像数据集的性能。我们分析模型的准确性和扰动的图像之间的规模之间的权衡。了解模型的精度差异民营医疗图像的各种扰动级别中是如何变化的是在这些背景下是有用的。这项工作是根据上下文显著给电晕病毒大流行，肺炎已成为由于其与COVID-19感染的潜在致命并发症更大的担忧。

47. Prediction of Thrombectomy FunctionalOutcomes using Multimodal Data [PDF] 返回目录
Zeynel A. Samak, Philip Clatworthy, Majid Mirmehdi
Abstract: Recent randomised clinical trials have shown that patients with ischaemic stroke {due to occlusion of a large intracranial blood vessel} benefit from endovascular thrombectomy. However, predicting outcome of treatment in an individual patient remains a challenge. We propose a novel deep learning approach to directly exploit multimodal data (clinical metadata information, imaging data, and imaging biomarkers extracted from images) to estimate the success of endovascular treatment. We incorporate an attention mechanism in our architecture to model global feature inter-dependencies, both channel-wise and spatially. We perform comparative experiments using unimodal and multimodal data, to predict functional outcome (modified Rankin Scale score, mRS) and achieve 0.75 AUC for dichotomised mRS scores and 0.35 classification accuracy for individual mRS scores.
摘要：最近的随机临床试验显示，缺血性卒中患者{由于大颅内血管闭塞}受益于血管内血栓。然而，在一个单独的预测患者的治疗结果仍然是一个挑战。我们提出了一个新颖的深学习方法直接利用多模态数据（临床元数据信息，影像数据，以及从图像中提取的生物标志物成像）来估计血管内治疗的成功。我们在架构整合的关注机制，全局特征相互依存关系的模型，两个通道的明智和空间。我们执行使用对比实验单峰和多数据，以预测功能的结果（改良Rankin量表评分，MRS）和达到0.75 AUC为二分的mRS分数和为个别的mRS分数0.35的分类精度。

48. Instance Explainable Temporal Network For Multivariate Timeseries [PDF] 返回目录
Naveen Madiraju, Homa Karimabadi
Abstract: Although deep networks have been widely adopted, one of their shortcomings has been their blackbox nature. One particularly difficult problem in machine learning is multivariate time series (MVTS) classification. MVTS data arise in many applications and are becoming ever more pervasive due to explosive growth of sensors and IoT devices. Here, we propose a novel network (IETNet) that identifies the important channels in the classification decision for each instance of inference. This feature also enables identification and removal of non-predictive variables which would otherwise lead to overfit and/or inaccurate model. IETNet is an end-to-end network that combines temporal feature extraction, variable selection, and joint variable interaction into a single learning framework. IETNet utilizes an 1D convolutions for temporal features, a novel channel gate layer for variable-class assignment using an attention layer to perform cross channel reasoning and perform classification objective. To gain insight into the learned temporal features and channels, we extract region of interest attention map along both time and channels. The viability of this network is demonstrated through a multivariate time series data from N body simulations and spacecraft sensor data.
摘要：虽然深网络已被广泛采用，它们的缺点之一是其黑盒性质。在机器学习的一个特别困难的问题是多元时间序列（MVTS）分类。在许多应用中MVTS数据出现并正变得越来越普遍，由于传感器和物联网设备的爆炸式增长。在这里，我们提出了一个新的网络（IETNet），其识别为推断的每个实例的分类决策的重要渠道。此功能还可识别和消除非预测变量，否则会导致过度拟合和/或不准确的模型。 IETNet是一个终端到终端的网络结合时间特征提取，变量选择，并成一个单一的学习框架关节变量相互作用。 IETNet利用了1D卷积为时间特征，用于使用关注层执行跨通道推理和执行分类目标变量级分配的新的信道栅层。为了深入了解学习时间特征和渠道，我们相处的时间和渠道提取感兴趣注意图区域。这个网络的生存能力通过选自N体模拟和航天器的传感器数据多元时间序列数据证实。

49. Kernel Self-Attention in Deep Multiple Instance Learning [PDF] 返回目录
Dawid Rymarczyk, Jacek Tabor, Bartosz Zieliński
Abstract: Multiple Instance Learning (MIL) is weakly supervised learning, which assumes that there is only one label provided for the entire bag of instances. As such, it appears in many problems of medical image analysis, like the whole-slide images classification of biopsy. Most recently, MIL was also applied to deep architectures by introducing the aggregation operator, which focuses on crucial instances of a bag. In this paper, we enrich this idea with the self-attention mechanism to take into account dependencies across the instances. We conduct several experiments and show that our method with various types of kernels increases the accuracy, especially in the case of non-standard MIL assumptions. This is of importance for real-word medical problems, which usually satisfy presence-based or threshold-based assumptions.
摘要：多实例学习（MIL）是弱监督的学习，它假设只有一个规定的情况下，整个袋子的标签。这样，它出现在医学图像分析的许多问题，如活检的全幻灯片图像分类。最近，MIL也被引入聚合运算符，它着重于袋的重要实例应用到深层结构。在本文中，我们丰富了这个想法与自注意机制要考虑到整个实例帐户的依赖。我们进行了多次实验，并表明我们与各类内核的方法增加了准确性，特别是在非标MIL假设的情况。这是真实的字的医疗问题，这通常满足基于阈值的存在为基础的或假设的重要性。

50. Learning to rank music tracks using triplet loss [PDF] 返回目录
Laure Prétet, Gaël Richard, Geoffroy Peeters
Abstract: Most music streaming services rely on automatic recommendation algorithms to exploit their large music catalogs. These algorithms aim at retrieving a ranked list of music tracks based on their similarity with a target music track. In this work, we propose a method for direct recommendation based on the audio content without explicitly tagging the music tracks. To that aim, we propose several strategies to perform triplet mining from ranked lists. We train a Convolutional Neural Network to learn the similarity via triplet loss. These different strategies are compared and validated on a large-scale experiment against an auto-tagging based approach. The results obtained highlight the efficiency of our system, especially when associated with an Auto-pooling layer.
摘要：大多数音乐流服务依赖于自动推荐算法，利用自己的庞大音乐目录。这些算法的目的是获取基于其与目标音乐曲目相似的音乐曲目的排序列表。在这项工作中，我们提出了基于音频内容没有明确标记的音乐曲目直接推荐的方法。为了这一目标，我们提出了一些策略，从等级列表进行三重挖掘。我们培养了卷积神经网络学会通过三重损失的相似性。这些不同的策略进行比较，并在针对自动标记为基础的方法大规模实验验证。当与自动汇集层相关联的获得的结果突出我们系统的效率，尤其是。

51. Gaze-based Autism Detection for Adolescents and Young Adults using Prosaic Videos [PDF] 返回目录
Karan Ahuja, Abhishek Bose, Mohit Jain, Kuntal Dey, Anil Joshi, Krishnaveni Achary, Blessin Varkey, Chris Harrison, Mayank Goel
Abstract: Autism often remains undiagnosed in adolescents and adults. Prior research has indicated that an autistic individual often shows atypical fixation and gaze patterns. In this short paper, we demonstrate that by monitoring a user's gaze as they watch commonplace (i.e., not specialized, structured or coded) video, we can identify individuals with autism spectrum disorder. We recruited 35 autistic and 25 non-autistic individuals, and captured their gaze using an off-the-shelf eye tracker connected to a laptop. Within 15 seconds, our approach was 92.5% accurate at identifying individuals with an autism diagnosis. We envision such automatic detection being applied during e.g., the consumption of web media, which could allow for passive screening and adaptation of user interfaces.
摘要：孤独症常常仍不能确诊的青少年和成人。此前的研究表明，孤独症个体常常显示非典型固定和凝视模式。在这短短的文章中，我们证明，通过监控用户的目光，因为他们看司空见惯（即不是专门的，结构化的或编码）的视频，我们可以找出自闭症谱系障碍的个体。我们招募了35自闭症和25个非自闭症患者，并使用连接到一台笔记本电脑关闭的，现成的眼动跟踪器捕捉他们的目光。 15秒内，我们的做法是92.5％准确查明患有孤独症的诊断的个体。我们设想这样的自动检测过程中，例如被施加，网络媒体的消耗，这可以允许被动筛选和用户接口适配。

52. End-to-end Optimized Video Compression with MV-Residual Prediction [PDF] 返回目录
XiangJi Wu, Ziwen Zhang, Jie Feng, Lei Zhou, Junmin Wu
Abstract: We present an end-to-end trainable framework for P-frame compression in this paper. A joint motion vector (MV) and residual prediction network MV-Residual is designed to extract the ensembled features of motion representations and residual information by treating the two successive frames as inputs. The prior probability of the latent representations is modeled by a hyperprior autoencoder and trained jointly with the MV-Residual network. Specially, the spatially-displaced convolution is applied for video frame prediction, in which a motion kernel for each pixel is learned to generate predicted pixel by applying the kernel at a displaced location in the source image. Finally, novel rate allocation and post-processing strategies are used to produce the final compressed bits, considering the bits constraint of the challenge. The experimental results on validation set show that the proposed optimized framework can generate the highest MS-SSIM for P-frame compression competition.
摘要：我们在本文中呈现为P帧压缩的端部到端可训练框架。联合运动矢量（MV）和残差预测网络MV-残余被设计成通过处理所述两个连续的帧作为输入，以提取运动表示和残差信息的合奏的特征。潜在的陈述的先验概率由hyperprior自动编码器建模，并与MV-剩余网络联合训练。特别地，空间移位的卷积被应用于视频帧预测，其中每个像素的运动内核了解通过在源图像中的移动的位置应用所述内核来产生预测像素。最后，新颖的比率分配和后期处理的策略来产生最终的经压缩位，考虑到挑战的比特约束。上验证集合表明，该优化的框架可以产生最高的MS-SSIM为P帧压缩竞争的实验结果。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-28

目录

摘要