摘要

1. HoliCity: A City-Scale Data Platform for Learning Holistic 3D Structures [PDF] 返回目录
Yichao Zhou, Jingwei Huang, Xili Dai, Linjie Luo, Zhili Chen, Yi Ma
Abstract: We present HoliCity, a city-scale 3D dataset with rich structural information. Currently, this dataset has 6,300 real-world panoramas of resolution $13312 \times 6656$ that are accurately aligned with the CAD model of downtown London with an area of more than 20 km$^2$, in which the median reprojection error of the alignment of an average image is less than half a degree. This dataset aims to be an all-in-one data platform for research of learning abstracted high-level holistic 3D structures that can be derived from city CAD models, e.g., corners, lines, wireframes, planes, and cuboids, with the ultimate goal of supporting real-world applications including city-scale reconstruction, localization, mapping, and augmented reality. The accurate alignment of the 3D CAD models and panoramas also benefits low-level 3D vision tasks such as surface normal estimation, as the surface normal extracted from previous LiDAR-based datasets is often noisy. We conduct experiments to demonstrate the applications of HoliCity, such as predicting surface segmentation, normal maps, depth maps, and vanishing points, as well as test the generalizability of methods trained on HoliCity and other related datasets. HoliCity is available at this https URL.
摘要：我们目前HoliCity，城市规模的3D数据集丰富的结构信息。目前，该数据集有6300真实世界的分辨率的全景$ 13312 \倍6656 $被准确地与伦敦市中心的CAD模型20多公里$ ^ 2 $的区域对齐，其中对齐的平均投影误差的平均图像的小于半度。该数据集的目标是为学习抽象的高层次全方位的3D结构的研究，可以从城市的CAD模型，例如，边角，线条，线框，飞机和立方体中得到的所有功能于一身的数据平台，其最终目标支持现实世界的应用，包括城市规模的重建，本地化，地图和增强现实。三维CAD模型和全景的准确定位也有利于低级别的3D视觉任务，如表面法线估计，从以前的基于激光雷达的数据集提取的表面法线是经常吵闹。我们进行实验来证明HoliCity的应用，如预测的曲面细分，法线贴图，深度地图和消失点，以及测试的培训了HoliCity和其他相关数据集的方法的普遍性。 HoliCity可在此HTTPS URL。

2. Physics-Based Dexterous Manipulations with Estimated Hand Poses and Residual Reinforcement Learning [PDF] 返回目录
Guillermo Garcia-Hernando, Edward Johns, Tae-Kyun Kim
Abstract: Dexterous manipulation of objects in virtual environments with our bare hands, by using only a depth sensor and a state-of-the-art 3D hand pose estimator (HPE), is challenging. While virtual environments are ruled by physics, e.g. object weights and surface frictions, the absence of force feedback makes the task challenging, as even slight inaccuracies on finger tips or contact points from HPE may make the interactions fail. Prior arts simply generate contact forces in the direction of the fingers' closures, when finger joints penetrate virtual objects. Although useful for simple grasping scenarios, they cannot be applied to dexterous manipulations such as in-hand manipulation. Existing reinforcement learning (RL) and imitation learning (IL) approaches train agents that learn skills by using task-specific rewards, without considering any online user input. In this work, we propose to learn a model that maps noisy input hand poses to target virtual poses, which introduces the needed contacts to accomplish the tasks on a physics simulator. The agent is trained in a residual setting by using a model-free hybrid RL+IL approach. A 3D hand pose estimation reward is introduced leading to an improvement on HPE accuracy when the physics-guided corrected target poses are remapped to the input space. As the model corrects HPE errors by applying minor but crucial joint displacements for contacts, this helps to keep the generated motion visually close to the user input. Since HPE sequences performing successful virtual interactions do not exist, a data generation scheme to train and evaluate the system is proposed. We test our framework in two applications that use hand pose estimates for dexterous manipulations: hand-object interactions in VR and hand-object motion reconstruction in-the-wild.
摘要：在与我们的裸手虚拟环境对象的灵巧操作，通过仅使用一个深度传感器和一个国家的最先进的三维手姿势估计器（HPE），是具有挑战性的。尽管虚拟环境是由物理，例如排除物体的重量和表面摩擦，没有力反馈，使具有挑战性的任务，因为从HPE在指尖或接触点即使是轻微的不准确性可能使交互失败。现有技术简单地产生在手指关闭，当手指关节穿透的虚拟对象的方向的接触力。虽然简单把持方案中很有用，它们不能被应用到灵巧操作，如在手操纵。现有的强化学习（RL）和模仿学习（IL）方法，通过使用任务奖励的具体学习技能培训代理商，而不考虑任何在线用户的输入。在这项工作中，我们建议学习映射嘈杂输入手姿势到目标虚拟姿势，它引入了必要的联系，实现对物理模拟器的任务模式。所述试剂在残设置通过使用无模型混合RL + IL方法培训。一种3D手姿势估计奖励被引入导致对HPE精度的提高，当物理引导的修正后的目标姿态被重新映射到输入空间。由于模型应用微小但重要的关节位移联系人纠正错误HPE，这有助于保持生成的运动视觉贴近用户的输入。由于HPE序列执行成功的虚拟的交互不存在，一个数据生成方案来训练和评估被提出的系统。我们在两个应用程序测试我们的框架，利用手的姿势估计灵巧操作：在VR手对象交互和手对象运动重建最百搭。

3. Multi-Level Temporal Pyramid Network for Action Detection [PDF] 返回目录
Xiang Wang, Changxin Gao, Shiwei Zhang, Nong Sang
Abstract: Currently, one-stage frameworks have been widely applied for temporal action detection, but they still suffer from the challenge that the action instances span a wide range of time. The reason is that these one-stage detectors, e.g., Single Shot Multi-Box Detector (SSD), extract temporal features only applying a single-level layer for each head, which is not discriminative enough to perform classification and regression. In this paper, we propose a Multi-Level Temporal Pyramid Network (MLTPN) to improve the discrimination of the features. Specially, we first fuse the features from multiple layers with different temporal resolutions, to encode multi-layer temporal information. We then apply a multi-level feature pyramid architecture on the features to enhance their discriminative abilities. Finally, we design a simple yet effective feature fusion module to fuse the multi-level multi-scale features. By this means, the proposed MLTPN can learn rich and discriminative features for different action instances with different durations. We evaluate MLTPN on two challenging datasets: THUMOS'14 and Activitynet v1.3, and the experimental results show that MLTPN obtains competitive performance on Activitynet v1.3 and outperforms the state-of-the-art approaches on THUMOS'14 significantly.
摘要：目前，一期框架已经广泛应用于颞动作检测，但他们仍然在遭受挑战的行为实例跨越了广泛的时间。其原因是，这些中的一个阶段的检测器，例如，单拍多盒检测器（SSD），提取时间特征只施加单级层对每个头，这是不区别足以执行分类和回归。在本文中，我们提出了多层次的金字塔时空网络（MLTPN）改善的特征歧视。特别地，我们的第一熔丝从具有不同的时间分辨率的多个层，以编码多层时间信息的功能。然后，我们应用的功能多层次传销功能架构，以提高自己的辨别能力。最后，我们设计了一个简单而有效的特征融合模块融合的多层次多尺度特征。通过这种方式，建议MLTPN可以学习与不同的时间不同的行为实例丰富，判别特征。我们评估MLTPN两个挑战数据集：THUMOS'14和Activitynet v1.3和实验结果表明，MLTPN取得对Activitynet V1.3竞争力的性能和优于国家的最先进的方法THUMOS'14显著。

4. Convolutional neural network based deep-learning architecture for intraprostatic tumour contouring on PSMA PET images in patients with primary prostate cancer [PDF] 返回目录
Dejan Kostyszyn, Tobias Fechter, Nico Bartl, Anca L. Grosu, Christian Gratzke, August Sigle, Michael Mix, Juri Ruf, Thomas F. Fassbender, Selina Kiefer, Alisa S. Bettermann, Nils H. Nicolay, Simon Spohn, Maria U. Kramer, Peter Bronsert, Hongqian Guo, Xuefeng Qiu, Feng Wang, Christoph Henkenberens, Rudolf A. Werner, Dimos Baltas, Philipp T. Meyer, Thorsten Derlin, Mengxia Chen, Constantinos Zamboglou
Abstract: Accurate delineation of the intraprostatic gross tumour volume (GTV) is a prerequisite for treatment approaches in patients with primary prostate cancer (PCa). Prostate-specific membrane antigen positron emission tomography (PSMA-PET) may outperform MRI in GTV detection. However, visual GTV delineation underlies interobserver heterogeneity and is time consuming. The aim of this study was to develop a convolutional neural network (CNN) for automated segmentation of intraprostatic tumour (GTV-CNN) in PSMA-PET. Methods: The CNN (3D U-Net) was trained on [68Ga]PSMA-PET images of 152 patients from two different institutions and the training labels were generated manually using a validated technique. The CNN was tested on two independent internal (cohort 1: [68Ga]PSMA-PET, n=18 and cohort 2: [18F]PSMA-PET, n=19) and one external (cohort 3: [68Ga]PSMA-PET, n=20) test-datasets. Accordance between manual contours and GTV-CNN was assessed with Dice-Sørensen coefficient (DSC). Sensitivity and specificity were calculated for the two internal test-datasets by using whole-mount histology. Results: Median DSCs for cohorts 1-3 were 0.84 (range: 0.32-0.95), 0.81 (range: 0.28-0.93) and 0.83 (range: 0.32-0.93), respectively. Sensitivities and specificities for GTV-CNN were comparable with manual expert contours: 0.98 and 0.76 (cohort 1) and 1 and 0.57 (cohort 2), respectively. Computation time was around 6 seconds for a standard dataset. Conclusion: The application of a CNN for automated contouring of intraprostatic GTV in [68Ga]PSMA- and [18F]PSMA-PET images resulted in a high concordance with expert contours and in high sensitivities and specificities in comparison with histology reference. This robust, accurate and fast technique may be implemented for treatment concepts in primary PCa. The trained model and the study's source code are available in an open source repository.
摘要：前列腺内肿瘤体积（GTV）的精确划定是用于治疗的一个先决条件原发性前列腺癌（PCa）接近。前列腺特异性膜抗原正电子发射断层摄影（PSMA-PET）可以以GTV检测优于MRI。然而，视觉GTV勾画underlies观察者之间的异质性和耗时。本研究的目的是开发一种用于在PSMA-PET前列腺内肿瘤（GTV-CNN）的自动分割的卷积神经网络（CNN）。方法：CNN（3D U形净）被训练在[68Ga]使用经过验证的技术手动生成从两个不同的机构152例PSMA-PET图像和训练标签。和一个外部（第3组：[68Ga] PSMA-PET的CNN是在两个独立的内部（：[68Ga] PSMA-PET中，n = 18和队列2 [18 F] PSMA-PET中，n = 19第1组）进行测试中，n = 20）测试数据集。手动轮廓和GTV-CNN之间根据用骰子-索伦森系数（DSC）评价。灵敏度和特异性通过使用计算两个内部测试数据集整个安装组织学。结果：组群中值DSC的1-3 0.84（范围：0.32-0.95），0.81（范围：0.28-0.93）和0.83（范围：0.32-0.93），分别。分别为0.98和0.76（1组）和1和0.57（队列2），：敏感性和特异性为GTV-CNN用手动专家轮廓相媲美。计算时间约为6秒标准数据集。结论：一个CNN的在[68Ga] PSMA-自动轮廓前列腺内GTV的和中的应用[18 F] PSMA-PET图像产生了较高的一致性与专家的轮廓和在高灵敏度和特异性与组织学参考比较。这种坚固，精确和快速的技术可用于在初级前列腺癌治疗的概念来实现。训练有素的模型和研究的源代码，在一个开放的源代码库可供选择。

5. A Study on Visual Perception of Light Field Content [PDF] 返回目录
Ailbhe Gill, Emin Zerman, Cagri Ozcinar, Aljosa Smolic
Abstract: The effective design of visual computing systems depends heavily on the anticipation of visual attention, or saliency. While visual attention is well investigated for conventional 2D images and video, it is nevertheless a very active research area for emerging immersive media. In particular, visual attention of light fields (light rays of a scene captured by a grid of cameras or micro lenses) has only recently become a focus of research. As they may be rendered and consumed in various ways, a primary challenge that arises is the definition of what visual perception of light field content should be. In this work, we present a visual attention study on light field content. We conducted perception experiments displaying them to users in various ways and collected corresponding visual attention data. Our analysis highlights characteristics of user behaviour in light field imaging applications. The light field data set and attention data are provided with this paper.
摘要：视觉计算系统的有效设计在很大程度上依赖于视觉注意力，或显着性的预期。虽然视觉注意力以及研究了传统的2D图像和视频，但它仍然是新兴的多媒体身临其境一个非常活跃的研究领域。特别地，光场（由摄像机或微透镜的栅格捕捉的场景的光线）的视觉注意最近才成为研究的焦点。因为它们可能被渲染，并以各种方式消耗时发生的主要挑战是光场的内容视觉感受应该是什么的定义。在这项工作中，我们提出了光场内容的视觉注意研究。我们进行感知实验显示他们以各种方式向用户和收集对应视觉注意力数据。在光场成像应用的用户行为我们的分析突出特点。光场数据集和关注数据提供了本文。

6. Decomposition of Longitudinal Deformations via Beltrami Descriptors [PDF] 返回目录
Ho Law, Lok Ming Lui, Chun Yin Siu
Abstract: We present a mathematical model to decompose a longitudinal deformation into normal and abnormal components. The goal is to detect and extract subtle quivers from periodic motions in a video sequence. It has important applications in medical image analysis. To achieve this goal, we consider a representation of the longitudinal deformation, called the Beltrami descriptor, based on quasiconformal theories. The Beltrami descriptor is a complex-valued matrix. Each longitudinal deformation is associated to a Beltrami descriptor and vice versa. To decompose the longitudinal deformation, we propose to carry out the low rank and sparse decomposition of the Beltrami descriptor. The low rank component corresponds to the periodic motions, whereas the sparse part corresponds to the abnormal motions of a longitudinal deformation. Experiments have been carried out on both synthetic and real video sequences. Results demonstrate the efficacy of our proposed model to decompose a longitudinal deformation into regular and irregular components.
摘要：我们提出一个数学模型，以分解的纵向变形为正常和异常成分。我们的目标是检测并在视频序列中提取周期运动细微的颤抖。它在医学图像分析的重要应用。为了实现这一目标，我们考虑纵向变形，堪称贝尔特拉米描述符的表示，基于拟共理论。所述的Beltrami描述符是复数值矩阵。每个纵向变形被关联到一个描述符的Beltrami，反之亦然。分解纵向变形，我们建议进行贝尔特拉米描述符的低阶和稀疏分解。低阶分量对应于周期运动，而稀疏部分对应于纵向变形的异常运动。实验已经在人工和实际的视频序列进行。结果证明我们提出的模型的有效性分解纵向变形为定期和不定期的组件。

7. Revisiting Mid-Level Patterns for Distant-Domain Few-Shot Recognition [PDF] 返回目录
Yixiong Zou, Shanghang Zhang, José M. F. Moura, JianPeng Yu, Yonghong Tian
Abstract: Cross-domain few-shot learning (FSL) is proposed recently to transfer knowledge from general-domain known classes (e.g., ImageNet) to novel classes in other domains, and recognize novel classes with only few training samples. In this paper, we go further to define a more challenging scenario that transfers knowledge from general-domain known classes to novel classes in distant domains which are significantly different from the general domain, e.g., medical data. To solve this challenging problem, we propose to exploit mid-level features, which are more transferable, yet under-explored in recent main-stream FSL works. To boost the discriminability of mid-level features, we propose a residual-prediction task for the training on known classes. In this task, we view the current training sample as a sample from a pseudo-novel class, so as to provide simulated novel-class data. However, this simulated data is from the same domain as known classes, and shares high-level patterns with other known classes. Therefore, we then use high-level patterns from other known classes to represent it and remove this high-level representation from the simulated data, outputting a residual term containing discriminative information of it that could not be represented by high-level patterns from other known classes. Then, mid-level features from multiple mid-layers are dynamically weighted to predict this residual term, which encourages the mid-level features to be discriminative. Notably, our method can be applied to both the regular in-domain FSL setting by emphasizing high-level & transformed mid-level features and the distant-domain FSL setting by emphasizing mid-level features. Experiments under both settings on six public datasets (including two challenging medical datasets) validate the rationale of the proposed method, demonstrating state-of-the-art performance on both settings.
摘要：跨域几个次学习（FSL）近来提出了知识转移与一般域已知类别（例如，ImageNet），以新颖的班在其他领域，并认识到只有几个训练样本小说类。在本文中，我们进一步去定义一个更具挑战性的场景，从一般的域转移的知识已知类新颖类在遥远结构域，其是从一般的域显著不同，例如，医疗数据。为了解决这个具有挑战性的问题，我们提出了利用中等特征，这是更转让，但却未得到探讨了近期主流FSL作品。要提高的中等特征量的辨别，我们提出了对已知类别的训练残留预测任务。在此任务中，我们认为目前的训练样本来自一个伪类新的样本，以便提供模拟的新型级的数据。然而，这种仿真数据是从相同的域中已知的类，并且股的高级别模式与其他已知的类。因此，我们然后使用高级别模式从其他已知的类来表示它，并从模拟数据中去除这个高层次表示，输出包含它的区别信息的剩余术语不能由从其他高级别模式已知被表示类。然后，从多个中间层中层的特征是动态加权来预测该剩余项，它鼓励中等特征是辨别。值得注意的是，我们的方法可以通过加强高层及转化中等特征和强调中等特征远处域FSL设置被应用到正规域内FSL设置。下在六个公共数据集都设置实验（其中包括两个具有挑战性的医疗数据集）验证了该方法的基本原理，这表明在两个设置状态的最先进的性能。

8. Associative Partial Domain Adaptation [PDF] 返回目录
Youngeun Kim, Sungeun Hong, Seunghan Yang, Sungil Kang, Yunho Jeon, Jiwon Kim
Abstract: Partial Adaptation (PDA) addresses a practical scenario in which the target domain contains only a subset of classes in the source domain. While PDA should take into account both class-level and sample-level to mitigate negative transfer, current approaches mostly rely on only one of them. In this paper, we propose a novel approach to fully exploit multi-level associations that can arise in PDA. Our Associative Partial Domain Adaptation (APDA) utilizes intra-domain association to actively select out non-trivial anomaly samples in each source-private class that sample-level weighting cannot handle. Additionally, our method considers inter-domain association to encourage positive transfer by mapping between nearby target samples and source samples with high label-commonness. For this, we exploit feature propagation in a proposed label space consisting of source ground-truth labels and target probabilistic labels. We further propose a geometric guidance loss based on the label commonness of each source class to encourage positive transfer. Our APDA consistently achieves state-of-the-art performance across public datasets.
摘要：部分适应（PDA）的地址，其中所述目标域仅包含在源域类子集的实际情形。虽然PDA应该既考虑到类级和采样级以减轻负迁移，目前的方法主要依赖只是其中之一。在本文中，我们提出了一种新的方法来充分利用，可以在PDA出现多层次的关联。我们的关联部分域适配（APDA）利用域内关联到主动地选择了在每个源 - 私人类样品级加权不能处理非平凡的异常样本。此外，我们的方法考虑域间的关联，以鼓励通过映射附近目标样本和源样本之间具有高标签共性正向传送。为此，我们利用在由源地面实况标签的建议标签空间功能的传播和目标概率标签。我们进一步提出了一种基于每个源类的标签共性几何指导的损失，以鼓励积极的转移。我们一贯APDA实现在公共数据集的国家的最先进的性能。

9. Cascade Graph Neural Networks for RGB-D Salient Object Detection [PDF] 返回目录
Ao Luo, Xin Li, Fan Yang, Zhicheng Jiao, Hong Cheng, Siwei Lyu
Abstract: In this paper, we study the problem of salient object detection (SOD) for RGB-D images using both color and depth information.A major technical challenge in performing salient object detection fromRGB-D images is how to fully leverage the two complementary data sources. Current works either simply distill prior knowledge from the corresponding depth map for handling the RGB-image or blindly fuse color and geometric information to generate the coarse depth-aware representations, hindering the performance of RGB-D saliency this http URL this work, we introduceCascade Graph Neural Networks(Cas-Gnn),a unified framework which is capable of comprehensively distilling and reasoning the mutual benefits between these two data sources through a set of cascade graphs, to learn powerful representations for RGB-D salient object detection. Cas-Gnn processes the two data sources individually and employs a novelCascade Graph Reasoning(CGR) module to learn powerful dense feature embeddings, from which the saliency map can be easily inferred. Contrast to the previous approaches, the explicitly modeling and reasoning of high-level relations between complementary data sources allows us to better overcome challenges such as occlusions and ambiguities. Extensive experiments demonstrate that Cas-Gnn achieves significantly better performance than all existing RGB-DSOD approaches on several widely-used benchmarks.
摘要：在本文中，我们同时使用颜色和深度information.A重大的技术挑战进行显着的物体检测fromRGB-d的图像是如何充分利用这两个互补的研究突出物检测（SOD）的RGB-d图像的问题数据源。现在的作品要么干脆提制之前从相应的深度图知识用于处理RGB图像或盲目保险丝颜色和几何信息来生成粗的深度感知表示，阻碍RGB-d显着的这个HTTP URL此项工作的表现，我们introduceCascade图神经网络（CAS-GNN），一个统一的框架，它能够全面蒸馏并通过一组级联图形推理这两个数据源之间的互惠互利，学习的RGB-d显着的物体检测功能强大的交涉。 CAS-GNN单独处理两个数据源并采用novelCascade格拉夫推理（CGR）模块，以了解强大密集特征的嵌入，从该显着图，可以容易地推断出。相较于以前的方法，显式建模和补充数据源之间的高层关系的推理使我们能够更好地克服诸如遮挡和含糊不清的挑战。大量的实验证明，CAS-GNN达到显著性能优于所有现有的RGB-DSOD几个被广泛使用的基准方法。

10. SimPatch: A Nearest Neighbor Similarity Match between Image Patches [PDF] 返回目录
Aritra Banerjee
Abstract: Measuring the similarity between patches in images is a fundamental building block in various tasks. Naturally, the patch-size has a major impact on the matching quality, and on the consequent application performance. We try to use large patches instead of relatively small patches so that each patch contains more information. We use different feature extraction mechanisms to extract the features of each individual image patches which forms a feature matrix and find out the nearest neighbor patches in the image. The nearest patches are calculated using two different nearest neighbor algorithms in this paper for a query patch for a given image and the results have been demonstrated in this paper.
摘要：测量斑块之间的相似性在图像中各种任务的基本构建块。当然，补丁大小对匹配质量产生重大影响，以及随之而来的应用程序的性能。我们试图让每个补丁中包含更多的信息，使用大补丁，而不是相对较小的补丁。我们使用不同的特征提取机制，以提取形成特征矩阵的每个单独的图像块的特征，找出图像中的近邻补丁。最近的补丁使用本文两种不同的近邻算法查询补丁对于给定的图像计算和结果都在本文中得到证实。

11. Deep Ordinal Regression Forests [PDF] 返回目录
Haiping Zhu, Yuheng Zhang, Hongming Shan, Lingfu Che, Xiaoyang Xu, Junping Zhang, Jianbo Shi, Fei-Yue Wang
Abstract: Ordinal regression is a type of regression techniques used for predicting an ordinal variable. Recent methods formulate an ordinal regression problem as a series of binary classification problems. Such methods cannot ensure the global ordinal relationship is preserved since the relationships among different binary classifiers are neglected. We propose a novel ordinal regression approach called Deep Ordinal Regression Forests (DORFs), which is constructed with the differentiable decision trees for obtaining precise and stable global ordinal relationships. The advantages of the proposed DORFs are twofold. First, instead of learning a series of binary classifiers independently, the proposed method learns an ordinal distribution for ordinal regression. Second, the differentiable decision trees can be trained together with the ordinal distribution in an end-to-end manner. The effectiveness of the proposed DORFs is verified on two ordinal regression tasks, i.e., facial age estimation and image aesthetic assessment, showing significant improvements and better stability over the state-of-the-art ordinal regression methods.
摘要：有序回归是一种类型的用于预测的序变量回归技术。最近的方法制定的有序回归问题，因为一系列的二元分类问题。这些方法不能确保全球序关系被保留，因为不同的二元分类之间的关系被忽略。我们提出所谓的深序数回归森林（DORFs）一种新型的有序回归的方法，其构造与微决策树获得精确和稳定的全球序关系。所提出的DORFs的优点是双重的。首先，而不是学习一系列的二元分类的独立，所提出的方法用于学习有序回归序分布。其次，微决策树可以一起在一个终端到终端的方式有序分布训练。所提出的DORFs的有效性在两个有序回归任务，即，人脸年龄估计和形象的审美评价，显示出在国家的最先进的有序回归方法显著的改善和更好的稳定性进行验证。

12. Oversampling Adversarial Network for Class-Imbalanced Fault Diagnosis [PDF] 返回目录
Masoumeh Zareapoor, Pourya Shamsolmoali, Jie Yang
Abstract: The collected data from industrial machines are often imbalanced, which poses a negative effect on learning algorithms. However, this problem becomes more challenging for a mixed type of data or while there is overlapping between classes. Class-imbalance problem requires a robust learning system which can timely predict and classify the data. We propose a new adversarial network for simultaneous classification and fault detection. In particular, we restore the balance in the imbalanced dataset by generating faulty samples from the proposed mixture of data distribution. We designed the discriminator of our model to handle the generated faulty samples to prevent outlier and overfitting. We empirically demonstrate that; (i) the discriminator trained with a generator to generates samples from a mixture of normal and faulty data distribution which can be considered as a fault detector; (ii), the quality of the generated faulty samples outperforms the other synthetic resampling techniques. Experimental results show that the proposed model performs well when comparing to other fault diagnosis methods across several evaluation metrics; in particular, coalescing of generative adversarial network (GAN) and feature matching function is effective at recognizing faulty samples.
摘要：从工业机器所收集的数据通常不平衡，这对上学习算法具有负面影响。然而，这个问题变得更加具有挑战性的混合类型的数据或同时有类之间的重叠。类不平衡问题，需要一个强大的学习系统，能够及时预测和对数据进行分类。我们提出了同时分类和故障检测新的对抗网络。特别是，我们通过从数据分配的所提出的混合物产生故障样品恢复该数据集不平衡的平衡。我们设计的模型来处理所产生的故障样本，以防止异常和过度拟合的鉴别。我们经验证明; （i）与发电机以从可以被认为是故障检测器正常和有故障的数据分布的混合物产生样本训练鉴别器; （ii）中，所产生的故障的样品的质量优于其它合成重采样技术。实验结果表明，所提出的模型进行很好的比较跨越几个评价指标的其它故障诊断方法时;特别地，生成对抗网络（GAN）和特征匹配功能的聚结是在识别故障样本有效。

13. A Surgery of the Neural Architecture Evaluators [PDF] 返回目录
Xuefei Ning, Wenshuo Li, Zixuan Zhou, Tianchen Zhao, Yin Zheng, Shuang Liang, Huazhong Yang, Yu Wang
Abstract: Neural architecture search (NAS) recently received extensive attention due to its effectiveness in automatically designing effective neural architectures. A major challenge in NAS is to conduct a fast and accurate evaluation of neural architectures. Commonly used fast architecture evaluators include one-shot evaluators (including weight sharing and hypernet-based ones) and predictor-based evaluators. Despite their high evaluation efficiency, the evaluation correlation of these evaluators is still questionable. In this paper, we conduct an extensive assessment of both the one-shot and predictor-based evaluator on the NAS-Bench-201 benchmark search space, and break up how and why different factors influence the evaluation correlation and other NAS-oriented criteria.
摘要：神经结构搜索（NAS）最近收到了广泛的关注，因为它在自动设计有效的神经结构的有效性。在NAS的一个主要挑战是进行神经结构的快速和准确的评价。常用的快速架构评估包括一次性评估（包括体重共享和hypernet为基础的）和基于预测，评估。尽管他们的高度评价效率，这些评估的评估相关性，仍是疑问。在本文中，我们进行了一次性的和基于预测评价器上的NAS-台-201基准搜索空间都进行了广泛的评估，并打破了如何和为什么不同因素影响评价的相关性和其他NAS-至上的准则。

14. Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems [PDF] 返回目录
Kailai Zhou, Linsen Chen, Xun Cao
Abstract: Multispectral pedestrian detection is capable of adapting to insufficient illumination conditions by leveraging color-thermal modalities. On the other hand, it is still lacking of in-depth insights on how to fuse the two modalities effectively. Compared with traditional pedestrian detection, we find multispectral pedestrian detection suffers from modality imbalance problems which will hinder the optimization process of dual-modality network and depress the performance of detector. Inspired by this observation, we propose Modality Balance Network (MBNet) which facilitates the optimization process in a much more flexible and balanced manner. Firstly, we design a novel Differential Modality Aware Fusion (DMAF) module to make the two modalities complement each other. Secondly, an illumination aware feature alignment module selects complementary features according to the illumination conditions and aligns the two modality features adaptively. Extensive experimental results demonstrate MBNet outperforms the state-of-the-arts on both the challenging KAIST and CVC-14 multispectral pedestrian datasets in terms of the accuracy and the computational efficiency. Code is available at this https URL.
摘要：多光谱行人检测是能够适应通过利用彩色热方式的照明条件不足的。在另一方面，它仍然对如何有效地融合了两种模式缺乏深入的见解。与传统的行人探测相比，我们发现从形态失衡问题，这将阻碍双模态网络的优化过程，并抑制探测器的性能，多光谱行人检测受到影响。通过这种观察的启发，我们提出了模态平衡网络（MBNet）这有利于优化过程中更加灵活和平衡的方式。首先，我们设计了一个新的微分模态感知融合（DMAF）模块，以使所述两个模态相互补充。其次，照明感知特征对准模块选择互补的特征根据照明条件和比对两个模态自适应功能。广泛的实验结果证明MBNet性能优于状态的最艺上具有挑战性KAIST和CVC-14的多光谱数据集行人无论是在精度和计算效率方面。代码可在此HTTPS URL。

15. Deep Robust Clustering by Contrastive Learning [PDF] 返回目录
Huasong Zhong, Chong Chen, Zhongming Jin, Xian-Sheng Hua
Abstract: Recently, many unsupervised deep learning methods have been proposed to learn clustering with unlabelled data. By introducing data augmentation, most of the latest methods look into deep clustering from the perspective that the original image and its tansformation should share similar semantic clustering assignment. However, the representation features before softmax activation function could be quite different even the assignment probability is very similar since softmax is only sensitive to the maximum value. This may result in high intra-class diversities in the representation feature space, which will lead to unstable local optimal and thus harm the clustering performance. By investigating the internal relationship between mutual information and contrastive learning, we summarized a general framework that can turn any maximizing mutual information into minimizing contrastive loss. We apply it to both the semantic clustering assignment and representation feature and propose a novel method named Deep Robust Clustering by Contrastive Learning (DRC). Different to existing methods, DRC aims to increase inter-class diver-sities and decrease intra-class diversities simultaneously and achieve more robust clustering results. Extensive experiments on six widely-adopted deep clustering benchmarks demonstrate the superiority of DRC in both stability and accuracy. e.g., attaining 71.6% mean accuracy on CIFAR-10, which is 7.1% higher than state-of-the-art results.
摘要：近日，许多无监督深度学习方法被提出学习与未标记的数据聚类。通过引入数据增强，大部分的最新方法看起来陷入深深的聚类从原始图像及其tansformation应该有着相似的语义聚合分配的角度。然而，在SOFTMAX激活功能的代表性特征可以连分配的概率是非常相似相当不同，因为SOFTMAX是最大值仅敏感。这可能导致在表示特征空间的高类内的多样性，这将导致不稳定的局部最优，从而损害了聚类性能。通过调查互信息和对比学习之间的内在联系，我们总结出可以将任何最大化互信息为尽量减少损失对比的总体框架。我们也同时适用于语义聚合分配和表示特征，并提出通过对比学习（DRC）命名为深稳健聚类的新方法。不同于现有的方法，DRC旨在提高级间差异性，同时减少类内的多样性，实现更稳健的聚类结果。六广泛采用的深集群基准大量的实验证明刚果民主共和国的稳定性和准确性的优势。例如，在CIFAR-10，这是比状态的最先进的结果高7.1％达到71.6％的平均精确度。

16. A Multi-Task Learning Approach for Human Action Detection and Ergonomics Risk Assessment [PDF] 返回目录
Behnoosh Parsa, Ashis G. Banerjee
Abstract: We propose a new approach to Human Action Evaluation (HAE) in long videos using graph-based multi-task modeling. Previous works in activity assessment either directly compute a metric using a detected skeleton or use the scene information to regress the activity score. These approaches are insufficient for accurate activity assessment since they only compute an average score over a clip, and do not consider the correlation between the joints and body dynamics. Moreover, they are highly scene-dependent which makes the generalizability of these methods questionable. We propose a novel multi-task framework for HAE that utilizes a Graph Convolutional Network backbone to embed the interconnection between human joints in the features. In this framework, we solve the Human Action Detection (HAD) problem as an auxiliary task to improve activity assessment. The HAD head is powered by an Encoder-Decoder Temporal Convolutional Network to detect activities in long videos and HAE uses a Long-Short-Term-Memory-based architecture. We evaluate our method on the UW-IOM and TUM Kitchen datasets and discuss the success and failure cases on these two datasets.
摘要：我们采用基于图形的多任务建模长视频提出了一种新的方法，以人的行为评价（HAE）。在活动评估以前的作品中无论是使用检测骨架直接计算度量或使用场景信息倒退的行为得分。这些方法都不足以准确评估活动，因为他们只计算在剪辑中的平均分，并没有考虑关节和体动力学之间的相关性。此外，它们是高度场景相关的，这使得这些方法的普遍性问题的。我们提出了一个HAE新型多任务的框架，利用图形卷积网络骨干，以嵌入的特征人体关节之间的互连。在此框架下，我们解决了人的动作检测（HAD）问题作为辅助任务，以提高活动的评估。民政事务总署头由一个编码器，解码器时空卷积网络供电检测长视频活动和HAE使用基于长短期内存架构。我们评估我们在UW-IOM和TUM厨房数据集方法和讨论关于这两个数据集的成功和失败的案例。

17. Single-stage intake gesture detection using CTC loss and extended prefix beam search [PDF] 返回目录
Philipp V. Rouast, Marc T. P. Adam
Abstract: Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) frame-level intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) consistency with the fuzzy nature of intake gestures, and (iii) avoidance of hard-coded rules. Across two separate datasets, we quantify these benefits by showing relative $F_1$ score improvements between 2.0% and 6.2% over the two-stage approach for intake detection and eating vs. drinking recognition tasks, for both video and inertial sensors.
摘要：个体摄入手势的精确检测是向自动膳食监测的关键步骤。手腕的运动和视频数据描述所述上本体的两个惯性传感器数据已被用于这一目的。最先进的方法迄今使用两阶段方法，其中（ⅰ）帧级进概率从传感器数据了解到使用深神经网络，然后（ⅱ）稀疏摄取事件通过找到的最大值检测帧级概率。在这项研究中，我们提出这直接解码来自传感器数据获悉到稀疏的摄入检测的概率的单级方法。这是通过使用联结颞分类（CTC）损失，并且使用一种新的扩展的前缀波束搜索译码算法译码弱监督训练实现。这种方法的优点包括：（i）端至端用于检测训练，（ii）与进姿势的模糊性质的一致性，以及（iii）避免硬编码规则。在两个独立的数据集，我们用量化显示2.0％和6.2％之间的相对$ F_1 $得分比改善进气检测两阶段的做法和吃与喝识别任务，为视频和惯性传感器这些好处。

18. Leveraging Localization for Multi-camera Association [PDF] 返回目录
Zhongang Cai, Cunjun Yu, Junzhe Zhang, Jiawei Ren, Haiyu Zhao
Abstract: We present McAssoc, a deep learning approach to the as-sociation of detection bounding boxes in different views ofa multi-camera system. The vast majority of the academiahas been developing single-camera computer vision algo-rithms, however, little research attention has been directedto incorporating them into a multi-camera system. In thispaper, we designed a 3-branch architecture that leveragesdirect association and additional cross localization infor-mation. A new metric, image-pair association accuracy(IPAA) is designed specifically for performance evaluationof cross-camera detection association. We show in the ex-periments that localization information is critical to suc-cessful cross-camera association, especially when similar-looking objects are present. This paper is an experimentalwork prior to MessyTable, which is a large-scale bench-mark for instance association in mutliple cameras.
摘要：我们提出McAssoc，深学习方法来检测包围在OFA多摄像机系统不同的看法盒的作为，社会交往。绝大多数academiahas的在开发单电相机计算机视觉算法中，rithms，然而，一些研究关注已directedto将它们成多摄像机系统。在thispaper，我们设计了3支架构leveragesdirect协会和额外的交叉定位的Infor-mation。一种新的度量，图像对的关联精度（IPAA）是出于性能evaluationof横相机检测关联专门设计的。我们发现在EX-periments的本地化信息是SUC-cessful跨相机协会至关重要，特别是在外观类似的对象存在。本文是之前MessyTable，这是一个大型台式标记用于在复式相机实例关联的experimentalwork。

19. Global Context Aware Convolutions for 3D Point Cloud Understanding [PDF] 返回目录
Zhiyuan Zhang, Binh-Son Hua, Wei Chen, Yibin Tian, Sai-Kit Yeung
Abstract: Recent advances in deep learning for 3D point clouds have shown great promises in scene understanding tasks thanks to the introduction of convolution operators to consume 3D point clouds directly in a neural network. Point cloud data, however, could have arbitrary rotations, especially those acquired from 3D scanning. Recent works show that it is possible to design point cloud convolutions with rotation invariance property, but such methods generally do not perform as well as translation-invariant only convolution. We found that a key reason is that compared to point coordinates, rotation-invariant features consumed by point cloud convolution are not as distinctive. To address this problem, we propose a novel convolution operator that enhances feature distinction by integrating global context information from the input point cloud to the convolution. To this end, a globally weighted local reference frame is constructed in each point neighborhood in which the local point set is decomposed into bins. Anchor points are generated in each bin to represent global shape features. A convolution can then be performed to transform the points and anchor features into final rotation-invariant features. We conduct several experiments on point cloud classification, part segmentation, shape retrieval, and normals estimation to evaluate our convolution, which achieves state-of-the-art accuracy under challenging rotations.
摘要：在深度学习三维点云的最新进展表明在现场带来巨大的希望理解感谢任务引入卷积运营商直接在神经网络中消耗三维点云。点云数据，但是，可以有任意旋转，尤其是那些从3D扫描获得的。最近的作品表明，这是可能的旋转不变性设计点云回旋，但这种方法一般不执行以及翻译不变的只有卷积。我们发现，一个重要原因是，相对于点坐标，通过点云卷积消耗旋转不变特征并不独具匠心。为了解决这个问题，我们提出了一个新颖的卷积运算，增强从输入点云卷积整合全球范围内的信息功能区别。为此，一个全局加权本地参考帧中的每个点附近，其中所述本地点集被分解成二进制位构成。定位点在每个箱代表全球形状特征产生。然后可以执行卷积到分和锚特征转变成最终的旋转不变的特征。我们进行点云的分类，部分分割，形状检索多次实验和法线估计，以评估我们的卷积，达到具有挑战性的旋转下，国家的最先进的精度。

20. Textual Description for Mathematical Equations [PDF] 返回目录
Ajoy Mondal, C. V. Jawahar
Abstract: Reading of mathematical expression or equation in the document images is very challenging due to the large variability of mathematical symbols and expressions. In this paper, we pose reading of mathematical equation as a task of generation of the textual description which interprets the internal meaning of this equation. Inspired by the natural image captioning problem in computer vision, we present a mathematical equation description (MED) model, a novel end-to-end trainable deep neural network based approach that learns to generate a textual description for reading mathematical equation images. Our MED model consists of a convolution neural network as an encoder that extracts features of input mathematical equation images and a recurrent neural network with attention mechanism which generates description related to the input mathematical equation images. Due to the unavailability of mathematical equation image data sets with their textual descriptions, we generate two data sets for experimental purpose. To validate the effectiveness of our MED model, we conduct a real-world experiment to see whether the students are able to write equations by only reading or listening their textual descriptions or not. Experiments conclude that the students are able to write most of the equations correctly by reading their textual descriptions only.
摘要：读取在原稿图像的数学表达式或方程的非常由于大的变异性的数学符号和表达式的挑战。在本文中，我们提出阅读数学公式作为一代文字描述，解释这个等式的内部意义的任务。通过自然影像字幕在计算机视觉问题的启发，我们提出了一个数学公式描述（MED）模式，一种新型的终端到终端的可训练深层神经网络为基础的方法是学会产生阅读数学公式图像的文字描述。我们的MED模型由卷积神经网络作为一种编码器，输入数学公式图像中提取特征，并产生与输入数学公式图像描述注意机制经常性的神经网络。由于数学公式图像数据集的文本描述的不可用，我们生成两个数据集实验目的。为了验证我们的MED模型的有效性，我们进行了真实世界的实验，看学生是否仅读或听他们的文字说明或不能够写方程。实验得出结论：学生能够正确地只阅读他们的文字说明写的大部分方程。

21. Exploring Rich and Efficient Spatial Temporal Interactions for Real Time Video Salient Object Detection [PDF] 返回目录
Chenglizhao Chen, Guotao Wang, Chong Peng, Dingwen Zhang, Yuming Fang, Hong Qin
Abstract: The current main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches. As a complementary component, the main task for the temporal branch is to intermittently focus the spatial branch on those regions with salient movements. In this way, even though the overall video saliency quality is heavily dependent on its spatial branch, however, the performance of the temporal branch still matter. Thus, the key factor to improve the overall video saliency is how to further boost the performance of these branches efficiently. In this paper, we propose a novel spatiotemporal network to achieve such improvement in a full interactive fashion. We integrate a lightweight temporal model into the spatial branch to coarsely locate those spatially salient regions which are correlated with trustworthy salient movements. Meanwhile, the spatial branch itself is able to recurrently refine the temporal model in a multi-scale manner. In this way, both the spatial and temporal branches are able to interact with each other, achieving the mutual performance improvement. Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.
摘要：目前的主流方法主要制定自己的视频显着来自两个独立的场地，即时间和空间的分支。作为补充部件，用于时间分支的主要任务是间歇地集中在空间上分支与凸运动的那些区域。这样一来，即使整体视频质量的显着在很大程度上取决于其空间分支，然而，颞支的表现依然重要。因此，为了提高整体视频显着性的关键因素是如何更有效地推动这些部门的表现。在本文中，我们提出了一个新颖的时空网络，以实现一个完整的互动方式，例如改善。我们一个轻量级的域模型融入空间分支粗略地找到那些与值得信赖的突出运动相关的空间显着区域。同时，空间分支本身能够反复细化域模型的多尺度方法。通过这种方式，在空间和时间上的分支都能够相互交流，实现共同的性能提升。我们的方法很容易实现而有效，实现实时的速度高质量视频显着性检测与50 FPS。

22. A Novel Video Salient Object Detection Method via Semi-supervised Motion Quality Perception [PDF] 返回目录
Chenglizhao Chen, Jia Song, Chong Peng, Guodong Wang, Yuming Fang
Abstract: Previous video salient object detection (VSOD) approaches have mainly focused on designing fancy networks to achieve their performance improvements. However, with the slow-down in development of deep learning techniques recently, it may become more and more difficult to anticipate another breakthrough via fancy networks solely. To this end, this paper proposes a universal learning scheme to get a further 3\% performance improvement for all state-of-the-art (SOTA) methods. The major highlight of our method is that we resort the "motion quality"---a brand new concept, to select a sub-group of video frames from the original testing set to construct a new training set. The selected frames in this new training set should all contain high-quality motions, in which the salient objects will have large probability to be successfully detected by the "target SOTA method"---the one we want to improve. Consequently, we can achieve a significant performance improvement by using this new training set to start a new round of network training. During this new round training, the VSOD results of the target SOTA method will be applied as the pseudo training objectives. Our novel learning scheme is simple yet effective, and its semi-supervised methodology may have large potential to inspire the VSOD community in the future.
摘要：上一个视频显着对象检测（VSOD）的方法主要集中在设计花哨的网络来实现他们的性能改进。然而，随着深学习技术最近发展慢下来，它可能变得越来越难以预料仅通过网络看中的又一突破。为此，本文提出了一种通用的学习计划，以获取所有国家的最先进的（SOTA）方法，另外3 \％的性能提升。我们的方法的主要亮点是，我们采取了“运动质量” ---一个全新的概念，从原来的测试集选择子组视频帧，构建一个新的训练集。在这个新的训练集选择的帧都应该包含高品质的运动，其中显着对象将通过“目标SOTA方法” ---我们要提高一个成功检测到大概率。因此，我们可以通过使用这种新的训练集，开始了新一轮的网络训练的显著的性能提升。在这新一轮的训练，目标SOTA方法的VSOD结果将作为伪训练目标应用。我们的新型学习方式是简单而有效的，它的半监督的方法可能有很大的潜力，激发VSOD社会的未来。

23. Few Shot Learning Framework to Reduce Inter-observer Variability in Medical Images [PDF] 返回目录
Sohini Roychowdhury
Abstract: Most computer aided pathology detection systems rely on large volumes of quality annotated data to aid diagnostics and follow up procedures. However, quality assuring large volumes of annotated medical image data can be subjective and expensive. In this work we present a novel standardization framework that implements three few-shot learning (FSL) models that can be iteratively trained by atmost 5 images per 3D stack to generate multiple regional proposals (RPs) per test image. These FSL models include a novel parallel echo state network (ParESN) framework and an augmented U-net model. Additionally, we propose a novel target label selection algorithm (TLSA) that measures relative agreeability between RPs and the manually annotated target labels to detect the "best" quality annotation per image. Using the FSL models, our system achieves 0.28-0.64 Dice coefficient across vendor image stacks for intra-retinal cyst segmentation. Additionally, the TLSA is capable of automatically classifying high quality target labels from their noisy counterparts for 60-97% of the images while ensuring manual supervision on remaining images. Also, the proposed framework with ParESN model minimizes manual annotation checking to 12-28% of the total number of images. The TLSA metrics further provide confidence scores for the automated annotation quality assurance. Thus, the proposed framework is flexible to extensions for quality image annotation curation of other image stacks as well.
摘要：大多数计算机辅助病理检测系统依靠大量的质量注释的数据，以帮助诊断和后续程序。然而，质量保证大量注释医学图像数据可以是主观的和昂贵的。在这项工作中，我们提出了一种新的标准化框架，可以通过atmost每个3D堆叠5张图片反复的培训实现了三个几炮学习（FSL）模型来生成每个测试图像的多个区域建议（RPS）。这些FSL模型包括一个新颖平行回声状态网络（ParESN）框架和增强U形网模型。此外，我们提出一个新的目标标签选择算法（TLSA），该RP与手动注释目标标签之间的措施相对agreeability以检测每个图像的“最佳”质量注释。使用FSL型号，我们的系统实现跨厂商的图像栈0.28-0.64骰子系数视网膜内囊肿分割。此外，TLSA能够从他们的吵闹同行图像的60-97％，同时确保剩余图像的人工管理自动分类高质量目标的标签。此外，与ParESN模型所提出的框架手动标注检查最小化的图像的总数的12-28％。该TLSA度量还提供了自动标注质量保证的信心分数。因此，所提出的框架是柔性的，以扩展为其他图像栈的质量的图像注释策为好。

24. A Deeper Look at Salient Object Detection: Bi-stream Network with a Small Training Dataset [PDF] 返回目录
Zhenyu Wu, Shuai Li, Chenglizhao Chen, Aimin Hao, Hong Qin
Abstract: Compared with the conventional hand-crafted approaches, the deep learning based methods have achieved tremendous performance improvements by training exquisitely crafted fancy networks over large-scale training sets. However, do we really need large-scale training set for salient object detection (SOD)? In this paper, we provide a deeper insight into the interrelationship between the SOD performances and the training sets. To alleviate the conventional demands for large-scale training data, we provide a feasible way to construct a novel small-scale training set, which only contains 4K images. Moreover, we propose a novel bi-stream network to take full advantage of our proposed small training set, which is consisted of two feature backbones with different structures, achieving complementary semantical saliency fusion via the proposed gate control unit. To our best knowledge, this is the first attempt to use a small-scale training set to outperform state-of-the-art models which are trained on large-scale training sets; nevertheless, our method can still achieve the leading state-of-the-art performance on five benchmark datasets.
摘要：与传统的手工制作方法相比，深学习为基础的方法已通过在大规模训练集训练做工精美花哨的网络取得了巨大的性能提升。但是，我们真的需要突出目标检测（SOD）的大型训练集？在本文中，我们提供了一个深入了解的SOD表演和训练集之间的相互关系。为了减轻对于大型训练数据的常规要求，我们提供了一种可行的方法构建了一个新的小型训练集，仅包含4K图像。此外，我们提出了一种新颖的双网流采取我们提出的小训练集，这是由具有不同结构的两个特征骨架的，通过所提出的选通控制单元实现互补语义显着性融合的全部优点。据我们所知，这是用一个小规模的训练集，其是在大规模训练集训练的跑赢大市的国家的最先进车型的首次尝试;然而，我们的方法仍然可以达到五个标准数据集领先的国家的最先进的性能。

25. Polysemy Deciphering Network for Robust Human-Object Interaction Detection [PDF] 返回目录
Xubin Zhong, Changxing Ding, Xian Qu, Dacheng Tao
Abstract: Human-Object Interaction (HOI) detection is important to human-centric scene understanding tasks. Existing works tend to assume that the same verb has similar visual characteristics in different HOI categories, an approach that ignores the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection in three distinct ways. First, we refine features for HOI detection to be polysemyaware through the use of two novel modules: namely, Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important elements in human and object appearance features for each HOI category to be identified; moreover, LPFA augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce intra-class variation for the same verb. Second, we introduce a novel Polysemy-Aware Modal Fusion module (PAMF), which guides PD-Net to make decisions based on feature types deemed more important according to the language priors. Third, we propose to relieve the verb polysemy problem through sharing verb classifiers for semantically similar HOI categories. Furthermore, to expedite research on the verb polysemy problem, we build a new benchmark dataset named HOI-VerbPolysemy (HOIVP), which includes common verbs (predicates) that have diverse semantic meanings in the real world. Finally, through deciphering the visual polysemy of verbs, our approach is demonstrated to outperform state-of-the-art methods by significant margins on the HICO-DET, V-COCO, and HOI-VP databases. Code and data in this paper will be released at this https URL.
摘要：人机交互对象（HOI）检测是人类为中心的场景理解任务重要。现有的作品往往假设同一个动词在不同类别的HOI类似的视觉特性，忽略动词的不同语义的方法。为了解决这个问题，在本文中，我们提出了一个新颖的多义解读网络（PD-网）解码动词在三种不同的方式HOI检测的视觉多义性。首先，我们改进对HOI检测功能通过使用两种新颖的模块是polysemyaware：即，语言引导之前通道注意（LPCA）和语言基于现有特征增强（LPFA）。 LPCA突出在人类和对象的外观的特征的重要元素为要被识别每个HOI类;此外，LPFA增强人类姿势和使用语言先验，使动词分类器以接收语言暗示，降低对于相同的动词类内变化为HOI检测空间特征。其次，我们引入一个新的多义词感知模态融合模块（PAMF），指导PD-网，使基于特征的类型决定的根据更重要视为语言先验。第三，我们建议通过共享动词分类的语义相似HOI类别，以减轻动词多义性问题。此外，为加快对动词多义词问题的研究，我们建立一个名为海VerbPolysemy（HOIVP）一个新的基准数据集，其中包括那些在现实世界中不同的语义常用动词（谓语）。最后，通过解密动词的视觉多义性，我们的方法被证明状态的最先进的方法跑赢通过对HICO-DET显著边距，V-COCO，和HOI-VP数据库。代码和本文中的数据将在本HTTPS URL被释放。

26. An Indexing Scheme and Descriptor for 3D Object Retrieval Based on Local Shape Querying [PDF] 返回目录
Bart Iver van Blokland, Theoharis Theoharis
Abstract: A binary descriptor indexing scheme based on Hamming distance called the Hamming tree for local shape queries is presented. A new binary clutter resistant descriptor named Quick Intersection Count Change Image (QUICCI) is also introduced. This local shape descriptor is extremely small and fast to compare. Additionally, a novel distance function called Weighted Hamming applicable to QUICCI images is proposed for retrieval applications. The effectiveness of the indexing scheme and QUICCI is demonstrated on 828 million QUICCI images derived from the SHREC2017 dataset, while the clutter resistance of QUICCI is shown using the clutterbox experiment.
摘要：基于所谓的汉明树局部形状查询海明距离二进制描述符索引方案。还引入了一个新的二进制抗杂波名为快速路口增减数图片（QUICCI）描述符。这种局部形状描述符是非常小的和快速的比较。适用于QUICCI图像。另外，一种新型的距离函数，称为加权汉明提出了一种用于检索应用。的索引方案QUICCI和有效性证明来自SHREC2017数据集导出的8.28亿QUICCI图像，而使用clutterbox实验所示QUICCI的杂波电阻。

27. Predicting Visual Importance Across Graphic Design Types [PDF] 返回目录
Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O'Donovan, Aaron Hertzmann, Zoya Bylinskii
Abstract: This paper introduces a Unified Model of Saliency and Importance (UMSI), which learns to predict visual importance in input graphic designs, and saliency in natural images, along with a new dataset and applications. Previous methods for predicting saliency or visual importance are trained individually on specialized datasets, making them limited in application and leading to poor generalization on novel image classes, while requiring a user to know which model to apply to which input. UMSI is a deep learning-based model simultaneously trained on images from different design classes, including posters, infographics, mobile UIs, as well as natural images, and includes an automatic classification module to classify the input. This allows the model to work more effectively without requiring a user to label the input. We also introduce Imp1k, a new dataset of designs annotated with importance information. We demonstrate two new design interfaces that use importance prediction, including a tool for adjusting the relative importance of design elements, and a tool for reflowing designs to new aspect ratios while preserving visual importance. The model, code, and importance dataset are available at this https URL .
摘要：本文介绍了显着性的统一模型的重要性（UMSI），该学习如何预测自然图像输入图形设计的视觉重要性，显着性，与新的数据集和应用程序一起。以前预测的显着性和视觉重要性方法是在专业的数据集进行单独训练，使他们在应用程序的限制，导致对新的图像类泛化很差，而需要用户知道所适用的输入哪个模型。 UMSI是深基于学习的模型上从不同的设计类，包括海报，信息图形，移动的用户界面，以及自然图像的图像同时训练，并且包括一个自动分类模块到输入分类。这更有效地允许模型工作而不需要用户标记输入。我们还引进Imp1k，与重要信息注解设计的一个新的数据集。我们证明了使用重要性预测两个新设计的接口，包括用于调节的设计元素的相对重要性的工具，和用于回流设计到新的纵横比，同时保持视觉重要的工具。该模型，代码和重要性数据集可在此HTTPS URL。

28. Diagnosis of Autism in Children using Facial Analysis and Deep Learning [PDF] 返回目录
Madison Beary, Alex Hadsell, Ryan Messersmith, Mohammad-Parsa Hosseini
Abstract: In this paper, we introduce a deep learning model to classify children as either healthy or potentially autistic with 94.6% accuracy using Deep Learning. Autistic patients struggle with social skills, repetitive behaviors, and communication, both verbal and nonverbal. Although the disease is considered to be genetic, the highest rates of accurate diagnosis occur when the child is tested on behavioral characteristics and facial features. Patients have a common pattern of distinct facial deformities, allowing researchers to analyze only an image of the child to determine if the child has the disease. While there are other techniques and models used for facial analysis and autism classification on their own, our proposal bridges these two ideas allowing classification in a cheaper, more efficient method. Our deep learning model uses MobileNet and two dense layers in order to perform feature extraction and image classification. The model is trained and tested using 3,014 images, evenly split between children with autism and children without it. 90% of the data is used for training, and 10% is used for testing. Based on our accuracy, we propose that the diagnosis of autism can be done effectively using only a picture. Additionally, there may be other diseases that are similarly diagnosable.
摘要：在本文中，我们介绍了深刻的学习模式进行分类儿童正常，或者用深度学习94.6％的准确率可能自闭。自闭症患者难以与社交技巧，重复行为和沟通，双方言语和非言语。虽然这种疾病被认为是遗传的，当孩子的行为特征和面部特征进行测试准确诊断率最高。患者有明显的面部畸形的常见模式，使研究只分析孩子的图像，以确定如果孩子有疾病。虽然有用于自己的面部分析和自闭症分类等技术和模式，我们的建议桥接这两个概念，允许在一个更便宜，更有效的方法分类。我们深厚的学习模型使用MobileNet和两个致密层，以便进行特征提取和图像分类。该模型被训练和使用3,014的图像，自闭症儿童和儿童平分没有测试它。 90％的数据被用于训练，并且10％被用于测试。根据我们的精确度，我们建议自闭症的诊断可以有效地只用一个图片来实现。此外，可能有其它的疾病也同样可诊断。

29. Webly Supervised Semantic Embeddings for Large Scale Zero-Shot Learning [PDF] 返回目录
Yannick Le Cacheux, Adrian Popescu, Hervé Le Borgne
Abstract: Zero-shot learning (ZSL) makes object recognition in images possible in absence of visual training data for a part of the classes from a dataset. When the number of classes is large, classes are usually represented by semantic class prototypes learned automatically from unannotated text collections. This typically leads to much lower performances than with manually designed semantic prototypes such as attributes. While most ZSL works focus on the visual aspect and reuse standard semantic prototypes learned from generic text collections, we focus on the problem of semantic class prototype design for large scale ZSL. More specifically, we investigate the use of noisy textual metadata associated to photos as text collections, as we hypothesize they are likely to provide more plausible semantic embeddings for visual classes if exploited appropriately. We thus make use of a source-based voting strategy to improve the robustness of semantic prototypes. Evaluation on the large scale ImageNet dataset shows a significant improvement in ZSL performances over two strong baselines, and over usual semantic embeddings used in previous works. We show that this improvement is obtained for several embedding methods, leading to state of the art results when one uses automatically created visual and text features.
摘要：零次学习（ZSL）使物体识别在用于类的来自数据集的一部分可能在没有视觉训练数据的图像。当类的数量很大，类通常由未加文字的集合自动学习语义类原型表示。这通常导致低得多的性能比手动设计语义原型，如属性。虽然大多数ZSL作品集中在视觉方面和重用通用文本集合学会标准语义原型，我们专注于语义类原型设计大规模ZSL的问题。更具体地说，我们调查使用的嘈杂文本元数据相关联的照片作为收藏的文字，因为我们假设他们有可能利用是否适当的视觉类提供更合理的语义的嵌入。因此，我们利用基于源的投票策略，以提高语义原型的鲁棒性。评价大规模ImageNet数据集显示ZSL表演了两个强大的基线，而在以往的作品中使用了常用语义的嵌入一个显著的改善。我们发现，几个嵌入方法获得这种改进，导致艺术成果的状态，当一个使用自动创建的视觉和文本功能。

30. Improving Explainability of Image Classification in Scenarios with Class Overlap: Application to COVID-19 and Pneumonia [PDF] 返回目录
Edward Verenich, Alvaro Velasquez, Nazar Khan, Faraz Hussain
Abstract: Trust in predictions made by machine learning models is increased if the model generalizes well on previously unseen samples and when inference is accompanied by cogent explanations of the reasoning behind predictions. In the image classification domain, generalization can also be assessed through accuracy, sensitivity, and specificity, and one measure to assess explainability is how well the model localizes the object of interest within an image. However, in multi-class settings, both generalization and explanation through localization are degraded when available training data contains features with significant overlap between classes. We propose a method to enhance explainability of image classification through better localization by mitigating the model uncertainty induced by class overlap. Our technique performs discriminative localization on images that contain features with significant class overlap, without explicitly training for localization. Our method is particularly promising in real-world class overlap scenarios, such as COVID19 vs pneumonia, where expertly labeled data for localization is not available. This can be useful for early, rapid, and trustworthy screening for COVID-19.
摘要：信托通过机器学习模型作出的预测增加如果模型概括以及上前所未见的样，并在推理伴随预测的理由的说服力的解释。在图像分类域，泛化也可以通过精确度，灵敏度，和特异性，以及一个度量来评估explainability评估是模型如何局部化图像内的感兴趣对象。然而，在多级设置，通过本地化既概括和说明时可用的训练数据包含类之间显著的重叠功能退化。我们提出了一个方法，通过减轻由类的重叠引起的模型不确定性通过更好的本地化，以提高图像分类的explainability。我们的技术进行辨别在包含有显著类重叠图像的功能定位，没有本地化明确培训。我们的方法是在现实世界一流重叠的情况，如COVID19特别看好VS肺炎，其中熟练地标数据的本地化不可用。这可以为早期，快速，守信筛选COVID-19是有用的。

31. Integration of 3D Knowledge for On-Board UAV Visual Tracking [PDF] 返回目录
Stéphane Vujasinović, Stefan Becker, Timo Breuer, Norbert Scherer-Negenborn, Michael Arens
Abstract: Visual tracking from an unmanned aerial vehicle (UAV) poses challenges such as occlusions or background clutter. In order to achieve more robust on-board UAV visual tracking, a pipeline combining information extracted from a visual tracker and a sparse 3D reconstruction of the static environment is introduced. The 3D reconstruction is based on an image-based structure-from-motion (SfM) component and thus allows to utilize a state estimator in a pseudo-3D space. Thereby improved handling of occlusion situations and background clutter is realized. Evaluation is done on prototypical image sequences captured from a UAV with low-altitude oblique views. The experimental results demonstrate the benefit of the proposed approach compared to only relying on visual cues or using a state estimation in the image space.
摘要：从无人飞行器的视觉跟踪（UAV）提出了挑战，如闭塞或背景杂波。为了实现更稳健的车载UAV视觉跟踪，从视觉跟踪器和静态环境的稀疏3D重建中提取的管道合成信息被引入。三维重建是基于基于图像的结构从运动（SFM）组件，并因此允许利用以伪三维空间中的状态估计。从而改善了处理的闭塞的情况和背景杂波被实现。评价是在从与低空倾斜视图一个UAV捕获原型图像序列来完成。实验结果表明，相比于仅依赖于视觉提示或使用在图像空间中的状态估计所提出的方法的益处。

32. Investigation of Speaker-adaptation methods in Transformer based ASR [PDF] 返回目录
Vishwas M. Shetty, Metilda Sagaya Mary N J, S. Umesh
Abstract: End-to-end models are fast replacing conventional hybrid models in automatic speech recognition. A transformer is a sequence-to-sequence framework solely based on attention, that was initially applied to machine translation task. This end-to-end framework has been shown to give promising results when used for automatic speech recognition as well. In this paper, we explore different ways of incorporating speaker information while training a transformer-based model to improve its performance. We present speaker information in the form of speaker embeddings for each of the speakers. Two broad categories of speaker embeddings are used: (i)fixed embeddings, and (ii)learned embeddings. We experiment using speaker embeddings learned along with the model training, as well as one-hot vectors and x-vectors. Using these different speaker embeddings, we obtain an average relative improvement of 1% to 3% in the token error rate. We report results on the NPTEL lecture database. NPTEL is an open-source e-learning portal providing content from top Indian universities.
摘要：端至端模型快速取代传统的混合模型的自动语音识别。变压器是一个序列到序列框架完全基于注意力，这是最初应用于机器翻译任务。为此到终端的框架已被证明用于自动语音识别，以及什么时候给可喜的成果。在本文中，我们将探讨整合的扬声器信息，同时培养了基于变压器的模型，以提高其性能的不同方式。我们在扬声器的嵌入为每个扬声器的形式存在扬声器的信息。扬声器的嵌入的两大类被使用：（ⅰ）固定的嵌入，和（ii）的嵌入了解到。我们尝试使用与模型训练沿了解到音箱的嵌入，以及一个热载体和X-载体。使用这些不同的嵌入扬声器，我们获得的1％的平均相对改善到3％在令牌错误率。我们报告NPTEL讲座数据库的结果。 NPTEL是一个开源的电子学习从顶部印度大学门户网站提供内容。

33. ESPRESSO: Entropy and ShaPe awaRe timE-Series SegmentatiOn for processing heterogeneous sensor data [PDF] 返回目录
Shohreh Deldari, Daniel V. Smith, Amin Sadri, Flora D. Salim
Abstract: Extracting informative and meaningful temporal segments from high-dimensional wearable sensor data, smart devices, or IoT data is a vital preprocessing step in applications such as Human Activity Recognition (HAR), trajectory prediction, gesture recognition, and lifelogging. In this paper, we propose ESPRESSO (Entropy and ShaPe awaRe timE-Series SegmentatiOn), a hybrid segmentation model for multi-dimensional time-series that is formulated to exploit the entropy and temporal shape properties of time-series. ESPRESSO differs from existing methods that focus upon particular statistical or temporal properties of time-series exclusively. As part of model development, a novel temporal representation of time-series $WCAC$ was introduced along with a greedy search approach that estimate segments based upon the entropy metric. ESPRESSO was shown to offer superior performance to four state-of-the-art methods across seven public datasets of wearable and wear-free sensing. In addition, we undertake a deeper investigation of these datasets to understand how ESPRESSO and its constituent methods perform with respect to different dataset characteristics. Finally, we provide two interesting case-studies to show how applying ESPRESSO can assist in inferring daily activity routines and the emotional state of humans.
摘要：从高维可穿戴式传感器数据，智能设备，或数据的IoT信息提取和有意义的时间片段是在诸如人类活动识别（HAR），轨迹预测，手势识别，和生活日志的重要预处理步骤。在本文中，我们提议ESPRESSO（熵和形状AWARE时间序列分割）中，其被配制以利用时间序列的熵和时间形状属性多维时间序列混合分割模型。从现有的方法不同ESPRESSO使得在时间序列排他的特定统计或时间属性的焦点。作为模型开发的一部分，时间序列$ $ WCAC了一种新的时间表示用贪婪搜索的办法，估计在细分熵基于度量一起推出。 ESPRESSO被证明提供卓越的性能在整个耐磨，无磨损感应七个公共数据集四大国有的最先进的方法。此外，我们承诺这些数据集的更深层次的调查了解ESPRESSO及其组成方式如何针对不同的数据集执行的特性。最后，我们提供了两个有趣的案例研究来说明如何运用ESPRESSO可以推断日常活动规律和人类的情绪状态帮助。

34. In-Depth DCT Coefficient Distribution Analysis for First Quantization Estimation [PDF] 返回目录
Sebastiano Battiato, Oliver Giudice, Francesco Guarnera, Giovanni Puglisi
Abstract: The exploitation of traces in JPEG double compressed images is of utter importance for investigations. Properly exploiting such insights, First Quantization Estimation (FQE) could be performed in order to obtain source camera model identification (CMI) and therefore reconstruct the history of a digital image. In this paper, a method able to estimate the first quantization factors for JPEG double compressed images is presented, employing a mixed statistical and Machine Learning approach. The presented solution is demonstrated to work without any a-priori assumptions about the quantization matrices. Experimental results and comparisons with the state-of-the-art show the goodness of the proposed technique.
摘要：痕迹的JPEG压缩的双重图像的开发是对调查彻底重要性。适当地利用这样的见解，第一量化估计（FQE）可以获得源照相机模型识别（CMI），因此重建的数字图像的历史来执行。在本文中，能够估计JPEG双重压缩图像的第一量化因子的方法被提出，采用混合统计和机器学习的方法。所提出的解决方案是证明工作没有关于量化矩阵的任何先验假设。实验结果和比较与国家的最先进的显示所提出的技术的优度。

35. Multi-Task Driven Explainable Diagnosis of COVID-19 using Chest X-ray Images [PDF] 返回目录
Aakarsh Malhotra, Surbhi Mittal, Puspita Majumdar, Saheb Chhabra, Kartik Thakral, Mayank Vatsa, Richa Singh, Santanu Chaudhury, Ashwin Pudrod, Anjali Agrawal
Abstract: With increasing number of COVID-19 cases globally, all the countries are ramping up the testing numbers. While the RT-PCR kits are available in sufficient quantity in several countries, others are facing challenges with limited availability of testing kits and processing centers in remote areas. This has motivated researchers to find alternate methods of testing which are reliable, easily accessible and faster. Chest X-Ray is one of the modalities that is gaining acceptance as a screening modality. Towards this direction, the paper has two primary contributions. Firstly, we present the COVID-19 Multi-Task Network which is an automated end-to-end network for COVID-19 screening. The proposed network not only predicts whether the CXR has COVID-19 features present or not, it also performs semantic segmentation of the regions of interest to make the model explainable. Secondly, with the help of medical professionals, we manually annotate the lung regions of 9000 frontal chest radiographs taken from ChestXray-14, CheXpert and a consolidated COVID-19 dataset. Further, 200 chest radiographs pertaining to COVID-19 patients are also annotated for semantic segmentation. This database will be released to the research community.
摘要：随着全球不断增加的COVID-19病例数，所有国家正在加速测试号码。虽然RT-PCR试剂盒在一些国家是够多，其他人都面临着在偏远地区的检测试剂盒和加工中心的有限挑战。这促使研究人员发现测试的替代方法，其是可靠，方便和快捷。胸部X射线是获得接受的方式筛选的方式之一。朝着这个方向，所述纸材具有两个主要的贡献。首先，我们将介绍COVID-19多任务网络是一个自动化的端至端网络COVID-19筛选。所提出的网络不仅预测CXR是否具有存在COVID-19拥有与否，它也执行的感兴趣区域的语义分割，使模型可以解释。其次，医疗专业人士的帮助下，我们手动标注从ChestXray-14，CheXpert和综合COVID-19数据集取9000个额叶胸片肺部区域。此外，关于COVID-19例200个胸片也注释语义分割。这个数据库将被释放到研究团体。

36. Image Transformation Network for Privacy-Preserving Deep Neural Networks and Its Security Evaluation [PDF] 返回目录
Hiroki Ito, Yuma Kinoshita, Hitoshi Kiya
Abstract: We propose a transformation network for generating visually-protected images for privacy-preserving DNNs. The proposed transformation network is trained by using a plain image dataset so that plain images are transformed into visually protected ones. Conventional perceptual encryption methods have a weak visual-protection performance and some accuracy degradation in image classification. In contrast, the proposed network enables us not only to strongly protect visual information but also to maintain the image classification accuracy that using plain images achieves. In an image classification experiment, the proposed network is demonstrated to strongly protect visual information on plain images without any performance degradation under the use of CIFAR datasets. In addition, it is shown that the visually protected images are robust against a DNN-based attack, called inverse transformation network attack (ITN-Attack) in an experiment.
摘要：我们提出了一个变换网络为隐私保护DNNs产生视觉保护的图像。所提出的变换网络是通过使用一个普通的图像数据集，使得普通图像被转变成目视加保护的培训。传统的感知加密方法有微弱视力保护性能，在图像分类中一些精度降低。相反，所提出的网络使我们不仅要大力保护视觉信息也能维持，使用普通的图像实现图像分类的准确性。在图像分类实验中，所提出的网络被证明在普通图像强烈保护可视信息，而无需使用CIFAR数据集的在任何性能下降。此外，它表明，在视觉上受保护的图像是针对基于DNN攻击强劲，被称为逆变换网络攻击（ITN攻击）的实验。

37. The Ensemble Method for Thorax Diseases Classification [PDF] 返回目录
Bayu A. Nugroho
Abstract: A common problem found in real-word medical image classification is the inherent imbalance of the positive and negative patterns in the dataset where positive patterns are usually rare. Moreover, in the classification of multiple classes with neural network, a training pattern is treated as a positive pattern in one output node and negative in all the remaining output nodes. In this paper, the weights of a training pattern in the loss function are designed based not only on the number of the training patterns in the class but also on the different nodes where one of them treats this training pattern as positive and the others treat it as negative. We propose a combined approach of weights calculation algorithm for deep network training and the training optimization from the state-of-the-art deep network architecture for thorax diseases classification problem. Experimental results on the Chest X-Ray image dataset demonstrate that this new weighting scheme improves classification performances, also the training optimization from the EfficientNet improves the performance furthermore. We compare the ensemble method with several performances from the previous study of thorax diseases classifications to provide the fair comparisons against the proposed method.
摘要：在实字的医用图像的分类的一个常见问题是发现在数据集中的正和负模式，其中正图案通常是罕见的固有不平衡。此外，在多个类与神经网络的分类，训练模式被视为在所有剩余的输出节点在一个输出节点和负一正型的图案。在本文中，在损失功能的训练模式的权重设计不仅基于在类的训练模式的数量，而且在不同的节点，其中一人把这种培训模式是积极的和其他人把它上为阴性。我们提出的权重计算算法的深网络培训，并从国家的最先进的深网络架构胸部疾病分类问题的训练优化相结合的办法。在胸部X射线图像数据集实验结果表明，这种新的权重方案提高了分类的表演，也从EfficientNet训练优化进一步提高了性能。我们与来自胸部疾病分类以往的研究多场演出，以提供对所提出的方法的公平比较，比较的集成方法。

38. NuI-Go: Recursive Non-Local Encoder-Decoder Network for Retinal Image Non-Uniform Illumination Removal [PDF] 返回目录
Chongyi Li, Huazhu Fu, Runmin Cong, Zechao Li, Qianqian Xu
Abstract: Retinal images have been widely used by clinicians for early diagnosis of ocular diseases. However, the quality of retinal images is often clinically unsatisfactory due to eye lesions and imperfect imaging process. One of the most challenging quality degradation issues in retinal images is non-uniform which hinders the pathological information and further impairs the diagnosis of ophthalmologists and computer-aided this http URL address this issue, we propose a non-uniform illumination removal network for retinal image, called NuI-Go, which consists of three Recursive Non-local Encoder-Decoder Residual Blocks (NEDRBs) for enhancing the degraded retinal images in a progressive manner. Each NEDRB contains a feature encoder module that captures the hierarchical feature representations, a non-local context module that models the context information, and a feature decoder module that recovers the details and spatial dimension. Additionally, the symmetric skip-connections between the encoder module and the decoder module provide long-range information compensation and reuse. Extensive experiments demonstrate that the proposed method can effectively remove the non-uniform illumination on retinal images while well preserving the image details and color. We further demonstrate the advantages of the proposed method for improving the accuracy of retinal vessel segmentation.
摘要：视网膜图像已被广泛用于临床医师对眼部疾病的早期诊断。然而，视网膜图像的质量经常是不令人满意的临床由于眼部病变和不完美的成像过程。一个在视网膜图像中最具挑战性的质量退化问题是不均匀这阻碍了病理信息，并进一步也妨碍眼科医生和诊断计算机辅助此HTTP URL解决这个问题，我们提出了一种非均匀照明去除网络视网膜图像称为NUI围棋，这对于以渐进的方式增强了退化的视网膜图像由三个递归非本地编码器 - 解码器的残余块（NEDRBs）的。每个NEDRB包含一个功能编码器模块捕获的层次特征表示，非本地上下文模块建模的上下文信息，和解码器的功能模块，恢复的细节和空间维度。另外，编码器模块和解码器模块之间的所述对称跳过连接提供远程信息补偿和重用。广泛的实验表明，该方法可以有效地除去非均匀照明对视网膜的图像，同时保持良好的图像细节和颜色。我们进一步证明了该方法的优势，为改善视网膜血管分割的准确性。

39. Parts-Based Articulated Object Localization in Clutter Using Belief Propagation [PDF] 返回目录
Jana Pavlasek, Stanley Lewis, Karthik Desingh, Odest Chadwicke Jenkins
Abstract: Robots working in human environments must be able to perceive and act on challenging objects with articulations, such as a pile of tools. Articulated objects increase the dimensionality of the pose estimation problem, and partial observations under clutter create additional challenges. To address this problem, we present a generative-discriminative parts-based recognition and localization method for articulated objects in clutter. We formulate the problem of articulated object pose estimation as a Markov Random Field (MRF). Hidden nodes in this MRF express the pose of the object parts, and edges express the articulation constraints between parts. Localization is performed within the MRF using an efficient belief propagation method. The method is informed by both part segmentation heatmaps over the observation, generated by a neural network, and the articulation constraints between object parts. Our generative-discriminative approach allows the proposed method to function in cluttered environments by inferring the pose of occluded parts using hypotheses from the visible parts. We demonstrate the efficacy of our methods in a tabletop environment for recognizing and localizing hand tools in uncluttered and cluttered configurations.
摘要：在人类环境中工作的机器人必须能够感知并与关节，如一堆工具，挑战的对象采取行动。铰接式对象增加姿态估计问题的维数，并且在杂波部分观测创造了额外的挑战。为了解决这个问题，我们提出了杂波铰接式对象生成非歧视的基础部件识别和定位方法。我们制定关节状姿势估计马尔可夫随机场（MRF）的问题。在此MRF隐藏节点表示对象部件的姿态，和边缘表达部分之间的关节的制约。定位是使用高效的置信传播方法中的MRF内进行。该方法由两个部分分割热图上的观察，通过神经网络生成，并且对象部件之间的关节的制约通知。我们的生成，判别方法允许在杂乱的环境中所提出的方法，以功能通过使用推断从可见的部分假设遮挡零件的姿态。我们证明我们的方法在桌面环境中的功效，适用于识别和整洁和混乱的配置本地化手工工具。

40. Fatigue Assessment using ECG and Actigraphy Sensors [PDF] 返回目录
Yang Bai, Yu Guan, Wan-Fai Ng
Abstract: Fatigue is one of the key factors in the loss of work efficiency and health-related quality of life, and most fatigue assessment methods were based on self-reporting, which may suffer from many factors such as recall bias. To address this issue, we developed an automated system using wearable sensing and machine learning techniques for objective fatigue assessment. ECG/Actigraphy data were collected from subjects in free-living environments. Preprocessing and feature engineering methods were applied, before interpretable solution and deep learning solution were introduced. Specifically, for interpretable solution, we proposed a feature selection approach which can select less correlated and high informative features for better understanding system's decision-making process. For deep learning solution, we used state-of-the-art self-attention model, based on which we further proposed a consistency self-attention (CSA) mechanism for fatigue assessment. Extensive experiments were conducted, and very promising results were achieved.
摘要：疲劳是在工作效率和生活的健康质量，最疲劳评估方法流失的关键因素之一是基于自我报告，可以从很多因素，如回忆偏差受到影响。为了解决这个问题，我们开发利用客观评价疲劳感可穿戴和机器学习技术的自动化系统。 ECG /活动记录仪数据是从受试者自由生活的环境中收集的。预处理并应用功能的工程方法，引入了解释解决方案和深厚的学习解决方案之前。具体而言，解释的解决方案，我们提出了一个特征选择的办法，可以选择更好的理解系统的决策过程中的相关性较低和较高的信息量大的特点。对于深学习解决方案，我们使用状态的最先进的自关注模型的基础上，我们还提出一种用于疲劳评估的稠度自关注（CSA）的机制。广泛进行实验，并取得了非常可喜的成果。

41. Confidence-guided Lesion Mask-based Simultaneous Synthesis of Anatomic and Molecular MR Images in Patients with Post-treatment Malignant Gliomas [PDF] 返回目录
Pengfei Guo, Puyang Wang, Rajeev Yasarla, Jinyuan Zhou, Vishal M. Patel, Shanshan Jiang
Abstract: Data-driven automatic approaches have demonstrated their great potential in resolving various clinical diagnostic dilemmas in neuro-oncology, especially with the help of standard anatomic and advanced molecular MR images. However, data quantity and quality remain a key determinant of, and a significant limit on, the potential of such applications. In our previous work, we explored synthesis of anatomic and molecular MR image network (SAMR) in patients with post-treatment malignant glioms. Now, we extend it and propose Confidence Guided SAMR (CG-SAMR) that synthesizes data from lesion information to multi-modal anatomic sequences, including T1-weighted (T1w), gadolinium enhanced T1w (Gd-T1w), T2-weighted (T2w), and fluid-attenuated inversion recovery (FLAIR), and the molecular amide proton transfer-weighted (APTw) sequence. We introduce a module which guides the synthesis based on confidence measure about the intermediate results. Furthermore, we extend the proposed architecture for unsupervised synthesis so that unpaired data can be used for training the network. Extensive experiments on real clinical data demonstrate that the proposed model can perform better than the state-of-theart synthesis methods.
摘要：数据驱动的自动方法已经证明在神经肿瘤解决各种临床诊断的难题，尤其是标准的解剖和先进的分子MR图像的帮助下他们的巨大潜力。然而，数据的数量和质量仍然是一个关键因素，并在显著的限制，这类应用的潜力。在我们以前的工作中，我们探讨了患者治疗后恶性glioms解剖和分子MR图像网络（SAMR）的合成。现在，我们扩展它并提出置信引导SAMR（CG-SAMR），其合成来自病变信息数据到多模态解剖序列，包括T1加权（T1W），钆增强T1W（钆T1W），T2加权（T2加权），以及液体衰减反转恢复（FLAIR），并且分子的酰胺质子转移加权（APTw）序列。我们推出基于对中间结果置信度是指导合成的模块。此外，我们延长无监督合成所提出的体系结构，以便不成对的数据可以用于训练网络。真实的临床数据大量实验表明，该模型可以比国家的theart合成方法有更好的表现。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-10

目录

摘要