目录
15. RetinotopicNet: An Iterative Attention Mechanism Using Local Descriptors with Global Context [PDF] 摘要
20. Effective and Robust Detection of Adversarial Examples via Benford-Fourier Coefficients [PDF] 摘要
24. Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity [PDF] 摘要
27. Combining Deep Learning with Geometric Features for Image based Localization in the Gastrointestinal Tract [PDF] 摘要
30. Target-Independent Domain Adaptation for WBC Classification using Generative Latent Search [PDF] 摘要
35. Adipose Tissue Segmentation in Unlabeled Abdomen MRI using Cross Modality Domain Adaptation [PDF] 摘要
37. Very High Resolution Land Cover Mapping of Urban Areas at Global Scale with Convolutional Neural Networks [PDF] 摘要
41. High-Fidelity Accelerated MRI Reconstruction by Scan-Specific Fine-Tuning of Physics-Based Neural Networks [PDF] 摘要
摘要
1. Efficient and Interpretable Infrared and Visible Image Fusion Via Algorithm Unrolling [PDF] 返回目录
Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang
Abstract: Infrared and visible image fusion expects to obtain images that highlight thermal radiation information from infrared images and texture details from visible images. In this paper, an interpretable deep network fusion model is proposed. Initially, two optimization models are established to accomplish two-scale decomposition, separating low-frequency base information and high-frequency detail information from source images. The algorithm unrolling that each iteration process is mapped to a convolutional neural network layer to transfer the optimization steps into the trainable neural networks, is implemented to solve the optimization models. In the test phase, the two decomposition feature maps of base and detail are merged respectively by the fusion layer, and then the decoder outputs the fusion image. Qualitative and quantitative comparisons demonstrate the superiority of our model, which is interpretable and can robustly generate fusion images containing highlight targets and legible details, exceeding the state-of-the-art methods.
摘要:红外和可见光图像融合期望获得从突出红外图像和纹理细节从可见图像的热辐射信息的图像。在本文中,可解释的深层网络融合模型。最初,两种优化模型的建立是为了实现两个尺度分解,从源图像中分离低频基信息和高频细节信息。该算法展开,每个迭代过程被映射到一个卷积神经网络层的优化步骤转移到可训练神经网络,是为了解决该优化模型。在测试阶段中,两个分解特性基和细节的映射是由熔合层分别合并,则解码器输出的融合图像。定性和定量比较表明我们的模型,它是可解释和可稳健地生成包含亮点的目标和清晰的细节融合图像,超越国家的最先进的方法的优越性。
Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang
Abstract: Infrared and visible image fusion expects to obtain images that highlight thermal radiation information from infrared images and texture details from visible images. In this paper, an interpretable deep network fusion model is proposed. Initially, two optimization models are established to accomplish two-scale decomposition, separating low-frequency base information and high-frequency detail information from source images. The algorithm unrolling that each iteration process is mapped to a convolutional neural network layer to transfer the optimization steps into the trainable neural networks, is implemented to solve the optimization models. In the test phase, the two decomposition feature maps of base and detail are merged respectively by the fusion layer, and then the decoder outputs the fusion image. Qualitative and quantitative comparisons demonstrate the superiority of our model, which is interpretable and can robustly generate fusion images containing highlight targets and legible details, exceeding the state-of-the-art methods.
摘要:红外和可见光图像融合期望获得从突出红外图像和纹理细节从可见图像的热辐射信息的图像。在本文中,可解释的深层网络融合模型。最初,两种优化模型的建立是为了实现两个尺度分解,从源图像中分离低频基信息和高频细节信息。该算法展开,每个迭代过程被映射到一个卷积神经网络层的优化步骤转移到可训练神经网络,是为了解决该优化模型。在测试阶段中,两个分解特性基和细节的映射是由熔合层分别合并,则解码器输出的融合图像。定性和定量比较表明我们的模型,它是可解释和可稳健地生成包含亮点的目标和清晰的细节融合图像,超越国家的最先进的方法的优越性。
2. Latent Fingerprint Registration via Matching Densely Sampled Points [PDF] 返回目录
Shan Gu, Jianjiang Feng, Jiwen Lu, Jie Zhou
Abstract: Latent fingerprint matching is a very important but unsolved problem. As a key step of fingerprint matching, fingerprint registration has a great impact on the recognition performance. Existing latent fingerprint registration approaches are mainly based on establishing correspondences between minutiae, and hence will certainly fail when there are no sufficient number of extracted minutiae due to small fingerprint area or poor image quality. Minutiae extraction has become the bottleneck of latent fingerprint registration. In this paper, we propose a non-minutia latent fingerprint registration method which estimates the spatial transformation between a pair of fingerprints through a dense fingerprint patch alignment and matching procedure. Given a pair of fingerprints to match, we bypass the minutiae extraction step and take uniformly sampled points as key points. Then the proposed patch alignment and matching algorithm compares all pairs of sampling points and produces their similarities along with alignment parameters. Finally, a set of consistent correspondences are found by spectral clustering. Extensive experiments on NIST27 database and MOLF database show that the proposed method achieves the state-of-the-art registration performance, especially under challenging conditions.
摘要:潜在指纹匹配是非常重要的,但没有解决问题。由于指纹匹配的关键一步,指纹登记对识别性能有很大的影响。现有潜指纹登记方法主要是基于细节之间建立对应关系,因此,当有提取细节的没有足够数量的由于小区域的指纹或图像质量差肯定会失败。细节提取已成为潜指纹登记的瓶颈。在本文中,我们提出一种通过一个致密的指纹贴片对准和匹配过程估算一对指纹之间的空间变换的非细节潜指纹登记方法。给定一对指纹的匹配,我们旁路指纹特征提取步骤,并采取均匀采样点作为关键点。然后所提出的贴片对准和匹配算法比较所有对的采样点中,用比对参数沿产生它们的相似性。最后,一组一致的对应关系由谱聚类发现。上NIST27数据库和数据库MOLF表明,所提出的方法实现了国家的最先进的注册性能,特别是在困难的条件下广泛的实验。
Shan Gu, Jianjiang Feng, Jiwen Lu, Jie Zhou
Abstract: Latent fingerprint matching is a very important but unsolved problem. As a key step of fingerprint matching, fingerprint registration has a great impact on the recognition performance. Existing latent fingerprint registration approaches are mainly based on establishing correspondences between minutiae, and hence will certainly fail when there are no sufficient number of extracted minutiae due to small fingerprint area or poor image quality. Minutiae extraction has become the bottleneck of latent fingerprint registration. In this paper, we propose a non-minutia latent fingerprint registration method which estimates the spatial transformation between a pair of fingerprints through a dense fingerprint patch alignment and matching procedure. Given a pair of fingerprints to match, we bypass the minutiae extraction step and take uniformly sampled points as key points. Then the proposed patch alignment and matching algorithm compares all pairs of sampling points and produces their similarities along with alignment parameters. Finally, a set of consistent correspondences are found by spectral clustering. Extensive experiments on NIST27 database and MOLF database show that the proposed method achieves the state-of-the-art registration performance, especially under challenging conditions.
摘要:潜在指纹匹配是非常重要的,但没有解决问题。由于指纹匹配的关键一步,指纹登记对识别性能有很大的影响。现有潜指纹登记方法主要是基于细节之间建立对应关系,因此,当有提取细节的没有足够数量的由于小区域的指纹或图像质量差肯定会失败。细节提取已成为潜指纹登记的瓶颈。在本文中,我们提出一种通过一个致密的指纹贴片对准和匹配过程估算一对指纹之间的空间变换的非细节潜指纹登记方法。给定一对指纹的匹配,我们旁路指纹特征提取步骤,并采取均匀采样点作为关键点。然后所提出的贴片对准和匹配算法比较所有对的采样点中,用比对参数沿产生它们的相似性。最后,一组一致的对应关系由谱聚类发现。上NIST27数据库和数据库MOLF表明,所提出的方法实现了国家的最先进的注册性能,特别是在困难的条件下广泛的实验。
3. Recurrent and Spiking Modeling of Sparse Surgical Kinematics [PDF] 返回目录
Neil Getty, Zixuan Zhou, Stephan Gruessner, Liaohai Chen, Fangfang Xia
Abstract: Robot-assisted minimally invasive surgery is improving surgeon performance and patient outcomes. This innovation is also turning what has been a subjective practice into motion sequences that can be precisely measured. A growing number of studies have used machine learning to analyze video and kinematic data captured from surgical robots. In these studies, models are typically trained on benchmark datasets for representative surgical tasks to assess surgeon skill levels. While they have shown that novices and experts can be accurately classified, it is not clear whether machine learning can separate highly proficient surgeons from one another, especially without video data. In this study, we explore the possibility of using only kinematic data to predict surgeons of similar skill levels. We focus on a new dataset created from surgical exercises on a simulation device for skill training. A simple, efficient encoding scheme was devised to encode kinematic sequences so that they were amenable to edge learning. We report that it is possible to identify surgical fellows receiving near perfect scores in the simulation exercises based on their motion characteristics alone. Further, our model could be converted to a spiking neural network to train and infer on the Nengo simulation framework with no loss in accuracy. Overall, this study suggests that building neuromorphic models from sparse motion features may be a potentially useful strategy for identifying surgeons and gestures with chips deployed on robotic systems to offer adaptive assistance during surgery and training with additional latency and privacy benefits.
摘要:机器人辅助微创手术是外科医生提高性能和患者的治疗效果。这种创新也转向什么一直是主观修炼成的运动过程,可以精确地测量。越来越多的研究使用机器学习来分析视频和外科手术机器人捕获运动数据。在这些研究中,模型通常训练有素的基准数据集为代表的手术任务,以评估医生的技术水平。虽然他们已经表明,新手和专家可以准确分类,目前尚不清楚机器学习是否可以从彼此分离高度熟练的外科医生,尤其是没有视频数据。在这项研究中,我们探讨了只使用动态数据来预测类似的技术水平的医生的可能性。我们专注于从手术练习技能培训的模拟设备上创建一个新的数据集。一个简单的,高效的编码方案被设计来编码运动序列,以便它们都适合于边缘学习。我们报告说,它可以识别接收附近仅基于其运动特性的模拟练习满分手术研究员。此外,我们的模型可以被转换成尖峰神经网络对Nengo仿真框架列车,并推断在精度没有损失。总体而言,这项研究表明,从稀疏运动功能神经形态建立模型可作为鉴定的外科医生和手势与部署在机器人系统芯片手术和培训额外的延迟和隐私利益时,提供自适应援助可能有用的策略。
Neil Getty, Zixuan Zhou, Stephan Gruessner, Liaohai Chen, Fangfang Xia
Abstract: Robot-assisted minimally invasive surgery is improving surgeon performance and patient outcomes. This innovation is also turning what has been a subjective practice into motion sequences that can be precisely measured. A growing number of studies have used machine learning to analyze video and kinematic data captured from surgical robots. In these studies, models are typically trained on benchmark datasets for representative surgical tasks to assess surgeon skill levels. While they have shown that novices and experts can be accurately classified, it is not clear whether machine learning can separate highly proficient surgeons from one another, especially without video data. In this study, we explore the possibility of using only kinematic data to predict surgeons of similar skill levels. We focus on a new dataset created from surgical exercises on a simulation device for skill training. A simple, efficient encoding scheme was devised to encode kinematic sequences so that they were amenable to edge learning. We report that it is possible to identify surgical fellows receiving near perfect scores in the simulation exercises based on their motion characteristics alone. Further, our model could be converted to a spiking neural network to train and infer on the Nengo simulation framework with no loss in accuracy. Overall, this study suggests that building neuromorphic models from sparse motion features may be a potentially useful strategy for identifying surgeons and gestures with chips deployed on robotic systems to offer adaptive assistance during surgery and training with additional latency and privacy benefits.
摘要:机器人辅助微创手术是外科医生提高性能和患者的治疗效果。这种创新也转向什么一直是主观修炼成的运动过程,可以精确地测量。越来越多的研究使用机器学习来分析视频和外科手术机器人捕获运动数据。在这些研究中,模型通常训练有素的基准数据集为代表的手术任务,以评估医生的技术水平。虽然他们已经表明,新手和专家可以准确分类,目前尚不清楚机器学习是否可以从彼此分离高度熟练的外科医生,尤其是没有视频数据。在这项研究中,我们探讨了只使用动态数据来预测类似的技术水平的医生的可能性。我们专注于从手术练习技能培训的模拟设备上创建一个新的数据集。一个简单的,高效的编码方案被设计来编码运动序列,以便它们都适合于边缘学习。我们报告说,它可以识别接收附近仅基于其运动特性的模拟练习满分手术研究员。此外,我们的模型可以被转换成尖峰神经网络对Nengo仿真框架列车,并推断在精度没有损失。总体而言,这项研究表明,从稀疏运动功能神经形态建立模型可作为鉴定的外科医生和手势与部署在机器人系统芯片手术和培训额外的延迟和隐私利益时,提供自适应援助可能有用的策略。
4. Neural Architecture Transfer [PDF] 返回目录
Zhichao Lu, Gautam Sreekumar, Erik Goodman, Wolfgang Banzhaf, Kalyanmoy Deb, Vishnu Naresh Boddeti
Abstract: Neural architecture search (NAS) has emerged as a promising avenue for automatically designing task-specific neural networks. Most existing NAS approaches require one complete search for each deployment specification of hardware or objective. This is a computationally impractical endeavor given the potentially large number of application scenarios. In this paper, we propose Neural Architecture Transfer (NAT) to overcome this limitation. NAT is designed to efficiently generate task-specific custom models that are competitive even under multiple conflicting objectives. To realize this goal we learn task-specific supernets from which specialized subnets can be sampled without any additional training. The key to our approach is an integrated online transfer learning and many-objective evolutionary search procedure. A pre-trained supernet is iteratively adapted while simultaneously searching for task-specific subnets. We demonstrate the efficacy of NAT on 11 benchmark image classification tasks ranging from large-scale multi-class to small-scale fine-grained datasets. In all cases, including ImageNet, NATNets improve upon the state-of-the-art under mobile settings ($\leq$ 600M Multiply-Adds). Surprisingly, small-scale fine-grained datasets benefit the most from NAT. At the same time, the architecture search and transfer is orders of magnitude more efficient than existing NAS methods. Overall, experimental evaluation indicates that across diverse image classification tasks and computational objectives, NAT is an appreciably more effective alternative to fine-tuning based transfer learning. Code is available at this https URL
摘要:神经结构搜索(NAS)已经成为一个有希望的途径用于自动设计特定任务的神经网络。大多数现有的NAS方法需要对硬件或客观的每个部署规范一个完整的搜索。这是考虑到潜在的大量的应用场景在计算上不切实际的努力。在本文中,我们提出了神经结构转换(NAT)来克服这个限制。 NAT的目的是有效地生成,即使在多个相互冲突的目标是有竞争力的特定任务的定制机型。为了实现这个目标,我们学会从专业子网可以在没有任何额外的培训进行采样特定任务的超网。我们的方法的关键是集成的在线迁移学习和许多目标进化搜索过程。预训练超网被反复改编,同时为任务的特定子网搜索。我们证明NAT对11的基准图像分类任务,从大型多类小规模的细粒度数据集的功效。在所有情况下,包括ImageNet,NATNets提高在移动下的设置($ \ $当量600M乘相加)的国家的最先进的。出人意料的是,小规模的细粒度数据集受益于NAT最。与此同时,该架构搜索和转移是数量级比现有NAS方法更有效。总体而言,实验评价表明,跨不同的图像分类任务和计算的目标,NAT是一种明显更有效的替代微调基于迁移学习。代码可在此HTTPS URL
Zhichao Lu, Gautam Sreekumar, Erik Goodman, Wolfgang Banzhaf, Kalyanmoy Deb, Vishnu Naresh Boddeti
Abstract: Neural architecture search (NAS) has emerged as a promising avenue for automatically designing task-specific neural networks. Most existing NAS approaches require one complete search for each deployment specification of hardware or objective. This is a computationally impractical endeavor given the potentially large number of application scenarios. In this paper, we propose Neural Architecture Transfer (NAT) to overcome this limitation. NAT is designed to efficiently generate task-specific custom models that are competitive even under multiple conflicting objectives. To realize this goal we learn task-specific supernets from which specialized subnets can be sampled without any additional training. The key to our approach is an integrated online transfer learning and many-objective evolutionary search procedure. A pre-trained supernet is iteratively adapted while simultaneously searching for task-specific subnets. We demonstrate the efficacy of NAT on 11 benchmark image classification tasks ranging from large-scale multi-class to small-scale fine-grained datasets. In all cases, including ImageNet, NATNets improve upon the state-of-the-art under mobile settings ($\leq$ 600M Multiply-Adds). Surprisingly, small-scale fine-grained datasets benefit the most from NAT. At the same time, the architecture search and transfer is orders of magnitude more efficient than existing NAS methods. Overall, experimental evaluation indicates that across diverse image classification tasks and computational objectives, NAT is an appreciably more effective alternative to fine-tuning based transfer learning. Code is available at this https URL
摘要:神经结构搜索(NAS)已经成为一个有希望的途径用于自动设计特定任务的神经网络。大多数现有的NAS方法需要对硬件或客观的每个部署规范一个完整的搜索。这是考虑到潜在的大量的应用场景在计算上不切实际的努力。在本文中,我们提出了神经结构转换(NAT)来克服这个限制。 NAT的目的是有效地生成,即使在多个相互冲突的目标是有竞争力的特定任务的定制机型。为了实现这个目标,我们学会从专业子网可以在没有任何额外的培训进行采样特定任务的超网。我们的方法的关键是集成的在线迁移学习和许多目标进化搜索过程。预训练超网被反复改编,同时为任务的特定子网搜索。我们证明NAT对11的基准图像分类任务,从大型多类小规模的细粒度数据集的功效。在所有情况下,包括ImageNet,NATNets提高在移动下的设置($ \ $当量600M乘相加)的国家的最先进的。出人意料的是,小规模的细粒度数据集受益于NAT最。与此同时,该架构搜索和转移是数量级比现有NAS方法更有效。总体而言,实验评价表明,跨不同的图像分类任务和计算的目标,NAT是一种明显更有效的替代微调基于迁移学习。代码可在此HTTPS URL
5. Probabilistic Semantic Segmentation Refinement by Monte Carlo Region Growing [PDF] 返回目录
Philipe A. Dias, Henry Medeiros
Abstract: Semantic segmentation with fine-grained pixel-level accuracy is a fundamental component of a variety of computer vision applications. However, despite the large improvements provided by recent advances in the architectures of convolutional neural networks, segmentations provided by modern state-of-the-art methods still show limited boundary adherence. We introduce a fully unsupervised post-processing algorithm that exploits Monte Carlo sampling and pixel similarities to propagate high-confidence pixel labels into regions of low-confidence classification. Our algorithm, which we call probabilistic Region Growing Refinement (pRGR), is based on a rigorous mathematical foundation in which clusters are modelled as multivariate normally distributed sets of pixels. Exploiting concepts of Bayesian estimation and variance reduction techniques, pRGR performs multiple refinement iterations at varied receptive fields sizes, while updating cluster statistics to adapt to local image features. Experiments using multiple modern semantic segmentation networks and benchmark datasets demonstrate the effectiveness of our approach for the refinement of segmentation predictions at different levels of coarseness, as well as the suitability of the variance estimates obtained in the Monte Carlo iterations as uncertainty measures that are highly correlated with segmentation accuracy.
摘要:细粒度像素级的精度语义分割是各种计算机视觉应用的基本组成部分。然而,尽管最新进展卷积神经网络架构所提供的大改进,通过现代先进设备,最先进的方法所提供的分割仍然显示有限的边界坚持。我们引入一个完全无人监管的后期处理算法,它利用蒙特卡罗抽样和像素相似性高可信度像素标签传播到低信任分类的区域。我们的算法,我们称之为概率区域生长细化(PRGR),是基于其集群建模为多元正态分布组像素的严格的数学基础。利用贝叶斯估计和方差减少技术的概念,PRGR执行在改变感受域尺寸的多个改善迭代,而更新集群统计数据,以适应局部图像特征。使用多种现代语义分割网络和标准数据集实验证明我们对不同层次粗糙的分割预测的细化方法的有效性,以及在蒙特卡罗迭代不确定性的措施获得的方差估计是高度相关的适用性与分割精度。
Philipe A. Dias, Henry Medeiros
Abstract: Semantic segmentation with fine-grained pixel-level accuracy is a fundamental component of a variety of computer vision applications. However, despite the large improvements provided by recent advances in the architectures of convolutional neural networks, segmentations provided by modern state-of-the-art methods still show limited boundary adherence. We introduce a fully unsupervised post-processing algorithm that exploits Monte Carlo sampling and pixel similarities to propagate high-confidence pixel labels into regions of low-confidence classification. Our algorithm, which we call probabilistic Region Growing Refinement (pRGR), is based on a rigorous mathematical foundation in which clusters are modelled as multivariate normally distributed sets of pixels. Exploiting concepts of Bayesian estimation and variance reduction techniques, pRGR performs multiple refinement iterations at varied receptive fields sizes, while updating cluster statistics to adapt to local image features. Experiments using multiple modern semantic segmentation networks and benchmark datasets demonstrate the effectiveness of our approach for the refinement of segmentation predictions at different levels of coarseness, as well as the suitability of the variance estimates obtained in the Monte Carlo iterations as uncertainty measures that are highly correlated with segmentation accuracy.
摘要:细粒度像素级的精度语义分割是各种计算机视觉应用的基本组成部分。然而,尽管最新进展卷积神经网络架构所提供的大改进,通过现代先进设备,最先进的方法所提供的分割仍然显示有限的边界坚持。我们引入一个完全无人监管的后期处理算法,它利用蒙特卡罗抽样和像素相似性高可信度像素标签传播到低信任分类的区域。我们的算法,我们称之为概率区域生长细化(PRGR),是基于其集群建模为多元正态分布组像素的严格的数学基础。利用贝叶斯估计和方差减少技术的概念,PRGR执行在改变感受域尺寸的多个改善迭代,而更新集群统计数据,以适应局部图像特征。使用多种现代语义分割网络和标准数据集实验证明我们对不同层次粗糙的分割预测的细化方法的有效性,以及在蒙特卡罗迭代不确定性的措施获得的方差估计是高度相关的适用性与分割精度。
6. Bayesian Fusion for Infrared and Visible Images [PDF] 返回目录
Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang
Abstract: Infrared and visible image fusion has been a hot issue in image fusion. In this task, a fused image containing both the gradient and detailed texture information of visible images as well as the thermal radiation and highlighting targets of infrared images is expected to be obtained. In this paper, a novel Bayesian fusion model is established for infrared and visible images. In our model, the image fusion task is cast into a regression problem. To measure the variable uncertainty, we formulate the model in a hierarchical Bayesian manner. Aiming at making the fused image satisfy human visual system, the model incorporates the total-variation(TV) penalty. Subsequently, the model is efficiently inferred by the expectation-maximization(EM) algorithm. We test our algorithm on TNO and NIR image fusion datasets with several state-of-the-art approaches. Compared with the previous methods, the novel model can generate better fused images with high-light targets and rich texture details, which can improve the reliability of the target automatic detection and recognition system.
摘要:红外和可见光图像融合已经在图像融合的热点问题。在该任务中,同时含有梯度和可见图像以及热辐射和红外图像的突出显示目标详细纹理信息融合图像有望获得。在本文中,一种新颖的贝叶斯融合模型被建立用于红外线和可见光图像。在我们的模型中,图像融合任务铸造成一个回归问题。为了测量变量的不确定性,我们制定的分层贝叶斯方法的模型。旨在使融合图像满足人的视觉系统,所述模型包含的总的变化率(TV)的惩罚。随后,该模型是有效地被期望最大化(EM)算法推断。我们测试我们的TNO和近红外图像融合的数据集算法与几个国家的最先进的方法。与以前的方法相比,新的模型可以产生具有高光的目标和丰富的纹理细节,从而可以提高目标自动检测和识别系统的可靠性更好融合图像。
Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang
Abstract: Infrared and visible image fusion has been a hot issue in image fusion. In this task, a fused image containing both the gradient and detailed texture information of visible images as well as the thermal radiation and highlighting targets of infrared images is expected to be obtained. In this paper, a novel Bayesian fusion model is established for infrared and visible images. In our model, the image fusion task is cast into a regression problem. To measure the variable uncertainty, we formulate the model in a hierarchical Bayesian manner. Aiming at making the fused image satisfy human visual system, the model incorporates the total-variation(TV) penalty. Subsequently, the model is efficiently inferred by the expectation-maximization(EM) algorithm. We test our algorithm on TNO and NIR image fusion datasets with several state-of-the-art approaches. Compared with the previous methods, the novel model can generate better fused images with high-light targets and rich texture details, which can improve the reliability of the target automatic detection and recognition system.
摘要:红外和可见光图像融合已经在图像融合的热点问题。在该任务中,同时含有梯度和可见图像以及热辐射和红外图像的突出显示目标详细纹理信息融合图像有望获得。在本文中,一种新颖的贝叶斯融合模型被建立用于红外线和可见光图像。在我们的模型中,图像融合任务铸造成一个回归问题。为了测量变量的不确定性,我们制定的分层贝叶斯方法的模型。旨在使融合图像满足人的视觉系统,所述模型包含的总的变化率(TV)的惩罚。随后,该模型是有效地被期望最大化(EM)算法推断。我们测试我们的TNO和近红外图像融合的数据集算法与几个国家的最先进的方法。与以前的方法相比,新的模型可以产生具有高光的目标和丰富的纹理细节,从而可以提高目标自动检测和识别系统的可靠性更好融合图像。
7. A Novel Distributed Approximate Nearest Neighbor Method for Real-time Face Recognition [PDF] 返回目录
Aysan Aghazadeh, Maryam Amirmazlaghani
Abstract: Nowadays face recognition and more generally, image recognition have many applications in the modern world and are widely used in our daily tasks. In this paper, we propose a novel distributed approximate nearest neighbor (ANN) method for real-time face recognition with a big data-set that involves a lot of classes. The proposed approach is based on using a clustering method to separate the data-set into different clusters, and specifying the importance of each cluster by defining cluster weights. Reference instances are selected from each cluster based on the cluster weights and by using a maximum likelihood approach. This process leads to a more informed selection of instances, and so enhances the performance of the algorithm. Experimental results confirm the efficiency of the proposed method and its out-performance in terms of accuracy and processing time.
摘要:目前人脸识别以及更普遍,图像识别在现代世界上的许多应用程序,并广泛应用于我们的日常工作。在本文中,我们提出了一种新的分布式近似近邻(ANN)方法用于实时人脸识别与涉及大量类的大数据集。所提出的方法是基于使用聚类方法将数据集分割成不同的簇,并且通过定义簇权重指定每个簇的重要性。参考实例基于所述簇的权重,并通过使用最大似然方法的每个群集选择。这个过程导致实例的更明智的选择,因此提高了算法的性能。实验结果确认所提出的方法和在精度和处理时间方面其出性能的效率。
Aysan Aghazadeh, Maryam Amirmazlaghani
Abstract: Nowadays face recognition and more generally, image recognition have many applications in the modern world and are widely used in our daily tasks. In this paper, we propose a novel distributed approximate nearest neighbor (ANN) method for real-time face recognition with a big data-set that involves a lot of classes. The proposed approach is based on using a clustering method to separate the data-set into different clusters, and specifying the importance of each cluster by defining cluster weights. Reference instances are selected from each cluster based on the cluster weights and by using a maximum likelihood approach. This process leads to a more informed selection of instances, and so enhances the performance of the algorithm. Experimental results confirm the efficiency of the proposed method and its out-performance in terms of accuracy and processing time.
摘要:目前人脸识别以及更普遍,图像识别在现代世界上的许多应用程序,并广泛应用于我们的日常工作。在本文中,我们提出了一种新的分布式近似近邻(ANN)方法用于实时人脸识别与涉及大量类的大数据集。所提出的方法是基于使用聚类方法将数据集分割成不同的簇,并且通过定义簇权重指定每个簇的重要性。参考实例基于所述簇的权重,并通过使用最大似然方法的每个群集选择。这个过程导致实例的更明智的选择,因此提高了算法的性能。实验结果确认所提出的方法和在精度和处理时间方面其出性能的效率。
8. One-Shot Recognition of Manufacturing Defects in Steel Surfaces [PDF] 返回目录
Aditya M. Deshpande, Ali A. Minai, Manish Kumar
Abstract: Quality control is an essential process in manufacturing to make the product defect-free as well as to meet customer needs. The automation of this process is important to maintain high quality along with the high manufacturing throughput. With recent developments in deep learning and computer vision technologies, it has become possible to detect various features from the images with near-human accuracy. However, many of these approaches are data intensive. Training and deployment of such a system on manufacturing floors may become expensive and time-consuming. The need for large amounts of training data is one of the limitations of the applicability of these approaches in real-world manufacturing systems. In this work, we propose the application of a Siamese convolutional neural network to do one-shot recognition for such a task. Our results demonstrate how one-shot learning can be used in quality control of steel by identification of defects on the steel surface. This method can significantly reduce the requirements of training data and can also be run in real-time.
摘要:质量控制是生产使产品以及满足客户的需求无缺陷的一个重要过程。这个过程的自动化,重要的是与高生产量以及保持高品质。随着深度学习和计算机视觉技术的最新发展,它已成为可能从接近人类的精度图像检测各种特征。然而,这些方法都是密集的数据。在制造地板的培训和这样的系统的部署可能变得昂贵和费时。对于需要大量的训练数据是真实世界的制造系统的这些适用性的限制的一个方法。在这项工作中,我们提出了一个连体卷积神经网络的应用做一锤子承认这样的任务。我们的研究结果表明如何一次性学习可以通过的钢表面缺陷识别钢的质量控制方法。这种方法可以显著降低训练数据的要求,也可以在实时运行。
Aditya M. Deshpande, Ali A. Minai, Manish Kumar
Abstract: Quality control is an essential process in manufacturing to make the product defect-free as well as to meet customer needs. The automation of this process is important to maintain high quality along with the high manufacturing throughput. With recent developments in deep learning and computer vision technologies, it has become possible to detect various features from the images with near-human accuracy. However, many of these approaches are data intensive. Training and deployment of such a system on manufacturing floors may become expensive and time-consuming. The need for large amounts of training data is one of the limitations of the applicability of these approaches in real-world manufacturing systems. In this work, we propose the application of a Siamese convolutional neural network to do one-shot recognition for such a task. Our results demonstrate how one-shot learning can be used in quality control of steel by identification of defects on the steel surface. This method can significantly reduce the requirements of training data and can also be run in real-time.
摘要:质量控制是生产使产品以及满足客户的需求无缺陷的一个重要过程。这个过程的自动化,重要的是与高生产量以及保持高品质。随着深度学习和计算机视觉技术的最新发展,它已成为可能从接近人类的精度图像检测各种特征。然而,这些方法都是密集的数据。在制造地板的培训和这样的系统的部署可能变得昂贵和费时。对于需要大量的训练数据是真实世界的制造系统的这些适用性的限制的一个方法。在这项工作中,我们提出了一个连体卷积神经网络的应用做一锤子承认这样的任务。我们的研究结果表明如何一次性学习可以通过的钢表面缺陷识别钢的质量控制方法。这种方法可以显著降低训练数据的要求,也可以在实时运行。
9. HDD-Net: Hybrid Detector Descriptor with Mutual Interactive Learning [PDF] 返回目录
Axel Barroso-Laguna, Yannick Verdie, Benjamin Busam, Krystian Mikolajczyk
Abstract: Local feature extraction remains an active research area due to the advances in fields such as SLAM, 3D reconstructions, or AR applications. The success in these applications relies on the performance of the feature detector and descriptor. While the detector-descriptor interaction of most methods is based on unifying in single network detections and descriptors, we propose a method that treats both extractions independently and focuses on their interaction in the learning process rather than by parameter sharing. We formulate the classical hard-mining triplet loss as a new detector optimisation term to refine candidate positions based on the descriptor map. We propose a dense descriptor that uses a multi-scale approach and a hybrid combination of hand-crafted and learned features to obtain rotation and scale robustness by design. We evaluate our method extensively on different benchmarks and show improvements over the state of the art in terms of image matching on HPatches and 3D reconstruction quality while keeping on par on camera localisation tasks.
摘要:局部特征提取仍然是一个活跃的研究领域,由于在诸如SLAM,三维重建,或AR应用的进步。在这些应用中的成功依赖于特征检测和描述符的性能。尽管大多数方法探测器描述符的相互作用是基于单一网络检测和描述符统一,我们提出了一种方法,将两者都单独提取,重点对他们在学习过程中,而不是通过参数共享互动。我们制定了经典硬矿业三重损失作为一种新的检测器优化项基于描述符地图上细化的候选位置。我们建议使用多尺度方法和手工制作,学习功能的混合组合设计,以获得旋转和缩放的鲁棒性密集的描述符。我们广泛地评估我们的方法在不同的基准测试,并显示在HPatches同时看齐保持相机的本地化任务三维重建质量的图像匹配方面优于现有技术的状态改进。
Axel Barroso-Laguna, Yannick Verdie, Benjamin Busam, Krystian Mikolajczyk
Abstract: Local feature extraction remains an active research area due to the advances in fields such as SLAM, 3D reconstructions, or AR applications. The success in these applications relies on the performance of the feature detector and descriptor. While the detector-descriptor interaction of most methods is based on unifying in single network detections and descriptors, we propose a method that treats both extractions independently and focuses on their interaction in the learning process rather than by parameter sharing. We formulate the classical hard-mining triplet loss as a new detector optimisation term to refine candidate positions based on the descriptor map. We propose a dense descriptor that uses a multi-scale approach and a hybrid combination of hand-crafted and learned features to obtain rotation and scale robustness by design. We evaluate our method extensively on different benchmarks and show improvements over the state of the art in terms of image matching on HPatches and 3D reconstruction quality while keeping on par on camera localisation tasks.
摘要:局部特征提取仍然是一个活跃的研究领域,由于在诸如SLAM,三维重建,或AR应用的进步。在这些应用中的成功依赖于特征检测和描述符的性能。尽管大多数方法探测器描述符的相互作用是基于单一网络检测和描述符统一,我们提出了一种方法,将两者都单独提取,重点对他们在学习过程中,而不是通过参数共享互动。我们制定了经典硬矿业三重损失作为一种新的检测器优化项基于描述符地图上细化的候选位置。我们建议使用多尺度方法和手工制作,学习功能的混合组合设计,以获得旋转和缩放的鲁棒性密集的描述符。我们广泛地评估我们的方法在不同的基准测试,并显示在HPatches同时看齐保持相机的本地化任务三维重建质量的图像匹配方面优于现有技术的状态改进。
10. Adaptive Mixture Regression Network with Local Counting Map for Crowd Counting [PDF] 返回目录
Xiyang Liu, Jie Yang, Tieqiang Wang, Wenrui Ding
Abstract: The crowd counting task aims at estimating the number of people located in an image or a frame from videos. Existing methods widely adopt density maps as the training targets to optimize the point-to-point loss. While in testing phase, we only focus on the differences between the crowd numbers and the global summation of density maps, which indicate the inconsistency between the training targets and the evaluation criteria. To solve this problem, we introduce a new target, named local counting map (LCM), to obtain more accurate results than density map based approaches. Moreover, we also propose an adaptive mixture regression framework with three modules in a coarse-to-fine manner to further improve the precision of the crowd estimation: scale-aware module (SAM), mixture regression module (MRM) and adaptive soft interval module (ASIM). Specifically, SAM fully utilizes the context and multi-scale information from different convolutional features; MRM and ASIM perform more precise counting regression on local patches of images. Compared with current methods, the proposed method reports better performances on the typical datasets. The source code is available at this https URL.
摘要:在人群估计的位于图像中的人数或视频帧计数任务目标。现有的方法被广泛采用密度图作为训练目标,以优化点至点的损失。虽然在测试阶段,我们只专注于人群数量和密度图,这表明培训目标和评价标准之间的不一致性的全球总和之间的差异。为了解决这个问题,我们引入一个新的目标,命名的本地计数地图(LCM),获得比密度地图为基础的方法更准确的结果。此外,我们还提出了具有三个模块的自适应混合物回归框架在粗到细的方式进一步提高人群估计的精度:规模感知模块(SAM),混合物回归模块(MRM)和自适应软间隔模块(ASIM)。具体地,SAM充分利用于不同的卷积功能的上下文和多尺度信息; MRM和ASIM对图像的局部修补程序执行较精确的计数回归。与现有方法相比,该方法报告了典型的数据集更好的性能。源代码可在此HTTPS URL。
Xiyang Liu, Jie Yang, Tieqiang Wang, Wenrui Ding
Abstract: The crowd counting task aims at estimating the number of people located in an image or a frame from videos. Existing methods widely adopt density maps as the training targets to optimize the point-to-point loss. While in testing phase, we only focus on the differences between the crowd numbers and the global summation of density maps, which indicate the inconsistency between the training targets and the evaluation criteria. To solve this problem, we introduce a new target, named local counting map (LCM), to obtain more accurate results than density map based approaches. Moreover, we also propose an adaptive mixture regression framework with three modules in a coarse-to-fine manner to further improve the precision of the crowd estimation: scale-aware module (SAM), mixture regression module (MRM) and adaptive soft interval module (ASIM). Specifically, SAM fully utilizes the context and multi-scale information from different convolutional features; MRM and ASIM perform more precise counting regression on local patches of images. Compared with current methods, the proposed method reports better performances on the typical datasets. The source code is available at this https URL.
摘要:在人群估计的位于图像中的人数或视频帧计数任务目标。现有的方法被广泛采用密度图作为训练目标,以优化点至点的损失。虽然在测试阶段,我们只专注于人群数量和密度图,这表明培训目标和评价标准之间的不一致性的全球总和之间的差异。为了解决这个问题,我们引入一个新的目标,命名的本地计数地图(LCM),获得比密度地图为基础的方法更准确的结果。此外,我们还提出了具有三个模块的自适应混合物回归框架在粗到细的方式进一步提高人群估计的精度:规模感知模块(SAM),混合物回归模块(MRM)和自适应软间隔模块(ASIM)。具体地,SAM充分利用于不同的卷积功能的上下文和多尺度信息; MRM和ASIM对图像的局部修补程序执行较精确的计数回归。与现有方法相比,该方法报告了典型的数据集更好的性能。源代码可在此HTTPS URL。
11. ReadNet:Towards Accurate ReID with Limited and Noisy Samples [PDF] 返回目录
Yitian Li, Ruini Xue, Mengmeng Zhu, Qing Xu, Zenglin Xu
Abstract: Person re-identification (ReID) is an essential cross-camera retrieval task to identify pedestrians. However, the photo number of each pedestrian usually differs drastically, and thus the data limitation and imbalance problem hinders the prediction accuracy greatly. Additionally, in real-world applications, pedestrian images are captured by different surveillance cameras, so the noisy camera related information, such as the lights, perspectives and resolutions, result in inevitable domain gaps for ReID algorithms. These challenges bring difficulties to current deep learning methods with triplet loss for coping with such problems. To address these challenges, this paper proposes ReadNet, an adversarial camera network (ACN) with an angular triplet loss (ATL). In detail, ATL focuses on learning the angular distance among different identities to mitigate the effect of data imbalance, and guarantees a linear decision boundary as well, while ACN takes the camera discriminator as a game opponent of feature extractor to filter camera related information to bridge the multi-camera gaps. ReadNet is designed to be flexible so that either ATL or ACN can be deployed independently or simultaneously. The experiment results on various benchmark datasets have shown that ReadNet can deliver better prediction performance than current state-of-the-art methods.
摘要:人重新鉴定(里德)是一个重要的跨相机检索任务来识别行人。然而,每个行人的照片数量通常大大不同,因此,数据限制和失衡问题阻碍了预测精度很大。此外,在实际应用中,行人的图像是由不同的监控摄像头拍摄的,所以吵相机相关的信息,如灯光,角度和决议,结果里德算法不可避免领域的空白。这些挑战带来目前深学习方法,使用三线损困难这样的问题的应对。为了应对这些挑战,本文提出ReadNet,对抗性摄像机网络(ACN)具有角三重态损耗(ATL)。详细地说,ATL着重于学习不同的身份之间的角距离以减轻数据失衡的影响,并确保提供一种线性决策边界为好,而ACN拍摄摄像机鉴别作为特征抽取器的游戏的对手进行筛选照相机相关信息桥多照相机的间隙。 ReadNet被设计为柔性的,使得任一ATL或ACN可以独立地或同时部署。在各种基准数据集的实验结果表明,可以ReadNet提供比状态的最先进的现有方法更好的预测性能。
Yitian Li, Ruini Xue, Mengmeng Zhu, Qing Xu, Zenglin Xu
Abstract: Person re-identification (ReID) is an essential cross-camera retrieval task to identify pedestrians. However, the photo number of each pedestrian usually differs drastically, and thus the data limitation and imbalance problem hinders the prediction accuracy greatly. Additionally, in real-world applications, pedestrian images are captured by different surveillance cameras, so the noisy camera related information, such as the lights, perspectives and resolutions, result in inevitable domain gaps for ReID algorithms. These challenges bring difficulties to current deep learning methods with triplet loss for coping with such problems. To address these challenges, this paper proposes ReadNet, an adversarial camera network (ACN) with an angular triplet loss (ATL). In detail, ATL focuses on learning the angular distance among different identities to mitigate the effect of data imbalance, and guarantees a linear decision boundary as well, while ACN takes the camera discriminator as a game opponent of feature extractor to filter camera related information to bridge the multi-camera gaps. ReadNet is designed to be flexible so that either ATL or ACN can be deployed independently or simultaneously. The experiment results on various benchmark datasets have shown that ReadNet can deliver better prediction performance than current state-of-the-art methods.
摘要:人重新鉴定(里德)是一个重要的跨相机检索任务来识别行人。然而,每个行人的照片数量通常大大不同,因此,数据限制和失衡问题阻碍了预测精度很大。此外,在实际应用中,行人的图像是由不同的监控摄像头拍摄的,所以吵相机相关的信息,如灯光,角度和决议,结果里德算法不可避免领域的空白。这些挑战带来目前深学习方法,使用三线损困难这样的问题的应对。为了应对这些挑战,本文提出ReadNet,对抗性摄像机网络(ACN)具有角三重态损耗(ATL)。详细地说,ATL着重于学习不同的身份之间的角距离以减轻数据失衡的影响,并确保提供一种线性决策边界为好,而ACN拍摄摄像机鉴别作为特征抽取器的游戏的对手进行筛选照相机相关信息桥多照相机的间隙。 ReadNet被设计为柔性的,使得任一ATL或ACN可以独立地或同时部署。在各种基准数据集的实验结果表明,可以ReadNet提供比状态的最先进的现有方法更好的预测性能。
12. Skeleton-Aware Networks for Deep Motion Retargeting [PDF] 返回目录
Kfir Aberman, Peizhuo Li, Dani Lischinski, Olga Sorkine-Hornung, Daniel Cohen-Or, Baoquan Chen
Abstract: We introduce a novel deep learning framework for data-driven motion retargeting between skeletons, which may have different structure, yet corresponding to homeomorphic graphs. Importantly, our approach learns how to retarget without requiring any explicit pairing between the motions in the training set. We leverage the fact that different homeomorphic skeletons may be reduced to a common primal skeleton by a sequence of edge merging operations, which we refer to as skeletal pooling. Thus, our main technical contribution is the introduction of novel differentiable convolution, pooling, and unpooling operators. These operators are skeleton-aware, meaning that they explicitly account for the skeleton's hierarchical structure and joint adjacency, and together they serve to transform the original motion into a collection of deep temporal features associated with the joints of the primal skeleton. In other words, our operators form the building blocks of a new deep motion processing framework that embeds the motion into a common latent space, shared by a collection of homeomorphic skeletons. Thus, retargeting can be achieved simply by encoding to, and decoding from this latent space. Our experiments show the effectiveness of our framework for motion retargeting, as well as motion processing in general, compared to existing approaches. Our approach is also quantitatively evaluated on a synthetic dataset that contains pairs of motions applied to different skeletons. To the best of our knowledge, our method is the first to perform retargeting between skeletons with differently sampled kinematic chains, without any paired examples.
摘要:我们介绍了骨架之间的数据驱动运动重定目标,其可以具有不同的结构的新型深学习框架,但对应于同胚的曲线图。重要的是,我们的方法学习如何无需运动之间任何明确的配对训练集中的重新定位。我们利用一个事实,即不同的同构骨架可通过边缘合并操作,我们称之为骨骼池的序列可以减少到一个共同的原始骨架。因此,我们的主要技术贡献是引进新的微卷积,池和unpooling运营商。这些运营商骨架知道,这意味着他们明确地解释为骨架的层次结构和关节相邻,和他们一起服务于原来的运动转换成与原始骨骼关节相关的颞深要素的集合。换句话说,我们的运营商形成一个新的深运动处理框架嵌入运动到一个共同的潜在空间,通过同胚骨架的集合共享的构建块。因此,重定位可以简单地通过编码到,并从该潜在空间解码实现。我们的实验表明,一般我们对运动重新定位框架的有效性,以及运动处理,相比于现有的方法。我们的方法还定量地对包含对应用到不同的骨架运动的合成数据集进行评估。据我们所知,我们的方法是先进行具有不同采样运动链骨架之间的重定向,没有任何配对的例子。
Kfir Aberman, Peizhuo Li, Dani Lischinski, Olga Sorkine-Hornung, Daniel Cohen-Or, Baoquan Chen
Abstract: We introduce a novel deep learning framework for data-driven motion retargeting between skeletons, which may have different structure, yet corresponding to homeomorphic graphs. Importantly, our approach learns how to retarget without requiring any explicit pairing between the motions in the training set. We leverage the fact that different homeomorphic skeletons may be reduced to a common primal skeleton by a sequence of edge merging operations, which we refer to as skeletal pooling. Thus, our main technical contribution is the introduction of novel differentiable convolution, pooling, and unpooling operators. These operators are skeleton-aware, meaning that they explicitly account for the skeleton's hierarchical structure and joint adjacency, and together they serve to transform the original motion into a collection of deep temporal features associated with the joints of the primal skeleton. In other words, our operators form the building blocks of a new deep motion processing framework that embeds the motion into a common latent space, shared by a collection of homeomorphic skeletons. Thus, retargeting can be achieved simply by encoding to, and decoding from this latent space. Our experiments show the effectiveness of our framework for motion retargeting, as well as motion processing in general, compared to existing approaches. Our approach is also quantitatively evaluated on a synthetic dataset that contains pairs of motions applied to different skeletons. To the best of our knowledge, our method is the first to perform retargeting between skeletons with differently sampled kinematic chains, without any paired examples.
摘要:我们介绍了骨架之间的数据驱动运动重定目标,其可以具有不同的结构的新型深学习框架,但对应于同胚的曲线图。重要的是,我们的方法学习如何无需运动之间任何明确的配对训练集中的重新定位。我们利用一个事实,即不同的同构骨架可通过边缘合并操作,我们称之为骨骼池的序列可以减少到一个共同的原始骨架。因此,我们的主要技术贡献是引进新的微卷积,池和unpooling运营商。这些运营商骨架知道,这意味着他们明确地解释为骨架的层次结构和关节相邻,和他们一起服务于原来的运动转换成与原始骨骼关节相关的颞深要素的集合。换句话说,我们的运营商形成一个新的深运动处理框架嵌入运动到一个共同的潜在空间,通过同胚骨架的集合共享的构建块。因此,重定位可以简单地通过编码到,并从该潜在空间解码实现。我们的实验表明,一般我们对运动重新定位框架的有效性,以及运动处理,相比于现有的方法。我们的方法还定量地对包含对应用到不同的骨架运动的合成数据集进行评估。据我们所知,我们的方法是先进行具有不同采样运动链骨架之间的重定向,没有任何配对的例子。
13. IterDet: Iterative Scheme for ObjectDetection in Crowded Environments [PDF] 返回目录
Danila Rukhovich, Konstantin Sofiiuk, Danil Galeev, Olga Barinova, Anton Konushin
Abstract: Deep learning-based detectors usually produce a redundant set of object bounding boxes including many duplicate detections of the same object. These boxes are then filtered using non-maximum suppression (NMS) in order to select exactly one bounding box per object of interest. This greedy scheme is simple and provides sufficient accuracy for isolated objects but often fails in crowded environments, since one needs to both preserve boxes for different objects and suppress duplicate detections. In this work we develop an alternative iterative scheme, where a new subset of objects is detected at each iteration. Detected boxes from the previous iterations are passed to the network at the following iterations to ensure that the same object would not be detected twice. This iterative scheme can be applied to both one-stage and two-stage object detectors with just minor modifications of the training and inference procedures. We perform extensive experiments with two different baseline detectors on four datasets and show significant improvement over the baseline, leading to state-of-the-art performance on CrowdHuman and WiderPerson datasets. The source code and the trained models are available at this https URL.
摘要:深基于学习的检测器通常会产生一组冗余的对象的边界框包括该对象的许多重复检测。然后,将这些盒使用非最大抑制(NMS),以便选择每个感兴趣对象正好一个边界框过滤。这种贪婪的方案很简单,并提供了隔离对象足够的精度,但往往不能在拥挤的环境中,因为人们需要既保护盒不同的对象和抑制重复检测。在这项工作中,我们开发替代迭代方案,其中在每次迭代检测对象的新的子集。从以前的迭代检测箱传递到网络在下面的迭代,以确保相同的对象不会被检测到两次。这种迭代方案可以被应用于具有的训练和推理过程只是稍作修改两个单阶段和两阶段对象检测器。我们进行了广泛的实验与四个数据集中的两个不同的基准探测器,并显示显著改善将比基线,导致对CrowdHuman和WiderPerson数据集的国家的最先进的性能。源代码和训练的模型可在此HTTPS URL。
Danila Rukhovich, Konstantin Sofiiuk, Danil Galeev, Olga Barinova, Anton Konushin
Abstract: Deep learning-based detectors usually produce a redundant set of object bounding boxes including many duplicate detections of the same object. These boxes are then filtered using non-maximum suppression (NMS) in order to select exactly one bounding box per object of interest. This greedy scheme is simple and provides sufficient accuracy for isolated objects but often fails in crowded environments, since one needs to both preserve boxes for different objects and suppress duplicate detections. In this work we develop an alternative iterative scheme, where a new subset of objects is detected at each iteration. Detected boxes from the previous iterations are passed to the network at the following iterations to ensure that the same object would not be detected twice. This iterative scheme can be applied to both one-stage and two-stage object detectors with just minor modifications of the training and inference procedures. We perform extensive experiments with two different baseline detectors on four datasets and show significant improvement over the baseline, leading to state-of-the-art performance on CrowdHuman and WiderPerson datasets. The source code and the trained models are available at this https URL.
摘要:深基于学习的检测器通常会产生一组冗余的对象的边界框包括该对象的许多重复检测。然后,将这些盒使用非最大抑制(NMS),以便选择每个感兴趣对象正好一个边界框过滤。这种贪婪的方案很简单,并提供了隔离对象足够的精度,但往往不能在拥挤的环境中,因为人们需要既保护盒不同的对象和抑制重复检测。在这项工作中,我们开发替代迭代方案,其中在每次迭代检测对象的新的子集。从以前的迭代检测箱传递到网络在下面的迭代,以确保相同的对象不会被检测到两次。这种迭代方案可以被应用于具有的训练和推理过程只是稍作修改两个单阶段和两阶段对象检测器。我们进行了广泛的实验与四个数据集中的两个不同的基准探测器,并显示显著改善将比基线,导致对CrowdHuman和WiderPerson数据集的国家的最先进的性能。源代码和训练的模型可在此HTTPS URL。
14. Automatic clustering of Celtic coins based on 3D point cloud pattern analysis [PDF] 返回目录
Sofiane Horache, Jean-Emmanuel Deschaud, François Goulette, Katherine Gruel, Thierry Lejars
Abstract: The recognition and clustering of coins which have been struck by the same die is of interest for archeological studies. Nowadays, this work can only be performed by experts and is very tedious. In this paper, we propose a method to automatically cluster dies, based on 3D scans of coins. It is based on three steps: registration, comparison and graph-based clustering. Experimental results on 90 coins coming from a Celtic treasury from the II-Ith century BC show a clustering quality equivalent to expert's work.
摘要:已经击中由同一模具硬币的识别和聚类是对考古研究的兴趣。如今,这项工作只能由专家进行,是非常繁琐。在本文中,我们提出了一个方法来自动聚类模具的基础上,硬币的3D扫描。它是基于以下三个步骤:注册,比较和基于图的聚类。 90枚硬币从凯尔特人国库即将从II-ITH世纪BC实验结果表明等效聚类质量专家的工作。
Sofiane Horache, Jean-Emmanuel Deschaud, François Goulette, Katherine Gruel, Thierry Lejars
Abstract: The recognition and clustering of coins which have been struck by the same die is of interest for archeological studies. Nowadays, this work can only be performed by experts and is very tedious. In this paper, we propose a method to automatically cluster dies, based on 3D scans of coins. It is based on three steps: registration, comparison and graph-based clustering. Experimental results on 90 coins coming from a Celtic treasury from the II-Ith century BC show a clustering quality equivalent to expert's work.
摘要:已经击中由同一模具硬币的识别和聚类是对考古研究的兴趣。如今,这项工作只能由专家进行,是非常繁琐。在本文中,我们提出了一个方法来自动聚类模具的基础上,硬币的3D扫描。它是基于以下三个步骤:注册,比较和基于图的聚类。 90枚硬币从凯尔特人国库即将从II-ITH世纪BC实验结果表明等效聚类质量专家的工作。
15. RetinotopicNet: An Iterative Attention Mechanism Using Local Descriptors with Global Context [PDF] 返回目录
Thomas Kurbiel, Shahrzad Khaleghian
Abstract: Convolutional Neural Networks (CNNs) were the driving force behind many advancements in Computer Vision research in recent years. This progress has spawned many practical applications and we see an increased need to efficiently move CNNs to embedded systems today. However traditional CNNs lack the property of scale and rotation invariance: two of the most frequently encountered transformations in natural images. As a consequence CNNs have to learn different features for same objects at different scales. This redundancy is the main reason why CNNs need to be very deep in order to achieve the desired accuracy. In this paper we develop an efficient solution by reproducing how nature has solved the problem in the human brain. To this end we let our CNN operate on small patches extracted using the log-polar transform, which is known to be scale and rotation equivariant. Patches extracted in this way have the nice property of magnifying the central field and compressing the periphery. Hence we obtain local descriptors with global context information. However the processing of a single patch is usually not sufficient to achieve high accuracies in e.g. classification tasks. We therefore successively jump to several different locations, called saccades, thus building an understanding of the whole image. Since log-polar patches contain global context information, we can efficiently calculate following saccades using only the small patches. Saccades efficiently compensate for the lack of translation equivariance of the log-polar transform.
摘要:卷积神经网络(细胞神经网络)是近年来在计算机视觉研究的许多进步的原动力。这种进步催生了许多实际应用中,我们看到一个更加需要今天有效地移动细胞神经网络嵌入式系统。然而传统的细胞神经网络的缺乏规模和旋转不变性的特性:两个自然图像中最常遇到的转换。因此细胞神经网络必须学会在不同的尺度相同对象的不同特点。这种冗余是最主要的原因,细胞神经网络需要非常深的,以达到所需的精度。在本文中,我们开发,通过再现大自然如何解决了在人脑中的问题的有效解决方案。为此,我们让我们的CNN使用对数极坐标变换提取的小补丁,这是众所周知的是缩放和旋转等变化进行操作。以这种方式提取的拼块具有放大中央场地和压缩周边的很好的特性。因此,我们获得与全球环境信息的局部描述符。然而单个补丁的处理通常是不足以实现在例如高精确度分类任务。因此,我们先后跳转到几个不同的位置,称为扫视,从而构建整个图像的理解。由于数极补丁包含全球范围内的信息,我们可以只使用小补丁有效地计算下扫视。扫视有效弥补缺乏对数极坐标变换的平移同变性的。
Thomas Kurbiel, Shahrzad Khaleghian
Abstract: Convolutional Neural Networks (CNNs) were the driving force behind many advancements in Computer Vision research in recent years. This progress has spawned many practical applications and we see an increased need to efficiently move CNNs to embedded systems today. However traditional CNNs lack the property of scale and rotation invariance: two of the most frequently encountered transformations in natural images. As a consequence CNNs have to learn different features for same objects at different scales. This redundancy is the main reason why CNNs need to be very deep in order to achieve the desired accuracy. In this paper we develop an efficient solution by reproducing how nature has solved the problem in the human brain. To this end we let our CNN operate on small patches extracted using the log-polar transform, which is known to be scale and rotation equivariant. Patches extracted in this way have the nice property of magnifying the central field and compressing the periphery. Hence we obtain local descriptors with global context information. However the processing of a single patch is usually not sufficient to achieve high accuracies in e.g. classification tasks. We therefore successively jump to several different locations, called saccades, thus building an understanding of the whole image. Since log-polar patches contain global context information, we can efficiently calculate following saccades using only the small patches. Saccades efficiently compensate for the lack of translation equivariance of the log-polar transform.
摘要:卷积神经网络(细胞神经网络)是近年来在计算机视觉研究的许多进步的原动力。这种进步催生了许多实际应用中,我们看到一个更加需要今天有效地移动细胞神经网络嵌入式系统。然而传统的细胞神经网络的缺乏规模和旋转不变性的特性:两个自然图像中最常遇到的转换。因此细胞神经网络必须学会在不同的尺度相同对象的不同特点。这种冗余是最主要的原因,细胞神经网络需要非常深的,以达到所需的精度。在本文中,我们开发,通过再现大自然如何解决了在人脑中的问题的有效解决方案。为此,我们让我们的CNN使用对数极坐标变换提取的小补丁,这是众所周知的是缩放和旋转等变化进行操作。以这种方式提取的拼块具有放大中央场地和压缩周边的很好的特性。因此,我们获得与全球环境信息的局部描述符。然而单个补丁的处理通常是不足以实现在例如高精确度分类任务。因此,我们先后跳转到几个不同的位置,称为扫视,从而构建整个图像的理解。由于数极补丁包含全球范围内的信息,我们可以只使用小补丁有效地计算下扫视。扫视有效弥补缺乏对数极坐标变换的平移同变性的。
16. Stillleben: Realistic Scene Synthesis for Deep Learning in Robotics [PDF] 返回目录
Max Schwarz, Sven Behnke
Abstract: Training data is the key ingredient for deep learning approaches, but difficult to obtain for the specialized domains often encountered in robotics. We describe a synthesis pipeline capable of producing training data for cluttered scene perception tasks such as semantic segmentation, object detection, and correspondence or pose estimation. Our approach arranges object meshes in physically realistic, dense scenes using physics simulation. The arranged scenes are rendered using high-quality rasterization with randomized appearance and material parameters. Noise and other transformations introduced by the camera sensors are simulated. Our pipeline can be run online during training of a deep neural network, yielding applications in life-long learning and in iterative render-and-compare approaches. We demonstrate the usability by learning semantic segmentation on the challenging YCB-Video dataset without actually using any training frames, where our method achieves performance comparable to a conventionally trained model. Additionally, we show successful application in a real-world regrasping system.
摘要:训练数据是深度学习的关键成分的方法,但是很难获得机器人经常遇到的专业领域。我们描述了能够产生训练数据杂乱场景感知任务,诸如语义分割,对象检测和对应或姿势估计的合成管道。我们的方法整理反对使用物理模拟物理现实,密集的场景网格。所布置的场面使用的高品质的光栅化与随机外观和材料参数渲染。噪声和相机传感器引入其他变换进行了模拟。我们的管道可以在网上一个深层神经网络的训练过程中运行,产生终身学习应用和反复渲染和比较的方法。我们证明通过学习对挑战YCB视频数据集语义分割实际上并没有使用任何训练帧,我们的方法实现了性能堪比常规训练模型的实用性。此外,我们表现出一个真实世界的再夹紧系统成功应用。
Max Schwarz, Sven Behnke
Abstract: Training data is the key ingredient for deep learning approaches, but difficult to obtain for the specialized domains often encountered in robotics. We describe a synthesis pipeline capable of producing training data for cluttered scene perception tasks such as semantic segmentation, object detection, and correspondence or pose estimation. Our approach arranges object meshes in physically realistic, dense scenes using physics simulation. The arranged scenes are rendered using high-quality rasterization with randomized appearance and material parameters. Noise and other transformations introduced by the camera sensors are simulated. Our pipeline can be run online during training of a deep neural network, yielding applications in life-long learning and in iterative render-and-compare approaches. We demonstrate the usability by learning semantic segmentation on the challenging YCB-Video dataset without actually using any training frames, where our method achieves performance comparable to a conventionally trained model. Additionally, we show successful application in a real-world regrasping system.
摘要:训练数据是深度学习的关键成分的方法,但是很难获得机器人经常遇到的专业领域。我们描述了能够产生训练数据杂乱场景感知任务,诸如语义分割,对象检测和对应或姿势估计的合成管道。我们的方法整理反对使用物理模拟物理现实,密集的场景网格。所布置的场面使用的高品质的光栅化与随机外观和材料参数渲染。噪声和相机传感器引入其他变换进行了模拟。我们的管道可以在网上一个深层神经网络的训练过程中运行,产生终身学习应用和反复渲染和比较的方法。我们证明通过学习对挑战YCB视频数据集语义分割实际上并没有使用任何训练帧,我们的方法实现了性能堪比常规训练模型的实用性。此外,我们表现出一个真实世界的再夹紧系统成功应用。
17. Detecting CNN-Generated Facial Images in Real-World Scenarios [PDF] 返回目录
Nils Hulzebosch, Sarah Ibrahimi, Marcel Worring
Abstract: Artificial, CNN-generated images are now of such high quality that humans have trouble distinguishing them from real images. Several algorithmic detection methods have been proposed, but these appear to generalize poorly to data from unknown sources, making them infeasible for real-world scenarios. In this work, we present a framework for evaluating detection methods under real-world conditions, consisting of cross-model, cross-data, and post-processing evaluation, and we evaluate state-of-the-art detection methods using the proposed framework. Furthermore, we examine the usefulness of commonly used image pre-processing methods. Lastly, we evaluate human performance on detecting CNN-generated images, along with factors that influence this performance, by conducting an online survey. Our results suggest that CNN-based detection methods are not yet robust enough to be used in real-world scenarios.
摘要:人工,CNN生成的图像现在这样高品质的,人类有麻烦从实际图像加以区别的。几种算法的检测方法已经被提出,但这些似乎不佳推广到未知来源的数据,使他们不可行的现实情况。在这项工作中,我们提出了一个框架,用以评估真实世界条件下的检测方法,由跨模型,跨数据,和后处理的评价,我们评估使用拟议框架的国家的最先进的检测方法。此外,我们研究的常用图像预处理方法的有效性。最后,我们评估对检测CNN生成的图像,以影响该性能,通过正在进行一项网上调查因素一起人的表现。我们的研究结果表明,基于CNN的检测方法还没有足够强大的在真实场景中使用。
Nils Hulzebosch, Sarah Ibrahimi, Marcel Worring
Abstract: Artificial, CNN-generated images are now of such high quality that humans have trouble distinguishing them from real images. Several algorithmic detection methods have been proposed, but these appear to generalize poorly to data from unknown sources, making them infeasible for real-world scenarios. In this work, we present a framework for evaluating detection methods under real-world conditions, consisting of cross-model, cross-data, and post-processing evaluation, and we evaluate state-of-the-art detection methods using the proposed framework. Furthermore, we examine the usefulness of commonly used image pre-processing methods. Lastly, we evaluate human performance on detecting CNN-generated images, along with factors that influence this performance, by conducting an online survey. Our results suggest that CNN-based detection methods are not yet robust enough to be used in real-world scenarios.
摘要:人工,CNN生成的图像现在这样高品质的,人类有麻烦从实际图像加以区别的。几种算法的检测方法已经被提出,但这些似乎不佳推广到未知来源的数据,使他们不可行的现实情况。在这项工作中,我们提出了一个框架,用以评估真实世界条件下的检测方法,由跨模型,跨数据,和后处理的评价,我们评估使用拟议框架的国家的最先进的检测方法。此外,我们研究的常用图像预处理方法的有效性。最后,我们评估对检测CNN生成的图像,以影响该性能,通过正在进行一项网上调查因素一起人的表现。我们的研究结果表明,基于CNN的检测方法还没有足够强大的在真实场景中使用。
18. Unsupervised Multi-label Dataset Generation from Web Data [PDF] 返回目录
Carlos Roig, David Varas, Issey Masuda, Juan Carlos Riveiro, Elisenda Bou-Balust
Abstract: This paper presents a system towards the generation of multi-label datasets from web data in an unsupervised manner. To achieve this objective, this work comprises two main contributions, namely: a) the generation of a low-noise unsupervised single-label dataset from web-data, and b) the augmentation of labels in such dataset (from single label to multi label). The generation of a single-label dataset uses an unsupervised noise reduction phase (clustering and selection of clusters using anchors) obtaining a 85% of correctly labeled images. An unsupervised label augmentation process is then performed to assign new labels to the images in the dataset using the class activation maps and the uncertainty associated with each class. This process is applied to the dataset generated in this paper and a public dataset (Places365) achieving a 9.5% and 27% of extra labels in each dataset respectively, therefore demonstrating that the presented system can robustly enrich the initial dataset.
摘要:本文介绍在无人监督的方式对来自网络的数据多标签数据集的生成系统。为了实现这一目标,这项工作包括两个主要的贡献,即:a)从网络的数据的低噪声无监督单标签数据集的产生,和b)在这样的数据集标签的增强(从单个标签多标签)。单标签数据集的生成使用获得正确标记的图像的85%的一种无监督降噪相(聚类和使用锚簇的选择)。无监督标签增加处理然后进行分配新的标签来使用类激活图和与每个类相关的不确定性数据集中的图像。这个过程被应用到实现本文和一个公共数据集(Places365)中产生所述数据集的9.5%和在分别每个数据集的额外的标签的27%,因此表明所提出的系统可以鲁棒地丰富的初始数据集。
Carlos Roig, David Varas, Issey Masuda, Juan Carlos Riveiro, Elisenda Bou-Balust
Abstract: This paper presents a system towards the generation of multi-label datasets from web data in an unsupervised manner. To achieve this objective, this work comprises two main contributions, namely: a) the generation of a low-noise unsupervised single-label dataset from web-data, and b) the augmentation of labels in such dataset (from single label to multi label). The generation of a single-label dataset uses an unsupervised noise reduction phase (clustering and selection of clusters using anchors) obtaining a 85% of correctly labeled images. An unsupervised label augmentation process is then performed to assign new labels to the images in the dataset using the class activation maps and the uncertainty associated with each class. This process is applied to the dataset generated in this paper and a public dataset (Places365) achieving a 9.5% and 27% of extra labels in each dataset respectively, therefore demonstrating that the presented system can robustly enrich the initial dataset.
摘要:本文介绍在无人监督的方式对来自网络的数据多标签数据集的生成系统。为了实现这一目标,这项工作包括两个主要的贡献,即:a)从网络的数据的低噪声无监督单标签数据集的产生,和b)在这样的数据集标签的增强(从单个标签多标签)。单标签数据集的生成使用获得正确标记的图像的85%的一种无监督降噪相(聚类和使用锚簇的选择)。无监督标签增加处理然后进行分配新的标签来使用类激活图和与每个类相关的不确定性数据集中的图像。这个过程被应用到实现本文和一个公共数据集(Places365)中产生所述数据集的9.5%和在分别每个数据集的额外的标签的27%,因此表明所提出的系统可以鲁棒地丰富的初始数据集。
19. Discriminative Multi-modality Speech Recognition [PDF] 返回目录
Bo Xu, Cheng Lu, Yandong Guo, Jacob Wang
Abstract: Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates. After combining visual modality, ASR is upgraded to the multi-modality speech recognition (MSR). In this paper, we propose a two-stage speech recognition model. In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model understands clearly. At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate. There are some other key contributions: we introduce a pseudo-3D residual convolution (P3D)-based visual front-end to extract more discriminative features; we upgrade the temporal convolution block from 1D ResNet with the temporal convolutional network (TCN), which is more suitable for the temporal tasks; the MSR sub-network is built on the top of Element-wise-Attention Gated Recurrent Unit (EleAtt-GRU), which is more effective than Transformer in long sequences. We conducted extensive experiments on the LRS3-TED and the LRW datasets. Our two-stage model (audio enhanced multi-modality speech recognition, AE-MSR) consistently achieves the state-of-the-art performance by a significant margin, which demonstrates the necessity and effectiveness of AE-MSR.
摘要:远景经常被用来作为音频语音识别(ASR)的补充方式,特别是在嘈杂的环境中,个人音频形态的性能显著恶化。结合视觉模态后,ASR升级到多模态语音识别(MSR)。在本文中,我们提出了两个阶段的语音识别模型。在第一阶段中,识别对象语音是从背景噪音与从嘴唇运动的相应的视觉信息的帮助分离,使得模型清楚地了解。在第二阶段,听觉模态联合视觉形态再次更好地了解一个MSR子网的讲话,进一步提高识别率。还有一些其他的重要贡献:我们引入一个伪3D残余卷积(P3D)基视觉前端以提取更多的判别特征;我们升级从1D RESNET时间卷积块与时间卷积网络(TCN),这是更适合的时间任务;所述MSR子网络上建立逐元素-注意门控重复单元(EleAtt-GRU),这是更有效互感器比在长序列的顶部。我们已就LRS3-TED和LRW数据集大量的实验。我们的两阶段模型(音频增强的多模态语音识别,AE-MSR)一致地实现由显著余量,这表明AE-MSR的必要性和有效性的状态的最先进的性能。
Bo Xu, Cheng Lu, Yandong Guo, Jacob Wang
Abstract: Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates. After combining visual modality, ASR is upgraded to the multi-modality speech recognition (MSR). In this paper, we propose a two-stage speech recognition model. In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model understands clearly. At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate. There are some other key contributions: we introduce a pseudo-3D residual convolution (P3D)-based visual front-end to extract more discriminative features; we upgrade the temporal convolution block from 1D ResNet with the temporal convolutional network (TCN), which is more suitable for the temporal tasks; the MSR sub-network is built on the top of Element-wise-Attention Gated Recurrent Unit (EleAtt-GRU), which is more effective than Transformer in long sequences. We conducted extensive experiments on the LRS3-TED and the LRW datasets. Our two-stage model (audio enhanced multi-modality speech recognition, AE-MSR) consistently achieves the state-of-the-art performance by a significant margin, which demonstrates the necessity and effectiveness of AE-MSR.
摘要:远景经常被用来作为音频语音识别(ASR)的补充方式,特别是在嘈杂的环境中,个人音频形态的性能显著恶化。结合视觉模态后,ASR升级到多模态语音识别(MSR)。在本文中,我们提出了两个阶段的语音识别模型。在第一阶段中,识别对象语音是从背景噪音与从嘴唇运动的相应的视觉信息的帮助分离,使得模型清楚地了解。在第二阶段,听觉模态联合视觉形态再次更好地了解一个MSR子网的讲话,进一步提高识别率。还有一些其他的重要贡献:我们引入一个伪3D残余卷积(P3D)基视觉前端以提取更多的判别特征;我们升级从1D RESNET时间卷积块与时间卷积网络(TCN),这是更适合的时间任务;所述MSR子网络上建立逐元素-注意门控重复单元(EleAtt-GRU),这是更有效互感器比在长序列的顶部。我们已就LRS3-TED和LRW数据集大量的实验。我们的两阶段模型(音频增强的多模态语音识别,AE-MSR)一致地实现由显著余量,这表明AE-MSR的必要性和有效性的状态的最先进的性能。
20. Effective and Robust Detection of Adversarial Examples via Benford-Fourier Coefficients [PDF] 返回目录
Chengcheng Ma, Baoyuan Wu, Shibiao Xu, Yanbo Fan, Yong Zhang, Xiaopeng Zhang, Zhifeng Li
Abstract: Adversarial examples have been well known as a serious threat to deep neural networks (DNNs). In this work, we study the detection of adversarial examples, based on the assumption that the output and internal responses of one DNN model for both adversarial and benign examples follow the generalized Gaussian distribution (GGD), but with different parameters (i.e., shape factor, mean, and variance). GGD is a general distribution family to cover many popular distributions (e.g., Laplacian, Gaussian, or uniform). It is more likely to approximate the intrinsic distributions of internal responses than any specific distribution. Besides, since the shape factor is more robust to different databases rather than the other two parameters, we propose to construct discriminative features via the shape factor for adversarial detection, employing the magnitude of Benford-Fourier coefficients (MBF), which can be easily estimated using responses. Finally, a support vector machine is trained as the adversarial detector through leveraging the MBF features. Extensive experiments in terms of image classification demonstrate that the proposed detector is much more effective and robust on detecting adversarial examples of different crafting methods and different sources, compared to state-of-the-art adversarial detection methods.
摘要:对抗性例子已经众所周知深神经网络(DNNs)构成严重威胁。在这项工作中,我们研究了检测的对抗性例子,基于这样的假设,输出和一个DNN模型的内部反应两种对立的和良性的例子如下广义高斯分布(GGD),但使用不同的参数(即,形状因子,平均值,方差和)。 GGD是一般分布族涵盖许多流行的分布(例如,拉普拉斯,高斯,或均匀)。它更可能近似比任何具体分布内部反应的内在分布。此外,由于形状系数是更健壮到不同的数据库,而不是另外两个参数中,我们提出通过用于对抗检测形状因子构造判别特征,采用本福德傅立叶系数的幅度(MBF),其可以被容易地估计使用响应。最后,支持向量机通过撬动MBF特征训练作为对抗检测器。在图像分类方面广泛的实验表明,该检测器是有效得多并且在检测的不同工艺加工方法和不同来源的对抗性例子健壮的,相比于国家的最先进的对抗性的检测方法。
Chengcheng Ma, Baoyuan Wu, Shibiao Xu, Yanbo Fan, Yong Zhang, Xiaopeng Zhang, Zhifeng Li
Abstract: Adversarial examples have been well known as a serious threat to deep neural networks (DNNs). In this work, we study the detection of adversarial examples, based on the assumption that the output and internal responses of one DNN model for both adversarial and benign examples follow the generalized Gaussian distribution (GGD), but with different parameters (i.e., shape factor, mean, and variance). GGD is a general distribution family to cover many popular distributions (e.g., Laplacian, Gaussian, or uniform). It is more likely to approximate the intrinsic distributions of internal responses than any specific distribution. Besides, since the shape factor is more robust to different databases rather than the other two parameters, we propose to construct discriminative features via the shape factor for adversarial detection, employing the magnitude of Benford-Fourier coefficients (MBF), which can be easily estimated using responses. Finally, a support vector machine is trained as the adversarial detector through leveraging the MBF features. Extensive experiments in terms of image classification demonstrate that the proposed detector is much more effective and robust on detecting adversarial examples of different crafting methods and different sources, compared to state-of-the-art adversarial detection methods.
摘要:对抗性例子已经众所周知深神经网络(DNNs)构成严重威胁。在这项工作中,我们研究了检测的对抗性例子,基于这样的假设,输出和一个DNN模型的内部反应两种对立的和良性的例子如下广义高斯分布(GGD),但使用不同的参数(即,形状因子,平均值,方差和)。 GGD是一般分布族涵盖许多流行的分布(例如,拉普拉斯,高斯,或均匀)。它更可能近似比任何具体分布内部反应的内在分布。此外,由于形状系数是更健壮到不同的数据库,而不是另外两个参数中,我们提出通过用于对抗检测形状因子构造判别特征,采用本福德傅立叶系数的幅度(MBF),其可以被容易地估计使用响应。最后,支持向量机通过撬动MBF特征训练作为对抗检测器。在图像分类方面广泛的实验表明,该检测器是有效得多并且在检测的不同工艺加工方法和不同来源的对抗性例子健壮的,相比于国家的最先进的对抗性的检测方法。
21. DeepFaceLab: A simple, flexible and extensible face swapping framework [PDF] 返回目录
Ivan Petrov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Umé, Jian Jiang, Luis RP, Sheng Zhang, Pingyu Wu, Weiming Zhang
Abstract: DeepFaceLab is an open-source deepfake system created by \textbf{iperov} for face swapping with more than 3,000 forks and 13,000 stars in Github: it provides an imperative and easy-to-use pipeline for people to use with no comprehensive understanding of deep learning framework or with model implementation required, while remains a flexible and loose coupling structure for people who need to strengthen their own pipeline with other features without writing complicated boilerplate code. In this paper, we detail the principles that drive the implementation of DeepFaceLab and introduce the pipeline of it, through which every aspect of the pipeline can be modified painlessly by users to achieve their customization purpose, and it's noteworthy that DeepFaceLab could achieve results with high fidelity and indeed indiscernible by mainstream forgery detection approaches. We demonstrate the advantage of our system through comparing our approach with current prevailing systems. For more information, please visit: this https URL.
摘要:DeepFaceLab是一个开放源码的deepfake系统由\ textbf创建{iperov}脸拥有3000多个叉子1.3万星在Github上交换:它为人们提供了一个必要的和易于使用的管道来使用,没有全面的了解深度学习的框架或模型实现需要,而遗体谁需要加强自己的管道与其他功能,而无需编写复杂的样板代码的人一个灵活的,松耦合结构。在本文中,我们详细介绍驱动DeepFaceLab的实施,并介绍了它的管道,通过该管道的每一个方面可以由用户无痛修改,以达到其定制目的的原则,这是值得注意的是,DeepFaceLab能够实现高结果保真度和主流伪造检测方法确实难以分辨。我们通过与方法目前通行的系统比较证明了我们系统的优点。欲了解更多信息,请访问:此HTTPS URL。
Ivan Petrov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Umé, Jian Jiang, Luis RP, Sheng Zhang, Pingyu Wu, Weiming Zhang
Abstract: DeepFaceLab is an open-source deepfake system created by \textbf{iperov} for face swapping with more than 3,000 forks and 13,000 stars in Github: it provides an imperative and easy-to-use pipeline for people to use with no comprehensive understanding of deep learning framework or with model implementation required, while remains a flexible and loose coupling structure for people who need to strengthen their own pipeline with other features without writing complicated boilerplate code. In this paper, we detail the principles that drive the implementation of DeepFaceLab and introduce the pipeline of it, through which every aspect of the pipeline can be modified painlessly by users to achieve their customization purpose, and it's noteworthy that DeepFaceLab could achieve results with high fidelity and indeed indiscernible by mainstream forgery detection approaches. We demonstrate the advantage of our system through comparing our approach with current prevailing systems. For more information, please visit: this https URL.
摘要:DeepFaceLab是一个开放源码的deepfake系统由\ textbf创建{iperov}脸拥有3000多个叉子1.3万星在Github上交换:它为人们提供了一个必要的和易于使用的管道来使用,没有全面的了解深度学习的框架或模型实现需要,而遗体谁需要加强自己的管道与其他功能,而无需编写复杂的样板代码的人一个灵活的,松耦合结构。在本文中,我们详细介绍驱动DeepFaceLab的实施,并介绍了它的管道,通过该管道的每一个方面可以由用户无痛修改,以达到其定制目的的原则,这是值得注意的是,DeepFaceLab能够实现高结果保真度和主流伪造检测方法确实难以分辨。我们通过与方法目前通行的系统比较证明了我们系统的优点。欲了解更多信息,请访问:此HTTPS URL。
22. PSDet: Efficient and Universal Parking Slot Detection [PDF] 返回目录
Zizhang Wu, Weiwei Sun, Man Wang, Xiaoquan Wang, Lizhu Ding, Fan Wang
Abstract: While real-time parking slot detection plays a critical role in valet parking systems, existing methods have limited success in real-world applications. We argue two reasons accounting for the unsatisfactory performance: \romannumeral1, The available datasets have limited diversity, which causes the low generalization ability. \romannumeral2, Expert knowledge for parking slot detection is under-estimated. Thus, we annotate a large-scale benchmark for training the network and release it for the benefit of community. Driven by the observation of various parking lots in our benchmark, we propose the circular descriptor to regress the coordinates of parking slot vertexes and accordingly localize slots accurately. To further boost the performance, we develop a two-stage deep architecture to localize vertexes in the coarse-to-fine manner. In our benchmark and other datasets, it achieves the state-of-the-art accuracy while being real-time in practice. Benchmark is available at: this https URL
摘要:在实时停车位检测对代客泊车系统的关键作用,现有的方法在实际应用中取得有限的成功。我们认为有两个原因占表现欠佳:\ romannumeral1,可用的数据集具有多样性有限,这会导致低泛化能力。 \ romannumeral2,对于停车位检测的专业知识被低估。因此,我们标注了大规模基准训练网络,并释放了社会的利益。通过各种停车场在我们的基准观察的带动下,我们提出了圆形描述倒退停车位顶点的坐标,并相应地精确定位槽。为了进一步提升性能,我们开发了一个两阶段的深层结构本地化顶点在粗到精的方式。在我们的基准和其他数据集,实现了国家的最先进的精度,同时实时在实践中。基准,请访问:此HTTPS URL
Zizhang Wu, Weiwei Sun, Man Wang, Xiaoquan Wang, Lizhu Ding, Fan Wang
Abstract: While real-time parking slot detection plays a critical role in valet parking systems, existing methods have limited success in real-world applications. We argue two reasons accounting for the unsatisfactory performance: \romannumeral1, The available datasets have limited diversity, which causes the low generalization ability. \romannumeral2, Expert knowledge for parking slot detection is under-estimated. Thus, we annotate a large-scale benchmark for training the network and release it for the benefit of community. Driven by the observation of various parking lots in our benchmark, we propose the circular descriptor to regress the coordinates of parking slot vertexes and accordingly localize slots accurately. To further boost the performance, we develop a two-stage deep architecture to localize vertexes in the coarse-to-fine manner. In our benchmark and other datasets, it achieves the state-of-the-art accuracy while being real-time in practice. Benchmark is available at: this https URL
摘要:在实时停车位检测对代客泊车系统的关键作用,现有的方法在实际应用中取得有限的成功。我们认为有两个原因占表现欠佳:\ romannumeral1,可用的数据集具有多样性有限,这会导致低泛化能力。 \ romannumeral2,对于停车位检测的专业知识被低估。因此,我们标注了大规模基准训练网络,并释放了社会的利益。通过各种停车场在我们的基准观察的带动下,我们提出了圆形描述倒退停车位顶点的坐标,并相应地精确定位槽。为了进一步提升性能,我们开发了一个两阶段的深层结构本地化顶点在粗到精的方式。在我们的基准和其他数据集,实现了国家的最先进的精度,同时实时在实践中。基准,请访问:此HTTPS URL
23. A Novel Granular-Based Bi-Clustering Method of Deep Mining the Co-Expressed Genes [PDF] 返回目录
Kaijie Xu, Witold Pedrycz, Zhiwu Li, Yinghui Quan, Weike Nie
Abstract: Traditional clustering methods are limited when dealing with huge and heterogeneous groups of gene expression data, which motivates the development of bi-clustering methods. Bi-clustering methods are used to mine bi-clusters whose subsets of samples (genes) are co-regulated under their test conditions. Studies show that mining bi-clusters of consistent trends and trends with similar degrees of fluctuations from the gene expression data is essential in bioinformatics research. Unfortunately, traditional bi-clustering methods are not fully effective in discovering such bi-clusters. Therefore, we propose a novel bi-clustering method by involving here the theory of Granular Computing. In the proposed scheme, the gene data matrix, considered as a group of time series, is transformed into a series of ordered information granules. With the information granules we build a characteristic matrix of the gene data to capture the fluctuation trend of the expression value between consecutive conditions to mine the ideal bi-clusters. The experimental results are in agreement with the theoretical analysis, and show the excellent performance of the proposed method.
摘要:基因表达数据,其激励的双聚类方法发展的巨大而异质群体打交道时,传统的聚类方法的限制。双聚类方法用于矿井双集群子集,其样品(基因)是其试验条件下共调节。研究显示一致的潮流和趋势矿业双向簇和类似度的基因表达数据的波动是至关重要的生物信息学研究。不幸的是,传统的双聚类方法不是在发现这样的双集群充分发挥作用。因此,我们通过这里涉及粒计算的理论提出了一种新的双聚类方法。在所提出的方案中,基因数据矩阵,认为是一组的时间序列被变换成一系列有序信息颗粒组成。随着信息颗粒我们构建基因的数据的特性矩阵来捕获连续的条件之间的表达值的波动趋势矿理想双集群。实验结果与理论分析一致,并且表明了该方法的优良性能。
Kaijie Xu, Witold Pedrycz, Zhiwu Li, Yinghui Quan, Weike Nie
Abstract: Traditional clustering methods are limited when dealing with huge and heterogeneous groups of gene expression data, which motivates the development of bi-clustering methods. Bi-clustering methods are used to mine bi-clusters whose subsets of samples (genes) are co-regulated under their test conditions. Studies show that mining bi-clusters of consistent trends and trends with similar degrees of fluctuations from the gene expression data is essential in bioinformatics research. Unfortunately, traditional bi-clustering methods are not fully effective in discovering such bi-clusters. Therefore, we propose a novel bi-clustering method by involving here the theory of Granular Computing. In the proposed scheme, the gene data matrix, considered as a group of time series, is transformed into a series of ordered information granules. With the information granules we build a characteristic matrix of the gene data to capture the fluctuation trend of the expression value between consecutive conditions to mine the ideal bi-clusters. The experimental results are in agreement with the theoretical analysis, and show the excellent performance of the proposed method.
摘要:基因表达数据,其激励的双聚类方法发展的巨大而异质群体打交道时,传统的聚类方法的限制。双聚类方法用于矿井双集群子集,其样品(基因)是其试验条件下共调节。研究显示一致的潮流和趋势矿业双向簇和类似度的基因表达数据的波动是至关重要的生物信息学研究。不幸的是,传统的双聚类方法不是在发现这样的双集群充分发挥作用。因此,我们通过这里涉及粒计算的理论提出了一种新的双聚类方法。在所提出的方案中,基因数据矩阵,认为是一组的时间序列被变换成一系列有序信息颗粒组成。随着信息颗粒我们构建基因的数据的特性矩阵来捕获连续的条件之间的表达值的波动趋势矿理想双集群。实验结果与理论分析一致,并且表明了该方法的优良性能。
24. Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity [PDF] 返回目录
Mohammad Rami Koujan, Luma Alharbawee, Giorgos Giannakakis, Nicolas Pugeault, Anastasios Roussos
Abstract: Human emotions analysis has been the focus of many studies, especially in the field of Affective Computing, and is important for many applications, e.g. human-computer intelligent interaction, stress analysis, interactive games, animations, etc. Solutions for automatic emotion analysis have also benefited from the development of deep learning approaches and the availability of vast amount of visual facial data on the internet. This paper proposes a novel method for human emotion recognition from a single RGB image. We construct a large-scale dataset of facial videos (\textbf{FaceVid}), rich in facial dynamics, identities, expressions, appearance and 3D pose variations. We use this dataset to train a deep Convolutional Neural Network for estimating expression parameters of a 3D Morphable Model and combine it with an effective back-end emotion classifier. Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation and accurately recognizing facial expressions from in-the-wild images. We present extensive experimental evaluation that shows that the proposed method outperforms the compared techniques in estimating the 3D expression parameters and achieves state-of-the-art performance in recognising the basic emotions from facial images, as well as recognising stress from facial videos. %compared to the current state of the art in emotion recognition from facial images.
摘要:人的情绪分析一直是许多研究的焦点,尤其是在情感计算领域,重要的是许多应用,例如人机智能交互,应力分析,互动游戏,动画等进行自动分析情感的解决方案也从深度学习的发展得益于接近和互联网上的视觉面部数据的大量的可用性。本文提出了一种从一个单一的RGB图像的人情绪识别的新方法。我们构建的面部视频大规模数据集(\ textbf {} FaceVid),丰富的面部动态,身份,表情,外观和3D姿态的变化。我们用这个数据集来训练了深刻的卷积神经网络预测三维形变模型的表达参数,并用有效的后端情感分类相结合。在每秒50帧我们提出的框架运行,并且能够牢固地估计3D表达变异的参数和准确地在最野生图像识别的面部表情。我们本广泛的实验评价其示出了所提出的方法优于比较技术在估计3D表达参数,并实现在识别从面部图像的基本情绪,以及从面部识别视频应力状态的最先进的性能。 %,对比于现有技术中,从面部图像情绪识别的当前状态。
Mohammad Rami Koujan, Luma Alharbawee, Giorgos Giannakakis, Nicolas Pugeault, Anastasios Roussos
Abstract: Human emotions analysis has been the focus of many studies, especially in the field of Affective Computing, and is important for many applications, e.g. human-computer intelligent interaction, stress analysis, interactive games, animations, etc. Solutions for automatic emotion analysis have also benefited from the development of deep learning approaches and the availability of vast amount of visual facial data on the internet. This paper proposes a novel method for human emotion recognition from a single RGB image. We construct a large-scale dataset of facial videos (\textbf{FaceVid}), rich in facial dynamics, identities, expressions, appearance and 3D pose variations. We use this dataset to train a deep Convolutional Neural Network for estimating expression parameters of a 3D Morphable Model and combine it with an effective back-end emotion classifier. Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation and accurately recognizing facial expressions from in-the-wild images. We present extensive experimental evaluation that shows that the proposed method outperforms the compared techniques in estimating the 3D expression parameters and achieves state-of-the-art performance in recognising the basic emotions from facial images, as well as recognising stress from facial videos. %compared to the current state of the art in emotion recognition from facial images.
摘要:人的情绪分析一直是许多研究的焦点,尤其是在情感计算领域,重要的是许多应用,例如人机智能交互,应力分析,互动游戏,动画等进行自动分析情感的解决方案也从深度学习的发展得益于接近和互联网上的视觉面部数据的大量的可用性。本文提出了一种从一个单一的RGB图像的人情绪识别的新方法。我们构建的面部视频大规模数据集(\ textbf {} FaceVid),丰富的面部动态,身份,表情,外观和3D姿态的变化。我们用这个数据集来训练了深刻的卷积神经网络预测三维形变模型的表达参数,并用有效的后端情感分类相结合。在每秒50帧我们提出的框架运行,并且能够牢固地估计3D表达变异的参数和准确地在最野生图像识别的面部表情。我们本广泛的实验评价其示出了所提出的方法优于比较技术在估计3D表达参数,并实现在识别从面部图像的基本情绪,以及从面部识别视频应力状态的最先进的性能。 %,对比于现有技术中,从面部图像情绪识别的当前状态。
25. 3DV: 3D Dynamic Voxel for Action Recognition in Depth Video [PDF] 返回目录
Yancheng Wang, Yang Xiao, Fu Xiong, Wenxiang Jiang, Zhiguo Cao, Joey Tianyi Zhou, Junsong Yuan
Abstract: To facilitate depth-based 3D action recognition, 3D dynamic voxel (3DV) is proposed as a novel 3D motion representation. With 3D space voxelization, the key idea of 3DV is to encode 3D motion information within depth video into a regular voxel set (i.e., 3DV) compactly, via temporal rank pooling. Each available 3DV voxel intrinsically involves 3D spatial and motion feature jointly. 3DV is then abstracted as a point set and input into PointNet++ for 3D action recognition, in the end-to-end learning way. The intuition for transferring 3DV into the point set form is that, PointNet++ is lightweight and effective for deep feature learning towards point set. Since 3DV may lose appearance clue, a multi-stream 3D action recognition manner is also proposed to learn motion and appearance feature jointly. To extract richer temporal order information of actions, we also divide the depth video into temporal splits and encode this procedure in 3DV integrally. The extensive experiments on 4 well-established benchmark datasets demonstrate the superiority of our proposition. Impressively, we acquire the accuracy of 82.4% and 93.5% on NTU RGB+D 120 [13] with the cross-subject and crosssetup test setting respectively. 3DV's code is available at this https URL.
摘要:为了便于基于深度的三维动作识别,3D体素动态(3DV)提出了作为一种新的3D运动的表示。与3D空间体素化,3DV的关键思想是深度视频内三维运动信息编码为一个普通的体素集合(即,3DV)紧凑地,通过时间等级池。每个可用的3DV素本质上涉及到3D空间和运动功能联合。然后3DV抽象为一组点和输入到PointNet ++用于3D动作识别,在端至端学习方式。转移3DV入点集形式的直觉是,PointNet ++是轻量级的,有效的对点集深地物学习。因为3DV可能会失去外观的线索,多流3D动作识别方式还提出要学习运动和外观特征联合。要提取的动作更丰富的时间顺序的信息,我们也划分深度视频转换时间的分裂和在3DV整体编码此过程。 4行之有效的基准数据集上的广泛实验证明我们的主张的优势。令人印象深刻的,我们获得的82.4%和93.5%对NTU RGB + d 120 [13]的准确性与交受试者和测试crosssetup分别设置。 3DV的代码可在此HTTPS URL。
Yancheng Wang, Yang Xiao, Fu Xiong, Wenxiang Jiang, Zhiguo Cao, Joey Tianyi Zhou, Junsong Yuan
Abstract: To facilitate depth-based 3D action recognition, 3D dynamic voxel (3DV) is proposed as a novel 3D motion representation. With 3D space voxelization, the key idea of 3DV is to encode 3D motion information within depth video into a regular voxel set (i.e., 3DV) compactly, via temporal rank pooling. Each available 3DV voxel intrinsically involves 3D spatial and motion feature jointly. 3DV is then abstracted as a point set and input into PointNet++ for 3D action recognition, in the end-to-end learning way. The intuition for transferring 3DV into the point set form is that, PointNet++ is lightweight and effective for deep feature learning towards point set. Since 3DV may lose appearance clue, a multi-stream 3D action recognition manner is also proposed to learn motion and appearance feature jointly. To extract richer temporal order information of actions, we also divide the depth video into temporal splits and encode this procedure in 3DV integrally. The extensive experiments on 4 well-established benchmark datasets demonstrate the superiority of our proposition. Impressively, we acquire the accuracy of 82.4% and 93.5% on NTU RGB+D 120 [13] with the cross-subject and crosssetup test setting respectively. 3DV's code is available at this https URL.
摘要:为了便于基于深度的三维动作识别,3D体素动态(3DV)提出了作为一种新的3D运动的表示。与3D空间体素化,3DV的关键思想是深度视频内三维运动信息编码为一个普通的体素集合(即,3DV)紧凑地,通过时间等级池。每个可用的3DV素本质上涉及到3D空间和运动功能联合。然后3DV抽象为一组点和输入到PointNet ++用于3D动作识别,在端至端学习方式。转移3DV入点集形式的直觉是,PointNet ++是轻量级的,有效的对点集深地物学习。因为3DV可能会失去外观的线索,多流3D动作识别方式还提出要学习运动和外观特征联合。要提取的动作更丰富的时间顺序的信息,我们也划分深度视频转换时间的分裂和在3DV整体编码此过程。 4行之有效的基准数据集上的广泛实验证明我们的主张的优势。令人印象深刻的,我们获得的82.4%和93.5%对NTU RGB + d 120 [13]的准确性与交受试者和测试crosssetup分别设置。 3DV的代码可在此HTTPS URL。
26. Train and Deploy an Image Classifier for Disaster Response [PDF] 返回目录
Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren
Abstract: With Deep Learning Image Classification becoming more powerful each year, it is apparent that its introduction to disaster response will increase the efficiency that responders can work with. Using several Neural Network Models, including AlexNet, ResNet, MobileNet, DenseNets, and 4-Layer CNN, we have classified flood disaster images from a large image data set with up to 79% accuracy. Our models and tutorials for working with the data set have created a foundation for others to classify other types of disasters contained in the images.
摘要:随着深学习图像分类变得更强大,每年,很明显,它的引入救灾会增加反应可以一起工作的效率。使用多个神经网络模型,包括AlexNet,RESNET,MobileNet,DenseNets,和4层CNN,我们从一个大的图像数据集以高达79%的准确分类洪涝灾害的图像。我们的模型和教程与数据集工作已经创造了他人的基础,其他类型的包含在图像灾害的分类。
Jianyu Mao, Kiana Harris, Nae-Rong Chang, Caleb Pennell, Yiming Ren
Abstract: With Deep Learning Image Classification becoming more powerful each year, it is apparent that its introduction to disaster response will increase the efficiency that responders can work with. Using several Neural Network Models, including AlexNet, ResNet, MobileNet, DenseNets, and 4-Layer CNN, we have classified flood disaster images from a large image data set with up to 79% accuracy. Our models and tutorials for working with the data set have created a foundation for others to classify other types of disasters contained in the images.
摘要:随着深学习图像分类变得更强大,每年,很明显,它的引入救灾会增加反应可以一起工作的效率。使用多个神经网络模型,包括AlexNet,RESNET,MobileNet,DenseNets,和4层CNN,我们从一个大的图像数据集以高达79%的准确分类洪涝灾害的图像。我们的模型和教程与数据集工作已经创造了他人的基础,其他类型的包含在图像灾害的分类。
27. Combining Deep Learning with Geometric Features for Image based Localization in the Gastrointestinal Tract [PDF] 返回目录
Jingwei Song, Mitesh Patel, Andreas Girgensohn, Chelhwon Kim
Abstract: Tracking monocular colonoscope in the Gastrointestinal tract (GI) is a challenging problem as the images suffer from deformation, blurred textures, significant changes in appearance. They greatly restrict the tracking ability of conventional geometry based methods. Even though Deep Learning (DL) can overcome these issues, limited labeling data is a roadblock to state-of-art DL method. Considering these, we propose a novel approach to combine DL method with traditional feature based approach to achieve better localization with small training data. Our method fully exploits the best of both worlds by introducing a Siamese network structure to perform few-shot classification to the closest zone in the segmented training image set. The classified label is further adopted to initialize the pose of scope. To fully use the training dataset, a pre-generated triangulated map points within the zone in the training set are registered with observation and contribute to estimating the optimal pose of the test image. The proposed hybrid method is extensively tested and compared with existing methods, and the result shows significant improvement over traditional geometric based or DL based localization. The accuracy is improved by 28.94% (Position) and 10.97% (Orientation) with respect to state-of-art method.
摘要:在胃肠道(GI)跟踪单眼结肠镜是一个具有挑战性的问题,因为从图像变形,模糊的纹理,在外观上的变化显著遭受。他们极大地限制了传统基于几何方法的跟踪能力。即使深度学习(DL)可以克服这些问题,局限于标记数据是一个路障到国家的技术DL方法。考虑到这些,我们提出了一种新的方法来DL方法与传统的基于特征的方法相结合,实现与小的训练数据更好地本地化。我们的方法通过引入一个连体的网络结构在分段训练图像集执行为数不多的镜头分类到最接近的区域充分利用了两全其美的。分类的标记,还采用了初始化的范围姿态。完全使用训练数据集,在训练集合中的区域内的,预先产生三角地图点与观测注册并有助于估计测试图像的最佳姿势。所提出的混合方法被广泛的测试,并与现有的方法,并且将结果显示了传统的几何基或DL基于本地化显著改善。精度由28.94%(位置)和10.97%(定向)相对于国家的现有技术的方法得到改善。
Jingwei Song, Mitesh Patel, Andreas Girgensohn, Chelhwon Kim
Abstract: Tracking monocular colonoscope in the Gastrointestinal tract (GI) is a challenging problem as the images suffer from deformation, blurred textures, significant changes in appearance. They greatly restrict the tracking ability of conventional geometry based methods. Even though Deep Learning (DL) can overcome these issues, limited labeling data is a roadblock to state-of-art DL method. Considering these, we propose a novel approach to combine DL method with traditional feature based approach to achieve better localization with small training data. Our method fully exploits the best of both worlds by introducing a Siamese network structure to perform few-shot classification to the closest zone in the segmented training image set. The classified label is further adopted to initialize the pose of scope. To fully use the training dataset, a pre-generated triangulated map points within the zone in the training set are registered with observation and contribute to estimating the optimal pose of the test image. The proposed hybrid method is extensively tested and compared with existing methods, and the result shows significant improvement over traditional geometric based or DL based localization. The accuracy is improved by 28.94% (Position) and 10.97% (Orientation) with respect to state-of-art method.
摘要:在胃肠道(GI)跟踪单眼结肠镜是一个具有挑战性的问题,因为从图像变形,模糊的纹理,在外观上的变化显著遭受。他们极大地限制了传统基于几何方法的跟踪能力。即使深度学习(DL)可以克服这些问题,局限于标记数据是一个路障到国家的技术DL方法。考虑到这些,我们提出了一种新的方法来DL方法与传统的基于特征的方法相结合,实现与小的训练数据更好地本地化。我们的方法通过引入一个连体的网络结构在分段训练图像集执行为数不多的镜头分类到最接近的区域充分利用了两全其美的。分类的标记,还采用了初始化的范围姿态。完全使用训练数据集,在训练集合中的区域内的,预先产生三角地图点与观测注册并有助于估计测试图像的最佳姿势。所提出的混合方法被广泛的测试,并与现有的方法,并且将结果显示了传统的几何基或DL基于本地化显著改善。精度由28.94%(位置)和10.97%(定向)相对于国家的现有技术的方法得到改善。
28. VIDIT: Virtual Image Dataset for Illumination Transfer [PDF] 返回目录
Majed El Helou, Ruofan Zhou, Johan Barthas, Sabine Süsstrunk
Abstract: Deep image relighting is gaining more interest lately, as it allows photo enhancement through illumination-specific retouching without human effort. Aside from aesthetic enhancement and photo montage, image relighting is valuable for domain adaptation, whether to augment datasets for training or to normalize input test data. Accurate relighting is, however, very challenging for various reasons, such as the difficulty in removing and recasting shadows and the modeling of different surfaces. We present a novel dataset, the Virtual Image Dataset for Illumination Transfer (VIDIT), in an effort to create a reference evaluation benchmark and to push forward the development of illumination manipulation methods. Virtual datasets are not only an important step towards achieving real-image performance but have also proven capble of improving training even when real datasets are possible to acquire and available. VIDIT contains 300 virtual scenes used for training, where every scene is captured 40 times in total: from 8 equally-spaced azimuthal angles, each lit with 5 different illuminants.
摘要:深图像二次照明是最近获得更多的利益,因为它可以让照片增强通过照射特定没有人的努力润饰。除了审美的提高和照片蒙太奇,图像二次照明是域适应有价值的,无论是训练还是正常化输入测试数据扩充数据集。准确的二次照明然而,由于种种原因,如在消除和重铸阴影的难度和不同表面的造型非常具有挑战性。我们提出了一个新颖的数据集,虚拟图像数据集用于照明转移(VIDIT),以努力以创建参考评价基准和推进的照明操作方法的开发。虚拟数据集不仅对实现实时图像性能的重要一步,但也证明了改善,即使真实数据是可能的获取和提供培训的capble。 VIDIT包含用于训练,其中每一个场景被捕获40倍总共300个虚拟场景:从8等距方位角度,每个具有5个不同的发光体点亮。
Majed El Helou, Ruofan Zhou, Johan Barthas, Sabine Süsstrunk
Abstract: Deep image relighting is gaining more interest lately, as it allows photo enhancement through illumination-specific retouching without human effort. Aside from aesthetic enhancement and photo montage, image relighting is valuable for domain adaptation, whether to augment datasets for training or to normalize input test data. Accurate relighting is, however, very challenging for various reasons, such as the difficulty in removing and recasting shadows and the modeling of different surfaces. We present a novel dataset, the Virtual Image Dataset for Illumination Transfer (VIDIT), in an effort to create a reference evaluation benchmark and to push forward the development of illumination manipulation methods. Virtual datasets are not only an important step towards achieving real-image performance but have also proven capble of improving training even when real datasets are possible to acquire and available. VIDIT contains 300 virtual scenes used for training, where every scene is captured 40 times in total: from 8 equally-spaced azimuthal angles, each lit with 5 different illuminants.
摘要:深图像二次照明是最近获得更多的利益,因为它可以让照片增强通过照射特定没有人的努力润饰。除了审美的提高和照片蒙太奇,图像二次照明是域适应有价值的,无论是训练还是正常化输入测试数据扩充数据集。准确的二次照明然而,由于种种原因,如在消除和重铸阴影的难度和不同表面的造型非常具有挑战性。我们提出了一个新颖的数据集,虚拟图像数据集用于照明转移(VIDIT),以努力以创建参考评价基准和推进的照明操作方法的开发。虚拟数据集不仅对实现实时图像性能的重要一步,但也证明了改善,即使真实数据是可能的获取和提供培训的capble。 VIDIT包含用于训练,其中每一个场景被捕获40倍总共300个虚拟场景:从8等距方位角度,每个具有5个不同的发光体点亮。
29. Online Monitoring for Neural Network Based Monocular Pedestrian Pose Estimation [PDF] 返回目录
Arjun Gupta, Luca Carlone
Abstract: Several autonomy pipelines now have core components that rely on deep learning approaches. While these approaches work well in nominal conditions, they tend to have unexpected and severe failure modes that create concerns when used in safety-critical applications, including self-driving cars. There are several works that aim to characterize the robustness of networks offline, but currently there is a lack of tools to monitor the correctness of network outputs online during operation. We investigate the problem of online output monitoring for neural networks that estimate 3D human shapes and poses from images. Our first contribution is to present and evaluate model-based and learning-based monitors for a human-pose-and-shape reconstruction network, and assess their ability to predict the output loss for a given test input. As a second contribution, we introduce an Adversarially-Trained Online Monitor ( ATOM ) that learns how to effectively predict losses from data. ATOM dominates model-based baselines and can detect bad outputs, leading to substantial improvements in human pose output quality. Our final contribution is an extensive experimental evaluation that shows that discarding outputs flagged as incorrect by ATOM improves the average error by 12.5%, and the worst-case error by 126.5%.
摘要:一些自主权管线现在有依靠深学习方法的核心组件。虽然这些方法在额定条件下工作得很好,他们往往有在安全关键应用,包括自动驾驶汽车使用时创建的关注意外和严重的故障模式。有几部作品,旨在离线表征网络的鲁棒性,但目前缺乏工具来操作过程中的在线监测网络输出的正确性。我们调查网络输出的监测对于估算图像三维人体形状和姿态神经网络的问题。我们的第一个贡献是本和评估基于模型的学习和基于监视的人类姿势和 - 形状重建网络,并评估其来预测输出损失对于给定的测试输入的能力。作为第二个贡献,我们引入一个Adversarially,训练有素的在线监控(ATOM)是学会如何有效地预测从数据丢失。 ATOM占主导地位的基于模型的基线和可检测坏输出,从而导致人体姿势输出质量显着改善。我们最后的贡献是一个广泛的实验评估其表明,丢弃输出标记为不正确由ATOM 12.5%,并且由126.5%的最坏情况下的误差改善的平均误差。
Arjun Gupta, Luca Carlone
Abstract: Several autonomy pipelines now have core components that rely on deep learning approaches. While these approaches work well in nominal conditions, they tend to have unexpected and severe failure modes that create concerns when used in safety-critical applications, including self-driving cars. There are several works that aim to characterize the robustness of networks offline, but currently there is a lack of tools to monitor the correctness of network outputs online during operation. We investigate the problem of online output monitoring for neural networks that estimate 3D human shapes and poses from images. Our first contribution is to present and evaluate model-based and learning-based monitors for a human-pose-and-shape reconstruction network, and assess their ability to predict the output loss for a given test input. As a second contribution, we introduce an Adversarially-Trained Online Monitor ( ATOM ) that learns how to effectively predict losses from data. ATOM dominates model-based baselines and can detect bad outputs, leading to substantial improvements in human pose output quality. Our final contribution is an extensive experimental evaluation that shows that discarding outputs flagged as incorrect by ATOM improves the average error by 12.5%, and the worst-case error by 126.5%.
摘要:一些自主权管线现在有依靠深学习方法的核心组件。虽然这些方法在额定条件下工作得很好,他们往往有在安全关键应用,包括自动驾驶汽车使用时创建的关注意外和严重的故障模式。有几部作品,旨在离线表征网络的鲁棒性,但目前缺乏工具来操作过程中的在线监测网络输出的正确性。我们调查网络输出的监测对于估算图像三维人体形状和姿态神经网络的问题。我们的第一个贡献是本和评估基于模型的学习和基于监视的人类姿势和 - 形状重建网络,并评估其来预测输出损失对于给定的测试输入的能力。作为第二个贡献,我们引入一个Adversarially,训练有素的在线监控(ATOM)是学会如何有效地预测从数据丢失。 ATOM占主导地位的基于模型的基线和可检测坏输出,从而导致人体姿势输出质量显着改善。我们最后的贡献是一个广泛的实验评估其表明,丢弃输出标记为不正确由ATOM 12.5%,并且由126.5%的最坏情况下的误差改善的平均误差。
30. Target-Independent Domain Adaptation for WBC Classification using Generative Latent Search [PDF] 返回目录
Prashant Pandey, Prathosh AP, Vinay Kyatham, Deepak Mishra, Tathagato Rai Dastidar
Abstract: Automating the classification of camera-obtained microscopic images of White Blood Cells (WBCs) and related cell subtypes has assumed importance since it aids the laborious manual process of review and diagnosis. Several State-Of-The-Art (SOTA) methods developed using Deep Convolutional Neural Networks suffer from the problem of domain shift - severe performance degradation when they are tested on data (target) obtained in a setting different from that of the training (source). The change in the target data might be caused by factors such as differences in camera/microscope types, lenses, lighting-conditions etc. This problem can potentially be solved using Unsupervised Domain Adaptation (UDA) techniques albeit standard algorithms presuppose the existence of a sufficient amount of unlabelled target data which is not always the case with medical images. In this paper, we propose a method for UDA that is devoid of the need for target data. Given a test image from the target data, we obtain its 'closest-clone' from the source data that is used as a proxy in the classifier. We prove the existence of such a clone given that infinite number of data points can be sampled from the source distribution. We propose a method in which a latent-variable generative model based on variational inference is used to simultaneously sample and find the 'closest-clone' from the source distribution through an optimization procedure in the latent space. We demonstrate the efficacy of the proposed method over several SOTA UDA methods for WBC classification on datasets captured using different imaging modalities under multiple settings.
摘要:自动化白血细胞(白细胞)和相关的细胞亚型的照相机获得的显微图像的分类已承担重要的,因为它有助于审查和诊断费力的手工过程。在从训练的设置不同(来源获得的性能严重下降时,他们的数据(目标)测试 - 一些国家的最艺术(SOTA)方法使用Deep卷积神经网络的域转移的问题的困扰开发)。在目标数据中的变化可能是由因素引起,例如在照相机/这个问题可以可能使用的无监督域适配(UDA)的技术虽然标准算法预先假定足够的存在需要解决的差异显微镜类型,透镜,照明条件等未标记的目标数据的量,这并不总是与医学图像的情况。在本文中,我们提出了UDA的方法是没有的,需要目标数据。给定从目标数据的测试图像,我们从被用作在分级的代理的源数据得到其“最接近克隆”。我们证明假定的数据点无限数量可以从源分布进行采样这样的克隆的存在。我们建议在基于变推理潜变量生成模型于潜在空间用于同时样品,找到从源头上分布最亲密的克隆'通过优化程序的方法。我们证明在几个SOTA UDA方法WBC分类使用在多个不同的设置成像方式捕获的数据集所提出的方法的有效性。
Prashant Pandey, Prathosh AP, Vinay Kyatham, Deepak Mishra, Tathagato Rai Dastidar
Abstract: Automating the classification of camera-obtained microscopic images of White Blood Cells (WBCs) and related cell subtypes has assumed importance since it aids the laborious manual process of review and diagnosis. Several State-Of-The-Art (SOTA) methods developed using Deep Convolutional Neural Networks suffer from the problem of domain shift - severe performance degradation when they are tested on data (target) obtained in a setting different from that of the training (source). The change in the target data might be caused by factors such as differences in camera/microscope types, lenses, lighting-conditions etc. This problem can potentially be solved using Unsupervised Domain Adaptation (UDA) techniques albeit standard algorithms presuppose the existence of a sufficient amount of unlabelled target data which is not always the case with medical images. In this paper, we propose a method for UDA that is devoid of the need for target data. Given a test image from the target data, we obtain its 'closest-clone' from the source data that is used as a proxy in the classifier. We prove the existence of such a clone given that infinite number of data points can be sampled from the source distribution. We propose a method in which a latent-variable generative model based on variational inference is used to simultaneously sample and find the 'closest-clone' from the source distribution through an optimization procedure in the latent space. We demonstrate the efficacy of the proposed method over several SOTA UDA methods for WBC classification on datasets captured using different imaging modalities under multiple settings.
摘要:自动化白血细胞(白细胞)和相关的细胞亚型的照相机获得的显微图像的分类已承担重要的,因为它有助于审查和诊断费力的手工过程。在从训练的设置不同(来源获得的性能严重下降时,他们的数据(目标)测试 - 一些国家的最艺术(SOTA)方法使用Deep卷积神经网络的域转移的问题的困扰开发)。在目标数据中的变化可能是由因素引起,例如在照相机/这个问题可以可能使用的无监督域适配(UDA)的技术虽然标准算法预先假定足够的存在需要解决的差异显微镜类型,透镜,照明条件等未标记的目标数据的量,这并不总是与医学图像的情况。在本文中,我们提出了UDA的方法是没有的,需要目标数据。给定从目标数据的测试图像,我们从被用作在分级的代理的源数据得到其“最接近克隆”。我们证明假定的数据点无限数量可以从源分布进行采样这样的克隆的存在。我们建议在基于变推理潜变量生成模型于潜在空间用于同时样品,找到从源头上分布最亲密的克隆'通过优化程序的方法。我们证明在几个SOTA UDA方法WBC分类使用在多个不同的设置成像方式捕获的数据集所提出的方法的有效性。
31. Optimizing Vessel Trajectory Compression [PDF] 返回目录
Giannis Fikioris, Kostas Patroumpas, Alexander Artikis
Abstract: In previous work we introduced a trajectory detection module that can provide summarized representations of vessel trajectories by consuming AIS positional messages online. This methodology can provide reliable trajectory synopses with little deviations from the original course by discarding at least 70% of the raw data as redundant. However, such trajectory compression is very sensitive to parametrization. In this paper, our goal is to fine-tune the selection of these parameter values. We take into account the type of each vessel in order to provide a suitable configuration that can yield improved trajectory synopses, both in terms of approximation error and compression ratio. Furthermore, we employ a genetic algorithm converging to a suitable configuration per vessel type. Our tests against a publicly available AIS dataset have shown that compression efficiency is comparable or even better than the one with default parametrization without resorting to a laborious data inspection.
摘要:在以往的工作中,我们引入了一个轨迹检测模块,可以通过网上消费AIS位置信息提供船舶轨迹总结表示。这种方法可以通过丢弃原始数据冗余的至少70%,提供从原来的课程一点偏差可靠的轨迹提纲。然而,这样的轨迹压缩到参数化非常敏感。在本文中,我们的目标是微调这些参数值的选择。我们考虑到每个容器的类型,以提供一个合适的配置,可以产生改进的轨迹提要,无论是在近似误差和压缩比计。此外,我们使用了遗传算法收敛到每个容器类型的合适配置。我们对一个公开的AIS数据集的测试表明,压缩效率相媲美,甚至优于一个使用默认参数化,而不诉诸于费力的数据检查。
Giannis Fikioris, Kostas Patroumpas, Alexander Artikis
Abstract: In previous work we introduced a trajectory detection module that can provide summarized representations of vessel trajectories by consuming AIS positional messages online. This methodology can provide reliable trajectory synopses with little deviations from the original course by discarding at least 70% of the raw data as redundant. However, such trajectory compression is very sensitive to parametrization. In this paper, our goal is to fine-tune the selection of these parameter values. We take into account the type of each vessel in order to provide a suitable configuration that can yield improved trajectory synopses, both in terms of approximation error and compression ratio. Furthermore, we employ a genetic algorithm converging to a suitable configuration per vessel type. Our tests against a publicly available AIS dataset have shown that compression efficiency is comparable or even better than the one with default parametrization without resorting to a laborious data inspection.
摘要:在以往的工作中,我们引入了一个轨迹检测模块,可以通过网上消费AIS位置信息提供船舶轨迹总结表示。这种方法可以通过丢弃原始数据冗余的至少70%,提供从原来的课程一点偏差可靠的轨迹提纲。然而,这样的轨迹压缩到参数化非常敏感。在本文中,我们的目标是微调这些参数值的选择。我们考虑到每个容器的类型,以提供一个合适的配置,可以产生改进的轨迹提要,无论是在近似误差和压缩比计。此外,我们使用了遗传算法收敛到每个容器类型的合适配置。我们对一个公开的AIS数据集的测试表明,压缩效率相媲美,甚至优于一个使用默认参数化,而不诉诸于费力的数据检查。
32. A Parallel Hybrid Technique for Multi-Noise Removal from Grayscale Medical Images [PDF] 返回目录
Nora Youssef, Abeer M. Mahmoud, El-Sayed M. El-Horbaty
Abstract: Medical imaging is the technique used to create images of the human body or parts of it for clinical purposes. Medical images always have large sizes and they are commonly corrupted by single or multiple noise type at the same time, due to various reasons, these two reasons are the triggers for moving toward parallel image processing to find alternatives of image de-noising techniques. This paper presents a parallel hybrid filter implementation for gray scale medical image de-noising. The hybridization is between adaptive median and wiener filters. Parallelization is implemented on the adaptive median filter to overcome the latency of neighborhood operation, parfor implicit parallelism powered by MatLab 2013a is used. The implementation is tested on an image of 2.5 MB size, which is divided into 2, 4 and 8 partitions; a comparison between the proposed implementation and sequential implementation is given, in terms of time. Thus, each case has the best time when assigned to number of threads equal to the number of its partitions. Moreover, Speed up and efficiency are calculated for the algorithm and they show a measured enhancement.
摘要:医学成像是用于创建用于临床目的的它的人体的图像或部分的技术。医疗图像始终具有较大的尺寸,并且通常由一个或多个噪声类型的同时损坏,由于种种原因,这两个原因是对并行图像处理,找到与图像的替代去噪技术的触发器。本文提出了灰度医用图像的并联混合滤波器执行去噪声。杂交是自适应中值和维纳滤波器之间。并行化在自适应中值过滤器来实现,以克服附近操作的等待时间,PARFOR用于供电用Matlab 2013a隐含并行。的实现是2.5 MB的大小,其被分成2,4和8个分区的图像上测试;所提出的实现和顺序实现之间的比较中给出,在时间上。因此,每个壳体具有当分配给编号的线程等于其分区的数目的最佳时间。此外,加速和效率计算算法和它们表现出测得的增强。
Nora Youssef, Abeer M. Mahmoud, El-Sayed M. El-Horbaty
Abstract: Medical imaging is the technique used to create images of the human body or parts of it for clinical purposes. Medical images always have large sizes and they are commonly corrupted by single or multiple noise type at the same time, due to various reasons, these two reasons are the triggers for moving toward parallel image processing to find alternatives of image de-noising techniques. This paper presents a parallel hybrid filter implementation for gray scale medical image de-noising. The hybridization is between adaptive median and wiener filters. Parallelization is implemented on the adaptive median filter to overcome the latency of neighborhood operation, parfor implicit parallelism powered by MatLab 2013a is used. The implementation is tested on an image of 2.5 MB size, which is divided into 2, 4 and 8 partitions; a comparison between the proposed implementation and sequential implementation is given, in terms of time. Thus, each case has the best time when assigned to number of threads equal to the number of its partitions. Moreover, Speed up and efficiency are calculated for the algorithm and they show a measured enhancement.
摘要:医学成像是用于创建用于临床目的的它的人体的图像或部分的技术。医疗图像始终具有较大的尺寸,并且通常由一个或多个噪声类型的同时损坏,由于种种原因,这两个原因是对并行图像处理,找到与图像的替代去噪技术的触发器。本文提出了灰度医用图像的并联混合滤波器执行去噪声。杂交是自适应中值和维纳滤波器之间。并行化在自适应中值过滤器来实现,以克服附近操作的等待时间,PARFOR用于供电用Matlab 2013a隐含并行。的实现是2.5 MB的大小,其被分成2,4和8个分区的图像上测试;所提出的实现和顺序实现之间的比较中给出,在时间上。因此,每个壳体具有当分配给编号的线程等于其分区的数目的最佳时间。此外,加速和效率计算算法和它们表现出测得的增强。
33. Planning to Explore via Self-Supervised World Models [PDF] 返回目录
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak
Abstract: Reinforcement learning allows solving complex tasks, however, the learning tends to be task-specific and the sample efficiency remains a challenge. We present Plan2Explore, a self-supervised reinforcement learning agent that tackles both these challenges through a new approach to self-supervised exploration and fast adaptation to new tasks, which need not be known during exploration. During exploration, unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty. After exploration, the agent quickly adapts to multiple downstream tasks in a zero or a few-shot manner. We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Videos and code at this https URL
摘要:强化学习允许解决复杂的任务,但是,学习往往是任务的具体样本效率仍然是一个挑战。我们目前Plan2Explore,自我监督的强化学习剂铲球均通过一种新的方法进行自我监督的探索和快速适应新的任务,它不必勘探中知道这些挑战。在勘探,与现有方法不同,其计算追溯观察的新奇之后,代理已经达到他们,我们的代理商有效地通过利用规划,以寻求未来预期新奇的作用。经过摸索,代理迅速适应在零或几拍地多个下游任务。我们评估从高维图像输入具有挑战性的控制任务。没有经过任何训练的监督或任务特异性相互作用,Plan2Explore优于之前的自我监督的勘探方法,而事实上,几乎在表演甲骨文有权访问奖励相匹配。视频和代码在此HTTPS URL
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak
Abstract: Reinforcement learning allows solving complex tasks, however, the learning tends to be task-specific and the sample efficiency remains a challenge. We present Plan2Explore, a self-supervised reinforcement learning agent that tackles both these challenges through a new approach to self-supervised exploration and fast adaptation to new tasks, which need not be known during exploration. During exploration, unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty. After exploration, the agent quickly adapts to multiple downstream tasks in a zero or a few-shot manner. We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Videos and code at this https URL
摘要:强化学习允许解决复杂的任务,但是,学习往往是任务的具体样本效率仍然是一个挑战。我们目前Plan2Explore,自我监督的强化学习剂铲球均通过一种新的方法进行自我监督的探索和快速适应新的任务,它不必勘探中知道这些挑战。在勘探,与现有方法不同,其计算追溯观察的新奇之后,代理已经达到他们,我们的代理商有效地通过利用规划,以寻求未来预期新奇的作用。经过摸索,代理迅速适应在零或几拍地多个下游任务。我们评估从高维图像输入具有挑战性的控制任务。没有经过任何训练的监督或任务特异性相互作用,Plan2Explore优于之前的自我监督的勘探方法,而事实上,几乎在表演甲骨文有权访问奖励相匹配。视频和代码在此HTTPS URL
34. Localized convolutional neural networks for geospatial wind forecasting [PDF] 返回目录
Arnas Uselis, Mantas Lukoševičius, Lukas Stasytis
Abstract: Convolutional Neural Networks (CNN) possess many positive qualities when it comes to spatial raster data. Translation invariance enables CNNs to detect features regardless of their position in the scene. But in some domains, like geospatial, not all locations are exactly equal. In this work we propose localized convolutional neural networks that enable convolutional architectures to learn local features in addition to the global ones. We investigate their instantiations in the form of learnable inputs, local weights, and a more general form. They can be added to any convolutional layers, easily end-to-end trained, introduce minimal additional complexity, and let CNNs retain most of their benefits to the extent that they are needed. In this work we address spatio-temporal prediction: test the effectiveness of our methods on a synthetic benchmark dataset and tackle three real-world wind prediction datasets. For one of them we propose a method to spatially order the unordered data. We compare against the recent state-of-the-art spatio-temporal prediction models on the same data. Models that use convolutional layers can be and are extended with our localizations. In all these cases our extensions improve the results, and thus often the state-of-the-art. We share all the code at a public repository.
摘要:卷积神经网络(CNN)具有许多积极的品质,当涉及到空间的栅格数据。平移不变性使细胞神经网络检测功能,无论其在场景中的位置的。但在某些领域,如地理信息,并不是所有的地方都完全相等。在这项工作中我们提出局部卷积神经网络,使卷积架构学习,除了全球那些局部特征。我们调查他们的实例中可以学习的投入,局部权重,以及更一般的形式形式。他们可以添加到任何卷积层,很容易最终到终端的培训,引入最小的额外的复杂性,并让细胞神经网络保留大部分它们的好处,他们所需要的程度。在这项工作中,我们解决时空预测:测试我们的方法的有效性上合成的基准数据集,并完成三真实世界风预测数据集。对于他们中的一个,我们提出了一个方法,在空间上订购无序的数据。我们比较反对在相同的数据,最近的国家的最先进的时空预测模型。使用卷积层模型可以并扩展我们的本地化。在所有这些情况下,我们的扩展提高的结果,因而往往是国家的最先进的。我们在公共资源库共享所有的代码。
Arnas Uselis, Mantas Lukoševičius, Lukas Stasytis
Abstract: Convolutional Neural Networks (CNN) possess many positive qualities when it comes to spatial raster data. Translation invariance enables CNNs to detect features regardless of their position in the scene. But in some domains, like geospatial, not all locations are exactly equal. In this work we propose localized convolutional neural networks that enable convolutional architectures to learn local features in addition to the global ones. We investigate their instantiations in the form of learnable inputs, local weights, and a more general form. They can be added to any convolutional layers, easily end-to-end trained, introduce minimal additional complexity, and let CNNs retain most of their benefits to the extent that they are needed. In this work we address spatio-temporal prediction: test the effectiveness of our methods on a synthetic benchmark dataset and tackle three real-world wind prediction datasets. For one of them we propose a method to spatially order the unordered data. We compare against the recent state-of-the-art spatio-temporal prediction models on the same data. Models that use convolutional layers can be and are extended with our localizations. In all these cases our extensions improve the results, and thus often the state-of-the-art. We share all the code at a public repository.
摘要:卷积神经网络(CNN)具有许多积极的品质,当涉及到空间的栅格数据。平移不变性使细胞神经网络检测功能,无论其在场景中的位置的。但在某些领域,如地理信息,并不是所有的地方都完全相等。在这项工作中我们提出局部卷积神经网络,使卷积架构学习,除了全球那些局部特征。我们调查他们的实例中可以学习的投入,局部权重,以及更一般的形式形式。他们可以添加到任何卷积层,很容易最终到终端的培训,引入最小的额外的复杂性,并让细胞神经网络保留大部分它们的好处,他们所需要的程度。在这项工作中,我们解决时空预测:测试我们的方法的有效性上合成的基准数据集,并完成三真实世界风预测数据集。对于他们中的一个,我们提出了一个方法,在空间上订购无序的数据。我们比较反对在相同的数据,最近的国家的最先进的时空预测模型。使用卷积层模型可以并扩展我们的本地化。在所有这些情况下,我们的扩展提高的结果,因而往往是国家的最先进的。我们在公共资源库共享所有的代码。
35. Adipose Tissue Segmentation in Unlabeled Abdomen MRI using Cross Modality Domain Adaptation [PDF] 返回目录
Samira Masoudi, Syed M. Anwar, Stephanie A. Harmon, Peter L. Choyke, Baris Turkbey, Ulas Bagci
Abstract: Abdominal fat quantification is critical since multiple vital organs are located within this region. Although computed tomography (CT) is a highly sensitive modality to segment body fat, it involves ionizing radiations which makes magnetic resonance imaging (MRI) a preferable alternative for this purpose. Additionally, the superior soft tissue contrast in MRI could lead to more accurate results. Yet, it is highly labor intensive to segment fat in MRI scans. In this study, we propose an algorithm based on deep learning technique(s) to automatically quantify fat tissue from MR images through a cross modality adaptation. Our method does not require supervised labeling of MR scans, instead, we utilize a cycle generative adversarial network (C-GAN) to construct a pipeline that transforms the existing MR scans into their equivalent synthetic CT (s-CT) images where fat segmentation is relatively easier due to the descriptive nature of HU (hounsfield unit) in CT images. The fat segmentation results for MRI scans were evaluated by expert radiologist. Qualitative evaluation of our segmentation results shows average success score of 3.80/5 and 4.54/5 for visceral and subcutaneous fat segmentation in MR images.
摘要:腹部脂肪定量是至关重要的,因为多个重要器官都位于这个区域内。虽然计算机断层摄影(CT)是一种高度灵敏模态到分割体脂肪,它涉及电离辐射,这使得磁共振成像(MRI)用于此目的的优选的选择。此外,在MRI优越的软组织对比度可能会导致更准确的结果。然而,它是高度劳动密集在MRI扫描部分脂肪。在这项研究中,我们提出了一种基于深度学习技术(S)的算法,通过跨模态匹配自动从量化磁共振图像脂肪组织。我们的方法不要求MR扫描的监督标签,相反,我们利用一个周期生成对抗网络(C-GAN)来构造一个管道,其将所述现有的MR扫描到他们的等效合成CT(S-CT)图像,其中脂肪分割为由于胡(亨斯菲尔德单位)的CT图像中的描述本质上相对容易。通过专家对放射MRI扫描中的脂肪分割结果进行评价。我们的分割结果显示平均成功得分为3.80 / 5和4.54 / 5内脏和皮下脂肪分割MR图像质量的评价。
Samira Masoudi, Syed M. Anwar, Stephanie A. Harmon, Peter L. Choyke, Baris Turkbey, Ulas Bagci
Abstract: Abdominal fat quantification is critical since multiple vital organs are located within this region. Although computed tomography (CT) is a highly sensitive modality to segment body fat, it involves ionizing radiations which makes magnetic resonance imaging (MRI) a preferable alternative for this purpose. Additionally, the superior soft tissue contrast in MRI could lead to more accurate results. Yet, it is highly labor intensive to segment fat in MRI scans. In this study, we propose an algorithm based on deep learning technique(s) to automatically quantify fat tissue from MR images through a cross modality adaptation. Our method does not require supervised labeling of MR scans, instead, we utilize a cycle generative adversarial network (C-GAN) to construct a pipeline that transforms the existing MR scans into their equivalent synthetic CT (s-CT) images where fat segmentation is relatively easier due to the descriptive nature of HU (hounsfield unit) in CT images. The fat segmentation results for MRI scans were evaluated by expert radiologist. Qualitative evaluation of our segmentation results shows average success score of 3.80/5 and 4.54/5 for visceral and subcutaneous fat segmentation in MR images.
摘要:腹部脂肪定量是至关重要的,因为多个重要器官都位于这个区域内。虽然计算机断层摄影(CT)是一种高度灵敏模态到分割体脂肪,它涉及电离辐射,这使得磁共振成像(MRI)用于此目的的优选的选择。此外,在MRI优越的软组织对比度可能会导致更准确的结果。然而,它是高度劳动密集在MRI扫描部分脂肪。在这项研究中,我们提出了一种基于深度学习技术(S)的算法,通过跨模态匹配自动从量化磁共振图像脂肪组织。我们的方法不要求MR扫描的监督标签,相反,我们利用一个周期生成对抗网络(C-GAN)来构造一个管道,其将所述现有的MR扫描到他们的等效合成CT(S-CT)图像,其中脂肪分割为由于胡(亨斯菲尔德单位)的CT图像中的描述本质上相对容易。通过专家对放射MRI扫描中的脂肪分割结果进行评价。我们的分割结果显示平均成功得分为3.80 / 5和4.54 / 5内脏和皮下脂肪分割MR图像质量的评价。
36. Unpaired Motion Style Transfer from Video to Animation [PDF] 返回目录
Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, Baoquan Chen
Abstract: Transferring the motion style from one animation clip to another, while preserving the motion content of the latter, has been a long-standing problem in character animation. Most existing data-driven approaches are supervised and rely on paired data, where motions with the same content are performed in different styles. In addition, these approaches are limited to transfer of styles that were seen during training. In this paper, we present a novel data-driven framework for motion style transfer, which learns from an unpaired collection of motions with style labels, and enables transferring motion styles not observed during training. Furthermore, our framework is able to extract motion styles directly from videos, bypassing 3D reconstruction, and apply them to the 3D input motion. Our style transfer network encodes motions into two latent codes, for content and for style, each of which plays a different role in the decoding (synthesis) process. While the content code is decoded into the output motion by several temporal convolutional layers, the style code modifies deep features via temporally invariant adaptive instance normalization (AdaIN). Moreover, while the content code is encoded from 3D joint rotations, we learn a common embedding for style from either 3D or 2D joint positions, enabling style extraction from videos. Our results are comparable to the state-of-the-art, despite not requiring paired training data, and outperform other methods when transferring previously unseen styles. To our knowledge, we are the first to demonstrate style transfer directly from videos to 3D animations - an ability which enables one to extend the set of style examples far beyond motions captured by MoCap systems.
摘要:从一个动画片段转移到另一个运动风格,同时保留了后者的议案内容,已经在角色动画一个长期存在的问题。大多数现有的数据驱动的方法是监督和依靠配对数据,其中具有相同内容的议案不同的风格进行。此外,这些方法仅限于训练期间看到,风格转移。在本文中,我们提出了运动样式转移,这学习到运动的未配对的集合与式标签的新的数据驱动的框架,并使得能够传送训练期间没有观察到运动样式。此外,我们的框架是能够直接从视频中提取的运动风格,绕过三维重建,并将其应用到3D输入动作。我们的风格传送网络编码运动成两个潜码,内容和风格,其中的每一个起到在解码(合成)处理的不同的作用。而内容代码是由几个时间卷积层解码成输出运动,样式代码修改深经由在时间上不变自适应实例正常化(AdaIN)功能。此外,虽然内容代码是从3D关节旋转编码,我们学习共同嵌入从3D或2D关节位置的风格,从视频中启用样式提取。我们的研究结果具有可比性的国家的最先进的,尽管不要求配对的训练数据,并传送以前看不到的款式时,跑赢其他方法。据我们所知,我们是第一次证明直接从影片风格转移到3D动画 - 这使得一个扩展集的风格的建筑远远超出了动作捕捉系统捕捉运动的能力。
Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, Baoquan Chen
Abstract: Transferring the motion style from one animation clip to another, while preserving the motion content of the latter, has been a long-standing problem in character animation. Most existing data-driven approaches are supervised and rely on paired data, where motions with the same content are performed in different styles. In addition, these approaches are limited to transfer of styles that were seen during training. In this paper, we present a novel data-driven framework for motion style transfer, which learns from an unpaired collection of motions with style labels, and enables transferring motion styles not observed during training. Furthermore, our framework is able to extract motion styles directly from videos, bypassing 3D reconstruction, and apply them to the 3D input motion. Our style transfer network encodes motions into two latent codes, for content and for style, each of which plays a different role in the decoding (synthesis) process. While the content code is decoded into the output motion by several temporal convolutional layers, the style code modifies deep features via temporally invariant adaptive instance normalization (AdaIN). Moreover, while the content code is encoded from 3D joint rotations, we learn a common embedding for style from either 3D or 2D joint positions, enabling style extraction from videos. Our results are comparable to the state-of-the-art, despite not requiring paired training data, and outperform other methods when transferring previously unseen styles. To our knowledge, we are the first to demonstrate style transfer directly from videos to 3D animations - an ability which enables one to extend the set of style examples far beyond motions captured by MoCap systems.
摘要:从一个动画片段转移到另一个运动风格,同时保留了后者的议案内容,已经在角色动画一个长期存在的问题。大多数现有的数据驱动的方法是监督和依靠配对数据,其中具有相同内容的议案不同的风格进行。此外,这些方法仅限于训练期间看到,风格转移。在本文中,我们提出了运动样式转移,这学习到运动的未配对的集合与式标签的新的数据驱动的框架,并使得能够传送训练期间没有观察到运动样式。此外,我们的框架是能够直接从视频中提取的运动风格,绕过三维重建,并将其应用到3D输入动作。我们的风格传送网络编码运动成两个潜码,内容和风格,其中的每一个起到在解码(合成)处理的不同的作用。而内容代码是由几个时间卷积层解码成输出运动,样式代码修改深经由在时间上不变自适应实例正常化(AdaIN)功能。此外,虽然内容代码是从3D关节旋转编码,我们学习共同嵌入从3D或2D关节位置的风格,从视频中启用样式提取。我们的研究结果具有可比性的国家的最先进的,尽管不要求配对的训练数据,并传送以前看不到的款式时,跑赢其他方法。据我们所知,我们是第一次证明直接从影片风格转移到3D动画 - 这使得一个扩展集的风格的建筑远远超出了动作捕捉系统捕捉运动的能力。
37. Very High Resolution Land Cover Mapping of Urban Areas at Global Scale with Convolutional Neural Networks [PDF] 返回目录
Thomas Tilak, Arnaud Braun, David Chandler, Nicolas David, Sylvain Galopin, Amélie Lombard, Michaël Michaud, Camille Parisel, Matthieu Porte, Marjorie Robert
Abstract: This paper describes a methodology to produce a 7-classes land cover map of urban areas from very high resolution images and limited noisy labeled data. The objective is to make a segmentation map of a large area (a french department) with the following classes: asphalt, bare soil, building, grassland, mineral material (permeable artificialized areas), forest and water from 20cm aerial images and Digital Height Model. We created a training dataset on a few areas of interest aggregating databases, semi-automatic classification, and manual annotation to get a complete ground truth in each class. A comparative study of different encoder-decoder architectures (U-Net, U-Net with Resnet encoders, Deeplab v3+) is presented with different loss functions. The final product is a highly valuable land cover map computed from model predictions stitched together, binarized, and refined before vectorization.
摘要:本文介绍了一种方法从非常高的分辨率的图像和限制嘈杂的标记数据产生市区的7类土地覆盖图。的目标是使大面积的与以下类别分割图(法国部门):沥青,裸露的土壤,建筑,草地,矿物材料(可渗透artificialized区域),从20厘米航拍林水和数字高程模型。我们创建了兴趣聚合数据库,半自动分类,人工标注的少数地区训练数据集获得每班一个完整的地面实况。不同的编码器 - 解码器的体系结构(U-Net的,U-Net的与RESNET编码器,Deeplab V3 +)的比较研究呈现不同损耗函数。最终产品是从缝合在一起,二值化,量化细化前模型预测计算的非常宝贵的土地覆盖图。
Thomas Tilak, Arnaud Braun, David Chandler, Nicolas David, Sylvain Galopin, Amélie Lombard, Michaël Michaud, Camille Parisel, Matthieu Porte, Marjorie Robert
Abstract: This paper describes a methodology to produce a 7-classes land cover map of urban areas from very high resolution images and limited noisy labeled data. The objective is to make a segmentation map of a large area (a french department) with the following classes: asphalt, bare soil, building, grassland, mineral material (permeable artificialized areas), forest and water from 20cm aerial images and Digital Height Model. We created a training dataset on a few areas of interest aggregating databases, semi-automatic classification, and manual annotation to get a complete ground truth in each class. A comparative study of different encoder-decoder architectures (U-Net, U-Net with Resnet encoders, Deeplab v3+) is presented with different loss functions. The final product is a highly valuable land cover map computed from model predictions stitched together, binarized, and refined before vectorization.
摘要:本文介绍了一种方法从非常高的分辨率的图像和限制嘈杂的标记数据产生市区的7类土地覆盖图。的目标是使大面积的与以下类别分割图(法国部门):沥青,裸露的土壤,建筑,草地,矿物材料(可渗透artificialized区域),从20厘米航拍林水和数字高程模型。我们创建了兴趣聚合数据库,半自动分类,人工标注的少数地区训练数据集获得每班一个完整的地面实况。不同的编码器 - 解码器的体系结构(U-Net的,U-Net的与RESNET编码器,Deeplab V3 +)的比较研究呈现不同损耗函数。最终产品是从缝合在一起,二值化,量化细化前模型预测计算的非常宝贵的土地覆盖图。
38. Invertible Image Rescaling [PDF] 返回目录
Mingqing Xiao, Shuxin Zheng, Chang Liu, Yaolong Wang, Di He, Guolin Ke, Jiang Bian, Zhouchen Lin, Tie-Yan Liu
Abstract: High-resolution digital images are usually downscaled to fit various display screens or save the cost of storage and bandwidth, meanwhile the post-upscaling is adpoted to recover the original resolutions or the details in the zoom-in images. However, typical image downscaling is a non-injective mapping due to the loss of high-frequency information, which leads to the ill-posed problem of the inverse upscaling procedure and poses great challenges for recovering details from the downscaled low-resolution images. Simply upscaling with image super-resolution methods results in unsatisfactory recovering performance. In this work, we propose to solve this problem by modeling the downscaling and upscaling processes from a new perspective, i.e. an invertible bijective transformation, which can largely mitigate the ill-posed nature of image upscaling. We develop an Invertible Rescaling Net (IRN) with deliberately designed framework and objectives to produce visually-pleasing low-resolution images and meanwhile capture the distribution of the lost information using a latent variable following a specified distribution in the downscaling process. In this way, upscaling is made tractable by inversely passing a randomly-drawn latent variable with the low-resolution image through the network. Experimental results demonstrate the significant improvement of our model over existing methods in terms of both quantitative and qualitative evaluations of image upscaling reconstruction from downscaled images.
摘要:高分辨率数字图像通常按比例缩小,以适应各种显示屏或节省存储和带宽的成本,同时后升频被adpoted恢复在放大图像的原始分辨率和细节。然而,典型的图像缩减是非射映射由于高频信息的丢失,从而导致逆升频程序和姿势用于回收从所述缩减的低分辨率图像的信息的巨大挑战的病态问题。只需在恢复不理想表现图像超分辨率方法的结果按比例放大。在这项工作中,我们提出通过模拟缩小,并从一个新的视角比例放大的过程,即一个可逆的双射变换,这可以在很大程度上减轻图像放大倍数的病态性质来解决这个问题。我们开发了一个可逆的重标度网络(IRN)与故意设计的框架和目标,产生视觉上赏心悦目的低清晰度的图像,同时捕获使用潜变量丢失信息下面的缩小过程中指定的分布的分布。以这种方式,倍增通过经由网络传递逆随机绘制潜变量与低分辨率图像制成易处理的。实验结果表明,我们的模型在现有的方法中,从缩小的图像图像放大倍数重建的定量和定性评估方面的显著改善。
Mingqing Xiao, Shuxin Zheng, Chang Liu, Yaolong Wang, Di He, Guolin Ke, Jiang Bian, Zhouchen Lin, Tie-Yan Liu
Abstract: High-resolution digital images are usually downscaled to fit various display screens or save the cost of storage and bandwidth, meanwhile the post-upscaling is adpoted to recover the original resolutions or the details in the zoom-in images. However, typical image downscaling is a non-injective mapping due to the loss of high-frequency information, which leads to the ill-posed problem of the inverse upscaling procedure and poses great challenges for recovering details from the downscaled low-resolution images. Simply upscaling with image super-resolution methods results in unsatisfactory recovering performance. In this work, we propose to solve this problem by modeling the downscaling and upscaling processes from a new perspective, i.e. an invertible bijective transformation, which can largely mitigate the ill-posed nature of image upscaling. We develop an Invertible Rescaling Net (IRN) with deliberately designed framework and objectives to produce visually-pleasing low-resolution images and meanwhile capture the distribution of the lost information using a latent variable following a specified distribution in the downscaling process. In this way, upscaling is made tractable by inversely passing a randomly-drawn latent variable with the low-resolution image through the network. Experimental results demonstrate the significant improvement of our model over existing methods in terms of both quantitative and qualitative evaluations of image upscaling reconstruction from downscaled images.
摘要:高分辨率数字图像通常按比例缩小,以适应各种显示屏或节省存储和带宽的成本,同时后升频被adpoted恢复在放大图像的原始分辨率和细节。然而,典型的图像缩减是非射映射由于高频信息的丢失,从而导致逆升频程序和姿势用于回收从所述缩减的低分辨率图像的信息的巨大挑战的病态问题。只需在恢复不理想表现图像超分辨率方法的结果按比例放大。在这项工作中,我们提出通过模拟缩小,并从一个新的视角比例放大的过程,即一个可逆的双射变换,这可以在很大程度上减轻图像放大倍数的病态性质来解决这个问题。我们开发了一个可逆的重标度网络(IRN)与故意设计的框架和目标,产生视觉上赏心悦目的低清晰度的图像,同时捕获使用潜变量丢失信息下面的缩小过程中指定的分布的分布。以这种方式,倍增通过经由网络传递逆随机绘制潜变量与低分辨率图像制成易处理的。实验结果表明,我们的模型在现有的方法中,从缩小的图像图像放大倍数重建的定量和定性评估方面的显著改善。
39. Understanding and Correcting Low-quality Retinal Fundus Images for Clinical Analysis [PDF] 返回目录
Ziyi Shen, Huazhu Fu, Jianbing Shen, Ling Shao
Abstract: Retinal fundus images are widely used for clinical screening and diagnosis of eye diseases. However, fundus images captured by operators with various levels of experiences have a large variation in quality. Low-quality fundus images increase the uncertainty in clinical observation and lead to a risk of misdiagnosis. Due to the special optical beam of fundus imaging and retinal structure, the natural image enhancement methods cannot be utilized directly. In this paper, we first analyze the ophthalmoscope imaging system and model the reliable degradation of major inferior-quality factors, including uneven illumination, blur, and artifacts. Then, based on the degradation model, a clinical-oriented fundus enhancement network~(cofe-Net)~is proposed to suppress the global degradation factors, and simultaneously preserve anatomical retinal structures and pathological characteristics for clinical observation and analysis. Experiments on both synthetic and real fundus images demonstrate that our algorithm effectively corrects low-quality fundus images without losing retinal details. Moreover, we also show that the fundus correction method can benefit medical image analysis applications, e.g, retinal vessel segmentation and optic disc/cup detection.
摘要:视网膜眼底图像被广泛地用于临床筛查和眼睛疾病的诊断。然而,运营商与各级的经验,拍摄的眼底图像在质量有很大的变化。低质量的眼底图像增加临床观察和铅的不确定性误诊的风险。由于眼底成像和视网膜结构的特殊光束,自然图像增强方法不能直接使用。在本文中,我们首先分析检眼镜成像系统和主要低劣质量的因素,包括不均匀的照明,模糊,和工件的可靠降解建模。然后,基于退化模型,面向临床眼底增强网络〜(由CoFe-网)〜提出了抑制全球退化的因素,并同时保持解剖视网膜结构和临床观察及分析病理特点。在人工和实际的眼底图像实验表明,我们的算法可以有效地修正低质量的眼底图像,而不会丢失细节的视网膜。此外,我们还表明,该眼底校正方法可以受益医学图像分析的应用,e.g,视网膜脉管分割和视盘/杯检测。
Ziyi Shen, Huazhu Fu, Jianbing Shen, Ling Shao
Abstract: Retinal fundus images are widely used for clinical screening and diagnosis of eye diseases. However, fundus images captured by operators with various levels of experiences have a large variation in quality. Low-quality fundus images increase the uncertainty in clinical observation and lead to a risk of misdiagnosis. Due to the special optical beam of fundus imaging and retinal structure, the natural image enhancement methods cannot be utilized directly. In this paper, we first analyze the ophthalmoscope imaging system and model the reliable degradation of major inferior-quality factors, including uneven illumination, blur, and artifacts. Then, based on the degradation model, a clinical-oriented fundus enhancement network~(cofe-Net)~is proposed to suppress the global degradation factors, and simultaneously preserve anatomical retinal structures and pathological characteristics for clinical observation and analysis. Experiments on both synthetic and real fundus images demonstrate that our algorithm effectively corrects low-quality fundus images without losing retinal details. Moreover, we also show that the fundus correction method can benefit medical image analysis applications, e.g, retinal vessel segmentation and optic disc/cup detection.
摘要:视网膜眼底图像被广泛地用于临床筛查和眼睛疾病的诊断。然而,运营商与各级的经验,拍摄的眼底图像在质量有很大的变化。低质量的眼底图像增加临床观察和铅的不确定性误诊的风险。由于眼底成像和视网膜结构的特殊光束,自然图像增强方法不能直接使用。在本文中,我们首先分析检眼镜成像系统和主要低劣质量的因素,包括不均匀的照明,模糊,和工件的可靠降解建模。然后,基于退化模型,面向临床眼底增强网络〜(由CoFe-网)〜提出了抑制全球退化的因素,并同时保持解剖视网膜结构和临床观察及分析病理特点。在人工和实际的眼底图像实验表明,我们的算法可以有效地修正低质量的眼底图像,而不会丢失细节的视网膜。此外,我们还表明,该眼底校正方法可以受益医学图像分析的应用,e.g,视网膜脉管分割和视盘/杯检测。
40. Multi-Channel Transfer Learning of Chest X-ray Images for Screening of COVID-19 [PDF] 返回目录
Sampa Misra, Seungwan Jeon, Seiyon Lee, Ravi Managuli, Chulhong Kim
Abstract: The 2019 novel coronavirus (COVID-19) has spread rapidly all over the world and it is affecting the whole society. The current gold standard test for screening COVID-19 patients is the polymerase chain reaction test. However, the COVID-19 test kits are not widely available and time-consuming. Thus, as an alternative, chest X-rays are being considered for quick screening. Since the presentation of COVID-19 in chest X-rays is varied in features and specialization in reading COVID-19 chest X-rays are required thus limiting its use for diagnosis. To address this challenge of reading chest X-rays by radiologists quickly, we present a multi-channel transfer learning model based on ResNet architecture to facilitate the diagnosis of COVID-19 chest X-ray. Three ResNet-based models (Models a, b, and c) were retrained using Dataset_A (1579 normal and 4429 diseased), Dataset_B (4245 pneumonia and 1763 non-pneumonia), and Dataset_C (184 COVID-19 and 5824 Non-COVID19), respectively, to classify (a) normal or diseased, (b) pneumonia or non-pneumonia, and (c) COVID-19 or non-COVID19. Finally, these three models were ensembled and fine-tuned using Dataset_D (1579 normal, 4245 pneumonia, and 184 COVID-19) to classify normal, pneumonia, and COVID-19 cases. Our results show that the ensemble model is more accurate than the single ResNet model, which is also re-trained using Dataset_D as it extracts more relevant semantic features for each class. Our approach provides a precision of 94 % and a recall of 100%. Thus, our method could potentially help clinicians in screening patients for COVID-19, thus facilitating immediate triaging and treatment for better outcomes.
摘要:2019年新型冠状病毒(COVID-19)具有迅速蔓延遍布世界各地,并正在影响着整个社会。对于COVID-19筛选的患者目前的黄金标准测试是聚合酶链式反应测试。然而,COVID-19检测试剂盒并不普及和费时。因此,作为一个替代方案中,胸部X光检查正在考虑用于快速筛选。由于COVID-19在胸部X射线的演示文稿中的特征和专业化读取COVID-19胸部X射线被需要因而限制了其用于诊断用途变化。为了解决高速地读由放射科医生胸部X光检查这一挑战,我们提出基于RESNET架构便于COVID-19胸部X射线诊断的多通道传输学习模型。三RESNET基于模型(模型A,B和C)使用Dataset_A(1579正常,4429有病),Dataset_B(4245肺炎和1763非肺炎),和Dataset_C再培训(184 COVID19和5824非COVID19)分别进行分类(a)正常的或患病的,(b)肺炎或非肺炎,和(c)COVID19或非COVID19。最后,这三种模型,合奏和使用Dataset_D微调(1579正常,4245肺炎和184 COVID-19)正常,肺炎分类和COVID-19的情况。我们的研究结果表明,集成模型比单RESNET模型,这也是使用Dataset_D,因为它的每个提取类的更多相关语义特征重新训练的更准确。我们的方法提供了94%的精度和100%的召回。因此,我们的方法能够在为患者检查COVID-19,从而有利于即时分类,并将治疗更好的结果可能帮助临床医生。
Sampa Misra, Seungwan Jeon, Seiyon Lee, Ravi Managuli, Chulhong Kim
Abstract: The 2019 novel coronavirus (COVID-19) has spread rapidly all over the world and it is affecting the whole society. The current gold standard test for screening COVID-19 patients is the polymerase chain reaction test. However, the COVID-19 test kits are not widely available and time-consuming. Thus, as an alternative, chest X-rays are being considered for quick screening. Since the presentation of COVID-19 in chest X-rays is varied in features and specialization in reading COVID-19 chest X-rays are required thus limiting its use for diagnosis. To address this challenge of reading chest X-rays by radiologists quickly, we present a multi-channel transfer learning model based on ResNet architecture to facilitate the diagnosis of COVID-19 chest X-ray. Three ResNet-based models (Models a, b, and c) were retrained using Dataset_A (1579 normal and 4429 diseased), Dataset_B (4245 pneumonia and 1763 non-pneumonia), and Dataset_C (184 COVID-19 and 5824 Non-COVID19), respectively, to classify (a) normal or diseased, (b) pneumonia or non-pneumonia, and (c) COVID-19 or non-COVID19. Finally, these three models were ensembled and fine-tuned using Dataset_D (1579 normal, 4245 pneumonia, and 184 COVID-19) to classify normal, pneumonia, and COVID-19 cases. Our results show that the ensemble model is more accurate than the single ResNet model, which is also re-trained using Dataset_D as it extracts more relevant semantic features for each class. Our approach provides a precision of 94 % and a recall of 100%. Thus, our method could potentially help clinicians in screening patients for COVID-19, thus facilitating immediate triaging and treatment for better outcomes.
摘要:2019年新型冠状病毒(COVID-19)具有迅速蔓延遍布世界各地,并正在影响着整个社会。对于COVID-19筛选的患者目前的黄金标准测试是聚合酶链式反应测试。然而,COVID-19检测试剂盒并不普及和费时。因此,作为一个替代方案中,胸部X光检查正在考虑用于快速筛选。由于COVID-19在胸部X射线的演示文稿中的特征和专业化读取COVID-19胸部X射线被需要因而限制了其用于诊断用途变化。为了解决高速地读由放射科医生胸部X光检查这一挑战,我们提出基于RESNET架构便于COVID-19胸部X射线诊断的多通道传输学习模型。三RESNET基于模型(模型A,B和C)使用Dataset_A(1579正常,4429有病),Dataset_B(4245肺炎和1763非肺炎),和Dataset_C再培训(184 COVID19和5824非COVID19)分别进行分类(a)正常的或患病的,(b)肺炎或非肺炎,和(c)COVID19或非COVID19。最后,这三种模型,合奏和使用Dataset_D微调(1579正常,4245肺炎和184 COVID-19)正常,肺炎分类和COVID-19的情况。我们的研究结果表明,集成模型比单RESNET模型,这也是使用Dataset_D,因为它的每个提取类的更多相关语义特征重新训练的更准确。我们的方法提供了94%的精度和100%的召回。因此,我们的方法能够在为患者检查COVID-19,从而有利于即时分类,并将治疗更好的结果可能帮助临床医生。
41. High-Fidelity Accelerated MRI Reconstruction by Scan-Specific Fine-Tuning of Physics-Based Neural Networks [PDF] 返回目录
Seyed Amir Hossein Hosseini, Burhaneddin Yaman, Steen Moeller, Mehmet Akçakaya
Abstract: Long scan duration remains a challenge for high-resolution MRI. Deep learning has emerged as a powerful means for accelerated MRI reconstruction by providing data-driven regularizers that are directly learned from data. These data-driven priors typically remain unchanged for future data in the testing phase once they are learned during training. In this study, we propose to use a transfer learning approach to fine-tune these regularizers for new subjects using a self-supervision approach. While the proposed approach can compromise the extremely fast reconstruction time of deep learning MRI methods, our results on knee MRI indicate that such adaptation can substantially reduce the remaining artifacts in reconstructed images. In addition, the proposed approach has the potential to reduce the risks of generalization to rare pathological conditions, which may be unavailable in the training data.
摘要:龙扫描持续时间仍是高分辨率MRI是一个挑战。深学习已通过提供直接从数据了解到数据驱动regularizers成为一种强有力的手段加速MRI重建。一旦他们被训练期间学习这些数据驱动的先验通常保持在测试阶段,未来的数据不变。在这项研究中,我们建议使用转移学习方法进行微调,这些regularizers使用一个自检的方法新的主题。虽然该方法可以妥协的深度学习MRI方法极快的重建时间,我们的膝盖核磁共振结果表明,这种适应可以大大减少在重建图像剩余文物。此外,该方法具有降低泛化的风险,罕见的病理条件下,这可能是在训练数据不可用的可能性。
Seyed Amir Hossein Hosseini, Burhaneddin Yaman, Steen Moeller, Mehmet Akçakaya
Abstract: Long scan duration remains a challenge for high-resolution MRI. Deep learning has emerged as a powerful means for accelerated MRI reconstruction by providing data-driven regularizers that are directly learned from data. These data-driven priors typically remain unchanged for future data in the testing phase once they are learned during training. In this study, we propose to use a transfer learning approach to fine-tune these regularizers for new subjects using a self-supervision approach. While the proposed approach can compromise the extremely fast reconstruction time of deep learning MRI methods, our results on knee MRI indicate that such adaptation can substantially reduce the remaining artifacts in reconstructed images. In addition, the proposed approach has the potential to reduce the risks of generalization to rare pathological conditions, which may be unavailable in the training data.
摘要:龙扫描持续时间仍是高分辨率MRI是一个挑战。深学习已通过提供直接从数据了解到数据驱动regularizers成为一种强有力的手段加速MRI重建。一旦他们被训练期间学习这些数据驱动的先验通常保持在测试阶段,未来的数据不变。在这项研究中,我们建议使用转移学习方法进行微调,这些regularizers使用一个自检的方法新的主题。虽然该方法可以妥协的深度学习MRI方法极快的重建时间,我们的膝盖核磁共振结果表明,这种适应可以大大减少在重建图像剩余文物。此外,该方法具有降低泛化的风险,罕见的病理条件下,这可能是在训练数据不可用的可能性。
42. Making Robots Draw A Vivid Portrait In Two Minutes [PDF] 返回目录
Fei Gao, Jingjie Zhu, Zeyuan Yu, Peng Li, Tao Wang
Abstract: Significant progress has been made with artistic robots. However, existing robots fail to produce high-quality portraits in a short time. In this work, we present a drawing robot, which can automatically transfer a facial picture to a vivid portrait, and then draw it on paper within two minutes averagely. At the heart of our system is a novel portrait synthesis algorithm based on deep learning. Innovatively, we employ a self-consistency loss, which makes the algorithm capable of generating continuous and smooth brush-strokes. Besides, we propose a componential-sparsity constraint to reduce the number of brush-strokes over insignificant areas. We also implement a local sketch synthesis algorithm, and several pre- and post-processing techniques to deal with the background and details. The portrait produced by our algorithm successfully captures individual characteristics by using a sparse set of continuous brush-strokes. Finally, the portrait is converted to a sequence of trajectories and reproduced by a 3-degree-of-freedom robotic arm. The whole portrait drawing robotic system is named AiSketcher. Extensive experiments show that AiSketcher can produce considerably high-quality sketches for a wide range of pictures, including faces in-the-wild and universal images of arbitrary content. To our best knowledge, AiSketcher is the first portrait drawing robot that uses deep learning techniques. AiSketcher has attended a quite number of exhibitions and shown remarkable performance under diverse circumstances.
摘要:显着的进展已经取得了艺术的机器人。但是,现有的机器人不能产生在短时间内高质量的人像。在这项工作中,我们提出了一个图机器人,它可以自动面部图像在纸张上两分钟内转移到一个生动肖像,然后绘制它平均。在我们系统的心脏是基于深度学习一种新的画像合成算法。创新地,我们使用自一致性的损失,这使得能够产生连续及平滑的刷笔画的算法。此外,我们提出如下成分,稀疏约束,以减少在无关紧要的区域刷笔画数。我们还实行局部画像合成算法,以及一些预处理和后处理技术处理的背景和细节。通过我们的算法产生的肖像使用稀疏集的连续刷招成功地捕捉个体特征。最后,将画像转换为轨迹的序列,并且由3度的自由度的机器人手臂再现。整个人像绘画机器人系统被命名为AiSketcher。大量的实验表明,AiSketcher可以显着生产出高品质的草图适用范围广的图片,包括在最野生面和的任意内容的通用的图像。据我们所知,AiSketcher是第一个肖像绘画机器人使用深层学习技术。 AiSketcher出席展览的相当数量,并显示在不同的情况下,表现可圈可点。
Fei Gao, Jingjie Zhu, Zeyuan Yu, Peng Li, Tao Wang
Abstract: Significant progress has been made with artistic robots. However, existing robots fail to produce high-quality portraits in a short time. In this work, we present a drawing robot, which can automatically transfer a facial picture to a vivid portrait, and then draw it on paper within two minutes averagely. At the heart of our system is a novel portrait synthesis algorithm based on deep learning. Innovatively, we employ a self-consistency loss, which makes the algorithm capable of generating continuous and smooth brush-strokes. Besides, we propose a componential-sparsity constraint to reduce the number of brush-strokes over insignificant areas. We also implement a local sketch synthesis algorithm, and several pre- and post-processing techniques to deal with the background and details. The portrait produced by our algorithm successfully captures individual characteristics by using a sparse set of continuous brush-strokes. Finally, the portrait is converted to a sequence of trajectories and reproduced by a 3-degree-of-freedom robotic arm. The whole portrait drawing robotic system is named AiSketcher. Extensive experiments show that AiSketcher can produce considerably high-quality sketches for a wide range of pictures, including faces in-the-wild and universal images of arbitrary content. To our best knowledge, AiSketcher is the first portrait drawing robot that uses deep learning techniques. AiSketcher has attended a quite number of exhibitions and shown remarkable performance under diverse circumstances.
摘要:显着的进展已经取得了艺术的机器人。但是,现有的机器人不能产生在短时间内高质量的人像。在这项工作中,我们提出了一个图机器人,它可以自动面部图像在纸张上两分钟内转移到一个生动肖像,然后绘制它平均。在我们系统的心脏是基于深度学习一种新的画像合成算法。创新地,我们使用自一致性的损失,这使得能够产生连续及平滑的刷笔画的算法。此外,我们提出如下成分,稀疏约束,以减少在无关紧要的区域刷笔画数。我们还实行局部画像合成算法,以及一些预处理和后处理技术处理的背景和细节。通过我们的算法产生的肖像使用稀疏集的连续刷招成功地捕捉个体特征。最后,将画像转换为轨迹的序列,并且由3度的自由度的机器人手臂再现。整个人像绘画机器人系统被命名为AiSketcher。大量的实验表明,AiSketcher可以显着生产出高品质的草图适用范围广的图片,包括在最野生面和的任意内容的通用的图像。据我们所知,AiSketcher是第一个肖像绘画机器人使用深层学习技术。 AiSketcher出席展览的相当数量,并显示在不同的情况下,表现可圈可点。
43. Jigsaw-VAE: Towards Balancing Features in Variational Autoencoders [PDF] 返回目录
Saeid Asgari Taghanaki, Mohammad Havaei, Alex Lamb, Aditya Sanghi, Ara Danielyan, Tonya Custis
Abstract: The latent variables learned by VAEs have seen considerable interest as an unsupervised way of extracting features, which can then be used for downstream tasks. There is a growing interest in the question of whether features learned on one environment will generalize across different environments. We demonstrate here that VAE latent variables often focus on some factors of variation at the expense of others - in this case we refer to the features as ``imbalanced''. Feature imbalance leads to poor generalization when the latent variables are used in an environment where the presence of features changes. Similarly, latent variables trained with imbalanced features induce the VAE to generate less diverse (i.e. biased towards dominant features) samples. To address this, we propose a regularization scheme for VAEs, which we show substantially addresses the feature imbalance problem. We also introduce a simple metric to measure the balance of features in generated images.
摘要:VAES学到的潜在变量看到了相当大的兴趣作为提取特征,然后可以用于下游任务的无人监督的方式。有一个在一个环境学特征是否会在不同的环境概括问题的兴趣与日俱增。我们在这里展示的是VAE潜在变量往往集中在变化的一些因素在别人为代价 - 在这种情况下,我们指的是特征``不平衡“”。特征不平衡导致泛化差时潜在变量在的特征的存在改变的环境中被使用。类似地,与不平衡特征训练的隐变量诱导VAE以产生较少多样化(即偏向主要特征)的样品。为了解决这个问题,我们提出了一个VAES正则化方案,这是我们表现出显着解决了功能失调的问题。我们还引进指标来衡量的功能,在生成的图像平衡的简单。
Saeid Asgari Taghanaki, Mohammad Havaei, Alex Lamb, Aditya Sanghi, Ara Danielyan, Tonya Custis
Abstract: The latent variables learned by VAEs have seen considerable interest as an unsupervised way of extracting features, which can then be used for downstream tasks. There is a growing interest in the question of whether features learned on one environment will generalize across different environments. We demonstrate here that VAE latent variables often focus on some factors of variation at the expense of others - in this case we refer to the features as ``imbalanced''. Feature imbalance leads to poor generalization when the latent variables are used in an environment where the presence of features changes. Similarly, latent variables trained with imbalanced features induce the VAE to generate less diverse (i.e. biased towards dominant features) samples. To address this, we propose a regularization scheme for VAEs, which we show substantially addresses the feature imbalance problem. We also introduce a simple metric to measure the balance of features in generated images.
摘要:VAES学到的潜在变量看到了相当大的兴趣作为提取特征,然后可以用于下游任务的无人监督的方式。有一个在一个环境学特征是否会在不同的环境概括问题的兴趣与日俱增。我们在这里展示的是VAE潜在变量往往集中在变化的一些因素在别人为代价 - 在这种情况下,我们指的是特征``不平衡“”。特征不平衡导致泛化差时潜在变量在的特征的存在改变的环境中被使用。类似地,与不平衡特征训练的隐变量诱导VAE以产生较少多样化(即偏向主要特征)的样品。为了解决这个问题,我们提出了一个VAES正则化方案,这是我们表现出显着解决了功能失调的问题。我们还引进指标来衡量的功能,在生成的图像平衡的简单。
44. Deep Medical Image Analysis with Representation Learning and Neuromorphic Computing [PDF] 返回目录
Neil Getty, Thomas Brettin, Dong Jin, Rick Stevens, Fangfang Xia
Abstract: We explore three representative lines of research and demonstrate the utility of our methods on a classification benchmark of brain cancer MRI data. First, we present a capsule network that explicitly learns a representation robust to rotation and affine transformation. This model requires less training data and outperforms both the original convolutional baseline and a previous capsule network implementation. Second, we leverage the latest domain adaptation techniques to achieve a new state-of-the-art accuracy. Our experiments show that non-medical images can be used to improve model performance. Finally, we design a spiking neural network trained on the Intel Loihi neuromorphic chip (Fig. 1 shows an inference snapshot). This model consumes much lower power while achieving reasonable accuracy given model reduction. We posit that more research in this direction combining hardware and learning advancements will power future medical imaging (on-device AI, few-shot prediction, adaptive scanning).
摘要:我们探讨的研究三个有代表性的线,并证明对脑癌MRI数据的分类基准的我们的方法的实用程序。首先,我们提出了一个胶囊网络,明确学习强大的旋转和仿射变换的表示。这种模式需要较少的训练数据和性能优于原来的卷积基线和以前胶囊网络实现。其次,我们利用最新的域自适应技术,实现了新的国家的最先进的精度。我们的实验表明,非医学图像可以用来提高模型的性能。最后,我们设计训练英特尔Loihi神经形态芯片上的尖峰神经网络(图1只显示了一个推论快照)。这种模式消耗低得多的功率,同时实现给定的模型还原相当准确。我们断定,在这个方向上进行更多的研究相结合的硬件和学习进步将电力未来的医疗成像(设备上的AI,很少拍预测,自适应扫描)。
Neil Getty, Thomas Brettin, Dong Jin, Rick Stevens, Fangfang Xia
Abstract: We explore three representative lines of research and demonstrate the utility of our methods on a classification benchmark of brain cancer MRI data. First, we present a capsule network that explicitly learns a representation robust to rotation and affine transformation. This model requires less training data and outperforms both the original convolutional baseline and a previous capsule network implementation. Second, we leverage the latest domain adaptation techniques to achieve a new state-of-the-art accuracy. Our experiments show that non-medical images can be used to improve model performance. Finally, we design a spiking neural network trained on the Intel Loihi neuromorphic chip (Fig. 1 shows an inference snapshot). This model consumes much lower power while achieving reasonable accuracy given model reduction. We posit that more research in this direction combining hardware and learning advancements will power future medical imaging (on-device AI, few-shot prediction, adaptive scanning).
摘要:我们探讨的研究三个有代表性的线,并证明对脑癌MRI数据的分类基准的我们的方法的实用程序。首先,我们提出了一个胶囊网络,明确学习强大的旋转和仿射变换的表示。这种模式需要较少的训练数据和性能优于原来的卷积基线和以前胶囊网络实现。其次,我们利用最新的域自适应技术,实现了新的国家的最先进的精度。我们的实验表明,非医学图像可以用来提高模型的性能。最后,我们设计训练英特尔Loihi神经形态芯片上的尖峰神经网络(图1只显示了一个推论快照)。这种模式消耗低得多的功率,同时实现给定的模型还原相当准确。我们断定,在这个方向上进行更多的研究相结合的硬件和学习进步将电力未来的医疗成像(设备上的AI,很少拍预测,自适应扫描)。
45. Identifying Mechanical Models through Differentiable Simulations [PDF] 返回目录
Changkyu Song, Abdeslam Boularias
Abstract: This paper proposes a new method for manipulating unknown objects through a sequence of non-prehensile actions that displace an object from its initial configuration to a given goal configuration on a flat surface. The proposed method leverages recent progress in differentiable physics models to identify unknown mechanical properties of manipulated objects, such as inertia matrix, friction coefficients and external forces acting on the object. To this end, a recently proposed differentiable physics engine for two-dimensional objects is adopted in this work and extended to deal forces in the three-dimensional space. The proposed model identification technique analytically computes the gradient of the distance between forecasted poses of objects and their actual observed poses and utilizes that gradient to search for values of the mechanical properties that reduce the reality gap. Experiments with real objects using a real robot to gather data show that the proposed approach can identify the mechanical properties of heterogeneous objects on the fly.
摘要:本文提出了一种用于通过移动从初始配置的目的是在给定的目标配置在一个平面上的非抓握动作序列操纵未知物体的新方法。所提出的方法利用在可微物理模型的最新进展,以确定操作对象,例如惯性矩阵,摩擦系数和作用在物体上的外力的未知的机械性能。为此,对于二维物体最近提出的微物理引擎,在这项工作中被采用,并在三维空间扩展到交易的力量。所提出的模型识别技术分析计算对象的预测姿态和他们的实际观测的姿势和利用了梯度搜索的机械性能降低的现实间隙的值之间的距离的梯度。用真正的机器人来收集数据的真实对象的实验表明,该方法可以识别的飞行物体异构的机械性能。
Changkyu Song, Abdeslam Boularias
Abstract: This paper proposes a new method for manipulating unknown objects through a sequence of non-prehensile actions that displace an object from its initial configuration to a given goal configuration on a flat surface. The proposed method leverages recent progress in differentiable physics models to identify unknown mechanical properties of manipulated objects, such as inertia matrix, friction coefficients and external forces acting on the object. To this end, a recently proposed differentiable physics engine for two-dimensional objects is adopted in this work and extended to deal forces in the three-dimensional space. The proposed model identification technique analytically computes the gradient of the distance between forecasted poses of objects and their actual observed poses and utilizes that gradient to search for values of the mechanical properties that reduce the reality gap. Experiments with real objects using a real robot to gather data show that the proposed approach can identify the mechanical properties of heterogeneous objects on the fly.
摘要:本文提出了一种用于通过移动从初始配置的目的是在给定的目标配置在一个平面上的非抓握动作序列操纵未知物体的新方法。所提出的方法利用在可微物理模型的最新进展,以确定操作对象,例如惯性矩阵,摩擦系数和作用在物体上的外力的未知的机械性能。为此,对于二维物体最近提出的微物理引擎,在这项工作中被采用,并在三维空间扩展到交易的力量。所提出的模型识别技术分析计算对象的预测姿态和他们的实际观测的姿势和利用了梯度搜索的机械性能降低的现实间隙的值之间的距离的梯度。用真正的机器人来收集数据的真实对象的实验表明,该方法可以识别的飞行物体异构的机械性能。
46. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning [PDF] 返回目录
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal
Abstract: Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: this https URL
摘要:视频生成多一句描述是最具挑战性的字幕任务之一,由于其在该段不仅是可视的相关性,但整个句子也语篇连贯基础要求高。为了实现这一目标,我们提出了一个名为内存增强复发变压器(MART)的新方法,它使用一个内存模块以增加变压器的架构。内存模块生成从视频段的高度概括记忆状态和句子的历史,以便下一个句子(w.r.t.的共参照和重复方面)的帮助下更好的预测,从而鼓励连贯款产生。大量的实验,人类的评估,并在两个流行的数据集ActivityNet标题和YouCookII表明MART产生更加连贯,少重复的段落标题超过基线的方法,同时保持相关性输入的视频事件的定性分析。所有代码都可以开源在:此HTTPS URL
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal
Abstract: Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: this https URL
摘要:视频生成多一句描述是最具挑战性的字幕任务之一,由于其在该段不仅是可视的相关性,但整个句子也语篇连贯基础要求高。为了实现这一目标,我们提出了一个名为内存增强复发变压器(MART)的新方法,它使用一个内存模块以增加变压器的架构。内存模块生成从视频段的高度概括记忆状态和句子的历史,以便下一个句子(w.r.t.的共参照和重复方面)的帮助下更好的预测,从而鼓励连贯款产生。大量的实验,人类的评估,并在两个流行的数据集ActivityNet标题和YouCookII表明MART产生更加连贯,少重复的段落标题超过基线的方法,同时保持相关性输入的视频事件的定性分析。所有代码都可以开源在:此HTTPS URL
注:中文为机器翻译结果!