目录
4. BioMetricNet: deep unconstrained face verification through learning of metrics regularized onto Gaussian distributions [PDF] 摘要
7. Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos [PDF] 摘要
9. Estimating Magnitude and Phase of Automotive Radar Signals under Multiple Interference Sources with Fully Convolutional Networks [PDF] 摘要
16. Recurrent Deconvolutional Generative Adversarial Networks with Application to Text Guided Video Generation [PDF] 摘要
25. SkeletonNet: A Topology-Preserving Solution for Learning Mesh Reconstruction of Object Surfaces from RGB Images [PDF] 摘要
30. Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition [PDF] 摘要
31. Alleviating Human-level Shift : A Robust Domain Adaptation Method for Multi-person Pose Estimation [PDF] 摘要
33. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D [PDF] 摘要
37. Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze [PDF] 摘要
39. Feature Binding with Category-Dependant MixUp for Semantic Segmentation and Adversarial Robustness [PDF] 摘要
41. Sparse Coding Driven Deep Decision Tree Ensembles for Nuclear Segmentation in Digital Pathology Images [PDF] 摘要
42. ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network [PDF] 摘要
43. Few shot clustering for indoor occupancy detection with extremely low-quality images from battery free cameras [PDF] 摘要
45. Self-Path: Self-supervision for Classification of Pathology Images with Limited Annotations [PDF] 摘要
47. Facial Expression Recognition Under Partial Occlusion from Virtual Reality Headsets based on Transfer Learning [PDF] 摘要
53. Multi-Mask Self-Supervised Learning for Physics-Guided Neural Networks in Highly Accelerated MRI [PDF] 摘要
55. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations [PDF] 摘要
57. Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning [PDF] 摘要
58. Multi-Modality Pathology Segmentation Framework: Application to Cardiac Magnetic Resonance Images [PDF] 摘要
62. Towards Modality Transferable Visual Information Representation with Optimal Model Compression [PDF] 摘要
摘要
1. Full-Body Awareness from Partial Observations [PDF] 返回目录
Chris Rockwell, David F. Fouhey
Abstract: There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations for 13K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation. Project website: this https URL
摘要:人类已经有很大的进步3D网格中学习从消费视频数据的全球经济复苏和极大的兴趣。三维人体网状恢复工作电流遗憾方法,而不良的消费视频数据,因为在互联网上,不寻常的相机观点和侵略性截断是常态,而不是稀罕物。我们研究这个问题,并作出了一些来解决它的贡献:(一)我们提出了一个简单但非常有效的自我培训的框架,适应人的3D Mesh恢复系统到消费视频,并展示其最近的两个系统的应用; (二)我们引入评价协议和在四个消费者的视频数据集13K帧关键点标注为研究这一任务,包括外的图像关键点的评估; (三)我们证明了我们的方法显着地提高PCK和人类主体的判断比较基准,无论是从它的,还有其他三个数据集没有进一步适应训练数据集的测试视频。项目网站:此HTTPS URL
Chris Rockwell, David F. Fouhey
Abstract: There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations for 13K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation. Project website: this https URL
摘要:人类已经有很大的进步3D网格中学习从消费视频数据的全球经济复苏和极大的兴趣。三维人体网状恢复工作电流遗憾方法,而不良的消费视频数据,因为在互联网上,不寻常的相机观点和侵略性截断是常态,而不是稀罕物。我们研究这个问题,并作出了一些来解决它的贡献:(一)我们提出了一个简单但非常有效的自我培训的框架,适应人的3D Mesh恢复系统到消费视频,并展示其最近的两个系统的应用; (二)我们引入评价协议和在四个消费者的视频数据集13K帧关键点标注为研究这一任务,包括外的图像关键点的评估; (三)我们证明了我们的方法显着地提高PCK和人类主体的判断比较基准,无论是从它的,还有其他三个数据集没有进一步适应训练数据集的测试视频。项目网站:此HTTPS URL
2. DSDNet: Deep Structured self-Driving Network [PDF] 返回目录
Wenyuan Zeng, Shenlong Wang, Renjie Liao, Yun Chen, Bin Yang, Raquel Urtasun
Abstract: In this paper, we propose the Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network. Towards this goal, we develop a deep structured energy based model which considers the interactions between actors and produces socially consistent multimodal future predictions. Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to plan a safe maneuver by using a structured planning cost. Our sample-based formulation allows us to overcome the difficulty in probabilistic inference of continuous random variables. Experiments on a number of large-scale self driving datasets demonstrate that our model significantly outperforms the state-of-the-art.
摘要:在本文中,我们提出了深层结构自驾车网络(DSDNet),其进行目标检测,运动预测和运动规划与单一的神经网络。为了实现这一目标,我们开发了一个深刻的结构能基于模型,该模型考虑了参与者之间的互动,并产生社会一致的多式联运未来的预测。此外,DSDNet明确利用演员的预计未来发行采用结构化的规划成本计划一个安全演习。我们基于样本的配方使我们能够克服连续随机变量的概率推理的难度。对一些大型自驾数据集实验表明,我们的模型显著优于国家的最先进的。
Wenyuan Zeng, Shenlong Wang, Renjie Liao, Yun Chen, Bin Yang, Raquel Urtasun
Abstract: In this paper, we propose the Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network. Towards this goal, we develop a deep structured energy based model which considers the interactions between actors and produces socially consistent multimodal future predictions. Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to plan a safe maneuver by using a structured planning cost. Our sample-based formulation allows us to overcome the difficulty in probabilistic inference of continuous random variables. Experiments on a number of large-scale self driving datasets demonstrate that our model significantly outperforms the state-of-the-art.
摘要:在本文中,我们提出了深层结构自驾车网络(DSDNet),其进行目标检测,运动预测和运动规划与单一的神经网络。为了实现这一目标,我们开发了一个深刻的结构能基于模型,该模型考虑了参与者之间的互动,并产生社会一致的多式联运未来的预测。此外,DSDNet明确利用演员的预计未来发行采用结构化的规划成本计划一个安全演习。我们基于样本的配方使我们能够克服连续随机变量的概率推理的难度。对一些大型自驾数据集实验表明,我们的模型显著优于国家的最先进的。
3. Towards Visually Explaining Similarity Models [PDF] 返回目录
Meng Zheng, Srikrishna Karanam, Terrence Chen, Richard J. Radke, Ziyan Wu
Abstract: We consider the problem of visually explaining similarity models, i.e., explaining why a model predicts two images to be similar in addition to producing a scalar score. While much recent work in visual model interpretability has focused on gradient-based attention, these methods rely on a classification module to generate visual explanations. Consequently, they cannot readily explain other kinds of models that do not use or need classification-like loss functions (e.g., similarity models trained with a metric learning loss). In this work, we bridge this crucial gap, presenting the first method to generate gradient-based visual explanations for image similarity predictors. By relying solely on the learned feature embedding, we show that our approach can be applied to any kind of CNN-based similarity architecture, an important step towards generic visual explainability. We show that our resulting visual explanations serve more than just interpretability; they can be infused into the model learning process itself with new trainable constraints based on our similarity explanations. We show that the resulting similarity models perform, and can be visually explained, better than the corresponding baseline models trained without our explanation constraints. We demonstrate our approach using extensive experiments on three different kinds of tasks: generic image retrieval, person re-identification, and low-shot semantic segmentation.
摘要:为什么一个模型预测两个图像是除了生产标得分类似,我们认为在视觉上的相似性解释模型,即说明问题。虽然最近在可视化模型解释性很多工作都集中在基于梯度的关注,这些方法依赖于分类模块上产生视觉解释。因此,他们无法轻易解释其他种类的不使用或模型需要分类般的损失函数(例如,一个度量学习培训的损失相似型号)。在这项工作中,我们弥补这一关键间隙,呈现所述第一种方法,以产生用于图像的相似性的预测基于梯度的可视解释。通过单就学习的特征嵌入依托,我们表明,我们的方法可以适用于任何一种基于CNN相似的架构,对一般的视觉explainability的重要一步。我们证明了我们的视觉造成服务说明不止解释性;它们可以被注入模型学习过程本身基于我们的相似性说明新的训练的约束。我们发现,所产生的相似机型执行,并可以直观地解释,不如没有我们的解释约束训练相应的基线模型。通用的图像检索,人重新鉴定和低射击语义分割:我们用三种不同类型的任务大量的实验证明了该方法。
Meng Zheng, Srikrishna Karanam, Terrence Chen, Richard J. Radke, Ziyan Wu
Abstract: We consider the problem of visually explaining similarity models, i.e., explaining why a model predicts two images to be similar in addition to producing a scalar score. While much recent work in visual model interpretability has focused on gradient-based attention, these methods rely on a classification module to generate visual explanations. Consequently, they cannot readily explain other kinds of models that do not use or need classification-like loss functions (e.g., similarity models trained with a metric learning loss). In this work, we bridge this crucial gap, presenting the first method to generate gradient-based visual explanations for image similarity predictors. By relying solely on the learned feature embedding, we show that our approach can be applied to any kind of CNN-based similarity architecture, an important step towards generic visual explainability. We show that our resulting visual explanations serve more than just interpretability; they can be infused into the model learning process itself with new trainable constraints based on our similarity explanations. We show that the resulting similarity models perform, and can be visually explained, better than the corresponding baseline models trained without our explanation constraints. We demonstrate our approach using extensive experiments on three different kinds of tasks: generic image retrieval, person re-identification, and low-shot semantic segmentation.
摘要:为什么一个模型预测两个图像是除了生产标得分类似,我们认为在视觉上的相似性解释模型,即说明问题。虽然最近在可视化模型解释性很多工作都集中在基于梯度的关注,这些方法依赖于分类模块上产生视觉解释。因此,他们无法轻易解释其他种类的不使用或模型需要分类般的损失函数(例如,一个度量学习培训的损失相似型号)。在这项工作中,我们弥补这一关键间隙,呈现所述第一种方法,以产生用于图像的相似性的预测基于梯度的可视解释。通过单就学习的特征嵌入依托,我们表明,我们的方法可以适用于任何一种基于CNN相似的架构,对一般的视觉explainability的重要一步。我们证明了我们的视觉造成服务说明不止解释性;它们可以被注入模型学习过程本身基于我们的相似性说明新的训练的约束。我们发现,所产生的相似机型执行,并可以直观地解释,不如没有我们的解释约束训练相应的基线模型。通用的图像检索,人重新鉴定和低射击语义分割:我们用三种不同类型的任务大量的实验证明了该方法。
4. BioMetricNet: deep unconstrained face verification through learning of metrics regularized onto Gaussian distributions [PDF] 返回目录
Arslan Ali, Matteo Testa, Tiziano Bianchi, Enrico Magli
Abstract: We present BioMetricNet: a novel framework for deep unconstrained face verification which learns a regularized metric to compare facial features. Differently from popular methods such as FaceNet, the proposed approach does not impose any specific metric on facial features; instead, it shapes the decision space by learning a latent representation in which matching and non-matching pairs are mapped onto clearly separated and well-behaved target distributions. In particular, the network jointly learns the best feature representation, and the best metric that follows the target distributions, to be used to discriminate face images. In this paper we present this general framework, first of its kind for facial verification, and tailor it to Gaussian distributions. This choice enables the use of a simple linear decision boundary that can be tuned to achieve the desired trade-off between false alarm and genuine acceptance rate, and leads to a loss function that can be written in closed form. Extensive analysis and experimentation on publicly available datasets such as Labeled Faces in the wild (LFW), Youtube faces (YTF), Celebrities in Frontal-Profile in the Wild (CFP), and challenging datasets like cross-age LFW (CALFW), cross-pose LFW (CPLFW), In-the-wild Age Dataset (AgeDB) show a significant performance improvement and confirms the effectiveness and superiority of BioMetricNet over existing state-of-the-art methods.
摘要:我们提出BioMetricNet:深不受约束的人脸验证其学习正则化指标来比较面部特征的新框架。不同于常用的方法如FaceNet,该方法不会产生任何具体的面部特征度量;相反,它形状通过学习其中匹配和非匹配对被映射到清楚地分开,并表现良好的目标分布的潜表示该决定的空间。特别地,网络共同学习的最佳特征表示,并且接下来的目标分布,以用来判别面部图像最佳度量。在本文中,我们提出这个总体框架,首创面部验证,并裁缝高斯分布。这种选择允许使用可以调整以实现到可以以封闭形式编写的损失函数误报和真正的录取率,和导线之间的预期折衷的简单线性决策边界。可公开获得的数据集,如在野外(LFW)标记面广泛的分析和实验,YouTube的面(YTF),在野外(CFP)正面型材名人,和具有挑战性的数据集,像跨时代的LFW(CALFW),交-pose LFW(CPLFW),在最狂野年龄数据集(AgeDB)显示BioMetricNet的在现有的国家的最先进的方法显著的性能提升,并确认有效性和优越性。
Arslan Ali, Matteo Testa, Tiziano Bianchi, Enrico Magli
Abstract: We present BioMetricNet: a novel framework for deep unconstrained face verification which learns a regularized metric to compare facial features. Differently from popular methods such as FaceNet, the proposed approach does not impose any specific metric on facial features; instead, it shapes the decision space by learning a latent representation in which matching and non-matching pairs are mapped onto clearly separated and well-behaved target distributions. In particular, the network jointly learns the best feature representation, and the best metric that follows the target distributions, to be used to discriminate face images. In this paper we present this general framework, first of its kind for facial verification, and tailor it to Gaussian distributions. This choice enables the use of a simple linear decision boundary that can be tuned to achieve the desired trade-off between false alarm and genuine acceptance rate, and leads to a loss function that can be written in closed form. Extensive analysis and experimentation on publicly available datasets such as Labeled Faces in the wild (LFW), Youtube faces (YTF), Celebrities in Frontal-Profile in the Wild (CFP), and challenging datasets like cross-age LFW (CALFW), cross-pose LFW (CPLFW), In-the-wild Age Dataset (AgeDB) show a significant performance improvement and confirms the effectiveness and superiority of BioMetricNet over existing state-of-the-art methods.
摘要:我们提出BioMetricNet:深不受约束的人脸验证其学习正则化指标来比较面部特征的新框架。不同于常用的方法如FaceNet,该方法不会产生任何具体的面部特征度量;相反,它形状通过学习其中匹配和非匹配对被映射到清楚地分开,并表现良好的目标分布的潜表示该决定的空间。特别地,网络共同学习的最佳特征表示,并且接下来的目标分布,以用来判别面部图像最佳度量。在本文中,我们提出这个总体框架,首创面部验证,并裁缝高斯分布。这种选择允许使用可以调整以实现到可以以封闭形式编写的损失函数误报和真正的录取率,和导线之间的预期折衷的简单线性决策边界。可公开获得的数据集,如在野外(LFW)标记面广泛的分析和实验,YouTube的面(YTF),在野外(CFP)正面型材名人,和具有挑战性的数据集,像跨时代的LFW(CALFW),交-pose LFW(CPLFW),在最狂野年龄数据集(AgeDB)显示BioMetricNet的在现有的国家的最先进的方法显著的性能提升,并确认有效性和优越性。
5. Testing the Safety of Self-driving Vehicles by Simulating Perception and Prediction [PDF] 返回目录
Kelvin Wong, Qiang Zhang, Ming Liang, Bin Yang, Renjie Liao, Abbas Sadat, Raquel Urtasun
Abstract: We present a novel method for testing the safety of self-driving vehicles in simulation. We propose an alternative to sensor simulation, as sensor simulation is expensive and has large domain gaps. Instead, we directly simulate the outputs of the self-driving vehicle's perception and prediction system, enabling realistic motion planning testing. Specifically, we use paired data in the form of ground truth labels and real perception and prediction outputs to train a model that predicts what the online system will produce. Importantly, the inputs to our system consists of high definition maps, bounding boxes, and trajectories, which can be easily sketched by a test engineer in a matter of minutes. This makes our approach a much more scalable solution. Quantitative results on two large-scale datasets demonstrate that we can realistically test motion planning using our simulations.
摘要:本文提出了一种新方法来测试自驾车车辆的安全性仿真。我们建议传感器仿真替代方案,作为传感器仿真是昂贵的,并具有大的域的间隙。相反,我们直接模拟自动驾驶汽车的看法和预测系统的输出,从而实现逼真的运动规划测试。具体来说,我们使用地面实况标签和实时感知和预测输出的形式成对数据来训练,预测在线系统会产生什么的典范。重要的是,投入到我们的系统由高清晰地图,包围盒,和轨迹,它可以通过一个测试工程师在短短的几分钟内轻松勾勒。这使得我们的方法更加灵活的解决方案。两个大型数据集的定量结果表明,我们可以用我们的模拟现实测试运动规划。
Kelvin Wong, Qiang Zhang, Ming Liang, Bin Yang, Renjie Liao, Abbas Sadat, Raquel Urtasun
Abstract: We present a novel method for testing the safety of self-driving vehicles in simulation. We propose an alternative to sensor simulation, as sensor simulation is expensive and has large domain gaps. Instead, we directly simulate the outputs of the self-driving vehicle's perception and prediction system, enabling realistic motion planning testing. Specifically, we use paired data in the form of ground truth labels and real perception and prediction outputs to train a model that predicts what the online system will produce. Importantly, the inputs to our system consists of high definition maps, bounding boxes, and trajectories, which can be easily sketched by a test engineer in a matter of minutes. This makes our approach a much more scalable solution. Quantitative results on two large-scale datasets demonstrate that we can realistically test motion planning using our simulations.
摘要:本文提出了一种新方法来测试自驾车车辆的安全性仿真。我们建议传感器仿真替代方案,作为传感器仿真是昂贵的,并具有大的域的间隙。相反,我们直接模拟自动驾驶汽车的看法和预测系统的输出,从而实现逼真的运动规划测试。具体来说,我们使用地面实况标签和实时感知和预测输出的形式成对数据来训练,预测在线系统会产生什么的典范。重要的是,投入到我们的系统由高清晰地图,包围盒,和轨迹,它可以通过一个测试工程师在短短的几分钟内轻松勾勒。这使得我们的方法更加灵活的解决方案。两个大型数据集的定量结果表明,我们可以用我们的模拟现实测试运动规划。
6. Black Magic in Deep Learning: How Human Skill Impacts Network Training [PDF] 返回目录
Kanav Anand, Ziqi Wang, Marco Loog, Jan van Gemert
Abstract: How does a user's prior experience with deep learning impact accuracy? We present an initial study based on 31 participants with different levels of experience. Their task is to perform hyperparameter optimization for a given deep learning architecture. The results show a strong positive correlation between the participant's experience and the final performance. They additionally indicate that an experienced participant finds better solutions using fewer resources on average. The data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyperparameters. Our study investigates the subjective human factor in comparisons of state of the art results and scientific reproducibility in deep learning.
摘要:如何与深度学习的影响精度的用户以前的经验?我们提出了基于31名学员不同程度的经验初步研究。他们的任务是为给定的深度学习架构进行超参数优化。结果表明参与者的经验和最终性能之间有很强的正相关关系。此外,他们表明,有经验的参与者找到使用平均更少的资源更好的解决方案。数据还显示,参与者没有现成的经验按照自己的追求最优超参数的随机策略。我们的研究调查了在深学习艺术的结果的状态和科学的重复性比较主观人为因素。
Kanav Anand, Ziqi Wang, Marco Loog, Jan van Gemert
Abstract: How does a user's prior experience with deep learning impact accuracy? We present an initial study based on 31 participants with different levels of experience. Their task is to perform hyperparameter optimization for a given deep learning architecture. The results show a strong positive correlation between the participant's experience and the final performance. They additionally indicate that an experienced participant finds better solutions using fewer resources on average. The data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyperparameters. Our study investigates the subjective human factor in comparisons of state of the art results and scientific reproducibility in deep learning.
摘要:如何与深度学习的影响精度的用户以前的经验?我们提出了基于31名学员不同程度的经验初步研究。他们的任务是为给定的深度学习架构进行超参数优化。结果表明参与者的经验和最终性能之间有很强的正相关关系。此外,他们表明,有经验的参与者找到使用平均更少的资源更好的解决方案。数据还显示,参与者没有现成的经验按照自己的追求最优超参数的随机策略。我们的研究调查了在深学习艺术的结果的状态和科学的重复性比较主观人为因素。
7. Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos [PDF] 返回目录
Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, Jian-Huang Lai
Abstract: The objective of action quality assessment is to score sports videos. However, most existing works focus only on video dynamic information (i.e., motion information) but ignore the specific postures that an athlete is performing in a video, which is important for action assessment in long videos. In this work, we present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. To learn more discriminative representations for videos, we not only learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames, which represent the action quality at certain moments, along with the help of the proposed hybrid dynamic-static architecture. Moreover, we leverage a context-aware attention module consisting of a temporal instance-wise graph convolutional network unit and an attention unit for both streams to extract more robust stream features, where the former is for exploring the relations between instances and the latter for assigning a proper weight to each instance. Finally, we combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts. Additionally, we have collected and annotated the new Rhythmic Gymnastics dataset, which contains videos of four different types of gymnastics routines, for evaluation of action quality assessment in long videos. Extensive experimental results validate the efficacy of our proposed method, which outperforms related approaches. The codes and dataset are available at \url{this https URL}.
摘要:动作质量评估的目的是取得体育视频。然而,大多数现有的作品只注重视频动态信息(即,运动信息),但忽略了特定的姿势,一个运动员在视频,这对于长视频行为评估的重要执行。在这项工作中,我们提出了在长视频行为评估一种新的混合动态静态上下文感知关注网络(ACTION-NET)。要了解更多的歧视性的表示,视频中,我们不仅学习了视频动态信息,而且还与所提出的混合dynamic-的帮助下专注于特定的帧检测的运动员,代表了动作质量,在某些时刻的静态姿势,沿静态架构。此外,我们利用上下文感知注意模块包括一个时间场合明智图卷积网络单元和关注单元为两个流中提取更强大的流功能,其中前者是探索和实例之间的关系,后者用于分配适当的权重给每个实例。最后,我们结合两个流的特点倒退最终的视频得分,监督由专家给出地面实况分数。此外,我们还收集并诠释了新的艺术体操的数据集,其中包含的四种不同类型的体操套路的视频,在长视频动作质量评估评价。大量的实验结果验证了我们提出的方法,该方法优于相关方法的有效性。这些代码和数据集可在\ {URL这HTTPS URL}。
Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, Jian-Huang Lai
Abstract: The objective of action quality assessment is to score sports videos. However, most existing works focus only on video dynamic information (i.e., motion information) but ignore the specific postures that an athlete is performing in a video, which is important for action assessment in long videos. In this work, we present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. To learn more discriminative representations for videos, we not only learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames, which represent the action quality at certain moments, along with the help of the proposed hybrid dynamic-static architecture. Moreover, we leverage a context-aware attention module consisting of a temporal instance-wise graph convolutional network unit and an attention unit for both streams to extract more robust stream features, where the former is for exploring the relations between instances and the latter for assigning a proper weight to each instance. Finally, we combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts. Additionally, we have collected and annotated the new Rhythmic Gymnastics dataset, which contains videos of four different types of gymnastics routines, for evaluation of action quality assessment in long videos. Extensive experimental results validate the efficacy of our proposed method, which outperforms related approaches. The codes and dataset are available at \url{this https URL}.
摘要:动作质量评估的目的是取得体育视频。然而,大多数现有的作品只注重视频动态信息(即,运动信息),但忽略了特定的姿势,一个运动员在视频,这对于长视频行为评估的重要执行。在这项工作中,我们提出了在长视频行为评估一种新的混合动态静态上下文感知关注网络(ACTION-NET)。要了解更多的歧视性的表示,视频中,我们不仅学习了视频动态信息,而且还与所提出的混合dynamic-的帮助下专注于特定的帧检测的运动员,代表了动作质量,在某些时刻的静态姿势,沿静态架构。此外,我们利用上下文感知注意模块包括一个时间场合明智图卷积网络单元和关注单元为两个流中提取更强大的流功能,其中前者是探索和实例之间的关系,后者用于分配适当的权重给每个实例。最后,我们结合两个流的特点倒退最终的视频得分,监督由专家给出地面实况分数。此外,我们还收集并诠释了新的艺术体操的数据集,其中包含的四种不同类型的体操套路的视频,在长视频动作质量评估评价。大量的实验结果验证了我们提出的方法,该方法优于相关方法的有效性。这些代码和数据集可在\ {URL这HTTPS URL}。
8. SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition with Distractors [PDF] 返回目录
Mona Jalal, Josef Spjut, Ben Boudaoud, Margrit Betke
Abstract: We present a new, publicly-available image dataset generated by the NVIDIA Deep Learning Data Synthesizer intended for use in object detection, pose estimation, and tracking applications. This dataset contains 144k stereo image pairs that synthetically combine 18 camera viewpoints of three photorealistic virtual environments with up to 10 objects (chosen randomly from the 21 object models of the YCB dataset [1]) and flying distractors. Object and camera pose, scene lighting, and quantity of objects and distractors were randomized. Each provided view includes RGB, depth, segmentation, and surface normal images, all pixel level. We describe our approach for domain randomization and provide insight into the decisions that produced the dataset.
摘要:我们提出由NVIDIA深度学习数据合成器生成一个新的,公开可用的图像数据集用于物体检测,姿态估计和跟踪应用。此数据集包含144K该合成结合的与三个真实感的虚拟环境18个照相机视点到10个目的和飞行分心(从YCB数据集[1]的21个对象模型随机选择的)立体图像对。对象和干扰项的对象和相机的姿势,现场照明和数量被随机分配。每个提供视图包括RGB,深度,分割和表面法线的图像,所有的像素级。我们描述我们的方法域随机化和洞察所产生的数据集的决定。
Mona Jalal, Josef Spjut, Ben Boudaoud, Margrit Betke
Abstract: We present a new, publicly-available image dataset generated by the NVIDIA Deep Learning Data Synthesizer intended for use in object detection, pose estimation, and tracking applications. This dataset contains 144k stereo image pairs that synthetically combine 18 camera viewpoints of three photorealistic virtual environments with up to 10 objects (chosen randomly from the 21 object models of the YCB dataset [1]) and flying distractors. Object and camera pose, scene lighting, and quantity of objects and distractors were randomized. Each provided view includes RGB, depth, segmentation, and surface normal images, all pixel level. We describe our approach for domain randomization and provide insight into the decisions that produced the dataset.
摘要:我们提出由NVIDIA深度学习数据合成器生成一个新的,公开可用的图像数据集用于物体检测,姿态估计和跟踪应用。此数据集包含144K该合成结合的与三个真实感的虚拟环境18个照相机视点到10个目的和飞行分心(从YCB数据集[1]的21个对象模型随机选择的)立体图像对。对象和干扰项的对象和相机的姿势,现场照明和数量被随机分配。每个提供视图包括RGB,深度,分割和表面法线的图像,所有的像素级。我们描述我们的方法域随机化和洞察所产生的数据集的决定。
9. Estimating Magnitude and Phase of Automotive Radar Signals under Multiple Interference Sources with Fully Convolutional Networks [PDF] 返回目录
Nicolae-Cătălin Ristea, Andrei Anghel, Radu Tudor Ionescu
Abstract: Radar sensors are gradually becoming a wide-spread equipment for road vehicles, playing a crucial role in autonomous driving and road safety. The broad adoption of radar sensors increases the chance of interference among sensors from different vehicles, generating corrupted range profiles and range-Doppler maps. In order to extract distance and velocity of multiple targets from range-Doppler maps, the interference affecting each range profile needs to be mitigated. In this paper, we propose a fully convolutional neural network for automotive radar interference mitigation. In order to train our network in a real-world scenario, we introduce a new data set of realistic automotive radar signals with multiple targets and multiple interferers. To our knowledge, this is the first work to mitigate interference from multiple sources. Furthermore, we introduce a new training regime that eliminates noisy weights, showing superior results compared to the widely-used dropout. While some previous works successfully estimated the magnitude of automotive radar signals, we are the first to propose a deep learning model that can accurately estimate the phase. For instance, our novel approach reduces the phase estimation error with respect to the commonly-adopted zeroing technique by half, from 12.55 degrees to 6.58 degrees. Considering the lack of databases for automotive radar interference mitigation, we release as open source our large-scale data set that closely replicates the real-world automotive scenario for multiple interference cases, allowing others to objectively compare their future work in this domain. Our data set is available for download at: this http URL.
摘要:雷达传感器正逐渐成为越野车宽传播设备,打自主驾驶和道路安全至关重要的作用。广泛采用雷达传感器的增加干扰的机会从不同的车辆传感器中,产生损坏范围型材和范围多普勒映射。为了提取物的距离和多个目标速度从距离多普勒地图,影响每个范围轮廓的干扰需要被缓解。在本文中,我们提出了汽车雷达干扰缓解全卷积神经网络。为了训练我们的网络在真实世界的场景中,我们介绍了具有多目标,多干扰逼真的汽车雷达信号的新的数据集。据我们所知,这是第一个工作,从多个来源减轻干扰。此外,我们引入一个新的训练制度,消除嘈杂的权重,显示出优异的成绩相比,被广泛使用的辍学。虽然一些以前的作品中成功地估计汽车雷达信号的幅度,我们是第一个提出了深刻的学习模式,能够准确地估算相位。例如,我们的新方法减少了相位估计误差与由半对于通常采用的技术归零,从12.55度至6.58度。考虑到缺乏对汽车雷达干扰缓解数据库,我们发布开源我们的大型数据集地复制多个干扰情况下的真实世界汽车的情况下,允许他人客观地比较在这一领域今后的工作。我们的数据集是可供下载:此http网址。
Nicolae-Cătălin Ristea, Andrei Anghel, Radu Tudor Ionescu
Abstract: Radar sensors are gradually becoming a wide-spread equipment for road vehicles, playing a crucial role in autonomous driving and road safety. The broad adoption of radar sensors increases the chance of interference among sensors from different vehicles, generating corrupted range profiles and range-Doppler maps. In order to extract distance and velocity of multiple targets from range-Doppler maps, the interference affecting each range profile needs to be mitigated. In this paper, we propose a fully convolutional neural network for automotive radar interference mitigation. In order to train our network in a real-world scenario, we introduce a new data set of realistic automotive radar signals with multiple targets and multiple interferers. To our knowledge, this is the first work to mitigate interference from multiple sources. Furthermore, we introduce a new training regime that eliminates noisy weights, showing superior results compared to the widely-used dropout. While some previous works successfully estimated the magnitude of automotive radar signals, we are the first to propose a deep learning model that can accurately estimate the phase. For instance, our novel approach reduces the phase estimation error with respect to the commonly-adopted zeroing technique by half, from 12.55 degrees to 6.58 degrees. Considering the lack of databases for automotive radar interference mitigation, we release as open source our large-scale data set that closely replicates the real-world automotive scenario for multiple interference cases, allowing others to objectively compare their future work in this domain. Our data set is available for download at: this http URL.
摘要:雷达传感器正逐渐成为越野车宽传播设备,打自主驾驶和道路安全至关重要的作用。广泛采用雷达传感器的增加干扰的机会从不同的车辆传感器中,产生损坏范围型材和范围多普勒映射。为了提取物的距离和多个目标速度从距离多普勒地图,影响每个范围轮廓的干扰需要被缓解。在本文中,我们提出了汽车雷达干扰缓解全卷积神经网络。为了训练我们的网络在真实世界的场景中,我们介绍了具有多目标,多干扰逼真的汽车雷达信号的新的数据集。据我们所知,这是第一个工作,从多个来源减轻干扰。此外,我们引入一个新的训练制度,消除嘈杂的权重,显示出优异的成绩相比,被广泛使用的辍学。虽然一些以前的作品中成功地估计汽车雷达信号的幅度,我们是第一个提出了深刻的学习模式,能够准确地估算相位。例如,我们的新方法减少了相位估计误差与由半对于通常采用的技术归零,从12.55度至6.58度。考虑到缺乏对汽车雷达干扰缓解数据库,我们发布开源我们的大型数据集地复制多个干扰情况下的真实世界汽车的情况下,允许他人客观地比较在这一领域今后的工作。我们的数据集是可供下载:此http网址。
10. On failures of RGB cameras and their effects in autonomous driving applications [PDF] 返回目录
Francesco Secci, Andrea Ceccarelli
Abstract: RGB cameras are arguably one of the most relevant sensors for autonomous driving applications. It is undeniable that failures of vehicle cameras may compromise the autonomous driving task, possibly leading to unsafe behaviors when images that are subsequently processed by the driving system are altered. To support the definition of safe and robust vehicle architectures and intelligent systems, in this paper we define the failures model of a vehicle camera, together with an analysis of effects and known mitigations. Further, we build a software library for the generation of the corresponding failed images and we feed them to the trained agent of an autonomous driving simulator: the misbehavior of the trained agent allows a better understanding of failures effects and especially of the resulting safety risk.
摘要:RGB摄像头,可以说是对自主驾驶应用的最相关的传感器之一。不可否认的是,车辆故障的相机可能会危及自主行走任务,可能导致不安全行为时被随后由驱动系统处理后的图像被改变。为了支持安全可靠的车辆架构和智能系统的定义,在本文中,我们定义了一个车载摄像机的故障模式,用的效果,称为缓解的分析在一起。此外,我们建立了相应的失败图像的生成软件库,我们养活他们自主驾驶模拟器的培训的代理:训练有素的代理人的不当行为可以更好的故障影响,尤其是导致安全风险的理解。
Francesco Secci, Andrea Ceccarelli
Abstract: RGB cameras are arguably one of the most relevant sensors for autonomous driving applications. It is undeniable that failures of vehicle cameras may compromise the autonomous driving task, possibly leading to unsafe behaviors when images that are subsequently processed by the driving system are altered. To support the definition of safe and robust vehicle architectures and intelligent systems, in this paper we define the failures model of a vehicle camera, together with an analysis of effects and known mitigations. Further, we build a software library for the generation of the corresponding failed images and we feed them to the trained agent of an autonomous driving simulator: the misbehavior of the trained agent allows a better understanding of failures effects and especially of the resulting safety risk.
摘要:RGB摄像头,可以说是对自主驾驶应用的最相关的传感器之一。不可否认的是,车辆故障的相机可能会危及自主行走任务,可能导致不安全行为时被随后由驱动系统处理后的图像被改变。为了支持安全可靠的车辆架构和智能系统的定义,在本文中,我们定义了一个车载摄像机的故障模式,用的效果,称为缓解的分析在一起。此外,我们建立了相应的失败图像的生成软件库,我们养活他们自主驾驶模拟器的培训的代理:训练有素的代理人的不当行为可以更好的故障影响,尤其是导致安全风险的理解。
11. End-to-end Contextual Perception and Prediction with Interaction Transformer [PDF] 返回目录
Lingyun Luke Li, Bin Yang, Ming Liang, Wenyuan Zeng, Mengye Ren, Sean Segal, Raquel Urtasun
Abstract: In this paper, we tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. Towards this goal, we design a novel approach that explicitly takes into account the interactions between actors. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture, which we call the Interaction Transformer. Importantly, our model can be trained end-to-end, and runs in real-time. We validate our approach on two challenging real-world datasets: ATG4D and nuScenes. We show that our approach can outperform the state-of-the-art on both datasets. In particular, we significantly improve the social compliance between the estimated future trajectories, resulting in far fewer collisions between the predicted actors.
摘要:在本文中,我们处理的三维物体检测和自驾车的情况下预测他们未来运动的问题。为了实现这一目标,我们设计了一种新的方法,明确考虑到角色之间的交互。捕捉他们的时空相关性,我们提出了一个新颖变压器结构,我们称之为互动变压器经常性的神经网络。重要的是,我们的模型可以被训练的端至端,并在实时运行。我们确认我们的两个办法挑战现实世界的数据集:ATG4D和nuScenes。我们表明,我们的方法可以超越上两个数据集的国家的最先进的。特别是,我们显著改善预计未来轨迹之间的社会责任,从而导致预测者之间少得多的碰撞。
Lingyun Luke Li, Bin Yang, Ming Liang, Wenyuan Zeng, Mengye Ren, Sean Segal, Raquel Urtasun
Abstract: In this paper, we tackle the problem of detecting objects in 3D and forecasting their future motion in the context of self-driving. Towards this goal, we design a novel approach that explicitly takes into account the interactions between actors. To capture their spatial-temporal dependencies, we propose a recurrent neural network with a novel Transformer architecture, which we call the Interaction Transformer. Importantly, our model can be trained end-to-end, and runs in real-time. We validate our approach on two challenging real-world datasets: ATG4D and nuScenes. We show that our approach can outperform the state-of-the-art on both datasets. In particular, we significantly improve the social compliance between the estimated future trajectories, resulting in far fewer collisions between the predicted actors.
摘要:在本文中,我们处理的三维物体检测和自驾车的情况下预测他们未来运动的问题。为了实现这一目标,我们设计了一种新的方法,明确考虑到角色之间的交互。捕捉他们的时空相关性,我们提出了一个新颖变压器结构,我们称之为互动变压器经常性的神经网络。重要的是,我们的模型可以被训练的端至端,并在实时运行。我们确认我们的两个办法挑战现实世界的数据集:ATG4D和nuScenes。我们表明,我们的方法可以超越上两个数据集的国家的最先进的。特别是,我们显著改善预计未来轨迹之间的社会责任,从而导致预测者之间少得多的碰撞。
12. DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild [PDF] 返回目录
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, Jiateng Liu
Abstract: Recently, facial expression recognition (FER) in the wild has gained a lot of researchers' attention because it is a valuable topic to enable the FER techniques to move from the laboratory to the real applications. In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. First, we present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW (Dynamic Facial Expression in the Wild), consisting of over 16,000 video clips from thousands of movies. These video clips contain various challenging interferences in practical scenarios such as extreme illumination, occlusions, and capricious pose changes. Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning (EC-STFL) framework to deal with dynamic FER in the wild. Third, we conduct extensive benchmark experiments on DFEW using a lot of spatiotemporal deep feature learning methods as well as our proposed EC-STFL. Experimental results show that DFEW is a well-designed and challenging database, and the proposed EC-STFL can promisingly improve the performance of existing spatiotemporal deep neural networks in coping with the problem of dynamic FER in the wild. Our DFEW database is publicly available and can be freely downloaded from this https URL.
摘要:近日,在野外面部表情识别(FER)获得了大量的研究人员的关注,因为它是一个有价值的话题,使FER技术从实验室转移到实际应用中。在本文中,我们重点从三个方面这一具有挑战性的,但有意思的话题作贡献。首先,我们提出了一个新的大规模的“在最野”动态表情数据库,DFEW(在狂野的动态表情),从数千部电影,包括超过16,000个视频剪辑。这些视频片段包含在实际情况下的各种挑战干扰,例如极端照明,闭塞,和任性姿态的变化。其次,我们提出了一个名为表达聚集时空特征学习(EC-STFL)框架在野外处理动态FER新方法。第三,我们在使用DFEW很多时空深特点的学习方法以及我们提出的EC-STFL进行广泛的基准测试实验。实验结果表明,DFEW是一个精心设计和具有挑战性的数据库,并提出了EC-STFL可以有为地提高动态FER在野外问题应对现有的时空深层神经网络的性能。我们DFEW数据库是公开的,可以从该HTTPS URL上免费下载。
Xingxun Jiang, Yuan Zong, Wenming Zheng, Chuangao Tang, Wanchuang Xia, Cheng Lu, Jiateng Liu
Abstract: Recently, facial expression recognition (FER) in the wild has gained a lot of researchers' attention because it is a valuable topic to enable the FER techniques to move from the laboratory to the real applications. In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. First, we present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW (Dynamic Facial Expression in the Wild), consisting of over 16,000 video clips from thousands of movies. These video clips contain various challenging interferences in practical scenarios such as extreme illumination, occlusions, and capricious pose changes. Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning (EC-STFL) framework to deal with dynamic FER in the wild. Third, we conduct extensive benchmark experiments on DFEW using a lot of spatiotemporal deep feature learning methods as well as our proposed EC-STFL. Experimental results show that DFEW is a well-designed and challenging database, and the proposed EC-STFL can promisingly improve the performance of existing spatiotemporal deep neural networks in coping with the problem of dynamic FER in the wild. Our DFEW database is publicly available and can be freely downloaded from this https URL.
摘要:近日,在野外面部表情识别(FER)获得了大量的研究人员的关注,因为它是一个有价值的话题,使FER技术从实验室转移到实际应用中。在本文中,我们重点从三个方面这一具有挑战性的,但有意思的话题作贡献。首先,我们提出了一个新的大规模的“在最野”动态表情数据库,DFEW(在狂野的动态表情),从数千部电影,包括超过16,000个视频剪辑。这些视频片段包含在实际情况下的各种挑战干扰,例如极端照明,闭塞,和任性姿态的变化。其次,我们提出了一个名为表达聚集时空特征学习(EC-STFL)框架在野外处理动态FER新方法。第三,我们在使用DFEW很多时空深特点的学习方法以及我们提出的EC-STFL进行广泛的基准测试实验。实验结果表明,DFEW是一个精心设计和具有挑战性的数据库,并提出了EC-STFL可以有为地提高动态FER在野外问题应对现有的时空深层神经网络的性能。我们DFEW数据库是公开的,可以从该HTTPS URL上免费下载。
13. LGNN: a Context-aware Line Segment Detector [PDF] 返回目录
Quan Meng, Jiakai Zhang, Qiang Hu, Xuming He, Jingyi Yu
Abstract: We present a novel real-time line segment detection scheme called Line Graph Neural Network (LGNN). Existing approaches require a computationally expensive verification or postprocessing step. Our LGNN employs a deep convolutional neural network (DCNN) for proposing line segment directly, with a graph neural network (GNN) module for reasoning their connectivities. Specifically, LGNN exploits a new quadruplet representation for each line segment where the GNN module takes the predicted candidates as vertexes and constructs a sparse graph to enforce structural context. Compared with the state-of-the-art, LGNN achieves near real-time performance without compromising accuracy. LGNN further enables time-sensitive 3D applications. When a 3D point cloud is accessible, we present a multi-modal line segment classification technique for extracting a 3D wireframe of the environment robustly and efficiently.
摘要:我们提出所谓线图神经网络(LGNN)一种新型的实时线段检测方案。现有方法需要耗费计算验证或后处理步骤。我们LGNN采用用于直接提出线段,与图表神经网络(GNN),用于推理其连接性模块深卷积神经网络(DCNN)。具体而言,利用LGNN对于每个线段其中GNN模块执行预测的候选为顶点并构造一个稀疏图执行结构性上下文一个新的四元表示。与国家的最先进的相比,LGNN达到不影响精度接近实时的性能。 LGNN进一步使时间敏感的3D应用程序。当一个3D点云是可访问的,提出了一种多模态线段分类技术为鲁棒和有效地提取所述环境的3D线框。
Quan Meng, Jiakai Zhang, Qiang Hu, Xuming He, Jingyi Yu
Abstract: We present a novel real-time line segment detection scheme called Line Graph Neural Network (LGNN). Existing approaches require a computationally expensive verification or postprocessing step. Our LGNN employs a deep convolutional neural network (DCNN) for proposing line segment directly, with a graph neural network (GNN) module for reasoning their connectivities. Specifically, LGNN exploits a new quadruplet representation for each line segment where the GNN module takes the predicted candidates as vertexes and constructs a sparse graph to enforce structural context. Compared with the state-of-the-art, LGNN achieves near real-time performance without compromising accuracy. LGNN further enables time-sensitive 3D applications. When a 3D point cloud is accessible, we present a multi-modal line segment classification technique for extracting a 3D wireframe of the environment robustly and efficiently.
摘要:我们提出所谓线图神经网络(LGNN)一种新型的实时线段检测方案。现有方法需要耗费计算验证或后处理步骤。我们LGNN采用用于直接提出线段,与图表神经网络(GNN),用于推理其连接性模块深卷积神经网络(DCNN)。具体而言,利用LGNN对于每个线段其中GNN模块执行预测的候选为顶点并构造一个稀疏图执行结构性上下文一个新的四元表示。与国家的最先进的相比,LGNN达到不影响精度接近实时的性能。 LGNN进一步使时间敏感的3D应用程序。当一个3D点云是可访问的,提出了一种多模态线段分类技术为鲁棒和有效地提取所述环境的3D线框。
14. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis [PDF] 返回目录
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Fei Wu, Xiao-Yuan Jing
Abstract: Synthesizing high-resolution realistic images from text descriptions is a challenging task. Almost all existing text-to-image methods employ stacked generative adversarial networks as the backbone, utilize cross-modal attention mechanisms to fuse text and image features, and use extra networks to ensure text-image semantic consistency. The existing text-to-image models have three problems: 1) For the backbone, there are multiple generators and discriminators stacked for generating different scales of images making the training process slow and inefficient. 2) For semantic consistency, the existing models employ extra networks to ensure the semantic consistency increasing the training complexity and bringing an additional computational cost. 3) For the text-image feature fusion method, cross-modal attention is only applied a few times during the generation process due to its computational cost impeding fusing the text and image features deeply. To solve these limitations, we propose 1) a novel simplified text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, 2) a novel regularization method called Matching-Aware zero-centered Gradient Penalty which promotes the generator to synthesize more realistic and text-image semantic consistent images without introducing extra networks, 3) a novel fusion module called Deep Text-Image Fusion Block which can exploit the semantics of text descriptions effectively and fuse text and image features deeply during the generation process. Compared with the previous text-to-image models, our DF-GAN is simpler and more efficient and achieves better performance. Extensive experiments and ablation studies on both Caltech-UCSD Birds 200 and COCO datasets demonstrate the superiority of the proposed model in comparison to state-of-the-art models.
摘要:合成高分辨率从文字描述逼真的图像是一项艰巨的任务。几乎所有的现有文本到影像的方法采用堆叠生成对抗性的网络为骨干,利用跨模式的关注机制保险丝文字和图片的功能,并使用额外的网络,以确保文本图像语义一致性。现有的文本到图像模型有三个问题:1)对于骨干,有层叠用于产生使训练过程缓慢和低效的图像的不同尺度的多个发电机和鉴别器。 2)语义一致性,现有车型采用的额外网络,以确保语义一致性,提高训练的复杂性和带来额外的计算成本。 3)对于文本的图像特征融合的方法,跨通道注意不仅是因为它的计算成本阻碍熔化文字和图像深深设有施加在生成过程中几次。为了解决这些局限性,我们提出1)一种新的简化的文本到图像骨架,其能够直接通过一对发电机和鉴别器,2)称为匹配感知零为中心的梯度的新型正则化方法合成高品质的图像惩罚这促进了发电机合成更逼真和文字图像语义一致的图像,而不引入额外的网络,3)新的融合模块称为深文本图像融合块可有效利用的文字描述的语义和保险丝文字和图像深深设有在生成过程中。与以前的文本到影像机型相比,我们的DF-GaN更简单,更高效,实现了更好的性能。两个加州理工学院,加州大学圣地亚哥分校鸟类200和COCO的数据集大量的实验和消融研究表明相比于国家的最先进的车型,该模型的优越性。
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Fei Wu, Xiao-Yuan Jing
Abstract: Synthesizing high-resolution realistic images from text descriptions is a challenging task. Almost all existing text-to-image methods employ stacked generative adversarial networks as the backbone, utilize cross-modal attention mechanisms to fuse text and image features, and use extra networks to ensure text-image semantic consistency. The existing text-to-image models have three problems: 1) For the backbone, there are multiple generators and discriminators stacked for generating different scales of images making the training process slow and inefficient. 2) For semantic consistency, the existing models employ extra networks to ensure the semantic consistency increasing the training complexity and bringing an additional computational cost. 3) For the text-image feature fusion method, cross-modal attention is only applied a few times during the generation process due to its computational cost impeding fusing the text and image features deeply. To solve these limitations, we propose 1) a novel simplified text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, 2) a novel regularization method called Matching-Aware zero-centered Gradient Penalty which promotes the generator to synthesize more realistic and text-image semantic consistent images without introducing extra networks, 3) a novel fusion module called Deep Text-Image Fusion Block which can exploit the semantics of text descriptions effectively and fuse text and image features deeply during the generation process. Compared with the previous text-to-image models, our DF-GAN is simpler and more efficient and achieves better performance. Extensive experiments and ablation studies on both Caltech-UCSD Birds 200 and COCO datasets demonstrate the superiority of the proposed model in comparison to state-of-the-art models.
摘要:合成高分辨率从文字描述逼真的图像是一项艰巨的任务。几乎所有的现有文本到影像的方法采用堆叠生成对抗性的网络为骨干,利用跨模式的关注机制保险丝文字和图片的功能,并使用额外的网络,以确保文本图像语义一致性。现有的文本到图像模型有三个问题:1)对于骨干,有层叠用于产生使训练过程缓慢和低效的图像的不同尺度的多个发电机和鉴别器。 2)语义一致性,现有车型采用的额外网络,以确保语义一致性,提高训练的复杂性和带来额外的计算成本。 3)对于文本的图像特征融合的方法,跨通道注意不仅是因为它的计算成本阻碍熔化文字和图像深深设有施加在生成过程中几次。为了解决这些局限性,我们提出1)一种新的简化的文本到图像骨架,其能够直接通过一对发电机和鉴别器,2)称为匹配感知零为中心的梯度的新型正则化方法合成高品质的图像惩罚这促进了发电机合成更逼真和文字图像语义一致的图像,而不引入额外的网络,3)新的融合模块称为深文本图像融合块可有效利用的文字描述的语义和保险丝文字和图像深深设有在生成过程中。与以前的文本到影像机型相比,我们的DF-GaN更简单,更高效,实现了更好的性能。两个加州理工学院,加州大学圣地亚哥分校鸟类200和COCO的数据集大量的实验和消融研究表明相比于国家的最先进的车型,该模型的优越性。
15. Self-supervised Video Representation Learning by Pace Prediction [PDF] 返回目录
Jiangliu Wang, Jianbo Jiao, Yun-Hui Liu
Abstract: This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at \url{this https URL}.
摘要:本文地址从一个全新的角度自我监督的视频表示学习的问题 - 通过视频速度的预测。它从观察茎,人类视觉系统对视频速度,例如,慢动作,电影摄制的广泛使用的技术是敏感的。具体而言,考虑到自然的步伐起到了视频,在不同的速度我们随机样本训练剪辑和问问神经网络来识别每个视频剪辑的步伐。这里的假设是,网络能够以这样的步伐推理任务时它理解底层的视频内容,并学会代表时空的特点才能成功。此外,我们还引进对比学习,推动朝模型通过最大化上类似的视频内容的协议识别不同的步调。为了验证该方法的有效性,我们进行的动作识别及视频检索任务广泛的实验与几种可供选择的网络架构。试验评估表明,我们的方法实现了国家的最先进的性能在不同的网络架构和不同的基准自我监督的视频表示学习。代码和预先训练模式可在\ {URL这HTTPS URL}。
Jiangliu Wang, Jianbo Jiao, Yun-Hui Liu
Abstract: This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at \url{this https URL}.
摘要:本文地址从一个全新的角度自我监督的视频表示学习的问题 - 通过视频速度的预测。它从观察茎,人类视觉系统对视频速度,例如,慢动作,电影摄制的广泛使用的技术是敏感的。具体而言,考虑到自然的步伐起到了视频,在不同的速度我们随机样本训练剪辑和问问神经网络来识别每个视频剪辑的步伐。这里的假设是,网络能够以这样的步伐推理任务时它理解底层的视频内容,并学会代表时空的特点才能成功。此外,我们还引进对比学习,推动朝模型通过最大化上类似的视频内容的协议识别不同的步调。为了验证该方法的有效性,我们进行的动作识别及视频检索任务广泛的实验与几种可供选择的网络架构。试验评估表明,我们的方法实现了国家的最先进的性能在不同的网络架构和不同的基准自我监督的视频表示学习。代码和预先训练模式可在\ {URL这HTTPS URL}。
16. Recurrent Deconvolutional Generative Adversarial Networks with Application to Text Guided Video Generation [PDF] 返回目录
Hongyuan Yu, Yan Huang, Lihong Pi, Liang Wang
Abstract: This paper proposes a novel model for video generation and especially makes the attempt to deal with the problem of video generation from text descriptions, i.e., synthesizing realistic videos conditioned on given texts. Existing video generation methods cannot be easily adapted to handle this task well, due to the frame discontinuity issue and their text-free generation schemes. To address these problems, we propose a recurrent deconvolutional generative adversarial network (RD-GAN), which includes a recurrent deconvolutional network (RDN) as the generator and a 3D convolutional neural network (3D-CNN) as the discriminator. The RDN is a deconvolutional version of conventional recurrent neural network, which can well model the long-range temporal dependency of generated video frames and make good use of conditional information. The proposed model can be jointly trained by pushing the RDN to generate realistic videos so that the 3D-CNN cannot distinguish them from real ones. We apply the proposed RD-GAN to a series of tasks including conventional video generation, conditional video generation, video prediction and video classification, and demonstrate its effectiveness by achieving well performance.
摘要:本文提出了一种视频生成一个新的模式,尤其是使企图处理视频生成从文字描述,也就是这个问题,合成逼真的视频空调上给出的文本。现有的视频生成方法不容易适用于处理这个任务很好,由于框架不连续的问题,他们的自由文本生成方案。为了解决这些问题,我们提出了一个反复解卷积生成对抗网络(RD-GAN),其中包括一个反复解卷积网络(RDN)作为发电机和鉴别三维卷积神经网络(3D-CNN)。该RDN是传统的回归神经网络的解卷积版本,它可以井模型远程生成的视频帧的时间相关性,并充分利用条件信息。该模型可以通过按RDN生成逼真的视频,使得3D-CNN不能以假乱真区分它们共同训练。我们应用所提出的RD-GaN的一系列任务,包括常规的视频生成,有条件的视频生成,视频预测和视频分类,并实现良好的表现证明了其有效性。
Hongyuan Yu, Yan Huang, Lihong Pi, Liang Wang
Abstract: This paper proposes a novel model for video generation and especially makes the attempt to deal with the problem of video generation from text descriptions, i.e., synthesizing realistic videos conditioned on given texts. Existing video generation methods cannot be easily adapted to handle this task well, due to the frame discontinuity issue and their text-free generation schemes. To address these problems, we propose a recurrent deconvolutional generative adversarial network (RD-GAN), which includes a recurrent deconvolutional network (RDN) as the generator and a 3D convolutional neural network (3D-CNN) as the discriminator. The RDN is a deconvolutional version of conventional recurrent neural network, which can well model the long-range temporal dependency of generated video frames and make good use of conditional information. The proposed model can be jointly trained by pushing the RDN to generate realistic videos so that the 3D-CNN cannot distinguish them from real ones. We apply the proposed RD-GAN to a series of tasks including conventional video generation, conditional video generation, video prediction and video classification, and demonstrate its effectiveness by achieving well performance.
摘要:本文提出了一种视频生成一个新的模式,尤其是使企图处理视频生成从文字描述,也就是这个问题,合成逼真的视频空调上给出的文本。现有的视频生成方法不容易适用于处理这个任务很好,由于框架不连续的问题,他们的自由文本生成方案。为了解决这些问题,我们提出了一个反复解卷积生成对抗网络(RD-GAN),其中包括一个反复解卷积网络(RDN)作为发电机和鉴别三维卷积神经网络(3D-CNN)。该RDN是传统的回归神经网络的解卷积版本,它可以井模型远程生成的视频帧的时间相关性,并充分利用条件信息。该模型可以通过按RDN生成逼真的视频,使得3D-CNN不能以假乱真区分它们共同训练。我们应用所提出的RD-GaN的一系列任务,包括常规的视频生成,有条件的视频生成,视频预测和视频分类,并实现良好的表现证明了其有效性。
17. Localizing the Common Action Among a Few Videos [PDF] 返回目录
Pengwan Yang, Vincent Tao Hu, Pascal Mettes, Cees G. M. Snoek
Abstract: This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (\textit{i}) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (\textit{ii}) a progressive alignment module that iteratively fuses the support videos into the query branch; and (\textit{iii}) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.
摘要:本文力求本地化行动的时间范围在相当长的修剪视频。如果现有的工作充分利用他们的开始,他们的结局,和/或在课堂训练时间的动作的例子很多,我们提出了几个拍的共同行动本地化。开始和在长时间未修剪的视频的动作的结束时确定仅基于手充满含有相同的动作,而不需要知道它们共同类别标签修整视频的例子。为了解决这个任务,我们引入能够从与相关查询视频片段支持视频对齐表示了全新的3D卷积网络架构。该网络包含:(\ textit {I})相互增强模块同时补充为数不多的修剪支持视频和修剪查询视频的表示; (\ textit {II})的渐进比对模块迭代地保险丝支承视频到查询分支;和(\ textit {三})的成对匹配模块权衡的不同的支持视频的重要性。在包含单个或多个操作实例修剪视频几拍的共同行动本地化的评估证明的有效性和我们的建议的普遍适用性。
Pengwan Yang, Vincent Tao Hu, Pascal Mettes, Cees G. M. Snoek
Abstract: This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (\textit{i}) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (\textit{ii}) a progressive alignment module that iteratively fuses the support videos into the query branch; and (\textit{iii}) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.
摘要:本文力求本地化行动的时间范围在相当长的修剪视频。如果现有的工作充分利用他们的开始,他们的结局,和/或在课堂训练时间的动作的例子很多,我们提出了几个拍的共同行动本地化。开始和在长时间未修剪的视频的动作的结束时确定仅基于手充满含有相同的动作,而不需要知道它们共同类别标签修整视频的例子。为了解决这个任务,我们引入能够从与相关查询视频片段支持视频对齐表示了全新的3D卷积网络架构。该网络包含:(\ textit {I})相互增强模块同时补充为数不多的修剪支持视频和修剪查询视频的表示; (\ textit {II})的渐进比对模块迭代地保险丝支承视频到查询分支;和(\ textit {三})的成对匹配模块权衡的不同的支持视频的重要性。在包含单个或多个操作实例修剪视频几拍的共同行动本地化的评估证明的有效性和我们的建议的普遍适用性。
18. Shift Equivariance in Object Detection [PDF] 返回目录
Marco Manfredi, Yu Wang
Abstract: Robustness to small image translations is a highly desirable property for object detectors. However, recent works have shown that CNN-based classifiers are not shift invariant. It is unclear to what extent this could impact object detection, mainly because of the architectural differences between the two and the dimensionality of the prediction space of modern detectors. To assess shift equivariance of object detection models end-to-end, in this paper we propose an evaluation metric, built upon a greedy search of the lower and upper bounds of the mean average precision on a shifted image set. Our new metric shows that modern object detection architectures, no matter if one-stage or two-stage, anchor-based or anchor-free, are sensitive to even one pixel shift to the input images. Furthermore, we investigate several possible solutions to this problem, both taken from the literature and newly proposed, quantifying the effectiveness of each one with the suggested metric. Our results indicate that none of these methods can provide full shift equivariance. Measuring and analyzing the extent of shift variance of different models and the contributions of possible factors, is a first step towards being able to devise methods that mitigate or even leverage such variabilities.
摘要:鲁棒性小的图像转换为物体的探测器高度期望的性质。然而,最近的工作表明,基于CNN-分类不平移不变。目前还不清楚到什么程度,这可能影响到的主要是因为这两个与现代探测器的预测空间的维度之间的架构差异物体检测。为了评估对象检测模型的偏移同变性端至端,在本文中我们提出了一种评估度量,当在一个移动图像集的贪婪搜索的中值平均精度下界和上界的建造。我们的新的指标显示,现代对象检测架构,不管一阶段或两阶段,锚基或无锚,是连一个像素移动,以输入图像敏感。此外,我们研究这个问题的几种可能的解决方案,无论是从文献中得到与新提出的,量化每一个与建议的指标的有效性。我们的研究结果表明,这些方法都可以提供完整的转移同变性。测量和分析不同的模型的偏移方差和的可能因素的贡献的程度,是迈向能够设计方法可减轻或甚至杠杆这样变性的第一步。
Marco Manfredi, Yu Wang
Abstract: Robustness to small image translations is a highly desirable property for object detectors. However, recent works have shown that CNN-based classifiers are not shift invariant. It is unclear to what extent this could impact object detection, mainly because of the architectural differences between the two and the dimensionality of the prediction space of modern detectors. To assess shift equivariance of object detection models end-to-end, in this paper we propose an evaluation metric, built upon a greedy search of the lower and upper bounds of the mean average precision on a shifted image set. Our new metric shows that modern object detection architectures, no matter if one-stage or two-stage, anchor-based or anchor-free, are sensitive to even one pixel shift to the input images. Furthermore, we investigate several possible solutions to this problem, both taken from the literature and newly proposed, quantifying the effectiveness of each one with the suggested metric. Our results indicate that none of these methods can provide full shift equivariance. Measuring and analyzing the extent of shift variance of different models and the contributions of possible factors, is a first step towards being able to devise methods that mitigate or even leverage such variabilities.
摘要:鲁棒性小的图像转换为物体的探测器高度期望的性质。然而,最近的工作表明,基于CNN-分类不平移不变。目前还不清楚到什么程度,这可能影响到的主要是因为这两个与现代探测器的预测空间的维度之间的架构差异物体检测。为了评估对象检测模型的偏移同变性端至端,在本文中我们提出了一种评估度量,当在一个移动图像集的贪婪搜索的中值平均精度下界和上界的建造。我们的新的指标显示,现代对象检测架构,不管一阶段或两阶段,锚基或无锚,是连一个像素移动,以输入图像敏感。此外,我们研究这个问题的几种可能的解决方案,无论是从文献中得到与新提出的,量化每一个与建议的指标的有效性。我们的研究结果表明,这些方法都可以提供完整的转移同变性。测量和分析不同的模型的偏移方差和的可能因素的贡献的程度,是迈向能够设计方法可减轻或甚至杠杆这样变性的第一步。
19. Predicting Visual Overlap of Images Through Interpretable Non-Metric Box Embeddings [PDF] 返回目录
Anita Rau, Guillermo Garcia-Hernando, Danail Stoyanov, Gabriel J. Brostow, Daniyar Turmukhambetov
Abstract: To what extent are two images picturing the same 3D surfaces? Even when this is a known scene, the answer typically requires an expensive search across scale space, with matching and geometric verification of large sets of local features. This expense is further multiplied when a query image is evaluated against a gallery, e.g. in visual relocalization. While we don't obviate the need for geometric verification, we propose an interpretable image-embedding that cuts the search in scale space to essentially a lookup. Our approach measures the asymmetric relation between two images. The model then learns a scene-specific measure of similarity, from training examples with known 3D visible-surface overlaps. The result is that we can quickly identify, for example, which test image is a close-up version of another, and by what scale factor. Subsequently, local features need only be detected at that scale. We validate our scene-specific model by showing how this embedding yields competitive image-matching results, while being simpler, faster, and also interpretable by humans.
摘要:在何种程度上是两个图像描绘相同的3D面?即使这是一个已知的场景,答案通常需要跨越尺度空间昂贵的搜索,以匹配和大型成套局部特征的几何验证。当查询图像与图库评估了此费用还相乘,例如在视觉重新定位。虽然我们不排除对几何的验证中,我们提出了一个解释的图像嵌入的是削减尺度空间基本上查找搜索。我们的方法测量两个图像之间的不对称关系。然后,该模型学习相似性的特定场景的措施,从训练实例与已知的3D可见表面重叠。其结果是,我们可以快速识别,例如,该测试图像是另一个特写版,以及以何种比例因子。随后,本地特性只需要在这种规模检测。我们通过展示验证我们的现场,具体型号是如何嵌入产量有竞争力的图像匹配的结果,而被更简单,更快,也可解释人类。
Anita Rau, Guillermo Garcia-Hernando, Danail Stoyanov, Gabriel J. Brostow, Daniyar Turmukhambetov
Abstract: To what extent are two images picturing the same 3D surfaces? Even when this is a known scene, the answer typically requires an expensive search across scale space, with matching and geometric verification of large sets of local features. This expense is further multiplied when a query image is evaluated against a gallery, e.g. in visual relocalization. While we don't obviate the need for geometric verification, we propose an interpretable image-embedding that cuts the search in scale space to essentially a lookup. Our approach measures the asymmetric relation between two images. The model then learns a scene-specific measure of similarity, from training examples with known 3D visible-surface overlaps. The result is that we can quickly identify, for example, which test image is a close-up version of another, and by what scale factor. Subsequently, local features need only be detected at that scale. We validate our scene-specific model by showing how this embedding yields competitive image-matching results, while being simpler, faster, and also interpretable by humans.
摘要:在何种程度上是两个图像描绘相同的3D面?即使这是一个已知的场景,答案通常需要跨越尺度空间昂贵的搜索,以匹配和大型成套局部特征的几何验证。当查询图像与图库评估了此费用还相乘,例如在视觉重新定位。虽然我们不排除对几何的验证中,我们提出了一个解释的图像嵌入的是削减尺度空间基本上查找搜索。我们的方法测量两个图像之间的不对称关系。然后,该模型学习相似性的特定场景的措施,从训练实例与已知的3D可见表面重叠。其结果是,我们可以快速识别,例如,该测试图像是另一个特写版,以及以何种比例因子。随后,本地特性只需要在这种规模检测。我们通过展示验证我们的现场,具体型号是如何嵌入产量有竞争力的图像匹配的结果,而被更简单,更快,也可解释人类。
20. CycleMorph: Cycle Consistent Unsupervised Deformable Image Registration [PDF] 返回目录
Boah Kim, Dong Hwan Kim, Seong Ho Park, Jieun Kim, June-Goo Lee, Jong Chul Ye
Abstract: Image registration is a fundamental task in medical image analysis. Recently, deep learning based image registration methods have been extensively investigated due to their excellent performance despite the ultra-fast computational time. However, the existing deep learning methods still have limitation in the preservation of original topology during the deformation with registration vector fields. To address this issues, here we present a cycle-consistent deformable image registration. The cycle consistency enhances image registration performance by providing an implicit regularization to preserve topology during the deformation. The proposed method is so flexible that can be applied for both 2D and 3D registration problems for various applications, and can be easily extended to multi-scale implementation to deal with the memory issues in large volume registration. Experimental results on various datasets from medical and non-medical applications demonstrate that the proposed method provides effective and accurate registration on diverse image pairs within a few seconds. Qualitative and quantitative evaluations on deformation fields also verify the effectiveness of the cycle consistency of the proposed method.
摘要:图像配准是医学图像分析的根本任务。近日,深基础的学习图像配准方法已广泛,由于其优异的性能,尽管超高速计算时间的影响。然而,现有的深学习方法仍然有登记的矢量场的变形过程中原始拓扑的保存的限制。为了解决这个问题,在这里我们提出一个周期一致的变形图像配准。通过提供一种隐式正则化的周期一致性增强了图像配准的性能在变形过程中保持拓扑。所提出的方法是如此灵活,可以应用于用于各种应用的2D和3D配准的问题,并且可以容易地扩展到多大规模实施以处理大体积配准的存储器的问题。从医疗和非医疗应用中的各种数据集的实验结果表明,所提出的方法提供了在几秒钟之内不同图像对有效和准确的配准。上变形场的定性和定量评价也验证所提出的方法的周期一致性的有效性。
Boah Kim, Dong Hwan Kim, Seong Ho Park, Jieun Kim, June-Goo Lee, Jong Chul Ye
Abstract: Image registration is a fundamental task in medical image analysis. Recently, deep learning based image registration methods have been extensively investigated due to their excellent performance despite the ultra-fast computational time. However, the existing deep learning methods still have limitation in the preservation of original topology during the deformation with registration vector fields. To address this issues, here we present a cycle-consistent deformable image registration. The cycle consistency enhances image registration performance by providing an implicit regularization to preserve topology during the deformation. The proposed method is so flexible that can be applied for both 2D and 3D registration problems for various applications, and can be easily extended to multi-scale implementation to deal with the memory issues in large volume registration. Experimental results on various datasets from medical and non-medical applications demonstrate that the proposed method provides effective and accurate registration on diverse image pairs within a few seconds. Qualitative and quantitative evaluations on deformation fields also verify the effectiveness of the cycle consistency of the proposed method.
摘要:图像配准是医学图像分析的根本任务。近日,深基础的学习图像配准方法已广泛,由于其优异的性能,尽管超高速计算时间的影响。然而,现有的深学习方法仍然有登记的矢量场的变形过程中原始拓扑的保存的限制。为了解决这个问题,在这里我们提出一个周期一致的变形图像配准。通过提供一种隐式正则化的周期一致性增强了图像配准的性能在变形过程中保持拓扑。所提出的方法是如此灵活,可以应用于用于各种应用的2D和3D配准的问题,并且可以容易地扩展到多大规模实施以处理大体积配准的存储器的问题。从医疗和非医疗应用中的各种数据集的实验结果表明,所提出的方法提供了在几秒钟之内不同图像对有效和准确的配准。上变形场的定性和定量评价也验证所提出的方法的周期一致性的有效性。
21. Weakly Supervised Generative Network for Multiple 3D Human Pose Hypotheses [PDF] 返回目录
Chen Li, Gim Hee Lee
Abstract: 3D human pose estimation from a single image is an inverse problem due to the inherent ambiguity of the missing depth. Several previous works addressed the inverse problem by generating multiple hypotheses. However, these works are strongly supervised and require ground truth 2D-to-3D correspondences which can be difficult to obtain. In this paper, we propose a weakly supervised deep generative network to address the inverse problem and circumvent the need for ground truth 2D-to-3D correspondences. To this end, we design our network to model a proposal distribution which we use to approximate the unknown multi-modal target posterior distribution. We achieve the approximation by minimizing the KL divergence between the proposal and target distributions, and this leads to a 2D reprojection error and a prior loss term that can be weakly supervised. Furthermore, we determine the most probable solution as the conditional mode of the samples using the mean-shift algorithm. We evaluate our method on three benchmark datasets -- Human3.6M, MPII and MPI-INF-3DHP. Experimental results show that our approach is capable of generating multiple feasible hypotheses and achieves state-of-the-art results compared to existing weakly supervised approaches. Our source code is available at the project website.
摘要:三维人体姿势从单个图像估计的逆问题,由于缺少深度的固有的模糊性。以前的几部作品通过生成多个假设解决的逆问题。然而,这些作品是强监督,需要地面实况2D到3D的对应关系可以是很难获得的。在本文中,我们提出了一个弱监督深深生成网络解决逆问题和规避地面实况2D到3D对应的需求。为此,我们设计的网络模型我们用它来近似未知的多模态目标后验分布的建议分布。我们通过最小化的建议和目标分布,这导致了2D投影误差和可以弱监督的事先损失项之间的KL散度达到近似。此外,我们确定最可能的解决方案,使用均值漂移算法样品的条件模式。我们评估我们对三个标准数据集的方法 - Human3.6M,MPII和MPI-INF-3DHP。实验结果表明,我们的方法能够产生多种可行的假设和实现先进的最先进的结果相比,现有的弱监督的方法。我们的源代码可在项目网站。
Chen Li, Gim Hee Lee
Abstract: 3D human pose estimation from a single image is an inverse problem due to the inherent ambiguity of the missing depth. Several previous works addressed the inverse problem by generating multiple hypotheses. However, these works are strongly supervised and require ground truth 2D-to-3D correspondences which can be difficult to obtain. In this paper, we propose a weakly supervised deep generative network to address the inverse problem and circumvent the need for ground truth 2D-to-3D correspondences. To this end, we design our network to model a proposal distribution which we use to approximate the unknown multi-modal target posterior distribution. We achieve the approximation by minimizing the KL divergence between the proposal and target distributions, and this leads to a 2D reprojection error and a prior loss term that can be weakly supervised. Furthermore, we determine the most probable solution as the conditional mode of the samples using the mean-shift algorithm. We evaluate our method on three benchmark datasets -- Human3.6M, MPII and MPI-INF-3DHP. Experimental results show that our approach is capable of generating multiple feasible hypotheses and achieves state-of-the-art results compared to existing weakly supervised approaches. Our source code is available at the project website.
摘要:三维人体姿势从单个图像估计的逆问题,由于缺少深度的固有的模糊性。以前的几部作品通过生成多个假设解决的逆问题。然而,这些作品是强监督,需要地面实况2D到3D的对应关系可以是很难获得的。在本文中,我们提出了一个弱监督深深生成网络解决逆问题和规避地面实况2D到3D对应的需求。为此,我们设计的网络模型我们用它来近似未知的多模态目标后验分布的建议分布。我们通过最小化的建议和目标分布,这导致了2D投影误差和可以弱监督的事先损失项之间的KL散度达到近似。此外,我们确定最可能的解决方案,使用均值漂移算法样品的条件模式。我们评估我们对三个标准数据集的方法 - Human3.6M,MPII和MPI-INF-3DHP。实验结果表明,我们的方法能够产生多种可行的假设和实现先进的最先进的结果相比,现有的弱监督的方法。我们的源代码可在项目网站。
22. Powers of layers for image-to-image translation [PDF] 返回目录
Hugo Touvron, Matthijs Douze, Matthieu Cord, Hervé Jégou
Abstract: We propose a simple architecture to address unpaired image-to-image translation tasks: style or class transfer, denoising, deblurring, deblocking, etc. We start from an image autoencoder architecture with fixed weights. For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached. A specific training schedule is required to alleviate the exponentiation effect of the iterations. At test time, it offers several advantages: the number of weight parameters is limited and the compositional design allows one to modulate the strength of the transformation with the number of iterations. This is useful, for instance, when the type or amount of noise to suppress is not known in advance. Experimentally, we provide proofs of concepts showing the interest of our method for many transformations. The performance of our model is comparable or better than CycleGAN with significantly fewer parameters.
摘要:我们提出了一个简单的架构,地址配对的图像到影像翻译任务:样式或类别转移,噪声,去模糊,去块效应等,我们从与固定权重的图像自动编码器架构开始。对于每一个任务中,我们学到了残余块的潜在空间,这是反复调用,直到达到目标域中运行。具体的训练计划需要缓解迭代幂效果。在测试时,它提供了几个优点:重量参数的数量是有限的,成分设计允许一个调制与迭代次数转型的力量。这是有用的,例如,当噪声的抑制的类型或量预先是未知的。在实验中,我们提供的概念,显示我们方法的许多变革的兴趣证明。我们的模型的性能相当或比CycleGAN有显著较少的参数更好。
Hugo Touvron, Matthijs Douze, Matthieu Cord, Hervé Jégou
Abstract: We propose a simple architecture to address unpaired image-to-image translation tasks: style or class transfer, denoising, deblurring, deblocking, etc. We start from an image autoencoder architecture with fixed weights. For each task we learn a residual block operating in the latent space, which is iteratively called until the target domain is reached. A specific training schedule is required to alleviate the exponentiation effect of the iterations. At test time, it offers several advantages: the number of weight parameters is limited and the compositional design allows one to modulate the strength of the transformation with the number of iterations. This is useful, for instance, when the type or amount of noise to suppress is not known in advance. Experimentally, we provide proofs of concepts showing the interest of our method for many transformations. The performance of our model is comparable or better than CycleGAN with significantly fewer parameters.
摘要:我们提出了一个简单的架构,地址配对的图像到影像翻译任务:样式或类别转移,噪声,去模糊,去块效应等,我们从与固定权重的图像自动编码器架构开始。对于每一个任务中,我们学到了残余块的潜在空间,这是反复调用,直到达到目标域中运行。具体的训练计划需要缓解迭代幂效果。在测试时,它提供了几个优点:重量参数的数量是有限的,成分设计允许一个调制与迭代次数转型的力量。这是有用的,例如,当噪声的抑制的类型或量预先是未知的。在实验中,我们提供的概念,显示我们方法的许多变革的兴趣证明。我们的模型的性能相当或比CycleGAN有显著较少的参数更好。
23. Adversarial Knowledge Transfer from Unlabeled Data [PDF] 返回目录
Akash Gupta, Rameswar Panda, Sujoy Paul, Jianming Zhang, Amit K. Roy-Chowdhury
Abstract: While machine learning approaches to visual recognition offer great promise, most of the existing methods rely heavily on the availability of large quantities of labeled training data. However, in the vast majority of real-world settings, manually collecting such large labeled datasets is infeasible due to the cost of labeling data or the paucity of data in a given domain. In this paper, we present a novel Adversarial Knowledge Transfer (AKT) framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier on a given visual recognition task. The proposed adversarial learning framework aligns the feature space of the unlabeled source data with the labeled target data such that the target classifier can be used to predict pseudo labels on the source data. An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task, unlike some existing approaches. Extensive experiments well demonstrate that models learned using our approach hold a lot of promise across a variety of visual recognition tasks on multiple standard datasets.
摘要:虽然机器学习方法,以视觉识别提供了极大的承诺,大多数现有的方法严重依赖于大量标注的训练数据的可用性。然而,在绝大多数的真实世界设置,手动收集如此大的数据集标记不可行由于标记数据的成本或数据的在给定域的缺乏。在本文中,我们提出了从互联网规模未标记的数据传输知识,提高在给定的视觉识别任务分类性能的新型对抗性知识转移(AKT)的框架。所提出的对抗性学习框架与对齐,使得目标分类器可以用于对源数据预测伪标签标记的靶数据未标记的源数据的特征空间。我们的方法的一个重要新颖方面是,未标记的源数据可以从那些标记的目标数据的不同类别的,并且没有必要定义一个单独的借口任务,不像一些现有的方法。大量的实验也证明,模型中使用我们的方法保持很多的承诺在各种对多种标准数据集视觉识别任务的经验教训。
Akash Gupta, Rameswar Panda, Sujoy Paul, Jianming Zhang, Amit K. Roy-Chowdhury
Abstract: While machine learning approaches to visual recognition offer great promise, most of the existing methods rely heavily on the availability of large quantities of labeled training data. However, in the vast majority of real-world settings, manually collecting such large labeled datasets is infeasible due to the cost of labeling data or the paucity of data in a given domain. In this paper, we present a novel Adversarial Knowledge Transfer (AKT) framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier on a given visual recognition task. The proposed adversarial learning framework aligns the feature space of the unlabeled source data with the labeled target data such that the target classifier can be used to predict pseudo labels on the source data. An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task, unlike some existing approaches. Extensive experiments well demonstrate that models learned using our approach hold a lot of promise across a variety of visual recognition tasks on multiple standard datasets.
摘要:虽然机器学习方法,以视觉识别提供了极大的承诺,大多数现有的方法严重依赖于大量标注的训练数据的可用性。然而,在绝大多数的真实世界设置,手动收集如此大的数据集标记不可行由于标记数据的成本或数据的在给定域的缺乏。在本文中,我们提出了从互联网规模未标记的数据传输知识,提高在给定的视觉识别任务分类性能的新型对抗性知识转移(AKT)的框架。所提出的对抗性学习框架与对齐,使得目标分类器可以用于对源数据预测伪标签标记的靶数据未标记的源数据的特征空间。我们的方法的一个重要新颖方面是,未标记的源数据可以从那些标记的目标数据的不同类别的,并且没有必要定义一个单独的借口任务,不像一些现有的方法。大量的实验也证明,模型中使用我们的方法保持很多的承诺在各种对多种标准数据集视觉识别任务的经验教训。
24. Pose Estimation for Vehicle-mounted Cameras via Horizontal and Vertical Planes [PDF] 返回目录
Istan Gergo Gal, Daniel Barath, Levente Hajder
Abstract: We propose two novel solvers for estimating the egomotion of a calibrated camera mounted to a moving vehicle from a single affine correspondence via recovering special homographies. For the first class of solvers, the sought plane is expected to be perpendicular to one of the camera axes. For the second class, the plane is orthogonal to the ground with unknown normal, e.g., it is a building facade. Both methods are solved via a linear system with a small coefficient matrix, thus, being extremely efficient. Both the minimal and over-determined cases can be solved by the proposed methods. They are tested on synthetic data and on publicly available real-world datasets. The novel methods are more accurate or comparable to the traditional algorithms and are faster when included in state of the art robust estimators.
摘要:我们提出了两种新的求解器用于估计校准的相机的自我运动经由回收特殊单应性地安装到移动车辆从单个仿射对应。对于第一类解算器,所寻求的平面预期为垂直于相机轴中的一个。对于第二类,所述平面垂直于与未知正常,例如地面,它是一种建筑物的外墙。这两种方法都是通过线性系统来解决具有小系数矩阵,从而,是极其有效的。无论是最小的,并且在确定的情况下,可以通过所提出的方法来解决。他们是在综合的数据和公开可用的现实世界的数据集进行测试。的新颖的方法是更精确的或可比的传统算法并且包括在本领域稳健估计的状态时更快。
Istan Gergo Gal, Daniel Barath, Levente Hajder
Abstract: We propose two novel solvers for estimating the egomotion of a calibrated camera mounted to a moving vehicle from a single affine correspondence via recovering special homographies. For the first class of solvers, the sought plane is expected to be perpendicular to one of the camera axes. For the second class, the plane is orthogonal to the ground with unknown normal, e.g., it is a building facade. Both methods are solved via a linear system with a small coefficient matrix, thus, being extremely efficient. Both the minimal and over-determined cases can be solved by the proposed methods. They are tested on synthetic data and on publicly available real-world datasets. The novel methods are more accurate or comparable to the traditional algorithms and are faster when included in state of the art robust estimators.
摘要:我们提出了两种新的求解器用于估计校准的相机的自我运动经由回收特殊单应性地安装到移动车辆从单个仿射对应。对于第一类解算器,所寻求的平面预期为垂直于相机轴中的一个。对于第二类,所述平面垂直于与未知正常,例如地面,它是一种建筑物的外墙。这两种方法都是通过线性系统来解决具有小系数矩阵,从而,是极其有效的。无论是最小的,并且在确定的情况下,可以通过所提出的方法来解决。他们是在综合的数据和公开可用的现实世界的数据集进行测试。的新颖的方法是更精确的或可比的传统算法并且包括在本领域稳健估计的状态时更快。
25. SkeletonNet: A Topology-Preserving Solution for Learning Mesh Reconstruction of Object Surfaces from RGB Images [PDF] 返回目录
Jiapeng Tang, Xiaoguang Han, Mingkui Tan, Xin Tong, Kui Jia
Abstract: This paper focuses on the challenging task of learning 3D object surface reconstructions from RGB images. Existingmethods achieve varying degrees of success by using different surface representations. However, they all have their own drawbacks,and cannot properly reconstruct the surface shapes of complex topologies, arguably due to a lack of constraints on the topologicalstructures in their learning frameworks. To this end, we propose to learn and use the topology-preserved, skeletal shape representationto assist the downstream task of object surface reconstruction from RGB images. Technically, we propose the novelSkeletonNetdesign that learns a volumetric representation of a skeleton via a bridged learning of a skeletal point set, where we use paralleldecoders each responsible for the learning of points on 1D skeletal curves and 2D skeletal sheets, as well as an efficient module ofglobally guided subvolume synthesis for a refined, high-resolution skeletal volume; we present a differentiablePoint2Voxellayer tomake SkeletonNet end-to-end and trainable. With the learned skeletal volumes, we propose two models, the Skeleton-Based GraphConvolutional Neural Network (SkeGCNN) and the Skeleton-Regularized Deep Implicit Surface Network (SkeDISN), which respectivelybuild upon and improve over the existing frameworks of explicit mesh deformation and implicit field learning for the downstream surfacereconstruction task. We conduct thorough experiments that verify the efficacy of our proposed SkeletonNet. SkeGCNN and SkeDISNoutperform existing methods as well, and they have their own merits when measured by different metrics. Additional results ingeneralized task settings further demonstrate the usefulness of our proposed methods. We have made both our implementation codeand the ShapeNet-Skeleton dataset publicly available at ble at this https URL.
摘要:本文着重从RGB图像学习三维物体表面重建的艰巨任务。 Existingmethods实现通过使用不同的表面表示不同程度的成功。然而,他们都有自己的缺点,并不能很好地重建复杂的拓扑结构的表面形状,可以说是由于缺乏对他们的学习框架的topologicalstructures约束。为此,我们提出了学习和使用的拓扑结构保存完好,骨骼形状representationto从RGB图像辅助物体表面重建的下游任务。从技术上讲,我们建议,学习的骨架经由桥接学习骨骼点集合,在这里我们使用paralleldecoders各自负责点的一维骨架曲线和2D骨架片的学习,以及作为有效的模块的体积表示的novelSkeletonNetdesign ofglobally引导用于精制的,高分辨率的骨骼体积的子体积的合成;我们提出了一个differentiablePoint2Voxellayer tomake SkeletonNet端至端和接受训练的。随着学习的骨骼量,我们提出了两种型号,骨架为基础GraphConvolutional神经网络(SkeGCNN)和骷髅正则深隐式曲面网(SkeDISN),其中respectivelybuild后,提高了明确的网格变形和隐场的现有框架学习下游surfacereconstruction任务。我们进行充分的实验证明,验证我们提出SkeletonNet的功效。 SkeGCNN和SkeDISNoutperform现有的方法为好,而当通过不同的指标衡量他们有自己的优点。 ingeneralized任务设置附加结果进一步证明我们提出的方法的有效性。我们已经取得了我们的执行codeand在此HTTPS URL的ShapeNet骨架数据集中公布于BLE。
Jiapeng Tang, Xiaoguang Han, Mingkui Tan, Xin Tong, Kui Jia
Abstract: This paper focuses on the challenging task of learning 3D object surface reconstructions from RGB images. Existingmethods achieve varying degrees of success by using different surface representations. However, they all have their own drawbacks,and cannot properly reconstruct the surface shapes of complex topologies, arguably due to a lack of constraints on the topologicalstructures in their learning frameworks. To this end, we propose to learn and use the topology-preserved, skeletal shape representationto assist the downstream task of object surface reconstruction from RGB images. Technically, we propose the novelSkeletonNetdesign that learns a volumetric representation of a skeleton via a bridged learning of a skeletal point set, where we use paralleldecoders each responsible for the learning of points on 1D skeletal curves and 2D skeletal sheets, as well as an efficient module ofglobally guided subvolume synthesis for a refined, high-resolution skeletal volume; we present a differentiablePoint2Voxellayer tomake SkeletonNet end-to-end and trainable. With the learned skeletal volumes, we propose two models, the Skeleton-Based GraphConvolutional Neural Network (SkeGCNN) and the Skeleton-Regularized Deep Implicit Surface Network (SkeDISN), which respectivelybuild upon and improve over the existing frameworks of explicit mesh deformation and implicit field learning for the downstream surfacereconstruction task. We conduct thorough experiments that verify the efficacy of our proposed SkeletonNet. SkeGCNN and SkeDISNoutperform existing methods as well, and they have their own merits when measured by different metrics. Additional results ingeneralized task settings further demonstrate the usefulness of our proposed methods. We have made both our implementation codeand the ShapeNet-Skeleton dataset publicly available at ble at this https URL.
摘要:本文着重从RGB图像学习三维物体表面重建的艰巨任务。 Existingmethods实现通过使用不同的表面表示不同程度的成功。然而,他们都有自己的缺点,并不能很好地重建复杂的拓扑结构的表面形状,可以说是由于缺乏对他们的学习框架的topologicalstructures约束。为此,我们提出了学习和使用的拓扑结构保存完好,骨骼形状representationto从RGB图像辅助物体表面重建的下游任务。从技术上讲,我们建议,学习的骨架经由桥接学习骨骼点集合,在这里我们使用paralleldecoders各自负责点的一维骨架曲线和2D骨架片的学习,以及作为有效的模块的体积表示的novelSkeletonNetdesign ofglobally引导用于精制的,高分辨率的骨骼体积的子体积的合成;我们提出了一个differentiablePoint2Voxellayer tomake SkeletonNet端至端和接受训练的。随着学习的骨骼量,我们提出了两种型号,骨架为基础GraphConvolutional神经网络(SkeGCNN)和骷髅正则深隐式曲面网(SkeDISN),其中respectivelybuild后,提高了明确的网格变形和隐场的现有框架学习下游surfacereconstruction任务。我们进行充分的实验证明,验证我们提出SkeletonNet的功效。 SkeGCNN和SkeDISNoutperform现有的方法为好,而当通过不同的指标衡量他们有自己的优点。 ingeneralized任务设置附加结果进一步证明我们提出的方法的有效性。我们已经取得了我们的执行codeand在此HTTPS URL的ShapeNet骨架数据集中公布于BLE。
26. Reliability of Decision Support in Cross-spectral Biometric-enabled Systems [PDF] 返回目录
Kenneth Lai, Svetlana N. Yanushkevich, Vlad Shmerko
Abstract: This paper addresses the evaluation of the performance of the decision support system that utilizes face and facial expression biometrics. The evaluation criteria include risk of error and related reliability of decision, as well as their contribution to the changes in the perceived operator's trust in the decision. The relevant applications include human behavior monitoring and stress detection in individuals and teams, and in situational awareness system. Using an available database of cross-spectral videos of faces and facial expressions, we conducted a series of experiments that demonstrate the phenomenon of biases in biometrics that affect the evaluated measures of the performance in human-machine systems.
摘要:本文针对利用脸和表情生物的决策支持系统的性能进行评估。评估标准包括错误的风险与决策相关的可靠性,以及他们在感知操作者在决定信任的变化贡献。相关的应用包括人的行为进行监控并在个人和团队应力检测,并在态势感知系统。使用的面孔和表情的互谱视频可用的数据库,我们进行了一系列的展示影响在人机系统的性能的评估措施,生物识别偏差的现象实验。
Kenneth Lai, Svetlana N. Yanushkevich, Vlad Shmerko
Abstract: This paper addresses the evaluation of the performance of the decision support system that utilizes face and facial expression biometrics. The evaluation criteria include risk of error and related reliability of decision, as well as their contribution to the changes in the perceived operator's trust in the decision. The relevant applications include human behavior monitoring and stress detection in individuals and teams, and in situational awareness system. Using an available database of cross-spectral videos of faces and facial expressions, we conducted a series of experiments that demonstrate the phenomenon of biases in biometrics that affect the evaluated measures of the performance in human-machine systems.
摘要:本文针对利用脸和表情生物的决策支持系统的性能进行评估。评估标准包括错误的风险与决策相关的可靠性,以及他们在感知操作者在决定信任的变化贡献。相关的应用包括人的行为进行监控并在个人和团队应力检测,并在态势感知系统。使用的面孔和表情的互谱视频可用的数据库,我们进行了一系列的展示影响在人机系统的性能的评估措施,生物识别偏差的现象实验。
27. An Ensemble of Knowledge Sharing Models for Dynamic Hand Gesture Recognition [PDF] 返回目录
Kenneth Lai, Svetlana Yanushkevich
Abstract: The focus of this paper is dynamic gesture recognition in the context of the interaction between humans and machines. We propose a model consisting of two sub-networks, a transformer and an ordered-neuron long-short-term-memory (ON-LSTM) based recurrent neural network (RNN). Each sub-network is trained to perform the task of gesture recognition using only skeleton joints. Since each sub-network extracts different types of features due to the difference in architecture, the knowledge can be shared between the sub-networks. Through knowledge distillation, the features and predictions from each sub-network are fused together into a new fusion classifier. In addition, a cyclical learning rate can be used to generate a series of models that are combined in an ensemble, in order to yield a more generalizable prediction. The proposed ensemble of knowledge-sharing models exhibits an overall accuracy of 86.11% using only skeleton information, as tested using the Dynamic Hand Gesture-14/28 dataset
摘要:本文研究的重点是人与机器之间的交互的情况下动态手势识别。我们建议由两个子网络,变压器和基于回归神经网络(RNN)的有序 - 神经元的长短期记忆(ON-LSTM)的模型。每个子网络进行训练,以执行仅使用骨骼关节手势识别的任务。由于每个子网络由于在体系结构中的差提取不同类型的特征,所述知识可以在子网络之间共享。通过知识蒸馏,从各子网络的特点和预测融合在一起成为一种新的融合分类。此外,周期性学习速率可以用来产生一系列的被组合在合奏模型中,为了产生更一般化预测。所提出的合奏知识共享模型表现出的整体的86.11%仅使用骨骼信息,作为使用的动态手势-14/28型的数据集进行测试精度
Kenneth Lai, Svetlana Yanushkevich
Abstract: The focus of this paper is dynamic gesture recognition in the context of the interaction between humans and machines. We propose a model consisting of two sub-networks, a transformer and an ordered-neuron long-short-term-memory (ON-LSTM) based recurrent neural network (RNN). Each sub-network is trained to perform the task of gesture recognition using only skeleton joints. Since each sub-network extracts different types of features due to the difference in architecture, the knowledge can be shared between the sub-networks. Through knowledge distillation, the features and predictions from each sub-network are fused together into a new fusion classifier. In addition, a cyclical learning rate can be used to generate a series of models that are combined in an ensemble, in order to yield a more generalizable prediction. The proposed ensemble of knowledge-sharing models exhibits an overall accuracy of 86.11% using only skeleton information, as tested using the Dynamic Hand Gesture-14/28 dataset
摘要:本文研究的重点是人与机器之间的交互的情况下动态手势识别。我们建议由两个子网络,变压器和基于回归神经网络(RNN)的有序 - 神经元的长短期记忆(ON-LSTM)的模型。每个子网络进行训练,以执行仅使用骨骼关节手势识别的任务。由于每个子网络由于在体系结构中的差提取不同类型的特征,所述知识可以在子网络之间共享。通过知识蒸馏,从各子网络的特点和预测融合在一起成为一种新的融合分类。此外,周期性学习速率可以用来产生一系列的被组合在合奏模型中,为了产生更一般化预测。所提出的合奏知识共享模型表现出的整体的86.11%仅使用骨骼信息,作为使用的动态手势-14/28型的数据集进行测试精度
28. ExplAIn: Explanatory Artificial Intelligence for Diabetic Retinopathy Diagnosis [PDF] 返回目录
Gwenolé Quellec, Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Pascale Massin, Béatrice Cochener
Abstract: In recent years, Artificial Intelligence (AI) has proven its relevance for medical decision support. However, the "black-box" nature of successful AI algorithms still holds back their wide-spread deployment. In this paper, we describe an eXplanatory Artificial Intelligence (XAI) that reaches the same level of performance as black-box AI, for the task of Diabetic Retinopathy (DR) diagnosis using Color Fundus Photography (CFP). This algorithm, called ExplAIn, learns to segment and categorize lesions in images; the final diagnosis is directly derived from these multivariate lesion segmentations. The novelty of this explanatory framework is that it is trained from end to end, with image supervision only, just like black-box AI algorithms: the concepts of lesions and lesion categories emerge by themselves. For improved lesion localization, foreground/background separation is trained through self-supervision, in such a way that occluding foreground pixels transforms the input image into a healthy-looking image. The advantage of such an architecture is that automatic diagnoses can be explained simply by an image and/or a few sentences. ExplAIn is evaluated at the image level and at the pixel level on various CFP image datasets. We expect this new framework, which jointly offers high classification performance and explainability, to facilitate AI deployment.
摘要:近年来,人工智能(AI)已经证明了它的医疗决策支持的相关性。然而,成功的AI算法的“黑盒子”的性质仍持有其背部广泛分布的部署。在本文中,我们描述的解释人工智能(XAI)到达的性能黑箱AI同一水平线上,使用彩色眼底摄影(CFP)的糖尿病视网膜病变(DR)诊断的任务。这种算法,称为讲解,学会细分和分类病变图像;最终的诊断是直接从这些多元病变分割而得。这个解释框架的新颖之处在于它是从端到端的培训,以图像监控而已,就像黑匣子AI算法:病变和病变类别的概念出现自己。对于改进的病灶定位,前景/背景分离是通过自检的训练,在这样一种方式,闭塞前景像素变换输入图像分割为看起来健康的图像。这样的体系结构的优点在于,自动诊断可以通过图像和/或几个句子简单地说明。解释在图像水平和在各种CFP图像数据组的像素级进行评价。我们希望这个新的框架,共同提供了较高的分类性能和explainability,便于AI部署。
Gwenolé Quellec, Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Pascale Massin, Béatrice Cochener
Abstract: In recent years, Artificial Intelligence (AI) has proven its relevance for medical decision support. However, the "black-box" nature of successful AI algorithms still holds back their wide-spread deployment. In this paper, we describe an eXplanatory Artificial Intelligence (XAI) that reaches the same level of performance as black-box AI, for the task of Diabetic Retinopathy (DR) diagnosis using Color Fundus Photography (CFP). This algorithm, called ExplAIn, learns to segment and categorize lesions in images; the final diagnosis is directly derived from these multivariate lesion segmentations. The novelty of this explanatory framework is that it is trained from end to end, with image supervision only, just like black-box AI algorithms: the concepts of lesions and lesion categories emerge by themselves. For improved lesion localization, foreground/background separation is trained through self-supervision, in such a way that occluding foreground pixels transforms the input image into a healthy-looking image. The advantage of such an architecture is that automatic diagnoses can be explained simply by an image and/or a few sentences. ExplAIn is evaluated at the image level and at the pixel level on various CFP image datasets. We expect this new framework, which jointly offers high classification performance and explainability, to facilitate AI deployment.
摘要:近年来,人工智能(AI)已经证明了它的医疗决策支持的相关性。然而,成功的AI算法的“黑盒子”的性质仍持有其背部广泛分布的部署。在本文中,我们描述的解释人工智能(XAI)到达的性能黑箱AI同一水平线上,使用彩色眼底摄影(CFP)的糖尿病视网膜病变(DR)诊断的任务。这种算法,称为讲解,学会细分和分类病变图像;最终的诊断是直接从这些多元病变分割而得。这个解释框架的新颖之处在于它是从端到端的培训,以图像监控而已,就像黑匣子AI算法:病变和病变类别的概念出现自己。对于改进的病灶定位,前景/背景分离是通过自检的训练,在这样一种方式,闭塞前景像素变换输入图像分割为看起来健康的图像。这样的体系结构的优点在于,自动诊断可以通过图像和/或几个句子简单地说明。解释在图像水平和在各种CFP图像数据组的像素级进行评价。我们希望这个新的框架,共同提供了较高的分类性能和explainability,便于AI部署。
29. Contextual Diversity for Active Learning [PDF] 返回目录
Sharat Agarwal, Himanshu Arora, Saket Anand, Chetan Arora
Abstract: Requirement of large annotated datasets restrict the use of deep convolutional neural networks (CNNs) for many practical applications. The problem can be mitigated by using active learning (AL) techniques which, under a given annotation budget, allow to select a subset of data that yields maximum accuracy upon fine tuning. State of the art AL approaches typically rely on measures of visual diversity or prediction uncertainty, which are unable to effectively capture the variations in spatial context. On the other hand, modern CNN architectures make heavy use of spatial context for achieving highly accurate predictions. Since the context is difficult to evaluate in the absence of ground-truth labels, we introduce the notion of contextual diversity that captures the confusion associated with spatially co-occurring classes. Contextual Diversity (CD) hinges on a crucial observation that the probability vector predicted by a CNN for a region of interest typically contains information from a larger receptive field. Exploiting this observation, we use the proposed CD measure within two AL frameworks: (1) a core-set based strategy and (2) a reinforcement learning based policy, for active frame selection. Our extensive empirical evaluation establish state of the art results for active learning on benchmark datasets of Semantic Segmentation, Object Detection and Image Classification. Our ablation studies show clear advantages of using contextual diversity for active learning. The source code and additional results are available at this https URL.
摘要:大型注释的数据集要求限制使用对于许多实际应用深卷积神经网络(细胞神经网络)的。该问题可以通过使用主动学习(AL)技术,该技术中,给定注解预算下,允许选择使得在微调产生最大精度数据的子集来减轻。艺术AL通常接近的国家依赖于视觉的多样性和不确定性预测的措施,无法有效地捕捉空间环境的变化。在另一方面,现代CNN架构使实现高度精确的预测大量使用空间上下文。由于背景是难以在没有地面实况标签的评估,我们引入背景的多样性的概念,即捕获与空间上共发生类相关联的混乱。上下文分集(CD)的铰链上的关键观察,即由CNN对感兴趣的区域的预测的概率载体通常含有从较大的感受域的信息。利用这一观察,我们使用了两个AL框架内提出的CD的措施:(1)核心集基于策略和(2)强化学习基础的政策,对于主动帧选择。我们丰富的经验评估建立技术成果对语义分割,目标检测和图像分类的标准数据集主动学习的状态。我们的消融研究表明,使用背景的多样性为主动学习的明显优势。源代码和其他结果可在此HTTPS URL。
Sharat Agarwal, Himanshu Arora, Saket Anand, Chetan Arora
Abstract: Requirement of large annotated datasets restrict the use of deep convolutional neural networks (CNNs) for many practical applications. The problem can be mitigated by using active learning (AL) techniques which, under a given annotation budget, allow to select a subset of data that yields maximum accuracy upon fine tuning. State of the art AL approaches typically rely on measures of visual diversity or prediction uncertainty, which are unable to effectively capture the variations in spatial context. On the other hand, modern CNN architectures make heavy use of spatial context for achieving highly accurate predictions. Since the context is difficult to evaluate in the absence of ground-truth labels, we introduce the notion of contextual diversity that captures the confusion associated with spatially co-occurring classes. Contextual Diversity (CD) hinges on a crucial observation that the probability vector predicted by a CNN for a region of interest typically contains information from a larger receptive field. Exploiting this observation, we use the proposed CD measure within two AL frameworks: (1) a core-set based strategy and (2) a reinforcement learning based policy, for active frame selection. Our extensive empirical evaluation establish state of the art results for active learning on benchmark datasets of Semantic Segmentation, Object Detection and Image Classification. Our ablation studies show clear advantages of using contextual diversity for active learning. The source code and additional results are available at this https URL.
摘要:大型注释的数据集要求限制使用对于许多实际应用深卷积神经网络(细胞神经网络)的。该问题可以通过使用主动学习(AL)技术,该技术中,给定注解预算下,允许选择使得在微调产生最大精度数据的子集来减轻。艺术AL通常接近的国家依赖于视觉的多样性和不确定性预测的措施,无法有效地捕捉空间环境的变化。在另一方面,现代CNN架构使实现高度精确的预测大量使用空间上下文。由于背景是难以在没有地面实况标签的评估,我们引入背景的多样性的概念,即捕获与空间上共发生类相关联的混乱。上下文分集(CD)的铰链上的关键观察,即由CNN对感兴趣的区域的预测的概率载体通常含有从较大的感受域的信息。利用这一观察,我们使用了两个AL框架内提出的CD的措施:(1)核心集基于策略和(2)强化学习基础的政策,对于主动帧选择。我们丰富的经验评估建立技术成果对语义分割,目标检测和图像分类的标准数据集主动学习的状态。我们的消融研究表明,使用背景的多样性为主动学习的明显优势。源代码和其他结果可在此HTTPS URL。
30. Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition [PDF] 返回目录
Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, Sangyoun Lee
Abstract: Deep-Learning-based video recognition has shown promising improvements along with the development of large-scale datasets and spatiotemporal network architectures. In image recognition, learning spatially invariant features is a key factor in improving recognition performance and robustness. Data augmentation based on visual inductive priors, such as cropping, flipping, rotating, or photometric jittering, is a representative approach to achieve these features. Recent state-of-the-art recognition solutions have relied on modern data augmentation strategies that exploit a mixture of augmentation operations. In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally localizable features to cover temporal perturbations or complex actions in videos. Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data compared to the spatial-only data augmentation algorithms, including the 1st Visual Inductive Priors (VIPriors) for data-efficient action recognition challenge. Furthermore, learned features are temporally localizable that cannot be achieved using spatial augmentation algorithms. Our source code is available at this https URL.
摘要:基于深学习型视频识别已经显示出有希望与大型数据集和时空网络架构的发展而改善。在图像识别,学习空间不变的功能是提高识别性能和稳健性的关键因素。基于视觉感应先验,诸如裁剪,翻转,旋转,或光度数据抖动增大,是代表性的方法来实现这些功能。国家的最先进的最新识别解决方案都依赖于该利用增强操作的混合现代数据增强策略。在这项研究中,我们这些策略延伸到时间维度的视频,了解时间不变或时间本地化的功能覆盖时间扰动或影片复杂的动作。基于我们的新的时间数据增强算法,视频识别性能使用相对于仅空间数据增强算法,包括第一视觉感应先验(VIPriors)进行数据高效动作识别挑战训练数据的只有有限的量提高。此外,了解到特征是不能使用空间增强算法来实现时间上本地化。我们的源代码可在此HTTPS URL。
Taeoh Kim, Hyeongmin Lee, MyeongAh Cho, Ho Seong Lee, Dong Heon Cho, Sangyoun Lee
Abstract: Deep-Learning-based video recognition has shown promising improvements along with the development of large-scale datasets and spatiotemporal network architectures. In image recognition, learning spatially invariant features is a key factor in improving recognition performance and robustness. Data augmentation based on visual inductive priors, such as cropping, flipping, rotating, or photometric jittering, is a representative approach to achieve these features. Recent state-of-the-art recognition solutions have relied on modern data augmentation strategies that exploit a mixture of augmentation operations. In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally localizable features to cover temporal perturbations or complex actions in videos. Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data compared to the spatial-only data augmentation algorithms, including the 1st Visual Inductive Priors (VIPriors) for data-efficient action recognition challenge. Furthermore, learned features are temporally localizable that cannot be achieved using spatial augmentation algorithms. Our source code is available at this https URL.
摘要:基于深学习型视频识别已经显示出有希望与大型数据集和时空网络架构的发展而改善。在图像识别,学习空间不变的功能是提高识别性能和稳健性的关键因素。基于视觉感应先验,诸如裁剪,翻转,旋转,或光度数据抖动增大,是代表性的方法来实现这些功能。国家的最先进的最新识别解决方案都依赖于该利用增强操作的混合现代数据增强策略。在这项研究中,我们这些策略延伸到时间维度的视频,了解时间不变或时间本地化的功能覆盖时间扰动或影片复杂的动作。基于我们的新的时间数据增强算法,视频识别性能使用相对于仅空间数据增强算法,包括第一视觉感应先验(VIPriors)进行数据高效动作识别挑战训练数据的只有有限的量提高。此外,了解到特征是不能使用空间增强算法来实现时间上本地化。我们的源代码可在此HTTPS URL。
31. Alleviating Human-level Shift : A Robust Domain Adaptation Method for Multi-person Pose Estimation [PDF] 返回目录
Xixia Xu, Qi Zou, Xue Lin
Abstract: Human pose estimation has been widely studied with much focus on supervised learning requiring sufficient annotations. However, in real applications, a pretrained pose estimation model usually need be adapted to a novel domain with no labels or sparse labels. Such domain adaptation for 2D pose estimation hasn't been explored. The main reason is that a pose, by nature, has typical topological structure and needs fine-grained features in local keypoints. While existing adaptation methods do not consider topological structure of object-of-interest and they align the whole images coarsely. Therefore, we propose a novel domain adaptation method for multi-person pose estimation to conduct the human-level topological structure alignment and fine-grained feature alignment. Our method consists of three modules: Cross-Attentive Feature Alignment (CAFA), Intra-domain Structure Adaptation (ISA) and Inter-domain Human-Topology Alignment (IHTA) module. The CAFA adopts a bidirectional spatial attention module (BSAM)that focuses on fine-grained local feature correlation between two humans to adaptively aggregate consistent features for adaptation. We adopt ISA only in semi-supervised domain adaptation (SSDA) to exploit the corresponding keypoint semantic relationship for reducing the intra-domain bias. Most importantly, we propose an IHTA to learn more domain-invariant human topological representation for reducing the inter-domain discrepancy. We model the human topological structure via the graph convolution network (GCN), by passing messages on which, high-order relations can be considered. This structure preserving alignment based on GCN is beneficial to the occluded or extreme pose inference. Extensive experiments are conducted on two popular benchmarks and results demonstrate the competency of our method compared with existing supervised approaches.
摘要:人体姿势估计已经有很多重点监督学习需要足够的注释被广泛研究。然而,在实际应用中,预训练姿势估计模型通常需要适应,没有标签或标签稀疏一种新的域名。二维姿态估计这样的领域适应性尚未探索。主要的原因是,一个姿势,自然,具有典型的拓扑结构,并在当地的关键点需要细粒度的功能。虽然现有的适应方法没有考虑对象感兴趣的拓扑结构和它们粗略对齐整个图像。因此,我们提出了多方人士的姿态估计的新领域适应性方法进行人类水平的拓扑结构调整和细粒度的功能定位。我们的方法包括三个模块:跨细心特征对齐(CAFA),域内结构适应(ISA)和域间人类拓扑结构对齐(IHTA)模块。中央美术学院采用的是集中在两个人之间,以适应自适应骨料一致的特点细粒度的局部特征相关的双向空间注意模块(甲胺磷)。我们采用ISA只在半监督领域适应性(SSDA)利用减少域内偏置相应的关键点语义关系。最重要的是,我们提出了一个IHTA了解更多域不变的人类拓扑表示为了减少域间的差异。我们人类拓扑结构经由所述图形卷积网络(GDN),通过使在其上,高阶关系可以被认为是消息建模。基于GCN这种结构保持对齐到闭塞或极端姿势推断有益。大量的实验是在两种流行的基准进行,结果证明我们的方法的能力与现有的监督方法相比。
Xixia Xu, Qi Zou, Xue Lin
Abstract: Human pose estimation has been widely studied with much focus on supervised learning requiring sufficient annotations. However, in real applications, a pretrained pose estimation model usually need be adapted to a novel domain with no labels or sparse labels. Such domain adaptation for 2D pose estimation hasn't been explored. The main reason is that a pose, by nature, has typical topological structure and needs fine-grained features in local keypoints. While existing adaptation methods do not consider topological structure of object-of-interest and they align the whole images coarsely. Therefore, we propose a novel domain adaptation method for multi-person pose estimation to conduct the human-level topological structure alignment and fine-grained feature alignment. Our method consists of three modules: Cross-Attentive Feature Alignment (CAFA), Intra-domain Structure Adaptation (ISA) and Inter-domain Human-Topology Alignment (IHTA) module. The CAFA adopts a bidirectional spatial attention module (BSAM)that focuses on fine-grained local feature correlation between two humans to adaptively aggregate consistent features for adaptation. We adopt ISA only in semi-supervised domain adaptation (SSDA) to exploit the corresponding keypoint semantic relationship for reducing the intra-domain bias. Most importantly, we propose an IHTA to learn more domain-invariant human topological representation for reducing the inter-domain discrepancy. We model the human topological structure via the graph convolution network (GCN), by passing messages on which, high-order relations can be considered. This structure preserving alignment based on GCN is beneficial to the occluded or extreme pose inference. Extensive experiments are conducted on two popular benchmarks and results demonstrate the competency of our method compared with existing supervised approaches.
摘要:人体姿势估计已经有很多重点监督学习需要足够的注释被广泛研究。然而,在实际应用中,预训练姿势估计模型通常需要适应,没有标签或标签稀疏一种新的域名。二维姿态估计这样的领域适应性尚未探索。主要的原因是,一个姿势,自然,具有典型的拓扑结构,并在当地的关键点需要细粒度的功能。虽然现有的适应方法没有考虑对象感兴趣的拓扑结构和它们粗略对齐整个图像。因此,我们提出了多方人士的姿态估计的新领域适应性方法进行人类水平的拓扑结构调整和细粒度的功能定位。我们的方法包括三个模块:跨细心特征对齐(CAFA),域内结构适应(ISA)和域间人类拓扑结构对齐(IHTA)模块。中央美术学院采用的是集中在两个人之间,以适应自适应骨料一致的特点细粒度的局部特征相关的双向空间注意模块(甲胺磷)。我们采用ISA只在半监督领域适应性(SSDA)利用减少域内偏置相应的关键点语义关系。最重要的是,我们提出了一个IHTA了解更多域不变的人类拓扑表示为了减少域间的差异。我们人类拓扑结构经由所述图形卷积网络(GDN),通过使在其上,高阶关系可以被认为是消息建模。基于GCN这种结构保持对齐到闭塞或极端姿势推断有益。大量的实验是在两种流行的基准进行,结果证明我们的方法的能力与现有的监督方法相比。
32. Modeling Caricature Expressions by 3D Blendshape and Dynamic Texture [PDF] 返回目录
Keyu Chen, Jianmin Zheng, Jianfei Cai, Juyong Zhang
Abstract: The problem of deforming an artist-drawn caricature according to a given normal face expression is of interest in applications such as social media, animation and entertainment. This paper presents a solution to the problem, with an emphasis on enhancing the ability to create desired expressions and meanwhile preserve the identity exaggeration style of the caricature, which imposes challenges due to the complicated nature of caricatures. The key of our solution is a novel method to model caricature expression, which extends traditional 3DMM representation to caricature domain. The method consists of shape modelling and texture generation for caricatures. Geometric optimization is developed to create identity-preserving blendshapes for reconstructing accurate and stable geometric shape, and a conditional generative adversarial network (cGAN) is designed for generating dynamic textures under target expressions. The combination of both shape and texture components makes the non-trivial expressions of a caricature be effectively defined by the extension of the popular 3DMM representation and a caricature can thus be flexibly deformed into arbitrary expressions with good results visually in both shape and color spaces. The experiments demonstrate the effectiveness of the proposed method.
摘要:根据给定的正常面部表情变形艺术家绘制漫画的问题是在应用如社交媒体,动画和娱乐的兴趣。本文提出了一种解决问题的办法,对提高创建所需的表情,同时保持漫画,其中规定挑战的身份夸张风格的能力为重点,由于漫画的复杂性。我们的解决方案的关键是一种新颖的方法,以模型漫画的表达,这扩展了传统3DMM表示以漫画域。该方法包括形状建模和纹理生成用于漫画的。几何优化发展到用于重建精确和稳定的几何形状创建身份保留blendshapes,和条件生成对抗网络(cGAN)被设计用于目标表达式下生成动态纹理。形状和纹理分量的组合使得一个漫画的非平凡的表达式由流行3DMM表示的扩展被有效地定义,并且一个漫画因此可以灵活地变形而与在视觉上在形状和颜色空间良好的效果的任意表达式。实验证明了该方法的有效性。
Keyu Chen, Jianmin Zheng, Jianfei Cai, Juyong Zhang
Abstract: The problem of deforming an artist-drawn caricature according to a given normal face expression is of interest in applications such as social media, animation and entertainment. This paper presents a solution to the problem, with an emphasis on enhancing the ability to create desired expressions and meanwhile preserve the identity exaggeration style of the caricature, which imposes challenges due to the complicated nature of caricatures. The key of our solution is a novel method to model caricature expression, which extends traditional 3DMM representation to caricature domain. The method consists of shape modelling and texture generation for caricatures. Geometric optimization is developed to create identity-preserving blendshapes for reconstructing accurate and stable geometric shape, and a conditional generative adversarial network (cGAN) is designed for generating dynamic textures under target expressions. The combination of both shape and texture components makes the non-trivial expressions of a caricature be effectively defined by the extension of the popular 3DMM representation and a caricature can thus be flexibly deformed into arbitrary expressions with good results visually in both shape and color spaces. The experiments demonstrate the effectiveness of the proposed method.
摘要:根据给定的正常面部表情变形艺术家绘制漫画的问题是在应用如社交媒体,动画和娱乐的兴趣。本文提出了一种解决问题的办法,对提高创建所需的表情,同时保持漫画,其中规定挑战的身份夸张风格的能力为重点,由于漫画的复杂性。我们的解决方案的关键是一种新颖的方法,以模型漫画的表达,这扩展了传统3DMM表示以漫画域。该方法包括形状建模和纹理生成用于漫画的。几何优化发展到用于重建精确和稳定的几何形状创建身份保留blendshapes,和条件生成对抗网络(cGAN)被设计用于目标表达式下生成动态纹理。形状和纹理分量的组合使得一个漫画的非平凡的表达式由流行3DMM表示的扩展被有效地定义,并且一个漫画因此可以灵活地变形而与在视觉上在形状和颜色空间良好的效果的任意表达式。实验证明了该方法的有效性。
33. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D [PDF] 返回目录
Jonah Philion, Sanja Fidler
Abstract: The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code: this https URL .
摘要:感知为自主车辆的目标是从多个传感器和保险丝这些表示语义表示提取到一个单一的“鸟瞰视图”由运动规划坐标消费帧。我们提出了一个新的终端到终端的架构,直接提取场景的图象数据的鸟瞰视图表示从摄像机的任意数量。我们的做法背后的核心思想是“提升”每幅图像分别为特征为每个摄像机截头,然后选择“图示”的所有视锥成光栅化鸟瞰视图网格。通过在整个摄像头钻机培训,我们提供的证据表明,我们的模型能够学习不仅是如何表现的图像,但如何从所有摄像机保险丝预测到场景中的一个单一的凝聚力表现,同时强大的校准误差。在标准的鸟瞰视图任务,如对象分割和地图分割,我们的模型优于所有基线和以前的工作。在追求学习密集表示对运动规划的目标,我们表明,我们的模型推导出的表示能够通过“拍摄”模板轨迹可解释终端到终端的运动规划纳入我们的网络鸟瞰视野输出的成本地图。我们的基准对我们的模型使用Oracle深度从激光雷达的方法。与代码项目页面:这个HTTPS URL。
Jonah Philion, Sanja Fidler
Abstract: The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single "bird's-eye-view" coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code: this https URL .
摘要:感知为自主车辆的目标是从多个传感器和保险丝这些表示语义表示提取到一个单一的“鸟瞰视图”由运动规划坐标消费帧。我们提出了一个新的终端到终端的架构,直接提取场景的图象数据的鸟瞰视图表示从摄像机的任意数量。我们的做法背后的核心思想是“提升”每幅图像分别为特征为每个摄像机截头,然后选择“图示”的所有视锥成光栅化鸟瞰视图网格。通过在整个摄像头钻机培训,我们提供的证据表明,我们的模型能够学习不仅是如何表现的图像,但如何从所有摄像机保险丝预测到场景中的一个单一的凝聚力表现,同时强大的校准误差。在标准的鸟瞰视图任务,如对象分割和地图分割,我们的模型优于所有基线和以前的工作。在追求学习密集表示对运动规划的目标,我们表明,我们的模型推导出的表示能够通过“拍摄”模板轨迹可解释终端到终端的运动规划纳入我们的网络鸟瞰视野输出的成本地图。我们的基准对我们的模型使用Oracle深度从激光雷达的方法。与代码项目页面:这个HTTPS URL。
34. Robust Image Matching By Dynamic Feature Selection [PDF] 返回目录
Hao Huang, Jianchun Chen, Xiang Li, Lingjing Wang, Yi Fang
Abstract: Estimating dense correspondences between images is a long-standing image under-standing task. Recent works introduce convolutional neural networks (CNNs) to extract high-level feature maps and find correspondences through feature matching. However,high-level feature maps are in low spatial resolution and therefore insufficient to provide accurate and fine-grained features to distinguish intra-class variations for correspondence matching. To address this problem, we generate robust features by dynamically selecting features at different scales. To resolve two critical issues in feature selection,i.e.,how many and which scales of features to be selected, we frame the feature selection process as a sequential Markov decision-making process (MDP) and introduce an optimal selection strategy using reinforcement learning (RL). We define an RL environment for image matching in which each individual action either requires new features or terminates the selection episode by referring a matching score. Deep neural networks are incorporated into our method and trained for decision making. Experimental results show that our method achieves comparable/superior performance with state-of-the-art methods on three benchmarks, demonstrating the effectiveness of our feature selection strategy.
摘要:估计图像之间的密集的对应是一个长期的图像下的独立任务。近期作品介绍卷积神经网络(细胞神经网络)中提取的高层次功能的地图,发现通过特征匹配对应。然而,高级特征映射是在低空间分辨率,并且因此不足以提供准确和细粒度特征来区分的类内的变化为对应的匹配。为了解决这个问题,我们通过不同尺度动态选择功能,产生强大的功能。要解决特征选择,即两个关键问题,多少和哪些被选中的特征尺度,我们框架的特征选择过程为连续马尔可夫决策过程(MDP),并利用强化学习引进的最佳选择策略(RL )。我们定义的图像匹配的RL的环境,使每一个人的行动既需要新的功能或通过参考匹配得分终止选择插曲。深层神经网络被纳入我们的方法和训练有素的决策。实验结果表明,我们的方法实现对三个基准国家的最先进的方法相同/优越的性能,证明了我们的特征选择策略的有效性。
Hao Huang, Jianchun Chen, Xiang Li, Lingjing Wang, Yi Fang
Abstract: Estimating dense correspondences between images is a long-standing image under-standing task. Recent works introduce convolutional neural networks (CNNs) to extract high-level feature maps and find correspondences through feature matching. However,high-level feature maps are in low spatial resolution and therefore insufficient to provide accurate and fine-grained features to distinguish intra-class variations for correspondence matching. To address this problem, we generate robust features by dynamically selecting features at different scales. To resolve two critical issues in feature selection,i.e.,how many and which scales of features to be selected, we frame the feature selection process as a sequential Markov decision-making process (MDP) and introduce an optimal selection strategy using reinforcement learning (RL). We define an RL environment for image matching in which each individual action either requires new features or terminates the selection episode by referring a matching score. Deep neural networks are incorporated into our method and trained for decision making. Experimental results show that our method achieves comparable/superior performance with state-of-the-art methods on three benchmarks, demonstrating the effectiveness of our feature selection strategy.
摘要:估计图像之间的密集的对应是一个长期的图像下的独立任务。近期作品介绍卷积神经网络(细胞神经网络)中提取的高层次功能的地图,发现通过特征匹配对应。然而,高级特征映射是在低空间分辨率,并且因此不足以提供准确和细粒度特征来区分的类内的变化为对应的匹配。为了解决这个问题,我们通过不同尺度动态选择功能,产生强大的功能。要解决特征选择,即两个关键问题,多少和哪些被选中的特征尺度,我们框架的特征选择过程为连续马尔可夫决策过程(MDP),并利用强化学习引进的最佳选择策略(RL )。我们定义的图像匹配的RL的环境,使每一个人的行动既需要新的功能或通过参考匹配得分终止选择插曲。深层神经网络被纳入我们的方法和训练有素的决策。实验结果表明,我们的方法实现对三个基准国家的最先进的方法相同/优越的性能,证明了我们的特征选择策略的有效性。
35. Network Architecture Search for Domain Adaptation [PDF] 返回目录
Yichen Li, Xingchao Peng
Abstract: Deep networks have been used to learn transferable representations for domain adaptation. Existing deep domain adaptation methods systematically employ popular hand-crafted networks designed specifically for image-classification tasks, leading to sub-optimal domain adaptation performance. In this paper, we present Neural Architecture Search for Domain Adaptation (NASDA), a principle framework that leverages differentiable neural architecture search to derive the optimal network architecture for domain adaptation task. NASDA is designed with two novel training strategies: neural architecture search with multi-kernel Maximum Mean Discrepancy to derive the optimal architecture, and adversarial training between a feature generator and a batch of classifiers to consolidate the feature generator. We demonstrate experimentally that NASDA leads to state-of-the-art performance on several domain adaptation benchmarks.
摘要:深网络已被用来学习的领域适应性转让交涉。现有深厚的适应方法系统采用流行的手工制作的网络专为图像分类任务设计的,从而导致次优域自适应性能。在本文中,我们提出的神经结构搜索领域适应性(NASDA),一个原则框架,利用微神经结构的搜索来推导领域适应性任务的最佳网络架构。宇宙开发事业团设计了两个新的培训策略:神经结构搜索具有多内核的最大平均差异推导出最佳的架构和功能发生器和批次分类的巩固特征生成之间的对抗训练。我们通过实验证明,NASDA导致国家的最先进的性能在几个领域适应性的基准。
Yichen Li, Xingchao Peng
Abstract: Deep networks have been used to learn transferable representations for domain adaptation. Existing deep domain adaptation methods systematically employ popular hand-crafted networks designed specifically for image-classification tasks, leading to sub-optimal domain adaptation performance. In this paper, we present Neural Architecture Search for Domain Adaptation (NASDA), a principle framework that leverages differentiable neural architecture search to derive the optimal network architecture for domain adaptation task. NASDA is designed with two novel training strategies: neural architecture search with multi-kernel Maximum Mean Discrepancy to derive the optimal architecture, and adversarial training between a feature generator and a batch of classifiers to consolidate the feature generator. We demonstrate experimentally that NASDA leads to state-of-the-art performance on several domain adaptation benchmarks.
摘要:深网络已被用来学习的领域适应性转让交涉。现有深厚的适应方法系统采用流行的手工制作的网络专为图像分类任务设计的,从而导致次优域自适应性能。在本文中,我们提出的神经结构搜索领域适应性(NASDA),一个原则框架,利用微神经结构的搜索来推导领域适应性任务的最佳网络架构。宇宙开发事业团设计了两个新的培训策略:神经结构搜索具有多内核的最大平均差异推导出最佳的架构和功能发生器和批次分类的巩固特征生成之间的对抗训练。我们通过实验证明,NASDA导致国家的最先进的性能在几个领域适应性的基准。
36. What leads to generalization of object proposals? [PDF] 返回目录
Rui Wang, Dhruv Mahajan, Vignesh Ramanathan
Abstract: Object proposal generation is often the first step in many detection models. It is lucrative to train a good proposal model, that generalizes to unseen classes. This could help scaling detection models to larger number of classes with fewer annotations. Motivated by this, we study how a detection model trained on a small set of source classes can provide proposals that generalize to unseen classes. We systematically study the properties of the dataset visual diversity and label space granularity - required for good generalization. We show the trade-off between using fine-grained labels and coarse labels. We introduce the idea of prototypical classes: a set of sufficient and necessary classes required to train a detection model to obtain generalized proposals in a more data-efficient way. On the Open Images V4 dataset, we show that only 25% of the classes can be selected to form such a prototypical set. The resulting proposals from a model trained with these classes is only 4.3% worse than using all the classes, in terms of average recall (AR). We also demonstrate that Faster R-CNN model leads to better generalization of proposals compared to a single-stage network like RetinaNet.
摘要:对象建议的生成往往在许多检测模型的第一步。它是有利可图培养出不错的建议模式,即推广到看不见的类。这将有助于缩放检测模型更大的用更少的注解类数。这个启发,我们研究如何培训了一小源类的检测模型可以提供推广到看不见类的建议。我们系统地研究数据集的视觉多样性和标签空间粒度的性质 - 需要良好的推广。我们显示了使用细粒度标签和粗标签之间的权衡。我们引进的原型类的想法:一组训练的检测模型,以获得更数据高效的方式推广提案需要有足够和必要的类。在打开的图像数据集V4中,我们显示的类的,只有25%可以被选择为形成这样的原型集。从这些类训练的模型得到的建议是比使用平均召回(AR)方面的所有类,更糟糕的只有4.3%。我们还表明,更快的R-CNN模型导致相比单级网络状RetinaNet的建议,更好地推广。
Rui Wang, Dhruv Mahajan, Vignesh Ramanathan
Abstract: Object proposal generation is often the first step in many detection models. It is lucrative to train a good proposal model, that generalizes to unseen classes. This could help scaling detection models to larger number of classes with fewer annotations. Motivated by this, we study how a detection model trained on a small set of source classes can provide proposals that generalize to unseen classes. We systematically study the properties of the dataset visual diversity and label space granularity - required for good generalization. We show the trade-off between using fine-grained labels and coarse labels. We introduce the idea of prototypical classes: a set of sufficient and necessary classes required to train a detection model to obtain generalized proposals in a more data-efficient way. On the Open Images V4 dataset, we show that only 25% of the classes can be selected to form such a prototypical set. The resulting proposals from a model trained with these classes is only 4.3% worse than using all the classes, in terms of average recall (AR). We also demonstrate that Faster R-CNN model leads to better generalization of proposals compared to a single-stage network like RetinaNet.
摘要:对象建议的生成往往在许多检测模型的第一步。它是有利可图培养出不错的建议模式,即推广到看不见的类。这将有助于缩放检测模型更大的用更少的注解类数。这个启发,我们研究如何培训了一小源类的检测模型可以提供推广到看不见类的建议。我们系统地研究数据集的视觉多样性和标签空间粒度的性质 - 需要良好的推广。我们显示了使用细粒度标签和粗标签之间的权衡。我们引进的原型类的想法:一组训练的检测模型,以获得更数据高效的方式推广提案需要有足够和必要的类。在打开的图像数据集V4中,我们显示的类的,只有25%可以被选择为形成这样的原型集。从这些类训练的模型得到的建议是比使用平均召回(AR)方面的所有类,更糟糕的只有4.3%。我们还表明,更快的R-CNN模型导致相比单级网络状RetinaNet的建议,更好地推广。
37. Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze [PDF] 返回目录
Dongfang Liu, Yiming Cui, Xiaolei Guo, Wei Ding, Baijian Yang, Yingjie Chen
Abstract: Accurate localization is a foundational capacity, required for autonomous vehicles to accomplish other tasks such as navigation or path planning. It is a common practice for vehicles to use GPS to acquire location information. However, the application of GPS can result in severe challenges when vehicles run within the inner city where different kinds of structures may shadow the GPS signal and lead to inaccurate location results. To address the localization challenges of urban settings, we propose a novel feature voting technique for visual localization. Different from the conventional front-view-based method, our approach employs views from three directions (front, left, and right) and thus significantly improves the robustness of location prediction. In our work, we craft the proposed feature voting method into three state-of-the-art visual localization networks and modify their architectures properly so that they can be applied for vehicular operation. Extensive field test results indicate that our approach can predict location robustly even in challenging inner-city settings. Our research sheds light on using the visual localization approach to help autonomous vehicles to find accurate location information in a city maze, within a desirable time constraint.
摘要:准确的定位是基本能力,对自主车来完成其他任务,如导航或路径规划要求。这是车辆使用GPS获取位置信息的普遍做法。然而,GPS的应用可能会导致严重的挑战时,车辆内的城市,不同类型的结构可能遮盖GPS信号,并导致不准确的定位结果中运行。为了解决城市环境的本地化挑战,我们提出了视觉定位的新特点投票技术。从传统的基于正视方法不同,我们的方法从三个方向(前,左,右)使用的观点,因此显著提高位置预测的鲁棒性。在我们的工作中,我们工艺的特点提出的投票方法为国家的最先进的三个可视定位网络,并修改它们的架构适当,使他们可以应用于车辆的操作。广泛的现场测试结果表明,我们的方法可以有力即使在富有挑战性的内城设置预测位置。我们的研究阐明了使用可视定位方法,帮助自主车找到一个城市的迷宫精确的位置信息,所需的时间限制内。
Dongfang Liu, Yiming Cui, Xiaolei Guo, Wei Ding, Baijian Yang, Yingjie Chen
Abstract: Accurate localization is a foundational capacity, required for autonomous vehicles to accomplish other tasks such as navigation or path planning. It is a common practice for vehicles to use GPS to acquire location information. However, the application of GPS can result in severe challenges when vehicles run within the inner city where different kinds of structures may shadow the GPS signal and lead to inaccurate location results. To address the localization challenges of urban settings, we propose a novel feature voting technique for visual localization. Different from the conventional front-view-based method, our approach employs views from three directions (front, left, and right) and thus significantly improves the robustness of location prediction. In our work, we craft the proposed feature voting method into three state-of-the-art visual localization networks and modify their architectures properly so that they can be applied for vehicular operation. Extensive field test results indicate that our approach can predict location robustly even in challenging inner-city settings. Our research sheds light on using the visual localization approach to help autonomous vehicles to find accurate location information in a city maze, within a desirable time constraint.
摘要:准确的定位是基本能力,对自主车来完成其他任务,如导航或路径规划要求。这是车辆使用GPS获取位置信息的普遍做法。然而,GPS的应用可能会导致严重的挑战时,车辆内的城市,不同类型的结构可能遮盖GPS信号,并导致不准确的定位结果中运行。为了解决城市环境的本地化挑战,我们提出了视觉定位的新特点投票技术。从传统的基于正视方法不同,我们的方法从三个方向(前,左,右)使用的观点,因此显著提高位置预测的鲁棒性。在我们的工作中,我们工艺的特点提出的投票方法为国家的最先进的三个可视定位网络,并修改它们的架构适当,使他们可以应用于车辆的操作。广泛的现场测试结果表明,我们的方法可以有力即使在富有挑战性的内城设置预测位置。我们的研究阐明了使用可视定位方法,帮助自主车找到一个城市的迷宫精确的位置信息,所需的时间限制内。
38. Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance Segmentation [PDF] 返回目录
Jialian Wu, Liangchen Song, Tiancai Wang, Qian Zhang, Junsong Yuan
Abstract: Despite the previous success of object analysis, detecting and segmenting a large number of object categories with a long-tailed data distribution remains a challenging problem and is less investigated. For a large-vocabulary classifier, the chance of obtaining noisy logits is much higher, which can easily lead to a wrong recognition. In this paper, we exploit prior knowledge of the relations among object categories to cluster fine-grained classes into coarser parent classes, and construct a classification tree that is responsible for parsing an object instance into a fine-grained category via its parent class. In the classification tree, as the number of parent class nodes are significantly less, their logits are less noisy and can be utilized to suppress the wrong/noisy logits existed in the fine-grained class nodes. As the way to construct the parent class is not unique, we further build multiple trees to form a classification forest where each tree contributes its vote to the fine-grained classification. To alleviate the imbalanced learning caused by the long-tail phenomena, we propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution. Our method, termed as Forest R-CNN, can serve as a plug-and-play module being applied to most object recognition models for recognizing more than 1000 categories. Extensive experiments are performed on the large vocabulary dataset LVIS. Compared with the Mask R-CNN baseline, the Forest R-CNN significantly boosts the performance with 11.5% and 3.9% AP improvements on the rare categories and overall categories, respectively. Moreover, we achieve state-of-the-art results on the LVIS dataset. Code is available at this https URL.
摘要:尽管对象分析的先前成功,检测和分割大量对象类别的具有长尾数据分布保持一个具有挑战性的问题,并且不太影响。对于大词汇分类,获得嘈杂logits的几率要高得多,这很容易导致一个错误的认识。在本文中,我们利用对象类别之间的关系,以集群细粒度类成较粗的父类的先验知识,构建分类树是负责解析对象实例为通过它的父类细粒度的范畴。在分类树,作为父类节点的数量少显著,他们logits不太嘈杂,可以用来抑制细粒级节点中存在的错误/嘈杂logits。由于构造父类的方法不是唯一的,我们进一步构建多棵树,形成分类森林里每一棵树有助于其投票的细粒度分类。为了减轻不平衡的长尾现象造成的学习,我们提出了一个简单而有效的重采样方法,NMS重采样,以重新平衡数据分布。为应用于最物体识别模型用于识别多于1000个类别的插头和播放模块我们的方法,称为森林R-CNN,可以服务。大量的实验都在大词汇量的数据集进行LVIS。与掩模R-CNN基线相比,森林R-CNN显著提升与AP的改进分别在罕见类别和整体类,11.5%和3.9%的性能。此外,我们实现了对LVIS数据集的国家的最先进的成果。代码可在此HTTPS URL。
Jialian Wu, Liangchen Song, Tiancai Wang, Qian Zhang, Junsong Yuan
Abstract: Despite the previous success of object analysis, detecting and segmenting a large number of object categories with a long-tailed data distribution remains a challenging problem and is less investigated. For a large-vocabulary classifier, the chance of obtaining noisy logits is much higher, which can easily lead to a wrong recognition. In this paper, we exploit prior knowledge of the relations among object categories to cluster fine-grained classes into coarser parent classes, and construct a classification tree that is responsible for parsing an object instance into a fine-grained category via its parent class. In the classification tree, as the number of parent class nodes are significantly less, their logits are less noisy and can be utilized to suppress the wrong/noisy logits existed in the fine-grained class nodes. As the way to construct the parent class is not unique, we further build multiple trees to form a classification forest where each tree contributes its vote to the fine-grained classification. To alleviate the imbalanced learning caused by the long-tail phenomena, we propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution. Our method, termed as Forest R-CNN, can serve as a plug-and-play module being applied to most object recognition models for recognizing more than 1000 categories. Extensive experiments are performed on the large vocabulary dataset LVIS. Compared with the Mask R-CNN baseline, the Forest R-CNN significantly boosts the performance with 11.5% and 3.9% AP improvements on the rare categories and overall categories, respectively. Moreover, we achieve state-of-the-art results on the LVIS dataset. Code is available at this https URL.
摘要:尽管对象分析的先前成功,检测和分割大量对象类别的具有长尾数据分布保持一个具有挑战性的问题,并且不太影响。对于大词汇分类,获得嘈杂logits的几率要高得多,这很容易导致一个错误的认识。在本文中,我们利用对象类别之间的关系,以集群细粒度类成较粗的父类的先验知识,构建分类树是负责解析对象实例为通过它的父类细粒度的范畴。在分类树,作为父类节点的数量少显著,他们logits不太嘈杂,可以用来抑制细粒级节点中存在的错误/嘈杂logits。由于构造父类的方法不是唯一的,我们进一步构建多棵树,形成分类森林里每一棵树有助于其投票的细粒度分类。为了减轻不平衡的长尾现象造成的学习,我们提出了一个简单而有效的重采样方法,NMS重采样,以重新平衡数据分布。为应用于最物体识别模型用于识别多于1000个类别的插头和播放模块我们的方法,称为森林R-CNN,可以服务。大量的实验都在大词汇量的数据集进行LVIS。与掩模R-CNN基线相比,森林R-CNN显著提升与AP的改进分别在罕见类别和整体类,11.5%和3.9%的性能。此外,我们实现了对LVIS数据集的国家的最先进的成果。代码可在此HTTPS URL。
39. Feature Binding with Category-Dependant MixUp for Semantic Segmentation and Adversarial Robustness [PDF] 返回目录
Md Amirul Islam, Matthew Kowal, Konstantinos G. Derpanis, Neil D. B. Bruce
Abstract: In this paper, we present a strategy for training convolutional neural networks to effectively resolve interference arising from competing hypotheses relating to inter-categorical information throughout the network. The premise is based on the notion of feature binding, which is defined as the process by which activation's spread across space and layers in the network are successfully integrated to arrive at a correct inference decision. In our work, this is accomplished for the task of dense image labelling by blending images based on their class labels, and then training a feature binding network, which simultaneously segments and separates the blended images. Subsequent feature denoising to suppress noisy activations reveals additional desirable properties and high degrees of successful predictions. Through this process, we reveal a general mechanism, distinct from any prior methods, for boosting the performance of the base segmentation network while simultaneously increasing robustness to adversarial attacks.
摘要:在本文中,我们提出了训练卷积神经网络从与整个网络跨类别的信息竞争假设产生有效解决干扰的策略。前提是基于特征的结合的概念,它被定义为过程,其中活化的跨越空间和层在网络中传播的成功整合,以在正确的推理做出决定。在我们的工作,这是实现通过混合根据他们的阶级标签图像,然后训练功能结合网络,同时段密集的图像标记的任务,并分离混合图像。随后的去噪功能来抑制噪声的激活揭示了额外所需的特性和高程度的成功预测。在这个过程中,我们揭示了一个通用的机制,不同于任何现有方法,为提高基础分割网络的性能,同时提高耐用性,以对抗攻击。
Md Amirul Islam, Matthew Kowal, Konstantinos G. Derpanis, Neil D. B. Bruce
Abstract: In this paper, we present a strategy for training convolutional neural networks to effectively resolve interference arising from competing hypotheses relating to inter-categorical information throughout the network. The premise is based on the notion of feature binding, which is defined as the process by which activation's spread across space and layers in the network are successfully integrated to arrive at a correct inference decision. In our work, this is accomplished for the task of dense image labelling by blending images based on their class labels, and then training a feature binding network, which simultaneously segments and separates the blended images. Subsequent feature denoising to suppress noisy activations reveals additional desirable properties and high degrees of successful predictions. Through this process, we reveal a general mechanism, distinct from any prior methods, for boosting the performance of the base segmentation network while simultaneously increasing robustness to adversarial attacks.
摘要:在本文中,我们提出了训练卷积神经网络从与整个网络跨类别的信息竞争假设产生有效解决干扰的策略。前提是基于特征的结合的概念,它被定义为过程,其中活化的跨越空间和层在网络中传播的成功整合,以在正确的推理做出决定。在我们的工作,这是实现通过混合根据他们的阶级标签图像,然后训练功能结合网络,同时段密集的图像标记的任务,并分离混合图像。随后的去噪功能来抑制噪声的激活揭示了额外所需的特性和高程度的成功预测。在这个过程中,我们揭示了一个通用的机制,不同于任何现有方法,为提高基础分割网络的性能,同时提高耐用性,以对抗攻击。
40. What Should Not Be Contrastive in Contrastive Learning [PDF] 返回目录
Tete Xiao, Xiaolong Wang, Alexei A. Efros, Trevor Darrell
Abstract: Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.
摘要:最近的自我监督对比方法已经能够通过学习,是不变的不同的数据扩充产生令人印象深刻的视觉转让交涉。然而,这些方法暗含的假设特定的一组代表性的不变性(例如,不变性颜色),而当下游任务违反了这一假设(例如,区分红色与黄色汽车)可以表现不佳。我们引进了对比学习框架,它不需要特定的先验知识,取决于任务的不变性。我们的模型学会捕捉变和视觉表现的不变因素通过构建独立的嵌入空间,每一个都是不变的所有,但一个增强。我们用一个多头网络共享骨干捕捉在每个增强和孤独性能优于下游任务的所有基线的信息。我们进一步发现,不变和变化的空间进行跨我们会调查所有任务,包括粗粒,细粒,很少拍下游分类任务,以及各种数据损坏最好的串联。
Tete Xiao, Xiaolong Wang, Alexei A. Efros, Trevor Darrell
Abstract: Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.
摘要:最近的自我监督对比方法已经能够通过学习,是不变的不同的数据扩充产生令人印象深刻的视觉转让交涉。然而,这些方法暗含的假设特定的一组代表性的不变性(例如,不变性颜色),而当下游任务违反了这一假设(例如,区分红色与黄色汽车)可以表现不佳。我们引进了对比学习框架,它不需要特定的先验知识,取决于任务的不变性。我们的模型学会捕捉变和视觉表现的不变因素通过构建独立的嵌入空间,每一个都是不变的所有,但一个增强。我们用一个多头网络共享骨干捕捉在每个增强和孤独性能优于下游任务的所有基线的信息。我们进一步发现,不变和变化的空间进行跨我们会调查所有任务,包括粗粒,细粒,很少拍下游分类任务,以及各种数据损坏最好的串联。
41. Sparse Coding Driven Deep Decision Tree Ensembles for Nuclear Segmentation in Digital Pathology Images [PDF] 返回目录
Jie Song, Liang Xiao, Mohsen Molaei, Zhichao Lian
Abstract: In this paper, we propose an easily trained yet powerful representation learning approach with performance highly competitive to deep neural networks in a digital pathology image segmentation task. The method, called sparse coding driven deep decision tree ensembles that we abbreviate as ScD2TE, provides a new perspective on representation learning. We explore the possibility of stacking several layers based on non-differentiable pairwise modules and generate a densely concatenated architecture holding the characteristics of feature map reuse and end-to-end dense learning. Under this architecture, fast convolutional sparse coding is used to extract multi-level features from the output of each layer. In this way, rich image appearance models together with more contextual information are integrated by learning a series of decision tree ensembles. The appearance and the high-level context features of all the previous layers are seamlessly combined by concatenating them to feed-forward as input, which in turn makes the outputs of subsequent layers more accurate and the whole model efficient to train. Compared with deep neural networks, our proposed ScD2TE does not require back-propagation computation and depends on less hyper-parameters. ScD2TE is able to achieve a fast end-to-end pixel-wise training in a layer-wise manner. We demonstrated the superiority of our segmentation technique by evaluating it on the multi-disease state and multi-organ dataset where consistently higher performances were obtained for comparison against several state-of-the-art deep learning methods such as convolutional neural networks (CNN), fully convolutional networks (FCN), etc.
摘要:在本文中,我们提出了在数字病理图像分割任务性能深层神经网络竞争激烈的容易训练,但强大的代表性的学习方法。该方法被称为稀疏编码深驱动决策树合奏,我们简称为ScD2TE,提供学习代表一个新的视角。我们探讨堆叠基于不可微分成对模块若干层的可能性,并产生一个密集级联架构保持特征地图重用和端至端密学习的特点。在此架构下,快速卷积稀疏编码被用于从每一层的输出中提取多级的功能。通过这种方式,丰富的图像外观模型与更多的上下文信息一起通过学习一系列决策树合奏的集成。外观和所有以前的层的高级别上下文特征无缝地通过将它们串联到前馈作为输入,这又使得后续层的输出更准确和整个模型有效训练相结合。深神经网络相比,我们提出的ScD2TE不需要反向传播的计算和依赖较少的超参数。 ScD2TE是能够实现在一个层式的方式快速的端至端逐像素训练。我们通过对其中针对一些国家的最先进的深度学习的方法,如卷积神经网络相比较,得到一致的更高的性能多疾病状态和多器官数据集评估它展示了我们的分割技术的优越性(CNN)的,完全卷积网络(FCN)等
Jie Song, Liang Xiao, Mohsen Molaei, Zhichao Lian
Abstract: In this paper, we propose an easily trained yet powerful representation learning approach with performance highly competitive to deep neural networks in a digital pathology image segmentation task. The method, called sparse coding driven deep decision tree ensembles that we abbreviate as ScD2TE, provides a new perspective on representation learning. We explore the possibility of stacking several layers based on non-differentiable pairwise modules and generate a densely concatenated architecture holding the characteristics of feature map reuse and end-to-end dense learning. Under this architecture, fast convolutional sparse coding is used to extract multi-level features from the output of each layer. In this way, rich image appearance models together with more contextual information are integrated by learning a series of decision tree ensembles. The appearance and the high-level context features of all the previous layers are seamlessly combined by concatenating them to feed-forward as input, which in turn makes the outputs of subsequent layers more accurate and the whole model efficient to train. Compared with deep neural networks, our proposed ScD2TE does not require back-propagation computation and depends on less hyper-parameters. ScD2TE is able to achieve a fast end-to-end pixel-wise training in a layer-wise manner. We demonstrated the superiority of our segmentation technique by evaluating it on the multi-disease state and multi-organ dataset where consistently higher performances were obtained for comparison against several state-of-the-art deep learning methods such as convolutional neural networks (CNN), fully convolutional networks (FCN), etc.
摘要:在本文中,我们提出了在数字病理图像分割任务性能深层神经网络竞争激烈的容易训练,但强大的代表性的学习方法。该方法被称为稀疏编码深驱动决策树合奏,我们简称为ScD2TE,提供学习代表一个新的视角。我们探讨堆叠基于不可微分成对模块若干层的可能性,并产生一个密集级联架构保持特征地图重用和端至端密学习的特点。在此架构下,快速卷积稀疏编码被用于从每一层的输出中提取多级的功能。通过这种方式,丰富的图像外观模型与更多的上下文信息一起通过学习一系列决策树合奏的集成。外观和所有以前的层的高级别上下文特征无缝地通过将它们串联到前馈作为输入,这又使得后续层的输出更准确和整个模型有效训练相结合。深神经网络相比,我们提出的ScD2TE不需要反向传播的计算和依赖较少的超参数。 ScD2TE是能够实现在一个层式的方式快速的端至端逐像素训练。我们通过对其中针对一些国家的最先进的深度学习的方法,如卷积神经网络相比较,得到一致的更高的性能多疾病状态和多器官数据集评估它展示了我们的分割技术的优越性(CNN)的,完全卷积网络(FCN)等
42. ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network [PDF] 返回目录
Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang
Abstract: Food recognition has received more and more attention in the multimedia community for its various real-world applications, such as diet management and self-service restaurants. A large-scale ontology of food images is urgently needed for developing advanced large-scale food recognition algorithms, as well as for providing the benchmark dataset for such algorithms. To encourage further progress in food recognition, we introduce the dataset ISIA Food- 500 with 500 categories from the list in the Wikipedia and 399,726 images, a more comprehensive food dataset that surpasses existing popular benchmark datasets by category coverage and data volume. Furthermore, we propose a stacked global-local attention network, which consists of two sub-networks for food recognition. One subnetwork first utilizes hybrid spatial-channel attention to extract more discriminative features, and then aggregates these multi-scale discriminative features from multiple layers into global-level representation (e.g., texture and shape information about food). The other one generates attentional regions (e.g., ingredient relevant regions) from different regions via cascaded spatial transformers, and further aggregates these multi-scale regional features from different layers into local-level representation. These two types of features are finally fused as comprehensive representation for food recognition. Extensive experiments on ISIA Food-500 and other two popular benchmark datasets demonstrate the effectiveness of our proposed method, and thus can be considered as one strong baseline. The dataset, code and models can be found at this http URL.
摘要:食品承认已经收到越来越多的关注,在多媒体社会对各种现实世界的应用,如饮食管理,自我服务的餐厅。大规模本体食物图像,迫切需要发展先进的大型食品识别算法,以及用于为这些算法的基准数据集。为了鼓励食品识别方面取得进一步进展,我们引入了数据集ISIA罐头500与500类从维基百科和399726倍的图像,更全面的食品数据集超过现有的按类别的覆盖范围和数据量的流行风向标的数据集列表。此外,我们提出了一个堆叠全局 - 局部关注网络,它由两个子网络,对食品的认可。一个子网第一使用混合空间信道注意提取更多的判别特征,然后聚集来自多个层,这些多尺度判别特征为全局级表示(约食品例如,纹理和形状信息)。另一种通过级联空间变压器,并进一步从聚集体不同的层,这些多尺度区域特征入本地级表示生成来自不同区域的注视区域(例如,成分相关的区域)。这两种类型的特点最终融合为食品识别综合表现。在ISIA食品-500和其他两个流行的基准数据集大量的实验证明我们提出的方法的有效性,因此可以被视为一个强大的基础。该数据集,代码和模式可以在这个HTTP URL中找到。
Weiqing Min, Linhu Liu, Zhiling Wang, Zhengdong Luo, Xiaoming Wei, Xiaolin Wei, Shuqiang Jiang
Abstract: Food recognition has received more and more attention in the multimedia community for its various real-world applications, such as diet management and self-service restaurants. A large-scale ontology of food images is urgently needed for developing advanced large-scale food recognition algorithms, as well as for providing the benchmark dataset for such algorithms. To encourage further progress in food recognition, we introduce the dataset ISIA Food- 500 with 500 categories from the list in the Wikipedia and 399,726 images, a more comprehensive food dataset that surpasses existing popular benchmark datasets by category coverage and data volume. Furthermore, we propose a stacked global-local attention network, which consists of two sub-networks for food recognition. One subnetwork first utilizes hybrid spatial-channel attention to extract more discriminative features, and then aggregates these multi-scale discriminative features from multiple layers into global-level representation (e.g., texture and shape information about food). The other one generates attentional regions (e.g., ingredient relevant regions) from different regions via cascaded spatial transformers, and further aggregates these multi-scale regional features from different layers into local-level representation. These two types of features are finally fused as comprehensive representation for food recognition. Extensive experiments on ISIA Food-500 and other two popular benchmark datasets demonstrate the effectiveness of our proposed method, and thus can be considered as one strong baseline. The dataset, code and models can be found at this http URL.
摘要:食品承认已经收到越来越多的关注,在多媒体社会对各种现实世界的应用,如饮食管理,自我服务的餐厅。大规模本体食物图像,迫切需要发展先进的大型食品识别算法,以及用于为这些算法的基准数据集。为了鼓励食品识别方面取得进一步进展,我们引入了数据集ISIA罐头500与500类从维基百科和399726倍的图像,更全面的食品数据集超过现有的按类别的覆盖范围和数据量的流行风向标的数据集列表。此外,我们提出了一个堆叠全局 - 局部关注网络,它由两个子网络,对食品的认可。一个子网第一使用混合空间信道注意提取更多的判别特征,然后聚集来自多个层,这些多尺度判别特征为全局级表示(约食品例如,纹理和形状信息)。另一种通过级联空间变压器,并进一步从聚集体不同的层,这些多尺度区域特征入本地级表示生成来自不同区域的注视区域(例如,成分相关的区域)。这两种类型的特点最终融合为食品识别综合表现。在ISIA食品-500和其他两个流行的基准数据集大量的实验证明我们提出的方法的有效性,因此可以被视为一个强大的基础。该数据集,代码和模式可以在这个HTTP URL中找到。
43. Few shot clustering for indoor occupancy detection with extremely low-quality images from battery free cameras [PDF] 返回目录
Homagni Saha, Sin Yon Tan, Ali Saffari, Mohamad Katanbaf, Joshua R. Smith, Soumik Sarkar
Abstract: Reliable detection of human occupancy in indoor environments is critical for various energy efficiency, security, and safety applications. We consider this challenge of occupancy detection using extremely low-quality, privacy-preserving images from low power image sensors. We propose a combined few shot learning and clustering algorithm to address this challenge that has very low commissioning and maintenance cost. While the few shot learning concept enables us to commission our system with a few labeled examples, the clustering step serves the purpose of online adaptation to changing imaging environment over time. Apart from validating and comparing our algorithm on benchmark datasets, we also demonstrate performance of our algorithm on streaming images collected from real homes using our novel battery free camera hardware.
摘要:在室内环境中可靠地检测人入住的是各种能源效率,安全性,和安全应用的关键。我们用极低的品质,从低功耗图像传感器隐私保护图像考虑占用检测的这一挑战。我们提出了一个联合几个镜头的学习和聚类算法,以解决具有非常低的调试和维护成本这一挑战。虽然很少出手学习理念,使我们能够委托我们的系统有一些标记示例中,聚类步骤提供在线适应随时间变化的成像环境的目的。除了验证和我们的算法对标准数据集进行比较,我们也证明了我们在流媒体使用我们的新型电池免费摄像头硬件真正的家采集到的图像算法的性能。
Homagni Saha, Sin Yon Tan, Ali Saffari, Mohamad Katanbaf, Joshua R. Smith, Soumik Sarkar
Abstract: Reliable detection of human occupancy in indoor environments is critical for various energy efficiency, security, and safety applications. We consider this challenge of occupancy detection using extremely low-quality, privacy-preserving images from low power image sensors. We propose a combined few shot learning and clustering algorithm to address this challenge that has very low commissioning and maintenance cost. While the few shot learning concept enables us to commission our system with a few labeled examples, the clustering step serves the purpose of online adaptation to changing imaging environment over time. Apart from validating and comparing our algorithm on benchmark datasets, we also demonstrate performance of our algorithm on streaming images collected from real homes using our novel battery free camera hardware.
摘要:在室内环境中可靠地检测人入住的是各种能源效率,安全性,和安全应用的关键。我们用极低的品质,从低功耗图像传感器隐私保护图像考虑占用检测的这一挑战。我们提出了一个联合几个镜头的学习和聚类算法,以解决具有非常低的调试和维护成本这一挑战。虽然很少出手学习理念,使我们能够委托我们的系统有一些标记示例中,聚类步骤提供在线适应随时间变化的成像环境的目的。除了验证和我们的算法对标准数据集进行比较,我们也证明了我们在流媒体使用我们的新型电池免费摄像头硬件真正的家采集到的图像算法的性能。
44. We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos [PDF] 返回目录
Alex Andonian, Camilo Fosco, Mathew Monfort, Allen Lee, Rogerio Feris, Carl Vondrick, Aude Oliva
Abstract: Identifying common patterns among events is a key ability in human and machine perception, as it underlies intelligent decision making. We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong to the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision.
摘要:确定事件中常见的模式是人类和机器感知的关键能力,它强调智能决策。我们提出了学习上的视频语义关系的抽象集合,由人类学习的启发的方法。我们结合视觉特征与自然语言监督以产生跨越的一组影片相似的高层表示。这使得我们的模型来进行认知任务,如集抽象(这一般概念是常见的一组影片中?),设置完成(其新的视频与设定顺利?),以及单只的检测(其中视频确实不属于集合?)。在时间上的两个视频基准,动力学和多矩实验,表明稳健和灵活的交涉学会识别套之间的共性时出现。我们我们的模型比较基准数的算法和显示显著的改善,从学习明确语义监督关系的抽象的结果。
Alex Andonian, Camilo Fosco, Mathew Monfort, Allen Lee, Rogerio Feris, Carl Vondrick, Aude Oliva
Abstract: Identifying common patterns among events is a key ability in human and machine perception, as it underlies intelligent decision making. We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong to the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision.
摘要:确定事件中常见的模式是人类和机器感知的关键能力,它强调智能决策。我们提出了学习上的视频语义关系的抽象集合,由人类学习的启发的方法。我们结合视觉特征与自然语言监督以产生跨越的一组影片相似的高层表示。这使得我们的模型来进行认知任务,如集抽象(这一般概念是常见的一组影片中?),设置完成(其新的视频与设定顺利?),以及单只的检测(其中视频确实不属于集合?)。在时间上的两个视频基准,动力学和多矩实验,表明稳健和灵活的交涉学会识别套之间的共性时出现。我们我们的模型比较基准数的算法和显示显著的改善,从学习明确语义监督关系的抽象的结果。
45. Self-Path: Self-supervision for Classification of Pathology Images with Limited Annotations [PDF] 返回目录
Navid Alemi Koohbanani, Balagopal Unnikrishnan, Syed Ali Khurram, Pavitra Krishnaswamy, Nasir Rajpoot
Abstract: While high-resolution pathology images lend themselves well to `data hungry' deep learning algorithms, obtaining exhaustive annotations on these images is a major challenge. In this paper, we propose a self-supervised CNN approach to leverage unlabeled data for learning generalizable and domain invariant representations in pathology images. The proposed approach, which we term as Self-Path, is a multi-task learning approach where the main task is tissue classification and pretext tasks are a variety of self-supervised tasks with labels inherent to the input data. We introduce novel domain specific self-supervision tasks that leverage contextual, multi-resolution and semantic features in pathology images for semi-supervised learning and domain adaptation. We investigate the effectiveness of Self-Path on 3 different pathology datasets. Our results show that Self-Path with the domain-specific pretext tasks achieves state-of-the-art performance for semi-supervised learning when small amounts of labeled data are available. Further, we show that Self-Path improves domain adaptation for classification of histology image patches when there is no labeled data available for the target domain. This approach can potentially be employed for other applications in computational pathology, where annotation budget is often limited or large amount of unlabeled image data is available.
摘要:虽然高分辨率图像病理借给自己很好的'数据饥渴”深度学习算法,对这些图像获取详尽的注释是一项重大挑战。在本文中,我们提出了一个自我监督CNN的方法来利用未标记的数据,在病理图像学习普及和域恒定表征。所提出的方法,这是我们长期的自我路径,是一个多任务的学习方法,把主要任务是组织分类和借口任务是各种具有固有的输入数据标签自我监督的任务。我们引入新的特定领域的自我监督的任务,充分利用情境,多分辨率和病理图像语义特征的半监督学习和领域适应性。我们调查自路径对3个不同的病理数据集的有效性。我们的研究结果表明,与特定领域的借口任务自路径实现国家的最先进的性能半监督学习时,可以使用少量的标签数据。此外,我们表明,自路径提高领域适应性的组织学图像块的分类时,有没有标注可用于目标域数据。这种方法可以潜在地被用于在计算病理学,其他应用中注释预算通常是有限的或大量的未标记的图象数据的是可用的。
Navid Alemi Koohbanani, Balagopal Unnikrishnan, Syed Ali Khurram, Pavitra Krishnaswamy, Nasir Rajpoot
Abstract: While high-resolution pathology images lend themselves well to `data hungry' deep learning algorithms, obtaining exhaustive annotations on these images is a major challenge. In this paper, we propose a self-supervised CNN approach to leverage unlabeled data for learning generalizable and domain invariant representations in pathology images. The proposed approach, which we term as Self-Path, is a multi-task learning approach where the main task is tissue classification and pretext tasks are a variety of self-supervised tasks with labels inherent to the input data. We introduce novel domain specific self-supervision tasks that leverage contextual, multi-resolution and semantic features in pathology images for semi-supervised learning and domain adaptation. We investigate the effectiveness of Self-Path on 3 different pathology datasets. Our results show that Self-Path with the domain-specific pretext tasks achieves state-of-the-art performance for semi-supervised learning when small amounts of labeled data are available. Further, we show that Self-Path improves domain adaptation for classification of histology image patches when there is no labeled data available for the target domain. This approach can potentially be employed for other applications in computational pathology, where annotation budget is often limited or large amount of unlabeled image data is available.
摘要:虽然高分辨率图像病理借给自己很好的'数据饥渴”深度学习算法,对这些图像获取详尽的注释是一项重大挑战。在本文中,我们提出了一个自我监督CNN的方法来利用未标记的数据,在病理图像学习普及和域恒定表征。所提出的方法,这是我们长期的自我路径,是一个多任务的学习方法,把主要任务是组织分类和借口任务是各种具有固有的输入数据标签自我监督的任务。我们引入新的特定领域的自我监督的任务,充分利用情境,多分辨率和病理图像语义特征的半监督学习和领域适应性。我们调查自路径对3个不同的病理数据集的有效性。我们的研究结果表明,与特定领域的借口任务自路径实现国家的最先进的性能半监督学习时,可以使用少量的标签数据。此外,我们表明,自路径提高领域适应性的组织学图像块的分类时,有没有标注可用于目标域数据。这种方法可以潜在地被用于在计算病理学,其他应用中注释预算通常是有限的或大量的未标记的图象数据的是可用的。
46. Generating Person-Scene Interactions in 3D Scenes [PDF] 返回目录
Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, Siyu Tang
Abstract: High fidelity digital 3D environments have been proposed in recent years; however, it remains extreme challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depths, or semantic maps to represent the scene, and parametric human models to represent 3D bodies in the scene. While being straightforward, their generated human-scene interactions are often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To explicitly and effectively represent the physical contact between the body and the world is essential for modeling human-scene interaction. To that end, we propose a novel interaction representation, which explicitly encodes the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the distance from every basis point to its closest point on a human body. The synthesized proximal relationship between the human body and the scene can indicate which region a person tends to contact. Furthermore, based on such synthesized proximity, we can effectively obtain expressive 3D human bodies that naturally interact with the 3D scene. Our perceptual study shows that our model significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. Our code and model will be publicly available for research purpose.
摘要:高保真数字3D环境在近几年被提出;但是,它仍然极度逼真人体具有挑战性的自动装备这样的环境。现有的工作利用图像,深度或语义图来表示的情景,和参数人体模型来表示场景的3D体。虽然是简单的,其产生的人类场景互动往往缺乏自然和物理合理性。我们的主要发现是,人类通过身体接触的场景在世界的互动。要明确和有效地代表身体和世界之间的身体接触是人的造型,场景互动是必不可少的。为此,我们提出了一种新的互动表现,其中明确编码人体和周围的3D场景之间的接近程度。具体地,给出一套基点场景网,我们利用条件变的自动编码合成从每一个基点的距离对人体的最接近点。人体和场景之间的合成近端关系可指示一个人趋向于接触哪个区域。此外,基于这种合成的接近,我们可以有效地获得表现的3D人体自然与3D场景互动。我们的知觉的研究表明,我们的模型显著提高了国家的最先进的方法,接近真人现场互动的真实感。我们相信,我们的方法使得对逼真的三维人体的3D场景全自动合成的重要一步。我们的代码和型号将公开的研究目的。
Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, Siyu Tang
Abstract: High fidelity digital 3D environments have been proposed in recent years; however, it remains extreme challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depths, or semantic maps to represent the scene, and parametric human models to represent 3D bodies in the scene. While being straightforward, their generated human-scene interactions are often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To explicitly and effectively represent the physical contact between the body and the world is essential for modeling human-scene interaction. To that end, we propose a novel interaction representation, which explicitly encodes the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the distance from every basis point to its closest point on a human body. The synthesized proximal relationship between the human body and the scene can indicate which region a person tends to contact. Furthermore, based on such synthesized proximity, we can effectively obtain expressive 3D human bodies that naturally interact with the 3D scene. Our perceptual study shows that our model significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. Our code and model will be publicly available for research purpose.
摘要:高保真数字3D环境在近几年被提出;但是,它仍然极度逼真人体具有挑战性的自动装备这样的环境。现有的工作利用图像,深度或语义图来表示的情景,和参数人体模型来表示场景的3D体。虽然是简单的,其产生的人类场景互动往往缺乏自然和物理合理性。我们的主要发现是,人类通过身体接触的场景在世界的互动。要明确和有效地代表身体和世界之间的身体接触是人的造型,场景互动是必不可少的。为此,我们提出了一种新的互动表现,其中明确编码人体和周围的3D场景之间的接近程度。具体地,给出一套基点场景网,我们利用条件变的自动编码合成从每一个基点的距离对人体的最接近点。人体和场景之间的合成近端关系可指示一个人趋向于接触哪个区域。此外,基于这种合成的接近,我们可以有效地获得表现的3D人体自然与3D场景互动。我们的知觉的研究表明,我们的模型显著提高了国家的最先进的方法,接近真人现场互动的真实感。我们相信,我们的方法使得对逼真的三维人体的3D场景全自动合成的重要一步。我们的代码和型号将公开的研究目的。
47. Facial Expression Recognition Under Partial Occlusion from Virtual Reality Headsets based on Transfer Learning [PDF] 返回目录
Bita Houshmand, Naimul Khan
Abstract: Facial expressions of emotion are a major channel in our daily communications, and it has been subject of intense research in recent years. To automatically infer facial expressions, convolutional neural network based approaches has become widely adopted due to their proven applicability to Facial Expression Recognition (FER) task.On the other hand Virtual Reality (VR) has gained popularity as an immersive multimedia platform, where FER can provide enriched media experiences. However, recognizing facial expression while wearing a head-mounted VR headset is a challenging task due to the upper half of the face being completely occluded. In this paper we attempt to overcome these issues and focus on facial expression recognition in presence of a severe occlusion where the user is wearing a head-mounted display in a VR setting. We propose a geometric model to simulate occlusion resulting from a Samsung Gear VR headset that can be applied to existing FER datasets. Then, we adopt a transfer learning approach, starting from two pretrained networks, namely VGG and ResNet. We further fine-tune the networks on FER+ and RAF-DB datasets. Experimental results show that our approach achieves comparable results to existing methods while training on three modified benchmark datasets that adhere to realistic occlusion resulting from wearing a commodity VR headset. Code for this paper is available at: this https URL
摘要:情感的面部表情是在我们的日常沟通的重要渠道,并已经深入研究的课题在最近几年。要自动推断出的面部表情,卷积基于神经网络的方法已被广泛采用,由于其成熟的适用于面部表情识别(FER)task.On另一方面虚拟现实(VR)已经得到普及作为一种身临其境的多媒体平台,在这里可以FER提供丰富的媒体体验。然而,认识的面部表情,而穿着的头戴式耳机VR是一个具有挑战性的任务,由于面部的上半部分被完全封闭。在本文中,我们试图克服这些问题,并集中在面部表情识别在用户戴着一个严重阻塞的存在头戴式在VR设定显示。我们提出了一个几何模型从可应用于现有的FER数据集三星齿轮VR耳机产生的模拟闭塞。然后,我们采取转移学习方法,从两个预先训练网络,即VGG和RESNET开始。我们进一步微调的FER +和RAF-DB数据集的网络。实验结果表明,该方法实现了类似的结果,以现有的方法,同时在三个修改基准数据集即坚持现实遮挡从穿着一件商品VR耳机导致训练。代码本文,请访问:此HTTPS URL
Bita Houshmand, Naimul Khan
Abstract: Facial expressions of emotion are a major channel in our daily communications, and it has been subject of intense research in recent years. To automatically infer facial expressions, convolutional neural network based approaches has become widely adopted due to their proven applicability to Facial Expression Recognition (FER) task.On the other hand Virtual Reality (VR) has gained popularity as an immersive multimedia platform, where FER can provide enriched media experiences. However, recognizing facial expression while wearing a head-mounted VR headset is a challenging task due to the upper half of the face being completely occluded. In this paper we attempt to overcome these issues and focus on facial expression recognition in presence of a severe occlusion where the user is wearing a head-mounted display in a VR setting. We propose a geometric model to simulate occlusion resulting from a Samsung Gear VR headset that can be applied to existing FER datasets. Then, we adopt a transfer learning approach, starting from two pretrained networks, namely VGG and ResNet. We further fine-tune the networks on FER+ and RAF-DB datasets. Experimental results show that our approach achieves comparable results to existing methods while training on three modified benchmark datasets that adhere to realistic occlusion resulting from wearing a commodity VR headset. Code for this paper is available at: this https URL
摘要:情感的面部表情是在我们的日常沟通的重要渠道,并已经深入研究的课题在最近几年。要自动推断出的面部表情,卷积基于神经网络的方法已被广泛采用,由于其成熟的适用于面部表情识别(FER)task.On另一方面虚拟现实(VR)已经得到普及作为一种身临其境的多媒体平台,在这里可以FER提供丰富的媒体体验。然而,认识的面部表情,而穿着的头戴式耳机VR是一个具有挑战性的任务,由于面部的上半部分被完全封闭。在本文中,我们试图克服这些问题,并集中在面部表情识别在用户戴着一个严重阻塞的存在头戴式在VR设定显示。我们提出了一个几何模型从可应用于现有的FER数据集三星齿轮VR耳机产生的模拟闭塞。然后,我们采取转移学习方法,从两个预先训练网络,即VGG和RESNET开始。我们进一步微调的FER +和RAF-DB数据集的网络。实验结果表明,该方法实现了类似的结果,以现有的方法,同时在三个修改基准数据集即坚持现实遮挡从穿着一件商品VR耳机导致训练。代码本文,请访问:此HTTPS URL
48. Continual Class Incremental Learning for CT Thoracic Segmentation [PDF] 返回目录
Abdelrahman Elskhawy, Aneta Lisowska, Matthias Keicher, Josep Henry, Paul Thomson, Nassir Navab
Abstract: Deep learning organ segmentation approaches require large amounts of annotated training data, which is limited in supply due to reasons of confidentiality and the time required for expert manual annotation. Therefore, being able to train models incrementally without having access to previously used data is desirable. A common form of sequential training is fine tuning (FT). In this setting, a model learns a new task effectively, but loses performance on previously learned tasks. The Learning without Forgetting (LwF) approach addresses this issue via replaying its own prediction for past tasks during model training. In this work, we evaluate FT and LwF for class incremental learning in multi-organ segmentation using the publicly available AAPM dataset. We show that LwF can successfully retain knowledge on previous segmentations, however, its ability to learn a new class decreases with the addition of each class. To address this problem we propose an adversarial continual learning segmentation approach (ACLSeg), which disentangles feature space into task-specific and task-invariant features. This enables preservation of performance on past tasks and effective acquisition of new knowledge.
摘要:深学习器官分割方法要求,由于保密和专家手动标注所需的时间的原因,大量的注释的训练数据,这是限量供应的。因此,能够训练模型递增,而无需访问以前使用的数据是理想的。连续训练的常见的形式是微调(FT)。在这种背景下,模型有效地学习新的任务,但失去了对先前学习的任务中的表现。没有遗忘(LWF)的方式解决了学习通过模型训练期间重播了自己过去的工作预测这个问题。在这项工作中,我们评估FT和LWF类增量使用可公开获得的数据集AAPM在多器官分割学习。我们表明,LWF能够成功留住知识以前的分割,然而,其学习新类的能力与每增加一个班的降低。为了解决这个问题,我们提出了一种对抗性不断学习的分割方法(ACLSeg),它理顺了那些纷繁功能空间划分为任务的特定和任务不变特征。这使得对过去的任务和有效获取新知识的保鲜性能。
Abdelrahman Elskhawy, Aneta Lisowska, Matthias Keicher, Josep Henry, Paul Thomson, Nassir Navab
Abstract: Deep learning organ segmentation approaches require large amounts of annotated training data, which is limited in supply due to reasons of confidentiality and the time required for expert manual annotation. Therefore, being able to train models incrementally without having access to previously used data is desirable. A common form of sequential training is fine tuning (FT). In this setting, a model learns a new task effectively, but loses performance on previously learned tasks. The Learning without Forgetting (LwF) approach addresses this issue via replaying its own prediction for past tasks during model training. In this work, we evaluate FT and LwF for class incremental learning in multi-organ segmentation using the publicly available AAPM dataset. We show that LwF can successfully retain knowledge on previous segmentations, however, its ability to learn a new class decreases with the addition of each class. To address this problem we propose an adversarial continual learning segmentation approach (ACLSeg), which disentangles feature space into task-specific and task-invariant features. This enables preservation of performance on past tasks and effective acquisition of new knowledge.
摘要:深学习器官分割方法要求,由于保密和专家手动标注所需的时间的原因,大量的注释的训练数据,这是限量供应的。因此,能够训练模型递增,而无需访问以前使用的数据是理想的。连续训练的常见的形式是微调(FT)。在这种背景下,模型有效地学习新的任务,但失去了对先前学习的任务中的表现。没有遗忘(LWF)的方式解决了学习通过模型训练期间重播了自己过去的工作预测这个问题。在这项工作中,我们评估FT和LWF类增量使用可公开获得的数据集AAPM在多器官分割学习。我们表明,LWF能够成功留住知识以前的分割,然而,其学习新类的能力与每增加一个班的降低。为了解决这个问题,我们提出了一种对抗性不断学习的分割方法(ACLSeg),它理顺了那些纷繁功能空间划分为任务的特定和任务不变特征。这使得对过去的任务和有效获取新知识的保鲜性能。
49. Co-training for On-board Deep Object Detection [PDF] 返回目录
Gabriel Villalonga, Antonio M. Lopez
Abstract: Providing ground truth supervision to train visual models has been a bottleneck over the years, exacerbated by domain shifts which degenerate the performance of such models. This was the case when visual tasks relied on handcrafted features and shallow machine learning and, despite its unprecedented performance gains, the problem remains open within the deep learning paradigm due to its data-hungry nature. Best performing deep vision-based object detectors are trained in a supervised manner by relying on human-labeled bounding boxes which localize class instances (i.e.objects) within the training images.Thus, object detection is one of such tasks for which human labeling is a major bottleneck. In this paper, we assess co-training as a semi-supervised learning method for self-labeling objects in unlabeled images, so reducing the human-labeling effort for developing deep object detectors. Our study pays special attention to a scenario involving domain shift; in particular, when we have automatically generated virtual-world images with object bounding boxes and we have real-world images which are unlabeled. Moreover, we are particularly interested in using co-training for deep object detection in the context of driver assistance systems and/or self-driving vehicles. Thus, using well-established datasets and protocols for object detection in these application contexts, we will show how co-training is a paradigm worth to pursue for alleviating object labeling, working both alone and together with task-agnostic domain adaptation.
摘要:提供地面实况监督训练可视化模型一直是多年来的一个瓶颈,通过这种退化模型的性能域的变化而加剧。这是情况下,当视觉任务依赖于手工制作的特点和浅机器学习,尽管其前所未有的性能提升,问题依旧深度学习模式中打开,由于其大量数据的性质。最好执行深度基于视觉的对象检测器通过依赖于该训练images.Thus内本地化类实例(ieobjects)人标记的边界框在监督方式训练,物体检测是这样的任务的一个为其中人工标记的是一个主要瓶颈。在本文中,我们评估协同训练为在未标记的图像自动标注对象半监督学习方法,因此减少了对深发展对象探测器人类标签的努力。我们的研究特别关注涉及领域转移的情形;特别是,当我们自动生成与对象边界框虚拟世界的图像和我们有未标记现实世界的图像。此外,我们特别感兴趣的驾驶员辅助系统和/或自动驾驶车辆的情况下使用深对象检测协同训练。因此,利用已有的数据集和协议,这些应用程序上下文对象检测,我们将展示协同训练是怎样一个范例值得追求缓解对象标记,工作单独和与工作无关的领域适应性在一起。
Gabriel Villalonga, Antonio M. Lopez
Abstract: Providing ground truth supervision to train visual models has been a bottleneck over the years, exacerbated by domain shifts which degenerate the performance of such models. This was the case when visual tasks relied on handcrafted features and shallow machine learning and, despite its unprecedented performance gains, the problem remains open within the deep learning paradigm due to its data-hungry nature. Best performing deep vision-based object detectors are trained in a supervised manner by relying on human-labeled bounding boxes which localize class instances (i.e.objects) within the training images.Thus, object detection is one of such tasks for which human labeling is a major bottleneck. In this paper, we assess co-training as a semi-supervised learning method for self-labeling objects in unlabeled images, so reducing the human-labeling effort for developing deep object detectors. Our study pays special attention to a scenario involving domain shift; in particular, when we have automatically generated virtual-world images with object bounding boxes and we have real-world images which are unlabeled. Moreover, we are particularly interested in using co-training for deep object detection in the context of driver assistance systems and/or self-driving vehicles. Thus, using well-established datasets and protocols for object detection in these application contexts, we will show how co-training is a paradigm worth to pursue for alleviating object labeling, working both alone and together with task-agnostic domain adaptation.
摘要:提供地面实况监督训练可视化模型一直是多年来的一个瓶颈,通过这种退化模型的性能域的变化而加剧。这是情况下,当视觉任务依赖于手工制作的特点和浅机器学习,尽管其前所未有的性能提升,问题依旧深度学习模式中打开,由于其大量数据的性质。最好执行深度基于视觉的对象检测器通过依赖于该训练images.Thus内本地化类实例(ieobjects)人标记的边界框在监督方式训练,物体检测是这样的任务的一个为其中人工标记的是一个主要瓶颈。在本文中,我们评估协同训练为在未标记的图像自动标注对象半监督学习方法,因此减少了对深发展对象探测器人类标签的努力。我们的研究特别关注涉及领域转移的情形;特别是,当我们自动生成与对象边界框虚拟世界的图像和我们有未标记现实世界的图像。此外,我们特别感兴趣的驾驶员辅助系统和/或自动驾驶车辆的情况下使用深对象检测协同训练。因此,利用已有的数据集和协议,这些应用程序上下文对象检测,我们将展示协同训练是怎样一个范例值得追求缓解对象标记,工作单独和与工作无关的领域适应性在一起。
50. Mitigating Dataset Imbalance via Joint Generation and Classification [PDF] 返回目录
Aadarsh Sahoo, Ankit Singh, Rameswar Panda, Rogerio Feris, Abir Das
Abstract: Supervised deep learning methods are enjoying enormous success in many practical applications of computer vision and have the potential to revolutionize robotics. However, the marked performance degradation to biases and imbalanced data questions the reliability of these methods. In this work we address these questions from the perspective of dataset imbalance resulting out of severe under-representation of annotated training data for certain classes and its effect on both deep classification and generation methods. We introduce a joint dataset repairment strategy by combining a neural network classifier with Generative Adversarial Networks (GAN) that makes up for the deficit of training examples from the under-representated class by producing additional training examples. We show that the combined training helps to improve the robustness of both the classifier and the GAN against severe class imbalance. We show the effectiveness of our proposed approach on three very different datasets with different degrees of imbalance in them. The code is available at this https URL .
摘要:深监督学习方法,正享受着计算机视觉的许多实际应用了巨大的成功,并有可能彻底改变机器人的潜力。然而,标记的性能下降的偏见和不平衡数据问题的方法的可靠性。在这项工作中,我们从解决某些类别及其两个深分类和生成方法效果注释的训练数据的严重代表性不足而导致了数据集不平衡的角度来看这些问题。我们通过结合创成对抗性网络(GAN)通过产生额外的培训例子弥补了从下representated一流的培训例子赤字神经网络分类介绍了联合数据集修复策略。我们表明,联合训练有助于提高分类和危害严重的类不平衡GAN二者的鲁棒性。我们显示我们提出的方法对不同程度的在他们的不平衡三个非常不同的数据集的有效性。该代码可在此HTTPS URL。
Aadarsh Sahoo, Ankit Singh, Rameswar Panda, Rogerio Feris, Abir Das
Abstract: Supervised deep learning methods are enjoying enormous success in many practical applications of computer vision and have the potential to revolutionize robotics. However, the marked performance degradation to biases and imbalanced data questions the reliability of these methods. In this work we address these questions from the perspective of dataset imbalance resulting out of severe under-representation of annotated training data for certain classes and its effect on both deep classification and generation methods. We introduce a joint dataset repairment strategy by combining a neural network classifier with Generative Adversarial Networks (GAN) that makes up for the deficit of training examples from the under-representated class by producing additional training examples. We show that the combined training helps to improve the robustness of both the classifier and the GAN against severe class imbalance. We show the effectiveness of our proposed approach on three very different datasets with different degrees of imbalance in them. The code is available at this https URL .
摘要:深监督学习方法,正享受着计算机视觉的许多实际应用了巨大的成功,并有可能彻底改变机器人的潜力。然而,标记的性能下降的偏见和不平衡数据问题的方法的可靠性。在这项工作中,我们从解决某些类别及其两个深分类和生成方法效果注释的训练数据的严重代表性不足而导致了数据集不平衡的角度来看这些问题。我们通过结合创成对抗性网络(GAN)通过产生额外的培训例子弥补了从下representated一流的培训例子赤字神经网络分类介绍了联合数据集修复策略。我们表明,联合训练有助于提高分类和危害严重的类不平衡GAN二者的鲁棒性。我们显示我们提出的方法对不同程度的在他们的不平衡三个非常不同的数据集的有效性。该代码可在此HTTPS URL。
51. Free View Synthesis [PDF] 返回目录
Gernot Riegler, Vladlen Koltun
Abstract: We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no fine-tuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work.
摘要:我们提出用于从输入图像新颖视图合成被周围的场景自由分发的方法。我们的方法不依赖于输入视图规则排列,可以通过现场合成自由摄像机的运动图像,并适用于与无约束几何布局一般的场景。我们通过SFM校准输入图像和直立粗几何支架通过MVS。这个支架被用于创建所述场景的新颖视图代理深度图。根据本深度图上,复发性编码器 - 解码器网络进程重新投影从附近的视图的特征和合成的新视图。我们的网络并不需要一个给定的场景进行优化。在数据集中训练后,它的工作原理在没有微调或每个场景优化前所未见的环境。我们评估的挑战现实世界的数据集,包括坦克和寺庙,在这里我们表现出的第一次成功的视图合成并基本上超越之前和并发工作所提出的方法。
Gernot Riegler, Vladlen Koltun
Abstract: We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no fine-tuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work.
摘要:我们提出用于从输入图像新颖视图合成被周围的场景自由分发的方法。我们的方法不依赖于输入视图规则排列,可以通过现场合成自由摄像机的运动图像,并适用于与无约束几何布局一般的场景。我们通过SFM校准输入图像和直立粗几何支架通过MVS。这个支架被用于创建所述场景的新颖视图代理深度图。根据本深度图上,复发性编码器 - 解码器网络进程重新投影从附近的视图的特征和合成的新视图。我们的网络并不需要一个给定的场景进行优化。在数据集中训练后,它的工作原理在没有微调或每个场景优化前所未见的环境。我们评估的挑战现实世界的数据集,包括坦克和寺庙,在这里我们表现出的第一次成功的视图合成并基本上超越之前和并发工作所提出的方法。
52. Multi-level Stress Assessment Using Multi-domain Fusion of ECG Signal [PDF] 返回目录
Zeeshan Ahmad, Naimul Khan
Abstract: Stress analysis and assessment of affective states of mind using ECG as a physiological signal is a burning research topic in biomedical signal processing. However, existing literature provides only binary assessment of stress, while multiple levels of assessment may be more beneficial for healthcare applications. Furthermore, in present research, ECG signal for stress analysis is examined independently in spatial domain or in transform domains but the advantage of fusing these domains has not been fully utilized. To get the maximum advantage of fusing diferent domains, we introduce a dataset with multiple stress levels and then classify these levels using a novel deep learning approach by converting ECG signal into signal images based on R-R peaks without any feature extraction. Moreover, We made signal images multimodal and multidomain by converting them into time-frequency and frequency domain using Gabor wavelet transform (GWT) and Discrete Fourier Transform (DFT) respectively. Convolutional Neural networks (CNNs) are used to extract features from different modalities and then decision level fusion is performed for improving the classification accuracy. The experimental results on an in-house dataset collected with 15 users show that with proposed fusion framework and using ECG signal to image conversion, we reach an average accuracy of 85.45%.
摘要:应力分析和使用心电图作为生理信号心态情感状态的评估是在生物医学信号处理燃烧的研究课题。然而,现有的文献提供的压力只有二进制评估,而评估的多层次的可能是医疗行业应用更为有利。此外,在目前的研究,进行应力分析ECG信号独立地在空间域或在变换域检验,但融合这些结构域的优点还没有被充分利用。为了得到定影diferent域的最大优点,我们引入一个数据集具有多个压力水平,然后通过ECG信号转换成基于R-R峰值信号的图像,没有任何特征提取分类使用新颖的深学习方法这些水平。此外,我们做了信号的图像通过使用Gabor小波变换(GWT)和离散傅里叶分别变换(DFT)将其转换为时间 - 频率域和频域多峰和多畴。卷积神经网络(细胞神经网络)来提取特征来自不同模态,然后是用于提高分级精度进行决策级融合。与15个用户收集了一个内部的数据集的实验结果表明,与所提出的融合框架和使用ECG信号到图像转换,我们达到85.45%的平均精确度。
Zeeshan Ahmad, Naimul Khan
Abstract: Stress analysis and assessment of affective states of mind using ECG as a physiological signal is a burning research topic in biomedical signal processing. However, existing literature provides only binary assessment of stress, while multiple levels of assessment may be more beneficial for healthcare applications. Furthermore, in present research, ECG signal for stress analysis is examined independently in spatial domain or in transform domains but the advantage of fusing these domains has not been fully utilized. To get the maximum advantage of fusing diferent domains, we introduce a dataset with multiple stress levels and then classify these levels using a novel deep learning approach by converting ECG signal into signal images based on R-R peaks without any feature extraction. Moreover, We made signal images multimodal and multidomain by converting them into time-frequency and frequency domain using Gabor wavelet transform (GWT) and Discrete Fourier Transform (DFT) respectively. Convolutional Neural networks (CNNs) are used to extract features from different modalities and then decision level fusion is performed for improving the classification accuracy. The experimental results on an in-house dataset collected with 15 users show that with proposed fusion framework and using ECG signal to image conversion, we reach an average accuracy of 85.45%.
摘要:应力分析和使用心电图作为生理信号心态情感状态的评估是在生物医学信号处理燃烧的研究课题。然而,现有的文献提供的压力只有二进制评估,而评估的多层次的可能是医疗行业应用更为有利。此外,在目前的研究,进行应力分析ECG信号独立地在空间域或在变换域检验,但融合这些结构域的优点还没有被充分利用。为了得到定影diferent域的最大优点,我们引入一个数据集具有多个压力水平,然后通过ECG信号转换成基于R-R峰值信号的图像,没有任何特征提取分类使用新颖的深学习方法这些水平。此外,我们做了信号的图像通过使用Gabor小波变换(GWT)和离散傅里叶分别变换(DFT)将其转换为时间 - 频率域和频域多峰和多畴。卷积神经网络(细胞神经网络)来提取特征来自不同模态,然后是用于提高分级精度进行决策级融合。与15个用户收集了一个内部的数据集的实验结果表明,与所提出的融合框架和使用ECG信号到图像转换,我们达到85.45%的平均精确度。
53. Multi-Mask Self-Supervised Learning for Physics-Guided Neural Networks in Highly Accelerated MRI [PDF] 返回目录
Burhaneddin Yaman, Seyed Amir Hossein Hosseini, Steen Moeller, Jutta Ellermann, Kâmil Uğurbil, Mehmet Akçakaya
Abstract: Purpose: To develop an improved self-supervised learning strategy that efficiently uses the acquired data for training a physics-guided reconstruction network without a database of fully-sampled data. Methods: Currently self-supervised learning for physics-guided reconstruction networks splits acquired undersampled data into two disjoint sets, where one is used for data consistency (DC) in the unrolled network and the other to define the training loss. The proposed multi-mask self-supervised learning via data undersampling (SSDU) splits acquired measurements into multiple pairs of disjoint sets for each training sample, while using one of these sets for DC units and the other for defining loss, thereby more efficiently using the undersampled data. Multi-mask SSDU is applied on fully-sampled 3D knee and prospectively undersampled 3D brain MRI datasets, which are retrospectively subsampled to acceleration rate (R)=8, and compared to CG-SENSE and single-mask SSDU DL-MRI, as well as supervised DL-MRI when fully-sampled data is available. Results: Results on knee MRI show that the proposed multi-mask SSDU outperforms SSDU and performs closely with supervised DL-MRI, while significantly outperforming CG-SENSE. A clinical reader study further ranks the multi-mask SSDU higher than supervised DL-MRI in terms of SNR and aliasing artifacts. Results on brain MRI show that multi-mask SSDU achieves better reconstruction quality compared to SSDU and CG-SENSE. Reader study demonstrates that multi-mask SSDU at R=8 significantly improves reconstruction compared to single-mask SSDU at R=8, as well as CG-SENSE at R=2. Conclusion: The proposed multi-mask SSDU approach enables improved training of physics-guided neural networks without fully-sampled data, by enabling efficient use of the undersampled data with multiple masks.
摘要:目的:制定更完备的自我监督学习策略,有效地利用所获得的数据进行训练物理引导重建网络,而不完全采样数据的数据库。方法:目前自监督学习用于物理引导重建网络分割获取的欠采样数据为两个不相交的集合,其中一个是展开的网络中用于数据一致性(DC),另以限定训练损失。经由数据所提出的多掩模自监督学习欠(SSDU)分割所获得的测量为多个对不相交的集合的每个训练样本,同时使用这些集DC单元中的一个,而另一个用于限定损失,从而更有效地利用欠采样数据。多掩模SSDU施加在完全采样的3D膝盖和前瞻性欠3D脑MRI数据集,其被回顾性二次采样以加速速率(R)= 8,并与CG-SENSE和单掩模SSDU DL-MRI,以及作为监督的DL-MRI时完全采样的数据是可用的。结果:膝关节MRI结果表明,所提出的多光罩SSDU性能优于SSDU并进行密切监督DL-MRI,而显著优于CG-SENSE。临床读者学习进一步位居多面膜SSDU比监督DL-MRI高信噪比的条款和锯齿瑕疵。脑部MRI的结果显示,相比于SSDU和CG-SENSE多面膜SSDU获得更好的重建质量。读者研究表明在R = 8,其多掩模SSDU显著提高重建相比单掩模SSDU在R = 8,以及在R = 2 CG-SENSE。结论:该多掩模SSDU方法使改进的物理引导神经网络的训练而不完全采样的数据,通过使有效地利用与多个掩模的欠采样数据的。
Burhaneddin Yaman, Seyed Amir Hossein Hosseini, Steen Moeller, Jutta Ellermann, Kâmil Uğurbil, Mehmet Akçakaya
Abstract: Purpose: To develop an improved self-supervised learning strategy that efficiently uses the acquired data for training a physics-guided reconstruction network without a database of fully-sampled data. Methods: Currently self-supervised learning for physics-guided reconstruction networks splits acquired undersampled data into two disjoint sets, where one is used for data consistency (DC) in the unrolled network and the other to define the training loss. The proposed multi-mask self-supervised learning via data undersampling (SSDU) splits acquired measurements into multiple pairs of disjoint sets for each training sample, while using one of these sets for DC units and the other for defining loss, thereby more efficiently using the undersampled data. Multi-mask SSDU is applied on fully-sampled 3D knee and prospectively undersampled 3D brain MRI datasets, which are retrospectively subsampled to acceleration rate (R)=8, and compared to CG-SENSE and single-mask SSDU DL-MRI, as well as supervised DL-MRI when fully-sampled data is available. Results: Results on knee MRI show that the proposed multi-mask SSDU outperforms SSDU and performs closely with supervised DL-MRI, while significantly outperforming CG-SENSE. A clinical reader study further ranks the multi-mask SSDU higher than supervised DL-MRI in terms of SNR and aliasing artifacts. Results on brain MRI show that multi-mask SSDU achieves better reconstruction quality compared to SSDU and CG-SENSE. Reader study demonstrates that multi-mask SSDU at R=8 significantly improves reconstruction compared to single-mask SSDU at R=8, as well as CG-SENSE at R=2. Conclusion: The proposed multi-mask SSDU approach enables improved training of physics-guided neural networks without fully-sampled data, by enabling efficient use of the undersampled data with multiple masks.
摘要:目的:制定更完备的自我监督学习策略,有效地利用所获得的数据进行训练物理引导重建网络,而不完全采样数据的数据库。方法:目前自监督学习用于物理引导重建网络分割获取的欠采样数据为两个不相交的集合,其中一个是展开的网络中用于数据一致性(DC),另以限定训练损失。经由数据所提出的多掩模自监督学习欠(SSDU)分割所获得的测量为多个对不相交的集合的每个训练样本,同时使用这些集DC单元中的一个,而另一个用于限定损失,从而更有效地利用欠采样数据。多掩模SSDU施加在完全采样的3D膝盖和前瞻性欠3D脑MRI数据集,其被回顾性二次采样以加速速率(R)= 8,并与CG-SENSE和单掩模SSDU DL-MRI,以及作为监督的DL-MRI时完全采样的数据是可用的。结果:膝关节MRI结果表明,所提出的多光罩SSDU性能优于SSDU并进行密切监督DL-MRI,而显著优于CG-SENSE。临床读者学习进一步位居多面膜SSDU比监督DL-MRI高信噪比的条款和锯齿瑕疵。脑部MRI的结果显示,相比于SSDU和CG-SENSE多面膜SSDU获得更好的重建质量。读者研究表明在R = 8,其多掩模SSDU显著提高重建相比单掩模SSDU在R = 8,以及在R = 2 CG-SENSE。结论:该多掩模SSDU方法使改进的物理引导神经网络的训练而不完全采样的数据,通过使有效地利用与多个掩模的欠采样数据的。
54. Deep Learning to Quantify Pulmonary Edema in Chest Radiographs [PDF] 返回目录
Steven Horng, Ruizhi Liao, Xin Wang, Sandeep Dalal, Polina Golland, Seth J Berkowitz
Abstract: Background: Clinical management decisions for acutely decompensated CHF patients are often based on grades of pulmonary edema severity, rather than its mere absence or presence. The grading of pulmonary edema on chest radiographs is based on well-known radiologic findings. Purpose: We develop a clinical machine learning task to grade pulmonary edema severity and release both the underlying data and code to serve as a benchmark for future algorithmic developments in machine vision. Materials and Methods: We collected 369,071 chest radiographs and their associated radiology reports from 64,581 patients from the MIMIC-CXR chest radiograph dataset. We extracted pulmonary edema severity labels from the associated radiology reports as 4 ordinal levels: no edema (0), vascular congestion (1), interstitial edema (2), and alveolar edema (3). We developed machine learning models using two standard approaches: 1) a semi-supervised model using a variational autoencoder and 2) a pre-trained supervised learning model using a dense neural network. Results: We measured the area under the receiver operating characteristic curve (AUROC) from the semi-supervised model and the pre-trained model. AUROC for differentiating alveolar edema from no edema was 0.99 and 0.87 (semi-supervised and pre-trained models). Performance of the algorithm was inversely related to the difficulty in categorizing milder states of pulmonary edema: 2 vs 0 (0.88, 0.81), 1 vs 0 (0.79, 0.66), 3 vs 1 (0.93, 0.82), 2 vs 1 (0.69, 0.73), 3 vs 2 (0.88, 0.63). Conclusion: Accurate grading of pulmonary edema on chest radiographs is a clinically important task. Application of state-of-the-art machine learning techniques can produce a novel quantitative imaging biomarker from one of the oldest and most widely available imaging modalities.
摘要:背景:临床管理决策为急性失代偿CHF患者通常基于肺水肿严重程度等级,而不是它的单纯的存在与否。胸片肺水肿的分级是基于著名的影像学表现。目的:我们制定了临床机器学习任务等级肺水肿严重程度和释放的基础数据和代码都以作为机器视觉算法未来发展的标杆。材料与方法:我们收集到369071个的胸片和从模仿-CXR胸片数据集64581例患者及其相关的放射学报告。我们提取肺水肿的严重程度的标签与相关联的放射学报告为4个序号级别:没有水肿(0),血管充血(1),间质水肿(2),和肺泡水肿(3)。我们开发使用两种标准方法机器学习模型:使用变的自动编码和2)预训练监督使用密集的神经网络学习模型1)一个半监督模式。结果:我们测量了接收器操作从半监督模型和预先训练模型特性曲线(AUROC)下的面积。 AUROC从无水肿分化肺泡水肿为0.99和0.87(半监督和预先训练模式)。该算法的性能呈负相关的困难在分类肺水肿的较温和的状态:2比0(0.88,0.81),1比0(0.79,0.66),3比1(0.93,0.82),2比1(0.69 ,0.73),3比2(0.88,0.63)。结论:胸片肺水肿的精确分级是一种临床上重要的任务。的状态的最先进的机器学习技术应用程序可以从最古老和最广泛使用的成像模态中的一个产生的新的定量成像生物标志物。
Steven Horng, Ruizhi Liao, Xin Wang, Sandeep Dalal, Polina Golland, Seth J Berkowitz
Abstract: Background: Clinical management decisions for acutely decompensated CHF patients are often based on grades of pulmonary edema severity, rather than its mere absence or presence. The grading of pulmonary edema on chest radiographs is based on well-known radiologic findings. Purpose: We develop a clinical machine learning task to grade pulmonary edema severity and release both the underlying data and code to serve as a benchmark for future algorithmic developments in machine vision. Materials and Methods: We collected 369,071 chest radiographs and their associated radiology reports from 64,581 patients from the MIMIC-CXR chest radiograph dataset. We extracted pulmonary edema severity labels from the associated radiology reports as 4 ordinal levels: no edema (0), vascular congestion (1), interstitial edema (2), and alveolar edema (3). We developed machine learning models using two standard approaches: 1) a semi-supervised model using a variational autoencoder and 2) a pre-trained supervised learning model using a dense neural network. Results: We measured the area under the receiver operating characteristic curve (AUROC) from the semi-supervised model and the pre-trained model. AUROC for differentiating alveolar edema from no edema was 0.99 and 0.87 (semi-supervised and pre-trained models). Performance of the algorithm was inversely related to the difficulty in categorizing milder states of pulmonary edema: 2 vs 0 (0.88, 0.81), 1 vs 0 (0.79, 0.66), 3 vs 1 (0.93, 0.82), 2 vs 1 (0.69, 0.73), 3 vs 2 (0.88, 0.63). Conclusion: Accurate grading of pulmonary edema on chest radiographs is a clinically important task. Application of state-of-the-art machine learning techniques can produce a novel quantitative imaging biomarker from one of the oldest and most widely available imaging modalities.
摘要:背景:临床管理决策为急性失代偿CHF患者通常基于肺水肿严重程度等级,而不是它的单纯的存在与否。胸片肺水肿的分级是基于著名的影像学表现。目的:我们制定了临床机器学习任务等级肺水肿严重程度和释放的基础数据和代码都以作为机器视觉算法未来发展的标杆。材料与方法:我们收集到369071个的胸片和从模仿-CXR胸片数据集64581例患者及其相关的放射学报告。我们提取肺水肿的严重程度的标签与相关联的放射学报告为4个序号级别:没有水肿(0),血管充血(1),间质水肿(2),和肺泡水肿(3)。我们开发使用两种标准方法机器学习模型:使用变的自动编码和2)预训练监督使用密集的神经网络学习模型1)一个半监督模式。结果:我们测量了接收器操作从半监督模型和预先训练模型特性曲线(AUROC)下的面积。 AUROC从无水肿分化肺泡水肿为0.99和0.87(半监督和预先训练模式)。该算法的性能呈负相关的困难在分类肺水肿的较温和的状态:2比0(0.88,0.81),1比0(0.79,0.66),3比1(0.93,0.82),2比1(0.69 ,0.73),3比2(0.88,0.63)。结论:胸片肺水肿的精确分级是一种临床上重要的任务。的状态的最先进的机器学习技术应用程序可以从最古老和最广泛使用的成像模态中的一个产生的新的定量成像生物标志物。
55. Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations [PDF] 返回目录
Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, Raquel Urtasun
Abstract: In this paper we propose a novel end-to-end learnable network that performs joint perception, prediction and motion planning for self-driving vehicles and produces interpretable intermediate representations. Unlike existing neural motion planners, our motion planning costs are consistent with our perception and prediction estimates. This is achieved by a novel differentiable semantic occupancy representation that is explicitly used as cost by the motion planning process. Our network is learned end-to-end from human demonstrations. The experiments in a large-scale manual-driving dataset and closed-loop simulation show that the proposed model significantly outperforms state-of-the-art planners in imitating the human behaviors while producing much safer trajectories.
摘要:在本文中,我们提出了一种新的端至端可学习网络进行关节感知,预测和运动规划自驾驶车辆,并产生可解释的中间表示。与现有的神经运动规划者,我们的运动计划成本与我们的感知和预测的估计是一致的。这是由明确用作由运动规划过程成本的新型微的语义占用表示来实现。我们的网络了解终端到终端的人示威。在大规模人工驾驶数据集实验和闭环仿真结果表明,该模型显著性能优于国家的最先进的策划者在模仿人类行为,同时产生更为安全的轨迹。
Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, Raquel Urtasun
Abstract: In this paper we propose a novel end-to-end learnable network that performs joint perception, prediction and motion planning for self-driving vehicles and produces interpretable intermediate representations. Unlike existing neural motion planners, our motion planning costs are consistent with our perception and prediction estimates. This is achieved by a novel differentiable semantic occupancy representation that is explicitly used as cost by the motion planning process. Our network is learned end-to-end from human demonstrations. The experiments in a large-scale manual-driving dataset and closed-loop simulation show that the proposed model significantly outperforms state-of-the-art planners in imitating the human behaviors while producing much safer trajectories.
摘要:在本文中,我们提出了一种新的端至端可学习网络进行关节感知,预测和运动规划自驾驶车辆,并产生可解释的中间表示。与现有的神经运动规划者,我们的运动计划成本与我们的感知和预测的估计是一致的。这是由明确用作由运动规划过程成本的新型微的语义占用表示来实现。我们的网络了解终端到终端的人示威。在大规模人工驾驶数据集实验和闭环仿真结果表明,该模型显著性能优于国家的最先进的策划者在模仿人类行为,同时产生更为安全的轨迹。
56. Motion Similarity Modeling -- A State of the Art Report [PDF] 返回目录
Anna Sebernegg, Peter Kán, Hannes Kaufmann
Abstract: The analysis of human motion opens up a wide range of possibilities, such as realistic training simulations or authentic motions in robotics or animation. One of the problems underlying motion analysis is the meaningful comparison of actions based on similarity measures. Since the motion analysis is application-dependent, it is essential to find the appropriate motion similarity method for the particular use case. This state of the art report provides an overview of human motion analysis and different similarity modeling methods, while mainly focusing on approaches that work with 3D motion data. The survey summarizes various similarity aspects and features of motion and describes approaches to measuring the similarity between two actions.
摘要:人体运动的分析开辟了广泛的可能性,如逼真的训练仿真或者在机器人或动画真实运动。一者的运动分析潜在的问题是,基于相似性度量行动有意义的比较。由于运动分析是依赖于应用的,必须找到的特定用途的情况下的适当运动的相似性的方法。本领域报告的这种状态提供了人的运动分析和不同的相似的建模方法的概述,而主要集中在接近与3D运动的数据的工作。该调查总结各种相似性方面和运动的特征和描述的方法来测量两个动作之间的相似性。
Anna Sebernegg, Peter Kán, Hannes Kaufmann
Abstract: The analysis of human motion opens up a wide range of possibilities, such as realistic training simulations or authentic motions in robotics or animation. One of the problems underlying motion analysis is the meaningful comparison of actions based on similarity measures. Since the motion analysis is application-dependent, it is essential to find the appropriate motion similarity method for the particular use case. This state of the art report provides an overview of human motion analysis and different similarity modeling methods, while mainly focusing on approaches that work with 3D motion data. The survey summarizes various similarity aspects and features of motion and describes approaches to measuring the similarity between two actions.
摘要:人体运动的分析开辟了广泛的可能性,如逼真的训练仿真或者在机器人或动画真实运动。一者的运动分析潜在的问题是,基于相似性度量行动有意义的比较。由于运动分析是依赖于应用的,必须找到的特定用途的情况下的适当运动的相似性的方法。本领域报告的这种状态提供了人的运动分析和不同的相似的建模方法的概述,而主要集中在接近与3D运动的数据的工作。该调查总结各种相似性方面和运动的特征和描述的方法来测量两个动作之间的相似性。
57. Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning [PDF] 返回目录
Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, Yuejie Zhang
Abstract: When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.
摘要:当观看视频,视觉事件的发生往往伴随着音频事件,例如,嘴唇活动的声音,演奏乐器的音乐。有音频和视觉事件,其可以用作游离监督信息通过求解的视听同步借口任务训练神经网络之间的潜在相关性。在本文中,我们提出用共同注意机制一种新的自我监督框架,从学习在野外未标记的视频,并进一步造福下游任务的通用跨模态表示。具体来说,我们探索三种不同的共注意模块集中于相关的声音辨别视觉区域和介绍它们之间的相互作用。实验证明,同时与现有的方法相比较少的参数我们的模型实现的借口任务状态的最先进的性能。为了进一步评估普遍性和我们的方法的转让,我们应用在两个下游的任务,即,声源定位和动作识别预先训练的模式。大量的实验表明,我们的模型提供了与其他自我监督的方法有竞争力的结果,同时也表明,我们的方法可以解决其包含多个声源的挑战场景。
Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, Yuejie Zhang
Abstract: When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.
摘要:当观看视频,视觉事件的发生往往伴随着音频事件,例如,嘴唇活动的声音,演奏乐器的音乐。有音频和视觉事件,其可以用作游离监督信息通过求解的视听同步借口任务训练神经网络之间的潜在相关性。在本文中,我们提出用共同注意机制一种新的自我监督框架,从学习在野外未标记的视频,并进一步造福下游任务的通用跨模态表示。具体来说,我们探索三种不同的共注意模块集中于相关的声音辨别视觉区域和介绍它们之间的相互作用。实验证明,同时与现有的方法相比较少的参数我们的模型实现的借口任务状态的最先进的性能。为了进一步评估普遍性和我们的方法的转让,我们应用在两个下游的任务,即,声源定位和动作识别预先训练的模式。大量的实验表明,我们的模型提供了与其他自我监督的方法有竞争力的结果,同时也表明,我们的方法可以解决其包含多个声源的挑战场景。
58. Multi-Modality Pathology Segmentation Framework: Application to Cardiac Magnetic Resonance Images [PDF] 返回目录
Zhen Zhang, Chenyu Liu, Wangbin Ding, Sihan Wang, Chenhao Pei, Mingjing Yang, Liqin Huang
Abstract: Multi-sequence of cardiac magnetic resonance (CMR) images can provide complementary information for myocardial pathology (scar and edema). However, it is still challenging to fuse these underlying information for pathology segmentation effectively. This work presents an automatic cascade pathology segmentation framework based on multi-modality CMR images. It mainly consists of two neural networks: an anatomical structure segmentation network (ASSN) and a pathological region segmentation network (PRSN). Specifically, the ASSN aims to segment the anatomical structure where the pathology may exist, and it can provide a spatial prior for the pathological region segmentation. In addition, we integrate a denoising auto-encoder (DAE) into the ASSN to generate segmentation results with plausible shapes. The PRSN is designed to segment pathological region based on the result of ASSN, in which a fusion block based on channel attention is proposed to better aggregate multi-modality information from multi-modality CMR images. Experiments from the MyoPS2020 challenge dataset show that our framework can achieve promising performance for myocardial scar and edema segmentation.
摘要:心脏磁共振(CMR)的图像的多序列可提供用于心肌病理(疤痕和水肿)的补充信息。然而,它仍然是具有挑战性的有效融合病理分割这些基本的信息。这项工作提出了一种基于多模态CMR图像的自动级联病理分割框架。它主要包括两个神经网络:解剖结构分割网络(ASSN)和病理区域分割网络(PRSN)。具体而言,ASSN旨在段的解剖结构,其中病理学可能存在,并且其可以提供一个空间先验对病理区域分割。此外,我们降噪自动编码器(DAE)融入ASSN生成具有合理的形状分割结果。所述PRSN基于ASSN的结果,其中,基于信道关注的融合块被提议从多模态图像CMR更好骨料多模态信息设计成段病理区域。从MyoPS2020实验挑战数据集上,我们的架构能够实现心肌疤痕和水肿分割看好的表现。
Zhen Zhang, Chenyu Liu, Wangbin Ding, Sihan Wang, Chenhao Pei, Mingjing Yang, Liqin Huang
Abstract: Multi-sequence of cardiac magnetic resonance (CMR) images can provide complementary information for myocardial pathology (scar and edema). However, it is still challenging to fuse these underlying information for pathology segmentation effectively. This work presents an automatic cascade pathology segmentation framework based on multi-modality CMR images. It mainly consists of two neural networks: an anatomical structure segmentation network (ASSN) and a pathological region segmentation network (PRSN). Specifically, the ASSN aims to segment the anatomical structure where the pathology may exist, and it can provide a spatial prior for the pathological region segmentation. In addition, we integrate a denoising auto-encoder (DAE) into the ASSN to generate segmentation results with plausible shapes. The PRSN is designed to segment pathological region based on the result of ASSN, in which a fusion block based on channel attention is proposed to better aggregate multi-modality information from multi-modality CMR images. Experiments from the MyoPS2020 challenge dataset show that our framework can achieve promising performance for myocardial scar and edema segmentation.
摘要:心脏磁共振(CMR)的图像的多序列可提供用于心肌病理(疤痕和水肿)的补充信息。然而,它仍然是具有挑战性的有效融合病理分割这些基本的信息。这项工作提出了一种基于多模态CMR图像的自动级联病理分割框架。它主要包括两个神经网络:解剖结构分割网络(ASSN)和病理区域分割网络(PRSN)。具体而言,ASSN旨在段的解剖结构,其中病理学可能存在,并且其可以提供一个空间先验对病理区域分割。此外,我们降噪自动编码器(DAE)融入ASSN生成具有合理的形状分割结果。所述PRSN基于ASSN的结果,其中,基于信道关注的融合块被提议从多模态图像CMR更好骨料多模态信息设计成段病理区域。从MyoPS2020实验挑战数据集上,我们的架构能够实现心肌疤痕和水肿分割看好的表现。
59. Weight Equalizing Shift Scaler-Coupled Post-training Quantization [PDF] 返回目录
Jihun Oh, SangJeong Lee, Meejeong Park, Pooni Walagaurav, Kiseok Kwon
Abstract: Post-training, layer-wise quantization is preferable because it is free from retraining and is hardware-friendly. Nevertheless, accuracy degradation has occurred when a neural network model has a big difference of per-out-channel weight ranges. In particular, the MobileNet family has a tragedy drop in top-1 accuracy from 70.60% ~ 71.87% to 0.1% on the ImageNet dataset after 8-bit weight quantization. To mitigate this significant accuracy reduction, we propose a new weight equalizing shift scaler, i.e. rescaling the weight range per channel by a 4-bit binary shift, prior to a layer-wise quantization. To recover the original output range, inverse binary shifting is efficiently fused to the existing per-layer scale compounding in the fixed-computing convolutional operator of the custom neural processing unit. The binary shift is a key feature of our algorithm, which significantly improved the accuracy performance without impeding the memory footprint. As a result, our proposed method achieved a top-1 accuracy of 69.78% ~ 70.96% in MobileNets and showed robust performance in varying network models and tasks, which is competitive to channel-wise quantization results.
摘要:培训后,逐层量化是可取的,因为它是由再培训免费的,硬件友好。然而,当一个神经网络模型具有每出通道重量范围的一个很大的区别已经出现精度降低。特别是,家庭MobileNet后具有8位权重的量化在顶部-1精度的悲剧下降从70.60%〜71.87%至0.1%的ImageNet数据集。为了减轻这种显著还原准确性,我们提出一个新的重量均衡移定标器,即通过一个4位的二进制移位,逐层量化之前重新调节每通道的重量范围内。以恢复原始的输出范围,逆二进制换档有效地融合到现有每层规模在自定义神经处理单元的固定计算卷积算子复合。二进制转变是我们的算法,显著提高了精度性能,而不会阻碍内存占用的一个主要特点。因此,我们提出的方法在MobileNets达到69.78%〜70.96%,顶-1的精度和显示在不同的网络模型和任务强劲的性能,这是通道明智的量化结果具有竞争力。
Jihun Oh, SangJeong Lee, Meejeong Park, Pooni Walagaurav, Kiseok Kwon
Abstract: Post-training, layer-wise quantization is preferable because it is free from retraining and is hardware-friendly. Nevertheless, accuracy degradation has occurred when a neural network model has a big difference of per-out-channel weight ranges. In particular, the MobileNet family has a tragedy drop in top-1 accuracy from 70.60% ~ 71.87% to 0.1% on the ImageNet dataset after 8-bit weight quantization. To mitigate this significant accuracy reduction, we propose a new weight equalizing shift scaler, i.e. rescaling the weight range per channel by a 4-bit binary shift, prior to a layer-wise quantization. To recover the original output range, inverse binary shifting is efficiently fused to the existing per-layer scale compounding in the fixed-computing convolutional operator of the custom neural processing unit. The binary shift is a key feature of our algorithm, which significantly improved the accuracy performance without impeding the memory footprint. As a result, our proposed method achieved a top-1 accuracy of 69.78% ~ 70.96% in MobileNets and showed robust performance in varying network models and tasks, which is competitive to channel-wise quantization results.
摘要:培训后,逐层量化是可取的,因为它是由再培训免费的,硬件友好。然而,当一个神经网络模型具有每出通道重量范围的一个很大的区别已经出现精度降低。特别是,家庭MobileNet后具有8位权重的量化在顶部-1精度的悲剧下降从70.60%〜71.87%至0.1%的ImageNet数据集。为了减轻这种显著还原准确性,我们提出一个新的重量均衡移定标器,即通过一个4位的二进制移位,逐层量化之前重新调节每通道的重量范围内。以恢复原始的输出范围,逆二进制换档有效地融合到现有每层规模在自定义神经处理单元的固定计算卷积算子复合。二进制转变是我们的算法,显著提高了精度性能,而不会阻碍内存占用的一个主要特点。因此,我们提出的方法在MobileNets达到69.78%〜70.96%,顶-1的精度和显示在不同的网络模型和任务强劲的性能,这是通道明智的量化结果具有竞争力。
60. Revisiting Temporal Modeling for Video Super-resolution [PDF] 返回目录
Takashi Isobe, Fang Zhu, Shengjin Wang
Abstract: Video super-resolution plays an important role in surveillance video analysis and ultra-high-definition video display, which has drawn much attention in both the research and industrial communities. Although many deep learning-based VSR methods have been proposed, it is hard to directly compare these methods since the different loss functions and training datasets have a significant impact on the super-resolution results. In this work, we carefully study and compare three temporal modeling methods (2D CNN with early fusion, 3D CNN with slow fusion and Recurrent Neural Network) for video super-resolution. We also propose a novel Recurrent Residual Network (RRN) for efficient video super-resolution, where residual learning is utilized to stabilize the training of RNN and meanwhile to boost the super-resolution performance. Extensive experiments show that the proposed RRN is highly computational efficiency and produces temporal consistent VSR results with finer details than other temporal modeling methods. Besides, the proposed method achieves state-of-the-art results on several widely used benchmarks.
摘要:视频超分辨率起着监控录像分析和超高清视频显示,这已引起广泛关注的研究和工业界都具有重要作用。虽然已经提出了许多深刻的学习型VSR方法,这是很难直接比较这些方法,因为不同的损失函数和训练数据对超分辨率结果的显著影响。在这项工作中,我们认真研究和比较了三个时间建模方法(2D CNN早期融合,3D CNN缓慢融合和递归神经网络)视频超分辨率。我们还提出了一个新的轮回剩余网络(RRN)高效视频超分辨率,其中残留的学习被用来稳定RNN的训练,同时提高了超分辨率的性能。大量的实验表明,该RRN是高计算效率,并产生比其他时间的建模方法更精细的细节时间一致VSR结果。此外,所提出的方法实现了在几个广泛使用的基准状态的最先进的结果。
Takashi Isobe, Fang Zhu, Shengjin Wang
Abstract: Video super-resolution plays an important role in surveillance video analysis and ultra-high-definition video display, which has drawn much attention in both the research and industrial communities. Although many deep learning-based VSR methods have been proposed, it is hard to directly compare these methods since the different loss functions and training datasets have a significant impact on the super-resolution results. In this work, we carefully study and compare three temporal modeling methods (2D CNN with early fusion, 3D CNN with slow fusion and Recurrent Neural Network) for video super-resolution. We also propose a novel Recurrent Residual Network (RRN) for efficient video super-resolution, where residual learning is utilized to stabilize the training of RNN and meanwhile to boost the super-resolution performance. Extensive experiments show that the proposed RRN is highly computational efficiency and produces temporal consistent VSR results with finer details than other temporal modeling methods. Besides, the proposed method achieves state-of-the-art results on several widely used benchmarks.
摘要:视频超分辨率起着监控录像分析和超高清视频显示,这已引起广泛关注的研究和工业界都具有重要作用。虽然已经提出了许多深刻的学习型VSR方法,这是很难直接比较这些方法,因为不同的损失函数和训练数据对超分辨率结果的显著影响。在这项工作中,我们认真研究和比较了三个时间建模方法(2D CNN早期融合,3D CNN缓慢融合和递归神经网络)视频超分辨率。我们还提出了一个新的轮回剩余网络(RRN)高效视频超分辨率,其中残留的学习被用来稳定RNN的训练,同时提高了超分辨率的性能。大量的实验表明,该RRN是高计算效率,并产生比其他时间的建模方法更精细的细节时间一致VSR结果。此外,所提出的方法实现了在几个广泛使用的基准状态的最先进的结果。
61. AdaIN-Switchable CycleGAN for Efficient Unsupervised Low-Dose CT Denoising [PDF] 返回目录
Jawook Gu, Jong Chul Ye
Abstract: Recently, deep learning approaches have been extensively studied for low-dose CT denoising thanks to its superior performance despite the fast computational time. In particular, cycleGAN has been demonstrated as a powerful unsupervised learning scheme to improve the low-dose CT image quality without requiring matched high-dose reference data. Unfortunately, one of the main limitations of the cycleGAN approach is that it requires two deep neural network generators at the training phase, although only one of them is used at the inference phase. The secondary auxiliary generator is needed to enforce the cycle-consistency, but the additional memory requirement and increases of the learnable parameters are the main huddles for cycleGAN training. To address this issue, here we propose a novel cycleGAN architecture using a single switchable generator. In particular, a single generator is implemented using adaptive instance normalization (AdaIN) layers so that the baseline generator converting a low-dose CT image to a routine-dose CT image can be switched to a generator converting high-dose to low-dose by simply changing the AdaIN code. Thanks to the shared baseline network, the additional memory requirement and weight increases are minimized, and the training can be done more stably even with small training data. Experimental results show that the proposed method outperforms the previous cycleGAN approaches while using only about half the parameters.
摘要:近日,深学习方法已经被广泛地研究了低剂量CT降噪得益于其优越的性能,尽管快速的计算时间。特别是,cycleGAN已被证明是一个功能强大的无监督学习方案来改善低剂量CT图像质量,而无需匹配高剂量的参考数据。遗憾的是,cycleGAN方法的一个主要限制是它需要在训练阶段两道深深的神经网络的发电机,但其中只有一个在推断阶段使用。二次辅助发电机需要强制执行周期的一致性,但额外的内存需求并且可以学习参数的增加是对于cycleGAN训练主闲聊。为了解决这个问题,在这里我们建议使用一个转换生成一个新的cycleGAN架构。特别地,一个单一的发生器被使用自适应实例正常化实现(AdaIN)层,使得基线发电机转换低剂量CT图像以常规剂量CT图像可以被切换到一个生成器通过高剂量转换成低剂量简单地改变AdaIN代码。多亏了共享基线网络,附加的存储器需求和重量增加最小化,并且所述训练甚至可以小训练数据更稳定地进行。实验结果表明,该方法优于先前cycleGAN,而只使用约一半的参数的方法。
Jawook Gu, Jong Chul Ye
Abstract: Recently, deep learning approaches have been extensively studied for low-dose CT denoising thanks to its superior performance despite the fast computational time. In particular, cycleGAN has been demonstrated as a powerful unsupervised learning scheme to improve the low-dose CT image quality without requiring matched high-dose reference data. Unfortunately, one of the main limitations of the cycleGAN approach is that it requires two deep neural network generators at the training phase, although only one of them is used at the inference phase. The secondary auxiliary generator is needed to enforce the cycle-consistency, but the additional memory requirement and increases of the learnable parameters are the main huddles for cycleGAN training. To address this issue, here we propose a novel cycleGAN architecture using a single switchable generator. In particular, a single generator is implemented using adaptive instance normalization (AdaIN) layers so that the baseline generator converting a low-dose CT image to a routine-dose CT image can be switched to a generator converting high-dose to low-dose by simply changing the AdaIN code. Thanks to the shared baseline network, the additional memory requirement and weight increases are minimized, and the training can be done more stably even with small training data. Experimental results show that the proposed method outperforms the previous cycleGAN approaches while using only about half the parameters.
摘要:近日,深学习方法已经被广泛地研究了低剂量CT降噪得益于其优越的性能,尽管快速的计算时间。特别是,cycleGAN已被证明是一个功能强大的无监督学习方案来改善低剂量CT图像质量,而无需匹配高剂量的参考数据。遗憾的是,cycleGAN方法的一个主要限制是它需要在训练阶段两道深深的神经网络的发电机,但其中只有一个在推断阶段使用。二次辅助发电机需要强制执行周期的一致性,但额外的内存需求并且可以学习参数的增加是对于cycleGAN训练主闲聊。为了解决这个问题,在这里我们建议使用一个转换生成一个新的cycleGAN架构。特别地,一个单一的发生器被使用自适应实例正常化实现(AdaIN)层,使得基线发电机转换低剂量CT图像以常规剂量CT图像可以被切换到一个生成器通过高剂量转换成低剂量简单地改变AdaIN代码。多亏了共享基线网络,附加的存储器需求和重量增加最小化,并且所述训练甚至可以小训练数据更稳定地进行。实验结果表明,该方法优于先前cycleGAN,而只使用约一半的参数的方法。
62. Towards Modality Transferable Visual Information Representation with Optimal Model Compression [PDF] 返回目录
Rongqun Lin, Linwei Zhu, Shiqi Wang, Sam Kwong
Abstract: Compactly representing the visual signals is of fundamental importance in various image/video-centered applications. Although numerous approaches were developed for improving the image and video coding performance by removing the redundancies within visual signals, much less work has been dedicated to the transformation of the visual signals to another well-established modality for better representation capability. In this paper, we propose a new scheme for visual signal representation that leverages the philosophy of transferable modality. In particular, the deep learning model, which characterizes and absorbs the statistics of the input scene with online training, could be efficiently represented in the sense of rate-utility optimization to serve as the enhancement layer in the bitstream. As such, the overall performance can be further guaranteed by optimizing the new modality incorporated. The proposed framework is implemented on the state-of-the-art video coding standard (i.e., versatile video coding), and significantly better representation capability has been observed based on extensive evaluations.
摘要:紧表示视觉信号是在不同图像/视频中心的应用至关重要。尽管许多方法被用于提高图像和视频通过去除视频信号中的冗余编码性能开发,更谈不上工作一直致力于视觉信号的转变,以更好的表现能力的另一个行之有效的模式。在本文中,我们提出了一个利用转让方式的理念,视觉信号表示一个新的方案。特别是,深度学习模型,它表征和吸收与在线训练输入场景的统计数据,可以有效地用于速度,动力系统优化的意义表示,作为在比特流的增强层。这样,整体性能可进一步通过优化方式并入新的模态保证。所提出的框架上的状态的最先进的视频编码标准来实现(即,多功能视频编码),并基于广泛的评价显著更好的表示能力已经观察到。
Rongqun Lin, Linwei Zhu, Shiqi Wang, Sam Kwong
Abstract: Compactly representing the visual signals is of fundamental importance in various image/video-centered applications. Although numerous approaches were developed for improving the image and video coding performance by removing the redundancies within visual signals, much less work has been dedicated to the transformation of the visual signals to another well-established modality for better representation capability. In this paper, we propose a new scheme for visual signal representation that leverages the philosophy of transferable modality. In particular, the deep learning model, which characterizes and absorbs the statistics of the input scene with online training, could be efficiently represented in the sense of rate-utility optimization to serve as the enhancement layer in the bitstream. As such, the overall performance can be further guaranteed by optimizing the new modality incorporated. The proposed framework is implemented on the state-of-the-art video coding standard (i.e., versatile video coding), and significantly better representation capability has been observed based on extensive evaluations.
摘要:紧表示视觉信号是在不同图像/视频中心的应用至关重要。尽管许多方法被用于提高图像和视频通过去除视频信号中的冗余编码性能开发,更谈不上工作一直致力于视觉信号的转变,以更好的表现能力的另一个行之有效的模式。在本文中,我们提出了一个利用转让方式的理念,视觉信号表示一个新的方案。特别是,深度学习模型,它表征和吸收与在线训练输入场景的统计数据,可以有效地用于速度,动力系统优化的意义表示,作为在比特流的增强层。这样,整体性能可进一步通过优化方式并入新的模态保证。所提出的框架上的状态的最先进的视频编码标准来实现(即,多功能视频编码),并基于广泛的评价显著更好的表示能力已经观察到。
63. Procedural Urban Forestry [PDF] 返回目录
Till Niese, Sören Pirk, Bedrich Benes, Oliver Deussen
Abstract: The placement of vegetation plays a central role in the realism of virtual scenes. We introduce procedural placement models (PPMs) for vegetation in urban layouts. PPMs are environmentally sensitive to city geometry and allow identifying plausible plant positions based on structural and functional zones in an urban layout. PPMs can either be directly used by defining their parameters or can be learned from satellite images and land register data. Together with approaches for generating buildings and trees, this allows us to populate urban landscapes with complex 3D vegetation. The effectiveness of our framework is shown through examples of large-scale city scenes and close-ups of individually grown tree models; we also validate it by a perceptual user study.
摘要:植被的放置起着虚拟场景的真实感核心作用。我们介绍了在城市布局植被程序放置模型(PPMS)。的PPM是城市的几何环境敏感,并且允许基于在城市布局结构和功能区确定合理的工厂位置。的PPM可以直接使用通过定义它们的参数,或者可以从卫星图像和土地寄存器数据来获知。加上用于生成建筑物和树木的方法,这使我们能够填充复杂的三维植被城市景观。我们的框架的有效性是通过大规模城市场景和生长繁殖树模型的特写例子所示;我们也由感知用户的研究验证。
Till Niese, Sören Pirk, Bedrich Benes, Oliver Deussen
Abstract: The placement of vegetation plays a central role in the realism of virtual scenes. We introduce procedural placement models (PPMs) for vegetation in urban layouts. PPMs are environmentally sensitive to city geometry and allow identifying plausible plant positions based on structural and functional zones in an urban layout. PPMs can either be directly used by defining their parameters or can be learned from satellite images and land register data. Together with approaches for generating buildings and trees, this allows us to populate urban landscapes with complex 3D vegetation. The effectiveness of our framework is shown through examples of large-scale city scenes and close-ups of individually grown tree models; we also validate it by a perceptual user study.
摘要:植被的放置起着虚拟场景的真实感核心作用。我们介绍了在城市布局植被程序放置模型(PPMS)。的PPM是城市的几何环境敏感,并且允许基于在城市布局结构和功能区确定合理的工厂位置。的PPM可以直接使用通过定义它们的参数,或者可以从卫星图像和土地寄存器数据来获知。加上用于生成建筑物和树木的方法,这使我们能够填充复杂的三维植被城市景观。我们的框架的有效性是通过大规模城市场景和生长繁殖树模型的特写例子所示;我们也由感知用户的研究验证。
64. DSM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry [PDF] 返回目录
Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas J. Guibas, Lin Gao
Abstract: 3D shape generation is a fundamental operation in computer graphics. While significant progress has been made, especially with recent deep generative models, it remains a challenge to synthesize high-quality geometric shapes with rich detail and complex structure, in a controllable manner. To tackle this, we introduce DSM-Net, a deep neural network that learns a disentangled structured mesh representation for 3D shapes, where two key aspects of shapes, geometry and structure, are encoded in a synergistic manner to ensure plausibility of the generated shapes, while also being disentangled as much as possible. This supports a range of novel shape generation applications with intuitive control, such as interpolation of structure (geometry) while keeping geometry (structure) unchanged. To achieve this, we simultaneously learn structure and geometry through variational autoencoders (VAEs) in a hierarchical manner for both, with bijective mappings at each level. In this manner we effectively encode geometry and structure in separate latent spaces, while ensuring their compatibility: the structure is used to guide the geometry and vice versa. At the leaf level, the part geometry is represented using a conditional part VAE, to encode high-quality geometric details, guided by the structure context as the condition. Our method not only supports controllable generation applications, but also produces high-quality synthesized shapes, outperforming state-of-the-art methods.
摘要:3D形状生成是计算机图形学的基本操作。虽然显著已经取得进展,特别是最近深生成模型,它仍然是合成高品质的几何形状与丰富的细节和复杂的结构,以可控的方式提出了挑战。为了解决这个问题,我们引入DSM-Net的,即学习一个解开构造网格表示为三维形状,其中的形状,几何形状和结构的两个关键方面中,以协同的方式编码,以保证所产生的形状的真实性深神经网络,同时还解开尽可能地。这支持了一系列具有直观的控制新颖形状生成的应用,如结构(几何形状)的内插,同时保持几何形状(结构)不变。为了实现这一目标,我们同时以分层方式通过学习变自动编码(VAES)结构和几何两个,在每个级别的双射映射。以这种方式,我们有效地编码的几何形状和结构在单独的潜空间,同时确保它们的兼容性:结构被用于引导的几何形状,反之亦然。在叶级,部件的几何形状是使用条件部分VAE,对编码的高质量几何细节,由作为所述条件的结构上下文引导表示。我们的方法不仅支持可控生成的应用程序,而且还生产高品质的合成形状,优于国家的最先进的方法。
Jie Yang, Kaichun Mo, Yu-Kun Lai, Leonidas J. Guibas, Lin Gao
Abstract: 3D shape generation is a fundamental operation in computer graphics. While significant progress has been made, especially with recent deep generative models, it remains a challenge to synthesize high-quality geometric shapes with rich detail and complex structure, in a controllable manner. To tackle this, we introduce DSM-Net, a deep neural network that learns a disentangled structured mesh representation for 3D shapes, where two key aspects of shapes, geometry and structure, are encoded in a synergistic manner to ensure plausibility of the generated shapes, while also being disentangled as much as possible. This supports a range of novel shape generation applications with intuitive control, such as interpolation of structure (geometry) while keeping geometry (structure) unchanged. To achieve this, we simultaneously learn structure and geometry through variational autoencoders (VAEs) in a hierarchical manner for both, with bijective mappings at each level. In this manner we effectively encode geometry and structure in separate latent spaces, while ensuring their compatibility: the structure is used to guide the geometry and vice versa. At the leaf level, the part geometry is represented using a conditional part VAE, to encode high-quality geometric details, guided by the structure context as the condition. Our method not only supports controllable generation applications, but also produces high-quality synthesized shapes, outperforming state-of-the-art methods.
摘要:3D形状生成是计算机图形学的基本操作。虽然显著已经取得进展,特别是最近深生成模型,它仍然是合成高品质的几何形状与丰富的细节和复杂的结构,以可控的方式提出了挑战。为了解决这个问题,我们引入DSM-Net的,即学习一个解开构造网格表示为三维形状,其中的形状,几何形状和结构的两个关键方面中,以协同的方式编码,以保证所产生的形状的真实性深神经网络,同时还解开尽可能地。这支持了一系列具有直观的控制新颖形状生成的应用,如结构(几何形状)的内插,同时保持几何形状(结构)不变。为了实现这一目标,我们同时以分层方式通过学习变自动编码(VAES)结构和几何两个,在每个级别的双射映射。以这种方式,我们有效地编码的几何形状和结构在单独的潜空间,同时确保它们的兼容性:结构被用于引导的几何形状,反之亦然。在叶级,部件的几何形状是使用条件部分VAE,对编码的高质量几何细节,由作为所述条件的结构上下文引导表示。我们的方法不仅支持可控生成的应用程序,而且还生产高品质的合成形状,优于国家的最先进的方法。
注:中文为机器翻译结果!封面为论文标题词云图!