目录
1. Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations [PDF] 摘要
4. Patient-Specific Finetuning of Deep Learning Models for Adaptive Radiotherapy in Prostate CT [PDF] 摘要
7. DeepDualMapper: A Gated Fusion Network for Automatic Map Extraction using Aerial Images and Trajectories [PDF] 摘要
9. On the Similarity of Deep Learning Representations Across Didactic and Adversarial Examples [PDF] 摘要
13. Unsupervised Image-generation Enhanced Adaptation for Object Detection in Thermal images [PDF] 摘要
14. Superpixel Segmentation via Convolutional Neural Networks with Regularized Information Maximization [PDF] 摘要
20. Block Annotation: Better Image Annotation for Semantic Segmentation with Sub-Image Decomposition [PDF] 摘要
24. Automated Labelling using an Attention model for Radiology reports of MRI scans (ALARM) [PDF] 摘要
36. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [PDF] 摘要
37. Cell R-CNN V3: A Novel Panoptic Paradigm for Instance Segmentation in Biomedical Images [PDF] 摘要
39. Historical Document Processing: Historical Document Processing: A Survey of Techniques, Tools, and Trends [PDF] 摘要
40. Single Unit Status in Deep Convolutional Neural Network Codes for Face Identification: Sparseness Redefined [PDF] 摘要
43. Social-WaGDAT: Interaction-aware Trajectory Prediction via Wasserstein Graph Double-Attention Network [PDF] 摘要
47. PCSGAN: Perceptual Cyclic-Synthesized Generative Adversarial Networks for Thermal and NIR to Visible Image Transformation [PDF] 摘要
48. Large-scale biometry with interpretable neural network regression on UK Biobank body MRI [PDF] 摘要
52. Gaussian Smoothen Semantic Features (GSSF) -- Exploring the Linguistic Aspects of Visual Captioning in Indian Languages (Bengali) Using MSCOCO Framework [PDF] 摘要
55. 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans [PDF] 摘要
56. Learning representations of irregular particle-detector geometry with distance-weighted graph networks [PDF] 摘要
摘要
1. Precision Gating: Improving Neural Network Efficiency with Dynamic Dual-Precision Activations [PDF] 返回目录
Yichi Zhang, Ritchie Zhao, Weizhe Hua, Nayun Xu, G. Edward Suh, Zhiru Zhang
Abstract: We propose precision gating (PG), an end-to-end trainable dynamic dual-precision quantization technique for deep neural networks. PG computes most features in a low precision and only a small proportion of important features in a higher precision to preserve accuracy. The proposed approach is applicable to a variety of DNN architectures and significantly reduces the computational cost of DNN execution with almost no accuracy loss. Our experiments indicate that PG achieves excellent results on CNNs, including statically compressed mobile-friendly networks such as ShuffleNet. Compared to the state-of-the-art prediction-based quantization schemes, PG achieves the same or higher accuracy with 2.4$\times$ less compute on ImageNet. PG furthermore applies to RNNs. Compared to 8-bit uniform quantization, PG obtains a 1.2% improvement in perplexity per word with 2.7$\times$ computational cost reduction on LSTM on the Penn Tree Bank dataset.
摘要:本文提出精密门控(PG),深神经网络的终端到终端的可训练的动态双精度量化技术。 PG计算最具特色的精度低,只有重要的特征,更高的精度一小部分以保持准确性。所提出的方法是适用于各种DNN架构和显著降低DNN执行的计算成本几乎没有精度损失。我们的实验表明PG实现对细胞神经网络优异的成绩,其中包括静态压缩移动友好型网络如的ShuffleNet。的状态相比的最先进的基于预测的量化方案,PG实现了与上ImageNet 2.4 $ \倍$计算更少相同或更高的精度。 PG还适用于RNNs。相比于8位均匀量化,PG获得每字困惑1.2%的改善与LSTM $计算成本降低的宾州树库的数据集2.7 $ \倍。
Yichi Zhang, Ritchie Zhao, Weizhe Hua, Nayun Xu, G. Edward Suh, Zhiru Zhang
Abstract: We propose precision gating (PG), an end-to-end trainable dynamic dual-precision quantization technique for deep neural networks. PG computes most features in a low precision and only a small proportion of important features in a higher precision to preserve accuracy. The proposed approach is applicable to a variety of DNN architectures and significantly reduces the computational cost of DNN execution with almost no accuracy loss. Our experiments indicate that PG achieves excellent results on CNNs, including statically compressed mobile-friendly networks such as ShuffleNet. Compared to the state-of-the-art prediction-based quantization schemes, PG achieves the same or higher accuracy with 2.4$\times$ less compute on ImageNet. PG furthermore applies to RNNs. Compared to 8-bit uniform quantization, PG obtains a 1.2% improvement in perplexity per word with 2.7$\times$ computational cost reduction on LSTM on the Penn Tree Bank dataset.
摘要:本文提出精密门控(PG),深神经网络的终端到终端的可训练的动态双精度量化技术。 PG计算最具特色的精度低,只有重要的特征,更高的精度一小部分以保持准确性。所提出的方法是适用于各种DNN架构和显著降低DNN执行的计算成本几乎没有精度损失。我们的实验表明PG实现对细胞神经网络优异的成绩,其中包括静态压缩移动友好型网络如的ShuffleNet。的状态相比的最先进的基于预测的量化方案,PG实现了与上ImageNet 2.4 $ \倍$计算更少相同或更高的精度。 PG还适用于RNNs。相比于8位均匀量化,PG获得每字困惑1.2%的改善与LSTM $计算成本降低的宾州树库的数据集2.7 $ \倍。
2. Lake Ice Detection from Sentinel-1 SAR with Deep Learning [PDF] 返回目录
Manu Tom, Roberto Aguilar, Pascal Imhof, Silvan Leinss, Emmanuel Baltsavias, Konrad Schindler
Abstract: Lake ice, as part of the Essential Climate Variable (ECV) lakes, is an important indicator to monitor climate change and global warming. The spatio-temporal extent of lake ice cover, along with the timings of key phenological events such as freeze-up and break-up, provides important cues about the local and global climate. We present a lake ice monitoring system based on the automatic analysis of Sentinel-1 Synthetic Aperture Radar (SAR) data with a deep neural network. In previous studies that used optical satellite imagery for lake ice monitoring, frequent cloud cover was a main limiting factor, which we overcome thanks to the ability of microwave sensors to penetrate clouds and observe the lakes regardless of the weather and illumination conditions. We cast ice detection as a two class (frozen, non-frozen) semantic segmentation problem and solve it using a state-of-the-art deep convolutional network (CNN). We report results on two winters ($2016-17$ and $2017-18$) and three alpine lakes in Switzerland, including cross-validation tests to assess the generalisation to unseen lakes and winters. The proposed model reaches mean Intersection-over-Union (mIoU) scores >90% on average, and >84% even for the most difficult lake.
摘要:湖冰,作为基本气候变量(ECV)湖的一部分,监测气候变化和全球变暖的一个重要指标。湖冰覆盖的时空范围内,与关键物候事件,如冷冻起来,分手的时刻一起,提供有关本地和全球气候的重要线索。我们提出了基于哨兵-1合成孔径雷达(SAR)数据,具有深层神经网络自动分析一湖冰监测系统。在以前的研究中使用的光学卫星图像湖冰监测,频繁云盖是一个主要的限制因素,这是我们克服由于微波传感器的穿透云和无论天气和照明条件的观察湖泊的能力。我们投冰检测作为两类(冷冻的,非冷冻)语义分割问题和使用状态的最先进的深卷积网络(CNN)求解。我们报告了两个冬天($ 2016-17 $ $和$ 2017 - 18)和瑞士三种高寒草甸湖泊,包括交叉验证测试,以评估推广到看不见的湖泊,冬季的结果。该模型河段意味着,即使是最困难的湖路口,过联盟(米欧)得分> 90%的平均值,和> 84%。
Manu Tom, Roberto Aguilar, Pascal Imhof, Silvan Leinss, Emmanuel Baltsavias, Konrad Schindler
Abstract: Lake ice, as part of the Essential Climate Variable (ECV) lakes, is an important indicator to monitor climate change and global warming. The spatio-temporal extent of lake ice cover, along with the timings of key phenological events such as freeze-up and break-up, provides important cues about the local and global climate. We present a lake ice monitoring system based on the automatic analysis of Sentinel-1 Synthetic Aperture Radar (SAR) data with a deep neural network. In previous studies that used optical satellite imagery for lake ice monitoring, frequent cloud cover was a main limiting factor, which we overcome thanks to the ability of microwave sensors to penetrate clouds and observe the lakes regardless of the weather and illumination conditions. We cast ice detection as a two class (frozen, non-frozen) semantic segmentation problem and solve it using a state-of-the-art deep convolutional network (CNN). We report results on two winters ($2016-17$ and $2017-18$) and three alpine lakes in Switzerland, including cross-validation tests to assess the generalisation to unseen lakes and winters. The proposed model reaches mean Intersection-over-Union (mIoU) scores >90% on average, and >84% even for the most difficult lake.
摘要:湖冰,作为基本气候变量(ECV)湖的一部分,监测气候变化和全球变暖的一个重要指标。湖冰覆盖的时空范围内,与关键物候事件,如冷冻起来,分手的时刻一起,提供有关本地和全球气候的重要线索。我们提出了基于哨兵-1合成孔径雷达(SAR)数据,具有深层神经网络自动分析一湖冰监测系统。在以前的研究中使用的光学卫星图像湖冰监测,频繁云盖是一个主要的限制因素,这是我们克服由于微波传感器的穿透云和无论天气和照明条件的观察湖泊的能力。我们投冰检测作为两类(冷冻的,非冷冻)语义分割问题和使用状态的最先进的深卷积网络(CNN)求解。我们报告了两个冬天($ 2016-17 $ $和$ 2017 - 18)和瑞士三种高寒草甸湖泊,包括交叉验证测试,以评估推广到看不见的湖泊,冬季的结果。该模型河段意味着,即使是最困难的湖路口,过联盟(米欧)得分> 90%的平均值,和> 84%。
3. Learning Architectures for Binary Networks [PDF] 返回目录
Kunal Pratap Singh, Dahyun Kim, Jonghyun Choi
Abstract: Backbone architectures of most binary networks are well-known floating point architectures, such as the ResNet family. Questioning that the architectures designed for floating-point networks would not be the best for binary networks, we propose to search architectures for binary networks (BNAS). Specifically, based on the cell based search method, we define a new set of layer types, design a new cell template, and rediscover the utility of and propose to use the Zeroise layer to learn well-performing binary networks. In addition, we propose to diversify early search to learn better performing binary architectures. We show that our searched binary networks outperform state-of-the-art binary networks on CIFAR10 and ImageNet datasets.
摘要:大多数二进制网络的骨干架构是众所周知的浮点架构,如RESNET家庭。质疑是专为浮点网络架构不会是最好的二进制网络,我们建议搜索二进制网络(BNAS)架构。具体而言,基于基于小区搜索方法,我们定义一组新的图层类型,设计出新的细胞模板,重新发现的效用,并提出使用Zeroise层好好学习,执行二进制网络。此外,我们建议早期的多样化搜索更好地学习执行二进制架构。我们证明了我们的搜索二进制网络性能超过CIFAR10和ImageNet数据集的国家的最先进的二进制网络。
Kunal Pratap Singh, Dahyun Kim, Jonghyun Choi
Abstract: Backbone architectures of most binary networks are well-known floating point architectures, such as the ResNet family. Questioning that the architectures designed for floating-point networks would not be the best for binary networks, we propose to search architectures for binary networks (BNAS). Specifically, based on the cell based search method, we define a new set of layer types, design a new cell template, and rediscover the utility of and propose to use the Zeroise layer to learn well-performing binary networks. In addition, we propose to diversify early search to learn better performing binary architectures. We show that our searched binary networks outperform state-of-the-art binary networks on CIFAR10 and ImageNet datasets.
摘要:大多数二进制网络的骨干架构是众所周知的浮点架构,如RESNET家庭。质疑是专为浮点网络架构不会是最好的二进制网络,我们建议搜索二进制网络(BNAS)架构。具体而言,基于基于小区搜索方法,我们定义一组新的图层类型,设计出新的细胞模板,重新发现的效用,并提出使用Zeroise层好好学习,执行二进制网络。此外,我们建议早期的多样化搜索更好地学习执行二进制架构。我们证明了我们的搜索二进制网络性能超过CIFAR10和ImageNet数据集的国家的最先进的二进制网络。
4. Patient-Specific Finetuning of Deep Learning Models for Adaptive Radiotherapy in Prostate CT [PDF] 返回目录
Mohamed S. Elmahdy, Tanuj Ahuja, U. A. van der Heide, Marius Staring
Abstract: Contouring of the target volume and Organs-At-Risk (OARs) is a crucial step in radiotherapy treatment planning. In an adaptive radiotherapy setting, updated contours need to be generated based on daily imaging. In this work, we leverage personalized anatomical knowledge accumulated over the treatment sessions, to improve the segmentation accuracy of a pre-trained Convolution Neural Network (CNN), for a specific patient. We investigate a transfer learning approach, fine-tuning the baseline CNN model to a specific patient, based on imaging acquired in earlier treatment fractions. The baseline CNN model is trained on a prostate CT dataset from one hospital of 379 patients. This model is then fine-tuned and tested on an independent dataset of another hospital of 18 patients, each having 7 to 10 daily CT scans. For the prostate, seminal vesicles, bladder and rectum, the model fine-tuned on each specific patient achieved a Mean Surface Distance (MSD) of $1.64 \pm 0.43$ mm, $2.38 \pm 2.76$ mm, $2.30 \pm 0.96$ mm, and $1.24 \pm 0.89$ mm, respectively, which was significantly better than the baseline model. The proposed personalized model adaptation is therefore very promising for clinical implementation in the context of adaptive radiotherapy of prostate cancer.
摘要:靶区和器官高危(桨)的轮廓加工是在放射治疗计划的关键一步。在自适应放疗设置,更新轮廓需要基于每日成像产生。在这项工作中,我们利用积累疗程个性化的解剖知识,提高预训练卷积神经网络(CNN)的分割精度,针对特定病人。我们调查的转移学习方法,微调基准CNN模型特定患者的基础上,在早期治疗分成像获取。基线CNN模型是从379例一个中院培训了前列腺CT数据集。该模型然后微调和18本例中,每个具有7-10个每日CT扫描另一医院的一个独立的数据集进行测试。对于前列腺,精囊,膀胱和直肠,模型微调每个特定患者达到$ 1.64 \下午0.43 $毫米的平均表面距离(MSD),$ 2.38 \下午2.76 $毫米,$ 2.30 \下午0.96 $毫米,和$ 1.24 \ 0.89下午$毫米,分别为,这是显著比基线模型更好。因此,提出个性化的模型适应性非常有前途的前列腺癌的放射治疗自适应的背景下临床实施。
Mohamed S. Elmahdy, Tanuj Ahuja, U. A. van der Heide, Marius Staring
Abstract: Contouring of the target volume and Organs-At-Risk (OARs) is a crucial step in radiotherapy treatment planning. In an adaptive radiotherapy setting, updated contours need to be generated based on daily imaging. In this work, we leverage personalized anatomical knowledge accumulated over the treatment sessions, to improve the segmentation accuracy of a pre-trained Convolution Neural Network (CNN), for a specific patient. We investigate a transfer learning approach, fine-tuning the baseline CNN model to a specific patient, based on imaging acquired in earlier treatment fractions. The baseline CNN model is trained on a prostate CT dataset from one hospital of 379 patients. This model is then fine-tuned and tested on an independent dataset of another hospital of 18 patients, each having 7 to 10 daily CT scans. For the prostate, seminal vesicles, bladder and rectum, the model fine-tuned on each specific patient achieved a Mean Surface Distance (MSD) of $1.64 \pm 0.43$ mm, $2.38 \pm 2.76$ mm, $2.30 \pm 0.96$ mm, and $1.24 \pm 0.89$ mm, respectively, which was significantly better than the baseline model. The proposed personalized model adaptation is therefore very promising for clinical implementation in the context of adaptive radiotherapy of prostate cancer.
摘要:靶区和器官高危(桨)的轮廓加工是在放射治疗计划的关键一步。在自适应放疗设置,更新轮廓需要基于每日成像产生。在这项工作中,我们利用积累疗程个性化的解剖知识,提高预训练卷积神经网络(CNN)的分割精度,针对特定病人。我们调查的转移学习方法,微调基准CNN模型特定患者的基础上,在早期治疗分成像获取。基线CNN模型是从379例一个中院培训了前列腺CT数据集。该模型然后微调和18本例中,每个具有7-10个每日CT扫描另一医院的一个独立的数据集进行测试。对于前列腺,精囊,膀胱和直肠,模型微调每个特定患者达到$ 1.64 \下午0.43 $毫米的平均表面距离(MSD),$ 2.38 \下午2.76 $毫米,$ 2.30 \下午0.96 $毫米,和$ 1.24 \ 0.89下午$毫米,分别为,这是显著比基线模型更好。因此,提出个性化的模型适应性非常有前途的前列腺癌的放射治疗自适应的背景下临床实施。
5. Amplifying The Uncanny [PDF] 返回目录
Terence Broad, Frederic Fol Leymarie, Mick Grierson
Abstract: Deep neural networks have become remarkably good at producing realistic deepfakes, images of people that are (to the untrained eye) indistinguishable from real images. These are produced by algorithms that learn to distinguish between real and fake images and are optimised to generate samples that the system deems realistic. This paper, and the resulting series of artworks Being Foiled explore the aesthetic outcome of inverting this process and instead optimising the system to generate images that it sees as being fake. Maximising the unlikelihood of the data and in turn, amplifying the uncanny nature of these machine hallucinations.
摘要:深层神经网络已经成为现实生产deepfakes,人是(未经训练的眼睛)从真实图像区分图像非常好的。这些是由学习真假图像之间进行区分,并且优化该系统认为现实的产生样品的算法产生的。本文以及由此带来的一系列作品被挫败的探索颠倒这个过程,而是优化系统生成的图像,它认为是假的审美效果。最大化又将数据的和不大可能,放大这些机器幻觉的不可思议的性质。
Terence Broad, Frederic Fol Leymarie, Mick Grierson
Abstract: Deep neural networks have become remarkably good at producing realistic deepfakes, images of people that are (to the untrained eye) indistinguishable from real images. These are produced by algorithms that learn to distinguish between real and fake images and are optimised to generate samples that the system deems realistic. This paper, and the resulting series of artworks Being Foiled explore the aesthetic outcome of inverting this process and instead optimising the system to generate images that it sees as being fake. Maximising the unlikelihood of the data and in turn, amplifying the uncanny nature of these machine hallucinations.
摘要:深层神经网络已经成为现实生产deepfakes,人是(未经训练的眼睛)从真实图像区分图像非常好的。这些是由学习真假图像之间进行区分,并且优化该系统认为现实的产生样品的算法产生的。本文以及由此带来的一系列作品被挫败的探索颠倒这个过程,而是优化系统生成的图像,它认为是假的审美效果。最大化又将数据的和不大可能,放大这些机器幻觉的不可思议的性质。
6. Hierarchical Rule Induction Network for Abstract Visual Reasoning [PDF] 返回目录
Sheng Hu, Yuqing Ma, Xianglong Liu, Yanlu Wei, Shihao Bai
Abstract: Abstract reasoning refers to the ability to analyze information, discover rules at an intangible level, and solve problems in innovative ways. Raven's Progressive Matrices (RPM) test is typically used to examine the capability of abstract reasoning. In the test, the subject is asked to identify the correct choice from the answer set to fill the missing panel at the bottom right of RPM (e.g., a 3$\times$3 matrix), following the underlying rules inside the matrix. Recent studies, taking advantage of Convolutional Neural Networks (CNNs), have achieved encouraging progress to accomplish the RPM test problems. Unfortunately, simply relying on the relation extraction at the matrix level, they fail to recognize the complex attribute patterns inside or across rows/columns of RPM. To address this problem, in this paper we propose a Hierarchical Rule Induction Network (HriNet), by intimating human induction strategies. HriNet extracts multiple granularity rule embeddings at different levels and integrates them through a gated embedding fusion module. We further introduce a rule similarity metric based on the embeddings, so that HriNet can not only be trained using a tuplet loss but also infer the best answer according to the similarity score. To comprehensively evaluate HriNet, we first fix the defects contained in the very recent RAVEN dataset and generate a new one named Balanced-RAVEN. Then extensive experiments are conducted on the large-scale dataset PGM and our Balanced-RAVEN, the results of which show that HriNet outperforms the state-of-the-art models by a large margin.
摘要:抽象推理是指能力,分析信息,以一种无形的层面探索规律,解决以创新的方式问题。 Raven的推理测验(RPM)测试通常用来检查抽象推理的能力。在该试验中,受试者被要求从回答集识别正确的选择在RPM的右下角,以填补缺失的面板(例如,3 $ \倍$ 3矩阵),按照矩阵内的底层规则。最近的研究,卷积神经网络(细胞神经网络)的利用,取得了可喜的进展,完成RPM测试问题。不幸的是,单纯依靠在基体层的关系提取,他们没有认识到内部或跨RPM的行/列复杂的属性模式。为了解决这个问题,在本文中,我们提出了一种层次法则归纳网(HriNet),通过威逼人类诱导策略。 HriNet提取在通过门控嵌入融合模块不同层次和整合他们多粒度规则的嵌入。我们进一步推出基于这个的嵌入规则相似性度量,这样HriNet不仅可以使用连音损失,而且推断最佳答案根据相似性得分培训。综合评价HriNet,我们首先确定包含在非常近RAVEN数据集中的缺陷,并生成一个名为平衡RAVEN新的。然后广泛的实验是在大型数据集PGM和我们的平衡RAVEN进行的,其结果表明,HriNet优于国家的最先进的车型以大比分。
Sheng Hu, Yuqing Ma, Xianglong Liu, Yanlu Wei, Shihao Bai
Abstract: Abstract reasoning refers to the ability to analyze information, discover rules at an intangible level, and solve problems in innovative ways. Raven's Progressive Matrices (RPM) test is typically used to examine the capability of abstract reasoning. In the test, the subject is asked to identify the correct choice from the answer set to fill the missing panel at the bottom right of RPM (e.g., a 3$\times$3 matrix), following the underlying rules inside the matrix. Recent studies, taking advantage of Convolutional Neural Networks (CNNs), have achieved encouraging progress to accomplish the RPM test problems. Unfortunately, simply relying on the relation extraction at the matrix level, they fail to recognize the complex attribute patterns inside or across rows/columns of RPM. To address this problem, in this paper we propose a Hierarchical Rule Induction Network (HriNet), by intimating human induction strategies. HriNet extracts multiple granularity rule embeddings at different levels and integrates them through a gated embedding fusion module. We further introduce a rule similarity metric based on the embeddings, so that HriNet can not only be trained using a tuplet loss but also infer the best answer according to the similarity score. To comprehensively evaluate HriNet, we first fix the defects contained in the very recent RAVEN dataset and generate a new one named Balanced-RAVEN. Then extensive experiments are conducted on the large-scale dataset PGM and our Balanced-RAVEN, the results of which show that HriNet outperforms the state-of-the-art models by a large margin.
摘要:抽象推理是指能力,分析信息,以一种无形的层面探索规律,解决以创新的方式问题。 Raven的推理测验(RPM)测试通常用来检查抽象推理的能力。在该试验中,受试者被要求从回答集识别正确的选择在RPM的右下角,以填补缺失的面板(例如,3 $ \倍$ 3矩阵),按照矩阵内的底层规则。最近的研究,卷积神经网络(细胞神经网络)的利用,取得了可喜的进展,完成RPM测试问题。不幸的是,单纯依靠在基体层的关系提取,他们没有认识到内部或跨RPM的行/列复杂的属性模式。为了解决这个问题,在本文中,我们提出了一种层次法则归纳网(HriNet),通过威逼人类诱导策略。 HriNet提取在通过门控嵌入融合模块不同层次和整合他们多粒度规则的嵌入。我们进一步推出基于这个的嵌入规则相似性度量,这样HriNet不仅可以使用连音损失,而且推断最佳答案根据相似性得分培训。综合评价HriNet,我们首先确定包含在非常近RAVEN数据集中的缺陷,并生成一个名为平衡RAVEN新的。然后广泛的实验是在大型数据集PGM和我们的平衡RAVEN进行的,其结果表明,HriNet优于国家的最先进的车型以大比分。
7. DeepDualMapper: A Gated Fusion Network for Automatic Map Extraction using Aerial Images and Trajectories [PDF] 返回目录
Hao Wu, Hanyuan Zhang, Xinyu Zhang, Weiwei Sun, Baihua Zheng, Yuning Jiang
Abstract: Automatic map extraction is of great importance to urban computing and location-based services. Aerial image and GPS trajectory data refer to two different data sources that could be leveraged to generate the map, although they carry different types of information. Most previous works on data fusion between aerial images and data from auxiliary sensors do not fully utilize the information of both modalities and hence suffer from the issue of information loss. We propose a deep convolutional neural network called DeepDualMapper which fuses the aerial image and trajectory data in a more seamless manner to extract the digital map. We design a gated fusion module to explicitly control the information flows from both modalities in a complementary-aware manner. Moreover, we propose a novel densely supervised refinement decoder to generate the prediction in a coarse-to-fine way. Our comprehensive experiments demonstrate that DeepDualMapper can fuse the information of images and trajectories much more effectively than existing approaches, and is able to generate maps with higher accuracy.
摘要:地图自动提取的城市计算和基于位置的服务具有重要意义。空间图像和GPS轨迹数据指的是可以被利用来生成地图两种不同的数据源,虽然它们携带不同类型的信息。在航拍图像和辅助传感器的数据之间的数据融合大部分以前的作品没有充分利用这两种方式的信息,因此,从信息丢失的问题受到影响。我们提出所谓DeepDualMapper了深刻的卷积神经网络,融合航拍图和轨迹数据更无缝的方式提取数字地图。我们设计了门控融合模块明确地控制从以互补感知方式两种模式中的信息流动。此外,我们提出了一个新颖的密集监督细化解码器,以生成一个粗到细的方式预测。我们全面的实验表明,DeepDualMapper可以比现有的方法更有效地融合图像和轨迹的信息,并能以更高的精度来生成地图。
Hao Wu, Hanyuan Zhang, Xinyu Zhang, Weiwei Sun, Baihua Zheng, Yuning Jiang
Abstract: Automatic map extraction is of great importance to urban computing and location-based services. Aerial image and GPS trajectory data refer to two different data sources that could be leveraged to generate the map, although they carry different types of information. Most previous works on data fusion between aerial images and data from auxiliary sensors do not fully utilize the information of both modalities and hence suffer from the issue of information loss. We propose a deep convolutional neural network called DeepDualMapper which fuses the aerial image and trajectory data in a more seamless manner to extract the digital map. We design a gated fusion module to explicitly control the information flows from both modalities in a complementary-aware manner. Moreover, we propose a novel densely supervised refinement decoder to generate the prediction in a coarse-to-fine way. Our comprehensive experiments demonstrate that DeepDualMapper can fuse the information of images and trajectories much more effectively than existing approaches, and is able to generate maps with higher accuracy.
摘要:地图自动提取的城市计算和基于位置的服务具有重要意义。空间图像和GPS轨迹数据指的是可以被利用来生成地图两种不同的数据源,虽然它们携带不同类型的信息。在航拍图像和辅助传感器的数据之间的数据融合大部分以前的作品没有充分利用这两种方式的信息,因此,从信息丢失的问题受到影响。我们提出所谓DeepDualMapper了深刻的卷积神经网络,融合航拍图和轨迹数据更无缝的方式提取数字地图。我们设计了门控融合模块明确地控制从以互补感知方式两种模式中的信息流动。此外,我们提出了一个新颖的密集监督细化解码器,以生成一个粗到细的方式预测。我们全面的实验表明,DeepDualMapper可以比现有的方法更有效地融合图像和轨迹的信息,并能以更高的精度来生成地图。
8. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting [PDF] 返回目录
Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, Fei Wu
Abstract: Many approaches have recently been proposed to detect irregular scene text and achieved promising results. However, their localization results may not well satisfy the following text recognition part mainly because of two reasons: 1) recognizing arbitrary shaped text is still a challenging task, and 2) prevalent non-trainable pipeline strategies between text detection and text recognition will lead to suboptimal performances. To handle this incompatibility problem, in this paper we propose an end-to-end trainable text spotting approach named Text Perceptron. Concretely, Text Perceptron first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information. Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies without extra parameters. It unites text detection and the following recognition part into a whole framework, and helps the whole network achieve global optimization. Experiments show that our method achieves competitive performance on two standard text benchmarks, i.e., ICDAR 2013 and ICDAR 2015, and also obviously outperforms existing methods on irregular text benchmarks SCUT-CTW1500 and Total-Text.
摘要已经提出了许多方法来检测不规则的现场文字,取得了可喜的成果:摘要。然而,他们的定位结果可能不会很好地满足,主要是因为两个原因如下文字识别部分:1)识别任意形状的文本仍然是一个艰巨的任务,和2)流行的非训练的管道战略文本检测和文字识别之间将导致次优的性能。为了解决这个不兼容的问题,在本文中,我们提出了一个名为文本感知的终端到终端的可训练的文本斑点的方法。具体地,文本感知第一采用了学习潜文本读取顺序和边界信息的高效基于分割的文本检测器。然后新颖的形状变换模块(缩写STM)被设计为变换所检测到的特征区域成规则的形态没有额外的参数。它团结文本检测和识别以下部分成一个整体的框架,并有助于整个网络实现全局优化。实验表明,我们的方法实现两种标准的文本基准竞争力的性能,即ICDAR 2013年和2015年ICDAR,也明显性能优于不规则文字的基准华南理工大学,CTW1500和总文本现有的方法。
Liang Qiao, Sanli Tang, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, Fei Wu
Abstract: Many approaches have recently been proposed to detect irregular scene text and achieved promising results. However, their localization results may not well satisfy the following text recognition part mainly because of two reasons: 1) recognizing arbitrary shaped text is still a challenging task, and 2) prevalent non-trainable pipeline strategies between text detection and text recognition will lead to suboptimal performances. To handle this incompatibility problem, in this paper we propose an end-to-end trainable text spotting approach named Text Perceptron. Concretely, Text Perceptron first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information. Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies without extra parameters. It unites text detection and the following recognition part into a whole framework, and helps the whole network achieve global optimization. Experiments show that our method achieves competitive performance on two standard text benchmarks, i.e., ICDAR 2013 and ICDAR 2015, and also obviously outperforms existing methods on irregular text benchmarks SCUT-CTW1500 and Total-Text.
摘要已经提出了许多方法来检测不规则的现场文字,取得了可喜的成果:摘要。然而,他们的定位结果可能不会很好地满足,主要是因为两个原因如下文字识别部分:1)识别任意形状的文本仍然是一个艰巨的任务,和2)流行的非训练的管道战略文本检测和文字识别之间将导致次优的性能。为了解决这个不兼容的问题,在本文中,我们提出了一个名为文本感知的终端到终端的可训练的文本斑点的方法。具体地,文本感知第一采用了学习潜文本读取顺序和边界信息的高效基于分割的文本检测器。然后新颖的形状变换模块(缩写STM)被设计为变换所检测到的特征区域成规则的形态没有额外的参数。它团结文本检测和识别以下部分成一个整体的框架,并有助于整个网络实现全局优化。实验表明,我们的方法实现两种标准的文本基准竞争力的性能,即ICDAR 2013年和2015年ICDAR,也明显性能优于不规则文字的基准华南理工大学,CTW1500和总文本现有的方法。
9. On the Similarity of Deep Learning Representations Across Didactic and Adversarial Examples [PDF] 返回目录
Pamela K. Douglas, Farzad Vasheghani Farahani
Abstract: The increasing use of deep neural networks (DNNs) has motivated a parallel endeavor: the design of adversaries that profit from successful misclassifications. However, not all adversarial examples are crafted for malicious purposes. For example, real world systems often contain physical, temporal, and sampling variability across instrumentation. Adversarial examples in the wild may inadvertently prove deleterious for accurate predictive modeling. Conversely, naturally occurring covariance of image features may serve didactic purposes. Here, we studied the stability of deep learning representations for neuroimaging classification across didactic and adversarial conditions characteristic of MRI acquisition variability. We show that representational similarity and performance vary according to the frequency of adversarial examples in the input space.
摘要:越来越多地使用深层神经网络(DNNs)的激励了平行的努力:对手是获利的设计从成功的错误分类。然而,并非所有敌对实例制作用于恶意目的。例如,真实世界的系统通常包含物理,时间和采样跨仪器的可变性。在野外对抗性的例子可能会在无意中证明有害准确的预测模型。相反地,图像特征天然存在的协方差可用于教导目的。在这里,我们研究深度学习交涉的稳定性跨MRI采集变异的特征说教和对抗性条件神经影像学分类。我们表明,代表性的相似性和性能根据在输入空间对抗性例的频率而变化。
Pamela K. Douglas, Farzad Vasheghani Farahani
Abstract: The increasing use of deep neural networks (DNNs) has motivated a parallel endeavor: the design of adversaries that profit from successful misclassifications. However, not all adversarial examples are crafted for malicious purposes. For example, real world systems often contain physical, temporal, and sampling variability across instrumentation. Adversarial examples in the wild may inadvertently prove deleterious for accurate predictive modeling. Conversely, naturally occurring covariance of image features may serve didactic purposes. Here, we studied the stability of deep learning representations for neuroimaging classification across didactic and adversarial conditions characteristic of MRI acquisition variability. We show that representational similarity and performance vary according to the frequency of adversarial examples in the input space.
摘要:越来越多地使用深层神经网络(DNNs)的激励了平行的努力:对手是获利的设计从成功的错误分类。然而,并非所有敌对实例制作用于恶意目的。例如,真实世界的系统通常包含物理,时间和采样跨仪器的可变性。在野外对抗性的例子可能会在无意中证明有害准确的预测模型。相反地,图像特征天然存在的协方差可用于教导目的。在这里,我们研究深度学习交涉的稳定性跨MRI采集变异的特征说教和对抗性条件神经影像学分类。我们表明,代表性的相似性和性能根据在输入空间对抗性例的频率而变化。
10. Discernible Compressed Images via Deep Perception Consistency [PDF] 返回目录
Zhaohui Yang, Yunhe Wang, Chao Xu, Chang Xu
Abstract: Image compression, as one of the fundamental low-level image processing tasks, is very essential for computer vision. Conventional image compression methods tend to obtain compressed images by minimizing their appearance discrepancy with the corresponding original images, but pay little attention to their efficacy in downstream perception tasks, e.g., image recognition and object detection. In contrast, this paper aims to produce compressed images by pursuing both appearance and perception consistency. Based on the encoder-decoder framework, we propose using a pre-trained CNN to extract features of original and compressed images. In addition, the maximum mean discrepancy (MMD) is employed to minimize the difference between feature distributions. The resulting compression network can generate images with high image quality and preserve the consistent perception in the feature domain, so that these images can be well recognized by pre-trained machine learning models. Experiments on benchmarks demonstrate the superiority of the proposed algorithm over comparison methods.
摘要:图像压缩,因为根本低级图像处理的任务之一,是计算机视觉非常必要的。传统图像压缩方法往往通过最小化与相应的原始图像它们的外观差异,以获得压缩的图像,但很少注意它们的功效在下游感知的任务,例如,图像识别和对象检测。相比之下,本文的目的是通过追求外观和感知的一致性,以产生压缩的图像。基于编码器,解码器的框架,我们建议使用预训练CNN提取原始图像和压缩图像的功能。此外,最大平均差异(MMD)功能,以减少特征分布之间的差异。产生的压缩网络可以产生具有高图像质量的图像和保存在功能域的一致看法,因此这些图像可以通过预先训练机器学习模型被广泛认可。在基准测试实验验证了该算法的在比较方法的优越性。
Zhaohui Yang, Yunhe Wang, Chao Xu, Chang Xu
Abstract: Image compression, as one of the fundamental low-level image processing tasks, is very essential for computer vision. Conventional image compression methods tend to obtain compressed images by minimizing their appearance discrepancy with the corresponding original images, but pay little attention to their efficacy in downstream perception tasks, e.g., image recognition and object detection. In contrast, this paper aims to produce compressed images by pursuing both appearance and perception consistency. Based on the encoder-decoder framework, we propose using a pre-trained CNN to extract features of original and compressed images. In addition, the maximum mean discrepancy (MMD) is employed to minimize the difference between feature distributions. The resulting compression network can generate images with high image quality and preserve the consistent perception in the feature domain, so that these images can be well recognized by pre-trained machine learning models. Experiments on benchmarks demonstrate the superiority of the proposed algorithm over comparison methods.
摘要:图像压缩,因为根本低级图像处理的任务之一,是计算机视觉非常必要的。传统图像压缩方法往往通过最小化与相应的原始图像它们的外观差异,以获得压缩的图像,但很少注意它们的功效在下游感知的任务,例如,图像识别和对象检测。相比之下,本文的目的是通过追求外观和感知的一致性,以产生压缩的图像。基于编码器,解码器的框架,我们建议使用预训练CNN提取原始图像和压缩图像的功能。此外,最大平均差异(MMD)功能,以减少特征分布之间的差异。产生的压缩网络可以产生具有高图像质量的图像和保存在功能域的一致看法,因此这些图像可以通过预先训练机器学习模型被广泛认可。在基准测试实验验证了该算法的在比较方法的优越性。
11. CQ-VQA: Visual Question Answering on Categorized Questions [PDF] 返回目录
Aakansha Mishra, Ashish Anand, Prithwijit Guha
Abstract: This paper proposes CQ-VQA, a novel 2-level hierarchical but end-to-end model to solve the task of visual question answering (VQA). The first level of CQ-VQA, referred to as question categorizer (QC), classifies questions to reduce the potential answer search space. The QC uses attended and fused features of the input question and image. The second level, referred to as answer predictor (AP), comprises of a set of distinct classifiers corresponding to each question category. Depending on the question category predicted by QC, only one of the classifiers of AP remains active. The loss functions of QC and AP are aggregated together to make it an end-to-end model. The proposed model (CQ-VQA) is evaluated on the TDIUC dataset and is benchmarked against state-of-the-art approaches. Results indicate competitive or better performance of CQ-VQA.
摘要:提出CQ-VQA,一种新颖的2级分层但端至端模式来解决视觉问答(VQA)的任务。 CQ-VQA的第一级,被称为问题分类(QC),分类问题,以减少潜在的答案搜索空间。该QC用途出席并融合了输入问题和图像的功能。第二个层次中,被称为应答预测器(AP),一组对应于每个问题类别不同分类器中的包括。根据通过QC预测出题类型,只有AP的分类之一仍然有效。 QC和AP的损失功能集中在一起,使其成为一个终端到高端型号。所提出的模型(CQ-VQA)上TDIUC数据集进行评估,并且对基准状态的最先进的方法。结果表明CQ-VQA的竞争性或更好的性能。
Aakansha Mishra, Ashish Anand, Prithwijit Guha
Abstract: This paper proposes CQ-VQA, a novel 2-level hierarchical but end-to-end model to solve the task of visual question answering (VQA). The first level of CQ-VQA, referred to as question categorizer (QC), classifies questions to reduce the potential answer search space. The QC uses attended and fused features of the input question and image. The second level, referred to as answer predictor (AP), comprises of a set of distinct classifiers corresponding to each question category. Depending on the question category predicted by QC, only one of the classifiers of AP remains active. The loss functions of QC and AP are aggregated together to make it an end-to-end model. The proposed model (CQ-VQA) is evaluated on the TDIUC dataset and is benchmarked against state-of-the-art approaches. Results indicate competitive or better performance of CQ-VQA.
摘要:提出CQ-VQA,一种新颖的2级分层但端至端模式来解决视觉问答(VQA)的任务。 CQ-VQA的第一级,被称为问题分类(QC),分类问题,以减少潜在的答案搜索空间。该QC用途出席并融合了输入问题和图像的功能。第二个层次中,被称为应答预测器(AP),一组对应于每个问题类别不同分类器中的包括。根据通过QC预测出题类型,只有AP的分类之一仍然有效。 QC和AP的损失功能集中在一起,使其成为一个终端到高端型号。所提出的模型(CQ-VQA)上TDIUC数据集进行评估,并且对基准状态的最先进的方法。结果表明CQ-VQA的竞争性或更好的性能。
12. Deep Domain Adaptive Object Detection: a Survey [PDF] 返回目录
Wanyi Li, Fuyu Li, Yongkang Luo, Peng Wang
Abstract: Deep learning (DL) based object detection has achieved great progress. These methods typically assume that large amount of labeled training data is available, and training and test data are drawn from an identical distribution. However, the two assumptions are not always hold in practice. Deep domain adaptive object detection (DDAOD) has emerged as a new learning paradigm to address the above mentioned challenges. This paper aims to review the state-of-the-art progress on deep domain adaptive object detection approaches. Firstly, we introduce briefly the basic concepts of deep domain adaptation. Secondly, the deep domain adaptive detectors are classified into four categories and detailed descriptions of representative methods in each category are provided. Finally, insights for future research trend are presented.
摘要:深学习(DL)基于对象的检测已经取得了很大的进步。这些方法通常假设大量标记的训练数据是可用的,和训练和测试数据被从同一分发绘制。然而,这两个假设并不总是保持在实践中。深域自适应对象检测(DDAOD)已经成为一种新的学习范例来解决上述挑战。本文旨在深域自适应物体检测方法,审查状态的最先进的进展。首先,我们简要地介绍了深厚的适应的基本概念。提供的每个类别的代表性方法其次,深域自适应检测器被分为四类和详细说明。最后,未来的研究发展趋势的见解呈现。
Wanyi Li, Fuyu Li, Yongkang Luo, Peng Wang
Abstract: Deep learning (DL) based object detection has achieved great progress. These methods typically assume that large amount of labeled training data is available, and training and test data are drawn from an identical distribution. However, the two assumptions are not always hold in practice. Deep domain adaptive object detection (DDAOD) has emerged as a new learning paradigm to address the above mentioned challenges. This paper aims to review the state-of-the-art progress on deep domain adaptive object detection approaches. Firstly, we introduce briefly the basic concepts of deep domain adaptation. Secondly, the deep domain adaptive detectors are classified into four categories and detailed descriptions of representative methods in each category are provided. Finally, insights for future research trend are presented.
摘要:深学习(DL)基于对象的检测已经取得了很大的进步。这些方法通常假设大量标记的训练数据是可用的,和训练和测试数据被从同一分发绘制。然而,这两个假设并不总是保持在实践中。深域自适应对象检测(DDAOD)已经成为一种新的学习范例来解决上述挑战。本文旨在深域自适应物体检测方法,审查状态的最先进的进展。首先,我们简要地介绍了深厚的适应的基本概念。提供的每个类别的代表性方法其次,深域自适应检测器被分为四类和详细说明。最后,未来的研究发展趋势的见解呈现。
13. Unsupervised Image-generation Enhanced Adaptation for Object Detection in Thermal images [PDF] 返回目录
Wanyi Li, Fuyu Li, Yongkang Luo, Peng Wang
Abstract: Object detection in thermal images is an important computer vision task and has many applications such as unmanned vehicles, robotics, surveillance and night vision. Deep learning based detectors have achieved major progress, which usually need large amount of labelled training data. However, labelled data for object detection in thermal images is scarce and expensive to collect. How to take advantage of the large number labelled visible images and adapt them into thermal image domain, is expected to solve. This paper proposes an unsupervised image-generation enhanced adaptation method for object detection in thermal images. To reduce the gap between visible domain and thermal domain, the proposed method manages to generate simulated fake thermal images that are similar to the target images, and preserves the annotation information of the visible source domain. The image generation includes a CycleGAN based image-to-image translation and an intensity inversion transformation. Generated fake thermal images are used as renewed source domain. And then the off-the-shelf Domain Adaptive Faster RCNN is utilized to reduce the gap between generated intermediate domain and the thermal target domain. Experiments demonstrate the effectiveness and superiority of the proposed method.
摘要:在热图像目标检测是一个重要的计算机视觉任务,有许多应用,如无人驾驶车辆,机器人,监视和夜视。基于深度学习的探测器都取得了重大进展,这通常需要大量的标记的训练数据。然而,对于在热图像对象检测标记的数据是稀缺和昂贵收集。如何利用大量用数字标记的可见光图像,并将其改编成热成像领域,有望解决。本文提出了一种在热图像物体检测无监督图像生成增强的自适应方法。为了减少可见域和热域之间的间隙,所提出的方法进行管理,以产生类似于目标图像模拟假热图像,并保留可见源域的注释信息。图像生成包括基于CycleGAN图像到图像平移和强度反转变换。产生假热图像被用作更新源域。然后关断的,现成的域自适应更快RCNN用于降低生成的中间域和热目标域之间的间隙。实验证明了该方法的有效性和优越性。
Wanyi Li, Fuyu Li, Yongkang Luo, Peng Wang
Abstract: Object detection in thermal images is an important computer vision task and has many applications such as unmanned vehicles, robotics, surveillance and night vision. Deep learning based detectors have achieved major progress, which usually need large amount of labelled training data. However, labelled data for object detection in thermal images is scarce and expensive to collect. How to take advantage of the large number labelled visible images and adapt them into thermal image domain, is expected to solve. This paper proposes an unsupervised image-generation enhanced adaptation method for object detection in thermal images. To reduce the gap between visible domain and thermal domain, the proposed method manages to generate simulated fake thermal images that are similar to the target images, and preserves the annotation information of the visible source domain. The image generation includes a CycleGAN based image-to-image translation and an intensity inversion transformation. Generated fake thermal images are used as renewed source domain. And then the off-the-shelf Domain Adaptive Faster RCNN is utilized to reduce the gap between generated intermediate domain and the thermal target domain. Experiments demonstrate the effectiveness and superiority of the proposed method.
摘要:在热图像目标检测是一个重要的计算机视觉任务,有许多应用,如无人驾驶车辆,机器人,监视和夜视。基于深度学习的探测器都取得了重大进展,这通常需要大量的标记的训练数据。然而,对于在热图像对象检测标记的数据是稀缺和昂贵收集。如何利用大量用数字标记的可见光图像,并将其改编成热成像领域,有望解决。本文提出了一种在热图像物体检测无监督图像生成增强的自适应方法。为了减少可见域和热域之间的间隙,所提出的方法进行管理,以产生类似于目标图像模拟假热图像,并保留可见源域的注释信息。图像生成包括基于CycleGAN图像到图像平移和强度反转变换。产生假热图像被用作更新源域。然后关断的,现成的域自适应更快RCNN用于降低生成的中间域和热目标域之间的间隙。实验证明了该方法的有效性和优越性。
14. Superpixel Segmentation via Convolutional Neural Networks with Regularized Information Maximization [PDF] 返回目录
Teppei Suzuki
Abstract: We propose an unsupervised superpixel segmentation method by optimizing a randomly-initialized convolutional neural network (CNN) in inference time. Our method generates superpixels via CNN from a single image without any labels by minimizing a proposed objective function for superpixel segmentation in inference time. There are three advantages to our method compared with many of existing methods: (i) leverages an image prior of CNN for superpixel segmentation, (ii) adaptively changes the number of superpixels according to the given images, and (iii) controls the property of superpixels by adding an auxiliary cost to the objective function. We verify the advantages of our method quantitatively and qualitatively on BSDS500 and SBD datasets.
摘要:本文提出通过优化推理时间随机初始化卷积神经网络(CNN)的无监督的超像素分割方法。我们的方法通过在推理时间最小化超像素分割一个提出的目标函数生成从一个单一的图像,而没有任何标签经由CNN超像素。有三个优点我们的方法具有许多的现有方法相比:(i)在CNN的利用图像为超像素分割,(ⅱ)自适应地改变根据给定的图像中的超像素的数目,和(iii)控制的属性通过添加辅助成本对目标函数的超像素。我们定量和定性核实BSDS500和SBD数据集我们的方法的优点。
Teppei Suzuki
Abstract: We propose an unsupervised superpixel segmentation method by optimizing a randomly-initialized convolutional neural network (CNN) in inference time. Our method generates superpixels via CNN from a single image without any labels by minimizing a proposed objective function for superpixel segmentation in inference time. There are three advantages to our method compared with many of existing methods: (i) leverages an image prior of CNN for superpixel segmentation, (ii) adaptively changes the number of superpixels according to the given images, and (iii) controls the property of superpixels by adding an auxiliary cost to the objective function. We verify the advantages of our method quantitatively and qualitatively on BSDS500 and SBD datasets.
摘要:本文提出通过优化推理时间随机初始化卷积神经网络(CNN)的无监督的超像素分割方法。我们的方法通过在推理时间最小化超像素分割一个提出的目标函数生成从一个单一的图像,而没有任何标签经由CNN超像素。有三个优点我们的方法具有许多的现有方法相比:(i)在CNN的利用图像为超像素分割,(ⅱ)自适应地改变根据给定的图像中的超像素的数目,和(iii)控制的属性通过添加辅助成本对目标函数的超像素。我们定量和定性核实BSDS500和SBD数据集我们的方法的优点。
15. Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation [PDF] 返回目录
Yingjie Yin, De Xu, Xingang Wang, Lei Zhang
Abstract: Most recent semi-supervised video object segmentation (VOS) methods rely on fine-tuning deep convolutional neural networks online using the given mask of the first frame or predicted masks of subsequent frames. However, the online fine-tuning process is usually time-consuming, limiting the practical use of such methods. We propose a directional deep embedding and appearance learning (DDEAL) method, which is free of the online fine-tuning process, for fast VOS. First, a global directional matching module, which can be efficiently implemented by parallel convolutional operations, is proposed to learn a semantic pixel-wise embedding as an internal guidance. Second, an effective directional appearance model based statistics is proposed to represent the target and background on a spherical embedding space for VOS. Equipped with the global directional matching module and the directional appearance model learning module, DDEAL learns static cues from the labeled first frame and dynamically updates cues of the subsequent frames for object segmentation. Our method exhibits state-of-the-art VOS performance without using online fine-tuning. Specifically, it achieves a J & F mean score of 74.8% on DAVIS 2017 dataset and an overall score G of 71.3% on the large-scale YouTube-VOS dataset, while retaining a speed of 25 fps with a single NVIDIA TITAN Xp GPU. Furthermore, our faster version runs 31 fps with only a little accuracy loss. Our code and trained networks are available at this https URL.
摘要:最近的半监督视频对象分割(VOS)的方法依赖于微调深卷积神经网络的在线使用所述第一帧的所述给定掩模或后续帧的预测的掩模。然而,在线微调过程通常是耗时的,这限制了实际使用的这种方法。我们提出了一个方向性深深嵌入和外观学习(DDEAL)方法,它是免费的在线微调过程中,快速VOS。首先,全局定向匹配模块,其可以通过并行卷积操作来高效地实现,提出了学习语义逐像素嵌入作为内部引导。第二,一个有效的定向外观基于模型统计提出来表示对VOS球形嵌入空间的目标和背景。配备有全球定向匹配模块和定向外观模型学习模块,DDEAL学习来自标记的第一帧的静态提示和动态地更新为对象分割后续帧的提示。我们的方法表现出VOS性能状态的最先进的,而无需使用在线微调。具体而言,实现了74.8%的DAVIS 2017的数据集并在大型的YouTube-VOS数据集的71.3%的总成绩G为J&F平均得分,同时保留的每秒25帧与单个NVIDIA TITAN XP中GPU的速度。此外,我们的速度更快的版本上运行31 fps的只有一点点的精度损失。我们的代码和训练有素的网络,可在此HTTPS URL。
Yingjie Yin, De Xu, Xingang Wang, Lei Zhang
Abstract: Most recent semi-supervised video object segmentation (VOS) methods rely on fine-tuning deep convolutional neural networks online using the given mask of the first frame or predicted masks of subsequent frames. However, the online fine-tuning process is usually time-consuming, limiting the practical use of such methods. We propose a directional deep embedding and appearance learning (DDEAL) method, which is free of the online fine-tuning process, for fast VOS. First, a global directional matching module, which can be efficiently implemented by parallel convolutional operations, is proposed to learn a semantic pixel-wise embedding as an internal guidance. Second, an effective directional appearance model based statistics is proposed to represent the target and background on a spherical embedding space for VOS. Equipped with the global directional matching module and the directional appearance model learning module, DDEAL learns static cues from the labeled first frame and dynamically updates cues of the subsequent frames for object segmentation. Our method exhibits state-of-the-art VOS performance without using online fine-tuning. Specifically, it achieves a J & F mean score of 74.8% on DAVIS 2017 dataset and an overall score G of 71.3% on the large-scale YouTube-VOS dataset, while retaining a speed of 25 fps with a single NVIDIA TITAN Xp GPU. Furthermore, our faster version runs 31 fps with only a little accuracy loss. Our code and trained networks are available at this https URL.
摘要:最近的半监督视频对象分割(VOS)的方法依赖于微调深卷积神经网络的在线使用所述第一帧的所述给定掩模或后续帧的预测的掩模。然而,在线微调过程通常是耗时的,这限制了实际使用的这种方法。我们提出了一个方向性深深嵌入和外观学习(DDEAL)方法,它是免费的在线微调过程中,快速VOS。首先,全局定向匹配模块,其可以通过并行卷积操作来高效地实现,提出了学习语义逐像素嵌入作为内部引导。第二,一个有效的定向外观基于模型统计提出来表示对VOS球形嵌入空间的目标和背景。配备有全球定向匹配模块和定向外观模型学习模块,DDEAL学习来自标记的第一帧的静态提示和动态地更新为对象分割后续帧的提示。我们的方法表现出VOS性能状态的最先进的,而无需使用在线微调。具体而言,实现了74.8%的DAVIS 2017的数据集并在大型的YouTube-VOS数据集的71.3%的总成绩G为J&F平均得分,同时保留的每秒25帧与单个NVIDIA TITAN XP中GPU的速度。此外,我们的速度更快的版本上运行31 fps的只有一点点的精度损失。我们的代码和训练有素的网络,可在此HTTPS URL。
16. Generator From Edges: Reconstruction of Facial Images [PDF] 返回目录
Nao Takano, Gita Alaghband
Abstract: Applications that involve supervised training require paired images. Researchers of single image super-resolution (SISR) create such images by artificially generating blurry input images from the corresponding ground truth. Similarly we can create paired images with the canny edge. We propose Generator From Edges (GFE) [Figure 2]. Our aim is to determine the best architecture for GFE, along with reviews of perceptual loss [1, 2]. To this end, we conducted three experiments. First, we explored the effects of the adversarial loss often used in SISR. In particular, we uncovered that it is not an essential component to form a perceptual loss. Eliminating adversarial loss will lead to a more effective architecture from the perspective of hardware resource. It also means that considerations for the problems pertaining to generative adversarial network (GAN) [3], such as mode collapse, are not necessary. Second, we reexamined VGG loss and found that the mid-layers yield the best results. By extracting the full potential of VGG loss, the overall performance of perceptual loss improves significantly. Third, based on the findings of the first two experiments, we reevaluated the dense network to construct GFE. Using GFE as an intermediate process, reconstructing a facial image from a pencil sketch can become an easy task.
摘要:应用涉及指导训练需要成对的图像。单图像超分辨率(SISR)的研究人员通过人工从相应的地面实况产生模糊的输入图像创建这样的图像。同样,我们可以创建一个精明的边缘成对的图像。我们提出发电机从边缘处(GFE)[图2]。我们的目标是确定最佳结构为GFE,与视觉损失[1,2]的评论一起。为此,我们进行了三次实验。首先,我们探索了经常在SISR使用的对抗性损失的影响。特别是,我们发现,这是不形成视觉损失的必要成分。消除敌对的损失将导致从硬件资源的角度更有效的架构。这也意味着,对于关于生成对抗网络(GAN)[3],如模式崩溃的问题的考虑,是没有必要的。其次,我们重新审视VGG损失,发现中间层产生最好的结果。通过提取VGG损失的全部潜能,视觉损失的整体性能显著提高。第三,基于前两个实验的结果,我们重新评估了密集的网络,构建GFE。使用GFE作为中间过程,重建来自铅笔素描的面部图像可以成为一个简单的任务。
Nao Takano, Gita Alaghband
Abstract: Applications that involve supervised training require paired images. Researchers of single image super-resolution (SISR) create such images by artificially generating blurry input images from the corresponding ground truth. Similarly we can create paired images with the canny edge. We propose Generator From Edges (GFE) [Figure 2]. Our aim is to determine the best architecture for GFE, along with reviews of perceptual loss [1, 2]. To this end, we conducted three experiments. First, we explored the effects of the adversarial loss often used in SISR. In particular, we uncovered that it is not an essential component to form a perceptual loss. Eliminating adversarial loss will lead to a more effective architecture from the perspective of hardware resource. It also means that considerations for the problems pertaining to generative adversarial network (GAN) [3], such as mode collapse, are not necessary. Second, we reexamined VGG loss and found that the mid-layers yield the best results. By extracting the full potential of VGG loss, the overall performance of perceptual loss improves significantly. Third, based on the findings of the first two experiments, we reevaluated the dense network to construct GFE. Using GFE as an intermediate process, reconstructing a facial image from a pencil sketch can become an easy task.
摘要:应用涉及指导训练需要成对的图像。单图像超分辨率(SISR)的研究人员通过人工从相应的地面实况产生模糊的输入图像创建这样的图像。同样,我们可以创建一个精明的边缘成对的图像。我们提出发电机从边缘处(GFE)[图2]。我们的目标是确定最佳结构为GFE,与视觉损失[1,2]的评论一起。为此,我们进行了三次实验。首先,我们探索了经常在SISR使用的对抗性损失的影响。特别是,我们发现,这是不形成视觉损失的必要成分。消除敌对的损失将导致从硬件资源的角度更有效的架构。这也意味着,对于关于生成对抗网络(GAN)[3],如模式崩溃的问题的考虑,是没有必要的。其次,我们重新审视VGG损失,发现中间层产生最好的结果。通过提取VGG损失的全部潜能,视觉损失的整体性能显著提高。第三,基于前两个实验的结果,我们重新评估了密集的网络,构建GFE。使用GFE作为中间过程,重建来自铅笔素描的面部图像可以成为一个简单的任务。
17. AOL: Adaptive Online Learning for Human Trajectory Prediction in Dynamic Video Scenes [PDF] 返回目录
Manh Huynh, Gita Alaghband
Abstract: We present a novel adaptive online learning (AOL) framework to predict human movement trajectories in dynamic video scenes. Our framework learns and adapts to changes in the scene environment and generates best network weights for different scenarios. The framework can be applied to prediction models and improve their performance as it dynamically adjusts when it encounters changes in the scene and can apply the best training weights for predicting the next locations. We demonstrate this by integrating our framework with two existing prediction models: LSTM [3] and Future Person Location (FPL) [1]. Furthermore, we analyze the number of network weights for optimal performance and show that we can achieve real-time with a fixed number of networks using the least recently used (LRU) strategy for maintaining the most recently trained network weights. With extensive experiments, we show that our framework increases prediction accuracies of LSTM and FPL by ~17% and 28% on average, and up to ~50% for FPL on the worst case while achieving real-time (20fps).
摘要:本文提出了一种新的自适应在线学习(AOL)的框架来预测动态视频场景人体运动轨迹。我们的框架,学习和适应场景中的环境变化,产生不同场景的最佳网络权。该框架可以应用到预测模型和提高它们的性能,当它遇到的场景变化,并可以应用最好的训练权重来预测下一个位置,因为它动态调整。 LSTM [3]和未来的人的位置(FPL)[1]:我们通过我们的两个现有的预测模型框架集成证明这一点。此外,我们分析网络配重的数量以获得最佳性能,并表明我们可以实现实时有固定数量的使用维护最近训练好的网络权最近最少使用(LRU)策略网络。随着大量的实验,我们表明,LSTM和FPL的我们的框架增加预测精度由〜17%,平均28%,达到约50%为FPL在最坏的情况下,同时实现实时(20FPS)。
Manh Huynh, Gita Alaghband
Abstract: We present a novel adaptive online learning (AOL) framework to predict human movement trajectories in dynamic video scenes. Our framework learns and adapts to changes in the scene environment and generates best network weights for different scenarios. The framework can be applied to prediction models and improve their performance as it dynamically adjusts when it encounters changes in the scene and can apply the best training weights for predicting the next locations. We demonstrate this by integrating our framework with two existing prediction models: LSTM [3] and Future Person Location (FPL) [1]. Furthermore, we analyze the number of network weights for optimal performance and show that we can achieve real-time with a fixed number of networks using the least recently used (LRU) strategy for maintaining the most recently trained network weights. With extensive experiments, we show that our framework increases prediction accuracies of LSTM and FPL by ~17% and 28% on average, and up to ~50% for FPL on the worst case while achieving real-time (20fps).
摘要:本文提出了一种新的自适应在线学习(AOL)的框架来预测动态视频场景人体运动轨迹。我们的框架,学习和适应场景中的环境变化,产生不同场景的最佳网络权。该框架可以应用到预测模型和提高它们的性能,当它遇到的场景变化,并可以应用最好的训练权重来预测下一个位置,因为它动态调整。 LSTM [3]和未来的人的位置(FPL)[1]:我们通过我们的两个现有的预测模型框架集成证明这一点。此外,我们分析网络配重的数量以获得最佳性能,并表明我们可以实现实时有固定数量的使用维护最近训练好的网络权最近最少使用(LRU)策略网络。随着大量的实验,我们表明,LSTM和FPL的我们的框架增加预测精度由〜17%,平均28%,达到约50%为FPL在最坏的情况下,同时实现实时(20FPS)。
18. PeelNet: Textured 3D reconstruction of human body using single view RGB image [PDF] 返回目录
Sai Sagar Jinka, Rohan Chacko, Avinash Sharma, P. J. Narayanan
Abstract: Reconstructing human shape and pose from a single image is a challenging problem due to issues like severe self-occlusions, clothing variations, and changes in lighting to name a few. Many applications in the entertainment industry, e-commerce, health-care (physiotherapy), and mobile-based AR/VR platforms can benefit from recovering the 3D human shape, pose, and texture. In this paper, we present PeelNet, an end-to-end generative adversarial framework to tackle the problem of textured 3D reconstruction of the human body from a single RGB image. Motivated by ray tracing for generating realistic images of a 3D scene, we tackle this problem by representing the human body as a set of peeled depth and RGB maps which are obtained by extending rays beyond the first intersection with the 3D object. This formulation allows us to handle self-occlusions efficiently. Current parametric model-based approaches fail to model loose clothing and surface-level details and are proposed for the underlying naked human body. Majority of non-parametric approaches are either computationally expensive or provide unsatisfactory results. We present a simple non-parametric solution where the peeled maps are generated from a single RGB image as input. Our proposed peeled depth maps are back-projected to 3D volume to obtain a complete 3D shape. The corresponding RGB maps provide vertex-level texture details. We compare our method against current state-of-the-art methods in 3D reconstruction and demonstrate the effectiveness of our method on BUFF and MonoPerfCap datasets.
摘要:重建人体的形状和姿态,从一个单一的形象是一个具有挑战性的问题,是由于其在照明仅举几例类似严重的自我闭塞,服装的变化,以及变化的问题。在娱乐行业的许多应用程序,电子商务,医疗保健(理疗),以及基于移动-AR / VR平台可以受益于恢复三维人体形状,姿态和质感。在本文中,我们提出PeelNet,端至端生成对抗框架从单个RGB图像处理纹理的3D重建人体的问题。通过射线跟踪用于生成三维场景的逼真的图像的启发,我们通过表示人体作为一组其通过延伸射线超出与3D对象的第一相交得到剥离深度和RGB图的解决这个问题。这一提法使我们能够有效地处理自我闭塞。当前参数的基于模型的方法都没有模型宽松的衣服和表面层次的细节,并提出了基本的裸体人体。的非参数方法多数要么计算昂贵的或提供不令人满意的结果。我们提出,其中剥离映射从作为输入的单个RGB图像而生成一个简单的非参数的解决方案。我们提出的去皮深度图被背投影到3D体积以获得完整的3D形状。相应的RGB地图提供顶点级纹理细节。我们将我们对国家的最先进的电流方法三维重建方法,并证明我们的BUFF和MonoPerfCap数据集方法的有效性。
Sai Sagar Jinka, Rohan Chacko, Avinash Sharma, P. J. Narayanan
Abstract: Reconstructing human shape and pose from a single image is a challenging problem due to issues like severe self-occlusions, clothing variations, and changes in lighting to name a few. Many applications in the entertainment industry, e-commerce, health-care (physiotherapy), and mobile-based AR/VR platforms can benefit from recovering the 3D human shape, pose, and texture. In this paper, we present PeelNet, an end-to-end generative adversarial framework to tackle the problem of textured 3D reconstruction of the human body from a single RGB image. Motivated by ray tracing for generating realistic images of a 3D scene, we tackle this problem by representing the human body as a set of peeled depth and RGB maps which are obtained by extending rays beyond the first intersection with the 3D object. This formulation allows us to handle self-occlusions efficiently. Current parametric model-based approaches fail to model loose clothing and surface-level details and are proposed for the underlying naked human body. Majority of non-parametric approaches are either computationally expensive or provide unsatisfactory results. We present a simple non-parametric solution where the peeled maps are generated from a single RGB image as input. Our proposed peeled depth maps are back-projected to 3D volume to obtain a complete 3D shape. The corresponding RGB maps provide vertex-level texture details. We compare our method against current state-of-the-art methods in 3D reconstruction and demonstrate the effectiveness of our method on BUFF and MonoPerfCap datasets.
摘要:重建人体的形状和姿态,从一个单一的形象是一个具有挑战性的问题,是由于其在照明仅举几例类似严重的自我闭塞,服装的变化,以及变化的问题。在娱乐行业的许多应用程序,电子商务,医疗保健(理疗),以及基于移动-AR / VR平台可以受益于恢复三维人体形状,姿态和质感。在本文中,我们提出PeelNet,端至端生成对抗框架从单个RGB图像处理纹理的3D重建人体的问题。通过射线跟踪用于生成三维场景的逼真的图像的启发,我们通过表示人体作为一组其通过延伸射线超出与3D对象的第一相交得到剥离深度和RGB图的解决这个问题。这一提法使我们能够有效地处理自我闭塞。当前参数的基于模型的方法都没有模型宽松的衣服和表面层次的细节,并提出了基本的裸体人体。的非参数方法多数要么计算昂贵的或提供不令人满意的结果。我们提出,其中剥离映射从作为输入的单个RGB图像而生成一个简单的非参数的解决方案。我们提出的去皮深度图被背投影到3D体积以获得完整的3D形状。相应的RGB地图提供顶点级纹理细节。我们将我们对国家的最先进的电流方法三维重建方法,并证明我们的BUFF和MonoPerfCap数据集方法的有效性。
19. Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings [PDF] 返回目录
Shweta Mahajan, Iryna Gurevych, Stefan Roth
Abstract: Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately. The information shared between the domains is aligned with an invertible neural network. Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the two domains. We demonstrate the effectiveness of our model on diverse tasks, including image captioning and text-to-image synthesis.
摘要:图片和文字的教训联合代表组成的几个重要的跨域任务,例如图像字幕骨干。以前的工作两个域到一个共同的潜表示主要是在映射一个纯粹的监督方式。这是相当严格的,但是,由于两个域遵循不同的生成过程。因此,我们提出了一种新颖的半监督框架,其分别模型共享域和域特定信息之间的信息。域之间共享的信息与可逆神经网络对齐。我们的模型集成正常化基于流的先验对特定领域的信息,这使我们能够学习两个域之间不同的许多一对多的映射。我们证明在不同的任务,包括图像字幕和文本的图像合成我们的模型的有效性。
Shweta Mahajan, Iryna Gurevych, Stefan Roth
Abstract: Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately. The information shared between the domains is aligned with an invertible neural network. Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the two domains. We demonstrate the effectiveness of our model on diverse tasks, including image captioning and text-to-image synthesis.
摘要:图片和文字的教训联合代表组成的几个重要的跨域任务,例如图像字幕骨干。以前的工作两个域到一个共同的潜表示主要是在映射一个纯粹的监督方式。这是相当严格的,但是,由于两个域遵循不同的生成过程。因此,我们提出了一种新颖的半监督框架,其分别模型共享域和域特定信息之间的信息。域之间共享的信息与可逆神经网络对齐。我们的模型集成正常化基于流的先验对特定领域的信息,这使我们能够学习两个域之间不同的许多一对多的映射。我们证明在不同的任务,包括图像字幕和文本的图像合成我们的模型的有效性。
20. Block Annotation: Better Image Annotation for Semantic Segmentation with Sub-Image Decomposition [PDF] 返回目录
Hubert Lin, Paul Upchurch, Kavita Bala
Abstract: Image datasets with high-quality pixel-level annotations are valuable for semantic segmentation: labelling every pixel in an image ensures that rare classes and small objects are annotated. However, full-image annotations are expensive, with experts spending up to 90 minutes per image. We propose block sub-image annotation as a replacement for full-image annotation. Despite the attention cost of frequent task switching, we find that block annotations can be crowdsourced at higher quality compared to full-image annotation with equal monetary cost using existing annotation tools developed for full-image annotation. Surprisingly, we find that 50% pixels annotated with blocks allows semantic segmentation to achieve equivalent performance to 100% pixels annotated. Furthermore, as little as 12% of pixels annotated allows performance as high as 98% of the performance with dense annotation. In weakly-supervised settings, block annotation outperforms existing methods by 3-4% (absolute) given equivalent annotation time. To recover the necessary global structure for applications such as characterizing spatial context and affordance relationships, we propose an effective method to inpaint block-annotated images with high-quality labels without additional human effort. As such, fewer annotations can also be used for these applications compared to full-image annotation.
摘要:提供高品质的像素级别注释图像数据组是用于语义分割有价值:在标记图像可确保每个象素的是罕见的类和小的物体进行注释。然而,全图像注释是昂贵的,与专家花费可达每图像90分钟。我们建议块子图像标注为全图像注释的替代品。尽管频繁的任务切换的关注成本,我们发现,块注释可以以更高的质量进行众包相比,使用了全图像注释开发的现有注释工具等于货币成本全图像注释。出人意料的是,我们发现,与块注释50%的像素可以让语义分割,实现相同的性能用100%的像素。此外,注释像素的少12%允许性能一样高密集注释的性能的98%。在弱监督设置,块注释性能优于由3-4%(绝对值)给出等效标注时间现有的方法。要恢复等应用特征空间环境和启示关系,必要的全球结构,我们建议无需额外的人为努力的有效方法,以高品质的标签,补绘块注释的图像。因此,更少的注释也可用于这些应用相比全图像注释。
Hubert Lin, Paul Upchurch, Kavita Bala
Abstract: Image datasets with high-quality pixel-level annotations are valuable for semantic segmentation: labelling every pixel in an image ensures that rare classes and small objects are annotated. However, full-image annotations are expensive, with experts spending up to 90 minutes per image. We propose block sub-image annotation as a replacement for full-image annotation. Despite the attention cost of frequent task switching, we find that block annotations can be crowdsourced at higher quality compared to full-image annotation with equal monetary cost using existing annotation tools developed for full-image annotation. Surprisingly, we find that 50% pixels annotated with blocks allows semantic segmentation to achieve equivalent performance to 100% pixels annotated. Furthermore, as little as 12% of pixels annotated allows performance as high as 98% of the performance with dense annotation. In weakly-supervised settings, block annotation outperforms existing methods by 3-4% (absolute) given equivalent annotation time. To recover the necessary global structure for applications such as characterizing spatial context and affordance relationships, we propose an effective method to inpaint block-annotated images with high-quality labels without additional human effort. As such, fewer annotations can also be used for these applications compared to full-image annotation.
摘要:提供高品质的像素级别注释图像数据组是用于语义分割有价值:在标记图像可确保每个象素的是罕见的类和小的物体进行注释。然而,全图像注释是昂贵的,与专家花费可达每图像90分钟。我们建议块子图像标注为全图像注释的替代品。尽管频繁的任务切换的关注成本,我们发现,块注释可以以更高的质量进行众包相比,使用了全图像注释开发的现有注释工具等于货币成本全图像注释。出人意料的是,我们发现,与块注释50%的像素可以让语义分割,实现相同的性能用100%的像素。此外,注释像素的少12%允许性能一样高密集注释的性能的98%。在弱监督设置,块注释性能优于由3-4%(绝对值)给出等效标注时间现有的方法。要恢复等应用特征空间环境和启示关系,必要的全球结构,我们建议无需额外的人为努力的有效方法,以高品质的标签,补绘块注释的图像。因此,更少的注释也可用于这些应用相比全图像注释。
21. CRL: Class Representative Learning for Image Classification [PDF] 返回目录
Mayanka Chandrashekar, Yugyung Lee
Abstract: Building robust and real-time classifiers with diverse datasets are one of the most significant challenges to deep learning researchers. It is because there is a considerable gap between a model built with training (seen) data and real (unseen) data in applications. Recent works including Zero-Shot Learning (ZSL), have attempted to deal with this problem of overcoming the apparent gap through transfer learning. In this paper, we propose a novel model, called Class Representative Learning Model (CRL), that can be especially effective in image classification influenced by ZSL. In the CRL model, first, the learning step is to build class representatives to represent classes in datasets by aggregating prominent features extracted from a Convolutional Neural Network (CNN). Second, the inferencing step in CRL is to match between the class representatives and new data. The proposed CRL model demonstrated superior performance compared to the current state-of-the-art research in ZSL and mobile deep learning. The proposed CRL model has been implemented and evaluated in a parallel environment, using Apache Spark, for both distributed learning and recognition. An extensive experimental study on the benchmark datasets, ImageNet-1K, CalTech-101, CalTech-256, CIFAR-100, shows that CRL can build a class distribution model with drastic improvement in learning and recognition performance without sacrificing accuracy compared to the state-of-the-art performances in image classification.
摘要:建立与不同的数据集强大和实时分类是深学习研究最显著的挑战之一。这是因为与培训(看到)的数据和应用程序中实(看不见的)数据建立的模型之间有相当的差距。最近的作品包括零射门学习(ZSL),试图应对克服通过转移学习的明显差距这个问题。在本文中,我们提出了一种新的模式,所谓的课代表学习模式(CRL),即可以在由ZSL影响图像分类特别有效。在CRL模型中,首先,学习步骤是建立类代表通过聚合从卷积神经网络(CNN)萃取突出的特点,以表示数据集的类。其次,在CRL的推理步骤是将集体代表和新数据之间的匹配。相比目前国家的最先进的研究ZSL和移动深度学习所提出的CRL模型表现出了超强的性能。所提出的CRL模型已经实施和评估在并行环境中,使用Apache星火,对于分布式学习和识别。基准的数据集,ImageNet-1K,加州理工学院-101,加州理工学院-256,CIFAR-100,说明CRL可以在学习建设有大幅改善的一类分布模型和识别性能的广泛的实验研究在不牺牲精度相比国有的最先进的演出图像分类。
Mayanka Chandrashekar, Yugyung Lee
Abstract: Building robust and real-time classifiers with diverse datasets are one of the most significant challenges to deep learning researchers. It is because there is a considerable gap between a model built with training (seen) data and real (unseen) data in applications. Recent works including Zero-Shot Learning (ZSL), have attempted to deal with this problem of overcoming the apparent gap through transfer learning. In this paper, we propose a novel model, called Class Representative Learning Model (CRL), that can be especially effective in image classification influenced by ZSL. In the CRL model, first, the learning step is to build class representatives to represent classes in datasets by aggregating prominent features extracted from a Convolutional Neural Network (CNN). Second, the inferencing step in CRL is to match between the class representatives and new data. The proposed CRL model demonstrated superior performance compared to the current state-of-the-art research in ZSL and mobile deep learning. The proposed CRL model has been implemented and evaluated in a parallel environment, using Apache Spark, for both distributed learning and recognition. An extensive experimental study on the benchmark datasets, ImageNet-1K, CalTech-101, CalTech-256, CIFAR-100, shows that CRL can build a class distribution model with drastic improvement in learning and recognition performance without sacrificing accuracy compared to the state-of-the-art performances in image classification.
摘要:建立与不同的数据集强大和实时分类是深学习研究最显著的挑战之一。这是因为与培训(看到)的数据和应用程序中实(看不见的)数据建立的模型之间有相当的差距。最近的作品包括零射门学习(ZSL),试图应对克服通过转移学习的明显差距这个问题。在本文中,我们提出了一种新的模式,所谓的课代表学习模式(CRL),即可以在由ZSL影响图像分类特别有效。在CRL模型中,首先,学习步骤是建立类代表通过聚合从卷积神经网络(CNN)萃取突出的特点,以表示数据集的类。其次,在CRL的推理步骤是将集体代表和新数据之间的匹配。相比目前国家的最先进的研究ZSL和移动深度学习所提出的CRL模型表现出了超强的性能。所提出的CRL模型已经实施和评估在并行环境中,使用Apache星火,对于分布式学习和识别。基准的数据集,ImageNet-1K,加州理工学院-101,加州理工学院-256,CIFAR-100,说明CRL可以在学习建设有大幅改善的一类分布模型和识别性能的广泛的实验研究在不牺牲精度相比国有的最先进的演出图像分类。
22. Key Points Estimation and Point Instance Segmentation Approach for Lane Detection [PDF] 返回目录
Yeongmin Ko, Jiwon Jun, Donghwuy Ko, Moongu Jeon
Abstract: State-of-the-art lane detection methods achieve successful performance. Despite their advantages, these methods have critical deficiencies such as the limited number of detectable lanes and high false positive. In especial, high false positive can cause wrong and dangerous control. In this paper, we propose a novel lane detection method for the arbitrary number of lanes using the deep learning method, which has the lower number of false positives than other recent lane detection methods. The architecture of the proposed method has the shared feature extraction layers and several branches for detection and embedding to cluster lanes. The proposed method can generate exact points on the lanes, and we cast a clustering problem for the generated points as a point cloud instance segmentation problem. The proposed method is more compact because it generates fewer points than the original image pixel size. Our proposed post processing method eliminates outliers successfully and increases the performance notably. Whole proposed framework achieves competitive results on the tuSimple dataset.
摘要:国家的最先进的车道检测方法,可以实现成功的表现。尽管它们的优点,这些方法具有严重缺陷,例如可检测的车道和高的假阳性的数量有限。在特殊的,高误报可能导致错误和危险的控制。在本文中,我们提出了用深层学习方法,它具有误报比近期其他车道检测方法的下车道数的任意数量的一种新的车道检测方法。所提出的方法的体系结构具有共享特征提取层和用于检测几个分支和嵌入到簇车道。该方法可以在车道生成精确的点,我们投一个聚类问题的产生点的点云实例分割问题。所提出的方法是更紧凑的,因为它产生比原始图像的像素尺寸的点较少。我们成功地提出了后处理方法消除异常值和显着提高性能。整个建议的架构实现了对tuSimple数据集有竞争力的结果。
Yeongmin Ko, Jiwon Jun, Donghwuy Ko, Moongu Jeon
Abstract: State-of-the-art lane detection methods achieve successful performance. Despite their advantages, these methods have critical deficiencies such as the limited number of detectable lanes and high false positive. In especial, high false positive can cause wrong and dangerous control. In this paper, we propose a novel lane detection method for the arbitrary number of lanes using the deep learning method, which has the lower number of false positives than other recent lane detection methods. The architecture of the proposed method has the shared feature extraction layers and several branches for detection and embedding to cluster lanes. The proposed method can generate exact points on the lanes, and we cast a clustering problem for the generated points as a point cloud instance segmentation problem. The proposed method is more compact because it generates fewer points than the original image pixel size. Our proposed post processing method eliminates outliers successfully and increases the performance notably. Whole proposed framework achieves competitive results on the tuSimple dataset.
摘要:国家的最先进的车道检测方法,可以实现成功的表现。尽管它们的优点,这些方法具有严重缺陷,例如可检测的车道和高的假阳性的数量有限。在特殊的,高误报可能导致错误和危险的控制。在本文中,我们提出了用深层学习方法,它具有误报比近期其他车道检测方法的下车道数的任意数量的一种新的车道检测方法。所提出的方法的体系结构具有共享特征提取层和用于检测几个分支和嵌入到簇车道。该方法可以在车道生成精确的点,我们投一个聚类问题的产生点的点云实例分割问题。所提出的方法是更紧凑的,因为它产生比原始图像的像素尺寸的点较少。我们成功地提出了后处理方法消除异常值和显着提高性能。整个建议的架构实现了对tuSimple数据集有竞争力的结果。
23. Analytic Marching: An Analytic Meshing Solution from Deep Implicit Surface Networks [PDF] 返回目录
Jiabao Lei, Kui Jia
Abstract: This paper studies a problem of learning surface mesh via implicit functions in an emerging field of deep learning surface reconstruction, where implicit functions are popularly implemented as multi-layer perceptrons (MLPs) with rectified linear units (ReLU). To achieve meshing from learned implicit functions, existing methods adopt the de-facto standard algorithm of marching cubes; while promising, they suffer from loss of precision learned in the MLPs, due to the discretization nature of marching cubes. Motivated by the knowledge that a ReLU based MLP partitions its input space into a number of linear regions, we identify from these regions analytic cells and analytic faces that are associated with zero-level isosurface of the implicit function, and characterize the theoretical conditions under which the identified analytic faces are guaranteed to connect and form a closed, piecewise planar surface. Based on our theorem, we propose a naturally parallelizable algorithm of analytic marching, which marches among analytic cells to exactly recover the mesh captured by a learned MLP. Experiments on deep learning mesh reconstruction verify the advantages of our algorithm over existing ones.
摘要:本文研究经由隐函数在深度学习表面重建,其中隐函数是普遍作为多层感知器(的MLP)实现的一个新兴的领域学习表面网格的一个问题与整流线性单位(RELU)。为了实现从了解到隐函数啮合,现有的方法采用移动立方体的事实标准算法;同时承诺,从它们的精度在业主有限合伙了解到,遭受损失是由于移动立方体的离散性质。通过一个基于RELU MLP划分其输入空间为多个线性区域的知识的启发,我们从这些区域识别分析的细胞和与隐函数的零电平等值面相关的解析的面,和表征在其下理论条件所识别的解析面保证连接并形成一个封闭的,分段平坦的表面。根据我们的理论,我们提出分析行军,其分析小区间的游行正好恢复网格由获悉MLP捕获的自然并行算法。深学习网重建实验,验证了在现有的我们的算法的优点。
Jiabao Lei, Kui Jia
Abstract: This paper studies a problem of learning surface mesh via implicit functions in an emerging field of deep learning surface reconstruction, where implicit functions are popularly implemented as multi-layer perceptrons (MLPs) with rectified linear units (ReLU). To achieve meshing from learned implicit functions, existing methods adopt the de-facto standard algorithm of marching cubes; while promising, they suffer from loss of precision learned in the MLPs, due to the discretization nature of marching cubes. Motivated by the knowledge that a ReLU based MLP partitions its input space into a number of linear regions, we identify from these regions analytic cells and analytic faces that are associated with zero-level isosurface of the implicit function, and characterize the theoretical conditions under which the identified analytic faces are guaranteed to connect and form a closed, piecewise planar surface. Based on our theorem, we propose a naturally parallelizable algorithm of analytic marching, which marches among analytic cells to exactly recover the mesh captured by a learned MLP. Experiments on deep learning mesh reconstruction verify the advantages of our algorithm over existing ones.
摘要:本文研究经由隐函数在深度学习表面重建,其中隐函数是普遍作为多层感知器(的MLP)实现的一个新兴的领域学习表面网格的一个问题与整流线性单位(RELU)。为了实现从了解到隐函数啮合,现有的方法采用移动立方体的事实标准算法;同时承诺,从它们的精度在业主有限合伙了解到,遭受损失是由于移动立方体的离散性质。通过一个基于RELU MLP划分其输入空间为多个线性区域的知识的启发,我们从这些区域识别分析的细胞和与隐函数的零电平等值面相关的解析的面,和表征在其下理论条件所识别的解析面保证连接并形成一个封闭的,分段平坦的表面。根据我们的理论,我们提出分析行军,其分析小区间的游行正好恢复网格由获悉MLP捕获的自然并行算法。深学习网重建实验,验证了在现有的我们的算法的优点。
24. Automated Labelling using an Attention model for Radiology reports of MRI scans (ALARM) [PDF] 返回目录
David A. Wood, Jeremy Lynch, Sina Kafiabadi, Emily Guilhem, Aisha Al Busaidi, Antanas Montvila, Thomas Varsavsky, Juveria Siddiqui, Naveen Gadapa, Matthew Townend, Martin Kiik, Keena Patel, Gareth Barker, Sebastian Ourselin, James H. Cole, Thomas C. Booth
Abstract: Labelling large datasets for training high-capacity neural networks is a major obstacle to the development of deep learning-based medical imaging applications. Here we present a transformer-based network for magnetic resonance imaging (MRI) radiology report classification which automates this task by assigning image labels on the basis of free-text expert radiology reports. Our model's performance is comparable to that of an expert radiologist, and better than that of an expert physician, demonstrating the feasibility of this approach. We make code available online for researchers to label their own MRI datasets for medical imaging applications.
摘要:标签大型数据集为培养高容量的神经网络是深以学习为主的医疗成像应用发展的一个主要障碍。在这里,我们提出了磁共振成像(MRI)影像报告的分类基于变压器的网络,通过自由文本放射学专家报告的基础上分配图像标签自动执行此任务。我们的模型的性能相媲美的专家放射科医师,而比专家医生的更好,证明了该方法的可行性。我们使代码可在网上为研究人员标注自己的MRI数据集用于医疗成像应用。
David A. Wood, Jeremy Lynch, Sina Kafiabadi, Emily Guilhem, Aisha Al Busaidi, Antanas Montvila, Thomas Varsavsky, Juveria Siddiqui, Naveen Gadapa, Matthew Townend, Martin Kiik, Keena Patel, Gareth Barker, Sebastian Ourselin, James H. Cole, Thomas C. Booth
Abstract: Labelling large datasets for training high-capacity neural networks is a major obstacle to the development of deep learning-based medical imaging applications. Here we present a transformer-based network for magnetic resonance imaging (MRI) radiology report classification which automates this task by assigning image labels on the basis of free-text expert radiology reports. Our model's performance is comparable to that of an expert radiologist, and better than that of an expert physician, demonstrating the feasibility of this approach. We make code available online for researchers to label their own MRI datasets for medical imaging applications.
摘要:标签大型数据集为培养高容量的神经网络是深以学习为主的医疗成像应用发展的一个主要障碍。在这里,我们提出了磁共振成像(MRI)影像报告的分类基于变压器的网络,通过自由文本放射学专家报告的基础上分配图像标签自动执行此任务。我们的模型的性能相媲美的专家放射科医师,而比专家医生的更好,证明了该方法的可行性。我们使代码可在网上为研究人员标注自己的MRI数据集用于医疗成像应用。
25. Reinforced active learning for image segmentation [PDF] 返回目录
Arantxa Casanova, Pedro O. Pinheiro, Negar Rostamzadeh, Christopher J. Pal
Abstract: Learning-based approaches for semantic segmentation have two inherent challenges. First, acquiring pixel-wise labels is expensive and time-consuming. Second, realistic segmentation datasets are highly unbalanced: some categories are much more abundant than others, biasing the performance to the most represented ones. In this paper, we are interested in focusing human labelling effort on a small subset of a larger pool of data, minimizing this effort while maximizing performance of a segmentation model on a hold-out set. We present a new active learning strategy for semantic segmentation based on deep reinforcement learning (RL). An agent learns a policy to select a subset of small informative image regions -- opposed to entire images -- to be labeled, from a pool of unlabeled data. The region selection decision is made based on predictions and uncertainties of the segmentation model being trained. Our method proposes a new modification of the deep Q-network (DQN) formulation for active learning, adapting it to the large-scale nature of semantic segmentation problems. We test the proof of concept in CamVid and provide results in the large-scale dataset Cityscapes. On Cityscapes, our deep RL region-based DQN approach requires roughly 30% less additional labeled data than our most competitive baseline to reach the same performance. Moreover, we find that our method asks for more labels of under-represented categories compared to the baselines, improving their performance and helping to mitigate class imbalance.
摘要:语义分割基于学习的方法有两个内在的挑战。首先,收购逐像素的标签是昂贵和费时。其次,现实的分割数据集是高度不平衡:某些类别的人比其他人更丰富,偏置性能,以最具代表性的。在本文中,我们感兴趣的是着眼于更大的数据池的一小部分人的标签的努力,同时最大限度上保持了集分割模型的性能最大限度地减少这方面的努力。我们提出了语义分割基于深强化学习(RL)的新主动学习策略。代理学习策略选择小翔实的图像区域的一个子集 - 而不是整个图像 - 要贴上标签,从标签数据池。区域选择决定是基于细分模型正在接受训练的预测和不确定性做出。我们的方法提出了一种用于主动学习深Q-网络(DQN)制剂的新的变型中,它适应的语义分割问题大规模性质。我们测试概念证明在CamVid并提供大型数据集城市景观效果。在城市景观,我们深为基础的区域RL DQN方法需要比我们最有竞争力的基线大约30%较少的额外标记数据以达到相同的性能。此外,我们发现,我们的方法询问代表性不足的类别相比基线的多个标签,从而提高其性能,并帮助缓解类不平衡。
Arantxa Casanova, Pedro O. Pinheiro, Negar Rostamzadeh, Christopher J. Pal
Abstract: Learning-based approaches for semantic segmentation have two inherent challenges. First, acquiring pixel-wise labels is expensive and time-consuming. Second, realistic segmentation datasets are highly unbalanced: some categories are much more abundant than others, biasing the performance to the most represented ones. In this paper, we are interested in focusing human labelling effort on a small subset of a larger pool of data, minimizing this effort while maximizing performance of a segmentation model on a hold-out set. We present a new active learning strategy for semantic segmentation based on deep reinforcement learning (RL). An agent learns a policy to select a subset of small informative image regions -- opposed to entire images -- to be labeled, from a pool of unlabeled data. The region selection decision is made based on predictions and uncertainties of the segmentation model being trained. Our method proposes a new modification of the deep Q-network (DQN) formulation for active learning, adapting it to the large-scale nature of semantic segmentation problems. We test the proof of concept in CamVid and provide results in the large-scale dataset Cityscapes. On Cityscapes, our deep RL region-based DQN approach requires roughly 30% less additional labeled data than our most competitive baseline to reach the same performance. Moreover, we find that our method asks for more labels of under-represented categories compared to the baselines, improving their performance and helping to mitigate class imbalance.
摘要:语义分割基于学习的方法有两个内在的挑战。首先,收购逐像素的标签是昂贵和费时。其次,现实的分割数据集是高度不平衡:某些类别的人比其他人更丰富,偏置性能,以最具代表性的。在本文中,我们感兴趣的是着眼于更大的数据池的一小部分人的标签的努力,同时最大限度上保持了集分割模型的性能最大限度地减少这方面的努力。我们提出了语义分割基于深强化学习(RL)的新主动学习策略。代理学习策略选择小翔实的图像区域的一个子集 - 而不是整个图像 - 要贴上标签,从标签数据池。区域选择决定是基于细分模型正在接受训练的预测和不确定性做出。我们的方法提出了一种用于主动学习深Q-网络(DQN)制剂的新的变型中,它适应的语义分割问题大规模性质。我们测试概念证明在CamVid并提供大型数据集城市景观效果。在城市景观,我们深为基础的区域RL DQN方法需要比我们最有竞争力的基线大约30%较少的额外标记数据以达到相同的性能。此外,我们发现,我们的方法询问代表性不足的类别相比基线的多个标签,从而提高其性能,并帮助缓解类不平衡。
26. Topological Mapping for Manhattan-like Repetitive Environments [PDF] 返回目录
Sai Shubodh Puligilla, Satyajit Tourani, Tushar Vaidya, Udit Singh Parihar, Ravi Kiran Sarvadevabhatla, K. Madhava Krishna
Abstract: We showcase a topological mapping framework for a challenging indoor warehouse setting. At the most abstract level, the warehouse is represented as a Topological Graph where the nodes of the graph represent a particular warehouse topological construct (e.g. rackspace, corridor) and the edges denote the existence of a path between two neighbouring nodes or topologies. At the intermediate level, the map is represented as a Manhattan Graph where the nodes and edges are characterized by Manhattan properties and as a Pose Graph at the lower-most level of detail. The topological constructs are learned via a Deep Convolutional Network while the relational properties between topological instances are learnt via a Siamese-style Neural Network. In the paper, we show that maintaining abstractions such as Topological Graph and Manhattan Graph help in recovering an accurate Pose Graph starting from a highly erroneous and unoptimized Pose Graph. We show how this is achieved by embedding topological and Manhattan relations as well as Manhattan Graph aided loop closure relations as constraints in the backend Pose Graph optimization framework. The recovery of near ground-truth Pose Graph on real-world indoor warehouse scenes vindicate the efficacy of the proposed framework.
摘要:我们展示一个拓扑映射框架,一个具有挑战性的室内仓库设置。在最抽象级别,仓库被表示为拓扑图,其中图的节点表示特定仓库的拓扑结构(如机架磁盘走廊)和边缘表示两个相邻节点或拓扑之间的路径的存在。在中间电平,在地图上被表示为曼哈顿图,其中所述节点和边处细节的最下一级的特征在于曼哈顿属性和作为姿态图形。拓扑结构通过一个深卷积网了解到,同时拓扑实例之间的关系特性通过连体式神经网络学会。在本文中,我们表明,保持抽象,如拓扑图,并恢复准确的姿态图形从高度是错误的,没有优化的姿态图形开始曼哈顿图表帮助。我们展示了如何通过嵌入拓扑和曼哈顿关系以及曼哈顿图形辅助环路闭合关系作为后端姿态图形优化框架的约束来实现。近地面实况姿态图形的现实世界的室内仓库场景恢复平反提出的框架的有效性。
Sai Shubodh Puligilla, Satyajit Tourani, Tushar Vaidya, Udit Singh Parihar, Ravi Kiran Sarvadevabhatla, K. Madhava Krishna
Abstract: We showcase a topological mapping framework for a challenging indoor warehouse setting. At the most abstract level, the warehouse is represented as a Topological Graph where the nodes of the graph represent a particular warehouse topological construct (e.g. rackspace, corridor) and the edges denote the existence of a path between two neighbouring nodes or topologies. At the intermediate level, the map is represented as a Manhattan Graph where the nodes and edges are characterized by Manhattan properties and as a Pose Graph at the lower-most level of detail. The topological constructs are learned via a Deep Convolutional Network while the relational properties between topological instances are learnt via a Siamese-style Neural Network. In the paper, we show that maintaining abstractions such as Topological Graph and Manhattan Graph help in recovering an accurate Pose Graph starting from a highly erroneous and unoptimized Pose Graph. We show how this is achieved by embedding topological and Manhattan relations as well as Manhattan Graph aided loop closure relations as constraints in the backend Pose Graph optimization framework. The recovery of near ground-truth Pose Graph on real-world indoor warehouse scenes vindicate the efficacy of the proposed framework.
摘要:我们展示一个拓扑映射框架,一个具有挑战性的室内仓库设置。在最抽象级别,仓库被表示为拓扑图,其中图的节点表示特定仓库的拓扑结构(如机架磁盘走廊)和边缘表示两个相邻节点或拓扑之间的路径的存在。在中间电平,在地图上被表示为曼哈顿图,其中所述节点和边处细节的最下一级的特征在于曼哈顿属性和作为姿态图形。拓扑结构通过一个深卷积网了解到,同时拓扑实例之间的关系特性通过连体式神经网络学会。在本文中,我们表明,保持抽象,如拓扑图,并恢复准确的姿态图形从高度是错误的,没有优化的姿态图形开始曼哈顿图表帮助。我们展示了如何通过嵌入拓扑和曼哈顿关系以及曼哈顿图形辅助环路闭合关系作为后端姿态图形优化框架的约束来实现。近地面实况姿态图形的现实世界的室内仓库场景恢复平反提出的框架的有效性。
27. Facial Attribute Capsules for Noise Face Super Resolution [PDF] 返回目录
Jingwei Xin, Nannan Wang, Xinrui Jiang, Jie Li, Xinbo Gao, Zhifeng Li
Abstract: Existing face super-resolution (SR) methods mainly assume the input image to be noise-free. Their performance degrades drastically when applied to real-world scenarios where the input image is always contaminated by noise. In this paper, we propose a Facial Attribute Capsules Network (FACN) to deal with the problem of high-scale super-resolution of noisy face image. Capsule is a group of neurons whose activity vector models different properties of the same entity. Inspired by the concept of capsule, we propose an integrated representation model of facial information, which named Facial Attribute Capsule (FAC). In the SR processing, we first generated a group of FACs from the input LR face, and then reconstructed the HR face from this group of FACs. Aiming to effectively improve the robustness of FAC to noise, we generate FAC in semantic, probabilistic and facial attributes manners by means of integrated learning strategy. Each FAC can be divided into two sub-capsules: Semantic Capsule (SC) and Probabilistic Capsule (PC). Them describe an explicit facial attribute in detail from two aspects of semantic representation and probability distribution. The group of FACs model an image as a combination of facial attribute information in the semantic space and probabilistic space by an attribute-disentangling way. The diverse FACs could better combine the face prior information to generate the face images with fine-grained semantic attributes. Extensive benchmark experiments show that our method achieves superior hallucination results and outperforms state-of-the-art for very low resolution (LR) noise face image super resolution.
摘要:现有的脸超分辨率(SR)方法主要假设输入图像是无噪音。其性能下降显着,当应用到真实世界的场景中输入图像总是被噪声污染。在本文中,我们提出了一个面部属性胶囊网(FACN)来处理高规模超分辨率嘈杂的人脸图像的问题。胶囊是一组神经元,其活性矢量模型的相同实体的不同性质。通过胶囊的概念的启发,我们提出的面部信息,其命名为面部属性胶囊(FAC)的综合表示模型。在SR处理时,我们首先生成群的FACS从输入LR面,然后从该组中的FACS重构的HR面。旨在有效提高FAC的鲁棒性的噪音,我们通过整合学习策略的手段语义,概率和面部特征的方式产生FAC。每个FAC可以被划分成两个子胶囊:胶囊的语义(SC)和概率胶囊(PC)。他们描述了从语义表示和概率分布的两个方面进行详细的显式的面部属性。该组的FACS图像建模为通过一个属性解开方式的在语义空间和概率空间的面部属性信息的组合。多样化的流式细胞仪能更好地结合起来,面对之前的信息来产生面部图像细粒度语义属性。广泛的基准测试实验表明,我们的方法达到出色的幻觉效果,优于国家的最先进的非常低的分辨率(LR)的噪音人脸图像的超分辨率。
Jingwei Xin, Nannan Wang, Xinrui Jiang, Jie Li, Xinbo Gao, Zhifeng Li
Abstract: Existing face super-resolution (SR) methods mainly assume the input image to be noise-free. Their performance degrades drastically when applied to real-world scenarios where the input image is always contaminated by noise. In this paper, we propose a Facial Attribute Capsules Network (FACN) to deal with the problem of high-scale super-resolution of noisy face image. Capsule is a group of neurons whose activity vector models different properties of the same entity. Inspired by the concept of capsule, we propose an integrated representation model of facial information, which named Facial Attribute Capsule (FAC). In the SR processing, we first generated a group of FACs from the input LR face, and then reconstructed the HR face from this group of FACs. Aiming to effectively improve the robustness of FAC to noise, we generate FAC in semantic, probabilistic and facial attributes manners by means of integrated learning strategy. Each FAC can be divided into two sub-capsules: Semantic Capsule (SC) and Probabilistic Capsule (PC). Them describe an explicit facial attribute in detail from two aspects of semantic representation and probability distribution. The group of FACs model an image as a combination of facial attribute information in the semantic space and probabilistic space by an attribute-disentangling way. The diverse FACs could better combine the face prior information to generate the face images with fine-grained semantic attributes. Extensive benchmark experiments show that our method achieves superior hallucination results and outperforms state-of-the-art for very low resolution (LR) noise face image super resolution.
摘要:现有的脸超分辨率(SR)方法主要假设输入图像是无噪音。其性能下降显着,当应用到真实世界的场景中输入图像总是被噪声污染。在本文中,我们提出了一个面部属性胶囊网(FACN)来处理高规模超分辨率嘈杂的人脸图像的问题。胶囊是一组神经元,其活性矢量模型的相同实体的不同性质。通过胶囊的概念的启发,我们提出的面部信息,其命名为面部属性胶囊(FAC)的综合表示模型。在SR处理时,我们首先生成群的FACS从输入LR面,然后从该组中的FACS重构的HR面。旨在有效提高FAC的鲁棒性的噪音,我们通过整合学习策略的手段语义,概率和面部特征的方式产生FAC。每个FAC可以被划分成两个子胶囊:胶囊的语义(SC)和概率胶囊(PC)。他们描述了从语义表示和概率分布的两个方面进行详细的显式的面部属性。该组的FACS图像建模为通过一个属性解开方式的在语义空间和概率空间的面部属性信息的组合。多样化的流式细胞仪能更好地结合起来,面对之前的信息来产生面部图像细粒度语义属性。广泛的基准测试实验表明,我们的方法达到出色的幻觉效果,优于国家的最先进的非常低的分辨率(LR)的噪音人脸图像的超分辨率。
28. A Real-Time Deep Network for Crowd Counting [PDF] 返回目录
Xiaowen Shi, Xin Li, Caili Wu, Shuchen Kong, Jing Yang, Liang He
Abstract: Automatic analysis of highly crowded people has attracted extensive attention from computer vision research. Previous approaches for crowd counting have already achieved promising performance across various benchmarks. However, to deal with the real situation, we hope the model run as fast as possible while keeping accuracy. In this paper, we propose a compact convolutional neural network for crowd counting which learns a more efficient model with a small number of parameters. With three parallel filters executing the convolutional operation on the input image simultaneously at the front of the network, our model could achieve nearly real-time speed and save more computing resources. Experiments on two benchmarks show that our proposed method not only takes a balance between performance and efficiency which is more suitable for actual scenes but also is superior to existing light-weight models in speed.
摘要:高度挤人的自动分析,吸引了来自计算机视觉研究的广泛关注。针对人群计数以前的方法已经取得看好在不同的基准性能。然而,为了应对实际情况,我们希望示范运行尽可能快,同时保持准确度。在本文中,我们提出了人群计数紧凑卷积神经网络,学习有少量的参数更有效的模式。随着三个平行的过滤器同时运行在网络前端的输入图像的卷积运算,我们的模型可以实现近实时的速度和节省更多的计算资源。两个基准测试实验表明,该方法不仅需要性能和效率,这是更适合实际的场景之间的平衡也优越于现有速度轻量机型。
Xiaowen Shi, Xin Li, Caili Wu, Shuchen Kong, Jing Yang, Liang He
Abstract: Automatic analysis of highly crowded people has attracted extensive attention from computer vision research. Previous approaches for crowd counting have already achieved promising performance across various benchmarks. However, to deal with the real situation, we hope the model run as fast as possible while keeping accuracy. In this paper, we propose a compact convolutional neural network for crowd counting which learns a more efficient model with a small number of parameters. With three parallel filters executing the convolutional operation on the input image simultaneously at the front of the network, our model could achieve nearly real-time speed and save more computing resources. Experiments on two benchmarks show that our proposed method not only takes a balance between performance and efficiency which is more suitable for actual scenes but also is superior to existing light-weight models in speed.
摘要:高度挤人的自动分析,吸引了来自计算机视觉研究的广泛关注。针对人群计数以前的方法已经取得看好在不同的基准性能。然而,为了应对实际情况,我们希望示范运行尽可能快,同时保持准确度。在本文中,我们提出了人群计数紧凑卷积神经网络,学习有少量的参数更有效的模式。随着三个平行的过滤器同时运行在网络前端的输入图像的卷积运算,我们的模型可以实现近实时的速度和节省更多的计算资源。两个基准测试实验表明,该方法不仅需要性能和效率,这是更适合实际的场景之间的平衡也优越于现有速度轻量机型。
29. Face Recognition: Too Bias, or Not Too Bias? [PDF] 返回目录
Joseph P Robinson, Gennady Livitz, Yann Henon, Can Qin, Yun Fu, Samson Timoner
Abstract: We reveal critical insights into problems of bias in state-of-the-art facial recognition (FR) systems using a novel Balanced Faces In the Wild (BFW) dataset: data balanced for gender and ethnic groups. We show variations in the optimal scoring threshold for face-pairs across different subgroups. Thus, the conventional approach of learning a global threshold for all pairs resulting in performance gaps among subgroups. By learning subgroup-specific thresholds, we not only mitigate problems in performance gaps but also show a notable boost in the overall performance. Furthermore, we do a human evaluation to measure the bias in humans, which supports the hypothesis that such a bias exists in human perception. For the BFW database, source code, and more, visit this http URL.
摘要:我们揭示了使用一种新的平衡的面孔在野外(BFW)数据集的关键见解在国家的最先进的面部识别(FR)系统偏压的问题:性别和族裔平衡的数据。我们显示了在不同亚群面对的最佳得分阈值的变化。因此,学习导致性能差距亚组之间的所有对一个全局阈值的传统方法。通过学习群特定的阈值,我们在性能上的差距,不仅缓解问题,而且也显示出整体性能显着提升。此外,我们做一个人的评价来衡量人类的偏见,这支持了这样的偏见在人类感知存在的假设。对于BFW数据库,源代码和详情,请访问该HTTP URL。
Joseph P Robinson, Gennady Livitz, Yann Henon, Can Qin, Yun Fu, Samson Timoner
Abstract: We reveal critical insights into problems of bias in state-of-the-art facial recognition (FR) systems using a novel Balanced Faces In the Wild (BFW) dataset: data balanced for gender and ethnic groups. We show variations in the optimal scoring threshold for face-pairs across different subgroups. Thus, the conventional approach of learning a global threshold for all pairs resulting in performance gaps among subgroups. By learning subgroup-specific thresholds, we not only mitigate problems in performance gaps but also show a notable boost in the overall performance. Furthermore, we do a human evaluation to measure the bias in humans, which supports the hypothesis that such a bias exists in human perception. For the BFW database, source code, and more, visit this http URL.
摘要:我们揭示了使用一种新的平衡的面孔在野外(BFW)数据集的关键见解在国家的最先进的面部识别(FR)系统偏压的问题:性别和族裔平衡的数据。我们显示了在不同亚群面对的最佳得分阈值的变化。因此,学习导致性能差距亚组之间的所有对一个全局阈值的传统方法。通过学习群特定的阈值,我们在性能上的差距,不仅缓解问题,而且也显示出整体性能显着提升。此外,我们做一个人的评价来衡量人类的偏见,这支持了这样的偏见在人类感知存在的假设。对于BFW数据库,源代码和详情,请访问该HTTP URL。
30. Learning to Group: A Bottom-Up Framework for 3D Part Discovery in Unseen Categories [PDF] 返回目录
Tiange Luo, Kaichun Mo, Zhiao Huang, Jiarui Xu, Siyu Hu, Liwei Wang, Hao Su
Abstract: We address the problem of discovering 3D parts for objects in unseen categories. Being able to learn the geometry prior of parts and transfer this prior to unseen categories pose fundamental challenges on data-driven shape segmentation approaches. Formulated as a contextual bandit problem, we propose a learning-based agglomerative clustering framework which learns a grouping policy to progressively group small part proposals into bigger ones in a bottom-up fashion. At the core of our approach is to restrict the local context for extracting part-level features, which encourages the generalizability to unseen categories. On the large-scale fine-grained 3D part dataset, PartNet, we demonstrate that our method can transfer knowledge of parts learned from 3 training categories to 21 unseen testing categories without seeing any annotated samples. Quantitative comparisons against four shape segmentation baselines shows that our approach achieve the state-of-the-art performance.
摘要:我们发现解决3D零件在看不见的类对象的问题。如果能够学习部分之前的几何形状和在此之前看不见的类别转移造成对数据驱动的形状分割方法根本性的挑战。配制成上下文土匪问题,我们提出了一种基于学习的凝聚聚类框架,它学会了一分组策略,逐步一群小部分建议到一个自下而上的方式做大的。在我们的方法的核心是限制提取部件级功能,这鼓励推广到看不见的类别当地的情况。在大规模细粒度3D部分数据集,PartNet,我们证明了我们的方法可以把从3个训练类别21个看不见的测试类别了解到部分的知识,没有看到任何注释的样本。对四名形状分割基准表明,我们的方法实现国家的最先进的性能的定量比较。
Tiange Luo, Kaichun Mo, Zhiao Huang, Jiarui Xu, Siyu Hu, Liwei Wang, Hao Su
Abstract: We address the problem of discovering 3D parts for objects in unseen categories. Being able to learn the geometry prior of parts and transfer this prior to unseen categories pose fundamental challenges on data-driven shape segmentation approaches. Formulated as a contextual bandit problem, we propose a learning-based agglomerative clustering framework which learns a grouping policy to progressively group small part proposals into bigger ones in a bottom-up fashion. At the core of our approach is to restrict the local context for extracting part-level features, which encourages the generalizability to unseen categories. On the large-scale fine-grained 3D part dataset, PartNet, we demonstrate that our method can transfer knowledge of parts learned from 3 training categories to 21 unseen testing categories without seeing any annotated samples. Quantitative comparisons against four shape segmentation baselines shows that our approach achieve the state-of-the-art performance.
摘要:我们发现解决3D零件在看不见的类对象的问题。如果能够学习部分之前的几何形状和在此之前看不见的类别转移造成对数据驱动的形状分割方法根本性的挑战。配制成上下文土匪问题,我们提出了一种基于学习的凝聚聚类框架,它学会了一分组策略,逐步一群小部分建议到一个自下而上的方式做大的。在我们的方法的核心是限制提取部件级功能,这鼓励推广到看不见的类别当地的情况。在大规模细粒度3D部分数据集,PartNet,我们证明了我们的方法可以把从3个训练类别21个看不见的测试类别了解到部分的知识,没有看到任何注释的样本。对四名形状分割基准表明,我们的方法实现国家的最先进的性能的定量比较。
31. A Multiple Decoder CNN for Inverse Consistent 3D Image Registration [PDF] 返回目录
Abdullah Nazib, Clinton Fookes, Olivier Salvado, Dimitri Perrin
Abstract: The recent application of deep learning technologies in medical image registration has exponentially decreased the registration time and gradually increased registration accuracy when compared to their traditional counterparts. Most of the learning-based registration approaches considers this task as a one directional problem. As a result, only correspondence from the moving image to the target image is considered. However, in some medical procedures bidirectional registration is required to be performed. Unlike other learning-based registration, we propose a registration framework with inverse consistency. The proposed method simultaneously learns forward transformation and backward transformation in an unsupervised manner. We perform training and testing of the method on the publicly available LPBA40 MRI dataset and demonstrate strong performance than baseline registration methods.
摘要:深度学习技术在医学图像配准的最新应用呈指数下降了报名时间,并与他们的传统的同行时逐渐增加的配准精度。大多数基于学习的登记方法认为这个任务作为一个方向性的问题。其结果是,从运动图像的对象图像对应仅考虑。然而,在一些医疗程序中,需要双向注册被执行。不同于其他基于学习的登记,我们提出用逆一致性注册框架。所提出的方法同时学习以无监督的方式正向转换和向后转换。我们进行培训和方法的公开提供的LPBA40 MRI数据集测试,并证明比基准登记方法表现强劲。
Abdullah Nazib, Clinton Fookes, Olivier Salvado, Dimitri Perrin
Abstract: The recent application of deep learning technologies in medical image registration has exponentially decreased the registration time and gradually increased registration accuracy when compared to their traditional counterparts. Most of the learning-based registration approaches considers this task as a one directional problem. As a result, only correspondence from the moving image to the target image is considered. However, in some medical procedures bidirectional registration is required to be performed. Unlike other learning-based registration, we propose a registration framework with inverse consistency. The proposed method simultaneously learns forward transformation and backward transformation in an unsupervised manner. We perform training and testing of the method on the publicly available LPBA40 MRI dataset and demonstrate strong performance than baseline registration methods.
摘要:深度学习技术在医学图像配准的最新应用呈指数下降了报名时间,并与他们的传统的同行时逐渐增加的配准精度。大多数基于学习的登记方法认为这个任务作为一个方向性的问题。其结果是,从运动图像的对象图像对应仅考虑。然而,在一些医疗程序中,需要双向注册被执行。不同于其他基于学习的登记,我们提出用逆一致性注册框架。所提出的方法同时学习以无监督的方式正向转换和向后转换。我们进行培训和方法的公开提供的LPBA40 MRI数据集测试,并证明比基准登记方法表现强劲。
32. HighRes-net: Recursive Fusion for Multi-Frame Super-Resolution of Satellite Imagery [PDF] 返回目录
Michel Deudon, Alfredo Kalaitzis, Israel Goytom, Md Rifat Arefin, Zhichao Lin, Kris Sankaran, Vincent Michalski, Samira E. Kahou, Julien Cornebise, Yoshua Bengio
Abstract: Generative deep learning has sparked a new wave of Super-Resolution (SR) algorithms that enhance single images with impressive aesthetic results, albeit with imaginary details. Multi-frame Super-Resolution (MFSR) offers a more grounded approach to the ill-posed problem, by conditioning on multiple low-resolution views. This is important for satellite monitoring of human impact on the planet -- from deforestation, to human rights violations -- that depend on reliable imagery. To this end, we present HighRes-net, the first deep learning approach to MFSR that learns its sub-tasks in an end-to-end fashion: (i) co-registration, (ii) fusion, (iii) up-sampling, and (iv) registration-at-the-loss. Co-registration of low-resolution views is learned implicitly through a reference-frame channel, with no explicit registration mechanism. We learn a global fusion operator that is applied recursively on an arbitrary number of low-resolution pairs. We introduce a registered loss, by learning to align the SR output to a ground-truth through ShiftNet. We show that by learning deep representations of multiple views, we can super-resolve low-resolution signals and enhance Earth Observation data at scale. Our approach recently topped the European Space Agency's MFSR competition on real-world satellite imagery.
摘要:剖成深度学习已经引发了超分辨率(SR)增强单个图像令人印象深刻的美学效果,尽管假想的细节算法的新风潮。多帧超分辨率(MFSR)提供了一个更接地的方法来病态问题,通过对多个低分辨率意见调理。这是地球上人类的影响卫星测控重要 - 毁林,侵犯人权行为 - 依赖于可靠的图像。为此,我们本HIGHRES网,所述第一深学习方法来MFSR该学习其子任务在端至端的方式:(ⅰ)共配准,(ⅱ)融合,(ⅲ)上采样,和(iv)注册用在最损失。的低分辨率视图共配准是通过参考帧信道隐含地了解到,随着没有明确的注册机制。我们了解到,递归地应用在低分辨率对任意数量的全球融合运营商。我们引入一个注册的损失,通过学习,通过ShiftNet对准SR输出到地面实况。我们发现,通过学习的多个视图深表示,我们可以超决心低分辨率的信号,并在规模提升的地球观测数据。我们的做法最近荣登现实世界的卫星图像,欧洲航天局的MFSR竞争。
Michel Deudon, Alfredo Kalaitzis, Israel Goytom, Md Rifat Arefin, Zhichao Lin, Kris Sankaran, Vincent Michalski, Samira E. Kahou, Julien Cornebise, Yoshua Bengio
Abstract: Generative deep learning has sparked a new wave of Super-Resolution (SR) algorithms that enhance single images with impressive aesthetic results, albeit with imaginary details. Multi-frame Super-Resolution (MFSR) offers a more grounded approach to the ill-posed problem, by conditioning on multiple low-resolution views. This is important for satellite monitoring of human impact on the planet -- from deforestation, to human rights violations -- that depend on reliable imagery. To this end, we present HighRes-net, the first deep learning approach to MFSR that learns its sub-tasks in an end-to-end fashion: (i) co-registration, (ii) fusion, (iii) up-sampling, and (iv) registration-at-the-loss. Co-registration of low-resolution views is learned implicitly through a reference-frame channel, with no explicit registration mechanism. We learn a global fusion operator that is applied recursively on an arbitrary number of low-resolution pairs. We introduce a registered loss, by learning to align the SR output to a ground-truth through ShiftNet. We show that by learning deep representations of multiple views, we can super-resolve low-resolution signals and enhance Earth Observation data at scale. Our approach recently topped the European Space Agency's MFSR competition on real-world satellite imagery.
摘要:剖成深度学习已经引发了超分辨率(SR)增强单个图像令人印象深刻的美学效果,尽管假想的细节算法的新风潮。多帧超分辨率(MFSR)提供了一个更接地的方法来病态问题,通过对多个低分辨率意见调理。这是地球上人类的影响卫星测控重要 - 毁林,侵犯人权行为 - 依赖于可靠的图像。为此,我们本HIGHRES网,所述第一深学习方法来MFSR该学习其子任务在端至端的方式:(ⅰ)共配准,(ⅱ)融合,(ⅲ)上采样,和(iv)注册用在最损失。的低分辨率视图共配准是通过参考帧信道隐含地了解到,随着没有明确的注册机制。我们了解到,递归地应用在低分辨率对任意数量的全球融合运营商。我们引入一个注册的损失,通过学习,通过ShiftNet对准SR输出到地面实况。我们发现,通过学习的多个视图深表示,我们可以超决心低分辨率的信号,并在规模提升的地球观测数据。我们的做法最近荣登现实世界的卫星图像,欧洲航天局的MFSR竞争。
33. An End-to-End Framework for Unsupervised Pose Estimation of Occluded Pedestrians [PDF] 返回目录
Sudip Das, Perla Sai Raj Kishore, Ujjwal Bhattacharya
Abstract: Pose estimation in the wild is a challenging problem, particularly in situations of (i) occlusions of varying degrees and (ii) crowded outdoor scenes. Most of the existing studies of pose estimation did not report the performance in similar situations. Moreover, pose annotations for occluded parts of human figures have not been provided in any of the relevant standard datasets which in turn creates further difficulties to the required studies for pose estimation of the entire figure of occluded humans. Well known pedestrian detection datasets such as CityPersons contains samples of outdoor scenes but it does not include pose annotations. Here, we propose a novel multi-task framework for end-to-end training towards the entire pose estimation of pedestrians including in situations of any kind of occlusion. To tackle this problem for training the network, we make use of a pose estimation dataset, MS-COCO, and employ unsupervised adversarial instance-level domain adaptation for estimating the entire pose of occluded pedestrians. The experimental studies show that the proposed framework outperforms the SOTA results for pose estimation, instance segmentation and pedestrian detection in cases of heavy occlusions (HO) and reasonable + heavy occlusions (R + HO) on the two benchmark datasets.
摘要:在野外姿态估计是一个具有挑战性的问题,特别是在不同程度上和(ii)拥挤室外场景的(ⅰ)闭塞的情况。大多数姿态估计现有的研究没有报告在类似情况下的性能。此外,没有被任何这反过来又创造为人类闭塞的整个人物的姿态估计所需的研究进一步的困难相关的标准数据集提供的人物遮挡零件的姿态注解。众所周知的行人探测数据集,如CityPersons包含户外场景的样本,但它不包括姿势注解。在这里,我们提出了对行人,包括任何种类的阻塞情况的整个姿态估计终端到终端的培训新型多任务框架。为了解决这个问题,为训练网络,我们使用姿势估计数据集,MS-COCO,并采用无监督敌对实例级领域适应性的估计遮挡行人的整个姿态。实验研究表明,该框架优于姿态估计,例如分割和行人探测重闭塞(HO)和合理的+重闭塞(R + HO)上的两个标准数据集的情况下,SOTA结果。
Sudip Das, Perla Sai Raj Kishore, Ujjwal Bhattacharya
Abstract: Pose estimation in the wild is a challenging problem, particularly in situations of (i) occlusions of varying degrees and (ii) crowded outdoor scenes. Most of the existing studies of pose estimation did not report the performance in similar situations. Moreover, pose annotations for occluded parts of human figures have not been provided in any of the relevant standard datasets which in turn creates further difficulties to the required studies for pose estimation of the entire figure of occluded humans. Well known pedestrian detection datasets such as CityPersons contains samples of outdoor scenes but it does not include pose annotations. Here, we propose a novel multi-task framework for end-to-end training towards the entire pose estimation of pedestrians including in situations of any kind of occlusion. To tackle this problem for training the network, we make use of a pose estimation dataset, MS-COCO, and employ unsupervised adversarial instance-level domain adaptation for estimating the entire pose of occluded pedestrians. The experimental studies show that the proposed framework outperforms the SOTA results for pose estimation, instance segmentation and pedestrian detection in cases of heavy occlusions (HO) and reasonable + heavy occlusions (R + HO) on the two benchmark datasets.
摘要:在野外姿态估计是一个具有挑战性的问题,特别是在不同程度上和(ii)拥挤室外场景的(ⅰ)闭塞的情况。大多数姿态估计现有的研究没有报告在类似情况下的性能。此外,没有被任何这反过来又创造为人类闭塞的整个人物的姿态估计所需的研究进一步的困难相关的标准数据集提供的人物遮挡零件的姿态注解。众所周知的行人探测数据集,如CityPersons包含户外场景的样本,但它不包括姿势注解。在这里,我们提出了对行人,包括任何种类的阻塞情况的整个姿态估计终端到终端的培训新型多任务框架。为了解决这个问题,为训练网络,我们使用姿势估计数据集,MS-COCO,并采用无监督敌对实例级领域适应性的估计遮挡行人的整个姿态。实验研究表明,该框架优于姿态估计,例如分割和行人探测重闭塞(HO)和合理的+重闭塞(R + HO)上的两个标准数据集的情况下,SOTA结果。
34. Scale-Invariant Multi-Oriented Text Detection in Wild Scene Images [PDF] 返回目录
Kinjal Dasgupta, Sudip Das, Ujjwal Bhattacharya
Abstract: Automatic detection of scene texts in the wild is a challenging problem, particularly due to the difficulties in handling (i) occlusions of varying percentages, (ii) widely different scales and orientations, (iii) severe degradations in the image quality etc. In this article, we propose a fully convolutional neural network architecture consisting of a novel Feature Representation Block (FRB) capable of efficient abstraction of information. The proposed network has been trained using curriculum learning with respect to difficulties in image samples and gradual pixel-wise blurring. It is capable of detecting texts of different scales and orientations suffered by blurring from multiple possible sources, non-uniform illumination as well as partial occlusions of varying percentages. Text detection performance of the proposed framework on various benchmark sample databases including ICDAR 2015, ICDAR 2017 MLT, COCO-Text and MSRA-TD500 improves respective state-of-the-art results significantly. Source code of the proposed architecture will be made available at github.
摘要:在野外现场文本的自动检测是一个具有挑战性的问题,特别是由于在处理不同百分比的(ⅰ)闭塞的困难,(ⅱ)广泛不同尺度和方向,(ⅲ)在图像质量等严重劣化在本文中,我们提出了一种完全卷积神经网络体系结构由一种新颖的特征表示块(FRB)能够信息的高效的抽象的。所提出的网络已经采用课程学习与图像样本和逐步逐像素模糊就困难培训。它是能够检测由从多个可能的源,非均匀照明以及不同百分比的部分遮挡模糊遭受不同尺度和方向的文本。在各种基准样本数据库包括ICDAR 2015所提出的框架的文本检测的性能,ICDAR 2017 MLT,COCO - 文本和MSRA-TD500显著提高各个国家的最先进的结果。所提出的架构的源代码将在github上提供。
Kinjal Dasgupta, Sudip Das, Ujjwal Bhattacharya
Abstract: Automatic detection of scene texts in the wild is a challenging problem, particularly due to the difficulties in handling (i) occlusions of varying percentages, (ii) widely different scales and orientations, (iii) severe degradations in the image quality etc. In this article, we propose a fully convolutional neural network architecture consisting of a novel Feature Representation Block (FRB) capable of efficient abstraction of information. The proposed network has been trained using curriculum learning with respect to difficulties in image samples and gradual pixel-wise blurring. It is capable of detecting texts of different scales and orientations suffered by blurring from multiple possible sources, non-uniform illumination as well as partial occlusions of varying percentages. Text detection performance of the proposed framework on various benchmark sample databases including ICDAR 2015, ICDAR 2017 MLT, COCO-Text and MSRA-TD500 improves respective state-of-the-art results significantly. Source code of the proposed architecture will be made available at github.
摘要:在野外现场文本的自动检测是一个具有挑战性的问题,特别是由于在处理不同百分比的(ⅰ)闭塞的困难,(ⅱ)广泛不同尺度和方向,(ⅲ)在图像质量等严重劣化在本文中,我们提出了一种完全卷积神经网络体系结构由一种新颖的特征表示块(FRB)能够信息的高效的抽象的。所提出的网络已经采用课程学习与图像样本和逐步逐像素模糊就困难培训。它是能够检测由从多个可能的源,非均匀照明以及不同百分比的部分遮挡模糊遭受不同尺度和方向的文本。在各种基准样本数据库包括ICDAR 2015所提出的框架的文本检测的性能,ICDAR 2017 MLT,COCO - 文本和MSRA-TD500显著提高各个国家的最先进的结果。所提出的架构的源代码将在github上提供。
35. Video Face Super-Resolution with Motion-Adaptive Feedback Cell [PDF] 返回目录
Jingwei Xin, Nannan Wang, Jie Li, Xinbo Gao, Zhifeng Li
Abstract: Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN). Current state-of-the-art CNN methods usually treat the VSR problem as a large number of separate multi-frame super-resolution tasks, at which a batch of low resolution (LR) frames is utilized to generate a single high resolution (HR) frame, and running a slide window to select LR frames over the entire video would obtain a series of HR frames. However, duo to the complex temporal dependency between frames, with the number of LR input frames increase, the performance of the reconstructed HR frames become worse. The reason is in that these methods lack the ability to model complex temporal dependencies and hard to give an accurate motion estimation and compensation for VSR process. Which makes the performance degrade drastically when the motion in frames is complex. In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way. Our approach efficiently utilizes the information of the inter-frame motion, the dependence of the network on motion estimation and compensation method can be avoid. In addition, benefiting from the excellent nature of MAFC, the network can achieve better performance in the case of extremely complex motion scenarios. Extensive evaluations and comparisons validate the strengths of our approach, and the experimental results demonstrated that the proposed framework is outperform the state-of-the-art methods.
摘要:视频超分辨率(VSR)方法,最近取得了令人瞩目的成功,由于深卷积神经网络(CNN)的发展。当前状态的最先进的CNN方法通常处理VSR的问题,因为大量的单独的多帧的超分辨率任务,在该一批次低分辨率(LR)的帧被用于产生一个单一的高分辨率(HR )帧,并运行一个滑动窗口在整个视频来选择LR帧将获得一系列HR帧。然而,二人到帧之间的复杂的时间相关性,与LR输入帧数量的增加,再生HR帧的性能变差。其原因是,这些方法缺乏复杂的时间依赖性和硬模型,给出了振动时效工艺的精确运动估计和补偿的能力。这使得性能大幅下降时,在帧的运动是复杂的。在本文中,我们提出了一个运动自适应反馈单元(MAFC),一个简单而有效的块,其可有效地捕获运动补偿和在自适应方式反馈给该网络。我们的方法有效地利用帧间运动的信息,网络上的运动估计和补偿方法的依赖性可避免的。另外,从MAFC的优良性质中受益,该网络可以实现的极其复杂的运动场景的情况下更好的性能。广泛的评估和比较验证我们的方法的优点,实验结果表明,所提出的框架是强于大盘的国家的最先进的方法。
Jingwei Xin, Nannan Wang, Jie Li, Xinbo Gao, Zhifeng Li
Abstract: Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN). Current state-of-the-art CNN methods usually treat the VSR problem as a large number of separate multi-frame super-resolution tasks, at which a batch of low resolution (LR) frames is utilized to generate a single high resolution (HR) frame, and running a slide window to select LR frames over the entire video would obtain a series of HR frames. However, duo to the complex temporal dependency between frames, with the number of LR input frames increase, the performance of the reconstructed HR frames become worse. The reason is in that these methods lack the ability to model complex temporal dependencies and hard to give an accurate motion estimation and compensation for VSR process. Which makes the performance degrade drastically when the motion in frames is complex. In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way. Our approach efficiently utilizes the information of the inter-frame motion, the dependence of the network on motion estimation and compensation method can be avoid. In addition, benefiting from the excellent nature of MAFC, the network can achieve better performance in the case of extremely complex motion scenarios. Extensive evaluations and comparisons validate the strengths of our approach, and the experimental results demonstrated that the proposed framework is outperform the state-of-the-art methods.
摘要:视频超分辨率(VSR)方法,最近取得了令人瞩目的成功,由于深卷积神经网络(CNN)的发展。当前状态的最先进的CNN方法通常处理VSR的问题,因为大量的单独的多帧的超分辨率任务,在该一批次低分辨率(LR)的帧被用于产生一个单一的高分辨率(HR )帧,并运行一个滑动窗口在整个视频来选择LR帧将获得一系列HR帧。然而,二人到帧之间的复杂的时间相关性,与LR输入帧数量的增加,再生HR帧的性能变差。其原因是,这些方法缺乏复杂的时间依赖性和硬模型,给出了振动时效工艺的精确运动估计和补偿的能力。这使得性能大幅下降时,在帧的运动是复杂的。在本文中,我们提出了一个运动自适应反馈单元(MAFC),一个简单而有效的块,其可有效地捕获运动补偿和在自适应方式反馈给该网络。我们的方法有效地利用帧间运动的信息,网络上的运动估计和补偿方法的依赖性可避免的。另外,从MAFC的优良性质中受益,该网络可以实现的极其复杂的运动场景的情况下更好的性能。广泛的评估和比较验证我们的方法的优点,实验结果表明,所提出的框架是强于大盘的国家的最先进的方法。
36. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [PDF] 返回目录
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, Ming Zhou
Abstract: We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.
摘要:本文提出UniViLM:一个统一的视频和语言前的训练模式的多模态的理解和生成。受近期对NLP和图像语言任务,VideoBERT和CBT基于BERT前培训技术的成功激励提出了利用使用讲述了教学视频,视频和语言训练前BERT模式。从他们的作品只预列车任务的理解不同,我们提出了两种理解和生成任务统一视频语言前的训练模式。我们的模型包括4个组分,包括两个单峰编码器,一个编码器的交叉,并与变压器主链的解码器的。我们先预先训练我们的模型来学习的视频和语言上大量的教学视频数据集中的普遍代表性。然后,我们微调在两个多任务,其中包括理解任务(基于文本的视频检索)和生成任务(多视频字幕)模型。我们广泛的实验表明,我们的方法可以提高双方的理解和生成任务的性能和实现国家的艺术效果。
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, Ming Zhou
Abstract: We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.
摘要:本文提出UniViLM:一个统一的视频和语言前的训练模式的多模态的理解和生成。受近期对NLP和图像语言任务,VideoBERT和CBT基于BERT前培训技术的成功激励提出了利用使用讲述了教学视频,视频和语言训练前BERT模式。从他们的作品只预列车任务的理解不同,我们提出了两种理解和生成任务统一视频语言前的训练模式。我们的模型包括4个组分,包括两个单峰编码器,一个编码器的交叉,并与变压器主链的解码器的。我们先预先训练我们的模型来学习的视频和语言上大量的教学视频数据集中的普遍代表性。然后,我们微调在两个多任务,其中包括理解任务(基于文本的视频检索)和生成任务(多视频字幕)模型。我们广泛的实验表明,我们的方法可以提高双方的理解和生成任务的性能和实现国家的艺术效果。
37. Cell R-CNN V3: A Novel Panoptic Paradigm for Instance Segmentation in Biomedical Images [PDF] 返回目录
Dongnan Liu, Donghao Zhang, Yang Song, Heng Huang, Weidong Cai
Abstract: Instance segmentation is an important task for biomedical image analysis. Due to the complicated background components, the high variability of object appearances, numerous overlapping objects, and ambiguous object boundaries, this task still remains challenging. Recently, deep learning based methods have been widely employed to solve these problems and can be categorized into proposal-free and proposal-based methods. However, both proposal-free and proposal-based methods suffer from information loss, as they focus on either global-level semantic or local-level instance features. To tackle this issue, we present a panoptic architecture that unifies the semantic and instance features in this work. Specifically, our proposed method contains a residual attention feature fusion mechanism to incorporate the instance prediction with the semantic features, in order to facilitate the semantic contextual information learning in the instance branch. Then, a mask quality branch is designed to align the confidence score of each object with the quality of the mask prediction. Furthermore, a consistency regularization mechanism is designed between the semantic segmentation tasks in the semantic and instance branches, for the robust learning of both tasks. Extensive experiments demonstrate the effectiveness of our proposed method, which outperforms several state-of-the-art methods on various biomedical datasets.
摘要:实例分割是生物医学图像分析的一项重要任务。由于复杂的背景成分,对象出场,众多重叠对象,暧昧对象边界的高可变性,这一任务仍然具有挑战性。近日,深基础的学习方法已被广泛采用,以解决这些问题,可以分成提案,免费和建议的方法。然而,无论是免费的建议和提案为基础的方法,从信息遭受损失,因为他们专注于为全局级别的语义或地方级实例特性。为了解决这个问题,我们提出了一个全景架构相结合语义和实例的功能在此工作。具体来说,我们提出的方法中包含的剩余关注特征融合机制,结合实例预测与语义特征,以便于语义上下文信息的情况下分支学习。然后,掩模质量分支被设计来对准置信度得分的每个对象的与掩模预测的质量。此外,一致性正规化机制的语义分割任务之间在语义和实例分支机构设计的,对于这两项任务的强大的学习。大量的实验证明我们提出的方法,它优于国家的最先进的几种各种生物医学数据集方法的有效性。
Dongnan Liu, Donghao Zhang, Yang Song, Heng Huang, Weidong Cai
Abstract: Instance segmentation is an important task for biomedical image analysis. Due to the complicated background components, the high variability of object appearances, numerous overlapping objects, and ambiguous object boundaries, this task still remains challenging. Recently, deep learning based methods have been widely employed to solve these problems and can be categorized into proposal-free and proposal-based methods. However, both proposal-free and proposal-based methods suffer from information loss, as they focus on either global-level semantic or local-level instance features. To tackle this issue, we present a panoptic architecture that unifies the semantic and instance features in this work. Specifically, our proposed method contains a residual attention feature fusion mechanism to incorporate the instance prediction with the semantic features, in order to facilitate the semantic contextual information learning in the instance branch. Then, a mask quality branch is designed to align the confidence score of each object with the quality of the mask prediction. Furthermore, a consistency regularization mechanism is designed between the semantic segmentation tasks in the semantic and instance branches, for the robust learning of both tasks. Extensive experiments demonstrate the effectiveness of our proposed method, which outperforms several state-of-the-art methods on various biomedical datasets.
摘要:实例分割是生物医学图像分析的一项重要任务。由于复杂的背景成分,对象出场,众多重叠对象,暧昧对象边界的高可变性,这一任务仍然具有挑战性。近日,深基础的学习方法已被广泛采用,以解决这些问题,可以分成提案,免费和建议的方法。然而,无论是免费的建议和提案为基础的方法,从信息遭受损失,因为他们专注于为全局级别的语义或地方级实例特性。为了解决这个问题,我们提出了一个全景架构相结合语义和实例的功能在此工作。具体来说,我们提出的方法中包含的剩余关注特征融合机制,结合实例预测与语义特征,以便于语义上下文信息的情况下分支学习。然后,掩模质量分支被设计来对准置信度得分的每个对象的与掩模预测的质量。此外,一致性正规化机制的语义分割任务之间在语义和实例分支机构设计的,对于这两项任务的强大的学习。大量的实验证明我们提出的方法,它优于国家的最先进的几种各种生物医学数据集方法的有效性。
38. Recognizing Families In the Wild (RFIW): The 4th Edition [PDF] 返回目录
Joseph P. Robinson, Yu Yin, Zaid Khan, Ming Shao, Siyu Xia, Michael Stopa, Samson Timoner, Matthew A. Turk, Rama Chellappa, Yun Fu
Abstract: Recognizing Families In the Wild (RFIW): an annual large-scale, multi-track automatic kinship recognition evaluation that supports various visual kin-based problems on scales much higher than ever before. Organized in conjunction with the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG) as a Challenge, RFIW provides a platform for publishing original work and the gathering of experts for a discussion of the next steps. This paper summarizes the supported tasks (i.e., kinship verification, tri-subject verification, and search & retrieval of missing children) in the evaluation protocols, which include the practical motivation, technical background, data splits, metrics, and benchmark results. Furthermore, top submissions (i.e., leader-board stats) are listed and reviewed as a high-level analysis on the state of the problem. In the end, the purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions.
摘要:鉴于家庭在野外(RFIW):每年的大规模,多跟踪自动识别血缘关系的评价,关于尺度支持各种基于视觉健的问题比以往任何时候都高。在第15届IEEE国际会议上的自动脸部和手势识别(FG)作为挑战共同举办,RFIW提供了发布的原创作品和专家进行的下一阶段的讨论,聚会的平台。本文总结了支持的任务(即血缘关系鉴定,三主题验证,并搜索和检索失踪儿童)在评估协议,其中包括实际的动机,技术背景,数据分片,度量和基准测试结果。此外,顶部的提交(即,前导板数据)被列出并审查作为对这个问题的状态的高层次分析。最后,本文的目的是描述2020年RFIW挑战,最终到年底,预期沿着充满希望的未来方向。
Joseph P. Robinson, Yu Yin, Zaid Khan, Ming Shao, Siyu Xia, Michael Stopa, Samson Timoner, Matthew A. Turk, Rama Chellappa, Yun Fu
Abstract: Recognizing Families In the Wild (RFIW): an annual large-scale, multi-track automatic kinship recognition evaluation that supports various visual kin-based problems on scales much higher than ever before. Organized in conjunction with the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG) as a Challenge, RFIW provides a platform for publishing original work and the gathering of experts for a discussion of the next steps. This paper summarizes the supported tasks (i.e., kinship verification, tri-subject verification, and search & retrieval of missing children) in the evaluation protocols, which include the practical motivation, technical background, data splits, metrics, and benchmark results. Furthermore, top submissions (i.e., leader-board stats) are listed and reviewed as a high-level analysis on the state of the problem. In the end, the purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions.
摘要:鉴于家庭在野外(RFIW):每年的大规模,多跟踪自动识别血缘关系的评价,关于尺度支持各种基于视觉健的问题比以往任何时候都高。在第15届IEEE国际会议上的自动脸部和手势识别(FG)作为挑战共同举办,RFIW提供了发布的原创作品和专家进行的下一阶段的讨论,聚会的平台。本文总结了支持的任务(即血缘关系鉴定,三主题验证,并搜索和检索失踪儿童)在评估协议,其中包括实际的动机,技术背景,数据分片,度量和基准测试结果。此外,顶部的提交(即,前导板数据)被列出并审查作为对这个问题的状态的高层次分析。最后,本文的目的是描述2020年RFIW挑战,最终到年底,预期沿着充满希望的未来方向。
39. Historical Document Processing: Historical Document Processing: A Survey of Techniques, Tools, and Trends [PDF] 返回目录
James P. Philips, Nasseh Tabrizi
Abstract: Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including computer vision, document analysis and recognition, natural language processing, and machine learning, to convert images of ancient manuscripts, letters, diaries, and early printed texts automatically into a digital format usable in data mining and information retrieval systems. Within the past twenty years, as libraries, museums, and other cultural heritage institutions have scanned an increasing volume of their historical document archives, the need to transcribe the full text from these collections has become acute. Since Historical Document Processing encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of, standard algorithms, tools, and datasets in the field of Historical Document Processing, discusses the results of a literature review, and finally suggests directions for further research.
摘要:历史文档处理是从过去的数字化的书面材料,为今后的历史学家和其他学者使用的过程。它采用的算法和计算机科学的不同子领域,包括计算机视觉,文档分析与识别,自然语言处理和机器学习软件工具,古代手稿,书信,日记,和早期的印刷文本的转换图像自动转换成数字格式可用在数据挖掘和信息检索系统。在过去的二十年里,图书馆,博物馆和其他文化遗产机构已经扫描他们的历史文献档案的数量不断增加,从这些藏品抄录全文的需要日益突出。由于历史文档处理包括计算机科学的多个子域,目的相关的知识分散在众多的期刊和会议论文。本文的调查,标准算法,工具和数据集的主要阶段在历史文档处理领域,讨论了文学审查的结果,最后提出了进一步研究的方向。
James P. Philips, Nasseh Tabrizi
Abstract: Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including computer vision, document analysis and recognition, natural language processing, and machine learning, to convert images of ancient manuscripts, letters, diaries, and early printed texts automatically into a digital format usable in data mining and information retrieval systems. Within the past twenty years, as libraries, museums, and other cultural heritage institutions have scanned an increasing volume of their historical document archives, the need to transcribe the full text from these collections has become acute. Since Historical Document Processing encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of, standard algorithms, tools, and datasets in the field of Historical Document Processing, discusses the results of a literature review, and finally suggests directions for further research.
摘要:历史文档处理是从过去的数字化的书面材料,为今后的历史学家和其他学者使用的过程。它采用的算法和计算机科学的不同子领域,包括计算机视觉,文档分析与识别,自然语言处理和机器学习软件工具,古代手稿,书信,日记,和早期的印刷文本的转换图像自动转换成数字格式可用在数据挖掘和信息检索系统。在过去的二十年里,图书馆,博物馆和其他文化遗产机构已经扫描他们的历史文献档案的数量不断增加,从这些藏品抄录全文的需要日益突出。由于历史文档处理包括计算机科学的多个子域,目的相关的知识分散在众多的期刊和会议论文。本文的调查,标准算法,工具和数据集的主要阶段在历史文档处理领域,讨论了文学审查的结果,最后提出了进一步研究的方向。
40. Single Unit Status in Deep Convolutional Neural Network Codes for Face Identification: Sparseness Redefined [PDF] 返回目录
Connor J. Parde, Y. Ivette Colón, Matthew Q. Hill, Carlos D. Castillo, Prithviraj Dhar, Alice J. O'Toole
Abstract: Deep convolutional neural networks (DCNNs) trained for face identification develop representations that generalize over variable images, while retaining subject (e.g., gender) and image (e.g., viewpoint) information. Identity, gender, and viewpoint codes were studied at the "neural unit" and ensemble levels of a face-identification network. At the unit level, identification, gender classification, and viewpoint estimation were measured by deleting units to create variably-sized, randomly-sampled subspaces at the top network layer. Identification of 3,531 identities remained high (area under the ROC approximately 1.0) as dimensionality decreased from 512 units to 16 (0.95), 4 (0.80), and 2 (0.72) units. Individual identities separated statistically on every top-layer unit. Cross-unit responses were minimally correlated, indicating that units code non-redundant identity cues. This "distributed" code requires only a sparse, random sample of units to identify faces accurately. Gender classification declined gradually and viewpoint estimation fell steeply as dimensionality decreased. Individual units were weakly predictive of gender and viewpoint, but ensembles proved effective predictors. Therefore, distributed and sparse codes co-exist in the network units to represent different face attributes. At the ensemble level, principal component analysis of face representations showed that identity, gender, and viewpoint information separated into high-dimensional subspaces, ordered by explained variance. Identity, gender, and viewpoint information contributed to all individual unit responses, undercutting a neural tuning analogy for face attributes. Interpretation of neural-like codes from DCNNs, and by analogy, high-level visual codes, cannot be inferred from single unit responses. Instead, "meaning" is encoded by directions in the high-dimensional space.
摘要:训练面部识别深层卷积神经网络(DCNNs)开发推广了可变图像表示,同时保留主题(例如,性别)和图像(例如视点)的信息。身份,性别和视点码均在“神经单元”和面部识别网络的合奏水平的研究。在单元级别,识别,性别分类,和视点估计被删除单位创建可变尺寸的,顶部网络层中随机取样的子空间测量。的3531名的身份标识(该ROC约1.0下的面积)保持高维数从512个单位减少到16(0.95),4(0.80),和2(0.72)单元。个体身份统计学分离每顶层单元上。交叉单元应答最低限度相关,这表明单元代码非冗余身份线索。这个“分布式”码只需要一个稀疏的,单位随机样本准确识别的面。性别分类的逐步下降和观点估计急剧下跌维度下降。个别单位呈弱预测性别和观点,但合奏证明是有效的预测。因此,分布式和稀疏代码并存的网络单元来代表不同的面部属性。在合奏水平,面表示的主成分分析表明,身份,性别和视点信息分离成高维子空间,通过有序解释方差。身份,性别和视点信息促成所有个别单位的响应,削弱面部属性的神经调节比喻。的神经样从DCNNs码,依此类推,高层次视觉代码解释,不能从单个单元响应来推断。相反,“意思是”由在高维空间方向上进行编码。
Connor J. Parde, Y. Ivette Colón, Matthew Q. Hill, Carlos D. Castillo, Prithviraj Dhar, Alice J. O'Toole
Abstract: Deep convolutional neural networks (DCNNs) trained for face identification develop representations that generalize over variable images, while retaining subject (e.g., gender) and image (e.g., viewpoint) information. Identity, gender, and viewpoint codes were studied at the "neural unit" and ensemble levels of a face-identification network. At the unit level, identification, gender classification, and viewpoint estimation were measured by deleting units to create variably-sized, randomly-sampled subspaces at the top network layer. Identification of 3,531 identities remained high (area under the ROC approximately 1.0) as dimensionality decreased from 512 units to 16 (0.95), 4 (0.80), and 2 (0.72) units. Individual identities separated statistically on every top-layer unit. Cross-unit responses were minimally correlated, indicating that units code non-redundant identity cues. This "distributed" code requires only a sparse, random sample of units to identify faces accurately. Gender classification declined gradually and viewpoint estimation fell steeply as dimensionality decreased. Individual units were weakly predictive of gender and viewpoint, but ensembles proved effective predictors. Therefore, distributed and sparse codes co-exist in the network units to represent different face attributes. At the ensemble level, principal component analysis of face representations showed that identity, gender, and viewpoint information separated into high-dimensional subspaces, ordered by explained variance. Identity, gender, and viewpoint information contributed to all individual unit responses, undercutting a neural tuning analogy for face attributes. Interpretation of neural-like codes from DCNNs, and by analogy, high-level visual codes, cannot be inferred from single unit responses. Instead, "meaning" is encoded by directions in the high-dimensional space.
摘要:训练面部识别深层卷积神经网络(DCNNs)开发推广了可变图像表示,同时保留主题(例如,性别)和图像(例如视点)的信息。身份,性别和视点码均在“神经单元”和面部识别网络的合奏水平的研究。在单元级别,识别,性别分类,和视点估计被删除单位创建可变尺寸的,顶部网络层中随机取样的子空间测量。的3531名的身份标识(该ROC约1.0下的面积)保持高维数从512个单位减少到16(0.95),4(0.80),和2(0.72)单元。个体身份统计学分离每顶层单元上。交叉单元应答最低限度相关,这表明单元代码非冗余身份线索。这个“分布式”码只需要一个稀疏的,单位随机样本准确识别的面。性别分类的逐步下降和观点估计急剧下跌维度下降。个别单位呈弱预测性别和观点,但合奏证明是有效的预测。因此,分布式和稀疏代码并存的网络单元来代表不同的面部属性。在合奏水平,面表示的主成分分析表明,身份,性别和视点信息分离成高维子空间,通过有序解释方差。身份,性别和视点信息促成所有个别单位的响应,削弱面部属性的神经调节比喻。的神经样从DCNNs码,依此类推,高层次视觉代码解释,不能从单个单元响应来推断。相反,“意思是”由在高维空间方向上进行编码。
41. Layered Embeddings for Amodal Instance Segmentation [PDF] 返回目录
Yanfeng Liu, Eric Psota, Lance Pérez
Abstract: The proposed method extends upon the representational output of semantic instance segmentation by explicitly including both visible and occluded parts. A fully convolutional network is trained to produce consistent pixel-level embedding across two layers such that, when clustered, the results convey the full spatial extent and depth ordering of each instance. Results demonstrate that the network can accurately estimate complete masks in the presence of occlusion and outperform leading top-down bounding-box approaches. Source code available at this https URL
摘要:所提出的方法通过明确地包括可见和遮挡部件在语义实例分割的代表性输出延伸。一个完全卷积网络进行训练,以产生横跨两层一致的像素级嵌入,使得聚集的情况下,结果传达每个实例的完整空间范围和深度排序。结果表明,该网络能够精确地估计在存在遮挡时完整面具和跑赢领先自上而下边界框接近。在此可用的源代码HTTPS URL
Yanfeng Liu, Eric Psota, Lance Pérez
Abstract: The proposed method extends upon the representational output of semantic instance segmentation by explicitly including both visible and occluded parts. A fully convolutional network is trained to produce consistent pixel-level embedding across two layers such that, when clustered, the results convey the full spatial extent and depth ordering of each instance. Results demonstrate that the network can accurately estimate complete masks in the presence of occlusion and outperform leading top-down bounding-box approaches. Source code available at this https URL
摘要:所提出的方法通过明确地包括可见和遮挡部件在语义实例分割的代表性输出延伸。一个完全卷积网络进行训练,以产生横跨两层一致的像素级嵌入,使得聚集的情况下,结果传达每个实例的完整空间范围和深度排序。结果表明,该网络能够精确地估计在存在遮挡时完整面具和跑赢领先自上而下边界框接近。在此可用的源代码HTTPS URL
42. Why Do Line Drawings Work? A Realism Hypothesis [PDF] 返回目录
Aaron Hertzmann
Abstract: Why is it that we can recognize object identity and 3D shape from line drawings, even though they do not exist in the natural world? This paper hypothesizes that the human visual system perceives line drawings as if they were approximately realistic images. Moreover, the techniques of line drawing are chosen to accurately convey shape to a human observer. Several implications and variants of this hypothesis are explored.
摘要:为什么我们能够识别对象的身份和3D形状从线条图,即使他们没有在自然界存在吗?本文推测,人的视觉系统感知到排队附图,好像他们是约逼真的图像。此外,画线的技术被选择以准确传达形状对人类观察者。一些影响和这一假设的变种进行了探索。
Aaron Hertzmann
Abstract: Why is it that we can recognize object identity and 3D shape from line drawings, even though they do not exist in the natural world? This paper hypothesizes that the human visual system perceives line drawings as if they were approximately realistic images. Moreover, the techniques of line drawing are chosen to accurately convey shape to a human observer. Several implications and variants of this hypothesis are explored.
摘要:为什么我们能够识别对象的身份和3D形状从线条图,即使他们没有在自然界存在吗?本文推测,人的视觉系统感知到排队附图,好像他们是约逼真的图像。此外,画线的技术被选择以准确传达形状对人类观察者。一些影响和这一假设的变种进行了探索。
43. Social-WaGDAT: Interaction-aware Trajectory Prediction via Wasserstein Graph Double-Attention Network [PDF] 返回目录
Jiachen Li, Hengbo Ma, Zhihao Zhang, Masayoshi Tomizuka
Abstract: Effective understanding of the environment and accurate trajectory prediction of surrounding dynamic obstacles are indispensable for intelligent mobile systems (like autonomous vehicles and social robots) to achieve safe and high-quality planning when they navigate in highly interactive and crowded scenarios. Due to the existence of frequent interactions and uncertainty in the scene evolution, it is desired for the prediction system to enable relational reasoning on different entities and provide a distribution of future trajectories for each agent. In this paper, we propose a generic generative neural system (called Social-WaGDAT) for multi-agent trajectory prediction, which makes a step forward to explicit interaction modeling by incorporating relational inductive biases with a dynamic graph representation and leverages both trajectory and scene context information. We also employ an efficient kinematic constraint layer applied to vehicle trajectory prediction which not only ensures physical feasibility but also enhances model performance. The proposed system is evaluated on three public benchmark datasets for trajectory prediction, where the agents cover pedestrians, cyclists and on-road vehicles. The experimental results demonstrate that our model achieves better performance than various baseline approaches in terms of prediction accuracy.
摘要:围绕动态障碍物的环境和准确的轨迹预测的有效了解,是智能移动系统不可缺少的(如自动驾驶汽车和社交机器人),以实现安全和高品质的规划,当他们在高度互动导航和拥挤的场景。由于频繁的互动和场景变化存在不确定性,需要为预测系统,以使不同的实体关系的推理,并提供未来轨迹的每个代理的分布。在本文中,我们提出了一个通用的生成神经系统(称为社会WaGDAT)多剂轨迹预测,其通过将关系感性的偏见与动态图表示使得向前迈进了一步,以显式交互模型,并利用双方的轨迹和场景语境信息。我们还采用了适用于车辆轨迹预测不但可以确保物理可行性的高效运动约束层,而且提高了模型的性能。三个公共基准数据集的轨迹预测,其中药剂包括行人,骑自行车和在道路上的车辆所提出的系统进行评估。实验结果表明,我们的模型获得更好的性能比基线各种在预测准确度方面接近。
Jiachen Li, Hengbo Ma, Zhihao Zhang, Masayoshi Tomizuka
Abstract: Effective understanding of the environment and accurate trajectory prediction of surrounding dynamic obstacles are indispensable for intelligent mobile systems (like autonomous vehicles and social robots) to achieve safe and high-quality planning when they navigate in highly interactive and crowded scenarios. Due to the existence of frequent interactions and uncertainty in the scene evolution, it is desired for the prediction system to enable relational reasoning on different entities and provide a distribution of future trajectories for each agent. In this paper, we propose a generic generative neural system (called Social-WaGDAT) for multi-agent trajectory prediction, which makes a step forward to explicit interaction modeling by incorporating relational inductive biases with a dynamic graph representation and leverages both trajectory and scene context information. We also employ an efficient kinematic constraint layer applied to vehicle trajectory prediction which not only ensures physical feasibility but also enhances model performance. The proposed system is evaluated on three public benchmark datasets for trajectory prediction, where the agents cover pedestrians, cyclists and on-road vehicles. The experimental results demonstrate that our model achieves better performance than various baseline approaches in terms of prediction accuracy.
摘要:围绕动态障碍物的环境和准确的轨迹预测的有效了解,是智能移动系统不可缺少的(如自动驾驶汽车和社交机器人),以实现安全和高品质的规划,当他们在高度互动导航和拥挤的场景。由于频繁的互动和场景变化存在不确定性,需要为预测系统,以使不同的实体关系的推理,并提供未来轨迹的每个代理的分布。在本文中,我们提出了一个通用的生成神经系统(称为社会WaGDAT)多剂轨迹预测,其通过将关系感性的偏见与动态图表示使得向前迈进了一步,以显式交互模型,并利用双方的轨迹和场景语境信息。我们还采用了适用于车辆轨迹预测不但可以确保物理可行性的高效运动约束层,而且提高了模型的性能。三个公共基准数据集的轨迹预测,其中药剂包括行人,骑自行车和在道路上的车辆所提出的系统进行评估。实验结果表明,我们的模型获得更好的性能比基线各种在预测准确度方面接近。
44. Spectrum Translation for Cross-Spectral Ocular Matching [PDF] 返回目录
Kevin Hernandez Diaz, Fernando Alonso-Fernandez, Josef Bigun
Abstract: Cross-spectral verification remains a big issue in biometrics, especially for the ocular area due to differences in the reflected features in the images depending on the region and spectrum used. In this paper, we investigate the use of Conditional Adversarial Networks for spectrum translation between near infra-red and visual light images for ocular biometrics. We analyze the transformation based on the overall visual quality of the transformed images and the accuracy drop of the identification system when trained with opposing data. We use the PolyU database and propose two different systems for biometric verification, the first one based on Siamese Networks trained with Softmax and Cross-Entropy loss, and the second one a Triplet Loss network. We achieved an EER of 1\% when using a Triplet Loss network trained for NIR and finding the Euclidean distance between the real NIR images and the fake ones translated from the visible spectrum. We also outperform previous results using baseline algorithms.
摘要:互谱验证仍然是生物识别技术的一个大问题,尤其是对眼部区域由于在取决于所使用的区域和光谱图像的反射特性的差异。在本文中,我们探讨眼部生物识别近红外和可见光图像之间的频谱转换使用条件对抗性网络。我们分析基于变换图像的整体视觉质量,并且当与反对数据训练识别系统的准确度下降的转变。我们用理大数据库,并提出了生物特征识别两个不同的系统,第一个基于网络的连体与SOFTMAX和交叉熵损失的训练,而第二个三重损失的网络。我们使用训练NIR三重损失的网络,并找到真正的NIR图像和假的来自于可见光谱转换之间的欧氏距离时获得的1 \%的能效比。我们也优于使用基线算法以前的结果。
Kevin Hernandez Diaz, Fernando Alonso-Fernandez, Josef Bigun
Abstract: Cross-spectral verification remains a big issue in biometrics, especially for the ocular area due to differences in the reflected features in the images depending on the region and spectrum used. In this paper, we investigate the use of Conditional Adversarial Networks for spectrum translation between near infra-red and visual light images for ocular biometrics. We analyze the transformation based on the overall visual quality of the transformed images and the accuracy drop of the identification system when trained with opposing data. We use the PolyU database and propose two different systems for biometric verification, the first one based on Siamese Networks trained with Softmax and Cross-Entropy loss, and the second one a Triplet Loss network. We achieved an EER of 1\% when using a Triplet Loss network trained for NIR and finding the Euclidean distance between the real NIR images and the fake ones translated from the visible spectrum. We also outperform previous results using baseline algorithms.
摘要:互谱验证仍然是生物识别技术的一个大问题,尤其是对眼部区域由于在取决于所使用的区域和光谱图像的反射特性的差异。在本文中,我们探讨眼部生物识别近红外和可见光图像之间的频谱转换使用条件对抗性网络。我们分析基于变换图像的整体视觉质量,并且当与反对数据训练识别系统的准确度下降的转变。我们用理大数据库,并提出了生物特征识别两个不同的系统,第一个基于网络的连体与SOFTMAX和交叉熵损失的训练,而第二个三重损失的网络。我们使用训练NIR三重损失的网络,并找到真正的NIR图像和假的来自于可见光谱转换之间的欧氏距离时获得的1 \%的能效比。我们也优于使用基线算法以前的结果。
45. Seeing Around Corners with Edge-Resolved Transient Imaging [PDF] 返回目录
Joshua Rapp, Charles Saunders, Julián Tachella, John Murray-Bruce, Yoann Altmann, Jean-Yves Tourneret, Stephen McLaughlin, Robin M. A. Dawson, Franco N. C. Wong, Vivek K Goyal
Abstract: Non-line-of-sight (NLOS) imaging is a rapidly growing field seeking to form images of objects outside the field of view, with potential applications in search and rescue, reconnaissance, and even medical imaging. The critical challenge of NLOS imaging is that diffuse reflections scatter light in all directions, resulting in weak signals and a loss of directional information. To address this problem, we propose a method for seeing around corners that derives angular resolution from vertical edges and longitudinal resolution from the temporal response to a pulsed light source. We introduce an acquisition strategy, scene response model, and reconstruction algorithm that enable the formation of 2.5-dimensional representations -- a plan view plus heights -- and a 180$^{\circ}$ field of view (FOV) for large-scale scenes. Our experiments demonstrate accurate reconstructions of hidden rooms up to 3 meters in each dimension.
摘要:非视距视距(NLOS)成像是一个快速增长的领域寻求形成物体的图像的视野之外,在搜救,侦察,甚至是医疗成像应用潜力。 NLOS成像的关键的挑战是,漫反射在所有方向上散射光,从而导致弱信号和方向信息的损失。为了解决这个问题,我们提出了看到周围导出从垂直边缘,并从以一个脉冲光源的时间响应纵向分辨率的角分辨率的角的方法。我们引入一个采集策略,场景响应模型,和重建算法,使2.5维表示的形成 - 的平面图加高度 - 和180 $ ^ {\ CIRC}视场(FOV)的$字段适用于大大规模的场面。我们的实验证明隐藏客房精确重建达在每个维度3米。
Joshua Rapp, Charles Saunders, Julián Tachella, John Murray-Bruce, Yoann Altmann, Jean-Yves Tourneret, Stephen McLaughlin, Robin M. A. Dawson, Franco N. C. Wong, Vivek K Goyal
Abstract: Non-line-of-sight (NLOS) imaging is a rapidly growing field seeking to form images of objects outside the field of view, with potential applications in search and rescue, reconnaissance, and even medical imaging. The critical challenge of NLOS imaging is that diffuse reflections scatter light in all directions, resulting in weak signals and a loss of directional information. To address this problem, we propose a method for seeing around corners that derives angular resolution from vertical edges and longitudinal resolution from the temporal response to a pulsed light source. We introduce an acquisition strategy, scene response model, and reconstruction algorithm that enable the formation of 2.5-dimensional representations -- a plan view plus heights -- and a 180$^{\circ}$ field of view (FOV) for large-scale scenes. Our experiments demonstrate accurate reconstructions of hidden rooms up to 3 meters in each dimension.
摘要:非视距视距(NLOS)成像是一个快速增长的领域寻求形成物体的图像的视野之外,在搜救,侦察,甚至是医疗成像应用潜力。 NLOS成像的关键的挑战是,漫反射在所有方向上散射光,从而导致弱信号和方向信息的损失。为了解决这个问题,我们提出了看到周围导出从垂直边缘,并从以一个脉冲光源的时间响应纵向分辨率的角分辨率的角的方法。我们引入一个采集策略,场景响应模型,和重建算法,使2.5维表示的形成 - 的平面图加高度 - 和180 $ ^ {\ CIRC}视场(FOV)的$字段适用于大大规模的场面。我们的实验证明隐藏客房精确重建达在每个维度3米。
46. Query-Efficient Physical Hard-Label Attacks on Deep Learning Visual Classification [PDF] 返回目录
Ryan Feng, Jiefeng Chen, Nelson Manohar, Earlence Fernandes, Somesh Jha, Atul Prakash
Abstract: We present Survival-OPT, a physical adversarial example algorithm in the black-box hard-label setting where the attacker only has access to the model prediction class label. Assuming such limited access to the model is more relevant for settings such as proprietary cyber-physical and cloud systems than the whitebox setting assumed by prior work. By leveraging the properties of physical attacks, we create a novel approach based on the survivability of perturbations corresponding to physical transformations. Through simply querying the model for hard-label predictions, we optimize perturbations to survive in many different physical conditions and show that adversarial examples remain a security risk to cyber-physical systems (CPSs) even in the hard-label threat model. We show that Survival-OPT is query-efficient and robust: using fewer than 200K queries, we successfully attack a stop sign to be misclassified as a speed limit 30 km/hr sign in 98.5% of video frames in a drive-by setting. Survival-OPT also outperforms our baseline combination of existing hard-label and physical approaches, which required over 10x more queries for less robust results.
摘要:我们目前生存-OPT,在黑盒硬标签的设置下,攻击者只能访问模型预测类别标签物理对抗性的示例算法。该模型假定,例如有限的访问是用于设置,例如不是由以前的工作假定白盒设置专有网络物理和云系统更相关。通过利用物理攻击的特性,我们创建基于对应于物理转换扰动的生存能力的新方法。通过简单的查询硬标签的预测模型,我们优化扰动在许多不同的物理条件和显示的生存是对抗性的例子仍然是一个安全隐患,即使在硬标签威胁模型的网络物理系统(CPS的)。我们证明了生存-OPT是查询效率和稳健的:使用不到200K的查询,我们成功地攻击停止的迹象在驱动设置被错误归类为一个速度下限的30公里/小时的标志在视频帧的98.5%。生存-OPT也优于我们的基准现有的硬标签和物理的方法,这需要对不太可靠的结果超过10倍的查询组合。
Ryan Feng, Jiefeng Chen, Nelson Manohar, Earlence Fernandes, Somesh Jha, Atul Prakash
Abstract: We present Survival-OPT, a physical adversarial example algorithm in the black-box hard-label setting where the attacker only has access to the model prediction class label. Assuming such limited access to the model is more relevant for settings such as proprietary cyber-physical and cloud systems than the whitebox setting assumed by prior work. By leveraging the properties of physical attacks, we create a novel approach based on the survivability of perturbations corresponding to physical transformations. Through simply querying the model for hard-label predictions, we optimize perturbations to survive in many different physical conditions and show that adversarial examples remain a security risk to cyber-physical systems (CPSs) even in the hard-label threat model. We show that Survival-OPT is query-efficient and robust: using fewer than 200K queries, we successfully attack a stop sign to be misclassified as a speed limit 30 km/hr sign in 98.5% of video frames in a drive-by setting. Survival-OPT also outperforms our baseline combination of existing hard-label and physical approaches, which required over 10x more queries for less robust results.
摘要:我们目前生存-OPT,在黑盒硬标签的设置下,攻击者只能访问模型预测类别标签物理对抗性的示例算法。该模型假定,例如有限的访问是用于设置,例如不是由以前的工作假定白盒设置专有网络物理和云系统更相关。通过利用物理攻击的特性,我们创建基于对应于物理转换扰动的生存能力的新方法。通过简单的查询硬标签的预测模型,我们优化扰动在许多不同的物理条件和显示的生存是对抗性的例子仍然是一个安全隐患,即使在硬标签威胁模型的网络物理系统(CPS的)。我们证明了生存-OPT是查询效率和稳健的:使用不到200K的查询,我们成功地攻击停止的迹象在驱动设置被错误归类为一个速度下限的30公里/小时的标志在视频帧的98.5%。生存-OPT也优于我们的基准现有的硬标签和物理的方法,这需要对不太可靠的结果超过10倍的查询组合。
47. PCSGAN: Perceptual Cyclic-Synthesized Generative Adversarial Networks for Thermal and NIR to Visible Image Transformation [PDF] 返回目录
Kancharagunta Kishan Babu, Shiv Ram Dubey
Abstract: In many real world scenarios, it is difficult to capture the images in the visible light spectrum (VIS) due to bad lighting conditions. However, the images can be captured in such scenarios using Near-Infrared (NIR) and Thermal (THM) cameras. The NIR and THM images contain the limited details. Thus, there is a need to transform the images from THM/NIR to VIS for better understanding. However, it is non-trivial task due to the large domain discrepancies and lack of abundant datasets. Nowadays, Generative Adversarial Network (GAN) is able to transform the images from one domain to another domain. Most of the available GAN based methods use the combination of the adversarial and the pixel-wise losses (like L1 or L2) as the objective function for training. The quality of transformed images in case of THM/NIR to VIS transformation is still not up to the mark using such objective function. Thus, better objective functions are needed to improve the quality, fine details and realism of the transformed images. A new model for THM/NIR to VIS image transformation called Perceptual Cyclic-Synthesized Generative Adversarial Network (PCSGAN) is introduced to address these issues. The PCSGAN uses the combination of the perceptual (i.e., feature based) losses along with the pixel-wise and the adversarial losses. Both the quantitative and qualitative measures are used to judge the performance of the PCSGAN model over the WHU-IIP face and the RGB-NIR scene datasets. The proposed PCSGAN outperforms the state-of-the-art image transformation models, including Pix2pix, DualGAN, CycleGAN, PS2GAN, and PAN in terms of the SSIM, MSE, PSNR and LPIPS evaluation measures. The code is available at: \url{this https URL}.
摘要:在许多现实世界的场景,很难捕捉可见光光谱(VIS)的图像由于恶劣的照明条件。然而,该图像可在使用近红外(NIR)和热(THM)摄像机这样的场景被捕获。该NIR和THM图像包含有限的详细信息。因此,有必要从THM / NIR到VIS变换图像更好地理解。然而,这是不平凡的任务,由于大域差异和缺乏丰富的数据集。如今,剖成对抗性网络(GAN)是能够从一个域变换图像到另一个域。大多数可用的GaN系的方法使用对抗和逐像素损失(如L1或L2)作为用于训练的目标函数的组合。变换图像在THM / NIR到VIS转化的情况下的质量仍然没有达到使用这样的目标函数的标志。因此,需要更好的目标函数,以提高质量,精致的细节和逼真的变换图像。对于THM / NIR新模式,VIS图像变换称为感知循环合成剖成对抗性网络(PCSGAN)引入来解决这些问题。所述PCSGAN使用感知的组合(即,特征为基础)与逐像素和对抗损失沿损失。无论是定量和定性指标来判断PCSGAN模型在WHU-IIP面和RGB-NIR现场数据集的性能。所提出的PCSGAN优于国家的最先进的图像变换模型,包括Pix2pix,DualGAN,CycleGAN,PS2GAN和PAN在SSIM,MSE,PSNR的条款和LPIPS评估措施。该代码可在:\ {URL这HTTPS URL}。
Kancharagunta Kishan Babu, Shiv Ram Dubey
Abstract: In many real world scenarios, it is difficult to capture the images in the visible light spectrum (VIS) due to bad lighting conditions. However, the images can be captured in such scenarios using Near-Infrared (NIR) and Thermal (THM) cameras. The NIR and THM images contain the limited details. Thus, there is a need to transform the images from THM/NIR to VIS for better understanding. However, it is non-trivial task due to the large domain discrepancies and lack of abundant datasets. Nowadays, Generative Adversarial Network (GAN) is able to transform the images from one domain to another domain. Most of the available GAN based methods use the combination of the adversarial and the pixel-wise losses (like L1 or L2) as the objective function for training. The quality of transformed images in case of THM/NIR to VIS transformation is still not up to the mark using such objective function. Thus, better objective functions are needed to improve the quality, fine details and realism of the transformed images. A new model for THM/NIR to VIS image transformation called Perceptual Cyclic-Synthesized Generative Adversarial Network (PCSGAN) is introduced to address these issues. The PCSGAN uses the combination of the perceptual (i.e., feature based) losses along with the pixel-wise and the adversarial losses. Both the quantitative and qualitative measures are used to judge the performance of the PCSGAN model over the WHU-IIP face and the RGB-NIR scene datasets. The proposed PCSGAN outperforms the state-of-the-art image transformation models, including Pix2pix, DualGAN, CycleGAN, PS2GAN, and PAN in terms of the SSIM, MSE, PSNR and LPIPS evaluation measures. The code is available at: \url{this https URL}.
摘要:在许多现实世界的场景,很难捕捉可见光光谱(VIS)的图像由于恶劣的照明条件。然而,该图像可在使用近红外(NIR)和热(THM)摄像机这样的场景被捕获。该NIR和THM图像包含有限的详细信息。因此,有必要从THM / NIR到VIS变换图像更好地理解。然而,这是不平凡的任务,由于大域差异和缺乏丰富的数据集。如今,剖成对抗性网络(GAN)是能够从一个域变换图像到另一个域。大多数可用的GaN系的方法使用对抗和逐像素损失(如L1或L2)作为用于训练的目标函数的组合。变换图像在THM / NIR到VIS转化的情况下的质量仍然没有达到使用这样的目标函数的标志。因此,需要更好的目标函数,以提高质量,精致的细节和逼真的变换图像。对于THM / NIR新模式,VIS图像变换称为感知循环合成剖成对抗性网络(PCSGAN)引入来解决这些问题。所述PCSGAN使用感知的组合(即,特征为基础)与逐像素和对抗损失沿损失。无论是定量和定性指标来判断PCSGAN模型在WHU-IIP面和RGB-NIR现场数据集的性能。所提出的PCSGAN优于国家的最先进的图像变换模型,包括Pix2pix,DualGAN,CycleGAN,PS2GAN和PAN在SSIM,MSE,PSNR的条款和LPIPS评估措施。该代码可在:\ {URL这HTTPS URL}。
48. Large-scale biometry with interpretable neural network regression on UK Biobank body MRI [PDF] 返回目录
Taro Langner, Håkan Ahlström, Joel Kullberg
Abstract: The UK Biobank study has successfully imaged more than 32,000 volunteer participants with neck-to-knee body MRI. Each scan is linked to extensive metadata, providing a comprehensive survey of imaged anatomy and related health states. Despite its potential for research, this vast amount of data presents a challenge to established methods of evaluation, which often rely on manual input. To date, the range of reference values for cardiovascular and metabolic risk factors is therefore incomplete. In this work, neural networks were trained for regression to infer various biological metrics from the neck-to-knee body MRI automatically. The approach requires no manual intervention or ground truth segmentations for training. The examined fields span 64 variables derived from anthropometric measurements, dual-energy X-ray absorptiometry (DXA), atlas-based segmentations, and dedicated liver scans. The standardized framework achieved a close fit to the target values (median R^2 > 0.97) in 7-fold cross-validation with the ResNet50. Interpretation of aggregated saliency maps suggests that the network correctly targets specific body regions and limbs, and learned to emulate different modalities. On several body composition metrics, the quality of the predictions is within the range of variability observed between established gold standard techniques.
摘要:英国生物库研究已经成功地成像与颈部到膝盖的人体MRI 32000名多名志愿者参加。每个扫描链接到大量的元数据,提供成像解剖学及相关健康状况的全面调查。尽管其研究潜力,这个庞大的数据量的礼物,以评估建立的方法,这往往依靠手工输入一个挑战。迄今为止,用于治疗心血管和代谢风险因素的参考值的范围,因此是不完整的。在这项工作中,神经网络是从颈部到膝盖的身体MRI自动训练回归推断各种生物指标。该方法需要对培训没有人工干预或地面实况分割。所检查的领域跨越从人体测量,双能X射线吸收法(DXA),基于图谱-分割,和专用肝脏扫描得到64个变量。标准化框架实现了紧密配合到所述目标的值(中值R ^ 2> 0.97)在7倍交叉验证与ResNet50。聚集的显着性的解释映射表明,网络正确针对特定的身体部位和四肢,并学会了模仿不同的方式。在几个身体组成指标,预测的质量是建立金标准技术之间观察到的变异性的范围内。
Taro Langner, Håkan Ahlström, Joel Kullberg
Abstract: The UK Biobank study has successfully imaged more than 32,000 volunteer participants with neck-to-knee body MRI. Each scan is linked to extensive metadata, providing a comprehensive survey of imaged anatomy and related health states. Despite its potential for research, this vast amount of data presents a challenge to established methods of evaluation, which often rely on manual input. To date, the range of reference values for cardiovascular and metabolic risk factors is therefore incomplete. In this work, neural networks were trained for regression to infer various biological metrics from the neck-to-knee body MRI automatically. The approach requires no manual intervention or ground truth segmentations for training. The examined fields span 64 variables derived from anthropometric measurements, dual-energy X-ray absorptiometry (DXA), atlas-based segmentations, and dedicated liver scans. The standardized framework achieved a close fit to the target values (median R^2 > 0.97) in 7-fold cross-validation with the ResNet50. Interpretation of aggregated saliency maps suggests that the network correctly targets specific body regions and limbs, and learned to emulate different modalities. On several body composition metrics, the quality of the predictions is within the range of variability observed between established gold standard techniques.
摘要:英国生物库研究已经成功地成像与颈部到膝盖的人体MRI 32000名多名志愿者参加。每个扫描链接到大量的元数据,提供成像解剖学及相关健康状况的全面调查。尽管其研究潜力,这个庞大的数据量的礼物,以评估建立的方法,这往往依靠手工输入一个挑战。迄今为止,用于治疗心血管和代谢风险因素的参考值的范围,因此是不完整的。在这项工作中,神经网络是从颈部到膝盖的身体MRI自动训练回归推断各种生物指标。该方法需要对培训没有人工干预或地面实况分割。所检查的领域跨越从人体测量,双能X射线吸收法(DXA),基于图谱-分割,和专用肝脏扫描得到64个变量。标准化框架实现了紧密配合到所述目标的值(中值R ^ 2> 0.97)在7倍交叉验证与ResNet50。聚集的显着性的解释映射表明,网络正确针对特定的身体部位和四肢,并学会了模仿不同的方式。在几个身体组成指标,预测的质量是建立金标准技术之间观察到的变异性的范围内。
49. Class-Imbalanced Semi-Supervised Learning [PDF] 返回目录
Minsung Hyun, Jisoo Jeong, Nojun Kwak
Abstract: Semi-Supervised Learning (SSL) has achieved great success in overcoming the difficulties of labeling and making full use of unlabeled data. However, SSL has a limited assumption that the numbers of samples in different classes are balanced, and many SSL algorithms show lower performance for the datasets with the imbalanced class distribution. In this paper, we introduce a task of class-imbalanced semi-supervised learning (CISSL), which refers to semi-supervised learning with class-imbalanced data. In doing so, we consider class imbalance in both labeled and unlabeled sets. First, we analyze existing SSL methods in imbalanced environments and examine how the class imbalance affects SSL methods. Then we propose Suppressed Consistency Loss (SCL), a regularization method robust to class imbalance. Our method shows better performance than the conventional methods in the CISSL environment. In particular, the more severe the class imbalance and the smaller the size of the labeled data, the better our method performs.
摘要:半监督学习(SSL)在克服标签的困难,充分利用未标记数据的取得了巨大成功。然而,SSL具有有限的假设,在不同类型的样品的数量是平衡的,许多SSL算法显示与不平衡类分布的数据集较低的性能。在本文中,我们介绍了类不平衡半监督学习(CISSL),它指的是半监督学习与类不平衡数据的任务。在此过程中,我们考虑在这两个标记和未标记套类不平衡。首先,我们分析了不平衡的环境中现有的SSL方法和检验类失衡是如何影响SSL的方法。然后,我们建议禁止一致性损失(SCL),正则化方法稳健类不平衡。我们的方法显示出比CISSL环境的传统方法更好的性能。特别地,更严重的类不平衡和标记数据的大小越小越好我们的方法进行。
Minsung Hyun, Jisoo Jeong, Nojun Kwak
Abstract: Semi-Supervised Learning (SSL) has achieved great success in overcoming the difficulties of labeling and making full use of unlabeled data. However, SSL has a limited assumption that the numbers of samples in different classes are balanced, and many SSL algorithms show lower performance for the datasets with the imbalanced class distribution. In this paper, we introduce a task of class-imbalanced semi-supervised learning (CISSL), which refers to semi-supervised learning with class-imbalanced data. In doing so, we consider class imbalance in both labeled and unlabeled sets. First, we analyze existing SSL methods in imbalanced environments and examine how the class imbalance affects SSL methods. Then we propose Suppressed Consistency Loss (SCL), a regularization method robust to class imbalance. Our method shows better performance than the conventional methods in the CISSL environment. In particular, the more severe the class imbalance and the smaller the size of the labeled data, the better our method performs.
摘要:半监督学习(SSL)在克服标签的困难,充分利用未标记数据的取得了巨大成功。然而,SSL具有有限的假设,在不同类型的样品的数量是平衡的,许多SSL算法显示与不平衡类分布的数据集较低的性能。在本文中,我们介绍了类不平衡半监督学习(CISSL),它指的是半监督学习与类不平衡数据的任务。在此过程中,我们考虑在这两个标记和未标记套类不平衡。首先,我们分析了不平衡的环境中现有的SSL方法和检验类失衡是如何影响SSL的方法。然后,我们建议禁止一致性损失(SCL),正则化方法稳健类不平衡。我们的方法显示出比CISSL环境的传统方法更好的性能。特别地,更严重的类不平衡和标记数据的大小越小越好我们的方法进行。
50. Reinforcement learning for the manipulation of eye tracking data [PDF] 返回目录
Wolfgang Fuhl
Abstract: In this paper, we present an approach based on reinforcement learning for eye tracking data manipulation. It is based on two opposing agents, where one tries to classify the data correctly and the second agent looks for patterns in the data, which get manipulated to hide specific information. We show that our approach is successfully applicable to preserve the privacy of a subject. In addition, our approach allows to evaluate the importance of temporal, as well as spatial, information of eye tracking data for specific classification goals. In general, this approach can also be used for stimuli manipulation, making it interesting for gaze guidance. For this purpose, this work provides the theoretical basis, which is why we have also integrated a section on how to apply this method for gaze guidance.
摘要:本文提出了一种基于强化学习的眼动追踪数据操作的方法。它是基于两个对立的代理,其中一个尝试将数据正确分类和第二剂会在数据当中去操纵隐藏特定信息的模式。我们表明,我们的做法是成功适用于保护受试者的隐私。此外,我们的方法可以评估的具体分类目标眼动追踪数据的时间,以及空间,信息的重要性。在一般情况下,这种方法也可以用于刺激操纵,使得它有趣的凝视指导。为此,这项工作提供了理论依据,这就是为什么我们还集成了对如何申请视线引导这种方法的部分。
Wolfgang Fuhl
Abstract: In this paper, we present an approach based on reinforcement learning for eye tracking data manipulation. It is based on two opposing agents, where one tries to classify the data correctly and the second agent looks for patterns in the data, which get manipulated to hide specific information. We show that our approach is successfully applicable to preserve the privacy of a subject. In addition, our approach allows to evaluate the importance of temporal, as well as spatial, information of eye tracking data for specific classification goals. In general, this approach can also be used for stimuli manipulation, making it interesting for gaze guidance. For this purpose, this work provides the theoretical basis, which is why we have also integrated a section on how to apply this method for gaze guidance.
摘要:本文提出了一种基于强化学习的眼动追踪数据操作的方法。它是基于两个对立的代理,其中一个尝试将数据正确分类和第二剂会在数据当中去操纵隐藏特定信息的模式。我们表明,我们的做法是成功适用于保护受试者的隐私。此外,我们的方法可以评估的具体分类目标眼动追踪数据的时间,以及空间,信息的重要性。在一般情况下,这种方法也可以用于刺激操纵,使得它有趣的凝视指导。为此,这项工作提供了理论依据,这就是为什么我们还集成了对如何申请视线引导这种方法的部分。
51. Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks [PDF] 返回目录
Micah Goldblum, Steven Reich, Liam Fowl, Renkun Ni, Valeriia Cherepanova, Tom Goldstein
Abstract: Meta-learning algorithms produce feature extractors which achieve state-of-the-art performance on few-shot classification. While the literature is rich with meta-learning methods, little is known about why the resulting feature extractors perform so well. We develop a better understanding of the underlying mechanics of meta-learning and the difference between models trained using meta-learning and models which are trained classically. In doing so, we develop several hypotheses for why meta-learned models perform better. In addition to visualizations, we design several regularizers inspired by our hypotheses which improve performance on few-shot classification.
摘要:元学习算法产生其实现上为数不多的镜头分类的国家的最先进的性能特征提取。虽然文学是丰富的元学习方法,鲜为人知的是,为什么得到的特征提取如此上佳表现。我们更好地了解元学习和模型之间的差异的底层机制的使用已训练元学习这些经典训练模式。在此过程中,我们开发了为什么元学模型进行更好的几种假说。除了可视化,我们设计我们假设这改善为数不多的镜头分类性能的启发几个regularizers。
Micah Goldblum, Steven Reich, Liam Fowl, Renkun Ni, Valeriia Cherepanova, Tom Goldstein
Abstract: Meta-learning algorithms produce feature extractors which achieve state-of-the-art performance on few-shot classification. While the literature is rich with meta-learning methods, little is known about why the resulting feature extractors perform so well. We develop a better understanding of the underlying mechanics of meta-learning and the difference between models trained using meta-learning and models which are trained classically. In doing so, we develop several hypotheses for why meta-learned models perform better. In addition to visualizations, we design several regularizers inspired by our hypotheses which improve performance on few-shot classification.
摘要:元学习算法产生其实现上为数不多的镜头分类的国家的最先进的性能特征提取。虽然文学是丰富的元学习方法,鲜为人知的是,为什么得到的特征提取如此上佳表现。我们更好地了解元学习和模型之间的差异的底层机制的使用已训练元学习这些经典训练模式。在此过程中,我们开发了为什么元学模型进行更好的几种假说。除了可视化,我们设计我们假设这改善为数不多的镜头分类性能的启发几个regularizers。
52. Gaussian Smoothen Semantic Features (GSSF) -- Exploring the Linguistic Aspects of Visual Captioning in Indian Languages (Bengali) Using MSCOCO Framework [PDF] 返回目录
Chiranjib Sur
Abstract: In this work, we have introduced Gaussian Smoothen Semantic Features (GSSF) for Better Semantic Selection for Indian regional language-based image captioning and introduced a procedure where we used the existing translation and English crowd-sourced sentences for training. We have shown that this architecture is a promising alternative source, where there is a crunch in resources. Our main contribution of this work is the development of deep learning architectures for the Bengali language (is the fifth widely spoken language in the world) with a completely different grammar and language attributes. We have shown that these are working well for complex applications like language generation from image contexts and can diversify the representation through introducing constraints, more extensive features, and unique feature spaces. We also established that we could achieve absolute precision and diversity when we use smoothened semantic tensor with the traditional LSTM and feature decomposition networks. With better learning architecture, we succeeded in establishing an automated algorithm and assessment procedure that can help in the evaluation of competent applications without the requirement for expertise and human intervention.
摘要:在这项工作中,我们已经介绍了高斯平滑语义特征(GSSF)为更好的语义选择对印度地方语言基础的图像和字幕介绍我们用于训练现有的翻译和英语人群来源的句子的过程。我们已经证明这种体系结构是一个有前途的替代来源,那里是在资源危机。我们这项工作的主要贡献是为孟加拉语深度学习架构(是世界第五广泛的语言)使用完全不同的语法和语言特性的发展。我们已经表明,这些从图像背景复杂的应用程序,如语言生成运作良好,可以通过引入约束,更丰富的功能,以及独特的功能空间多样化的表现。我们还建立了,当我们使用平滑的语义张量与传统LSTM和功能分解的网络,我们可以实现绝对精度和多样性。有了更好的学习结构,我们成功地建立自动算法和评估程序,在主管应用的评价可以帮助没有专业知识和人为干预的要求。
Chiranjib Sur
Abstract: In this work, we have introduced Gaussian Smoothen Semantic Features (GSSF) for Better Semantic Selection for Indian regional language-based image captioning and introduced a procedure where we used the existing translation and English crowd-sourced sentences for training. We have shown that this architecture is a promising alternative source, where there is a crunch in resources. Our main contribution of this work is the development of deep learning architectures for the Bengali language (is the fifth widely spoken language in the world) with a completely different grammar and language attributes. We have shown that these are working well for complex applications like language generation from image contexts and can diversify the representation through introducing constraints, more extensive features, and unique feature spaces. We also established that we could achieve absolute precision and diversity when we use smoothened semantic tensor with the traditional LSTM and feature decomposition networks. With better learning architecture, we succeeded in establishing an automated algorithm and assessment procedure that can help in the evaluation of competent applications without the requirement for expertise and human intervention.
摘要:在这项工作中,我们已经介绍了高斯平滑语义特征(GSSF)为更好的语义选择对印度地方语言基础的图像和字幕介绍我们用于训练现有的翻译和英语人群来源的句子的过程。我们已经证明这种体系结构是一个有前途的替代来源,那里是在资源危机。我们这项工作的主要贡献是为孟加拉语深度学习架构(是世界第五广泛的语言)使用完全不同的语法和语言特性的发展。我们已经表明,这些从图像背景复杂的应用程序,如语言生成运作良好,可以通过引入约束,更丰富的功能,以及独特的功能空间多样化的表现。我们还建立了,当我们使用平滑的语义张量与传统LSTM和功能分解的网络,我们可以实现绝对精度和多样性。有了更好的学习结构,我们成功地建立自动算法和评估程序,在主管应用的评价可以帮助没有专业知识和人为干预的要求。
53. Coresets for the Nearest-Neighbor Rule [PDF] 返回目录
Alejandro Flores Velazco, David M. Mount
Abstract: The problem of nearest-neighbor condensation deals with finding a subset R from a set of labeled points P such that for every point p in R the nearest-neighbor of p in R has the same label as p. This is motivated by applications in classification, where the nearest-neighbor rule assigns to an unlabeled query point the label of its nearest-neighbor in the point set. In this context, condensation aims to reduce the size of the set needed to classify new points. However, finding such subsets of minimum cardinality is NP-hard, and most research has focused on practical heuristics without performance guarantees. Additionally, the use of exact nearest-neighbors is always assumed, ignoring the effect of condensation in the classification accuracy when nearest-neighbors are computed approximately. In this paper, we address these shortcomings by proposing new approximation-sensitive criteria for the nearest-neighbor condensation problem, along with practical algorithms with provable performance guarantees. We characterize sufficient conditions to guarantee correct classification of unlabeled points using approximate nearest-neighbor queries on these subsets, which introduces the notion of coresets for classification with the nearest-neighbor rule. Moreover, we prove that it is NP-hard to compute subsets with these characteristics, whose cardinality approximates that of the minimum cardinality subset. Additionally, we propose new algorithms for computing such subsets, with tight approximation factors in general metrics, and improved factors for doubling metrics and l_p metrics with p >= 2. Finally, we show an alternative implementation scheme that reduces the worst-case time complexity of one of these algorithms, becoming the first truly subquadratic approximation algorithm for the nearest-neighbor condensation problem.
摘要:最近邻冷凝涉及从一组标记的点P使得对R中的每个点p中的R p的近邻具有相同的标签为P的找到一个子集R本问题。这是通过分类,其中最近邻规则受让人向无查询点的最近邻居的标签点集中的应用程序的动机。在此背景下,冷凝旨在减少所需的分类新点集的大小。然而,寻找最小基数的这种子集是NP难,大部分研究都集中在实用的启发式没有性能保证。此外,使用精确的近邻的总是假设,忽略缩合而当近邻近似计算的分类精度的影响。在本文中,我们通过提出新的逼近敏感的标准,最近邻凝结问题,与可证明的性能保证的实用算法一起克服这些缺点。我们描述了充分条件,以保证使用近似最近邻查询在这些子集,它引入了coresets的与最近邻规则分类的概念未标记点的正确分类。此外,我们证明它是NP-难以计算的子集具有这些特征,其基数近似于最小基数子集。此外,我们提出了新的算法,用于计算这种子集,紧张因素近似一般指标,和改进的因素有p倍增度量和L_P度量> = 2。最后,我们表明,减少了最坏情况下的时间复杂度的可选的实施方案对这些算法之一,成为最近邻结露问题的第一个真正的次二次近似算法。
Alejandro Flores Velazco, David M. Mount
Abstract: The problem of nearest-neighbor condensation deals with finding a subset R from a set of labeled points P such that for every point p in R the nearest-neighbor of p in R has the same label as p. This is motivated by applications in classification, where the nearest-neighbor rule assigns to an unlabeled query point the label of its nearest-neighbor in the point set. In this context, condensation aims to reduce the size of the set needed to classify new points. However, finding such subsets of minimum cardinality is NP-hard, and most research has focused on practical heuristics without performance guarantees. Additionally, the use of exact nearest-neighbors is always assumed, ignoring the effect of condensation in the classification accuracy when nearest-neighbors are computed approximately. In this paper, we address these shortcomings by proposing new approximation-sensitive criteria for the nearest-neighbor condensation problem, along with practical algorithms with provable performance guarantees. We characterize sufficient conditions to guarantee correct classification of unlabeled points using approximate nearest-neighbor queries on these subsets, which introduces the notion of coresets for classification with the nearest-neighbor rule. Moreover, we prove that it is NP-hard to compute subsets with these characteristics, whose cardinality approximates that of the minimum cardinality subset. Additionally, we propose new algorithms for computing such subsets, with tight approximation factors in general metrics, and improved factors for doubling metrics and l_p metrics with p >= 2. Finally, we show an alternative implementation scheme that reduces the worst-case time complexity of one of these algorithms, becoming the first truly subquadratic approximation algorithm for the nearest-neighbor condensation problem.
摘要:最近邻冷凝涉及从一组标记的点P使得对R中的每个点p中的R p的近邻具有相同的标签为P的找到一个子集R本问题。这是通过分类,其中最近邻规则受让人向无查询点的最近邻居的标签点集中的应用程序的动机。在此背景下,冷凝旨在减少所需的分类新点集的大小。然而,寻找最小基数的这种子集是NP难,大部分研究都集中在实用的启发式没有性能保证。此外,使用精确的近邻的总是假设,忽略缩合而当近邻近似计算的分类精度的影响。在本文中,我们通过提出新的逼近敏感的标准,最近邻凝结问题,与可证明的性能保证的实用算法一起克服这些缺点。我们描述了充分条件,以保证使用近似最近邻查询在这些子集,它引入了coresets的与最近邻规则分类的概念未标记点的正确分类。此外,我们证明它是NP-难以计算的子集具有这些特征,其基数近似于最小基数子集。此外,我们提出了新的算法,用于计算这种子集,紧张因素近似一般指标,和改进的因素有p倍增度量和L_P度量> = 2。最后,我们表明,减少了最坏情况下的时间复杂度的可选的实施方案对这些算法之一,成为最近邻结露问题的第一个真正的次二次近似算法。
54. Hold me tight! Influence of discriminative features on deep network boundaries [PDF] 返回目录
Guillermo Ortiz-Jimenez, Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard
Abstract: Important insights towards the explainability of neural networks and their properties reside in the formation of their decision boundaries. In this work, we borrow tools from the field of adversarial robustness and propose a new framework that permits to relate the features of the dataset with the distance of data samples to the decision boundary along specific directions. We demonstrate that the inductive bias of deep learning has the tendency to generate classification functions that are invariant along non-discriminative directions of the dataset. More surprisingly, we further show that training on small perturbations of the data samples are sufficient to completely change the decision boundary. This is actually the characteristic exploited by the so-called adversarial training to produce robust classifiers. Our general framework can be used to reveal the effect of specific dataset features on the macroscopic properties of deep models and to develop a better understanding of the successes and limitations of deep learning.
摘要:对神经网络及其属性的explainability重要见解驻留在他们的决策边界的形成。在这项工作中,我们从对抗鲁棒性领域借用工具,并提出一个新的框架,以允许相关数据集的功能与数据样本沿特定方向的决策边界的距离。我们证明了深度学习的归纳偏置具有生成沿着数据集的非歧视的方向不变分类功能的倾向。更令人惊讶的,我们进一步显示出对数据样本的微小扰动培训足以彻底改变决策边界。这实际上是由所谓的对抗训练利用来产生强大的分类特征。我们的总体框架可用于揭示的特定数据集的功能深模型的宏观性质的影响,并开发出更好的成绩和深度学习的局限性的理解。
Guillermo Ortiz-Jimenez, Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard
Abstract: Important insights towards the explainability of neural networks and their properties reside in the formation of their decision boundaries. In this work, we borrow tools from the field of adversarial robustness and propose a new framework that permits to relate the features of the dataset with the distance of data samples to the decision boundary along specific directions. We demonstrate that the inductive bias of deep learning has the tendency to generate classification functions that are invariant along non-discriminative directions of the dataset. More surprisingly, we further show that training on small perturbations of the data samples are sufficient to completely change the decision boundary. This is actually the characteristic exploited by the so-called adversarial training to produce robust classifiers. Our general framework can be used to reveal the effect of specific dataset features on the macroscopic properties of deep models and to develop a better understanding of the successes and limitations of deep learning.
摘要:对神经网络及其属性的explainability重要见解驻留在他们的决策边界的形成。在这项工作中,我们从对抗鲁棒性领域借用工具,并提出一个新的框架,以允许相关数据集的功能与数据样本沿特定方向的决策边界的距离。我们证明了深度学习的归纳偏置具有生成沿着数据集的非歧视的方向不变分类功能的倾向。更令人惊讶的,我们进一步显示出对数据样本的微小扰动培训足以彻底改变决策边界。这实际上是由所谓的对抗训练利用来产生强大的分类特征。我们的总体框架可用于揭示的特定数据集的功能深模型的宏观性质的影响,并开发出更好的成绩和深度学习的局限性的理解。
55. 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans [PDF] 返回目录
Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, Luca Carlone
Abstract: We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g. objects, walls, rooms), and edges represent relations (e.g. inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g. humans, robots), and to include actionable information that supports planning and decision-making (e.g. spatio-temporal relations, topology at different levels of abstraction). Our second contribution is to provide the first fully automatic Spatial PerceptIon eNgine(SPIN) to build a DSG from visual-inertial data. We integrate state-of-the-art techniques for object and human detection and pose estimation, and we describe how to robustly infer object, robot, and human nodes in crowded scenes. To the best of our knowledge, this is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking. Moreover, we provide algorithms to obtain hierarchical representations of indoor environments (e.g. places, structures, rooms) and their relations. Our third contribution is to demonstrate the proposed spatial perception engine in a photo-realistic Unity-based simulator, where we assess its robustness and expressiveness. Finally, we discuss the implications of our proposal on modern robotics applications. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction. A video abstract is available at this https URL
摘要:我们提出了可操作的空间知觉的统一表示:3D动态场景图。场景图的有向图,其中节点代表场景中的实体(例如对象,墙壁,房间),并且边表示节点之间的关系(例如包含,邻接)。动态场景图(DSGs)扩展这个概念来表示与移动代理的动态场景(例如人类,机器人),并包括可操作的信息,支持规划和决策(例如时空关系,拓扑结构,在不同的抽象层次)。我们的第二贡献是提供第一全自动空间感知发动机(SPIN)建立从视觉惯性数据的DSG。我们整合国家的最先进的技术目标和人类检测和姿态估计,我们描述了如何稳健地推断物体,机器人,并在拥挤的场景人类节点。据我们所知,这是和解的视觉惯性SLAM和密集的人力网跟踪的第一篇论文。此外,我们提供的算法来获得室内环境的分层表示(例如地方,结构,室)和他们的关系。我们的第三个贡献是验证了空间感知发动机在照片般逼真的基于统一的仿真,我们评估其耐用性和表现力。最后,我们讨论我们的建议对现代机器人技术应用的影响。 3D动态场景图可以对规划和决策,人机交互,长期的自主权,并现场预测了深远的影响。视频摘要可在此HTTPS URL
Antoni Rosinol, Arjun Gupta, Marcus Abate, Jingnan Shi, Luca Carlone
Abstract: We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g. objects, walls, rooms), and edges represent relations (e.g. inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g. humans, robots), and to include actionable information that supports planning and decision-making (e.g. spatio-temporal relations, topology at different levels of abstraction). Our second contribution is to provide the first fully automatic Spatial PerceptIon eNgine(SPIN) to build a DSG from visual-inertial data. We integrate state-of-the-art techniques for object and human detection and pose estimation, and we describe how to robustly infer object, robot, and human nodes in crowded scenes. To the best of our knowledge, this is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking. Moreover, we provide algorithms to obtain hierarchical representations of indoor environments (e.g. places, structures, rooms) and their relations. Our third contribution is to demonstrate the proposed spatial perception engine in a photo-realistic Unity-based simulator, where we assess its robustness and expressiveness. Finally, we discuss the implications of our proposal on modern robotics applications. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction. A video abstract is available at this https URL
摘要:我们提出了可操作的空间知觉的统一表示:3D动态场景图。场景图的有向图,其中节点代表场景中的实体(例如对象,墙壁,房间),并且边表示节点之间的关系(例如包含,邻接)。动态场景图(DSGs)扩展这个概念来表示与移动代理的动态场景(例如人类,机器人),并包括可操作的信息,支持规划和决策(例如时空关系,拓扑结构,在不同的抽象层次)。我们的第二贡献是提供第一全自动空间感知发动机(SPIN)建立从视觉惯性数据的DSG。我们整合国家的最先进的技术目标和人类检测和姿态估计,我们描述了如何稳健地推断物体,机器人,并在拥挤的场景人类节点。据我们所知,这是和解的视觉惯性SLAM和密集的人力网跟踪的第一篇论文。此外,我们提供的算法来获得室内环境的分层表示(例如地方,结构,室)和他们的关系。我们的第三个贡献是验证了空间感知发动机在照片般逼真的基于统一的仿真,我们评估其耐用性和表现力。最后,我们讨论我们的建议对现代机器人技术应用的影响。 3D动态场景图可以对规划和决策,人机交互,长期的自主权,并现场预测了深远的影响。视频摘要可在此HTTPS URL
56. Learning representations of irregular particle-detector geometry with distance-weighted graph networks [PDF] 返回目录
Shah Rukh Qasim, Jan Kieseler, Yutaro Iiyama, Maurizio Pierini
Abstract: We explore the use of graph networks to deal with irregular-geometry detectors in the context of particle reconstruction. Thanks to their representation-learning capabilities, graph networks can exploit the full detector granularity, while natively managing the event sparsity and arbitrarily complex detector geometries. We introduce two distance-weighted graph network architectures, dubbed GarNet and GravNet layers, and apply them to a typical particle reconstruction task. The performance of the new architectures is evaluated on a data set of simulated particle interactions on a toy model of a highly granular calorimeter, loosely inspired by the endcap calorimeter to be installed in the CMS detector for the High-Luminosity LHC phase. We study the clustering of energy depositions, which is the basis for calorimetric particle reconstruction, and provide a quantitative comparison to alternative approaches. The proposed algorithms provide an interesting alternative to existing methods, offering equally performing or less resource-demanding solutions with less underlying assumptions on the detector geometry and, consequently, the possibility to generalize to other detectors.
摘要:探索利用图表网络来处理在粒子重建过程中不规则几何形状的检测器。由于其表示学习能力,图形网络可以利用完全探测器粒度,而本地管理事件稀疏和任意复杂的检测器几何形状。我们引入两个距离加权图的网络架构,被称为石榴石和GravNet层,并将其应用到一个典型的颗粒重建任务。新的体系结构的性能上非常细微的量热计的玩具模型模拟粒子相互作用,通过端盖量热松弛地启发,被安装在CMS探测器的高辉度LHC相位的数据组进行评估。我们研究的能量沉积的聚类,这是量热粒子重建的基础上,并且提供一个定量的比较来替代方法。所提出的算法提供了一个有趣的替代现有的方法,将提供同样具有较少潜在的假设上的检测器几何形状执行以下需要资源的解决方案,因此,的可能性推广到其他检测器。
Shah Rukh Qasim, Jan Kieseler, Yutaro Iiyama, Maurizio Pierini
Abstract: We explore the use of graph networks to deal with irregular-geometry detectors in the context of particle reconstruction. Thanks to their representation-learning capabilities, graph networks can exploit the full detector granularity, while natively managing the event sparsity and arbitrarily complex detector geometries. We introduce two distance-weighted graph network architectures, dubbed GarNet and GravNet layers, and apply them to a typical particle reconstruction task. The performance of the new architectures is evaluated on a data set of simulated particle interactions on a toy model of a highly granular calorimeter, loosely inspired by the endcap calorimeter to be installed in the CMS detector for the High-Luminosity LHC phase. We study the clustering of energy depositions, which is the basis for calorimetric particle reconstruction, and provide a quantitative comparison to alternative approaches. The proposed algorithms provide an interesting alternative to existing methods, offering equally performing or less resource-demanding solutions with less underlying assumptions on the detector geometry and, consequently, the possibility to generalize to other detectors.
摘要:探索利用图表网络来处理在粒子重建过程中不规则几何形状的检测器。由于其表示学习能力,图形网络可以利用完全探测器粒度,而本地管理事件稀疏和任意复杂的检测器几何形状。我们引入两个距离加权图的网络架构,被称为石榴石和GravNet层,并将其应用到一个典型的颗粒重建任务。新的体系结构的性能上非常细微的量热计的玩具模型模拟粒子相互作用,通过端盖量热松弛地启发,被安装在CMS探测器的高辉度LHC相位的数据组进行评估。我们研究的能量沉积的聚类,这是量热粒子重建的基础上,并且提供一个定量的比较来替代方法。所提出的算法提供了一个有趣的替代现有的方法,将提供同样具有较少潜在的假设上的检测器几何形状执行以下需要资源的解决方案,因此,的可能性推广到其他检测器。
57. Adaptive Kernel Estimation of the Spectral Density with Boundary Kernel Analysis [PDF] 返回目录
Alexander Sidorenko, Kurt S. Riedel
Abstract: A hybrid estimator of the log-spectral density of a stationary time series is proposed. First, a multiple taper estimate is performed, followed by kernel smoothing the log-multitaper estimate. This procedure reduces the expected mean square error by $({\pi^2 \over 4})^{.8}$ over simply smoothing the log tapered periodogram. The optimal number of tapers is $O(N^{8/15})$. A data adaptive implementation of a variable bandwidth kernel smoother is given. When the spectral density is discontinuous, one sided smoothing estimates are used.
摘要:一个固定的时间序列的对数谱密度的混合估计算法。首先,将多个锥形估算被执行,随后内核平滑对数多窗口估计。此过程通过$降低了预期均方误差({\ PI ^ 2 \超过4})^ {8} $过简单地平滑日志锥形周期图。锥度的最佳数量是$ O(N ^ {8/15})$。可变带宽内核平滑的数据自适应实现中给出。当谱密度是不连续的,使用一种双面平滑估计。
Alexander Sidorenko, Kurt S. Riedel
Abstract: A hybrid estimator of the log-spectral density of a stationary time series is proposed. First, a multiple taper estimate is performed, followed by kernel smoothing the log-multitaper estimate. This procedure reduces the expected mean square error by $({\pi^2 \over 4})^{.8}$ over simply smoothing the log tapered periodogram. The optimal number of tapers is $O(N^{8/15})$. A data adaptive implementation of a variable bandwidth kernel smoother is given. When the spectral density is discontinuous, one sided smoothing estimates are used.
摘要:一个固定的时间序列的对数谱密度的混合估计算法。首先,将多个锥形估算被执行,随后内核平滑对数多窗口估计。此过程通过$降低了预期均方误差({\ PI ^ 2 \超过4})^ {8} $过简单地平滑日志锥形周期图。锥度的最佳数量是$ O(N ^ {8/15})$。可变带宽内核平滑的数据自适应实现中给出。当谱密度是不连续的,使用一种双面平滑估计。
注:中文为机器翻译结果!