目录
5. An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches [PDF] 摘要
10. Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study [PDF] 摘要
17. Improvement of electronic Governance and mobile Governance in Multilingual Countries with Digital Etymology using Sanskrit Grammar [PDF] 摘要
摘要
1. Sign Language Translation with Transformers [PDF] 返回目录
Kayo Yin
Abstract: Sign Language Translation (SLT) first uses a Sign Language Recognition (SLR) system to extract sign language glosses from videos. Then, a translation system generates spoken language translations from the sign language glosses. Though SLT has gathered interest recently, little study has been performed on the translation system. This paper focuses on the translation system and improves performance by utilizing Transformer networks. We report a wide range of experimental results for various Transformer setups and introduce the use of Spatial-Temporal Multi-Cue (STMC) networks in an end-to-end SLT system with Transformer. We perform experiments on RWTH-PHOENIX-Weather 2014T, a challenging SLT benchmark dataset of German sign language, and ASLG-PC12, a dataset involving American Sign Language (ASL) recently used in gloss-to-text translation. Our methodology improves on the current state-of-the-art by over 5 and 7 points respectively in BLEU-4 score on ground truth glosses and by using an STMC network to predict glosses of the RWTH-PHOENIX-Weather 2014T dataset. On the ASLG-PC12 corpus, we report an improvement of over 16 points in BLEU-4. Our findings also demonstrate that end-to-end translation on predicted glosses provides even better performance than translation on ground truth glosses. This shows potential for further improvement in SLT by either jointly training the SLR and translation systems or by revising the gloss annotation system.
摘要:手语翻译(SLT)首先使用手语识别(SLR)系统从视频中提取手语粉饰。随后,翻译系统从手语粉饰口语翻译。虽然SLT已经聚集了兴趣最近,很少研究已翻译系统上进行。本文着重翻译系统上,并利用变压器网络提高性能。我们报告了广泛的各种变压器设置实验结果和介绍使用时空多线索(STMC)网络在与变压器的端至端SLT系统。我们执行的RWTH-PHOENIX-天气2014T,德国手语的挑战SLT基准数据集,并ASLG-PC12,包括美国手语(ASL)的数据集最近在光泽到文本的翻译使用的实验。我们的方法对当前状态的最先进的通过在图5和7分分别在BLEU-4的得分地面实况美化并通过使用网络STMC预测RWTH-PHOENIX-天气2014T数据集的粉饰提高。在ASLG-PC12语料库,我们报道了超过16点的BLEU-4的改进。我们的研究结果还表明,终端到终端的平移预测的粉饰提供了比在地面真相粉饰翻译甚至更好的性能。这说明潜在的或者通过共同训练单反相机和翻译系统或通过修改光泽注释系统在SLT进一步改善。
Kayo Yin
Abstract: Sign Language Translation (SLT) first uses a Sign Language Recognition (SLR) system to extract sign language glosses from videos. Then, a translation system generates spoken language translations from the sign language glosses. Though SLT has gathered interest recently, little study has been performed on the translation system. This paper focuses on the translation system and improves performance by utilizing Transformer networks. We report a wide range of experimental results for various Transformer setups and introduce the use of Spatial-Temporal Multi-Cue (STMC) networks in an end-to-end SLT system with Transformer. We perform experiments on RWTH-PHOENIX-Weather 2014T, a challenging SLT benchmark dataset of German sign language, and ASLG-PC12, a dataset involving American Sign Language (ASL) recently used in gloss-to-text translation. Our methodology improves on the current state-of-the-art by over 5 and 7 points respectively in BLEU-4 score on ground truth glosses and by using an STMC network to predict glosses of the RWTH-PHOENIX-Weather 2014T dataset. On the ASLG-PC12 corpus, we report an improvement of over 16 points in BLEU-4. Our findings also demonstrate that end-to-end translation on predicted glosses provides even better performance than translation on ground truth glosses. This shows potential for further improvement in SLT by either jointly training the SLR and translation systems or by revising the gloss annotation system.
摘要:手语翻译(SLT)首先使用手语识别(SLR)系统从视频中提取手语粉饰。随后,翻译系统从手语粉饰口语翻译。虽然SLT已经聚集了兴趣最近,很少研究已翻译系统上进行。本文着重翻译系统上,并利用变压器网络提高性能。我们报告了广泛的各种变压器设置实验结果和介绍使用时空多线索(STMC)网络在与变压器的端至端SLT系统。我们执行的RWTH-PHOENIX-天气2014T,德国手语的挑战SLT基准数据集,并ASLG-PC12,包括美国手语(ASL)的数据集最近在光泽到文本的翻译使用的实验。我们的方法对当前状态的最先进的通过在图5和7分分别在BLEU-4的得分地面实况美化并通过使用网络STMC预测RWTH-PHOENIX-天气2014T数据集的粉饰提高。在ASLG-PC12语料库,我们报道了超过16点的BLEU-4的改进。我们的研究结果还表明,终端到终端的平移预测的粉饰提供了比在地面真相粉饰翻译甚至更好的性能。这说明潜在的或者通过共同训练单反相机和翻译系统或通过修改光泽注释系统在SLT进一步改善。
2. Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter Stream [PDF] 返回目录
Simran K, Prathiksha Balakrishna, Vinayakumar R, Soman KP
Abstract: In recent days, the amount of Cyber Security text data shared via social media resources mainly Twitter has increased. An accurate analysis of this data can help to develop cyber threat situational awareness framework for a cyber threat. This work proposes a deep learning based approach for tweet data analysis. To convert the tweets into numerical representations, various text representations are employed. These features are feed into deep learning architecture for optimal feature extraction as well as classification. Various hyperparameter tuning approaches are used for identifying optimal text representation method as well as optimal network parameters and network structures for deep learning models. For comparative analysis, the classical text representation method with classical machine learning algorithm is employed. From the detailed analysis of experiments, we found that the deep learning architecture with advanced text representation methods performed better than the classical text representation and classical machine learning algorithms. The primary reason for this is that the advanced text representation methods have the capability to learn sequential properties which exist among the textual data and deep learning architectures learns the optimal features along with decreasing the feature size.
摘要:最近几天,通过社交媒体资源共享网络安全的文本数据的量主要是Twitter已经增加。这个数据的准确分析,可以帮助制定一个网络威胁网络威胁态势感知框架。这项工作提出了鸣叫数据分析深度学习为基础的方法。要鸣叫转换成数字表示,各种文本表示被采用。这些功能送入深度学习架构优化特征提取和分类。各种超参数调谐方法用于鉴定用于深学习模型的最佳文本表示方法以及优化的网络参数和网络结构。进行比较分析,采用与传统的机器学习算法的经典文本表示方法。从实验的详细分析,我们发现,深学习架构,先进的文本表示方法进行比经典文本的代表性和经典的机器学习算法更好。造成这种情况的主要原因是,先进的文本表示方法,必须学习文本数据和深度学习架构其中存在学习的最佳特性随特征尺寸沿连续属性的功能。
Simran K, Prathiksha Balakrishna, Vinayakumar R, Soman KP
Abstract: In recent days, the amount of Cyber Security text data shared via social media resources mainly Twitter has increased. An accurate analysis of this data can help to develop cyber threat situational awareness framework for a cyber threat. This work proposes a deep learning based approach for tweet data analysis. To convert the tweets into numerical representations, various text representations are employed. These features are feed into deep learning architecture for optimal feature extraction as well as classification. Various hyperparameter tuning approaches are used for identifying optimal text representation method as well as optimal network parameters and network structures for deep learning models. For comparative analysis, the classical text representation method with classical machine learning algorithm is employed. From the detailed analysis of experiments, we found that the deep learning architecture with advanced text representation methods performed better than the classical text representation and classical machine learning algorithms. The primary reason for this is that the advanced text representation methods have the capability to learn sequential properties which exist among the textual data and deep learning architectures learns the optimal features along with decreasing the feature size.
摘要:最近几天,通过社交媒体资源共享网络安全的文本数据的量主要是Twitter已经增加。这个数据的准确分析,可以帮助制定一个网络威胁网络威胁态势感知框架。这项工作提出了鸣叫数据分析深度学习为基础的方法。要鸣叫转换成数字表示,各种文本表示被采用。这些功能送入深度学习架构优化特征提取和分类。各种超参数调谐方法用于鉴定用于深学习模型的最佳文本表示方法以及优化的网络参数和网络结构。进行比较分析,采用与传统的机器学习算法的经典文本表示方法。从实验的详细分析,我们发现,深学习架构,先进的文本表示方法进行比经典文本的代表性和经典的机器学习算法更好。造成这种情况的主要原因是,先进的文本表示方法,必须学习文本数据和深度学习架构其中存在学习的最佳特性随特征尺寸沿连续属性的功能。
3. Deep Learning Approach for Intelligent Named Entity Recognition of Cyber Security [PDF] 返回目录
Simran K, Sriram S, Vinayakumar R, Soman KP
Abstract: In recent years, the amount of Cyber Security data generated in the form of unstructured texts, for example, social media resources, blogs, articles, and so on has exceptionally increased. Named Entity Recognition (NER) is an initial step towards converting this unstructured data into structured data which can be used by a lot of applications. The existing methods on NER for Cyber Security data are based on rules and linguistic characteristics. A Deep Learning (DL) based approach embedded with Conditional Random Fields (CRFs) is proposed in this paper. Several DL architectures are evaluated to find the most optimal architecture. The combination of Bidirectional Gated Recurrent Unit (Bi-GRU), Convolutional Neural Network (CNN), and CRF performed better compared to various other DL frameworks on a publicly available benchmark dataset. This may be due to the reason that the bidirectional structures preserve the features related to the future and previous words in a sequence.
摘要:近年来,在非结构化文本的形式产生的网络安全数据的量,例如,社交媒体资源,博客,文章等也格外增加。命名实体识别(NER)朝向转换这个非结构化数据转换成可通过大量的应用中使用结构化数据的初始步骤。对于网络安全的数据上NER现有的方法是基于规则和语言特点。深层学习(DL)为基础的方法嵌入了条件随机域(控释肥)在本文提出。一些DL架构进行评估,以找到最优化的架构。双向门控重复单元(双GRU),卷积神经网络(CNN)和CRF的组合表现较好相比,在公开的基准数据集各种其他DL框架。这可能是由于原因,双向结构保持在一个序列中涉及到未来与过去的话的功能。
Simran K, Sriram S, Vinayakumar R, Soman KP
Abstract: In recent years, the amount of Cyber Security data generated in the form of unstructured texts, for example, social media resources, blogs, articles, and so on has exceptionally increased. Named Entity Recognition (NER) is an initial step towards converting this unstructured data into structured data which can be used by a lot of applications. The existing methods on NER for Cyber Security data are based on rules and linguistic characteristics. A Deep Learning (DL) based approach embedded with Conditional Random Fields (CRFs) is proposed in this paper. Several DL architectures are evaluated to find the most optimal architecture. The combination of Bidirectional Gated Recurrent Unit (Bi-GRU), Convolutional Neural Network (CNN), and CRF performed better compared to various other DL frameworks on a publicly available benchmark dataset. This may be due to the reason that the bidirectional structures preserve the features related to the future and previous words in a sequence.
摘要:近年来,在非结构化文本的形式产生的网络安全数据的量,例如,社交媒体资源,博客,文章等也格外增加。命名实体识别(NER)朝向转换这个非结构化数据转换成可通过大量的应用中使用结构化数据的初始步骤。对于网络安全的数据上NER现有的方法是基于规则和语言特点。深层学习(DL)为基础的方法嵌入了条件随机域(控释肥)在本文提出。一些DL架构进行评估,以找到最优化的架构。双向门控重复单元(双GRU),卷积神经网络(CNN)和CRF的组合表现较好相比,在公开的基准数据集各种其他DL框架。这可能是由于原因,双向结构保持在一个序列中涉及到未来与过去的话的功能。
4. Parasitic Neural Network for Zero-Shot Relation Extraction [PDF] 返回目录
Shengbin Jia, Shijia E, Yang Xiang
Abstract: Conventional relation extraction methods can only identify limited relation classes and not recognize the unseen relation types that have no pre-labeled training data. In this paper, we explore the zero-shot relation extraction to overcome the challenge. The only requisite information about unseen types is the name of their labels. We propose a Parasitic Neural Network (PNN), and it can learn a mapping between the general feature representations of text samples and the distributions of unseen types in a shared semantic space. Experiment results show that our model significantly outperforms others on the unseen relation extraction task and achieves effect improvement more than 20%, when there are not any manual annotations or additional resources.
摘要:传统的关系抽取方法只能识别有限的关系类和不承认没有预先标记的训练数据看不见的关系类型。在本文中,我们探讨了零射门关系抽取克服的挑战。关于看不见的类型的唯一必要的信息是他们的标签的名称。我们提出了一个寄生神经网络(PNN),它可以学习文本样本的一般特征表示和看不见类型的分布在共享语义空间之间的映射。实验结果表明,我们的模型显著优于对看不见的关系抽取任务他人,达到改善效果20%以上,当没有任何手动注释或额外的资源。
Shengbin Jia, Shijia E, Yang Xiang
Abstract: Conventional relation extraction methods can only identify limited relation classes and not recognize the unseen relation types that have no pre-labeled training data. In this paper, we explore the zero-shot relation extraction to overcome the challenge. The only requisite information about unseen types is the name of their labels. We propose a Parasitic Neural Network (PNN), and it can learn a mapping between the general feature representations of text samples and the distributions of unseen types in a shared semantic space. Experiment results show that our model significantly outperforms others on the unseen relation extraction task and achieves effect improvement more than 20%, when there are not any manual annotations or additional resources.
摘要:传统的关系抽取方法只能识别有限的关系类和不承认没有预先标记的训练数据看不见的关系类型。在本文中,我们探讨了零射门关系抽取克服的挑战。关于看不见的类型的唯一必要的信息是他们的标签的名称。我们提出了一个寄生神经网络(PNN),它可以学习文本样本的一般特征表示和看不见类型的分布在共享语义空间之间的映射。实验结果表明,我们的模型显著优于对看不见的关系抽取任务他人,达到改善效果20%以上,当没有任何手动注释或额外的资源。
5. An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches [PDF] 返回目录
Nkechi Ifeanyi-Reuben, Chidiebere Ugwu
Abstract: This paper presents an improved classification model for Igbo text using N-gram and K-Nearest Neighbour approaches. The N-gram model was used for text representation and the classification was carried out on the text using the K-Nearest Neighbour model. Object-Oriented design methodology is used for the work and is implemented with the Python programming language with tools from Natural Language Toolkit (NLTK). The performance of the Igbo text classification system is measured by computing the precision, recall and F1-measure of the result obtained on Unigram, Bigram and Trigram represented text. The Igbo text classification on bigram represented text has highest degree of exactness (precision); result obtained with three N-gram models has the same level of completeness (recall) while trigram has the lowest level of precision. This shows that the classification on bigram Igbo represented text outperforms unigram and trigram represented texts. Therefore, bigram text representation model is highly recommended for any intelligent text-based system in Igbo language.
摘要:本文介绍了伊博文本使用的N-gram和K近邻的改进分类模型的方法。使用了的N-gram模型文本表示和所述分类是有关使用k近邻模型的文本进行。面向对象的设计方法用于工作,并与来自自然语言工具包(NLTK)工具Python编程语言实现。伊格博文本分类系统的性能是通过计算的精度测量,回忆和单字组,两字组和卦获得的结果的F1-量度表示的文本。上两字组表示的文本伊格博文本分类具有正确性(精度)的最高程度;结果用三个N-克模型中获得的具有完整性(回忆)的同一水平,同时卦具有精度的最低水平。这表明,二元伊博分类表示的文本性能优于unigram中和卦代表文本。因此,二元文本表示模型,强烈建议在伊博语任何智能基于文本的系统。
Nkechi Ifeanyi-Reuben, Chidiebere Ugwu
Abstract: This paper presents an improved classification model for Igbo text using N-gram and K-Nearest Neighbour approaches. The N-gram model was used for text representation and the classification was carried out on the text using the K-Nearest Neighbour model. Object-Oriented design methodology is used for the work and is implemented with the Python programming language with tools from Natural Language Toolkit (NLTK). The performance of the Igbo text classification system is measured by computing the precision, recall and F1-measure of the result obtained on Unigram, Bigram and Trigram represented text. The Igbo text classification on bigram represented text has highest degree of exactness (precision); result obtained with three N-gram models has the same level of completeness (recall) while trigram has the lowest level of precision. This shows that the classification on bigram Igbo represented text outperforms unigram and trigram represented texts. Therefore, bigram text representation model is highly recommended for any intelligent text-based system in Igbo language.
摘要:本文介绍了伊博文本使用的N-gram和K近邻的改进分类模型的方法。使用了的N-gram模型文本表示和所述分类是有关使用k近邻模型的文本进行。面向对象的设计方法用于工作,并与来自自然语言工具包(NLTK)工具Python编程语言实现。伊格博文本分类系统的性能是通过计算的精度测量,回忆和单字组,两字组和卦获得的结果的F1-量度表示的文本。上两字组表示的文本伊格博文本分类具有正确性(精度)的最高程度;结果用三个N-克模型中获得的具有完整性(回忆)的同一水平,同时卦具有精度的最低水平。这表明,二元伊博分类表示的文本性能优于unigram中和卦代表文本。因此,二元文本表示模型,强烈建议在伊博语任何智能基于文本的系统。
6. Enriching Consumer Health Vocabulary Using Enhanced GloVe Word Embedding [PDF] 返回目录
Mohammed Ibrahim, Susan Gauch, Omar Salman, Mohammed Alqahatani
Abstract: Open-Access and Collaborative Consumer Health Vocabulary (OAC CHV, or CHV for short), is a collection of medical terms written in plain English. It provides a list of simple, easy, and clear terms that laymen prefer to use rather than an equivalent professional medical term. The National Library of Medicine (NLM) has integrated and mapped the CHV terms to their Unified Medical Language System (UMLS). These CHV terms mapped to 56000 professional concepts on the UMLS. We found that about 48% of these laymen's terms are still jargon and matched with the professional terms on the UMLS. In this paper, we present an enhanced word embedding technique that generates new CHV terms from a consumer-generated text. We downloaded our corpus from a healthcare social media and evaluated our new method based on iterative feedback to word embedding using ground truth built from the existing CHV terms. Our feedback algorithm outperformed unmodified GLoVe and new CHV terms have been detected.
摘要:开放存取和协作消费者健康词汇(OAC CHV,或CHV的简称),是写在普通的英语医学术语的集合。它提供了简单的列表,轻松,而明确地说外行喜欢使用,而不是等价的专业医学术语。医学国家图书馆(NLM)整合和绘制了CHV方面的一体化医学语言系统(UMLS)。这些CHV术语映射到在UMLS 56000个专业概念。我们发现,约48这些外行人的术语%仍然是行话,并在UMLS的专业术语匹配。在本文中,我们提出了一种增强的字嵌入技术,其生成从消费者生成的文本新CHV条款。我们下载了我们的语料库从医疗保健社会媒体和评估基于迭代反馈到Word使用从现有CHV方面内置地面实况嵌入我们的新方法。我们的反馈算法优于未修改的手套,并已检测到的新CHV条款。
Mohammed Ibrahim, Susan Gauch, Omar Salman, Mohammed Alqahatani
Abstract: Open-Access and Collaborative Consumer Health Vocabulary (OAC CHV, or CHV for short), is a collection of medical terms written in plain English. It provides a list of simple, easy, and clear terms that laymen prefer to use rather than an equivalent professional medical term. The National Library of Medicine (NLM) has integrated and mapped the CHV terms to their Unified Medical Language System (UMLS). These CHV terms mapped to 56000 professional concepts on the UMLS. We found that about 48% of these laymen's terms are still jargon and matched with the professional terms on the UMLS. In this paper, we present an enhanced word embedding technique that generates new CHV terms from a consumer-generated text. We downloaded our corpus from a healthcare social media and evaluated our new method based on iterative feedback to word embedding using ground truth built from the existing CHV terms. Our feedback algorithm outperformed unmodified GLoVe and new CHV terms have been detected.
摘要:开放存取和协作消费者健康词汇(OAC CHV,或CHV的简称),是写在普通的英语医学术语的集合。它提供了简单的列表,轻松,而明确地说外行喜欢使用,而不是等价的专业医学术语。医学国家图书馆(NLM)整合和绘制了CHV方面的一体化医学语言系统(UMLS)。这些CHV术语映射到在UMLS 56000个专业概念。我们发现,约48这些外行人的术语%仍然是行话,并在UMLS的专业术语匹配。在本文中,我们提出了一种增强的字嵌入技术,其生成从消费者生成的文本新CHV条款。我们下载了我们的语料库从医疗保健社会媒体和评估基于迭代反馈到Word使用从现有CHV方面内置地面实况嵌入我们的新方法。我们的反馈算法优于未修改的手套,并已检测到的新CHV条款。
7. A Swiss German Dictionary: Variation in Speech and Writing [PDF] 返回目录
Larissa Schmidt, Lucy Linder, Sandra Djambazovska, Alexandros Lazaridis, Tanja Samardžić, Claudiu Musat
Abstract: We introduce a dictionary containing forms of common words in various Swiss German dialects normalized into High German. As Swiss German is, for now, a predominantly spoken language, there is a significant variation in the written forms, even between speakers of the same dialect. To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA). This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions. Moreover, we control for the regional distribution and insure the equal representation of the major Swiss dialects. The coupling of the phonetic and written Swiss German forms is powerful. We show that they are sufficient to train a Transformer-based phoneme to grapheme model that generates credible novel Swiss German writings. In addition, we show that the inverse mapping - from graphemes to phonemes - can be modeled with a transformer trained with the novel dictionary. This generation of pronunciations for previously unknown words is key in training extensible automated speech recognition (ASR) systems, which are key beneficiaries of this dictionary.
摘要:介绍含有标准化为高德各瑞士德语方言常用词形式的字典。由于瑞士德语是,就目前来看,主要为口语,没有在书面形式的一个显著的变化,甚至同一种方言的扬声器之间。为了缓解与这种多样性相关的不确定性,我们补充瑞士德语的对 - 用瑞士德语音标(SAMPA)高德的话。因此,这本词典成为大规模的自发翻译与语音记录相结合的第一资源。此外,我们控制的区域分布和确保瑞士主要方言的平等代表权。的语音和书面瑞士德语形式的耦合是强大的。我们发现,它们足以基于变压器的音素训练,以产生新的可靠的瑞士德语文字的字形模式。此外,我们还表明,逆映射 - 从字形到音素 - 能与新型词典训练的变压器进行建模。这对于先前未知的词语生成发音的是在训练可扩展的自动语音识别(ASR)系统,这是本词典键受益者密钥。
Larissa Schmidt, Lucy Linder, Sandra Djambazovska, Alexandros Lazaridis, Tanja Samardžić, Claudiu Musat
Abstract: We introduce a dictionary containing forms of common words in various Swiss German dialects normalized into High German. As Swiss German is, for now, a predominantly spoken language, there is a significant variation in the written forms, even between speakers of the same dialect. To alleviate the uncertainty associated with this diversity, we complement the pairs of Swiss German - High German words with the Swiss German phonetic transcriptions (SAMPA). This dictionary becomes thus the first resource to combine large-scale spontaneous translation with phonetic transcriptions. Moreover, we control for the regional distribution and insure the equal representation of the major Swiss dialects. The coupling of the phonetic and written Swiss German forms is powerful. We show that they are sufficient to train a Transformer-based phoneme to grapheme model that generates credible novel Swiss German writings. In addition, we show that the inverse mapping - from graphemes to phonemes - can be modeled with a transformer trained with the novel dictionary. This generation of pronunciations for previously unknown words is key in training extensible automated speech recognition (ASR) systems, which are key beneficiaries of this dictionary.
摘要:介绍含有标准化为高德各瑞士德语方言常用词形式的字典。由于瑞士德语是,就目前来看,主要为口语,没有在书面形式的一个显著的变化,甚至同一种方言的扬声器之间。为了缓解与这种多样性相关的不确定性,我们补充瑞士德语的对 - 用瑞士德语音标(SAMPA)高德的话。因此,这本词典成为大规模的自发翻译与语音记录相结合的第一资源。此外,我们控制的区域分布和确保瑞士主要方言的平等代表权。的语音和书面瑞士德语形式的耦合是强大的。我们发现,它们足以基于变压器的音素训练,以产生新的可靠的瑞士德语文字的字形模式。此外,我们还表明,逆映射 - 从字形到音素 - 能与新型词典训练的变压器进行建模。这对于先前未知的词语生成发音的是在训练可扩展的自动语音识别(ASR)系统,这是本词典键受益者密钥。
8. Automatic Extraction of Bengali Root Verbs using Paninian Grammar [PDF] 返回目录
Arijit Das, Tapas Halder, Diganta Saha
Abstract: In this research work, we have proposed an algorithm based on supervised learning methodology to extract the root forms of the Bengali verbs using the grammatical rules proposed by Panini [1] in Ashtadhyayi. This methodology can be applied for the languages which are derived from Sanskrit. The proposed system has been developed based on tense, person and morphological inflections of the verbs to find their root forms. The work has been executed in two phases: first, the surface level forms or inflected forms of the verbs have been classified into a certain number of groups of similar tense and person. For this task, a standard pattern, available in Bengali language has been used. Next, a set of rules have been applied to extract the root form from the surface level forms of a verb. The system has been tested on 10000 verbs collected from the Bengali text corpus developed in the TDIL project of the Govt. of India. The accuracy of the output has been achieved 98% which is verified by a linguistic expert. Root verb identification is a key step in semantic searching, multi-sentence search query processing, understanding the meaning of a language, disambiguation of word sense, classification of the sentences etc.
摘要:本研究工作中,我们提出了基于监督学习的方法来提取使用由帕尼尼[1] Ashtadhyayi提出的语法规则孟加拉动词的词根形式的算法。这种方法可应用于其中从梵文来源的语言。所提出的系统已经开发出了基于时态,人称和动词的形态语调找到自己的根形式。第一,动词的表面水平的形式或变形形式被分为一定数目的类似和紧张的人组:工作已在两个阶段执行。对于这个任务,一个标准的模式,在孟加拉语提供已被使用。接着,一组规则已经被应用于从一个动词的表面水平的形式提取根形式。该系统已经在从官立TDIL项目开发孟加拉文本语料库收集10000个动词进行了测试。印度。的输出的精确度已经达到其通过一个语言专家验证98%。根动词识别是语义搜索,多一句搜索查询处理中的关键步骤,理解语言的意义,词义消歧,分类判决等
Arijit Das, Tapas Halder, Diganta Saha
Abstract: In this research work, we have proposed an algorithm based on supervised learning methodology to extract the root forms of the Bengali verbs using the grammatical rules proposed by Panini [1] in Ashtadhyayi. This methodology can be applied for the languages which are derived from Sanskrit. The proposed system has been developed based on tense, person and morphological inflections of the verbs to find their root forms. The work has been executed in two phases: first, the surface level forms or inflected forms of the verbs have been classified into a certain number of groups of similar tense and person. For this task, a standard pattern, available in Bengali language has been used. Next, a set of rules have been applied to extract the root form from the surface level forms of a verb. The system has been tested on 10000 verbs collected from the Bengali text corpus developed in the TDIL project of the Govt. of India. The accuracy of the output has been achieved 98% which is verified by a linguistic expert. Root verb identification is a key step in semantic searching, multi-sentence search query processing, understanding the meaning of a language, disambiguation of word sense, classification of the sentences etc.
摘要:本研究工作中,我们提出了基于监督学习的方法来提取使用由帕尼尼[1] Ashtadhyayi提出的语法规则孟加拉动词的词根形式的算法。这种方法可应用于其中从梵文来源的语言。所提出的系统已经开发出了基于时态,人称和动词的形态语调找到自己的根形式。第一,动词的表面水平的形式或变形形式被分为一定数目的类似和紧张的人组:工作已在两个阶段执行。对于这个任务,一个标准的模式,在孟加拉语提供已被使用。接着,一组规则已经被应用于从一个动词的表面水平的形式提取根形式。该系统已经在从官立TDIL项目开发孟加拉文本语料库收集10000个动词进行了测试。印度。的输出的精确度已经达到其通过一个语言专家验证98%。根动词识别是语义搜索,多一句搜索查询处理中的关键步骤,理解语言的意义,词义消歧,分类判决等
9. A Clustering Framework for Lexical Normalization of Roman Urdu [PDF] 返回目录
Abdul Rafae Khan, Asim Karim, Hassan Sajjad, Faisal Kamiran, Jia Xu
Abstract: Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.
摘要:罗马乌尔都语是写在罗马脚本,它被广泛应用于南亚在线文字内容乌尔都语语言的非正式形式。它缺乏标准的拼写和因此造成的自动语言处理过程中的几个正常化的挑战。在本文中,我们提出了罗马乌尔都语语料库,其包括语音算法UrduPhone,串匹配组件,基于特征的相似度函数,并且聚类算法莱克斯-var的词法归一基于特征的集群框架。 UrduPhone编码罗马乌尔都语串到自己的发音为基础的表示。使用罗马字母书写乌尔都语时出现的字符串匹配组件处理字符级的变化。
Abdul Rafae Khan, Asim Karim, Hassan Sajjad, Faisal Kamiran, Jia Xu
Abstract: Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.
摘要:罗马乌尔都语是写在罗马脚本,它被广泛应用于南亚在线文字内容乌尔都语语言的非正式形式。它缺乏标准的拼写和因此造成的自动语言处理过程中的几个正常化的挑战。在本文中,我们提出了罗马乌尔都语语料库,其包括语音算法UrduPhone,串匹配组件,基于特征的相似度函数,并且聚类算法莱克斯-var的词法归一基于特征的集群框架。 UrduPhone编码罗马乌尔都语串到自己的发音为基础的表示。使用罗马字母书写乌尔都语时出现的字符串匹配组件处理字符级的变化。
10. Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study [PDF] 返回目录
Michael Leventhal, Allahsera Tapo, Sarah Luger, Marcos Zampieri, Christopher M. Homan
Abstract: We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages. Malian university students translated French texts, producing either written or oral translations to Bambara. Our results suggest that similar quality can be obtained from either written or spoken translations for certain kinds of texts. They also suggest specific instructions that human translators should be given in order to improve the quality of their work.
摘要:本发明的新方法评估人类翻译对齐文本的质量学习资源不足的语言机器翻译模型。马里大学生翻译法文文本,制作书面或口头翻译到班巴拉语。我们的研究结果表明,类似的质量可以从某些种类的文本无论是书面或口头翻译得到。他们还建议具体的说明,人类的翻译应以提高他们的工作质量给予。
Michael Leventhal, Allahsera Tapo, Sarah Luger, Marcos Zampieri, Christopher M. Homan
Abstract: We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages. Malian university students translated French texts, producing either written or oral translations to Bambara. Our results suggest that similar quality can be obtained from either written or spoken translations for certain kinds of texts. They also suggest specific instructions that human translators should be given in order to improve the quality of their work.
摘要:本发明的新方法评估人类翻译对齐文本的质量学习资源不足的语言机器翻译模型。马里大学生翻译法文文本,制作书面或口头翻译到班巴拉语。我们的研究结果表明,类似的质量可以从某些种类的文本无论是书面或口头翻译得到。他们还建议具体的说明,人类的翻译应以提高他们的工作质量给予。
11. Multilingual Stance Detection: The Catalonia Independence Corpus [PDF] 返回目录
Elena Zotova, Rodrigo Agerri, Manuel Nuñez, German Rigau
Abstract: Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish. Unfortunately, the TW-10 Catalan subset is extremely imbalanced. This paper addresses these issues by presenting a new multilingual dataset for stance detection in Twitter for the Catalan and Spanish languages, with the aim of facilitating research on stance detection in multilingual and cross-lingual settings. The dataset is annotated with stance towards one topic, namely, the independence of Catalonia. We also provide a semi-automatic method to annotate the dataset based on a categorization of Twitter users. We experiment on the new corpus with a number of supervised approaches, including linear classifiers and deep learning methods. Comparison of our new corpus with the with the TW-1O dataset shows both the benefits and potential of a well balanced corpus for multilingual and cross-lingual research on stance detection. Finally, we establish new state-of-the-art results on the TW-10 dataset, both for Catalan and Spanish.
摘要:姿态检测的目的是确定给定文本的态度,相对于一个特定的主题或要求。虽然姿态检测已经在过去几年中一直相当好研究,大部分的工作都集中在英语。这主要是由于相对缺乏其他语言注释的数据。在TW-10公投公布的数据集在IberEval 2018是之前的努力提供在加泰罗尼亚语和西班牙语多种语言的姿态标注的数据。不幸的是,TW-10加泰罗尼亚子集极不平衡。本文通过在Twitter上呈现的姿态检测一个新的多语言数据集的加泰罗尼亚语和西班牙语,在多语言和跨语言设置,方便上的立场检测研究的目的是解决这些问题。该数据集是对一个话题,即加泰罗尼亚的独立注释与立场。我们还提供了一个半自动的方法诠释基于Twitter用户的分类数据集。我们尝试与一些监管办法,包括线性分类和深入学习方法的新语料库。我们与与TW-1O数据集将同时显示在姿态检测多语言和跨语言研究的一个很好的平衡语料库的好处和潜在的新文集的比较。最后,我们对TW-10数据集建立国家的最先进的新成果,既为加泰罗尼亚语和西班牙语。
Elena Zotova, Rodrigo Agerri, Manuel Nuñez, German Rigau
Abstract: Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish. Unfortunately, the TW-10 Catalan subset is extremely imbalanced. This paper addresses these issues by presenting a new multilingual dataset for stance detection in Twitter for the Catalan and Spanish languages, with the aim of facilitating research on stance detection in multilingual and cross-lingual settings. The dataset is annotated with stance towards one topic, namely, the independence of Catalonia. We also provide a semi-automatic method to annotate the dataset based on a categorization of Twitter users. We experiment on the new corpus with a number of supervised approaches, including linear classifiers and deep learning methods. Comparison of our new corpus with the with the TW-1O dataset shows both the benefits and potential of a well balanced corpus for multilingual and cross-lingual research on stance detection. Finally, we establish new state-of-the-art results on the TW-10 dataset, both for Catalan and Spanish.
摘要:姿态检测的目的是确定给定文本的态度,相对于一个特定的主题或要求。虽然姿态检测已经在过去几年中一直相当好研究,大部分的工作都集中在英语。这主要是由于相对缺乏其他语言注释的数据。在TW-10公投公布的数据集在IberEval 2018是之前的努力提供在加泰罗尼亚语和西班牙语多种语言的姿态标注的数据。不幸的是,TW-10加泰罗尼亚子集极不平衡。本文通过在Twitter上呈现的姿态检测一个新的多语言数据集的加泰罗尼亚语和西班牙语,在多语言和跨语言设置,方便上的立场检测研究的目的是解决这些问题。该数据集是对一个话题,即加泰罗尼亚的独立注释与立场。我们还提供了一个半自动的方法诠释基于Twitter用户的分类数据集。我们尝试与一些监管办法,包括线性分类和深入学习方法的新语料库。我们与与TW-1O数据集将同时显示在姿态检测多语言和跨语言研究的一个很好的平衡语料库的好处和潜在的新文集的比较。最后,我们对TW-10数据集建立国家的最先进的新成果,既为加泰罗尼亚语和西班牙语。
12. Give your Text Representation Models some Love: the Case for Basque [PDF] 返回目录
Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre
Abstract: Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.
摘要:Word中的嵌入和预先训练语言模型允许建立文本的丰富的陈述和具有在大多数NLP任务启用改善。不幸的是,他们是非常昂贵的火车,和许多小公司和研究团体倾向于使用已机型预先训练,并提供由第三方,而不是建立自己的。这是次优的,对于许多语言中,模型已经被训练较小(或低质量)语料库。此外,单语预先训练模型非英语语言并不总是可用。充其量,这些语言模型都包含在多语言版本,其中每个语言共享子和参数与语言的其余部分的配额。这是一个较小的语言,如巴斯克尤其如此。在本文中,我们表明,一些较大的巴斯克语料训练的一种语言模型(FastText字的嵌入,FLAIR和BERT语言模型)的产生比下游NLP任务公开可用的版本,包括主题分类,情感分类,词性标注和更好的结果ER。这项工作建立了新的国家的最先进的那些任务巴斯克。在这项工作中所使用的所有标准和模式是公开的。
Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre
Abstract: Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.
摘要:Word中的嵌入和预先训练语言模型允许建立文本的丰富的陈述和具有在大多数NLP任务启用改善。不幸的是,他们是非常昂贵的火车,和许多小公司和研究团体倾向于使用已机型预先训练,并提供由第三方,而不是建立自己的。这是次优的,对于许多语言中,模型已经被训练较小(或低质量)语料库。此外,单语预先训练模型非英语语言并不总是可用。充其量,这些语言模型都包含在多语言版本,其中每个语言共享子和参数与语言的其余部分的配额。这是一个较小的语言,如巴斯克尤其如此。在本文中,我们表明,一些较大的巴斯克语料训练的一种语言模型(FastText字的嵌入,FLAIR和BERT语言模型)的产生比下游NLP任务公开可用的版本,包括主题分类,情感分类,词性标注和更好的结果ER。这项工作建立了新的国家的最先进的那些任务巴斯克。在这项工作中所使用的所有标准和模式是公开的。
13. Deep Entity Matching with Pre-Trained Language Models [PDF] 返回目录
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan
Abstract: We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or ALBERT pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 19% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 8.5%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.
摘要:我们目前同上,根据预先训练的基于变压器的语言模型中新实体匹配系统。我们微调和投EM作序,对分类问题,以充分利用这些模型用一个简单的架构。我们的实验表明,语言模型,如BERT,DistilBERT,或ALBERT的直接应用在大型语料库预训练已经显著提高了匹配质量,优于先前的国家的最先进的(SOTA),高达19% F1的得分标准数据集。我们还开发了三个优化技术进一步提高同上的匹配能力。同上允许领域的知识,通过突出的匹配做决策时,可能会感兴趣的输入信息重要的部分被注入。同上还总结了太长时间,使得只有必要的信息被保留,用于EM字符串。最后,同上适应对文本到EM,以增加与(难)实施例中的训练数据的数据扩充一个SOTA技术。这样一来,同上被迫学习“更难”,以提高模型的匹配能力。该优化,我们开发进一步推动同上的性能高达8.5%。也许更令人惊讶的,我们建立同上可以实现至多一半的标签数据的数量以前的SOTA结果。最后,我们证明同上对现实世界的大型EM任务的有效性。匹配两家公司的数据集,包括789K和412K记录,同上实现了96.5%的高比分F1。
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan
Abstract: We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or ALBERT pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 19% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 8.5%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.
摘要:我们目前同上,根据预先训练的基于变压器的语言模型中新实体匹配系统。我们微调和投EM作序,对分类问题,以充分利用这些模型用一个简单的架构。我们的实验表明,语言模型,如BERT,DistilBERT,或ALBERT的直接应用在大型语料库预训练已经显著提高了匹配质量,优于先前的国家的最先进的(SOTA),高达19% F1的得分标准数据集。我们还开发了三个优化技术进一步提高同上的匹配能力。同上允许领域的知识,通过突出的匹配做决策时,可能会感兴趣的输入信息重要的部分被注入。同上还总结了太长时间,使得只有必要的信息被保留,用于EM字符串。最后,同上适应对文本到EM,以增加与(难)实施例中的训练数据的数据扩充一个SOTA技术。这样一来,同上被迫学习“更难”,以提高模型的匹配能力。该优化,我们开发进一步推动同上的性能高达8.5%。也许更令人惊讶的,我们建立同上可以实现至多一半的标签数据的数量以前的SOTA结果。最后,我们证明同上对现实世界的大型EM任务的有效性。匹配两家公司的数据集,包括789K和412K记录,同上实现了96.5%的高比分F1。
14. More Grounded Image Captioning by Distilling Image-Text Matching Model [PDF] 返回目录
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang
Abstract: Visual attention not only improves the performance of image captioners, but also serves as a visual interpretation to qualitatively measure the caption rationality and model transparency. Specifically, we expect that a captioner can fix its attentive gaze on the correct objects while generating the corresponding words. This ability is also known as grounded image captioning. However, the grounding accuracy of existing captioners is far from satisfactory. To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision. To this end, we propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN \cite{lee2018stacked}): POS-SCAN, as the effective knowledge distillation for more grounded image captioning. The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module. By showing benchmark experimental results, we demonstrate that conventional image captioners equipped with POS-SCAN can significantly improve the grounding accuracy without strong supervision. Last but not the least, we explore the indispensable Self-Critical Sequence Training (SCST) \cite{Rennie_2017_CVPR} in the context of grounded image captioning and show that the image-text matching score can serve as a reward for more grounded captioning \footnote{this https URL}.
摘要:视觉注意不仅提高了图像式字幕的性能,而且还作为一个目视解译来定性衡量标题合理性和模型透明。具体而言,我们预期一个字幕员可以固定其视线周到上正确的对象,而产生相应的单词。这种能力也被称为接地图像字幕。然而,现有的字幕人员的接地精度远远不能令人满意。为了提高精度接地,同时保留字幕的质量,它是昂贵的收集字区域对齐作为强有力的监督。为此,我们提出了一个部分的语音(POS),增强的图像,文本匹配模式(SCAN \ {引用} lee2018stacked):POS-SCAN,为更多的接地图像字幕的有效的知识升华。的好处是双重的:1)给定的一个句子和图像,POS-SCAN能够比SCAN更准确地的对象; 2)POS-SCAN用作字幕员的视觉注意模块一个字区域对齐正则化。通过展示基准实验结果,我们证明了装有POS机-SCAN传统的图像字幕人员可以显著改善没有强有力的监督接地准确性。最后但并非最不重要的,我们探索了不可或缺的自我批判序列训练(SCST)\举{Rennie_2017_CVPR}在接地的图像字幕,并显示图像的文本匹配分数可以作为多点接地字幕\脚注奖励的情况下{此HTTPS URL}。
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang
Abstract: Visual attention not only improves the performance of image captioners, but also serves as a visual interpretation to qualitatively measure the caption rationality and model transparency. Specifically, we expect that a captioner can fix its attentive gaze on the correct objects while generating the corresponding words. This ability is also known as grounded image captioning. However, the grounding accuracy of existing captioners is far from satisfactory. To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision. To this end, we propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN \cite{lee2018stacked}): POS-SCAN, as the effective knowledge distillation for more grounded image captioning. The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module. By showing benchmark experimental results, we demonstrate that conventional image captioners equipped with POS-SCAN can significantly improve the grounding accuracy without strong supervision. Last but not the least, we explore the indispensable Self-Critical Sequence Training (SCST) \cite{Rennie_2017_CVPR} in the context of grounded image captioning and show that the image-text matching score can serve as a reward for more grounded captioning \footnote{this https URL}.
摘要:视觉注意不仅提高了图像式字幕的性能,而且还作为一个目视解译来定性衡量标题合理性和模型透明。具体而言,我们预期一个字幕员可以固定其视线周到上正确的对象,而产生相应的单词。这种能力也被称为接地图像字幕。然而,现有的字幕人员的接地精度远远不能令人满意。为了提高精度接地,同时保留字幕的质量,它是昂贵的收集字区域对齐作为强有力的监督。为此,我们提出了一个部分的语音(POS),增强的图像,文本匹配模式(SCAN \ {引用} lee2018stacked):POS-SCAN,为更多的接地图像字幕的有效的知识升华。的好处是双重的:1)给定的一个句子和图像,POS-SCAN能够比SCAN更准确地的对象; 2)POS-SCAN用作字幕员的视觉注意模块一个字区域对齐正则化。通过展示基准实验结果,我们证明了装有POS机-SCAN传统的图像字幕人员可以显著改善没有强有力的监督接地准确性。最后但并非最不重要的,我们探索了不可或缺的自我批判序列训练(SCST)\举{Rennie_2017_CVPR}在接地的图像字幕,并显示图像的文本匹配分数可以作为多点接地字幕\脚注奖励的情况下{此HTTPS URL}。
15. Adversarial Transfer Learning for Punctuation Restoration [PDF] 返回目录
Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Cunhang Fan
Abstract: Previous studies demonstrate that word embeddings and part-of-speech (POS) tags are helpful for punctuation restoration tasks. However, two drawbacks still exist. One is that word embeddings are pre-trained by unidirectional language modeling objectives. Thus the word embeddings only contain left-to-right context information. The other is that POS tags are provided by an external POS tagger. So computation cost will be increased and incorrect predicted tags may affect the performance of restoring punctuation marks during decoding. This paper proposes adversarial transfer learning to address these problems. A pre-trained bidirectional encoder representations from transformers (BERT) model is used to initialize a punctuation model. Thus the transferred model parameters carry both left-to-right and right-to-left representations. Furthermore, adversarial multi-task learning is introduced to learn task invariant knowledge for punctuation prediction. We use an extra POS tagging task to help the training of the punctuation predicting task. Adversarial training is utilized to prevent the shared parameters from containing task specific information. We only use the punctuation predicting task to restore marks during decoding stage. Therefore, it will not need extra computation and not introduce incorrect tags from the POS tagger. Experiments are conducted on IWSLT2011 datasets. The results demonstrate that the punctuation predicting models obtain further performance improvement with task invariant knowledge from the POS tagging task. Our best model outperforms the previous state-of-the-art model trained only with lexical features by up to 9.2% absolute overall F_1-score on test set.
摘要:以往的研究表明,字的嵌入词性的和终端(POS)标签是标点符号恢复任务很有帮助。然而,仍然存在两个缺点。一个是字的嵌入由单向语言建模的目标预先训练。因此,这个词的嵌入仅包含由左到右的上下文信息。另一种是POS标签由外部POS恶搞提供。所以计算成本增加和不正确的预测标签可能会影响解码期间恢复标点符号的性能。本文提出了对抗性转让学习解决这些问题。从变压器(BERT)模型中的预训练的双向编码器表示用于初始化一个标点的模型。因此,转移的模型参数,同时携带左到右,右到左表示。此外,对抗性多任务学习引入到学习标点预测任务不变的知识。我们使用一个额外的词性标注任务,帮助标点符号预测任务的训练。对抗性训练被用来防止从含有任务特定信息的共享参数。我们只用标点符号预测任务解码阶段期间恢复的痕迹。因此,它不会需要额外的计算,而不是从POS恶搞引入不正确的标签。实验在IWSLT2011数据集进行。结果表明,标点符号预测模型获得从词性标注任务任务不变的知识,进一步提高性能。我们最好的模型优于以前的国家的最先进的型号只能用高达9.2%的绝对总F_1得分上测试集词汇特征的培训。
Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Cunhang Fan
Abstract: Previous studies demonstrate that word embeddings and part-of-speech (POS) tags are helpful for punctuation restoration tasks. However, two drawbacks still exist. One is that word embeddings are pre-trained by unidirectional language modeling objectives. Thus the word embeddings only contain left-to-right context information. The other is that POS tags are provided by an external POS tagger. So computation cost will be increased and incorrect predicted tags may affect the performance of restoring punctuation marks during decoding. This paper proposes adversarial transfer learning to address these problems. A pre-trained bidirectional encoder representations from transformers (BERT) model is used to initialize a punctuation model. Thus the transferred model parameters carry both left-to-right and right-to-left representations. Furthermore, adversarial multi-task learning is introduced to learn task invariant knowledge for punctuation prediction. We use an extra POS tagging task to help the training of the punctuation predicting task. Adversarial training is utilized to prevent the shared parameters from containing task specific information. We only use the punctuation predicting task to restore marks during decoding stage. Therefore, it will not need extra computation and not introduce incorrect tags from the POS tagger. Experiments are conducted on IWSLT2011 datasets. The results demonstrate that the punctuation predicting models obtain further performance improvement with task invariant knowledge from the POS tagging task. Our best model outperforms the previous state-of-the-art model trained only with lexical features by up to 9.2% absolute overall F_1-score on test set.
摘要:以往的研究表明,字的嵌入词性的和终端(POS)标签是标点符号恢复任务很有帮助。然而,仍然存在两个缺点。一个是字的嵌入由单向语言建模的目标预先训练。因此,这个词的嵌入仅包含由左到右的上下文信息。另一种是POS标签由外部POS恶搞提供。所以计算成本增加和不正确的预测标签可能会影响解码期间恢复标点符号的性能。本文提出了对抗性转让学习解决这些问题。从变压器(BERT)模型中的预训练的双向编码器表示用于初始化一个标点的模型。因此,转移的模型参数,同时携带左到右,右到左表示。此外,对抗性多任务学习引入到学习标点预测任务不变的知识。我们使用一个额外的词性标注任务,帮助标点符号预测任务的训练。对抗性训练被用来防止从含有任务特定信息的共享参数。我们只用标点符号预测任务解码阶段期间恢复的痕迹。因此,它不会需要额外的计算,而不是从POS恶搞引入不正确的标签。实验在IWSLT2011数据集进行。结果表明,标点符号预测模型获得从词性标注任务任务不变的知识,进一步提高性能。我们最好的模型优于以前的国家的最先进的型号只能用高达9.2%的绝对总F_1得分上测试集词汇特征的培训。
16. AM-MobileNet1D: A Portable Model for Speaker Recognition [PDF] 返回目录
João Antônio Chagas Nunes, David Macêdo, Cleber Zanchettin
Abstract: Speaker Recognition and Speaker Identification are challenging tasks with essential applications such as automation, authentication, and security. Deep learning approaches like SincNet and AM-SincNet presented great results on these tasks. The promising performance took these models to real-world applications that becoming fundamentally end-user driven and mostly mobile. The mobile computation requires applications with reduced storage size, non-processing and memory intensive and efficient energy-consuming. The deep learning approaches, in contrast, usually are energy expensive, demanding storage, processing power, and memory. To address this demand, we propose a portable model called Additive Margin MobileNet1D (AM-MobileNet1D) to Speaker Identification on mobile devices. We evaluated the proposed approach on TIMIT and MIT datasets obtaining equivalent or better performances concerning the baseline methods. Additionally, the proposed model takes only 11.6 megabytes on disk storage against 91.2 from SincNet and AM-SincNet architectures, making the model seven times faster, with eight times fewer parameters.
摘要:说话人识别和说话人识别与诸如自动化,身份验证和安全至关重要的应用程序挑战性的任务。深度学习方法,如SincNet和AM-SincNet介绍了这些任务的伟大成果。有前途的表现了这些模型对现实世界的应用程序,成为从根本上最终用户驱动,主要是手机。移动计算需要具有减小的存储容量,非处理和存储器密集型的,高效的能量消耗的应用程序。深学习方法,相反,通常是昂贵的能源,要求存储,处理能力和内存。为了满足这一需求,我们提出了所谓的添加剂保证金MobileNet1D(AM-MobileNet1D)便携式模型说话人识别移动设备上。我们评估TIMIT和麻省理工学院的数据集,获取关于基线方法相当或更好的性能所提出的方法。此外,该模型从SincNet和AM-SincNet架构仅需11.6对上91.2磁盘存储兆字节,使得机型提高七倍,以八倍参数少。
João Antônio Chagas Nunes, David Macêdo, Cleber Zanchettin
Abstract: Speaker Recognition and Speaker Identification are challenging tasks with essential applications such as automation, authentication, and security. Deep learning approaches like SincNet and AM-SincNet presented great results on these tasks. The promising performance took these models to real-world applications that becoming fundamentally end-user driven and mostly mobile. The mobile computation requires applications with reduced storage size, non-processing and memory intensive and efficient energy-consuming. The deep learning approaches, in contrast, usually are energy expensive, demanding storage, processing power, and memory. To address this demand, we propose a portable model called Additive Margin MobileNet1D (AM-MobileNet1D) to Speaker Identification on mobile devices. We evaluated the proposed approach on TIMIT and MIT datasets obtaining equivalent or better performances concerning the baseline methods. Additionally, the proposed model takes only 11.6 megabytes on disk storage against 91.2 from SincNet and AM-SincNet architectures, making the model seven times faster, with eight times fewer parameters.
摘要:说话人识别和说话人识别与诸如自动化,身份验证和安全至关重要的应用程序挑战性的任务。深度学习方法,如SincNet和AM-SincNet介绍了这些任务的伟大成果。有前途的表现了这些模型对现实世界的应用程序,成为从根本上最终用户驱动,主要是手机。移动计算需要具有减小的存储容量,非处理和存储器密集型的,高效的能量消耗的应用程序。深学习方法,相反,通常是昂贵的能源,要求存储,处理能力和内存。为了满足这一需求,我们提出了所谓的添加剂保证金MobileNet1D(AM-MobileNet1D)便携式模型说话人识别移动设备上。我们评估TIMIT和麻省理工学院的数据集,获取关于基线方法相当或更好的性能所提出的方法。此外,该模型从SincNet和AM-SincNet架构仅需11.6对上91.2磁盘存储兆字节,使得机型提高七倍,以八倍参数少。
17. Improvement of electronic Governance and mobile Governance in Multilingual Countries with Digital Etymology using Sanskrit Grammar [PDF] 返回目录
Arijit Das, Diganta Saha
Abstract: With huge improvement of digital connectivity (Wifi,3G,4G) and digital devices access to internet has reached in the remotest corners now a days. Rural people can easily access web or apps from PDAs, laptops, smartphones etc. This is an opportunity of the Government to reach to the citizen in large number, get their feedback, associate them in policy decision with e governance without deploying huge man, material or resourses. But the Government of multilingual countries face a lot of problem in successful implementation of Government to Citizen (G2C) and Citizen to Government (C2G) governance as the rural people tend and prefer to interact in their native languages. Presenting equal experience over web or app to different language group of speakers is a real challenge. In this research we have sorted out the problems faced by Indo Aryan speaking netizens which is in general also applicable to any language family groups or subgroups. Then we have tried to give probable solutions using Etymology. Etymology is used to correlate the words using their ROOT forms. In 5th century BC Panini wrote Astadhyayi where he depicted sutras or rules -- how a word is changed according to person,tense,gender,number etc. Later this book was followed in Western countries also to derive their grammar of comparatively new languages. We have trained our system for automatic root extraction from the surface level or morphed form of words using Panian Gramatical rules. We have tested our system over 10000 bengali Verbs and extracted the root form with 98% accuracy. We are now working to extend the program to successfully lemmatize any words of any language and correlate them by applying those rule sets in Artificial Neural Network.
摘要:随着数字连接的巨大改善(WIFI,3G,4G)和数字设备接入互联网在最偏远的角落已经达到了现在是一个天。农村人可以从PDA,笔记本电脑,智能手机等,这是政府的机会接触到大量的市民容易访问网页或应用,得到他们的反馈,随e治,无需部署大量的人,他们的材料在决策关联或源泉。但是,多语言国家的政府面临着很多的问题,成功实现政府对公民(G2C)和公民向政府(C2G)治理作为农村人容易,喜欢互动本国语言。在呈现网页或应用程式等于经验,不同的语言组扬声器的是一个真正的挑战。在这项研究中,我们整理出面对印度雅利安网民说这是在一般也适用于任何语言的家庭组或子组的问题。然后,我们试图给出一个使用词源可能的解决方案。词源是习惯使用自己的ROOT构成的词关联起来。在公元前5世纪的Panini写道Astadhyayi他所描绘的佛经或规则 - 一个字是如何根据人称,时态,性别,数量等后来这本书之后,在西方国家也得到了较新的语言的语法改变。我们培训我们的系统,用于从地面水平的全自动根提取物或使用Panian Gramatical规则字的演变形式。我们测试我们的系统超过10000个孟加拉语动词,并用98%的准确率的根本形式。我们现在正在成功地lemmatize任何语言中的任何字,并通过应用人工神经网络的规则集关联起来,以扩展该计划。
Arijit Das, Diganta Saha
Abstract: With huge improvement of digital connectivity (Wifi,3G,4G) and digital devices access to internet has reached in the remotest corners now a days. Rural people can easily access web or apps from PDAs, laptops, smartphones etc. This is an opportunity of the Government to reach to the citizen in large number, get their feedback, associate them in policy decision with e governance without deploying huge man, material or resourses. But the Government of multilingual countries face a lot of problem in successful implementation of Government to Citizen (G2C) and Citizen to Government (C2G) governance as the rural people tend and prefer to interact in their native languages. Presenting equal experience over web or app to different language group of speakers is a real challenge. In this research we have sorted out the problems faced by Indo Aryan speaking netizens which is in general also applicable to any language family groups or subgroups. Then we have tried to give probable solutions using Etymology. Etymology is used to correlate the words using their ROOT forms. In 5th century BC Panini wrote Astadhyayi where he depicted sutras or rules -- how a word is changed according to person,tense,gender,number etc. Later this book was followed in Western countries also to derive their grammar of comparatively new languages. We have trained our system for automatic root extraction from the surface level or morphed form of words using Panian Gramatical rules. We have tested our system over 10000 bengali Verbs and extracted the root form with 98% accuracy. We are now working to extend the program to successfully lemmatize any words of any language and correlate them by applying those rule sets in Artificial Neural Network.
摘要:随着数字连接的巨大改善(WIFI,3G,4G)和数字设备接入互联网在最偏远的角落已经达到了现在是一个天。农村人可以从PDA,笔记本电脑,智能手机等,这是政府的机会接触到大量的市民容易访问网页或应用,得到他们的反馈,随e治,无需部署大量的人,他们的材料在决策关联或源泉。但是,多语言国家的政府面临着很多的问题,成功实现政府对公民(G2C)和公民向政府(C2G)治理作为农村人容易,喜欢互动本国语言。在呈现网页或应用程式等于经验,不同的语言组扬声器的是一个真正的挑战。在这项研究中,我们整理出面对印度雅利安网民说这是在一般也适用于任何语言的家庭组或子组的问题。然后,我们试图给出一个使用词源可能的解决方案。词源是习惯使用自己的ROOT构成的词关联起来。在公元前5世纪的Panini写道Astadhyayi他所描绘的佛经或规则 - 一个字是如何根据人称,时态,性别,数量等后来这本书之后,在西方国家也得到了较新的语言的语法改变。我们培训我们的系统,用于从地面水平的全自动根提取物或使用Panian Gramatical规则字的演变形式。我们测试我们的系统超过10000个孟加拉语动词,并用98%的准确率的根本形式。我们现在正在成功地lemmatize任何语言中的任何字,并通过应用人工神经网络的规则集关联起来,以扩展该计划。
18. Unification-based Reconstruction of Explanations for Science Questions [PDF] 返回目录
Marco Valentino, Mokanarangan Thayaparan, André Freitas
Abstract: The paper presents a framework to reconstruct explanations for multiple choices science questions through explanation-centred corpora. Building upon the notion of unification in science, the framework ranks explanatory facts with respect to question and candidate answer by leveraging a combination of two different scores: (a) A Relevance Score (RS) that represents the extent to which a given fact is specific to the question; (b) A Unification Score (US) that takes into account the explanatory power of a fact, determined according to its frequency in explanations for similar questions. An extensive evaluation of the framework is performed on the Worldtree corpus, adopting IR weighting schemes for its implementation. The following findings are presented: (1) The proposed approach achieves competitive results when compared to state-of-the-art Transformers, yet possessing the property of being scalable to large explanatory knowledge bases; (2) The combined model significantly outperforms IR baselines (+7.8/8.4 MAP), confirming the complementary aspects of relevance and unification score; (3) The constructed explanations can support downstream models for answer prediction, improving the accuracy of BERT for multiple choices QA on both ARC easy (+6.92%) and challenge (+15.69%) questions.
摘要:本文介绍了重建的解释,通过解释为中心的语料库多种选择科学问题的框架。这代表了一个给定的事实是特定的范围内(一)相关性得分(RS):在科学统一的概念的基础上,框架通过利用两种不同的分数的组合排名相对于问题和候选答案解释事实这个问题; (b)一个统一评分(US),其考虑的事实解释力,根据其在为类似的问题的解释频率决定。该框架的一个广泛的评价在Worldtree语料库进行,采用IR加权方案及其实施。以下结果被呈现:(1)所提出的方法实现了有竞争力的结果相比,国家的最先进的变压器时,还具有的被缩放到大的说明知识库中的属性; (2)将合并的模型显著优于IR基线(+ 7.8 / 8.4 MAP),证实相关性和统一评分的互补方面; (3)构建的解释可以支持答案预测下游模型,提高了BERT的准确性多种选择QA两个ARC容易(+ 6.92%)和挑战(+ 15.69%)的问题。
Marco Valentino, Mokanarangan Thayaparan, André Freitas
Abstract: The paper presents a framework to reconstruct explanations for multiple choices science questions through explanation-centred corpora. Building upon the notion of unification in science, the framework ranks explanatory facts with respect to question and candidate answer by leveraging a combination of two different scores: (a) A Relevance Score (RS) that represents the extent to which a given fact is specific to the question; (b) A Unification Score (US) that takes into account the explanatory power of a fact, determined according to its frequency in explanations for similar questions. An extensive evaluation of the framework is performed on the Worldtree corpus, adopting IR weighting schemes for its implementation. The following findings are presented: (1) The proposed approach achieves competitive results when compared to state-of-the-art Transformers, yet possessing the property of being scalable to large explanatory knowledge bases; (2) The combined model significantly outperforms IR baselines (+7.8/8.4 MAP), confirming the complementary aspects of relevance and unification score; (3) The constructed explanations can support downstream models for answer prediction, improving the accuracy of BERT for multiple choices QA on both ARC easy (+6.92%) and challenge (+15.69%) questions.
摘要:本文介绍了重建的解释,通过解释为中心的语料库多种选择科学问题的框架。这代表了一个给定的事实是特定的范围内(一)相关性得分(RS):在科学统一的概念的基础上,框架通过利用两种不同的分数的组合排名相对于问题和候选答案解释事实这个问题; (b)一个统一评分(US),其考虑的事实解释力,根据其在为类似的问题的解释频率决定。该框架的一个广泛的评价在Worldtree语料库进行,采用IR加权方案及其实施。以下结果被呈现:(1)所提出的方法实现了有竞争力的结果相比,国家的最先进的变压器时,还具有的被缩放到大的说明知识库中的属性; (2)将合并的模型显著优于IR基线(+ 7.8 / 8.4 MAP),证实相关性和统一评分的互补方面; (3)构建的解释可以支持答案预测下游模型,提高了BERT的准确性多种选择QA两个ARC容易(+ 6.92%)和挑战(+ 15.69%)的问题。
19. Information Leakage in Embedding Models [PDF] 返回目录
Congzheng Song, Ananth Raghunathan
Abstract: Embeddings are functions that map raw input data to low-dimensional vector representations, while preserving important semantic information about the inputs. Pre-training embeddings on a large amount of unlabeled data and fine-tuning them for downstream tasks is now a de facto standard in achieving state of the art learning in many domains. We demonstrate that embeddings, in addition to encoding generic semantics, often also present a vector that leaks sensitive information about the input data. We develop three classes of attacks to systematically study information that might be leaked by embeddings. First, embedding vectors can be inverted to partially recover some of the input data. As an example, we show that our attacks on popular sentence embeddings recover between 50\%--70\% of the input words (F1 scores of 0.5--0.7). Second, embeddings may reveal sensitive attributes inherent in inputs and independent of the underlying semantic task at hand. Attributes such as authorship of text can be easily extracted by training an inference model on just a handful of labeled embedding vectors. Third, embedding models leak moderate amount of membership information for infrequent training data inputs. We extensively evaluate our attacks on various state-of-the-art embedding models in the text domain. We also propose and evaluate defenses that can prevent the leakage to some extent at a minor cost in utility.
摘要:曲面嵌入是映射原始输入数据到低维向量表示,同时保存有关输入重要的语义信息的功能。在大量的未标记数据和微调他们下游的任务前培训的嵌入现在在许多领域实现先进的学习状态的事实上的标准。我们证明,嵌入物,除了编码的通用的语义,往往还呈现出向量的泄漏有关输入数据的敏感信息。我们开发了三类到可能由嵌入物被泄露系统研究信息的攻击。首先,嵌入矢量可以被反向,以部分地回收一些输入数据。作为一个例子,我们证明了我们对流行一句话的嵌入攻击间隔50 \%恢复 - 输入字的70 \%(的0.5--0.7 F1分)。其次,嵌入物可揭示在输入固有的和独立于手底层语义任务的敏感属性。属性如文本的作者就可以通过只标记嵌入矢量少数训练推理模型容易地提取。三,嵌入模型泄漏的偶发训练数据输入的会员信息适量。我们广泛地评估我们在文本域中各个国家的最先进的嵌入模型的攻击。我们亦建议和评价,可以防止在实用小成本泄漏到一定程度的防御。
Congzheng Song, Ananth Raghunathan
Abstract: Embeddings are functions that map raw input data to low-dimensional vector representations, while preserving important semantic information about the inputs. Pre-training embeddings on a large amount of unlabeled data and fine-tuning them for downstream tasks is now a de facto standard in achieving state of the art learning in many domains. We demonstrate that embeddings, in addition to encoding generic semantics, often also present a vector that leaks sensitive information about the input data. We develop three classes of attacks to systematically study information that might be leaked by embeddings. First, embedding vectors can be inverted to partially recover some of the input data. As an example, we show that our attacks on popular sentence embeddings recover between 50\%--70\% of the input words (F1 scores of 0.5--0.7). Second, embeddings may reveal sensitive attributes inherent in inputs and independent of the underlying semantic task at hand. Attributes such as authorship of text can be easily extracted by training an inference model on just a handful of labeled embedding vectors. Third, embedding models leak moderate amount of membership information for infrequent training data inputs. We extensively evaluate our attacks on various state-of-the-art embedding models in the text domain. We also propose and evaluate defenses that can prevent the leakage to some extent at a minor cost in utility.
摘要:曲面嵌入是映射原始输入数据到低维向量表示,同时保存有关输入重要的语义信息的功能。在大量的未标记数据和微调他们下游的任务前培训的嵌入现在在许多领域实现先进的学习状态的事实上的标准。我们证明,嵌入物,除了编码的通用的语义,往往还呈现出向量的泄漏有关输入数据的敏感信息。我们开发了三类到可能由嵌入物被泄露系统研究信息的攻击。首先,嵌入矢量可以被反向,以部分地回收一些输入数据。作为一个例子,我们证明了我们对流行一句话的嵌入攻击间隔50 \%恢复 - 输入字的70 \%(的0.5--0.7 F1分)。其次,嵌入物可揭示在输入固有的和独立于手底层语义任务的敏感属性。属性如文本的作者就可以通过只标记嵌入矢量少数训练推理模型容易地提取。三,嵌入模型泄漏的偶发训练数据输入的会员信息适量。我们广泛地评估我们在文本域中各个国家的最先进的嵌入模型的攻击。我们亦建议和评价,可以防止在实用小成本泄漏到一定程度的防御。
注:中文为机器翻译结果!