目录
1. Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach [PDF] 摘要
6. Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning [PDF] 摘要
7. Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT [PDF] 摘要
10. A Corpus for Detecting High-Context Medical Conditions in Intensive Care Patient Notes Focusing on Frequently Readmitted Patients [PDF] 摘要
13. EmpTransfo: A Multi-head Transformer Architecture for Creating Empathetic Dialog Systems [PDF] 摘要
14. Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System [PDF] 摘要
15. Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish [PDF] 摘要
17. Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation [PDF] 摘要
19. End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification [PDF] 摘要
摘要
1. Distributional semantic modeling: a revised technique to train term/word vector space models applying the ontology-related approach [PDF] 返回目录
Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov, Oleksandr Shchurov
Abstract: We design a new technique for the distributional semantic modeling with a neural network-based approach to learn distributed term representations (or term embeddings) - term vector space models as a result, inspired by the recent ontology-related approach (using different types of contextual knowledge such as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to the identification of terms (term extraction) and relations between them (relation extraction) called semantic pre-processing technology - SPT. Our method relies on automatic term extraction from the natural language texts and subsequent formation of the problem-oriented or application-oriented (also deeply annotated) text corpora where the fundamental entity is the term (includes non-compositional and compositional terms). This gives us an opportunity to changeover from distributed word representations (or word embeddings) to distributed term representations (or term embeddings). This transition will allow to generate more accurate semantic maps of different subject domains (also, of relations between input terms - it is useful to explore clusters and oppositions, or to test your hypotheses about them). The semantic map can be represented as a graph using Vec2graph - a Python library for visualizing word embeddings (term embeddings in our case) as dynamic and interactive graphs. The Vec2graph library coupled with term embeddings will not only improve accuracy in solving standard NLP tasks, but also update the conventional concept of automated ontology development. The main practical result of our work is the development kit (set of toolkits represented as web service APIs and web application), which provides all necessary routines for the basic linguistic pre-processing and the semantic pre-processing of the natural language texts in Ukrainian for future training of term vector space models.
摘要:我们设计了分布式语义建模的新技术与基于神经网络的方法来学习分布式项表示(或足月的嵌入) - 项向量空间模型的结果,最近本体的相关做法的启发(使用不同类型的背景知识,如语法知识,术语知识,语义知识等),他们(关系抽取)之间的术语(术语提取)和关系的识别称为语义前处理技术 - SPT。我们的方法依赖于从自然语言文本自动术语提取和或面向问题的面向应用的(也深深地注释)文本语料库,其中基本实体是术语(包括非组成和组成方面)的随后形成。这给了我们从分布式字表示(或字嵌入物),以分布式项表示(或足月的嵌入)有机会转换。这种转变将允许产生不同学科领域的更准确的语义地图(也,输入项之间的关系 - 探索集群和对立,或以测试对他们的假设,它是有用的)。语义地图可以表示为使用Vec2graph图形 - 一个Python库用于可视化的嵌入字(本例中的嵌入一词)的动态和交互式图形。该Vec2graph库加上长期的嵌入,不仅提高准确性,解决标准NLP的任务,而且还更新自动本体发展的传统观念。在乌克兰的自然语言文本,我们的工作主要的实际结果是开发工具包(集表示为Web服务API和Web应用的工具包),它提供了基本的语言预处理和语义所有必要的程序预处理对于长期向量空间模型今后的培训。
Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov, Oleksandr Shchurov
Abstract: We design a new technique for the distributional semantic modeling with a neural network-based approach to learn distributed term representations (or term embeddings) - term vector space models as a result, inspired by the recent ontology-related approach (using different types of contextual knowledge such as syntactic knowledge, terminological knowledge, semantic knowledge, etc.) to the identification of terms (term extraction) and relations between them (relation extraction) called semantic pre-processing technology - SPT. Our method relies on automatic term extraction from the natural language texts and subsequent formation of the problem-oriented or application-oriented (also deeply annotated) text corpora where the fundamental entity is the term (includes non-compositional and compositional terms). This gives us an opportunity to changeover from distributed word representations (or word embeddings) to distributed term representations (or term embeddings). This transition will allow to generate more accurate semantic maps of different subject domains (also, of relations between input terms - it is useful to explore clusters and oppositions, or to test your hypotheses about them). The semantic map can be represented as a graph using Vec2graph - a Python library for visualizing word embeddings (term embeddings in our case) as dynamic and interactive graphs. The Vec2graph library coupled with term embeddings will not only improve accuracy in solving standard NLP tasks, but also update the conventional concept of automated ontology development. The main practical result of our work is the development kit (set of toolkits represented as web service APIs and web application), which provides all necessary routines for the basic linguistic pre-processing and the semantic pre-processing of the natural language texts in Ukrainian for future training of term vector space models.
摘要:我们设计了分布式语义建模的新技术与基于神经网络的方法来学习分布式项表示(或足月的嵌入) - 项向量空间模型的结果,最近本体的相关做法的启发(使用不同类型的背景知识,如语法知识,术语知识,语义知识等),他们(关系抽取)之间的术语(术语提取)和关系的识别称为语义前处理技术 - SPT。我们的方法依赖于从自然语言文本自动术语提取和或面向问题的面向应用的(也深深地注释)文本语料库,其中基本实体是术语(包括非组成和组成方面)的随后形成。这给了我们从分布式字表示(或字嵌入物),以分布式项表示(或足月的嵌入)有机会转换。这种转变将允许产生不同学科领域的更准确的语义地图(也,输入项之间的关系 - 探索集群和对立,或以测试对他们的假设,它是有用的)。语义地图可以表示为使用Vec2graph图形 - 一个Python库用于可视化的嵌入字(本例中的嵌入一词)的动态和交互式图形。该Vec2graph库加上长期的嵌入,不仅提高准确性,解决标准NLP的任务,而且还更新自动本体发展的传统观念。在乌克兰的自然语言文本,我们的工作主要的实际结果是开发工具包(集表示为Web服务API和Web应用的工具包),它提供了基本的语言预处理和语义所有必要的程序预处理对于长期向量空间模型今后的培训。
2. Quality of Word Embeddings on Sentiment Analysis Tasks [PDF] 返回目录
Erion Çano, Maurizio Morisio
Abstract: Word embeddings or distributed representations of words are being used in various applications like machine translation, sentiment analysis, topic identification etc. Quality of word embeddings and performance of their applications depends on several factors like training method, corpus size and relevance etc. In this study we compare performance of a dozen of pretrained word embedding models on lyrics sentiment analysis and movie review polarity tasks. According to our results, Twitter Tweets is the best on lyrics sentiment analysis, whereas Google News and Common Crawl are the top performers on movie polarity analysis. Glove trained models slightly outrun those trained with Skipgram. Also, factors like topic relevance and size of corpus significantly impact the quality of the models. When medium or large-sized text sets are available, obtaining word embeddings from same training dataset is usually the best choice.
摘要:Word中的嵌入或词的分布式表示在各种应用,如机器翻译,情感分析,主题辨别等字的嵌入的质量和应用的性能取决于多种因素,如训练方法,语料规模和相关性等。在使用这项研究中,我们比较预训练字嵌入模型对歌词情感分析和电影评论极性任务十几的性能。根据我们的结果,Twitter的鸣叫是最好的歌词情感分析,而谷歌新闻和常见的抓取都在电影极性分析表现最佳。手套训练的模型略有逃脱那些Skipgram训练。此外,像主题相关性和文集的大小因素显著影响模型的质量。当中型或大型文本集可用,从相同的训练数据集获得词的嵌入通常是最好的选择。
Erion Çano, Maurizio Morisio
Abstract: Word embeddings or distributed representations of words are being used in various applications like machine translation, sentiment analysis, topic identification etc. Quality of word embeddings and performance of their applications depends on several factors like training method, corpus size and relevance etc. In this study we compare performance of a dozen of pretrained word embedding models on lyrics sentiment analysis and movie review polarity tasks. According to our results, Twitter Tweets is the best on lyrics sentiment analysis, whereas Google News and Common Crawl are the top performers on movie polarity analysis. Glove trained models slightly outrun those trained with Skipgram. Also, factors like topic relevance and size of corpus significantly impact the quality of the models. When medium or large-sized text sets are available, obtaining word embeddings from same training dataset is usually the best choice.
摘要:Word中的嵌入或词的分布式表示在各种应用,如机器翻译,情感分析,主题辨别等字的嵌入的质量和应用的性能取决于多种因素,如训练方法,语料规模和相关性等。在使用这项研究中,我们比较预训练字嵌入模型对歌词情感分析和电影评论极性任务十几的性能。根据我们的结果,Twitter的鸣叫是最好的歌词情感分析,而谷歌新闻和常见的抓取都在电影极性分析表现最佳。手套训练的模型略有逃脱那些Skipgram训练。此外,像主题相关性和文集的大小因素显著影响模型的质量。当中型或大型文本集可用,从相同的训练数据集获得词的嵌入通常是最好的选择。
3. On the Role of Conceptualization in Commonsense Knowledge Graph Construction [PDF] 返回目录
Mutian He, Yangqiu Song, Kun Xu, Yu Dong
Abstract: Commonsense knowledge graphs (CKG) like Atomic and ASER are substantially different from conventional KG as they consist of much larger number of nodes formed by loosely-structured texts, which, though, enable them to handle highly diverse queries in natural language regarding commonsense, lead to unique challenges to automatic KG construction methods. Besides identifying relations absent from the KG between nodes, the methods are also expected to explore absent nodes represented by texts, in which different real-world things or entities may appear. To deal with innumerable entities involved with commonsense in real world, we introduce to CKG construction methods conceptualization, i.e., to view entities mentioned in texts as instances of specific concepts or vice versa. We build synthetic triples by conceptualization, and further formulate the task as triple classification, handled by a discriminatory model with knowledge transferred from pretrained language models and fine-tuned by negative sampling. Experiments demonstrate that our methods could effectively identify plausible triples and expand the KG by triples of both new nodes and edges in high diversity and novelty.
摘要:常识知识图(CKG),如原子和ASER是从传统KG显着不同,因为它们是由更大的松散结构化文本,这虽然,使他们能够处理的有关常识性的自然语言高度多样化的查询,形成节点数量,导致自动KG施工方法独特的挑战。除了确定缺席关系从KG节点之间,也有望方法探索缺席节点表示通过文本,在不同的现实世界中的事物或实体可能会出现。为了应对参与现实世界的常识无数的实体,我们介绍CKG施工方法概念化,即,要查看文本提到的具体概念,反之亦然实例实体。我们通过建立概念化合成三倍,并进一步制定任务,三重分类,通过与预先训练语言模型传递知识和微调由负采样歧视性的模式来处理。实验表明,我们的方法可以有效地在高多样性和新颖性都新节点和边缘的三元组确定合理的三倍,扩大KG。
Mutian He, Yangqiu Song, Kun Xu, Yu Dong
Abstract: Commonsense knowledge graphs (CKG) like Atomic and ASER are substantially different from conventional KG as they consist of much larger number of nodes formed by loosely-structured texts, which, though, enable them to handle highly diverse queries in natural language regarding commonsense, lead to unique challenges to automatic KG construction methods. Besides identifying relations absent from the KG between nodes, the methods are also expected to explore absent nodes represented by texts, in which different real-world things or entities may appear. To deal with innumerable entities involved with commonsense in real world, we introduce to CKG construction methods conceptualization, i.e., to view entities mentioned in texts as instances of specific concepts or vice versa. We build synthetic triples by conceptualization, and further formulate the task as triple classification, handled by a discriminatory model with knowledge transferred from pretrained language models and fine-tuned by negative sampling. Experiments demonstrate that our methods could effectively identify plausible triples and expand the KG by triples of both new nodes and edges in high diversity and novelty.
摘要:常识知识图(CKG),如原子和ASER是从传统KG显着不同,因为它们是由更大的松散结构化文本,这虽然,使他们能够处理的有关常识性的自然语言高度多样化的查询,形成节点数量,导致自动KG施工方法独特的挑战。除了确定缺席关系从KG节点之间,也有望方法探索缺席节点表示通过文本,在不同的现实世界中的事物或实体可能会出现。为了应对参与现实世界的常识无数的实体,我们介绍CKG施工方法概念化,即,要查看文本提到的具体概念,反之亦然实例实体。我们通过建立概念化合成三倍,并进一步制定任务,三重分类,通过与预先训练语言模型传递知识和微调由负采样歧视性的模式来处理。实验表明,我们的方法可以有效地在高多样性和新颖性都新节点和边缘的三元组确定合理的三倍,扩大KG。
4. Practical Annotation Strategies for Question Answering Datasets [PDF] 返回目录
Bernhard Kratzwald, Xiang Yue, Huan Sun, Stefan Feuerriegel
Abstract: Annotating datasets for question answering (QA) tasks is very costly, as it requires intensive manual labor and often domain-specific knowledge. Yet strategies for annotating QA datasets in a cost-effective manner are scarce. To provide a remedy for practitioners, our objective is to develop heuristic rules for annotating a subset of questions, so that the annotation cost is reduced while maintaining both in- and out-of-domain performance. For this, we conduct a large-scale analysis in order to derive practical recommendations. First, we demonstrate experimentally that more training samples contribute often only to a higher in-domain test-set performance, but do not help the model in generalizing to unseen datasets. Second, we develop a model-guided annotation strategy: it makes a recommendation with regard to which subset of samples should be annotated. Its effectiveness is demonstrated in a case study based on domain customization of QA to a clinical setting. Here, remarkably, annotating a stratified subset with only 1.2% of the original training set achieves 97.7% of the performance as if the complete dataset was annotated. Hence, the labeling effort can be reduced immensely. Altogether, our work fulfills a demand in practice when labeling budgets are limited and where thus recommendations are needed for annotating QA datasets more cost-effectively.
摘要:问答(QA)任务的数据集的诠释是非常昂贵的,因为它需要大量的手工劳动,往往特定领域的知识。然而,对于具有成本效益的方式标注QA数据集的策略是稀缺的。为了给从业者一个补救措施,我们的目标是发展启发式规则注释的问题的一个子集,因此,同时保证了入点和出的域性能注释成本降低。对于这一点,我们才能得出切实可行的建议进行大规模分析。首先,我们证明实验,更多的训练样本只到一个更高的域测试组的性能往往贡献,但不利于模型推广到看不见的数据集。其次,我们开发了一个模型的指导注解策略:它使关于一项建议,其样本的子集,应注明。它的有效性表现在基于QA的域定制的临床案例。在这里,值得注意的是,注释分层子集,只有1.2%的原始训练集的实现了性能的97.7%,就好像整个数据集注释。因此,标记的努力可以极大地减小。总之,我们的工作符合实际需求时,标签预算有限,因而地方建议需要更具成本效益的注释QA数据集。
Bernhard Kratzwald, Xiang Yue, Huan Sun, Stefan Feuerriegel
Abstract: Annotating datasets for question answering (QA) tasks is very costly, as it requires intensive manual labor and often domain-specific knowledge. Yet strategies for annotating QA datasets in a cost-effective manner are scarce. To provide a remedy for practitioners, our objective is to develop heuristic rules for annotating a subset of questions, so that the annotation cost is reduced while maintaining both in- and out-of-domain performance. For this, we conduct a large-scale analysis in order to derive practical recommendations. First, we demonstrate experimentally that more training samples contribute often only to a higher in-domain test-set performance, but do not help the model in generalizing to unseen datasets. Second, we develop a model-guided annotation strategy: it makes a recommendation with regard to which subset of samples should be annotated. Its effectiveness is demonstrated in a case study based on domain customization of QA to a clinical setting. Here, remarkably, annotating a stratified subset with only 1.2% of the original training set achieves 97.7% of the performance as if the complete dataset was annotated. Hence, the labeling effort can be reduced immensely. Altogether, our work fulfills a demand in practice when labeling budgets are limited and where thus recommendations are needed for annotating QA datasets more cost-effectively.
摘要:问答(QA)任务的数据集的诠释是非常昂贵的,因为它需要大量的手工劳动,往往特定领域的知识。然而,对于具有成本效益的方式标注QA数据集的策略是稀缺的。为了给从业者一个补救措施,我们的目标是发展启发式规则注释的问题的一个子集,因此,同时保证了入点和出的域性能注释成本降低。对于这一点,我们才能得出切实可行的建议进行大规模分析。首先,我们证明实验,更多的训练样本只到一个更高的域测试组的性能往往贡献,但不利于模型推广到看不见的数据集。其次,我们开发了一个模型的指导注解策略:它使关于一项建议,其样本的子集,应注明。它的有效性表现在基于QA的域定制的临床案例。在这里,值得注意的是,注释分层子集,只有1.2%的原始训练集的实现了性能的97.7%,就好像整个数据集注释。因此,标记的努力可以极大地减小。总之,我们的工作符合实际需求时,标签预算有限,因而地方建议需要更具成本效益的注释QA数据集。
5. Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? [PDF] 返回目录
Yu Zhang, Zhenghua Li, Houquan Zhou, Min Zhang
Abstract: In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing due to their important role in alleviating data sparseness of purely lexical features, and quite a few works focus on joint tagging and parsing models to avoid error propagation. In contrast, recent studies suggest that POS tagging becomes much less important or even useless for neural parsing, especially when using character-based word representations such as CharLSTM. Yet there still lacks a full and systematic investigation on this interesting issue, both empirically and linguistically. To answer this, we design four typical multi-task learning frameworks (i.e., Share-Loose, Share-Tight, Stack-Discrete, Stack-Hidden), for joint tagging and parsing based on the state-of-the-art biaffine parser. Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS-tag data. We conduct experiments on both English and Chinese datasets, and the results clearly show that POS tagging (both homogeneous and heterogeneous) can still significantly improve parsing performance when using the Stack-Hidden joint framework. We conduct detailed analysis and gain more insights from the linguistic aspect.
摘要:在前期深度学习的时代,部分的语音标签已被视为对依存分析功能,工程不可缺少的成分,由于其在减轻纯粹词汇特征数据稀疏重要的作用,也有不少作品注重联合标记和解析模型,以避免错误传播。与此相反,最近的研究表明,词性标注变得不那么重要或神经解析甚至无用的,尤其是在使用基于字符的文字表示,如CharLSTM时。然而,有仍然缺乏对这个有趣的问题全面和系统的调查,无论是经验和语言。要回答这个问题,我们设计了四种典型的多任务学习框架(即股份松动,分享紧,堆栈的离散,堆栈隐藏),联合标记和解析基于国家的最先进的biaffine解析器。考虑到它是便宜比分析树注释POS标签太多,我们也研究大规模异构POS标签数据的利用率。我们在英语和中国的数据集进行实验,结果清楚地表明,词性标注(均相和多相)仍然可以显著提高使用堆栈隐藏合资框架时解析性能。我们进行详细的分析,并从语言方面取得进一步的分析。
Yu Zhang, Zhenghua Li, Houquan Zhou, Min Zhang
Abstract: In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing due to their important role in alleviating data sparseness of purely lexical features, and quite a few works focus on joint tagging and parsing models to avoid error propagation. In contrast, recent studies suggest that POS tagging becomes much less important or even useless for neural parsing, especially when using character-based word representations such as CharLSTM. Yet there still lacks a full and systematic investigation on this interesting issue, both empirically and linguistically. To answer this, we design four typical multi-task learning frameworks (i.e., Share-Loose, Share-Tight, Stack-Discrete, Stack-Hidden), for joint tagging and parsing based on the state-of-the-art biaffine parser. Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS-tag data. We conduct experiments on both English and Chinese datasets, and the results clearly show that POS tagging (both homogeneous and heterogeneous) can still significantly improve parsing performance when using the Stack-Hidden joint framework. We conduct detailed analysis and gain more insights from the linguistic aspect.
摘要:在前期深度学习的时代,部分的语音标签已被视为对依存分析功能,工程不可缺少的成分,由于其在减轻纯粹词汇特征数据稀疏重要的作用,也有不少作品注重联合标记和解析模型,以避免错误传播。与此相反,最近的研究表明,词性标注变得不那么重要或神经解析甚至无用的,尤其是在使用基于字符的文字表示,如CharLSTM时。然而,有仍然缺乏对这个有趣的问题全面和系统的调查,无论是经验和语言。要回答这个问题,我们设计了四种典型的多任务学习框架(即股份松动,分享紧,堆栈的离散,堆栈隐藏),联合标记和解析基于国家的最先进的biaffine解析器。考虑到它是便宜比分析树注释POS标签太多,我们也研究大规模异构POS标签数据的利用率。我们在英语和中国的数据集进行实验,结果清楚地表明,词性标注(均相和多相)仍然可以显著提高使用堆栈隐藏合资框架时解析性能。我们进行详细的分析,并从语言方面取得进一步的分析。
6. Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning [PDF] 返回目录
Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo
Abstract: Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.
摘要:词的数据驱动分割成子词单元已经在各种自然语言处理应用中使用,如自动语音识别和统计机器翻译了近20年。最近,它已成为更广泛采用,基于深层神经网络模型通常从子字单元中获益甚至形态简单的语言。在本文中,我们将讨论和比较训练算法一单字组子字模型基础上,期望最大化算法和词汇修剪。使用英语,芬兰语,北萨米和土耳其的数据集,我们表明,这种方法能够找到由Morfessor基线模型比原始递归训练算法定义的优化问题更好的解决方案。相比语言金本位时,改进的优化也导致了更高的形态分割精度。我们发布的新的算法实现在广泛使用的Morfessor软件包。
Stig-Arne Grönroos, Sami Virpioja, Mikko Kurimo
Abstract: Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.
摘要:词的数据驱动分割成子词单元已经在各种自然语言处理应用中使用,如自动语音识别和统计机器翻译了近20年。最近,它已成为更广泛采用,基于深层神经网络模型通常从子字单元中获益甚至形态简单的语言。在本文中,我们将讨论和比较训练算法一单字组子字模型基础上,期望最大化算法和词汇修剪。使用英语,芬兰语,北萨米和土耳其的数据集,我们表明,这种方法能够找到由Morfessor基线模型比原始递归训练算法定义的优化问题更好的解决方案。相比语言金本位时,改进的优化也导致了更高的形态分割精度。我们发布的新的算法实现在广泛使用的Morfessor软件包。
7. Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT [PDF] 返回目录
Aitor García-Pablos, Naiara Perez, Montse Cuadros
Abstract: Massive digital data processing provides a wide range of opportunities and benefits, but at the cost of endangering personal data privacy. Anonymisation consists in removing or replacing sensitive information from data, enabling its exploitation for different purposes while preserving the privacy of individuals. Over the years, a lot of automatic anonymisation systems have been proposed; however, depending on the type of data, the target language or the availability of training documents, the task remains challenging still. The emergence of novel deep-learning models during the last two years has brought large improvements to the state of the art in the field of Natural Language Processing. These advancements have been most noticeably led by BERT, a model proposed by Google in 2018, and the shared language models pre-trained on millions of documents. In this paper, we use a BERT-based sequence labelling model to conduct a series of anonymisation experiments on several clinical datasets in Spanish. We also compare BERT to other algorithms. The experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.
摘要:海量数字数据处理提供了广泛的机会和利益,但在危及个人资料私隐的成本。匿名化包括移除或替换数据的敏感信息,从而实现其开发用于不同的目的,同时保持个人的隐私。多年来,很多自动匿名化系统已被提出;然而,根据数据的类型,目标语言或培训文件的可用性,任务依然严峻依然。新颖的深学习模式在过去两年中出现带来大的改进的技术状态自然语言处理领域。这些进步已经由BERT,于2018年由谷歌提出了一个模型,最明显的领导,以及数以百万计的文档共享语言模型预先训练。在本文中,我们使用基于BERT序列标注模型,在西班牙的几个临床数据集进行了一系列的实验匿名化。我们也比较BERT其他算法。实验结果表明,与一般的域前培训取得竞争激烈的结果一个简单的基于BERT的模型没有任何特定领域的功能设计。
Aitor García-Pablos, Naiara Perez, Montse Cuadros
Abstract: Massive digital data processing provides a wide range of opportunities and benefits, but at the cost of endangering personal data privacy. Anonymisation consists in removing or replacing sensitive information from data, enabling its exploitation for different purposes while preserving the privacy of individuals. Over the years, a lot of automatic anonymisation systems have been proposed; however, depending on the type of data, the target language or the availability of training documents, the task remains challenging still. The emergence of novel deep-learning models during the last two years has brought large improvements to the state of the art in the field of Natural Language Processing. These advancements have been most noticeably led by BERT, a model proposed by Google in 2018, and the shared language models pre-trained on millions of documents. In this paper, we use a BERT-based sequence labelling model to conduct a series of anonymisation experiments on several clinical datasets in Spanish. We also compare BERT to other algorithms. The experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.
摘要:海量数字数据处理提供了广泛的机会和利益,但在危及个人资料私隐的成本。匿名化包括移除或替换数据的敏感信息,从而实现其开发用于不同的目的,同时保持个人的隐私。多年来,很多自动匿名化系统已被提出;然而,根据数据的类型,目标语言或培训文件的可用性,任务依然严峻依然。新颖的深学习模式在过去两年中出现带来大的改进的技术状态自然语言处理领域。这些进步已经由BERT,于2018年由谷歌提出了一个模型,最明显的领导,以及数以百万计的文档共享语言模型预先训练。在本文中,我们使用基于BERT序列标注模型,在西班牙的几个临床数据集进行了一系列的实验匿名化。我们也比较BERT其他算法。实验结果表明,与一般的域前培训取得竞争激烈的结果一个简单的基于BERT的模型没有任何特定领域的功能设计。
8. Improving Neural Named Entity Recognition with Gazetteers [PDF] 返回目录
Chan Hee Song, Dawn Lawrie, Tim Finin, James Mayfield
Abstract: The goal of this work is to improve the performance of a neural named entity recognition system by adding input features that indicate a word is part of a name included in a gazetteer. This article describes how to generate gazetteers from the Wikidata knowledge graph as well as how to integrate the information into a neural NER system. Experiments reveal that the approach yields performance gains in two distinct languages: a high-resource, word-based language, English and a high-resource, character-based language, Chinese. Experiments were also performed in a low-resource language, Russian on a newly annotated Russian NER corpus from Reddit tagged with four core types and twelve extended types. This article reports a baseline score. It is a longer version of a paper in the 33rd FLAIRS conference (Song et al. 2020).
摘要:这项工作的目的是通过增加输入功能,显示一个字是包含在地名词典的名字的一部分,以改善神经命名实体识别系统的性能。本文介绍如何生成从维基数据知识图方志,以及如何将信息整合到一个神经命名实体识别系统。实验显示,在两个不同的语言方式能够提高性能:高资源,基于词的语言,英语和高资源,基于字符的语言,中国。实验还处于低资源语言进行的,俄罗斯从reddit的新注释俄罗斯NER语料库标记四个核心类型和十二个扩展类型。本文报告的基准得分。这是一篇论文在第33 FLAIRS会议上更长的版本(Song等人。2020)。
Chan Hee Song, Dawn Lawrie, Tim Finin, James Mayfield
Abstract: The goal of this work is to improve the performance of a neural named entity recognition system by adding input features that indicate a word is part of a name included in a gazetteer. This article describes how to generate gazetteers from the Wikidata knowledge graph as well as how to integrate the information into a neural NER system. Experiments reveal that the approach yields performance gains in two distinct languages: a high-resource, word-based language, English and a high-resource, character-based language, Chinese. Experiments were also performed in a low-resource language, Russian on a newly annotated Russian NER corpus from Reddit tagged with four core types and twelve extended types. This article reports a baseline score. It is a longer version of a paper in the 33rd FLAIRS conference (Song et al. 2020).
摘要:这项工作的目的是通过增加输入功能,显示一个字是包含在地名词典的名字的一部分,以改善神经命名实体识别系统的性能。本文介绍如何生成从维基数据知识图方志,以及如何将信息整合到一个神经命名实体识别系统。实验显示,在两个不同的语言方式能够提高性能:高资源,基于词的语言,英语和高资源,基于字符的语言,中国。实验还处于低资源语言进行的,俄罗斯从reddit的新注释俄罗斯NER语料库标记四个核心类型和十二个扩展类型。本文报告的基准得分。这是一篇论文在第33 FLAIRS会议上更长的版本(Song等人。2020)。
9. Parsing Thai Social Data: A New Challenge for Thai NLP [PDF] 返回目录
Sattaya Singkul, Borirat Khampingyot, Nattasit Maharattamalai, Supawat Taerungruang, Tawunrat Chalothorn
Abstract: Dependency parsing (DP) is a task that analyzes text for syntactic structure and relationship between words. DP is widely used to improve natural language processing (NLP) applications in many languages such as English. Previous works on DP are generally applicable to formally written languages. However, they do not apply to informal languages such as the ones used in social networks. Therefore, DP has to be researched and explored with such social network data. In this paper, we explore and identify a DP model that is suitable for Thai social network data. After that, we will identify the appropriate linguistic unit as an input. The result showed that, the transition based model called, improve Elkared dependency parser outperform the others at UAS of 81.42%.
摘要:依存分析(DP)是分析句法结构与词之间的关系文字的任务。 DP被广泛使用,以提高自然语言处理(NLP)应用在许多语言,如英语。在DP以前的作品通常适用于正式的书面语言。然而,他们并不适用于非正式的语言,如社交网络所使用的。因此,DP已进行研究,并用这样的社交网络数据的探讨。在本文中,我们探索和识别DP模型,该模型适用于泰国社交网络数据。在那之后,我们将确定适当的语言单位作为输入。结果表明,调用,提高Elkared依赖解析器的过渡模式在81.42%UAS胜过别人。
Sattaya Singkul, Borirat Khampingyot, Nattasit Maharattamalai, Supawat Taerungruang, Tawunrat Chalothorn
Abstract: Dependency parsing (DP) is a task that analyzes text for syntactic structure and relationship between words. DP is widely used to improve natural language processing (NLP) applications in many languages such as English. Previous works on DP are generally applicable to formally written languages. However, they do not apply to informal languages such as the ones used in social networks. Therefore, DP has to be researched and explored with such social network data. In this paper, we explore and identify a DP model that is suitable for Thai social network data. After that, we will identify the appropriate linguistic unit as an input. The result showed that, the transition based model called, improve Elkared dependency parser outperform the others at UAS of 81.42%.
摘要:依存分析(DP)是分析句法结构与词之间的关系文字的任务。 DP被广泛使用,以提高自然语言处理(NLP)应用在许多语言,如英语。在DP以前的作品通常适用于正式的书面语言。然而,他们并不适用于非正式的语言,如社交网络所使用的。因此,DP已进行研究,并用这样的社交网络数据的探讨。在本文中,我们探索和识别DP模型,该模型适用于泰国社交网络数据。在那之后,我们将确定适当的语言单位作为输入。结果表明,调用,提高Elkared依赖解析器的过渡模式在81.42%UAS胜过别人。
10. A Corpus for Detecting High-Context Medical Conditions in Intensive Care Patient Notes Focusing on Frequently Readmitted Patients [PDF] 返回目录
Edward T. Moseley, Joy T. Wu, Jonathan Welt, John Foote, Patrick D. Tyler, David W. Grant, Eric T. Carlson, Sebastian Gehrmann, Franck Dernoncourt, Leo Anthony Celi
Abstract: A crucial step within secondary analysis of electronic health records (EHRs) is to identify the patient cohort under investigation. While EHRs contain medical billing codes that aim to represent the conditions and treatments patients may have, much of the information is only present in the patient notes. Therefore, it is critical to develop robust algorithms to infer patients' conditions and treatments from their written notes. In this paper, we introduce a dataset for patient phenotyping, a task that is defined as the identification of whether a patient has a given medical condition (also referred to as clinical indication or phenotype) based on their patient note. Nursing Progress Notes and Discharge Summaries from the Intensive Care Unit of a large tertiary care hospital were manually annotated for the presence of several high-context phenotypes relevant to treatment and risk of re-hospitalization. This dataset contains 1102 Discharge Summaries and 1000 Nursing Progress Notes. Each Discharge Summary and Progress Note has been annotated by at least two expert human annotators (one clinical researcher and one resident physician). Annotated phenotypes include treatment non-adherence, chronic pain, advanced/metastatic cancer, as well as 10 other phenotypes. This dataset can be utilized for academic and industrial research in medicine and computer science, particularly within the field of medical natural language processing.
摘要:电子健康记录(电子病历)二次分析中的一个关键步骤是确定被调查的患者群。虽然电子病历包含旨在表示患者可能具有的条件和处理,医疗帐单代码多的信息是仅在病人笔记本。因此,从他们的书面说明开发强大的算法来推断患者的病情和治疗的关键。在本文中,我们介绍患者表型的数据集,被定义为的患者是否具有给定的医疗条件的识别一个任务(也被称为临床指征或表型)的基础上他们的病人注释。从一个大的三级医院的重症监护病房护理进展Notes和出院小结手动注释为相关治疗和再次住院的风险几个高背景的表型的存在。此数据集包含1102个出院小结和1000个护理病程记录。每一个出院小结和进展说明已至少两次专家人工注释(一个临床研究员,一个住院医生)注释。注释的表型包括治疗非依从,慢性疼痛,先进/转移性癌症,以及其它10页的表型。此数据集可用于医学和计算机科学的学术和行业研究,特别是在医疗自然语言处理领域。
Edward T. Moseley, Joy T. Wu, Jonathan Welt, John Foote, Patrick D. Tyler, David W. Grant, Eric T. Carlson, Sebastian Gehrmann, Franck Dernoncourt, Leo Anthony Celi
Abstract: A crucial step within secondary analysis of electronic health records (EHRs) is to identify the patient cohort under investigation. While EHRs contain medical billing codes that aim to represent the conditions and treatments patients may have, much of the information is only present in the patient notes. Therefore, it is critical to develop robust algorithms to infer patients' conditions and treatments from their written notes. In this paper, we introduce a dataset for patient phenotyping, a task that is defined as the identification of whether a patient has a given medical condition (also referred to as clinical indication or phenotype) based on their patient note. Nursing Progress Notes and Discharge Summaries from the Intensive Care Unit of a large tertiary care hospital were manually annotated for the presence of several high-context phenotypes relevant to treatment and risk of re-hospitalization. This dataset contains 1102 Discharge Summaries and 1000 Nursing Progress Notes. Each Discharge Summary and Progress Note has been annotated by at least two expert human annotators (one clinical researcher and one resident physician). Annotated phenotypes include treatment non-adherence, chronic pain, advanced/metastatic cancer, as well as 10 other phenotypes. This dataset can be utilized for academic and industrial research in medicine and computer science, particularly within the field of medical natural language processing.
摘要:电子健康记录(电子病历)二次分析中的一个关键步骤是确定被调查的患者群。虽然电子病历包含旨在表示患者可能具有的条件和处理,医疗帐单代码多的信息是仅在病人笔记本。因此,从他们的书面说明开发强大的算法来推断患者的病情和治疗的关键。在本文中,我们介绍患者表型的数据集,被定义为的患者是否具有给定的医疗条件的识别一个任务(也被称为临床指征或表型)的基础上他们的病人注释。从一个大的三级医院的重症监护病房护理进展Notes和出院小结手动注释为相关治疗和再次住院的风险几个高背景的表型的存在。此数据集包含1102个出院小结和1000个护理病程记录。每一个出院小结和进展说明已至少两次专家人工注释(一个临床研究员,一个住院医生)注释。注释的表型包括治疗非依从,慢性疼痛,先进/转移性癌症,以及其它10页的表型。此数据集可用于医学和计算机科学的学术和行业研究,特别是在医疗自然语言处理领域。
11. A Framework for the Computational Linguistic Analysis of Dehumanization [PDF] 返回目录
Julia Mendelsohn, Yulia Tsvetkov, Dan Jurafsky
Abstract: Dehumanization is a pernicious psychological process that often leads to extreme intergroup bias, hate speech, and violence aimed at targeted social groups. Despite these serious consequences and the wealth of available data, dehumanization has not yet been computationally studied on a large scale. Drawing upon social psychology research, we create a computational linguistic framework for analyzing dehumanizing language by identifying linguistic correlates of salient components of dehumanization. We then apply this framework to analyze discussions of LGBTQ people in the New York Times from 1986 to 2015. Overall, we find increasingly humanizing descriptions of LGBTQ people over time. However, we find that the label homosexual has emerged to be much more strongly associated with dehumanizing attitudes than other labels, such as gay. Our proposed techniques highlight processes of linguistic variation and change in discourses surrounding marginalized groups. Furthermore, the ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.
摘要:非人化是一种有害的心理过程,往往导致极端群体间的偏见,仇恨言论和暴力旨在有针对性的社会群体。尽管存在这些严重的后果和可用的大量数据,非人性化尚未计算研究大规模。在吸取社会心理学研究中,我们创建通过鉴定非人化的显着组件的语言相关因素分析无人性语言计算语言学框架。然后,我们应用这个框架来分析纽约时报LGBTQ人的讨论,从1986年到2015年,总体而言,我们发现越来越人性化LGBTQ人的描述随着时间的推移。然而,我们发现,标签同性恋者已经出现更加强烈无人性比其他标签,如同性恋的态度有关。我们提出的技术,突出周边边缘化群体话语的语言变异和变化的过程。此外,分析在大规模的非人性化语言的能力具有自动检测和理解媒体的偏见以及网上辱骂性语言的含义。
Julia Mendelsohn, Yulia Tsvetkov, Dan Jurafsky
Abstract: Dehumanization is a pernicious psychological process that often leads to extreme intergroup bias, hate speech, and violence aimed at targeted social groups. Despite these serious consequences and the wealth of available data, dehumanization has not yet been computationally studied on a large scale. Drawing upon social psychology research, we create a computational linguistic framework for analyzing dehumanizing language by identifying linguistic correlates of salient components of dehumanization. We then apply this framework to analyze discussions of LGBTQ people in the New York Times from 1986 to 2015. Overall, we find increasingly humanizing descriptions of LGBTQ people over time. However, we find that the label homosexual has emerged to be much more strongly associated with dehumanizing attitudes than other labels, such as gay. Our proposed techniques highlight processes of linguistic variation and change in discourses surrounding marginalized groups. Furthermore, the ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.
摘要:非人化是一种有害的心理过程,往往导致极端群体间的偏见,仇恨言论和暴力旨在有针对性的社会群体。尽管存在这些严重的后果和可用的大量数据,非人性化尚未计算研究大规模。在吸取社会心理学研究中,我们创建通过鉴定非人化的显着组件的语言相关因素分析无人性语言计算语言学框架。然后,我们应用这个框架来分析纽约时报LGBTQ人的讨论,从1986年到2015年,总体而言,我们发现越来越人性化LGBTQ人的描述随着时间的推移。然而,我们发现,标签同性恋者已经出现更加强烈无人性比其他标签,如同性恋的态度有关。我们提出的技术,突出周边边缘化群体话语的语言变异和变化的过程。此外,分析在大规模的非人性化语言的能力具有自动检测和理解媒体的偏见以及网上辱骂性语言的含义。
12. S-APIR: News-based Business Sentiment Index [PDF] 返回目录
Kazuhiro Seki, Yusuke Ikuta
Abstract: This paper describes our work on developing a new business sentiment index using daily newspaper articles. We adopt a recurrent neural network (RNN) with Gated Recurrent Units to predict the business sentiment of a given text. An RNN is initially trained on Economy Watchers Survey and then fine-tuned on news texts for domain adaptation. Also, a one-class support vector machine is applied to filter out texts deemed irrelevant to business sentiment. Moreover, we propose a simple approach to temporally analyzing how much and when any given factor influences the predicted business sentiment. The validity and utility of the proposed approaches are empirically demonstrated through a series of experiments on Nikkei Newspaper articles published from 2013 to 2018.
摘要:本文介绍了我们在使用日报的文章制定新的企业景气判断指数的工作。我们采用带门复发单位经常性的神经网络(RNN)来预测一个给定文本的商业情绪。一个RNN最初的培训对经济观察家调查和域适应新闻文本,然后微调。此外,一类支持向量机适用于过滤掉文本视为无关的商业情绪。此外,我们提出了一个简单的方法来暂时分析时,任何给定的因素影响预计商业景气多少和。所提出的方法的有效性和实用性进行了实证通过日经报纸文章的一系列实验公布2013至2018年证明。
Kazuhiro Seki, Yusuke Ikuta
Abstract: This paper describes our work on developing a new business sentiment index using daily newspaper articles. We adopt a recurrent neural network (RNN) with Gated Recurrent Units to predict the business sentiment of a given text. An RNN is initially trained on Economy Watchers Survey and then fine-tuned on news texts for domain adaptation. Also, a one-class support vector machine is applied to filter out texts deemed irrelevant to business sentiment. Moreover, we propose a simple approach to temporally analyzing how much and when any given factor influences the predicted business sentiment. The validity and utility of the proposed approaches are empirically demonstrated through a series of experiments on Nikkei Newspaper articles published from 2013 to 2018.
摘要:本文介绍了我们在使用日报的文章制定新的企业景气判断指数的工作。我们采用带门复发单位经常性的神经网络(RNN)来预测一个给定文本的商业情绪。一个RNN最初的培训对经济观察家调查和域适应新闻文本,然后微调。此外,一类支持向量机适用于过滤掉文本视为无关的商业情绪。此外,我们提出了一个简单的方法来暂时分析时,任何给定的因素影响预计商业景气多少和。所提出的方法的有效性和实用性进行了实证通过日经报纸文章的一系列实验公布2013至2018年证明。
13. EmpTransfo: A Multi-head Transformer Architecture for Creating Empathetic Dialog Systems [PDF] 返回目录
Rohola Zandie, Mohammad H. Mahoor
Abstract: Understanding emotions and responding accordingly is one of the biggest challenges of dialog systems. This paper presents EmpTransfo, a multi-head Transformer architecture for creating an empathetic dialog system. EmpTransfo utilizes state-of-the-art pre-trained models (e.g., OpenAI-GPT) for language generation, though models with different sizes can be used. We show that utilizing the history of emotions and other metadata can improve the quality of generated conversations by the dialog system. Our experimental results using a challenging language corpus show that the proposed approach outperforms other models in terms of Hit@1 and PPL (Perplexity).
摘要:了解情绪和相应的响应,是对话系统的最大挑战之一。本文礼物EmpTransfo,多头变压器架构创建一个善解人意的对话系统。 EmpTransfo利用状态的最先进的预训练的模型(例如,OpenAI-GPT),用于语言生成,尽管可以使用具有不同尺寸的模型。我们发现,利用情感和其它元数据的历史记录可通过对话系统提高产生对话的质量。我们使用一个具有挑战性的语料表明,该方法比其他模型中的命中@ 1和PPL(困惑)方面的实验结果。
Rohola Zandie, Mohammad H. Mahoor
Abstract: Understanding emotions and responding accordingly is one of the biggest challenges of dialog systems. This paper presents EmpTransfo, a multi-head Transformer architecture for creating an empathetic dialog system. EmpTransfo utilizes state-of-the-art pre-trained models (e.g., OpenAI-GPT) for language generation, though models with different sizes can be used. We show that utilizing the history of emotions and other metadata can improve the quality of generated conversations by the dialog system. Our experimental results using a challenging language corpus show that the proposed approach outperforms other models in terms of Hit@1 and PPL (Perplexity).
摘要:了解情绪和相应的响应,是对话系统的最大挑战之一。本文礼物EmpTransfo,多头变压器架构创建一个善解人意的对话系统。 EmpTransfo利用状态的最先进的预训练的模型(例如,OpenAI-GPT),用于语言生成,尽管可以使用具有不同尺寸的模型。我们发现,利用情感和其它元数据的历史记录可通过对话系统提高产生对话的质量。我们使用一个具有挑战性的语料表明,该方法比其他模型中的命中@ 1和PPL(困惑)方面的实验结果。
14. Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System [PDF] 返回目录
Seid Muhie Yimam, Gopalakrishnan Venkatesh, John Sie Yuen Lee, Chris Biemann
Abstract: We present the first approach to automatically building resources for academic writing. The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing. On top of existing academic resources, such as the Corpus of Contemporary American English (COCA) academic Word List, the New Academic Word List, and the Academic Collocation List, we also explore how to dynamically build such resources that would be used to automatically identify informal or non-academic words or phrases. The resources are compiled using different generic approaches that can be extended for different domains and languages. We describe the evaluation of resources with a system implementation. The system consists of an informal word identification (IWI), academic candidate paraphrase generation, and paraphrase ranking components. To generate candidates and rank them in context, we have used the PPDB and WordNet paraphrase resources. We use the Concepts in Context (CoInCO) "All-Words" lexical substitution dataset both for the informal word identification and paraphrase generation experiments. Our informal word identification component achieves an F-1 score of 82%, significantly outperforming a stratified classifier baseline. The main contribution of this work is a domain-independent methodology to build targeted resources for writing aids.
摘要:我们提出的第一种方法来自动建立资源学术写作。其目的是建立一个书写辅助系统,该系统会自动将编辑一个文本,以便其更好地粘附到写作的学术风格。在现有的学术资源,如美国当代英语(COCA)学术词汇表,新的学术词汇表,以及学术搭配清单的语料库之上,我们还探讨了如何动态地构建将被用来自动识别这些资源非正式的或非学术的词或短语。这些资源使用可以扩展到不同的领域和不同语言的通用方法编译。我们描述一个系统实现资源的评价。该系统由一个非正式字识别(IWI),学术候选复述生成,以及复述排名组件。为了产生候选人,并在上下文中对他们进行排名,我们已经使用了PPDB和共发现意译资源。我们使用的概念在上下文(COINCO)“全词话”词汇替代数据集都为非正式词识别和意译发电实验。我们的非正式字识别组件实现的F-1的分数的82%,显著优于分层分类器的基线。这项工作的主要贡献是一个领域无关的方法来建立有针对性的资源编写的辅助工具。
Seid Muhie Yimam, Gopalakrishnan Venkatesh, John Sie Yuen Lee, Chris Biemann
Abstract: We present the first approach to automatically building resources for academic writing. The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing. On top of existing academic resources, such as the Corpus of Contemporary American English (COCA) academic Word List, the New Academic Word List, and the Academic Collocation List, we also explore how to dynamically build such resources that would be used to automatically identify informal or non-academic words or phrases. The resources are compiled using different generic approaches that can be extended for different domains and languages. We describe the evaluation of resources with a system implementation. The system consists of an informal word identification (IWI), academic candidate paraphrase generation, and paraphrase ranking components. To generate candidates and rank them in context, we have used the PPDB and WordNet paraphrase resources. We use the Concepts in Context (CoInCO) "All-Words" lexical substitution dataset both for the informal word identification and paraphrase generation experiments. Our informal word identification component achieves an F-1 score of 82%, significantly outperforming a stratified classifier baseline. The main contribution of this work is a domain-independent methodology to build targeted resources for writing aids.
摘要:我们提出的第一种方法来自动建立资源学术写作。其目的是建立一个书写辅助系统,该系统会自动将编辑一个文本,以便其更好地粘附到写作的学术风格。在现有的学术资源,如美国当代英语(COCA)学术词汇表,新的学术词汇表,以及学术搭配清单的语料库之上,我们还探讨了如何动态地构建将被用来自动识别这些资源非正式的或非学术的词或短语。这些资源使用可以扩展到不同的领域和不同语言的通用方法编译。我们描述一个系统实现资源的评价。该系统由一个非正式字识别(IWI),学术候选复述生成,以及复述排名组件。为了产生候选人,并在上下文中对他们进行排名,我们已经使用了PPDB和共发现意译资源。我们使用的概念在上下文(COINCO)“全词话”词汇替代数据集都为非正式词识别和意译发电实验。我们的非正式字识别组件实现的F-1的分数的82%,显著优于分层分类器的基线。这项工作的主要贡献是一个领域无关的方法来建立有针对性的资源编写的辅助工具。
15. Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish [PDF] 返回目录
Barbara Plank
Abstract: Named Entity Recognition (NER) has greatly advanced by the introduction of deep neural architectures. However, the success of these methods depends on large amounts of training data. The scarcity of publicly-available human-labeled datasets has resulted in limited evaluation of existing NER systems, as is the case for Danish. This paper studies the effectiveness of cross-lingual transfer for Danish, evaluates its complementarity to limited gold data, and sheds light on performance of Danish NER.
摘要:命名实体识别(NER)已经大大引入深层神经架构的先进。然而,这些方法的成功依赖于大量的训练数据。公开可用的人类标记的数据集的稀缺性导致了现有NER系统的评估有限,因为对丹麦的情况。本文研究了跨语言转移丹麦的有效性,评估其互补性有限的黄金数据,并阐明了丹麦NER的性能。
Barbara Plank
Abstract: Named Entity Recognition (NER) has greatly advanced by the introduction of deep neural architectures. However, the success of these methods depends on large amounts of training data. The scarcity of publicly-available human-labeled datasets has resulted in limited evaluation of existing NER systems, as is the case for Danish. This paper studies the effectiveness of cross-lingual transfer for Danish, evaluates its complementarity to limited gold data, and sheds light on performance of Danish NER.
摘要:命名实体识别(NER)已经大大引入深层神经架构的先进。然而,这些方法的成功依赖于大量的训练数据。公开可用的人类标记的数据集的稀缺性导致了现有NER系统的评估有限,因为对丹麦的情况。本文研究了跨语言转移丹麦的有效性,评估其互补性有限的黄金数据,并阐明了丹麦NER的性能。
16. What the [MASK]? Making Sense of Language-Specific BERT Models [PDF] 返回目录
Debora Nozza, Federico Bianchi, Dirk Hovy
Abstract: Recently, Natural Language Processing (NLP) has witnessed an impressive progress in many areas, due to the advent of novel, pretrained contextual representation models. In particular, Devlin et al. (2019) proposed a model, called BERT (Bidirectional Encoder Representations from Transformers), which enables researchers to obtain state-of-the art performance on numerous NLP tasks by fine-tuning the representations on their data set and task, without the need for developing and training highly-specific architectures. The authors also released multilingual BERT (mBERT), a model trained on a corpus of 104 languages, which can serve as a universal language model. This model obtained impressive results on a zero-shot cross-lingual natural inference task. Driven by the potential of BERT models, the NLP community has started to investigate and generate an abundant number of BERT models that are trained on a particular language, and tested on a specific data domain and task. This allows us to evaluate the true potential of mBERT as a universal language model, by comparing it to the performance of these more specific models. This paper presents the current state of the art in language-specific BERT models, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks). Our aim is to provide an immediate and straightforward overview of the commonalities and differences between Language-Specific (language-specific) BERT models and mBERT. We also provide an interactive and constantly updated website that can be used to explore the information we have collected, at this https URL.
摘要:近日,自然语言处理(NLP)目睹了在许多领域令人瞩目的进展,由于小说的问世,预训练的上下文表示模型。特别是,Devlin等。 (2019)提出了一个模型,叫做BERT(来自变形金刚双向编码器表示),这使研究人员能够获得国家的的通过微调其数据集和任务的陈述众多NLP任务的艺术表演,而不需要开发和培训针对性强的架构。作者还发布了多语种BERT(mBERT),培训了104种语言语料库中的模型,它可以作为一种通用的语言模型。这个模型得到的零次跨语言自然推理任务骄人的成绩。由BERT车型的潜在推动下,NLP社区已经开始调查,并生成丰富的数是在一个特定的语言训练的BERT模式,并在特定的数据域和任务进行测试。这允许我们通过比较这些更具体的车型的性能,以评估mBERT的真正潜力作为一种通用的语言模型。本文介绍了现有技术中的特定语言模型BERT的当前状态,从而提供相对于不同尺寸的整体图像(即架构中,数据域,和任务)。我们的目标是提供特定语言(语言特定的)BERT模型和mBERT之间的共同点和差异的直接和简单的概述。我们还提供一个互动和不断更新的网站,可以用来探索我们所收集的信息,在此HTTPS URL。
Debora Nozza, Federico Bianchi, Dirk Hovy
Abstract: Recently, Natural Language Processing (NLP) has witnessed an impressive progress in many areas, due to the advent of novel, pretrained contextual representation models. In particular, Devlin et al. (2019) proposed a model, called BERT (Bidirectional Encoder Representations from Transformers), which enables researchers to obtain state-of-the art performance on numerous NLP tasks by fine-tuning the representations on their data set and task, without the need for developing and training highly-specific architectures. The authors also released multilingual BERT (mBERT), a model trained on a corpus of 104 languages, which can serve as a universal language model. This model obtained impressive results on a zero-shot cross-lingual natural inference task. Driven by the potential of BERT models, the NLP community has started to investigate and generate an abundant number of BERT models that are trained on a particular language, and tested on a specific data domain and task. This allows us to evaluate the true potential of mBERT as a universal language model, by comparing it to the performance of these more specific models. This paper presents the current state of the art in language-specific BERT models, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks). Our aim is to provide an immediate and straightforward overview of the commonalities and differences between Language-Specific (language-specific) BERT models and mBERT. We also provide an interactive and constantly updated website that can be used to explore the information we have collected, at this https URL.
摘要:近日,自然语言处理(NLP)目睹了在许多领域令人瞩目的进展,由于小说的问世,预训练的上下文表示模型。特别是,Devlin等。 (2019)提出了一个模型,叫做BERT(来自变形金刚双向编码器表示),这使研究人员能够获得国家的的通过微调其数据集和任务的陈述众多NLP任务的艺术表演,而不需要开发和培训针对性强的架构。作者还发布了多语种BERT(mBERT),培训了104种语言语料库中的模型,它可以作为一种通用的语言模型。这个模型得到的零次跨语言自然推理任务骄人的成绩。由BERT车型的潜在推动下,NLP社区已经开始调查,并生成丰富的数是在一个特定的语言训练的BERT模式,并在特定的数据域和任务进行测试。这允许我们通过比较这些更具体的车型的性能,以评估mBERT的真正潜力作为一种通用的语言模型。本文介绍了现有技术中的特定语言模型BERT的当前状态,从而提供相对于不同尺寸的整体图像(即架构中,数据域,和任务)。我们的目标是提供特定语言(语言特定的)BERT模型和mBERT之间的共同点和差异的直接和简单的概述。我们还提供一个互动和不断更新的网站,可以用来探索我们所收集的信息,在此HTTPS URL。
17. Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation [PDF] 返回目录
Mitchell A. Gordon, Kevin Duh
Abstract: We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using in-domain data with an adapted teacher.
摘要:本文探讨最佳实践的领域适应性训练设置小,内存使用效率机器翻译型号,序列层次的知识升华。虽然这两个领域适应性和知识蒸馏被广泛使用,它们的相互作用还是一知半解。我们在机器翻译的大规模实证结果(与每三个领域的三语对)建议以获得最佳性能蒸馏两次:使用域数据与适应老师曾经使用通用域数据和试。
Mitchell A. Gordon, Kevin Duh
Abstract: We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using in-domain data with an adapted teacher.
摘要:本文探讨最佳实践的领域适应性训练设置小,内存使用效率机器翻译型号,序列层次的知识升华。虽然这两个领域适应性和知识蒸馏被广泛使用,它们的相互作用还是一知半解。我们在机器翻译的大规模实证结果(与每三个领域的三语对)建议以获得最佳性能蒸馏两次:使用域数据与适应老师曾经使用通用域数据和试。
18. Transfer Learning for Information Extraction with Limited Data [PDF] 返回目录
Minh-Tien Nguyen, Viet-Anh Phan, Le Thai Linh, Nguyen Hong Son, Le Tien Dung, Miku Hirano, Hajime Hotta
Abstract: This paper presents a practical approach to fine-grained information extraction. Through plenty of experiences of authors in practically applying information extraction to business process automation, there can be found a couple of fundamental technical challenges: (i) the availability of labeled data is usually limited and (ii) highly detailed classification is required. The main idea of our proposal is to leverage the concept of transfer learning, which is to reuse the pre-trained model of deep neural networks, with a combination of common statistical classifiers to determine the class of each extracted term. To do that, we first exploit BERT to deal with the limitation of training data in real scenarios, then stack BERT with Convolutional Neural Networks to learn hidden representation for classification. To validate our approach, we applied our model to an actual case of document processing, which is a process of competitive bids for government projects in Japan. We used 100 documents for training and testing and confirmed that the model enables to extract fine-grained named entities with a detailed level of information preciseness specialized in the targeted business process, such as a department name of application receivers.
摘要:本文介绍了细粒度的信息提取的实用方法。通过大量的作者的经验在实际适用信息提取到业务流程的自动化,可以发现一对夫妇的基本技术挑战:(一)标记的数据的可用性通常是有限的,并且需要(II)非常详细的分类。我们的建议的主要思想是利用迁移学习,这是重用深层神经网络的预训练的模式,与常见的统计分类的组合来确定每个类提取出的术语的概念。要做到这一点,我们首先利用BERT以应对真实情景的训练数据,然后使用卷积神经网络堆栈BERT学会隐藏表示分类的限制。为了验证我们的方法,我们应用我们的模型到文档处理,这是日本政府项目竞标过程的实际情况。我们用于训练和测试,并证实该模型能够提取细粒度命名实体与专业针对性的业务流程,信息准确的详细级别100个文件,例如应用接收机的部门名称。
Minh-Tien Nguyen, Viet-Anh Phan, Le Thai Linh, Nguyen Hong Son, Le Tien Dung, Miku Hirano, Hajime Hotta
Abstract: This paper presents a practical approach to fine-grained information extraction. Through plenty of experiences of authors in practically applying information extraction to business process automation, there can be found a couple of fundamental technical challenges: (i) the availability of labeled data is usually limited and (ii) highly detailed classification is required. The main idea of our proposal is to leverage the concept of transfer learning, which is to reuse the pre-trained model of deep neural networks, with a combination of common statistical classifiers to determine the class of each extracted term. To do that, we first exploit BERT to deal with the limitation of training data in real scenarios, then stack BERT with Convolutional Neural Networks to learn hidden representation for classification. To validate our approach, we applied our model to an actual case of document processing, which is a process of competitive bids for government projects in Japan. We used 100 documents for training and testing and confirmed that the model enables to extract fine-grained named entities with a detailed level of information preciseness specialized in the targeted business process, such as a department name of application receivers.
摘要:本文介绍了细粒度的信息提取的实用方法。通过大量的作者的经验在实际适用信息提取到业务流程的自动化,可以发现一对夫妇的基本技术挑战:(一)标记的数据的可用性通常是有限的,并且需要(II)非常详细的分类。我们的建议的主要思想是利用迁移学习,这是重用深层神经网络的预训练的模式,与常见的统计分类的组合来确定每个类提取出的术语的概念。要做到这一点,我们首先利用BERT以应对真实情景的训练数据,然后使用卷积神经网络堆栈BERT学会隐藏表示分类的限制。为了验证我们的方法,我们应用我们的模型到文档处理,这是日本政府项目竞标过程的实际情况。我们用于训练和测试,并证实该模型能够提取细粒度命名实体与专业针对性的业务流程,信息准确的详细级别100个文件,例如应用接收机的部门名称。
19. End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification [PDF] 返回目录
Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu
Abstract: The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.
摘要:扬声器diarization最常用的方法是聚类扬声器的嵌入的。然而,基于聚类的方法有许多问题;即,(i)其是不是最优化以最小化直接错误diarization,(ⅱ)它不能处理扬声器正确地重叠,和(iii)它具有麻烦调整其扬声器嵌入模式,以真实的录音与扬声器重叠。为了解决这些问题,我们提出了端至尾神经Diarization(EEND),其中一个神经网络直接输出扬声器diarization结果给出多扬声器记录。为了实现这样的端至高端型号,我们制定了扬声器diarization问题,多标签分类问题,并介绍了免费的置换目标函数直接减少diarization错误。除了其终端到终端的简单,EEND方法可以明确地处理扬声器训练和推理过程重叠。只需用相应的扬声器段标签喂多喇叭录音,我们的模型可以很容易地适应真正的对话。我们评估了模拟混合语音和实时交谈的数据集我们的方法。结果表明,所述方法EEND优于状态的最先进的x矢量基于聚类的方法,而它正确处理扬声器重叠。我们探讨了EEND方法的神经网络结构,并发现为基础的自我关注神经网络的关键是实现优异的性能。与此相反,以调节网络只在它的前面和后面的隐藏状态,如使用双向长短期记忆(BLSTM)完成的,自我关注直接调节上所有的帧。通过可视化的关注权重,我们展示除了本地语音活动动态,自我注意捕捉全球扬声器特性,使得它特别适合于处理与扬声器diarization问题。
Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu
Abstract: The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.
摘要:扬声器diarization最常用的方法是聚类扬声器的嵌入的。然而,基于聚类的方法有许多问题;即,(i)其是不是最优化以最小化直接错误diarization,(ⅱ)它不能处理扬声器正确地重叠,和(iii)它具有麻烦调整其扬声器嵌入模式,以真实的录音与扬声器重叠。为了解决这些问题,我们提出了端至尾神经Diarization(EEND),其中一个神经网络直接输出扬声器diarization结果给出多扬声器记录。为了实现这样的端至高端型号,我们制定了扬声器diarization问题,多标签分类问题,并介绍了免费的置换目标函数直接减少diarization错误。除了其终端到终端的简单,EEND方法可以明确地处理扬声器训练和推理过程重叠。只需用相应的扬声器段标签喂多喇叭录音,我们的模型可以很容易地适应真正的对话。我们评估了模拟混合语音和实时交谈的数据集我们的方法。结果表明,所述方法EEND优于状态的最先进的x矢量基于聚类的方法,而它正确处理扬声器重叠。我们探讨了EEND方法的神经网络结构,并发现为基础的自我关注神经网络的关键是实现优异的性能。与此相反,以调节网络只在它的前面和后面的隐藏状态,如使用双向长短期记忆(BLSTM)完成的,自我关注直接调节上所有的帧。通过可视化的关注权重,我们展示除了本地语音活动动态,自我注意捕捉全球扬声器特性,使得它特别适合于处理与扬声器diarization问题。
注:中文为机器翻译结果!