目录
1. Generating similes <effortlessly> like a Pro: A Style Transfer Approach for Simile Generation [PDF] 摘要
11. NEU at WNUT-2020 Task 2: Data Augmentation To Tell BERT That Death Is Not Necessarily Informative [PDF] 摘要
15. PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology [PDF] 摘要
16. A Study of Genetic Algorithms for Hyperparameter Optimization of Neural Networks in Machine Translation [PDF] 摘要
20. MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering [PDF] 摘要
摘要
1. Generating similes <effortlessly> like a Pro: A Style Transfer Approach for Simile Generation [PDF] 返回目录
Tuhin Chakrabarty, Smaranda Muresan, Nanyun Peng
Abstract: Literary tropes, from poetry to stories, are at the crux of human imagination and communication. Figurative language such as a simile go beyond plain expressions to give readers new insights and inspirations. In this paper, we tackle the problem of simile generation. Generating a simile requires proper understanding for effective mapping of properties between two concepts. To this end, we first propose a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge. We then propose to fine-tune a pretrained sequence to sequence model, BART~\cite{lewis2019bart}, on the literal-simile pairs to gain generalizability, so that we can generate novel similes given a literal sentence. Experiments show that our approach generates $88\%$ novel similes that do not share properties with the training data. Human evaluation on an independent set of literal statements shows that our model generates similes better than two literary experts \textit{37\%}\footnote{We average 32.6\% and 41.3\% for 2 humans.} of the times, and three baseline systems including a recent metaphor generation model \textit{71\%}\footnote{We average 82\% ,63\% and 68\% for three baselines.} of the times when compared pairwise.\footnote{The simile in the title is generated by our best model. Input: Generating similes effortlessly, output: Generating similes \textit{like a Pro}.} We also show how replacing literal sentences with similes from our best model in machine generated stories improves evocativeness and leads to better acceptance by human judges.
摘要:文学比喻,从诗歌到小说,都是在人类的想象力与沟通的症结所在。比喻性语言,如超出普通表达式一个比喻去给读者新的见解和启示。在本文中,我们解决比喻产生的问题。产生一个比喻需要两个概念之间的特性映射的有效正确理解。为此,我们首先提出了一种方法,通过变换了大量来自reddit的收集利用结构化的常识性知识的文字对应的明喻的自动构造平行语料库。然后,我们建议微调预训练序列序列模型,BART〜\ {引用} lewis2019bart,在字面的比喻对获得普遍性,这样我们就可以给出一个文字句子产生新颖的比喻。实验结果表明,我们的方法产生$ 88 \%$不与训练数据共享属性新颖的比喻。在一套独立的文字陈述表明,我们的模型产生的比喻不是两个文学专家\ textit {37 \%} \注脚更好的人评价{我们平均32.6 \%和41.3 \%,为2人。}时代,三基线系统包括最近隐喻生成模型\ textit {71 \%} \脚注{我们平均82 \%,63 \%和68 \%三个基线。}时代的两两比较时。\脚注{的比喻在标题是由我们的最佳模型生成。输入:生成比喻毫不费力,输出:生成的比喻\ textit {像亲}}我们还展示了如何用比喻我们最好的模型机产生的故事,人类的法官代替文字的句子提高了evocativeness,带来更好的接受。
Tuhin Chakrabarty, Smaranda Muresan, Nanyun Peng
Abstract: Literary tropes, from poetry to stories, are at the crux of human imagination and communication. Figurative language such as a simile go beyond plain expressions to give readers new insights and inspirations. In this paper, we tackle the problem of simile generation. Generating a simile requires proper understanding for effective mapping of properties between two concepts. To this end, we first propose a method to automatically construct a parallel corpus by transforming a large number of similes collected from Reddit to their literal counterpart using structured common sense knowledge. We then propose to fine-tune a pretrained sequence to sequence model, BART~\cite{lewis2019bart}, on the literal-simile pairs to gain generalizability, so that we can generate novel similes given a literal sentence. Experiments show that our approach generates $88\%$ novel similes that do not share properties with the training data. Human evaluation on an independent set of literal statements shows that our model generates similes better than two literary experts \textit{37\%}\footnote{We average 32.6\% and 41.3\% for 2 humans.} of the times, and three baseline systems including a recent metaphor generation model \textit{71\%}\footnote{We average 82\% ,63\% and 68\% for three baselines.} of the times when compared pairwise.\footnote{The simile in the title is generated by our best model. Input: Generating similes effortlessly, output: Generating similes \textit{like a Pro}.} We also show how replacing literal sentences with similes from our best model in machine generated stories improves evocativeness and leads to better acceptance by human judges.
摘要:文学比喻,从诗歌到小说,都是在人类的想象力与沟通的症结所在。比喻性语言,如超出普通表达式一个比喻去给读者新的见解和启示。在本文中,我们解决比喻产生的问题。产生一个比喻需要两个概念之间的特性映射的有效正确理解。为此,我们首先提出了一种方法,通过变换了大量来自reddit的收集利用结构化的常识性知识的文字对应的明喻的自动构造平行语料库。然后,我们建议微调预训练序列序列模型,BART〜\ {引用} lewis2019bart,在字面的比喻对获得普遍性,这样我们就可以给出一个文字句子产生新颖的比喻。实验结果表明,我们的方法产生$ 88 \%$不与训练数据共享属性新颖的比喻。在一套独立的文字陈述表明,我们的模型产生的比喻不是两个文学专家\ textit {37 \%} \注脚更好的人评价{我们平均32.6 \%和41.3 \%,为2人。}时代,三基线系统包括最近隐喻生成模型\ textit {71 \%} \脚注{我们平均82 \%,63 \%和68 \%三个基线。}时代的两两比较时。\脚注{的比喻在标题是由我们的最佳模型生成。输入:生成比喻毫不费力,输出:生成的比喻\ textit {像亲}}我们还展示了如何用比喻我们最好的模型机产生的故事,人类的法官代替文字的句子提高了evocativeness,带来更好的接受。
2. Principal Components of the Meaning [PDF] 返回目录
Neslihan Suzen, Alexander Gorban, Jeremy Levesley, Evgeny Mirkes
Abstract: In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space. This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains, where the categories are those used by the Web of Science, and the words are taken from a reduced word set from texts in the Web of Science. We show that this reduced word set plausibly represents all texts in the corpus, so that the principal component analysis has some objective meaning with respect to the corpus. We argue that 13 dimensions is adequate to describe the meaning of scientific texts, and hypothesise about the qualitative meaning of the principal components.
摘要:在本文中,我们认为(词汇)在科学意义可以在一个13维的意蕴空间来表示。该空间是使用上的单词的类别相关的信息增益的矩阵,其中所述类别是那些使用Web of Science的主成分分析(奇异值分解)构成,该字是从减小的字组中的网络取自文本科学。我们表明,这种减少的话振振有词设置代表语料库的所有文本,使主成分分析相对于胼一些客观意义。我们认为,13点的尺寸足以描述关于主成分的定性意义的科学文本的意义,hypothesise。
Neslihan Suzen, Alexander Gorban, Jeremy Levesley, Evgeny Mirkes
Abstract: In this paper we argue that (lexical) meaning in science can be represented in a 13 dimension Meaning Space. This space is constructed using principal component analysis (singular decomposition) on the matrix of word category relative information gains, where the categories are those used by the Web of Science, and the words are taken from a reduced word set from texts in the Web of Science. We show that this reduced word set plausibly represents all texts in the corpus, so that the principal component analysis has some objective meaning with respect to the corpus. We argue that 13 dimensions is adequate to describe the meaning of scientific texts, and hypothesise about the qualitative meaning of the principal components.
摘要:在本文中,我们认为(词汇)在科学意义可以在一个13维的意蕴空间来表示。该空间是使用上的单词的类别相关的信息增益的矩阵,其中所述类别是那些使用Web of Science的主成分分析(奇异值分解)构成,该字是从减小的字组中的网络取自文本科学。我们表明,这种减少的话振振有词设置代表语料库的所有文本,使主成分分析相对于胼一些客观意义。我们认为,13点的尺寸足以描述关于主成分的定性意义的科学文本的意义,hypothesise。
3. FarsTail: A Persian Natural Language Inference Dataset [PDF] 返回目录
Hossein Amirkhani, Mohammad Azari Jafari, Azadeh Amirak, Zohreh Pourjafari, Soroush Faridan Jahromi, Zeinab Kouhkan
Abstract: Natural language inference (NLI) is known as one of the central tasks in natural language processing (NLP) which encapsulates many fundamental aspects of language understanding. With the considerable achievements of data-hungry deep learning methods in NLP tasks, a great amount of effort has been devoted to develop more diverse datasets for different languages. In this paper, we present a new dataset for the NLI task in the Persian language, also known as Farsi, which is one of the dominant languages in the Middle East. This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language as well as the indexed format to be useful for non-Persian researchers. The samples are generated from 3,539 multiple-choice questions with the least amount of annotator interventions in a way similar to the SciTail dataset. A carefully designed multi-step process is adopted to ensure the quality of the dataset. We also present the results of traditional and state-of-the-art methods on FarsTail including different embedding methods such as word2vec, fastText, ELMo, BERT, and LASER, as well as different modeling approaches such as DecompAtt, ESIM, HBMP, ULMFiT, and cross-lingual transfer approach to provide a solid baseline for the future research. The best obtained test accuracy is 78.13% which shows that there is a big room for improving the current methods to be useful for real-world NLP applications in different languages. The dataset is available at this https URL.
摘要:自然语言推理(NLI)被称为自然语言处理(NLP),它封装了语言理解许多基本方面的中心任务之一。随着自然语言处理任务的大量数据的深学习方法取得的巨大成就,努力很大的量一直致力于开发不同的语言更加多样化的数据集。在本文中,我们提出了在波斯语中NLI任务,也被称为波斯语,这是在中东地区的主要语言之一的新数据集。此数据集,名为FarsTail,包括10,367样本,其在两个波斯语,以及以非波斯研究人员有用的索引格式提供。将样品从3539选择题与注释干预以类似于SciTail数据集的方式至少量生成。精心设计的多步骤的过程中采用,以确保该数据集的质量。我们还提出的传统和先进的最先进的方法的结果上FarsTail包括不同的嵌入方法,例如word2vec,fastText,毛毛,BERT,和激光,以及不同的建模方法,如DecompAtt,ESIM,人BMP,ULMFiT和跨语言转移的方法来为将来研究了坚实的基础。最好的得到测试精度为78.13%,这表明,有一个大的改善的余地目前的方法是在不同的语言真实世界NLP应用。该数据集可在此HTTPS URL。
Hossein Amirkhani, Mohammad Azari Jafari, Azadeh Amirak, Zohreh Pourjafari, Soroush Faridan Jahromi, Zeinab Kouhkan
Abstract: Natural language inference (NLI) is known as one of the central tasks in natural language processing (NLP) which encapsulates many fundamental aspects of language understanding. With the considerable achievements of data-hungry deep learning methods in NLP tasks, a great amount of effort has been devoted to develop more diverse datasets for different languages. In this paper, we present a new dataset for the NLI task in the Persian language, also known as Farsi, which is one of the dominant languages in the Middle East. This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language as well as the indexed format to be useful for non-Persian researchers. The samples are generated from 3,539 multiple-choice questions with the least amount of annotator interventions in a way similar to the SciTail dataset. A carefully designed multi-step process is adopted to ensure the quality of the dataset. We also present the results of traditional and state-of-the-art methods on FarsTail including different embedding methods such as word2vec, fastText, ELMo, BERT, and LASER, as well as different modeling approaches such as DecompAtt, ESIM, HBMP, ULMFiT, and cross-lingual transfer approach to provide a solid baseline for the future research. The best obtained test accuracy is 78.13% which shows that there is a big room for improving the current methods to be useful for real-world NLP applications in different languages. The dataset is available at this https URL.
摘要:自然语言推理(NLI)被称为自然语言处理(NLP),它封装了语言理解许多基本方面的中心任务之一。随着自然语言处理任务的大量数据的深学习方法取得的巨大成就,努力很大的量一直致力于开发不同的语言更加多样化的数据集。在本文中,我们提出了在波斯语中NLI任务,也被称为波斯语,这是在中东地区的主要语言之一的新数据集。此数据集,名为FarsTail,包括10,367样本,其在两个波斯语,以及以非波斯研究人员有用的索引格式提供。将样品从3539选择题与注释干预以类似于SciTail数据集的方式至少量生成。精心设计的多步骤的过程中采用,以确保该数据集的质量。我们还提出的传统和先进的最先进的方法的结果上FarsTail包括不同的嵌入方法,例如word2vec,fastText,毛毛,BERT,和激光,以及不同的建模方法,如DecompAtt,ESIM,人BMP,ULMFiT和跨语言转移的方法来为将来研究了坚实的基础。最好的得到测试精度为78.13%,这表明,有一个大的改善的余地目前的方法是在不同的语言真实世界NLP应用。该数据集可在此HTTPS URL。
4. Document-level Neural Machine Translation with Document Embeddings [PDF] 返回目录
Shu Jiang, Hai Zhao, Zuchao Li, Bao-Liang Lu
Abstract: Standard neural machine translation (NMT) is on the assumption of document-level context independent. Most existing document-level NMT methods are satisfied with a smattering sense of brief document-level information, while this work focuses on exploiting detailed document-level context in terms of multiple forms of document embeddings, which is capable of sufficiently modeling deeper and richer document-level context. The proposed document-aware NMT is implemented to enhance the Transformer baseline by introducing both global and local document-level clues on the source end. Experiments show that the proposed method significantly improves the translation performance over strong baselines and other related studies.
摘要:标准神经机器翻译(NMT)是在文件级上下文无关的假设。大多数现有的文档级NMT方法是满意的简短文件级信息一知半解感,同时今年工作重点放在以多种形式文档的嵌入,这是能够充分造型更深入和更丰富的文档方面利用详细的文档级上下文-level上下文。所提出的文件感知NMT实现通过引入源端的全局和本地文档级别的线索,以提高变压器基线。实验结果表明,该方法显著改善了强烈的基准和其他相关研究的翻译性能。
Shu Jiang, Hai Zhao, Zuchao Li, Bao-Liang Lu
Abstract: Standard neural machine translation (NMT) is on the assumption of document-level context independent. Most existing document-level NMT methods are satisfied with a smattering sense of brief document-level information, while this work focuses on exploiting detailed document-level context in terms of multiple forms of document embeddings, which is capable of sufficiently modeling deeper and richer document-level context. The proposed document-aware NMT is implemented to enhance the Transformer baseline by introducing both global and local document-level clues on the source end. Experiments show that the proposed method significantly improves the translation performance over strong baselines and other related studies.
摘要:标准神经机器翻译(NMT)是在文件级上下文无关的假设。大多数现有的文档级NMT方法是满意的简短文件级信息一知半解感,同时今年工作重点放在以多种形式文档的嵌入,这是能够充分造型更深入和更丰富的文档方面利用详细的文档级上下文-level上下文。所提出的文件感知NMT实现通过引入源端的全局和本地文档级别的线索,以提高变压器基线。实验结果表明,该方法显著改善了强烈的基准和其他相关研究的翻译性能。
5. The birth of Romanian BERT [PDF] 返回目录
Stefan Daniel Dumitrescu, Andrei-Marius Avram, Sampo Pyysalo
Abstract: Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We open source not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.
摘要:大型预训练的语言模型已经在自然语言处理变得无处不在。然而,这些模型都是在高资源语言都不好,尤其是英语,还是多语言模型上覆盖各个语言妥协的性能。本文介绍了罗马尼亚BERT,第一个纯粹的罗基于变压器的语言模型,在预先训练的大型文本语料库。我们讨论语料库组成和清洗,模型训练过程中,以及在各种罗马尼亚数据集模型的广泛的评估。我们开源不仅是模型本身,也包含关于如何获取语料库,微调和在生产中使用这个模型(用实际的例子)的信息库,以及如何完全复制的评估过程。
Stefan Daniel Dumitrescu, Andrei-Marius Avram, Sampo Pyysalo
Abstract: Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus composition and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We open source not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.
摘要:大型预训练的语言模型已经在自然语言处理变得无处不在。然而,这些模型都是在高资源语言都不好,尤其是英语,还是多语言模型上覆盖各个语言妥协的性能。本文介绍了罗马尼亚BERT,第一个纯粹的罗基于变压器的语言模型,在预先训练的大型文本语料库。我们讨论语料库组成和清洗,模型训练过程中,以及在各种罗马尼亚数据集模型的广泛的评估。我们开源不仅是模型本身,也包含关于如何获取语料库,微调和在生产中使用这个模型(用实际的例子)的信息库,以及如何完全复制的评估过程。
6. RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network [PDF] 返回目录
Anson Bastos, Abhishek Nadgeri, Kuldeep Singh, Isaiah Onando Mulang', Saeedeh Shekarpour, Johannes Hoffart
Abstract: In this paper, we present a novel method named RECON, that automatically identifies relations in a sentence (sentential relation extraction) and aligns to a knowledge graph (KG). RECON uses a graph neural network to learn representations of both the sentence as well as facts stored in a KG, improving the overall extraction quality. These facts, including entity attributes (label, alias, description, instance-of) and factual triples, have not been collectively used in the state of the art methods. We evaluate the effect of various forms of representing the KG context on the performance of RECON. The empirical evaluation on two standard relation extraction datasets shows that RECON significantly outperforms all state of the art methods on NYT Freebase and Wikidata datasets. RECON reports 87.23 F1 score (Vs 82.29 baseline) on Wikidata dataset whereas on NYT Freebase, reported values are 87.5(P@10) and 74.1(P@30) compared to the previous baseline scores of 81.3(P@10) and 63.1(P@30).
摘要:在本文中,我们提出了一个名为RECON的新方法,即在一个句子(句子关系抽取)自动识别的关系,对齐,以知识图(KG)。 RECON使用图形的神经网络来学习存储在KG两个句子的表述,以及事实,提高企业的整体素质提取。这些事实,包括实体的属性(标签,别名,说明,实例的)和事实三元组,还没有在现有技术方法的状态被共同地使用。我们评估的各种形式表示对RECON的性能KG方面的作用。在两个标准关系抽取数据集显示,RECON显著优于对NYT游离碱和维基数据的数据集的现有技术方法的所有状态中的经验评估。 RECON上维基数据的数据集的报告87.23 F1得分(VS 82.29基线),而上纽约时报游离碱,报告的值是87.5(P @ 10)和74.1(P @ 30)相比的81.3先前基线评分(P @ 10)和63.1( 2P @ 30)。
Anson Bastos, Abhishek Nadgeri, Kuldeep Singh, Isaiah Onando Mulang', Saeedeh Shekarpour, Johannes Hoffart
Abstract: In this paper, we present a novel method named RECON, that automatically identifies relations in a sentence (sentential relation extraction) and aligns to a knowledge graph (KG). RECON uses a graph neural network to learn representations of both the sentence as well as facts stored in a KG, improving the overall extraction quality. These facts, including entity attributes (label, alias, description, instance-of) and factual triples, have not been collectively used in the state of the art methods. We evaluate the effect of various forms of representing the KG context on the performance of RECON. The empirical evaluation on two standard relation extraction datasets shows that RECON significantly outperforms all state of the art methods on NYT Freebase and Wikidata datasets. RECON reports 87.23 F1 score (Vs 82.29 baseline) on Wikidata dataset whereas on NYT Freebase, reported values are 87.5(P@10) and 74.1(P@30) compared to the previous baseline scores of 81.3(P@10) and 63.1(P@30).
摘要:在本文中,我们提出了一个名为RECON的新方法,即在一个句子(句子关系抽取)自动识别的关系,对齐,以知识图(KG)。 RECON使用图形的神经网络来学习存储在KG两个句子的表述,以及事实,提高企业的整体素质提取。这些事实,包括实体的属性(标签,别名,说明,实例的)和事实三元组,还没有在现有技术方法的状态被共同地使用。我们评估的各种形式表示对RECON的性能KG方面的作用。在两个标准关系抽取数据集显示,RECON显著优于对NYT游离碱和维基数据的数据集的现有技术方法的所有状态中的经验评估。 RECON上维基数据的数据集的报告87.23 F1得分(VS 82.29基线),而上纽约时报游离碱,报告的值是87.5(P @ 10)和74.1(P @ 30)相比的81.3先前基线评分(P @ 10)和63.1( 2P @ 30)。
7. Dr. Summarize: Global Summarization of Medical Dialogue by Exploiting Local Structures [PDF] 返回目录
Anirudh Joshi, Namit Katariya, Xavier Amatriain, Anitha Kannan
Abstract: Understanding a medical conversation between a patient and a physician poses a unique natural language understanding challenge since it combines elements of standard open ended conversation with very domain specific elements that require expertise and medical knowledge. Summarization of medical conversations is a particularly important aspect of medical conversation understanding since it addresses a very real need in medical practice: capturing the most important aspects of a medical encounter so that they can be used for medical decision making and subsequent follow ups. In this paper we present a novel approach to medical conversation summarization that leverages the unique and independent local structures created when gathering a patient's medical history. Our approach is a variation of the pointer generator network where we introduce a penalty on the generator distribution, and we explicitly model negations. The model also captures important properties of medical conversations such as medical knowledge coming from standardized medical ontologies better than when those concepts are introduced explicitly. Through evaluation by doctors, we show that our approach is preferred on twice the number of summaries to the baseline pointer generator model and captures most or all of the information in 80% of the conversations making it a realistic alternative to costly manual summarization by medical experts.
摘要:了解患者和医生之间的谈话医疗构成了独特的自然语言理解的挑战,因为它结合了需要的专业知识和医学知识非常特定领域的内容标准的开放式对话的要素。医疗对话的总结是医疗交谈了解一个特别重要的方面,因为它解决了在医疗实践中一个非常现实的需要:捕获医疗碰到这样的最重要的方面,它们可以用于医疗决策和后续跟踪窗口。在本文中,我们提出了一种新的方法,以医疗交谈总结,充分利用收集病人的病史时创建的唯一且独立的局部结构。我们的做法是指针产生的网络,我们在发电机分布引入惩罚的变化,我们明确地否定模式。该模型还捕捉来自标准化医疗本体到来时,明确地介绍了这些概念不是更好的医疗对话的重要特性,如医学知识。通过医生的评估,我们表明,我们的做法是最好的总结基线指针发电机模型和捕获大部分或全部的谈话的80%的信息数量的两倍使得它昂贵的人工总结一个现实的选择由医学专家。
Anirudh Joshi, Namit Katariya, Xavier Amatriain, Anitha Kannan
Abstract: Understanding a medical conversation between a patient and a physician poses a unique natural language understanding challenge since it combines elements of standard open ended conversation with very domain specific elements that require expertise and medical knowledge. Summarization of medical conversations is a particularly important aspect of medical conversation understanding since it addresses a very real need in medical practice: capturing the most important aspects of a medical encounter so that they can be used for medical decision making and subsequent follow ups. In this paper we present a novel approach to medical conversation summarization that leverages the unique and independent local structures created when gathering a patient's medical history. Our approach is a variation of the pointer generator network where we introduce a penalty on the generator distribution, and we explicitly model negations. The model also captures important properties of medical conversations such as medical knowledge coming from standardized medical ontologies better than when those concepts are introduced explicitly. Through evaluation by doctors, we show that our approach is preferred on twice the number of summaries to the baseline pointer generator model and captures most or all of the information in 80% of the conversations making it a realistic alternative to costly manual summarization by medical experts.
摘要:了解患者和医生之间的谈话医疗构成了独特的自然语言理解的挑战,因为它结合了需要的专业知识和医学知识非常特定领域的内容标准的开放式对话的要素。医疗对话的总结是医疗交谈了解一个特别重要的方面,因为它解决了在医疗实践中一个非常现实的需要:捕获医疗碰到这样的最重要的方面,它们可以用于医疗决策和后续跟踪窗口。在本文中,我们提出了一种新的方法,以医疗交谈总结,充分利用收集病人的病史时创建的唯一且独立的局部结构。我们的做法是指针产生的网络,我们在发电机分布引入惩罚的变化,我们明确地否定模式。该模型还捕捉来自标准化医疗本体到来时,明确地介绍了这些概念不是更好的医疗对话的重要特性,如医学知识。通过医生的评估,我们表明,我们的做法是最好的总结基线指针发电机模型和捕获大部分或全部的谈话的80%的信息数量的两倍使得它昂贵的人工总结一个现实的选择由医学专家。
8. Hierarchical GPT with Congruent Transformers for Multi-Sentence Language Models [PDF] 返回目录
Jihyeon Roh, Huiseong Gim, Soo-Young Lee
Abstract: We report a GPT-based multi-sentence language model for dialogue generation and document understanding. First, we propose a hierarchical GPT which consists of three blocks, i.e., a sentence encoding block, a sentence generating block, and a sentence decoding block. The sentence encoding and decoding blocks are basically the encoder-decoder blocks of the standard Transformers, which work on each sentence independently. The sentence generating block is inserted between the encoding and decoding blocks, and generates the next sentence embedding vector from the previous sentence embedding vectors. We believe it is the way human make conversation and understand paragraphs and documents. Since each sentence may consist of fewer words, the sentence encoding and decoding Transformers can use much smaller dimensional embedding vectors. Secondly, we note the attention in the Transformers utilizes the inner-product similarity measure. Therefore, to compare the two vectors in the same space, we set the transform matrices for queries and keys to be the same. Otherwise, the similarity concept is incongruent. We report experimental results to show that these two modifications increase the language model performance for tasks with multiple sentences.
摘要:我们报告对话生成和文件的理解基于GPT的多句语言模型。首先,我们提出了一种分层GPT它由三个块,即一个句子编码块,句子生成块,和一个句子解码块。句子编码和解码块基本上是标准变压器,其上的每个句子独立地工作的编码器 - 解码器块。句子生成块被插入在编码和解码块之间,并产生下一个句子从前面的句子嵌入矢量嵌入载体。我们认为,事情是这样的人使对话和理解的段落和文档。由于每个句子可以由更少的话,句子编码和解码变压器可以使用更小的维嵌入载体。其次,我们注意到在变形金刚注意利用产品内部的相似性措施。因此,在同一空间内的两个向量比较,我们设置的转换查询和键矩阵是相同的。否则,类似的概念是不一致的。我们报告的实验结果表明,这两个修改使多任务句子的语言模型的性能。
Jihyeon Roh, Huiseong Gim, Soo-Young Lee
Abstract: We report a GPT-based multi-sentence language model for dialogue generation and document understanding. First, we propose a hierarchical GPT which consists of three blocks, i.e., a sentence encoding block, a sentence generating block, and a sentence decoding block. The sentence encoding and decoding blocks are basically the encoder-decoder blocks of the standard Transformers, which work on each sentence independently. The sentence generating block is inserted between the encoding and decoding blocks, and generates the next sentence embedding vector from the previous sentence embedding vectors. We believe it is the way human make conversation and understand paragraphs and documents. Since each sentence may consist of fewer words, the sentence encoding and decoding Transformers can use much smaller dimensional embedding vectors. Secondly, we note the attention in the Transformers utilizes the inner-product similarity measure. Therefore, to compare the two vectors in the same space, we set the transform matrices for queries and keys to be the same. Otherwise, the similarity concept is incongruent. We report experimental results to show that these two modifications increase the language model performance for tasks with multiple sentences.
摘要:我们报告对话生成和文件的理解基于GPT的多句语言模型。首先,我们提出了一种分层GPT它由三个块,即一个句子编码块,句子生成块,和一个句子解码块。句子编码和解码块基本上是标准变压器,其上的每个句子独立地工作的编码器 - 解码器块。句子生成块被插入在编码和解码块之间,并产生下一个句子从前面的句子嵌入矢量嵌入载体。我们认为,事情是这样的人使对话和理解的段落和文档。由于每个句子可以由更少的话,句子编码和解码变压器可以使用更小的维嵌入载体。其次,我们注意到在变形金刚注意利用产品内部的相似性措施。因此,在同一空间内的两个向量比较,我们设置的转换查询和键矩阵是相同的。否则,类似的概念是不一致的。我们报告的实验结果表明,这两个修改使多任务句子的语言模型的性能。
9. fastHan: A BERT-based Joint Many-Task Toolkit for Chinese NLP [PDF] 返回目录
Zhichao Geng, Hang Yan, Xipeng Qiu, Xuanjing Huang
Abstract: We present fastHan, an open-source toolkit for four basic tasks in Chinese natural language processing: Chinese word segmentation, Part-of-Speech tagging, named entity recognition, and dependency parsing. The kernel of fastHan is a joint many-task model based on a pruned BERT, which uses the first 8 layers in BERT. We also provide a 4-layer base version of model compressed from the 8-layer model. The joint-model is trained and evaluated in 13 corpora of four tasks, yielding near state-of-the-art (SOTA) performance in the dependency parsing task and SOTA performance in the other three tasks. In addition to its small size and excellent performance, fastHan is also very user-friendly. Implemented as a python package, fastHan allows users to easily download and use it. Users can get what they want with one line of code, even if they have little knowledge of deep learning. The project is released on Github.
摘要:我们提出fastHan,一个开源工具包,用于在中国的自然语言处理四项基本任务:中国的分词,部分词性标注,命名实体识别,以及依存分析。 fastHan的核心是基于修剪的BERT,它使用了第一个8层的BERT联合多任务模式。我们还提供了模型的4层基本版本从8层模型压缩。联合模型训练和四个任务13个语料库进行评估,产生的依赖解析任务和SOTA表现在其他三个任务接近国家的最先进的(SOTA)的性能。除了它的体积小,性能优良,fastHan也很人性化。实现为Python包,fastHan允许用户方便地下载和使用它。用户可以得到他们想要用一行代码是什么,即使他们有很深的学问知之甚少。该项目被释放Github上。
Zhichao Geng, Hang Yan, Xipeng Qiu, Xuanjing Huang
Abstract: We present fastHan, an open-source toolkit for four basic tasks in Chinese natural language processing: Chinese word segmentation, Part-of-Speech tagging, named entity recognition, and dependency parsing. The kernel of fastHan is a joint many-task model based on a pruned BERT, which uses the first 8 layers in BERT. We also provide a 4-layer base version of model compressed from the 8-layer model. The joint-model is trained and evaluated in 13 corpora of four tasks, yielding near state-of-the-art (SOTA) performance in the dependency parsing task and SOTA performance in the other three tasks. In addition to its small size and excellent performance, fastHan is also very user-friendly. Implemented as a python package, fastHan allows users to easily download and use it. Users can get what they want with one line of code, even if they have little knowledge of deep learning. The project is released on Github.
摘要:我们提出fastHan,一个开源工具包,用于在中国的自然语言处理四项基本任务:中国的分词,部分词性标注,命名实体识别,以及依存分析。 fastHan的核心是基于修剪的BERT,它使用了第一个8层的BERT联合多任务模式。我们还提供了模型的4层基本版本从8层模型压缩。联合模型训练和四个任务13个语料库进行评估,产生的依赖解析任务和SOTA表现在其他三个任务接近国家的最先进的(SOTA)的性能。除了它的体积小,性能优良,fastHan也很人性化。实现为Python包,fastHan允许用户方便地下载和使用它。用户可以得到他们想要用一行代码是什么,即使他们有很深的学问知之甚少。该项目被释放Github上。
10. Unsupervised Parallel Corpus Mining on Web Data [PDF] 返回目录
Guokun Lai, Zihang Dai, Yiming Yang
Abstract: With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT'14 English-French and WMT'16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
摘要:随着大量的并行数据,神经机器翻译系统能够提供对句子层次的转换人类水平的性能。但是,它是昂贵的人类标注了大量的并行数据。与此相反,有一个大规模的,由互联网上的人类创造的平行语料库的。利用它们的主要困难是如何将它们从噪声环境中的网站过滤掉。当前并行数据挖掘方法都需要标记的并行数据作为训练源。在本文中,我们提出了一个管道矿井来自互联网的平行语料库在无人监督的方式。广泛使用的WMT'14英语,法语和WMT'16英语,德语基准,与我们的管道提取的数据训练的机器翻译机实现了非常接近的性能的监督结果。在WMT'16英语,罗马尼亚和罗马尼亚的英语基准,我们的系统即使有监督的方法相比,产生国家的最先进的新成果,39.81和38.95 BLEU分数。
Guokun Lai, Zihang Dai, Yiming Yang
Abstract: With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT'14 English-French and WMT'16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.
摘要:随着大量的并行数据,神经机器翻译系统能够提供对句子层次的转换人类水平的性能。但是,它是昂贵的人类标注了大量的并行数据。与此相反,有一个大规模的,由互联网上的人类创造的平行语料库的。利用它们的主要困难是如何将它们从噪声环境中的网站过滤掉。当前并行数据挖掘方法都需要标记的并行数据作为训练源。在本文中,我们提出了一个管道矿井来自互联网的平行语料库在无人监督的方式。广泛使用的WMT'14英语,法语和WMT'16英语,德语基准,与我们的管道提取的数据训练的机器翻译机实现了非常接近的性能的监督结果。在WMT'16英语,罗马尼亚和罗马尼亚的英语基准,我们的系统即使有监督的方法相比,产生国家的最先进的新成果,39.81和38.95 BLEU分数。
11. NEU at WNUT-2020 Task 2: Data Augmentation To Tell BERT That Death Is Not Necessarily Informative [PDF] 返回目录
Kumud Chauhan
Abstract: Millions of people around the world are sharing COVID-19 related information on social media platforms. Since not all the information shared on the social media is useful, a machine learning system to identify informative posts can help users in finding relevant information. In this paper, we present a BERT classifier system for W-NUT2020 Shared Task 2: Identification of Informative COVID-19 English Tweets. Further, we show that BERT exploits some easy signals to identify informative tweets, and adding simple patterns to uninformative tweets drastically degrades BERT performance. In particular, simply adding 10 deaths to tweets in dev set, reduces BERT F1- score from 92.63 to 7.28. We also propose a simple data augmentation technique that helps in improving the robustness and generalization ability of the BERT classifier.
摘要:数以百万计的世界各地的人分享了在社交媒体平台COVID-19的相关信息。由于并非所有的社交媒体共享的信息是有用的,机器学习系统来识别信息的帖子可以帮助用户找到相关的信息。在本文中,我们提出了W-NUT2020共享任务2一个BERT分类器系统:参考信息COVID-19英鸣叫的鉴定。此外,我们表明,BERT利用一些简单的信号来识别信息微博,并添加简单的图案,不提供信息的鸣叫大幅度恶化BERT性能。特别是,简单地添加10名至死亡鸣叫在dev的集合,降低了BERT F1-得分从92.63到7.28。我们还提出了一个简单的数据增强技术,有助于提高BERT分类的鲁棒性和泛化能力。
Kumud Chauhan
Abstract: Millions of people around the world are sharing COVID-19 related information on social media platforms. Since not all the information shared on the social media is useful, a machine learning system to identify informative posts can help users in finding relevant information. In this paper, we present a BERT classifier system for W-NUT2020 Shared Task 2: Identification of Informative COVID-19 English Tweets. Further, we show that BERT exploits some easy signals to identify informative tweets, and adding simple patterns to uninformative tweets drastically degrades BERT performance. In particular, simply adding 10 deaths to tweets in dev set, reduces BERT F1- score from 92.63 to 7.28. We also propose a simple data augmentation technique that helps in improving the robustness and generalization ability of the BERT classifier.
摘要:数以百万计的世界各地的人分享了在社交媒体平台COVID-19的相关信息。由于并非所有的社交媒体共享的信息是有用的,机器学习系统来识别信息的帖子可以帮助用户找到相关的信息。在本文中,我们提出了W-NUT2020共享任务2一个BERT分类器系统:参考信息COVID-19英鸣叫的鉴定。此外,我们表明,BERT利用一些简单的信号来识别信息微博,并添加简单的图案,不提供信息的鸣叫大幅度恶化BERT性能。特别是,简单地添加10名至死亡鸣叫在dev的集合,降低了BERT F1-得分从92.63到7.28。我们还提出了一个简单的数据增强技术,有助于提高BERT分类的鲁棒性和泛化能力。
12. Small but Mighty: New Benchmarks for Split and Rephrase [PDF] 返回目录
Li Zhang, Huaiyu Zhu, Siddhartha Brahma, Yunyao Li
Abstract: Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even a simple rule-based model can perform on par with the state-of-the-art model. To remedy such limitations, we collect and release two crowdsourced benchmark datasets. We not only make sure that they contain significantly more diverse syntax, but also carefully control for their quality according to a well-defined set of criteria. While no satisfactory automatic metric exists, we apply fine-grained manual evaluation based on these criteria using crowdsourcing, showing that our datasets better represent the task and are significantly more challenging for the models.
摘要:斯普利特和改写是重写一个复合句成简单的的文本简化的任务。作为一个相对较新的任务,这是最重要的,以确保其评价标准和指标的合理性。我们发现,广泛使用的基准数据集普遍含有造成其自动生成过程中很容易被利用语法线索。以这样的线索的优势,我们表明,即使一个简单的基于规则的模型可以与国家的最先进的机型看齐执行。为了弥补这种局限性,我们收集和发布两款众包基准数据集。我们不仅确保它们包含显著更多样化的语法,而是根据一个定义良好的一套标准,其质量也小心地控制。虽然没有令人满意的自动跃存在,我们申请使用众包,这表明我们的数据集更好地代表了任务,并设置模型显著更具挑战基于这些标准细粒度人工评估。
Li Zhang, Huaiyu Zhu, Siddhartha Brahma, Yunyao Li
Abstract: Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even a simple rule-based model can perform on par with the state-of-the-art model. To remedy such limitations, we collect and release two crowdsourced benchmark datasets. We not only make sure that they contain significantly more diverse syntax, but also carefully control for their quality according to a well-defined set of criteria. While no satisfactory automatic metric exists, we apply fine-grained manual evaluation based on these criteria using crowdsourcing, showing that our datasets better represent the task and are significantly more challenging for the models.
摘要:斯普利特和改写是重写一个复合句成简单的的文本简化的任务。作为一个相对较新的任务,这是最重要的,以确保其评价标准和指标的合理性。我们发现,广泛使用的基准数据集普遍含有造成其自动生成过程中很容易被利用语法线索。以这样的线索的优势,我们表明,即使一个简单的基于规则的模型可以与国家的最先进的机型看齐执行。为了弥补这种局限性,我们收集和发布两款众包基准数据集。我们不仅确保它们包含显著更多样化的语法,而是根据一个定义良好的一套标准,其质量也小心地控制。虽然没有令人满意的自动跃存在,我们申请使用众包,这表明我们的数据集更好地代表了任务,并设置模型显著更具挑战基于这些标准细粒度人工评估。
13. Generation-Augmented Retrieval for Open-domain Question Answering [PDF] 返回目录
Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, Weizhu Chen
Abstract: Conventional sparse retrieval methods such as TF-IDF and BM25 are simple and efficient, but solely rely on lexical overlap and fail to conduct semantic matching. Recent dense retrieval methods learn latent representations to tackle the lexical mismatch problem, while being more computationally expensive and sometimes insufficient for exact matching as they embed the entire text sequence into a single vector with limited capacity. In this paper, we present Generation-Augmented Retrieval (GAR), a query expansion method that augments a query with relevant contexts through text generation. We demonstrate on open-domain question answering (QA) that the generated contexts significantly enrich the semantics of the queries and thus GAR with sparse representations (BM25) achieves comparable or better performance than the current state-of-the-art dense method DPR \cite{karpukhin2020dense}. We show that generating various contexts of a query is beneficial as fusing their results consistently yields a better retrieval accuracy. Moreover, GAR achieves the state-of-the-art performance of extractive QA on the Natural Questions and TriviaQA datasets when equipped with an extractive reader.
摘要:传统的稀疏检索方法,比如TF-IDF和BM25是简单而有效的,但仅仅依靠词汇重叠,无法进行语义匹配。近期密集的检索方法学潜表示,解决词汇不匹配的问题,同时更加耗费计算,有时不足以精确匹配,因为它们嵌入整个文本序列与有限容量的单一载体。在本文中,我们本代 - 增强检索(GAR),查询扩展方法,该方法增强件的通文本生成与有关上下文查询。我们证明开放域问答(QA),生成的上下文显著丰富的查询的语义,从而GAR与稀疏表示(BM25)实现了比目前国家的最先进的密法DPR相当或更好的性能\举{} karpukhin2020dense。我们表明,生成查询的各种情况下是融合他们的结果有利于持续得到更好的检索精度。此外,当装有萃取读者GAR实现采掘QA对自然问题和TriviaQA数据集的国家的最先进的性能。
Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, Weizhu Chen
Abstract: Conventional sparse retrieval methods such as TF-IDF and BM25 are simple and efficient, but solely rely on lexical overlap and fail to conduct semantic matching. Recent dense retrieval methods learn latent representations to tackle the lexical mismatch problem, while being more computationally expensive and sometimes insufficient for exact matching as they embed the entire text sequence into a single vector with limited capacity. In this paper, we present Generation-Augmented Retrieval (GAR), a query expansion method that augments a query with relevant contexts through text generation. We demonstrate on open-domain question answering (QA) that the generated contexts significantly enrich the semantics of the queries and thus GAR with sparse representations (BM25) achieves comparable or better performance than the current state-of-the-art dense method DPR \cite{karpukhin2020dense}. We show that generating various contexts of a query is beneficial as fusing their results consistently yields a better retrieval accuracy. Moreover, GAR achieves the state-of-the-art performance of extractive QA on the Natural Questions and TriviaQA datasets when equipped with an extractive reader.
摘要:传统的稀疏检索方法,比如TF-IDF和BM25是简单而有效的,但仅仅依靠词汇重叠,无法进行语义匹配。近期密集的检索方法学潜表示,解决词汇不匹配的问题,同时更加耗费计算,有时不足以精确匹配,因为它们嵌入整个文本序列与有限容量的单一载体。在本文中,我们本代 - 增强检索(GAR),查询扩展方法,该方法增强件的通文本生成与有关上下文查询。我们证明开放域问答(QA),生成的上下文显著丰富的查询的语义,从而GAR与稀疏表示(BM25)实现了比目前国家的最先进的密法DPR相当或更好的性能\举{} karpukhin2020dense。我们表明,生成查询的各种情况下是融合他们的结果有利于持续得到更好的检索精度。此外,当装有萃取读者GAR实现采掘QA对自然问题和TriviaQA数据集的国家的最先进的性能。
14. Structured Attention for Unsupervised Dialogue Structure Induction [PDF] 返回目录
Liang Qiu, Yizhou Zhao, Weiyan Shi, Yuan Liang, Feng Shi, Tao Yuan, Zhou Yu, Song-Chun Zhu
Abstract: Inducing a meaningful structural representation from one or a set of dialogues is a crucial but challenging task in computational linguistics. Advancement made in this area is critical for dialogue system design and discourse analysis. It can also be extended to solve grammatical inference. In this work, we propose to incorporate structured attention layers into a Variational Recurrent Neural Network (VRNN) model with discrete latent states to learn dialogue structure in an unsupervised fashion. Compared to a vanilla VRNN, structured attention enables a model to focus on different parts of the source sentence embeddings while enforcing a structural inductive bias. Experiments show that on two-party dialogue datasets, VRNN with structured attention learns semantic structures that are similar to templates used to generate this dialogue corpus. While on multi-party dialogue datasets, our model learns an interactive structure demonstrating its capability of distinguishing speakers or addresses, automatically disentangling dialogues without explicit human annotation.
摘要:从诱导一个有意义的结构示意图或一组对话的是计算语言学的一个关键,但具有挑战性的任务。在这一领域取得的进步是对话系统的设计与话语分析是至关重要的。它也可以扩展到解决语法推断。在这项工作中,我们建议关注结构层纳入离散潜伏状态的变回归神经网络(VRNN)模型来学习对话结构,无监督的方式。相比于香草VRNN,结构化的关注使模型专注于原文句子的嵌入的不同部位同时实施结构性归纳偏置。实验表明,在两方对话的数据集,VRNN与结构注意力获悉类似于用来产生这种对话语料库模板语义结构。而在多方对话的数据集,我们的模型学习的互动结构,展示了其区分扬声器或地址,自动没有明确标注人类解开对话的能力。
Liang Qiu, Yizhou Zhao, Weiyan Shi, Yuan Liang, Feng Shi, Tao Yuan, Zhou Yu, Song-Chun Zhu
Abstract: Inducing a meaningful structural representation from one or a set of dialogues is a crucial but challenging task in computational linguistics. Advancement made in this area is critical for dialogue system design and discourse analysis. It can also be extended to solve grammatical inference. In this work, we propose to incorporate structured attention layers into a Variational Recurrent Neural Network (VRNN) model with discrete latent states to learn dialogue structure in an unsupervised fashion. Compared to a vanilla VRNN, structured attention enables a model to focus on different parts of the source sentence embeddings while enforcing a structural inductive bias. Experiments show that on two-party dialogue datasets, VRNN with structured attention learns semantic structures that are similar to templates used to generate this dialogue corpus. While on multi-party dialogue datasets, our model learns an interactive structure demonstrating its capability of distinguishing speakers or addresses, automatically disentangling dialogues without explicit human annotation.
摘要:从诱导一个有意义的结构示意图或一组对话的是计算语言学的一个关键,但具有挑战性的任务。在这一领域取得的进步是对话系统的设计与话语分析是至关重要的。它也可以扩展到解决语法推断。在这项工作中,我们建议关注结构层纳入离散潜伏状态的变回归神经网络(VRNN)模型来学习对话结构,无监督的方式。相比于香草VRNN,结构化的关注使模型专注于原文句子的嵌入的不同部位同时实施结构性归纳偏置。实验表明,在两方对话的数据集,VRNN与结构注意力获悉类似于用来产生这种对话语料库模板语义结构。而在多方对话的数据集,我们的模型学习的互动结构,展示了其区分扬声器或地址,自动没有明确标注人类解开对话的能力。
15. PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology [PDF] 返回目录
Ling Luo, Shankai Yan, Po-Ting Lai, Daniel Veltri, Andrew Oler, Sandhya Xirasagar, Rajarshi Ghosh, Morgan Similuk, Peter N. Robinson, Zhiyong Lu
Abstract: Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. In this paper, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary. Then, the dictionary and biomedical literature are used to automatically build a weakly-supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to state-of-the-art methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods.
摘要:从非结构化文本自动表型的概念识别仍然是生物医学文本挖掘研究的一个具有挑战性的任务。以前的作品该地址的任务通常使用基于字典的匹配方法,从而可以实现高精度,但是从较低召回受到影响。最近,基于机器学习的方法已经被提出,以确定生物医学的概念,它可以通过自动学习功能更认识看不见概念的同义词。然而,大多数方法需要进行模型训练,这是很难获得手动注释的数据,由于人类注释的高成本大语料库。在本文中,我们提出PhenoTagger,混合方法,它结合了字典和基于机器学习的方法来识别人类表型本体(HPO)在非结构化文本生物医学概念。我们首先使用所有的概念和同义词的HPO构建字典。然后,字典和生物医学文献用于自动建立机器学习弱监督的训练数据集。接着,在切割刃深度学习模型被训练为每个候选词组归类到相应的概念标签。最后,字典和机器基于学习的预测结果被组合以改进性能。我们的方法是验证带有两个HPO语料库,其结果表明,PhenoTagger媲美状态的最先进的方法。此外,为了证明我们的方法的普遍性,我们再培训PhenoTagger使用疾病本体MEDIC疾病的概念识别,调查对不同的本体培训的效果。在NCBI疾病语料库实验结果表明,PhenoTagger而不需要手动注释的训练数据实现与比较状态的最先进的监督方法竞争力的性能。
Ling Luo, Shankai Yan, Po-Ting Lai, Daniel Veltri, Andrew Oler, Sandhya Xirasagar, Rajarshi Ghosh, Morgan Similuk, Peter N. Robinson, Zhiyong Lu
Abstract: Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. In this paper, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary. Then, the dictionary and biomedical literature are used to automatically build a weakly-supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to state-of-the-art methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods.
摘要:从非结构化文本自动表型的概念识别仍然是生物医学文本挖掘研究的一个具有挑战性的任务。以前的作品该地址的任务通常使用基于字典的匹配方法,从而可以实现高精度,但是从较低召回受到影响。最近,基于机器学习的方法已经被提出,以确定生物医学的概念,它可以通过自动学习功能更认识看不见概念的同义词。然而,大多数方法需要进行模型训练,这是很难获得手动注释的数据,由于人类注释的高成本大语料库。在本文中,我们提出PhenoTagger,混合方法,它结合了字典和基于机器学习的方法来识别人类表型本体(HPO)在非结构化文本生物医学概念。我们首先使用所有的概念和同义词的HPO构建字典。然后,字典和生物医学文献用于自动建立机器学习弱监督的训练数据集。接着,在切割刃深度学习模型被训练为每个候选词组归类到相应的概念标签。最后,字典和机器基于学习的预测结果被组合以改进性能。我们的方法是验证带有两个HPO语料库,其结果表明,PhenoTagger媲美状态的最先进的方法。此外,为了证明我们的方法的普遍性,我们再培训PhenoTagger使用疾病本体MEDIC疾病的概念识别,调查对不同的本体培训的效果。在NCBI疾病语料库实验结果表明,PhenoTagger而不需要手动注释的训练数据实现与比较状态的最先进的监督方法竞争力的性能。
16. A Study of Genetic Algorithms for Hyperparameter Optimization of Neural Networks in Machine Translation [PDF] 返回目录
Keshav Ganapathy
Abstract: With neural networks having demonstrated their versatility and benefits, the need for their optimal performance is as prevalent as ever. A defining characteristic, hyperparameters, can greatly affect its performance. Thus engineers go through a process, tuning, to identify and implement optimal hyperparameters. That being said, excess amounts of manual effort are required for tuning network architectures, training configurations, and preprocessing settings such as Byte Pair Encoding (BPE). In this study, we propose an automatic tuning method modeled after Darwin's Survival of the Fittest Theory via a Genetic Algorithm (GA). Research results show that the proposed method, a GA, outperforms a random selection of hyperparameters.
摘要:已经证明了其多功能性和收益神经网络,需要对他们的最佳性能是一样普遍如初。限定特征,超参数,可以极大地影响其性能。因此,工程师们经过一个过程,调整,确定和实施最佳的超参数。如此说来,需要用于调谐网络架构的手动工作过量的,培养的配置,和预处理设置,如字节对编码(BPE)。在这项研究中,我们提出了达尔文的通过遗传算法优胜劣汰理论的生存(GA)为蓝本的自动调谐方法。研究结果表明,该方法中,GA,优于超参数的随机选择。
Keshav Ganapathy
Abstract: With neural networks having demonstrated their versatility and benefits, the need for their optimal performance is as prevalent as ever. A defining characteristic, hyperparameters, can greatly affect its performance. Thus engineers go through a process, tuning, to identify and implement optimal hyperparameters. That being said, excess amounts of manual effort are required for tuning network architectures, training configurations, and preprocessing settings such as Byte Pair Encoding (BPE). In this study, we propose an automatic tuning method modeled after Darwin's Survival of the Fittest Theory via a Genetic Algorithm (GA). Research results show that the proposed method, a GA, outperforms a random selection of hyperparameters.
摘要:已经证明了其多功能性和收益神经网络,需要对他们的最佳性能是一样普遍如初。限定特征,超参数,可以极大地影响其性能。因此,工程师们经过一个过程,调整,确定和实施最佳的超参数。如此说来,需要用于调谐网络架构的手动工作过量的,培养的配置,和预处理设置,如字节对编码(BPE)。在这项研究中,我们提出了达尔文的通过遗传算法优胜劣汰理论的生存(GA)为蓝本的自动调谐方法。研究结果表明,该方法中,GA,优于超参数的随机选择。
17. Image Captioning with Attention for Smart Local Tourism using EfficientNet [PDF] 返回目录
Dhomas Hatta Fudholi, Yurio Windiatmoko, Nurdi Afrianto, Prastyo Eko Susanto, Magfirah Suyuti, Ahmad Fathan Hidayatullah, Ridho Rahmadi
Abstract: Smart systems have been massively developed to help humans in various tasks. Deep Learning technologies push even further in creating accurate assistant systems due to the explosion of data lakes. One of the smart system tasks is to disseminate users needed information. This is crucial in the tourism sector to promote local tourism destinations. In this research, we design a model of local tourism specific image captioning, which later will support the development of AI-powered systems that assist various users. The model is developed using a visual Attention mechanism and uses the state-of-the-art feature extractor architecture EfficientNet. A local tourism dataset is collected and is used in the research, along with two different kinds of captions. Captions that describe the image literally and captions that represent human logical responses when seeing the image. This is done to make the captioning model more humane when implemented in the assistance system. We compared the performance of two different models using EfficientNet architectures (B0 and B4) with other well known VGG16 and InceptionV3. The best BLEU scores we get are 73.39 and 24.51 for the training set and the validation set respectively, using EfficientNetB0. The captioning result using the developed model shows that the model can produce logical caption for local tourism-related images
摘要:智能系统已经大规模开发的各项任务帮助人类。深度学习技术推匀创建精确的辅助系统进一步由于数据湖泊的爆炸。其中一个智能系统的任务是必要的信息为散发的用户。这是旅游部门,以促进当地旅游业的目的地是至关重要的。在这项研究中,我们设计了当地的旅游特定的图像字幕,后来将支持AI供电系统是帮助各种用户的发展模式。该模型是使用视觉注意机制开发和使用状态的最先进的特征提取器架构EfficientNet。当地旅游数据集被收集,并在研究使用,两种不同的字幕一起。从字面上描述图像看到图像时,代表人类逻辑的反应字幕和字幕。这样做是为了使字幕模式更加人性化的救助制度实施时。我们比较了使用EfficientNet架构(B0和B4)与其他知名VGG16和InceptionV3两种不同型号的性能。我们得到最好的BLEU分数是73.39和24.51的训练集和验证集分别使用EfficientNetB0。使用开发的模型表明,该模型可以产生本地旅游相关的影像逻辑字幕的字幕结果
Dhomas Hatta Fudholi, Yurio Windiatmoko, Nurdi Afrianto, Prastyo Eko Susanto, Magfirah Suyuti, Ahmad Fathan Hidayatullah, Ridho Rahmadi
Abstract: Smart systems have been massively developed to help humans in various tasks. Deep Learning technologies push even further in creating accurate assistant systems due to the explosion of data lakes. One of the smart system tasks is to disseminate users needed information. This is crucial in the tourism sector to promote local tourism destinations. In this research, we design a model of local tourism specific image captioning, which later will support the development of AI-powered systems that assist various users. The model is developed using a visual Attention mechanism and uses the state-of-the-art feature extractor architecture EfficientNet. A local tourism dataset is collected and is used in the research, along with two different kinds of captions. Captions that describe the image literally and captions that represent human logical responses when seeing the image. This is done to make the captioning model more humane when implemented in the assistance system. We compared the performance of two different models using EfficientNet architectures (B0 and B4) with other well known VGG16 and InceptionV3. The best BLEU scores we get are 73.39 and 24.51 for the training set and the validation set respectively, using EfficientNetB0. The captioning result using the developed model shows that the model can produce logical caption for local tourism-related images
摘要:智能系统已经大规模开发的各项任务帮助人类。深度学习技术推匀创建精确的辅助系统进一步由于数据湖泊的爆炸。其中一个智能系统的任务是必要的信息为散发的用户。这是旅游部门,以促进当地旅游业的目的地是至关重要的。在这项研究中,我们设计了当地的旅游特定的图像字幕,后来将支持AI供电系统是帮助各种用户的发展模式。该模型是使用视觉注意机制开发和使用状态的最先进的特征提取器架构EfficientNet。当地旅游数据集被收集,并在研究使用,两种不同的字幕一起。从字面上描述图像看到图像时,代表人类逻辑的反应字幕和字幕。这样做是为了使字幕模式更加人性化的救助制度实施时。我们比较了使用EfficientNet架构(B0和B4)与其他知名VGG16和InceptionV3两种不同型号的性能。我们得到最好的BLEU分数是73.39和24.51的训练集和验证集分别使用EfficientNetB0。使用开发的模型表明,该模型可以产生本地旅游相关的影像逻辑字幕的字幕结果
18. SciBERT-based Semantification of Bioassays in the Open Research Knowledge Graph [PDF] 返回目录
Marco Anteghini, Jennifer D'Souza, Vitor A. P. Martins dos Santos, Sören Auer
Abstract: As a novel contribution to the problem of semantifying biological assays, in this paper, we propose a neural-network-based approach to automatically semantify, thereby structure, unstructured bioassay text descriptions. Experimental evaluations, to this end, show promise as the neural-based semantification significantly outperforms a naive frequency-based baseline approach. Specifically, the neural method attains 72% F1 versus 47% F1 from the frequency-based method.
摘要:对于semantifying生物分析,在本文中的问题的新贡献,我们提出了一种基于神经网络的方法来自动semantify,因此结构化,非结构化的生物测定的文字描述。实验评价,为此,有希望成为基于神经semantification显著优于天真基于频率的基准办法。具体地,神经方法达到72%F1与从基于频率的方法47%F1。
Marco Anteghini, Jennifer D'Souza, Vitor A. P. Martins dos Santos, Sören Auer
Abstract: As a novel contribution to the problem of semantifying biological assays, in this paper, we propose a neural-network-based approach to automatically semantify, thereby structure, unstructured bioassay text descriptions. Experimental evaluations, to this end, show promise as the neural-based semantification significantly outperforms a naive frequency-based baseline approach. Specifically, the neural method attains 72% F1 versus 47% F1 from the frequency-based method.
摘要:对于semantifying生物分析,在本文中的问题的新贡献,我们提出了一种基于神经网络的方法来自动semantify,因此结构化,非结构化的生物测定的文字描述。实验评价,为此,有希望成为基于神经semantification显著优于天真基于频率的基准办法。具体地,神经方法达到72%F1与从基于频率的方法47%F1。
19. Towards Full-line Code Completion with Neural Language Models [PDF] 返回目录
Wenhan Wang, Sijie Shen, Ge Li, Zhi Jin
Abstract: A code completion system suggests future code elements to developers given a partially-complete code snippet. Code completion is one of the most useful features in Integrated Development Environments (IDEs). Currently, most code completion techniques predict a single token at a time. In this paper, we take a further step and discuss the probability of directly completing a whole line of code instead of a single token. We believe suggesting longer code sequences can further improve the efficiency of developers. Recently neural language models have been adopted as a preferred approach for code completion, and we believe these models can still be applied to full-line code completion with a few improvements. We conduct our experiments on two real-world python corpora and evaluate existing neural models based on source code tokens or syntactical actions. The results show that neural language models can achieve acceptable results on our tasks, with significant room for improvements.
摘要:代码完成系统提示将来的代码元素,赋予了部分完成的代码片段开发。代码完成是在集成开发环境(IDE)中最有用的功能之一。目前,大多数代码完成技术同时预测单个标记。在本文中,我们采取进一步的步骤,并讨论直接完成的代码,而不是一个单一的令牌一整行的可能性。我们认为,这表明较长的码序列可以进一步提高开发人员的效率。最近神经语言模型已被采纳为代码完成一个首选的方法,我们相信这些模型仍然可以适用于全行代码完成了一些改进。我们进行我们的两个现实蟒蛇语料库实验和基于源代码的标记或语法行为评价现有的神经模型。结果表明,神经语言模型可以在我们的任务获得可接受的结果,有显著改进的空间。
Wenhan Wang, Sijie Shen, Ge Li, Zhi Jin
Abstract: A code completion system suggests future code elements to developers given a partially-complete code snippet. Code completion is one of the most useful features in Integrated Development Environments (IDEs). Currently, most code completion techniques predict a single token at a time. In this paper, we take a further step and discuss the probability of directly completing a whole line of code instead of a single token. We believe suggesting longer code sequences can further improve the efficiency of developers. Recently neural language models have been adopted as a preferred approach for code completion, and we believe these models can still be applied to full-line code completion with a few improvements. We conduct our experiments on two real-world python corpora and evaluate existing neural models based on source code tokens or syntactical actions. The results show that neural language models can achieve acceptable results on our tasks, with significant room for improvements.
摘要:代码完成系统提示将来的代码元素,赋予了部分完成的代码片段开发。代码完成是在集成开发环境(IDE)中最有用的功能之一。目前,大多数代码完成技术同时预测单个标记。在本文中,我们采取进一步的步骤,并讨论直接完成的代码,而不是一个单一的令牌一整行的可能性。我们认为,这表明较长的码序列可以进一步提高开发人员的效率。最近神经语言模型已被采纳为代码完成一个首选的方法,我们相信这些模型仍然可以适用于全行代码完成了一些改进。我们进行我们的两个现实蟒蛇语料库实验和基于源代码的标记或语法行为评价现有的神经模型。结果表明,神经语言模型可以在我们的任务获得可接受的结果,有显著改进的空间。
20. MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering [PDF] 返回目录
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
Abstract: While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present \textit{MUTANT}, a training paradigm that exposes the model to perceptually similar, yet semantically distinct \textit{mutations} of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, \textit{MUTANT} does not rely on the knowledge about the nature of train and test answer distributions. \textit{MUTANT} establishes a new state-of-the-art accuracy on VQA-CP with a $10.57\%$ improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.
摘要:虽然已经取得进展的视觉问答排行榜,模型通常利用独立同分布下的数据集伪相关和先验设置。因此,上外的分布(OOD)试验样品的评价已成为用于概括一个代理。在本文中,我们目前\ textit {}突变体,一个训练模式暴露出模型感知类似,但不同的语义\ textit输入{}突变,以提高OOD概括,如VQA-CP的挑战。在这种模式下,模型利用一致性约束培养目标理解在输入对输出(回答)的语义更改(问题图像对)的效果。不同于上VQA-CP,\ textit现有的方法{}突变体不依赖于有关培训和考试答案分布的性质的认识。 \ {textit突变型}建立上VQA-CP一个新的国家的最先进的精度以$ 10.57 \%$的改善。我们的工作开辟了途径,在问答中使用语义输入突变的OOD概括。
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
Abstract: While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present \textit{MUTANT}, a training paradigm that exposes the model to perceptually similar, yet semantically distinct \textit{mutations} of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, \textit{MUTANT} does not rely on the knowledge about the nature of train and test answer distributions. \textit{MUTANT} establishes a new state-of-the-art accuracy on VQA-CP with a $10.57\%$ improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.
摘要:虽然已经取得进展的视觉问答排行榜,模型通常利用独立同分布下的数据集伪相关和先验设置。因此,上外的分布(OOD)试验样品的评价已成为用于概括一个代理。在本文中,我们目前\ textit {}突变体,一个训练模式暴露出模型感知类似,但不同的语义\ textit输入{}突变,以提高OOD概括,如VQA-CP的挑战。在这种模式下,模型利用一致性约束培养目标理解在输入对输出(回答)的语义更改(问题图像对)的效果。不同于上VQA-CP,\ textit现有的方法{}突变体不依赖于有关培训和考试答案分布的性质的认识。 \ {textit突变型}建立上VQA-CP一个新的国家的最先进的精度以$ 10.57 \%$的改善。我们的工作开辟了途径,在问答中使用语义输入突变的OOD概括。
注:中文为机器翻译结果!封面为论文标题词云图!