目录
10. A Machine Learning Application for Raising WASH Awareness in the Times of Covid-19 Pandemic [PDF] 摘要
11. Exploring Gaussian mixture model framework for speaker adaptation of deep neural network acoustic models [PDF] 摘要
12. Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0 [PDF] 摘要
摘要
1. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages [PDF] 返回目录
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Christopher D. Manning
Abstract: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionalities to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at this https URL.
摘要:介绍节中,一个开源Python的自然语言处理工具包支持66种人类的语言。相较于现有广泛使用的工具包,诗节为特色的文本分析,包括符号化,多字令牌扩张,词形归并,部分的语音和形态特征标记,依存分析,并命名实体识别语言无关的完全神经管道。我们上一共有112个集,包括普遍依赖树库等多语种语料库培训的诗节,并表明,在相同神经结构概括良好,实现了对测试的所有语言竞争力的性能。此外,诗节包括自然Python接口广泛使用的Java斯坦福CoreNLP软件,该软件进一步扩展其功能,以涵盖其他任务,如共参照分辨率和关系抽取。源代码,文档,以及66种语言预训练模式可在此HTTPS URL。
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Christopher D. Manning
Abstract: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionalities to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at this https URL.
摘要:介绍节中,一个开源Python的自然语言处理工具包支持66种人类的语言。相较于现有广泛使用的工具包,诗节为特色的文本分析,包括符号化,多字令牌扩张,词形归并,部分的语音和形态特征标记,依存分析,并命名实体识别语言无关的完全神经管道。我们上一共有112个集,包括普遍依赖树库等多语种语料库培训的诗节,并表明,在相同神经结构概括良好,实现了对测试的所有语言竞争力的性能。此外,诗节包括自然Python接口广泛使用的Java斯坦福CoreNLP软件,该软件进一步扩展其功能,以涵盖其他任务,如共参照分辨率和关系抽取。源代码,文档,以及66种语言预训练模式可在此HTTPS URL。
2. Key Phrase Classification in Complex Assignments [PDF] 返回目录
Manikandan Ravikiran
Abstract: Complex assignments typically consist of open-ended questions with large and diverse content in the context of both classroom and online graduate programs. With the sheer scale of these programs comes a variety of problems in peer and expert feedback, including rogue reviews. As such with the hope of identifying important contents needed for the review, in this work we present a very first work on key phrase classification with a detailed empirical study on traditional and most recent language modeling approaches. From this study, we find that the task of classification of key phrases is ambiguous at a human level producing Cohen's kappa of 0.77 on a new data set. Both pretrained language models and simple TFIDF SVM classifiers produce similar results with a former producing average of 0.6 F1 higher than the latter. We finally derive practical advice from our extensive empirical and model interpretability results for those interested in key phrase classification from educational reports in the future.
摘要:复杂的任务通常由在课堂和在线研究生课程的背景下,巨大和多样化的内容的开放性问题。随着这些项目的规模而来的,是各种在同行和专家的反馈问题,包括流氓评论。因此有必要确定为审查的重要内容的希望,在这项工作中,我们提出的关键短语分类的第一个工作,在传统的一个详细的实证研究和最近的语言建模方法。从这项研究中,我们发现,关键短语分类的任务是模糊的,在人的水平上了一个新的数据集生成的0.77科恩kappa。这两种预训练的语言模型和简单的TFIDF SVM分类产生与前生产平均比后者0.6 F1更高了类似的结果。最后,我们得出从我们丰富的经验和模式可解释性结果为那些有兴趣在关键短语分类从未来的教育报告的实用建议。
Manikandan Ravikiran
Abstract: Complex assignments typically consist of open-ended questions with large and diverse content in the context of both classroom and online graduate programs. With the sheer scale of these programs comes a variety of problems in peer and expert feedback, including rogue reviews. As such with the hope of identifying important contents needed for the review, in this work we present a very first work on key phrase classification with a detailed empirical study on traditional and most recent language modeling approaches. From this study, we find that the task of classification of key phrases is ambiguous at a human level producing Cohen's kappa of 0.77 on a new data set. Both pretrained language models and simple TFIDF SVM classifiers produce similar results with a former producing average of 0.6 F1 higher than the latter. We finally derive practical advice from our extensive empirical and model interpretability results for those interested in key phrase classification from educational reports in the future.
摘要:复杂的任务通常由在课堂和在线研究生课程的背景下,巨大和多样化的内容的开放性问题。随着这些项目的规模而来的,是各种在同行和专家的反馈问题,包括流氓评论。因此有必要确定为审查的重要内容的希望,在这项工作中,我们提出的关键短语分类的第一个工作,在传统的一个详细的实证研究和最近的语言建模方法。从这项研究中,我们发现,关键短语分类的任务是模糊的,在人的水平上了一个新的数据集生成的0.77科恩kappa。这两种预训练的语言模型和简单的TFIDF SVM分类产生与前生产平均比后者0.6 F1更高了类似的结果。最后,我们得出从我们丰富的经验和模式可解释性结果为那些有兴趣在关键短语分类从未来的教育报告的实用建议。
3. CompLex --- A New Corpus for Lexical Complexity Predicition from Likert Scale Data [PDF] 返回目录
Matthew Shardlow, Michael Cooper, Marcos Zampieri
Abstract: Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.
摘要:哪些词被认为很难理解给定目标人群测算是许多NLP应用,如文本简化的重要一步。这个任务通常被称为复合词识别(CWI)。除了少数例外,以前的研究已经接近了任务,因为在系统预测文本一组目标词的复杂度值(复杂与非复合物)的二元分类任务。这种选择是由一个事实,即目前所收集的所有数据集CWI已经使用二进制注解注释方案的动机。我们的论文通过提出第一个英文数据集连续词汇的复杂性预测解决了这一限制。我们使用5点李克特量表方案有三个来源/域注释文本中复杂的话:圣经Europarl和生物医学文本。这导致9476个句子的语料库分别由约7注释注解。
Matthew Shardlow, Michael Cooper, Marcos Zampieri
Abstract: Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.
摘要:哪些词被认为很难理解给定目标人群测算是许多NLP应用,如文本简化的重要一步。这个任务通常被称为复合词识别(CWI)。除了少数例外,以前的研究已经接近了任务,因为在系统预测文本一组目标词的复杂度值(复杂与非复合物)的二元分类任务。这种选择是由一个事实,即目前所收集的所有数据集CWI已经使用二进制注解注释方案的动机。我们的论文通过提出第一个英文数据集连续词汇的复杂性预测解决了这一限制。我们使用5点李克特量表方案有三个来源/域注释文本中复杂的话:圣经Europarl和生物医学文本。这导致9476个句子的语料库分别由约7注释注解。
4. TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding [PDF] 返回目录
Zhiheng Huang, Peng Xu, Davis Liang, Ajay Mishra, Bing Xiang
Abstract: Bidirectional Encoder Representations from Transformers (BERT) has recently achieved state-of-the-art performance on a broad range of NLP tasks including sentence classification, machine translation, and question answering. The BERT model architecture is derived primarily from the transformer. Prior to the transformer era, bidirectional Long Short-Term Memory (BLSTM) has been the dominant modeling architecture for neural machine translation and question answering. In this paper, we investigate how these two modeling techniques can be combined to create a more powerful model architecture. We propose a new architecture denoted as Transformer with BLSTM (TRANS-BLSTM) which has a BLSTM layer integrated to each transformer block, leading to a joint modeling framework for transformer and BLSTM. We show that TRANS-BLSTM models consistently lead to improvements in accuracy compared to BERT baselines in GLUE and SQuAD 1.1 experiments. Our TRANS-BLSTM model obtains an F1 score of 94.01% on the SQuAD 1.1 development dataset, which is comparable to the state-of-the-art result.
摘要:从变形金刚双向编码表示(BERT)最近所取得的广泛的NLP任务,包括句子分类,机器翻译,和答疑国家的最先进的性能。该BERT模型架构从变压器主要的。在此之前的变压器时代,双向龙短时记忆(BLSTM)一直是神经机器翻译和答疑主导建模架构。在本文中,我们研究了这两种建模技术如何能够结合起来,创造一个更强大的模型架构。我们提出了一个新的体系结构表示为与变压器BLSTM(TRANS-BLSTM),其具有集成到每个变压器块BLSTM层,导致对变压器以及BLSTM联合建模框架。我们发现,TRANS-BLSTM车型一贯导致精度的改善相比,胶水和队内1.1实验BERT基线。我们的TRANS-BLSTM模型获得的94.01%对1.1小队发展数据集,这与国家的最先进的结果的F1分数。
Zhiheng Huang, Peng Xu, Davis Liang, Ajay Mishra, Bing Xiang
Abstract: Bidirectional Encoder Representations from Transformers (BERT) has recently achieved state-of-the-art performance on a broad range of NLP tasks including sentence classification, machine translation, and question answering. The BERT model architecture is derived primarily from the transformer. Prior to the transformer era, bidirectional Long Short-Term Memory (BLSTM) has been the dominant modeling architecture for neural machine translation and question answering. In this paper, we investigate how these two modeling techniques can be combined to create a more powerful model architecture. We propose a new architecture denoted as Transformer with BLSTM (TRANS-BLSTM) which has a BLSTM layer integrated to each transformer block, leading to a joint modeling framework for transformer and BLSTM. We show that TRANS-BLSTM models consistently lead to improvements in accuracy compared to BERT baselines in GLUE and SQuAD 1.1 experiments. Our TRANS-BLSTM model obtains an F1 score of 94.01% on the SQuAD 1.1 development dataset, which is comparable to the state-of-the-art result.
摘要:从变形金刚双向编码表示(BERT)最近所取得的广泛的NLP任务,包括句子分类,机器翻译,和答疑国家的最先进的性能。该BERT模型架构从变压器主要的。在此之前的变压器时代,双向龙短时记忆(BLSTM)一直是神经机器翻译和答疑主导建模架构。在本文中,我们研究了这两种建模技术如何能够结合起来,创造一个更强大的模型架构。我们提出了一个新的体系结构表示为与变压器BLSTM(TRANS-BLSTM),其具有集成到每个变压器块BLSTM层,导致对变压器以及BLSTM联合建模框架。我们发现,TRANS-BLSTM车型一贯导致精度的改善相比,胶水和队内1.1实验BERT基线。我们的TRANS-BLSTM模型获得的94.01%对1.1小队发展数据集,这与国家的最先进的结果的F1分数。
5. Synonymous Generalization in Sequence-to-Sequence Recurrent Networks [PDF] 返回目录
Ning Shi
Abstract: When learning a language, people can quickly expand their understanding of the unknown content by using compositional skills, such as from two words "go" and "fast" to a new phrase "go fast." In recent work of Lake and Baroni (2017), modern Sequence-to-Sequence(se12seq) Recurrent Neural Networks (RNNs) can make powerful zero-shot generalizations in specifically controlled experiments. However, there is a missing regarding the property of such strong generalization and its precise requirements. This paper explores this positive result in detail and defines this pattern as the synonymous generalization, an ability to recognize an unknown sequence by decomposing the difference between it and a known sequence as corresponding existing synonyms. To better investigate it, I introduce a new environment called Colorful Extended Cleanup World (CECW), which consists of complex commands paired with logical expressions. While demonstrating that sequential RNNs can perform synonymous generalizations on foreign commands, I conclude their prerequisites for success. I also propose a data augmentation method, which is successfully verified on the Geoquery (GEO) dataset, as a novel application of synonymous generalization for real cases.
摘要:学习一种语言,人们可以迅速地通过成分的技能,比如从两个词“走出去”和“快”到一个新的短语扩大自己的未知内容的理解“走的快”。在最近的湖和巴罗尼(2017年),现代序列对序列(se12seq)回归神经网络(RNNs)的工作,可以使专门受控实验强大的零射门的概括。然而,有一个没有关于这种强大的通用性和它的精确要求的财产。本文详细探讨该肯定结果,并限定该图案作为同义概括,通过分解和它的已知序列为与现有同义词之间的差来识别未知序列的能力。为了更好地研究它,我介绍了一个名为七彩扩展清理世界(CECW)新的环境,其中包括与逻辑表达式配对复杂的命令。虽然表明连续RNNs能够在国外命令执行的代名词概括,我断定他们对成功的先决条件。我还提出了一种数据增强方法,该方法被成功验证的Geoquery(GEO)的数据集,同义概括的真正的情况下的新的应用程序。
Ning Shi
Abstract: When learning a language, people can quickly expand their understanding of the unknown content by using compositional skills, such as from two words "go" and "fast" to a new phrase "go fast." In recent work of Lake and Baroni (2017), modern Sequence-to-Sequence(se12seq) Recurrent Neural Networks (RNNs) can make powerful zero-shot generalizations in specifically controlled experiments. However, there is a missing regarding the property of such strong generalization and its precise requirements. This paper explores this positive result in detail and defines this pattern as the synonymous generalization, an ability to recognize an unknown sequence by decomposing the difference between it and a known sequence as corresponding existing synonyms. To better investigate it, I introduce a new environment called Colorful Extended Cleanup World (CECW), which consists of complex commands paired with logical expressions. While demonstrating that sequential RNNs can perform synonymous generalizations on foreign commands, I conclude their prerequisites for success. I also propose a data augmentation method, which is successfully verified on the Geoquery (GEO) dataset, as a novel application of synonymous generalization for real cases.
摘要:学习一种语言,人们可以迅速地通过成分的技能,比如从两个词“走出去”和“快”到一个新的短语扩大自己的未知内容的理解“走的快”。在最近的湖和巴罗尼(2017年),现代序列对序列(se12seq)回归神经网络(RNNs)的工作,可以使专门受控实验强大的零射门的概括。然而,有一个没有关于这种强大的通用性和它的精确要求的财产。本文详细探讨该肯定结果,并限定该图案作为同义概括,通过分解和它的已知序列为与现有同义词之间的差来识别未知序列的能力。为了更好地研究它,我介绍了一个名为七彩扩展清理世界(CECW)新的环境,其中包括与逻辑表达式配对复杂的命令。虽然表明连续RNNs能够在国外命令执行的代名词概括,我断定他们对成功的先决条件。我还提出了一种数据增强方法,该方法被成功验证的Geoquery(GEO)的数据集,同义概括的真正的情况下的新的应用程序。
6. Word Sense Disambiguation for 158 Languages using Word Embeddings Only [PDF] 返回目录
Varvara Logacheva, Denis Teslenko, Artem Shelmanov, Steffen Remus, Dmitry Ustalov, Andrey Kutuzov, Ekaterina Artemova, Chris Biemann, Simone Paolo Ponzetto, Alexander Panchenko
Abstract: Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al. (2018), enabling WSD in these languages. Models and system are available online.
摘要:在上下文中词义的消歧易于人,但对于自动方式的一项重大挑战。先进的监督,开发了以知识为基础的模型来解决这一任务。但是,(我)指导训练实例对于给定的单词和/或(ii)语言知识表征质量的内在Zipfian分配激励完全无监督和知识自由的方法来词义消歧(WSD)的开发。他们是不具有任何资源建设无论是监督和/或以知识为基础的模型资源不足的语言特别有用。在本文中,我们提出作为输入标准预训练的字嵌入模型并诱导完全成熟的词义库存,这可用于在上下文歧义消除的方法。我们用这种方法诱导感库存为158多种语言的集合由墓等原预训练fastText字的嵌入的基础上。 (2018),使WSD这些语言。模型和系统可在网上。
Varvara Logacheva, Denis Teslenko, Artem Shelmanov, Steffen Remus, Dmitry Ustalov, Andrey Kutuzov, Ekaterina Artemova, Chris Biemann, Simone Paolo Ponzetto, Alexander Panchenko
Abstract: Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al. (2018), enabling WSD in these languages. Models and system are available online.
摘要:在上下文中词义的消歧易于人,但对于自动方式的一项重大挑战。先进的监督,开发了以知识为基础的模型来解决这一任务。但是,(我)指导训练实例对于给定的单词和/或(ii)语言知识表征质量的内在Zipfian分配激励完全无监督和知识自由的方法来词义消歧(WSD)的开发。他们是不具有任何资源建设无论是监督和/或以知识为基础的模型资源不足的语言特别有用。在本文中,我们提出作为输入标准预训练的字嵌入模型并诱导完全成熟的词义库存,这可用于在上下文歧义消除的方法。我们用这种方法诱导感库存为158多种语言的集合由墓等原预训练fastText字的嵌入的基础上。 (2018),使WSD这些语言。模型和系统可在网上。
7. Text Similarity Using Word Embeddings to Classify Misinformation [PDF] 返回目录
Caio Almeida, Débora Santos
Abstract: Fake news is a growing problem in the last years, especially during elections. It's hard work to identify what is true and what is false among all the user generated content that circulates every day. Technology can help with that work and optimize the fact-checking process. In this work, we address the challenge of finding similar content in order to be able to suggest to a fact-checker articles that could have been verified before and thus avoid that the same information is verified more than once. This is especially important in collaborative approaches to fact-checking where members of large teams will not know what content others have already fact-checked.
摘要:假新闻是在过去几年中日益严重的问题,特别是在选举期间。它的辛勤工作,以确定什么是真,什么是中全是假的那个循环,每天的用户生成内容。技术可以与工作的帮助和优化事实查证过程。在这项工作中,我们找到解决类似的内容,以便能够建议可能之前已经被验证的事实,检查物品,从而避免了同一信息进行验证不止一次的挑战。这是协作方式其实检查,其中大型团队的成员不会知道是什么内容别人已经事实检查尤为重要。
Caio Almeida, Débora Santos
Abstract: Fake news is a growing problem in the last years, especially during elections. It's hard work to identify what is true and what is false among all the user generated content that circulates every day. Technology can help with that work and optimize the fact-checking process. In this work, we address the challenge of finding similar content in order to be able to suggest to a fact-checker articles that could have been verified before and thus avoid that the same information is verified more than once. This is especially important in collaborative approaches to fact-checking where members of large teams will not know what content others have already fact-checked.
摘要:假新闻是在过去几年中日益严重的问题,特别是在选举期间。它的辛勤工作,以确定什么是真,什么是中全是假的那个循环,每天的用户生成内容。技术可以与工作的帮助和优化事实查证过程。在这项工作中,我们找到解决类似的内容,以便能够建议可能之前已经被验证的事实,检查物品,从而避免了同一信息进行验证不止一次的挑战。这是协作方式其实检查,其中大型团队的成员不会知道是什么内容别人已经事实检查尤为重要。
8. LSCP: Enhanced Large Scale Colloquial Persian Language Understanding [PDF] 返回目录
Hadi Abdi Khojasteh, Ebrahim Ansari, Mahdi Bohlouli
Abstract: Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a "Large Scale Colloquial Persian Dataset" (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
摘要:语言识别已显著近年来现代化的机器学习方法,如深度学习和丰富的注释基准测试手段先进。然而,研究在资源匮乏的形式语言仍然是有限的。这包括在特别描述口语低资源的那些,如波斯一个显著间隙。为了瞄准这一差距为低资源语言中,我们提出了“规模化口语波斯数据集”(LSCP)。 LSCP分级组织在语义分类,专注于多任务非正式波斯语理解为一个综合性的问题。这包括在人类水平的句子,从现实世界中的句子自然地捕捉多个语义方面的认可。我们相信,进一步的调查和处理,以及新颖的算法和方法,应用程序可以加强电脑丰富的理解和低资源语言的处理。所提出的语料库包括120M的句子是由于在五种不同的语言解析树,部分的语音标签,情感极性和翻译注释27M的鸣叫。
Hadi Abdi Khojasteh, Ebrahim Ansari, Mahdi Bohlouli
Abstract: Language recognition has been significantly advanced in recent years by means of modern machine learning methods such as deep learning and benchmarks with rich annotations. However, research is still limited in low-resource formal languages. This consists of a significant gap in describing the colloquial language especially for low-resourced ones such as Persian. In order to target this gap for low resource languages, we propose a "Large Scale Colloquial Persian Dataset" (LSCP). LSCP is hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. This encompasses the recognition of multiple semantic aspects in the human-level sentences, which naturally captures from the real-world sentences. We believe that further investigations and processing, as well as the application of novel algorithms and methods, can strengthen enriching computerized understanding and processing of low resource languages. The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
摘要:语言识别已显著近年来现代化的机器学习方法,如深度学习和丰富的注释基准测试手段先进。然而,研究在资源匮乏的形式语言仍然是有限的。这包括在特别描述口语低资源的那些,如波斯一个显著间隙。为了瞄准这一差距为低资源语言中,我们提出了“规模化口语波斯数据集”(LSCP)。 LSCP分级组织在语义分类,专注于多任务非正式波斯语理解为一个综合性的问题。这包括在人类水平的句子,从现实世界中的句子自然地捕捉多个语义方面的认可。我们相信,进一步的调查和处理,以及新颖的算法和方法,应用程序可以加强电脑丰富的理解和低资源语言的处理。所提出的语料库包括120M的句子是由于在五种不同的语言解析树,部分的语音标签,情感极性和翻译注释27M的鸣叫。
9. A Survey on Contextual Embeddings [PDF] 返回目录
Qi Liu, Matt J. Kusner, Phil Blunsom
Abstract: Contextual embeddings, such as ELMo and BERT, move beyond global word representations like Word2Vec and achieve ground-breaking performance on a wide range of natural language processing tasks. Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages. In this survey, we review existing contextual embedding models, cross-lingual polyglot pre-training, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
摘要:内容的嵌入,如毛毛和BERT,超越全球字表示像Word2Vec,实现了广泛的自然语言处理任务突破性的性能。上下文的嵌入分配给每个字的表示是根据它的上下文,从而捕获词语的使用跨改变上下文和编码知识跨语言传输。在本次调查中,我们回顾现有的语境嵌入模型,跨语言通晓前培训,上下文的嵌入的下游任务的应用程序,压缩模型和模型分析。
Qi Liu, Matt J. Kusner, Phil Blunsom
Abstract: Contextual embeddings, such as ELMo and BERT, move beyond global word representations like Word2Vec and achieve ground-breaking performance on a wide range of natural language processing tasks. Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages. In this survey, we review existing contextual embedding models, cross-lingual polyglot pre-training, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
摘要:内容的嵌入,如毛毛和BERT,超越全球字表示像Word2Vec,实现了广泛的自然语言处理任务突破性的性能。上下文的嵌入分配给每个字的表示是根据它的上下文,从而捕获词语的使用跨改变上下文和编码知识跨语言传输。在本次调查中,我们回顾现有的语境嵌入模型,跨语言通晓前培训,上下文的嵌入的下游任务的应用程序,压缩模型和模型分析。
10. A Machine Learning Application for Raising WASH Awareness in the Times of Covid-19 Pandemic [PDF] 返回目录
Rohan Pandey, Vaibhav Gautam, Kanav Bhagat, Tavpritesh Sethi
Abstract: A proactive approach to raise awareness while preventing misinformation is a modern-day challenge in all domains including healthcare. Such awareness and sensitization approaches to prevention and containment are important components of a strong healthcare system, especially in the times of outbreaks such as the ongoing Covid-19 pandemic. However, there is a fine balance between continuous awareness-raising by providing new information and the risk of misinformation. In this work, we address this gap by creating a life-long learning application that delivers authentic information to users in Hindi, the most widely used local language in India. It does this by matching sources of verified and authentic information such as the WHO reports against daily news by using machine learning and natural language processing. It delivers the narrated content in Hindi by using state-of-the-art text to speech engines. Finally, the approach allows user input for continuous improvement of news feed relevance on a daily basis. We demonstrate a focused application of this approach for Water, Sanitation, Hygiene as it is critical in the containment of the currently raging Covid-19 pandemic through the WashKaro android application. Thirteen combinations of pre-processing strategies, word-embeddings, and similarity metrics were evaluated by eight human users via calculation of agreement statistics. The best performing combination achieved a Cohen's Kappa of 0.54 and was deployed in the WashKaro application back-end. Interventional studies for evaluating the effectiveness of the WashKaro application for preventing WASH-related diseases are planned to be carried out in the Mohalla clinics that provided 3.5 Million consults in 2019 in Delhi, India. Additionally, the application also features human-curated and vetted information to reach out to the community as audio-visual content in local languages.
摘要:积极主动的方法,以提高认识,同时防止误传是在包括医疗保健所有域的现代挑战。这样的意识和宣传预防办法和遏制是一个强大的医疗体系的重要组成部分,尤其是在爆发的时候,如正在进行的Covid-19大流行。然而,通过提供新的信息和错误信息的风险不断提高认识之间的平衡。在这项工作中,我们要解决通过创建一个终身学习的应用程序,可提供真实的信息在印地文,在印度使用最广泛的本地语言的用户这一差距。它通过验证的和真实的信息匹配来源,如世界卫生组织通过使用机器学习和自然语言处理对每日新闻报道做到这一点。它提供了通过使用先进设备,最先进的文本到语音引擎在印地文被叙述的内容。最后,该方法允许用户输入每天的基础上不断改进新闻源的相关性。我们证明这种方法对水,卫生设施的集中应用,卫生,因为它是通过WashKaro当前肆虐Covid-19大流行的关键遏制Android应用程序。的前处理策略,字的嵌入,和相似性度量十三组合是由8个人的用户经由协议的统计数据的计算评价。表现最好的组合,实现了Cohen的κ0.54和部署在WashKaro应用后端。用于评估WashKaro应用程序,以防止WASH相关疾病的有效性干预研究计划在Mohalla诊所,在新德里,印度提供350万个咨询谁在2019年进行。此外,应用程序还具有人性化策划和审核信息深入到社会作为当地语言的视听内容。
Rohan Pandey, Vaibhav Gautam, Kanav Bhagat, Tavpritesh Sethi
Abstract: A proactive approach to raise awareness while preventing misinformation is a modern-day challenge in all domains including healthcare. Such awareness and sensitization approaches to prevention and containment are important components of a strong healthcare system, especially in the times of outbreaks such as the ongoing Covid-19 pandemic. However, there is a fine balance between continuous awareness-raising by providing new information and the risk of misinformation. In this work, we address this gap by creating a life-long learning application that delivers authentic information to users in Hindi, the most widely used local language in India. It does this by matching sources of verified and authentic information such as the WHO reports against daily news by using machine learning and natural language processing. It delivers the narrated content in Hindi by using state-of-the-art text to speech engines. Finally, the approach allows user input for continuous improvement of news feed relevance on a daily basis. We demonstrate a focused application of this approach for Water, Sanitation, Hygiene as it is critical in the containment of the currently raging Covid-19 pandemic through the WashKaro android application. Thirteen combinations of pre-processing strategies, word-embeddings, and similarity metrics were evaluated by eight human users via calculation of agreement statistics. The best performing combination achieved a Cohen's Kappa of 0.54 and was deployed in the WashKaro application back-end. Interventional studies for evaluating the effectiveness of the WashKaro application for preventing WASH-related diseases are planned to be carried out in the Mohalla clinics that provided 3.5 Million consults in 2019 in Delhi, India. Additionally, the application also features human-curated and vetted information to reach out to the community as audio-visual content in local languages.
摘要:积极主动的方法,以提高认识,同时防止误传是在包括医疗保健所有域的现代挑战。这样的意识和宣传预防办法和遏制是一个强大的医疗体系的重要组成部分,尤其是在爆发的时候,如正在进行的Covid-19大流行。然而,通过提供新的信息和错误信息的风险不断提高认识之间的平衡。在这项工作中,我们要解决通过创建一个终身学习的应用程序,可提供真实的信息在印地文,在印度使用最广泛的本地语言的用户这一差距。它通过验证的和真实的信息匹配来源,如世界卫生组织通过使用机器学习和自然语言处理对每日新闻报道做到这一点。它提供了通过使用先进设备,最先进的文本到语音引擎在印地文被叙述的内容。最后,该方法允许用户输入每天的基础上不断改进新闻源的相关性。我们证明这种方法对水,卫生设施的集中应用,卫生,因为它是通过WashKaro当前肆虐Covid-19大流行的关键遏制Android应用程序。的前处理策略,字的嵌入,和相似性度量十三组合是由8个人的用户经由协议的统计数据的计算评价。表现最好的组合,实现了Cohen的κ0.54和部署在WashKaro应用后端。用于评估WashKaro应用程序,以防止WASH相关疾病的有效性干预研究计划在Mohalla诊所,在新德里,印度提供350万个咨询谁在2019年进行。此外,应用程序还具有人性化策划和审核信息深入到社会作为当地语言的视听内容。
11. Exploring Gaussian mixture model framework for speaker adaptation of deep neural network acoustic models [PDF] 返回目录
Natalia Tomashenko, Yuri Khokhlov, Yannick Esteve
Abstract: In this paper we investigate the GMM-derived (GMMD) features for adaptation of deep neural network (DNN) acoustic models. The adaptation of the DNN trained on GMMD features is done through the maximum a posteriori (MAP) adaptation of the auxiliary GMM model used for GMMD feature extraction. We explore fusion of the adapted GMMD features with conventional features, such as bottleneck and MFCC features, in two different neural network architectures: DNN and time-delay neural network (TDNN). We analyze and compare different types of adaptation techniques such as i-vectors and feature-space adaptation techniques based on maximum likelihood linear regression (fMLLR) with the proposed adaptation approach, and explore their complementarity using various types of fusion such as feature level, posterior level, lattice level and others in order to discover the best possible way of combination. Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN and TDNN setups at different levels and provide additional gain in recognition performance: up to 6% of relative word error rate reduction (WERR) over the strong feature-space adaptation techniques based on maximum likelihood linear regression (fMLLR) speaker adapted DNN baseline, and up to 18% of relative WERR in comparison with a speaker independent (SI) DNN baseline model, trained on conventional features. For TDNN models the proposed approach achieves up to 26% of relative WERR in comparison with a SI baseline, and up 13% in comparison with the model adapted by using i-vectors. The analysis of the adapted GMMD features from various points of view demonstrates their effectiveness at different levels.
摘要:本文探讨深层神经网络(DNN)声学模型的适应GMM衍生(GMMD)功能。的DNN的训练上GMMD适配功能是通过最大完成用于GMMD特征提取辅助GMM模型的后验概率(MAP)的适应。我们探讨适于GMMD的融合与常规特征,诸如瓶颈设有和MFCC特征,在两种不同的神经网络结构:DNN和时间延迟神经网络(TDNN)。我们分析和比较不同类型的适应技术,诸如I-矢量和基于线性回归(fMLLR)最大似然特征空间自适应技术与所提出的适应方法,和使用各种类型的融合的诸如特征水平探索它们的互补性,后水平,格子水平等,以发现相结合的最好的方式。在TED-LIUM语料库的实验结果表明所提出的自适应技术可有效地集成到DNN和TDNN设置在不同的电平并提供在识别性能的附加增益:在所述强(WERR)相对字错误率减少6%最高基于线性回归(fMLLR)扬声器最大似然特征空间自适应技术在与说话者无关(SI)DNN基准模型比较适于DNN基线,以及高达相对WERR的18%,上训练常规特征。对于TDNN模型所提出的方法实现了高达相对WERR的26%与SI基线比较,以及高达13%与所述模型比较适于通过使用I-载体。适配GMMD的分析从各种观点提供不同层次证明其有效性。
Natalia Tomashenko, Yuri Khokhlov, Yannick Esteve
Abstract: In this paper we investigate the GMM-derived (GMMD) features for adaptation of deep neural network (DNN) acoustic models. The adaptation of the DNN trained on GMMD features is done through the maximum a posteriori (MAP) adaptation of the auxiliary GMM model used for GMMD feature extraction. We explore fusion of the adapted GMMD features with conventional features, such as bottleneck and MFCC features, in two different neural network architectures: DNN and time-delay neural network (TDNN). We analyze and compare different types of adaptation techniques such as i-vectors and feature-space adaptation techniques based on maximum likelihood linear regression (fMLLR) with the proposed adaptation approach, and explore their complementarity using various types of fusion such as feature level, posterior level, lattice level and others in order to discover the best possible way of combination. Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN and TDNN setups at different levels and provide additional gain in recognition performance: up to 6% of relative word error rate reduction (WERR) over the strong feature-space adaptation techniques based on maximum likelihood linear regression (fMLLR) speaker adapted DNN baseline, and up to 18% of relative WERR in comparison with a speaker independent (SI) DNN baseline model, trained on conventional features. For TDNN models the proposed approach achieves up to 26% of relative WERR in comparison with a SI baseline, and up 13% in comparison with the model adapted by using i-vectors. The analysis of the adapted GMMD features from various points of view demonstrates their effectiveness at different levels.
摘要:本文探讨深层神经网络(DNN)声学模型的适应GMM衍生(GMMD)功能。的DNN的训练上GMMD适配功能是通过最大完成用于GMMD特征提取辅助GMM模型的后验概率(MAP)的适应。我们探讨适于GMMD的融合与常规特征,诸如瓶颈设有和MFCC特征,在两种不同的神经网络结构:DNN和时间延迟神经网络(TDNN)。我们分析和比较不同类型的适应技术,诸如I-矢量和基于线性回归(fMLLR)最大似然特征空间自适应技术与所提出的适应方法,和使用各种类型的融合的诸如特征水平探索它们的互补性,后水平,格子水平等,以发现相结合的最好的方式。在TED-LIUM语料库的实验结果表明所提出的自适应技术可有效地集成到DNN和TDNN设置在不同的电平并提供在识别性能的附加增益:在所述强(WERR)相对字错误率减少6%最高基于线性回归(fMLLR)扬声器最大似然特征空间自适应技术在与说话者无关(SI)DNN基准模型比较适于DNN基线,以及高达相对WERR的18%,上训练常规特征。对于TDNN模型所提出的方法实现了高达相对WERR的26%与SI基线比较,以及高达13%与所述模型比较适于通过使用I-载体。适配GMMD的分析从各种观点提供不同层次证明其有效性。
12. Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0 [PDF] 返回目录
Zack Hodari, Catherine Lai, Simon King
Abstract: In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.
摘要:在英语中,韵律增加了一个宽范围的信息,以区段序列,从信息结构(例如造影剂)以风格变化(例如情感的表达)。然而,学习控制在文本到语音的声音韵律的时候,目前尚不清楚究竟该控件修改。在离散表示学习的韵律现有的研究表明高自然度的,但没有分析什么这些表示捕获被执行,或者如果他们能产生话语的有意义,不同的变种。我们提出了一个短语级变自动编码器与多模式之前,使用模式中心为“语调代码”。我们的评估建立其语调代码感知不同,发现从我们的多模态模型潜伏的语调码均显著比使用K-均值聚类基线更加明显。我们开展后续的定性研究,以确定码携带的物品信息。最常见的是,听众评论有陈述或问题的风格语调代码。然而,许多其他影响相关的款式另据报道,包括:情绪化的,不确定的,惊讶,讽刺,被动攻击,和不安。
Zack Hodari, Catherine Lai, Simon King
Abstract: In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.
摘要:在英语中,韵律增加了一个宽范围的信息,以区段序列,从信息结构(例如造影剂)以风格变化(例如情感的表达)。然而,学习控制在文本到语音的声音韵律的时候,目前尚不清楚究竟该控件修改。在离散表示学习的韵律现有的研究表明高自然度的,但没有分析什么这些表示捕获被执行,或者如果他们能产生话语的有意义,不同的变种。我们提出了一个短语级变自动编码器与多模式之前,使用模式中心为“语调代码”。我们的评估建立其语调代码感知不同,发现从我们的多模态模型潜伏的语调码均显著比使用K-均值聚类基线更加明显。我们开展后续的定性研究,以确定码携带的物品信息。最常见的是,听众评论有陈述或问题的风格语调代码。然而,许多其他影响相关的款式另据报道,包括:情绪化的,不确定的,惊讶,讽刺,被动攻击,和不安。
13. Counterfactual Samples Synthesizing for Robust Visual Question Answering [PDF] 返回目录
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, Yueting Zhuang
Abstract: Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
摘要:尽管视觉答疑(VQA)实现了在过去的几年中令人印象深刻的进步,今天的VQA模型倾向于捕捉肤浅的语言相关的动车组,并不能推广到试验组具有不同QA分布。为了减少语言的偏见,最近几部作品引入辅助唯一的问题模型来规范针对性VQA模型的训练,并实现对VQA-CP控制的表现。然而,由于设计的复杂性,目前的方法是无法与理想的VQA模型的两个不可缺少的特性来装备基于集合的模型:1)视觉解释的:在做决策时的模型应该靠右边视觉区域。 2)问题敏感:所述模式应该是有问题的语言的变化敏感。为此,我们提出一个模型无关的反事实的样品合成(CSS)的培训方案。 CSS的通过屏蔽在图片或文字的关键对象的问题,并分配不同的地面实况答案产生大量的反训练样本。训练与互补的样品(即,原始和产生的样本)之后,将模型VQA被迫集中在所有关键的对象和文字,这显著改善了视觉可解释和问题敏感的能力。作为回报,这些机型的性能进一步增强。大量消融显示CSS的有效性。特别是,通过建立模型台基之上,我们实现了58.95%的VQA-CP V2破纪录的表现,与6.5%的升幅。
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, Yueting Zhuang
Abstract: Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
摘要:尽管视觉答疑(VQA)实现了在过去的几年中令人印象深刻的进步,今天的VQA模型倾向于捕捉肤浅的语言相关的动车组,并不能推广到试验组具有不同QA分布。为了减少语言的偏见,最近几部作品引入辅助唯一的问题模型来规范针对性VQA模型的训练,并实现对VQA-CP控制的表现。然而,由于设计的复杂性,目前的方法是无法与理想的VQA模型的两个不可缺少的特性来装备基于集合的模型:1)视觉解释的:在做决策时的模型应该靠右边视觉区域。 2)问题敏感:所述模式应该是有问题的语言的变化敏感。为此,我们提出一个模型无关的反事实的样品合成(CSS)的培训方案。 CSS的通过屏蔽在图片或文字的关键对象的问题,并分配不同的地面实况答案产生大量的反训练样本。训练与互补的样品(即,原始和产生的样本)之后,将模型VQA被迫集中在所有关键的对象和文字,这显著改善了视觉可解释和问题敏感的能力。作为回报,这些机型的性能进一步增强。大量消融显示CSS的有效性。特别是,通过建立模型台基之上,我们实现了58.95%的VQA-CP V2破纪录的表现,与6.5%的升幅。
14. Semantically-Enriched Search Engine for Geoportals: A Case Study with ArcGIS Online [PDF] 返回目录
Gengchen Mai, Krzysztof Janowicz, Sathya Prasad, Meilin Shi, Ling Cai, Rui Zhu, Blake Regalia, Ni Lao
Abstract: Many geoportals such as ArcGIS Online are established with the goal of improving geospatial data reusability and achieving intelligent knowledge discovery. However, according to previous research, most of the existing geoportals adopt Lucene-based techniques to achieve their core search functionality, which has a limited ability to capture the user's search intentions. To better understand a user's search intention, query expansion can be used to enrich the user's query by adding semantically similar terms. In the context of geoportals and geographic information retrieval, we advocate the idea of semantically enriching a user's query from both geospatial and thematic perspectives. In the geospatial aspect, we propose to enrich a query by using both place partonomy and distance decay. In terms of the thematic aspect, concept expansion and embedding-based document similarity are used to infer the implicit information hidden in a user's query. This semantic query expansion 1 2 G. Mai et al. framework is implemented as a semantically-enriched search engine using ArcGIS Online as a case study. A benchmark dataset is constructed to evaluate the proposed framework. Our evaluation results show that the proposed semantic query expansion framework is very effective in capturing a user's search intention and significantly outperforms a well-established baseline-Lucene's practical scoring function-with more than 3.0 increments in DCG@K (K=3,5,10).
摘要:许多geoportals如ArcGIS Online的建立与完善地理空间数据的可重用性,实现智能化的知识发现的目标。不过,根据以往的研究,目前多数geoportals的采用基于Lucene的技术来实现他们的核心搜索功能,它具有捕获用户的搜索意图的能力有限。为了更好地理解用户的搜索意图,查询扩展可用于通过添加语义相似的字词,丰富了用户的查询。在geoportals和地理信息检索的情况下,我们提倡的语义来自地理空间和专题的角度丰富了用户的查询的想法。在地理空间方面,我们建议同时使用地方partonomy和距离衰减,以丰富的查询。在主题方面,概念扩展和基于嵌入的文档相似的术语被用来推断隐藏在用户的查询的隐式信息。这种语义查询扩展1 2 G. Mai等。框架是实现为使用ArcGIS Online的作为案例研究语义富集的搜索引擎。基准测试数据集构建以评估建议的框架。我们的评估结果表明,所提出的语义查询扩展框架是在捕获用户的搜索意图非常有效的,显著优于一套行之有效的基线Lucene的实用的计分函数,超过3.0的增量在DCG @ K(K = 3,5, 10)。
Gengchen Mai, Krzysztof Janowicz, Sathya Prasad, Meilin Shi, Ling Cai, Rui Zhu, Blake Regalia, Ni Lao
Abstract: Many geoportals such as ArcGIS Online are established with the goal of improving geospatial data reusability and achieving intelligent knowledge discovery. However, according to previous research, most of the existing geoportals adopt Lucene-based techniques to achieve their core search functionality, which has a limited ability to capture the user's search intentions. To better understand a user's search intention, query expansion can be used to enrich the user's query by adding semantically similar terms. In the context of geoportals and geographic information retrieval, we advocate the idea of semantically enriching a user's query from both geospatial and thematic perspectives. In the geospatial aspect, we propose to enrich a query by using both place partonomy and distance decay. In terms of the thematic aspect, concept expansion and embedding-based document similarity are used to infer the implicit information hidden in a user's query. This semantic query expansion 1 2 G. Mai et al. framework is implemented as a semantically-enriched search engine using ArcGIS Online as a case study. A benchmark dataset is constructed to evaluate the proposed framework. Our evaluation results show that the proposed semantic query expansion framework is very effective in capturing a user's search intention and significantly outperforms a well-established baseline-Lucene's practical scoring function-with more than 3.0 increments in DCG@K (K=3,5,10).
摘要:许多geoportals如ArcGIS Online的建立与完善地理空间数据的可重用性,实现智能化的知识发现的目标。不过,根据以往的研究,目前多数geoportals的采用基于Lucene的技术来实现他们的核心搜索功能,它具有捕获用户的搜索意图的能力有限。为了更好地理解用户的搜索意图,查询扩展可用于通过添加语义相似的字词,丰富了用户的查询。在geoportals和地理信息检索的情况下,我们提倡的语义来自地理空间和专题的角度丰富了用户的查询的想法。在地理空间方面,我们建议同时使用地方partonomy和距离衰减,以丰富的查询。在主题方面,概念扩展和基于嵌入的文档相似的术语被用来推断隐藏在用户的查询的隐式信息。这种语义查询扩展1 2 G. Mai等。框架是实现为使用ArcGIS Online的作为案例研究语义富集的搜索引擎。基准测试数据集构建以评估建议的框架。我们的评估结果表明,所提出的语义查询扩展框架是在捕获用户的搜索意图非常有效的,显著优于一套行之有效的基线Lucene的实用的计分函数,超过3.0的增量在DCG @ K(K = 3,5, 10)。
15. DAN: Dual-View Representation Learning for Adapting Stance Classifiers to New Domains [PDF] 返回目录
Chang Xu, Cecile Paris, Surya Nepal, Ross Sparks, Chong Long, Yafang Wang
Abstract: We address the issue of having a limited number of annotations for stance classification in a new domain, by adapting out-of-domain classifiers with domain adaptation. Existing approaches often align different domains in a single, global feature space (or view), which may fail to fully capture the richness of the languages used for expressing stances, leading to reduced adaptability on stance data. In this paper, we identify two major types of stance expressions that are linguistically distinct, and we propose a tailored dual-view adaptation network (DAN) to adapt these expressions across domains. The proposed model first learns a separate view for domain transfer in each expression channel and then selects the best adapted parts of both views for optimal transfer. We find that the learned view features can be more easily aligned and more stance-discriminative in either or both views, leading to more transferable overall features after combining the views. Results from extensive experiments show that our method can enhance the state-of-the-art single-view methods in matching stance data across different domains, and that it consistently improves those methods on various adaptation tasks.
摘要:我们通过领域适应性调整域外的分类处理具有在新的域注解姿态分类的数量有限,问题。存在于单一的全局特征空间(或视图),这可能无法充分捕捉用于表达立场语言的丰富性,从而降低了适应性上的姿态数据往往接近对准不同的域。在本文中,我们确定两种主要的立场表述,其语言不同的,并且我们提出了一个定制的双视图适应网络(DAN)跨域这些表达式适应。该模型学习第一对中的每个表达式通道域转移的单独视图,然后选择两个视图的最适合的零件最优转移。我们发现,学习视图功能,可以更容易地对准多的立场,辨别在一方或双方的意见,结合意见后导致更多的转移总体特征。从大量的实验结果表明,该方法能够提高匹配在不同域中的姿态数据的国家的最先进的单一视图的方法,而且它始终在改进各种适应任务的方法。
Chang Xu, Cecile Paris, Surya Nepal, Ross Sparks, Chong Long, Yafang Wang
Abstract: We address the issue of having a limited number of annotations for stance classification in a new domain, by adapting out-of-domain classifiers with domain adaptation. Existing approaches often align different domains in a single, global feature space (or view), which may fail to fully capture the richness of the languages used for expressing stances, leading to reduced adaptability on stance data. In this paper, we identify two major types of stance expressions that are linguistically distinct, and we propose a tailored dual-view adaptation network (DAN) to adapt these expressions across domains. The proposed model first learns a separate view for domain transfer in each expression channel and then selects the best adapted parts of both views for optimal transfer. We find that the learned view features can be more easily aligned and more stance-discriminative in either or both views, leading to more transferable overall features after combining the views. Results from extensive experiments show that our method can enhance the state-of-the-art single-view methods in matching stance data across different domains, and that it consistently improves those methods on various adaptation tasks.
摘要:我们通过领域适应性调整域外的分类处理具有在新的域注解姿态分类的数量有限,问题。存在于单一的全局特征空间(或视图),这可能无法充分捕捉用于表达立场语言的丰富性,从而降低了适应性上的姿态数据往往接近对准不同的域。在本文中,我们确定两种主要的立场表述,其语言不同的,并且我们提出了一个定制的双视图适应网络(DAN)跨域这些表达式适应。该模型学习第一对中的每个表达式通道域转移的单独视图,然后选择两个视图的最适合的零件最优转移。我们发现,学习视图功能,可以更容易地对准多的立场,辨别在一方或双方的意见,结合意见后导致更多的转移总体特征。从大量的实验结果表明,该方法能够提高匹配在不同域中的姿态数据的国家的最先进的单一视图的方法,而且它始终在改进各种适应任务的方法。
注:中文为机器翻译结果!