目录
2. Sequential Domain Adaptation through Elastic Weight Consolidation for Sentiment Analysis [PDF] 摘要
4. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey [PDF] 摘要
5. NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identification [PDF] 摘要
6. NLNDE: Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection [PDF] 摘要
8. IIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE [PDF] 摘要
10. Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification [PDF] 摘要
17. Computing Conceptual Distances between Breast Cancer Screening Guidelines: An Implementation of a Near-Peer Epistemic Model ofMedical Disagreement [PDF] 摘要
摘要
1. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering [PDF] 返回目录
Gautier Izacard, Edouard Grave
Abstract: Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-art results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that generative models are good at aggregating and combining evidence from multiple passages.
摘要:开放域问答系统生成模型已被证明是有竞争力的,而不是诉诸外部知识。虽然看好,这种方法需要数十亿的参数,这是昂贵的培训和查询中使用的模型。在本文中,我们研究了多少这些模型可以检索的文本段落,可能含有的证据受益。我们获得有关自然问题和TriviaQA开放基准国家的先进成果。有趣的是,我们观察到,这种方法的性能提高检索的通道数时显著提高。这是证据表明,生成模型善于整合和多个通道相结合的证据。
Gautier Izacard, Edouard Grave
Abstract: Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-art results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that generative models are good at aggregating and combining evidence from multiple passages.
摘要:开放域问答系统生成模型已被证明是有竞争力的,而不是诉诸外部知识。虽然看好,这种方法需要数十亿的参数,这是昂贵的培训和查询中使用的模型。在本文中,我们研究了多少这些模型可以检索的文本段落,可能含有的证据受益。我们获得有关自然问题和TriviaQA开放基准国家的先进成果。有趣的是,我们观察到,这种方法的性能提高检索的通道数时显著提高。这是证据表明,生成模型善于整合和多个通道相结合的证据。
2. Sequential Domain Adaptation through Elastic Weight Consolidation for Sentiment Analysis [PDF] 返回目录
Avinash Madasu, Vijjini Anvesh Rao
Abstract: Elastic Weight Consolidation (EWC) is a technique used in overcoming catastrophic forgetting between successive tasks trained on a neural network. We use this phenomenon of information sharing between tasks for domain adaptation. Training data for tasks such as sentiment analysis (SA) may not be fairly represented across multiple domains. Domain Adaptation (DA) aims to build algorithms that leverage information from source domains to facilitate performance on an unseen target domain. We propose a model-independent framework - Sequential Domain Adaptation (SDA). SDA draws on EWC for training on successive source domains to move towards a general domain solution, thereby solving the problem of domain adaptation. We test SDA on convolutional, recurrent, and attention-based architectures. Our experiments show that the proposed framework enables simple architectures such as CNNs to outperform complex state-of-the-art models in domain adaptation of SA. We further observe the effectiveness of a harder first Anti-Curriculum ordering of source domains leads to maximum performance.
摘要:弹性重量合并(EWC)在克服训练的神经网络的连续任务之间的灾难性遗忘使用的技术。我们使用的域适应任务之间的信息共享这一现象。这样的任务,情绪分析(SA)的训练数据可以不跨越多个域来公平的代表。域的适应(DA)的目的是建立的算法,从源域利用信息,以便于一个看不见的目标域的性能。我们提出一个模型无关的框架 - 顺序领域适应性(SDA)。 SDA借鉴了EWC对连续源领域的培训,以向通用领域的解决方案移动,从而解决领域的适应问题。我们卷积,反复发作,并注意为基础的架构测试SDA。我们的实验表明,该框架支持简单的结构,如细胞神经网络在SA的领域适应性超越国家的最先进复杂的模型。我们进一步观察源域线索的第一更难防课程排序为最高性能的有效性。
Avinash Madasu, Vijjini Anvesh Rao
Abstract: Elastic Weight Consolidation (EWC) is a technique used in overcoming catastrophic forgetting between successive tasks trained on a neural network. We use this phenomenon of information sharing between tasks for domain adaptation. Training data for tasks such as sentiment analysis (SA) may not be fairly represented across multiple domains. Domain Adaptation (DA) aims to build algorithms that leverage information from source domains to facilitate performance on an unseen target domain. We propose a model-independent framework - Sequential Domain Adaptation (SDA). SDA draws on EWC for training on successive source domains to move towards a general domain solution, thereby solving the problem of domain adaptation. We test SDA on convolutional, recurrent, and attention-based architectures. Our experiments show that the proposed framework enables simple architectures such as CNNs to outperform complex state-of-the-art models in domain adaptation of SA. We further observe the effectiveness of a harder first Anti-Curriculum ordering of source domains leads to maximum performance.
摘要:弹性重量合并(EWC)在克服训练的神经网络的连续任务之间的灾难性遗忘使用的技术。我们使用的域适应任务之间的信息共享这一现象。这样的任务,情绪分析(SA)的训练数据可以不跨越多个域来公平的代表。域的适应(DA)的目的是建立的算法,从源域利用信息,以便于一个看不见的目标域的性能。我们提出一个模型无关的框架 - 顺序领域适应性(SDA)。 SDA借鉴了EWC对连续源领域的培训,以向通用领域的解决方案移动,从而解决领域的适应问题。我们卷积,反复发作,并注意为基础的架构测试SDA。我们的实验表明,该框架支持简单的结构,如细胞神经网络在SA的领域适应性超越国家的最先进复杂的模型。我们进一步观察源域线索的第一更难防课程排序为最高性能的有效性。
3. Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset [PDF] 返回目录
Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall
Abstract: This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages
摘要:本文介绍了Dakshina数据集,包括同时在拉丁语和12种南亚语言本地脚本文本的新资源。该数据集包括,为每种语言:1)原生脚本维基百科文本; 2)一个罗马化词汇;在该语言的两个本地脚本和基本拉丁字母和3)完整的句子并行数据。我们的文件用于制备和在每种语言的维基百科文本的选择方法;对于采样词汇证明罗马拼音的集合;从本地脚本集合举行出句子的手动罗马。我们还提供几个任务基准结果成为可能的数据集,包括单个单词的音译,全句的音译,和本地脚本和罗马拼音的文字语言建模。关键词:罗马,音译,南亚语言
Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, Keith Hall
Abstract: This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text. Keywords: romanization, transliteration, South Asian languages
摘要:本文介绍了Dakshina数据集,包括同时在拉丁语和12种南亚语言本地脚本文本的新资源。该数据集包括,为每种语言:1)原生脚本维基百科文本; 2)一个罗马化词汇;在该语言的两个本地脚本和基本拉丁字母和3)完整的句子并行数据。我们的文件用于制备和在每种语言的维基百科文本的选择方法;对于采样词汇证明罗马拼音的集合;从本地脚本集合举行出句子的手动罗马。我们还提供几个任务基准结果成为可能的数据集,包括单个单词的音译,全句的音译,和本地脚本和罗马拼音的文字语言建模。关键词:罗马,音译,南亚语言
4. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey [PDF] 返回目录
Shivaji Alaparthi, Manit Mishra
Abstract: The purpose of the study is to investigate the relative effectiveness of four different sentiment analysis techniques: (1) unsupervised lexicon-based model using Sent WordNet; (2) traditional supervised machine learning model using logistic regression; (3) supervised deep learning model using Long Short-Term Memory (LSTM); and, (4) advanced supervised deep learning models using Bidirectional Encoder Representations from Transformers (BERT). We use publicly available labeled corpora of 50,000 movie reviews originally posted on internet movie database (IMDB) for analysis using Sent WordNet lexicon, logistic regression, LSTM, and BERT. The first three models were run on CPU based system whereas BERT was run on GPU based system. The sentiment classification performance was evaluated based on accuracy, precision, recall, and F1 score. The study puts forth two key insights: (1) relative efficacy of four highly advanced and widely used sentiment analysis techniques; (2) undisputed superiority of pre-trained advanced supervised deep learning BERT model in sentiment analysis from text data. This study provides professionals in analytics industry and academicians working on text analysis key insight regarding comparative classification performance evaluation of key sentiment analysis techniques, including the recently developed BERT. This is the first research endeavor to compare the advanced pre-trained supervised deep learning model of BERT vis-à-vis other sentiment analysis models of LSTM, logistic regression, and Sent WordNet.
摘要:本研究的目的是研究的四种不同的情绪分析的技术的相对有效性:(1)无监督基于词典的模型使用发送的WordNet; (2)传统的监督的机器学习模型使用逻辑回归; (3)监督使用长短期记忆(LSTM)深度学习模型;并且,(4)采用先进的编码器的双向监督申述深学习模型从变压器(BERT)。我们公开使用的最初发布在互联网电影数据库(IMDB)进行分析50000个电影评论可标记语料库使用发送共发现词汇,逻辑回归,LSTM和BERT。而BERT是基于GPU的系统上运行基于CPU的系统上的前三个模型运行。基于准确度,精密度,召回和F1分数的情感分类性能进行了评估。该研究提出放两个重要启示:(1)四个高度先进的和广泛使用的情感分析技术相对有效性; (2)的预训练的高级监督深度学习BERT情感分析从文本数据模型无可争议的优势。这项研究提供了在分析行业有关的关键情感分析技术,包括最近开发的BERT比较分类绩效考核的专业人员和文本分析的关键洞察力工作院士。这是第一个研究努力比较BERT的先进的预训练监督深度学习模型面对面的人LSTM的其他情感分析模型,逻辑回归,和发送共发现。
Shivaji Alaparthi, Manit Mishra
Abstract: The purpose of the study is to investigate the relative effectiveness of four different sentiment analysis techniques: (1) unsupervised lexicon-based model using Sent WordNet; (2) traditional supervised machine learning model using logistic regression; (3) supervised deep learning model using Long Short-Term Memory (LSTM); and, (4) advanced supervised deep learning models using Bidirectional Encoder Representations from Transformers (BERT). We use publicly available labeled corpora of 50,000 movie reviews originally posted on internet movie database (IMDB) for analysis using Sent WordNet lexicon, logistic regression, LSTM, and BERT. The first three models were run on CPU based system whereas BERT was run on GPU based system. The sentiment classification performance was evaluated based on accuracy, precision, recall, and F1 score. The study puts forth two key insights: (1) relative efficacy of four highly advanced and widely used sentiment analysis techniques; (2) undisputed superiority of pre-trained advanced supervised deep learning BERT model in sentiment analysis from text data. This study provides professionals in analytics industry and academicians working on text analysis key insight regarding comparative classification performance evaluation of key sentiment analysis techniques, including the recently developed BERT. This is the first research endeavor to compare the advanced pre-trained supervised deep learning model of BERT vis-à-vis other sentiment analysis models of LSTM, logistic regression, and Sent WordNet.
摘要:本研究的目的是研究的四种不同的情绪分析的技术的相对有效性:(1)无监督基于词典的模型使用发送的WordNet; (2)传统的监督的机器学习模型使用逻辑回归; (3)监督使用长短期记忆(LSTM)深度学习模型;并且,(4)采用先进的编码器的双向监督申述深学习模型从变压器(BERT)。我们公开使用的最初发布在互联网电影数据库(IMDB)进行分析50000个电影评论可标记语料库使用发送共发现词汇,逻辑回归,LSTM和BERT。而BERT是基于GPU的系统上运行基于CPU的系统上的前三个模型运行。基于准确度,精密度,召回和F1分数的情感分类性能进行了评估。该研究提出放两个重要启示:(1)四个高度先进的和广泛使用的情感分析技术相对有效性; (2)的预训练的高级监督深度学习BERT情感分析从文本数据模型无可争议的优势。这项研究提供了在分析行业有关的关键情感分析技术,包括最近开发的BERT比较分类绩效考核的专业人员和文本分析的关键洞察力工作院士。这是第一个研究努力比较BERT的先进的预训练监督深度学习模型面对面的人LSTM的其他情感分析模型,逻辑回归,和发送共发现。
5. NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identification [PDF] 返回目录
Lukas Lange, Heike Adel, Jannik Strötgen
Abstract: Natural language processing has huge potential in the medical domain which recently led to a lot of research in this field. However, a prerequisite of secure processing of medical documents, e.g., patient notes and clinical trials, is the proper de-identification of privacy-sensitive information. In this paper, we describe our NLNDE system, with which we participated in the MEDDOCAN competition, the medical document anonymization task of IberLEF 2019. We address the task of detecting and classifying protected health information from Spanish data as a sequence-labeling problem and investigate different embedding methods for our neural network. Despite dealing in a non-standard language and domain setting, the NLNDE system achieves promising results in the competition.
摘要:自然语言处理在医疗领域最近导致了在这一领域的研究很多巨大的潜力。然而,医疗文件,例如,病人笔记和临床试验的安全处理的先决条件,是隐私敏感信息进行适当的去标识。在本文中,我们描述了我们NLNDE体系,这是我们参加了MEDDOCAN竞争,IberLEF 2019年的医学文献匿名任务,我们解决的检测和保护的健康信息来自西班牙的数据作为序列标签问题进行分类和调查任务不同的嵌入方法对我们的神经网络。尽管买卖非标准语言和域设置,NLNDE系统实现承诺的比赛成绩。
Lukas Lange, Heike Adel, Jannik Strötgen
Abstract: Natural language processing has huge potential in the medical domain which recently led to a lot of research in this field. However, a prerequisite of secure processing of medical documents, e.g., patient notes and clinical trials, is the proper de-identification of privacy-sensitive information. In this paper, we describe our NLNDE system, with which we participated in the MEDDOCAN competition, the medical document anonymization task of IberLEF 2019. We address the task of detecting and classifying protected health information from Spanish data as a sequence-labeling problem and investigate different embedding methods for our neural network. Despite dealing in a non-standard language and domain setting, the NLNDE system achieves promising results in the competition.
摘要:自然语言处理在医疗领域最近导致了在这一领域的研究很多巨大的潜力。然而,医疗文件,例如,病人笔记和临床试验的安全处理的先决条件,是隐私敏感信息进行适当的去标识。在本文中,我们描述了我们NLNDE体系,这是我们参加了MEDDOCAN竞争,IberLEF 2019年的医学文献匿名任务,我们解决的检测和保护的健康信息来自西班牙的数据作为序列标签问题进行分类和调查任务不同的嵌入方法对我们的神经网络。尽管买卖非标准语言和域设置,NLNDE系统实现承诺的比赛成绩。
6. NLNDE: Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection [PDF] 返回目录
Lukas Lange, Heike Adel, Jannik Strötgen
Abstract: Named entity recognition has been extensively studied on English news texts. However, the transfer to other domains and languages is still a challenging problem. In this paper, we describe the system with which we participated in the first subtrack of the PharmaCoNER competition of the BioNLP Open Shared Tasks 2019. Aiming at pharmacological entity detection in Spanish texts, the task provides a non-standard domain and language setting. However, we propose an architecture that requires neither language nor domain expertise. We treat the task as a sequence labeling task and experiment with attention-based embedding selection and the training on automatically annotated data to further improve our system's performance. Our system achieves promising results, especially by combining the different techniques, and reaches up to 88.6% F1 in the competition.
摘要:命名实体识别已经广泛地研究了英语新闻文本。但是,转移到其他领域和语言仍然是一个具有挑战性的问题。在本文中,我们描述了我们参加了BioNLP打开共享任务2019年在西班牙文本在药理实体检测针对的PharmaCoNER比赛的第一副轨道系统,任务提供了一个非标准的域和语言设置。然而,我们建议,既不需要语言也不领域的专业知识的架构。我们把任务作为一个序列标注任务和实验关注基于嵌入选择和自动标注的数据训练,进一步提高我们的系统性能。我们的系统实现了可喜的成果,特别是通过结合不同的技术,并达到了在竞争中88.6%的F1。
Lukas Lange, Heike Adel, Jannik Strötgen
Abstract: Named entity recognition has been extensively studied on English news texts. However, the transfer to other domains and languages is still a challenging problem. In this paper, we describe the system with which we participated in the first subtrack of the PharmaCoNER competition of the BioNLP Open Shared Tasks 2019. Aiming at pharmacological entity detection in Spanish texts, the task provides a non-standard domain and language setting. However, we propose an architecture that requires neither language nor domain expertise. We treat the task as a sequence labeling task and experiment with attention-based embedding selection and the training on automatically annotated data to further improve our system's performance. Our system achieves promising results, especially by combining the different techniques, and reaches up to 88.6% F1 in the competition.
摘要:命名实体识别已经广泛地研究了英语新闻文本。但是,转移到其他领域和语言仍然是一个具有挑战性的问题。在本文中,我们描述了我们参加了BioNLP打开共享任务2019年在西班牙文本在药理实体检测针对的PharmaCoNER比赛的第一副轨道系统,任务提供了一个非标准的域和语言设置。然而,我们建议,既不需要语言也不领域的专业知识的架构。我们把任务作为一个序列标注任务和实验关注基于嵌入选择和自动标注的数据训练,进一步提高我们的系统性能。我们的系统实现了可喜的成果,特别是通过结合不同的技术,并达到了在竞争中88.6%的F1。
7. Project PIAF: Building a Native French Question-Answering Dataset [PDF] 返回目录
Rachel Keraron, Guillaume Lancrenon, Mathilde Bras, Frédéric Allary, Gilles Moyse, Thomas Scialom, Edmundo-Pavel Soriano-Morales, Jacopo Staiano
Abstract: Motivated by the lack of data for non-English languages, in particular for the evaluation of downstream tasks such as Question Answering, we present a participatory effort to collect a native French Question Answering Dataset. Furthermore, we describe and publicly release the annotation tool developed for our collection effort, along with the data obtained and preliminary baselines.
摘要:缺乏对非英语语言的数据的启发,特别是下游的任务,如问答系统的评价中,我们提出了一个参与性的努力,收集了法国本土问答数据集。此外,我们描述和公开发布为我们收集工作发展的注释工具,用所获得的数据和初步基线一起。
Rachel Keraron, Guillaume Lancrenon, Mathilde Bras, Frédéric Allary, Gilles Moyse, Thomas Scialom, Edmundo-Pavel Soriano-Morales, Jacopo Staiano
Abstract: Motivated by the lack of data for non-English languages, in particular for the evaluation of downstream tasks such as Question Answering, we present a participatory effort to collect a native French Question Answering Dataset. Furthermore, we describe and publicly release the annotation tool developed for our collection effort, along with the data obtained and preliminary baselines.
摘要:缺乏对非英语语言的数据的启发,特别是下游的任务,如问答系统的评价中,我们提出了一个参与性的努力,收集了法国本土问答数据集。此外,我们描述和公开发布为我们收集工作发展的注释工具,用所获得的数据和初步基线一起。
8. IIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE [PDF] 返回目录
Luxi Xing, Yuqiang Xie, Yue Hu, Wei Peng
Abstract: This paper introduces our systems for the first two subtasks of SemEval Task4: Commonsense Validation and Explanation. To clarify the intention for judgment and inject contrastive information for selection, we propose the input reconstruction strategy with prompt templates. Specifically, we formalize the subtasks into the multiple-choice question answering format and construct the input with the prompt templates, then, the final prediction of question answering is considered as the result of subtasks. Experimental results show that our approaches achieve significant performance compared with the baseline systems. Our approaches secure the third rank on both official test sets of the first two subtasks with an accuracy of 96.4 and an accuracy of 94.3 respectively.
摘要:本文介绍了我们的系统SemEval Task4的前两个子任务:常识验证和说明。为了澄清意向的判断和注射的选择对比信息,我们建议进行提示模板输入重建战略。具体来说,我们正规化子任务进入选择题回答格式和构造与提示模板的输入,然后,问答的最终预测被认为是子任务的结果。实验结果表明,与基线系统相比,我们的方法实现显著的性能。我们的方法96.4精度和94.3分别精度保证在两个官方测试集的前两个子任务的第三等级。
Luxi Xing, Yuqiang Xie, Yue Hu, Wei Peng
Abstract: This paper introduces our systems for the first two subtasks of SemEval Task4: Commonsense Validation and Explanation. To clarify the intention for judgment and inject contrastive information for selection, we propose the input reconstruction strategy with prompt templates. Specifically, we formalize the subtasks into the multiple-choice question answering format and construct the input with the prompt templates, then, the final prediction of question answering is considered as the result of subtasks. Experimental results show that our approaches achieve significant performance compared with the baseline systems. Our approaches secure the third rank on both official test sets of the first two subtasks with an accuracy of 96.4 and an accuracy of 94.3 respectively.
摘要:本文介绍了我们的系统SemEval Task4的前两个子任务:常识验证和说明。为了澄清意向的判断和注射的选择对比信息,我们建议进行提示模板输入重建战略。具体来说,我们正规化子任务进入选择题回答格式和构造与提示模板的输入,然后,问答的最终预测被认为是子任务的结果。实验结果表明,与基线系统相比,我们的方法实现显著的性能。我们的方法96.4精度和94.3分别精度保证在两个官方测试集的前两个子任务的第三等级。
9. Fact-based Text Editing [PDF] 返回目录
Hayate Iso, Chao Qiao, Hang Li
Abstract: We propose a novel text editing task, referred to as \textit{fact-based text editing}, in which the goal is to revise a given document to better describe the facts in a knowledge base (e.g., several triples). The task is important in practice because reflecting the truth is a common requirement in text editing. First, we propose a method for automatically generating a dataset for research on fact-based text editing, where each instance consists of a draft text, a revised text, and several facts represented in triples. We apply the method into two public table-to-text datasets, obtaining two new datasets consisting of 233k and 37k instances, respectively. Next, we propose a new neural network architecture for fact-based text editing, called \textsc{FactEditor}, which edits a draft text by referring to given facts using a buffer, a stream, and a memory. A straightforward approach to address the problem would be to employ an encoder-decoder model. Our experimental results on the two datasets show that \textsc{FactEditor} outperforms the encoder-decoder approach in terms of fidelity and fluency. The results also show that \textsc{FactEditor} conducts inference faster than the encoder-decoder approach.
摘要:本文提出了一种新的文本编辑任务,被称为\ textit {基于事实的文本编辑},其中的目标是修改一个给定的文件,以更好地描述一个知识库(例如,几个三元)的事实。任务是在实践中重要的,因为反映的事实是,在文本编辑的共同要求。首先,我们提出了自动生成的数据集对基于事实的文本编辑,其中每个实例包含文本草案的修订案文,并在三元为代表的几个事实的研究方法。我们采用该方法分为两个公共表到文本数据集,分别得到由233K和37K的实例,两个新的数据集。接下来,我们提出了基于事实的文本编辑一个新的神经网络结构,称为\ textsc {} FactEditor,该编辑用缓冲器,物流和存储器参考给定事实的案文草案。直观的方法来解决这一问题将是采用编码器,解码器模型。我们对两个数据集实验结果表明,\ {textsc} FactEditor在优于保真度和流畅度方面的编码器,解码器的方法。该结果还表明,\ textsc {FactEditor}行为推断比编码器 - 解码器方法更快。
Hayate Iso, Chao Qiao, Hang Li
Abstract: We propose a novel text editing task, referred to as \textit{fact-based text editing}, in which the goal is to revise a given document to better describe the facts in a knowledge base (e.g., several triples). The task is important in practice because reflecting the truth is a common requirement in text editing. First, we propose a method for automatically generating a dataset for research on fact-based text editing, where each instance consists of a draft text, a revised text, and several facts represented in triples. We apply the method into two public table-to-text datasets, obtaining two new datasets consisting of 233k and 37k instances, respectively. Next, we propose a new neural network architecture for fact-based text editing, called \textsc{FactEditor}, which edits a draft text by referring to given facts using a buffer, a stream, and a memory. A straightforward approach to address the problem would be to employ an encoder-decoder model. Our experimental results on the two datasets show that \textsc{FactEditor} outperforms the encoder-decoder approach in terms of fidelity and fluency. The results also show that \textsc{FactEditor} conducts inference faster than the encoder-decoder approach.
摘要:本文提出了一种新的文本编辑任务,被称为\ textit {基于事实的文本编辑},其中的目标是修改一个给定的文件,以更好地描述一个知识库(例如,几个三元)的事实。任务是在实践中重要的,因为反映的事实是,在文本编辑的共同要求。首先,我们提出了自动生成的数据集对基于事实的文本编辑,其中每个实例包含文本草案的修订案文,并在三元为代表的几个事实的研究方法。我们采用该方法分为两个公共表到文本数据集,分别得到由233K和37K的实例,两个新的数据集。接下来,我们提出了基于事实的文本编辑一个新的神经网络结构,称为\ textsc {} FactEditor,该编辑用缓冲器,物流和存储器参考给定事实的案文草案。直观的方法来解决这一问题将是采用编码器,解码器模型。我们对两个数据集实验结果表明,\ {textsc} FactEditor在优于保真度和流畅度方面的编码器,解码器的方法。该结果还表明,\ textsc {FactEditor}行为推断比编码器 - 解码器方法更快。
10. Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification [PDF] 返回目录
Chetanya Rastogi, Nikka Mofid, Fang-I Hsiao
Abstract: This paper tackles one of the greatest limitations in Machine Learning: Data Scarcity. Specifically, we explore whether high accuracy classifiers can be built from small datasets, utilizing a combination of data augmentation techniques and machine learning algorithms. In this paper, we experiment with Easy Data Augmentation (EDA) and Backtranslation, as well as with three popular learning algorithms, Logistic Regression, Support Vector Machine (SVM), and Bidirectional Long Short-Term Memory Network (Bi-LSTM). For our experimentation, we utilize the Wikipedia Toxic Comments dataset so that in the process of exploring the benefits of data augmentation, we can develop a model to detect and classify toxic speech in comments to help fight back against cyberbullying and online harassment. Ultimately, we found that data augmentation techniques can be used to significantly boost the performance of classifiers and are an excellent strategy to combat lack of data in NLP problems.
摘要:本文铲球机器学习的最大限制之一:数据稀缺。具体地,我们研究是否高精度分类器可以从小的数据集来构建,使用的数据增量技术和机器学习算法的组合。在本文中,我们尝试使用简单的数据增强(EDA)和回译,以及与三种俗学习算法,Logistic回归,支持向量机(SVM)和双向龙短时记忆网络(双LSTM)。对于我们的实验,我们利用维基百科有毒的评论数据集,从而在探索数据隆胸的优势的过程中,我们可以开发一个模型来检测和分类有毒的讲话评论反对网络欺凌和骚扰在线帮助的回击。最终,我们发现,数据增强技术可用于显著提高分类器的性能,是一个很好的战略,打击缺乏NLP问题的数据。
Chetanya Rastogi, Nikka Mofid, Fang-I Hsiao
Abstract: This paper tackles one of the greatest limitations in Machine Learning: Data Scarcity. Specifically, we explore whether high accuracy classifiers can be built from small datasets, utilizing a combination of data augmentation techniques and machine learning algorithms. In this paper, we experiment with Easy Data Augmentation (EDA) and Backtranslation, as well as with three popular learning algorithms, Logistic Regression, Support Vector Machine (SVM), and Bidirectional Long Short-Term Memory Network (Bi-LSTM). For our experimentation, we utilize the Wikipedia Toxic Comments dataset so that in the process of exploring the benefits of data augmentation, we can develop a model to detect and classify toxic speech in comments to help fight back against cyberbullying and online harassment. Ultimately, we found that data augmentation techniques can be used to significantly boost the performance of classifiers and are an excellent strategy to combat lack of data in NLP problems.
摘要:本文铲球机器学习的最大限制之一:数据稀缺。具体地,我们研究是否高精度分类器可以从小的数据集来构建,使用的数据增量技术和机器学习算法的组合。在本文中,我们尝试使用简单的数据增强(EDA)和回译,以及与三种俗学习算法,Logistic回归,支持向量机(SVM)和双向龙短时记忆网络(双LSTM)。对于我们的实验,我们利用维基百科有毒的评论数据集,从而在探索数据隆胸的优势的过程中,我们可以开发一个模型来检测和分类有毒的讲话评论反对网络欺凌和骚扰在线帮助的回击。最终,我们发现,数据增强技术可用于显著提高分类器的性能,是一个很好的战略,打击缺乏NLP问题的数据。
11. Facts as Experts: Adaptable and Interpretable Neural Memory over Symbolic Knowledge [PDF] 返回目录
Pat Verga, Haitian Sun, Livio Baldini Soares, William W. Cohen
Abstract: Massive language models are the core of modern NLP modeling and have been shown to encode impressive amounts of commonsense and factual information. However, that knowledge exists only within the latent parameters of the model, inaccessible to inspection and interpretation, and even worse, factual information memorized from the training corpora is likely to become stale as the world changes. Knowledge stored as parameters will also inevitably exhibit all of the biases inherent in the source materials. To address these problems, we develop a neural language model that includes an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge. We show that this model dramatically improves performance on two knowledge-intensive question-answering tasks. More interestingly, the model can be updated without re-training by manipulating its symbolic representations. In particular this model allows us to add new facts and overwrite existing ones in ways that are not possible for earlier models.
摘要:海量语言模型是现代NLP模型的核心,已显示出令人印象深刻的编码量的常识和事实资料。然而,知识只有在模型中,无法进入检查和解释,更糟糕的潜在参数存在,从训练语料记忆事实信息可能会过时随着世界的变化。存储为参数的知识也将不可避免地显示出所有在源材料中固有的偏见。为了解决这些问题,我们开发了一个神经语言模型,其中包括象征性地解释事实的信息和亚符号神经知识之间的显式接口。我们表明,这种模式极大地提高了两个知识密集型答疑任务中的表现。更有趣的是,该模型可以无需重新培训通过操纵其符号表示更新。特别是,该模式使我们能够在不可能对以前的机型的方式增加新的事实和覆盖已有的。
Pat Verga, Haitian Sun, Livio Baldini Soares, William W. Cohen
Abstract: Massive language models are the core of modern NLP modeling and have been shown to encode impressive amounts of commonsense and factual information. However, that knowledge exists only within the latent parameters of the model, inaccessible to inspection and interpretation, and even worse, factual information memorized from the training corpora is likely to become stale as the world changes. Knowledge stored as parameters will also inevitably exhibit all of the biases inherent in the source materials. To address these problems, we develop a neural language model that includes an explicit interface between symbolically interpretable factual information and subsymbolic neural knowledge. We show that this model dramatically improves performance on two knowledge-intensive question-answering tasks. More interestingly, the model can be updated without re-training by manipulating its symbolic representations. In particular this model allows us to add new facts and overwrite existing ones in ways that are not possible for earlier models.
摘要:海量语言模型是现代NLP模型的核心,已显示出令人印象深刻的编码量的常识和事实资料。然而,知识只有在模型中,无法进入检查和解释,更糟糕的潜在参数存在,从训练语料记忆事实信息可能会过时随着世界的变化。存储为参数的知识也将不可避免地显示出所有在源材料中固有的偏见。为了解决这些问题,我们开发了一个神经语言模型,其中包括象征性地解释事实的信息和亚符号神经知识之间的显式接口。我们表明,这种模式极大地提高了两个知识密集型答疑任务中的表现。更有趣的是,该模型可以无需重新培训通过操纵其符号表示更新。特别是,该模式使我们能够在不可能对以前的机型的方式增加新的事实和覆盖已有的。
12. Lightme: Analysing Language in Internet Support Groups for Mental Health [PDF] 返回目录
Gabriela Ferraro, Brendan Loo-Gee, Shenjia Ji, Luis Salvador-Carulla
Abstract: Background: Assisting moderators to triage harmful posts in Internet Support Groups is relevant to ensure its safe use. Automated text classification methods analysing the language expressed in posts of online forums is a promising solution. Methods: Natural Language Processing and Machine Learning technologies were used to build a triage post classifier using a dataset from this http URL mental health forum for young people. Results: When comparing with the state-of-the-art, a solution mainly based on features from lexical resources, received the best classification performance for the crisis posts (52\%), which is the most severe class. Six salient linguistic characteristics were found when analysing the crisis post; 1) posts expressing hopelessness, 2) short posts expressing concise negative emotional responses, 3) long posts expressing variations of emotions, 4) posts expressing dissatisfaction with available health services, 5) posts utilising storytelling, and 6) posts expressing users seeking advice from peers during a crisis. Conclusion: It is possible to build a competitive triage classifier using features derived only from the textual content of the post. Further research needs to be done in order to translate our quantitative and qualitative findings into features, as it may improve overall performance.
摘要:背景:协助版主在网络支持组分流有害岗位相关,以确保其安全使用。自动文本分类方法分析在网络论坛帖子的表达语言是一种很有前途的解决方案。方法:自然语言处理和机器学习技术被用于构建用数据集从青少年这个HTTP URL心理健康论坛,分流后的分类。结果:当与比较状态的最先进的,主要是基于从词汇资源特征的溶液中,接收到用于危机帖(52 \%),这是最严重的类最佳的分类性能。分析后金融危机的时候,发现六个突出语言特点; 1)职位表达绝望,2)短的帖子表达简洁的负面情绪反应,3)长职位表达情感的变化,利用4)表达与现有的医疗服务不满的帖子,5)员额讲故事,和6)员额表达用户征求意见在危机期间同行。结论:这是不可能建立仅使用从帖子的文字内容导出功能,具有竞争力的分流分类。为了做进一步研究,需要我们的定量和定性研究结果转化为功能,因为它可以提高整体性能。
Gabriela Ferraro, Brendan Loo-Gee, Shenjia Ji, Luis Salvador-Carulla
Abstract: Background: Assisting moderators to triage harmful posts in Internet Support Groups is relevant to ensure its safe use. Automated text classification methods analysing the language expressed in posts of online forums is a promising solution. Methods: Natural Language Processing and Machine Learning technologies were used to build a triage post classifier using a dataset from this http URL mental health forum for young people. Results: When comparing with the state-of-the-art, a solution mainly based on features from lexical resources, received the best classification performance for the crisis posts (52\%), which is the most severe class. Six salient linguistic characteristics were found when analysing the crisis post; 1) posts expressing hopelessness, 2) short posts expressing concise negative emotional responses, 3) long posts expressing variations of emotions, 4) posts expressing dissatisfaction with available health services, 5) posts utilising storytelling, and 6) posts expressing users seeking advice from peers during a crisis. Conclusion: It is possible to build a competitive triage classifier using features derived only from the textual content of the post. Further research needs to be done in order to translate our quantitative and qualitative findings into features, as it may improve overall performance.
摘要:背景:协助版主在网络支持组分流有害岗位相关,以确保其安全使用。自动文本分类方法分析在网络论坛帖子的表达语言是一种很有前途的解决方案。方法:自然语言处理和机器学习技术被用于构建用数据集从青少年这个HTTP URL心理健康论坛,分流后的分类。结果:当与比较状态的最先进的,主要是基于从词汇资源特征的溶液中,接收到用于危机帖(52 \%),这是最严重的类最佳的分类性能。分析后金融危机的时候,发现六个突出语言特点; 1)职位表达绝望,2)短的帖子表达简洁的负面情绪反应,3)长职位表达情感的变化,利用4)表达与现有的医疗服务不满的帖子,5)员额讲故事,和6)员额表达用户征求意见在危机期间同行。结论:这是不可能建立仅使用从帖子的文字内容导出功能,具有竞争力的分流分类。为了做进一步研究,需要我们的定量和定性研究结果转化为功能,因为它可以提高整体性能。
13. Relevance-guided Supervision for OpenQA with ColBERT [PDF] 返回目录
Omar Khattab, Christopher Potts, Matei Zaharia
Abstract: Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned component that uses coarse-grained vector representations of questions and passages. We argue that this modeling choice is insufficiently expressive for dealing with the complexity of natural language questions. To address this, we define ColBERT-QA, which adapts the scalable neural retrieval model ColBERT to OpenQA. ColBERT creates fine-grained interactions between questions and passages. We propose a weak supervision strategy that iteratively uses ColBERT to create its own training data. This greatly improves OpenQA retrieval on both Natural Questions and TriviaQA, and the resulting end-to-end OpenQA system attains state-of-the-art performance on both of those datasets.
摘要:系统开放域问答系统(OpenQA)通常取决于猎犬在一个大的语料库和提取那些段落回答读者发现候选通道。在最近的很多工作中,猎犬是一个学习的组件使用粗粒度的问题,并在通道矢量表示。我们认为,这种造型的选择是为应付的自然语言问题的复杂性没有充分表现。为了解决这个问题,我们定义科尔伯特-QA,这适应了可扩展的神经检索模型科尔伯特到OpenQA。科尔伯特产生的问题和通道之间的细粒度交互。我们建议,反复使用科尔伯特打造自己的训练数据监管不力的策略。这极大地提高了这两种自然问题和TriviaQA OpenQA检索,并在这两个数据集的所得到的端至端OpenQA系统状态无所获的最先进的性能。
Omar Khattab, Christopher Potts, Matei Zaharia
Abstract: Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned component that uses coarse-grained vector representations of questions and passages. We argue that this modeling choice is insufficiently expressive for dealing with the complexity of natural language questions. To address this, we define ColBERT-QA, which adapts the scalable neural retrieval model ColBERT to OpenQA. ColBERT creates fine-grained interactions between questions and passages. We propose a weak supervision strategy that iteratively uses ColBERT to create its own training data. This greatly improves OpenQA retrieval on both Natural Questions and TriviaQA, and the resulting end-to-end OpenQA system attains state-of-the-art performance on both of those datasets.
摘要:系统开放域问答系统(OpenQA)通常取决于猎犬在一个大的语料库和提取那些段落回答读者发现候选通道。在最近的很多工作中,猎犬是一个学习的组件使用粗粒度的问题,并在通道矢量表示。我们认为,这种造型的选择是为应付的自然语言问题的复杂性没有充分表现。为了解决这个问题,我们定义科尔伯特-QA,这适应了可扩展的神经检索模型科尔伯特到OpenQA。科尔伯特产生的问题和通道之间的细粒度交互。我们建议,反复使用科尔伯特打造自己的训练数据监管不力的策略。这极大地提高了这两种自然问题和TriviaQA OpenQA检索,并在这两个数据集的所得到的端至端OpenQA系统状态无所获的最先进的性能。
14. Data Augmenting Contrastive Learning of Speech Representations in the Time Domain [PDF] 返回目录
Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux
Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.
摘要:对比预测编码(CPC)的基础上,预测基于过去的段讲话的未来细分正在成为一个强大的算法对语音信号的表示学习。但是,它仍然下进行无监督评价基准的其他方法。这里,我们介绍WavAugment,时域数据增强图书馆,发现在过去的应用增强通常更有效的是,收益率比其他方法更好的性能。我们发现,沥青改性,加性噪声和混响的组合显着增加CPC的性能(18-22%的相对改善),击败参考用600倍的数据利布里光的结果。使用领域外的一个数据集,时域数据增强可以推动党是在同水准与艺术上的零基准讲话2017年的状态,我们还表明,时域数据增强持续改善下游的限制,监督音素相对于12-15%的因素分类任务。
Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux
Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.
摘要:对比预测编码(CPC)的基础上,预测基于过去的段讲话的未来细分正在成为一个强大的算法对语音信号的表示学习。但是,它仍然下进行无监督评价基准的其他方法。这里,我们介绍WavAugment,时域数据增强图书馆,发现在过去的应用增强通常更有效的是,收益率比其他方法更好的性能。我们发现,沥青改性,加性噪声和混响的组合显着增加CPC的性能(18-22%的相对改善),击败参考用600倍的数据利布里光的结果。使用领域外的一个数据集,时域数据增强可以推动党是在同水准与艺术上的零基准讲话2017年的状态,我们还表明,时域数据增强持续改善下游的限制,监督音素相对于12-15%的因素分类任务。
15. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval [PDF] 返回目录
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk
Abstract: Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.
摘要:在密集了解到表示空间开展全文检索拥有超过稀疏检索许多有趣的优势。然而,密集的恢复(DR)的效果往往需要有稀疏检索组合。在本文中,我们确定的主要瓶颈是在培训机制,凡在训练中使用的反面事例并不能代表在测试中不相关的内容。本文呈现近似最近邻负对比估计(ANCE),培训机制,从语料库,这是平行与学习过程中更新的选择更现实的负面训练实例的近似近邻(ANN)的索引结构底片。这从根本上解决了培训和DR的测试中使用的数据分布之间的差异。在我们的实验中,ANCE提升了BERT连体DR模式,胜过所有有竞争力的疏密检索基线。它几乎精度匹配稀疏检索和-BERT-再排序在ANCE教训表示空间使用点产品,并提供几乎100倍的加速。
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk
Abstract: Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.
摘要:在密集了解到表示空间开展全文检索拥有超过稀疏检索许多有趣的优势。然而,密集的恢复(DR)的效果往往需要有稀疏检索组合。在本文中,我们确定的主要瓶颈是在培训机制,凡在训练中使用的反面事例并不能代表在测试中不相关的内容。本文呈现近似最近邻负对比估计(ANCE),培训机制,从语料库,这是平行与学习过程中更新的选择更现实的负面训练实例的近似近邻(ANN)的索引结构底片。这从根本上解决了培训和DR的测试中使用的数据分布之间的差异。在我们的实验中,ANCE提升了BERT连体DR模式,胜过所有有竞争力的疏密检索基线。它几乎精度匹配稀疏检索和-BERT-再排序在ANCE教训表示空间使用点产品,并提供几乎100倍的加速。
16. Legends: Folklore on Reddit [PDF] 返回目录
Caitrin Armstrong, Derek Ruths
Abstract: In this paper we introduce Reddit legends, a collection of venerated old posts that have become famous on Reddit. To establish the utility of Reddit legends for both computational science/HCI and folkloristics, we investigate two main questions: (1) whether they can be considered folklore, i.e. if they have consistent form, cultural significance, and undergo spontaneous transmission, and (2) whether they can be studied in a systematic manner. Through several subtasks, including the creation of a typology, an analysis of references to Reddit legends, and an examination of some of the textual characteristics of referencing behaviour, we show that Reddit legends can indeed be considered as folklore and that they are amendable to systematic text-based approaches. We discuss how these results will enable future analyses of folklore on Reddit, including tracking subreddit-wide and individual-user behaviour, and the relationship of this behaviour to other cultural markers.
摘要:本文介绍reddit的传说,崇敬的老帖子已经在Reddit上成名的集合。要建立reddit的传说的效用都计算科学/ HCI和民俗学中,我们探讨两个主要问题:(1)他们是否可以被认为是民间传说,即如果他们有一致的形式,文化意义,并进行自发传播,和(2 )是否可以有系统地进行研究。通过几个子任务,包括建立一个类型学的,到Reddit讨论社区传说引用的分析,以及对一些引用行为的文本特性的检查,我们发现reddit的传说的确可以看作是民间传说,他们是易于进行系统化基于文本的方法。我们讨论这些结果将如何实现在Reddit上民俗学未来的分析,包括跟踪版(Subreddit)宽和个人用户的行为,而这种行为对其他文化标志物的关系。
Caitrin Armstrong, Derek Ruths
Abstract: In this paper we introduce Reddit legends, a collection of venerated old posts that have become famous on Reddit. To establish the utility of Reddit legends for both computational science/HCI and folkloristics, we investigate two main questions: (1) whether they can be considered folklore, i.e. if they have consistent form, cultural significance, and undergo spontaneous transmission, and (2) whether they can be studied in a systematic manner. Through several subtasks, including the creation of a typology, an analysis of references to Reddit legends, and an examination of some of the textual characteristics of referencing behaviour, we show that Reddit legends can indeed be considered as folklore and that they are amendable to systematic text-based approaches. We discuss how these results will enable future analyses of folklore on Reddit, including tracking subreddit-wide and individual-user behaviour, and the relationship of this behaviour to other cultural markers.
摘要:本文介绍reddit的传说,崇敬的老帖子已经在Reddit上成名的集合。要建立reddit的传说的效用都计算科学/ HCI和民俗学中,我们探讨两个主要问题:(1)他们是否可以被认为是民间传说,即如果他们有一致的形式,文化意义,并进行自发传播,和(2 )是否可以有系统地进行研究。通过几个子任务,包括建立一个类型学的,到Reddit讨论社区传说引用的分析,以及对一些引用行为的文本特性的检查,我们发现reddit的传说的确可以看作是民间传说,他们是易于进行系统化基于文本的方法。我们讨论这些结果将如何实现在Reddit上民俗学未来的分析,包括跟踪版(Subreddit)宽和个人用户的行为,而这种行为对其他文化标志物的关系。
17. Computing Conceptual Distances between Breast Cancer Screening Guidelines: An Implementation of a Near-Peer Epistemic Model ofMedical Disagreement [PDF] 返回目录
Hossein Hematialam, Luciana Garbayo, Seethalakshmi Gopalakrishnan, Wlodek Zadrozny
Abstract: Using natural language processing tools, we investigate the differences of recommendations in medical guidelines for the same decision problem -- breast cancer screening. We show that these differences arise from knowledge brought to the problem by different medical societies, as reflected in the conceptual vocabularies used by the different groups of authors.The computational models we build and analyze agree with the near-peer epistemic model of expert disagreement proposed by Garbayo. Even though the article is a case study focused on one set of guidelines, the proposed methodology is broadly applicable. In addition to proposing a novel graph-based similarity model for comparing collections of documents, we perform an extensive analysis of the model performance. In a series of a few dozen experiments, in three broad categories, we show, at a very high statistical significance level of 3-4 standard deviations for our best models, that the high similarity between expert annotated model and our concept based, automatically created, computational models is not accidental. Our best model achieves roughly 70% similarity. We also describe possible extensions of this work.
摘要:利用自然语言处理的工具,我们研究的建议,在医疗准则对同一决策问题的分歧 - 乳腺癌筛查。我们发现,这些差异由不同医学会带来的问题出现的知识,这反映在authors.The计算模型的不同群体,我们建立和分析与建议的专家意见分歧的近等认知模型同意使用的概念词汇通过Garbayo。尽管文章是一个案例研究集中在一组的指导方针,建议的方法是广泛适用的。除了提出比较的文档集合一种新型的基于图形的相似模型,我们执行模型性能的深入分析。在一系列的几十个实验,在三大类中,我们显示,在3-4个标准差为我们最好的榜样很高的统计显着性水平,该专家注释的模式和我们的概念之间的相似度极高基础,自动创建,计算模型并非偶然。我们最好的模型达到大约70%的相似性。我们还描述这项工作可能扩展。
Hossein Hematialam, Luciana Garbayo, Seethalakshmi Gopalakrishnan, Wlodek Zadrozny
Abstract: Using natural language processing tools, we investigate the differences of recommendations in medical guidelines for the same decision problem -- breast cancer screening. We show that these differences arise from knowledge brought to the problem by different medical societies, as reflected in the conceptual vocabularies used by the different groups of authors.The computational models we build and analyze agree with the near-peer epistemic model of expert disagreement proposed by Garbayo. Even though the article is a case study focused on one set of guidelines, the proposed methodology is broadly applicable. In addition to proposing a novel graph-based similarity model for comparing collections of documents, we perform an extensive analysis of the model performance. In a series of a few dozen experiments, in three broad categories, we show, at a very high statistical significance level of 3-4 standard deviations for our best models, that the high similarity between expert annotated model and our concept based, automatically created, computational models is not accidental. Our best model achieves roughly 70% similarity. We also describe possible extensions of this work.
摘要:利用自然语言处理的工具,我们研究的建议,在医疗准则对同一决策问题的分歧 - 乳腺癌筛查。我们发现,这些差异由不同医学会带来的问题出现的知识,这反映在authors.The计算模型的不同群体,我们建立和分析与建议的专家意见分歧的近等认知模型同意使用的概念词汇通过Garbayo。尽管文章是一个案例研究集中在一组的指导方针,建议的方法是广泛适用的。除了提出比较的文档集合一种新型的基于图形的相似模型,我们执行模型性能的深入分析。在一系列的几十个实验,在三大类中,我们显示,在3-4个标准差为我们最好的榜样很高的统计显着性水平,该专家注释的模式和我们的概念之间的相似度极高基础,自动创建,计算模型并非偶然。我们最好的模型达到大约70%的相似性。我们还描述这项工作可能扩展。
注:中文为机器翻译结果!封面为论文标题词云图!