目录
3. ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention [PDF] 摘要
5. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary [PDF] 摘要
9. How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text [PDF] 摘要
10. Detecting White Supremacist Hate Speech using Domain Specific Word Embedding with Deep Learning and BERT [PDF] 摘要
11. "Did you really mean what you said?" : Sarcasm Detection in Hindi-English Code-Mixed Data using Bilingual Word Embeddings [PDF] 摘要
13. Phonemer at WNUT-2020 Task 2: Sequence Classification Using COVID Twitter BERT and Bagging Ensemble Technique based on Plurality Voting [PDF] 摘要
14. Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT [PDF] 摘要
16. Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models [PDF] 摘要
22. Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning [PDF] 摘要
30. Multi-label Classification of Common Bengali Handwritten Graphemes: Dataset and Challenge [PDF] 摘要
摘要
1. Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking [PDF] 返回目录
Michael Sejr Schlichtkrull, Nicola De Cao, Ivan Titov
Abstract: Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. However, there has been little work on interpreting them, and specifically on understanding which parts of the graphs (e.g. syntactic trees or co-reference structures) contribute to a prediction. In this work, we introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. Given a trained GNN model, we learn a simple classifier that, for every edge in every layer, predicts if that edge can be dropped. We demonstrate that such a classifier can be trained in a fully differentiable fashion, employing stochastic gates and encouraging sparsity through the expected $L_0$ norm. We use our technique as an attribution method to analyze GNN models for two tasks -- question answering and semantic role labeling -- providing insights into the information flow in these models. We show that we can drop a large proportion of edges without deteriorating the performance of the model, while we can analyse the remaining edges for interpreting model predictions.
摘要:图形神经网络(GNNS)已经成为一种流行的方式来结构感性偏见融入NLP模型。但是,还没有关于将它们解释一点工作,并且具体上理解该图(例如句法树或共参考结构)的部分向预测。在这项工作中,我们引入一个用于解释识别不必要边缘GNNS的预测的事后方法。给定一个训练有素的GNN模型中,我们学到一个简单的分类,对于每一层的每个边缘,预计如果边缘可被丢弃。我们证明,这样的分类可以在一个完全可微的方式进行训练,采用随机门,并鼓励通过预期的$ L_0 $规范稀疏。我们用我们的技术,因为归属的方法来分析GNN模型两个任务 - 问答和语义角色标注 - 提供见解这些模型的信息流。我们表明,我们可以删除边缘的一大部分,而不会损害模型的性能,同时我们可以分析其余边缘解释模型预测。
Michael Sejr Schlichtkrull, Nicola De Cao, Ivan Titov
Abstract: Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. However, there has been little work on interpreting them, and specifically on understanding which parts of the graphs (e.g. syntactic trees or co-reference structures) contribute to a prediction. In this work, we introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. Given a trained GNN model, we learn a simple classifier that, for every edge in every layer, predicts if that edge can be dropped. We demonstrate that such a classifier can be trained in a fully differentiable fashion, employing stochastic gates and encouraging sparsity through the expected $L_0$ norm. We use our technique as an attribution method to analyze GNN models for two tasks -- question answering and semantic role labeling -- providing insights into the information flow in these models. We show that we can drop a large proportion of edges without deteriorating the performance of the model, while we can analyse the remaining edges for interpreting model predictions.
摘要:图形神经网络(GNNS)已经成为一种流行的方式来结构感性偏见融入NLP模型。但是,还没有关于将它们解释一点工作,并且具体上理解该图(例如句法树或共参考结构)的部分向预测。在这项工作中,我们引入一个用于解释识别不必要边缘GNNS的预测的事后方法。给定一个训练有素的GNN模型中,我们学到一个简单的分类,对于每一层的每个边缘,预计如果边缘可被丢弃。我们证明,这样的分类可以在一个完全可微的方式进行训练,采用随机门,并鼓励通过预期的$ L_0 $规范稀疏。我们用我们的技术,因为归属的方法来分析GNN模型两个任务 - 问答和语义角色标注 - 提供见解这些模型的信息流。我们表明,我们可以删除边缘的一大部分,而不会损害模型的性能,同时我们可以分析其余边缘解释模型预测。
2. Understanding tables with intermediate pre-training [PDF] 返回目录
Julian Martin Eisenschlos, Syrine Krichine, Thomas Müller
Abstract: Table entailment, the binary classification task of finding if a sentence is supported or refuted by the content of a table, requires parsing language and table structure as well as numerical and discrete reasoning. While there is extensive work on textual entailment, table entailment is less well studied. We adapt TAPAS (Herzig et al., 2020), a table-based BERT model, to recognize entailment. Motivated by the benefits of data augmentation, we create a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. This new data is not only useful for table entailment, but also for SQA (Iyyer et al., 2017), a sequential table QA task. To be able to use long examples as input of BERT models, we evaluate table pruning techniques as a pre-processing step to drastically improve the training and prediction efficiency at a moderate drop in accuracy. The different methods set the new state-of-the-art on the TabFact (Chen et al., 2020) and SQA datasets.
摘要:表蕴涵,如果一个句子被支持或表的内容反驳发现的二元分类任务,需要解析语言和表结构以及数值和离散的推理。虽然对文字蕴涵大量的工作,表蕴涵较少很好的研究。我们适应TAPAS(Herzig的等,2020),基于表格的BERT模式,认识到蕴涵。通过数据增强的好处的启发,我们创建的数以百万计的自动创建训练实例在之前微调的中间步骤学到了平衡数据集。这一新数据不仅对表蕴涵有用,而且对SQA(Iyyer等,2017),顺序表QA任务。为了能够使用长的例子作为BERT模式的输入,我们评估表修剪技术作为预处理步骤,大幅提高在精确度适中降训练和预测效率。不同的方法设置的新的TabFact状态的最先进的(Chen等,2020)和SQA数据集。
Julian Martin Eisenschlos, Syrine Krichine, Thomas Müller
Abstract: Table entailment, the binary classification task of finding if a sentence is supported or refuted by the content of a table, requires parsing language and table structure as well as numerical and discrete reasoning. While there is extensive work on textual entailment, table entailment is less well studied. We adapt TAPAS (Herzig et al., 2020), a table-based BERT model, to recognize entailment. Motivated by the benefits of data augmentation, we create a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. This new data is not only useful for table entailment, but also for SQA (Iyyer et al., 2017), a sequential table QA task. To be able to use long examples as input of BERT models, we evaluate table pruning techniques as a pre-processing step to drastically improve the training and prediction efficiency at a moderate drop in accuracy. The different methods set the new state-of-the-art on the TabFact (Chen et al., 2020) and SQA datasets.
摘要:表蕴涵,如果一个句子被支持或表的内容反驳发现的二元分类任务,需要解析语言和表结构以及数值和离散的推理。虽然对文字蕴涵大量的工作,表蕴涵较少很好的研究。我们适应TAPAS(Herzig的等,2020),基于表格的BERT模式,认识到蕴涵。通过数据增强的好处的启发,我们创建的数以百万计的自动创建训练实例在之前微调的中间步骤学到了平衡数据集。这一新数据不仅对表蕴涵有用,而且对SQA(Iyyer等,2017),顺序表QA任务。为了能够使用长的例子作为BERT模式的输入,我们评估表修剪技术作为预处理步骤,大幅提高在精确度适中降训练和预测效率。不同的方法设置的新的TabFact状态的最先进的(Chen等,2020)和SQA数据集。
3. ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention [PDF] 返回目录
Jose Manuel Gomez-Perez, Raul Ortega
Abstract: Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails. Rather than training a language-visual transformer from scratch we rely on pre-trained transformers, fine-tuning and ensembling. We add bottom-up and top-down attention to identify regions of interest corresponding to diagram constituents and their relationships, improving the selection of relevant visual information for each question and answer options. Our system ISAAQ reports unprecedented success in all TQA question types, with accuracies of 81.36%, 71.11% and 55.12% on true/false, text-only and diagram multiple choice questions. ISAAQ also demonstrates its broad applicability, obtaining state-of-the-art results in other demanding datasets.
摘要:教科书问题回答是机器理解和Visual答疑的交叉复杂的任务,需要从文本和图表多模式信息的推理。这是第一次,对变压器的语言模型和自下而上和自上而下关注的潜在本文水龙头,以解决语言和直观的了解这个挑战任务带来什么。而不是从头训练语言的视觉变压器,我们依靠预训练的变压器,微调和ensembling。我们增加自下而上和自上而下的注意辨别对应图成分及其关系的感兴趣区域,提高每个问题和答案选项的相关视觉信息的选择。我们的系统ISAAQ报告中的所有TQA题型空前的成功,与真/假,只有文字和图表选择题的81.36%,71.11%和55.12%的精度。 ISAAQ也证明了其广泛的适用性,从而获得状态的最先进的结果在其他苛刻的数据集。
Jose Manuel Gomez-Perez, Raul Ortega
Abstract: Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails. Rather than training a language-visual transformer from scratch we rely on pre-trained transformers, fine-tuning and ensembling. We add bottom-up and top-down attention to identify regions of interest corresponding to diagram constituents and their relationships, improving the selection of relevant visual information for each question and answer options. Our system ISAAQ reports unprecedented success in all TQA question types, with accuracies of 81.36%, 71.11% and 55.12% on true/false, text-only and diagram multiple choice questions. ISAAQ also demonstrates its broad applicability, obtaining state-of-the-art results in other demanding datasets.
摘要:教科书问题回答是机器理解和Visual答疑的交叉复杂的任务,需要从文本和图表多模式信息的推理。这是第一次,对变压器的语言模型和自下而上和自上而下关注的潜在本文水龙头,以解决语言和直观的了解这个挑战任务带来什么。而不是从头训练语言的视觉变压器,我们依靠预训练的变压器,微调和ensembling。我们增加自下而上和自上而下的注意辨别对应图成分及其关系的感兴趣区域,提高每个问题和答案选项的相关视觉信息的选择。我们的系统ISAAQ报告中的所有TQA题型空前的成功,与真/假,只有文字和图表选择题的81.36%,71.11%和55.12%的精度。 ISAAQ也证明了其广泛的适用性,从而获得状态的最先进的结果在其他苛刻的数据集。
4. LiveQA: A Question Answering Dataset over Sports Live [PDF] 返回目录
Qianying Liu, Sicong Jiang, Yizhong Wang, Sujian Li
Abstract: In this paper, we introduce LiveQA, a new question answering dataset constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu (this https URL) website. Derived from the characteristics of sports games, LiveQA can potentially test the reasoning ability across timeline-based live broadcasts, which is challenging compared to the existing datasets. In LiveQA, the questions require understanding the timeline, tracking events or doing mathematical computations. Our preliminary experiments show that the dataset introduces a challenging problem for question answering models, and a strong baseline model only achieves the accuracy of 53.1\% and cannot beat the dominant option rule. We release the code and data of this paper for future research.
摘要:在本文中,我们介绍LiveQA,一个新的问题回答的数据集构建从播放通过播放现场直播。它含有117K人类评论员超过1670场NBA比赛,这是从中国Hupu收集(此HTTPS URL)网站书面选择题。从体育游戏的特点衍生LiveQA有可能测试跨基于时间轴的直播节目的推理能力,相对于现有的数据集这是具有挑战性的。在LiveQA的问题需要了解的时间表,跟踪事件或做数学计算。我们的初步实验表明,该数据集介绍了答疑模型一个具有挑战性的问题,具有较强的基准模型只达到53.1 \%的精度,不能随意敲打占主导地位的选项规则。我们发布这篇文章的代码和数据为今后的研究。
Qianying Liu, Sicong Jiang, Yizhong Wang, Sujian Li
Abstract: In this paper, we introduce LiveQA, a new question answering dataset constructed from play-by-play live broadcast. It contains 117k multiple-choice questions written by human commentators for over 1,670 NBA games, which are collected from the Chinese Hupu (this https URL) website. Derived from the characteristics of sports games, LiveQA can potentially test the reasoning ability across timeline-based live broadcasts, which is challenging compared to the existing datasets. In LiveQA, the questions require understanding the timeline, tracking events or doing mathematical computations. Our preliminary experiments show that the dataset introduces a challenging problem for question answering models, and a strong baseline model only achieves the accuracy of 53.1\% and cannot beat the dominant option rule. We release the code and data of this paper for future research.
摘要:在本文中,我们介绍LiveQA,一个新的问题回答的数据集构建从播放通过播放现场直播。它含有117K人类评论员超过1670场NBA比赛,这是从中国Hupu收集(此HTTPS URL)网站书面选择题。从体育游戏的特点衍生LiveQA有可能测试跨基于时间轴的直播节目的推理能力,相对于现有的数据集这是具有挑战性的。在LiveQA的问题需要了解的时间表,跟踪事件或做数学计算。我们的初步实验表明,该数据集介绍了答疑模型一个具有挑战性的问题,具有较强的基准模型只达到53.1 \%的精度,不能随意敲打占主导地位的选项规则。我们发布这篇文章的代码和数据为今后的研究。
5. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary [PDF] 返回目录
Daniel Deutsch, Tania Bedrax-Weiss, Dan Roth
Abstract: Recently, there has been growing interest in using question-answering (QA) models to evaluate the content quality of summaries. While previous work has shown initial promising results in this direction, their experimentation has been limited, leading to a poor understanding of the utility of QA in evaluating summary content. In this work, we perform an extensive evaluation of a QA-based metric for summary content quality, calculating its performance with today's state-of-the-art models as well as estimating its potential upper-bound performance. We analyze a proposed metric, QAEval, which is more widely applicable than previous work. We show that QAEval already achieves state-of-the-art performance at scoring summarization systems, beating all other metrics including the gold-standard Pyramid Method, while its performance on individual summaries is at best competitive to other automatic metrics. Through a careful analysis of each component of QAEval, we identify the performance bottlenecks and estimate that with human-level performance, QAEval's summary-level results have the potential to approach that of the Pyramid Method.
摘要:在使用答疑(QA)模型来评估总结的内容质量最近,出现了越来越大的兴趣。虽然之前的工作已经显示出初步承诺在这个方向的结果,他们的实验已经被限制,导致QA的评价摘要内容的实用程序缺乏了解。在这项工作中,我们执行基于QA度量的广泛评价摘要内容的质量,计算它与今天的国家的最先进的车型的性能,以及评估其潜在的上界的表现。我们分析所提出的度量,QAEval,这是更广泛的应用比以前的工作。我们发现,QAEval已经达到国家的最先进的性能在得分摘要系统,打破了所有的其他指标,包括黄金标准的金字塔方法,而其个人总结的表现是最好的,以其他自动指标的竞争力。通过QAEval的每个部件进行仔细分析,我们找出性能瓶颈,并估计与人类级别的性能,QAEval的总结性成果必须的做法,即金字塔方法的潜力。
Daniel Deutsch, Tania Bedrax-Weiss, Dan Roth
Abstract: Recently, there has been growing interest in using question-answering (QA) models to evaluate the content quality of summaries. While previous work has shown initial promising results in this direction, their experimentation has been limited, leading to a poor understanding of the utility of QA in evaluating summary content. In this work, we perform an extensive evaluation of a QA-based metric for summary content quality, calculating its performance with today's state-of-the-art models as well as estimating its potential upper-bound performance. We analyze a proposed metric, QAEval, which is more widely applicable than previous work. We show that QAEval already achieves state-of-the-art performance at scoring summarization systems, beating all other metrics including the gold-standard Pyramid Method, while its performance on individual summaries is at best competitive to other automatic metrics. Through a careful analysis of each component of QAEval, we identify the performance bottlenecks and estimate that with human-level performance, QAEval's summary-level results have the potential to approach that of the Pyramid Method.
摘要:在使用答疑(QA)模型来评估总结的内容质量最近,出现了越来越大的兴趣。虽然之前的工作已经显示出初步承诺在这个方向的结果,他们的实验已经被限制,导致QA的评价摘要内容的实用程序缺乏了解。在这项工作中,我们执行基于QA度量的广泛评价摘要内容的质量,计算它与今天的国家的最先进的车型的性能,以及评估其潜在的上界的表现。我们分析所提出的度量,QAEval,这是更广泛的应用比以前的工作。我们发现,QAEval已经达到国家的最先进的性能在得分摘要系统,打破了所有的其他指标,包括黄金标准的金字塔方法,而其个人总结的表现是最好的,以其他自动指标的竞争力。通过QAEval的每个部件进行仔细分析,我们找出性能瓶颈,并估计与人类级别的性能,QAEval的总结性成果必须的做法,即金字塔方法的潜力。
6. Evaluating Multilingual BERT for Estonian [PDF] 返回目录
Claudia Kittask, Kirill Milintsevich, Kairit Sirts
Abstract: Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there exist several multilingual BERT models that can handle multiple languages simultaneously and that have been trained also on Estonian data. In this paper, we evaluate four multilingual models---multilingual BERT, multilingual distilled BERT, XLM and XLM-RoBERTa---on several NLP tasks including POS and morphological tagging, NER and text classification. Our aim is to establish a comparison between these multilingual BERT models and the existing baseline neural models for these tasks. Our results show that multilingual BERT models can generalise well on different Estonian NLP tasks outperforming all baselines models for POS and morphological tagging and text classification, and reaching the comparable level with the best baseline for NER, with XLM-RoBERTa achieving the highest results compared with other multilingual models.
摘要:近日,大型预训练的语言模型,如BERT,达成了许多自然语言处理任务的国家的最先进的性能,但对于许多语言,包括爱沙尼亚,BERT型号尚未公布。然而,存在一些多语种BERT模式,可以同时处理多语言和已还培养了爱沙尼亚的数据。在本文中,我们评估了几个NLP任务,包括POS和形态标记,NER和文本分类4款多语言多语种--- BERT,多语种蒸馏水BERT,XLM和XLM - 罗伯塔---。我们的目标是要建立这些多语种BERT模型,这些任务的现有基准神经模型之间的比较。我们的研究结果表明,多语种BERT模式可以在不同的爱沙尼亚NLP任务超越了POS和形态标记和文本分类所有基准模型,并达到相当的水平与NER的最佳基线,以XLM - 罗伯塔达到最高的结果相比推广以及等多语种的模型。
Claudia Kittask, Kirill Milintsevich, Kairit Sirts
Abstract: Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there exist several multilingual BERT models that can handle multiple languages simultaneously and that have been trained also on Estonian data. In this paper, we evaluate four multilingual models---multilingual BERT, multilingual distilled BERT, XLM and XLM-RoBERTa---on several NLP tasks including POS and morphological tagging, NER and text classification. Our aim is to establish a comparison between these multilingual BERT models and the existing baseline neural models for these tasks. Our results show that multilingual BERT models can generalise well on different Estonian NLP tasks outperforming all baselines models for POS and morphological tagging and text classification, and reaching the comparable level with the best baseline for NER, with XLM-RoBERTa achieving the highest results compared with other multilingual models.
摘要:近日,大型预训练的语言模型,如BERT,达成了许多自然语言处理任务的国家的最先进的性能,但对于许多语言,包括爱沙尼亚,BERT型号尚未公布。然而,存在一些多语种BERT模式,可以同时处理多语言和已还培养了爱沙尼亚的数据。在本文中,我们评估了几个NLP任务,包括POS和形态标记,NER和文本分类4款多语言多语种--- BERT,多语种蒸馏水BERT,XLM和XLM - 罗伯塔---。我们的目标是要建立这些多语种BERT模型,这些任务的现有基准神经模型之间的比较。我们的研究结果表明,多语种BERT模式可以在不同的爱沙尼亚NLP任务超越了POS和形态标记和文本分类所有基准模型,并达到相当的水平与NER的最佳基线,以XLM - 罗伯塔达到最高的结果相比推广以及等多语种的模型。
7. A Survey on Explainability in Machine Reading Comprehension [PDF] 返回目录
Mokanarangan Thayaparan, Marco Valentino, André Freitas
Abstract: This paper presents a systematic review of benchmarks and approaches for explainability in Machine Reading Comprehension (MRC). We present how the representation and inference challenges evolved and the steps which were taken to tackle these challenges. We also present the evaluation methodologies to assess the performance of explainable systems. In addition, we identify persisting open research questions and highlight critical directions for future work.
摘要:本文介绍的基准进行了系统回顾和方法,可用于机器阅读理解(MRC)explainability。我们目前发展中的代表性和推断的挑战和采取应对这些挑战的步骤如何。我们还提出了评估方法来评估可解释系统的性能。此外,我们确定持续对未来工作的开放研究问题和突出的关键方向。
Mokanarangan Thayaparan, Marco Valentino, André Freitas
Abstract: This paper presents a systematic review of benchmarks and approaches for explainability in Machine Reading Comprehension (MRC). We present how the representation and inference challenges evolved and the steps which were taken to tackle these challenges. We also present the evaluation methodologies to assess the performance of explainable systems. In addition, we identify persisting open research questions and highlight critical directions for future work.
摘要:本文介绍的基准进行了系统回顾和方法,可用于机器阅读理解(MRC)explainability。我们目前发展中的代表性和推断的挑战和采取应对这些挑战的步骤如何。我们还提出了评估方法来评估可解释系统的性能。此外,我们确定持续对未来工作的开放研究问题和突出的关键方向。
8. Citation Sentiment Changes Analysis [PDF] 返回目录
Haixia Liu
Abstract: Metrics for measuring the citation sentiment changes were introduced. Citation sentiment changes can be observed from global citation sentiment sequences (GCSSs). With respect to a cited paper, the citation sentiment sequences were analysed across a collection of citing papers ordered by the published time. For analysing GCSSs, Eddy Dissipation Rate (EDR) was adopted, with the hypothesis that the GCSSs pattern differences can be spotted by EDR based method. Preliminary evidence showed that EDR based method holds the potential for analysing a publication's impact in a time series fashion.
摘要:指标测量引文情绪的变化进行了介绍。引文情绪变化可以从全球引文情绪序列(GCSSs)被观察到。对于一个被引论文,引文情绪序列跨越引述已发表的时间排序文件的集合进行分析。为了分析GCSSs,涡耗散率(EDR)获得通过,与假设该GCSSs图案的差异可以通过基于EDR方法被发现。初步证据表明,基于EDR方法适用于分析出版物在时序的方式冲击的潜力。
Haixia Liu
Abstract: Metrics for measuring the citation sentiment changes were introduced. Citation sentiment changes can be observed from global citation sentiment sequences (GCSSs). With respect to a cited paper, the citation sentiment sequences were analysed across a collection of citing papers ordered by the published time. For analysing GCSSs, Eddy Dissipation Rate (EDR) was adopted, with the hypothesis that the GCSSs pattern differences can be spotted by EDR based method. Preliminary evidence showed that EDR based method holds the potential for analysing a publication's impact in a time series fashion.
摘要:指标测量引文情绪的变化进行了介绍。引文情绪变化可以从全球引文情绪序列(GCSSs)被观察到。对于一个被引论文,引文情绪序列跨越引述已发表的时间排序文件的集合进行分析。为了分析GCSSs,涡耗散率(EDR)获得通过,与假设该GCSSs图案的差异可以通过基于EDR方法被发现。初步证据表明,基于EDR方法适用于分析出版物在时序的方式冲击的潜力。
9. How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text [PDF] 返回目录
Chihiro Shibata, Kei Uchiumi, Daichi Mochihashi
Abstract: Long Short-Term Memory recurrent neural network (LSTM) is widely used and known to capture informative long-term syntactic dependencies. However, how such information are reflected in its internal vectors for natural text has not yet been sufficiently investigated. We analyze them by learning a language model where syntactic structures are implicitly given. We empirically show that the context update vectors, i.e. outputs of internal gates, are approximately quantized to binary or ternary values to help the language model to count the depth of nesting accurately, as Suzgun et al. (2019) recently show for synthetic Dyck languages. For some dimensions in the context vector, we show that their activations are highly correlated with the depth of phrase structures, such as VP and NP. Moreover, with an $L_1$ regularization, we also found that it can accurately predict whether a word is inside a phrase structure or not from a small number of components of the context vector. Even for the case of learning from raw text, context vectors are shown to still correlate well with the phrase structures. Finally, we show that natural clusters of the functional words and the part of speeches that trigger phrases are represented in a small but principal subspace of the context-update vector of LSTM.
摘要:龙短时记忆回归神经网络(LSTM)被广泛使用并已知捕获信息长期句法依赖。然而,信息怎么这么反映在自然文本内部的载体还没有得到充分的调查。我们通过学习一门语言模型,其中句法结构隐含给出对它们进行分析。我们凭经验表明上下文更新向量,即内部栅极的输出,大致量化为二元或三元的值,以帮助所述语言模型来计算准确嵌套的深度,如Suzgun等。 (2019)最近显示合成戴克语言。在上下文向量一些尺寸,我们表明,他们的激活是高度短语结构,如VP和NP的深度相关。此外,用$ $ L_1正则化,我们也发现,它能够准确地预测一个字是否是一个短语结构内或不从少数上下文矢量的分量的。即使对于从原始文本学习的情况下,上下文向量被示为仍然很好地相关的短语结构。最后,我们显示的功能单词自然簇和触发短语在LSTM的上下文更新向量的一个小而主子空间表示讲话的一部分。
Chihiro Shibata, Kei Uchiumi, Daichi Mochihashi
Abstract: Long Short-Term Memory recurrent neural network (LSTM) is widely used and known to capture informative long-term syntactic dependencies. However, how such information are reflected in its internal vectors for natural text has not yet been sufficiently investigated. We analyze them by learning a language model where syntactic structures are implicitly given. We empirically show that the context update vectors, i.e. outputs of internal gates, are approximately quantized to binary or ternary values to help the language model to count the depth of nesting accurately, as Suzgun et al. (2019) recently show for synthetic Dyck languages. For some dimensions in the context vector, we show that their activations are highly correlated with the depth of phrase structures, such as VP and NP. Moreover, with an $L_1$ regularization, we also found that it can accurately predict whether a word is inside a phrase structure or not from a small number of components of the context vector. Even for the case of learning from raw text, context vectors are shown to still correlate well with the phrase structures. Finally, we show that natural clusters of the functional words and the part of speeches that trigger phrases are represented in a small but principal subspace of the context-update vector of LSTM.
摘要:龙短时记忆回归神经网络(LSTM)被广泛使用并已知捕获信息长期句法依赖。然而,信息怎么这么反映在自然文本内部的载体还没有得到充分的调查。我们通过学习一门语言模型,其中句法结构隐含给出对它们进行分析。我们凭经验表明上下文更新向量,即内部栅极的输出,大致量化为二元或三元的值,以帮助所述语言模型来计算准确嵌套的深度,如Suzgun等。 (2019)最近显示合成戴克语言。在上下文向量一些尺寸,我们表明,他们的激活是高度短语结构,如VP和NP的深度相关。此外,用$ $ L_1正则化,我们也发现,它能够准确地预测一个字是否是一个短语结构内或不从少数上下文矢量的分量的。即使对于从原始文本学习的情况下,上下文向量被示为仍然很好地相关的短语结构。最后,我们显示的功能单词自然簇和触发短语在LSTM的上下文更新向量的一个小而主子空间表示讲话的一部分。
10. Detecting White Supremacist Hate Speech using Domain Specific Word Embedding with Deep Learning and BERT [PDF] 返回目录
Hind Saleh Alatawi, Areej Maatog Alhothali, Kawthar Mustafa Moria
Abstract: White supremacists embrace a radical ideology that considers white people superior to people of other races. The critical influence of these groups is no longer limited to social media; they also have a significant effect on society in many ways by promoting racial hatred and violence. White supremacist hate speech is one of the most recently observed harmful content on social media.Traditional channels of reporting hate speech have proved inadequate due to the tremendous explosion of information, and therefore, it is necessary to find an automatic way to detect such speech in a timely manner. This research investigates the viability of automatically detecting white supremacist hate speech on Twitter by using deep learning and natural language processing techniques. Through our experiments, we used two approaches, the first approach is by using domain-specific embeddings which are extracted from white supremacist corpus in order to catch the meaning of this white supremacist slang with bidirectional Long Short-Term Memory (LSTM) deep learning model, this approach reached a 0.74890 F1-score. The second approach is by using the one of the most recent language model which is BERT, BERT model provides the state of the art of most NLP tasks. It reached to a 0.79605 F1-score. Both approaches are tested on a balanced dataset given that our experiments were based on textual data only. The dataset was combined from dataset created from Twitter and a Stormfront dataset compiled from that white supremacist forum.
摘要:白人至上拥抱认为优于其他种族的人白人激进的意识形态。这些群体的至关重要的影响不再局限于社交媒体;他们也有对社会在许多方面促进种族仇恨和暴力显著的效果。白人至上主义仇恨言论是关于报告仇恨言论的社会media.Traditional渠道最近观察到有害内容的一个证明是不够由于信息的巨大的爆炸,因此,有必要找到一种自动的方法来检测这样的讲话及时处理。本研究通过深度学习和自然语言处理技术,研究了在Twitter上自动检测白人至上主义仇恨言论的可行性。通过我们的实验中,我们使用了两种方法,第一种方法是通过使用由白人至上主义者文集提取为了赶上双向长短期记忆(LSTM)这个白人至上主义俚语的意思域特有的嵌入深度学习模型这种方法达到了0.74890 F1-得分。第二种方法是通过使用最新的语言模型是BERT,BERT模型提供了技术的最NLP任务的国家之一。它达到一个0.79605 F1-得分。这两种方法在给定的,我们的实验只是基于文本数据平衡数据集进行测试。该数据集从数据集从Twitter和Stormfront数据集从白人至上主义者论坛编译创建相结合。
Hind Saleh Alatawi, Areej Maatog Alhothali, Kawthar Mustafa Moria
Abstract: White supremacists embrace a radical ideology that considers white people superior to people of other races. The critical influence of these groups is no longer limited to social media; they also have a significant effect on society in many ways by promoting racial hatred and violence. White supremacist hate speech is one of the most recently observed harmful content on social media.Traditional channels of reporting hate speech have proved inadequate due to the tremendous explosion of information, and therefore, it is necessary to find an automatic way to detect such speech in a timely manner. This research investigates the viability of automatically detecting white supremacist hate speech on Twitter by using deep learning and natural language processing techniques. Through our experiments, we used two approaches, the first approach is by using domain-specific embeddings which are extracted from white supremacist corpus in order to catch the meaning of this white supremacist slang with bidirectional Long Short-Term Memory (LSTM) deep learning model, this approach reached a 0.74890 F1-score. The second approach is by using the one of the most recent language model which is BERT, BERT model provides the state of the art of most NLP tasks. It reached to a 0.79605 F1-score. Both approaches are tested on a balanced dataset given that our experiments were based on textual data only. The dataset was combined from dataset created from Twitter and a Stormfront dataset compiled from that white supremacist forum.
摘要:白人至上拥抱认为优于其他种族的人白人激进的意识形态。这些群体的至关重要的影响不再局限于社交媒体;他们也有对社会在许多方面促进种族仇恨和暴力显著的效果。白人至上主义仇恨言论是关于报告仇恨言论的社会media.Traditional渠道最近观察到有害内容的一个证明是不够由于信息的巨大的爆炸,因此,有必要找到一种自动的方法来检测这样的讲话及时处理。本研究通过深度学习和自然语言处理技术,研究了在Twitter上自动检测白人至上主义仇恨言论的可行性。通过我们的实验中,我们使用了两种方法,第一种方法是通过使用由白人至上主义者文集提取为了赶上双向长短期记忆(LSTM)这个白人至上主义俚语的意思域特有的嵌入深度学习模型这种方法达到了0.74890 F1-得分。第二种方法是通过使用最新的语言模型是BERT,BERT模型提供了技术的最NLP任务的国家之一。它达到一个0.79605 F1-得分。这两种方法在给定的,我们的实验只是基于文本数据平衡数据集进行测试。该数据集从数据集从Twitter和Stormfront数据集从白人至上主义者论坛编译创建相结合。
11. "Did you really mean what you said?" : Sarcasm Detection in Hindi-English Code-Mixed Data using Bilingual Word Embeddings [PDF] 返回目录
Akshita Aggarwal, Anshul Wadhawan, Anshima Chaudhary, Kavita Maurya
Abstract: With the increased use of social media platforms by people across the world, many new interesting NLP problems have come into existence. One such being the detection of sarcasm in the social media texts. We present a corpus of tweets for training custom word embeddings and a Hinglish dataset labelled for sarcasm detection. We propose a deep learning based approach to address the issue of sarcasm detection in Hindi-English code mixed tweets using bilingual word embeddings derived from FastText and Word2Vec approaches. We experimented with various deep learning models, including CNNs, LSTMs, Bi-directional LSTMs (with and without attention). We were able to outperform all state-of-the-art performances with our deep learning models, with attention based Bi-directional LSTMs giving the best performance exhibiting an accuracy of 78.49%.
摘要:由世界各地的人们越来越多地使用社交媒体平台,许多新的有趣的NLP问题已生效的存在。这样一个存在于社交媒体的文本检测的讽刺。我们提出的鸣叫的语料训练惯用词的嵌入及标示讽刺检测印地数据集。我们提出了一个深刻的学习为基础的方法来解决在使用来自FastText和Word2Vec衍生接近双语词的嵌入印地文,英文代码混合微博嘲讽检测的问题。我们尝试了各种深度学习模型,包括细胞神经网络,LSTMs,双向LSTMs(有和没有注意)。我们能够超越国家的最先进的各项性能与我们的深度学习模式,注重基于双向LSTMs给参展的78.49%的准确度最好的性能。
Akshita Aggarwal, Anshul Wadhawan, Anshima Chaudhary, Kavita Maurya
Abstract: With the increased use of social media platforms by people across the world, many new interesting NLP problems have come into existence. One such being the detection of sarcasm in the social media texts. We present a corpus of tweets for training custom word embeddings and a Hinglish dataset labelled for sarcasm detection. We propose a deep learning based approach to address the issue of sarcasm detection in Hindi-English code mixed tweets using bilingual word embeddings derived from FastText and Word2Vec approaches. We experimented with various deep learning models, including CNNs, LSTMs, Bi-directional LSTMs (with and without attention). We were able to outperform all state-of-the-art performances with our deep learning models, with attention based Bi-directional LSTMs giving the best performance exhibiting an accuracy of 78.49%.
摘要:由世界各地的人们越来越多地使用社交媒体平台,许多新的有趣的NLP问题已生效的存在。这样一个存在于社交媒体的文本检测的讽刺。我们提出的鸣叫的语料训练惯用词的嵌入及标示讽刺检测印地数据集。我们提出了一个深刻的学习为基础的方法来解决在使用来自FastText和Word2Vec衍生接近双语词的嵌入印地文,英文代码混合微博嘲讽检测的问题。我们尝试了各种深度学习模型,包括细胞神经网络,LSTMs,双向LSTMs(有和没有注意)。我们能够超越国家的最先进的各项性能与我们的深度学习模式,注重基于双向LSTMs给参展的78.49%的准确度最好的性能。
12. CoLAKE: Contextualized Language and Knowledge Embedding [PDF] 返回目录
Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang
Abstract: With the emerging branch of incorporating factual knowledge into pre-trained language models such as BERT, most existing models consider shallow, static, and separately pre-trained entity embeddings, which limits the performance gains of these models. Few works explore the potential of deep contextualized knowledge representation when injecting knowledge. In this paper, we propose the Contextualized Language and Knowledge Embedding (CoLAKE), which jointly learns contextualized representation for both language and knowledge with the extended MLM objective. Instead of injecting only entity embeddings, CoLAKE extracts the knowledge context of an entity from large-scale knowledge bases. To handle the heterogeneity of knowledge context and language context, we integrate them in a unified data structure, word-knowledge graph (WK graph). CoLAKE is pre-trained on large-scale WK graphs with the modified Transformer encoder. We conduct experiments on knowledge-driven tasks, knowledge probing tasks, and language understanding tasks. Experimental results show that CoLAKE outperforms previous counterparts on most of the tasks. Besides, CoLAKE achieves surprisingly high performance on our synthetic task called word-knowledge graph completion, which shows the superiority of simultaneously contextualizing language and knowledge representation.
摘要:结合实际知识到预先训练语言模型,如BERT的新兴分支,大多数现有的模型考虑浅,静态的,并分别预先训练实体的嵌入,这限制了这些车型的性能提升。很少有作品注入知识时,探索深情境知识表示的潜力。在本文中,我们提出了上下文化的语言和知识嵌入(CoLAKE),其共同学习情境表示用于语言和知识,扩展传销目标。相反,仅注入实体的嵌入的,CoLAKE提取从大规模知识库的实体的知识背景。为了处理知识背景和语境的异质性,我们将它们集成在一个统一的数据结构,文字知识图(WK图)。 CoLAKE是预先训练与修改后的转换器编码器的大型WK图。我们进行知识驱动任务的实验,知识探测任务,以及语言理解任务。实验结果表明,CoLAKE优于大多数的任务以前的同行。此外,CoLAKE达到我们所谓的文字知识图完成任务的合成,这说明同时归纳整理语言和知识表示的优越性高得惊人的表现。
Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang
Abstract: With the emerging branch of incorporating factual knowledge into pre-trained language models such as BERT, most existing models consider shallow, static, and separately pre-trained entity embeddings, which limits the performance gains of these models. Few works explore the potential of deep contextualized knowledge representation when injecting knowledge. In this paper, we propose the Contextualized Language and Knowledge Embedding (CoLAKE), which jointly learns contextualized representation for both language and knowledge with the extended MLM objective. Instead of injecting only entity embeddings, CoLAKE extracts the knowledge context of an entity from large-scale knowledge bases. To handle the heterogeneity of knowledge context and language context, we integrate them in a unified data structure, word-knowledge graph (WK graph). CoLAKE is pre-trained on large-scale WK graphs with the modified Transformer encoder. We conduct experiments on knowledge-driven tasks, knowledge probing tasks, and language understanding tasks. Experimental results show that CoLAKE outperforms previous counterparts on most of the tasks. Besides, CoLAKE achieves surprisingly high performance on our synthetic task called word-knowledge graph completion, which shows the superiority of simultaneously contextualizing language and knowledge representation.
摘要:结合实际知识到预先训练语言模型,如BERT的新兴分支,大多数现有的模型考虑浅,静态的,并分别预先训练实体的嵌入,这限制了这些车型的性能提升。很少有作品注入知识时,探索深情境知识表示的潜力。在本文中,我们提出了上下文化的语言和知识嵌入(CoLAKE),其共同学习情境表示用于语言和知识,扩展传销目标。相反,仅注入实体的嵌入的,CoLAKE提取从大规模知识库的实体的知识背景。为了处理知识背景和语境的异质性,我们将它们集成在一个统一的数据结构,文字知识图(WK图)。 CoLAKE是预先训练与修改后的转换器编码器的大型WK图。我们进行知识驱动任务的实验,知识探测任务,以及语言理解任务。实验结果表明,CoLAKE优于大多数的任务以前的同行。此外,CoLAKE达到我们所谓的文字知识图完成任务的合成,这说明同时归纳整理语言和知识表示的优越性高得惊人的表现。
13. Phonemer at WNUT-2020 Task 2: Sequence Classification Using COVID Twitter BERT and Bagging Ensemble Technique based on Plurality Voting [PDF] 返回目录
Anshul Wadhawan
Abstract: This paper presents the approach that we employed to tackle the EMNLP WNUT-2020 Shared Task 2 : Identification of informative COVID-19 English Tweets. The task is to develop a system that automatically identifies whether an English Tweet related to the novel coronavirus (COVID-19) is informative or not. We solve the task in three stages. The first stage involves pre-processing the dataset by filtering only relevant information. This is followed by experimenting with multiple deep learning models like CNNs, RNNs and Transformer based models. In the last stage, we propose an ensemble of the best model trained on different subsets of the provided dataset. Our final approach achieved an F1-score of 0.9037 and we were ranked sixth overall with F1-score as the evaluation criteria.
摘要:本文提出了我们用来解决EMNLP WNUT-2020共享任务2的办法:信息COVID-19英语鸣叫的识别。任务是开发自动识别是否英文资料Tweet有关新型冠状病毒的系统(COVID-19)为资料或没有。我们解决了三个阶段的任务。第一阶段通过过滤唯一的相关信息包括预处理所述数据集。其次是多深的学习模型,如细胞神经网络,RNNs和基于变压器模型试验。在最后阶段,我们提出的培训所提供的数据集的不同子集的最佳模式的集合。我们最后的办法实现的0.9037的F1-得分,我们与F1-分数作为评价标准,分别位居第六。
Anshul Wadhawan
Abstract: This paper presents the approach that we employed to tackle the EMNLP WNUT-2020 Shared Task 2 : Identification of informative COVID-19 English Tweets. The task is to develop a system that automatically identifies whether an English Tweet related to the novel coronavirus (COVID-19) is informative or not. We solve the task in three stages. The first stage involves pre-processing the dataset by filtering only relevant information. This is followed by experimenting with multiple deep learning models like CNNs, RNNs and Transformer based models. In the last stage, we propose an ensemble of the best model trained on different subsets of the provided dataset. Our final approach achieved an F1-score of 0.9037 and we were ranked sixth overall with F1-score as the evaluation criteria.
摘要:本文提出了我们用来解决EMNLP WNUT-2020共享任务2的办法:信息COVID-19英语鸣叫的识别。任务是开发自动识别是否英文资料Tweet有关新型冠状病毒的系统(COVID-19)为资料或没有。我们解决了三个阶段的任务。第一阶段通过过滤唯一的相关信息包括预处理所述数据集。其次是多深的学习模型,如细胞神经网络,RNNs和基于变压器模型试验。在最后阶段,我们提出的培训所提供的数据集的不同子集的最佳模式的集合。我们最后的办法实现的0.9037的F1-得分,我们与F1-分数作为评价标准,分别位居第六。
14. Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT [PDF] 返回目录
Ehsan Doostmohammadi, Minoo Nassajian, Adel Rahimi
Abstract: Words are properly segmented in the Persian writing system; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieve a macro-averaged F1-score of 92.40% on a carefully collected corpus of 500 sentences with a high level of difficulty.
摘要:字正确分割在波斯语书写系统;然而在实践中,这些书写规则往往被忽视,导致单个单词被disjointedly书面和多个单词写他们之间没有任何空格。本文地址分词和零宽不连字(ZWNJ)识别在波斯的问题,这是我们共同接近作为序列标号问题。我们对500句的高级别难度的精心收集的语料库达到92.40%,宏平均F1-得分。
Ehsan Doostmohammadi, Minoo Nassajian, Adel Rahimi
Abstract: Words are properly segmented in the Persian writing system; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieve a macro-averaged F1-score of 92.40% on a carefully collected corpus of 500 sentences with a high level of difficulty.
摘要:字正确分割在波斯语书写系统;然而在实践中,这些书写规则往往被忽视,导致单个单词被disjointedly书面和多个单词写他们之间没有任何空格。本文地址分词和零宽不连字(ZWNJ)识别在波斯的问题,这是我们共同接近作为序列标号问题。我们对500句的高级别难度的精心收集的语料库达到92.40%,宏平均F1-得分。
15. WeChat Neural Machine Translation Systems for WMT20 [PDF] 返回目录
Fandong Meng, Jianhao Yan, Yijin Liu, Yuan Gao, Xianfeng Zeng, Qinsong Zeng, Peng Li, Ming Chen, Jie Zhou, Sifan Liu, Hao Zhou
Abstract: We participate in the WMT 2020 shared news translation task on Chinese to English. Our system is based on the Transformer (Vaswani et al., 2017a) with effective variants and the DTMT (Meng and Zhang, 2019) architecture. In our experiments, we employ data selection, several synthetic data generation approaches (i.e., back-translation, knowledge distillation, and iterative in-domain knowledge transfer), advanced finetuning approaches and self-bleu based model ensemble. Our constrained Chinese to English system achieves 36.9 case-sensitive BLEU score, which is the highest among all submissions.
摘要:我们对中国参加WMT 2020共享新闻翻译任务为英语。我们的系统是基于变压器(瓦斯瓦尼等人,2017A),有效的变体和DTMT(孟和张,2019)架构。在我们的实验中,我们采用的数据选择,一些合成的数据产生方法(即回译,知识蒸馏和迭代域内的知识转移),先进的办法是细化和微调和自我吉尔基于模型的合奏。我们限制中国对英文系统达到36.9大小写敏感的BLEU得分,这是所有提交中最高的。
Fandong Meng, Jianhao Yan, Yijin Liu, Yuan Gao, Xianfeng Zeng, Qinsong Zeng, Peng Li, Ming Chen, Jie Zhou, Sifan Liu, Hao Zhou
Abstract: We participate in the WMT 2020 shared news translation task on Chinese to English. Our system is based on the Transformer (Vaswani et al., 2017a) with effective variants and the DTMT (Meng and Zhang, 2019) architecture. In our experiments, we employ data selection, several synthetic data generation approaches (i.e., back-translation, knowledge distillation, and iterative in-domain knowledge transfer), advanced finetuning approaches and self-bleu based model ensemble. Our constrained Chinese to English system achieves 36.9 case-sensitive BLEU score, which is the highest among all submissions.
摘要:我们对中国参加WMT 2020共享新闻翻译任务为英语。我们的系统是基于变压器(瓦斯瓦尼等人,2017A),有效的变体和DTMT(孟和张,2019)架构。在我们的实验中,我们采用的数据选择,一些合成的数据产生方法(即回译,知识蒸馏和迭代域内的知识转移),先进的办法是细化和微调和自我吉尔基于模型的合奏。我们限制中国对英文系统达到36.9大小写敏感的BLEU得分,这是所有提交中最高的。
16. Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models [PDF] 返回目录
Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong Do, Chi Mai Luong
Abstract: Studies on the Named Entity Recognition (NER) task have shown outstanding results that reach human parity on input texts with correct text formattings, such as with proper punctuation and capitalization. However, such conditions are not available in applications where the input is speech, because the text is generated from a speech recognition system (ASR), and that the system does not consider the text formatting. In this paper, we (1) presented the first Vietnamese speech dataset for NER task, and (2) the first pre-trained public large-scale monolingual language model for Vietnamese that achieved the new state-of-the-art for the Vietnamese NER task by 1.3% absolute F1 score comparing to the latest study. And finally, (3) we proposed a new pipeline for NER task from speech that overcomes the text formatting problem by introducing a text capitalization and punctuation recovery model (CaPu) into the pipeline. The model takes input text from an ASR system and performs two tasks at the same time, producing proper text formatting that helps to improve NER performance. Experimental results indicated that the CaPu model helps to improve by nearly 4% of F1-score.
摘要:研究在命名实体识别(NER)的任务表明,达到与正确的文本打印格式,如用适当的标点符号和大小写输入文本人的奇偶优异成绩。然而,如果输入的是语音,因为文字是从语音识别系统(ASR)生成的,并且系统不考虑文本格式等条件不应用中使用。在本文中,我们(1)提出了NER任务中的第越南语音数据集,以及(2)第一预先训练的公众大规模的单语语言模型越南所取得的新的国家的最先进的越南1.3%的绝对F1 NER任务得分比较最新的研究中。最后,(3),我们提出了从语音NER的任务,克服了文本通过引入文本大小写和标点符号恢复模式(CAPU)进入管道格式化问题的新管道。该模型需要从ASR系统和执行两个任务输入文本的同时,产生正确的文本格式,有助于提高NER性能。实验结果表明,该模型CAPU帮助了近4%的F1-得分的改善。
Thai Binh Nguyen, Quang Minh Nguyen, Thi Thu Hien Nguyen, Quoc Truong Do, Chi Mai Luong
Abstract: Studies on the Named Entity Recognition (NER) task have shown outstanding results that reach human parity on input texts with correct text formattings, such as with proper punctuation and capitalization. However, such conditions are not available in applications where the input is speech, because the text is generated from a speech recognition system (ASR), and that the system does not consider the text formatting. In this paper, we (1) presented the first Vietnamese speech dataset for NER task, and (2) the first pre-trained public large-scale monolingual language model for Vietnamese that achieved the new state-of-the-art for the Vietnamese NER task by 1.3% absolute F1 score comparing to the latest study. And finally, (3) we proposed a new pipeline for NER task from speech that overcomes the text formatting problem by introducing a text capitalization and punctuation recovery model (CaPu) into the pipeline. The model takes input text from an ASR system and performs two tasks at the same time, producing proper text formatting that helps to improve NER performance. Experimental results indicated that the CaPu model helps to improve by nearly 4% of F1-score.
摘要:研究在命名实体识别(NER)的任务表明,达到与正确的文本打印格式,如用适当的标点符号和大小写输入文本人的奇偶优异成绩。然而,如果输入的是语音,因为文字是从语音识别系统(ASR)生成的,并且系统不考虑文本格式等条件不应用中使用。在本文中,我们(1)提出了NER任务中的第越南语音数据集,以及(2)第一预先训练的公众大规模的单语语言模型越南所取得的新的国家的最先进的越南1.3%的绝对F1 NER任务得分比较最新的研究中。最后,(3),我们提出了从语音NER的任务,克服了文本通过引入文本大小写和标点符号恢复模式(CAPU)进入管道格式化问题的新管道。该模型需要从ASR系统和执行两个任务输入文本的同时,产生正确的文本格式,有助于提高NER性能。实验结果表明,该模型CAPU帮助了近4%的F1-得分的改善。
17. A Compare Aggregate Transformer for Understanding Document-grounded Dialogue [PDF] 返回目录
Longxuan Ma, Weinan Zhang, Runxin Sun, Ting Liu
Abstract: Unstructured documents serving as external knowledge of the dialogues help to generate more informative responses. Previous research focused on knowledge selection (KS) in the document with dialogue. However, dialogue history that is not related to the current dialogue may introduce noise in the KS processing. In this paper, we propose a Compare Aggregate Transformer (CAT) to jointly denoise the dialogue context and aggregate the document information for response generation. We designed two different comparison mechanisms to reduce noise (before and during decoding). In addition, we propose two metrics for evaluating document utilization efficiency based on word overlap. Experimental results on the CMUDoG dataset show that the proposed CAT model outperforms the state-of-the-art approach and strong baselines.
摘要:作为的对话,帮助外部知识的非结构化文档生成信息更丰富的响应。以前的研究侧重于与对话文档中的知识选择(KS)。然而,这是不是与当前对话的对话历史可能在KS处理噪音介绍。在本文中,我们提出了一个比较总结变压器(CAT)共同去噪对话上下文和聚合为响应生成的文档信息。我们设计了两个不同的比较机制,以减少噪声(之前和解码期间)。此外,我们提出了两个指标,用于评估基于词的重叠文档利用效率。在CMUDoG数据集表明,该CAT模型优于国家的最先进的方法和强大的基线实验结果。
Longxuan Ma, Weinan Zhang, Runxin Sun, Ting Liu
Abstract: Unstructured documents serving as external knowledge of the dialogues help to generate more informative responses. Previous research focused on knowledge selection (KS) in the document with dialogue. However, dialogue history that is not related to the current dialogue may introduce noise in the KS processing. In this paper, we propose a Compare Aggregate Transformer (CAT) to jointly denoise the dialogue context and aggregate the document information for response generation. We designed two different comparison mechanisms to reduce noise (before and during decoding). In addition, we propose two metrics for evaluating document utilization efficiency based on word overlap. Experimental results on the CMUDoG dataset show that the proposed CAT model outperforms the state-of-the-art approach and strong baselines.
摘要:作为的对话,帮助外部知识的非结构化文档生成信息更丰富的响应。以前的研究侧重于与对话文档中的知识选择(KS)。然而,这是不是与当前对话的对话历史可能在KS处理噪音介绍。在本文中,我们提出了一个比较总结变压器(CAT)共同去噪对话上下文和聚合为响应生成的文档信息。我们设计了两个不同的比较机制,以减少噪声(之前和解码期间)。此外,我们提出了两个指标,用于评估基于词的重叠文档利用效率。在CMUDoG数据集表明,该CAT模型优于国家的最先进的方法和强大的基线实验结果。
18. Examining the rhetorical capacities of neural language models [PDF] 返回目录
Zining Zhu, Chuer Pan, Mohamed Abdalla, Frank Rudzicz
Abstract: Recently, neural language models (LMs) have demonstrated impressive abilities in generating high-quality discourse. While many recent papers have analyzed the syntactic aspects encoded in LMs, there has been no analysis to date of the inter-sentential, rhetorical knowledge. In this paper, we propose a method that quantitatively evaluates the rhetorical capacities of neural LMs. We examine the capacities of neural LMs understanding the rhetoric of discourse by evaluating their abilities to encode a set of linguistic features derived from Rhetorical Structure Theory (RST). Our experiments show that BERT-based LMs outperform other Transformer LMs, revealing the richer discourse knowledge in their intermediate layer representations. In addition, GPT-2 and XLNet apparently encode less rhetorical knowledge, and we suggest an explanation drawing from linguistic philosophy. Our method shows an avenue towards quantifying the rhetorical capacities of neural LMs.
摘要:近日,神经语言模型(LMS)已经证明,在产生高品质的话语令人印象深刻的能力。虽然最近的许多论文都分析了LM的编码语法方面,一直没有分析的跨句,修辞知识日期。在本文中,我们提出了定量评估神经的LM的修辞能力的方法。我们研究通过评估编码了一组从修辞结构理论(RST)衍生的语言特点自己的能力理解话语的修辞神经的LM的能力。我们的实验表明,基于BERT-LM的优于其他变压器的LM,揭示了它们的中间层表示更丰富的知识话语。此外,GPT-2和XLNet显然编码少修辞知识,我们提出一个解释从语言哲学图纸。我们的方法显示对量化神经的LM的修辞能力的途径。
Zining Zhu, Chuer Pan, Mohamed Abdalla, Frank Rudzicz
Abstract: Recently, neural language models (LMs) have demonstrated impressive abilities in generating high-quality discourse. While many recent papers have analyzed the syntactic aspects encoded in LMs, there has been no analysis to date of the inter-sentential, rhetorical knowledge. In this paper, we propose a method that quantitatively evaluates the rhetorical capacities of neural LMs. We examine the capacities of neural LMs understanding the rhetoric of discourse by evaluating their abilities to encode a set of linguistic features derived from Rhetorical Structure Theory (RST). Our experiments show that BERT-based LMs outperform other Transformer LMs, revealing the richer discourse knowledge in their intermediate layer representations. In addition, GPT-2 and XLNet apparently encode less rhetorical knowledge, and we suggest an explanation drawing from linguistic philosophy. Our method shows an avenue towards quantifying the rhetorical capacities of neural LMs.
摘要:近日,神经语言模型(LMS)已经证明,在产生高品质的话语令人印象深刻的能力。虽然最近的许多论文都分析了LM的编码语法方面,一直没有分析的跨句,修辞知识日期。在本文中,我们提出了定量评估神经的LM的修辞能力的方法。我们研究通过评估编码了一组从修辞结构理论(RST)衍生的语言特点自己的能力理解话语的修辞神经的LM的能力。我们的实验表明,基于BERT-LM的优于其他变压器的LM,揭示了它们的中间层表示更丰富的知识话语。此外,GPT-2和XLNet显然编码少修辞知识,我们提出一个解释从语言哲学图纸。我们的方法显示对量化神经的LM的修辞能力的途径。
19. Learning from Mistakes: Combining Ontologies via Self-Training for Dialogue Generation [PDF] 返回目录
Lena Reed, Vrindavan Harrison, Shereen Oraby, Dilek Hakkani-Tur, Marilyn Walker
Abstract: Natural language generators (NLGs) for task-oriented dialogue typically take a meaning representation (MR) as input. They are trained end-to-end with a corpus of MR/utterance pairs, where the MRs cover a specific set of dialogue acts and domain attributes. Creation of such datasets is labor-intensive and time-consuming. Therefore, dialogue systems for new domain ontologies would benefit from using data for pre-existing ontologies. Here we explore, for the first time, whether it is possible to train an NLG for a new larger ontology using existing training sets for the restaurant domain, where each set is based on a different ontology. We create a new, larger combined ontology, and then train an NLG to produce utterances covering it. For example, if one dataset has attributes for family-friendly and rating information, and the other has attributes for decor and service, our aim is an NLG for the combined ontology that can produce utterances that realize values for family-friendly, rating, decor and service. Initial experiments with a baseline neural sequence-to-sequence model show that this task is surprisingly challenging. We then develop a novel self-training method that identifies (errorful) model outputs, automatically constructs a corrected MR input to form a new (MR, utterance) training pair, and then repeatedly adds these new instances back into the training data. We then test the resulting model on a new test set. The result is a self-trained model whose performance is an absolute 75.4% improvement over the baseline model. We also report a human qualitative evaluation of the final model showing that it achieves high naturalness, semantic coherence and grammaticality
摘要:自然语言发生器(NLGS)面向任务的对话通常采取的意思表示(MR)作为输入。他们被训练端至端与MR /话语对,其中MR中包括一组特定的对话行为和域属性的语料库。这样的数据集的创造是劳动密集和费时。因此,对于新领域本体的对话系统将受益于使用数据的预先存在的本体。在这里,我们探索,第一次,是否有可能培养一个NLG利用现有训练集餐饮领域,其中,各组是基于不同的本体一个新的更大的本体。我们创建了一个新的,更大的相结合的本体,然后训练的NLG,产生话语覆盖它。例如,如果一个数据集具有属性家庭友好型的评级信息,以及其他具有装饰和服务的属性,我们的目的是NLG的结合的本体可以产生话语,对家庭友善,评级REALIZE值,装潢和服务。最初的实验与基准神经序列到序列模型显示,这个任务是令人惊讶的挑战。然后,我们开发了一种新的自我训练方法,该方法识别(errorful)模型输出,自动构建校正MR输入以形成一个新的(MR,发声)训练对,然后反复将这些新的实例放回训练数据。然后我们测试一个新的测试集生成的模型。其结果是,其性能超过基准模型绝对的75.4%提高一个自学成才的典范。我们还报告最终模型表现的人定性评价,它实现了高自然度,语义连贯和语法性
Lena Reed, Vrindavan Harrison, Shereen Oraby, Dilek Hakkani-Tur, Marilyn Walker
Abstract: Natural language generators (NLGs) for task-oriented dialogue typically take a meaning representation (MR) as input. They are trained end-to-end with a corpus of MR/utterance pairs, where the MRs cover a specific set of dialogue acts and domain attributes. Creation of such datasets is labor-intensive and time-consuming. Therefore, dialogue systems for new domain ontologies would benefit from using data for pre-existing ontologies. Here we explore, for the first time, whether it is possible to train an NLG for a new larger ontology using existing training sets for the restaurant domain, where each set is based on a different ontology. We create a new, larger combined ontology, and then train an NLG to produce utterances covering it. For example, if one dataset has attributes for family-friendly and rating information, and the other has attributes for decor and service, our aim is an NLG for the combined ontology that can produce utterances that realize values for family-friendly, rating, decor and service. Initial experiments with a baseline neural sequence-to-sequence model show that this task is surprisingly challenging. We then develop a novel self-training method that identifies (errorful) model outputs, automatically constructs a corrected MR input to form a new (MR, utterance) training pair, and then repeatedly adds these new instances back into the training data. We then test the resulting model on a new test set. The result is a self-trained model whose performance is an absolute 75.4% improvement over the baseline model. We also report a human qualitative evaluation of the final model showing that it achieves high naturalness, semantic coherence and grammaticality
摘要:自然语言发生器(NLGS)面向任务的对话通常采取的意思表示(MR)作为输入。他们被训练端至端与MR /话语对,其中MR中包括一组特定的对话行为和域属性的语料库。这样的数据集的创造是劳动密集和费时。因此,对于新领域本体的对话系统将受益于使用数据的预先存在的本体。在这里,我们探索,第一次,是否有可能培养一个NLG利用现有训练集餐饮领域,其中,各组是基于不同的本体一个新的更大的本体。我们创建了一个新的,更大的相结合的本体,然后训练的NLG,产生话语覆盖它。例如,如果一个数据集具有属性家庭友好型的评级信息,以及其他具有装饰和服务的属性,我们的目的是NLG的结合的本体可以产生话语,对家庭友善,评级REALIZE值,装潢和服务。最初的实验与基准神经序列到序列模型显示,这个任务是令人惊讶的挑战。然后,我们开发了一种新的自我训练方法,该方法识别(errorful)模型输出,自动构建校正MR输入以形成一个新的(MR,发声)训练对,然后反复将这些新的实例放回训练数据。然后我们测试一个新的测试集生成的模型。其结果是,其性能超过基准模型绝对的75.4%提高一个自学成才的典范。我们还报告最终模型表现的人定性评价,它实现了高自然度,语义连贯和语法性
20. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models [PDF] 返回目录
Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman
Abstract: Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US, we introduce the Crowdsourced Stereotype Pairs benchmark (CrowS-Pairs). CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs a model is presented with two sentences: one that is more stereotyping and another that is less stereotyping. The data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups. We find that all three of the widely-used MLMs we evaluate substantially favor sentences that express stereotypes in every category in CrowS-Pairs. As work on building less biased models advances, this dataset can be used as a benchmark to evaluate progress.
摘要:预训练的语言模型,特别是掩盖语言模型(多层次营销)已经看到在许多NLP任务中取得成功。然而,有充分证据表明,他们使用无疑是存在的,他们被训练上,隐式偏交涉创建伤害语料库的文化偏见。为了测量而美国保护的人口群体的语言模型的一些形式的社会偏见,我们引进了众包刻板印象双基准(乌鸦对)。乌鸦,双有1508个例,覆盖定型处理九种类型的偏见,例如种族,宗教和年龄。乌鸦-对的模型呈现两句话:一个是比较定型,另一个是定型少。数据集中在关于历史上的弱势群体与强势群体的对比他们的成见。我们发现,所有这三个我们大致评估广泛使用的MLM的青睐表达的乌鸦,每双定型类的句子。作为建设更加客观模型的发展工作,该数据集可以作为一个基准来评估进展。
Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman
Abstract: Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US, we introduce the Crowdsourced Stereotype Pairs benchmark (CrowS-Pairs). CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs a model is presented with two sentences: one that is more stereotyping and another that is less stereotyping. The data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups. We find that all three of the widely-used MLMs we evaluate substantially favor sentences that express stereotypes in every category in CrowS-Pairs. As work on building less biased models advances, this dataset can be used as a benchmark to evaluate progress.
摘要:预训练的语言模型,特别是掩盖语言模型(多层次营销)已经看到在许多NLP任务中取得成功。然而,有充分证据表明,他们使用无疑是存在的,他们被训练上,隐式偏交涉创建伤害语料库的文化偏见。为了测量而美国保护的人口群体的语言模型的一些形式的社会偏见,我们引进了众包刻板印象双基准(乌鸦对)。乌鸦,双有1508个例,覆盖定型处理九种类型的偏见,例如种族,宗教和年龄。乌鸦-对的模型呈现两句话:一个是比较定型,另一个是定型少。数据集中在关于历史上的弱势群体与强势群体的对比他们的成见。我们发现,所有这三个我们大致评估广泛使用的MLM的青睐表达的乌鸦,每双定型类的句子。作为建设更加客观模型的发展工作,该数据集可以作为一个基准来评估进展。
21. Interactive Re-Fitting as a Technique for Improving Word Embeddings [PDF] 返回目录
James Powell, Kari Sentz
Abstract: Word embeddings are a fixed, distributional representation of the context of words in a corpus learned from word co-occurrences. While word embeddings have proven to have many practical uses in natural language processing tasks, they reflect the attributes of the corpus upon which they are trained. Recent work has demonstrated that post-processing of word embeddings to apply information found in lexical dictionaries can improve their quality. We build on this post-processing technique by making it interactive. Our approach makes it possible for humans to adjust portions of a word embedding space by moving sets of words closer to one another. One motivating use case for this capability is to enable users to identify and reduce the presence of bias in word embeddings. Our approach allows users to trigger selective post-processing as they interact with and assess potential bias in word embeddings.
摘要:Word中的嵌入是的话,从字共现学语料库背景下的一个固定的,分布式的表示。虽然字的嵌入已经被证明在自然语言处理的任务很多实际应用,它们反映在他们被训练语料库的属性。最近的研究表明,应用信息的嵌入字的后期处理,词汇字典中找到能够提高自身的素质。我们建立在通过使这种互动的后处理技术。我们的方法使得有可能对人类来调整字通过移动套的话彼此靠近嵌入空间的部分。一个激励的情况下使用该功能是让用户识别和减少偏见的字嵌入物的存在。我们的方法允许用户选择性地触发后处理,因为它们相互作用并评估潜在的偏见字的嵌入。
James Powell, Kari Sentz
Abstract: Word embeddings are a fixed, distributional representation of the context of words in a corpus learned from word co-occurrences. While word embeddings have proven to have many practical uses in natural language processing tasks, they reflect the attributes of the corpus upon which they are trained. Recent work has demonstrated that post-processing of word embeddings to apply information found in lexical dictionaries can improve their quality. We build on this post-processing technique by making it interactive. Our approach makes it possible for humans to adjust portions of a word embedding space by moving sets of words closer to one another. One motivating use case for this capability is to enable users to identify and reduce the presence of bias in word embeddings. Our approach allows users to trigger selective post-processing as they interact with and assess potential bias in word embeddings.
摘要:Word中的嵌入是的话,从字共现学语料库背景下的一个固定的,分布式的表示。虽然字的嵌入已经被证明在自然语言处理的任务很多实际应用,它们反映在他们被训练语料库的属性。最近的研究表明,应用信息的嵌入字的后期处理,词汇字典中找到能够提高自身的素质。我们建立在通过使这种互动的后处理技术。我们的方法使得有可能对人类来调整字通过移动套的话彼此靠近嵌入空间的部分。一个激励的情况下使用该功能是让用户识别和减少偏见的字嵌入物的存在。我们的方法允许用户选择性地触发后处理,因为它们相互作用并评估潜在的偏见字的嵌入。
22. Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning [PDF] 返回目录
Yuning Mao, Yanru Qu, Yiqing Xie, Xiang Ren, Jiawei Han
Abstract: While neural sequence learning methods have made significant progress in single-document summarization (SDS), they produce unsatisfactory results on multi-document summarization (MDS). We observe two major challenges when adapting SDS advances to MDS: (1) MDS involves larger search space and yet more limited training data, setting obstacles for neural methods to learn adequate representations; (2) MDS needs to resolve higher information redundancy among the source documents, which SDS methods are less effective to handle. To close the gap, we present RL-MMR, Maximal Margin Relevance-guided Reinforcement Learning for MDS, which unifies advanced neural SDS methods and statistical measures used in classical MDS. RL-MMR casts MMR guidance on fewer promising candidates, which restrains the search space and thus leads to better representation learning. Additionally, the explicit redundancy measure in MMR helps the neural representation of the summary to better capture redundancy. Extensive experiments demonstrate that RL-MMR achieves state-of-the-art performance on benchmark MDS datasets. In particular, we show the benefits of incorporating MMR into end-to-end learning when adapting SDS to MDS in terms of both learning effectiveness and efficiency.
摘要:虽然神经序列学习方法在单文档文摘(SDS)取得了显著的进步,他们生产的多文档文摘(MDS)令人满意的结果。适应SDS进步到MDS时,我们观察到两个主要的挑战:(1)MDS包括更大的搜索空间,但较为有限的训练数据,神经的方法来学习足够的交涉设置障碍; (2)MDS需要解析源文件,这些文件SDS方法是不太有效的处理中更高的信息的冗余。为了缩小差距,我们现在RL-MMR,最大边缘相关引导强化学习的MDS,统一了先进的神经SDS方法和古典MDS使用统计的措施。 RL-MMR蒙上较少希望的候选MMR指导,这限制了搜索空间,从而带来更好的代表性学习。此外,在MMR明确的冗余措施有助于总结,以更好地捕捉冗余的神经表示。大量的实验表明,RL-MMR实现对基准MDS数据集的国家的最先进的性能。尤其是,我们显示了结合MMR到终端到终端的学习中既学习成效和效率方面适应SDS到MDS时的好处。
Yuning Mao, Yanru Qu, Yiqing Xie, Xiang Ren, Jiawei Han
Abstract: While neural sequence learning methods have made significant progress in single-document summarization (SDS), they produce unsatisfactory results on multi-document summarization (MDS). We observe two major challenges when adapting SDS advances to MDS: (1) MDS involves larger search space and yet more limited training data, setting obstacles for neural methods to learn adequate representations; (2) MDS needs to resolve higher information redundancy among the source documents, which SDS methods are less effective to handle. To close the gap, we present RL-MMR, Maximal Margin Relevance-guided Reinforcement Learning for MDS, which unifies advanced neural SDS methods and statistical measures used in classical MDS. RL-MMR casts MMR guidance on fewer promising candidates, which restrains the search space and thus leads to better representation learning. Additionally, the explicit redundancy measure in MMR helps the neural representation of the summary to better capture redundancy. Extensive experiments demonstrate that RL-MMR achieves state-of-the-art performance on benchmark MDS datasets. In particular, we show the benefits of incorporating MMR into end-to-end learning when adapting SDS to MDS in terms of both learning effectiveness and efficiency.
摘要:虽然神经序列学习方法在单文档文摘(SDS)取得了显著的进步,他们生产的多文档文摘(MDS)令人满意的结果。适应SDS进步到MDS时,我们观察到两个主要的挑战:(1)MDS包括更大的搜索空间,但较为有限的训练数据,神经的方法来学习足够的交涉设置障碍; (2)MDS需要解析源文件,这些文件SDS方法是不太有效的处理中更高的信息的冗余。为了缩小差距,我们现在RL-MMR,最大边缘相关引导强化学习的MDS,统一了先进的神经SDS方法和古典MDS使用统计的措施。 RL-MMR蒙上较少希望的候选MMR指导,这限制了搜索空间,从而带来更好的代表性学习。此外,在MMR明确的冗余措施有助于总结,以更好地捕捉冗余的神经表示。大量的实验表明,RL-MMR实现对基准MDS数据集的国家的最先进的性能。尤其是,我们显示了结合MMR到终端到终端的学习中既学习成效和效率方面适应SDS到MDS时的好处。
23. AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts [PDF] 返回目录
Mohit Chandra, Ashwin Pathak, Eesha Dutta, Paryul Jain, Manish Gupta, Manish Shrivastava, Ponnurangam Kumaraguru
Abstract: While extensive popularity of online social media platforms has made information dissemination faster, it has also resulted in widespread online abuse of different types like hate speech, offensive language, sexist and racist opinions, etc. Detection and curtailment of such abusive content is critical for avoiding its psychological impact on victim communities, and thereby preventing hate crimes. Previous works have focused on classifying user posts into various forms of abusive behavior. But there has hardly been any focus on estimating the severity of abuse and the target. In this paper, we present a first of the kind dataset with 7601 posts from Gab which looks at online abuse from the perspective of presence of abuse, severity and target of abusive behavior. We also propose a system to address these tasks, obtaining an accuracy of ~80% for abuse presence, ~82% for abuse target detection, and ~64% for abuse severity detection.
摘要:虽然网络社交媒体平台的广泛普及使得信息传播速度更快,这也导致了不同类型,如仇恨言论,攻击性语言,性别歧视和种族主义的意见,等检测和滥用等内容的缩减的普遍滥用网络是至关重要的为避免对受害社区的心理影响,从而防止仇恨犯罪。以前的作品都集中在用户的职位分类成各种形式的虐待行为。但是,几乎没有受到任何注重估计滥用和目标的严重程度。在本文中,我们提出了一个第一的那种数据集从加布7601个员额不受虐待,严重程度和虐待行为对象的存在的角度网上滥用看起来。我们还提出了一个系统来处理这些任务,获得的〜80%的准确度滥用存在〜82%,滥用目标检测,并〜64%为严重滥用检测。
Mohit Chandra, Ashwin Pathak, Eesha Dutta, Paryul Jain, Manish Gupta, Manish Shrivastava, Ponnurangam Kumaraguru
Abstract: While extensive popularity of online social media platforms has made information dissemination faster, it has also resulted in widespread online abuse of different types like hate speech, offensive language, sexist and racist opinions, etc. Detection and curtailment of such abusive content is critical for avoiding its psychological impact on victim communities, and thereby preventing hate crimes. Previous works have focused on classifying user posts into various forms of abusive behavior. But there has hardly been any focus on estimating the severity of abuse and the target. In this paper, we present a first of the kind dataset with 7601 posts from Gab which looks at online abuse from the perspective of presence of abuse, severity and target of abusive behavior. We also propose a system to address these tasks, obtaining an accuracy of ~80% for abuse presence, ~82% for abuse target detection, and ~64% for abuse severity detection.
摘要:虽然网络社交媒体平台的广泛普及使得信息传播速度更快,这也导致了不同类型,如仇恨言论,攻击性语言,性别歧视和种族主义的意见,等检测和滥用等内容的缩减的普遍滥用网络是至关重要的为避免对受害社区的心理影响,从而防止仇恨犯罪。以前的作品都集中在用户的职位分类成各种形式的虐待行为。但是,几乎没有受到任何注重估计滥用和目标的严重程度。在本文中,我们提出了一个第一的那种数据集从加布7601个员额不受虐待,严重程度和虐待行为对象的存在的角度网上滥用看起来。我们还提出了一个系统来处理这些任务,获得的〜80%的准确度滥用存在〜82%,滥用目标检测,并〜64%为严重滥用检测。
24. Linguistic Structure Guided Context Modeling for Referring Image Segmentation [PDF] 返回目录
Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, Jizhong Han
Abstract: Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal this http URL tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence while excluding disturbing ones through three steps over the multimodal feature, i.e., gathering, constrained propagation and distributing. Extensive experiments on four benchmarks demonstrate that our method outperforms all the previous state-of-the-arts.
摘要:参照图像分割目标来预测由自然语言句子称为对象的前景掩码。这句话的背景下多式联运是至关重要的所指对象从背景中区分。现有的方法要么不足或冗余建模多式联运这个HTTP URL解决这个问题,我们提出了“收集,繁殖,分发”计划,通过跨模式的互动多情境建模和实现这一计划,作为一种新型的语言结构引导上下文建模(LSCM )模块。我们的LSCM模块建立一个依存句法分析树抑制引导所有的字以包括句子的有效多峰上下文,同时通过三个步骤在所述多模态特征排除干扰的,即,收集,约束传播和分配词图(DPT-WG)。四个基准大量实验表明,我们的方法优于以前所有的国家的最艺术。
Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, Jizhong Han
Abstract: Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal this http URL tackle this problem, we propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence while excluding disturbing ones through three steps over the multimodal feature, i.e., gathering, constrained propagation and distributing. Extensive experiments on four benchmarks demonstrate that our method outperforms all the previous state-of-the-arts.
摘要:参照图像分割目标来预测由自然语言句子称为对象的前景掩码。这句话的背景下多式联运是至关重要的所指对象从背景中区分。现有的方法要么不足或冗余建模多式联运这个HTTP URL解决这个问题,我们提出了“收集,繁殖,分发”计划,通过跨模式的互动多情境建模和实现这一计划,作为一种新型的语言结构引导上下文建模(LSCM )模块。我们的LSCM模块建立一个依存句法分析树抑制引导所有的字以包括句子的有效多峰上下文,同时通过三个步骤在所述多模态特征排除干扰的,即,收集,约束传播和分配词图(DPT-WG)。四个基准大量实验表明,我们的方法优于以前所有的国家的最艺术。
25. Referring Image Segmentation via Cross-Modal Progressive Comprehension [PDF] 返回目录
Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li
Abstract: Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances.
摘要:在分割,可以很好匹配在自然语言表达式给出的描述中的实体的前景掩模参照图像分割目标。先前的方法解决使用视觉和语言模态之间的隐含特征交互和融合此问题,但通常无法探索到阱对准特征从所述两个模态的表达的信息字准确地识别称为实体。在本文中,我们提出了一个跨模态的进步理解(CMPC)模块和文本制导功能Exchange(TGFE)模块,有效地解决了具有挑战性的任务。具体而言,CMPC模块采用第一实体,属性词感知到的所有可能受表达被认为是相关实体。然后,关系词采用突出的多图形推理正确的实体以及其他抑制那些无关。除了CMPC模块,我们进一步利用一个简单而有效的TGFE模块,从不同的层面的理由多的功能与文本信息指导相结合。通过这种方式,从多层次特征可以相互通信和基于文本上下文加以改进。我们对四大流行指分割基准进行了广泛的实验,实现国家的最先进的新表演。
Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li
Abstract: Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances.
摘要:在分割,可以很好匹配在自然语言表达式给出的描述中的实体的前景掩模参照图像分割目标。先前的方法解决使用视觉和语言模态之间的隐含特征交互和融合此问题,但通常无法探索到阱对准特征从所述两个模态的表达的信息字准确地识别称为实体。在本文中,我们提出了一个跨模态的进步理解(CMPC)模块和文本制导功能Exchange(TGFE)模块,有效地解决了具有挑战性的任务。具体而言,CMPC模块采用第一实体,属性词感知到的所有可能受表达被认为是相关实体。然后,关系词采用突出的多图形推理正确的实体以及其他抑制那些无关。除了CMPC模块,我们进一步利用一个简单而有效的TGFE模块,从不同的层面的理由多的功能与文本信息指导相结合。通过这种方式,从多层次特征可以相互通信和基于文本上下文加以改进。我们对四大流行指分割基准进行了广泛的实验,实现国家的最先进的新表演。
26. AMUSED: An Annotation Framework of Multi-modal Social Media Data [PDF] 返回目录
Gautam Kishore Shahi
Abstract: In this paper, we present a semi-automated framework called AMUSED for gathering multi-modal annotated data from the multiple social media platforms. The framework is designed to mitigate the issues of collecting and annotating social media data by cohesively combining machine and human in the data collection process. From a given list of the articles from professional news media or blog, AMUSED detects links to the social media posts from news articles and then downloads contents of the same post from the respective social media platform to gather details about that specific post. The framework is capable of fetching the annotated data from multiple platforms like Twitter, YouTube, Reddit. The framework aims to reduce the workload and problems behind the data annotation from the social media platforms. AMUSED can be applied in multiple application domains, as a use case, we have implemented the framework for collecting COVID-19 misinformation data from different social media platforms.
摘要:在本文中,我们提出了一个半自动化框架调用逗乐来自多个社交媒体平台收集的多模态注解数据。该框架旨在缓解收集并通过在数据收集过程团结一致结合机器和人的注释社交媒体数据的问题。从专业的新闻媒体或博客文章的一个给定的名单,逗得检测环节,从新闻文章的社交媒体帖子,然后下载从各自的社交媒体平台,同一职位的内容,以收集有关具体职位的详细信息。该框架是能够取从多个平台如Twitter,YouTube上,书签交易的注释的数据的。该框架旨在减少工作量和问题,从社会化媒体平台上的数据注解后面。发笑可以在多个应用程序域被应用,作为使用情况下,我们已经实现了用于从不同的社交媒体平台收集COVID-19误传数据的框架。
Gautam Kishore Shahi
Abstract: In this paper, we present a semi-automated framework called AMUSED for gathering multi-modal annotated data from the multiple social media platforms. The framework is designed to mitigate the issues of collecting and annotating social media data by cohesively combining machine and human in the data collection process. From a given list of the articles from professional news media or blog, AMUSED detects links to the social media posts from news articles and then downloads contents of the same post from the respective social media platform to gather details about that specific post. The framework is capable of fetching the annotated data from multiple platforms like Twitter, YouTube, Reddit. The framework aims to reduce the workload and problems behind the data annotation from the social media platforms. AMUSED can be applied in multiple application domains, as a use case, we have implemented the framework for collecting COVID-19 misinformation data from different social media platforms.
摘要:在本文中,我们提出了一个半自动化框架调用逗乐来自多个社交媒体平台收集的多模态注解数据。该框架旨在缓解收集并通过在数据收集过程团结一致结合机器和人的注释社交媒体数据的问题。从专业的新闻媒体或博客文章的一个给定的名单,逗得检测环节,从新闻文章的社交媒体帖子,然后下载从各自的社交媒体平台,同一职位的内容,以收集有关具体职位的详细信息。该框架是能够取从多个平台如Twitter,YouTube上,书签交易的注释的数据的。该框架旨在减少工作量和问题,从社会化媒体平台上的数据注解后面。发笑可以在多个应用程序域被应用,作为使用情况下,我们已经实现了用于从不同的社交媒体平台收集COVID-19误传数据的框架。
27. A survey on natural language processing (nlp) and applications in insurance [PDF] 返回目录
Antoine Ly, Benno Uthayasooriyar, Tingting Wang
Abstract: Text is the most widely used means of communication today. This data is abundant but nevertheless complex to exploit within algorithms. For years, scientists have been trying to implement different techniques that enable computers to replicate some mechanisms of human reading. During the past five years, research disrupted the capacity of the algorithms to unleash the value of text data. It brings today, many opportunities for the insurance industry.Understanding those methods and, above all, knowing how to apply them is a major challenge and key to unleash the value of text data that have been stored for many years. Processing language with computer brings many new opportunities especially in the insurance sector where reports are central in the information used by insurers. SCOR's Data Analytics team has been working on the implementation of innovative tools or products that enable the use of the latest research on text analysis. Understanding text mining techniques in insurance enhances the monitoring of the underwritten risks and many processes that finally benefit policyholders.This article proposes to explain opportunities that Natural Language Processing (NLP) are providing to insurance. It details different methods used today in practice traces back the story of them. We also illustrate the implementation of certain methods using open source libraries and python codes that we have developed to facilitate the use of these techniques.After giving a general overview on the evolution of text mining during the past few years,we share about how to conduct a full study with text mining and share some examples to serve those models into insurance products or services. Finally, we explained in more details every step that composes a Natural Language Processing study to ensure the reader can have a deep understanding on the implementation.
摘要:文字是当今通信的最广泛使用的手段。这个数据是丰富,但仍然复杂算法中的漏洞。多年来,科学家们一直在努力执行不同的技术使计算机可以复制人阅读的一些机制。在过去的五年中,研究打乱的算法来释放文本数据的价值的能力。它带来了今天,对于保险industry.Understanding这些方法,首先,知道如何应用它们很多机会是一个重大的挑战和关键,释放已存放多年的文本数据的价值。随着计算机处理语言带来了许多新的机会,特别是在保险业,其中报告是由保险公司使用的信息中心。 SCOR的数据分析团队一直致力于创新的工具和产品,让使用文本分析的最新研究成果的落实。在保险的理解文本挖掘技术,增强了包销的风险,很多流程,最终受益policyholders.This文章监控提出解释机会,自然语言处理(NLP)是提供保险。它详细介绍今天在实践中的痕迹使用不同的方法支持他们的故事。我们也说明了使用开源库和Python代码,我们已经开发了方便的使用这些techniques.After在过去几年给文本挖掘演变的总体概述的某些方法的实施,我们分享如何进行与文本挖掘和分享一些例子充分研究服务于那些模型到保险产品或服务。最后,我们在更多的细节每一步对构成自然语言处理研究,以确保读者能够对执行情况的深刻理解解释。
Antoine Ly, Benno Uthayasooriyar, Tingting Wang
Abstract: Text is the most widely used means of communication today. This data is abundant but nevertheless complex to exploit within algorithms. For years, scientists have been trying to implement different techniques that enable computers to replicate some mechanisms of human reading. During the past five years, research disrupted the capacity of the algorithms to unleash the value of text data. It brings today, many opportunities for the insurance industry.Understanding those methods and, above all, knowing how to apply them is a major challenge and key to unleash the value of text data that have been stored for many years. Processing language with computer brings many new opportunities especially in the insurance sector where reports are central in the information used by insurers. SCOR's Data Analytics team has been working on the implementation of innovative tools or products that enable the use of the latest research on text analysis. Understanding text mining techniques in insurance enhances the monitoring of the underwritten risks and many processes that finally benefit policyholders.This article proposes to explain opportunities that Natural Language Processing (NLP) are providing to insurance. It details different methods used today in practice traces back the story of them. We also illustrate the implementation of certain methods using open source libraries and python codes that we have developed to facilitate the use of these techniques.After giving a general overview on the evolution of text mining during the past few years,we share about how to conduct a full study with text mining and share some examples to serve those models into insurance products or services. Finally, we explained in more details every step that composes a Natural Language Processing study to ensure the reader can have a deep understanding on the implementation.
摘要:文字是当今通信的最广泛使用的手段。这个数据是丰富,但仍然复杂算法中的漏洞。多年来,科学家们一直在努力执行不同的技术使计算机可以复制人阅读的一些机制。在过去的五年中,研究打乱的算法来释放文本数据的价值的能力。它带来了今天,对于保险industry.Understanding这些方法,首先,知道如何应用它们很多机会是一个重大的挑战和关键,释放已存放多年的文本数据的价值。随着计算机处理语言带来了许多新的机会,特别是在保险业,其中报告是由保险公司使用的信息中心。 SCOR的数据分析团队一直致力于创新的工具和产品,让使用文本分析的最新研究成果的落实。在保险的理解文本挖掘技术,增强了包销的风险,很多流程,最终受益policyholders.This文章监控提出解释机会,自然语言处理(NLP)是提供保险。它详细介绍今天在实践中的痕迹使用不同的方法支持他们的故事。我们也说明了使用开源库和Python代码,我们已经开发了方便的使用这些techniques.After在过去几年给文本挖掘演变的总体概述的某些方法的实施,我们分享如何进行与文本挖掘和分享一些例子充分研究服务于那些模型到保险产品或服务。最后,我们在更多的细节每一步对构成自然语言处理研究,以确保读者能够对执行情况的深刻理解解释。
28. RRF102: Meeting the TREC-COVID Challenge with a 100+ Runs Ensemble [PDF] 返回目录
Michael Bendersky, Honglei Zhuang, Ji Ma, Shuguang Han, Keith Hall, Ryan McDonald
Abstract: In this paper, we report the results of our participation in the TREC-COVID challenge. To meet the challenge of building a search engine for rapidly evolving biomedical collection, we propose a simple yet effective weighted hierarchical rank fusion approach, that ensembles together 102 runs from (a) lexical and semantic retrieval systems, (b) pre-trained and fine-tuned BERT rankers, and (c) relevance feedback runs. Our ablation studies demonstrate the contributions of each of these systems to the overall ensemble. The submitted ensemble runs achieved state-of-the-art performance in rounds 4 and 5 of the TREC-COVID challenge.
摘要:在本文中,我们报道了我们在TREC-COVID挑战参与的结果。为了满足建立一个搜索引擎快速发展的生物医学收集的挑战,我们提出了一个简单而有效的加权系列等级融合的方法,即歌舞团一起从(一)词汇和语义检索系统102点运行时,(B)预先训练和精细-tuned BERT rankers,和(c)相关反馈运行。我们的消融研究表明这些系统的整体集成的贡献。提交的整体运行实现了轮4和TREC-COVID挑战5国家的最先进的性能。
Michael Bendersky, Honglei Zhuang, Ji Ma, Shuguang Han, Keith Hall, Ryan McDonald
Abstract: In this paper, we report the results of our participation in the TREC-COVID challenge. To meet the challenge of building a search engine for rapidly evolving biomedical collection, we propose a simple yet effective weighted hierarchical rank fusion approach, that ensembles together 102 runs from (a) lexical and semantic retrieval systems, (b) pre-trained and fine-tuned BERT rankers, and (c) relevance feedback runs. Our ablation studies demonstrate the contributions of each of these systems to the overall ensemble. The submitted ensemble runs achieved state-of-the-art performance in rounds 4 and 5 of the TREC-COVID challenge.
摘要:在本文中,我们报道了我们在TREC-COVID挑战参与的结果。为了满足建立一个搜索引擎快速发展的生物医学收集的挑战,我们提出了一个简单而有效的加权系列等级融合的方法,即歌舞团一起从(一)词汇和语义检索系统102点运行时,(B)预先训练和精细-tuned BERT rankers,和(c)相关反馈运行。我们的消融研究表明这些系统的整体集成的贡献。提交的整体运行实现了轮4和TREC-COVID挑战5国家的最先进的性能。
29. Dual Attention Model for Citation Recommendation [PDF] 返回目录
Yang Zhang, Qiang Ma
Abstract: Based on an exponentially increasing number of academic articles, discovering and citing comprehensive and appropriate resources has become a non-trivial task. Conventional citation recommender methods suffer from severe information loss. For example, they do not consider the section on which a user is working, the relatedness between words, or the importance of words. These shortcomings make such methods insufficient for recommending adequate citations when working on manuscripts. In this study, we propose a novel approach called dual attention model for citation recommendation (DACR) to recommend citations during manuscript preparation. Our method considers three dimensions of information: contextual words, structural contexts, and the section on which a user is working. The core of the proposed model is composed of self-attention and additive attention, where the former aims to capture the relatedness between input information, and the latter aims to learn the importance of inputs. The experiments on real-world datasets demonstrate the effectiveness of the proposed approach.
摘要:基于指数增加一些学术文章,发现和引用全面和适当的资源已成为一个不平凡的任务。传统的引文推荐方法从严重的信息遭受损失。例如,他们不考虑在其上用户工作的部分,字与字之间的关联性,或词的重要性。这些缺点使得这种方法不足以对手稿工作时,建议适当引用。在这项研究中,我们提出了一个要求引用推荐(DACR)兼顾模型的新方法稿件准备过程中,建议引用。我们的方法考虑的信息的三个方面:上下文单词,结构的上下文,并且在其中用户正在工作的部分。该模型的核心是由自我关注和添加剂的关注,其中前者旨在捕捉输入信息,而后者的目标之间的关联性学习投入的重要性。现实世界的数据集实验验证了该方法的有效性。
Yang Zhang, Qiang Ma
Abstract: Based on an exponentially increasing number of academic articles, discovering and citing comprehensive and appropriate resources has become a non-trivial task. Conventional citation recommender methods suffer from severe information loss. For example, they do not consider the section on which a user is working, the relatedness between words, or the importance of words. These shortcomings make such methods insufficient for recommending adequate citations when working on manuscripts. In this study, we propose a novel approach called dual attention model for citation recommendation (DACR) to recommend citations during manuscript preparation. Our method considers three dimensions of information: contextual words, structural contexts, and the section on which a user is working. The core of the proposed model is composed of self-attention and additive attention, where the former aims to capture the relatedness between input information, and the latter aims to learn the importance of inputs. The experiments on real-world datasets demonstrate the effectiveness of the proposed approach.
摘要:基于指数增加一些学术文章,发现和引用全面和适当的资源已成为一个不平凡的任务。传统的引文推荐方法从严重的信息遭受损失。例如,他们不考虑在其上用户工作的部分,字与字之间的关联性,或词的重要性。这些缺点使得这种方法不足以对手稿工作时,建议适当引用。在这项研究中,我们提出了一个要求引用推荐(DACR)兼顾模型的新方法稿件准备过程中,建议引用。我们的方法考虑的信息的三个方面:上下文单词,结构的上下文,并且在其中用户正在工作的部分。该模型的核心是由自我关注和添加剂的关注,其中前者旨在捕捉输入信息,而后者的目标之间的关联性学习投入的重要性。现实世界的数据集实验验证了该方法的有效性。
30. Multi-label Classification of Common Bengali Handwritten Graphemes: Dataset and Challenge [PDF] 返回目录
Samiul Alam, Tahsin Reasat, Asif Shahriyar Sushmit, Sadi Mohammad Siddiquee, Fuad Rahman, Mahady Hasan, Ahmed Imtiaz Humayun
Abstract: Latin has historically led the state-of-the-art in handwritten optical character recognition (OCR) research. Adapting existing systems from Latin to alpha-syllabary languages is particularly challenging due to a sharp contrast between their orthographies. The segmentation of graphical constituents corresponding to characters becomes significantly hard due to a cursive writing system and frequent use of diacritics in the alpha-syllabary family of languages. We propose a labeling scheme based on graphemes (linguistic segments of word formation) that makes segmentation inside alpha-syllabary words linear and present the first dataset of Bengali handwritten graphemes that are commonly used in an everyday context. The dataset is open-sourced as a part of the this http URL Handwritten Grapheme Classification Challenge on Kaggle to benchmark vision algorithms for multi-label grapheme classification. From competition proceedings, we see that deep learning methods can generalize to a large span of uncommon graphemes even when they are absent during training.
摘要:拉美历史上带动了国家的最先进的手写光学字符识别(OCR)的研究。适应从拉丁现有系统的α-音节语言是特别具有挑战性的,因为它们的正字法之间的鲜明对比。对应的字符图形成分的分割变得草书书写系统和语言的字母音节家庭频繁使用变音符号的显著硬所致。我们建议,使分割内部的α-音节字线性和呈现常用于日常上下文中使用孟加拉语手写字形的第一数据集基于字形(构词语言链段)的标记方案。该数据集是开源的这个HTTP URL手写字形分类挑战上Kaggle的一部分,以基准视觉算法的多标签分类字形。从竞争诉讼,我们看到深的学习方法可以推广到大跨度罕见字形的,甚至当他们在训练中缺席。
Samiul Alam, Tahsin Reasat, Asif Shahriyar Sushmit, Sadi Mohammad Siddiquee, Fuad Rahman, Mahady Hasan, Ahmed Imtiaz Humayun
Abstract: Latin has historically led the state-of-the-art in handwritten optical character recognition (OCR) research. Adapting existing systems from Latin to alpha-syllabary languages is particularly challenging due to a sharp contrast between their orthographies. The segmentation of graphical constituents corresponding to characters becomes significantly hard due to a cursive writing system and frequent use of diacritics in the alpha-syllabary family of languages. We propose a labeling scheme based on graphemes (linguistic segments of word formation) that makes segmentation inside alpha-syllabary words linear and present the first dataset of Bengali handwritten graphemes that are commonly used in an everyday context. The dataset is open-sourced as a part of the this http URL Handwritten Grapheme Classification Challenge on Kaggle to benchmark vision algorithms for multi-label grapheme classification. From competition proceedings, we see that deep learning methods can generalize to a large span of uncommon graphemes even when they are absent during training.
摘要:拉美历史上带动了国家的最先进的手写光学字符识别(OCR)的研究。适应从拉丁现有系统的α-音节语言是特别具有挑战性的,因为它们的正字法之间的鲜明对比。对应的字符图形成分的分割变得草书书写系统和语言的字母音节家庭频繁使用变音符号的显著硬所致。我们建议,使分割内部的α-音节字线性和呈现常用于日常上下文中使用孟加拉语手写字形的第一数据集基于字形(构词语言链段)的标记方案。该数据集是开源的这个HTTP URL手写字形分类挑战上Kaggle的一部分,以基准视觉算法的多标签分类字形。从竞争诉讼,我们看到深的学习方法可以推广到大跨度罕见字形的,甚至当他们在训练中缺席。
注:中文为机器翻译结果!封面为论文标题词云图!