摘要

1. Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering [PDF] 返回目录
Alexander R. Fabbri, Patrick Ng, Zhiguo Wang, Ramesh Nallapati, Bing Xiang
Abstract: Question Answering (QA) is in increasing demand as the amount of information available online and the desire for quick access to this content grows. A common approach to QA has been to fine-tune a pretrained language model on a task-specific labeled dataset. This paradigm, however, relies on scarce, and costly to obtain, large-scale human-labeled data. We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance by allowing the model to learn more complex context-question relationships. Training a QA model on this data gives a relative improvement over a previous unsupervised model in F1 score on the SQuAD dataset by about 14%, and 20% when the answer is a named entity, achieving state-of-the-art performance on SQuAD for unsupervised QA.
摘要：问答系统（QA）是在需求不断增加的网上提供信息的数量，以便快速访问该内容增长的愿望。到QA一种常见的方法是微调在任务的特定标记的数据集预训练语言模型。这种模式，但是，依赖于稀缺，和昂贵的获得，大规模的人类标记的数据。我们提出了一种无监督的方法来训练QA车型与生成的伪训练数据。我们显示QA培训，以生成的问题通过相关，检索到的句子而不是原来的背景下句应用简单的模板，通过允许模型，以了解更多复杂的情况下，问题的关系改善了下游QA性能。该数据训练QA模型给出了约14％和20％，比在F1得分球队数据集先前的无监督模式相对改善时，答案是一个命名实体，实现了对阵容的国家的最先进的性能无监督的质量保证。

2. Lite Transformer with Long-Short Range Attention [PDF] 返回目录
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han
Abstract: Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at this https URL.
摘要：变压器已成为自然语言处理普遍存在的（例如，机器翻译，问答）;但是，它需要的计算量庞大，以实现高性能，这使得它不适合于紧密地由硬件资源和电池限制的移动的应用程序。在本文中，我们提出了一种高效的移动NLP架构，精简版变压器便于部署的边缘设备，移动NLP应用程序。关键的原始是多空范围关注（LSRA），其中一个组的负责人专门在当地背景建模（卷积），而另一组专门从事长途关系模型（由关注）。这种专业化带来了超过三成熟的语言任务的香草变压器持续改善：机器翻译，抽象概括和语言模型。在资源有限的（500M / 100M MACS），精简版变压器性能优于上WMT'14英语，法语变压器由1.2 / 1.7 BLEU，分别。精简版变压器降低变压器基本模型的由2.5×0.3 BLEU分数降解的计算。用修剪和量化相结合，我们进一步18.2x压缩精简版变压器的模型大小。对于语言的造型，精简版变压器达到1.8比约为500M的MAC变压器低困惑。值得注意的是，精简版变压器0.5为不昂贵的搜索架构需要超过250 GPU年移动NLP设置更高的BLEU优于基于AutoML进化变压器。代码已可在此HTTPS URL。

3. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation [PDF] 返回目录
Biao Zhang, Philip Williams, Ivan Titov, Rico Sennrich
Abstract: Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures. We identify the off-target translation issue (i.e. translating into a wrong target language) as the major source of the inferior zero-shot performance, and propose random online backtranslation to enforce the translation of unseen training language pairs. Experiments on OPUS-100 (a novel multilingual dataset with 100 languages) show that our approach substantially narrows the performance gap with bilingual models in both one-to-many and many-to-many settings, and improves zero-shot performance by ~10 BLEU, approaching conventional pivot-based methods.
摘要：神经机器翻译（NMT）大规模多语言模型在理论上是有吸引力的，但往往表现不佳双语模式和质量较差的零射门的翻译。在本文中，我们探讨如何改善它们。我们认为，多语种NMT需要更强的建模能力，支持的语言对具有不同类型学特征，并克服通过特定语言的组件和深化NMT架构这一瓶颈。我们确定了脱靶翻译的问题（即翻译成一个错误的目标语言）作为下零射门的表现的主要来源，并提出随机在线回译执行看不见的训练语言对的翻译。在OPUS-100实验证明（有100种语言的一种新型的多语种数据集），我们的方法显着缩小了〜10双语机型的性能差距在这两个一个一对多和多对一的许多设置，提高了零射门的表现BLEU，接近传统的基于旋转的方法。

4. Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs [PDF] 返回目录
Tarcísio Souza Costa, Simon Gottschalk, Elena Demidova
Abstract: Semantic Question Answering (QA) is the key technology to facilitate intuitive user access to semantic information stored in knowledge graphs. Whereas most of the existing QA systems and datasets focus on entity-centric questions, very little is known about the performance of these systems in the context of events. As new event-centric knowledge graphs emerge, datasets for such questions gain importance. In this paper we present the Event-QA dataset for answering event-centric questions over knowledge graphs. Event-QA contains 1000 semantic queries and the corresponding English, German and Portuguese verbalisations for EventKG - a recently proposed event-centric knowledge graph with over 1 million events.
摘要：语义问题回答（QA）是为了方便存储在知识图语义信息直观的用户访问的关键技术。虽然大多数现有的质量保证系统和数据集的重点放在实体为中心的问题，很少有人知道有关事件的背景下，这些系统的性能。随着新的事件为中心的知识图的出现，对这些问题的数据集变得越来越重要。在本文中，我们提出了回答了知识图事件为中心的问题事件-QA数据集。事件-QA包含千个语义查询及相应的英语，德语和葡萄牙语verbalisations为EventKG - 最近提出的以事件为中心的知识与图超过100万的事件。

5. On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [PDF] 返回目录
Biao Zhang, Ivan Titov, Rico Sennrich
Abstract: Sequence-to-sequence models usually transfer all encoder outputs to the decoder for generation. In this work, by contrast, we hypothesize that these encoder outputs can be compressed to shorten the sequence delivered for decoding. We take Transformer as the testbed and introduce a layer of stochastic gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L0penalty, resulting in completely masking-out a subset of encoder outputs. In other words, via joint training, the L0DROP layer forces Transformer to route information through a subset of its encoder states. We investigate the effects of this sparsification on two machine translation and two summarization tasks. Experiments show that, depending on the task, around 40-70% of source encodings can be pruned without significantly compromising quality. The decrease of the output length endows L0DROP with the potential of improving decoding efficiency, where it yields a speedup of up to 1.65x on document summarization tasks against the standard Transformer. We analyze the L0DROP behaviour and observe that it exhibits systematic preferences for pruning certain word types, e.g., function words and punctuation get pruned most. Inspired by these observations, we explore the feasibility of specifying rule-based patterns that mask out encoder outputs based on information such as part-of-speech tags, word frequency and word position.
摘要：序列到序列模型通常所有编码器输出传送到解码器，用于产生。在这项工作中，通过对比，我们推测，这些编码器输出可以被压缩，以缩短交货解码序列。我们以变压器为测试平台和引入编码器和解码器之间的随机的栅极的层。栅极被使用稀疏性的诱导L0penalty的预期值正规化，从而导致完全掩蔽出编码器输出的子集。换句话说，通过联合培养中，L0DROP层通过其编码器状态的一个子集迫使变压器来路由信息。我们正在调查这个稀疏的两个机器翻译和两个汇总任务的影响。实验表明，根据不同的任务，源编码大约40-70％可在不显著影响质量修剪。输出长度的减少赋予与L0DROP提高解码效率，在那里它产生了一个加速到1.65x上对标准变压器文档文摘任务的潜力。我们分析L0DROP行为，并观察它表现出了一定的修剪类型的词，例如功能词和标点符号GET修剪的最系统的偏好。由这些观察结果的鼓舞，我们探索指定屏蔽掉基于信息编码器输出，如部分的语音标签，词频和字位置基于规则的图案的可行性。

6. FLAT: Chinese NER Using Flat-Lattice Transformer [PDF] 返回目录
Xiaonan Li, Hang Yan, Xipeng Qiu, Xuanjing Huang
Abstract: Recently, the character-word lattice structure has been proved to be effective for Chinese named entity recognition (NER) by incorporating the word information. However, since the lattice structure is complex and dynamic, most existing lattice-based models are hard to fully utilize the parallel computation of GPUs and usually have a low inference-speed. In this paper, we propose FLAT: Flat-LAttice Transformer for Chinese NER, which converts the lattice structure into a flat structure consisting of spans. Each span corresponds to a character or latent word and its position in the original lattice. With the power of Transformer and well-designed position encoding, FLAT can fully leverage the lattice information and has an excellent parallelization ability. Experiments on four datasets show FLAT outperforms other lexicon-based models in performance and efficiency.
摘要：近日，文字字格结构已被证明是有效的通过将字信息中国命名实体识别（NER）。然而，由于晶格结构是复杂的，动态的，大多数现有的基于格的模型都很难充分利用GPU的并行计算，通常具有较低的推理速度。在本文中，我们提出FLAT：平格变压器为中国NER，其晶格结构转换成由跨度的平面结构。每个跨度对应于字符或潜在的单词及其在原来的位置格。与变压器的功率和精心设计的位置编码，FLAT可以充分利用的点阵信息，并具有优异的并行化的能力。在四个数据集实验表明，在性能和效率FLAT优于其他基于词汇的模型。

7. Exploring Explainable Selection to Control Abstractive Generation [PDF] 返回目录
Wang Haonan, Gao Yang, Bai Yu, Mirella Lapata, Huang Heyan
Abstract: It is a big challenge to model long-range input for document summarization. In this paper, we target using a select and generate paradigm to enhance the capability of selecting explainable contents (i.e., interpret the selection given its semantics, novelty, relevance) and then guiding to control the abstract generation. Specifically, a newly designed pair-wise extractor is proposed to capture the sentence pair interactions and their centrality. Furthermore, the generator is hybrid with the selected content and is jointly integrated with a pointer distribution that is derived from a sentence deployment's attention. The abstract generation can be controlled by an explainable mask matrix that determines to what extent the content can be included in the summary. Encoders are adaptable with both Transformer-based and BERT-based configurations. Overall, both results based on ROUGE metrics and human evaluation gain outperformance over several state-of-the-art models on two benchmark CNN/DailyMail and NYT datasets.
摘要：这是远程输入模型的文件总结的一大挑战。在本文中，我们的目标使用选择并生成范例，以提高选择可解释内容的能力（即，给定的解释它的语义，新颖性，相关性选择），然后引导到控制抽象代。具体而言，新设计的成对的提取，提出了夺取句对的相互作用及其核心地位。此外，发电机是混合动力与选定的内容，并与从一个句子部署的注意派生的指针分配共同整合。抽象产生可以通过确定在何种程度上的内容可以被包括在摘要中可解释的掩码矩阵来控制。编码器能够适应既基于变压器和基于BERT配置。总体而言，基于ROUGE指标和人工评估增益跑赢上的两个基准CNN /每日邮报和纽约时报的数据集两个结果在国家的最先进的几种模式。

8. ST$^2$: Small-data Text Style Transfer via Multi-task Meta-Learning [PDF] 返回目录
Xiwen Chen, Kenny Q. Zhu
Abstract: Text style transfer aims to paraphrase a sentence in one style into another style while preserving content. Due to lack of parallel training data, state-of-art methods are unsupervised and rely on large datasets that share content. Furthermore, existing methods have been applied on very limited categories of styles such as positive/negative and formal/informal. In this work, we develop a meta-learning framework to transfer between any kind of text styles, including personal writing styles that are more fine-grained, share less content and have much smaller training data. While state-of-art models fail in the few-shot style transfer task, our framework effectively utilizes information from other styles to improve both language fluency and style transfer accuracy.
摘要：文本样式转移目标套用在一个风格的句子转化成另一种风格，同时保留内容。由于缺乏并行训练数据的，国家的最先进的方法是无监督，依靠大型数据集是共享内容。此外，现有的方法已应用在款式种类非常有限，如正/负和正式/非正式的。在这项工作中，我们开发了一个元学习框架，以任何形式的文本样式，包括个人的写作风格是更精致，分享内容少，并且具有更小的训练数据之间传输。虽然国家的最先进的车型中为数不多的拍摄风格转职任务失败了，我们的框架有效地利用从其他样式的资讯，来提高语言流畅和风格传递的准确性。

9. Coach: A Coarse-to-Fine Approach for Cross-domain Slot Filling [PDF] 返回目录
Zihan Liu, Genta Indra Winata, Peng Xu, Pascale Fung
Abstract: As an essential task in task-oriented dialog systems, slot filling requires extensive training data in a certain domain. However, such data are not always available. Hence, cross-domain slot filling has naturally arisen to cope with this data scarcity problem. In this paper, we propose a Coarse-to-fine approach (Coach) for cross-domain slot filling. Our model first learns the general pattern of slot entities by detecting whether the tokens are slot entities or not. It then predicts the specific types for the slot entities. In addition, we propose a template regularization approach to improve the adaptation robustness by regularizing the representation of utterances based on utterance templates. Experimental results show that our model significantly outperforms state-of-the-art approaches in slot filling. Furthermore, our model can also be applied to the cross-domain named entity recognition task, and it achieves better adaptation performance than other existing baselines. The code is available at this https URL.
摘要：作为面向任务的对话系统的一项重要任务，槽分配需要在一定的领域广泛的培训数据。然而，这样的数据并不总是可用。因此，跨域槽填充已自然出现应对该数据缺乏的问题。在本文中，我们提出了跨域槽分配由粗到细的方法（教练）。我们的模型首先通过检测令牌是否是时隙单位或不学习槽实体的一般模式。然后，它预测特定类型的插槽实体。此外，我们提出了一个模板正规化方法，通过正规化基于话音样本话语的代表性，提高适应鲁棒性。实验结果表明，我们的模型显著优于状态的最先进的在槽填充方法。此外，我们的模型也可以应用到命名实体识别任务的跨域，它实现了比其它现有基准更好地适应性能。该代码可在此HTTPS URL。

10. Residual Energy-Based Models for Text Generation [PDF] 返回目录
Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, Marc'Aurelio Ranzato
Abstract: Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and RoBERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation.
摘要：文本生成是许多NLP任务无处不在，从总结，对话和机器翻译。占主导地位的参数方法是基于本地标准化模型即一次预测一个字。虽然这些工作非常好，他们被曝光补偿由于产生过程的贪婪的本性困扰。在这项工作中，我们调查未归一能源为基础的模型（EBMS），这在令牌，但在序列水平操作不。为了使培训听话，我们在剩余的预训练归本地语言模型和第二我们培养使用噪音对比估计的第一部作品。此外，由于EBM工作在序列水平，我们可以利用预训练的双向上下文表示，如BERT和罗伯塔。我们在两个大语言建模数据集实验表明，残留EBMS相比，本地标准化基线产生较低的困惑。此外，通过重要性采样生成是非常有效的，并根据人评价比基线模型具有更高的质量。

11. GCAN: Graph-aware Co-Attention Networks for Explainable Fake News Detection on Social Media [PDF] 返回目录
Yi-Ju Lu, Cheng-Te Li
Abstract: This paper solves the fake news detection problem under a more realistic scenario on social media. Given the source short-text tweet and the corresponding sequence of retweet users without text comments, we aim at predicting whether the source tweet is fake or not, and generating explanation by highlighting the evidences on suspicious retweeters and the words they concern. We develop a novel neural network-based model, Graph-aware Co-Attention Networks (GCAN), to achieve the goal. Extensive experiments conducted on real tweet datasets exhibit that GCAN can significantly outperform state-of-the-art methods by 16% in accuracy on average. In addition, the case studies also show that GCAN can produce reasonable explanations.
摘要：本文对社会化媒体更现实的情况下，解决了假新闻的检测问题。鉴于源短文本鸣叫和转推的用户没有文字注释相应的序列，我们的目标是在预测源鸣叫是否是通过强调对可疑retweeters的字样，他们关心的证据假冒与否，以及产生的说明。我们开发一种新型基于神经网络模型，图形感知协同注意网络（GCAN），要达到的目标。真实的数据集鸣叫进行了广泛的实验表现出GCAN能状态的最先进的显著跑赢在平均精度的方法由16％。此外，案例研究也表明，GCAN能产生合理的解释。

12. Learning the grammar of prescription: recurrent neural network grammars for medication information extraction in clinical texts [PDF] 返回目录
Ivan Lerner, Jordan Jouffroy, Anita Burgun, Antoine Neuraz
Abstract: In this study, we evaluated the RNNG, a neural top-down transition based parser, for medication information extraction in clinical texts. We evaluated this model on a French clinical corpus. The task was to extract the name of a drug (or class of drug), as well as fields informing its administration: frequency, dosage, duration, condition and route of administration. We compared the RNNG model that jointly identify entities and their relations with separate BiLSTMs models for entities and relations as baselines. We call seq-BiLSTMs the baseline models for relations extraction that takes as extra-input the output of the BiLSTMs for entities. RNNG outperforms seq-BiLSTM for identifying relations, with on average 88.5% [87.2-89.8] versus 84.6 [83.1-86.1] F-measure. However, RNNG is weaker than the baseline BiLSTM on detecting entities, with on average 82.4 [80.8-83.8] versus 84.1 [82.7-85.6] % F- measure. RNNG trained only for detecting relations is weaker than RNNG with the joint modelling objective, 87.4 [85.8-88.8] versus 88.5% [87.2-89.8]. The performance of RNNG on relations can be explained both by the model architecture, which provides shortcut between distant parts of the sentence, and the joint modelling objective which allow the RNNG to learn richer representations. RNNG is efficient for modeling relations between entities in medical texts and its performances are close to those of a BiLSTM for entity detection.
摘要：在这项研究中，我们评估了RNNG，神经自上而下的过渡解析器，在临床用药的文本信息提取。我们评估对法国临床语料库这种模式。现在的任务是，以提取的药物（或药物类的），以及字段通知其管理的名称：频率，剂量，持续时间，状态及给药途径。我们比较了RNNG模式，共同确定实体及其与独立BiLSTMs模型实体关系和关系作为基准。我们呼吁SEQ-BiLSTMs基线模型关系提取这需要作为额外输入BiLSTMs为实体的输出。 RNNG优于SEQ-BiLSTM用于识别关系，以平均88.5％[87.2-89.8]对84.6 [83.1-86.1] F值。然而，RNNG比上检测实体基线BiLSTM较弱，对平均82.4 [80.8-83.8]对84.1 [82.7-85.6]％F-量度。 RNNG仅训练用于检测关系比RNNG与接头建模目标，87.4 [85.8-88.8]对88.5％[87.2-89.8]弱。 RNNG对关系的性能可以通过模型架构，这句话的遥远的地方，并联合建模目标这让RNNG学习更丰富的表示之间提供了快捷方式来解释两者。 RNNG是高效的医疗文本建模实体之间的关系，其性能接近那些BiLSTM对实体检测。

13. Customization and modifications of SignWriting by LIS users [PDF] 返回目录
Claudia S. Bianchini, Fabrizio Borgia, Margherita Castelli
Abstract: Historically, the various sign languages (SL) have not developed an own writing system; nevertheless, some systems exist, among which the SignWriting (SW) is a powerful and flexible one. In this paper, we present the mechanisms adopted by signers of the Italian Sign Language (LIS), expert users of SW, to modify the standard SW glyphs and increase their writing skills and/or represent peculiar linguistic phenomena. We identify these glyphs and show which characteristics make them "acceptable" by the expert community. Eventually, we analyze the potentialities of these glyphs in hand writing and in computer-assisted writing, focusing on SWift, a software designed to allow the electronic writing-down of user-modified glyphs.
摘要：从历史上看，不同的手语（SL）还没有开发出自己的书写系统;然而，一些系统存在，其中所述手语书写（SW）是一个强大的和灵活的。在本文中，我们提出由意大利手语（LIS）的签名者所采用的机制，SW的专家用户，修改标准SW字形和提高自己的写作水平和/或代表特殊的语言现象。我们确定这些符号，并显示其特性使它们“可以接受”由专家社区。最后，我们分析这些字形的潜力在手写和电脑辅助写作，专注于迅速，旨在让电子书写下用户修改字形的软件。

14. Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order [PDF] 返回目录
Yi Liao, Xin Jiang, Qun Liu
Abstract: Masked language model and autoregressive language model are two types of language models. While pretrained masked language models such as BERT overwhelm the line of natural language understanding (NLU) tasks, autoregressive language models such as GPT are especially capable in natural language generation (NLG). In this paper, we propose a probabilistic masking scheme for the masked language model, which we call probabilistically masked language model (PMLM). We implement a specific PMLM with a uniform prior distribution on the masking ratio named u-PMLM. We prove that u-PMLM is equivalent to an autoregressive permutated language model. One main advantage of the model is that it supports text generation in arbitrary order with surprisingly good quality, which could potentially enable new applications over traditional unidirectional generation. Besides, the pretrained u-PMLM also outperforms BERT on a set of downstream NLU tasks.
摘要：蒙面语言模型和自回归语言模型两种类型的语言模型。虽然蒙面预先训练语言模型，如BERT压倒的自然语言理解（NLU）的任务线，自回归语言模型，如GPT是自然语言生成（NLG）特别能干。在本文中，我们提出的掩盖语言模型，我们称之为概率掩盖语言模型（PMLM）概率掩码方案。我们实现与名为U型PMLM的遮蔽率的均匀先验分布特定PMLM。我们证明了U型PMLM相当于自回归置换的语言模型。该模型的一个主要优点是，它支持与质量出奇的好，这有可能使比传统的单向一代新的应用程序任意顺序文本生成。此外，预训练的一组下游NLU任务的u-PMLM也性能优于BERT。

15. G-DAUG: Generative Data Augmentation for Commonsense Reasoning [PDF] 返回目录
Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, Doug Downey
Abstract: Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.
摘要：在常识推理的最新进展依赖于大规模的人类注释的训练数据，以实现最佳性能。然而，训练示例人工管理是昂贵的，并且已经显示出引进注释伪像神经模型可以容易地利用和过拟合。我们调查G-DAUG，一种新型的数据生成隆胸方法，其目的是实现在低资源设置更精确和稳健的学习。我们的方法利用产生预训练语言模型选择合成例，以及最丰富和多样的集合用于数据增强的例子。在具有多个常识推理基准实验中，G-DAUG始终优于基于回译现行数据增强方法，并建立一个新的上WinoGrande，CODAH，和CommonsenseQA状态的最先进的。此外，除了在分配的精度的改进，G-DAUG-增强训练也增强外的分布概括，显示针对对抗或扰动的例子更大的鲁棒性。我们的分析表明，G-DAUG产生多样化的流畅训练的例子，它的选拔和培养方法对于性能非常重要。我们的研究结果鼓励未来的走向生成数据增强研究，以加强双方在分配学习和外的分布概括。

16. UHH-LT & LT2 at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection [PDF] 返回目录
Gregor Wiedemann, Seid Muhie Yimam, Chris Biemann
Abstract: Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in-domain data for unsupervised MLM resembling the actual classification target dataset allows for domain adaptation of the model. In this paper, we compare current pre-trained transformer networks with and without MLM fine-tuning on their performance for offensive language detection. Two different ensembles of our best performing classifiers rank 1st and 2nd out of 85 teams participating in the SemEval 2020 Shared Task~12 for the English language.
摘要：预先训练变压器网络如BERT产量的国家的最先进成果的文本分类任务的微调。通常情况下，微调是在特定任务的训练数据中有监督的方式进行。你也可以微调在无人监督的方式事先通过进一步的训练前的掩盖语言模型（MLM）的任务。由此，在域数据的无监督MLM类似于实际分类对象数据集允许模型的适配域。在本文中，我们比较有和没有对他们的攻击性语言检测性能传销微调电流预训练变压器网络。两种不同的合奏我们表现最好的分类排名第1次和第2次出85支球队参赛的SemEval 2020共享任务〜12为英语的。

17. Multiple Segmentations of Thai Sentences for Neural Machine Translation [PDF] 返回目录
Alberto Poncelas, Wichaya Pidchamook, Chao-Hong Liu, James Hadley, Andy Way
Abstract: Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English--Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.
摘要：泰国是一个资源匮乏的语言，因此它往往是数据不足够数量的训练，其执行的高质量水平的神经机器翻译（NMT）模型可用的情况下。此外，泰国脚本不使用空格分隔单词之间的界限，建设序列序列模型时，它增加了更多的复杂性。在这项工作中，我们将探讨如何增加一套英语 - 通过复制句子对与泰国不同的分词方法泰国并行数据，作为训练数据的NMT模型训练。使用字节对编码的不同的合并操作，可以得到泰国句子的不同分割。实验结果表明，结合这些数据集，表现为与具有使用监督分割工具被分光的数据集训练的NMT模型改进。

18. A Tool for Facilitating OCR Postediting in Historical Documents [PDF] 返回目录
Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, Andy Way
Abstract: Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary ,1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.
摘要：历史文献光学字符识别（OCR）是一个复杂的过程受到一系列独特的材料问题，包括字体和低质量的扫描不一致。因此，即使是最先进的OCR引擎产生错误。对于后编辑的Tesseract的输出，更具体用于校正数字化的历史文献中常见的错误建立在工具本文报道。建议的工具提示在指定的词汇没有发现单词形式的替代品。假定误差是由基于语言模型（LM）的分数后一版大概正确的替代方案所取代。该工具是在走向规范贸易和采用这种王国（卡里，1719）穷人书随笔的一章进行测试。如下所示，该工具成功地纠正了一些常见的错误。如果有时不可靠的，它也是透明的，受到人为干预。

19. A Gamma-Poisson Mixture Topic Model for Short Text [PDF] 返回目录
Jocelyn Mazarura, Alta de Waal, Pieter de Villiers
Abstract: Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.
摘要：大多数主题模型是在假定文档遵循多项分布下构建。泊松分布是一种替代分布来描述计数数据的概率。对于主题建模，泊松分布描述了在固定长度的文件，一个词的出现次数。泊松分布已在文本分类中得到成功应用，但其主题建模应用程序没有很好的记载，特别是在生成概率模型的情况下。此外，在文献中为数不多的泊松主题模型是混合模式，使得一个文档从主题的混合所产生的假设。在这项研究中，我们侧重于短文本。许多研究表明，混合模型的简单假设，适合短文本更好。与混合模型，相对于混合模型，所述生成的假设是，一文件由单个主题生成。一个主题模型，这使得这一个话题，每个文档的假设，是狄氏-多项混合模型。这项工作的主要贡献是一种新的伽玛泊松混合模型，以及一个倒塌吉布斯采样器模型。倒塌吉布斯采样派生的好处是，该模型能够自动选择包含在语料库的主题数。结果表明，伽玛泊松混合模型进行在选择在标语料库的主题数比狄利克雷-多项式混合模型更好。另外，伽玛泊松混合产生更好的话题连贯性得分高于狄氏-多项混合模型，使之成为短文本主题建模的具有挑战性的任务一个可行的选择。

20. Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent Neural Networks [PDF] 返回目录
Nachum Dershowitz, Ori Terner
Abstract: Many of the great Jewish works of the Middle Ages were written in Judeo-Arabic, a Jewish branch of the Arabic language family that incorporates the Hebrew script as its writing system. In this work we are trying to train a model that will automatically transliterate Judeo-Arabic into Arabic script; thus we aspire to enable Arabic readers to access those writings. We adopt a recurrent neural network (RNN) approach to the problem, applying connectionist temporal classification loss to deal with unequal input/output lengths. This choice obligates adjustments, termed doubling, in the training data to avoid input sequences that are shorter than their corresponding outputs. We also utilize a pretraining stage with a different loss function to help the network converge. Furthermore, since only a single source of parallel text was available for training, we examine the possibility of generating data synthetically from other Arabic original text from the time in question, leveraging the fact that, though the convention for mapping applied by the Judeo-Arabic author has a one-to-many relation from Judeo-Arabic to Arabic, its reverse (from Arabic to Judeo-Arabic) is a proper function. By this we attempt to train a model that has the capability to memorize words in the output language, and that also utilizes the context for distinguishing ambiguities in the transliteration. We examine this ability by testing on shuffled data that lacks context. We obtain an improvement over the baseline results (9.5% error), achieving 2% error with our system. On the shuffled test data, the error rises to 2.5%.
摘要：许多中世纪的伟大的犹太作品写在犹太 - 阿拉伯语，阿拉伯语家族结合了希伯来脚本作为它的书写系统的一个犹太分支。在这项工作中，我们正在努力培养模式，将自动音译犹太教和阿拉伯文翻译成阿拉伯文;因此，我们渴望能够阿拉伯语读者访问这些著作。我们采用递归神经网络（RNN）的方法解决问题，应用联结时间分类的损失来处理相等的输入/输出长度。这种选择责成调整，称为比它们相应的输出更短的倍增，在训练数据中，以避免输入序列。我们还利用训练前的阶段用不同的损失函数，以帮助网络收敛。此外，由于只有平行文本的一个来源是可用于培训，我们考察从有问题的时候其他的阿拉伯语原文综合生成的数据，充分利用，虽然采用由犹太教和阿拉伯语的映射公约的事实的可能性笔者有从犹太教和阿拉伯语阿拉伯语一个一对多的关系，其反向（从阿拉伯语到犹太教和阿拉伯语），是一种正常的功能。通过这一点，我们试图培养具有记忆输出语言的单词能力的模型，并且还利用在音译区分歧义的上下文。我们研究的是缺乏这方面的能力通过测试的洗牌的数据。我们得到的改善超过基线的结果（9.5％误差），达到2％的误差与我们的系统。在洗牌后的测试数据，错误上升到2.5％。

21. Deep Sentiment Classification and Topic Discovery on Novel Coronavirus or COVID-19 Online Discussions: NLP Using LSTM Recurrent Neural Network Approach [PDF] 返回目录
Hamed Jelodar, Yongli Wang, Rita Orji, Hucheng Huang
Abstract: Internet forums and public social media, such as online healthcare forums, provide a convenient channel for users (people/patients) concerned about health issues to discuss and share information with each other. In late December 2019, an outbreak of a novel coronavirus (infection from which results in the disease named COVID-19) was reported, and, due to the rapid spread of the virus in other parts of the world, the World Health Organization declared a state of emergency. In this paper, we used automated extraction of COVID-19 related discussions from social media and a natural language process (NLP) method based on topic modeling to uncover various issues related to COVID-19 from public opinions. Moreover, we also investigate how to use LSTM recurrent neural network for sentiment classification of COVID-19 comments. Our findings shed light on the importance of using public opinions and suitable computational techniques to understand issues surrounding COVID-19 and to guide related decision-making.
摘要：互联网论坛和公共社交媒体，如网上医疗论坛，为用户提供了（人/例）关注健康问题上彼此的便捷通道，讨论和共享信息。在十二月下旬2019年，一种新型的冠状病毒（感染从名为COVID-19的疾病导致）报道，而且，由于在世界其他地区的病毒迅速蔓延的疫情，世界卫生组织宣布紧急状态。在本文中，我们使用自动从社交媒体基于主题建模从舆论与COVID-19揪出各种问题COVID-19相关的讨论和自然语言处理（NLP）的方法提取。此外，我们还探讨了如何使用LSTM回归神经网络的COVID-19评论情感分类。我们的研究结果阐明了利用民意，适合计算技术来了解周围的COVID-19的问题，并指导相关决策的重要性。

22. Social Interactions or Business Transactions? What customer reviews disclose about Airbnb marketplace [PDF] 返回目录
Giovanni Quattrone, Antonino Nocera, Licia Capra, Daniele Quercia
Abstract: Airbnb is one of the most successful examples of sharing economy marketplaces. With rapid and global market penetration, understanding its attractiveness and evolving growth opportunities is key to plan business decision making. There is an ongoing debate, for example, about whether Airbnb is a hospitality service that fosters social exchanges between hosts and guests, as the sharing economy manifesto originally stated, or whether it is (or is evolving into being) a purely business transaction platform, the way hotels have traditionally operated. To answer these questions, we propose a novel market analysis approach that exploits customers' reviews. Key to the approach is a method that combines thematic analysis and machine learning to inductively develop a custom dictionary for guests' reviews. Based on this dictionary, we then use quantitative linguistic analysis on a corpus of 3.2 million reviews collected in 6 different cities, and illustrate how to answer a variety of market research questions, at fine levels of temporal, thematic, user and spatial granularity, such as (i) how the business vs social dichotomy is evolving over the years, (ii) what exact words within such top-level categories are evolving, (iii) whether such trends vary across different user segments and (iv) in different neighbourhoods.
摘要：制作的Airbnb是共享经济市场中最成功的例子之一。在快速和全球市场的渗透，了解它的吸引力和不断发展的增长机遇的关键是规划业务决策。有一个正在进行的争论，例如，关于是否制作的Airbnb是一个热情好客的服务，主机和客户机之间有助保持社会交往，如共享经济宣言原本规定，或者它是否是（或正在演变成是）一个纯粹的商业交易平台，酒店的方式传统操作。为了回答这些问题，我们建议，利用客户的评论一个新的市场分析方法。关键的做法是，结合专题分析和机器学习电感开发自定义词典，为客人提供评论的方法。在此基础上的字典，我们则在6名不同的城市收集的320万条评论文集用量化的语言分析，并说明如何回答各种市场研究问题，在时间，主题，用户和空间粒度，这样的罚款额（ⅰ）企业如何VS社会二分法多年演变过来，（ii）该顶级类别中什么确切的话是不断变化的，（iii）该趋势是否在不同的用户群及（iv）在不同的社区会有所不同。

23. Characterising User Content on a Multi-lingual Social Network [PDF] 返回目录
Pushkal Agarwal, Kiran Garimella, Sagar Joglekar, Nishanth Sastry, Gareth Tyson
Abstract: Social media has been on the vanguard of political information diffusion in the 21st century. Most studies that look into disinformation, political influence and fake-news focus on mainstream social media platforms. This has inevitably made English an important factor in our current understanding of political activity on social media. As a result, there has only been a limited number of studies into a large portion of the world, including the largest, multilingual and multi-cultural democracy: India. In this paper we present our characterisation of a multilingual social network in India called ShareChat. We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019, across 14 languages. We investigate the cross lingual dynamics by clustering visually similar images together, and exploring how they move across language barriers. We find that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images (often referred to as memes), and posts from Hindi have the largest cross-lingual diffusion across ShareChat (as well as images containing text in English). In the case of images containing text that cross language barriers, we see that language translation is used to widen the accessibility. That said, we find cases where the same image is associated with very different text (and therefore meanings). This initial characterisation paves the way for more advanced pipelines to understand the dynamics of fake and political content in a multi-lingual and non-textual setting.
摘要：社会化媒体一直对政治信息的传播在21世纪的先锋。大多数研究认为直视造谣，政治影响和假新闻专注于主流社会化媒体平台。这不可避免地在我们目前的社交媒体上的政治活动的认识做出英语的一个重要因素。作为结果，也不过才研究数量有限，走向世界的很大一部分，其中包括最大的，多语言，多文化的民主国家：印度。在本文中，我们提出我们在印度的多语种社交网络的特性叫做ShareChat。我们收集拥有72个星期内详尽的数据集之前和2019年印度大选中，跨越14种语言。我们通过聚类视觉上相似的图像融合在一起，探索他们跨越语言障碍是如何移动探讨跨语种的动态。我们发现，泰卢固语，马拉雅拉姆语，泰米尔语和埃纳德语语言往往是在拉客政治图像主导（通常被称为模因），以及印地文岗位有跨越ShareChat最大的跨语言扩散（以及含有英语文字的图片）。在包含文本图像的情况下，跨语言的障碍，我们看到，语言翻译，用于扩大可及性。这就是说，我们发现在相同的图像，其具有非常不同的文字（因此含义）相关情况。这一初步鉴定铺平了道路，更先进的管道来了解假冒政治内容的动态的多语言和非文本的设置方式。

24. Upgrading the Newsroom: An Automated Image Selection System for News Articles [PDF] 返回目录
Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, Karl Aberer
Abstract: We propose an automated image selection system to assist photo editors in selecting suitable images for news articles. The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs. It is equipped with char-level word embeddings to help both modeling morphologically rich languages, e.g. German, and transferring knowledge across nearby languages. The text encoder adopts a hierarchical self-attention mechanism to attend more to both keywords within a piece of text and informative components of a news article. We extensively experiment with our system on a large-scale text-image database containing multimodal multilingual news articles collected from Swiss local news media websites. The system is compared with multiple baselines with ablation studies and is shown to beat existing text-image retrieval methods in a weakly-supervised learning setting. Besides, we also offer insights on the advantage of using multiple textual sources and multilingual data.
摘要：我们提出一个自动图像选择系统来协助图片编辑在新闻文章选择合适的图像。该系统融合了从新闻报道中提取多个文本源和接受多种语言输入。它配备了焦炭级字的嵌入，以帮助建模形态丰富的语言，例如德国和转移附近的跨语言知识。文本编码器采用分层自注意机制一块文本和新闻文章的信息成分内参加更多的这两个关键字。我们广泛地与我们的含有瑞士当地新闻媒体网站收集的多模多语种新闻文章大规模文本图像数据库系统试验。该系统与消融研究多个基准进行比较，显示在弱监督学习设置击败现有文本的图像检索方法。此外，我们还提供了使用多个文本源和多语种数据的优势见解。

25. End-to-end speech-to-dialog-act recognition [PDF] 返回目录
Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara
Abstract: Spoken language understanding, which extracts intents and/or semantic concepts in utterances, is conventionally formulated as a post-processing of automatic speech recognition. It is usually trained with oracle transcripts, but needs to deal with errors by ASR. Moreover, there are acoustic features which are related with intents but not represented with the transcripts. In this paper, we present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process. In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer before the softmax layer, which provides a distributed representation of word-level ASR decoding information. Then, the entire network is fine-tuned in an end-to-end manner. This allows for stable training as well as robustness against ASR errors. The model is further extended to conduct DA segmentation jointly. Evaluations with the Switchboard corpus demonstrate that the proposed method significantly improves dialog act recognition accuracy from the conventional pipeline framework.
摘要：口语理解，其提取的意图和/或语义概念中的话语，通常配制为自动语音识别的后处理。它通常是与Oracle成绩单的训练，但需要处理由ASR错误。此外，还有一些与意图相关，但不与成绩单表示声学特征。在本文中，我们提出了一个端至端模型，其直接转换成语音对话行为不确定性转录过程。在所提出的模型中，对话行为识别网络就是合与SOFTMAX层，它提供词级ASR解码信息的分布式表示之前在其潜层的声学对词ASR模型。然后，将整个网络微调在端至端的方式。这使得稳定的培训以及对ASR错误的鲁棒性。该模型进一步扩展到共同进行DA分割。与总机语料库评估表明，该方法显著提高了从传统的管道框架对话行为识别的准确性。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-04-27

目录

摘要