摘要

1. The Rumour Mill: Making Misinformation Spread Visible and Tangible [PDF] 返回目录
Nanna Inie, Jeanette Falk Olesen, Leon Derczynski
Abstract: The spread of misinformation presents a technological and social threat to society. With the advance of AI-based language models, automatically generated texts have become difficult to identify and easy to create at scale. We present the "Rumour Mill", a playful art piece, designed as a commentary on the spread of rumours and automatically-generated misinformation. The mill is a tabletop interactive machine, which invites a user to experience the process of creating believable text by interacting with different tangible controls on the mill. The user manipulates visible parameters to adjust the genre and type of an automatically generated text rumour. The Rumour Mill is a physical demonstration of the state of NLP technology and its ability to generate and manipulate natural language text, and of the act of starting and spreading rumours.
摘要：误传的传播呈现给社会技术和社会的威胁。随着基于人工智能语言模型的推进，自动生成的文本已经变得难以识别，容易大规模制造。我们提出了“传闻”，一个好玩艺术片，设计为传言和自动生成的误传传播的评注。该工厂是一个桌面交互的机器，它邀请用户通过与轧机上不同的实际控制交互体验创造可信的文本的过程。用户操纵可见参数来调整体裁和键入一个自动生成的文本的传言。谣言是NLP技术状况及其产生和处理自然语言文字能力的物理演示，以及启动和传播谣言的行为。

2. ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning [PDF] 返回目录
Weihao Yu, Zihang Jiang, Yanfei Dong, Jiashi Feng
Abstract: Recent powerful pre-trained language models have achieved remarkable performance on most of the popular datasets for reading comprehension. It is time to introduce more challenging datasets to push the development of this field towards more comprehensive reasoning of text. In this paper, we introduce a new Reading Comprehension dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations. As earlier studies suggest, human-annotated datasets usually contain biases, which are often exploited by models to achieve high accuracy without truly understanding the text. In order to comprehensively evaluate the logical reasoning ability of models on ReClor, we propose to identify biased data points and separate them into EASY set while the rest as HARD set. Empirical results show that state-of-the-art models have an outstanding ability to capture biases contained in the dataset with high accuracy on EASY set. However, they struggle on HARD set with poor performance near that of random guess, indicating more research is needed to essentially enhance the logical reasoning ability of current models.
摘要：近期强大的预先训练的语言模型已经达到上最流行的数据集的表现可圈可点阅读理解。现在是时候推出更多具有挑战性的数据集，以推动这一领域的发展对文本的更全面的推理。在本文中，我们介绍一种新的阅读理解数据集需要从标准化的研究生入学考试中提取的逻辑推理（ReClor）。正如之前的研究表明，人类的注解数据集通常包含的偏见，这往往是由模型利用来达到较高的精度没有真正理解课文。为了全面评估对ReClor车型的逻辑推理能力，我们建议确定偏置数据点，将它们分开成易于设置，而其余硬集。实证结果表明，国家的最先进的机型有出色的能力，包含在与EASY集高精度数据集中采集偏见。然而，他们在HARD组与斗争是附近随机猜测的表现不佳，表示需要基本上提高现有模式的逻辑推理能力进行更多的研究。

3. Learning Coupled Policies for Simultaneous Machine Translation [PDF] 返回目录
Philip Arthur, Trevor Cohn, Gholamreza Haffari
Abstract: In simultaneous machine translation, the system needs to incrementally generate the output translation before the input sentence ends. This is a coupled decision process consisting of a programmer and interpreter. The programmer's policy decides about when to WRITE the next output or READ the next input, and the interpreter's policy decides what word to write. We present an imitation learning (IL) approach to efficiently learn effective coupled programmer-interpreter policies. To enable IL, we present an algorithmic oracle to produce oracle READ/WRITE actions for training bilingual sentence-pairs using the notion of word alignments. We attribute the effectiveness of the learned coupled policies to (i) scheduled sampling addressing the coupled exposure bias, and (ii) quality of oracle actions capturing enough information from the partial input before writing the output. Experiments show our method outperforms strong baselines in terms of translation quality and delay, when translating from German/Arabic/Czech/Bulgarian/Romanian to English.
摘要：在同时机器翻译系统需要逐步产生输入句子结束前的输出转换。这是由一个程序员和解释器的耦合决定处理。程序员的政策决定什么时候写下一个输出或读取下一个输入，而口译的政策决定写什么字。我们提出了一个模仿学习（IL）的方法来有效地学习有效耦合程序员解释政策。为了使IL，我们提出了一个算法甲骨文的Oracle产品的读/写操作训练用的词对齐的概念双语句子对。我们认为所学习的耦合政策的效力与（i）预定的采样寻址耦合曝光偏置，和（ii）的Oracle动作写入输出之前捕获来自所述部分输入足够的信息的质量。实验证明我们的方法优于在翻译质量和延迟方面强大的基线，从德国/阿拉伯文/捷克/保加利亚/罗马尼亚翻译成英文。

4. Non-Autoregressive Neural Dialogue Generation [PDF] 返回目录
Qinghong Han, Yuxian Meng, Fei Wu, Jiwei Li
Abstract: Maximum Mutual information (MMI), which models the bidirectional dependency between responses ($y$) and contexts ($x$), i.e., the forward probability $\log p(y|x)$ and the backward probability $\log p(x|y)$, has been widely used as the objective in the \sts model to address the dull-response issue in open-domain dialog generation. Unfortunately, under the framework of the \sts model, direct decoding from $\log p(y|x) + \log p(x|y)$ is infeasible since the second part (i.e., $p(x|y)$) requires the completion of target generation before it can be computed, and the search space for $y$ is enormous. Empirically, an N-best list is first generated given $p(y|x)$, and $p(x|y)$ is then used to rerank the N-best list, which inevitably results in non-globally-optimal solutions. In this paper, we propose to use non-autoregressive (non-AR) generation model to address this non-global optimality issue. Since target tokens are generated independently in non-AR generation, $p(x|y)$ for each target word can be computed as soon as it's generated, and does not have to wait for the completion of the whole sequence. This naturally resolves the non-global optimal issue in decoding. Experimental results demonstrate that the proposed non-AR strategy produces more diverse, coherent, and appropriate responses, yielding substantive gains in BLEU scores and in human evaluations.
摘要：最大互信息（MMI），该款机型的响应（$ Y $）和环境（$ X $）之间，即双向依赖，前向概率$ \日志P（Y | X）$和反向概率$ \日志p（X | Y）$，已被广泛用作目标在\ STS模式，以解决开域对话生成的平淡反应的问题。不幸的是，\ STS模型的框架下，直接解码从$ \日志P（Y | X）+ \日志P（X | Y）$是因为第二部分（即，$ P（X不可行| Y）$ ）要求目标生成的完成可以计算它之前，和$ Y $的搜索空间是巨大的。根据经验，N-最佳列表首先生成给出$ P（Y | X）$和$ P（X | Y），则$用于重新排名的N最佳列表，这不可避免地导致了非全局最优解。在本文中，我们建议使用非自回归（非AR）代车型，以解决这种非全局最优的问题。由于目标令牌在非AR生成独立产生，$ P（X | Y）$每个目标词可以很快，因为它是生成的计算，而不必等待整个序列的完成。这自然解决了解码非全局最优的问题。实验结果表明，所提出的非AR战略产生更多样化的，连贯的，适当的反应，产生的BLEU分数和人类评估实质性收益。

5. Performance Comparison of Crowdworkers and NLP Tools onNamed-Entity Recognition and Sentiment Analysis of Political Tweets [PDF] 返回目录
Mona Jalal, Kate K. Mays, Lei Guo, Margrit Betke
Abstract: We report results of a comparison of the accuracy of crowdworkers and seven NaturalLanguage Processing (NLP) toolkits in solving two important NLP tasks, named-entity recognition (NER) and entity-level sentiment(ELS) analysis. We here focus on a challenging dataset, 1,000 political tweets that were collected during the U.S. presidential primary election in February 2016. Each tweet refers to at least one of four presidential candidates,i.e., four named entities. The groundtruth, established by experts in political communication, has entity-level sentiment information for each candidate mentioned in the tweet. We tested several commercial and open-source tools. Our experiments show that, for our dataset of political tweets, the most accurate NER system, Google Cloud NL, performed almost on par with crowdworkers, but the most accurate ELS analysis system, TensiStrength, did not match the accuracy of crowdworkers by a large margin of more than 30 percent points.
摘要：我们报告crowdworkers在解决两个重要的NLP任务，命名实体识别（NER）和公司层面的情绪（ELS）分析的准确性和七个自然语言处理（NLP）工具包的比较结果。在这里，我们专注于一个具有挑战性的数据集，1000在2016年二月，美国总统初选每个鸣叫是指四个总统候选人至少一个，即，四个命名实体期间收集政治鸣叫。真实状况，由专家在政治传播成立以来，一直在鸣叫提到的每个候选实体层面的情绪信息。我们测试了几个商业和开源工具。我们的实验表明，对于我们的政治鸣叫，最准确的命名实体识别系统，谷歌云NL，几乎堪与crowdworkers执行的数据集，但最准确的ELS分析系统，TensiStrength，并没有大幅度匹配crowdworkers的准确性超过30个百分点。

6. Training with Streaming Annotation [PDF] 返回目录
Tongtao Zhang, Heng Ji, Shih-Fu Chang, Marjorie Freedman
Abstract: In this paper, we address a practical scenario where training data is released in a sequence of small-scale batches and annotation in earlier phases has lower quality than the later counterparts. To tackle the situation, we utilize a pre-trained transformer network to preserve and integrate the most salient document information from the earlier batches while focusing on the annotation (presumably with higher quality) from the current batch. Using event extraction as a case study, we demonstrate in the experiments that our proposed framework can perform better than conventional approaches (the improvement ranges from 3.6 to 14.9% absolute F-score gain), especially when there is more noise in the early annotation; and our approach spares 19.1% time with regard to the best conventional method.
摘要：在本文中，我们要解决这里的训练数据被释放的小规模批次序列和注释在早期阶段具有比同行晚低质量的实际情况。为了应对这种情况，我们利用预先训练变压器网络维护和最显着的文档信息从早期批次集成，并重点标注（大概有更高质量的）从目前的批次。使用事件提取作为个案研究，我们证明在实验中，我们提出的架构可以比传统方法更好地履行（改善的范围从3.6到14.9％的绝对F-分数增益），尤其是当有更多的噪音在早期的注释;而我们的方法免去19.1％的时间就以最好的常规方法。

7. Automatic Discourse Segmentation: an evaluation in French [PDF] 返回目录
Rémy Saksik, Alejandro Molina-Villegas, Andréa Carneiro Linhares, Juan-Manuel Torres-Moreno
Abstract: In this article, we describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality. Although our experiment were carried for documents in French, we have developed three discursive segmentation models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling. We have also carried out automatic evaluations of these systems against the Annodis corpus, which is a manually annotated reference. The results obtained are very encouraging.
摘要：在这篇文章中，我们介绍一些话语分割方法以及分割质量的初步评估。标记列表和统计POS标签：虽然我们的实验中，进行了在法国的文件中，我们基于几种语言的同时可利用的资源仅开发了三种话语分割模型。我们还进行了针对Annodis语料库，其是手动注释参考这些系统的自动评估。得到的结果是非常令人鼓舞的。

8. An experiment exploring the theoretical and methodological challenges in developing a semi-automated approach to analysis of small-N qualitative data [PDF] 返回目录
Sandro Tsang
Abstract: This paper experiments with designing a semi-automated qualitative data analysis (QDA) algorithm to analyse 20 transcripts by using freeware. Text-mining (TM) and QDA were guided by frequency and association measures, because these statistics remain robust when the sample size is small. The refined TM algorithm split the text into various sizes based on a manually revised dictionary. This lemmatisation approach may reflect the context of the text better than uniformly tokenising the text into one single size. TM results were used for initial coding. Code repacking was guided by association measures and external data to implement a general inductive QDA approach. The information retrieved by TM and QDA was depicted in subgraphs for comparisons. The analyses were completed in 6-7 days. Both algorithms retrieved contextually consistent and relevant information. However, the QDA algorithm retrieved more specific information than TM alone. The QDA algorithm does not strictly comply with the convention of TM or of QDA, but becomes a more efficient, systematic and transparent text analysis approach than a conventional QDA approach. Scaling up QDA to reliably discover knowledge from text was exactly the research purpose. This paper also sheds light on understanding the relations between information technologies, theory and methodologies.
摘要：设计一个半自动化定性数据分析（QDA）算法实验通过使用免费软件来分析20组的转录本。文本挖掘（TM）和QDA通过频率和关联的措施引导，因为这些统计信息保持稳健当样本大小是小的。精制TM算法分割文成基于手动订正字典各种尺寸。这lemmatisation做法可能反映了文本的语境比文字均匀tokenising成一个单一的大小更好。 TM结果用于初始编码。代码重新包装是由协会的措施和外部数据导入到实施一般的感应QDA方法。由TM和QDA检索的信息在子图用于比较被描绘。分析是在6-7天内完成。这两种算法检索上下文一致的相关信息。然而，QDA算法获取更具体的信息比TM孤独。该QDA算法不严格遵守TM或QDA的惯例，但比传统的QDA方法更高效，系统，透明的文本分析方法。扩大QDA可靠地发现从文本知识正是研究目的。本文还揭示了理解信息技术，理论和方法之间的关系光。

9. HGAT: Hierarchical Graph Attention Network for Fake News Detection [PDF] 返回目录
Yuxiang Ren, Jiawei Zhang
Abstract: The explosive growth of fake news has eroded the credibility of medias and governments. Fake news detection has become an urgent task. News articles along with other related components like news creators and news subjects can be modeled as a heterogeneous information network (HIN for short). In this paper, we focus on studying the HIN- based fake news detection problem. We propose a novel fake news detection framework, namely Hierarchical Graph Attention Network (HGAT) which employs a novel hierarchical attention mechanism to detect fake news by classifying news article nodes in the HIN. This method can effectively learn information from different types of related nodes through node-level and schema-level attention. Experiments with real-world fake news data show that our model can outperform text-based models and other network-based models. Besides, the experiments also demonstrate the expandability and potential of HGAT for heterogeneous graphs representation learning in the future.
摘要：假新闻的爆炸性增长已经侵蚀媒体和政府的公信力。假新闻的检测已成为一项紧迫的任务。与像新闻创作者和新闻主体等相关组件一起的新闻文章可以模拟成一个异构信息网络（HIN的简称）。在本文中，我们重点研究基于HIN-假新闻的检测问题。我们提出了一个新的假新闻的检测框架，即层次图关注网络（HGAT），它采用了新的分级注意机制由HIN新闻文章分类节点检测到假新闻。这种方法可以有效地学习，通过节点级和模式的高度重视，从不同类型的相关节点的信息。与现实世界的假新闻数据实验表明，我们的模型可以超越基于文本的模型和其他基于网络的模型。此外，实验还证明HGAT的在未来的异构图形表示学习可扩展性和潜力。

10. Convolutional Neural Networks and a Transfer Learning Strategy to Classify Parkinson's Disease from Speech in Three Different Languages [PDF] 返回目录
J. C. Vásquez-Correa, T. Arias-Vergara, C. D. Rios-Urrego, M. Schuster, J. Rusz, J. R. Orozco-Arroyave, E. Nöth
Abstract: Parkinson's disease patients develop different speech impairments that affect their communication capabilities. The automatic assessment of the speech of the patients allows the development of computer aided tools to support the diagnosis and the evaluation of the disease severity. This paper introduces a methodology to classify Parkinson's disease from speech in three different languages: Spanish, German, and Czech. The proposed approach considers convolutional neural networks trained with time frequency representations and a transfer learning strategy among the three languages. The transfer learning scheme aims to improve the accuracy of the models when the weights of the neural network are initialized with utterances from a different language than the used for the test set. The results suggest that the proposed strategy improves the accuracy of the models in up to 8\% when the base model used to initialize the weights of the classifier is robust enough. In addition, the results obtained after the transfer learning are in most cases more balanced in terms of specificity-sensitivity than those trained without the transfer learning strategy.
摘要：帕金森氏症患者制定影响其通信能力不同的语言障碍。患者的语音的自动评估允许的计算机辅助工具的开发，以支持诊断和疾病严重程度的评估。本文介绍一种方法，帕金森氏病从语音三种语言进行分类，西班牙语，德语和捷克。所提出的方法考虑了频率随时间的陈述和三种语言之间的迁移学习策略训练的卷积神经网络。转移学习方案，目的是提高模型的准确性时，神经网络的权与话语从不同的语言不是用于测试集初始化。结果表明，该策略提高了多达8 \％的模型的准确性时使用的基础模型来初始化权重的分类是足够强大的。此外，转移学习之后得到的结果都在比那些没有转移学习策略训练的特异性，灵敏度方面更平衡大多数情况下。

11. Adversarial Filters of Dataset Biases [PDF] 返回目录
Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew E. Peters, Ashish Sabharwal, Yejin Choi
Abstract: Large neural models have demonstrated human-level performance on language and vision benchmarks such as ImageNet and Stanford Natural Language Inference (SNLI). Yet, their performance degrades considerably when tested on adversarial or out-of-distribution samples. This raises the question of whether these models have learned to solve a dataset rather than the underlying task by overfitting on spurious dataset biases. We investigate one recently proposed approach, AFLite, which adversarially filters such dataset biases, as a means to mitigate the prevalent overestimation of machine performance. We provide a theoretical understanding for AFLite, by situating it in the generalized framework for optimum bias reduction. Our experiments show that as a result of the substantial reduction of these biases, models trained on the filtered datasets yield better generalization to out-of-distribution tasks, especially when the benchmarks used for training are over-populated with biased samples. We show that AFLite is broadly applicable to a variety of both real and synthetic datasets for reduction of measurable dataset biases and provide extensive supporting analyses. Finally, filtering results in a large drop in model performance (e.g., from 92% to 63% for SNLI), while human performance still remains high. Our work thus shows that such filtered datasets can pose new research challenges for robust generalization by serving as upgraded benchmarks.
摘要：大型神经模型已经证明在语言和视觉基准，如ImageNet和斯坦福大学自然语言推理（SNLI）人类水平的性能。然而，它们的性能会下降显着，当上对抗或分发外的样品进行测试。这就提出了这些模型是否已经学会对过度拟合数据集虚假偏见，解决了数据集而不是底层任务的问题。我们调查一个最近提出的方法，AFLite，这adversarially过滤器，例如数据集的偏见，作为一种手段来减轻整机性能普遍高估。我们为AFLite提供理论的理解，通过在最佳偏置降低广义框架的情境吧。我们的实验显示，这些偏见的大幅度减少的结果，训练有素的过滤数据集模型产生更好的推广到外的配送任务，特别是当用于训练的基准测试过填充偏置样品。我们表明，AFLite广泛适用于各种实际和综合数据集的减少衡量数据集的偏见，并提供广泛的支持分析。最后，过滤结果在模型的性能（例如SNLI，从92％至63％）大的下降，而人的性能仍然保持为高。因此，我们的工作表明，这种过滤的数据集可以作为升级基准构成了强大的推广新研究挑战。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-02-12

目录

摘要