摘要

1. Translation Artifacts in Cross-lingual Transfer Learning [PDF] 返回目录
Mikel Artetxe, Gorka Labaka, Eneko Agirre
Abstract: Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notable impact in existing cross-lingual models. For instance, in natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them, which current models are highly sensitive to. We show that some previous findings in cross-lingual transfer learning need to be reconsidered in the light of this phenomenon. Based on the gained insights, we also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
摘要：人类和机器翻译在跨语言迁移学习中发挥核心作用：许多会讲多种语言的数据集已通过专业翻译服务创建，并使用机器翻译来翻译或者测试集或训练集是一种广泛使用的传输技术。在本文中，我们表明，这种转换过程可以引入在现有的跨语言模型产生明显的影响微妙的假象。例如，在自然语言推理，独立翻译的前提和假设可以减少它们之间的词汇重叠，其目前的模式是高度敏感的。我们发现，在跨语言迁移学习需要一些以前的研究结果在这一现象的重新审议。基于该获得的分析结果，我们也提高了国家的最先进的XNLI的翻译测试和零射门分别接近4.3和2.8点。

2. BLEURT: Learning Robust Metrics for Text Generation [PDF] 返回目录
Thibault Sellam, Dipanjan Das, Ankur P. Parikh
Abstract: Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.
摘要：文本一代，在过去的几年中取得显著的进展。然而，评价标准已经落后，作为目前最流行的选择（例如，BLEU和ROUGE）可能与人类的判断较差相关。我们建议BLEURT，一个有学问的评估度基于BERT可以用几千可能偏向训练例子人为判断模型。我们的方法的一个关键方面是一种新型的预培训计划使用数以百万计的合成实例，帮助模型一概而论。 BLEURT提供最先进的国家的在最近三年WMT度量的结果共同任务和WebNLG比赛数据集。相反到基于BERT香草方法，它产生即使在训练数据是稀少和外的分布优异的结果。

3. Global Public Health Surveillance using Media Reports: Redesigning GPHIN [PDF] 返回目录
Dave Carter, Marta Stojanovic, Philip Hachey, Kevin Fournier, Simon Rodier, Yunli Wang, Berry de Bruijn
Abstract: Global public health surveillance relies on reporting structures and transmission of trustworthy health reports. But in practice, these processes may not always be fast enough, or are hindered by procedural, technical, or political barriers. GPHIN, the Global Public Health Intelligence Network, was designed in the late 1990s to scour mainstream news for health events, as that travels faster and more freely. This paper outlines the next generation of GPHIN, which went live in 2017, and reports on design decisions underpinning its new functions and innovations.
摘要：全球公共卫生监测依靠报告的可信赖的健康报告的结构和传输。但在实践中，这些进程可能并不总是不够快，或者通过程序，技术或政治壁垒阻碍。 GPHIN，全球公共卫生情报网络，目的是在90年代末，为公共卫生事件冲刷主流新闻，作为传播更快，更自由。本文概述了下一代GPHIN，其中又以生活在2017年，并在设计决策报告托换了新的功能和创新。

4. Interpretability Analysis for Named Entity Recognition to Understand System Predictions and How They Can Improve [PDF] 返回目录
Oshin Agarwal, Yinfei Yang, Byron C. Wallace, Ani Nenkova
Abstract: Named Entity Recognition systems achieve remarkable performance on domains such as English news. It is natural to ask: What are these models actually learning to achieve this? Are they merely memorizing the names themselves? Or are they capable of interpreting the text and inferring the correct entity type from the linguistic context? We examine these questions by contrasting the performance of several variants of LSTM-CRF architectures for named entity recognition, with some provided only representations of the context as features. We also perform similar experiments for BERT. We find that context representations do contribute to system performance, but that the main factor driving high performance is learning the name tokens themselves. We enlist human annotators to evaluate the feasibility of inferring entity types from the context alone and find that, while people are not able to infer the entity type either for the majority of the errors made by the context-only system, there is some room for improvement. A system should be able to recognize any name in a predictive context correctly and our experiments indicate that current systems may be further improved by such capability.
摘要：命名实体识别系统，实现对域名如英语新闻骄人的业绩。人们很自然地要问：什么是这些模型实际上学习实现这一目标？难道他们仅仅是记忆的名称本身？或者，他们能够解释的文本，并从语境推断正确的实体类型的？我们研究这些问题，通过对比的命名实体识别LSTM-CRF架构的几个变种的表现，与一些只提供上下文功能的交涉。我们也进行类似的实验了BERT。我们发现这方面做交涉有助于系统性能，但驱动高性能的主要因素是学习的名称标记自己。我们争取人工注释，以评估单独从上下文推断实体类型的可行性和发现，虽然人们不能够或者推断出实体类型为广大的唯一上下文的系统所犯的错误，有一定空间改进。系统应能在预测范围内正确地识别任何名称和我们的实验表明，当前系统可以通过这种能力进一步提高。

5. Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios [PDF] 返回目录
Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
Abstract: Unsupervised neural machine translation (UNMT) that relies solely on massive monolingual corpora has achieved remarkable results in several translation tasks. However, in real-world scenarios, massive monolingual corpora do not exist for some extremely low-resource languages such as Estonian, and UNMT systems usually perform poorly when there is not an adequate training corpus for one language. In this paper, we first define and analyze the unbalanced training data scenario for UNMT. Based on this scenario, we propose UNMT self-training mechanisms to train a robust UNMT system and improve its performance in this case. Experimental results on several language pairs show that the proposed methods substantially outperform conventional UNMT systems.
摘要：无监督神经机器翻译（UNMT），其完全依赖于大规模的单语语料库已经在几个翻译工作取得了显着成效。然而，在现实世界的情况下，大规模的单语语料库不存在一些极低的资源的语言，如爱沙尼亚和UNMT系统通常执行较差的时候没有一种语言足够的训练语料。在本文中，我们首先定义和分析不平衡的训练数据的情况进行UNMT。基于这种方案，我们提出UNMT自我培训机制，培养一个强大的UNMT体系，提高其在这一情况下的性能。在几个语言对实验结果表明，所提出的方法基本上优于常规UNMT系统。

6. Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem [PDF] 返回目录
Danielle Saunders, Bill Byrne
Abstract: Training data for NLP tasks often exhibits gender bias in that fewer sentences refer to women than to men. In Neural Machine Translation (NMT) gender bias has been shown to reduce translation quality, particularly when the target language has grammatical gender. The recent WinoMT challenge set allows us to measure this effect directly (Stanovsky et al, 2019). Ideally we would reduce system bias by simply debiasing all data prior to training, but achieving this effectively is itself a challenge. Rather than attempt to create a `balanced' dataset, we use transfer learning on a small set of trusted, gender-balanced examples. This approach gives strong and consistent improvements in gender debiasing with much less computational cost than training from scratch. A known pitfall of transfer learning on new domains is `catastrophic forgetting', which we address both in adaptation and in inference. During adaptation we show that Elastic Weight Consolidation allows a performance trade-off between general translation quality and bias reduction. During inference we propose a lattice-rescoring scheme which outperforms all systems evaluated in Stanovsky et al (2019) on WinoMT with no degradation of general test set BLEU, and we show this scheme can be applied to remove gender bias in the output of `black box` online commercial MT systems. We demonstrate our approach translating from English into three languages with varied linguistic properties and data availability.
摘要：NLP任务通常表现出的性别偏见更少的句子训练数据指的是女性多于男性。在神经机器翻译（NMT）性别偏见已经证明，以减少翻译质量，特别是当目标语言有语法性别。最近WinoMT挑战集可以让我们衡量这种影响直接（Stanovsky等，2019）。理想情况下，我们会通过简单地消除直流偏压的所有数据之前的训练，可有效地实现这一目标降低系统偏差本身就是一种挑战。而不是试图建立一个'平衡”的数据集，我们使用迁移学习上一小部分的信任，性别平衡的例子。这种方法使性别去除偏差强劲且持续的改进与比从头训练少得多的计算成本。迁移学习的新领域已知缺陷是'灾难性遗忘”，这是我们都在适应和推理解决。在适应我们表明，弹性体重合并可让一般的翻译质量和偏见降低之间的性能权衡。在推理我们提出一种优于在WinoMT Stanovsky等人（2019）进行评价，无一般测试集BLEU的降解所有系统的格子再评分方案，我们显示了该方案可以应用到消除性别偏见在黑色的`的输出box`在线商业机器翻译系统。我们证明我们的英语的方法翻译成三种语言与语言变化的性能和数据可用性。

7. MuTual: A Dataset for Multi-Turn Dialogue Reasoning [PDF] 返回目录
Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, Ming Zhou
Abstract: Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques. Given a context, current systems are able to yield a relevant and fluent response, but sometimes make logical mistakes because of weak reasoning capabilities. To facilitate the conversation reasoning research, we introduce MuTual, a novel dataset for Multi-Turn dialogue Reasoning, consisting of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. Compared to previous benchmarks for non-task oriented dialogue systems, MuTual is much more challenging since it requires a model that can handle various reasoning problems. Empirical results show that state-of-the-art methods only reach 71%, which is far behind the human performance of 94%, indicating that there is ample room for improving reasoning ability. MuTual is available at this https URL.
摘要：非任务导向的对话系统已经由于大部分访问会话数据和深度学习技术的发展取得了近几年大获成功。给定一个背景下，目前的系统能够产生相关的流畅响应，但有时会因为疲软的推理能力，逻辑错误。为了方便谈话推理研究，我们相互介绍，一个新的数据集，多转对话推理的基础上，中国学生的英语听力考试由8860个手动注释对话。相比于之前的基准非面向任务的对话系统，相互更具有挑战性，因为它需要能够处理各种推理问题的典范。经验结果显示状态的最先进的是方法只能达到71％，这是远远落后的94％人的性能，这表明有足够的空间用于提高推理能力。互可在此HTTPS URL。

8. Injecting Numerical Reasoning Skills into Language Models [PDF] 返回目录
Mor Geva, Ankit Gupta, Jonathan Berant
Abstract: Large pre-trained language models (LMs) are known to encode substantial amounts of linguistic information. However, high-level reasoning skills, such as numerical reasoning, are difficult to learn from a language-modeling objective only. Consequently, existing models for numerical reasoning have used specialized architectures with limited flexibility. In this work, we show that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs, by generating large amounts of data, and training in a multi-task setup. We show that pre-training our model, GenBERT, on this data, dramatically improves performance on DROP (49.3 $\rightarrow$ 72.3 F1), reaching performance that matches state-of-the-art models of comparable size, while using a simple and general-purpose encoder-decoder architecture. Moreover, GenBERT generalizes well to math word problem datasets, while maintaining high performance on standard RC tasks. Our approach provides a general recipe for injecting skills into large pre-trained LMs, whenever the skill is amenable to automatic data augmentation.
摘要：大型预训练的语言模型（LMS）是已知编码大量的语言信息。然而，高层次的推理能力，比如数字推理，都很难从语言建模的目标只是为了学习。因此，对于数字推理现有的模式已经使用灵活性有限的专业结构。在这项工作中，我们表明，数字推理是适合于自动数据生成，因此人们可以通过在多任务的设置产生大量的数据，以及培训注入这个技能到预训练的LMS。我们表明，前培训我们的模型，GenBERT，这一数据，极大地提高了跌落性能（49.3 $ \ RIGHTARROW $ 72.3 F1），达到性能，虽然采用同等规模的比赛状态的最先进的机型，简单和通用编码器 - 解码器架构。此外，GenBERT推广以及对数学应用题的数据集，同时保持标准RC任务的高性能。我们的方法为注射技巧融入大预先训练的LM，每当技能服从于自动数据增强的通用配方。

9. Recommendation Chart of Domains for Cross-Domain Sentiment Analysis:Findings of A 20 Domain Study [PDF] 返回目录
Akash Sheoran, Diptesh Kanojia, Aditya Joshi, Pushpak Bhattacharyya
Abstract: Cross-domain sentiment analysis (CDSA) helps to address the problem of data scarcity in scenarios where labelled data for a domain (known as the target domain) is unavailable or insufficient. However, the decision to choose a domain (known as the source domain) to leverage from is, at best, intuitive. In this paper, we investigate text similarity metrics to facilitate source domain selection for CDSA. We report results on 20 domains (all possible pairs) using 11 similarity metrics. Specifically, we compare CDSA performance with these metrics for different domain-pairs to enable the selection of a suitable source domain, given a target domain. These metrics include two novel metrics for evaluating domain adaptability to help source domain selection of labelled data and utilize word and sentence-based embeddings as metrics for unlabelled data. The goal of our experiments is a recommendation chart that gives the K best source domains for CDSA for a given target domain. We show that the best K source domains returned by our similarity metrics have a precision of over 50%, for varying values of K.
摘要：跨域情绪分析（CDSA）有助于解决在用于一个域（称为目标域）标记的数据方案的数据短缺问题是不可用的或不充分的。然而，选择从是一个域（称为源域）以杠杆作用，在最好的，直观的决定。在本文中，我们研究了文本相似性指标，以便源域选择了CDSA。我们报告使用11个相似性指标20个域（所有可能的）结果。具体来说，我们比较这些指标对于不同域对CDSA性能，以使合适的源域的选择，给定一个目标域。这些指标包括用于评估域适应性标记数据的帮助源域选择两种新指标，并利用字和基于句子的嵌入作为指标未标记的数据。我们实验的目的是推荐的图表，给出了CDSA K个最佳源域为给定的目标域。我们证明了我们的相似性指标恢复最好的K个源领域有超过50％的精度，为改变K的值

10. A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming [PDF] 返回目录
Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno, Thiago G. da Silva, Andréa Carneiro Linhares
Abstract: Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes an Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, with the goal of generating more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the state of the art for evaluations led on news datasets in three languages: French, Portuguese and Spanish. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. In additional tests, which take advantage of the fact that the length of compressions can be modulated, we still improve ROUGE scores with shorter output sentences.
摘要：多句压缩（MSC）旨在生成短句子与句子类似的集群中的关键信息。 MSC能够总结和问题回答系统生成的输出完全形成的句子从一个或多个文件合并。本文介绍使用顶点标记的图来选择不同的关键字MSC的整数线性规划方法，以产生更丰富的句子，同时保持其合乎语法的目标。我们的系统是质量好，优于现有技术的状态带领新闻数据集三种语言的评价：法语，葡萄牙语和西班牙语。我们领导的自动和手动评估，以确定信息量和压缩每个数据集的语法性。在另外的试验中，其中利用这样的事实即按压的长度可以调节的优点，我们仍然提高具有较短输出的句子ROUGE分数。

11. PANDORA Talks: Personality and Demographics on Reddit [PDF] 返回目录
Matej Gjurković, Mladen Karan, Iva Vukojević, Mihaela Bošnjak, Jan Šnajder
Abstract: Personality and demographics are important variables in social sciences, while in NLP they can aid in intepretability and removal of societal biases. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first large-scale dataset of Reddit comments labeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.
摘要：个性和人口在社会科学中的重要变量，而在NLP他们可以在intepretability和去除社会偏见的帮助。然而，这两个个性和人口统计学数据集的标签是稀缺的。为了解决这个问题，我们目前PANDORA，reddit的的首个大型数据集的意见标有三种人格模型（包括成熟的大五模型）和10K以上的用户人口统计（年龄，性别和位置）。我们展示的三个实验，我们利用从其他个性车型更容易获得的数据来预测5大特点，分析从心理人口统计变量产生性别分类的偏见，并开展基于验证和探索性分析，该数据集的用处在心理学理论。最后，我们提出的基准预测模型对所有的个性和人口统计学变量。

12. Improving Readability for Automatic Speech Recognition Transcription [PDF] 返回目录
Junwei Liao, Sefik Emre Eskimez, Liyang Lu, Yu Shi, Ming Gong, Linjun Shou, Hong Qu, Michael Zeng
Abstract: Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to grammatical errors, disfluency, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR system alike will be propagated to the next task in the pipeline. In this work, we propose a novel NLP task called ASR post-processing for readability (APR) that aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. In addition, we describe a method to address the lack of task-specific data by synthesizing examples for the APR task using the datasets collected for Grammatical Error Correction (GEC) followed by text-to-speech (TTS) and ASR. Furthermore, we propose metrics borrowed from similar tasks to evaluate performance on the APR task. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method. Our results suggest that finetuned models improve the performance on the APR task significantly, hinting at the potential benefits of using APR systems. We hope that the read, understand, and rewrite approach of our work can serve as a basis that many NLP tasks and human readers can benefit from.
摘要：现代自动语音识别（ASR）系统可实现识别精度方面的高性能。然而，一个完全准确的成绩单仍然是具有挑战性的阅读由于语法错误，不流利，并在口语交际等勘误常见。许多下游任务和人类读者依靠ASR系统的输出;因此，通过扬声器或ASR系统都引入的误差将传播到管道中的下一个任务。在这项工作中，我们提出了所谓的ASR后处理的可读性（APR）一种新型的NLP任务，其目的是在嘈杂的ASR输出转化为对人类和下游任务可读的文本，同时保持扬声器的语义。此外，我们描述通过合成实施例用于使用采集的语法错误校正（GEC），随后的文本到语音（TTS）和ASR中的数据集的APR任务解决缺乏任务特定的数据的方法。此外，我们建议从类似的任务借用来评估对APR任务绩效指标。我们基于与传统的流水线方法的几个开源和适应预先训练模式比较微调模型。我们的研究结果表明，微调，提高模型的APR任务性能显著，在使用APR系统的潜在好处暗示。我们希望，阅读，理解，并改写我们的工作方式可以作为基础，许多NLP任务和人类的读者可以从中受益。

13. On optimal transformer depth for low-resource language translation [PDF] 返回目录
Elan van Biljon, Arnu Pretorius, Julia Kreutzer
Abstract: Transformers have shown great promise as an approach to Neural Machine Translation (NMT) for low-resource languages. However, at the same time, transformer models remain difficult to optimize and require careful tuning of hyper-parameters to be useful in this setting. Many NMT toolkits come with a set of default hyper-parameters, which researchers and practitioners often adopt for the sake of convenience and avoiding tuning. These configurations, however, have been optimized for large-scale machine translation data sets with several millions of parallel sentences for European languages like English and French. In this work, we find that the current trend in the field to use very large models is detrimental for low-resource languages, since it makes training more difficult and hurts overall performance, confirming previous observations. We see our work as complementary to the Masakhane project ("Masakhane" means "We Build Together" in isiZulu.) In this spirit, low-resource NMT systems are now being built by the community who needs them the most. However, many in the community still have very limited access to the type of computational resources required for building extremely large models promoted by industrial research. Therefore, by showing that transformer models perform well (and often best) at low-to-moderate depth, we hope to convince fellow researchers to devote less computational resources, as well as time, to exploring overly large models during the development of these systems.
摘要：变压器都表现出极大的承诺，作为低资源语言的方法神经机器翻译（NMT）。然而，与此同时，变压器模型仍难以优化，需要的超参数的悉心调教是在此设置非常有用。许多NMT工具包配备了一组默认的超参数，其研究人员和从业者往往采用为方便和避免调整的缘故。这些配置，但已为大型机器翻译的数据集优化，几百万平行的句子欧洲语言，如英语和法语。在这项工作中，我们发现，在野外使用非常大的模型目前的趋势是有害的低资源的语言，因为它使训练更加困难，并伤害了整体性能，证实了先前的观察。我们认为，我们的补充工作的Masakhane项目（“Masakhane”指祖鲁语：“我们共同建设”）。在这种精神，低资源NMT系统现在正由谁需要他们大多数社区建。然而，许多在社会上仍然有非常有限，为建设工业研究促进非常大的模型所需的计算资源的类型。因此，通过表明变压器型号为低到中等深度表现良好（通常是最好的），我们希望能够说服同事研究人员投入更少的计算资源，以及时间，这些系统的开发过程中探索过大的模型。

14. Calibrating Structured Output Predictors for Natural Language Processing [PDF] 返回目录
Abhyuday Jagannatha, Hong Yu
Abstract: We address the problem of calibrating prediction confidence for output entities of interest in natural language processing (NLP) applications. It is important that NLP applications such as named entity recognition and question answering produce calibrated confidence scores for their predictions, especially if the system is to be deployed in a safety-critical domain such as healthcare. However, the output space of such structured prediction models is often too large to adapt binary or multi-class calibration methods directly. In this study, we propose a general calibration scheme for output entities of interest in neural-network based structured prediction models. Our proposed method can be used with any binary class calibration scheme and a neural network model. Additionally, we show that our calibration method can also be used as an uncertainty-aware, entity-specific decoding step to improve the performance of the underlying model at no additional training cost or data requirements. We show that our method outperforms current calibration techniques for named-entity-recognition, part-of-speech and question answering. We also improve our model's performance from our decoding step across several tasks and benchmark datasets. Our method improves the calibration and model performance on out-of-domain test scenarios as well.
摘要：我们解决了校准的自然语言处理（NLP）应用程序感兴趣的输出机构预测的信心的问题。重要的是，NLP应用，如命名实体识别和问答产品标定的信心分数，他们的预测，尤其是当系统处于安全关键领域进行部署，如医疗保健。然而，这样的结构预测模型的输出空间常常太大而二进制或多级校准方法直接适应。在这项研究中，我们提出了在神经网络结构基于预测模型感兴趣的输出实体一般校准方案。我们提出的方法可以与任何二进制类校准方案和神经网络模型中使用。此外，我们表明，我们的校准方法也可以作为不确定性感知，实体专用解码步骤，以改善底层模型，无需额外的培训费用或数据的性能要求。我们证明了我们的方法优于命名实体识别当前校准技术部分的语音和答疑。我们还提高了我们在多个任务和标准数据集解码步骤我们的模型的性能。我们的方法提高了域外的测试情景的校准和模型的性能为好。

15. Pruning and Sparsemax Methods for Hierarchical Attention Networks [PDF] 返回目录
João G. Ribeiro, Frederico S. Felisberto, Isabel C. Neto
Abstract: This paper introduces and evaluates two novel Hierarchical Attention Network models [Yang et al., 2016] - i) Hierarchical Pruned Attention Networks, which remove the irrelevant words and sentences from the classification process in order to reduce potential noise in the document classification accuracy and ii) Hierarchical Sparsemax Attention Networks, which replace the Softmax function used in the attention mechanism with the Sparsemax [Martins and Astudillo, 2016], capable of better handling importance distributions where a lot of words or sentences have very low probabilities. Our empirical evaluation on the IMDB Review for sentiment analysis datasets shows both approaches to be able to match the results obtained by the current state-of-the-art (without, however, any significant benefits). All our source code is made available athttps://github.com/jmribeiro/dsl-project.
摘要：本文介绍和评价两种新型分层注意网络模型[Yang等人，2016。] - ⅰ）层次剪枝注意网络，其除去从分类过程不相关的词和句子，以减少在将文件分类的潜在噪声准确性和ii）分层Sparsemax注意网络，其取代在与Sparsemax [马丁和阿斯图迪略，2016]注意机制使用的使用SoftMax功能，能够更好地处理重要性分布的地方很多词或句子的具有非常低的概率。对于情感分析数据集显示了IMDB审查我们的实证分析两种方法都能够匹配当前国家的最先进的得到的结果（无然而，任何显著的好处）。我们所有的源代码都可以athttps：//github.com/jmribeiro/dsl-project。

16. Conversation Learner -- A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog Systems [PDF] 返回目录
Swadheen Shukla, Lars Liden, Shahin Shayandeh, Eslam Kamal, Jinchao Li, Matt Mazzola, Thomas Park, Baolin Peng, Jianfeng Gao
Abstract: Traditionally, industry solutions for building a task-oriented dialog system have relied on helping dialog authors define rule-based dialog managers, represented as dialog flows. While dialog flows are intuitively interpretable and good for simple scenarios, they fall short of performance in terms of the flexibility needed to handle complex dialogs. On the other hand, purely machine-learned models can handle complex dialogs, but they are considered to be black boxes and require large amounts of training data. In this demonstration, we showcase Conversation Learner, a machine teaching tool for building dialog managers. It combines the best of both approaches by enabling dialog authors to create a dialog flow using familiar tools, converting the dialog flow into a parametric model (e.g., neural networks), and allowing dialog authors to improve the dialog manager (i.e., the parametric model) over time by leveraging user-system dialog logs as training data through a machine teaching interface.
摘要：传统上，构建面向任务的对话系统行业解决方案都依赖于帮助对话框笔者给出基于规则的对话经理，表示为对话流。虽然对话流是直观的解释和良好的简单的场景，他们短的性能下降来处理复杂的对话框所需要的灵活性条款。在另一方面，纯粹的机器学习模型可以处理复杂的对话，但他们被认为是黑盒，需要大量的训练数据。在这个演示中，我们展示了对话学习，为构建对话管理一台机器的教学工具。它结合了两者的通过使对话作者使用熟悉的工具来创建一个对话流的对话流转换为参数模型（如神经网络），并允许对话的作者，以提高对话管理方法（即，参数化模型）在通过利用用户系统对话框日志，通过机器接口教学训练数据的时间。

17. Severing the Edge Between Before and After: Neural Architectures for Temporal Ordering of Events [PDF] 返回目录
Miguel Ballesteros, Rishita Anubhai, Shuai Wang, Nima Pourdamghani, Yogarshi Vyas, Jie Ma, Parminder Bhatia, Kathleen McKeown, Yaser Al-Onaizan
Abstract: In this paper, we propose a neural architecture and a set of training methods for ordering events by predicting temporal relations. Our proposed models receive a pair of events within a span of text as input and they identify temporal relations (Before, After, Equal, Vague) between them. Given that a key challenge with this task is the scarcity of annotated data, our models rely on either pretrained representations (i.e. RoBERTa, BERT or ELMo), transfer and multi-task learning (by leveraging complementary datasets), and self-training techniques. Experiments on the MATRES dataset of English documents establish a new state-of-the-art on this task.
摘要：在本文中，我们提出了一个神经结构和一组由预测时间关系培训订货事件的方法。我们提出的模型接收事件对文本输入的范围内，他们找出它们之间的时间关系（前，后，平等的，模糊的）。鉴于此任务的一个关键挑战是注释数据的稀缺性，我们的模型依赖于任何预训练的表示（即ROBERTA，BERT或ELMO），转移和多任务学习（通过利用互补的数据集），和自我训练技术。对英文文档的MATRES数据集实验建立这项任务一个新的国家的最先进的。

18. The Spotify Podcasts Dataset [PDF] 返回目录
Ann Clifton, Aasish Pappu, Sravana Reddy, Yongze Yu, Jussi Karlgren, Ben Carterette, Rosie Jones
Abstract: Podcasts are a relatively new form of audio media. Episodes appear on a regular cadence, and come in many different formats and levels of formality. They can be formal news journalism or conversational chat; fiction or non-fiction. They are rapidly growing in popularity and yet have been relatively little studied. As an audio format, podcasts are more varied in style and production types than, say, broadcast news, and contain many more genres than typically studied in video research. The medium is therefore a rich domain with many research avenues for the IR and NLP communities. We present the Spotify Podcasts Dataset, a set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.
摘要：播客音频媒体的一个相对较新的形式。情节出现在一个普通的节奏，并有许多不同的格式和形式的水平。它们可以是正式的新闻新闻或谈话聊天;小说和非小说。他们正在迅速普及，但一直相对较少研究。作为一种音频格式，播客更加多样化的风格和生产类型比，也就是说，广播新闻，以及含有比一般的视频研究学习更多的流派。因此，媒体是有许多的研究途径的IR和NLP社区丰富的领域。我们目前Spotify的播客数据集，由原始音频文件伴随ASR成绩单沿一套大约100K播客节目。这代表了47000小时转录音频的，是数量级比以前的语音到文本语料库大一个数量级。

19. Error-correction and extraction in request dialogs [PDF] 返回目录
Stefan Constantin, Alex Waibel
Abstract: We propose a component that gets a request and a correction and outputs a corrected request. To get this corrected request, the entities in the correction phrase replace their corresponding entities in the request. In addition, the proposed component outputs these pairs of corresponding reparandum and repair entity. These entity pairs can be used, for example, for learning in a life-long learning component of a dialog system to reduce the need for correction in future dialogs. For the approach described in this work, we fine-tune BERT for sequence labeling. We created a dataset to evaluate our component; for which we got an accuracy of 93.28 %. An accuracy of 88.58 % has been achieved for out-of-domain data. This accuracy shows that the proposed component is learning the concept of corrections and can be developed to be used as an upstream component to avoid the need for collecting data for request corrections for every new domain.
摘要：我们建议得到一个请求和修正，输出修正要求的组件。为了得到这个修正的要求，在修正短语中的实体替换请求其对应的实体。此外，所提出的组件输出这些对对应reparandum和修复的实体。这些实体对可以使用，例如，在对话框系统的终身学习成分的学习，以减少未来的对话修正的需要。为方针，在这项工作中所描述的，我们微调BERT的序列标签。我们创建了一个数据集来评估我们的组件;为此我们得到了93.28％的准确度。的88.58％的精确度已经达到了域外的数据。这样的精度示出，所提出的成分是学习校正的概念，并可以被开发用作上游组件，以避免用于收集请求改正数据为每一个新领域的需要。

20. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries [PDF] 返回目录
Alex Wang, Kyunghyun Cho, Mike Lewis
Abstract: Practical applications of abstractive summarization models are limited by frequent factual inconsistencies with respect to their input. Existing automatic evaluation metrics for summarization are largely insensitive to such errors. We propose an automatic evaluation protocol called QAGS (pronounced "kags") that is designed to identify factual inconsistencies in a generated summary. QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source. To evaluate QAGS, we collect human judgments of factual consistency on model-generated summaries for the CNN/DailyMail (Hermann et al., 2015) and XSUM (Narayan et al., 2018) summarization datasets. QAGS has substantially higher correlations with these judgments than other automatic evaluation metrics. Also, QAGS offers a natural form of interpretability: The answers and questions generated while computing QAGS indicate which tokens of a summary are inconsistent and why. We believe QAGS is a promising tool in automatically generating usable and factually consistent text.
摘要：抽象概括模型的实际应用中频繁事实不符的情况相对于它们的输入限制。进行汇总现有的自动评估指标主要是不受此类错误。我们提出了称为QAGS自动评估协议（发音为“kags”），其被设计为识别在生成的摘要的事实不一致。 QAGS是基于直觉，如果我们问一个总结，它的来源的问题，我们会收到类似的答案，如果总结与源事实是一致的。为了评价QAGS，我们收集的事实一致性上的CNN /每日邮报模型生成摘要人工判断（Hermann等人，2015）和XSUM（纳拉等人，2018）综述的数据集。 QAGS具有这些判断比其他自动评价指标显着更高的相关性。此外，QAGS提供解释性的自然形态：产生这些问题的答案和问题，同时计算QAGS表示总结哪些记号是不一致的原因。我们相信QAGS是自动生成可用和事实一致的文本一个行之有效的手段。

21. Measuring Emotions in the COVID-19 Real World Worry Dataset [PDF] 返回目录
Bennett Kleinberg, Isabelle van der Vegt, Maximilian Mozes
Abstract: The COVID-19 pandemic is having a dramatic impact on societies and economies around the world. With various measures of lockdowns and social distancing in place, it becomes important to understand emotional responses on a large scale. In this paper, we present the first ground truth dataset of emotional responses to COVID-19. We asked participants to indicate their emotions and express these in text and created the Real World Worry Dataset of 5,000 texts (2,500 short + 2,500 long texts). Our analyses suggest that emotional responses correlated with linguistic measures. Topic modeling further revealed that people in the UK worry about their family and the economic situation. Tweet-sized texts functioned as a call for solidarity, while longer texts shed light on worries and concerns. Using predictive modeling approaches, we were able to approximate the emotional responses of participants from text within 14\% of their actual value. We encourage others to use the dataset and improve how we can use automated methods to learn about emotional responses and worries about an urgent problem.
摘要：COVID-19大流行正在对世界各地的社会和经济产生巨大的影响。有了lockdowns和社会隔离的各种措施，了解大规模的情感反应变得很重要。在本文中，我们提出的情绪反应的第一个地面实况数据集来COVID-19。我们要求参与者表示自己的情绪，在文字表达这些，创造了5000个文本（2,500短+长2500个文本）真实世界的忧虑数据集。我们的分析表明，情绪反应与语言相关的措施。主题建模进一步透露，英国人担心自己的家庭和经济状况。鸣叫大小的文本充当呼吁团结，而较长的文本阐明了担忧和顾虑。使用预测建模方法，我们能够内其实际价值的14 \％，从文字近似参与者的情绪反应。我们鼓励其他人使用的数据集，并提高我们如何使用自动化的方法来了解有关的紧急问题的情绪反应和担忧。

22. Generating Counter Narratives against Online Hate Speech: Data and Strategies [PDF] 返回目录
Serra Sinem Tekiroglu, Yi-Ling Chung, Marco Guerini
Abstract: Recently research has started focusing on avoiding undesired effects that come with content moderation, such as censorship and overblocking, when dealing with hatred online. The core idea is to directly intervene in the discussion with textual responses that are meant to counter the hate content and prevent it from further spreading. Accordingly, automation strategies, such as natural language generation, are beginning to be investigated. Still, they suffer from the lack of sufficient amount of quality data and tend to produce generic/repetitive responses. Being aware of the aforementioned limitations, we present a study on how to collect responses to hate effectively, employing large scale unsupervised language models such as GPT-2 for the generation of silver data, and the best annotation strategies/neural architectures that can be used for data filtering before expert validation/post-editing.
摘要：最近的研究已经开始着眼于避免不希望的效果来与内容审核，如审查和过度封锁，仇恨网上交易时。其核心思想是直接干预与旨在对抗仇恨的内容，防止其进一步蔓延文本响应的讨论。因此，自动化策略，如自然语言生成，都开始进行调查。尽管如此，他们从缺乏质量有足够的数据量的受苦，往往会产生通用/次重复响应。意识到的上述限制，我们提出了如何收集回复，有效地恨，采用大规模无人监督的语言模型，如GPT-2银数据的生成，最好的注解策略研究/可用于神经结构数据专家审定/后期编辑之前过滤。

23. Re-conceptualising the Language Game Paradigm in the Framework of Multi-Agent Reinforcement Learning [PDF] 返回目录
Paul Van Eecke, Katrien Beuls
Abstract: In this paper, we formulate the challenge of re-conceptualising the language game experimental paradigm in the framework of multi-agent reinforcement learning (MARL). If successful, future language game experiments will benefit from the rapid and promising methodological advances in the MARL community, while future MARL experiments on learning emergent communication will benefit from the insights and results gained from language game experiments. We strongly believe that this cross-pollination has the potential to lead to major breakthroughs in the modelling of how human-like languages can emerge and evolve in multi-agent systems.
摘要：在本文中，我们制定的挑战重新概念化的多智能体强化学习（MARL）的框架内，语言游戏实验范式。如果成功的话，未来的语言游戏实验将受益于泥灰岩社会的快速和有前途的方法的进步，而在学习应急通信未来MARL实验将会从语言游戏实验得到的见解和成果中受益。我们坚信，这种交叉授粉已经导致在类似人类的语言怎么能出现在多智能体系统进化的造型重大突破的潜力。

24. Violent music vs violence and music: Drill rap and violent crime in London [PDF] 返回目录
Bennett Kleinberg, Paul McFarlane
Abstract: The current policy of removing drill music videos from social media platforms such as YouTube remains controversial because it risks conflating the co-occurrence of drill rap and violence with a causal chain of the two. Empirically, we revisit the question of whether there is evidence to support the conjecture that drill music and gang violence are linked. We provide new empirical insights suggesting that: i) drill music lyrics have not become more negative over time if anything they have become more positive; ii) individual drill artists have similar sentiment trajectories to other artists in the drill genre, and iii) there is no meaningful relationship between drill music and real-life violence when compared to three kinds of police-recorded violent crime data in London. We suggest ideas for new work that can help build a much-needed evidence base around the problem.
摘要：从社会化媒体平台，如YouTube移除钻音乐视频的现行政策仍然是有争议的，因为它的风险混为一谈钻说唱和暴力的共生与两者的因果链。根据经验，我们重新审视是否有证据支持钻音乐和帮派暴力有联系的猜想的问题。我们提供了新的经验见解提示：1）钻音乐歌词也没有变得更负过如果有的话，他们已经变得更加积极的时间;二）个别钻艺术家们在演练风格相似的情绪轨迹到其他艺术家，并与在伦敦3种警方记录在案的暴力犯罪数据的时候III）有钻的音乐和现实生活中的暴力行为之间没有有意义的关系。我们建议对新的工作，可以帮助建立解决该问题急需证据基础的想法。

25. Automatic Differentiation in ROOT [PDF] 返回目录
Vassil Vassilev, Aleksandr Efremov, Oksana Shadura
Abstract: In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.), elementary functions (exp, log, sin, cos, etc.) and control flow statements. AD takes source code of a function as input and produces source code of the derived function. By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. This paper presents AD techniques available in ROOT, supported by Cling, to produce derivatives of arbitrary C/C++ functions through implementing source code transformation and employing the chain rule of differential calculus in both forward mode and reverse mode. We explain its current integration for gradient computation in TFormula. We demonstrate the correctness and performance improvements in ROOT's fitting algorithms.
摘要：数学和计算机代数，自动微分（AD）是一组技术来评估由计算机程序指定的功能的衍生物。 AD利用这样的事实，每一个计算机程序，无论多么复杂，执行基本的算术运算（加，减，乘，除等），初等函数（EXP，日志，正弦，余弦等）和控制序列流语句。 AD采用源代码的功能的作为输入并且产生所述导出的功能的源代码。通过重复地将链式法则这些操作，任意阶的衍生物可以自动计算，准确地加工精度，以及使用在比原来的程序最小常数因子多个算术运算。本文呈现的技术AD在ROOT可用，通过保鲜支撑，通过实施源代码转换以及采用微分的在正向模式和反向模式中的链式法则，以产生任意的C / C ++函数衍生物。我们解释其对TFormula梯度计算电流积分。我们证明在ROOT的拟合算法的正确性和性能改进。

26. Towards Exploiting Implicit Human Feedback for Improving RDF2vec Embeddings [PDF] 返回目录
Ahmad Al Taweel, Heiko Paulheim
Abstract: RDF2vec is a technique for creating vector space embeddings from an RDF knowledge graph, i.e., representing each entity in the graph as a vector. It first creates sequences of nodes by performing random walks on the graph. In a second step, those sequences are processed by the word2vec algorithm for creating the actual embeddings. In this paper, we explore the use of external edge weights for guiding the random walks. As edge weights, transition probabilities between pages in Wikipedia are used as a proxy for the human feedback for the importance of an edge. We show that in some scenarios, RDF2vec utilizing those transition probabilities can outperform both RDF2vec based on random walks as well as the usage of graph internal edge weights.
摘要：RDF2vec是用于从RDF知识图表创建矢量空间的嵌入，即，表示在图中作为载体的每个实体的技术。它首先通过在图形上执行随机游动创建节点的序列。在第二步骤中，这些序列是由word2vec算法用于产生实际的嵌入处理。在本文中，我们探索利用外部边缘的权重为导向的随机游动。作为边权，在维基百科页面之间的转移概率作为一个边缘的重要性的人反馈的代理。我们发现，在某些情况下，利用RDF2vec的转移概率基于随机游走以及图形内边权的使用可以超越这两个RDF2vec。

27. Large Arabic Twitter Dataset on COVID-19 [PDF] 返回目录
Sarah Alqurashi, Ahmad Alhindi, Eisa Alanazi
Abstract: The 2019 coronavirus disease (COVID-19), emerged late December 2019 in China, is now rapidly spreading across the globe. At the time of writing this paper, the number of global confirmed cases has passed one million and half with over 75,000 fatalities. Many countries have enforced strict social distancing policies to contain the spread of the virus. This has changed the daily life of tens of millions of people, and urged people to turn their discussions online, e.g., via online social media sites like Twitter. In this work, we describe the first Arabic tweets dataset on COVID-19 that we have been collecting since March 1st, 2020. The dataset would help researchers and policy makers in studying different societal issues related to the pandemic. Many other tasks related to behavioral change, information sharing, misinformation and rumors spreading can also be analyzed.
摘要：2019冠状病毒病（COVID-19），出现了十二月下旬2019在中国，正在迅速蔓延全球。在撰写本文的时候，全球确诊病例数已通过一百万人，一半超过75,000人死亡。许多国家都实行严格的社会隔离政策，以遏制病毒的传播。这种情况已经改变千百万人的日常生活，并敦促人们把他们的网上讨论，例如，通过在线社交媒体网站，如Twitter。在这项工作中，我们描述了第一个阿拉伯文微博数据集上COVID-19是我们一直以来3月1日收集，2020年该数据集将有助于研究人员和政策制定者在研究有关流感大流行不同的社会问题。涉及到行为改变，信息共享，误传和谣言传播也可以许多其他任务进行分析。

28. Learning to Scale Multilingual Representations for Vision-Language Tasks [PDF] 返回目录
Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, Bryan A. Plummer
Abstract: Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that represents many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for few. We use a novel masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
摘要：当前多语种视觉语言模型要么需要大量的附加参数为每个支持的语言，或语言添加有性能的下降。在本文中，我们提出了一个可扩展的多语种对齐语言表示（SMALR）表示与几个模型参数很多语言在不牺牲下游任务性能。 SMALR学习在一个多语种的词汇最的话，保持对少数特定语言的功能固定大小语言无关的表示。我们使用一种新型屏蔽跨语言建模损失对齐功能与其他语言环境。此外，我们提出了一个跨语言的一致性模块，用于查询和机器翻译做出保证的预测是相当的。 SMALR的有效性证明有十个不同种语言，通过视觉语言任务支持最新的两倍。我们评估对多语种图像句子检索和比其他字嵌入方法与小于1/5优于3-4％之前工作的训练参数。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-04-10

目录

摘要