摘要

1. Intrinsic Bias Metrics Do Not Correlate with Application Bias [PDF] 返回目录
Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ricardo Muñoz Sanchez, Mugdha Pandya, Adam Lopez
Abstract: Natural Language Processing (NLP) systems learn harmful societal biases that cause them to extend and proliferate inequality widely, as they are deployed in more and more situations. To address and combat this, the NLP community has come to rely on a variety of metrics to identify and quantify bias in black-box models, which are used to monitor model behaviour and to guide efforts at debiasing. Some of these metrics are intrinsic, and are measured in word embedding spaces, and some are extrinsic, which measure the bias present downstream in the tasks that the word embeddings are plugged into. This research examines whether intrinsic metrics (which are easy to measure) correlate well to extrinsic metrics (which reflect real world bias). We measure both intrinsic and extrinsic bias across hundreds of trained models covering different tasks and experimental conditions and find that there is no reliable correlation between these metrics that holds in more than extremely specific settings. We advise that efforts to debias embedding spaces be always also paired with measurement of downstream model bias, and suggest that that community direct more effort into making downstream measurement simpler and easier.
摘要：自然语言处理（NLP）系统学习有害的社会偏见，导致其在越来越多的情况下部署时，广泛地扩大和扩散不平等现象。为了解决这个问题，NLP社区已经开始依赖各种指标来识别和量化黑匣子模型中的偏差，这些偏差可用于监视模型行为并指导消除偏差的工作。这些度量标准中的一些是固有的，并且是在词嵌入空间中测量的，而有些是外在的，它们度量了词嵌入插入的任务下游存在的偏差。这项研究检查了内在指标（易于衡量）是否与外在指标（反映了现实世界的偏见）相关性很好。我们在涵盖不同任务和实验条件的数百个经过训练的模型中测量了内在和外在偏见，发现这些度量标准之间并没有可靠的关联，这些关联还可以包含极其具体的设置。我们建议对嵌入空间进行反偏斜的努力也应始终与下游模型偏差的度量结合使用，并建议社区引导更多的努力使下游度量变得更容易。

2. Studying Strategically: Learning to Mask for Closed-book QA [PDF] 返回目录
Qinyuan Ye, Belinda Z. Li, Sinong Wang, Benjamin Bolte, Hao Ma, Xiang Ren, Wen-tau Yih, Madian Khabsa
Abstract: Closed-book question-answering (QA) is a challenging task that requires a model to directly answer questions without access to external knowledge. It has been shown that directly fine-tuning pre-trained language models with (question, answer) examples yields surprisingly competitive performance, which is further improved upon through adding an intermediate pre-training stage between general pre-training and fine-tuning. Prior work used a heuristic during this intermediate stage, whereby named entities and dates are masked, and the model is trained to recover these tokens. In this paper, we aim to learn the optimal masking strategy for the intermediate pretraining stage. We first train our masking policy to extract spans that are likely to be tested, using supervision from the downstream task itself, then deploy the learned policy during intermediate pre-training. Thus, our policy packs task-relevant knowledge into the parameters of a language model. Our approach is particularly effective on TriviaQA, outperforming strong heuristics when used to pre-train BART.
摘要：闭卷问答（QA）是一项具有挑战性的任务，它需要一种模型来直接回答问题而无需访问外部知识。已经显示，使用（问题，答案）示例直接对预训练的语言模型进行微调会产生令人惊讶的竞争性能，通过在常规预训练和微调之间添加一个中间的预训练阶段，可以进一步改善竞争性能。先前的工作在此中间阶段使用了一种启发式方法，从而屏蔽了命名的实体和日期，并对模型进行了训练以恢复这些标记。在本文中，我们旨在学习中级预训练阶段的最佳掩蔽策略。我们首先使用来自下游任务本身的监督来训练掩蔽策略，以提取可能要测试的跨度，然后在中间的预训练期间部署学习到的策略。因此，我们的政策将与任务相关的知识打包到语言模型的参数中。我们的方法在TriviaQA上特别有效，当用于预训练BART时，其性能优于强力启发法。

3. Using Natural Language Relations between Answer Choices for Machine Comprehension [PDF] 返回目录
Rajkumar Pujari, Dan Goldwasser
Abstract: When evaluating an answer choice for Reading Comprehension task, other answer choices available for the question and the answers of related questions about the same paragraph often provide valuable information. In this paper, we propose a method to leverage the natural language relations between the answer choices, such as entailment and contradiction, to improve the performance of machine comprehension. We use a stand-alone question answering (QA) system to perform QA task and a Natural Language Inference (NLI) system to identify the relations between the choice pairs. Then we perform inference using an Integer Linear Programming (ILP)-based relational framework to re-evaluate the decisions made by the standalone QA system in light of the relations identified by the NLI system. We also propose a multitask learning model that learns both the tasks jointly.
摘要：在评估阅读理解任务的答案选择时，可用于该问题的其他答案选择以及有关同一段落的相关问题的答案通常会提供有价值的信息。在本文中，我们提出了一种方法来利用答案选择之间的自然语言关系，例如包含和矛盾，以提高机器理解的性能。我们使用独立的问答系统（QA）来执行QA任务，并使用自然语言推断（NLI）系统来识别选择对之间的关系。然后，我们使用基于整数线性规划（ILP）的关系框架执行推理，以根据NLI系统确定的关系重新评估独立质量检查系统做出的决策。我们还提出了一个多任务学习模型，可以共同学习两个任务。

4. Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade [PDF] 返回目录
Jiatao Gu, Xiang Kong
Abstract: Fully non-autoregressive neural machine translation (NAT) is proposed to simultaneously predict tokens with single forward of neural networks, which significantly reduces the inference latency at the expense of quality drop compared to the Transformer baseline. In this work, we target on closing the performance gap while maintaining the latency advantage. We first inspect the fundamental issues of fully NAT models, and adopt dependency reduction in the learning space of output tokens as the basic guidance. Then, we revisit methods in four different aspects that have been proven effective for improving NAT models, and carefully combine these techniques with necessary modifications. Our extensive experiments on three translation benchmarks show that the proposed system achieves the new state-of-the-art results for fully NAT models, and obtains comparable performance with the autoregressive and iterative NAT systems. For instance, one of the proposed models achieves 27.49 BLEU points on WMT14 En-De with approximately 16.5X speed up at inference time.
摘要：提出了完全非自回归神经机器翻译（NAT）来通过神经网络的单向转发同时预测令牌，与Transformer基线相比，它以降低质量为代价显着减少了推理延迟。在这项工作中，我们的目标是缩小性能差距，同时保持延迟优势。我们首先检查完全NAT模型的基本问题，并在输出令牌的学习空间中采用减少依赖作为基本指导。然后，我们在四个不同方面重新审视了已被证明对改进NAT模型有效的方法，并将这些技术与必要的修改仔细地结合在一起。我们在三个转换基准上进行的广泛实验表明，所提出的系统可以实现完全NAT模型的最新技术成果，并且可以与自回归和迭代NAT系统取得可比的性能。例如，提出的模型之一在WMT14 En-De上达到27.49 BLEU点，推理时速度提高了大约16.5倍。

5. Shortformer: Better Language Modeling using Shorter Inputs [PDF] 返回目录
Ofir Press, Noah A. Smith, Mike Lewis
Abstract: We explore the benefits of decreasing the input length of transformers. First, we show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time and, surprisingly, gives a large improvement in perplexity. We then show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens (when generating sequences that are larger than the maximal length that the transformer can handle at once). Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. By combining these techniques, we increase training speed by 65%, make generation nine times faster, and substantially improve perplexity on WikiText-103, without adding any parameters.
摘要：我们探讨了减少变压器输入长度的好处。首先，我们表明，在开始使用短子序列进行模型训练之前，再使用更长的子序列，既减少了总体训练时间，而且令人惊讶地大大提高了困惑度。然后，我们展示了如何提高转换器中的递归方法的效率，这使模型可以根据先前处理的令牌进行条件处理（当生成的序列大于转换器一次可以处理的最大长度时）。现有方法需要在计算上相对昂贵的相对位置嵌入。我们介绍了一种简单的替代方法，即向查询和键添加绝对位置嵌入，而不是向词嵌入添加绝对位置嵌入，从而有效地产生更好的结果。通过组合这些技术，我们无需增加任何参数即可将训练速度提高65％，使生成速度提高9倍，并显着提高WikiText-103的困惑。

6. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers [PDF] 返回目录
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, Furu Wei
Abstract: We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, and RoBERTa) outperform the state of the art.
摘要：我们仅通过将自注意力关系蒸馏用于预训练的变形金刚的任务不可知的压缩，来在MiniLM中推广深度自注意蒸馏（Wang等人，2020）。特别是，我们将多头自我注意关系定义为每个自我注意模块内的查询，键和值向量对之间的成比例的点积。然后，我们利用上述关系知识来训练学生模型。除了简单和统一的原则外，更有利的是，对学生关注头的数量没有限制，而以前的大多数工作都必须保证师生之间的头数相同。而且，细粒度的自注意力关系倾向于充分利用Transformer所学习的交互知识。另外，我们彻底检查了教师模型的层选择策略，而不是像MiniLM那样仅依赖最后一层。实验结果表明，我们从基本规模和大型教师（BERT和RoBERTa）中提炼的模型优于最新技术。

7. UCCA's Foundational Layer: Annotation Guidelines v2.1 [PDF] 返回目录
Omri Abend, Nathan Schneider, Dotan Dvir, Jakob Prange, Ari Rappoport
Abstract: This is the annotation manual for Universal Conceptual Cognitive Annotation (UCCA; Abend and Rappoport, 2013), specifically the Foundational Layer. UCCA is a graph-based semantic annotation scheme based on typological linguistic principles. It has been applied to several languages; for ease of exposition these guidelines give examples mainly in English. New annotators may wish to start with the tutorial on the UCCA framework (Abend et al., 2020). Further resources are available at the project homepage: this https URL
摘要：这是通用概念认知注释的注释手册（UCCA； Abend和Rappoport，2013年），特别是基础层。 UCCA是基于类型学语言原理的基于图的语义注释方案。它已被应用于多种语言。为了便于说明，这些指南主要以英语给出示例。新的注释者可能希望从有关UCCA框架的教程开始（Abend等人，2020年）。可在项目主页上找到更多资源：此https URL

8. Promoting Graph Awareness in Linearized Graph-to-Text Generation [PDF] 返回目录
Alexander Hoyle, Ana Marasović, Noah Smith
Abstract: Generating text from structured inputs, such as meaning representations or RDF triples, has often involved the use of specialized graph-encoding neural networks. However, recent applications of pretrained transformers to linearizations of graph inputs have yielded state-of-the-art generation results on graph-to-text tasks. Here, we explore the ability of these linearized models to encode local graph structures, in particular their invariance to the graph linearization strategy and their ability to reconstruct corrupted inputs. Our findings motivate solutions to enrich the quality of models' implicit graph encodings via scaffolding. Namely, we use graph-denoising objectives implemented in a multi-task text-to-text framework. We find that these denoising scaffolds lead to substantial improvements in downstream generation in low-resource settings.
摘要：从结构化输入（例如含义表示或RDF三元组）生成文本通常涉及使用专门的图形编码神经网络。但是，预训练的变压器在图形输入的线性化方面的最新应用已经产生了图形到文本任务的最新生成结果。在这里，我们探索了这些线性化模型对本地图结构进行编码的能力，尤其是它们对图线性化策略的不变性以及它们重建损坏的输入的能力。我们的发现激励解决方案通过脚手架来丰富模型的隐式图编码的质量。即，我们使用在多任务文本到文本框架中实现的图去噪目标。我们发现这些降噪支架可在资源匮乏的环境中显着改善下游发电。

9. Factual Error Correction of Claims [PDF] 返回目录
James Thorne, Andreas Vlachos
Abstract: This paper introduces the task of factual error correction: performing edits to a claim so that the generated rewrite is supported by evidence. This serves two purposes: firstly this provides a mechanism to correct written texts that contain misinformation, and secondly, this acts as an inherent explanation for claims already partially supported by evidence. We demonstrate that factual error correction is possible without the need for any additional training data using distant-supervision and retrieved evidence. We release a dataset of 65,000 instances, based on a recent fact verification dataset, to compare our distantly-supervised method to a fully supervised ceiling system. Our manual evaluation indicates which automated evaluation metrics best correlate with human judgements of factuality and whether errors were actually corrected.
摘要：本文介绍了事实错误纠正的任务：对声明进行编辑，以使生成的重写得到证据的支持。这有两个目的：首先，它提供了一种纠正包含错误信息的书面文本的机制，其次，它是对已经部分得到证据支持的权利要求的内在解释。我们证明，使用远程监督和检索到的证据，无需任何其他培训数据就可以进行事实错误纠正。我们基于最近的事实验证数据集发布了65,000个实例的数据集，以将我们的远程监督方法与完全监督的天花板系统进行比较。我们的人工评估表明哪种自动评估指标与人类对事实的判断以及错误是否得到实际纠正最相关。

10. Conditional Generation of Temporally-ordered Event Sequences [PDF] 返回目录
Shih-Ting Lin, Nathanael Chambers, Greg Durrett
Abstract: Models encapsulating narrative schema knowledge have proven to be useful for a range of event-related tasks, but these models typically do not engage with temporal relationships between events. We present a a BART-based conditional generation model capable of capturing event cooccurrence as well as temporality of event sequences. This single model can address both temporal ordering, sorting a given sequence of events into the order they occurred, and event infilling, predicting new events which fit into a temporally-ordered sequence of existing ones. Our model is trained as a denoising autoencoder: we take temporally-ordered event sequences, shuffle them, delete some events, and then attempting to recover the original event sequence. In this fashion, the model learns to make inferences given incomplete knowledge about the events in an underlying scenario. On the temporal ordering task, we show that our model is able to unscramble event sequences from existing datasets without access to explicitly labeled temporal training data, outperforming both a BERT-based pairwise model and a BERT-based pointer network. On event infilling, human evaluation shows that our model is able to generate events that fit better temporally into the input events when compared to GPT-2 story completion models.
摘要：事实证明，封装叙述性模式知识的模型对于一系列与事件相关的任务很有用，但是这些模型通常不涉及事件之间的时间关系。我们提出了一种基于BART的条件生成模型，该模型能够捕获事件共现以及事件序列的时间性。这个单一模型既可以处理时间顺序，将给定事件序列按事件发生的顺序排序，又可以进行事件填充，预测适合现有事件的时间顺序的新事件。我们的模型被训练为降噪自动编码器：我们采用时间顺序的事件序列，对其进行混洗，删除一些事件，然后尝试恢复原始事件序列。通过这种方式，该模型将在不了解基础场景中事件的情况下学会进行推断。在时间排序任务上，我们证明了我们的模型能够从现有数据集中解扰事件序列，而无需访问显式标记的时间训练数据，其性能优于基于BERT的成对模型和基于BERT的指针网络。关于事件填充，人工评估表明，与GPT-2故事完成模型相比，我们的模型能够生成在时间上更适合输入事件的事件。

11. Understanding Politics via Contextualized Discourse Processing [PDF] 返回目录
Rajkumar Pujari, Dan Goldwasser
Abstract: Politicians often have underlying agendas when reacting to events. Arguments in contexts of various events reflect a fairly consistent set of agendas for a given entity. In spite of recent advances in Pretrained Language Models (PLMs), those text representations are not designed to capture such nuanced patterns. In this paper, we propose a Compositional Reader model consisting of encoder and composer modules, that attempts to capture and leverage such information to generate more effective representations for entities, issues, and events. These representations are contextualized by tweets, press releases, issues, news articles, and participating entities. Our model can process several documents at once and generate composed representations for multiple entities over several issues or events. Via qualitative and quantitative empirical analysis, we show that these representations are meaningful and effective.
摘要：对事件做出反应时，政客通常有基本的议程。各种事件中的争论反映了给定实体的一系列相当一致的议程。尽管最近在预训练语言模型（PLM）中取得了进步，但这些文本表示并未设计为捕获这种细微差别的模式。在本文中，我们提出了一个由编码器和作曲器模块组成的组合阅读器模型，该模型试图捕获并利用这些信息来生成实体，问题和事件的更有效表示。这些表示通过推文，新闻稿，问题，新闻文章和参与实体来进行上下文化。我们的模型可以一次处理多个文档，并针对多个问题或事件为多个实体生成组合表示。通过定性和定量的经验分析，我们表明这些表示是有意义和有效的。

12. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection [PDF] 返回目录
Bertie Vidgen, Tristan Thrush, Zeerak Waseem, Douwe Kiela
Abstract: We present a first-of-its-kind large synthetic training dataset for online hate classification, created from scratch with trained annotators over multiple rounds of dynamic data collection. We provide a 40,623 example dataset with annotations for fine-grained labels, including a large number of challenging contrastive perturbation examples. Unusually for an abusive content dataset, it comprises 54% hateful and 46% not hateful entries. We show that model performance and robustness can be greatly improved using the dynamic data collection paradigm. The model error rate decreased across rounds, from 72.1% in the first round to 35.8% in the last round, showing that models became increasingly harder to trick -- even though content become progressively more adversarial as annotators became more experienced. Hate speech detection is an important and subtle problem that is still very challenging for existing AI methods. We hope that the models, dataset and dynamic system that we present here will help improve current approaches, having a positive social impact.
摘要：我们提出了用于在线仇恨分类的首个大型综合训练数据集，该数据集是由经过训练的注释者从头开始创建的，涉及多轮动态数据收集。我们提供了40,623个示例数据集，其中带有用于细粒度标签的注释，其中包括大量具有挑战性的对比扰动示例。对于滥用的内容数据集而言，它通常包含54％的仇恨条目和46％的非仇恨条目。我们表明，使用动态数据收集范例可以大大提高模型的性能和鲁棒性。模型的错误率从第一轮的72.1％下降到最后一轮的35.8％，这表明模型变得越来越难以捉摸-尽管随着注释者变得越来越有经验，模型的对抗性也越来越强。仇恨语音检测是一个重要而细微的问题，对于现有的AI方法而言仍然非常具有挑战性。我们希望我们在此展示的模型，数据集和动态系统将有助于改进当前的方法，并产生积极的社会影响。

13. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences [PDF] 返回目录
Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, Yejin Choi
Abstract: In social settings, much of human behavior is governed by unspoken rules of conduct. For artificial systems to be fully integrated into social environments, adherence to such norms is a central prerequisite. We investigate whether contemporary NLG models can function as behavioral priors for systems deployed in social settings by generating action hypotheses that achieve predefined goals under moral constraints. Moreover, we examine if models can anticipate likely consequences of (im)moral actions, or explain why certain actions are preferable by generating relevant norms. For this purpose, we introduce 'Moral Stories', a crowd-sourced dataset of structured, branching narratives for the study of grounded, goal-oriented social reasoning. Finally, we propose decoding strategies that effectively combine multiple expert models to significantly improve the quality of generated actions, consequences, and norms compared to strong baselines, e.g. though abductive reasoning.
摘要：在社会环境中，人类的许多行为都受到潜伏的行为准则的支配。为了将人工系统完全集成到社会环境中，遵守这些规范是首要条件。我们通过生成在道德约束下实现预定目标的行动假设，来调查当代NLG模型是否可以充当社会环境中部署系统的行为先验。此外，我们研究了模型是否可以预测（不道德的）行为的可能后果，或者通过生成相关规范来解释为什么某些行为更可取。为此，我们引入了“道德故事”，这是一个由人群提供的，结构化，分支叙事的数据集，用于研究扎根，面向目标的社会推理。最后，我们提出了有效结合多种专家模型的解码策略，与强基准（例如，尽管是推理性的。

14. Making Pre-trained Language Models Better Few-shot Learners [PDF] 返回目录
Tianyu Gao, Adam Fisch, Danqi Chen
Abstract: The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.
摘要：最新的GPT-3模型（Brown等人，2020年）仅通过利用自然语言提示和一些任务演示作为输入上下文，即可实现出色的快照性能。受他们的发现启发，我们在更实际的情况下研究了少量学习，在这种情况下，我们使用较小的语言模型，因此微调在计算上是有效的。我们介绍了LM-BFF-更好地进行语言模型的微调-在少量带注释的示例上对语言模型进行微调的一套简单且互补的技术。我们的方法包括（1）基于提示的微调，以及用于自动生成提示的新颖管道；（2）完善的策略，可以动态地，有选择地将演示结合到每个上下文中。最后，我们提出了一种系统的评估方法，用于分析一系列NLP任务（包括分类和回归）的少拍性能。我们的实验表明，在这种资源匮乏的环境中，我们的方法相结合，大大优于标准的微调程序，实现了30％的绝对改进，所有任务平均11％的改进。我们的方法对任务资源和领域专业知识的假设最少，因此构成了针对一次性学习的强大的任务不可知论方法。

15. FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine Translation [PDF] 返回目录
Wenhao Zhu, Shujian Huang, Tong Pu, Xu Zhang, Jian Yu, Wei Chen, Yanfeng Wang, Jiajun Chen
Abstract: Previous domain adaptation research usually neglect the diversity in translation within a same domain, which is a core problem for adapting a general neural machine translation (NMT) model into a specific domain in real-world scenarios. One representative of such challenging scenarios is to deploy a translation system for a conference with a specific topic, e.g. computer networks or natural language processing, where there is usually extremely less resources due to the limited time schedule. To motivate a wide investigation in such settings, we present a real-world fine-grained domain adaptation task in machine translation (FDMT). The FDMT dataset (Zh-En) consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone. To be closer to reality, FDMT does not employ any in-domain bilingual training data. Instead, each sub-domain is equipped with monolingual data, bilingual dictionary and knowledge base, to encourage in-depth exploration of these available resources. Corresponding development set and test set are provided for evaluation purpose. We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task and reveals several challenging problems that need to be addressed.
摘要：先前的领域适应性研究通常会忽略同一领域内翻译的多样性，这是在现实情况下将通用神经机器翻译（NMT）模型适应到特定领域的核心问题。这种具有挑战性的方案的代表之一是为具有特定主题的会议部署翻译系统。计算机网络或自然语言处理，通常由于时间表有限而资源很少。为了激发对此类设置的广泛研究，我们提出了机器翻译（FDMT）中的真实世界细粒度域自适应任务。 FDMT数据集（Zh-En）由信息技术的四个子域组成：自动驾驶汽车，AI教育，实时网络和智能手机。为了更接近现实，FDMT不使用任何域内双语培训数据。相反，每个子域都配备有单语数据，双语词典和知识库，以鼓励深入探索这些可用资源。提供相应的开发集和测试集用于评估目的。我们在此新环境中进行了定量实验和深入分析，它对细粒度的域自适应任务进行了基准测试，并揭示了一些需要解决的挑战性问题。

16. Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring [PDF] 返回目录
Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre
Abstract: Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.
摘要：跨语言单词嵌入的最新研究一直以对齐单语言嵌入的无监督映射方法为主导。这样的方法严格地依赖于具有类似结构的那些嵌入，但是最近显示出，用不同语言进行的单独训练导致背离该假设。在本文中，我们提出了一种没有此限制的替代方法，同时要求使用较弱的种子字典（例如，相同单词的列表）作为监督的唯一形式。我们的方法不是固定两个固定的嵌入空间，而是固定目标语言的嵌入，并为与之对齐的源语言学习一组新的嵌入。为此，我们使用skip-gram的扩展，该扩展利用翻译后的上下文词作为锚点，并结合了自学习和迭代重启功能，以减少对初始字典的依赖。我们的方法在双语词典归纳方面胜过传统的映射方法，并在下游XNLI任务中获得竞争性结果。

17. Revisiting Robust Neural Machine Translation: A Transformer Case Study [PDF] 返回目录
Peyman Passban, Puneeth S.M. Saladi, Qun Liu
Abstract: Transformers (Vaswani et al., 2017) have brought a remarkable improvement in the performance of neural machine translation (NMT) systems, but they could be surprisingly vulnerable to noise. Accordingly, we tried to investigate how noise breaks Transformers and if there exist solutions to deal with such issues. There is a large body of work in the NMT literature on analyzing the behaviour of conventional models for the problem of noise but it seems Transformers are understudied in this context. Therefore, we introduce a novel data-driven technique to incorporate noise during training. This idea is comparable to the well-known fine-tuning strategy. Moreover, we propose two new extensions to the original Transformer, that modify the neural architecture as well as the training process to handle noise. We evaluated our techniques to translate the English--German pair in both directions. Experimental results show that our models have a higher tolerance to noise. More specifically, they perform with no deterioration where up to 10% of entire test words are infected by noise.
摘要：变压器（Vaswani et al。，2017）在神经机器翻译（NMT）系统的性能方面带来了显着的进步，但它们可能出乎意料地易受噪声的影响。因此，我们试图研究噪声如何破坏变压器，以及是否存在解决此类问题的解决方案。 NMT文献中有大量工作要分析常规模型的噪声问题，但是在这种情况下似乎对变压器的研究不足。因此，我们引入了一种新的数据驱动技术，以在训练过程中合并噪声。这个想法可与著名的微调策略相提并论。此外，我们提出了对原始变压器的两个新扩展，它们修改了神经体系结构以及处理噪声的训练过程。我们评估了双向翻译英语-德语对的技术。实验结果表明，我们的模型对噪声的容忍度更高。更具体地说，在整个测试词中多达10％被噪声感染的情况下，它们的性能不会降低。

18. BinaryBERT: Pushing the Limit of BERT Quantization [PDF] 返回目录
Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, Irwin King
Abstract: The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes. Therefore, we propose ternary weight splitting, which initializes the binary model by equivalent splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary model, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base while being $24\times$ smaller, achieving the state-of-the-art results on GLUE and SQuAD benchmarks.
摘要：大型预训练语言模型的快速发展极大地增加了对模型压缩技术的需求，其中量化是一种流行的解决方案。在本文中，我们提出BinaryBERT，它通过权重二值化将BERT量化推到极限。我们发现，由于二进制BERT的复杂和不规则的损失情况，它比三进制对应的BERT更难以直接训练。因此，我们提出三元权重拆分，它通过从一半大小的三元网络中进行等效拆分来初始化二进制模型。因此，二进制模型继承了三元模型的良好性能，并且可以通过在拆分后对新架构进行微调来进一步增强。经验结果表明，与完全基于BERT的BERT相比，BinaryBERT的性能下降可忽略不计，而性能却下降了24倍，达到了GLUE和SQuAD基准测试的最新水平。

19. Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning [PDF] 返回目录
Chenglei Si, Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Qun Liu, Maosong Sun
Abstract: Pre-trained language models (PLMs) fail miserably on adversarial attacks. To improve the robustness, adversarial data augmentation (ADA) has been widely adopted, which attempts to cover more search space of adversarial attacks by adding the adversarial examples during training. However, the number of adversarial examples added by ADA is extremely insufficient due to the enormously large search space. In this work, we propose a simple and effective method to cover much larger proportion of the attack search space, called Adversarial Data Augmentation with Mixup (MixADA). Specifically, MixADA linearly interpolates the representations of pairs of training examples to form new virtual samples, which are more abundant and diverse than the discrete adversarial examples used in conventional ADA. Moreover, to evaluate the robustness of different models fairly, we adopt a challenging setup, which dynamically generates new adversarial examples for each model. In the text classification experiments of BERT and RoBERTa, MixADA achieves significant robustness gains under two strong adversarial attacks and alleviates the performance degradation of ADA on the original data. Our source codes will be released to support further explorations.
摘要：预先训练的语言模型（PLM）在对抗性攻击中失败了。为了提高鲁棒性，已经广泛采用了对抗数据增强（ADA），它试图通过在训练过程中添加对抗示例来覆盖更多的对抗攻击搜索空间。但是，由于搜索空间巨大，因此ADA所添加的对抗示例数量非常不足。在这项工作中，我们提出了一种简单有效的方法来覆盖更大比例的攻击搜索空间，称为“带有混合的对抗数据增强”（MixADA）。具体来说，MixADA对训练示例对的表示进行线性插值以形成新的虚拟样本，这些虚拟样本比常规ADA中使用的离散对抗样本更加丰富和多样。此外，为了公平地评估不同模型的鲁棒性，我们采用了具有挑战性的设置，该设置为每个模型动态生成新的对抗示例。在BERT和RoBERTa的文本分类实验中，MixADA在两次强大的对抗攻击下均获得了显着的鲁棒性增强，并减轻了ADA对原始数据的性能下降。我们的源代码将被发布以支持进一步的探索。

20. ERNIE-DOC: The Retrospective Long-Document Modeling Transformer [PDF] 返回目录
Siyu Ding, Junyuan Shang, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Abstract: Transformers are not suited for processing long document input due to its quadratically increasing memory and time consumption. Simply truncating a long document or applying the sparse attention mechanism will incur the context fragmentation problem or inferior modeling capability with comparable model size. In this paper, we propose ERNIE-DOC, a document-level language pretraining model based on Recurrence Transformers. Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism enable ERNIE-DOC with much longer effective context length to capture the contextual information of a whole document. We pretrain ERNIE-DOC to explicitly learn the relationship among segments with an additional document-aware segment reordering objective. Various experiments on both English and Chinese document-level tasks are conducted. ERNIE-DOC achieves SOTA language modeling result of 16.8 ppl on WikiText-103 and outperforms competitive pretraining models on most language understanding tasks such as text classification, question answering by a large margin.
摘要：变形器由于二次增加的内存和时间消耗而不适用于处理长文档输入。简单地截断较长的文档或应用稀疏的关注机制将导致上下文碎片化问题或模型大小可比的劣等建模能力。在本文中，我们提出了ERNIE-DOC，这是一种基于递归转换器的文档级语言预训练模型。两种精心设计的技术，即回顾性提要机制和增强的递归机制，使ERNIE-DOC具有更长的有效上下文长度，可以捕获整个文档的上下文信息。我们对ERNIE-DOC进行了预培训，以使用其他可识别文档的细分重新排序目标来明确学习细分之间的关系。进行了中英文文档级任务的各种实验。 ERNIE-DOC在WikiText-103上实现了16.8 ppl的SOTA语言建模结果，并且在大多数语言理解任务（例如文本分类，问题解答）上均优于竞争性的预培训模型。

21. A Closer Look at Few-Shot Crosslingual Transfer: Variance, Benchmarks and Baselines [PDF] 返回目录
Mengjie Zhao, Yi Zhu, Ehsan Shareghi, Roi Reichart, Anna Korhonen, Hinrich Schütze
Abstract: We present a focused study of few-shot crosslingual transfer, a recently proposed NLP scenario: a pretrained multilingual encoder is first finetuned on many annotations in a high resource language (typically English), and then finetuned on a few annotations (the ``few shots'') in a target language. Few-shot transfer brings large improvements over zero-shot transfer. However, we show that it inherently has large variance and it is necessary to report results on multiple sets of few shots for stable results and to guarantee fair comparison of different algorithms. To address this problem, we publish our few-shot sets. In a study of why few-shot learning outperforms zero-shot transfer, we show that large models heavily rely on lexical hints when finetuned on a few shots and then overfit quickly. We evaluate different methods that use few-shot annotations, but do not observe significant improvements over the baseline. This calls for better ways of utilizing the few-shot annotations.
摘要：我们针对少数镜头跨语言转移进行了重点研究，这是最近提出的NLP场景：首先对训练有素的多语言编码器以高资源语言（通常为英语）对许多注释进行微调，然后对一些注释进行微调（目标语言中的“少量镜头”）。零脉冲传输带来了比零脉冲传输大的改进。但是，我们证明了它固有地具有较大的方差，因此有必要报告几组少拍的结果，以获得稳定的结果，并保证不同算法的公平比较。为了解决这个问题，我们发布了几集。在一项关于为什么少拍学习胜过零拍转移的研究中，我们表明，当对几张镜头进行微调然后迅速过拟合时，大型模型会严重依赖词汇提示。我们评估了使用少量注解的不同方法，但未观察到比基线有显着改善。这就需要更好的方法来利用少量注解。

22. ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora [PDF] 返回目录
Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Abstract: Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance on downstream cross-lingual tasks. This improvement stems from the learning of a large amount of monolingual and parallel corpora. While it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for the low-resource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to break the constraint of parallel corpus size on the model performance. Our key insight is to integrate the idea of back translation in the pre-training process. We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages, which enhances the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.
摘要：最近的研究表明，经过预训练的跨语言模型在下游跨语言任务上的表现令人印象深刻。这种改进源于学习了大量的单语和平行语料库。尽管人们普遍认为并行语料库对于提高模型性能至关重要，但现有方法通常受到并行语料库大小的限制，特别是对于资源较少的语言。在本文中，我们提出了一种新的训练方法ERNIE-M，该方法鼓励模型使多语言表示与单语语料对齐，以打破并行语料库大小对模型性能的约束。我们的主要见解是将反向翻译的思想整合到预培训过程中。我们在单语语料库上生成伪平行句子对，以学习不同语言之间的语义对齐方式，从而增强了跨语言模型的语义建模。实验结果表明，ERNIE-M优于现有的跨语言模型，并在各种跨语言下游任务上提供了最新的最新结果。代码和预先训练的模型将公开提供。

23. VOLT: Improving Vocabularization via Optimal Transport for Machine Translation [PDF] 返回目录
Jingjing Xu, Hao Zhou, Chun Gan, Zaixiang Zheng, Lei Li
Abstract: It is well accepted that the choice of token vocabulary largely affects the performance of machine translation. However, due to expensive trial costs, most studies only conduct simple trials with dominant approaches (e.g BPE) and commonly used vocabulary sizes. In this paper, we find an exciting relation between an information-theoretic feature and BLEU scores. With this observation, we formulate the quest of vocabularization -- finding the best token dictionary with a proper size -- as an optimal transport problem. We then propose VOLT, a simple and efficient vocabularization solution without the full and costly trial training. We evaluate our approach on multiple machine translation tasks, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation. Empirical results show that VOLT beats widely-used vocabularies on diverse scenarios. For example, VOLT achieves 70% vocabulary size reduction and 0.6 BLEU gain on English-German translation. Also, one advantage of VOLT lies in its low resource consumption. Compared to naive BPE-search, VOLT reduces the search time from 288 GPU hours to 0.5 CPU hours.
摘要：令牌词汇的选择在很大程度上影响了机器翻译的性能，这已为人们所接受。但是，由于试验费用昂贵，大多数研究仅使用占优势的方法（例如BPE）和常用词汇量进行简单试验。在本文中，我们发现了信息理论特征与BLEU得分之间令人兴奋的关系。通过这种观察，我们制定了语音化的要求-找到具有适当大小的最佳令牌词典-作为最佳传输问题。然后，我们提出VOLT，这是一种简单有效的语音解决方案，无需进行完整且昂贵的试用培训。我们对多种机器翻译任务的方法进行了评估，包括WMT-14英德翻译，TED双语翻译和TED多语言翻译。实证结果表明，VOLT在各种情况下击败了广泛使用的词汇。例如，在英语-德语翻译中，VOLT实现了70％的词汇量缩减和0.6 BLEU增益。而且，VOLT的一个优势在于其低资源消耗。与单纯的BPE搜索相比，VOLT将搜索时间从288 GPU小时减少到0.5 CPU小时。

24. CoCoLM: COmplex COmmonsense Enhanced Language Model [PDF] 返回目录
Changlong Yu, Hongming Zhang, Yangqiu Song, Wilfred Ng
Abstract: Large-scale pre-trained language models have demonstrated strong knowledge representation ability. However, recent studies suggest that even though these giant models contains rich simple commonsense knowledge (e.g., bird can fly and fish can swim.), they often struggle with the complex commonsense knowledge that involves multiple eventualities (verb-centric phrases, e.g., identifying the relationship between ``Jim yells at Bob'' and ``Bob is upset'').To address this problem, in this paper, we propose to help pre-trained language models better incorporate complex commonsense knowledge. Different from existing fine-tuning approaches, we do not focus on a specific task and propose a general language model named CoCoLM. Through the careful training over a large-scale eventuality knowledge graphs ASER, we successfully teach pre-trained language models (i.e., BERT and RoBERTa) rich complex commonsense knowledge among eventualities. Experiments on multiple downstream commonsense tasks that requires the correct understanding of eventualities demonstrate the effectiveness of CoCoLM.
摘要：大规模的预训练语言模型具有较强的知识表示能力。但是，最近的研究表明，即使这些巨大的模型包含丰富的简单常识知识（例如，鸟类可以飞翔，鱼类可以游泳），它们也常常会遇到涉及多个事件的复杂常识知识（以动词为中心的短语，例如，识别为了解决这个问题，在本文中，我们建议帮助预训练的语言模型更好地结合复杂的常识知识。与现有的微调方法不同，我们不专注于特定任务，而是提出了一个名为CoCoLM的通用语言模型。通过对大规模事件知识图ASER的仔细训练，我们成功地在事件之间教授了丰富的复杂常识知识的预训练语言模型（即BERT和RoBERTa）。需要正确理解突发事件的多个下游常识性任务的实验证明了CoCoLM的有效性。

25. TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis [PDF] 返回目录
Haisong Zhang, Lemao Liu, Haiyun Jiang, Yangming Li, Enbo Zhao, Kun Xu, Linfeng Song, Suncong Zheng, Botong Zhou, Jianchen Zhu, Xiao Feng, Tao Chen, Tao Yang, Dong Yu, Feng Zhang, Zhanhui Kang, Shuming Shi
Abstract: This technique report introduces TexSmart, a text understanding system that supports fine-grained named entity recognition (NER) and enhanced semantic analysis functionalities. Compared to most previous publicly available text understanding systems and tools, TexSmart holds some unique features. First, the NER function of TexSmart supports over 1,000 entity types, while most other public tools typically support several to (at most) dozens of entity types. Second, TexSmart introduces new semantic analysis functions like semantic expansion and deep semantic representation, that are absent in most previous systems. Third, a spectrum of algorithms (from very fast algorithms to those that are relatively slow but more accurate) are implemented for one function in TexSmart, to fulfill the requirements of different academic and industrial applications. The adoption of unsupervised or weakly-supervised algorithms is especially emphasized, with the goal of easily updating our models to include fresh data with less human annotation efforts. The main contents of this report include major functions of TexSmart, algorithms for achieving these functions, how to use the TexSmart toolkit and Web APIs, and evaluation results of some key algorithms.
摘要：此技术报告介绍了TexSmart，这是一种文本理解系统，支持细粒度的命名实体识别（NER）和增强的语义分析功能。与以前的大多数公开文本理解系统和工具相比，TexSmart具有一些独特的功能。首先，TexSmart的NER功能支持1000多种实体类型，而大多数其他公共工具通常支持数种（最多）数十种实体类型。其次，TexSmart引入了新的语义分析功能，例如语义扩展和深度语义表示，这些功能在大多数以前的系统中是不存在的。第三，针对TexSmart中的一个功能实现了一系列算法（从非常快的算法到相对较慢但更准确的算法），以满足不同学术和工业应用的需求。特别强调了采用无监督或弱监督算法，目的是轻松地更新我们的模型以包括较少人工注释工作的新鲜数据。本报告的主要内容包括TexSmart的主要功能，实现这些功能的算法，如何使用TexSmart工具包和Web API以及一些关键算法的评估结果。

26. Open Korean Corpora: A Practical Report [PDF] 返回目录
Won Ik Cho, Sangwhan Moon, Youngsook Song
Abstract: Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.
摘要：韩语在研究界通常被称为一种资源匮乏的语言。尽管此主张部分是正确的，但这也是因为没有充分宣传和策划资源的可用性。这项工作策划并审查了韩国语料库列表，首先描述了机构级资源开发，然后进一步遍历了针对不同类型任务的当前开放数据集列表。然后，我们就应如何针对资源较少的语言进行开源数据集构建和发布提出一个方向，以促进研究。

27. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models [PDF] 返回目录
Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, Iryna Gurevych
Abstract: In this work we provide a \textit{systematic empirical comparison} of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first establish if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for a performance difference. To disentangle the impacting variables, we train new monolingual models on the same data, but with different tokenizers, both the monolingual and the multilingual version. We find that while the pretraining data size is an important factor, the designated tokenizer of the monolingual model plays an equally important role in the downstream performance. Our results show that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.
摘要：在这项工作中，我们提供了针对他们的单语任务表现的\ textit {系统的经验比较} \\预训练的多语种语言模型与其单语种语言模型的对比。我们研究了九种类型多样的语言，并在一组五种不同的单语下游任务上提供了易于使用的预训练单语模型。我们首先确定该语言的多语言表示形式和相应的单语言表示形式之间是否存在差距，然后调查性能差异的原因。为了弄清影响变量，我们在相同数据上训练了新的单语种模型，但是使用了不同的分词器，单语种和多语种版本。我们发现，虽然预训练数据大小是一个重要因素，但单语言模型的指定令牌化器在下游性能中起着同等重要的作用。我们的结果表明，与多语种语言相比，在多语种模型的词汇中能充分体现的语言的性能下降可忽略不计。我们进一步发现，用专用的单语言令牌生成器替换原始的多语言令牌生成器可提高几乎所有任务和语言的多语言模型的下游性能。

28. HateCheck: Functional Tests for Hate Speech Detection Models [PDF] 返回目录
Paul Röttger, Bertram Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, Janet Pierrehumbert
Abstract: Detecting online hate is a difficult task that even state-of-the-art models struggle with. In previous research, hate speech detection models are typically evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model quality due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a first suite of functional tests for hate speech detection models. We specify 29 model functionalities, the selection of which we motivate by reviewing previous research and through a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate data quality through a structured annotation process. To illustrate HateCheck's utility, we test near-state-of-the-art transformer detection models as well as a popular commercial model, revealing critical model weaknesses.
摘要：检测在线仇恨是一项艰巨的任务，即使是最先进的模型也难以克服。在以前的研究中，仇恨语音检测模型通常是通过使用准确性和F1分数等指标来衡量其对保留测试数据的性能来评估的。但是，这种方法很难识别特定的模型弱点。由于仇恨语音数据集中越来越明显的系统差距和偏见，也有可能高估通用模型的质量。为了提供更具针对性的诊断见解，我们引入了HateCheck，这是针对仇恨语音检测模型的第一套功能测试。我们指定了29种模型功能，这些功能是我们通过回顾先前的研究并通过与民间社会利益相关者进行的一系列访谈而激发的。我们为每种功能编写测试用例，并通过结构化注释过程验证数据质量。为了说明HateCheck的实用性，我们测试了最新的变压器检测模型以及流行的商业模型，揭示了模型的关键缺点。

29. Coreference Reasoning in Machine Reading Comprehension [PDF] 返回目录
Mingzhu Wu, Nafise Sadat Moosavi, Dan Roth, Iryna Gurevych
Abstract: The ability to reason about multiple references to a given entity is essential for natural language understanding and has been long studied in NLP. In recent years, as the format of Question Answering (QA) became a standard for machine reading comprehension (MRC), there have been data collection efforts, e.g., Dasigi et al. (2019), that attempt to evaluate the ability of MRC models to reason about coreference. However, as we show, coreference reasoning in MRC is a greater challenge than was earlier thought; MRC datasets do not reflect the natural distribution and, consequently, the challenges of coreference reasoning. Specifically, success on these datasets does not reflect a model's proficiency in coreference reasoning. We propose a methodology for creating reading comprehension datasets that better reflect the challenges of coreference reasoning and use it to show that state-of-the-art models still struggle with these phenomena. Furthermore, we develop an effective way to use naturally occurring coreference phenomena from annotated coreference resolution datasets when training MRC models. This allows us to show an improvement in the coreference reasoning abilities of state-of-the-art models across various MRC datasets. We will release all the code and the resulting dataset at this https URL.
摘要：推理多个给定实体的能力对于自然语言理解至关重要，并且已经在NLP中进行了长期研究。近年来，随着问答（QA）格式成为机器阅读理解（MRC）的标准，人们进行了数据收集工作，例如Dasigi等。（2019），该研究试图评估MRC模型推理共参照的能力。但是，正如我们所展示的，MRC中的共指推理比以前认为的要大。 MRC数据集不能反映自然分布，因此不能反映共指推理的挑战。具体而言，在这些数据集上的成功并不反映模型在共指推理中的熟练程度。我们提出了一种创建阅读理解数据集的方法，该方法可以更好地反映共指推理的挑战，并使用它来表明最新模型仍在应对这些现象。此外，我们开发了一种有效的方法，可以在训练MRC模型时使用带注释的共参考分辨率数据集中的自然发生的共参考现象。这使我们能够显示跨各种MRC数据集的最新模型的共指推理能力的提高。我们将在此https URL释放所有代码和结果数据集。

30. UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [PDF] 返回目录
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, Sebastian Ruder
Abstract: Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models, which are also written in scripts \textit{unseen} during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our proposed methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model's embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called \textit{lexically overlapping} tokens) shared between mBERT's and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can also yield improvements for low-resource languages written in scripts covered by the pretrained model.
摘要：大量的多语言语言模型，例如多语言BERT（mBERT）和XLM-R，可在一系列NLP任务上提供最新的跨语言传输性能。但是，由于它们的能力有限以及预训练数据的巨大差异，资源丰富的目标语言和资源贫乏的目标语言之间存在巨大的性能差距。最终的挑战是处理模型根本未涵盖的资源匮乏的语言，这些语言在预培训期间也用脚本\ textit {unseen}编写。在这项工作中，我们提出了一系列新颖的数据有效方法，这些方法可以使预训练的多语言模型快速有效地适应这种资源匮乏的语言和看不见的脚本。依靠矩阵分解，我们提出的方法利用了预训练模型的嵌入矩阵中已有的关于多种语言的现有潜在知识。此外，我们表明，可以利用在mBERT和目标语言词汇之间共享的少量词汇项（即所谓的\ textit {lexically重叠}标记）来改善目标语言中新的专用嵌入矩阵的学习。对于使用看不见的脚本的语言，我们的自适应技术可显着提高性能。我们还证明了它们也可以改善用预训练模型覆盖的脚本编写的低资源语言的语言。

31. XLM-T: Scaling up Multilingual Machine Translation with Pretrained Cross-lingual Transformer Encoders [PDF] 返回目录
Shuming Ma, Jian Yang, Haoyang Huang, Zewen Chi, Li Dong, Dongdong Zhang, Hany Hassan Awadalla, Alexandre Muzio, Akiko Eriguchi, Saksham Singhal, Xia Song, Arul Menezes, Furu Wei
Abstract: Multilingual machine translation enables a single model to translate between different languages. Most existing multilingual machine translation systems adopt a randomly initialized Transformer backbone. In this work, inspired by the recent success of language model pre-training, we present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer encoder and fine-tunes it with multilingual parallel data. This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs. Surprisingly, the method is also effective even upon the strong baseline with back-translation. Moreover, extensive analysis of XLM-T on unsupervised syntactic parsing, word alignment, and multilingual classification explains its effectiveness for machine translation. The code will be at this https URL.
摘要：多语言机器翻译使单个模型可以在不同语言之间进行翻译。大多数现有的多语言机器翻译系统都采用随机初始化的Transformer主干。在这项工作中，受近期成功进行语言模型预训练的启发，我们介绍了XLM-T，它使用现成的预训练的跨语言Transformer编码器初始化模型，并使用多语言并行数据对其进行微调。这种简单的方法对具有10种语言对的WMT数据集和具有94对语言的OPUS-100语料库进行了重大改进。出乎意料的是，该方法即使在具有反向翻译的强大基线时也有效。此外，对XLM-T的无监督句法分析，单词对齐和多语言分类的广泛分析说明了它对机器翻译的有效性。该代码将位于此https URL。

32. HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions [PDF] 返回目录
Shaobo Li, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Chengjie Sun, Zhenzhou Ji, Bingquan Liu
Abstract: Collecting supporting evidence from large corpora of text (e.g., Wikipedia) is of great challenge for open-domain Question Answering (QA). Especially, for multi-hop open-domain QA, scattered evidence pieces are required to be gathered together to support the answer extraction. In this paper, we propose a new retrieval target, hop, to collect the hidden reasoning evidence from Wikipedia for complex question answering. Specifically, the hop in this paper is defined as the combination of a hyperlink and the corresponding outbound link document. The hyperlink is encoded as the mention embedding which models the structured knowledge of how the outbound link entity is mentioned in the textual context, and the corresponding outbound link document is encoded as the document embedding representing the unstructured knowledge within it. Accordingly, we build HopRetriever which retrieves hops over Wikipedia to answer complex questions. Experiments on the HotpotQA dataset demonstrate that HopRetriever outperforms previously published evidence retrieval methods by large margins. Moreover, our approach also yields quantifiable interpretations of the evidence collection process.
摘要：从大型文本集（例如Wikipedia）中收集支持证据对于开放域问答（QA）来说是巨大的挑战。特别是对于多跳开放域质量检查，需要将分散的证据片段收集在一起以支持答案提取。在本文中，我们提出了一个新的检索目标Hop，以从Wikipedia收集隐藏的推理证据以进行复杂的问题解答。具体来说，本文中的跃点定义为超链接和相应的出站链接文档的组合。超链接被编码为提及嵌入，该嵌入建模对如何在文本上下文中提及出站链接实体的结构化知识进行建模，并且相应的出站链接文档被编码为表示其中的非结构化知识的文档嵌入。因此，我们构建了HopRetriever，它检索Wikipedia上的跃点以回答复杂的问题。在HotpotQA数据集上进行的实验表明，HopRetriever在很大程度上优于以前发布的证据检索方法。此外，我们的方法还可以对证据收集过程进行量化的解释。

33. BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining [PDF] 返回目录
Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan, Dayiheng Liu, Weizhu Chen, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan
Abstract: In this paper, we propose BANG, a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation. AR and NAR generation can be uniformly regarded as what extend of previous tokens can be attended to, and BANG bridges AR and NAR generation through designing a novel model structure for large-scale pre-training. A pretrained BANG model can simultaneously support AR, NAR, and semi-NAR generation to meet different requirements. Experiments on question generation (SQuAD 1.1), summarization (XSum), and dialogue (PersonaChat) show that BANG improves NAR and semi-NAR performance significantly as well as attaining comparable performance with strong AR pretrained models. Compared with the semi-NAR strong baselines, BANG achieves absolute improvements of 14.01 and 5.24 in overall scores of SQuAD and XSum, respectively. In addition, BANG achieves absolute improvements of 10.73, 6.39, and 5.90 in overall scores of SQuAD, XSUM, and PersonaChat compared with the NAR strong baselines, respectively. Our code will be made publicly available in the near future\footnote{this https URL}.
摘要：在本文中，我们提出了一种新的预训练模型BANG，以弥补自回归（AR）和非自回归（NAR）生成之间的差距。 AR和NAR的生成可以统一地视为以前代币的扩展范围，BANG通过设计用于大规模预训练的新颖模型结构来桥接AR和NAR的生成。预训练的BANG模型可以同时支持AR，NAR和半NAR生成，以满足不同的需求。关于问题生成（SQuAD 1.1），摘要（XSum）和对话（PersonaChat）的实验表明，BANG可以显着提高NAR和semi-NAR的性能，并在强大的AR预训练模型下达到可比的性能。与半NAR强基准相比，BANG的SQuAD和XSum总体得分分别达到14.01和5.24的绝对提高。此外，与NAR的强基准相比，BANG的SQuAD，XSUM和PersonaChat总体得分分别实现了10.73、6.39和5.90的绝对提高。我们的代码将在不久的将来\脚注{此https URL}公开提供。

34. Linear-Time WordPiece Tokenization [PDF] 返回目录
Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou
Abstract: WordPiece tokenization is a subword-based tokenization schema adopted by BERT: it segments the input text via a longest-match-first tokenization strategy, known as Maximum Matching or MaxMatch. To the best of our knowledge, all published MaxMatch algorithms are quadratic (or higher). In this paper, we propose LinMaxMatch, a novel linear-time algorithm for MaxMatch and WordPiece tokenization. Inspired by the Aho-Corasick algorithm, we introduce additional linkages on top of the trie built from the vocabulary, allowing smart transitions when the trie matching cannot continue. Experimental results show that our algorithm is 3x faster on average than two production systems by HuggingFace and TensorFlow Text. Regarding long-tail inputs, our algorithm is 4.5x faster at the 95 percentile. This work has immediate practical value (reducing inference latency, saving compute resources, etc.) and is of theoretical interest by providing an optimal complexity solution to the decades-old MaxMatch problem.
摘要：WordPiece令牌化是BERT所采用的基于子词的令牌化方案：它通过最长匹配优先的令牌化策略（称为最大匹配或最大匹配）对输入文本进行分段。据我们所知，所有已发布的MaxMatch算法都是二次（或更高）的。在本文中，我们提出了一种用于MaxMatch和WordPiece标记化的新型线性时间算法LinMaxMatch。受到Aho-Corasick算法的启发，我们在根据词汇建立的trie之上引入了其他链接，当trie匹配无法继续时，可以进行智能转换。实验结果表明，与HuggingFace和TensorFlow Text两种生产系统相比，我们的算法平均快3倍。关于长尾输入，我们的算法在95％百分位数时速度提高了4.5倍。这项工作具有直接的实践价值（减少推理延迟，节省计算资源等），并且通过为数十年历史的MaxMatch问题提供最佳复杂性解决方案而具有理论意义。

35. AraGPT2: Pre-Trained Transformer for Arabic Language Generation [PDF] 返回目录
Wissam Antoun, Fady Baly, Hazem Hajj
Abstract: Recently, pretrained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic is still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on large Arabic corpora of internet text and news articles. Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available. We evaluate different size variants of AraGPT2 using the perplexity measure, where AraGPT2-mega achieves a perplexity of 29.8 on held-out articles from Wikipedia. Pretrained variants of AraGPT2 (base, medium, large, mega) are publicly available on this https URL hoping to encourage new research directions and applications for Arabic NLP.
摘要：最近，基于预训练的基于变压器的体系结构已被证明在语言建模和理解方面非常有效，因为它们是在足够大的语料库上进行训练的。与其他NLP的进步相比，阿拉伯语在语言生成中的应用仍然落后，这主要是由于缺乏先进的阿拉伯语生成模型。在本文中，我们开发了第一个高级阿拉伯语生成模型AraGPT2，该模型从零开始就接受了互联网文本和新闻文章的大型阿拉伯语语料库的训练。我们最大的模型AraGPT2-mega具有14.6亿个参数，这使其成为最大的阿拉伯语模型。我们使用困惑度评估来评估AraGPT2的不同大小变体，其中Wikipedia保留的文章中AraGPT2-mega的困惑度为29.8。可在此https URL上公开获得AraGPT2的预训练变体（基本，中等，较大，巨型），以鼓励阿拉伯NLP的新研究方向和应用。

36. AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding [PDF] 返回目录
Wissam Antoun, Fady Baly, Hazem Hajj
Abstract: Advances in English language representation enabled a more sample-efficient pre-training task by Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Which, instead of training a model to recover masked tokens, it trains a discriminator model to distinguish true input tokens from corrupted tokens that were replaced by a generator network. On the other hand, current Arabic language representation approaches rely only on pretraining via masked language modeling. In this paper, we develop an Arabic language representation model, which we name AraELECTRA. Our model is pretrained using the replaced token detection objective on large Arabic text corpora. We evaluate our model on two Arabic reading comprehension tasks, and we show that AraELECTRA outperforms current state-of-the-art Arabic language representation models given the same pretraining data and with even a smaller model size.
摘要：通过有效学习准确地对代币替换进行分类的编码器，英语语言表示的进步使样本处理更有效，从而提高了样本效率。它不是训练模型来恢复被屏蔽的令牌，而是训练鉴别器模型以区分真实的输入令牌和由生成器网络替换的损坏的令牌。另一方面，当前的阿拉伯语言表示方法仅依赖于通过屏蔽语言建模的预训练。在本文中，我们开发了一种阿拉伯语言表示模型，我们将其命名为AraELECTRA。我们的模型在大型阿拉伯语文本语料库上使用替换的令牌检测目标进行了预训练。我们在两个阿拉伯语阅读理解任务上评估了我们的模型，结果表明，在相同的预训练数据和较小的模型大小的情况下，AraELECTRA的性能优于当前最新的阿拉伯语表示模型。

37. Neural Machine Translation: A Review of Methods, Resources, and Tools [PDF] 返回目录
Zhixing Tan, Shuo Wang, Zonghan Yang, Gang Chen, Xuancheng Huang, Maosong Sun, Yang Liu
Abstract: Machine translation (MT) is an important sub-field of natural language processing that aims to translate natural languages using computers. In recent years, end-to-end neural machine translation (NMT) has achieved great success and has become the new mainstream method in practical MT systems. In this article, we first provide a broad review of the methods for NMT and focus on methods relating to architectures, decoding, and data augmentation. Then we summarize the resources and tools that are useful for researchers. Finally, we conclude with a discussion of possible future research directions.
摘要：机器翻译（MT）是自然语言处理的重要子领域，旨在使用计算机翻译自然语言。近年来，端到端神经机器翻译（NMT）取得了巨大的成功，并已成为实用MT系统中的新主流方法。在本文中，我们首先对NMT的方法进行广泛的回顾，并重点介绍与体系结构，解码和数据增强有关的方法。然后，我们总结了对研究人员有用的资源和工具。最后，我们最后讨论了可能的未来研究方向。

38. Continual Learning in Task-Oriented Dialogue Systems [PDF] 返回目录
Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Seungwhan Moon, Paul Crook, Bing Liu, Zhou Yu, Eunjoon Cho, Zhiguang Wang
Abstract: Continual learning in task-oriented dialogue systems can allow us to add new domains and functionalities through time without incurring the high cost of a whole system retraining. In this paper, we propose a continual learning benchmark for task-oriented dialogue systems with 37 domains to be learned continuously in four settings, such as intent recognition, state tracking, natural language generation, and end-to-end. Moreover, we implement and compare multiple existing continual learning baselines, and we propose a simple yet effective architectural method based on residual adapters. Our experiments demonstrate that the proposed architectural method and a simple replay-based strategy perform comparably well but they both achieve inferior performance to the multi-task learning baseline, in where all the data are shown at once, showing that continual learning in task-oriented dialogue systems is a challenging task. Furthermore, we reveal several trade-offs between different continual learning methods in term of parameter usage and memory size, which are important in the design of a task-oriented dialogue system. The proposed benchmark is released together with several baselines to promote more research in this direction.
摘要：在面向任务的对话系统中的持续学习可以使我们随着时间的推移增加新的领域和功能，而不会招致整个系统的高昂培训费用。在本文中，我们提出了面向任务的对话系统的持续学习基准，该对话系统具有37个域，可以在四种环境中进行连续学习，例如意图识别，状态跟踪，自然语言生成和端到端。此外，我们实现并比较了多个现有的持续学习基准，并提出了一种基于残差适配器的简单而有效的体系结构方法。我们的实验表明，所提出的体系结构方法和基于重播的简单策略在性能上可比，但是它们都比多任务学习基线的性能差，后者在一次显示所有数据的情况下，表明了面向任务的持续学习对话系统是一项艰巨的任务。此外，我们揭示了在参数使用和内存大小方面不同的持续学习方法之间的一些折衷，这在面向任务的对话系统的设计中很重要。拟议的基准与几个基准一起发布，以促进朝这个方向进行更多研究。

39. Towards Zero-Shot Knowledge Distillation for Natural Language Processing [PDF] 返回目录
Ahmad Rashid, Vasileios Lioutas, Abbas Ghaddar, Mehdi Rezagholizadeh
Abstract: Knowledge Distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions. In its regular manifestations, KD requires access to the teacher's training data for knowledge transfer to the student network. However, privacy concerns, data regulations and proprietary reasons may prevent access to such data. We present, to the best of our knowledge, the first work on Zero-Shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data. Our solution combines out of domain data and adversarial training to learn the teacher's output distribution. We investigate six tasks from the GLUE benchmark and demonstrate that we can achieve between 75% and 92% of the teacher's classification score (accuracy or F1) while compressing the model 30 times.
摘要：知识蒸馏（KD）是一种常见的知识转移算法，用于各种基于深度学习的自然语言处理（NLP）解决方案中的模型压缩。在正常情况下，KD要求访问教师的培训数据以将知识转移到学生网络。但是，出于隐私考虑，数据法规和专有原因，可能会阻止访问此类数据。据我们所知，我们展示了NLP的零射知识蒸馏的第一篇作品，学生可以从更大的老师那里学习，而无需任何特定任务数据。我们的解决方案结合了域外数据和对抗训练来学习教师的输出分布。我们从GLUE基准调查了六个任务，并证明我们可以在压缩模型30次的同时达到教师分类分数（准确性或F1）的75％至92％。

40. Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings [PDF] 返回目录
Kiran Ramnath, Mark Hasegawa-Johnson
Abstract: Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e. not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature. We develop a novel QA architecture that allows us to reason over incomplete KGs, something current FVQA state-of-the-art (SOTA) approaches lack.We use KG Embeddings, a technique widely used for KG completion, for the downstream task of FVQA. We also employ a new image representation technique we call "Image-as-Knowledge" to enable this capability, alongside a simple one-step co-Attention mechanism to attend to text and image during QA. Our FVQA architecture is faster during inference time, being O(m), as opposed to existing FVQA SOTA methods which are O(N logN), where m is number of vertices, N is number of edges (which is O(m^2)). We observe that our architecture performs comparably in the standard answer-retrieval baseline with existing methods; while for missing-edge reasoning, our KG representation outperforms the SOTA representation by 25%, and image representation outperforms the SOTA representation by 2.6%.
摘要：基于事实的视觉问题回答（FVQA）是VQA的一个具有挑战性的变体，它要求QA系统在其推理过程中包含来自不同知识图谱（KG）的事实以产生答案。大型KG，尤其是常识性KG，是不完整的，即并非所有不存在的事实总是不正确的。因此，能够推理出不完整的KG进行QA是现实应用中的一项关键要求，而文献中并未对此进行广泛讨论。我们开发了一种新颖的QA架构，使我们能够对不完整的KG进行推理，这是当前FVQA的最先进（SOTA）方法所不具备的。。我们还采用了一种称为“图像即知识”的新图像表示技术来实现此功能，同时还采用了一种简单的一步式共同注意机制来在质量检查过程中处理文本和图像。与现有的FVQA SOTA方法为O（N logN）相比，我们的FVQA体系结构在推理期间更快，为O（m），其中m为顶点数，N为边数（即O（m ^ 2 ））。我们观察到，我们的体系结构在标准答案检索基准中与现有方法的性能相当。而对于遗漏边缘推理，我们的KG表现优于SOTA表现25％，图像表现优于SOTA表现2.6％。

41. FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation [PDF] 返回目录
Kushal Lakhotia, Bhargavi Paranjape, Asish Ghoshal, Wen-tau Yih, Yashar Mehdad, Srinivasan Iyer
Abstract: Natural language (NL) explanations of model predictions are gaining popularity as a means to understand and verify decisions made by large black-box pre-trained models, for NLP tasks such as Question Answering (QA) and Fact Verification. Recently, pre-trained sequence to sequence (seq2seq) models have proven to be very effective in jointly making predictions, as well as generating NL explanations. However, these models have many shortcomings; they can fabricate explanations even for incorrect predictions, they are difficult to adapt to long input documents, and their training requires a large amount of labeled data. In this paper, we develop FiD-Ex, which addresses these shortcomings for seq2seq models by: 1) introducing sentence markers to eliminate explanation fabrication by encouraging extractive generation, 2) using the fusion-in-decoder architecture to handle long input contexts, and 3) intermediate fine-tuning on re-structured open domain QA datasets to improve few-shot performance. FiD-Ex significantly improves over prior work in terms of explanation metrics and task accuracy, on multiple tasks from the ERASER explainability benchmark, both in the fully supervised and in the few-shot settings.
摘要：自然语言（NL）对模型预测的解释作为一种理解和验证大型黑盒经过预先训练的模型做出的决策而越来越流行，用于NLP任务，例如问答（QA）和事实验证。最近，已证明预训练的序列到序列（seq2seq）模型对于联合进行预测以及生成NL解释非常有效。但是，这些模型有很多缺点。他们甚至可以为错误的预测提供解释，难以适应冗长的输入文档，并且他们的培训需要大量带标签的数据。在本文中，我们开发了FiD-Ex，它通过以下方法解决了seq2seq模型的这些缺点：1）引入句子标记以通过鼓励提取生成来消除解释的产生； 2）使用解码器融合结构来处理较长的输入上下文，以及3）对重构的开放域QA数据集进行中间微调，以提高一次性性能。 FiD-Ex在完全监督和几次曝光的设置下，在解释指标和任务准确性方面，都大大提高了ERASER可解释性基准测试中的多项任务的解释指标和任务准确性。

42. CLEAR: Contrastive Learning for Sentence Representation [PDF] 返回目录
Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, Hao Ma
Abstract: Pre-trained language models have proven their unique powers in capturing implicit language features. However, most pre-training approaches focus on the word-level training objective, while sentence-level objectives are rarely studied. In this paper, we propose Contrastive LEArning for sentence Representation (CLEAR), which employs multiple sentence-level augmentation strategies in order to learn a noise-invariant sentence representation. These augmentations include word and span deletion, reordering, and substitution. Furthermore, we investigate the key reasons that make contrastive learning effective through numerous experiments. We observe that different sentence augmentations during pre-training lead to different performance improvements on various downstream tasks. Our approach is shown to outperform multiple existing methods on both SentEval and GLUE benchmarks.
摘要：经过预训练的语言模型已经证明了其在捕获隐式语言特征方面的独特能力。但是，大多数预训练方法都集中在单词级训练目标上，而句子级目标却很少被研究。在本文中，我们提出了一种针对句子表示的对比学习方法（CLEAR），该方法采用多种句子级别的增强策略来学习噪声不变的句子表示。这些扩充包括单词和跨度删除，重新排序和替换。此外，我们研究了通过大量实验使对比学习有效的关键原因。我们观察到，在预训练过程中，不同的句子扩充会导致在各种下游任务上的不同性能提升。结果表明，我们的方法在SentEval和GLUE基准测试中均胜过多种现有方法。

43. Exploring Monolingual Data for Neural Machine Translation with Knowledge Distillation [PDF] 返回目录
Alham Fikri Aji, Kenneth Heafield
Abstract: We explore two types of monolingual data that can be included in knowledge distillation training for neural machine translation (NMT). The first is the source-side monolingual data. Second, is the target-side monolingual data that is used as back-translation data. Both datasets are (forward-)translated by a teacher model from source-language to target-language, which are then combined into a dataset for smaller student models. We find that source-side monolingual data improves model performance when evaluated by test-set originated from source-side. Likewise, target-side data has a positive effect on the test-set in the opposite direction. We also show that it is not required to train the student model with the same data used by the teacher, as long as the domains are the same. Finally, we find that combining source-side and target-side yields in better performance than relying on just one side of the monolingual data.
摘要：我们探讨了两种可以包含在神经机器翻译（NMT）的知识蒸馏训练中的单语数据。首先是源方的单语数据。其次，是用作反向翻译数据的目标方单语数据。教师模型将这两个数据集（正向）从源语言转换为目标语言，然后合并为较小的学生模型的数据集。我们发现，当源于源端的测试集进行评估时，源端的单语数据可以提高模型性能。同样，目标侧数据会在相反方向对测试集产生积极影响。我们还表明，只要域相同，就不需要使用教师使用的相同数据来训练学生模型。最后，我们发现结合源端和目标端的产出比仅依赖单语数据的一侧具有更好的性能。

44. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [PDF] 返回目录
Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
Abstract: In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.
摘要：在本文中，我们提出了第一个模型，该模型可以直接合成不需要自然语言文本作为中间表示或监督来源的图像的流利，自然听起来的语音字幕。取而代之的是，我们将图像字幕模块和语音合成模块与一组离散的，子单词的语音单元相连接，这些单元是通过自我监督的视觉基础任务发现的。除了针对流行的MSCOCO数据集收集的新颖的语音字幕集之外，我们还对Flickr8k语音字幕集进行了实验，这表明我们生成的字幕还捕获了所描述图像的多种视觉语义。我们研究了几种不同的中间语音表示形式，并根据经验发现，该表示形式必须满足几个重要的属性才能用作文本的直接替换。

45. The jsRealB Text Realizer: Organization and Use Cases [PDF] 返回目录
Guy Lapalme
Abstract: This paper describes the design principles behind jsRealB, a surface realizer written in JavaScript for English or French sentences from a specification inspired by the constituent syntax formalism. It can be used either within a web page or as a node .js module. We show that the seemingly simple process of text realization involves many interesting implementation challenges in order to take into account the specifics of each language. jsRealB has a large coverage of English and French and has been used to develop realistic data-to-text applications and to reproduce existing literary texts and sentences with Universal Dependency annotations. Its source code and that of its applications are available on GitHub.
摘要：本文描述了jsRealB背后的设计原理，这是一种受JavaScript规范化启发的规范，它是用JavaScript编写的用于英语或法语句子的表面实现器。它既可以在网页中使用，也可以用作节点.js模块。我们证明了看似简单的文本实现过程涉及许多有趣的实现挑战，以便考虑每种语言的细节。 jsRealB具有广泛的英语和法语版本，已被用于开发现实的数据到文本应用程序，以及带有通用依赖注释的现有文学文本和句子。它的源代码及其应用程序的源代码可在GitHub上获得。

46. Verb Knowledge Injection for Multilingual Event Processing [PDF] 返回目录
Olga Majewska, Ivan Vulić, Goran Glavaš, Edoardo M. Ponti, Anna Korhonen
Abstract: In parallel to their overwhelming success across NLP tasks, language ability of deep Transformer networks, pretrained via language modeling (LM) objectives has undergone extensive scrutiny. While probing revealed that these models encode a range of syntactic and semantic properties of a language, they are still prone to fall back on superficial cues and simple heuristics to solve downstream tasks, rather than leverage deeper linguistic knowledge. In this paper, we target one such area of their deficiency, verbal reasoning. We investigate whether injecting explicit information on verbs' semantic-syntactic behaviour improves the performance of LM-pretrained Transformers in event extraction tasks -- downstream tasks for which accurate verb processing is paramount. Concretely, we impart the verb knowledge from curated lexical resources into dedicated adapter modules (dubbed verb adapters), allowing it to complement, in downstream tasks, the language knowledge obtained during LM-pretraining. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages: we investigate (1) zero-shot language transfer with multilingual Transformers as well as (2) transfer via (noisy automatic) translation of English verb-based lexical constraints. Our results show that the benefits of verb knowledge injection indeed extend to other languages, even when verb adapters are trained on noisily translated constraints.
摘要：除了在NLP任务中取得巨大成功外，通过语言建模（LM）目标进行了预训练的深层Transformer网络的语言能力也受到了广泛的审查。探测显示这些模型编码了语言的一系列句法和语义属性，但它们仍然倾向于依靠肤浅的提示和简单的启发式方法来解决下游任务，而不是利用更深的语言知识。在本文中，我们针对此类不足之处，即口头推理。我们调查了在动词的语义-句法行为上注入显式信息是否可以提高事件提取任务中对LM预训练的变形金刚的性能-下游任务中精确动词处理至关重要。具体来说，我们将动词知识从选定的词汇资源中分配到专用的适配器模块（配音的动词适配器）中，以便在下游任务中补充在LM预训练期间获得的语言知识。我们首先证明，注入动词知识可提高英语事件提取中的性能。然后，我们探索动词适配器在其他语言中进行事件提取的实用性：我们研究（1）使用多语种Transformers进行零镜头语言转移，以及（2）通过基于英语动词的词法约束的（有声自动）翻译进行转移。我们的结果表明，动词知识注入的好处的确扩展到其他语言，即使动词适配器在经过噪音转换的约束条件下训练也是如此。

47. An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain [PDF] 返回目录
Paul Grouchy, Shobhit Jain, Michael Liu, Kuhan Wang, Max Tian, Nidhi Arora, Hillary Ngai, Faiza Khan Khattak, Elham Dolatabadi, Sedef Akinli Kocak
Abstract: With the growing amount of text in health data, there have been rapid advances in large pre-trained models that can be applied to a wide variety of biomedical tasks with minimal task-specific modifications. Emphasizing the cost of these models, which renders technical replication challenging, this paper summarizes experiments conducted in replicating BioBERT and further pre-training and careful fine-tuning in the biomedical domain. We also investigate the effectiveness of domain-specific and domain-agnostic pre-trained models across downstream biomedical NLP tasks. Our finding confirms that pre-trained models can be impactful in some downstream NLP tasks (QA and NER) in the biomedical domain; however, this improvement may not justify the high cost of domain-specific pre-training.
摘要：随着健康数据中文本数量的增加，大型预训练模型有了快速的发展，这些模型可以以最少的任务特定修改应用于各种生物医学任务。强调了这些模型的成本，这给技术复制带来了挑战，本文总结了复制BioBERT以及在生物医学领域进行进一步的预训练和仔细微调的实验。我们还研究了跨下游生物医学NLP任务的领域特定和领域不可知的预训练模型的有效性。我们的发现证实了预训练模型在生物医学领域的某些下游NLP任务（QA和NER）中可能会产生影响。但是，这种改进可能无法证明特定于域的预培训的高成本。

48. Directed Beam Search: Plug-and-Play Lexically Constrained Language Generation [PDF] 返回目录
Damian Pascual, Beni Egressy, Florian Bolli, Roger Wattenhofer
Abstract: Large pre-trained language models are capable of generating realistic text. However, controlling these models so that the generated text satisfies lexical constraints, i.e., contains specific words, is a challenging problem. Given that state-of-the-art language models are too large to be trained from scratch in a manageable time, it is desirable to control these models without re-training them. Methods capable of doing this are called plug-and-play. Recent plug-and-play methods have been successful in constraining small bidirectional language models as well as forward models in tasks with a restricted search space, e.g., machine translation. However, controlling large transformer-based models to meet lexical constraints without re-training them remains a challenge. In this work, we propose Directed Beam Search (DBS), a plug-and-play method for lexically constrained language generation. Our method can be applied to any language model, is easy to implement and can be used for general language generation. In our experiments we use DBS to control GPT-2. We demonstrate its performance on keyword-to-phrase generation and we obtain comparable results as a state-of-the-art non-plug-and-play model for lexically constrained story generation.
摘要：大型的预训练语言模型能够生成逼真的文本。但是，控制这些模型以使生成的文本满足词汇约束，即包含特定的单词，是一个具有挑战性的问题。鉴于最新的语言模型太大，无法在可管理的时间内从头开始进行训练，因此希望在不重新训练的情况下控制这些模型。能够做到这一点的方法称为即插即用。最近的即插即用方法已经成功地将小型双向语言模型以及正向模型约束在搜索空间受限的任务中，例如机器翻译。然而，控制大型的基于变压器的模型来满足词法约束而不重新训练它们仍然是一个挑战。在这项工作中，我们提出了定向波束搜索（DBS），这是一种用于词法约束语言生成的即插即用方法。我们的方法可以应用于任何语言模型，易于实现，并且可以用于通用语言生成。在我们的实验中，我们使用DBS来控制GPT-2。我们证明了其在关键字到短语生成上的性能，并且获得了可比的结果，这是用于词法约束故事生成的最新非即插即用模型。

49. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [PDF] 返回目录
Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
Abstract: Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks.
摘要：现有的预训练方法要么专注于单模态任务，要么专注于多模态任务，无法有效地相互适应。他们只能使用单模式数据（即文本或图像）或有限的多模式数据（即图像-文本对）。在这项工作中，我们提出了一种统一模式的预训练架构，即UNIMO，它可以有效地适应单模式和多模式的理解和生成任务。可以利用大规模的自由文本语料库和图像集合来提高视觉和文本理解的能力，并且可以利用跨模式对比学习（CMCL）将文本和视觉信息对准图像语料库上的统一语义空间-文本对。由于非配对单模态数据非常丰富，因此我们的模型可以利用更大范围的数据来学习更通用的表示形式。此外，文本知识和视觉知识可以在统一的语义空间中相互增强。实验结果表明，UNIMO大大提高了一些单模式和多模式下游任务的性能。

50. Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learning and Human Demonstration [PDF] 返回目录
Weiyan Shi, Yu Li, Saurav Sahay, Zhou Yu
Abstract: Despite the recent success of large-scale language models on various downstream NLP tasks, the repetition and inconsistency problems still persist in dialogue response generation. Previous approaches have attempted to avoid repetition by penalizing the language model's undesirable behaviors in the loss function. However, these methods focus on token-level information and can lead to incoherent responses and uninterpretable behaviors. To alleviate these issues, we propose to apply reinforcement learning to refine an MLE-based language model without user simulators, and distill sentence-level information about repetition, inconsistency and task relevance through rewards. In addition, to better accomplish the dialogue task, the model learns from human demonstration to imitate intellectual activities such as persuasion, and selects the most persuasive responses. Experiments show that our model outperforms previous state-of-the-art dialogue models on both automatic metrics and human evaluation results on a donation persuasion task, and generates more diverse, consistent and persuasive conversations according to the user feedback.
摘要：尽管最近在各种下游NLP任务上成功使用了大型语言模型，但是在对话响应生成中仍然存在重复和不一致的问题。先前的方法已经尝试通过惩罚语言模型在损失函数中的不良行为来避免重复。但是，这些方法侧重于令牌级别的信息，并且可能导致不连贯的响应和无法解释的行为。为了缓解这些问题，我们建议应用强化学习来完善基于MLE的语言模型，而无需用户模拟器，并通过奖励提取有关重复，不一致和任务相关性的句子级信息。此外，为了更好地完成对话任务，该模型从人类示范中学习以模仿诸如劝说之类的智力活动，并选择最具说服力的回应。实验表明，在捐赠说服任务上，我们的模型在自动指标和人工评估结果方面都优于以前的最新对话模型，并根据用户反馈生成了更多样，更一致和更有说服力的对话。

51. Optimizing Deeper Transformers on Small Datasets: An Application on Text-to-SQL Semantic Parsing [PDF] 返回目录
Peng Xu, Wei Yang, Wenjie Zi, Keyi Tang, Chengyang Huang, Jackie Chi Kit Cheung, Yanshuai Cao
Abstract: Due to the common belief that training deep transformers from scratch requires large datasets, people usually only use shallow and simple additional layers on top of pre-trained models during fine-tuning on small datasets. We provide evidence that this does not always need to be the case: with proper initialization and training techniques, the benefits of very deep transformers are shown to carry over to hard structural prediction tasks, even using small datasets. In particular, we successfully train 48 layers of transformers for a semantic parsing task. These comprise 24 fine-tuned transformer layers from pre-trained RoBERTa and 24 relation-aware transformer layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state of the art performance on the challenging cross-domain Text-to-SQL semantic parsing benchmark Spider. We achieve this by deriving a novel Data dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis demonstrates that increasing the depth of the transformer model can help improve generalization on the cases requiring reasoning and structural understanding.
摘要：由于人们普遍认为从零开始训练深层变压器需要大量数据集，因此在对小型数据集进行微调时，人们通常仅在预训练模型之上使用浅层和简单的附加层。我们提供的证据表明，并不一定总是这样：通过适当的初始化和训练技术，即使使用较小的数据集，非常深的变换器的好处也可以延续到艰苦的结构预测任务上。特别是，我们成功地训练了48个层的转换器进行语义解析任务。它们包括来自预训练的RoBERTa的24个微调的变压器层和从头开始训练的24个关系感知的变压器层。借助更少的培训步骤，并且没有特定于任务的预培训，我们就可以在具有挑战性的跨域Text-to-SQL语义解析基准Spider上获得最新的性能。我们通过从以前的T-Fixup工作中获得灵感，得出一种新颖的数据相关变压器固定更新初始化方案（DT-Fixup），来实现这一目标。进一步的误差分析表明，增加变压器模型的深度可以帮助改进需要推理和结构理解的案例的概括性。

52. Deriving Contextualised Semantic Features from BERT (and Other Transformer Model) Embeddings [PDF] 返回目录
Jacob Turton, David Vinson, Robert Elliott Smith
Abstract: Models based on the transformer architecture, such as BERT, have marked a crucial step forward in the field of Natural Language Processing. Importantly, they allow the creation of word embeddings that capture important semantic information about words in context. However, as single entities, these embeddings are difficult to interpret and the models used to create them have been described as opaque. Binder and colleagues proposed an intuitive embedding space where each dimension is based on one of 65 core semantic features. Unfortunately, the space only exists for a small dataset of 535 words, limiting its uses. Previous work (Utsumi, 2018, 2020, Turton, Vinson & Smith, 2020) has shown that Binder features can be derived from static embeddings and successfully extrapolated to a large new vocabulary. Taking the next step, this paper demonstrates that Binder features can be derived from the BERT embedding space. This provides contextualised Binder embeddings, which can aid in understanding semantic differences between words in context. It additionally provides insights into how semantic features are represented across the different layers of the BERT model.
摘要：基于转换器架构的模型（例如BERT）标志着自然语言处理领域的重要一步。重要的是，它们允许创建单词嵌入，以捕获有关上下文中单词的重要语义信息。但是，作为单个实体，这些嵌入难以解释，并且用于创建它们的模型已描述为不透明的。 Binder和同事提出了一个直观的嵌入空间，其中每个维度均基于65个核心语义特征之一。不幸的是，该空间仅存在于535个单词的小型数据集中，限制了其使用。以前的工作（Utsumi，2018，2020，Turton，Vinson＆Smith，2020）显示，活页夹特征可以从静态嵌入中派生出来，并成功地推断出大量的新词汇。下一步，本文演示了可以从BERT嵌入空间派生出Binder特征。这提供了上下文相关的Binder嵌入，可以帮助理解上下文中单词之间的语义差异。此外，它还提供了有关如何在BERT模型的不同层上表示语义特征的见解。

53. DynaSent: A Dynamic Benchmark for Sentiment Analysis [PDF] 返回目录
Christopher Potts, Zhengxuan Wu, Atticus Geiger, Douwe Kiela
Abstract: We introduce DynaSent ('Dynamic Sentiment'), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis. DynaSent combines naturally occurring sentences with sentences created using the open-source Dynabench Platform, which facilities human-and-model-in-the-loop dataset creation. DynaSent has a total of 121,634 sentences, each validated by five crowdworkers, and its development and test splits are designed to produce chance performance for even the best models we have been able to develop; when future models solve this task, we will use them to create DynaSent version 2, continuing the dynamic evolution of this benchmark. Here, we report on the dataset creation effort, focusing on the steps we took to increase quality and reduce artifacts. We also present evidence that DynaSent's Neutral category is more coherent than the comparable category in other benchmarks, and we motivate training models from scratch for each round over successive fine-tuning.
摘要：我们介绍了DynaSent（“动态情感”），这是一种用于三元（正/负/中性）情感分析的新英语基准任务。 DynaSent将自然发生的句子与使用开源Dynabench平台创建的句子相结合，该平台可简化人与模型在环数据集的创建。 DynaSent总共有121,634个句子，每个句子都由五名群众工作人员进行了验证，其开发和测试拆分旨在甚至为我们已经能够开发的最佳模型提供机会表现。当将来的模型解决此任务时，我们将使用它们来创建DynaSent版本2，从而继续该基准的动态演变。在这里，我们报告了数据集的创建工作，重点介绍了为提高质量和减少工件而采取的步骤。我们还提供了证据，表明DynaSent的“中性”类别比其他基准中的同类类别更连贯，并且在连续的微调中，我们从头开始针对每轮训练模型。

54. kōan: A Corrected CBOW Implementation [PDF] 返回目录
Ozan İrsoy, Adrian Benton, Karl Stratos
Abstract: It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, kōan, at this https URL.
摘要：在NLP社区中，人们普遍认为，连续词袋（CBOW）词嵌入往往不如跳字（SG）嵌入。我们发现，这种信念建立在其培训目标的理论差异上，而不是建立在标准软件库（例如官方实现word2vec.c和Gensim）中错误的CBOW实现上。我们证明，正确实施CBOW可以产生与SG在各种内在和外在任务上完全竞争的词嵌入，而训练速度却是三倍以上。我们在此https URL上发布实现kōan。

55. Generating Landmark Navigation Instructions from Maps as a Graph-to-Text Problem [PDF] 返回目录
Raphael Schumann, Stefan Riezler
Abstract: Car-focused navigation services are based on turns and distances of named streets, whereas navigation instructions naturally used by humans are centered around physical objects called landmarks. We present a neural model that takes OpenStreetMap representations as input and learns to generate navigation instructions that contain visible and salient landmarks from human natural language instructions. Routes on the map are encoded in a location- and rotation-invariant graph representation that is decoded into natural language instructions. Our work is based on a novel dataset of 7,672 crowd-sourced instances that have been verified by human navigation in Street View. Our evaluation shows that the navigation instructions generated by our system have similar properties as human-generated instructions, and lead to successful human navigation in Street View.
摘要：以汽车为中心的导航服务基于命名街道的转弯和距离，而人类自然使用的导航指令以称为地标的物理对象为中心。我们提供了一个神经模型，该模型将OpenStreetMap表示形式作为输入，并学习生成包含人类自然语言指令中的可见地标和显着地标的导航指令。地图上的路线以位置和旋转不变的图形表示形式进行编码，然后将其解码为自然语言指令。我们的工作基于一个新的数据集，其中包含7672个众包实例的实例，这些实例已通过“街景视图”中的人工导航进行了验证。我们的评估表明，由我们的系统生成的导航指令与人类生成的指令具有相似的属性，并成功实现了街景视图中的人类导航。

56. DEER: A Data Efficient Language Model for Event Temporal Reasoning [PDF] 返回目录
Rujun Han, Xiang Ren, Nanyun Peng
Abstract: Pretrained language models (LMs) such as BERT, RoBERTa, and ELECTRA are effective at improving the performances of a variety of downstream NLP tasks. Recently, researchers have incorporated domain and task-specific knowledge in these LMs' training objectives and further enhanced models' capability of handling downstream tasks. However, none of these LMs are designed specifically for event temporal reasoning. We propose DEER, a language model that is trained to focus on event temporal relations and performs better under low-resource settings than original LMs. More specifically, we create a large number of training samples to simulate the machine reading comprehension and information extraction tasks for event temporal understanding and leverage a generator-discriminator structure to reinforce the LMs' capability of event temporal reasoning. Our experimental results show that DEER can achieve SOTA results and works particularly well in low-resource settings across 5 widely used datasets.
摘要：诸如BERT，RoBERTa和ELECTRA之类的预训练语言模型（LM）可有效提高各种下游NLP任务的性能。最近，研究人员将领域和特定于任务的知识纳入了这些LM的训练目标，并进一步增强了模型处理下游任务的能力。但是，这些LM都不是专门为事件时间推理设计的。我们提出DEER，这是一种语言模型，经过训练可专注于事件时间关系，并且在低资源设置下比原始LM表现更好。更具体地说，我们创建了大量的训练样本来模拟机器阅读理解和信息提取任务以实现事件时态理解，并利用生成器-判别器结构来增强LM的事件时态推理能力。我们的实验结果表明，DEER可以取得SOTA结果，并且在5个广泛使用的数据集的低资源设置中特别有效。

57. Predicting cross-linguistic adjective order with information gain [PDF] 返回目录
William Dyer, Richard Futrell, Zoey Liu, Gregory Scontras
Abstract: Languages vary in their placement of multiple adjectives before, after, or surrounding the noun, but they typically exhibit strong intra-language tendencies on the relative order of those adjectives (e.g., the preference for `big blue box' in English, `grande boîte bleue' in French, and `alsundūq al'azraq alkab\=ır' in Arabic). We advance a new quantitative account of adjective order across typologically-distinct languages based on maximizing information gain. Our model addresses the left-right asymmetry of French-type ANA sequences with the same approach as AAN and NAA orderings, without appeal to other mechanisms. We find that, across 32 languages, the preferred order of adjectives largely mirrors an efficient algorithm of maximizing information gain.
摘要：语言在名词之前，之后或周围出现的多个形容词的位置各不相同，但它们通常表现出强烈的语言内倾向，即这些形容词的相对顺序（例如，英语中对“大蓝框”的偏爱，法语中的“ grandeboîtebleue”和阿拉伯语中的“alsundūqal'azraq alkab \ =ır”）。我们在最大化信息增益的基础上，提出了一种跨类型不同的语言的形容词顺序的新定量方法。我们的模型使用与AAN和NAA排序相同的方法解决了法式ANA序列的左右不对称性，而没有使用其他机制。我们发现，在32种语言中，形容词的首选顺序很大程度上反映了最大化信息增益的有效算法。

58. Robustness Testing of Language Understanding in Dialog Systems [PDF] 返回目录
Jiexi Liu, Ryuichi Takanobu, Jiaxin Wen, Dazhen Wan, Weiran Nie, Hongyan Li, Cheng Li, Wei Peng, Minlie Huang
Abstract: Most language understanding models in dialog systems are trained on a small amount of annotated training data, and evaluated in a small set from the same distribution. However, these models can lead to system failure or undesirable outputs when being exposed to natural perturbation in practice. In this paper, we conduct comprehensive evaluation and analysis with respect to the robustness of natural language understanding models, and introduce three important aspects related to language understanding in real-world dialog systems, namely, language variety, speech characteristics, and noise perturbation. We propose a model-agnostic toolkit LAUG to approximate natural perturbation for testing the robustness issues in dialog systems. Four data augmentation approaches covering the three aspects are assembled in LAUG, which reveals critical robustness issues in state-of-the-art models. The augmented dataset through LAUG can be used to facilitate future research on the robustness testing of language understanding in dialog systems.
摘要：对话系统中的大多数语言理解模型都是在少量带注释的训练数据上进行训练的，并从相同的分布中进行少量评估。但是，这些模型在实践中暴露于自然扰动时会导致系统故障或不良输出。在本文中，我们对自然语言理解模型的鲁棒性进行了全面的评估和分析，并介绍了在现实对话系统中与语言理解有关的三个重要方面，即语言多样性，语音特征和噪声扰动。我们提出了一个与模型无关的工具包LAUG，以近似自然扰动来测试对话框系统中的鲁棒性问题。在LAUG中组装了涵盖这三个方面的四种数据增强方法，这些方法揭示了最新模型中的关键鲁棒性问题。通过LAUG扩充的数据集可用于促进未来对对话系统中语言理解的鲁棒性测试的研究。

59. Unsupervised Label-aware Event Trigger and Argument Classification [PDF] 返回目录
Hongming Zhang, Haoyu Wang, Dan Roth
Abstract: Identifying events and mapping them to pre-defined event types has long been an important natural language processing problem. Most previous work has been heavily relying on labor-intensive and domain-specific annotations while ignoring the semantic meaning contained in the labels of the event types. As a result, the learned models cannot effectively generalize to new domains, where new event types could be introduced. In this paper, we propose an unsupervised event extraction pipeline, which first identifies events with available tools (e.g., SRL) and then automatically maps them to pre-defined event types with our proposed unsupervised classification model. Rather than relying on annotated data, our model matches the semantics of identified events with those of event type labels. Specifically, we leverage pre-trained language models to contextually represent pre-defined types for both event triggers and arguments. After we map identified events to the target types via representation similarity, we use the event ontology (e.g., argument type "Victim" can only appear as the argument of event type "Attack") as global constraints to regularize the prediction. The proposed approach is shown to be very effective when tested on the ACE-2005 dataset, which has 33 trigger and 22 argument types. Without using any annotation, we successfully map 83% of the triggers and 54% of the arguments to the correct types, almost doubling the performance of previous zero-shot approaches.
摘要：识别事件并将其映射到预定义的事件类型长期以来一直是重要的自然语言处理问题。先前的大多数工作都在很大程度上依赖于劳动密集型和特定于域的注释，而忽略了事件类型的标签中包含的语义。结果，学习的模型无法有效地推广到可以引入新事件类型的新域。在本文中，我们提出了一种无监督事件提取管道，该管道首先使用可用工具（例如SRL）识别事件，然后使用我们提出的无监督分类模型将其自动映射到预定义的事件类型。我们的模型不是依赖注释的数据，而是将已识别事件的语义与事件类型标签的语义进行匹配。具体来说，我们利用预先训练的语言模型来上下文地表示事件触发器和参数的预定义类型。在通过表示相似性将识别的事件映射到目标类型之后，我们使用事件本体（例如，参数类型“受害者”只能作为事件类型“攻击”的参数出现）作为全局约束来规范化预测。当在ACE-2005数据集（包含33个触发器和22个参数类型）上进行测试时，该方法被证明是非常有效的。在不使用任何注释的情况下，我们成功地将83％的触发器和54％的参数映射到正确的类型，几乎使以前的零击方法的性能提高了一倍。

60. Can Sequence-to-Sequence Models Crack Substitution Ciphers? [PDF] 返回目录
Nada Aldarrab, Jonathan May
Abstract: Decipherment of historical ciphers is a challenging problem. The language of the target plaintext might be unknown, and ciphertext can have a lot of noise. State-of-the-art decipherment methods use beam search and a neural language model to score candidate plaintext hypotheses for a given cipher, assuming plaintext language is known. We propose an end-to-end multilingual model for solving simple substitution ciphers. We test our model on synthetic and real historical ciphers and show that our proposed method can decipher text without explicit language identification and can still be robust to noise.
摘要：解密历史密码是一个具有挑战性的问题。目标明文的语言可能是未知的，并且密文会带来很多干扰。假设已知明文语言，则最新的解密方法使用波束搜索和神经语言模型对给定密码的候选明文假设进行评分。我们提出了一种用于解决简单替换密码的端到端多语言模型。我们在合成和真实的历史密码上测试了我们的模型，结果表明，我们提出的方法可以在无需显式语言识别的情况下解密文本，并且仍然对噪声具有鲁棒性。

61. Introducing Orthogonal Constraint in Structural Probes [PDF] 返回目录
Tomasz Limisiewicz, David Mareček
Abstract: With the recent success of pre-trained models in NLP, a significant focus was put on interpreting their representations. One of the most prominent approaches is structural probing (Hewitt and Manning, 2019), where a linear projection of language vector space is performed in order to approximate the topology of linguistic structures. In this work, we decompose this mapping into 1. isomorphic space rotation; 2. linear scaling that identifies and scales the most relevant directions. We introduce novel structural tasks to exam our method's ability to disentangle information hidden in the embeddings. We experimentally show that our approach can be performed in a multitask setting. Moreover, the orthogonal constraint identifies embedding subspaces encoding specific linguistic features and make the probe less vulnerable to memorization.
摘要：随着NLP中预训练模型的最新成功，人们将重点放在解释其表示上。最突出的方法之一是结构探测（Hewitt和Manning，2019），其中执行语言向量空间的线性投影以逼近语言结构的拓扑。在这项工作中，我们将此映射分解为1.同构空间旋转； 2.线性缩放，用于识别和缩放最相关的方向。我们介绍了新颖的结构任务，以检查我们的方法解开隐藏在嵌入中的信息的能力。我们通过实验证明了我们的方法可以在多任务设置中执行。此外，正交约束可识别编码特定语言特征的嵌入子空间，并使探针不易记忆。

62. SemGloVe: Semantic Co-occurrences for GloVe from BERT [PDF] 返回目录
Leilei Gan, Zhiyang Teng, Yue Zhang, Linchao Zhu, Fei Wu, Yi Yang
Abstract: GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices. However, word pairs in the matrices are extracted from a predefined local context window, which might lead to limited word pairs and potentially semantic irrelevant word pairs. In this paper, we propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings. Particularly, we propose two models to extract co-occurrence statistics based on either the masked language model or the multi-head attention weights of BERT. Our methods can extract word pairs without limiting by the local window assumption and can define the co-occurrence weights by directly considering the semantic distance between word pairs. Experiments on several word similarity datasets and four external tasks show that SemGloVe can outperform GloVe.
摘要：GloVe通过利用单词共现矩阵的统计信息来学习单词嵌入。但是，矩阵中的单词对是从预定义的本地上下文窗口中提取的，这可能会导致单词对数量有限以及潜在的语义上不相关的单词对。在本文中，我们提出了SemGloVe，它将语义共现从BERT提取到静态的GloVe词嵌入中。特别是，我们提出了两种基于掩码语言模型或BERT的多头注意权重来提取共现统计的模型。我们的方法可以提取单词对而不受本地窗口假设的限制，并且可以通过直接考虑单词对之间的语义距离来定义共现权重。对几个单词相似性数据集和四个外部任务进行的实验表明，SemGloVe的性能优于GloVe。

63. Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks? [PDF] 返回目录
Thang M. Pham, Trung Bui, Long Mai, Anh Nguyen
Abstract: Do state-of-the-art natural language understanding models care about word order - one of the most important characteristics of a sequence? Not always! We found 75% to 90% of the correct predictions of BERT-based classifiers, trained on many GLUE tasks, remain constant after input words are randomly shuffled. Despite BERT embeddings are famously contextual, the contribution of each individual word to downstream tasks is almost unchanged even after the word's context is shuffled. BERT-based models are able to exploit superficial cues (e.g. the sentiment of keywords in sentiment analysis; or the word-wise similarity between sequence-pair inputs in natural language inference) to make correct decisions when tokens are arranged in random orders. Encouraging classifiers to capture word order information improves the performance on most GLUE tasks, SQuAD 2.0 and out-of-samples. Our work suggests that many GLUE tasks are not challenging machines to understand the meaning of a sentence.
摘要：最先进的自然语言理解模型是否关心单词顺序-序列最重要的特征之一？不总是！我们发现在许多GLUE任务上训练的，基于BERT的分类器的正确预测的75％至90％在输入单词被随机混洗后保持不变。尽管BERT嵌入在上下文中是著名的，但是即使单词的上下文被改组后，每个单词对下游任务的贡献也几乎没有变化。当基于随机顺序排列令牌时，基于BERT的模型能够利用表面线索（例如，情感分析中的关键字情感;或者自然语言推理中的序列对输入之间的逐字相似性）做出正确的决定。鼓励分类器捕获单词顺序信息可提高大多数GLUE任务，SQuAD 2.0和样本外的性能。我们的工作表明，许多GLUE任务并不是挑战机器以理解句子的含义。

64. Synthetic Source Language Augmentation for Colloquial Neural Machine Translation [PDF] 返回目录
Asrul Sani Ariesandy, Mukhlis Amien, Alham Fikri Aji, Radityo Eko Prasojo
Abstract: Neural machine translation (NMT) is typically domain-dependent and style-dependent, and it requires lots of training data. State-of-the-art NMT models often fall short in handling colloquial variations of its source language and the lack of parallel data in this regard is a challenging hurdle in systematically improving the existing models. In this work, we develop a novel colloquial Indonesian-English test-set collected from YouTube transcript and Twitter. We perform synthetic style augmentation to the source of formal Indonesian language and show that it improves the baseline Id-En models (in BLEU) over the new test data.
摘要：神经机器翻译（NMT）通常依赖于域和依赖于样式，并且需要大量的训练数据。最新的NMT模型在处理其源语言的口语变体时常常不尽人意，因此在这方面缺乏并行数据是系统地改进现有模型的挑战。在这项工作中，我们开发了一种新颖的口语印尼英语测试集，该测试集是从YouTube成绩单和Twitter收集的。我们对正式的印尼语源进行了合成样式增强，并显示它比新的测试数据改善了基线Id-En模型（在BLEU中）。

65. A Memory Efficient Baseline for Open Domain Question Answering [PDF] 返回目录
Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Sebastian Riedel, Edouard Grave
Abstract: Recently, retrieval systems based on dense representations have led to important improvements in open-domain question answering, and related tasks. While very effective, this approach is also memory intensive, as the dense vectors for the whole knowledge source need to be kept in memory. In this paper, we study how the memory footprint of dense retriever-reader systems can be reduced. We consider three strategies to reduce the index size: dimension reduction, vector quantization and passage filtering. We evaluate our approach on two question answering benchmarks: TriviaQA and NaturalQuestions, showing that it is possible to get competitive systems using less than 6Gb of memory.
摘要：最近，基于密集表示的检索系统在开放域问答和相关任务方面取得了重要的改进。这种方法虽然非常有效，但也占用大量内存，因为整个知识源的密集向量需要保留在内存中。在本文中，我们研究了如何减少密集的读写器系统的内存占用。我们考虑了三种减小索引大小的策略：降维，矢量量化和通过过滤。我们在两个回答问题的基准上评估了我们的方法：TriviaQA和NaturalQuestions，表明可以使用少于6Gb的内存来获得具有竞争力的系统。

66. Improving BERT with Syntax-aware Local Attention [PDF] 返回目录
Zhongli Li, Qingyu Zhou, Chao Li, Ke Xu, Yunbo Cao
Abstract: Pre-trained Transformer-based neural language models, such as BERT, have achieved remarkable results on varieties of NLP tasks. Recent works have shown that attention-based models can benefit from more focused attention over local regions. Most of them restrict the attention scope within a linear span, or confine to certain tasks such as machine translation and question answering. In this paper, we propose a syntax-aware local attention, where the attention scopes are restrained based on the distances in the syntactic structure. The proposed syntax-aware local attention can be integrated with pretrained language models, such as BERT, to render the model to focus on syntactically relevant words. We conduct experiments on various single-sentence benchmarks, including sentence classification and sequence labeling tasks. Experimental results show consistent gains over BERT on all benchmark datasets. The extensive studies verify that our model achieves better performance owing to more focused attention over syntactically relevant words.
摘要：基于预训练的基于变压器的神经语言模型（例如BERT）在各种NLP任务上均取得了显著成果。最近的工作表明，基于注意力的模型可以受益于对本地区域的更多关注。它们中的大多数将注意力范围限制在线性范围内，或仅限于某些任务，例如机器翻译和问题解答。在本文中，我们提出了一种语法感知的局部注意，其中注意范围根据句法结构中的距离而受到限制。可以将建议的语法意识本地注意力与诸如BERT的预训练语言模型集成，以使模型集中于语法相关的单词。我们在各种单句基准上进行实验，包括句子分类和序列标注任务。实验结果表明，在所有基准数据集上，BERT都获得了一致的收益。广泛的研究证明，由于对语法相关单词的更多关注，我们的模型取得了更好的性能。

67. Improving Zero-Shot Translation by Disentangling Positional Information [PDF] 返回目录
Danni Liu, Jan Niehues, James Cross, Francisco Guzmán, Xian Li
Abstract: Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. We demonstrate that a main factor causing the language-specific representations is the positional correspondence to input tokens. We show that this can be easily alleviated by removing residual connections in an encoder layer. With this modification, we gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions. The improvements are particularly prominent between related languages, where our proposed model outperforms pivot-based translation. Moreover, our approach allows easy integration of new languages, which substantially expands translation coverage. By thorough inspections of the hidden layer outputs, we show that our approach indeed leads to more language-independent representations.
摘要：多语言神经机器翻译显示了可以在训练中看不到的语言对之间进行直接翻译的能力，即零镜头翻译。尽管从概念上讲吸引人，但它经常遭受输出质量低下的困扰。归纳为新的翻译方向的困难表明模型表示与培训中看到的那些语言对高度相关。我们证明了导致特定于语言的表示的一个主要因素是与输入标记的位置对应。我们表明，通过消除编码器层中的残留连接，可以轻松缓解这种情况。通过此修改，我们可以在零镜头平移中获得多达18.5个BLEU点，同时在监督方向上保持质量。这些改进在相关语言之间尤为突出，在这些语言中，我们提出的模型优于基于透视的翻译。此外，我们的方法允许轻松集成新语言，从而大大扩展了翻译范围。通过对隐藏层输出的彻底检查，我们表明我们的方法确实导致了更多与语言无关的表示形式。

68. Joint Verification and Reranking for Open Fact Checking Over Tables [PDF] 返回目录
Michael Schlichtkrull, Vladimir Karpukhin, Barlas Oğuz, Mike Lewis, Wen-tau Yih, Sebastian Riedel
Abstract: Structured information is an important knowledge source for automatic verification of factual claims. Nevertheless, the majority of existing research into this task has focused on textual data, and the few recent inquiries into structured data have been for the closed-domain setting where appropriate evidence for each claim is assumed to have already been retrieved. In this paper, we investigate verification over structured data in the open-domain setting, introducing a joint reranking-and-verification model which fuses evidence documents in the verification component. Our open-domain model achieves performance comparable to the closed-domain state-of-the-art on the TabFact dataset, and demonstrates performance gains from the inclusion of multiple tables as well as a significant improvement over a heuristic retrieval baseline.
摘要：结构化信息是自动验证事实主张的重要知识来源。但是，有关此任务的大多数现有研究都集中在文本数据上，最近对结构化数据的查询很少用于封闭域设置，在这种情况下，假定已经检索到每个索赔的适当证据。在本文中，我们研究了在开放域环境中对结构化数据的验证，引入了联合重排和验证模型，该模型将证据文档融合到了验证组件中。我们的开放域模型实现的性能可与TabFact数据集上的封闭域最新技术相媲美，并展示了由于包含多个表而带来的性能提升，以及启发式检索基线的显着改进。

69. Accurate Word Representations with Universal Visual Guidance [PDF] 返回目录
Zhuosheng Zhang, Haojie Yu, Hai Zhao, Rui Wang, Masao Utiyama
Abstract: Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level context for modeling. Although the PrLMs generally give more accurate contextualized word representations than non-contextualized models do, they are still subject to a sequence of text contexts without diverse hints for word representation from multimodality. This paper thus proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. In detail, we build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. The texts and paired images are encoded in parallel, followed by an attention layer to integrate the multimodal representations. We show that the method substantially improves the accuracy of disambiguation. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
摘要：单词表示是神经语言理解模型的基本组成部分。最近，预训练语言模型（PrLM）通过利用序列级上下文进行建模，提供了一种新型的上下文化单词表示的高效方法。尽管PrLM通常会提供比非上下文化模型更准确的上下文化单词表示形式，但是它们仍然受到一系列文本上下文的约束，而没有来自多模态的单词表示形式的各种提示。因此，本文提出了一种视觉表示方法，以通过视觉指导显着增强具有多方面意义的常规单词嵌入。详细地说，我们从多峰种子数据集中构建了一个小规模的单词图像词典，其中每个单词对应于各种相关图像。文本和成对的图像被并行编码，随后是注意层以集成多模式表示。我们表明该方法大大提高了消除歧义的准确性。对12种自然语言理解和机器翻译任务的实验进一步验证了该方法的有效性和泛化能力。

70. A Subword Guided Neural Word Segmentation Model for Sindhi [PDF] 返回目录
Wazir Ali, Jay Kumar, Zenglin Xu, Congjian Luo, Junyu Lu, Junming Shao, Rajesh Kumar, Yazhou Ren
Abstract: Deep neural networks employ multiple processing layers for learning text representations to alleviate the burden of manual feature engineering in Natural Language Processing (NLP). Such text representations are widely used to extract features from unlabeled data. The word segmentation is a fundamental and inevitable prerequisite for many languages. Sindhi is an under-resourced language, whose segmentation is challenging as it exhibits space omission, space insertion issues, and lacks the labeled corpus for segmentation. In this paper, we investigate supervised Sindhi Word Segmentation (SWS) using unlabeled data with a Subword Guided Neural Word Segmenter (SGNWS) for Sindhi. In order to learn text representations, we incorporate subword representations to recurrent neural architecture to capture word information at morphemic-level, which takes advantage of Bidirectional Long-Short Term Memory (BiLSTM), self-attention mechanism, and Conditional Random Field (CRF). Our proposed SGNWS model achieves an F1 value of 98.51% without relying on feature engineering. The empirical results demonstrate the benefits of the proposed model over the existing Sindhi word segmenters.
摘要：深度神经网络采用多层处理层来学习文本表示，以减轻自然语言处理（NLP）中手动特征工程的负担。这样的文本表示被广泛用于从未标记的数据中提取特征。分词是许多语言的基本和必然前提。信德语是一种资源匮乏的语言，它的分割具有挑战性，因为它表现出空间遗漏，空间插入问题，并且缺少用于分割的标记语料库。在本文中，我们研究了使用未标记数据的监督Sindhi分词（SWS），以及针对Sindhi的子词引导神经分词器（SGNWS）。为了学习文本表示，我们将子词表示形式合并到递归神经体系结构中，以在词素级捕获词信息，这利用了双向长短时记忆（BiLSTM），自注意机制和条件随机场（CRF）。我们提出的SGNWS模型在不依赖特征工程的情况下实现了98.51％的F1值。实验结果表明，与现有的信德语分词器相比，该模型的优势在于。

71. Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA [PDF] 返回目录
Ana Valeria Gonzalez, Gagan Bansal, Angela Fan, Robin Jia, Yashar Mehdad, Srinivasan Iyer
Abstract: While research on explaining predictions of open-domain QA systems (ODQA) to users is gaining momentum, most works have failed to evaluate the extent to which explanations improve user trust. While few works evaluate explanations using user studies, they employ settings that may deviate from the end-user's usage in-the-wild: ODQA is most ubiquitous in voice-assistants, yet current research only evaluates explanations using a visual display, and may erroneously extrapolate conclusions about the most performant explanations to other modalities. To alleviate these issues, we conduct user studies that measure whether explanations help users correctly decide when to accept or reject an ODQA system's answer. Unlike prior work, we control for explanation modality, e.g., whether they are communicated to users through a spoken or visual interface, and contrast effectiveness across modalities. Our results show that explanations derived from retrieved evidence passages can outperform strong baselines (calibrated confidence) across modalities but the best explanation strategy in fact changes with the modality. We show common failure cases of current explanations, emphasize end-to-end evaluation of explanations, and caution against evaluating them in proxy modalities that are different from deployment.
摘要：虽然向用户解释开放域质量保证系统（ODQA）预测的研究势头迅猛，但大多数工作仍未能评估解释在多大程度上提高了用户信任度。尽管很少有作品使用用户研究来评估解释，但他们采用的设置可能会偏离最终用户的野外使用：ODQA在语音助手中最为普遍，但当前的研究仅使用视觉显示来评估解释，而且可能会错误地进行。将有关最有效解释的结论外推到其他方式。为了缓解这些问题，我们进行了用户研究，以衡量解释是否有助于用户正确地决定何时接受或拒绝ODQA系统的答案。与以前的工作不同，我们控制解释方式，例如是否通过口头或视觉界面将其传达给用户，以及对比各种方式的有效性。我们的结果表明，从检索到的证据段落中得出的解释可以胜过各种方法的强基准（校准的置信度），但最好的解释策略实际上会随方法而改变。我们将显示当前解释的常见失败案例，强调对解释的端到端评估，并警告不要以不同于部署的代理方式评估它们。

72. Enhancing Pre-trained Language Model with Lexical Simplification [PDF] 返回目录
Rongzhou Bao, Jiayi Wang, Zhuosheng Zhang, Hai Zhao
Abstract: For both human readers and pre-trained language models (PrLMs), lexical diversity may lead to confusion and inaccuracy when understanding the underlying semantic meanings of given sentences. By substituting complex words with simple alternatives, lexical simplification (LS) is a recognized method to reduce such lexical diversity, and therefore to improve the understandability of sentences. In this paper, we leverage LS and propose a novel approach which can effectively improve the performance of PrLMs in text classification. A rule-based simplification process is applied to a given sentence. PrLMs are encouraged to predict the real label of the given sentence with auxiliary inputs from the simplified version. Using strong PrLMs (BERT and ELECTRA) as baselines, our approach can still further improve the performance in various text classification tasks.
摘要：对于人类读者和预先训练的语言模型（PrLM），在理解给定句子的基本语义含义时，词汇多样性可能会导致混乱和不准确。通过用简单的替代词替换复杂的单词，词汇简化（LS）是减少此类词汇多样性并因此提高句子的可理解性的公认方法。在本文中，我们利用LS提出了一种可以有效提高PrLM在文本分类中性能的新颖方法。基于规则的简化过程将应用于给定的句子。鼓励PrLM使用简化版本的辅助输入来预测给定句子的真实标签。使用强大的PrLM（BERT和ELECTRA）作为基准，我们的方法仍可以进一步提高各种文本分类任务的性能。

73. Reservoir Transformer [PDF] 返回目录
Sheng Shen, Alexei Baevski, Ari S. Morcos, Kurt Keutzer, Michael Auli, Douwe Kiela
Abstract: We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.
摘要：我们证明，即使某些层被随机初始化并且从未更新，变压器也能获得令人印象深刻的性能。受到机器学习中古老而确立的思想的启发，我们探索了散布有常规变压器层的各种非线性“储层”层，并显示了各种时钟在收敛之前的壁钟计算时间以及整体性能的改进。机器翻译和（掩盖的）语言建模任务。

74. Language Identification of Devanagari Poems [PDF] 返回目录
Priyankit Acharya, Aditya Ku. Pathak, Rakesh Ch. Balabantaray, Anil Ku. Singh
Abstract: Language Identification is a very important part of several text processing pipelines. Extensive research has been done in this field. This paper proposes a procedure for automatic language identification of poems for poem analysis task, consisting of 10 Devanagari based languages of India i.e. Angika, Awadhi, Braj, Bhojpuri, Chhattisgarhi, Garhwali, Haryanvi, Hindi, Magahi, and Maithili. We collated corpora of poems of varying length and studied the similarity of poems among the 10 languages at the lexical level. Finally, various language identification systems based on supervised machine learning and deep learning techniques are applied and evaluated.
摘要：语言识别是几个文本处理管道中非常重要的一部分。在该领域已经进行了广泛的研究。本文提出了一种用于诗歌分析任务的诗歌自动语言识别程序，该程序由10种基于Devanagari的印度语言组成，即安吉卡，阿瓦迪，布拉杰，博普普里，恰蒂斯加里语，加瓦里，哈里扬维，印地语，马加希和麦蒂希利。我们整理了不同长度的诗的语料库，并在词汇层面研究了十种语言之间的诗的相似性。最后，应用和评估了基于监督机器学习和深度学习技术的各种语言识别系统。

75. ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning [PDF] 返回目录
Yujia Qin, Yankai Lin, Ryuichi Takanobu, Zhiyuan Liu, Peng Li, Heng Ji, Minlie Huang, Maosong Sun, Jie Zhou
Abstract: Pre-trained Language Models (PLMs) have shown strong performance in various downstream Natural Language Processing (NLP) tasks. However, PLMs still cannot well capture the factual knowledge in the text, which is crucial for understanding the whole text, especially for document-level language understanding tasks. To address this issue, we propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text. Specifically, (1) to better understand entities, we propose an entity discrimination task that distinguishes which tail entity can be inferred by the given head entity and relation. (2) Besides, to better understand relations, we employ a relation discrimination task which distinguishes whether two entity pairs are close or not in relational semantics. Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks, including relation extraction and reading comprehension, especially under low resource setting. Meanwhile, ERICA achieves comparable or better performance on sentence-level tasks. We will release the datasets, source codes and pre-trained language models for further research explorations.
摘要：预训练语言模型（PLM）在各种下游自然语言处理（NLP）任务中均表现出出色的性能。但是，PLM仍然无法很好地捕获文本中的事实知识，这对于理解整个文本（尤其是对于文档级语言理解任务）至关重要。为了解决这个问题，我们在预训练阶段提出了一个名为ERICA的新型对比学习框架，以更深入地了解实体及其在文本中的关系。具体来说，（1）为了更好地理解实体，我们提出了一个实体区分任务，该任务区分给定的头部实体和关系可以推断出哪个尾部实体。（2）此外，为了更好地理解关系，我们采用了关系区分任务，该任务区分两个实体对在关系语义上是否接近。实验结果表明，我们提出的ERICA框架在一些文档级语言理解任务上实现了一致的改进，包括关系提取和阅读理解，尤其是在资源不足的情况下。同时，ERICA在句子级任务上达到了相当或更好的性能。我们将发布数据集，源代码和预训练的语言模型，以进行进一步的研究探索。

76. OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts [PDF] 返回目录
Yuxian Meng, Shuhe Wang, Qinghong Han, Xiaofei Sun, Fei Wu, Rui Yan, Jiwei Li
Abstract: When humans converse, what a speaker will say next significantly depends on what he sees. Unfortunately, existing dialogue models generate dialogue utterances only based on preceding textual contexts, and visual contexts are rarely considered. This is due to a lack of a large-scale multi-module dialogue dataset with utterances paired with visual contexts. In this paper, we release {\bf OpenViDial}, a large-scale multi-module dialogue dataset. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images. Based on this dataset, we propose a family of encoder-decoder models leveraging both textual and visual contexts, from coarse-grained image features extracted from CNNs to fine-grained object features extracted from Faster R-CNNs. We observe that visual information significantly improves dialogue generation qualities, verifying the necessity of integrating multi-modal features for dialogue learning. Our work marks an important step towards large-scale multi-modal dialogue learning.
摘要：当人们交谈时，说话者接下来要说的话很大程度上取决于他所看到的。不幸的是，现有的对话模型仅基于先前的文本上下文来生成对话话语，并且很少考虑视觉上下文。这是由于缺少带有语音和视觉上下文配对的大规模多模块对话数据集。在本文中，我们发布了{\ bf OpenViDial}，这是一个大规模的多模块对话数据集。对话转弯和视觉上下文是从电影和电视剧中提取的，其中每个对话转弯都与发生对话的相应视觉上下文配对。 OpenViDial总共包含110万个对话回合，因此110万个可视上下文存储在图像中。基于此数据集，我们提出了一系列利用文本和视觉上下文的编码器-解码器模型，从CNN提取的粗粒度图像特征到Faster R-CNN提取的细粒度目标特征。我们观察到视觉信息显着提高了对话生成的质量，验证了整合多模态特征进行对话学习的必要性。我们的工作标志着迈向大规模多模式对话学习的重要一步。

77. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness [PDF] 返回目录
Sabrina J. Mielke, Arthur Szlam, Y-Lan Boureau, Emily Dinan
Abstract: Open-domain dialogue agents have vastly improved, but still confidently hallucinate knowledge or express doubt when asked straightforward questions. In this work, we analyze whether state-of-the-art chit-chat models can express metacognition capabilities through their responses: does a verbalized expression of doubt (or confidence) match the likelihood that the model's answer is incorrect (or correct)? We find that these models are poorly calibrated in this sense, yet we show that the representations within the models can be used to accurately predict likelihood of correctness. By incorporating these correctness predictions into the training of a controllable generation model, we obtain a dialogue agent with greatly improved linguistic calibration.
摘要：开放域对话代理已经得到了很大的改进，但是当问到简单的问题时，他们仍然自信地使知识产生幻觉或表达怀疑。在这项工作中，我们分析了最先进的聊天模型是否可以通过响应来表达元认知能力：疑问（或置信度）的口头表达是否与模型答案不正确（或正确）的可能性相匹配？我们发现这些模型在这种意义上的校准较差，但是我们表明模型内的表示可以用来准确预测正确性的可能性。通过将这些正确性预测并入可控生成模型的训练中，我们获得了具有大大改进的语言校准的对话代理。

78. Few-Shot Named Entity Recognition: A Comprehensive Study [PDF] 返回目录
Jiaxin Huang, Chunyuan Li, Krishan Subudhi, Damien Jose, Shobana Balakrishnan, Weizhu Chen, Baolin Peng, Jianfeng Gao, Jiawei Han
Abstract: This paper presents a comprehensive study to efficiently build named entity recognition (NER) systems when a small number of in-domain labeled data is available. Based upon recent Transformer-based self-supervised pre-trained language models (PLMs), we investigate three orthogonal schemes to improve the model generalization ability for few-shot settings: (1) meta-learning to construct prototypes for different entity types, (2) supervised pre-training on noisy web data to extract entity-related generic representations and (3) self-training to leverage unlabeled in-domain data. Different combinations of these schemes are also considered. We perform extensive empirical comparisons on 10 public NER datasets with various proportions of labeled data, suggesting useful insights for future research. Our experiments show that (i) in the few-shot learning setting, the proposed NER schemes significantly improve or outperform the commonly used baseline, a PLM-based linear classifier fine-tuned on domain labels; (ii) We create new state-of-the-art results on both few-shot and training-free settings compared with existing methods. We will release our code and pre-trained models for reproducible research.
摘要：本文提出了一项综合研究，以在少量域内标记数据可用时有效地构建命名实体识别（NER）系统。基于最近基于Transformer的自我监督的预训练语言模型（PLM），我们研究了三种正交方案以提高针对少拍场景的模型泛化能力：（1）元学习以构造不同实体类型的原型，（ 2）对嘈杂的Web数据进行有监督的预训练，以提取与实体相关的通用表示形式；以及（3）自训练，以利用未标记的域内数据。还考虑了这些方案的不同组合。我们对10种公共NER数据集（具有不同比例的标记数据）进行了广泛的经验比较，为将来的研究提供了有益的见解。我们的实验表明：（i）在快速学习设置中，提出的NER方案显着改善或优于常用的基线，即在域标签上微调的基于PLM的线性分类器；（ii）与现有方法相比，我们在少拍和无训练的设置上都创建了最新的结果。我们将发布代码和经过预训练的模型以进行可重复的研究。

79. Generating Natural Language Attacks in a Hard Label Black Box Setting [PDF] 返回目录
Rishabh Maheshwary, Saket Maheshwary, Vikram Pudi
Abstract: We study an important and challenging task of attacking natural language processing models in a hard label black box setting. We propose a decision-based attack strategy that crafts high quality adversarial examples on text classification and entailment tasks. Our proposed attack strategy leverages population-based optimization algorithm to craft plausible and semantically similar adversarial examples by observing only the top label predicted by the target model. At each iteration, the optimization procedure allow word replacements that maximizes the overall semantic similarity between the original and the adversarial text. Further, our approach does not rely on using substitute models or any kind of training data. We demonstrate the efficacy of our proposed approach through extensive experimentation and ablation studies on five state-of-the-art target models across seven benchmark datasets. In comparison to attacks proposed in prior literature, we are able to achieve a higher success rate with lower word perturbation percentage that too in a highly restricted setting.
摘要：我们研究了在硬标签黑匣子环境中攻击自然语言处理模型的一项重要而富挑战性的任务。我们提出了一种基于决策的攻击策略，该策略可以针对文本分类和包含任务生成高质量的对抗示例。我们提出的攻击策略利用基于种群的优化算法，通过仅观察目标模型预测的最高标签来设计合理且语义相似的对抗示例。在每次迭代中，优化过程都允许单词替换，从而最大限度地提高原始文本和对抗文本之间的整体语义相似度。此外，我们的方法不依赖使用替代模型或任何种类的训练数据。我们通过对七个基准数据集的五个最新目标模型进行广泛的实验和消融研究，证明了我们提出的方法的有效性。与现有文献中提出的攻击相比，我们能够以较高的成功率和较低的单词干扰率在高度受限的环境中实现更高的成功率。

80. Generating Wikipedia Article Sections from Diverse Data Sources [PDF] 返回目录
Mingda Chen, Sam Wiseman, Kevin Gimpel
Abstract: Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
摘要：用于数据到文本生成的数据集通常着重于多域单句生成或单域长格式生成。在这项工作中，我们创建了一个大型数据集WikiTableT，该数据集将Wikipedia部分与它们相应的表格数据和各种元数据配对。 WikiTableT包含数百万个实例，涵盖了广泛的主题，以及具有不同灵活性级别的多种生成任务。我们在WikiTableT上对几种训练和解码策略进行基准测试。我们的定性分析表明，最好的方法可以生成流利且高质量的文本，但有时会产生连贯性。

81. Transformer Feed-Forward Layers Are Key-Value Memories [PDF] 返回目录
Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy
Abstract: Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.
摘要：前馈层构成了变压器模型参数的三分之二，但它们在网络中的作用仍未得到充分研究。我们展示了基于转换器的语言模型中的前馈层用作键值存储，其中每个键与训练示例中的文本模式相关，并且每个值都在输出词汇表上引起分布。我们的实验表明，学习到的模式是人类可以解释的，并且较低的层倾向于捕获浅层模式，而较高的层则学习更多的语义模式。这些值通过诱导输出分布来补充键的输入模式，这些输出分布将概率质量集中在每个模式之后可能立即出现的令牌上，特别是在上层。最后，我们证明前馈层的输出是其内存的组成，随后将通过剩余连接在模型各层中对其进行细化，以生成最终的输出分布。

82. The Parallel Meaning Bank: A Framework for Semantically Annotating Multiple Languages [PDF] 返回目录
Lasha Abzianidze, Rik van Noord, Chunliu Wang, Johan Bos
Abstract: This paper gives a general description of the ideas behind the Parallel Meaning Bank, a framework with the aim to provide an easy way to annotate compositional semantics for texts written in languages other than English. The annotation procedure is semi-automatic, and comprises seven layers of linguistic information: segmentation, symbolisation, semantic tagging, word sense disambiguation, syntactic structure, thematic role labelling, and co-reference. New languages can be added to the meaning bank as long as the documents are based on translations from English, but also introduce new interesting challenges on the linguistics assumptions underlying the Parallel Meaning Bank.
摘要：本文对“并行含义库”背后的思想进行了一般性描述，该框架旨在提供一种简便的方法来注释非英语语言文字的构成语义。注释过程是半自动的，包括七层语言信息：分段，符号化，语义标记，词义歧义消除，句法结构，主题角色标记和共同引用。只要文档基于英语翻译，便可以将新语言添加到含义库中，但同时也为基于并行含义库的语言学假设带来了新的有趣挑战。

83. DRS at MRP 2020: Dressing up Discourse Representation Structures as Graphs [PDF] 返回目录
Lasha Abzianidze, Johan Bos, Stephan Oepen
Abstract: Discourse Representation Theory (DRT) is a formal account for representing the meaning of natural language discourse. Meaning in DRT is modeled via a Discourse Representation Structure (DRS), a meaning representation with a model-theoretic interpretation, which is usually depicted as nested boxes. In contrast, a directed labeled graph is a common data structure used to encode semantics of natural language texts. The paper describes the procedure of dressing up DRSs as directed labeled graphs to include DRT as a new framework in the 2020 shared task on Cross-Framework and Cross-Lingual Meaning Representation Parsing. Since one of the goals of the shared task is to encourage unified models for several semantic graph frameworks, the conversion procedure was biased towards making the DRT graph framework somewhat similar to other graph-based meaning representation frameworks.
摘要：话语表征理论（DRT）是形式表达自然语言话语含义的形式。 DRT中的意思是通过话语表示结构（DRS）进行建模的，这是一种具有模型理论解释的意思表示，通常被描述为嵌套框。相反，有向标记图是一种用于对自然语言文本的语义进行编码的通用数据结构。本文描述了将DRS整理为有向标记图的过程，以将DRT作为新框架纳入2020年跨框架和跨语言意义表示分析的共享任务。由于共享任务的目标之一是鼓励为多个语义图框架使用统一模型，因此转换过程倾向于使DRT图框架与其他基于图的含义表示框架相似。

84. Dialogue Graph Modeling for Conversational Machine Reading [PDF] 返回目录
Siru Ouyang, Zhuosheng Zhang, Hai Zhao
Abstract: Conversational Machine Reading (CMR) aims at answering questions in a complicated manner. Machine needs to answer questions through interactions with users based on given rule document, user scenario and dialogue history, and ask questions to clarify if necessary. In this paper, we propose a dialogue graph modeling framework to improve the understanding and reasoning ability of machine on CMR task. There are three types of graph in total. Specifically, Discourse Graph is designed to learn explicitly and extract the discourse relation among rule texts as well as the extra knowledge of scenario; Decoupling Graph is used for understanding local and contextualized connection within rule texts. And finally a global graph for fusing the information together and reply to the user with our final decision being either ``Yes/No/Irrelevant" or to ask a follow-up question to clarify.
摘要：会话式机器阅读（CMR）旨在以复杂的方式回答问题。机器需要根据给定的规则文档，用户场景和对话历史记录，通过与用户的互动来回答问题，并在必要时提出问题以进行澄清。在本文中，我们提出了一个对话图建模框架，以提高机器对CMR任务的理解和推理能力。共有三种类型的图。具体来说，语篇图旨在明确学习并提取规则文本之间的语篇关系以及场景的额外知识；解耦图用于了解规则文本中的本地和上下文连接。最后是一个用于将信息融合在一起并回复用户的全局图，我们的最终决定是``是/否/重要''或询问后续问题以进行澄清。

85. A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation [PDF] 返回目录
Jiangnan Li, Zheng Lin, Peng Fu, Qingyi Si, Weiping Wang
Abstract: Emotion Recognition in Conversation (ERC) is a more challenging task than conventional text emotion recognition. It can be regarded as a personalized and interactive emotion recognition task, which is supposed to consider not only the semantic information of text but also the influences from speakers. The current method models speakers' interactions by building a relation between every two speakers. However, this fine-grained but complicated modeling is computationally expensive, hard to extend, and can only consider local context. To address this problem, we simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies, without identifying every unique speaker for the targeted speaker. To better achieve the simplified interaction modeling of speakers in Transformer, which shows excellent ability to settle long-distance dependency, we design three types of masks and respectively utilize them in three independent Transformer blocks. The designed masks respectively model the conventional context modeling, Intra-Speaker dependency, and Inter-Speaker dependency. Furthermore, different speaker-aware information extracted by Transformer blocks diversely contributes to the prediction, and therefore we utilize the attention mechanism to automatically weight them. Experiments on two ERC datasets indicate that our model is efficacious to achieve better performance.
摘要：会话中的情感识别（ERC）比常规的文本情感识别更具挑战性。它可以看作是一种个性化的，交互式的情感识别任务，它不仅要考虑文本的语义信息，还要考虑说话者的影响。当前的方法通过在每两个说话者之间建立关系来对说话者的交互进行建模。但是，这种细粒度但复杂的建模在计算上非常昂贵，难以扩展，并且只能考虑局部上下文。为了解决此问题，我们将复杂的建模简化为二进制版本：扬声器内和扬声器间相关性，而无需为目标扬声器标识每个唯一的扬声器。为了更好地实现Transformer中扬声器的简化交互建模，该建模具有出色的解决长距离依赖性的能力，我们设计了三种类型的蒙版，分别在三个独立的Transformer模块中使用它们。设计的掩码分别对常规上下文建模，说话者内相关性和说话者间相关性进行建模。此外，Transformer块提取的不同的讲话者感知信息对预测的贡献也不同，因此，我们利用注意力机制对它们进行自动加权。在两个ERC数据集上进行的实验表明，我们的模型可有效实现更好的性能。

86. Combining Semilattices and Semimodules [PDF] 返回目录
Filippo Bonchi, Alessio Santamaria
Abstract: We describe the canonical weak distributive law $\delta \colon \mathcal S \mathcal P \to \mathcal P \mathcal S$ of the powerset monad $\mathcal P$ over the $S$-left-semimodule monad $\mathcal S$, for a class of semirings $S$. We show that the composition of $\mathcal P$ with $\mathcal S$ by means of such $\delta$ yields almost the monad of convex subsets previously introduced by Jacobs: the only difference consists in the absence in Jacobs's monad of the empty convex set. We provide a handy characterisation of the canonical weak lifting of $\mathcal P$ to $\mathbb{EM}(\mathcal S)$ as well as an algebraic theory for the resulting composed monad. Finally, we restrict the composed monad to finitely generated convex subsets and we show that it is presented by an algebraic theory combining semimodules and semilattices with bottom, which are the algebras for the finite powerset monad $\mathcal P_f$.
摘要：我们描述了幂集单子$ \ mathcal P $相对于$ S $ -left-semimodule monad $ \的规范弱分布律$ \ delta \冒号\ mathcal S \ mathcal P \至\ mathcal P \ math S mathcal S $，用于一类半环$ S $。我们证明通过这种$ \ delta $组合$ \ mathcal P $和$ \ mathal S $会产生几乎Jacobs先前引入的凸子集的monad：唯一的区别在于Jacobs的monad中没有空凸集。我们方便地描述了$ \ mathcal P $到$ \ mathbb {EM}（\ mathcal S）$的典型弱提升，并提供了所得合成单子的代数理论。最后，将合成单子限制为有限生成的凸子集，并证明它是由代数理论提出的，该代数理论将半模和半格与底数相结合，这是有限幂集单子$ \ mathcal P_f $的代数。

87. Abstractive Query Focused Summarization with Query-Free Resources [PDF] 返回目录
Yumo Xu, Mirella Lapata
Abstract: The availability of large-scale datasets has driven the development of neural sequence-to-sequence models to generate generic summaries, i.e., summaries which do not correspond to any pre-specified queries. However, due to the lack of training data, query focused summarization (QFS) has been studied mainly with extractive methods. In this work, we consider the problem of leveraging only generic summarization resources to build an abstractive QFS system. We propose Marge, a Masked ROUGE Regression framework composed of a novel unified representation for summaries and queries, and a distantly supervised training task for answer evidence estimation. To further utilize generic data for generation, three attributes are incorporated during training and inference to control the shape of the final summary: evidence rank, query guidance, and summary length. Despite learning from minimal supervision, our system achieves state-of-the-art results in the distantly supervised setting across domains and query types.
摘要：大规模数据集的可用性推动了神经序列到序列模型的发展，以生成通用摘要，即与任何预先指定的查询都不对应的摘要。但是，由于缺少训练数据，因此主要使用提取方法研究了查询集中的摘要（QFS）。在这项工作中，我们考虑了仅利用通用摘要资源来构建抽象QFS系统的问题。我们提出Marge，这是一个Masked ROUGE回归框架，由用于汇总和查询的新颖统一表示形式以及用于答案证据估计的远程监督训练任务组成。为了进一步利用通用数据进行生成，在训练和推理过程中合并了三个属性以控制最终摘要的形状：证据等级，查询指导和摘要长度。尽管从最低限度的监督中学到了经验，但我们的系统仍在跨域和查询类型的远程监督设置中实现了最新的结果。

88. Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces [PDF] 返回目录
Linyang Li, Yunfan Shao, Demin Song, Xipeng Qiu, Xuanjing Huang
Abstract: Adversarial attacks in texts are mostly substitution-based methods that replace words or characters in the original texts to achieve success attacks. Recent methods use pre-trained language models as the substitutes generator. While in Chinese, such methods are not applicable since words in Chinese require segmentations first. In this paper, we propose a pre-train language model as the substitutes generator using sentence-pieces to craft adversarial examples in Chinese. The substitutions in the generated adversarial examples are not characters or words but \textit{'pieces'}, which are more natural to Chinese readers. Experiments results show that the generated adversarial samples can mislead strong target models and remain fluent and semantically preserved.
摘要：文本中的对抗攻击主要是基于替换的方法，该方法替换原始文本中的单词或字符以实现成功攻击。最近的方法使用预先训练的语言模型作为替代生成器。在中文中，这种方法不适用，因为中文单词首先需要分段。在本文中，我们提出了一种训练前的语言模型作为替代生成器，它使用句子片段来制作中文的对抗性示例。生成的对抗性示例中的替换不是字符或单词，而是\ textit {'pieces}}，这对中国读者来说更自然。实验结果表明，生成的对抗样本会误导强大的目标模型，并保持流畅性和语义保留性。

89. Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning [PDF] 返回目录
Xuebo Liu, Longyue Wang, Derek F. Wong, Liang Ding, Lidia S. Chao, Zhaopeng Tu
Abstract: Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder layers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models, which has proven effective on various NLP tasks. However, it is still not entirely clear why and when EncoderFusion should work. In this paper, our main contribution is to take a step further in understanding EncoderFusion. Many of previous studies believe that the success of EncoderFusion comes from exploiting surface and syntactic information embedded in lower encoder layers. Unlike them, we find that the encoder embedding layer is more important than other intermediate encoder layers. In addition, the uppermost decoder layer consistently pays more attention to the encoder embedding layer across NLP tasks. Based on this observation, we propose a simple fusion method, SurfaceFusion, by fusing only the encoder embedding layer for the softmax layer. Experimental results show that SurfaceFusion outperforms EncoderFusion on several NLP benchmarks, including machine translation, text summarization, and grammatical error correction. It obtains the state-of-the-art performance on WMT16 Romanian-English and WMT14 English-French translation tasks. Extensive analyses reveal that SurfaceFusion learns more expressive bilingual word embeddings by building a closer relationship between relevant source and target embeddings. The source code will be released.
摘要：编码器层融合（EncoderFusion）是一种融合所有编码器层（而不是最上层）的序列到序列（Seq2Seq）模型的技术，已证明对各种NLP任务有效。但是，仍然不清楚为什么EncoderFusion应该何时工作。在本文中，我们的主要贡献是进一步了解EncoderFusion。以前的许多研究认为，EncoderFusion的成功来自利用嵌入在较低编码器层中的表面和句法信息。与它们不同，我们发现编码器嵌入层比其他中间编码器层更重要。此外，最上层的解码器层始终在NLP任务中更加关注编码器嵌入层。基于此观察，我们通过仅融合softmax层的编码器嵌入层，提出了一种简单的融合方法SurfaceFusion。实验结果表明，SurfaceFusion在包括机器翻译，文本摘要和语法错误纠正在内的多个NLP基准测试中均优于EncoderFusion。它在WMT16罗马尼亚英语和WMT14英语-法语翻译任务中获得了最先进的性能。大量分析表明，SurfaceFusion通过在相关源嵌入和目标嵌入之间建立更紧密的关系来学习更具表现力的双语单词嵌入。源代码将被发布。

90. CMV-BERT: Contrastive multi-vocab pretraining of BERT [PDF] 返回目录
Wei Zhu, Daniel Cheung
Abstract: In this work, we represent CMV-BERT, which improves the pretraining of a language model via two ingredients: (a) contrastive learning, which is well studied in the area of computer vision; (b) multiple vocabularies, one of which is fine-grained and the other is coarse-grained. The two methods both provide different views of an original sentence, and both are shown to be beneficial. Downstream tasks demonstrate our proposed CMV-BERT are effective in improving the pretrained language models.
摘要：在这项工作中，我们代表CMV-BERT，它通过两个方面改善语言模型的预训练：（a）对比学习，在计算机视觉领域进行了深入研究；（b）多个词汇，其中一个是细粒度的，另一个是粗粒度的。这两种方法都提供了对原始句子的不同观点，并且都被证明是有益的。下游任务证明了我们提出的CMV-BERT可有效改善预训练的语言模型。

91. Dialogue Response Selection with Hierarchical Curriculum Learning [PDF] 返回目录
Yixuan Su, Deng Cai, Qingyu Zhou, Zibo Lin, Simon Baker, Yunbo Cao, Shuming Shi, Nigel Collier, Yan Wang
Abstract: We study the learning of a matching model for dialogue response selection. Motivated by the recent finding that random negatives are often too trivial to train a reliable model, we propose a hierarchical curriculum learning (HCL) framework that consists of two complementary curricula: (1) corpus-level curriculum (CC); and (2) instance-level curriculum (IC). In CC, the model gradually increases its ability in finding the matching clues between the dialogue context and response. On the other hand, IC progressively strengthens the model's ability in identifying the mismatched information between the dialogue context and response. Empirical studies on two benchmark datasets with three state-of-the-art matching models demonstrate that the proposed HCL significantly improves the model performance across various evaluation metrics.
摘要：我们研究了用于对话响应选择的匹配模型的学习。基于最近的发现，即随机否定通常过于琐碎而无法训练一个可靠的模型，我们提出了一种分层课程学习（HCL）框架，该框架由两个互补的课程组成：（1）语料库水平课程（CC）；（2）实例级课程（IC）。在CC中，该模型逐渐提高了其在对话上下文和响应之间找到匹配线索的能力。另一方面，IC逐渐增强了该模型识别对话上下文和响应之间不匹配信息的能力。对具有三个最新匹配模型的两个基准数据集的实证研究表明，提出的HCL可显着提高各种评估指标之间的模型性能。

92. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding [PDF] 返回目录
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou
Abstract: Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present \textbf{LayoutLMv2} by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672).
摘要：由于文本和布局的有效模型结构以及大规模未标记的扫描/数字化文档的优势，因此对多种视觉丰富的文档理解任务而言，文本和布局的预训练已被证明是有效的。在本文中，我们通过在多模式框架中预训练文本，布局和图像来展示\ textbf {LayoutLMv2}，其中利用了新的模型架构和预训练任务。具体而言，LayoutLMv2不仅使用现有的蒙版视觉语言建模任务，还使用预训练阶段中的新文本-图像对齐和文本-图像匹配任务，在该阶段可以更好地学习跨模式交互。同时，它还将空间感知的自我关注机制集成到了Transformer体系结构中，从而使模型可以充分理解不同文本块之间的相对位置关系。实验结果表明，LayoutLMv2优于强基准，并在各种下游，视觉上丰富的文档理解任务上获得了最新的最新结果，包括FUNSD（0.7895-> 0.8420），CORD（0.9493-> 0.9601），SROIE （0.9524-> 0.9781），Kleister-NDA（0.834-> 0.852），RVL-CDIP（0.9443-> 0.9564）和DocVQA（0.7295-> 0.8672）。

93. SIT3: Code Summarization with Structure-Induced Transformer [PDF] 返回目录
Hongqiu Wu, Hai Zhao, Min Zhang
Abstract: Code summarization (CS) is becoming a promising area in recent natural language understanding, which aims to generate sensible annotations automatically for source code and is known as programmer oriented. Previous works attempt to apply structure-based traversal (SBT) or non-sequential models like Tree-LSTM and GNN to learn structural program semantics. They both meet the following drawbacks: 1) it is shown ineffective to incorporate SBT into Transformer; 2) it is limited to capture global information through GNN; 3) it is underestimated to capture structural semantics only using Transformer. In this paper, we propose a novel model based on structure-induced self-attention, which encodes sequential inputs with highly-effective structure modeling. Extensive experiments show that our newly-proposed model achieves new state-of-the-art results on popular benchmarks. To our best knowledge, it is the first work on code summarization that uses Transformer to model structural information with high efficiency and no extra parameters. We also provide a tutorial on how we pre-process.
摘要：代码摘要（CS）在最近的自然语言理解中正成为一个有前途的领域，其目的是自动为源代码生成明智的注释，并且被称为面向程序员。先前的工作试图应用基于结构的遍历（SBT）或非顺序模型（例如Tree-LSTM和GNN）来学习结构程序语义。它们都具有以下缺点：1）将SBT集成到Transformer中显示为无效； 2）通过GNN捕获全球信息是有限的； 3）低估了仅使用Transformer捕获结构语义的可能性。在本文中，我们提出了一种基于结构诱导的自我注意的新颖模型，该模型利用高效的结构建模对顺序输入进行编码。大量的实验表明，我们新提出的模型在流行基准上获得了最新的最新结果。据我们所知，这是使用Transformer高效建模结构信息的代码摘要的第一项工作，并且没有任何额外的参数。我们还提供了有关如何进行预处理的教程。

94. Accelerating Pre-trained Language Models via Calibrated Cascade [PDF] 返回目录
Lei Li, Yankai Lin, Shuhuai Ren, Deli Chen, Xuancheng Ren, Peng Li, Jie Zhou, Xu Sun
Abstract: Dynamic early exiting aims to accelerate pre-trained language models' (PLMs) inference by exiting in shallow layer without passing through the entire model. In this paper, we analyze the working mechanism of dynamic early exiting and find it cannot achieve a satisfying trade-off between inference speed and performance. On one hand, the PLMs' representations in shallow layers are not sufficient for accurate prediction. One the other hand, the internal off-ramps cannot provide reliable exiting decisions. To remedy this, we instead propose CascadeBERT, which dynamically selects a proper-sized, complete model in a cascading manner. To obtain more reliable model selection, we further devise a difficulty-aware objective, encouraging the model output class probability to reflect the real difficulty of each instance. Extensive experimental results demonstrate the superiority of our proposal over strong baseline models of PLMs' acceleration including both dynamic early exiting and knowledge distillation methods.
摘要：动态早期退出旨在通过在不经过整个模型的情况下在浅层退出来加速预训练语言模型（PLM）的推理。在本文中，我们分析了动态早期退出的工作机制，发现它无法在推理速度和性能之间取得令人满意的折衷。一方面，浅层的PLM表示不足以进行准确的预测。另一方面，内部斜道不能提供可靠的退出决策。为了解决这个问题，我们建议使用CascadeBERT，它以级联方式动态选择适当大小的完整模型。为了获得更可靠的模型选择，我们进一步设计了一个意识到困难的目标，鼓励模型输出类概率反映每个实例的实际困难。大量的实验结果证明了我们的建议优于PLM加速的强大基线模型（包括动态早期退出和知识蒸馏方法）的优势。

95. Faster Re-translation Using Non-Autoregressive Model For Simultaneous Neural Machine Translation [PDF] 返回目录
Hyojung Han, Sathish Indurthi, Mohd Abbas Zaidi, Nikhil Kumar Lakumarapu, Beomseok Lee, Sangha Kim, Chanwoo Kim, Inchul Hwang
Abstract: Recently, simultaneous translation has gathered a lot of attention since it enables compelling applications such as subtitle translation for a live event or real-time video-call translation. Some of these translation applications allow editing of partial translation giving rise to re-translation approaches. The current re-translation approaches are based on autoregressive sequence generation models (ReTA), which generate tar-get tokens in the (partial) translation sequentially. The multiple re-translations with sequential generation inReTAmodelslead to an increased inference time gap between the incoming source input and the corresponding target output as the source input grows. Besides, due to the large number of inference operations involved, the ReTA models are not favorable for resource-constrained devices. In this work, we propose a faster re-translation system based on a non-autoregressive sequence generation model (FReTNA) to overcome the aforementioned limitations. We evaluate the proposed model on multiple translation tasks and our model reduces the inference times by several orders and achieves a competitive BLEUscore compared to the ReTA and streaming (Wait-k) models.The proposed model reduces the average computation time by a factor of 20 when compared to the ReTA model by incurring a small drop in the translation quality. It also outperforms the streaming-based Wait-k model both in terms of computation time (1.5 times lower) and translation quality.
摘要：近来，同声翻译由于能够实现引人注目的应用，例如用于现场活动的字幕翻译或实时视频通话翻译，因此备受关注。其中一些翻译应用程序允许编辑部分翻译，从而产生了重新翻译的方法。当前的重新翻译方法基于自回归序列生成模型（ReTA），该模型在（部分）翻译中顺序生成tar-get令牌。 ReTA模型中具有顺序生成的多次重新翻译会导致随着源输入的增长，传入源输入与相应目标输出之间的推断时间间隔增加。此外，由于涉及大量推理操作，ReTA模型不利于资源受限的设备。在这项工作中，我们提出了一种基于非自回归序列生成模型（FReTNA）的更快的重新翻译系统，以克服上述限制。与ReTA和流式（Wait-k）模型相比，我们对提出的模型在多个翻译任务上进行了评估，并且该模型将推理时间缩短了几个数量级，并获得了竞争性的BLEUscore。该模型将平均计算时间减少了20倍与ReTA模型相比，翻译质量略有下降。在计算时间（低1.5倍）和翻译质量方面，它也优于基于流的Wait-k模型。

96. RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems [PDF] 返回目录
Baolin Peng, Chunyuan Li, Zhu Zhang, Chenguang Zhu, Jinchao Li, Jianfeng Gao
Abstract: For task-oriented dialog systems to be maximally useful, it must be able to process conversations in a way that is (1) generalizable with a small number of training examples for new task domains, and (2) robust to user input in various styles, modalities or domains. In pursuit of these goals, we introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. By including tasks with limited training data, RADDLE is designed to favor and encourage models with a strong generalization ability. RADDLE also includes a diagnostic checklist that facilitates detailed robustness analysis in aspects such as language variations, speech errors, unseen entities, and out-of-domain utterances. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain. Overall, existing models are less than satisfactory in robustness evaluation, which suggests opportunities for future improvement.
摘要：为了使面向任务的对话系统发挥最大作用，它必须能够以以下方式处理对话：（1）可通过针对新任务域的少量培训示例进行概括，以及（2）对用户输入的鲁棒性各种样式，形式或领域。为了实现这些目标，我们引入了RADDLE基准测试，这是一套语料库和工具，用于评估跨多个领域的模型的性能。通过包含训练数据有限的任务，RADDLE旨在支持和鼓励具有强大泛化能力的模型。 RADDLE还包括一个诊断清单，该清单有助于在语言变化，语音错误，看不见的实体和域外语音等方面进行详细的鲁棒性分析。我们评估了基于预训练和微调的最新技术，发现在异类对话语料库上扎根的预训练比训练每个域的单独模型要好。总体而言，现有模型在鲁棒性评估中不尽人意，这表明未来有改进的机会。

97. A Theoretical Analysis of the Repetition Problem in Text Generation [PDF] 返回目录
Zihao Fu, Wai Lam, Anthony Man-Cho So, Bei Shi
Abstract: Text generation tasks, including translation, summarization, language models, and etc. see rapid growth during recent years. Despite the remarkable achievements, the repetition problem has been observed in nearly all text generation models undermining the generation performance extensively. To solve the repetition problem, many methods have been proposed, but there is no existing theoretical analysis to show why this problem happens and how it is resolved. In this paper, we propose a new framework for theoretical analysis for the repetition problem. We first define the Average Repetition Probability (ARP) to characterize the repetition problem quantitatively. Then, we conduct an extensive analysis of the Markov generation model and derive several upper bounds of the average repetition probability with intuitive understanding. We show that most of the existing methods are essentially minimizing the upper bounds explicitly or implicitly. Grounded on our theory, we show that the repetition problem is, unfortunately, caused by the traits of our language itself. One major reason is attributed to the fact that there exist too many words predicting the same word as the subsequent word with high probability. Consequently, it is easy to go back to that word and form repetitions and we dub it as the high inflow problem. Furthermore, we derive a concentration bound of the average repetition probability for a general generation model. Finally, based on the theoretical upper bounds, we propose a novel rebalanced encoding approach to alleviate the high inflow problem. The experimental results show that our theoretical framework is applicable in general generation models and our proposed rebalanced encoding approach alleviates the repetition problem significantly. The source code of this paper can be obtained from \url{this https URL}.
摘要：近年来，文本生成任务（包括翻译，摘要，语言模型等）迅速增长。尽管取得了令人瞩目的成就，但在几乎所有文本生成模型中都发现了重复问题，从而严重损害了生成性能。为了解决重复问题，已经提出了许多方法，但是没有现有的理论分析来说明为什么发生此问题以及如何解决该问题。在本文中，我们为重复问题提出了一个新的理论分析框架。我们首先定义平均重复概率（ARP）来定量地描述重复问题。然后，我们对马尔可夫生成模型进行了广泛的分析，并在直观理解的基础上得出了平均重复概率的几个上限。我们表明，大多数现有方法实质上是在显式或隐式地最小化上限。基于我们的理论，我们表明重复问题是不幸的，是由我们语言本身的特征引起的。一个主要的原因归因于这样的事实，即存在太多的单词以高概率预测与后续单词相同的单词。因此，很容易回到该单词并形成重复，我们将其称为高流入问题。此外，我们推导了一般生成模型的平均重复概率的集中度边界。最后，基于理论上的上限，我们提出了一种新颖的重新平衡编码方法来缓解高流量问题。实验结果表明，我们的理论框架适用于一般生成模型，并且我们提出的重新平衡编码方法大大减轻了重复问题。可以从\ url {this https URL}获取本文的源代码。

98. Can You be More Social? Injecting Politeness and Positivity into Task-Oriented Conversational Agents [PDF] 返回目录
Yi-Chia Wang, Alexandros Papangelis, Runze Wang, Zhaleh Feizollahi, Gokhan Tur, Robert Kraut
Abstract: Goal-oriented conversational agents are becoming prevalent in our daily lives. For these systems to engage users and achieve their goals, they need to exhibit appropriate social behavior as well as provide informative replies that guide users through tasks. The first component of the research in this paper applies statistical modeling techniques to understand conversations between users and human agents for customer service. Analyses show that social language used by human agents is associated with greater users' responsiveness and task completion. The second component of the research is the construction of a conversational agent model capable of injecting social language into an agent's responses while still preserving content. The model uses a sequence-to-sequence deep learning architecture, extended with a social language understanding element. Evaluation in terms of content preservation and social language level using both human judgment and automatic linguistic measures shows that the model can generate responses that enable agents to address users' issues in a more socially appropriate way.
摘要：面向目标的对话代理在我们的日常生活中正变得越来越普遍。为了使这些系统吸引用户并实现他们的目标，他们需要表现出适当的社交行为并提供指导用户完成任务的信息丰富的答复。本文研究的第一部分应用统计建模技术来理解用户与人工代理之间的客户服务对话。分析表明，人工代理使用的社交语言与更大的用户响应能力和任务完成率相关。该研究的第二部分是构建对话代理模型，该模型能够将社交语言注入代理的响应中，同时仍然保留内容。该模型使用序列到序列的深度学习架构，并扩展了社交语言理解元素。使用人工判断和自动语言措施对内容保存和社会语言水平进行评估表明，该模型可以生成响应，使代理能够以更适合社会的方式解决用户的问题。

99. Interpretable NLG for Task-oriented Dialogue Systems with Heterogeneous Rendering Machines [PDF] 返回目录
Yangming Li, Kaisheng Yao
Abstract: End-to-end neural networks have achieved promising performances in natural language generation (NLG). However, they are treated as black boxes and lack interpretability. To address this problem, we propose a novel framework, heterogeneous rendering machines (HRM), that interprets how neural generators render an input dialogue act (DA) into an utterance. HRM consists of a renderer set and a mode switcher. The renderer set contains multiple decoders that vary in both structure and functionality. For every generation step, the mode switcher selects an appropriate decoder from the renderer set to generate an item (a word or a phrase). To verify the effectiveness of our method, we have conducted extensive experiments on 5 benchmark datasets. In terms of automatic metrics (e.g., BLEU), our model is competitive with the current state-of-the-art method. The qualitative analysis shows that our model can interpret the rendering process of neural generators well. Human evaluation also confirms the interpretability of our proposed approach.
摘要：端到端神经网络在自然语言生成（NLG）中取得了令人鼓舞的性能。但是，它们被视为黑匣子，缺乏可解释性。为了解决这个问题，我们提出了一个新颖的框架，即异构渲染机器（HRM），该框架解释了神经生成器如何将输入对话动作（DA）渲染为语音。 HRM由一个渲染器集和一个模式切换器组成。渲染器集包含多个解码器，这些解码器的结构和功能均不同。对于每个生成步骤，模式切换器都会从渲染器集中选择适当的解码器以生成项（单词或短语）。为了验证我们方法的有效性，我们对5个基准数据集进行了广泛的实验。在自动指标方面（例如BLEU），我们的模型与当前的最新方法相比具有竞争力。定性分析表明，我们的模型可以很好地解释神经发生器的渲染过程。人工评估也证实了我们提出的方法的可解释性。

100. Multiple Structural Priors Guided Self Attention Network for Language Understanding [PDF] 返回目录
Le Qi, Yu Zhang, Qingyu Yin, Ting Liu
Abstract: Self attention networks (SANs) have been widely utilized in recent NLP studies. Unlike CNNs or RNNs, standard SANs are usually position-independent, and thus are incapable of capturing the structural priors between sequences of words. Existing studies commonly apply one single mask strategy on SANs for incorporating structural priors while failing at modeling more abundant structural information of texts. In this paper, we aim at introducing multiple types of structural priors into SAN models, proposing the Multiple Structural Priors Guided Self Attention Network (MS-SAN) that transforms different structural priors into different attention heads by using a novel multi-mask based multi-head attention mechanism. In particular, we integrate two categories of structural priors, including the sequential order and the relative position of words. For the purpose of capturing the latent hierarchical structure of the texts, we extract these information not only from the word contexts but also from the dependency syntax trees. Experimental results on two tasks show that MS-SAN achieves significant improvements against other strong baselines.
摘要：自我注意网络（SAN）已在最近的NLP研究中得到广泛利用。与CNN或RNN不同，标准SAN通常与位置无关，因此无法捕获单词序列之间的结构先验。现有研究通常在SAN上采用一种单一的掩码策略来合并结构先验，而无法对文本的更丰富的结构信息进行建模。在本文中，我们旨在将多种类型的结构先验引入到SAN模型中，并提出了多结构先验引导的自我注意网络（MS-SAN），该网络通过使用新颖的基于多掩码的多层次结构将不同的结构先验转化为不同的关注头。头部注意机制。特别是，我们整合了结构先验的两类，包括顺序和单词的相对位置。为了捕获文本的潜在层次结构，我们不仅从单词上下文中提取这些信息，还从依赖项语法树中提取这些信息。在两项任务上的实验结果表明，MS-SAN相对于其他严格的基准具有显着的改进。

101. Unified Open-Domain Question Answering with Structured and Unstructured Knowledge [PDF] 返回目录
Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, Scott Yih
Abstract: We study open-domain question answering (ODQA) with structured, unstructured and semi-structured knowledge sources, including text, tables, lists, and knowledge bases. Our approach homogenizes all sources by reducing them to text, and applies recent, powerful retriever-reader models which have so far been limited to text sources only. We show that knowledge-base QA can be greatly improved when reformulated in this way. Contrary to previous work, we find that combining sources always helps, even for datasets which target a single source by construction. As a result, our unified model produces state-of-the-art results on 3 popular ODQA benchmarks.
摘要：我们研究具有结构化，非结构化和半结构化知识源的开放域问答（ODQA），包括文本，表格，列表和知识库。我们的方法通过将所有来源简化为文本来使所有来源同质化，并应用了最新的，功能强大的检索器-阅读器模型，到目前为止，该模型仅限于文本来源。我们表明，以这种方式重新编写时，可以大大改善基于知识库的质量检查。与以前的工作相反，我们发现合并源总是有帮助的，即使对于以构造为目标的单个源的数据集也是如此。结果，我们的统一模型在3种流行的ODQA基准上产生了最新的结果。

102. Is human scoring the best criteria for summary evaluation? [PDF] 返回目录
Oleg Vasilyev, John Bohannon
Abstract: Normally, summary quality measures are compared with quality scores produced by human annotators. A higher correlation with human scores is considered to be a fair indicator of a better measure. We discuss observations that cast doubt on this view. We attempt to show a possibility of an alternative indicator. Given a family of measures, we explore a criterion of selecting the best measure not relying on correlations with human scores. Our observations for the BLANC family of measures suggest that the criterion is universal across very different styles of summaries.
摘要：通常，将摘要质量度量与人工注释者产生的质量得分进行比较。与人类得分的较高相关性被认为是更好度量的公平指标。我们讨论的观察结果对此观点表示怀疑。我们试图显示替代指标的可能性。给定一系列度量，我们将探索一种选择最佳度量的标准，该准则不依赖于与人类得分的相关性。我们对BLANC度量标准系列的观察表明，该标准在非常不同的摘要样式中是通用的。

103. Understanding and Improving Lexical Choice in Non-Autoregressive Translation [PDF] 返回目录
Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, Zhaopeng Tu
Abstract: Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. Encouragingly, our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively. The source code will be released.
摘要：知识蒸馏（KD）对于通过使用自回归教师模型降低原始数据的复杂性来训练非自回归翻译（NAT）模型至关重要。在这项研究中，我们从经验上证明，作为这种培训的副作用，低频单词的词汇选择错误会从教师模型传播到NAT模型。为了缓解此问题，我们建议将原始数据暴露给NAT模型，以恢复低频词的有用信息，这些信息在精炼数据中会丢失。为此，我们引入了一个额外的Kullback-Leibler散度项，该项是通过比较NAT模型的词汇选择和嵌入在原始数据中的词汇选择得出的。跨语言对和模型体系结构的实验结果证明了该方法的有效性和普遍性。大量分析证实了我们的主张，即我们的方法通过减少低频单词的词汇选择错误来提高性能。令人鼓舞的是，我们的方法将WMT14英德数据集和WMT16罗马尼亚英式数据集上的SOTA NAT性能分别提高了27.8和33.8 BLEU。源代码将被发布。

104. YASO: A New Benchmark for Targeted Sentiment Analysis [PDF] 返回目录
Matan Orbach, Orith Toledo-Ronen, Artem Spector, Ranit Aharonov, Yoav Katz, Noam Slonim
Abstract: Sentiment analysis research has shifted over the years from the analysis of full documents or single sentences to a finer-level of detail -- identifying the sentiment towards single words or phrases -- with the task of Targeted Sentiment Analysis (TSA). While this problem is attracting a plethora of works focusing on algorithmic aspects, they are typically evaluated on a selection from a handful of datasets, and little effort, if any, is dedicated to the expansion of the available evaluation data. In this work, we present YASO -- a new crowd-sourced TSA evaluation dataset, collected using a new annotation scheme for labeling targets and their sentiments. The dataset contains 2,215 English sentences from movie, business and product reviews, and 7,415 terms and their corresponding sentiments annotated within these sentences. Our analysis verifies the reliability of our annotations, and explores the characteristics of the collected data. Lastly, benchmark results using five contemporary TSA systems lay the foundation for future work, and show there is ample room for improvement on this challenging new dataset.
摘要：多年来，情感分析研究已从对完整文档或单个句子的分析转移到更详细的层次-识别针对单个单词或短语的情感-任务是针对目标情感分析（TSA）。尽管这个问题吸引了许多专注于算法方面的工作，但通常会从少数几个数据集中进行选择来评估它们，并且几乎不花力气（如果有的话）致力于扩展可用的评估数据。在这项工作中，我们展示了YASO（YASO），这是一个新的基于人群的TSA评估数据集，使用新的注释方案收集了这些信息，用于标记目标及其情感。数据集包含来自电影，业务和产品评论的2,215个英语句子，以及在这些句子中注释的7,415个术语及其对应的情感。我们的分析验证了注释的可靠性，并探索了所收集数据的特征。最后，使用五个现代TSA系统的基准测试结果为将来的工作打下了基础，并表明在这个具有挑战性的新数据集上仍有很大的改进空间。

105. Robust Dialogue Utterance Rewriting as Sequence Tagging [PDF] 返回目录
Jie Hao, Linfeng Song, Liwei Wang, Kun Xu, Zhaopeng Tu, Dong Yu
Abstract: The task of dialogue rewriting aims to reconstruct the latest dialogue utterance by copying the missing content from the dialogue context. Until now, the existing models for this task suffer from the robustness issue, i.e., performances drop dramatically when testing on a different domain. We address this robustness issue by proposing a novel sequence-tagging-based model so that the search space is significantly reduced, yet the core of this task is still well covered. As a common issue of most tagging models for text generation, the model's outputs may lack fluency. To alleviate this issue, we inject the loss signal from BLEU or GPT-2 under a REINFORCE framework. Experiments show huge improvements of our model over the current state-of-the-art systems on domain transfer.
摘要：对话重写的任务旨在通过复制对话上下文中缺少的内容来重建最新的对话话语。到目前为止，用于此任务的现有模型都存在鲁棒性问题，即，在不同域上进行测试时，性能会急剧下降。我们通过提出一种新颖的基于序列标签的模型来解决此鲁棒性问题，从而显着减少了搜索空间，但该任务的核心仍得到了很好的介绍。作为用于文本生成的大多数标记模型的常见问题，该模型的输出可能缺乏流利性。为了缓解此问题，我们在REINFORCE框架下注入来自BLEU或GPT-2的损耗信号。实验表明，相对于当前最先进的域名传输系统，我们的模型得到了巨大的改进。

106. A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification [PDF] 返回目录
Xiangci Li, Gully Burns, Nanyun Peng
Abstract: Even for domain experts, it is a non-trivial task to verify a scientific claim by providing supporting or refuting evidence rationales. The situation worsens as misinformation is proliferated on social media or news websites, manually or programmatically, at every moment. As a result, an automatic fact-verification tool becomes crucial for combating the spread of misinformation. In this work, we propose a novel, paragraph-level, multi-task learning model for the SciFact task by directly computing a sequence of contextualized sentence embeddings from a BERT model and jointly training the model on rationale selection and stance prediction.
摘要：即使对于领域专家而言，通过提供支持或反驳的证据依据来验证科学主张也是一项艰巨的任务。由于错误信息每时每刻都在社交媒体或新闻网站上以手动或编程方式传播，因此情况进一步恶化。结果，自动事实验证工具对于打击错误信息传播至关重要。在这项工作中，我们通过直接从BERT模型中计算一系列上下文化句子嵌入并联合训练该模型的基本原理选择和姿势预测，为SciFact任务提出了一个新颖的，段落级别的多任务学习模型。

107. Language-Mediated, Object-Centric Representation Learning [PDF] 返回目录
Ruocheng Wang, Jiayuan Mao, Samuel J. Gershman, Jiajun Wu
Abstract: We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised segmentation algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the object segmentation performance of MONet and Slot Attention on two datasets via the help of language. We also show that concepts learned by LORL, in conjunction with segmentation algorithms such as MONet, aid downstream tasks such as referring expression comprehension.
摘要：我们提出了语言中介的，以对象为中心的表示学习（LORL），这是一种从视觉和语言中学习纠缠的，以对象为中心的场景表示的范例。 LORL基于无监督对象分割的最新进展，尤其是MONet和Slot Attention。虽然这些算法仅通过重构输入图像来学习以对象为中心的表示，但LORL使他们能够进一步学习将学习到的表示与来自语言输入的概念（即用于对象类别，属性和空间关系的单词）相关联。这些源自语言的以对象为中心的概念有助于学习以对象为中心的表示形式。 LORL可以与各种与语言无关的无监督分割算法集成在一起。实验表明，通过语言的帮助，LORL的集成不断提高了两个数据集上的MONet和Slot Attention对象分割性能。我们还表明，由LORL学习的概念与诸如MONet之类的分段算法相结合，有助于下游任务，例如引用表达式理解。

108. FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging [PDF] 返回目录
Han Guo, Nazneen Fatema Rajani, Peter Hase, Mohit Bansal, Caiming Xiong
Abstract: Influence functions approximate the 'influences' of training data-points for test predictions and have a wide variety of applications. Despite the popularity, their computational cost does not scale well with model and training data size. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. We use k-Nearest Neighbors (kNN) to narrow the search space down to a subset of good candidate data points, identify the configurations that best balance the speed-quality trade-off in estimating the inverse Hessian-vector product, and introduce a fast parallel variant. Our proposed method achieves about 80x speedup while being highly correlated with the original influence values. With the availability of the fast influence functions, we demonstrate their usefulness in four applications. First, we examine whether influential data-points can 'explain' test time behavior using the framework of simulatability. Second, we visualize the influence interactions between training and test data-points. Third, we show that we can correct model errors by additional fine-tuning on certain influential data-points, improving the accuracy of a trained MNLI model by 2.6% on the HANS challenge set using a small number of gradient updates. Finally, we experiment with a data-augmentation setup where we use influence functions to search for new data-points unseen during training to improve model performance. Overall, our fast influence functions can be efficiently applied to large models and datasets, and our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors. Code is available at this https URL
摘要：影响函数近似于训练数据点对测试预测的“影响”，并具有广泛的应用。尽管很受欢迎，但是它们的计算成本并不能随着模型和训练数据的大小而很好地扩展。我们介绍FastIF，这是一组对功能的简单修改，可以显着改善其运行时间。我们使用k最近邻（kNN）将搜索空间缩小到良好候选数据点的子集，确定在估计逆向Hessian向量乘积时最能平衡速度质量权衡的配置，并引入快速并行变体。我们提出的方法实现了约80倍的加速，同时与原始影响值高度相关。随着快速影响功能的可用性，我们证明了它们在四个应用中的有用性。首先，我们检查具有影响力的数据点是否可以使用可模拟性框架“解释”测试时间行为。其次，我们可视化训练和测试数据点之间的影响相互作用。第三，我们证明了我们可以通过对某些有影响力的数据点进行额外的微调来纠正模型错误，使用少量的梯度更新，在HANS挑战集上将经过训练的MNLI模型的准确性提高2.6％。最后，我们使用数据增强设置进行实验，其中我们使用影响函数来搜索训练期间看不见的新数据点，以提高模型性能。总体而言，我们的快速影响函数可以有效地应用于大型模型和数据集，并且我们的实验证明了影响函数在模型解释和校正模型错误中的潜力。可以从此https URL获得代码

109. EfficientNet-Absolute Zero for Continuous Speech Keyword Spotting [PDF] 返回目录
Amir Mohammad Rostami, Ali Karimi, Mohammad Ali Akhaee
Abstract: Keyword spotting is a process of finding some specific words or phrases in recorded speeches by computers. Deep neural network algorithms, as a powerful engine, can handle this problem if they are trained over an appropriate dataset. To this end, the football keyword dataset (FKD), as a new keyword spotting dataset in Persian, is collected with crowdsourcing. This dataset contains nearly 31000 samples in 18 classes. The continuous speech synthesis method proposed to made FKD usable in the practical application which works with continuous speeches. Besides, we proposed a lightweight architecture called EfficientNet-A0 (absolute zero) by applying the compound scaling method on EfficientNet-B0 for keyword spotting task. Finally, the proposed architecture is evaluated with various models. It is realized that EfficientNet-A0 and Resnet models outperform other models on this dataset.
摘要：关键字发现是计算机在录制的语音中查找某些特定单词或短语的过程。如果深度神经网络算法经过适当的数据集训练，则可以作为一个强大的引擎来解决此问题。为此，通过众包收集了足球关键字数据集（FKD），作为波斯语中的新关键字定位数据集。该数据集包含18类的近31000个样本。提出了连续语音合成方法，使FKD可以在与连续语音一起工作的实际应用中使用。此外，我们通过在EfficientNet-B0上应用复合缩放方法来实现关键字发现任务，提出了一种称为EfficientNet-A0（绝对零）的轻量级架构。最后，使用各种模型对提出的体系结构进行了评估。可以意识到，EfficientNet-A0和Resnet模型在该数据集上的性能优于其他模型。

110. Discovering Dialog Structure Graph for Open-Domain Dialog Generation [PDF] 返回目录
Jun Xu, Zeyang Lei, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, Ting Liu
Abstract: Learning interpretable dialog structure from human-human dialogs yields basic insights into the structure of conversation, and also provides background knowledge to facilitate dialog generation. In this paper, we conduct unsupervised discovery of dialog structure from chitchat corpora, and then leverage it to facilitate dialog generation in downstream systems. To this end, we present a Discrete Variational Auto-Encoder with Graph Neural Network (DVAE-GNN), to discover a unified human-readable dialog structure. The structure is a two-layer directed graph that contains session-level semantics in the upper-layer vertices, utterance-level semantics in the lower-layer vertices, and edges among these semantic vertices. In particular, we integrate GNN into DVAE to fine-tune utterance-level semantics for more effective recognition of session-level semantic vertex. Furthermore, to alleviate the difficulty of discovering a large number of utterance-level semantics, we design a coupling mechanism that binds each utterance-level semantic vertex with a distinct phrase to provide prior semantics. Experimental results on two benchmark corpora confirm that DVAE-GNN can discover meaningful dialog structure, and the use of dialog structure graph as background knowledge can facilitate a graph grounded conversational system to conduct coherent multi-turn dialog generation.
摘要：从人与人的对话中学习可解释的对话结构，可以对对话的结构有基本的了解，还可以提供有助于对话生成的背景知识。在本文中，我们从chitchat语料库中无监督地发现对话框结构，然后利用它来促进下游系统中的对话框生成。为此，我们提出了一种具有图神经网络的离散变分自动编码器（DVAE-GNN），以发现统一的人类可读对话结构。该结构是一个两层有向图，其中包含上层顶点中的会话级语义，下层顶点中的话语级语义以及这些语义顶点之间的边缘。特别是，我们将GNN集成到DVAE中，以微调发话级语义，以更有效地识别会话级语义顶点。此外，为了减轻发现大量话语级语义的困难，我们设计了一种耦合机制，该机制将每个话语级语义顶点与不同的短语绑定在一起以提供先验语义。在两个基准语料库上的实验结果证实，DVAE-GNN可以发现有意义的对话结构，并且使用对话结构图作为背景知识可以促进基于图的会话系统进行连贯的多轮对话生成。

111. Unified Mandarin TTS Front-end Based on Distilled BERT Model [PDF] 返回目录
Yang Zhang, Liqun Deng, Yasheng Wang
Abstract: The front-end module in a typical Mandarin text-to-speech system (TTS) is composed of a long pipeline of text processing components, which requires extensive efforts to build and is prone to large accumulative model size and cascade errors. In this paper, a pre-trained language model (PLM) based model is proposed to simultaneously tackle the two most important tasks in TTS front-end, i.e., prosodic structure prediction (PSP) and grapheme-to-phoneme (G2P) conversion. We use a pre-trained Chinese BERT[1] as the text encoder and employ multi-task learning technique to adapt it to the two TTS front-end tasks. Then, the BERT encoder is distilled into a smaller model by employing a knowledge distillation technique called TinyBERT[2], making the whole model size 25% of that of benchmark pipeline models while maintaining competitive performance on both tasks. With the proposed the methods, we are able to run the whole TTS front-end module in a light and unified manner, which is more friendly to deployment on mobile devices.
摘要：典型的普通话文本语音转换系统（TTS）中的前端模块由较长的文本处理组件流水线组成，这需要大量的精力来构建，并且容易出现较大的累积模型大小和级联错误。本文提出了一种基于预训练语言模型（PLM）的模型，以同时解决TTS前端中两个最重要的任务，即韵律结构预测（PSP）和字素到音素（G2P）转换。我们使用经过预训练的中文BERT [1]作为文本编码器，并采用多任务学习技术使其适应两个TTS前端任务。然后，采用称为TinyBERT [2]的知识提炼技术将BERT编码器提炼成较小的模型，使整个模型的大小是基准管道模型的25％，同时在两个任务上都保持竞争优势。通过所提出的方法，我们能够以轻便和统一的方式运行整个TTS前端模块，这对于在移动设备上进行部署更加友好。

112. Meta Adaptive Neural Ranking with Contrastive Synthetic Supervision [PDF] 返回目录
Si Sun, Yingzhuo Qian, Zhenghao Liu, Chenyan Xiong, Kaitao Zhang, Jie Bao, Zhiyuan Liu, Paul Bennett
Abstract: Neural Information Retrieval (Neu-IR) models have shown their effectiveness and thrive from end-to-end training with massive high-quality relevance labels. Nevertheless, relevance labels at such quantity are luxury and unavailable in many ranking scenarios, for example, in biomedical search. This paper improves Neu-IR in such few-shot search scenarios by meta-adaptively training neural rankers with synthetic weak supervision. We first leverage contrastive query generation (ContrastQG) to synthesize more informative queries as in-domain weak relevance labels, and then filter them with meta adaptive learning to rank (MetaLTR) to better generalize neural rankers to the target few-shot domain. Experiments on three different search domains: web, news, and biomedical, demonstrate significantly improved few-shot accuracy of neural rankers with our weak supervision framework. The code of this paper will be open-sourced.
摘要：神经信息检索（Neu-IR）模型已经证明了其有效性，并通过具有大量高质量相关标签的端到端训练而蓬勃发展。然而，如此数量的相关性标签是奢侈的，并且在许多排名方案中，例如在生物医学搜索中，是不可用的。本文通过对具有综合弱监督功能的神经排名进行元自适应训练，在这种少发搜索的情况下改进了Neu-IR。我们首先利用对比查询生成（ContrastQG）将更多信息查询合成为域内弱相关标签，然后使用元自适应学习排序（MetaLTR）对其进行过滤，以更好地将神经排名归纳到目标少发域。在三个不同的搜索领域（网络，新闻和生物医学）上进行的实验表明，在我们的监管框架薄弱的情况下，神经排名的准确率显着提高。本文的代码将是开源的。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2021-01-01

目录

摘要