摘要

1. Keep CALM and Explore: Language Models for Action Generation in Text-based Games [PDF] 返回目录
Shunyu Yao, Rohan Rao, Matthew Hausknecht, Karthik Narasimhan
Abstract: Text-based games present a unique challenge for autonomous agents to operate in natural language and handle enormous action spaces. In this paper, we propose the Contextual Action Language Model (CALM) to generate a compact set of action candidates at each game state. Our key insight is to train language models on human gameplay, where people demonstrate linguistic priors and a general game sense for promising actions conditioned on game history. We combine CALM with a reinforcement learning agent which re-ranks the generated action candidates to maximize in-game rewards. We evaluate our approach using the Jericho benchmark, on games unseen by CALM during training. Our method obtains a 69% relative improvement in average game score over the previous state-of-the-art model. Surprisingly, on half of these games, CALM is competitive with or better than other models that have access to ground truth admissible actions. Code and data are available at this https URL.
摘要：基于文本的游戏展示自主代理的自然语言，并处理了巨大作用的空间了独特的挑战。在本文中，我们提出了上下文动作语言模型（CALM）产生在每场比赛状态的紧集动作候选。我们的主要观点是对人的游戏，人们表现出的语言先验和有前途的空调在游戏历史操作的一般游戏感训练语言模型。我们结合CALM与强化学习剂重新排名产生的作用候选人最大化的游戏奖励。我们评估使用杰里科标杆我们的方法，在游戏训练期间CALM看不见。我们的方法求出平均游戏得分比前状态的最先进的模型中的69％的相对改善。出人意料的是，这些游戏的一半，CALM是具有竞争力的或比即有机会获得地面实况受理动作等车型更好。代码和数据都可以在此HTTPS URL。

2. COD3S: Diverse Generation with Discrete Semantic Signatures [PDF] 返回目录
Nathaniel Weir, João Sedoc, Benjamin Van Durme
Abstract: We present COD3S, a novel method for generating semantically diverse sentences using neural sequence-to-sequence (seq2seq) models. Conditioned on an input, seq2seq models typically produce semantically and syntactically homogeneous sets of sentences and thus perform poorly on one-to-many sequence generation tasks. Our two-stage approach improves output diversity by conditioning generation on locality-sensitive hash (LSH)-based semantic sentence codes whose Hamming distances highly correlate with human judgments of semantic textual similarity. Though it is generally applicable, we apply COD3S to causal generation, the task of predicting a proposition's plausible causes or effects. We demonstrate through automatic and human evaluation that responses produced using our method exhibit improved diversity without degrading task performance.
摘要：我们提出COD3S，用于使用神经序列到序列（seq2seq）模型语义多样的句子的新方法。空调上的输入，seq2seq模型通常产生语义和语法上均匀语句集合的并因此在一个对多序列生成任务表现不佳。我们的两阶段式方法在本地敏感散列（LSH）通过调节一代提高了输出的多样性为基础的，其海明距离与语义文本相似的人的判断高度相关的语义一句代码。虽然它是普遍适用的，我们应用COD3S于因果一代，预测命题的合理原因或影响的任务。我们通过自动和人工评估表明，反应用我们的方法表现出改善的多样性，而不会降低工作绩效产生。

3. LOGAN: Local Group Bias Detection by Clustering [PDF] 返回目录
Jieyu Zhao, Kai-Wei Chang
Abstract: Machine learning techniques have been widely used in natural language processing (NLP). However, as revealed by many recent studies, machine learning models often inherit and amplify the societal biases in data. Various metrics have been proposed to quantify biases in model predictions. In particular, several of them evaluate disparity in model performance between protected groups and advantaged groups in the test corpus. However, we argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. In fact, a model with similar aggregated performance between different groups on the entire data may behave differently on instances in a local region. To analyze and detect such local bias, we propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region and allows us to better analyze the biases in model predictions.
摘要：机器学习技术已被广泛应用在自然语言处理（NLP）。然而，最近的许多研究所揭示的，机器学习模型往往继承和放大数据的社会偏见。各种指标都提出了量化模型的预测偏差。特别是，其中几人的评价在测试语料保护的群体和强势群体间模型的性能差距。然而，我们认为，在语料库度评价偏差是不够的，理解的偏见是如何嵌入模型。事实上，在整个数据不同群体之间的相似总体性能的模型可以表现不同的情况下，在局部区域。为了分析和检测这样的本地倾向，我们提出LOGAN，基于聚类的一个新偏差检测技术。毒性分类和对象分类任务实验表明，LOGAN标识局部区域偏见，使我们能够更好地分析模型预测的偏差。

4. A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration [PDF] 返回目录
Avi Shmidman, Joshua Guedalia, Shaltiel Shmidman, Moshe Koppel, Reut Tsarfaty
Abstract: One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew homographs -- the first of its kind - containing substantial attestation of each analysis of 21 Hebrew homographs. We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity. Leveraging our new dataset, we achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95. Our resulting annotated datasets are made publicly available for further research.
摘要：一个形态分析器的主要任务是同形异义的歧义。特别困难的是不平衡的不确定性，其中可能分析的一个比别人更频繁的情况下。在这种情况下，有可能无法以存在少数分析的足够的例子适当地评估性能，亦不训练有效分类器。在本文中，我们讨论在希伯来文不平衡形态含糊不清的问题。我们提供希伯来语同形异义挑战集 - 的先河 - 含21个希伯来语同形异义每次分析大量的认证。我们表明，希伯来语消歧执行对不平衡不确定性的情况下，不良当前SOTA。利用我们的新的数据集，我们实现了一个新的国家的最先进的所有21个字，提高企业的整体平均F1值从0.67到0.95。我们得到的注释数据集可以公开进行进一步的研究。

5. Robustness and Reliability of Gender Bias Assessment in WordEmbeddings: The Role of Base Pairs [PDF] 返回目录
Haiyang Zhang, Alison Sneyd, Mark Stevenson
Abstract: It has been shown that word embeddings can exhibit gender bias, and various methods have been proposed to quantify this. However, the extent to which the methods are capturing social stereotypes inherited from the data has been debated. Bias is a complex concept and there exist multiple ways to define it. Previous work has leveraged gender word pairs to measure bias and extract biased analogies. We show that the reliance on these gendered pairs has strong limitations: bias measures based off of them are not robust and cannot identify common types of real-world bias, whilst analogies utilising them are unsuitable indicators of bias. In particular, the well-known analogy "man is to computer-programmer as woman is to homemaker" is due to word similarity rather than societal bias. This has important implications for work on measuring bias in embeddings and related work debiasing embeddings.
摘要：已经表明字的嵌入可以表现出性别偏见，并且已经提出了各种方法来量化这种。然而，该方法被捕获从数据继承社会定型的程度已被讨论。偏见是一个复杂的概念，并存在多种方式来定义它。以前的工作已利用性别词对测量偏差和偏见提取类比。我们发现，在这些对性别的依赖，具有较强的局限性：他们的基于断偏见的措施是不稳健，不能识别常见的真实世界的偏见，同时利用它们类比是偏见的不合适的指标。尤其是著名的比喻“男人是计算机程序员的女人是家庭主妇”是由于单词类似，而不是社会的偏见。这对测量的嵌入和相关工作的嵌入去除偏差偏置的工作具有重要意义。

6. Semantic Evaluation for Text-to-SQL with Distilled Test Suites [PDF] 返回目录
Ruiqi Zhong, Tao Yu, Dan Klein
Abstract: We propose test suite accuracy to approximate semantic accuracy for Text-to-SQL models. Our method distills a small test suite of databases that achieves high code coverage for the gold query from a large number of randomly generated databases. At evaluation time, it computes the denotation accuracy of the predicted queries on the distilled test suite, hence calculating a tight upper-bound for semantic accuracy efficiently. We use our proposed method to evaluate 21 models submitted to the Spider leader board and manually verify that our method is always correct on 100 examples. In contrast, the current Spider metric leads to a 2.5% false negative rate on average and 8.1% in the worst case, indicating that test suite accuracy is needed. Our implementation, along with distilled test suites for eleven Text-to-SQL datasets, is publicly available.
摘要：本文提出测试套件精度逼近语义精度文本到SQL模型。我们的方法的蒸馏数据库小测试套件，实现了高代码覆盖用于从大量随机生成的数据库的金查询。在评价时，它计算预测的查询的蒸馏测试套件的外延准确性，因此，计算紧上限为语义的精度有效地。我们用我们提出的方法来评估提交给蜘蛛领导委员会21款和手动验证我们的方法始终是100分的例子是正确的。相反，当所需要的电流蜘蛛度量导致平均2.5％的假阴性率，在最坏的情况下为8.1％，这表明测试套件精度。我们的实现，用蒸馏水测试套件了十文本到SQL数据集一起，是公开的。

7. PRover: Proof Generation for Interpretable Reasoning over Rules [PDF] 返回目录
Swarnadeep Saha, Sayan Ghosh, Shashank Srivastava, Mohit Bansal
Abstract: Recent work by Clark et al. (2020) shows that transformers can act as 'soft theorem provers' by answering questions over explicitly provided knowledge in natural language. In our work, we take a step closer to emulating formal theorem provers, by proposing PROVER, an interpretable transformer-based model that jointly answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. During inference, a valid proof, satisfying a set of global constraints is generated. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation, with strong generalization performance. First, PROVER generates proofs with an accuracy of 87%, while retaining or improving performance on the QA task, compared to RuleTakers (up to 6% improvement on zero-shot evaluation). Second, when trained on questions requiring lower depths of reasoning, it generalizes significantly better to higher depths (up to 15% improvement). Third, PROVER obtains near perfect QA accuracy of 98% using only 40% of the training data. However, generating proofs for questions requiring higher depths of reasoning becomes challenging, and the accuracy drops to 65% for 'depth 5', indicating significant scope for future work. Our code and models are publicly available at this https URL
摘要：由克拉克等人最近的工作。（2020年）表明，变压器可以通过回答问题在自然语言上明确规定，知识作为“软定理证明”行动。在我们的工作中，我们采取更近了一步模拟正式的定理证明，通过提出PROVER，可解释的基于变压器的模型，共同回答了规则基础二元问题，并生成相应的证明。我们的模型学习如何预测对应的证明图中的有效约束的训练模式的节点和边。在推理，有效证明，满足一组全局约束产生。我们进行的合成，手工创作，和人类意译的规则基础实验，以示对质量保证和证明代可喜的成果，具有很强的推广能力。首先，生成PROVER证明以87％的准确度，同时保持或改善对QA任务性能，相比RuleTakers（高达零次评价6％的改善）。其次，需要推理下深处问题的培训时，显著更好的深度较高（高达15％的改善）推广。第三，98％只用40％的训练数据的PROVER取得近乎完美的质量保证准确性。然而，对于需要推理更高深度的问题产生样张变得具有挑战性的，准确度下降到65％为“深度5”，指示用于将来的工作显著范围。我们的代码和模式是公开的，在此HTTPS URL

8. QADiscourse -- Discourse Relations as QA Pairs: Representation, Crowdsourcing and Baselines [PDF] 返回目录
Valentina Pyatkin, Ayal Klein, Reut Tsarfaty, Ido Dagan
Abstract: Discourse relations describe how two propositions relate to one another, and identifying them automatically is an integral part of natural language understanding. However, annotating discourse relations typically requires expert annotators. Recently, different semantic aspects of a sentence have been represented and crowd-sourced via question-and-answer (QA) pairs. This paper proposes a novel representation of discourse relations as QA pairs, which in turn allows us to crowd-source wide-coverage data annotated with discourse relations, via an intuitively appealing interface for composing such questions and answers. Based on our proposed representation, we collect a novel and wide-coverage QADiscourse dataset, and present baseline algorithms for predicting QADiscourse relations.
摘要：语篇关系描述两个命题是如何彼此相关，并自动识别它们是自然语言理解的一个组成部分。然而，注释话语关系通常需要专家注释。最近，一个句子的不同语义方面已经表示，并通过提问和回答（QA）对人群来源。本文提出的话语关系作为QA对，这反过来又使我们能够与话语的关系注释人群来源广覆盖数据的一种新的表现，对通过这样的组成问题和答案的直观吸引人的界面。根据我们提出的代表性，我们收集用于预测QADiscourse关系一个新的和广泛覆盖QADiscourse数据集，和现在的基线算法。

9. Intrinsic Probing through Dimension Selection [PDF] 返回目录
Lucas Torroba Hennigen, Adina Williams, Ryan Cotterell
Abstract: Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted. To enable intrinsic probing, we propose a novel framework based on a decomposable multivariate Gaussian probe that allows us to determine whether the linguistic information in word embeddings is dispersed or focal. We then probe fastText and BERT for various morphosyntactic attributes across 36 languages. We find that most attributes are reliably encoded by only a few neurons, with fastText concentrating its linguistic structure more than BERT.
摘要：大多数现代NLP系统利用的是实现对各种任务的惊人的高性能预训练的情景表示。除非某种形式在这些表象语言结构inheres，以及丰富的研究已经如雨后春笋般涌现在探测它如此高的性能应该是不可能的。在本文中，我们绘制探测外在在以前的工作，仅通过示出，它可以成功地提取主张这样的信息的存在的探测流行固有的区别，其检查信息如何语言是一个表示内结构，和。为了使内在的探索，提出了一种基于可分解多元高斯探头，使我们能够确定的嵌入字对应的语言信息是否被分散或聚焦一个新的框架。然后，我们探讨fastText和BERT跨36种语言的各种形态句法属性。我们发现，大多数属性被可靠地仅由少数神经元编码，与fastText集中它的语言结构超过BERT。

10. Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus [PDF] 返回目录
Michel Plüss, Lukas Neukom, Manfred Vogel
Abstract: We present a forced sentence alignment procedure for Swiss German speech and Standard German text. It is able to create a speech-to-text corpus in a fully automatic fashion, given an audio recording and the corresponding unaligned transcript. Compared to a manual alignment, it achieves a mean IoU of 0.8401 with a sentence recall of 0.9491. When applying our IoU estimate filter, the mean IoU can be further improved to 0.9271 at the cost of a lower sentence recall of 0.4881. Using this procedure, we created the Swiss Parliaments Corpus, an automatically aligned Swiss German speech to Standard German text corpus. 65 % of the raw data could be transformed to sentence-level audio-text-pairs, resulting in 293 hours of training data. We have made the corpus freely available for download.
摘要：我们提出强制句子对齐程序瑞士德语语言和标准德语文本。它能够以全自动的方式来创建一个语音到文本语料库，给录音和相应的未对齐成绩单。相比手动对齐，它实现了0.8401的平均欠条与0.9491句子召回。当应用我们的欠条估计滤波器，平均欠条可以进一步在0.4881较低句子的召回成本提高到0.9271。使用这种方法，我们创建了瑞士议会语料库，自动对齐瑞士德语演讲，标准德语文本语料库。原始数据的65％可以被转化为句级音频文本对，导致在293学时培训资料。我们已经取得了语料库可以免费下载。

11. Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks [PDF] 返回目录
Shubham Toshniwal, Sam Wiseman, Allyson Ettinger, Karen Livescu, Kevin Gimpel
Abstract: Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires keeping all entities in memory, which can be impractical for long documents. We argue that keeping all entities in memory is unnecessary, and we propose a memory-augmented neural network that tracks only a small bounded number of entities at a time, thus guaranteeing a linear runtime in length of document. We show that (a) the model remains competitive with models with high memory and computational requirements on OntoNotes and LitBank, and (b) the model learns an efficient memory management strategy easily outperforming a rule-based strategy.
摘要：长文档的指代消解仍然是一个具有挑战性的任务，由于现有机型的超大内存和运行时间的要求。最近的工作只用实体显示实际利益的全球代表做增量指代消解，但需要保持在内存中的所有实体，这可能是不切实际的长文档。我们认为，保持所有实体在内存中是不必要的，我们提出了一个内存扩充神经网络歌曲仅在少量时间有限数量的实体，这样就保证了文件长度的直线运行。我们发现，（一）模式仍然具有高内存和OntoNotes和LitBank计算要求，和（b）模式学习的高效的内存管理策略很容易超越规则为基础的战略车型竞争力。

12. Textual Supervision for Visually Grounded Spoken Language Understanding [PDF] 返回目录
Bertrand Higy, Desmond Eliott, Grzegorz Chrupała
Abstract: Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different strategies, we find that the pipeline approach works better when enough text is available. With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.
摘要：直接从语音口语理解提取语义信息，视觉基模型，而不依赖于转录。这是一个资源匮乏的语言，其中抄录可能很昂贵或无法获得有用的。最近的研究表明，这些模型可以改编是否可在训练时间得到改善。然而，如何结束到终端的方法比较时，一个访问抄录传统的基于管道的方法，目前尚不清楚。比较不同的策略，我们发现，流水线方式工作得更好，当足够的文本是可用的。考虑到资源匮乏的语言，我们还表明，翻译可以替代转录，但更多的数据被有效地用于需要获得类似的结果。

13. Tackling the Low-resource Challenge for Canonical Segmentation [PDF] 返回目录
Manuel Mager, Özlem Çetinoğlu, Katharina Kann
Abstract: Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy. However, while accuracy in emulated low-resource scenarios is over 50% for all languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude that canonical segmentation is still a challenging task for low-resource languages.
摘要：典型形态分割包括把话到他们的标准化语素。在这里，我们感兴趣的是该任务的方法时，训练数据是有限的。我们比较了高资源德语，英语一模拟低资源设置模型的性能，以及印尼对的真正低资源语言Popoluca和Tepehua新的数据集实验。我们探索两个新型号的任务，从形态产生密切相关的领域借用：一个LSTM指针发生器和序列到序列模型与模仿学习刻苦训练单调的关注。我们发现，在低资源设置，新的方法跑赢大盘高达11.4％的准确率在所有语言现有的。然而，虽然精度模拟低资源场景是所有语言的超过50％，对于真正的低资源语言Popoluca和Tepehua，我们最好的模型只取得精度37.4％和28.4％，分别。因此，我们得出结论，规范分割仍是低资源语言一项艰巨的任务。

14. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations [PDF] 返回目录
Deepanway Ghosal, Navonil Majumder, Alexander Gelbukh, Rada Mihalcea, Soujanya Poria
Abstract: In this paper, we address the task of utterance level emotion recognition in conversations using commonsense knowledge. We propose COSMIC, a new framework that incorporates different elements of commonsense such as mental states, events, and causal relations, and build upon them to learn interactions between interlocutors participating in a conversation. Current state-of-the-art methods often encounter difficulties in context propagation, emotion shift detection, and differentiating between related emotion classes. By learning distinct commonsense representations, COSMIC addresses these challenges and achieves new state-of-the-art results for emotion recognition on four different benchmark conversational datasets. Our code is available at this https URL.
摘要：在本文中，我们要解决的话语层面的情感识别的使用常识性知识的对话任务。我们提出了宇宙，一个新的框架，结合常识的不同元素如精神状态，事件和因果关系，并建立在他们学习参与对话的对话者之间的相互作用。当前状态的最先进的方法经常会遇到在上下文传播，情感偏移检测的困难，以及相关的情感类别之间进行区分。通过学习不同的常识性表述，COSMIC应对这些挑战和实现国家的最先进的新上四个不同的基准对话数据集结果的情感认同。我们的代码可在此HTTPS URL。

15. An Exploration of Arbitrary-Order Sequence Labeling via Energy-Based Inference Networks [PDF] 返回目录
Lifu Tu, Tianyu Liu, Kevin Gimpel
Abstract: Many tasks in natural language processing involve predicting structured outputs, e.g., sequence labeling, semantic role labeling, parsing, and machine translation. Researchers are increasingly applying deep representation learning to these problems, but the structured component of these approaches is usually quite simplistic. In this work, we propose several high-order energy terms to capture complex dependencies among labels in sequence labeling, including several that consider the entire label sequence. We use neural parameterizations for these energy terms, drawing from convolutional, recurrent, and self-attention networks. We use the framework of learning energy-based inference networks (Tu and Gimpel, 2018) for dealing with the difficulties of training and inference with such models. We empirically demonstrate that this approach achieves substantial improvement using a variety of high-order energy terms on four sequence labeling tasks, while having the same decoding speed as simple, local classifiers. We also find high-order energies to help in noisy data conditions.
摘要：在自然语言处理的许多任务涉及到预测的结构输出，例如序列标注，语义角色标注，分析和机器翻译。研究人员正在越来越多地采用深表示学习这些问题，但这些方法的结构化部件通常是相当简单的。在这项工作中，我们提出了几个高阶能量项捕捉序列标注标签之间复杂的依赖关系，其中包括一些是考虑整个标签序列。我们使用的神经参数化这些能源方面，从卷积，复发性和自我关注网络绘图。我们用学习处理的训练和推理与这些模型的困难能源为基础的推论网络（Tu和金培尔，2018）的框架。我们经验表明，这种方法利用各种四个序列标注任务，高阶能量而言，虽然具有相同的解码速度一样简单，地方分级达到显着改善。我们还发现高阶能量在嘈杂数据条件的帮助。

16. A Multi-Task Incremental Learning Framework with Category Name Embedding for Aspect-Category Sentiment Analysis [PDF] 返回目录
Zehui Dai, Cheng Peng, Huajie Chen, Yadong Ding
Abstract: (T)ACSA tasks, including aspect-category sentiment analysis (ACSA) and targeted aspect-category sentiment analysis (TACSA), aims at identifying sentiment polarity on predefined categories. Incremental learning on new categories is necessary for (T)ACSA real applications. Though current multi-task learning models achieve good performance in (T)ACSA tasks, they suffer from catastrophic forgetting problems in (T)ACSA incremental learning tasks. In this paper, to make multi-task learning feasible for incremental learning, we proposed Category Name Embedding network (CNE-net). We set both encoder and decoder shared among all categories to weaken the catastrophic forgetting problem. Besides the origin input sentence, we applied another input feature, i.e., category name, for task discrimination. Our model achieved state-of-the-art on two (T)ACSA benchmark datasets. Furthermore, we proposed a dataset for (T)ACSA incremental learning and achieved the best performance compared with other strong baselines.
摘要：（T）ACSA任务，包括方面类情绪分析（ACSA）和定位的方面类情绪分析（TACSA），目的是确定在预定义的类别情感极性。新类别增量学习是必要的（T）ACSA实际应用。虽然目前的多任务学习模型实现（T）ACSA任务出色的表现，他们在（T）ACSA增量学习任务灾难性遗忘问题的困扰。在本文中，使多任务学习增量学习是可行的，我们提出的类别名称嵌入网络（CNE网）。我们同时设置编码器和解码器的所有类别之间共享削弱灾难性遗忘的问题。除了原点输入句子，我们应用的另一个输入功能，即，类别名称，任务歧视。我们的模型来实现在两个（T）ACSA基准数据集的状态的最先进的。此外，我们提出了（T）ACSA增量学习的数据集，并实现了与其他强大的基线相比，最佳的性能。

17. Stepwise Extractive Summarization and Planning with Structured Transformers [PDF] 返回目录
Shashi Narayan, Joshua Maynez, Jakub Adamek, Daniele Pighin, Blaž Bratanič, Ryan McDonald
Abstract: We propose encoder-centric stepwise models for extractive summarization using structured transformers -- HiBERT and Extended Transformers. We enable stepwise summarization by injecting the previously generated summary into the structured transformer as an auxiliary sub-structure. Our models are not only efficient in modeling the structure of long inputs, but they also do not rely on task-specific redundancy-aware modeling, making them a general purpose extractive content planner for different tasks. When evaluated on CNN/DailyMail extractive summarization, stepwise models achieve state-of-the-art performance in terms of Rouge without any redundancy aware modeling or sentence filtering. This also holds true for Rotowire table-to-text generation, where our models surpass previously reported metrics for content selection, planning and ordering, highlighting the strength of stepwise modeling. Amongst the two structured transformers we test, stepwise Extended Transformers provides the best performance across both datasets and sets a new standard for these challenges.
摘要：我们建议使用结构化变压器采掘汇总编码器为中心逐步模型 - 的Hilbert和扩展变形金刚。我们使通过注入预先生成的概要进构造变压器作为辅助子结构逐步聚合。我们的模型不仅在造型的长期投入的结构效率，但他们也并不依赖于特定任务的冗余感知建模，使他们不同的任务通用抽提物含量规划师。当CNN评估/每日邮报采掘总结，逐步实现模型在没有任何冗余感知建模或句子过滤高棉方面国家的最先进的性能。这也为Rotowire表到文本生成，在我们的模型赛过先前报道的内容选择，规划和订货指标，逐步凸显造型的力量也是如此。当中两种结构的变压器，我们测试的逐步的扩展变形金刚提供了跨两个数据集的最佳性能，并为应对这些挑战的新标准。

18. Towards Coalgebras in Stylometry [PDF] 返回目录
Joël A. Doat
Abstract: The syntactic behaviour of texts can highly vary depending on their contexts (e.g. author, genre, etc.). From the standpoint of stylometry, it can be helpful to objectively measure this behaviour. In this paper, we discuss how coalgebras are used to formalise the notion of behaviour by embedding syntactic features of a given text into probabilistic transition systems. By introducing the behavioural distance, we are then able to quantitatively measure differences between points in these systems and thus, comparing features of different texts. Furthermore, the behavioural distance of points can be approximated by a polynomial-time algorithm.
摘要：文本的句法行为可以高度取决于它们的上下文（例如作者，流派等）。从stylometry的角度来看，它可以帮助客观衡量这种行为。在本文中，我们将讨论如何余代数是用来嵌入一个给定文本的句法特征为概率过渡系统正式行为的概念。通过引入行为距离，我们就能够定量地测量在这些系统中的点之间的差异，因此，在比较不同的文本的特征。另外，点的行为距离可通过一个多项式时间算法来近似。

19. Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation [PDF] 返回目录
Minki Kang, Moonsu Han, Sung Ju Hwang
Abstract: We propose a method to automatically generate a domain- and task-adaptive maskings of the given text for self-supervised pre-training, such that we can effectively adapt the language model to a particular target task (e.g. question answering). Specifically, we present a novel reinforcement learning-based framework which learns the masking policy, such that using the generated masks for further pre-training of the target language model helps improve task performance on unseen texts. We use off-policy actor-critic with entropy regularization and experience replay for reinforcement learning, and propose a Transformer-based policy network that can consider the relative importance of words in a given text. We validate our Neural Mask Generator (NMG) on several question answering and text classification datasets using BERT and DistilBERT as the language models, on which it outperforms rule-based masking strategies, by automatically learning optimal adaptive maskings.
摘要：我们提出了一个方法来自动生成自我监督前培训，以使我们能够有效地适应语言模型，以特定的目标任务（例如问答）给定文本的结构域和任务自适应maskings。具体来说，我们提出了一种新的基于强化学习框架，学习遮蔽的政策，例如，使用目标语言模型的进一步前培训所产生的面具有助于提高看不见的文本任务性能。我们使用过的策略演员评论家熵正规化和经验重播强化学习，并提出了基于变压器的政策网络，可以考虑在给定文本中的单词的相对重要性。我们确认我们的神经上使用BERT和DistilBERT作为语言模型，其上它优于基于规则的屏蔽策略的几个问题回答和文本分类数据集，掩模生成（NMG），通过自动学习优化自适应maskings。

20. Aspect Sentiment Classification with Aspect-Specific Opinion Spans [PDF] 返回目录
Lu Xu, Lidong Bing, Wei Lu, Fei Huang
Abstract: Aspect based sentiment analysis, predicting sentiment polarity of given aspects, has drawn extensive attention. Previous attention-based models emphasize using aspect semantics to help extract opinion features for classification. However, these works are either not able to capture opinion spans as a whole, or not able to capture variable-length opinion spans. In this paper, we present a neat and effective structured attention model by aggregating multiple linear-chain CRFs. Such a design allows the model to extract aspect-specific opinion spans and then evaluate sentiment polarity by exploiting the extracted opinion features. The experimental results on four datasets demonstrate the effectiveness of the proposed model, and our analysis demonstrates that our model can capture aspect-specific opinion spans.
摘要：基于看点情感分析，预测方面给予的情感极性，已引起广泛关注。上一页关注基于模型强调使用方面的语义帮助提意见的功能进行分类。然而，这些作品要么是无法捕捉的意见跨度作为一个整体，还是没能捕获可变长度的意见跨度。在本文中，我们通过聚集多个直链的CRF呈现整齐和有效的结构关注模型。这样的设计允许模型提取物方面特异性看来跨度，然后通过利用所提取的意见特征评价情感极性。在四个数据集上的实验结果验证了该模型的有效性，以及我们的分析表明，我们的模型可以捕捉特定的方面，意见跨度。

21. Analyzing Individual Neurons in Pre-trained Language Models [PDF] 返回目录
Nadir Durrani, Hassan Sajjad, Fahim Dalvi, Yonatan Belinkov
Abstract: While a lot of analysis has been carried to demonstrate linguistic knowledge captured by the representations learned within deep NLP models, very little attention has been paid towards individual neurons.We carry outa neuron-level analysis using core linguistic tasks of predicting morphology, syntax and semantics, on pre-trained language models, with questions like: i) do individual neurons in pre-trained models capture linguistic information? ii) which parts of the network learn more about certain linguistic phenomena? iii) how distributed or focused is the information? and iv) how do various architectures differ in learning these properties? We found small subsets of neurons to predict linguistic tasks, with lower level tasks (such as morphology) localized in fewer neurons, compared to higher level task of predicting syntax. Our study also reveals interesting cross architectural comparisons. For example, we found neurons in XLNet to be more localized and disjoint when predicting properties compared to BERT and others, where they are more distributed and coupled.
摘要：虽然大量的分析已经进行展示深NLP模型中了解到的陈述抓获语言知识，很少关注已使用预测形态的核心语言任务支付对个人携带neurons.We欧塔神经元层次的分析，语法和语义上预先训练语言模型，有这样的问题：i）他们在预先训练模型单个神经元获取语言信息？ ⅱ）该网络的部分更多地了解某些语言现象？三）如何分布或集中是信息？和iv）怎么办各种架构在学习这些属性有什么不同？我们发现神经元的小的子集来预测语言任务，以较少的神经元的本地化水平较低的任务（如形态），比预测语法的更高水平的任务。我们的研究还揭示有趣的交叉建筑的比较。例如，我们发现在XLNet神经元相比BERT等人，在那里它们被更分布和耦合预测属性时，可更局部化和不相交的。

22. SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling [PDF] 返回目录
Di Wu, Liang Ding, Fan Lu, Jian Xie
Abstract: Slot filling and intent detection are two main tasks in spoken language understanding (SLU) system. In this paper, we propose a novel non-autoregressive model named SlotRefine for joint intent detection and slot filling. Besides, we design a novel two-pass iteration mechanism to handle the uncoordinated slots problem caused by conditional independence of non-autoregressive model. Experiments demonstrate that our model significantly outperforms previous models in slot filling task, while considerably speeding up the decoding (up to X 10.77). In-depth analyses show that 1) pretraining schemes could further enhance our model; 2) two-pass mechanism indeed remedy the uncoordinated slots.
摘要：空位填充和意图的检测是在口语理解（SLU）系统中的两个主要任务。在本文中，我们提出了一个名为SlotRefine联合意图检测和槽分配一个新的非自回归模型。此外，我们设计了一种新的双通路循环机制来处理所造成的非自回归模型的条件独立的不协调插槽问题。实验表明，我们的模型显著优于在槽分配任务之前的机型，同时大大加快了解码（高达X 10.77）。在深入分析表明，1）训练前计划可进一步提升我们的模型; 2）二通机制确实补救不协调槽。

23. BERT Knows Punta Cana is not just beautiful, it's gorgeous: Ranking Scalar Adjectives with Contextualised Representations [PDF] 返回目录
Aina Garí Soler, Marianna Apidianaki
Abstract: Adjectives like pretty, beautiful and gorgeous describe positive properties of the nouns they modify but with different intensity. These differences are important for natural language understanding and reasoning. We propose a novel BERT-based approach to intensity detection for scalar adjectives. We model intensity by vectors directly derived from contextualised representations and show they can successfully rank scalar adjectives. We evaluate our models both intrinsically, on gold standard datasets, and on an Indirect Question Answering task. Our results demonstrate that BERT encodes rich knowledge about the semantics of scalar adjectives, and is able to provide better quality intensity rankings than static embeddings and previous models with access to dedicated resources.
摘要：形容词一样漂亮，美丽和华丽的描述了他们修改的名词，但具有不同强度的正性。这些差异是自然语言理解和推理重要。我们提出了一个新的基于BERT的方法来强度检测标形容词。我们从contextualised表示直接导出和向量模型强度显示出他们能够成功地等级标形容词。我们都内在地评估我们的模型，对黄金标准数据集，并间接答疑任务。我们的研究结果表明，BERT编码关于标形容词的语义丰富的知识，并且能够提供比静态的嵌入和之前的机型更优质的强度排名与访问专用资源。

24. Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder [PDF] 返回目录
Alvin Chan, Yi Tay, Yew-Soon Ong, Aston Zhang
Abstract: This paper demonstrates a fatal vulnerability in natural language inference (NLI) and text classification systems. More concretely, we present a 'backdoor poisoning' attack on NLP models. Our poisoning attack utilizes conditional adversarially regularized autoencoder (CARA) to generate poisoned training samples by poison injection in latent space. Just by adding 1% poisoned data, our experiments show that a victim BERT finetuned classifier's predictions can be steered to the poison target class with success rates of >80% when the input hypothesis is injected with the poison signature, demonstrating that NLI and text classification systems face a huge security risk.
摘要：本文阐述了自然语言推理（NLI）和文本分类系统致命的漏洞。更具体地说，我们提出了一个“后门中毒”的NLP模型攻击。我们的中毒攻击利用条件adversarially正则自动编码器（CARA）组合以通过在潜在空间毒药注射下毒训练样本。只需通过添加1点％的中毒的数据，我们的实验表明，受害者BERT微调，分类的预测可以转向与> 80％的成功率毒药目标上课的时候输入假说与毒签名注入，这表明NLI和文本分类系统面临着巨大的安全隐患。

25. Incorporating Behavioral Hypotheses for Query Generation [PDF] 返回目录
Ruey-Cheng Chen, Chia-Jung Lee
Abstract: Generative neural networks have been shown effective on query suggestion. Commonly posed as a conditional generation problem, the task aims to leverage earlier inputs from users in a search session to predict queries that they will likely issue at a later time. User inputs come in various forms such as querying and clicking, each of which can imply different semantic signals channeled through the corresponding behavioral patterns. This paper induces these behavioral biases as hypotheses for query generation, where a generic encoder-decoder Transformer framework is presented to aggregate arbitrary hypotheses of choice. Our experimental results show that the proposed approach leads to significant improvements on top-$k$ word error rate and Bert F1 Score compared to a recent BART model.
摘要：生成性神经网络已被证明对查询建议有效。通常提出的条件生成问题，任务的目标是利用从用户输入前面的搜索会话来预测查询，他们将在以后的时间可能问题。用户输入以各种形式出现，如查询和点击，其中的每一个可以暗示通过相应行为模式引导不同的语义的信号。本文诱发这些行为偏差的假设为查询生成，其中一个通用的编码器，解码器变压器框架提交给总任意选择的假设。我们的实验结果表明，该方法导致对顶$ $ķ字错误率和Bert F1显著改善得分相比，最近的BART模型。

26. Automatic Metaphor Interpretation Using Word Embeddings [PDF] 返回目录
Kfir Bar, Nachum Dershowitz, Lena Dankin
Abstract: We suggest a model for metaphor interpretation using word embeddings trained over a relatively large corpus. Our system handles nominal metaphors, like "time is money". It generates a ranked list of potential interpretations of given metaphors. Candidate meanings are drawn from collocations of the topic ("time") and vehicle ("money") components, automatically extracted from a dependency-parsed corpus. We explore adding candidates derived from word association norms (common human responses to cues). Our ranking procedure considers similarity between candidate interpretations and metaphor components, measured in a semantic vector space. Lastly, a clustering algorithm removes semantically related duplicates, thereby allowing other candidate interpretations to attain higher rank. We evaluate using a set of annotated metaphors.
摘要：我们建议使用培训了一个比较大的语料库字的嵌入隐喻解释模型。我们的系统处理标称的隐喻，如“时间就是金钱”。它产生给隐喻的潜在解释的排名列表。候选的含义是从主题（“时代”）和车辆（“钱”）成分，从依赖解析的语料库自动提取的搭配绘制。我们将探索从词汇联想规范（人类共同应对线索）衍生的候选人。我们的排名过程考虑候选解释和隐喻组件，但在一个语义向量空间中测量之间的相似性。最后，聚类算法消除语义相关的副本，从而使其他候选人解释，以获得更高的排名。我们评估使用一组注释的隐喻。

27. Detecting Attackable Sentences in Arguments [PDF] 返回目录
Yohan Jo, Seojin Bang, Emaad Manzoor, Eduard Hovy, Chris Reed
Abstract: Finding attackable sentences in an argument is the first step toward successful refutation in argumentation. We present a first large-scale analysis of sentence attackability in online arguments. We analyze driving reasons for attacks in argumentation and identify relevant characteristics of sentences. We demonstrate that a sentence's attackability is associated with many of these characteristics regarding the sentence's content, proposition types, and tone, and that an external knowledge source can provide useful information about attackability. Building on these findings, we demonstrate that machine learning models can automatically detect attackable sentences in arguments, significantly better than several baselines and comparably well to laypeople.
摘要：在辩论中查找易受攻击的句子是朝论证反驳成功的第一步。我们提出一句攻击性的第一次大规模分析网上争论。我们分析论证中的攻击驾驶的原因，并找出句子的相关特征。我们证明一个句子的攻击性与许多关于这句话的内容，命题类型和音这些特点相关，和外部知识源可以提供关于攻击性的有用信息。建立在这些研究结果，我们证明了机器学习模型可以自动检测易受攻击的句子参数，显著优于几个基线和同等很好地教友。

28. Multi-Instance Multi-Label Learning Networks for Aspect-Category Sentiment Analysis [PDF] 返回目录
Yuncong Li, Cunxiang Yin, Sheng-hua Zhong, Xu Pan
Abstract: Aspect-category sentiment analysis (ACSA) aims to predict sentiment polarities of sentences with respect to given aspect categories. To detect the sentiment toward a particular aspect category in a sentence, most previous methods first generate an aspect category-specific sentence representation for the aspect category, then predict the sentiment polarity based on the representation. These methods ignore the fact that the sentiment of an aspect category mentioned in a sentence is an aggregation of the sentiments of the words indicating the aspect category in the sentence, which leads to suboptimal performance. In this paper, we propose a Multi-Instance Multi-Label Learning Network for Aspect-Category sentiment analysis (AC-MIMLLN), which treats sentences as bags, words as instances, and the words indicating an aspect category as the key instances of the aspect category. Given a sentence and the aspect categories mentioned in the sentence, AC-MIMLLN first predicts the sentiments of the instances, then finds the key instances for the aspect categories, finally obtains the sentiments of the sentence toward the aspect categories by aggregating the key instance sentiments. Experimental results on three public datasets demonstrate the effectiveness of AC-MIMLLN.
摘要：方面类别情绪分析（ACSA）旨在预测句子情绪极性相对于给定方面的类别。为了检测朝向一个特定的方面类别的情绪在一个句子，大多数以前的方法首先产生用于所述类别方面的一方面类别的特定句子表示，则预测基于所述表示中的情感极性。这些方法忽视的事实是在一个句子中提到的方面类的情绪的指示句子中的方面类别，这导致次优的性能的话情绪的聚集。在本文中，我们提出了一个多实例多标记学习网看点类别情感分析（AC-MIMLLN），它把句子作为包装袋，字作为实例，而这句话指示体范畴作为的关键实例体范畴。给定一个句子和在句子中提到的方面分类，AC-MIMLLN第一预测实例的情绪，然后查找键实例为纵横类，最后通过聚合键实例情绪获得朝向方面类别句的情绪。三个公共数据集的实验结果表明，AC-MIMLLN的有效性。

29. Extracting Implicitly Asserted Propositions in Argumentation [PDF] 返回目录
Yohan Jo, Jacky Visser, Chris Reed, Eduard Hovy
Abstract: Argumentation accommodates various rhetorical devices, such as questions, reported speech, and imperatives. These rhetorical tools usually assert argumentatively relevant propositions rather implicitly, so understanding their true meaning is key to understanding certain arguments properly. However, most argument mining systems and computational linguistics research have paid little attention to implicitly asserted propositions in argumentation. In this paper, we examine a wide range of computational methods for extracting propositions that are implicitly asserted in questions, reported speech, and imperatives in argumentation. By evaluating the models on a corpus of 2016 U.S. presidential debates and online commentary, we demonstrate the effectiveness and limitations of the computational models. Our study may inform future research on argument mining and the semantics of these rhetorical devices in argumentation.
摘要：论证容纳各种修辞手法，如问题，间接引语，和必要性。这些修辞工具通常断言argumentatively相关的命题而含蓄，所以了解他们的真正意义，关键是正确理解某些参数。然而，大多数的说法挖掘系统和计算语言学研究已很少注意在论证隐含断言命题。在本文中，我们考察一个广泛的论证中提取了在问题隐含断言命题，间接引语，并迫切需要的计算方法。通过对2016年美国总统竞选辩论和在线评论语料库评估模型，我们证明了有效性和计算模型的局限性。我们的研究可以对参数挖掘和论证这些修辞手法的语义告知未来的研究。

30. If beam search is the answer, what was the question? [PDF] 返回目录
Clara Meister, Tim Vieira, Ryan Cotterell
Abstract: Quite surprisingly, exact maximum a posteriori (MAP) decoding of neural language generators frequently leads to low-quality results. Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the answer, what was the question? We frame beam search as the exact solution to a different decoding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive science. We suggest a set of decoding objectives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when decoding poorly calibrated language generation models. Additionally, we analyze the text produced using various decoding strategies and see that, in our neural machine translation experiments, the extent to which this property is adhered to strongly correlates with BLEU.
摘要：非常令人惊讶的，准确的最大后验（MAP）经常导致低质量的结果解码神经语言发电机。相反，使用定向搜索，尽管其压倒性的高搜索错误率都达到国家的最先进的最上语言生成任务的结果。这意味着MAP单独目标不表达，我们在文本的愿望性能，这值得质疑：束搜索是答案，究竟是什么问题？我们架梁搜索的精确解到不同的解码对象，以深入了解用户为什么模型下的高概率孤独可能不表明充足。我们发现在文本束搜索强制实施统一的信息密度，通过科学认知动机的属性。我们建议一组解码，明确执行此属性，找到确切的解码与这些目标解码缓解校准不当语言生成模型时所遇到的问题目标。此外，我们分析使用各种解码战略的文本制作和看到的是，在我们的神经的机器翻译实验，在何种程度上这个属性状态下粘附到BLEU强烈相关。

31. Context Modeling with Evidence Filter for Multiple Choice Question Answering [PDF] 返回目录
Sicheng Yu, Hao Zhang, Wei Jing, Jing Jiang
Abstract: Multiple-Choice Question Answering (MCQA) is a challenging task in machine reading comprehension. The main challenge in MCQA is to extract "evidence" from the given context that supports the correct answer. In the OpenbookQA dataset, the requirement of extracting "evidence" is particularly important due to the mutual independence of sentences in the context. Existing work tackles this problem by annotated evidence or distant supervision with rules which overly rely on human efforts. To address the challenge, we propose a simple yet effective approach termed evidence filtering to model the relationships between the encoded contexts with respect to different options collectively and to potentially highlight the evidence sentences and filter out unrelated sentences. In addition to the effective reduction of human efforts of our approach compared, through extensive experiments on OpenbookQA, we show that the proposed approach outperforms the models that use the same backbone and more training data; and our parameter analysis also demonstrates the interpretability of our approach.
摘要：多选题应答（MCQA）是机器阅读理解一个具有挑战性的任务。在MCQA的主要挑战是从支持正确的答案给定的上下文中提取的“证据”。在OpenbookQA数据集，提取的“证据”的要求是要在上下文的句子的相互独立尤为重要原因。现有工作铲球通过注释的证据或远处监督这个问题与过度依赖于人的努力规则。为了应对这一挑战，我们提出了一个简单而有效的方法称为证据滤波编码的上下文之间的关系，对于不同的选项集体和潜在的建模强调证据的句子和过滤掉无关的句子。除了有效地降低我们的方法的人的努力相比，通过对OpenbookQA广泛的实验，我们表明，该方法比使用相同的骨干和更多的培训数据模型;而我们的参数分析也证明了我们方法的可解释性。

32. On the Sub-Layer Functionalities of Transformer Decoder [PDF] 返回目录
Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, Zhaopeng Tu
Abstract: There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages - developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance -- a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.
摘要：已经有解释基于变压器的编码解码器架构的神经机器翻译（NMT）的编码器显著努力;同时，解码器在很大程度上仍然浑浑噩噩的，尽管其关键作用。在翻译过程中，解码器必须通过考虑来自编码器在源语言文本和在先前步骤中所产生的目标语言前缀预测输出令牌。在这项工作中，我们研究了从源语言和目标语言如何基于变压器的解码器利用信息 - 开发一种通用探头的任务，以评估信息是如何通过每个译码器层的每一个模块传播。我们执行三大翻译的数据集（WMT恩德，EN-FR，和恩深航）广泛的实验。我们的分析提供了关于何时何解码器利用不同来源的洞察力。根据这些分析，我们证明，在每个变压器译码器层残留的前馈模块可以与性能损失最少被丢弃 - 在计算的显著减少和参数的数量，以及培训和推理速度因此一个显著提升。

33. On the Sparsity of Neural Machine Translation Models [PDF] 返回目录
Yong Wang, Longyue Wang, Victor O.K. Li, Zhaopeng Tu
Abstract: Modern neural machine translation (NMT) models employ a large number of parameters, which leads to serious over-parameterization and typically causes the underutilization of computational resources. In response to this problem, we empirically investigate whether the redundant parameters can be reused to achieve better performance. Experiments and analyses are systematically conducted on different datasets and NMT architectures. We show that: 1) the pruned parameters can be rejuvenated to improve the baseline model by up to +0.8 BLEU points; 2) the rejuvenated parameters are reallocated to enhance the ability of modeling low-level lexical information.
摘要：现代神经机器翻译（NMT）车型采用了大量的参数，从而导致严重的过度参数，通常会导致计算资源的利用不足。针对这一问题，我们调查经验的冗余参数是否可以重复使用，以达到更好的性能。实验和分析在不同的数据集和NMT架构系统进行。我们表明：1）修剪参数可以重新焕发活力，以提高达+0.8 BLEU点的基准模型; 2）所述复原参数被重新分配，以提高建模低电平词汇信息的能力。

34. Neural Speech Synthesis for Estonian [PDF] 返回目录
Liisa Rätsep, Liisi Piits, Hille Pajupuu, Indrek Hein, Mark Fišel
Abstract: This technical report describes the results of a collaboration between the NLP research group at the University of Tartu and the Institute of Estonian Language on improving neural speech synthesis for Estonian. The report (written in Estonian) describes the project results, the summary of which is: (1) Speech synthesis data from 6 speakers for a total of 92.4 hours is collected and openly released (CC-BY-4.0). Data available at this https URL and this https URL. (2) software and models for neural speech synthesis is released open-source (MIT license). Available at this https URL . (3) We ran evaluations of the new models and compared them to other existing solutions (HMM-based HTS models from EKI, this http URL, and Google's speech synthesis for Estonian, accessed via this https URL). Evaluation includes voice acceptability MOS scores for sentence-level and longer excerpts, detailed error analysis and evaluation of the pre-processing module.
摘要：该技术报告描述了改善神经语音合成爱沙尼亚塔尔图大学和爱沙尼亚语言研究所的NLP研究小组之间的合作的结果。该报告（写在爱沙尼亚语）描述了项目的结果，总结其是：（1）从6个扬声器，总共92.4小时语音合成的数据被收集并公开释放（CC-BY-4.0）。数据可以在这个HTTPS URL，这HTTPS URL。（2）软件和模型的神经语音合成发布开源（MIT许可证）。可在此HTTPS URL。（3）我们跑新车型的评价和他们相对于其他现有解决方案（基于HMM模型HTS从EKI，这个HTTP URL，以及谷歌的语音合成爱沙尼亚，通过此HTTPS URL访问）。评价包括用于预处理模块的句子级和更长的摘录，详细的错误分析和评估语音可接受MOS得分。

35. On the Interplay Between Fine-tuning and Sentence-level Probing for Linguistic Knowledge in Pre-trained Transformers [PDF] 返回目录
Marius Mosbach, Anna Khokhlova, Michael A. Hedderich, Dietrich Klakow
Abstract: Fine-tuning pre-trained contextualized embedding models has become an integral part of the NLP pipeline. At the same time, probing has emerged as a way to investigate the linguistic knowledge captured by pre-trained models. Very little is, however, understood about how fine-tuning affects the representations of pre-trained models and thereby the linguistic knowledge they encode. This paper contributes towards closing this gap. We study three different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate through sentence-level probing how fine-tuning affects their representations. We find that for some probing tasks fine-tuning leads to substantial changes in accuracy, possibly suggesting that fine-tuning introduces or even removes linguistic knowledge from a pre-trained model. These changes, however, vary greatly across different models, fine-tuning and probing tasks. Our analysis reveals that while fine-tuning indeed changes the representations of a pre-trained model and these changes are typically larger for higher layers, only in very few cases, fine-tuning has a positive effect on probing accuracy that is larger than just using the pre-trained model with a strong pooling method. Based on our findings, we argue that both positive and negative effects of fine-tuning on probing require a careful interpretation.
摘要：微调预训练的情境嵌入模型已成为NLP管道的一个组成部分。与此同时，探测技术已成为一个以调查由预先训练模型捕获的语言知识。很少，然而，了解有关微调如何影响的预先训练模型的陈述，从而它们编码的语言知识。本文对关闭这一差距。我们研究了三种不同的预先训练模式：BERT，罗伯塔和伟业，并探讨通过句子级探测微调如何影响他们的陈述。我们发现，对于一些探测任务微调导致的准确性实质性的变化，这可能表明微调的介绍，甚至完全从预先训练模型的语言知识。这些变化，但是，在不同的型号，微调和探测任务差别很大。我们的分析显示，尽管微调确实改变预先训练模型的表示和这些变化通常为高层大，只有在极少数情况下，微调对探测精度有积极的作用比仅仅使用较大预先训练的模型具有较强的统筹方法。根据我们的调查结果，我们认为，微调对探测正反两方面的影响需要仔细的解释。

36. Position-Aware Tagging for Aspect Sentiment Triplet Extraction [PDF] 返回目录
Lu Xu, Hao Li, Wei Lu, Lidong Bing
Abstract: Aspect Sentiment Triplet Extraction (ASTE) is the task of extracting the triplets of target entities, their associated sentiment, and opinion spans explaining the reason for the sentiment. Existing research efforts mostly solve this problem using pipeline approaches, which break the triplet extraction process into several stages. Our observation is that the three elements within a triplet are highly related to each other, and this motivates us to build a joint model to extract such triplets using a sequence tagging approach. However, how to effectively design a tagging approach to extract the triplets that can capture the rich interactions among the elements is a challenging research question. In this work, we propose the first end-to-end model with a novel position-aware tagging scheme that is capable of jointly extracting the triplets. Our experimental results on several existing datasets show that jointly capturing elements in the triplet using our approach leads to improved performance over the existing approaches. We also conducted extensive experiments to investigate the model effectiveness and robustness.
摘要：看点情绪三重萃取（ASTE）是提取对象的实体，其相关的情感和意见跨度解释情绪的原因是三胞胎的任务。现有的研究工作使用管道的方法，这打破了三重提取过程分为几个阶段主要是解决这个问题。我们的观察是，三联中的三个要素是高度相互关联的，这促使我们建立一个联合的模型提取使用序列标记方法，例如三胞胎。然而，如何有效地设计一个标记方法来提取，可以捕捉到要素之间的相互作用丰富的三胞胎是一个具有挑战性的研究课题。在这项工作中，我们提出了第一端至端的模型，它能够共同提取三胞胎一种新型的知晓位置的标记方案。我们对一些现有的数据集实验结果表明，用我们的做法导致比现有的方法提高性能的三元组共同的拍摄元件。我们也进行了广泛的实验研究模型的有效性和鲁棒性。

37. DaNetQA: a yes/no Question Answering Dataset for the Russian Language [PDF] 返回目录
Taisia Glushkova, Alexey Machnev, Alena Fenogenova, Tatiana Shavrina, Ekaterina Artemova, Dmitry I. Ignatov
Abstract: DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) design: it comprises natural yes/no questions. Each question is paired with a paragraph from Wikipedia and an answer, derived from the paragraph. The task is to take both the question and a paragraph as input and come up with a yes/no answer, i.e. to produce a binary output. In this paper, we present a reproducible approach to DaNetQA creation and investigate transfer learning methods for task and language transferring. For task transferring we leverage three similar sentence modelling tasks: 1) a corpus of paraphrases, Paraphraser, 2) an NLI task, for which we use the Russian part of XNLI, 3) another question answering task, SberQUAD. For language transferring we use English to Russian translation together with multilingual language fine-tuning.
摘要：DaNetQA，一个新的答疑语料库，如下（Clark等，2019）设计：它包括自然是/否问题。每一个问题是搭配维基百科段落和一个答案，从段落的。该任务是把两个问题，一个段落作为输入，并拿出一个是/否的答案，即产生一个二进制输出。在本文中，我们提出了一个可重复的方法来创建DaNetQA和调查任务和语言转移转移的学习方法。有关任务转移，我们利用三个类似的句子建模任务：1）释义的语料库，Paraphraser，2）NLI的任务，为此，我们使用XNLI，3）另外一个问题回答任务，SberQUAD俄罗斯的一部分。对于语言传递我们多语种语言微调用英语俄语翻译一起。

38. Converting the Point of View of Messages Spoken to Virtual Assistants [PDF] 返回目录
Isabelle G. Lee, Vera Zu, Sai Srujana Buddi, Dennis Liang, Jack G.M. Fitzgerald
Abstract: Virtual Assistants can be quite literal at times. If the user says "tell Bob I love him," most virtual assistants will extract the message "I love him" and send it to the user's contact named Bob, rather than properly converting the message to "I love you." We designed a system to allow virtual assistants to take a voice message from one user, convert the point of view of the message, and then deliver the result to its target user. We developed a rule-based model, which integrates a linear text classification model, part-of-speech tagging, and constituency parsing with rule-based transformation methods. We also investigated Neural Machine Translation (NMT) approaches, including LSTMs, CopyNet, and T5. We explored 5 metrics to gauge both naturalness and faithfulness automatically, and we chose to use BLEU plus METEOR for faithfulness and relative perplexity using a separately trained language model (GPT) for naturalness. Transformer-Copynet and T5 performed similarly on faithfulness metrics, with T5 achieving slight edge, a BLEU score of 63.8 and a METEOR score of 83.0. CopyNet was the most natural, with a relative perplexity of 1.59. CopyNet also has 37 times fewer parameters than T5. We have publicly released our dataset, which is composed of 46,565 crowd-sourced samples.
摘要：虚拟助理有时可以是相当字面。如果用户说“鲍勃说我爱他，”大多数虚拟助理将提取消息“我爱他”，并将其发送给名为Bob的，而不是正确的消息转换成用户的联系人“我爱你。”我们设计了一个系统，允许虚拟助理把语音邮件从一个用户，转换的角度点消息，然后传递结果到它的目标用户。我们开发了一个基于规则的模型，它集成了线性文本分类模型，部分词性标注，并用选区基于规则的变换方法解析。我们还调查了神经机器翻译（NMT）方法，包括LSTMs，CopyNet和T5。我们探讨了5个指标来衡量自动既自然和诚实，我们选择使用BLEU加上流星的忠诚和使用自然单独训练的语言模型（GPT）相困惑。变压器Copynet和T5上忠实指标同样地进行，与T5实现轻微的边缘，63.8一个BLEU分数和的83.0流星得分。 CopyNet是最自然的，与1.59的相对困惑。 CopyNet也比T5参数37倍少。我们已经公开发布我们的数据，它是由46565人群来源的样品。

39. Embedding Words in Non-Vector Space with Unsupervised Graph Learning [PDF] 返回目录
Max Ryabinin, Sergei Popov, Liudmila Prokhorenkova, Elena Voita
Abstract: It has become a de-facto standard to represent words as elements of a vector space (word2vec, GloVe). While this approach is convenient, it is unnatural for language: words form a graph with a latent hierarchical structure, and this structure has to be revealed and encoded by word embeddings. We introduce GraphGlove: unsupervised graph word representations which are learned end-to-end. In our setting, each word is a node in a weighted graph and the distance between words is the shortest path distance between the corresponding nodes. We adopt a recent method learning a representation of data in the form of a differentiable weighted graph and use it to modify the GloVe training algorithm. We show that our graph-based representations substantially outperform vector-based methods on word similarity and analogy tasks. Our analysis reveals that the structure of the learned graphs is hierarchical and similar to that of WordNet, the geometry is highly non-trivial and contains subgraphs with different local topology.
摘要：它已成为一个事实上的标准来表示单词作为矢量空间的元件（word2vec，手套）。虽然这种方法是方便的，这是不自然语言：字形成具有潜分层结构的曲线图，该结构具有被揭示，并通过字的嵌入编码。我们介绍GraphGlove：这是学会端至端无人监督的图字表示。在我们的设置中，每个字是在加权图的节点和字之间的距离是相应的节点之间的最短路径的距离。我们采用的方法最近在学习可微加权图的形式数据的表示，并用它来修改手套训练算法。我们证明了我们的基于图形的表示大幅跑赢上字的相似和类比任务基于矢量的方法。我们的分析表明，学习曲线的结构是分层和类似于共发现的，几何形状是非常不平凡的，并包含有不同的本地拓扑子图。

40. Semantically Driven Sentence Fusion: Modeling and Evaluation [PDF] 返回目录
Eyal Ben-David, Orgad Keller, Eric Malmi, Idan Szpektor, Roi Reichart
Abstract: Sentence fusion is the task of joining related sentences into coherent text. Current training and evaluation schemes for this task are based on single reference ground-truths and do not account for valid fusion variants. We show that this hinders models from robustly capturing the semantic relationship between input sentences. To alleviate this, we present an approach in which ground-truth solutions are automatically expanded into multiple references via curated equivalence classes of connective phrases. We apply this method to a large-scale dataset and use the augmented dataset for both model training and evaluation. To improve the learning of semantic representation using multiple references, we enrich the model with auxiliary discourse classification tasks under a multi-tasking framework. Our experiments highlight the improvements of our approach over state-of-the-art models.
摘要：句子融合相关的句子加入到相关的文字的任务。这个任务目前的培训和评估方案是基于单个参考地面真理，不占有效融合的变体。我们表明，稳健地捕捉输入句子之间的语义关系这阻碍模型。为了缓解这种情况，我们提出了一种方法，其基本真解决方案通过自动结短语的策划等价类扩展到多个引用。我们将此方法应用于大规模数据集，并使用这两种模型的训练和评估增强数据集。为了提高使用多个引用语义表达的学习，我们有丰富的辅助话语分类任务模型中的多任务框架下。我们的实验凸显了我们在国家的最先进的模型法的改进。

41. Scene Graph Modification Based on Natural Language Commands [PDF] 返回目录
Xuanli He, Quan Hung Tran, Gholamreza Haffari, Walter Chang, Trung Bui, Zhe Lin, Franck Dernoncourt, Nhan Dam
Abstract: Structured representations like graphs and parse trees play a crucial role in many Natural Language Processing systems. In recent years, the advancements in multi-turn user interfaces necessitate the need for controlling and updating these structured representations given new sources of information. Although there have been many efforts focusing on improving the performance of the parsers that map text to graphs or parse trees, very few have explored the problem of directly manipulating these representations. In this paper, we explore the novel problem of graph modification, where the systems need to learn how to update an existing scene graph given a new user's command. Our novel models based on graph-based sparse transformer and cross attention information fusion outperform previous systems adapted from the machine translation and graph generation literature. We further contribute our large graph modification datasets to the research community to encourage future research for this new problem.
摘要：结构化表示喜欢图表和分析树在许多自然语言处理系统中发挥着至关重要的作用。近年来，在多圈用户界面的进步必要的控制和更新给出新的信息来源，这些结构表示的需要。虽然已经有许多努力围绕提高解析器的那个地图的文本或图形解析树，很少有研究的直接操作这些表示问题的性能。在本文中，我们探讨图形修改，在系统需要了解如何更新赋予了新用户的命令现有的场景图的新问题。我们基于基于图的稀疏变压器和跨重视信息融合新模式跑赢改编而成的机器翻译和图形一代文学以前的系统。我们进一步促进我国大型图形修改数据集的研究团体，以鼓励这一新的问题未来的研究。

42. CoRefi: A Crowd Sourcing Suite for Coreference Annotation [PDF] 返回目录
Aaron Bornstein, Arie Cattan, Ido Dagan
Abstract: Coreference annotation is an important, yet expensive and time consuming, task, which often involved expert annotators trained on complex decision guidelines. To enable cheaper and more efficient annotation, we present CoRefi, a web-based coreference annotation suite, oriented for crowdsourcing. Beyond the core coreference annotation tool, CoRefi provides guided onboarding for the task as well as a novel algorithm for a reviewing phase. CoRefi is open source and directly embeds into any website, including popular crowdsourcing platforms. CoRefi Demo: this http URL Video Tour: this http URL Github Repo: this https URL
摘要：共指注释是一个重要的，但昂贵和费时，任务，经常参与专家注释在复杂的决策指导培训。为了能够更便宜，更有效的注释，我们目前CoRefi，一个基于网络的共参照注释套件，面向对众包。超出芯共参照注释工具，CoRefi提供指导性的初始启用的任务，以及一个新颖的算法用于审查阶段。 CoRefi是开源的，直接嵌入到任何网站，包括流行的众包平台。 CoRefi演示：此http网址视频演示：这个HTTP URL GitHub库：此HTTPS URL

43. Dissecting Span Identification Tasks with Performance Prediction [PDF] 返回目录
Sean Papay, Roman Klinger, Sebastian Padó
Abstract: Span identification (in short, span ID) tasks such as chunking, NER, or code-switching detection, ask models to identify and classify relevant spans in a text. Despite being a staple of NLP, and sharing a common structure, there is little insight on how these tasks' properties influence their difficulty, and thus little guidance on what model families work well on span ID tasks, and why. We analyze span ID tasks via performance prediction, estimating how well neural architectures do on different tasks. Our contributions are: (a) we identify key properties of span ID tasks that can inform performance prediction; (b) we carry out a large-scale experiment on English data, building a model to predict performance for unseen span ID tasks that can support architecture choices; (c), we investigate the parameters of the meta model, yielding new insights on how model and task properties interact to affect span ID performance. We find, e.g., that span frequency is especially important for LSTMs, and that CRFs help when spans are infrequent and boundaries non-distinctive.
摘要：跨度识别（总之，跨度ID）的任务，如分块，NER，或代码开关检测，请车型识别和分类相关跨度文本。尽管是NLP的主食，并且共享一个共同的结构，对这些任务的属性如何影响他们的难度小的洞察力，以及什么型号的家庭于跨度ID任务运行良好，为什么这样的指导很少。我们通过性能预测分析跨度ID的任务，估计神经结构上不同的任务如何做。我们的贡献是：（1）我们识别可通知性能预测跨度ID任务关键属性; （二）我们开展英语数据的大规模实验，建立一个模型来预测，可支持架构选择看不见的跨度ID任务性能; （三），我们研究的元模型的参数，得到关于如何模型和任务属性相互作用，对跨度ID性能的新见解。我们发现，例如，跨度频率是LSTMs尤其重要，而控释肥帮助时，跨度是罕见和边界非显着。

44. Knowing What You Know: Calibrating Dialogue Belief State Distributions via Ensembles [PDF] 返回目录
Carel van Niekerk, Michael Heck, Christian Geishauser, Hsien-Chin Lin, Nurul Lubis, Marco Moresi, Milica Gašić
Abstract: The ability to accurately track what happens during a conversation is essential for the performance of a dialogue system. Current state-of-the-art multi-domain dialogue state trackers achieve just over 55% accuracy on the current go-to benchmark, which means that in almost every second dialogue turn they place full confidence in an incorrect dialogue state. Belief trackers, on the other hand, maintain a distribution over possible dialogue states. However, they lack in performance compared to dialogue state trackers, and do not produce well calibrated distributions. In this work we present state-of-the-art performance in calibration for multi-domain dialogue belief trackers using a calibrated ensemble of models. Our resulting dialogue belief tracker also outperforms previous dialogue belief tracking models in terms of accuracy.
摘要：准确地跟踪对话过程中发生的事情的能力是一个对话系统的性能至关重要。当前国家的最先进的多域对话状态跟踪器实现了对电流去到基准，这意味着几乎所有的第二次对话开启他们将充分的信心，不正确的对话状态刚刚超过55％的准确率。信仰跟踪，在另一方面，维持对可能的对话状态的分布。然而，他们缺乏在性能上相比对话状态跟踪器，并且不产生很好的校准分布。在这项工作中，我们在校准多领域对话的信念跟踪当前国家的最先进的性能。使用的模型校准合奏。我们得到的对话，相信跟踪器还优于先前的对话信念，在准确性方面的跟踪模型。

45. Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start [PDF] 返回目录
Wenpeng Yin, Nazneen Fatema Rajani, Dragomir Radev, Richard Socher, Caiming Xiong
Abstract: A standard way to address different NLP problems is by first constructing a problem-specific dataset, then building a model to fit this dataset. To build the ultimate artificial intelligence, we desire a single machine that can handle diverse new problems, for which task-specific annotations are limited. We bring up textual entailment as a unified solver for such NLP problems. However, current research of textual entailment has not spilled much ink on the following questions: (i) How well does a pretrained textual entailment system generalize across domains with only a handful of domain-specific examples? and (ii) When is it worth transforming an NLP task into textual entailment? We argue that the transforming is unnecessary if we can obtain rich annotations for this task. Textual entailment really matters particularly when the target NLP task has insufficient annotations. Universal NLP can be probably achieved through different routines. In this work, we introduce Universal Few-shot textual Entailment (UFO-Entail). We demonstrate that this framework enables a pretrained entailment model to work well on new entailment domains in a few-shot setting, and show its effectiveness as a unified solver for several downstream NLP tasks such as question answering and coreference resolution when the end-task annotations are limited. Code: this https URL
摘要：以满足不同的NLP问题的标准方法是通过首先构建一个具体问题的数据集，然后建立一个模型，以适应这个数据集。打造终极人工智能，我们希望有一个单一的机器，可以处理不同的新的问题，为此，特定任务的注解是有限的。我们带来了文字蕴涵作为一个统一的解决者这样的NLP问题。然而，文字蕴涵的目前的研究还没有对下列问题洒多少墨水：（一）效果如何跨域预训练文字蕴涵系统广义含只用的特定领域的例子屈指可数？及（ii）当它是值得的改造任务NLP成文字蕴涵？我们认为，转型是不必要的，如果我们能得到这个任务丰富的注解。文字蕴涵真的很重要，特别是当目标NLP任务有注释不足。通用NLP可以通过不同的程序来实现的可能。在这项工作中，我们介绍了通用很少拍文字蕴涵（UFO-意味着）。我们表明，这种框架使预训练蕴涵模型很好地工作在几合一设定新的蕴涵领域，并显示其作为一个统一的求解器效果好下游NLP任务，如答疑和指代消解结束任务注释时，有限。代码：这HTTPS URL

46. The Multilingual Amazon Reviews Corpus [PDF] 返回目录
Phillip Keung, Yichao Lu, György Szarvas, Noah A. Smith
Abstract: We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., 'books', 'appliances', etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.
摘要：我们提出的多语种亚马逊评测语料库（MARC）的多语言文本分类亚马逊的评论大规模的集合。该语料库包含英语，日语，德语，法语，西班牙语和中国，其中收集了2015年和2019年之间在数据集中的每个记录包含了评论文章，审查标题，星级，一匿名审稿ID，一个评论匿名的产品ID，和粗粒度的产品类别（例如，“书”，“下乡”等）的语料库跨越5级可能星级评定平衡，所以每个等级构成的每种语言的点评20％。对于每种语言，也有训练，开发和测试台20万只，5000，和5000条评论中，分别。我们通过微调汇报监督文本分类和零次跨语言迁移学习基准结果上的评论数据的多语种BERT模式。我们建议使用的，而不是分类的准确性平均绝对误差（MAE）的这个任务，因为MAE占收视率的序性质。

47. Does the Objective Matter? Comparing Training Objectives for Pronoun Resolution [PDF] 返回目录
Yordan Yordanov, Oana-Maria Camburu, Vid Kocijan, Thomas Lukasiewicz
Abstract: Hard cases of pronoun resolution have been used as a long-standing benchmark for commonsense reasoning. In the recent literature, pre-trained language models have been used to obtain state-of-the-art results on pronoun resolution. Overall, four categories of training and evaluation objectives have been introduced. The variety of training datasets and pre-trained language models used in these works makes it unclear whether the choice of training objective is critical. In this work, we make a fair comparison of the performance and seed-wise stability of four models that represent the four categories of objectives. Our experiments show that the objective of sequence ranking performs the best in-domain, while the objective of semantic similarity between candidates and pronoun performs the best out-of-domain. We also observe a seed-wise instability of the model using sequence ranking, which is not the case when the other objectives are used.
摘要：代词分辨率的疑难案件已被用来作为常识推理的长期基准。在最近的文献中，预先训练的语言模型已被用于获取代词国家的先进成果。总体来看，四大类的培训和考核目标相继出台。在各种培训在这些作品中所使用的数据集和预先训练语言模型使得它不清楚培训目标的选择是至关重要的。在这项工作中，我们做的四款车型，代表四类目标的性能和种子明智稳定性的一个公平的比较。我们的实验表明，序列的客观排名进行域内最好的，而客观的候选人和代词执行的最佳外的域之间的语义相似性。我们还观察到使用顺序排序，当使用其他的目标是不是这样的模型的种子明智的不稳定。

48. StyleDGPT: Stylized Response Generation with Pre-trained Language Models [PDF] 返回目录
Ze Yang, Wei Wu, Can Xu, Xinnian Liang, Jiaqi Bai, Liran Wang, Wei Wang, Zhoujun Li
Abstract: Generating responses following a desired style has great potentials to extend applications of open-domain dialogue systems, yet is refrained by lacking of parallel data for training. In this work, we explore the challenging task with pre-trained language models that have brought breakthrough to various natural language tasks. To this end, we introduce a KL loss and a style classifier to the fine-tuning step in order to steer response generation towards the target style in both a word-level and a sentence-level. Comprehensive empirical studies with two public datasets indicate that our model can significantly outperform state-of-the-art methods in terms of both style consistency and contextual coherence.
摘要：以下所需的风格产生响应具有很大的潜力，以扩大开放领域的对话系统的应用，但通过训练并行数据的缺乏忍住了。在这项工作中，我们探索与已经带来了突破性的各种自然语言任务的预先训练的语言模型的艰巨任务。为此，我们推出以引向两个词级和句子级的目标风格响应产生KL损失和风格分类，以微调的一步。有两个公共数据集全面的实证研究表明，我们的模型可以国家的最先进的显著跑赢方法在这两个风格的一致性和连贯性语境方面。

49. SupMMD: A Sentence Importance Model for Extractive Summarization using Maximum Mean Discrepancy [PDF] 返回目录
Umanga Bista, Alexander Patrick Mathews, Aditya Krishna Menon, Lexing Xie
Abstract: Most work on multi-document summarization has focused on generic summarization of information present in each individual document set. However, the under-explored setting of update summarization, where the goal is to identify the new information present in each set, is of equal practical interest (e.g., presenting readers with updates on an evolving news topic). In this work, we present SupMMD, a novel technique for generic and update summarization based on the maximum mean discrepancy from kernel two-sample testing. SupMMD combines both supervised learning for salience and unsupervised learning for coverage and diversity. Further, we adapt multiple kernel learning to make use of similarity across multiple information sources (e.g., text features and knowledge based concepts). We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
摘要：在多文档文摘大部分工作集中在每个单独的文档集当前信息的通用汇总。然而，更新汇总，其目的是识别新的信息出现在每一组的充分开发的设置，等于实际利益（例如，在一个不断发展的新闻话题呈现读者更新）。在这项工作中，我们提出SupMMD的基础上，从内核两样本测试的最大平均差异为通用和更新汇总的新技术。 SupMMD结合了显着性和无监督学习的覆盖面和多样性都监督学习。此外，我们适应多内核学习，使跨多个信息源（例如，文本功能和基于知识的概念）使用的相似性。我们通过达到或超过当前的DUC-2004和TAC-2009数据集的国家的最先进的显示SupMMD在通用和更新汇总任务的功效。

50. Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher [PDF] 返回目录
Giannis Karamanolakis, Daniel Hsu, Luis Gravano
Abstract: Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method, CLTS, that generates "weak" supervision in the target language using minimal cross-lingual resources, in the form of a small number of word translations. Given a limited translation budget, CLTS extracts and transfers only the most important task-specific seed words across languages and initializes a teacher classifier based on the translated seed words. Then, CLTS iteratively trains a more powerful student that also exploits the context of the seed words in unlabeled target documents and outperforms the teacher. CLTS is simple and surprisingly effective in 18 diverse languages: by transferring just 20 seed words, even a bag-of-words logistic regression student outperforms state-of-the-art cross-lingual methods (e.g., based on multilingual BERT). Moreover, CLTS can accommodate any type of student classifier: leveraging a monolingual BERT student leads to further improvements and outperforms even more expensive approaches by up to 12% in accuracy. Finally, CLTS addresses emerging tasks in low-resource languages using just a small number of word translations.
摘要：跨语言文本分类减轻了通过利用从其他语言标记文件的目标语言手工标注文档的需求。对于跨语言转移监督现有的方法需要昂贵的跨语言资源，如平行语料库，而较便宜的跨语言表示学习方法训练的分类没有目标标记的文档。在这项工作中，我们提出了一个跨语种的师生方法，CLTS，即使用最少的跨语种的资源，在少数词翻译的形式生成目标语言“弱”的监督。鉴于有限的预算翻译，CLTS提取物和跨语言传输只有最特定任务的重要种子的话，并初始化基于转换种子的话老师分类。然后，CLTS反复训练更强大的学生也利用了未标记的目标文件的种子字的背景下，优于老师。 CLTS是简单和在18个多种语言令人惊讶地有效：通过转移仅20种即，即使一个袋的词logistic回归学生性能优于国家的最先进的跨语言的方法（例如，基于多语言BERT）。此外，CLTS可以容纳任何类型的学生分类器：利用单语BERT学生导致进一步的改进，优于甚至高达12％的准确度更昂贵的方法。最后，CLTS地址新兴只使用少量的单词翻译在资源匮乏的语言任务。

51. LEGAL-BERT: The Muppets straight out of Law School [PDF] 返回目录
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos
Abstract: BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the previous guidelines for pre-training and fine-tuning, often blindly followed, do not always generalize well in the legal domain. Thus we propose a systematic investigation of the available strategies when applying BERT in specialised domains. These are: (a) use the original BERT out of the box, (b) adapt BERT by additional pre-training on domain-specific corpora, and (c) pre-train BERT from scratch on domain-specific corpora. We also propose a broader hyper-parameter search space when fine-tuning for downstream tasks and we release LEGAL-BERT, a family of BERT models intended to assist legal NLP research, computational law, and legal technology applications.
摘要：BERT已经在几个NLP任务，取得了骄人的业绩。然而，一直在专业领域的适应指导调查有限。在这里，我们重点关注的法律领域，我们探索应用BERT模型下游法律任务，对多个数据集评估几种方法。我们的研究结果表明，对于前培训和微调，往往一味沿袭了以往的准则，不要总是在法律领域推广好。因此，我们在专业领域应用BERT时提出的可用策略进行了系统的调查。这些是：（a）用原来的BERT开箱，（b）由域特定的语料库额外的预培养，并在特定领域的语料库（c）中从头预列车BERT适应BERT。我们还提出了一个更广泛的超参数搜索空间时，微调下游任务，我们发布法律-BERT，BERT车型的家族旨在协助法律NLP的研究，计算法和法律技术的应用。

52. PolicyQA: A Reading Comprehension Dataset for Privacy Policies [PDF] 返回目录
Wasi Uddin Ahmad, Jianfeng Chi, Yuan Tian, Kai-Wei Chang
Abstract: Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from policy documents reduces the burden of searching the target information from a lengthy text segment. In this paper, we present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices. We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.
摘要：隐私政策文件是漫长和繁琐。一个问答（QA）系统，可帮助用户找到相关的和对他们很重要的信息。在这一领域帧之前研究QA任务，获取最相关的文本段或从给定问题的政策文件句子的列表。相反，我们认为，为用户提供从政策文件简短的文字跨度减少搜索一个冗长的文字段的目标信息的负担。在本文中，我们目前PolicyQA，包含从115项网站的隐私政策的现有语料库策划25017阅读理解式的例子的数据集。 PolicyQA提供了广泛的隐私保护措施的书面714人标注的问题。我们评估现有的两个神经QA车型，并执行严格的分析来揭示由PolicyQA带来的优势和挑战。

53. Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation [PDF] 返回目录
Wenxiang Jiao, Xing Wang, Shilin He, Irwin King, Michael R. Lyu, Zhaopeng Tu
Abstract: Large-scale training datasets lie at the core of the recent success of neural machine translation (NMT) models. However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. In this work, we explore to identify the inactive training examples which contribute less to the model performance, and show that the existence of inactive examples depends on the data distribution. We further introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. The proposed framework consists of three phases. First, we train an identification model on the original training data, and use it to distinguish inactive examples and active examples by their sentence-level output probabilities. Then, we train a rejuvenation model on the active examples, which is used to re-label the inactive examples with forward-translation. Finally, the rejuvenated examples and the active examples are combined to train the final NMT model. Experimental results on WMT14 English-German and English-French datasets show that the proposed data rejuvenation consistently and significantly improves performance for several strong NMT models. Extensive analyses reveal that our approach stabilizes and accelerates the training process of NMT models, resulting in final models with better generalization capability.
摘要：大型训练数据趴在最近的神经机器翻译（NMT）模式成功的核心。然而，复杂的图案和潜在的噪声在大规模数据使培训NMT模型困难。在这项工作中，我们探讨找出无效训练实例贡献较少的模型的性能，并表明非活动实例的存在依赖于数据的分布。我们进一步介绍数据复兴提高NMT模型对大型数据集，通过利用不活跃样本的训练。拟议的框架由三个阶段组成。首先，我们训练的原始训练数据的识别模型，并用它通过自己的语句级输出概率来区分非活动实例和活动的例子。然后，我们在活动的例子，它是用来重新标签无效的例子具有超前转换训练复兴典范。最后，所述复原实施例和实施例活性组合来训练最终NMT模型。在WMT14英语，德语和英语，法语数据集实验结果表明，该数据始终复兴，显著提高了几个强大的NMT模型的性能。广泛的分析表明，我们的方法稳定，加快NMT模型的训练过程中，产生最终的模型中具有更好的推广能力。

54. Please Mind the Root: Decoding Arborescences for Dependency Parsing [PDF] 返回目录
Ran Zmigrod, Tim Vieira, Ryan Cotterell
Abstract: The connection between dependency trees and spanning trees is exploited by the NLP community to train and to decode graph-based dependency parsers. However, the NLP literature has missed an important difference between the two structures: only one edge may emanate from the root in a dependency tree. We analyzed the output of state-of-the-art parsers on many languages from the Universal Dependency Treebank: although these parsers are often able to learn that trees which violate the constraint should be assigned lower probabilities, their ability to do so unsurprisingly de-grades as the size of the training set this http URL fact, the worst constraint-violation rate we observe is 24%. Prior work has proposed an inefficient algorithm to enforce the constraint, which adds a factor of n to the decoding runtime. We adapt an algorithm due to Gabow and Tarjan (1984) to dependency parsing, which satisfies the constraint without compromising the original runtime.
摘要：依赖关系树和生成树之间的连接是由NLP社会培养和对基于图形解码依赖解析器利用。然而，NLP文献已经错过了两个结构之间的一个重要区别是：只有一个边缘可从根中的依赖关系树发出。我们分析了从通用依赖树库多语言的国家的最先进的解析器的输出：虽然这些解析器往往能够得知树，违反约束应该分配较低的概率，他们这样做的意料之中去的能力等级为设置这个HTTP URL事实培训的规模，我们观察到的最严重的约束，违规率是24％。以前的工作已提出了一种低效率的算法来强制约束，这增加n倍到解码运行时。我们适应因Gabow和的Tarjan（1984年）到依存分析的算法，满足不损害原来的运行时间的限制。

55. Do Explicit Alignments Robustly Improve Multilingual Encoders? [PDF] 返回目录
Shijie Wu, Mark Dredze
Abstract: Multilingual BERT (mBERT), XLM-RoBERTa (XLMR) and other unsupervised multilingual encoders can effectively learn cross-lingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.
摘要：多语种BERT（mBERT），XLM - 罗伯塔（XLMR）和其他监督的多语言编码器可以有效地学习跨语言表示。基于像Europarl或MultiUN bitexts明确对准目标已显示出进一步改善这些表示。然而，字级对齐往往不理想和这样bitexts是许多语言不可用。在本文中，我们提出了一个新的对比对准目标，可以更好地利用这样的信号，并检查这些以前对准方法是否可以适应调整后的数据的喧闹源：OPUS集合的随机抽样百万对的子集。此外，而不是用单一的模式运行一个单一的数据集报表的结果，我们的报告有不同的种子多次运行的平均值和标准偏差，在四个数据集和任务。我们更广泛的分析发现，而我们新的目标，远远超过前工作，整体这些方法不改善与更强大的评估框架的性能。此外，使用更好的基础模型的收益日食从对准训练带来任何好处。这些负面结果在评估这些方法的决定更多的关心和提出申请的明确对准目标的限制。

56. An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks [PDF] 返回目录
Kyubyong Park, Joohong Lee, Seongbo Jang, Dawoon Jung
Abstract: Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even though Byte Pair Encoding (BPE) has been considered the de facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, "What is the best tokenization strategy for Korean NLP tasks?" Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective.
摘要：通常，标记化是在多数文本处理工作的第一步。作为一个令牌作为嵌入文本的上下文信息，如何定义一个令牌起着model.Even虽然字节对编码（BPE）一直被认为是由于事实上的标准符号化方法的性能有决定性作用的原子单位它的简单性和普遍性，仍然不清楚是否BPE最好的作品在所有语言和任务。在本文中，我们为了回答我们的首要研究的问题，那就是，测试几个符号化战略，“什么是韩国NLP任务的最佳标记化战略？”实验结果表明，形态分割之后BPE的混合方法效果最好用韩语/英语机器翻译和自然语言理解任务，如KorNLI，KorSTS，NSMC和PAWS-X。作为例外，对于KorQuAD，班长韩延伸，BPE分割原来是最有效的。

57. Multi-task Learning for Multilingual Neural Machine Translation [PDF] 返回目录
Yiren Wang, ChengXiang Zhai, Hany Hassan Awadalla
Abstract: While monolingual data has been shown to be useful in improving bilingual neural machine translation (NMT), effectively and efficiently leveraging monolingual data for Multilingual NMT (MNMT) systems is a less explored area. In this work, we propose a multi-task learning (MTL) framework that jointly trains the model with the translation task on bitext data and two denoising tasks on the monolingual data. We conduct extensive empirical studies on MNMT systems with 10 language pairs from WMT datasets. We show that the proposed approach can effectively improve the translation quality for both high-resource and low-resource languages with large margin, achieving significantly better results than the individual bilingual models. We also demonstrate the efficacy of the proposed approach in the zero-shot setup for language pairs without bitext training data. Furthermore, we show the effectiveness of MTL over pre-training approaches for both NMT and cross-lingual transfer learning NLU tasks; the proposed approach outperforms massive scale models trained on single task.
摘要：虽然单语数据已被证明是在改善神经双语机器翻译（NMT）有用，有效且高效地利用用于多语言NMT单语数据（MNMT）系统是一个不太探索区域。在这项工作中，我们提出了一个多任务学习（MTL）框架，联合训练与bitext数据和单语数据的两个降噪任务的翻译任务的模式。我们对MNMT系统进行了广泛的实证研究与WMT的数据集10种语言对。我们表明，该方法可以有效地改善了高资源与大幅低资源语言翻译质量，实现比单独的双语模型显著更好的结果。我们还演示了在零射门设置为语言对提出的方法的有效性，而不bitext训练数据。此外，我们展示了MTL前培训的方法两个NMT和跨语言迁移学习NLU任务的有效性;所提出的方法比训练有素的单任务大规模的模型。

58. Investigating African-American Vernacular English in Transformer-Based Text Generation [PDF] 返回目录
Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita Honnavalli, Sharon Levy, Diba Mirza, William Yang Wang
Abstract: The growth of social media has encouraged the written use of African American Vernacular English (AAVE), which has traditionally been used only in oral contexts. However, NLP models have historically been developed using dominant English varieties, such as Standard American English (SAE), due to text corpora availability. We investigate the performance of GPT-2 on AAVE text by creating a dataset of intent-equivalent parallel AAVE/SAE tweet pairs, thereby isolating syntactic structure and AAVE- or SAE-specific language for each pair. We evaluate each sample and its GPT-2 generated text with pretrained sentiment classifiers and find that while AAVE text results in more classifications of negative sentiment than SAE, the use of GPT-2 generally increases occurrences of positive sentiment for both. Additionally, we conduct human evaluation of AAVE and SAE text generated with GPT-2 to compare contextual rigor and overall quality.
摘要：社交媒体的增长，鼓励书面使用非洲裔美国黑人英语（AAVE），这在传统上一直仅用于口服上下文。然而，NLP的模型在历史上一直使用英语占主导地位的品种，如标准的美式英语（SAE），由于语料库可用性开发。我们通过创建意图等效并联AAVE / SAE鸣叫对的数据集，从而分离句法结构和AAVE-或SAE-特定语言的每对调查GPT-2对文本AAVE性能。我们评估每个样本和GPT-2与预训练情绪分类生成的文本，并发现，虽然比SAE，使用GPT-2的负面情绪多个分类AAVE文本结果通常会增加的乐观情绪出现两个。此外，我们从事与GPT-2产生AAVE和SAE文字的人的评价来比较上下文严谨性和整体质量。

59. Efficient Meta Lifelong-Learning with Limited Memory [PDF] 返回目录
Zirui Wang, Sanket Vaibhav Mehta, Barnabás Póczos, Jaime Carbonell
Abstract: Current natural language processing models work well on a single task, yet they often fail to continuously learn new tasks without forgetting previous ones as they are re-trained throughout their lifetime, a challenge known as lifelong learning. State-of-the-art lifelong language learning methods store past examples in episodic memory and replay them at both training and inference time. However, as we show later in our experiments, there are three significant impediments: (1) needing unrealistically large memory module to achieve good performance, (2) suffering from negative transfer, (3) requiring multiple local adaptation steps for each test example that significantly slows down the inference speed. In this paper, we identify three common principles of lifelong learning methods and propose an efficient meta-lifelong framework that combines them in a synergistic fashion. To achieve sample efficiency, our method trains the model in a manner that it learns a better initialization for local adaptation. Extensive experiments on text classification and question answering benchmarks demonstrate the effectiveness of our framework by achieving state-of-the-art performance using merely 1% memory size and narrowing the gap with multi-task learning. We further show that our method alleviates both catastrophic forgetting and negative transfer at the same time.
摘要：目前的自然语言处理模型对单个任务工作得很好，但他们往往不能不断地学习新的任务没有忘记以前的，因为它们是重新训练的整个生命周期，被称为终身学习是一个挑战。国家的最先进的终身语言学习方法保存过去的情景记忆的例子，在训练和推理时间重放。然而，正如我们在我们的实验后表明，有三个显著障碍：（1）需要不切实际的大内存模块，以实现良好的性能;（2）要求对每个测试的例子，多个地方适应措施从负迁移的痛苦，（3）显著减慢速度推断。在本文中，我们确定的终身学习方法的三种常见的原则，提出了一种高效的元终身框架，将它们组合以协同方式。为了实现样品的效率，我们的方法训练，它学习的本地适应性更好的初始化方式模型。文本分类和问题回答基准大量的实验证明，通过使用仅1％的内存大小实现国家的最先进的性能和缩小与多任务学习的差距我们框架的有效性。进一步的研究表明，我们的方法既减轻灾难性遗忘，并在同一时间负迁移。

60. GRUEN for Evaluating Linguistic Quality of Generated Text [PDF] 返回目录
Wanzheng Zhu, Suma Bhat
Abstract: Automatic evaluation metrics are indispensable for evaluating generated text. To date, these metrics have focused almost exclusively on the content selection aspect of the system output, ignoring the linguistic quality aspect altogether. We bridge this gap by proposing GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text. GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output. Unlike most existing evaluation metrics which require human references as an input, GRUEN is reference-less and requires only the system output. Besides, it has the advantage of being unsupervised, deterministic, and adaptable to various tasks. Experiments on seven datasets over four language generation tasks show that the proposed metric correlates highly with human judgments.
摘要：自动评价指标是评价生成的文本不可缺少的。迄今为止，这些指标几乎无一例外地集中在系统输出的内容选择方面，完全忽略了语言质量方面。我们弥合提出GRUEN评估语法规范，非冗余，重点，结构和生成的文本的一致性这一空白。 GRUEN利用基于BERT模型和一类的句法，语义和上下文功能来检查系统的输出。与需要人的引用作为输入大多数现有的评价标准，GRUEN是参考以下，只需要系统输出。此外，它有被监督的，确定的，适用于各种任务的优势。七个集四个多语言生成任务实验表明，该指标相关因素的高度与人的判断。

61. Joint Turn and Dialogue level User Satisfaction Estimation on Multi-Domain Conversations [PDF] 返回目录
Praveen Kumar Bodigutla, Aditya Tiwari, Josep Vallas Vargas, Lazaros Polymenakos, Spyros Matsoukas
Abstract: Dialogue level quality estimation is vital for optimizing data driven dialogue management. Current automated methods to estimate turn and dialogue level user satisfaction employ hand-crafted features and rely on complex annotation schemes, which reduce the generalizability of the trained models. We propose a novel user satisfaction estimation approach which minimizes an adaptive multi-task loss function in order to jointly predict turn-level Response Quality labels provided by experts and explicit dialogue-level ratings provided by end users. The proposed BiLSTM based deep neural net model automatically weighs each turn's contribution towards the estimated dialogue-level rating, implicitly encodes temporal dependencies, and removes the need to hand-craft features. On dialogues sampled from 28 Alexa domains, two dialogue systems and three user groups, the joint dialogue-level satisfaction estimation model achieved up to an absolute 27% (0.43->0.70) and 7% (0.63->0.70) improvement in linear correlation performance over baseline deep neural net and benchmark Gradient boosting regression models, respectively.
摘要：对话级别质量估计是优化数据驱动对话管理是至关重要的。目前的自动化方法来估计转弯，对话级别的用户满意度聘请手工制作的特点，依靠复杂的注解方案，从而降低了训练的模型的普遍性。我们提出一种以最大限度地减少自适应多任务损失函数，共同预测转级响应质量标签由最终用户提供专家和明确的对话级别评级提供了一种新的用户满意度估计方法。所提出的基于BiLSTM深层神经网络模型自动重每回合的对估计的对话级别评级的贡献，隐含编码时间相关，并消除了需要手工工艺的特点。来自28个的Alexa域，2个对话系统和三个用户组采样的对话，联合对话级满意估计模型中的线性关系来实现长达一个绝对27％（0.43-> 0.70）和7％（0.63-> 0.70）的改善在基线深层神经网络和基准梯度性能提升回归模型，分别。

62. Help! Need Advice on Identifying Advice [PDF] 返回目录
Venkata Subrahmanyan Govindarajan, Benjamin T Chen, Rebecca Warholic, Katrin Erk, Junyi Jessy Li
Abstract: Humans use language to accomplish a wide variety of tasks - asking for and giving advice being one of them. In online advice forums, advice is mixed in with non-advice, like emotional support, and is sometimes stated explicitly, sometimes implicitly. Understanding the language of advice would equip systems with a better grasp of language pragmatics; practically, the ability to identify advice would drastically increase the efficiency of advice-seeking online, as well as advice-giving in natural language generation systems. We present a dataset in English from two Reddit advice forums - r/AskParents and r/needadvice - annotated for whether sentences in posts contain advice or not. Our analysis reveals rich linguistic phenomena in advice discourse. We present preliminary models showing that while pre-trained language models are able to capture advice better than rule-based systems, advice identification is challenging, and we identify directions for future research. Comments: To be presented at EMNLP 2020.
摘要：人类使用语言来完成各种各样的任务 - 询问及提供建议就是其中之一。在在线咨询论坛的建议是与非的建议，如情感支持混合，有时也明确指出，有时含蓄。了解建议将装备与语言的语用更好的掌握系统的语言;实际上，识别咨询的能力会急剧增加建议寻求效率在线，以及在自然语言生成系统的建议，给予。我们提出了一个数据集的英文两个reddit的咨询论坛 - R / AskParents和r / needadvice - 注释在帖子的句子中是否含有建议与否。我们的分析显示，建议话语丰富的语言现象。我们提出，表明虽然预先训练的语言模型能够捕捉到的意见比基于规则的系统更好，建议识别正在挑战前期模型，我们确定了今后的研究方向。点评：在EMNLP 2020年提出的。

63. Dynamic Semantic Matching and Aggregation Network for Few-shot Intent Detection [PDF] 返回目录
Hoang Nguyen, Chenwei Zhang, Congying Xia, Philip S. Yu
Abstract: Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances. Although recent works demonstrate that multi-level matching plays an important role in transferring learned knowledge from seen training classes to novel testing classes, they rely on a static similarity measure and overly fine-grained matching components. These limitations inhibit generalizing capability towards Generalized Few-shot Learning settings where both seen and novel classes are co-existent. In this paper, we propose a novel Semantic Matching and Aggregation Network where semantic components are distilled from utterances via multi-head self-attention with additional dynamic regularization constraints. These semantic components capture high-level information, resulting in more effective matching between instances. Our multi-perspective matching method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances. We also propose a more challenging evaluation setting that considers classification on the joint all-class label space. Extensive experimental results demonstrate the effectiveness of our method. Our code and data are publicly available.
摘要：很少拍意图检测是具有挑战性，由于可用的注释话语的稀缺性。虽然最近的工作表明，多层次的匹配起着从看到培训班转移所学的知识，以新颖的测试类重要的作用，他们依靠静态的相似性度量和过于细粒度匹配组件。这些限制抑制对广义很少拍学习设置里都看到和小说类是共存的推广能力。在本文中，我们提出了一个新的语义匹配和汇聚网络，其中语义成分是从话语通过多头自蒸的关注与额外的动态正规化约束。这些语义部分捕获的高级别信息，从而产生实例之间更有效的匹配。我们的多角度匹配方法提供了全面的配套措施，以提高这两个标记和未标记的实例表示。我们还建议，考虑在联合所有类标签空间分类更具挑战性评价设置。大量的实验结果证明了该方法的有效性。我们的代码和数据是公开的。

64. Pretrained Language Model Embryology: The Birth of ALBERT [PDF] 返回目录
David C. Chiang, Sung-Feng Huang, Hung-yi Lee
Abstract: While behaviors of pretrained language models (LMs) have been thoroughly examined, what happened during pretraining is rarely studied. We thus investigate the developmental process from a set of randomly initialized parameters to a totipotent language model, which we refer to as the embryology of a pretrained language model. Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining. We also find that linguistic knowledge and world knowledge do not generally improve as pretraining proceeds, nor do downstream tasks' performance. These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge. We will provide source codes and pretrained models to reproduce our results at this https URL.
摘要：虽然预训练语言模式（LMS）的行为已经彻底检查，有什么训练前发生期间很少研究。因此，我们从一组随机初始化参数，以全能的语言模型，我们称之为预训练语言模型的胚胎学研究的发展过程。我们的研究结果表明，ALBERT学会重建和训练前在预测不同的学习速度讲话（POS）的不同部分的标记。我们还发现，语言知识和世界知识一般不作为提高训练前进行，也没有下游任务的性能。这些结果表明预训练模型的知识训练前，并有更多的pretrain步骤不一定用户提供更全面的知识模型过程中发生变化。我们将提供源代码和预训练的模型在此HTTPS URL复制我们的结果。

65. Iterative Domain-Repaired Back-Translation [PDF] 返回目录
Hao-Ran Wei, Zhirui Zhang, Boxing Chen, Weihua Luo
Abstract: In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent. One common and effective strategy for this case is exploiting in-domain monolingual data with the back-translation method. However, the synthetic parallel data is very noisy because they are generated by imperfect out-of-domain systems, resulting in the poor performance of domain adaptation. To address this issue, we propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair (DR) model to refine translations in synthetic bilingual data. To this end, we construct corresponding data for the DR model training by round-trip translating the monolingual sentences, and then design the unified training framework to optimize paired DR and NMT models jointly. Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach, achieving 15.79 and 4.47 BLEU improvements on average over unadapted models and back-translation.
摘要：在本文中，我们专注于特定领域的翻译与低资源，在域内平行语料库稀少或不存在。一种常见而这种场合有效的策略与回译的方法利用域内单语数据。然而，合成的并行数据是非常嘈杂，因为它们是由缺陷的外结构域的系统生成的，导致在域适应的性能差。为了解决这个问题，我们提出了一种新的迭代域修复回译框架，它引入了域的修复（DR）模式，合成双语数据细化转换。为此，我们构建的往返翻译和英语句子中的DR模型训练相应的数据，然后设计统一培训框架，以优化配对DR和NMT模型联合。关于适应特定的域之间，并从一般域特定的域模型NMT实验证明我们提出的方法的有效性，实现了不适应车型平均15.79和4.47 BLEU改进和回译。

66. Are Words Commensurate with Actions? Quantifying Commitment to a Cause from Online Public Messaging [PDF] 返回目录
Zhao Wang, Jennifer Cutler, Aron Culotta
Abstract: Public entities such as companies and politicians increasingly use online social networks to communicate directly with their constituencies. Often, this public messaging is aimed at aligning the entity with a particular cause or issue, such as the environment or public health. However, as a consumer or voter, it can be difficult to assess an entity's true commitment to a cause based on public messaging. In this paper, we present a text classification approach to categorize a message according to its commitment level toward a cause. We then compare the volume of such messages with external ratings based on entities' actions (e.g., a politician's voting record with respect to the environment or a company's rating from environmental non-profits). We find that by distinguishing between low- and high- level commitment messages, we can more reliably identify truly committed entities. Furthermore, by measuring the discrepancy between classified messages and external ratings, we can identify entities whose public messaging does not align with their actions, thereby providing a methodology to identify potentially "inauthentic" messaging campaigns.
摘要：公共实体，如公司和政界人士越来越多地使用在线社交网络直接与选民沟通。通常，这种公开的消息是针对实体与特定的原因或问题，如环境或公众健康对齐。然而，作为消费者或选民，也可以是很难评估实体的基于公共通讯事业真正的承诺。在本文中，我们提出了一个文本分类方法根据其承诺的水平走向的原因进行分类的消息。然后我们比较外部评级的消息等基于实体的操作量（例如，一个政治家的关于环境或公司的评级从环境的非营利性组织的投票记录）。我们发现，低通过和高层次的承诺邮件区分，我们可以更可靠地识别真正致力于实体。此外，通过测量分类信息和外部评级之间的差异，我们可以找出其公共消息不与他们的行动一致的实体，从而提供了一种方法来识别潜在的“不真实”短信活动。

67. On the Branching Bias of Syntax Extracted from Pre-trained Language Models [PDF] 返回目录
Huayang Li, Lemao Liu, Guoping Huang, Shuming Shi
Abstract: Many efforts have been devoted to extracting constituency trees from pre-trained language models, often proceeding in two stages: feature definition and parsing. However, this kind of methods may suffer from the branching bias issue, which will inflate the performances on languages with the same branch it biases to. In this work, we propose quantitatively measuring the branching bias by comparing the performance gap on a language and its reversed language, which is agnostic to both language models and extracting methods. Furthermore, we analyze the impacts of three factors on the branching bias, namely parsing algorithms, feature definitions, and language models. Experiments show that several existing works exhibit branching biases, and some implementations of these three factors can introduce the branching bias.
摘要：许多努力，一直致力于从预先训练语言模型提取选区的树木，在两个阶段往往出发：特征定义和解析。然而，这种方法可以从分支偏见问题，这将在膨胀语言性能与它的偏置同一分支受到影响。在这项工作中，我们建议定量通过在语言和它相反的语言，这是不可知的两个语言模型和提取方法相比的性能差距测量分支偏差。此外，我们分析在分支偏三个因素的影响，即解析算法，功能定义和语言模型。实验表明，一些现有的作品表现出分支的偏见，和这三个因素的一些实现可以引入分支偏差。

68. Multi-Fact Correction in Abstractive Text Summarization [PDF] 返回目录
Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, Jingjing Liu
Abstract: Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE. However, system-generated abstractive summaries often face the pitfall of factual inconsistency: generating incorrect facts with respect to the source text. To address this challenge, we propose Span-Fact, a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection. Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text, while retaining the syntactic structure of summaries generated by abstractive summarization models. Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
摘要：预训练神经抽象概括系统已经统治了新闻摘要性能采掘策略，至少在ROUGE方面。然而，系统生成的抽象摘要经常面临的事实不一致的缺陷：产生相对于所述源文本不正确的事实。为了应对这一挑战，我们提出了跨度，其实，一套房两个事实校正模型，充分利用知识的问答模式通过量程选择，以使系统生成的摘要更正教训。我们的模型采用单层或多层屏蔽策略，无论是反复或自动更换regressively实体，以确保语义一致性w.r.t.源文本，同时保留由抽象聚合模型生成摘要的句法结构。实验表明，我们的模型显著提升系统生成的摘要的事实一致，而不在这两个自动度量和人工评估方面总结牺牲质量。

69. Modeling Preconditions in Text with a Crowd-sourced Dataset [PDF] 返回目录
Heeyoung Kwon, Mahnaz Koupaee, Pratyush Singh, Gargi Sawhney, Anmol Shukla, Keerthi Kumar Kallur, Nathanael Chambers, Niranjan Balasubramanian
Abstract: Preconditions provide a form of logical connection between events that explains {why} some events occur together and information that is complementary to the more widely studied relations such as causation, temporal ordering, entailment, and discourse relations. Modeling preconditions in text has been hampered in part due to the lack of large scale labeled data grounded in text. This paper introduces PeKo, a crowd-sourced annotation of \emph{preconditions} between event pairs in newswire, an order of magnitude larger than prior text annotations. To complement this new corpus, we also introduce two challenge tasks aimed at modeling preconditions: (i) Precondition Identification -- a standard classification task defined over pairs of event mentions, and (ii) Precondition Generation -- a generative task aimed at testing a more general ability to reason about a given event. Evaluation on both tasks shows that modeling preconditions is challenging even for today's large language models (LM). This suggests that precondition knowledge is not easily accessible in LM-derived representations alone. Our generation results show that fine-tuning an LM on PeKo yields better conditional relations than when trained on raw text or temporally-ordered corpora.
摘要：前提条件提供事件之间，解释为什么{}一些事件发生的一起信息逻辑连接的一种形式，是为了更广泛的研究关系，如因果关系，时间排序，蕴涵和话语的关系是互补的。文字造型的前提条件已经阻碍部分原因是由于缺乏文字接地大规模的标签数据。本文介绍牛奶妹，在通社事件对之间\ EMPH {先决条件}的人群来源注解，大小比现有文本注释大一个数量级。为了配合这一新的语料，我们还引入了旨在模拟前提条件2项挑战任务：（一）前提条件鉴定 - 在对事件的定义的标准分类任务提到，及（ii）前提条件代 - 一个生成任务旨在测试更一般的能力，推理给定的事件。在这两个任务展示评测的是造型的前提条件甚至挑战了当今大型语言模型（LM）。这表明，前提知识不单单LM衍生表示很方便。我们这一代结果表明，微调对PEKO收益率时，对原始文本或时间排序语料训练的优于条件关系的LM。

70. UNQOVERing Stereotyping Biases via Underspecified Questions [PDF] 返回目录
Tao Li, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Vivek Srikumar
Abstract: While language embeddings have been shown to have stereotyping biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence and question independence. We design a formalism that isolates the aforementioned errors. As case studies, we use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion. We probe five transformer-based QA models trained on two QA datasets, along with their underlying language models. Our broad study reveals that (1) all these models, with and without fine-tuning, have notable stereotyping biases in these classes; (2) larger models often have higher bias; and (3) the effect of fine-tuning on bias varies strongly with the dataset and the model size.
摘要：虽然语言的嵌入已经被证明具有定型的偏见，这些偏见是如何影响下游问答（QA）模型仍然未知。我们目前UNQOVER，通过未规范的问题，以探测和量化偏见的总体框架。我们证明了一个朴素的使用模型分数可以导致不正确的偏差估计，由于两种形式推理的错误：位置依赖性和独立性的问题。我们设计出隔离上述误差的形式主义。作为案例研究，我们用这个指标来分析定型的四个重要类：性别，民族，种族和宗教。我们探讨训练的两个QA数据集了五项基于变压器-QA车型，与它们的基础语言模型一起。我们广泛的研究表明，（1）所有这些模型，有和没有微调，在这些类值得注意的成见偏见; （2）较大的模型通常具有较高的偏置;和（3）的微调上偏压的影响强烈地与所述数据集和所述模型的大小而变化。

71. On the Role of Supervision in Unsupervised Constituency Parsing [PDF] 返回目录
Haoyue Shi, Karen Livescu, Kevin Gimpel
Abstract: We analyze several recent unsupervised constituency parsing models, which are tuned with respect to the parsing $F_1$ score on the Wall Street Journal (WSJ) development set (1,700 sentences). We introduce strong baselines for them, by training an existing supervised parsing model (Kitaev and Klein, 2018) on the same labeled examples they access. When training on the 1,700 examples, or even when using only 50 examples for training and 5 for development, such a few-shot parsing approach can outperform all the unsupervised parsing methods by a significant margin. Few-shot parsing can be further improved by a simple data augmentation method and self-training. This suggests that, in order to arrive at fair conclusions, we should carefully consider the amount of labeled data used for model development. We propose two protocols for future work on unsupervised parsing: (i) use fully unsupervised criteria for hyperparameter tuning and model selection; (ii) use as few labeled examples as possible for model development, and compare to few-shot parsing trained on the same labeled examples.
摘要：我们分析几款近期无监督选区分析模型，这是调整相对于在华尔街日报（WSJ）开发套件（1 700句）的解析$ F_1 $得分。我们引入强大的基线对他们来说，通过训练他们访问相同的标识样本目前受监管的分析模型（Kitaev和克莱因，2018）。当1700个实例培训，甚至只使用50训练实例和5时发展等几拍解析方法可以胜过所有由显著保证金无监督的分析方法。少数次解析可以通过简单的数据增强方法和自我训练得到进一步改善。这表明，为了在公平得出结论，我们应该仔细考虑用于模型开发标记的数据量。我们提出了未来无监督的解析工作的两个协议：（i）对于超参数调整和模式选择使用完全无人监管的标准; （二）使用尽可能少的标识样本作为可能的模型开发，并比较少拍解析培训了相同的标识样本。

72. Efficient Inference For Neural Machine Translation [PDF] 返回目录
Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, Ilya Chatsviorkin
Abstract: Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without sacrificing translation quality. We conduct an empirical study that stacks various approaches and demonstrates that combination of replacing decoder self-attention with simplified recurrent units, adopting a deep encoder and a shallow decoder architecture and multi-head attention pruning can achieve up to 109% and 84% speedup on CPU and GPU respectively and reduce the number of parameters by 25% while maintaining the same translation quality in terms of BLEU.
摘要：大型变压器模型已经取得了国家的最先进成果的神经机器翻译，已经成为该领域的标准。在这项工作中，我们寻找的公知技术来优化推理速度的最佳组合，而不会牺牲翻译质量。我们进行了实证研究，栈各种方法，并演示用简化的反复单位更换解码器的自我关注，采取了深刻的编码器和浅解码器架构和多头注意修剪可以实现高达109％和84％的加速上的该组合CPU和GPU分别减少25％的参数的数量，同时保持BLEU方面相同的翻译质量。

73. Efficient One-Pass End-to-End Entity Linking for Questions [PDF] 返回目录
Belinda Z. Li, Sewon Min, Srinivasan Iyer, Yashar Mehdad, Wen-tau Yih
Abstract: We present ELQ, a fast end-to-end entity linking model for questions, which uses a biencoder to jointly perform mention detection and linking in one pass. Evaluated on WebQSP and GraphQuestions with extended annotations that cover multiple entities per question, ELQ outperforms the previous state of the art by a large margin of +12.7% and +19.6% F1, respectively. With a very fast inference time (1.57 examples/s on a single CPU), ELQ can be useful for downstream question answering systems. In a proof-of-concept experiment, we demonstrate that using ELQ significantly improves the downstream QA performance of GraphRetriever (arXiv:1911.03868). Code and data available at this https URL
摘要：我们提出ELQ，快速的终端到终端实体连接模型的问题，它采用了biencoder共同进行提检测和一个通的连接。评估了WebQSP和GraphQuestions与覆盖每个问题的多个实体的扩展的注释，通过ELQ的+ 12.7％大幅度和+ 19.6％F1分别优于现有技术的先前的状态，。具有非常快的推理时间（在单CPU上1.57实例/秒），ELQ可以是用于下游问答系统是有用的。在验证的概念的实验中，我们证明了使用ELQ显著提高GraphRetriever（的arXiv：1911.03868）的下游QA性能。代码，并在此可用的数据HTTPS URL

74. Adversarial Grammatical Error Correction [PDF] 返回目录
Vipul Raheja, Dimitrios Alikaniotis
Abstract: Recent works in Grammatical Error Correction (GEC) have leveraged the progress in Neural Machine Translation (NMT), to learn rewrites from parallel corpora of grammatically incorrect and corrected sentences, achieving state-of-the-art results. At the same time, Generative Adversarial Networks (GANs) have been successful in generating realistic texts across many different tasks by learning to directly minimize the difference between human-generated and synthetic text. In this work, we present an adversarial learning approach to GEC, using the generator-discriminator framework. The generator is a Transformer model, trained to produce grammatically correct sentences given grammatically incorrect ones. The discriminator is a sentence-pair classification model, trained to judge a given pair of grammatically incorrect-correct sentences on the quality of grammatical correction. We pre-train both the discriminator and the generator on parallel texts and then fine-tune them further using a policy gradient method that assigns high rewards to sentences which could be true corrections of the grammatically incorrect text. Experimental results on FCE, CoNLL-14, and BEA-19 datasets show that Adversarial-GEC can achieve competitive GEC quality compared to NMT-based baselines.
摘要：在语法纠错（GEC）最近的作品已经利用神经机器翻译（NMT）的进展情况，从语法上不正确和纠正句子的平行语料库学重写，实现国家的最先进的成果。与此同时，剖成对抗性网络（甘斯）已成功通过学习，直接减少人为产生的和合成的文本之间的差异产生许多不同的任务逼真的文本。在这项工作中，我们提出了一个敌对的学习方法GEC，使用发电机鉴别框架。发电机是变压器模型，进行培训，制作给出语法不正确者语法正确的句子。鉴别是句对分类模型，训练的判断对语法修正的质量给定对语法不正确，正确的句子。我们预列车都鉴别和平行文本发电机和再微调他们进一步使用策略梯度法高回报受让人句子这可能是语法上不正确的文本的真实更正。在FCE，CoNLL-14，和实验结果BEA-19数据集显示，对抗性，GEC可以与基于NMT-基线获得竞争GEC质量。

75. Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning [PDF] 返回目录
Yi Yang, Arzoo Katiyar
Abstract: We present a simple few-shot named entity recognition (NER) system based on nearest neighbor learning and structured inference. Our system uses a supervised NER model trained on the source domain, as a feature extractor. Across several test domains, we show that a nearest neighbor classifier in this feature-space is far more effective than the standard meta-learning approaches. We further propose a cheap but effective method to capture the label dependencies between entity tags without expensive CRF training. We show that our method of combining structured decoding with nearest neighbor learning achieves state-of-the-art performance on standard few-shot NER evaluation tasks, improving F1 scores by $6\%$ to $16\%$ absolute points over prior meta-learning based systems.
摘要：提出了一种基于近邻学习和推理结构命名实体识别（NER）系统简单的几拍。我们的系统使用的培训上的源域的监督NER模型，作为特征提取。在多个测试领域，我们表明，在此特征空间内最近邻分类比标准元学习方法有效得多。我们进一步提出了一个廉价而有效的方法来捕获实体标签的标签不依赖昂贵CRF训练。我们表明，我们的结合的方法构建为近邻学习解码达到标准的为数不多的射门NER评估任务的国家的最先进的性能，通过$ 6 \％$提高F1分数$ 16个\％，比之前的元$绝对点学习基础的系统。

76. Guiding Attention for Self-Supervised Learning with Transformers [PDF] 返回目录
Ameet Deshpande, Karthik Narasimhan
Abstract: In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective and results in faster convergence of models as well as better performance on downstream tasks compared to the baselines, achieving state of the art results in low-resource settings. Surprisingly, we also find that linguistic properties of attention heads are not necessarily correlated with language modeling performance.
摘要：在本文中，我们提出了一个简单而有效的技术，以允许与双向变压器高效的自我监督学习。我们的做法是通过最近的研究表明，自我关注的模式在训练的模型中包含大量非语言规律的动机。我们提出了一个计算有效的辅助损失函数来指导注意头，以符合这种模式。我们的方法是不可知的车型更快的收敛以及更好的性能，实际预培养目标和结果上较基线下游任务，实现低资源环境艺术效果的状态。出人意料的是，我们还发现关注头，语言特性不一定与语言模型的性能相关。

77. Plan Optimization to Bilingual Dictionary Induction for Low-Resource Language Families [PDF] 返回目录
Arbi Haza Nasution, Yohei Murakami, Toru Ishida
Abstract: Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely-related ones, it has been shown that the constraint-based approach is useful for inducing bilingual lexicons from two bilingual dictionaries via the pivot language. However, if there are no available machine-readable dictionaries as input, we need to consider manual creation by bilingual native speakers. To reach a goal of comprehensively create multiple bilingual dictionaries, even if we already have several existing machine-readable bilingual dictionaries, it is still difficult to determine the execution order of the constraint-based approach to reducing the total cost. Plan optimization is crucial in composing the order of bilingual dictionaries creation with the consideration of the methods and their costs. We formalize the plan optimization for creating bilingual dictionaries by utilizing Markov Decision Process (MDP) with the goal to get a more accurate estimation of the most feasible optimal plan with the least total cost before fully implementing the constraint-based bilingual lexicon induction. We model a prior beta distribution of bilingual lexicon induction precision with language similarity and polysemy of the topology as $\alpha$ and $\beta$ parameters. It is further used to model cost function and state transition probability. We estimated the cost of all investment plan as a baseline for evaluating the proposed MDP-based approach with total cost as an evaluation metric. After utilizing the posterior beta distribution in the first batch of experiments to construct the prior beta distribution in the second batch of experiments, the result shows 61.5\% of cost reduction compared to the estimated all investment plan and 39.4\% of cost reduction compared to the estimated MDP optimal plan. The MDP-based proposal outperformed the baseline on the total cost.
摘要：创建双语词典是在丰富资源匮乏语言的关键的第一步。特别是对于密切相关的，它已经表明，基于约束的方法是通过枢轴语言从两个双语词典诱导双语词典有用。然而，如果没有可用的机器可读字典为输入，我们需要考虑的双语母语手动创建。为了达到目标的全面创建多个双语词典，即使我们已经有一些现有的机器可读双语词典，但仍然难以确定基于约束的方法的执行，以降低总成本。方案优化与考虑的方法，其成本构成双语词典创建的顺序是至关重要的。我们正式确定计划优化利用马尔可夫决策过程（MDP）与目标创建双语字典，以获得最可行的最优方案的更准确的估计用最少的总费用之前全面执行基于约束的双语词典感应。我们的双语词典感应精度的在先β分布，语言相似性和拓扑结构的多义性为$ \ $阿尔法和$ \ $测试模型参数。进一步用于模型的成本函数和状态转移概率。我们估计所有的投资计划的成本，用于评估与作为评价指标总成本提出了基于MDP的方法的基准。利用在第一批实验后β分布来构建在第二批实验的现有β分布后，将结果显示61.5 \成本降低的％相比，估计所有投资计划和降低成本的39.4 \％相比估计MDP最优方案。基于MDP-建议跑赢总成本基线。

78. A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families [PDF] 返回目录
Arbi Haza Nasution, Yohei Murakami, Toru Ishida
Abstract: The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely-related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycles to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This paper utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from the Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall and F-score. Our customizable approach allows the user to conduct cross-validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combinations of heuristics and the number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.
摘要：并行和比较语料库的缺乏或缺失使得双语词典抽取一个艰巨的任务，对资源匮乏的语言。枢轴语言和同种识别方法已被证明对诱导双语词典为这些语言非常有用。我们从最近的基于旋转感应技术延长限制，并进一步实现了多个对称的假设周期在transgraph达到更多的同源提出了基于约束的双语词典诱导密切相关的语言。我们进一步确定同源同义词获得许多一对多翻译对。本文利用四个数据集：一个南岛低资源语言和三个印欧高资源的语言。我们使用我们以前的工作，从输入字典作为基准的笛卡尔乘积所产生的逆咨询方法和翻译对三个基于约束的方法。我们评估采用精密，召回和F-得分的指标，我们的结果。我们的可定制的方法允许用户进行交叉验证来预测最优的超参数（同源阈值和同源同义词阈值）与启发式的各种组合和对称的假设周期，以获得最高的F-score的数量。我们提出的方法具有精度统计显著改善和F-得分比我们以前的基于约束的方法。结果表明，我们的方法演示了如何使用平行语料库为高资源语言，同时也处理低资源语言来补充其他双语字典制作方法，如字对齐模式的潜力。

79. Mixup-Transfomer: Dynamic Data Augmentation for NLP Tasks [PDF] 返回目录
Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, Philip S. Yu, Lifang He
Abstract: Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. It has shown strong effectiveness in image classification by interpolating images at the pixel level. Inspired by this line of research, in this paper, we explore i) how to apply mixup to natural language processing tasks since text data can hardly be mixed in the raw format; ii) if mixup is still effective in transformer-based learning models, e.g., BERT. To achieve the goal, we incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks while keeping the whole end-to-end training system. We evaluate the proposed framework by running extensive experiments on the GLUE benchmark. Furthermore, we also examine the performance of mixup-transformer in low-resource scenarios by reducing the training data with a certain ratio. Our studies show that mixup is a domain-independent data augmentation technique to pre-trained language models, resulting in significant performance improvement for transformer-based models.
摘要：的mixup是最新的数据增强技术，其线性地插入输入的实施例和相应的标签。它已通过在像素级内插的图像显示在图像分类强效力。通过这一研究的启发，在本文中，我们将探讨我）如何查询股价应用到自然语言处理任务，因为文本数据很难在原始格式混用; ii）如的mixup是基于变压器的学习模型仍然有效，例如，BERT。为了实现这一目标，我们整合的mixup到基于变压器的预先训练的架构，命名为“查询股价变压器”，适用范围广的NLP任务，同时保持整个终端到终端的培训体系。我们评估通过上胶基准跑了广泛的实验所提出的框架。此外，我们还通过与一定比例减少训练数据研究在资源匮乏的情况下的mixup变压器的性能。我们的研究表明，查询股价是域独立的数据增强技术，预先训练语言模型，从而为基于变压器的模型显著的性能提升。

80. Interactive Fiction Game Playing as Multi-Paragraph Reading Comprehension with Reinforcement Learning [PDF] 返回目录
Xiaoxiao Guo, Mo Yu, Yupeng Gao, Chuang Gan, Murray Campbell, Shiyu Chang
Abstract: Interactive Fiction (IF) games with real human-written natural language texts provide a new natural evaluation for language understanding techniques. In contrast to previous text games with mostly synthetic texts, IF games pose language understanding challenges on the human-written textual descriptions of diverse and sophisticated game worlds and language generation challenges on the action command generation from less restricted combinatorial space. We take a novel perspective of IF game solving and re-formulate it as Multi-Passage Reading Comprehension (MPRC) tasks. Our approaches utilize the context-query attention mechanisms and the structured prediction in MPRC to efficiently generate and evaluate action outputs and apply an object-centric historical observation retrieval strategy to mitigate the partial observability of the textual observations. Extensive experiments on the recent IF benchmark (Jericho) demonstrate clear advantages of our approaches achieving high winning rates and low data requirements compared to all previous approaches. Our source code is available at: this https URL.
摘要：互动小说（IF）场比赛真正的人类编写的自然语言文本的语言理解技术提供了新的自然评估。相较于以前的文字游戏与多为合成的文本，如果游戏带来从受限制较少的空间组合多样化和复杂的游戏世界和语言生成挑战动作命令一代人写的文字描述语言理解的挑战。我们采取的IF游戏解决并重新制定它作为多通道阅读理解（MPRC）任务的新视角。我们的方法利用上下文查询注意机制和MPRC结构化预测来高效地生成和评估行动输出，并应用对象为中心的历史观察检索策略，以减轻所述文本观测的部分可观察性。对近期大量的实验IF基准（杰里科）演示实现高中奖率和低数据的要求相比，所有以前的方法的我们的方法明显的优势。我们的源代码，请访问：此HTTPS URL。

81. Fine-Grained Grounding for Multimodal Speech Recognition [PDF] 返回目录
Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott
Abstract: Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.
摘要：多模式自动语音识别系统整合来自图像信息，以提高语音识别质量，在视觉方面接地讲话。尽管视觉信号已被证明是用于回收已在音频被掩蔽实体有用，这些模型应能够恢复字类型的更广泛的。现有的系统依赖于全球视觉特征，代表了整个图像，但本地化图像的相关区域将有可能恢复更大的一组词，如形容词和动词的。在本文中，我们提出了一种模型，该模型使用细粒度从图像的不同部分，使用自动目的建议的视觉信息。在对Flickr8K音频说明语料库实验中，我们发现，我们的模型改进了方法，可以利用全球视觉特征，这些建议使模型恢复实体和其他相关的词，如形容词，而改进是由于模型的能力本地化的正确主张。

82. Improving Neural Topic Models using Knowledge Distillation [PDF] 返回目录
Alexander Hoyle, Pranav Goel, Philip Resnik
Abstract: Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adaptable framework not only improves performance in the aggregate over all estimated topics, as is commonly reported, but also in head-to-head comparisons of aligned topics.
摘要：主题模型通常用来确定人类可解释的主题，以大型文档集合的帮助才有意义。我们用知识蒸馏概率主题模型和变压器预先训练的最佳属性相结合。我们的模块化方法可以与任何神经主题模型被直接应用到提高质量的话题，我们采用两个型号有不同的架构，获得国家的最先进的主题连贯性。我们证明了我们的适应框架，不仅提高了性能在总量上超过所有估计的话题，如常见的，而且在排列的主题头对头比较。

83. Investigating representations of verb bias in neural language models [PDF] 返回目录
Robert D. Hawkins, Takateru Yamakoshi, Thomas L. Griffiths, Adele E. Goldberg
Abstract: Languages typically provide more than one grammatical construction to express certain types of messages. A speaker's choice of construction is known to depend on multiple factors, including the choice of main verb -- a phenomenon known as \emph{verb bias}. Here we introduce DAIS, a large benchmark dataset containing 50K human judgments for 5K distinct sentence pairs in the English dative alternation. This dataset includes 200 unique verbs and systematically varies the definiteness and length of arguments. We use this dataset, as well as an existing corpus of naturally occurring data, to evaluate how well recent neural language models capture human preferences. Results show that larger models perform better than smaller models, and transformer architectures (e.g. GPT-2) tend to out-perform recurrent architectures (e.g. LSTMs) even under comparable parameter and training settings. Additional analyses of internal feature representations suggest that transformers may better integrate specific lexical information with grammatical constructions.
摘要：语言通常提供一个以上的语法结构来表达某些类型的消息。一种扬声器的选择结构的已知依赖于多种因素，包括主要动词的选择 - 被称为\ EMPH {动词偏压}的现象。下面我们就介绍DAIS，包含中英文交替格5K不同的句子对50K人为判断的大型基准数据集。此数据集包括200个独特的动词和系统地变化的参数的定性和长度。我们用这个数据集，以及自然产生的数据的现有语料库，以评估如何近年来神经语言模型捕捉人的喜好。结果表明，较大的模型小于模型更好地发挥，并且变压器的架构（例如，GPT-2）倾向于在性能即使在可比参数和训练设置复发架构（例如LSTMs）。内部特征表示的附加分析表明，变压器可以更好地整合与语法结构的特定词汇信息。

84. Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning [PDF] 返回目录
Tsvetomila Mihaylova, Vlad Niculae, André F. T. Martins
Abstract: Latent structure models are a powerful tool for modeling language data: they can mitigate the error propagation and annotation bottleneck in pipeline systems, while simultaneously uncovering linguistic insights about the data. One challenge with end-to-end training of these models is the argmax operation, which has null gradient. In this paper, we focus on surrogate gradients, a popular strategy to deal with this problem. We explore latent structure learning through the angle of pulling back the downstream learning objective. In this paradigm, we discover a principled motivation for both the straight-through estimator (STE) as well as the recently-proposed SPIGOT - a variant of STE for structured models. Our perspective leads to new algorithms in the same family. We empirically compare the known and the novel pulled-back estimators against the popular alternatives, yielding new insight for practitioners and revealing intriguing failure cases.
摘要：隐结构模型是建模语言数据的强大工具：他们可以减轻管道系统的误差传播和注释的瓶颈，同时揭露有关数据语言见解。与终端到终端的训练这些模型的一个挑战是argmax操作，它具有零梯度。在本文中，我们专注于代理梯度，一个流行的策略来应对这个问题。我们通过拉回下游学习目标的角度探讨潜在的结构学习。在这个范例中，我们发现两者有原则地积极性直通估计器（STE），以及最近提出的龙头 - STE的结构化模式的变体。我们的观点导致了同一家族的新算法。我们经验比较已知和新颖的后拉对估计流行的替代方案，产生对从业者的新见解，揭示耐人寻味失败的案例。

85. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages [PDF] 返回目录
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddee Hassan Muhammad, Salomon Kabongo, Salomey Osei, Sackey Freshia, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Jane Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Abdulkabir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, Ghollah Kioko, Espoir Murhabazi, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Emezue, Bonaventure Dossou, Blessing Sibanda, Blessing Itoro Bassey, Ayodele Olabiyi, Arshath Ramkilowan
Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under this https URL.
摘要：研究NLP缺乏地理多样性，以及NLP如何扩展到资源不足地区语言的问题还没有得到充分解决。 “低资源” -ness是超越数据可用性一个复杂的问题，反映了社会的系统性问题。在本文中，我们专注于机器翻译（MT），播放信息获取和沟通全球至关重要的作用的任务。尽管MT在过去十年中的巨大改进，MT是围绕着一些高资源的语言。作为MT的研究人员不能单独解决低resourcedness的问题，我们提出参与研究，涉及的MT开发过程中所需的所有必要的药剂的方法。我们展示的可行性和参与研究的可扩展性与非洲语言上MT为例。它的实施，导致新的翻译数据集，MT基准超过30种语言的集合，具有针对第三人的人的评价，并让参与者没有经过正规培训，使一个独特的科学贡献。基准，模型，数据，代码和评价结果根据本HTTPS URL释放。

86. Inference Strategies for Machine Translation with Conditional Masking [PDF] 返回目录
Julia Kreutzer, George Foster, Colin Cherry
Abstract: Conditional masked language model (CMLM) training has proven successful for non-autoregressive and semi-autoregressive sequence generation tasks, such as machine translation. Given a trained CMLM, however, it is not clear what the best inference strategy is. We formulate masked inference as a factorization of conditional probabilities of partial sequences, show that this does not harm performance, and investigate a number of simple heuristics motivated by this perspective. We identify a thresholding strategy that has advantages over the standard "mask-predict" algorithm, and provide analyses of its behavior on machine translation tasks.
摘要：条件掩盖语言模型（CMLM）培训被证明是成功的非自回归和半自回归序列生成任务，比如机器翻译。给定一个训练有素的CMLM，但是，目前尚不清楚最好的推理策略是什么。我们制定掩盖推理的部分序列的条件概率的分解，表明这并不影响性能，并调查了一些由这个角度看动机简单启发式的。我们确定了标准的具有优势的阈值策略“面具预测”的算法，并提供其机器翻译任务的行为分析。

87. We Don't Speak the Same Language: Interpreting Polarization through Machine Translation [PDF] 返回目录
Ashiqur R. KhudaBukhsh, Rupak Sarkar, Mark S. Kamlet, Tom M. Mitchell
Abstract: Polarization among US political parties, media and elites is a widely studied topic. Prominent lines of prior research across multiple disciplines have observed and analyzed growing polarization in social media. In this paper, we present a new methodology that offers a fresh perspective on interpreting polarization through the lens of machine translation. With a novel proposition that two sub-communities are speaking in two different \emph{languages}, we demonstrate that modern machine translation methods can provide a simple yet powerful and interpretable framework to understand the differences between two (or more) large-scale social media discussion data sets at the granularity of words. Via a substantial corpus of 86.6 million comments by 6.5 million users on over 200,000 news videos hosted by YouTube channels of four prominent US news networks, we demonstrate that simple word-level and phrase-level translation pairs can reveal deep insights into the current political divide - what is \emph{black lives matter} to one can be \emph{all lives matter} to the other.
摘要：美国政党，媒体和精英之间的两极分化是被广泛研究的课题。跨多个学科的研究之前的线条突出了观察和分析社交媒体日益严重的两极分化。在本文中，我们提出了一种新方法，它提供通过机器翻译的镜头解释极化一个全新的视角。随着新的命题，即两个子社区在两个不同的\ EMPH {语言}来说，我们证明了现代的机器翻译方法可以提供一个简单而强大的和可解释的框架，了解两者之间的差异（或以上）的大型社会媒体讨论的数据集以字的粒度。通过对由四路著名的美国新闻网的YouTube频道托管超过200,000新闻视频86600000条评论由650万个用户大量语料，我们证明了简单的单词级和短语级翻译对可以揭示深刻的理解目前的政治分歧 - 什么是\ {EMPH黑人的命也是命}一个可\ {EMPH所有的生命物质}其他。

88. CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation [PDF] 返回目录
Tianlu Wang, Xuezhi Wang, Yao Qin, Ben Packer, Kang Li, Jilin Chen, Alex Beutel, Ed Chi
Abstract: NLP models are shown to suffer from robustness issues, i.e., a model's prediction can be easily changed under small perturbations to the input. In this work, we present a Controlled Adversarial Text Generation (CAT-Gen) model that, given an input text, generates adversarial texts through controllable attributes that are known to be invariant to task labels. For example, in order to attack a model for sentiment classification over product reviews, we can use the product categories as the controllable attribute which would not change the sentiment of the reviews. Experiments on real-world NLP datasets demonstrate that our method can generate more diverse and fluent adversarial texts, compared to many existing adversarial text generation approaches. We further use our generated adversarial examples to improve models through adversarial training, and we demonstrate that our generated attacks are more robust against model re-training and different model architectures.
摘要：NLP模型显示，从稳健性的问题，即模型的预测可以很容易地在小扰动输入改为受到影响。在这项工作中，我们提出了控制对抗性文本生成（CAT根）模型，鉴于输入文本，通过被称为是不变的任务标签可控属性产生对抗性文本。例如，以攻为情感分类在产品评价的模型，我们可以使用的产品类别为它不会改变的审查的情绪可控属性。现实世界的数据集NLP实验表明，我们的方法可以产生更加多样化和流畅的对抗性文本，比起许多现有的对抗性文本生成方法。我们进一步利用我们产生对抗的例子，以提高通过对抗训练模型，我们证明了我们的攻击产生对抗模型再培训和不同的模型结构更加坚固。

89. InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective [PDF] 返回目录
Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, Ruoxi Jia, Bo Li, Jingjing Liu
Abstract: Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies, however, show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We aim to address this problem from an information-theoretic perspective, and propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models. InfoBERT contains two mutual-information-based regularizers for model training: (i) an Information Bottleneck regularizer, which suppresses noisy mutual information between the input and the feature representation; and (ii) a Robust Feature regularizer, which increases the mutual information between local robust features and global features. We provide a principled way to theoretically analyze and improve the robustness of representation learning for language models in both standard and adversarial training. Extensive experiments demonstrate that InfoBERT achieves state-of-the-art robust accuracy over several adversarial datasets on Natural Language Inference (NLI) and Question Answering (QA) tasks.
摘要：大型语言模型，如BERT已经实现在广泛的NLP任务的国家的最先进的性能。最近的研究，但是，表明这种基于BERT的模型是脆弱面临文本对抗攻击的威胁。我们的目标是解决从信息理论角度来看这个问题，并提出InfoBERT，为预训练语言模型稳健微调了一种新的学习框架。 InfoBERT包含模型训练两个基于互信息regularizers：（ⅰ）一个信息瓶颈正则，这抑制了输入和特征表示之间嘈杂互信息;和（ii）一个健壮的特性正则，这增加了本地强大的功能和全局特征之间的互信息。我们提供了一个原则性的方式，从理论上分析和改进代表学习的语言模型的鲁棒性标准和对抗性训练。大量的实验证明，InfoBERT实现国家的最先进的强大的准确度自然语言推理（NLI）和几个问题对抗性数据集回答（QA）任务。

90. SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup [PDF] 返回目录
Rongzhi Zhang, Yue Yu, Chao Zhang
Abstract: Active learning is an important technique for low-resource sequence labeling tasks. However, current active sequence labeling methods use the queried samples alone in each iteration, which is an inefficient way of leveraging human annotations. We propose a simple but effective data augmentation method to improve the label efficiency of active sequence labeling. Our method, SeqMix, simply augments the queried samples by generating extra labeled sequences in each iteration. The key difficulty is to generate plausible sequences along with token-level labels. In SeqMix, we address this challenge by performing mixup for both sequences and token-level labels of the queried samples. Furthermore, we design a discriminator during sequence mixup, which judges whether the generated sequences are plausible or not. Our experiments on Named Entity Recognition and Event Detection tasks show that SeqMix can improve the standard active sequence labeling method by $2.27\%$--$3.75\%$ in terms of $F_1$ scores. The code and data for SeqMix can be found at this https URL
摘要：主动学习是资源少的序列标注任务的一项重要技术。然而，目前的活性序列标记方法在每次迭代中，这是利用人类注释的低效的方式单独使用所查询的样品。我们提出了一个简单而有效的数据增强方法，以提高活动序列标签的标签效率。我们的方法，SeqMix，简单地通过在每次迭代中产生额外的标记序列增强了查询样品。主要的困难是产生标记级别标签沿着合理的序列。在SeqMix，我们讨论通过执行查询股价被查询样本的序列和标记级别标签这一挑战。此外，我们设计序列的mixup期间鉴别器，其判断所生成的序列是否可信与否。我们的命名实体识别和事件检测任务，实验表明，SeqMix可以通过$ 2.27完善标准活动序列标记法\％$ - $ 3.75 \％$在$ F_1 $得分方面。对于SeqMix的代码和数据可以在此HTTPS URL中找到

91. Sentiment Analysis for Reinforcement Learning [PDF] 返回目录
Ameet Deshpande, Eve Fleisig
Abstract: While reinforcement learning (RL) has been successful in natural language processing (NLP) domains such as dialogue generation and text-based games, it typically faces the problem of sparse rewards that leads to slow or no convergence. Traditional methods that use text descriptions to extract only a state representation ignore the feedback inherently present in them. In text-based games, for example, descriptions like "Good Job! You ate the food}" indicate progress, and descriptions like "You entered a new room" indicate exploration. Positive and negative cues like these can be converted to rewards through sentiment analysis. This technique converts the sparse reward problem into a dense one, which is easier to solve. Furthermore, this can enable reinforcement learning without rewards, in which the agent learns entirely from these intrinsic sentiment rewards. This framework is similar to intrinsic motivation, where the environment does not necessarily provide the rewards, but the agent analyzes and realizes them by itself. We find that providing dense rewards in text-based games using sentiment analysis improves performance under some conditions.
摘要：虽然强化学习（RL）已成功地在自然语言处理（NLP）领域，如对话生成和基于文本的游戏，它通常面临着稀疏的奖励的问题，导致慢或不收敛。传统的方法是用文字描述仅提取状态表示忽略了反馈固有地存在于他们。在基于文本的游戏，例如像描述“干得好！你吃的食物}”显示进度，就像描述：“你进入了一个新的空间”表示的探索。正，像这些负面线索可通过情感分析转换为奖励。这种技术的稀疏奖励问题转换成一个密集的，这是容易解决。此外，这可以使强化学习没有奖励，其中代理从这些内在情绪的奖励完全学会。这个框架是相似的内在动力，那里的环境并不一定提供奖励，但代理分析，并通过自身实现它们。我们发现，在使用情感分析基于文本的游戏提供密集的奖励提高某些条件下的性能。

92. KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [PDF] 返回目录
Wenhu Chen, Yu Su, Xifeng Yan, William Yang Wang
Abstract: Data-to-text generation has recently attracted substantial interests due to its wide applications. Existing methods have shown impressive performance on an array of tasks. However, they rely on a significant amount of labeled data for each task, which is costly to acquire and thus limits their application to new tasks and domains. In this paper, we propose to leverage pre-training and transfer learning to address this issue. We propose a knowledge-grounded pre-training (KGPT), which consists of two parts, 1) a general knowledge-grounded generation model to generate knowledge-enriched text. 2) a pre-training paradigm on a massive knowledge-grounded text corpus crawled from the web. The pre-trained model can be fine-tuned on various data-to-text generation tasks to generate task-specific text. We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness. Under the fully-supervised setting, our model can achieve remarkable gains over the known baselines. Under zero-shot setting, our model without seeing any examples achieves over 30 ROUGE-L on WebNLG while all other baselines fail. Under the few-shot setting, our model only needs about one-fifteenth as many labeled examples to achieve the same level of performance as baseline models. These experiments consistently prove the strong generalization ability of our proposed framework this https URL.
摘要：数据到文本生成最近吸引了大量的利益，由于其广泛的应用。现有的方法有任务的阵列上出了不俗的业绩。然而，他们依靠每个任务，这是昂贵的收购标数据的显著量，从而限制了其应用到新的任务和领域。在本文中，我们提出利用前培训和转移学习来解决这个问题。我们提出了一个知识接地前培训（KGPT），它由两个部分组成：1）一般知识接地一代车型产生知识丰富的文本。 2）一个巨大的知识接地文本语料库预先训练模式从网上抓取。预先训练的模型可以进行微调，对各种数据 - 文本生成任务生成任务的具体文本。我们采用三种设置，即充分监督，零射门，几拍，以评估其有效性。在完全监督的环境，我们的模型可以实现对已知的基线显着的收益。在零次设置，我们的模型中没有看到对WebNLG 30 ROUGE-L任何实例实现，而所有其他基线失败。根据几合一设定，我们的模型只需要大约十五分之一尽可能多的标识样本，以实现性能基线同级别车型。这些实验证明始终如一我们提出的这一框架HTTPS URL的泛化能力强。

93. Conversational Document Prediction to Assist Customer Care Agents [PDF] 返回目录
Jatin Ganhotra, Haggai Roitman, Doron Cohen, Nathaniel Mills, Chulaka Gunasekara, Yosi Mass, Sachindra Joshi, Luis Lastras, David Konopnicki
Abstract: A frequent pattern in customer care conversations is the agents responding with appropriate webpage URLs that address users' needs. We study the task of predicting the documents that customer care agents can use to facilitate users' needs. We also introduce a new public dataset which supports the aforementioned problem. Using this dataset and two others, we investigate state-of-the art deep learning (DL) and information retrieval (IR) models for the task. Additionally, we analyze the practicality of such systems in terms of inference time complexity. Our show that an hybrid IR+DL approach provides the best of both worlds.
摘要：在客户服务对话频繁模式是代理与相应的网页的URL该地址用户的需求响应。我们研究预测文档客户服务代理可以使用，方便用户需求的任务。我们还介绍了支持上述问题的一个新的公共数据集。使用这种数据集和另外两个人，我们研究了国家的最先进的深度学习（DL）和信息检索（IR）模型的任务。此外，我们分析的推理时间复杂性方面这种系统的实用性。我们表明，一个混合式红外线+ DL方法提供了两全其美。

94. PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation [PDF] 返回目录
Xinyu Hua, Lu Wang
Abstract: Pre-trained Transformers have enabled impressive breakthroughs in generating long and fluent text, yet their outputs are often "rambling" without coherently arranged content. In this work, we present a novel content-controlled text generation framework, PAIR, with planning and iterative refinement, which is built upon a large model, BART. We first adapt the BERT model to automatically construct the content plans, consisting of keyphrase assignments and their corresponding sentence-level positions. The BART model is employed for generation without modifying its structure. We then propose a refinement algorithm to gradually enhance the generation quality within the sequence-to-sequence framework. Evaluation with automatic metrics shows that adding planning consistently improves the generation quality on three distinct domains, with an average of 20 BLEU points and 12 METEOR points improvements. In addition, human judges rate our system outputs to be more relevant and coherent than comparisons without planning.
摘要：预先训练变压器具有产生长期和文字通顺启用令人印象深刻的突破，但它们的输出往往是“东拉西扯”不连贯安排的内容。在这项工作中，我们提出了一个新颖的内容控制的文本生成框架，对，规划和反复改进，这是在一个大的模型，BART建成。我们首先适应BERT模型自动构建内容的计划，其中包括关键词的任务及其相应的语句级职位。当采用BART模型生成，而无需修改其结构。然后，我们提出了一个改进算法，以逐步增强序列对的序列框架内产生质量。评价用自动指标显示加入规划始终提高在三个不同的结构域的产生质量，平均的20 BLEU分和12项METEOR改进。此外，人类评委评价我们的系统产出比没有计划的比较更具相关性和连贯。

95. Semi-Supervised Speech-Language Joint Pre-Training for Spoken Language Understanding [PDF] 返回目录
Yu-An Chung, Chenguang Zhu, Michael Zeng
Abstract: Spoken language understanding (SLU) requires a model to analyze input acoustic signals to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to utilize large-scale unlabeled text and speech data. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning method, AlignNet, to jointly pre-train the speech and language modules. Besides a self-supervised masked language modeling of the two individual modules, AlignNet aligns representations from paired speech and transcripts in a shared latent semantic space. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge. Experimental results verify the effectiveness of our approach on various SLU tasks. For example, AlignNet improves the previous state-of-the-art accuracy on the Spoken SQuAD dataset by 6.2%.
摘要：口语理解（SLU）需要一个模型来分析输入声音信号，了解其语言内容和作出预测。为了提高模型的性能，各种预训练方法提出了利用大型未标记的文字和语音数据。然而，这两种模式之间的固有差异必要相互分析。在本文中，我们提出了一个新颖的半监督学习方法，AlignNet，共同预训练语音和语言模块。除了两个独立的模块，AlignNet对齐从配对的讲话和成绩单表示在共享潜在语义空间的自我监督掩盖语言建模。因此，微调过程中，仅在语音合成模块可产生同时携带声音信息和上下文语义知识表示。实验结果验证了我们对各种SLU工作方法的有效性。例如，AlignNet 6.2％提高了对口语小队数据集之前的状态的最先进的精度。

96. Effects of Naturalistic Variation in Goal-Oriented Dialog [PDF] 返回目录
Jatin Ganhotra, Robert Moore, Sachindra Joshi, Kahini Wadhawan
Abstract: Existing benchmarks used to evaluate the performance of end-to-end neural dialog systems lack a key component: natural variation present in human conversations. Most datasets are constructed through crowdsourcing, where the crowd workers follow a fixed template of instructions while enacting the role of a user/agent. This results in straight-forward, somewhat routine, and mostly trouble-free conversations, as crowd workers do not think to represent the full range of actions that occur naturally with real users. In this work, we investigate the impact of naturalistic variation on two goal-oriented datasets: bAbI dialog task and Stanford Multi-Domain Dataset (SMD). We also propose new and more effective testbeds for both datasets, by introducing naturalistic variation by the user. We observe that there is a significant drop in performance (more than 60% in Ent. F1 on SMD and 85% in per-dialog accuracy on bAbI task) of recent state-of-the-art end-to-end neural methods such as BossNet and GLMP on both datasets.
摘要：用于评估终端到终端的神经对话系统的性能基准现有缺少的重要组成部分：在人的谈话自然变异的存在。大多数数据集是通过众包，其中的人群工作者遵循的指令数是固定的模板，同时制定一个用户/代理的角色构成的。这导致直接的，有些常规，和无故障大多是谈话，因为人群中工人不认为代表全方位的，与真实的用户自然发生的动作。在这项工作中，我们研究自然变异的两个面向目标的数据集的影响：巴比对话任务和斯坦福大学的多域数据集（SMD）。我们还提出了两个数据集新的，更有效的测试平台，由用户引入自然变化。我们观察到，在性能上（耳鼻喉科。F1上的SMD超过60％和85％，在每个对话精度上的巴比任务）最近的国家的最先进的终端到终端的神经等方法显著下跌作为BossNet和GLMP两个数据集。

97. An Ensemble Approach to Automatic Structuring of Radiology Reports [PDF] 返回目录
Morteza Pourreza Shahri, Amir Tahmasebi, Bingyang Ye, Henghui Zhu, Javed Aslam, Timothy Ferris
Abstract: Automatic structuring of electronic medical records is of high demand for clinical workflow solutions to facilitate extraction, storage, and querying of patient care information. However, developing a scalable solution is extremely challenging, specifically for radiology reports, as most healthcare institutes use either no template or department/institute specific templates. Moreover, radiologists' reporting style varies from one to another as sentences are telegraphic and do not follow general English grammar rules. We present an ensemble method that consolidates the predictions of three models, capturing various attributes of textual information for automatic labeling of sentences with section labels. These three models are: 1) Focus Sentence model, capturing context of the target sentence; 2) Surrounding Context model, capturing the neighboring context of the target sentence; and finally, 3) Formatting/Layout model, aimed at learning report formatting cues. We utilize Bi-directional LSTMs, followed by sentence encoders, to acquire the context. Furthermore, we define several features that incorporate the structure of reports. We compare our proposed approach against multiple baselines and state-of-the-art approaches on a proprietary dataset as well as 100 manually annotated radiology notes from the MIMIC-III dataset, which we are making publicly available. Our proposed approach significantly outperforms other approaches by achieving 97.1% accuracy.
摘要：电子病历自动结构是高需求的临床工作流程解决方案，以方便提取，存储和病人护理信息的查询。然而，开发一个可扩展的解决方案是非常具有挑战性的，专为放射学报告，大多数医疗机构使用或者没有模板或部门/机构的特定模板。此外，放射科医师的报道风格，从一个到另一个句子是电信和不遵循一般的英语语法规则的变化。我们提出了整合的三种模式预测，捕捉的文本信息的各种属性与部分标签句子的自动贴标的集成方法。这三种模式是：1）重点句模型，捕捉目标句子的上下文; 2）周围的上下文模型，捕捉目标句子的邻近上下文;最后，3）格式/布局模型，目的是学习报表格式线索。我们利用双向LSTMs，其次是一句编码器，以获取上下文。此外，我们定义了一些功能，包括报告的结构。我们比较我们提出的多个基准和国家的最先进的专有数据集的方法，还有100个手动注释放射学笔记模仿-III的数据集，这是我们正在公开可用的方法。我们提出的方法通过实现97.1％的准确率显著优于其他方法。

98. MedFilter: Extracting Task-relevant Utterances from Medical Dialogue through Integration of Discourse Structure and Ontological Knowledge [PDF] 返回目录
Sopan Khosla, Shikhar Vashishth, Jill Fain Lehman, Carolyn Rose
Abstract: Information extraction from conversational data is particularly challenging because the task-centric nature of conversation allows for effective communication of implicit information by humans, but is challenging for machines. The challenges may differ between utterances depending on the role of the speaker within the conversation, especially when relevant expertise is distributed asymmetrically across roles. Further, the challenges may also increase over the conversation as more shared context is built up through information communicated implicitly earlier in the dialogue. In this paper, we propose the novel modeling approach MedFilter, which addresses these insights in order to increase performance at identifying and categorizing task-relevant utterances, and in so doing, positively impacts performance at a downstream information extraction task. We evaluate this approach on a corpus of nearly 7,000 doctor-patient conversations where MedFilter is used to identify medically relevant contributions to the discussion (achieving a 10% improvement over SOTA baselines in terms of area under the PR curve). Identifying task-relevant utterances benefits downstream medical processing, achieving improvements of 15%, 105%, and 23% respectively for the extraction of symptoms, medications, and complaints.
摘要：从会话数据信息提取特别具有挑战性，因为谈话的任务为中心的性质允许通过人的隐含信息的有效沟通，但机器是具有挑战性的。面临的挑战可能取决于说话者的谈话中的角色话语之间的差异，尤其是当相关专业知识的跨角色不对称分布。此外，随着越来越多的共享上下文是通过对话隐含早期传达的信息建立起来的挑战也增加了谈话。在本文中，我们提出了新的建模方法MedFilter，这解决了这些见解，以提高性能的识别和分类任务相关的话语，并因此在下游信息提取任务做，肯定会影响性能。我们评估在哪里MedFilter用于识别的讨论（公关曲线下实现了SOTA基准的10％的改善就面积而言）医学相关的捐款近7000医患对话语料库这种方法。识别与任务相关的话语有利于下游医学处理，分别达到15％，105％和23％，为改善症状，药物和投诉的提取。

99. Acrostic Poem Generation [PDF] 返回目录
Rajat Agarwal, Katharina Kann
Abstract: We propose a new task in the area of computational creativity: acrostic poem generation in English. Acrostic poems are poems that contain a hidden message; typically, the first letter of each line spells out a word or short phrase. We define the task as a generation task with multiple constraints: given an input word, 1) the initial letters of each line should spell out the provided word, 2) the poem's semantics should also relate to it, and 3) the poem should conform to a rhyming scheme. We further provide a baseline model for the task, which consists of a conditional neural language model in combination with a neural rhyming model. Since no dedicated datasets for acrostic poem generation exist, we create training data for our task by first training a separate topic prediction model on a small set of topic-annotated poems and then predicting topics for additional poems. Our experiments show that the acrostic poems generated by our baseline are received well by humans and do not lose much quality due to the additional constraints. Last, we confirm that poems generated by our model are indeed closely related to the provided prompts, and that pretraining on Wikipedia can boost performance.
摘要：藏头诗诗代英：我们在计算创新领域提出了新的任务。离合诗诗是包含隐藏消息诗;典型地，每一行的第一个字母阐明了一个词或短语。我们定义该任务如与多个约束的生成任务：输入一个词，1）每行的首字母应该拼出提供字，2）诗的语义也应该涉及到它，以及3）诗应符合到押韵方案。我们还提供了任务，其中包括与神经押韵模型组合条件神经语言模型的基准模型。由于对藏头诗诗代不存在专门的数据集，我们创建的第一次训练的训练数据为我们的任务一个单独的话题预测模型上的一个小设置主题标注的诗，然后预测额外的诗歌主题。我们的实验表明，我们的基准产生的藏头诗诗是人类收好，不要失去太多的质量，由于额外的约束。最后，我们确认我们的模型生成的诗歌确实是密切相关的提供的提示，而训练前在维基百科上可以提高性能。

100. Learning to Generalize for Sequential Decision Making [PDF] 返回目录
Xusen Yin, Ralph Weischedel, Jonathan May
Abstract: We consider problems of making sequences of decisions to accomplish tasks, interacting via the medium of language. These problems are often tackled with reinforcement learning approaches. We find that these models do not generalize well when applied to novel task domains. However, the large amount of computation necessary to adequately train and explore the search space of sequential decision making, under a reinforcement learning paradigm, precludes the inclusion of large contextualized language models, which might otherwise enable the desired generalization ability. We introduce a teacher-student imitation learning methodology and a means of converting a reinforcement learning model into a natural language understanding model. Together, these methodologies enable the introduction of contextualized language models into the sequential decision making problem space. We show that models can learn faster and generalize more, leveraging both the imitation learning and the reformulation. Our models exceed teacher performance on various held-out decision problems, by up to 7% on in-domain problems and 24% on out-of-domain problems.
摘要：我们考虑把决策的序列来完成任务，通过语言的媒介互动的问题。这些问题往往解决与强化学习方法。我们发现，当应用到新的工作领域，这些模型不能推广好。然而，大量需要适当训练和探索顺序决策的搜索空间计算，一个强化学习范式下，排除了包括大型情境语言模型，否则可能使所需的泛化能力。我们引进了师生模仿学习方法和转换增强学习模型转换为自然语言理解模型的方法。总之，这些方法使引入情境化语言模型为连锁决策问题的空间。我们表明，模型可以学得更快和更一般化，借力模仿学习和再形成两者。我们的模型超过上举行的各种超时判定问题的老师性能，高达7％域内的问题，24％的超出域的问题。

101. Joint Semantics and Data-Driven Path Representation for Knowledge Graph Inference [PDF] 返回目录
Guanglin Niu, Bo Li, Yongfei Zhang, Yongpan Sheng, Chuan Shi, Jingyang Li, Shiliang Pu
Abstract: Inference on a large-scale knowledge graph (KG) is of great importance for KG applications like question answering. The path-based reasoning models can leverage much information over paths other than pure triples in the KG, which face several challenges: all the existing path-based methods are data-driven, lacking explainability for path representation. Besides, some methods either consider only relational paths or ignore the heterogeneity between entities and relations both contained in paths, which cannot capture the rich semantics of paths well. To address the above challenges, in this work, we propose a novel joint semantics and data-driven path representation that balances explainability and generalization in the framework of KG embedding. More specifically, we inject horn rules to obtain the condensed paths by the transparent and explainable path composition procedure. The entity converter is designed to transform the entities along paths into the representations in the semantic level similar to relations for reducing the heterogeneity between entities and relations, in which the KGs both with and without type information are considered. Our proposed model is evaluated on two classes of tasks: link prediction and path query answering task. The experimental results show that it has a significant performance gain over several different state-of-the-art baselines.
摘要：推理上大规模知识图谱（KG）是对KG的应用，如答疑非常重要的。基于路径的推理模型可以利用在比KG纯三元，其面临一些挑战其他路径多的信息：现有的所有基于路径的方法是数据驱动的，缺乏explainability的路径表示。此外，一些方法或者只考虑关系路径或忽略两者所含的路径实体和关系之间的异质性，这无法捕捉的路径丰富的语义很好。为了应对上述挑战，在这项工作中，我们提出了一个新的联合语义和数据驱动的路径表示，平衡explainability和概括在KG的框架嵌入。更具体地说，我们注入喇叭规则以获得由透明和可解释的路径组合物过程中的冷凝的路径。实体转换器设计用于沿路径的实体转变成表示在类似于降低实体和关系之间的异质性，其中既没有类型信息的幼儿园被认为是关系语义层面。链接预测和路径查询应答任务：我们所提出的模型是对两类任务的评估。实验结果表明它具有在几个不同的国家的最先进的基线量起显著的性能增益。

102. Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering [PDF] 返回目录
Wei Han, Hantao Huang, Tao Han
Abstract: Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.
摘要：图片文字进行必要的信息，以了解现场并进行推理。基于文本的视觉问答（文字VQA）任务的重点是需要在图像阅读文本的视觉问题。现有文本VQA系统生成由光学字符识别（OCR）的文本或固定词汇中选择一个答案。文本的位置信息利用不足，也缺乏对所生成的答案证据。因此，本文提出了一种定位感知回答预测网络（LAAP-网）来应对这一挑战。我们LAAP-Net的不仅是产生问题的答案，但也可以预测边框作为生成的答案的证据。此外，上下文富集OCR表示（COR），用于多模态融合提出了促进定位任务。我们提出了一个明显的保证金现有三个基准数据集文本VQA任务的方法LAAP-Net的性能优于。

103. Unsupervised Hierarchical Concept Learning [PDF] 返回目录
Sumegh Roychowdhury, Sumedh A. Sontakke, Nikaash Puri, Mausoom Sarkar, Milan Aggarwal, Pinkesh Badjatiya, Balaji Krishnamurthy, Laurent Itti
Abstract: Discovering concepts (or temporal abstractions) in an unsupervised manner from demonstration data in the absence of an environment is an important problem. Organizing these discovered concepts hierarchically at different levels of abstraction is useful in discovering patterns, building ontologies, and generating tutorials from demonstration data. However, recent work to discover such concepts without access to any environment does not discover relationships (or a hierarchy) between these discovered concepts. In this paper, we present a Transformer-based concept abstraction architecture UNHCLE (pronounced uncle) that extracts a hierarchy of concepts in an unsupervised way from demonstration data. We empirically demonstrate how UNHCLE discovers meaningful hierarchies using datasets from Chess and Cooking domains. Finally, we show how UNHCLE learns meaningful language labels for concepts by using demonstration data augmented with natural language for cooking and chess. All of our code is available at this https URL
摘要：在从显示数据在不存在的环境的无监督的方式发现的概念（或时抽取）是一个重要的问题。在不同的抽象层次的分层组织这些发现的概念是在发现模式，构建本体，以及从演示数据的教程非常有用。然而，最近的工作发现这样的观念没有获得任何环境没有发现这些发现概念之间的关系（或层次）。在本文中，我们提出了一种基于变压器的概念抽象架构UNHCLE（明显的叔叔），其提取物的概念，从演示数据的无监督方式的层次结构。如何UNHCLE发现的有意义的层次采用从国际象棋的数据集和烹饪领域，我们经验证明。最后，我们将展示UNHCLE如何使用自然语言做饭和国际象棋增强示范数据获悉的概念有意义的语言标签。我们所有的代码都可以在此HTTPS URL

104. A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments [PDF] 返回目录
Youngmoon Jung, Yeunju Choi, Hyungjun Lim, Hoirin Kim
Abstract: Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multi-scale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an end-to-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.
摘要：扬声器核查（SV）最近吸引了相当多的研究兴趣，因为虚拟助理的日益普及。与此同时，还有一个SV系统的要求越来越高：它应当是稳健做空的声音片段，特别是在嘈杂和回响环境。在本文中，我们考虑对于实际应用一个更重要的要求：该系统应当是稳健的含有长的非语音部分的音频流，在不应用的话音活动检测（VAD）。为了满足这两个要求，我们引入特征金字塔模块（FPM）为基础的多尺度聚合（MSA）和自适应软VAD（SAS-VAD）。我们提出基于FPM-MSA以应对嘈杂，回响环境简短的声音片段。此外，我们使用SAS-VAD增加稳健性长的非语音段。为了进一步改善鲁棒性声学失真（即，噪声和混响），我们应用一个基于掩蔽语音增强（SE）方法。我们在一个统一的深学习框架结合SV，VAD，和SE车型，共同培养整个网络中的终端到终端的方式。据我们所知，这是第一个工作，这三款机型在深学习框架相结合。我们进行了对朝鲜的室内（KID）实验和VoxCeleb数据集，这是由噪声和混响损坏。结果表明，所提出的方法是有效的SV在严苛的条件下，进行比基线的i-矢量和深扬声器嵌入系统更好。

105. Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays [PDF] 返回目录
Jianmo Ni, Chun-Nan Hsu, Amilcare Gentili, Julian McAuley
Abstract: Automatic medical image report generation has drawn growing attention due to its potential to alleviate radiologists' workload. Existing work on report generation often trains encoder-decoder networks to generate complete reports. However, such models are affected by data bias (e.g.~label imbalance) and face common issues inherent in text generation models (e.g.~repetition). In this work, we focus on reporting abnormal findings on radiology images; instead of training on complete radiology reports, we propose a method to identify abnormal findings from the reports in addition to grouping them with unsupervised clustering and minimal rules. We formulate the task as cross-modal retrieval and propose Conditional Visual-Semantic Embeddings to align images and fine-grained abnormal findings in a joint embedding space. We demonstrate that our method is able to retrieve abnormal findings and outperforms existing generation models on both clinical correctness and text generation metrics.
摘要：医学图像自动生成报告已经引起越来越多的关注，因为它的潜力，以减轻放射科医生的工作量。报告生成现有的工作中经常训练编码器，解码器网络，以生成完整的报告。然而，这种模型是通过数据偏压的影响（例如〜标签不平衡）在文本生成模型所固有的，面部的常见问题（例如〜重复）。在这项工作中，我们重点汇报放射影像异常发现;而不是在完整的放射学报告的训练，我们提出了一个以确定从报告异常发现除了与无监督聚类和最小的规则对它们进行分组的方法。我们制定的任务，因为跨模态获取并提出有条件的视觉，语义曲面嵌入到对齐图像和细粒度的异常发现在联合嵌入空间。我们证明我们的方法是能够检索异常发现，优于现有一代车型上都临床正确性和文本生成指标。

106. Identifying Spurious Correlations for Robust Text Classification [PDF] 返回目录
Zhao Wang, Aron Culotta
Abstract: The predictions of text classifiers are often driven by spurious correlations -- e.g., the term `Spielberg' correlates with positively reviewed movies, even though the term itself does not semantically convey a positive sentiment. In this paper, we propose a method to distinguish spurious and genuine correlations in text classification. We treat this as a supervised classification problem, using features derived from treatment effect estimators to distinguish spurious correlations from "genuine" ones. Due to the generic nature of these features and their small dimensionality, we find that the approach works well even with limited training examples, and that it is possible to transport the word classifier to new domains. Experiments on four datasets (sentiment classification and toxicity detection) suggest that using this approach to inform feature selection also leads to more robust classification, as measured by improved worst-case accuracy on the samples affected by spurious correlations.
摘要：文本分类的预测往往是由伪相关驱动的 - 例如，术语'斯皮尔伯格的相关因素与积极审查的电影，尽管这个词本身不是语义传达一种积极的情绪。在本文中，我们建议区分文本分类虚假的和真正的相互关系的方法。我们认为这是一次监督分类问题，用从治疗效果估计导出功能，以区分“真正的”那些伪相关。由于这些特点和他们的小维的通用性，我们发现，该方法效果很好，即使有限的培训例子，它可以将单词分类运输到新的领域。上四个数据集（情感分类和毒性检测）的实验表明，使用这种方法来通知特征选择也导致更健壮的分类，如由提高对受伪相关样品最差情况下的精度测量。

107. The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS [PDF] 返回目录
Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda
Abstract: This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model, followed using the transcriptions to generate the voice of the target with a text-to-speech (TTS) model. We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit, and the many well-configured pretrained models provided by the community. Official evaluation results show that our system comes out top among the participating systems in terms of conversion similarity, demonstrating the promising ability of seq2seq models to convert speaker identity. The implementation is made open-source at: this https URL.
摘要：本文介绍了序列到序列（seq2seq）基线语音转换的挑战（VCC）2020系统，我们认为语音转换（VC），这是第一次录制与自动语音输入语音一个天真的做法识别（ASR）模型，使用所述转录以产生具有一个文本到语音（TTS）模型对目标的声音遵循。我们通过利用ESPnet，一个开放源码的端至端的语音处理工具箱下一个序列到序列（seq2seq）框架再次讨论这个方法，以及由社会提供的许多良好配置的预训练的模型。官方评价结果表明，我们的系统名列前茅的转换相似性参与的系统，展示seq2seq模型转换的发言者身份的承诺能力之间。此HTTPS URL：实现在由开放源代码。

108. ERFit: Entropic Regression Fit Matlab Package, for Data-Driven System Identification of Underlying Dynamic Equations [PDF] 返回目录
Abd AlRahman AlMomani, Erik Bollt
Abstract: Data-driven sparse system identification becomes the general framework for a wide range of problems in science and engineering. It is a problem of growing importance in applied machine learning and artificial intelligence algorithms. In this work, we developed the Entropic Regression Software Package (ERFit), a MATLAB package for sparse system identification using the entropic regression method. The code requires minimal supervision, with a wide range of options that make it adapt easily to different problems in science and engineering. The ERFit is available at this https URL
摘要：数据驱动稀疏系统识别成为一个广泛的科学和工程问题的总体框架。它是在应用机器学习和人工智能算法日益重要的问题。在这项工作中，我们开发了熵回归软件包（ERFit），用熵回归方法的MATLAB包稀疏系统识别。该代码需要最少的监督，具有广泛的，使得它很容易适应在科学和工程的不同问题的选项。该ERFit可在此HTTPS URL

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-10-07

目录

摘要