摘要

1. DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool [PDF] 返回目录
Ernie Chang, Jeriah Caplinger, Alex Marin, Xiaoyu Shen, Vera Demberg
Abstract: We present a lightweight annotation tool, the Data AnnotatoR Tool (DART), for the general task of labeling structured data with textual descriptions. The tool is implemented as an interactive application that reduces human efforts in annotating large quantities of structured data, e.g. in the format of a table or tree structure. By using a backend sequence-to-sequence model, our system iteratively analyzes the annotated labels in order to better sample unlabeled data. In a simulation experiment performed on annotating large quantities of structured data, DART has been shown to reduce the total number of annotations needed with active learning and automatically suggesting relevant labels.
摘要：本文提出了一种轻量级的注解工具，数据标注工具（DART），与文字说明标注结构化数据的总任务。该工具被实现为减少注释大量的结构化数据的人类的努力的一个交互式应用，例如在一个表中或树结构的格式。通过使用一个后台序列到序列模型，我们的系统反复分析注释标签，以便更好地样品未标记的数据。在上标注大量的结构化数据进行的模拟实验，DART已经显示减少与主动学习需要注解的总数和自动建议相关的标签。

2. Towards Topic-Guided Conversational Recommender System [PDF] 返回目录
Kun Zhou, Yuanhang Zhou, Wayne Xin Zhao, Xiaoke Wang, Ji-Rong Wen
Abstract: Conversational recommender systems (CRS) aim to recommend high-quality items to users through interactive conversations. To develop an effective CRS, the support of high-quality datasets is essential. Existing CRS datasets mainly focus on immediate requests from users, while lack proactive guidance to the recommendation scenario. In this paper, we contribute a new CRS dataset named \textbf{TG-ReDial} (\textbf{Re}commendation through \textbf{T}opic-\textbf{G}uided \textbf{Dial}og). Our dataset has two major features. First, it incorporates topic threads to enforce natural semantic transitions towards the recommendation scenario. Second, it is created in a semi-automatic way, hence human annotation is more reasonable and controllable. Based on TG-ReDial, we present the task of topic-guided conversational recommendation, and propose an effective approach to this task. Extensive experiments have demonstrated the effectiveness of our approach on three sub-tasks, namely topic prediction, item recommendation and response generation. TG-ReDial is available at this https URL.
摘要：会话推荐系统（CRS）的目标是通过交流互动推荐优质项目给用户。要制定有效的CRS，高品质的数据集的支持是必不可少的。现有的CRS数据集主要集中在从用户直接请求，而缺乏对推荐方案主动引导。在本文中，我们贡献了新的CRS的数据集名为\ textbf {TG-重拨}（\ textbf {}再通过表彰\ textbf横置opic- \ textbf {G} uided \ textbf {}拨号OG）。我们的数据有两个主要特点。首先，它采用了话题的线索朝着推荐方案实施自然语义转换。其次，它在半自动方式被创建，因此人的注释是比较合理的和可控的。基于TG-重拨，我们提出的主题引导下的对话建议的任务，并提出有效的办法把这个任务。大量的实验已经证明在三个子任务，即题目预测，项目推荐和反应生成我们的方法的有效性。 TG-重拨可在此HTTPS URL。

3. Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? [PDF] 返回目录
Peter Hase, Shiyue Zhang, Harry Xie, Mohit Bansal
Abstract: Data collection for natural language (NL) understanding tasks has increasingly included human explanations alongside data points, allowing past works to introduce models that both perform a task and generate NL explanations for their outputs. Yet to date, model-generated explanations have been evaluated on the basis of surface-level similarities to human explanations, both through automatic metrics like BLEU and human evaluations. We argue that these evaluations are insufficient, since they fail to indicate whether explanations support actual model behavior (faithfulness), rather than simply match what a human would say (plausibility). In this work, we address the problem of evaluating explanations from the model simulatability perspective. Our contributions are as follows: (1) We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations, which measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We use a model as a proxy for a human observer, and validate this choice with two human subject experiments. (2) Using the CoS-E and e-SNLI datasets, we evaluate two existing generative graphical models and two new approaches; one rationalizing method we introduce achieves roughly human-level LAS scores. (3) Lastly, we frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage, which can improve LAS scores. We provide code for the experiments in this paper at this https URL
摘要：数据采集的自然语言（NL）了解任务越来越多地包含数据一起点的人解释，让过去的作品，无论执行任务，并生成NL解释其输出引进车型。然而，到目前为止，模型生成的解释已被评估的表面水平的相似性对人类的解释的基础上，无论是通过自动指标像BLEU和人的评估。我们认为，这些评估是不够的，因为它们无法表明是否说明实际支持模型行为（忠诚），而不是简单地匹配一个人会说（合理性）是什么。在这项工作中，我们要解决评估从模型角度simulatability解释的问题。我们的研究内容如下：（1）我们引入泄漏调整simulatability（LAS）指标用于评估NL解释，这措施说明如何帮助观察者预测模型的输出，同时控制了解释如何直接漏输出。我们用一个模型作为人类观察者的代理，并有两个人类主体实验验证了这个选择。（2）使用COS-E和E-SNLI数据集，我们评估现有的两个生成图形模型和两个新的方法;一个合理化的方法介绍大致达到人类水平的LAS分数。（3）最后，我们说明帧生成作为多剂游戏和优化的解释为simulatability而惩罚标签泄漏，从而可以提高LAS分数。我们在此提供了本文中的实验代码HTTPS URL

4. BERTering RAMS: What and How Much does BERT Already Know About Event Arguments? -- A Study on the RAMS Dataset [PDF] 返回目录
Varun Gangal, Eduard Hovy
Abstract: Using the attention map based probing framework from (Clark et al., 2019), we observe that BERT's attention heads have modest but well above-chance ability to spot event arguments sans any training or domain finetuning, varying from a low of 17.77% for Place to a high of 51.61% for Artifact. Next, we find that linear combinations of these heads, estimated with approx 11% of available total supervision, can push performance well-higher for some roles - highest two being Victim (68.29% Accuracy) and Artifact (58.82% Accuracy). Furthermore, we investigate how well our methods do for cross-sentence event arguments. We propose a procedure to isolate "best heads" for cross-sentence argument detection separately of those for intra-sentence arguments. The heads thus estimated have superior cross-sentence performance compared to their jointly estimated equivalents. Lastly, we seek to isolate to what extent our numbers stem from lexical frequency based associations between gold arguments and roles. We propose NONCE, a scheme to create adversarial test examples by replacing gold arguments with randomly generated "nonce" words. We find that learnt linear combinations are robust to NONCE, though individual best heads can be much more sensitive.
摘要：使用基于从探测框架注意图（Clark等，2019），我们观察到BERT的注意力头有现货事件参数SANS任何培训或域细化和微调不大，但大大高于机会的能力，从17.77低变％，持续场所的51.61％，为神器高。接下来，我们发现这些头是线性组合，与现有总监督的约11％估计，可推的性能非常高的一些角色 - 最高的两个分别是受害者（68.29％精确度）和神器（58.82％精确度）。此外，我们研究我们的方法对于跨句事件参数如何做。我们提出了分别针对那些内部句子参数的交叉一句参数检测的程序隔离“最好的头”。相比，他们的联合估计当量由此估计出的磁头具有优良的横句性能。最后，我们试图隔离到什么程度我们的数字源于基于词频黄金协会的论点和角色之间。我们建议NONCE，一个方案，通过随机生成的“现时”的字样代替黄金参数来创建对抗性测试的例子。我们发现，学习的线性组合是稳健的随机数，虽然个人最好头可以更敏感。

5. Precise Task Formalization Matters in Winograd Schema Evaluations [PDF] 返回目录
Haokun Liu, William Huang, Dhara A. Mungra, Samuel R. Bowman
Abstract: Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization---the combination of input specification, loss function, and reuse of pretrained parameters---by users of the dataset, rather than improvements in the pretrained model's reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance by 2-6 points and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.
摘要：性能上的Winograd架构挑战赛（WSC），一位受人尊敬的英文常识推理的风向标，最近从机会精度飙升到89％的强力胶排行榜，与推理能力相应的大的改进相对较少的佐证。我们猜测，这种改进主要来自近期预训练模型的推理能力变化的任务正式---输入规范，损失函数，和预先训练参数重用的组合---通过数据集的用户，而不是改善。我们进行了形式化之间进行插值前激增后使用两个威诺格拉德架构的数据集消融，并找到（我）制定的任务，因为选择题提高了2-6分，和（ii）一些额外的技术，包括再利用性能预训练语言模型的头，可以减轻该模型对超参数的极端敏感性。我们敦促未来基准的创造者征收额外的结构以尽量减少形式化决定对报告结果的影响。

6. GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems [PDF] 返回目录
Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, Xiaodan Liang
Abstract: Automatically evaluating dialogue coherence is a challenging but high-demand ability for developing high-quality open-domain dialogue systems. However, current evaluation metrics consider only surface features or utterance-level semantics, without explicitly considering the fine-grained topic transition dynamics of dialogue flows. Here, we first consider that the graph structure constituted with topics in a dialogue can accurately depict the underlying communication logic, which is a more natural way to produce persuasive metrics. Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation. Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence. The graph representations are obtained by reasoning over topic-level dialogue graphs enhanced with the evidence from a commonsense graph, including k-hop neighboring representations and hop-attention weights. Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models in terms of the Pearson and Spearman correlations with human judgements. Besides, we release a new large-scale human evaluation benchmark to facilitate future research on automatic metrics.
摘要：自动评估对话的一致性是开发高品质的开放域的对话系统，一个具有挑战性的，但高需求的能力。然而，目前的评价指标只考虑表面特征或话语层次语义，没有明确考虑细粒度的话题转变动力学的对话流。在此，我们首先考虑的是，图结构在一个对话可以准确地描绘出底层通信逻辑，其是产生说服力度量更自然的方式主题构成。资本化的话题级别对话图中，我们提出了一个新的评价指标等级，这代表了图形增强的自动对话评价交涉。具体而言，GRADE既包含粗粒话语层面语境的陈述和细粒度主题级图形表示，以评估对话的连贯性。该图表示通过推理在从常识图与证据增强的话题级别对话的图表，包括K-跳邻居交涉和跳，关注权重获得。实验结果表明，我们年级显著优于在与人类的判断Pearson和Spearman相关方面的测量多样化的对话模式的国家的最先进的其他指标。此外，我们推出一个新的大型人工评估基准，以方便日后自动指标研究。

7. Generating Instructions at Different Levels of Abstraction [PDF] 返回目录
Arne Köhn, Julia Wichlacz, Álvaro Torralba, Daniel Höller, Jörg Hoffmann, Alexander Koller
Abstract: When generating technical instructions, it is often convenient to describe complex objects in the world at different levels of abstraction. A novice user might need an object explained piece by piece, while for an expert, talking about the complex object (e.g. a wall or railing) directly may be more succinct and efficient. We show how to generate building instructions at different levels of abstraction in Minecraft. We introduce the use of hierarchical planning to this end, a method from AI planning which can capture the structure of complex objects neatly. A crowdsourcing evaluation shows that the choice of abstraction level matters to users, and that an abstraction strategy which balances low-level and high-level object descriptions compares favorably to ones which don't.
摘要：当生成技术指导，它往往是方便在不同的抽象层次来描述世界复杂的对象。新手用户可能需要一个目的是通过片解释片，而对于一个专家，讲的是复杂的对象（例如墙壁或栏杆）直接可以更简洁和高效。我们将展示如何产生在我的世界不同的抽象层次建设的指示。我们介绍使用分层规划的，为此，从AI规划方法可以巧妙地捕捉复杂对象的结构。众包的评估显示，抽象层次问题的选择用户，而这平衡了低层次和高层次的对象描述的抽象的策略相比，毫不逊色于那些不。

8. Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task [PDF] 返回目录
Martin Vastl, Daniel Zeman, Rudolf Rosa
Abstract: We present our submission to the SIGTYP 2020 Shared Task on the prediction of typological features. We submit a constrained system, predicting typological features only based on the WALS database. We investigate two approaches. The simpler of the two is a system based on estimating correlation of feature values within languages by computing conditional probabilities and mutual information. The second approach is to train a neural predictor operating on precomputed language embeddings based on WALS features. Our submitted system combines the two approaches based on their self-estimated confidence scores. We reach the accuracy of 70.7% on the test data and rank first in the shared task.
摘要：我们提出我们提交SIGTYP 2020共享任务上的类型学特征的预测。我们提出约束系统，仅预测基于WALS数据库类型学特征。我们调查两种方法。二者的更简单的是基于通过计算条件概率和相互信息语言内估计的特征值的相关性的系统。第二种方法是训练的基础上WALS预先计算功能的嵌入语言神经预测工作。我们提交系统将基于他们的自我估计的置信度的两种方法。我们达到了70.7％，对测试数据和排名的准确性首先在共享任务。

9. Injecting Word Information with Multi-Level Word Adapter for Chinese Spoken Language Understanding [PDF] 返回目录
Dechuang Teng, Libo Qin, Wanxiang Che, Sendong Zhao, Ting Liu
Abstract: Intent detection and slot filling are two closely related tasks for building a spoken language understanding (SLU) system. In this paper, we focus on improving Chinese SLU by flexibly injecting word information, which has shown effectiveness for various Chinese NLP tasks. Previous studies on Chinese SLU do not consider the word information, failing to detect the word boundaries that are beneficial for intent detection and slot filling. To address this issue, we propose a multi-level word adapter to inject word information for Chinese SLU: (1) sentence-level word adapter, which directly fuses the sentence representations of the word information and character information to perform intent detection; (2) character-level word adapter, which is applied at each character for selectively controlling weights on word information as well as character information. In addition, the proposed word adapter is applied at the output layer, which can be utilized as a plugin and easily combined with other pre-trained models (e.g., BERT). Experimental results on two Chinese SLU datasets show that our model can capture useful word information and achieve state-of-the-art performance. More importantly, our framework substantially gains further improvements when we plug the word adapters into a BERT, which again demonstrates the effectiveness and flexibility of our word adapter.
摘要：意向检测和槽分配是建设口语理解（SLU）系统中的两个密切相关的任务。在本文中，我们专注于通过灵活注射字的信息，这对中国各NLP任务，显示效果改善中国SLU。对中国SLU先前的研究没有考虑这个词的信息，不能检测单词边界是对意图探测和槽分配有益。为了解决这个问题，我们提出了一个多层次的词适配器注入字信息，中国SLU：（1）语句级字适配器，它直接融合字信息和文字信息进行检测意向的句子陈述; （2）字符级字适配器，这是在每一个字符施加用于选择性地控制权重对字信息以及字符信息。此外，所提出的字适配器在输出层，其可以被用作一个插件和方便地与其他预训练的模型（例如，BERT）组合施用。两个中国SLU数据集实验结果表明，我们的模型可以捕捉有用的字符信息以及实现国家的最先进的性能。更重要的是，当我们插上字适配器成BERT，这再次证明了我们的词适配器的有效性和灵活性，我们的框架基本获得进一步的改进。

10. Population Based Training for Data Augmentation and Regularization in Speech Recognition [PDF] 返回目录
Daniel Haziza, Jérémy Rapin, Gabriel Synnaeve
Abstract: Varying data augmentation policies and regularization over the course of optimization has led to performance improvements over using fixed values. We show that population based training is a useful tool to continuously search those hyperparameters, within a fixed budget. This greatly simplifies the experimental burden and computational cost of finding such optimal schedules. We experiment in speech recognition by optimizing SpecAugment this way, as well as dropout. It compares favorably to a baseline that does not change those hyperparameters over the course of training, with an 8% relative WER improvement. We obtain 5.18% word error rate on LibriSpeech's test-other.
摘要：在优化的过程中变化的数据增强政策和正规化导致性能改进了使用固定的值。我们发现，以人口为基础的训练就是不断寻找那些超参数，一个固定的预算范围内的有用工具。这大大简化了实验的负担，找到这样的最优调度的计算成本。通过优化SpecAugment这种方式，以及辍学，我们在语音识别实验。它逊色于不改变这些超参数在训练过程中，用8％的相对改善WER的基线。我们获得LibriSpeech的测试，其他5.18％的字错误率。

11. Large Product Key Memory for Pretrained Language Models [PDF] 返回目录
Gyuwan Kim, Tae-Hwan Jung
Abstract: Product key memory (PKM) proposed by Lample et al. (2019) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead. However, their empirical application is only limited to causal language modeling. Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be finetuned for a wide variety of downstream NLP tasks. We define a new memory usage metric, and careful observation using this metric reveals that most memory slots remain outdated during the training of PKM-augmented models. To train better PLMs by tackling this issue, we propose simple but effective solutions: (1) initialization from the model weights pretrained without memory and (2) augmenting PKM by addition rather than replacing a feed-forward network. We verify that both of them are crucial for the pretraining of PKM-augmented PLMs, enhancing memory utilization and downstream performance. Code and pretrained weights are available at this https URL.
摘要：Lample等人提出的产品密钥存储器（PKM）。（2019）使得能够通过与微不足道的计算开销有效地增大模型容量，以提高预测精度。然而，他们的经验应用仅限于因果语言模型。受近期预训练的语言模型（周期性肢体运动障碍）的成功激励，我们研究如何将大PKM成可以微调，为各种各样的下游NLP任务周期性肢体运动障碍。我们定义一个新的内存使用量的度量，并使用此度量仔细观察就会发现PKM-增强模型的训练过程中，大多数内存插槽保持过时。加入而不是更换前馈网络（1）无记忆预训练的模型权重初始化和（2）增强PKM：要通过解决这个问题更好的训练周期性肢体运动障碍，我们提出了简单而有效的解决方案。我们确认两者是PKM-增强周期性肢体运动障碍的训练前的关键，提高内存的利用率和下游性能。代码和预训练的权重可在此HTTPS URL。

12. A Co-Interactive Transformer for Joint Slot Filling and Intent Detection [PDF] 返回目录
Libo Qin, Tailu Liu, Wanxiang Che, Bingbing Kang, Sendong Zhao, Ting Liu
Abstract: Intent detection and slot filling are two main tasks for building a spoken language understanding (SLU) system. The two tasks are closely related and the information of one task can be utilized in the other task. Previous studies either model the two tasks separately or only consider the single information flow from intent to slot. None of the prior approaches model the bidirectional connection between the two tasks simultaneously. In this paper, we propose a Co-Interactive Transformer to consider the cross-impact between the two tasks. Instead of adopting the self-attention mechanism in vanilla Transformer, we propose a co-interactive module to consider the cross-impact by building a bidirectional connection between the two related tasks. In addition, the proposed co-interactive module can be stacked to incrementally enhance each other with mutual features. The experimental results on two public datasets (SNIPS and ATIS) show that our model achieves the state-of-the-art performance with considerable improvements (+3.4\% and +0.9\% on overall acc). Extensive experiments empirically verify that our model successfully captures the mutual interaction knowledge.
摘要：意向检测和槽分配是建设口语理解（SLU）系统中的两个主要任务。这两个任务是密切相关的，一个任务的信息，可以在其他任务中使用。以往的研究要么两个任务分别建模或者只考虑从意图插槽单一的信息流。在现有的方法中无型号同时这两个任务之间的双向连接。在本文中，我们提出了一个联合互动变压器要考虑两个任务之间的交叉影响。相反，采用香草变压器自重视机制，我们提出了一个共同的互动模块，审议通过构建两个相关任务之间的双向连接的交叉影响。此外，所提出的共同互动模块可堆叠逐步增进彼此具有相互的功能。在两个公共数据集的实验结果（SNIPS和ATIS）表明我们的模型实现了国家的最先进的性能相当大的改善（+3.4 \％和+0.9总体ACC \％）。大量的实验经验验证我们的模型成功地抓住了相互作用的知识。

13. What Can We Do to Improve Peer Review in NLP? [PDF] 返回目录
Anna Rogers, Isabelle Augenstein
Abstract: Peer review is our best tool for judging the quality of conference submissions, but it is becoming increasingly spurious. We argue that a part of the problem is that the reviewers and area chairs face a poorly defined task forcing apples-to-oranges comparisons. There are several potential ways forward, but the key difficulty is creating the incentives and mechanisms for their consistent implementation in the NLP community.
摘要：同行评议是判断会议提交的质量我们最好的工具，但它正变得越来越虚假的。我们认为，这个问题的一部分是，评论家和区域椅面一个定义不清的任务迫使苹果与桔子的对比。有几种可能的方式推进，但重点难点创造了自己在NLP界一致实施的动机和机制。

14. Two are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders [PDF] 返回目录
Jue Wang, Wei Lu
Abstract: Named entity recognition and relation extraction are two important fundamental problems. Joint learning algorithms have been proposed to solve both tasks simultaneously, and many of them cast the joint task as a table-filling problem. However, they typically focused on learning a single encoder (usually learning representation in the form of a table) to capture information required for both tasks within the same space. We argue that it can be beneficial to design two distinct encoders to capture such two different types of information in the learning process. In this work, we propose the novel {\em table-sequence encoders} where two different encoders -- a table encoder and a sequence encoder are designed to help each other in the representation learning process. Our experiments confirm the advantages of having {\em two} encoders over {\em one} encoder. On several standard datasets, our model shows significant improvements over existing approaches.
摘要：命名实体识别和关系抽取是两个重要的基本问题。联合学习算法已经被提出来同时解决这两个任务，其中许多人投了联合工作为表填充问题。然而，它们典型地集中在学习单个编码器（在表中的形式通常学习表示）为同一空间内两个任务所需捕获信息。我们认为，这可能是有益的，设计两个不同的编码器捕捉到这样的两种不同类型的信息，在学习过程中。在这项工作中，我们提出了新的{\ EM表序列编码器}在两个不同的编码器 - 一个表编码器和编码器的目的是帮助对方在表示学习过程的序列。我们的实验证实具有优势{\ EM 2}在编码器{\ EM一个}编码器。在几个标准数据集，我们的模型显示了现有方法显著的改善。

15. Extracting a Knowledge Base of Mechanisms from COVID-19 Papers [PDF] 返回目录
Aida Amini, Tom Hope, David Wadden, Madeleine van Zuylen, Eric Horvitz, Roy Schwartz, Hannaneh Hajishirzi
Abstract: The urgency of mitigating COVID-19 has spawned a large and diverse body of scientific literature that is challenging for researchers to navigate. This explosion of information has stimulated interest in automated tools to help identify useful knowledge. We have pursued the use of methods for extracting diverse forms of mechanism relations from the natural language of scientific papers. We seek to identify concepts in COVID-19 and related literature which represent activities, functions, associations and causal relations, ranging from cellular processes to economic impacts. We formulate a broad, coarse-grained schema targeting mechanism relations between open, free-form entities. Our approach strikes a balance between expressivity and breadth that supports generalization across diverse concepts. We curate a dataset of scientific papers annotated according to our novel schema. Using an information extraction model trained on this new corpus, we construct a knowledge base (KB) of 2M mechanism relations, which we make publicly available. Our model is able to extract relations at an F1 at least twice that of baselines such as open IE or related scientific IE systems. We conduct experiments examining the ability of our system to retrieve relevant information on viral mechanisms of action, and on applications of AI to COVID-19 research. In both cases, our system identifies relevant information from our automatically-constructed knowledge base with high precision.
摘要：减轻COVID-19的迫切性催生了科学文献的大量不同的身体被研究者导航挑战。这种信息爆炸，刺激了自动化工具，以帮助识别感兴趣的有用的知识。我们所追求的使用方法从科学论文自然语言提取机制关系的不同形式。我们设法确定在COVID-19和相关文献代表活动的功能，关联和因果关系，从细胞过程对经济影响的概念。我们制定一个广泛的，粗粒度架构针对开放的，自由形式的实体之间关系的机制。我们的方法打击表现力和广度之间的平衡，支持跨不同的概念推广。我们策划的科学论文的数据集根据我们新的架构注解。使用培训了这个新的语料库信息提取模型，构建2M机制的关系，这是我们公开发布的知识库（KB）。我们的模型是能够提取的关系在F1至少是基线，如打开IE或者相关科学的IE系统的两倍。我们进行实验研究我们的系统来检索行动的病毒机制相关信息的能力，以及AI的应用COVID-19的研究。在这两种情况下，系统会识别从我们的高精度自动构建的知识基础的相关信息。

16. On the importance of pre-training data volume for compact language models [PDF] 返回目录
Vincent Micheli, Martin D'Hoffschmidt, François Fleuret
Abstract: Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.
摘要：在语言模型的最新进展已经导致了国家的最先进的计算密集型和资源要求很高的机型。在迈向可持续发展实践的努力，我们研究前的训练数据量对紧凑语言模型的影响。多个基于BERT的模型被训练上量法文本逐渐增加。通过对法国问答集（FQuAD）的微调，我们观察到，良好的执行模型与获得低至100 MB的文本。此外，我们表明，过去极其少量的前期训练数据，具体任务语料库中间前的训练步骤不产生实质性的改进。

17. TextSETTR: Label-Free Text Style Extraction and Tunable Targeted Restyling [PDF] 返回目录
Parker Riley, Noah Constant, Mandy Guo, Girish Kumar, David Uthus, Zarana Parekh
Abstract: We present a novel approach to the problem of text style transfer. Unlike previous approaches that use parallel or non-parallel labeled data, our technique removes the need for labels entirely, relying instead on the implicit connection in style between adjacent sentences in unlabeled text. We show that T5 (Raffel et al., 2019), a strong pretrained text-to-text model, can be adapted to extract a style vector from arbitrary text and use this vector to condition the decoder to perform style transfer. As the resulting learned style vector space encodes many facets of textual style, we recast transfers as "targeted restyling" vector operations that adjust specific attributes of the input text while preserving others. When trained over unlabeled Amazon reviews data, our resulting TextSETTR model is competitive on sentiment transfer, even when given only four exemplars of each class. Furthermore, we demonstrate that a single model trained on unlabeled Common Crawl data is capable of transferring along multiple dimensions including dialect, emotiveness, formality, politeness, and sentiment.
摘要：本文提出了一种新的方法，以文本样式转移的问题。不同于以往的方法是使用平行或不平行的标记数据，我们的技术消除了需要的标签完全，而不是依赖于风格相邻句子之间的未标记文本隐式连接。我们表明，T5（拉费尔等人，2019），强大的预训练的文本到文本模型，可以适用于提取任意的文本样式矢量，并使用该向量来调节所述解码器执行样式传送。由于所产生的教训风格矢量空间编码的文本风格的许多方面，我们重铸转移为“再整的目标”是调整输入文本的特定属性，同时保留其他向量运算。当培训了未标记亚马逊的评论数据，我们得到的TextSETTR模型是对市场情绪转移的竞争力，给予只有四个模范每类也是如此。此外，我们证明了的培训上未标记的常见抓取数据的单一模型能够沿多个维度，包括方言，emotiveness，礼节，礼貌，和情绪转移的。

18. An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference [PDF] 返回目录
Tianyu Liu, Xin Zheng, Xiaoan Ding, Baobao Chang, Zhifang Sui
Abstract: The prior work on natural language inference (NLI) debiasing mainly targets at one or few known biases while not necessarily making the models more robust. In this paper, we focus on the model-agnostic debiasing strategies and explore how to (or is it possible to) make the NLI models robust to multiple distinct adversarial attacks while keeping or even strengthening the models' generalization power. We firstly benchmark prevailing neural NLI models including pretrained ones on various adversarial datasets. We then try to combat distinct known biases by modifying a mixture of experts (MoE) ensemble method and show that it's nontrivial to mitigate multiple NLI biases at the same time, and that model-level ensemble method outperforms MoE ensemble method. We also perform data augmentation including text swap, word substitution and paraphrase and prove its efficiency in combating various (though not all) adversarial attacks at the same time. Finally, we investigate several methods to merge heterogeneous training data (1.35M) and perform model ensembling, which are straightforward but effective to strengthen NLI models.
摘要：自然语言推理（NLI）消除直流偏压主要目标，在一个或几个已知的偏见，而不一定使车型更强劲的前期工作。在本文中，我们专注于模型无关去除偏差策略，并探讨如何（或者是有可能）使NLI模型稳健多个不同的敌对攻击，同时保持甚至加强了模型的泛化能力。首先，我们当时的基准NLI神经模型，包括在各种对抗性的数据集预先训练的。然后，我们通过修改专家（MOE）集成方法的混合物，表明它是平凡的同时，以减轻多个NLI偏见，优于教育部集成方法模型级集成方法尝试不同的作战已知的偏见。我们还进行数据扩充，包括文本交换，单词替换和释义，并证明它在同时打击各种（虽然不是全部）对抗攻击的效率。最后，我们研究了几种方法来合并异质训练数据（1.35M）和执行模型ensembling，这是简单而有效的加强NLI模型。

19. Detect All Abuse! Toward Universal Abusive Language Detection Models [PDF] 返回目录
Kunze Wang, Dong Lu, Soyeon Caren Han, Siqu Long, Josiah Poon
Abstract: Online abusive language detection (ALD) has become a societal issue of increasing importance in recent years. Several previous works in online ALD focused on solving a single abusive language problem in a single domain, like Twitter, and have not been successfully transferable to the general ALD task or domain. In this paper, we introduce a new generic ALD framework, MACAS, which is capable of addressing several types of ALD tasks across different domains. Our generic framework covers multi-aspect abusive language embeddings that represent the target and content aspects of abusive language and applies a textual graph embedding that analyses the user's linguistic behaviour. Then, we propose and use the cross-attention gate flow mechanism to embrace multiple aspects of abusive language. Quantitative and qualitative evaluation results show that our ALD algorithm rivals or exceeds the six state-of-the-art ALD algorithms across seven ALD datasets covering multiple aspects of abusive language and different online community domains.
摘要：网上辱骂性语言检测（ALD）已成为近年来越来越重要的社会问题。在网上ALD几个以前的作品集中在单个域解决一个侮辱性的语言问题，如Twitter，并没有被成功转移到一般的ALD任务或域。在本文中，我们介绍了一个新的通用ALD框架，MACAS，它能够处理多种类型的跨不同的域ALD任务。我们的通用框架，涵盖了多方面的粗言秽语的嵌入代表粗言秽语的目标和内容方面并应用文字图嵌入，分析用户的语言行为。然后，我们提出并使用交叉注意门的流动机制拥抱粗言秽语的多个方面。定量和定性的评价结果表明，我们的ALD算法的竞争对手或超过在七个ALD数据集涵盖侮辱性的语言和不同的网络社区领域的多个方面国家的最先进的六ALD算法。

20. Improving Long-Tail Relation Extraction with Collaborating Relation-Augmented Attention [PDF] 返回目录
Yang Li, Tao Shen, Guodong Long, Jing Jiang, Tianyi Zhou, Chengqi Zhang
Abstract: Wrong labeling problem and long-tail relations are two main challenges caused by distant supervision in relation extraction. Recent works alleviate the wrong labeling by selective attention via multi-instance learning, but cannot well handle long-tail relations even if hierarchies of the relations are introduced to share knowledge. In this work, we propose a novel neural network, Collaborating Relation-augmented Attention (CoRA), to handle both the wrong labeling and long-tail relations. Particularly, we first propose relation-augmented attention network as base model. It operates on sentence bag with a sentence-to-relation attention to minimize the effect of wrong labeling. Then, facilitated by the proposed base model, we introduce collaborating relation features shared among relations in the hierarchies to promote the relation-augmenting process and balance the training data for long-tail relations. Besides the main training objective to predict the relation of a sentence bag, an auxiliary objective is utilized to guide the relation-augmenting process for a more accurate bag-level representation. In the experiments on the popular benchmark dataset NYT, the proposed CoRA improves the prior state-of-the-art performance by a large margin in terms of Precision@N, AUC and Hits@K. Further analyses verify its superior capability in handling long-tail relations in contrast to the competitors.
摘要：贴错标签的问题和长尾关系所引起的关系抽取遥远的监督两个主要挑战。最近的作品缓解通过多实例学习贴错标签的选择性注意，但即使关系的层次结构引入到分享知识不能很好地处理长尾关系。在这项工作中，我们提出了一种新的神经网络，合作关系，增强注意力（激进反禁令），以处理错误的标签和长尾巴的关系两者。特别是，我们首先提出了关系，增强关注网络示范基地。它运行在句子袋一个句子到关系注意尽量减少贴错标签的作用。然后，通过所提出的示范基地推动，我们引入合作关系特征的层次关系，促进的关系，增强过程和长尾关系平衡训练数据共享。除了主要的训练目标来预测一个句子袋的关系，辅助物镜被用于引导的关系-增强过程更精确的袋级表示。在上流行的基准数据集纽约时报的实验中，所提出的激进反禁令改善精密@ N，AUC的术语和点击率@ k中的先前状态的最先进的性能大幅度。进一步分析验证对比竞争对手处理长尾关系，其卓越的性能。

21. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [PDF] 返回目录
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht
Abstract: Given a simple request (e.g., Put a washed apple in the kitchen fridge), humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld (Côté et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, visual scene understanding, and so forth).
摘要：给定一个简单的请求（例如，将在厨房冰箱的洗涤苹果），人类可以通过想象动作序列和得分他们的成功，典型性，和效率的可能性，所有这些都不移动肌肉理由纯粹抽象的术语。一旦我们看到有问题的厨房，我们就可以更新我们的抽象计划，以适应现场。体现代理商需要相同的能力，但现有的工作还没有提供必要的抽象两种推理和具体执行的基础设施。我们在丰富引入ALFWorld，一个模拟器，可让客户在TextWorld学习抽象的，基于文本的政策解决此限制（Cote等，2018），然后执行从ALFRED基准目标（Shridhar等人，2020年）视觉环境。 ALFWorld使新BUTLER剂，其抽象的知识，在TextWorld学会的创建，直接对应于具体，直观地接地措施。反过来，因为我们经验证明，这样才能建立更好的代理推广不仅仅是在视觉上接地环境训练。巴特勒的简单，模块化设计因素，使研究人员能够专注于模型提高每一块管道（自然语言理解，规划，导航，视觉场景的理解，等等）的问题。

22. Improving Attention Mechanism with Query-Value Interaction [PDF] 返回目录
Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang
Abstract: Attention mechanism has played critical roles in various state-of-the-art NLP models such as Transformer and BERT. It can be formulated as a ternary function that maps the input queries, keys and values into an output by using a summation of values weighted by the attention weights derived from the interactions between queries and keys. Similar with query-key interactions, there is also inherent relatedness between queries and values, and incorporating query-value interactions has the potential to enhance the output by learning customized values according to the characteristics of queries. However, the query-value interactions are ignored by existing attention methods, which may be not optimal. In this paper, we propose to improve the existing attention mechanism by incorporating query-value interactions. We propose a query-value interaction function which can learn query-aware attention values, and combine them with the original values and attention weights to form the final output. Extensive experiments on four datasets for different tasks show that our approach can consistently improve the performance of many attention-based models by incorporating query-value interactions.
摘要：注意机制在国家的最先进的各种型号NLP如变压器和BERT已经发挥了重要作用。它可以被配制为通过使用通过从查询和键之间的相互作用产生的关注权重加权值的总和输入查询，键和值映射到输出的三元功能。与查询键相互作用类似，也有查询和价值观，并纳入查询价值的交互之间的内在关联性已经根据查询的特点，学习自定义值，以提高输出的潜力。但是，查询价值的交互是通过现有的注意方法，这可能不是最优的忽视。在本文中，我们提出了以提高通过合并查询价值的交互现有的注意机制。我们建议可学习查询感知价值的关注，并与原有的价值观和关注权重将它们组合起来，以形成最终的输出查询值交互功能。在四个数据集用于不同的任务广泛的实验表明，我们的方法可以持续通过将查询价值的交互提高很多关注，基于模型的性能。

23. Assessing Phrasal Representation and Composition in Transformers [PDF] 返回目录
Lang Yu, Allyson Ettinger
Abstract: Deep transformer models have pushed performance on NLP tasks to new limits, suggesting sophisticated treatment of complex linguistic inputs, such as phrases. However, we have limited understanding of how these models handle representation of phrases, and whether this reflects sophisticated composition of phrase meaning like that done by humans. In this paper, we present systematic analysis of phrasal representations in state-of-the-art pre-trained transformers. We use tests leveraging human judgments of phrase similarity and meaning shift, and compare results before and after control of word overlap, to tease apart lexical effects versus composition effects. We find that phrase representation in these models relies heavily on word content, with little evidence of nuanced composition. We also identify variations in phrase representation quality across models, layers, and representation types, and make corresponding recommendations for usage of representations from these models.
摘要：深变压器模型已经被推NLP任务，新的极限性能，这表明成熟的治疗复杂的语言输入，如短语。然而，我们已经限制了这些模型是如何处理的短语表示理解，这是否反映短语的构图精巧意思一样，由人类来完成。在本文中，我们提出在国家的最先进的预先训练变压器短语表示的系统分析。我们使用的测试利用短语相似性和意义移位的人的判断，以及之前和字重叠的控制后的比较结果，梳理开词法效果对比组合物的影响。我们发现在这些模型中这句话表示严重依赖文字内容，用细致入微组成的迹象。我们还确定在整个模型，图层和表示类型的短语表示质量的变化，并从这些模型表示的使用相应的建议。

24. Discriminatively-Tuned Generative Classifiers for Robust Natural Language Inference [PDF] 返回目录
Xiaoan Ding, Tianyu Liu, Baobao Chang, Zhifang Sui, Kevin Gimpel
Abstract: While discriminative neural network classifiers are generally preferred, recent work has shown advantages of generative classifiers in term of data efficiency and robustness. In this paper, we focus on natural language inference (NLI). We propose GenNLI, a generative classifier for NLI tasks, and empirically characterize its performance by comparing it to five baselines, including discriminative models and large-scale pretrained language representation models like BERT. We explore training objectives for discriminative fine-tuning of our generative classifiers, showing improvements over log loss fine-tuning from prior work . In particular, we find strong results with a simple unbounded modification to log loss, which we call the "infinilog loss". Our experiments show that GenNLI outperforms both discriminative and pretrained baselines across several challenging NLI experimental settings, including small training sets, imbalanced label distributions, and label noise.
摘要：尽管判别神经网络分类器通常是优选的，最近的工作已显示出的数据的效率和鲁棒性术语生成分类器的优点。在本文中，我们专注于自然语言推理（NLI）。我们建议GenNLI，为NLI任务的生成分类，并通过比较其经验表征其性能五条基线，包括判别模型和大型预训练的语言表示模型，如BERT。我们探索培养目标为我们生成分类的判别式微调，显示了从以前的工作日志丢失微调改进。特别是，我们找到一个简单的无界修改强劲业绩记录的损失，我们称之为“infinilog损失”。我们的实验表明，GenNLI跨越几个挑战NLI实验设置，包括小训练集，不平衡标签分配和标签的噪声性能优于歧视性和预先训练基线。

25. Generalizable and Explainable Dialogue Generation via Explicit Action Learning [PDF] 返回目录
Xinting Huang, Jianzhong Qi, Yu Sun, Rui Zhang
Abstract: Response generation for task-oriented dialogues implicitly optimizes two objectives at the same time: task completion and language quality. Conditioned response generation serves as an effective approach to separately and better optimize these two objectives. Such an approach relies on system action annotations which are expensive to obtain. To alleviate the need of action annotations, latent action learning is introduced to map each utterance to a latent representation. However, this approach is prone to over-dependence on the training data, and the generalization capability is thus restricted. To address this issue, we propose to learn natural language actions that represent utterances as a span of words. This explicit action representation promotes generalization via the compositional structure of language. It also enables an explainable generation process. Our proposed unsupervised approach learns a memory component to summarize system utterances into a short span of words. To further promote a compact action representation, we propose an auxiliary task that restores state annotations as the summarized dialogue context using the memory component. Our proposed approach outperforms latent action baselines on MultiWOZ, a benchmark multi-domain dataset.
摘要：反应生成面向任务的对话隐含优化在同一时间两个目标：完成任务和语言质量。调节反应生成作为有效的方法来单独地和更好地优化这两个目标。这种方法依赖于系统的动作的注解，其是昂贵的，以获得。为了减轻动作注释的需要，潜在的行动学习引入到每个话语映射到一个潜在的表示。然而，这种方法很容易出现在训练数据过分依赖，因此泛化能力受到限制。为了解决这个问题，我们提出要学会代表话语作为字的跨度自然语言的动作。这种明确的行为表示通过语言的组成结构促进推广。这也使一个解释的生成过程。我们提出的无监督的方法学习记忆组件，系统总结话语变成文字的弹指一挥间。为了进一步促进小型行为表示，我们建议恢复状态标注为总结对话环境中使用的存储元件的辅助任务。我们提出的方法优于上MultiWOZ，基准多域数据集中潜伏行动的基准。

26. Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition [PDF] 返回目录
Yun He, Ziwei Zhu, Yin Zhang, Qin Chen, James Caverlee
Abstract: Knowledge of a disease includes information of various aspects of the disease, such as signs and symptoms, diagnosis and treatment. This disease knowledge is critical for many health-related and biomedical tasks, including consumer health question answering, medical language inference and disease name recognition. While pre-trained language models like BERT have shown success in capturing syntactic, semantic, and world knowledge from text, we find they can be further complemented by specific information like knowledge of symptoms, diagnoses, treatments, and other disease aspects. Hence, we integrate BERT with disease knowledge for improving these important tasks. Specifically, we propose a new disease knowledge infusion training procedure and evaluate it on a suite of BERT models including BERT, BioBERT, SciBERT, ClinicalBERT, BlueBERT, and ALBERT. Experiments over the three tasks show that these models can be enhanced in nearly all cases, demonstrating the viability of disease knowledge infusion. For example, accuracy of BioBERT on consumer health question answering is improved from 68.29% to 72.09%, while new SOTA results are observed in two datasets. We make our data and code freely available.
摘要：一种疾病的知识包括的疾病的各个方面，如症状和体征，诊断和治疗的信息。这种疾病的知识是许多与健康有关的和生物医学的任务，包括消费者健康问答，医疗语言推理和疾病知名度至关重要。虽然预先训练语言模型，如BERT显示捕获的语法，语义和世界从文本知识成功，我们发现，他们可以通过类似的症状，诊断，治疗知识的具体信息，以及其他疾病方面得到进一步补充。因此，我们与疾病知识整合BERT改进这些重要的任务。具体来说，我们提出了一种新的疾病知识的灌输训练过程和评估它在一套房BERT模型，包括BERT，BioBERT，SciBERT，ClinicalBERT，BlueBERT和伟业。在三个任务实验表明，这些模型可以提高在几乎所有情况下，表现出疾病的知识灌输的可行性。例如，对消费者健康答疑BioBERT的准确度从68.29％提高到72.09％，而新的SOTA结果在两个数据集观察。我们使我们的数据和代码可以免费获得。

27. Multi-hop Inference for Question-driven Summarization [PDF] 返回目录
Yang Deng, Wenxuan Zhang, Wai Lam
Abstract: Question-driven summarization has been recently studied as an effective approach to summarizing the source document to produce concise but informative answers for non-factoid questions. In this work, we propose a novel question-driven abstractive summarization method, Multi-hop Selective Generator (MSG), to incorporate multi-hop reasoning into question-driven summarization and, meanwhile, provide justifications for the generated summaries. Specifically, we jointly model the relevance to the question and the interrelation among different sentences via a human-like multi-hop inference module, which captures important sentences for justifying the summarized answer. A gated selective pointer generator network with a multi-view coverage mechanism is designed to integrate diverse information from different perspectives. Experimental results show that the proposed method consistently outperforms state-of-the-art methods on two non-factoid QA datasets, namely WikiHow and PubMedQA.
摘要：问题驱动的总结最近已研究作为一种有效的方法来总结源文件，产生简短明了的答案，对于非事实型询问。在这项工作中，我们提出了一种新颖的问题驱动抽象摘要方法，多跳选择性发生器（MSG），掺入多跳推理质疑驱动总结和，同时，提供一种用于将所生成的摘要的理由。具体来说，我们共同建模相关的问题，并通过类似人类的多跳推理模块，捕捉重要的句子为理由的总结答案不同的句子之间的相互关系。与多视图覆盖机构A选通选择性指针发生器网络被设计为从不同的角度不同的信息集成。实验结果表明，所提出的方法始终优于上两个非事实型QA的数据集的状态的最先进的方法，即是wikiHow和PubMedQA。

28. Shallow-to-Deep Training for Neural Machine Translation [PDF] 返回目录
Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du, Tong Xiao, Huizhen Wang, Jingbo Zhu
Abstract: Deep encoders have been proven to be effective in improving neural machine translation (NMT) systems, but training an extremely deep encoder is time consuming. Moreover, why deep models help NMT is an open question. In this paper, we investigate the behavior of a well-tuned deep Transformer system. We find that stacking layers is helpful in improving the representation ability of NMT models and adjacent layers perform similarly. This inspires us to develop a shallow-to-deep training method that learns deep models by stacking shallow models. In this way, we successfully train a Transformer system with a 54-layer encoder. Experimental results on WMT'16 English-German and WMT'14 English-French translation tasks show that it is $1.4$ $\times$ faster than training from scratch, and achieves a BLEU score of $30.33$ and $43.29$ on two tasks. The code is publicly available at this https URL.
摘要：深编码器已被证明是有效地改善神经机器翻译（NMT）系统，但训练非常深刻的编码器费时。此外，为什么深模型有助于NMT是一个悬而未决的问题。在本文中，我们研究了一个精心调校深变压器系统的行为。我们发现，堆码层数是在提高的NMT模型和相邻层执行同样的表现能力有所帮助。这激励我们开发出获悉深模型通过堆叠浅车型浅到深的训练方法。这样一来，我们成功地培养了变压器系统，一个54层的编码器。在WMT'16英语，德语和英语WMT'14 - 法语翻译任务，实验结果表明，它是$ $ 1.4 $ \ $时间比从头开始培训更快，实现了$ 30.33 $和$ $的43.29两个任务的BLEU分数。该代码是公开的，在此HTTPS URL。

29. Leveraging Discourse Rewards for Document-Level Neural Machine Translation [PDF] 返回目录
Inigo Jauregi Unanue, Nazanin Esmaili, Gholamreza Haffari, Massimo Piccardi
Abstract: Document-level machine translation focuses on the translation of entire documents from a source to a target language. It is widely regarded as a challenging task since the translation of the individual sentences in the document needs to retain aspects of the discourse at document level. However, document-level translation models are usually not trained to explicitly ensure discourse quality. Therefore, in this paper we propose a training approach that explicitly optimizes two established discourse metrics, lexical cohesion (LC) and coherence (COH), by using a reinforcement learning objective. Experiments over four different language pairs and three translation domains have shown that our training approach has been able to achieve more cohesive and coherent document translations than other competitive approaches, yet without compromising the faithfulness to the reference translation. In the case of the Zh-En language pair, our method has achieved an improvement of 2.46 percentage points (pp) in LC and 1.17 pp in COH over the runner-up, while at the same time improving 0.63 pp in BLEU score and 0.47 pp in F_BERT.
摘要：文档级机器翻译着眼于整个文件从源到目标语言的翻译。因为文档中的个别句子的翻译需要在文件级保留了话语方面它被广泛认为是一个具有挑战性的任务。然而，文档级翻译模型通常没有经过培训，以明确保证话语的质量。因此，在本文中，我们提出了明确优化两大国的话语指标，词汇衔接（LC）和连贯性（COH）训练方法，通过使用强化学习的目标。在四个不同的语言对和三个翻译领域的实验表明，我们的训练方法已经能够实现更具凝聚力和连贯的文件翻译比其他竞争方式，但不损害忠实于参考译文。在ZH-恩语言对的情况下，我们的方法已经过亚军实现在LC 2.46个百分点（PP）和1.17 PP的改善COH，而在同一时间提高BLEU得分和0.47 0.63 PP PP在F_BERT。

30. Learning to Fuse Sentences with Transformers for Summarization [PDF] 返回目录
Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Lidan Wang, Walter Chang, Fei Liu
Abstract: The ability to fuse sentences is highly attractive for summarization systems because it is an essential step to produce succinct abstracts. However, to date, summarizers can fail on fusing sentences. They tend to produce few summary sentences by fusion or generate incorrect fusions that lead the summary to fail to retain the original meaning. In this paper, we explore the ability of Transformers to fuse sentences and propose novel algorithms to enhance their ability to perform sentence fusion by leveraging the knowledge of points of correspondence between sentences. Through extensive experiments, we investigate the effects of different design choices on Transformer's performance. Our findings highlight the importance of modeling points of correspondence between sentences for effective sentence fusion.
摘要：以熔断句子的能力对于摘要系统非常有吸引力，因为它是产生简洁抽象的一个重要步骤。然而，迄今为止，summarizers可能会失败融合的句子。他们往往通过融合产生一些总结的句子或产生导致了总结失败保留了原有的意义不正确融合。在本文中，我们将探讨变形金刚的熔断器句子的能力，并提出了新的算法，以提高他们通过利用句子之间的对应关系的知识点进行判决融合的能力。通过大量的实验，研究不同的设计选择在变压器的性能的影响。我们的研究结果强调句子的生效判决融合之间的造型对应点的重要性。

31. PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge [PDF] 返回目录
Yun He, Zhuoer Wang, Yin Zhang, Ruihong Huang, James Caverlee
Abstract: We present a new benchmark dataset called PARADE for paraphrase identification that requires specialized domain knowledge. PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge. Experiments show that both state-of-the-art neural models and non-expert human annotators have poor performance on PARADE. For example, BERT after fine-tuning achieves an F1 score of 0.709, which is much lower than its performance on other paraphrase identification datasets. PARADE can serve as a resource for researchers interested in testing models that incorporate domain knowledge. We make our data and code freely available.
摘要：我们提出所谓的游行，要求专业领域知识的复述识别新的基准数据集。 PARADE包含重叠很少在词汇和语法水平，但在语义上是等效的基于计算机科学领域知识的释义，以及，在词法和句法层面重叠很大，但不是在语义上等效基于此领域知识的非释义。实验表明，国家的最先进的神经都模型和非专业的人工注释对PARADE表现不佳。例如，微调后BERT达到一个F1得分的0.709，这比其上的其它的改述识别数据集的性能低得多。 PARADE可以作为兴趣测试模型，纳入领域知识的研究人员的资源。我们使我们的数据和代码可以免费获得。

32. A Cascade Approach to Neural Abstractive Summarization with Content Selection and Fusion [PDF] 返回目录
Logan Lebanoff, Franck Dernoncourt, Doo Soon Kim, Walter Chang, Fei Liu
Abstract: We present an empirical study in favor of a cascade architecture to neural text summarization. Summarization practices vary widely but few other than news summarization can provide a sufficient amount of training data enough to meet the requirement of end-to-end neural abstractive systems which perform content selection and surface realization jointly to generate abstracts. Such systems also pose a challenge to summarization evaluation, as they force content selection to be evaluated along with text generation, yet evaluation of the latter remains an unsolved problem. In this paper, we present empirical results showing that the performance of a cascaded pipeline that separately identifies important content pieces and stitches them together into a coherent text is comparable to or outranks that of end-to-end systems, whereas a pipeline architecture allows for flexible content selection. We finally discuss how we can take advantage of a cascaded pipeline in neural text summarization and shed light on important directions for future research.
摘要：我们赞成级联架构神经文本摘要中提出了实证研究。总结实践有很大的不同，但比新闻摘要其他一些可以提供的训练数据，足以满足它们共同进行内容的选择和表面实现产生摘要终端到终端的神经系统抽象的要求足够量。这种系统也构成挑战概括评价，因为他们迫使内容选择与文本生成，但后者仍然是一个未解决的问题的评价一起进行评估。在本文中，我们本经验结果表明级联管线的性能即分别标识重要内容片段和缝线在一起成一个连贯的文本是相当或级别高于该端至端系统的，而流水线架构允许灵活的内容选择。最后，我们讨论我们如何可以在神经文摘级联管道的优势和对未来研究的重要方向线索。

33. Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based Decoding [PDF] 返回目录
Qile Zhu, Haidar Khan, Saleh Soltan, Stephen Rawls, Wael Hamza
Abstract: Semantic parsing is one of the key components of natural language understanding systems. A successful parse transforms an input utterance to an action that is easily understood by the system. Many algorithms have been proposed to solve this problem, from conventional rulebased or statistical slot-filling systems to shiftreduce based neural parsers. For complex parsing tasks, the state-of-the-art method is based on autoregressive sequence to sequence models to generate the parse directly. This model is slow at inference time, generating parses in O(n) decoding steps (n is the length of the target sequence). In addition, we demonstrate that this method performs poorly in zero-shot cross-lingual transfer learning settings. In this paper, we propose a non-autoregressive parser which is based on the insertion transformer to overcome these two issues. Our approach 1) speeds up decoding by 3x while outperforming the autoregressive model and 2) significantly improves cross-lingual transfer in the low-resource setting by 37% compared to autoregressive baseline. We test our approach on three well-known monolingual datasets: ATIS, SNIPS and TOP. For cross lingual semantic parsing, we use the MultiATIS++ and the multilingual TOP datasets.
摘要：语义分析是自然语言理解系统的关键部件之一。一个成功的语法分析将输入话语到容易被理解的系统的动作。许多算法已经被提出来解决这个问题，从传统的基于规则或统计空位填充系统shiftreduce基于神经解析器。对于复杂的解析任务，国家的最先进的方法是基于自回归序列序列模型来直接生成解析。该模型是在推理时间慢，在O（n）的生成解析解码步骤（n为靶序列的长度）。此外，我们证明了零次跨语言传递很差这种方法进行学习设置。在本文中，我们提出了一种基于插入变压器克服这两个问题非自回归分析器。我们的方法1）加快了3倍，同时解码表现优于自回归模型和2）相比，自回归基线37％显著改善了在低资源设置跨语言传输。 ATIS，SNIPS和TOP：我们在三个著名的单语的数据集测试我们的做法。对于跨语言语义分析，我们使用MultiATIS ++和多语种的TOP数据集。

34. Learning to Recombine and Resample Data for Compositional Generalization [PDF] 返回目录
Ekin Akyürek, Afra Feyza Akyürek, Jacob Andreas
Abstract: Flexible neural models outperform grammar- and automaton-based counterparts on a variety of sequence modeling tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data - particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. Here we present a family of learned data augmentation schemes that support a large category of compositional generalizations without appeal to latent symbolic structure. Our approach to data augmentation has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems---instruction following (SCAN) and morphological analysis (Sigmorphon 2018)---where our approach enables learning of new constructions and tenses from as few as eight initial examples.
摘要：灵活的神经模型跑赢上各种序列建模任务grammar-和自动机为基础的同行。然而，神经车型需要组成概括超越训练数据设置表现不佳 - 尤其是罕见的或看不见的序列。过去的工作中发现的符号脚手架（例如语法或自动机）在这些设置是必不可少的。在这里，我们提出一个家庭支持一个大的类别组成的概括，不得上诉潜符号结构学数据扩张计划。我们的数据增强方法有两个组成部分：经由基于原型的生成模型原始训练例子重组和产生的例子重采样，以鼓励外推。培训与重组和重新取样示例增强的数据集一个普通的神经序列模型显著提高泛化两种语言处理问题---以下（SCAN）和形态分析（Sigmorphon 2018）---指令，其中我们的方法使新的结构和学习从少八个初始实例时态。

35. Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models [PDF] 返回目录
Amrit Nagarajan, Sanchari Sen, Jacob R. Stevens, Anand Raghunathan
Abstract: Transformer models have garnered a lot of interest in recent years by delivering state-of-the-art performance in a range of Natural Language Processing (NLP) tasks. However, these models can have over a hundred billion parameters, presenting very high computational and memory requirements. We address this challenge through Approximate Computing, specifically targeting the use of Transformers in NLP tasks. Transformers are typically pre-trained and subsequently specialized for specific tasks through transfer learning. Based on the observation that pre-trained Transformers are often over-parameterized for several downstream NLP tasks, we propose a framework to create smaller, faster and in some cases more accurate models. The key cornerstones of the framework are a Significance Analysis (SA) method that identifies components in a pre-trained Transformer that are less significant for a given task, and techniques to approximate the less significant components. Our approximations include pruning of blocks, attention heads and weight groups, quantization of less significant weights and a low-complexity sign-matching based attention mechanism. Our framework can be adapted to produce models that are faster, smaller and/or more accurate, depending on the user's constraints. We apply our framework to seven Transformer models, including optimized models like DistilBERT and Q8BERT, and three downstream tasks. We demonstrate that our framework produces models that are up to 4x faster and up to 14x smaller (with less than 0.5% relative accuracy degradation), or up to 5.5% more accurate with simultaneous improvements of up to 9.83x in model size or 2.94x in speed.
摘要：变压器模型已经通过了一系列自然语言处理（NLP）任务提供国家的最先进的性能赢得了很多的兴趣在最近几年。然而，这些模型可以有超过一百十亿参数，呈现非常高的计算和存储需求。我们通过近似计算解决这一难题，专门针对自然语言处理任务中使用的变压器。变压器通常是预先训练，随后专门用于通过迁移学习特定的任务。基于这样的观察预先训练的变压器经常过参数几个下游NLP任务，我们提出了一个框架来创建更小，更快，在某些情况下，更精确的模型。该框架的主要基石是一个显着性分析（SA）方法，在一个预先训练变压器标识组件是给定任务少显著和技术来近似少显著组件。我们近似包括修剪块，注意头和组权重，少显著权重的量化和低复杂度的符号匹配基于注意机制。我们的架构能够适应生产模式，是更快，更小和/或更准确，这取决于用户的限制。我们运用我们的框架七个变压器模型，包括优化的模型，如DistilBERT和Q8BERT，以及三个下游任务。我们证明了我们的框架产生高达4倍速度更快，高达14倍较小（小于0.5％的相对准确度退化）的模型，涨幅高达5.5％，与模型的大小或2.94x起来的同时改进9.83x更准确速度。

36. Adaptive Self-training for Few-shot Neural Sequence Labeling [PDF] 返回目录
Yaqing Wang, Subhabrata Mukherjee, Haoda Chu, Yuancheng Tu, Ming Wu, Jing Gao, Ahmed Hassan Awadallah
Abstract: Neural sequence labeling is an important technique employed for many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), slot tagging for dialog systems and semantic parsing. Large-scale pre-trained language models obtain very good performance on these tasks when fine-tuned on large amounts of task-specific labeled data. However, such large-scale labeled datasets are difficult to obtain for several tasks and domains due to the high cost of human annotation as well as privacy and data access constraints for sensitive user applications. This is exacerbated for sequence labeling tasks requiring such annotations at token-level. In this work, we develop techniques to address the label scarcity challenge for neural sequence labeling models. Specifically, we develop self-training and meta-learning techniques for few-shot training of neural sequence taggers, namely MetaST. While self-training serves as an effective mechanism to learn from large amounts of unlabeled data -- meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels. Extensive experiments on six benchmark datasets including two massive multilingual NER datasets and four slot tagging datasets for task-oriented dialog systems demonstrate the effectiveness of our method with around 10% improvement over state-of-the-art systems for the 10-shot setting.
摘要：神经序列标注为许多自然语言处理（NLP）的任务，如命名实体识别（NER），时隙标记为对话系统和语义分析中采用的重要技术。大型预训练的语言模型获得关于这些任务非常不错的表现上大量针对特定任务的标签数据时微调。然而，这样的大规模数据集标记由于人类注释的高成本以及隐私和数据访问限制为敏感的用户应用程序是很难获得多个任务和域。这加剧了对于需要在标记级别这样的标注序列标注任务。在这项工作中，我们开发的技术来解决神经序列标注型号标签稀缺性挑战。具体来说，我们发展自我培训和元学习技术的神经序列标注器，即MetaST的几个次培训。而自我训练用作从大量的未标记数据的学习的有效机制 - 元学习有助于自适应样品重新加权从嘈杂伪标签减轻误差传播。六个基准数据集大量的实验，包括两个大型多语种NER数据集和面向任务的对话系统四个槽标记数据集证明我们的方法与在国家的最先进的系统，为10次设定在10％左右的改善效果。

37. Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank [PDF] 返回目录
Eleftheria Briakou, Marine Carpuat
Abstract: Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of fine-grained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect fine-grained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.
摘要：在内容检测细粒度差异在跨语种NLP和多语语料库分析不同的语言传达的问题，但由于注释是昂贵且难以规模是一个具有挑战性的机器学习问题。这项工作提高细粒度语义分歧的预测和注释。我们通过学习不同的粒度级别合成不同的实例介绍了多语种BERT机型培训战略。我们评估我们的合理化英法语义分歧，这项工作发布一个新的数据集模型，包括语义发散类和标记级别理由注释英语，法语句子对的。学习等级有助于更准确地检测细粒度句子级分歧比一个强句级相似模型，而标记级别的预测有粗粒和细粒之间的分歧进一步区分的潜力。

38. Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data [PDF] 返回目录
Shachar Rosenman, Alon Jacovi, Yoav Goldberg
Abstract: The process of collecting and annotating training data may introduce distribution artifacts which may limit the ability of models to learn correct generalization behavior. We identify failure modes of SOTA relation extraction (RE) models trained on TACRED, which we attribute to limitations in the data annotation process. We collect and annotate a challenge-set we call Challenging RE (CRE), based on naturally occurring corpus examples, to benchmark this behavior. Our experiments with four state-of-the-art RE models show that they have indeed adopted shallow heuristics that do not generalize to the challenge-set data. Further, we find that alternative question answering modeling performs significantly better than the SOTA models on the challenge-set, despite worse overall TACRED performance. By adding some of the challenge data as training examples, the performance of the model improves. Finally, we provide concrete suggestion on how to improve RE data collection to alleviate this behavior.
摘要：收集和注释的训练数据可能引入分布的文物这可能会限制的模型来了解正确的推广行为的能力的过程。我们确定SOTA关系抽取（RE）模型的培训上TACRED，这是我们在数据注释过程归因于限制的故障模式。我们收集和注释了挑战，一套我们称之为挑战RE（CRE），基于自然发生的语料的例子，以衡量这一行为。我们与国家的最先进的四重模型的实验表明，他们的确采用了浅启发式不推广到挑战组数据。此外，我们发现，替代答疑模型进行比对的挑战集的SOTA模型显著好，虽然较差总体TACRED性能。通过添加一些挑战数据作为训练实例，模型的性能提高。最后，我们提供了关于如何提高稀土的数据收集，以缓解这种行为具体的建议。

39. Cross-Thought for Sentence Encoder Pre-training [PDF] 返回目录
Shuohang Wang, Yuwei Fang, Siqi Sun, Zhe Gan, Yu Cheng, Jing Jiang, Jingjing Liu
Abstract: In this paper, we propose Cross-Thought, a novel approach to pre-training sequence encoder, which is instrumental in building reusable sequence embeddings for large-scale NLP tasks such as question answering. Instead of using the original signals of full sentences, we train a Transformer-based sequence encoder over a large set of short sequences, which allows the model to automatically select the most useful information for predicting masked words. Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders trained with continuous sentence signals as well as traditional masked language modeling baselines. Our proposed approach also achieves new state of the art on HotpotQA (full-wiki setting) by improving intermediate information retrieval performance.
摘要：在本文中，我们提出了跨思想，一种新的方法来前的训练序列编码器，这是构建可重用的嵌入序列为大规模NLP任务，比如问答工具。而不是使用完整的句子的原始信号的，我们培养了基于变压器的序列编码器在大集短序列，这使得模型自动选择预测屏蔽词的最有用的信息。在答疑和文字蕴涵任务的实验结果表明，我们的预先训练的编码器可以跑赢大盘连续句子信号以及传统的蒙面语言建模基线训练的国家的最先进的编码器。我们所提出的方法还通过提高中间信息检索性能达到了艺术上HotpotQA（全维基设置）的新状态。

40. A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks [PDF] 返回目录
Nikunj Saunshi, Sadhika Malladi, Sanjeev Arora
Abstract: Autoregressive language models pretrained on large corpora have been successful at solving downstream tasks, even with zero-shot usage. However, there is little theoretical justification for their success. This paper considers the following questions: (1) Why should learning the distribution of natural language help with downstream classification tasks? (2) Why do features learned using language modeling help solve downstream tasks with linear classifiers? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as next word prediction tasks, thus making language modeling a meaningful pretraining task. For (2), we analyze properties of the cross-entropy objective to show that $\epsilon$-optimal language models in cross-entropy (log-perplexity) learn features that are $\mathcal{O}(\sqrt{\epsilon})$-good on natural linear classification tasks, thus demonstrating mathematically that doing well on language modeling can be beneficial for downstream tasks. We perform experiments to verify assumptions and validate theoretical results. Our theoretical insights motivate a simple alternative to the cross-entropy objective that performs well on some linear classification tasks.
摘要：在预先训练语料库大自回归语言模型已经成功地解决下游任务，甚至是零次使用。但是，他们成功的小理论依据。本文考虑以下几个问题：（1）为什么要学习的自然语言帮助下游分类任务分配？（2）为什么使用功能语言建模帮助解决线性分类下游任务学到了什么？对于（1），我们假设，和实证检验，感兴趣的分类任务可以被改写为下一个单词预测的任务，从而使语言建模有意义的训练前的任务。对于（2），我们分析了交叉熵目标的性质表明，$ \小量美元交叉熵（日志茫然） - 最优语言模型学习功能，是$ \ mathcal {}Ø（\开方{\小量}）$ - 天然线性分类任务好，从而证明数学上的语言模型做好可以为下游的任务是有益的。我们进行实验来验证假设和验证的理论成果。我们的理论见解激励一个简单的替代交叉熵客观的说，执行以及一些线性分类任务。

41. Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations [PDF] 返回目录
Wanrong Zhu, Xin Eric Wang, Pradyumna Narayana, Kazoo Sone, Sugato Basu, William Yang Wang
Abstract: A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that our evaluation protocols are correct, and benchmarks are reliable. In this work, we set forth to design a set of experiments to understand an important but often ignored problem in visually grounded language generation: given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? Empirically, we study several multi-reference datasets and corresponding vision-and-language tasks. We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task; that metric-wise, CIDEr has shown systematically larger variances than others. Our evaluations on reference-per-instance shed light on the design of reliable datasets in the future.
摘要：在视觉上接地语言生成的一个主要挑战是建立稳健的基准数据集和模型，可以在现实世界中的设置推广好。要做到这一点，关键是要确保我们的评价协议是正确的，基准是可靠的。在这项工作中，我们提出，设计了一组实验，以了解在视觉上接地语言生成的一个重要但常常被忽视的问题：因为人类有不同的公用事业和视觉注意，怎么会多参考样本方差的数据集影响模型的表现？根据经验，我们研究一些多引用数据集和相应的视觉和语言的任务。我们表明，这是至为重要的实验报告方差;人类产生的引用可以在不同的数据集/任务急剧变化，揭示了每项任务的性质;该指标上看，苹果酒已经显示出比其他系统更大的差异。我们在参考每个实例评估阐明了在未来可靠的数据集的设计。

42. Zero-Shot Stance Detection: A Dataset and Model using Generalized Topic Representations [PDF] 返回目录
Emily Allaway, Kathleen McKeown
Abstract: Stance detection is an important component of understanding hidden influences in everyday life. Since there are thousands of potential topics to take a stance on, most with little to no training data, we focus on zero-shot stance detection: classifying stance from no training examples. In this paper, we present a new dataset for zero-shot stance detection that captures a wider range of topics and lexical variation than in previous datasets. Additionally, we propose a new model for stance detection that implicitly captures relationships between topics using generalized topic representations and show that this model improves performance on a number of challenging linguistic phenomena.
摘要：姿态检测是在日常生活中了解隐藏影响的重要组成部分。由于有成千上万的潜在议题采取立场，大多数几乎没有训练数据，我们专注于零射门的姿态检测：从没有训练实例进行分类的立场。在本文中，我们提出了零射门的姿态检测，捕捉更广泛的主题，比以往的数据集词汇变化的新的数据集。此外，我们提出，使用广义的主题陈述和表明该模型提高了许多具有挑战性的语言现象的表现主题之间的隐含关系捕获检测姿态的新模式。

43. MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics [PDF] 返回目录
Anthony Chen, Gabriel Stanovsky, Sameer Singh, Matt Gardner
Abstract: Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute Pearson points on held-out annotations. When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement. MOCHA presents a challenging problem for developing accurate and robust generative reading comprehension metrics.
摘要：冒充阅读理解的生成问题提供了非常灵活，允许对可能的答案几乎没有限制开放式的问题。然而，进度是由现有的生成度量，这依赖于令牌重叠和是不可知的阅读理解的细微差别的阻碍。为了解决这个问题，我们引入了训练和评估生成阅读理解指标的基准：与人注解建模正确性。 MOCHA含有6个多样问答数据集上模型输出40K人的判断的分数和附加的一组最小的对进行评价的。使用MOCHA，我们培养一个有学问的评价指标进行阅读理解，LERC，以模仿人的判断分数。 LERC性能优于基准指标由10至36绝对皮尔逊点上持有了注解。当我们在最小对评估的鲁棒性，LERC达到80％的准确率，由14至26个绝对百分点跑赢基准，而留有许多改进显著空间。摩卡提出了制定准确的和强大的生成阅读理解度量一个具有挑战性的问题。

44. MuSeM: Detecting Incongruent News Headlines using Mutual Attentive Semantic Matching [PDF] 返回目录
Rahul Mishra, Piyush Yadav, Remi Calizzano, Markus Leippold
Abstract: Measuring the congruence between two texts has several useful applications, such as detecting the prevalent deceptive and misleading news headlines on the web. Many works have proposed machine learning based solutions such as text similarity between the headline and body text to detect the incongruence. Text similarity based methods fail to perform well due to different inherent challenges such as relative length mismatch between the news headline and its body content and non-overlapping vocabulary. On the other hand, more recent works that use headline guided attention to learn a headline derived contextual representation of the news body also result in convoluting overall representation due to the news body's lengthiness. This paper proposes a method that uses inter-mutual attention-based semantic matching between the original and synthetically generated headlines, which utilizes the difference between all pairs of word embeddings of words involved. The paper also investigates two more variations of our method, which use concatenation and dot-products of word embeddings of the words of original and synthetic headlines. We observe that the proposed method outperforms prior arts significantly for two publicly available datasets.
摘要：测量两个文本之间的一致性有几个有用的应用，如检测普遍欺骗性和网络上的误导性新闻标题。许多作品都提出了基于机器学习的解决方案，如标题和正文检测不一致之间的文本相似性。文本相似度为基础的方法不能如新闻标题和其主体内容和非重叠的词汇之间的相对长度失配表现良好由于不同的固有的挑战。在另一方面，在使用标题引导注意力学习标题更近期的作品衍生的新闻身体也导致整体卷积表示上下文表示由于消息体的周期长。本文提出了使用原始的和合成产生的头条新闻，它利用所有对涉及单词的字的嵌入之间的差异之间的相互共同关注的基于语义匹配的方法。文中还研究了两个我们的方法的变体，它使用级联和DOT-产品的原始和合成头条的话，字的嵌入的。我们观察到，该方法显著两个公开可用的数据集优于现有技术。

45. Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets [PDF] 返回目录
Mihaela Gaman, Radu Tudor Ionescu
Abstract: In this work, we introduce the methods proposed by the UnibucKernel team in solving the Social Media Variety Geolocation task featured in the 2020 VarDial Evaluation Campaign. We address only the second subtask, which targets a data set composed of nearly 30 thousand Swiss German Jodels. The dialect identification task is about accurately predicting the latitude and longitude of test samples. We frame the task as a double regression problem, employing a variety of machine learning approaches to predict both latitude and longitude. From simple models for regression, such as Support Vector Regression, to deep neural networks, such as Long Short-Term Memory networks and character-level convolutional neural networks, and, finally, to ensemble models based on meta-learners, such as XGBoost, our interest is focused on approaching the problem from a few different perspectives, in an attempt to minimize the prediction error. With the same goal in mind, we also considered many types of features, from high-level features, such as BERT embeddings, to low-level features, such as characters n-grams, which are known to provide good results in dialect identification. Our empirical results indicate that the handcrafted model based on string kernels outperforms the deep learning approaches. Nevertheless, our best performance is given by the ensemble model that combines both handcrafted and deep learning models.
摘要：在这项工作中，我们介绍了在解决2020年VarDial评价活动提供了社交媒体品种地理位置任务由UnibucKernel团队提出的方法。我们只解决了第二子任务，它的目标近30000瑞士德语Jodels组成的数据集。方言识别任务即将精确地预测测试样品的纬度和经度。我们框定任务作为双回归问题，采用各种机器学习方法的预测经度和纬度。从简单的模型进行回归，如支持向量回归，对深层神经网络，如长短期记忆网络字符级卷积神经网络，最后，基于元学习者，如XGBoost整体模型，我们的兴趣主要是从几个不同的角度接近问题，企图以减少预测误差。考虑到同样的目标，我们也考虑了很多类型的特性，从高层次的功能，如BERT的嵌入，以低层次的功能，如字符正克，这是众所周知的提供方言辨识了良好的效果。我们的实证结果表明，基于字符串的手工模型内核优于深学习方法。然而，我们的最佳表现是由集成模型因为结合了手工和深厚的学习模式。

46. SRLGRN: Semantic Role Labeling Graph Reasoning Network [PDF] 返回目录
Chen Zheng, Parisa Kordjamshidi
Abstract: This work deals with the challenge of learning and reasoning over multi-hop question answering (QA). We propose a graph reasoning network based on the semantic structure of the sentences to learn cross paragraph reasoning paths and find the supporting facts and the answer jointly. The proposed graph is a heterogeneous document-level graph that contains nodes of type sentence (question, title, and other sentences), and semantic role labeling sub-graphs per sentence that contain arguments as nodes and predicates as edges. Incorporating the argument types, the argument phrases, and the semantics of the edges originated from SRL predicates into the graph encoder helps in finding and also the explainability of the reasoning paths. Our proposed approach shows competitive performance on the HotpotQA distractor setting benchmark compared to the recent state-of-the-art models.
摘要：以学习和推理在多跳问答（QA）的挑战性的工作协议。我们提出了基于句子的语义结构的图形推理网络学习跨段落推理路径，并找到支持的事实，共同答案。所提出的图形是包含式句子（问题，标题和其他句子）的节点的非均相文档级曲线图，和语义角色标注每个句子中包含的参数作为节点和谓词作为边缘的子图。结合参数类型，参数短语和边缘的语义源于SRL谓词到图形编码器有助于发现并推理路径的explainability。我们建议的做法显示了HotpotQA牵张设定基准竞争表现相比，近期国家的最先进的车型。

47. Characterizing the Value of Information in Medical Notes [PDF] 返回目录
Chao-Chun Hsu, Shantanu Karnwal, Sendhil Mullainathan, Ziad Obermeyer, Chenhao Tan
Abstract: Machine learning models depend on the quality of input data. As electronic health records are widely adopted, the amount of data in health care is growing, along with complaints about the quality of medical notes. We use two prediction tasks, readmission prediction and in-hospital mortality prediction, to characterize the value of information in medical notes. We show that as a whole, medical notes only provide additional predictive power over structured information in readmission prediction. We further propose a probing framework to select parts of notes that enable more accurate predictions than using all notes, despite that the selected information leads to a distribution shift from the training data ("all notes"). Finally, we demonstrate that models trained on the selected valuable information achieve even better predictive performance, with only 6.8% of all the tokens for readmission prediction.
摘要：机器学习模型依赖于输入数据的质量。随着电子病历的广泛采用，在医疗保健的数据量越来越大，大约的医疗记录的质量投诉一起。我们使用两种预测的任务，重新接纳预测和住院死亡率的预测，表征医疗记录信息的价值。我们表明，作为一个整体，医疗记录只提供了重新接纳中预测的结构化信息的其他预测能力。我们进一步提出了一个探测框架来选择的音符，使更精确的预测比使用所有纸币的部分，尽管选定的信息会导致从训练数据（“所有音符”）分布的转变。最后，我们证明了的培训所选择的有价值的信息模型，实现更好的预测性能，仅有6.8％的所有令牌重新接纳的预测。

48. Dense Relational Image Captioning via Multi-task Triple-Stream Networks [PDF] 返回目录
Dong-Jin Kim, Tae-Hyun oh, Jinsoo Choi, In So Kweon
Abstract: We introduce dense relational captioning, a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in a visual scene. Relational captioning provides explicit descriptions of each relationship between object combinations. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding based on relationships, e.g., relational proposal generation. For relational understanding between objects, the part-of-speech (POS, i.e., subject-object-predicate categories) can be a valuable prior information to guide the causal sequence of words in a caption. We enforce our framework to not only learn to generate captions but also predict the POS of each word. To this end, we propose the multi-task triple-stream network (MTTSNet) which consists of three recurrent units responsible for each POS which is trained by jointly predicting the correct captions and POS for each word. In addition, we found that the performance of MTTSNet can be improved by modulating the object embeddings with an explicit relational module. We demonstrate that our proposed model can generate more diverse and richer captions, via extensive experimental analysis on large scale datasets and several metrics. We additionally extend analysis to an ablation study, applications on holistic image captioning, scene graph generation, and retrieval tasks
摘要：介绍密关系字幕，一个新的图像字幕的任务，目标相对于在视觉场景的对象之间关系的信息来生成多个字幕。关系字幕对象提供组合之间的每个关系的明确的描述。该框架是在这两个多样性和信息量有利，导致基础上的关系，例如，关系方案生成一个全面的图像理解。为对象之间的关系的理解，部分的词性（POS，即，主体 - 客体谓类别）可以是有价值的先验信息来指导字的因果序列中的字幕。我们我们的框架执行，不仅学会生成字幕，但也预示着每个单词的POS。为此，我们建议它由负责这是由联合预测每个字正确的字幕和POS训练的每个POS三份经常性单位的多任务三流网络（MTTSNet）。此外，我们发现，MTTSNet的性能可以通过明确的关系模块调节对象的嵌入得到改善。我们表明，我们的模型能够产生更多样化，更丰富的字幕，通过大规模数据集和几个指标广泛的实验分析。我们还扩展分析，以消融研究，对整体形象的字幕，场景图的生成，和检索任务的应用程序

49. Text-based RL Agents with Commonsense Knowledge: New Challenges, Environments and Baselines [PDF] 返回目录
Keerthiram Murugesan, Mattia Atzeni, Pavan Kapanipathi, Pushkar Shukla, Sadhana Kumaravel, Gerald Tesauro, Kartik Talamadupula, Mrinmaya Sachan, Murray Campbell
Abstract: Text-based games have emerged as an important test-bed for Reinforcement Learning (RL) research, requiring RL agents to combine grounded language understanding with sequential decision making. In this paper, we examine the problem of infusing RL agents with commonsense knowledge. Such knowledge would allow agents to efficiently act in the world by pruning out implausible actions, and to perform look-ahead planning to determine how current actions might affect future world states. We design a new text-based gaming environment called TextWorld Commonsense (TWC) for training and evaluating RL agents with a specific kind of commonsense knowledge about objects, their attributes, and affordances. We also introduce several baseline RL agents which track the sequential context and dynamically retrieve the relevant commonsense knowledge from ConceptNet. We show that agents which incorporate commonsense knowledge in TWC perform better, while acting more efficiently. We conduct user-studies to estimate human performance on TWC and show that there is ample room for future improvement.
摘要：基于文本的游戏已经成为一个重要的试验床强化学习（RL）的研究，需要RL代理接地语言与顺序决策的理解结合起来。在本文中，我们将考察注入RL代理与常识知识的问题。这些知识将让代理商能够有效地在世界上修剪出令人难以置信的行为采取行动，并进行前瞻规划确定当前的行动会如何影响未来世界的状态。我们的培训和特定种类的有关对象，它们的属性和启示常识知识评估RL代理人设计的名为TextWorld常识（TWC）的新的基于文本的游戏环境。我们还介绍了其跟踪连续上下文和动态检索ConceptNet相关常识知识库几个基线RL剂。我们表明，代理商这在TWC一体化常识知识更好地发挥，同时更有效地采取行动。我们进行用户研究，以评估对人类TWC性能和显示，有足够的空间用于未来的改进。

50. Latent linguistic embedding for cross-lingual text-to-speech and voice conversion [PDF] 返回目录
Hieu-Thi Luong, Junichi Yamagishi
Abstract: As the recently proposed voice cloning system, NAUTILUS, is capable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system. Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. This type of system is not simply cloning the voice of the target speaker, but essentially creating a new voice that can be considered better than the original under a specific framing. By using a well-trained English latent linguistic embedding to create a cross-lingual TTS and VC system for several German, Finnish, and Mandarin speakers included in the Voice Conversion Challenge 2020, we show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps. However, the subjective evaluations of perceived naturalness seemed to vary between target speakers, which is one aspect for future improvement.
摘要：最近提出的声音克隆系统，鹦鹉螺，能够使用非转录的讲话克隆看不见的声音，我们研究用它来建立一个统一的跨语种TTS / VC系统的可行性。跨语种语音生成是在语音发声与目标说话人的不是他们原来使用的语言的声音产生的情况。这种类型的系统不是简单地克隆目标讲话者的声音，但本质上创建一个新的声音，可以在特定的框架下被认为比原来的更好。通过使用一个训练有素的英语潜伏语言嵌入创建包含在语音转换挑战2020几位德国，芬兰和讲普通话的一个跨语种TTS和VC系统，我们表明，我们的方法不仅创建了跨语言VC高扬声器相似但也可以无缝地用于跨语言TTS，而无需执行任何额外的步骤。然而，感知自然的主观评价似乎目标说话者，这是对于今后改进一个方面之间变化。

51. Domain Adversarial Neural Networks for Dysarthric Speech Recognition [PDF] 返回目录
Dominika Woszczyk, Stavros Petridis, David Millard
Abstract: Speech recognition systems have improved dramatically over the last few years, however, their performance is significantly degraded for the cases of accented or impaired speech. This work explores domain adversarial neural networks (DANN) for speaker-independent speech recognition on the UAS dataset of dysarthric speech. The classification task on 10 spoken digits is performed using an end-to-end CNN taking raw audio as input. The results are compared to a speaker-adaptive (SA) model as well as speaker-dependent (SD) and multi-task learning models (MTL). The experiments conducted in this paper show that DANN achieves an absolute recognition rate of 74.91% and outperforms the baseline by 12.18%. Additionally, the DANN model achieves comparable results to the SA model's recognition rate of 77.65%. We also observe that when labelled dysarthric speech data is available DANN and MTL perform similarly, but when they are not DANN performs better than MTL.
摘要：语音识别系统在过去几年显着改善，但是，他们的表现显著下降的口音或讲话受损的情况。这项工作探讨了在构音障碍的言语的UAS数据集说话者无关的语音识别领域对抗神经网络（DANN）。使用的端部到端CNN服用原始音频作为输入，进行10个口语数字的分类任务。结果相比，说话人自适应（SA）模型，以及特定人（SD）和多任务学习模型（MTL）。实验本文表明DANN达到74.91的％的绝对识别率和12.18％，优于基准进行。此外，DANN模型实现可比结果的77.65％的SA模型的识别率。我们还观察到标记构音障碍的语音数据时可用DANN和MTL同样执行，但是当他们没有DANN进行比MTL。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-10-09

目录

摘要