摘要

1. What is Learned in Visually Grounded Neural Syntax Acquisition [PDF] 返回目录
Noriyuki Kojima, Hadar Averbuch-Elor, Alexander M. Rush, Yoav Artzi
Abstract: Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax from a visual training signal. By constructing simplified versions of the model, we isolate the core factors that yield the model's strong performance. Contrary to what the model might be capable of learning, we find significantly less expressive versions produce similar predictions and perform just as well, or even better. We also find that a simple lexical signal of noun concreteness plays the main role in the model's predictions as opposed to more complex syntactic reasoning.
摘要：视觉特征是学习引导文本模式有希望的信号。然而，黑箱学习模型难以分离的可视化组件的具体贡献。在这种分析中，我们考虑了视觉神经接地语法学习（Shi等，2019），最近的一次从视觉训练信号的学习语法的方法的案例研究。通过构建模型的简化版本，我们隔离的核心因素产生模型的强劲表现。相反的是模型可能是能够学习的，我们发现显著少的表现形式产生类似的预测和执行一样好，甚至更好。我们还发现，名词具体化的一个简单的词法信号起着模型的预测主要作用，而不是更复杂的句法推理。

2. Fast and Robust Unsupervised Contextual Biasing for Speech Recognition [PDF] 返回目录
Young Mo Kang, Yingbo Zhou
Abstract: Automatic speech recognition (ASR) system is becoming a ubiquitous technology. Although its accuracy is closing the gap with that of human level under certain settings, one area that can further improve is to incorporate user-specific information or context to bias its prediction. A common framework is to dynamically construct a small language model from the provided contextual mini corpus and interpolate its score with the main language model during the decoding process. Here we propose an alternative approach that does not entail explicit contextual language model. Instead, we derive the bias score for every word in the system vocabulary from the training corpus. The method is unique in that 1) it does not require meta-data or class-label annotation for the context or the training corpus. 2) The bias score is proportional to the word's log-probability, thus not only would it bias the provided context, but also robust against irrelevant context (e.g. user mis-specified or in case where it is hard to quantify a tight scope). 3) The bias score for the entire vocabulary is pre-determined during the training stage, thereby eliminating computationally expensive language model construction during inference. We show significant improvement in recognition accuracy when the relevant context is available. Additionally, we also demonstrate that the proposed method exhibits high tolerance to false-triggering errors in the presence of irrelevant context.
摘要：自动语音识别（ASR）系统正在成为一个无处不在的技术。虽然其精度在关闭与该下某些设置人类水平，可以进一步提高间隙，一个区域是将用户特定的信息或上下文偏压其预测。一个共同的框架是动态构造从所提供的内容相关的迷你语料库小语言模型和在解码过程中进行内插其得分与主语言模型。在这里，我们提出了一种替代方法，它不涉及明确的上下文的语言模型。相反，我们推导出偏差分数从训练语料库的系统词汇的每一个字。该方法是在1唯一）它不需要元数据或类标签标注的内容或训练语料库。 2）偏置得分是正比于字的登录概率，从而不仅将它偏压提供的上下文，也健壮针对无关上下文（例如用户误指定或在情况下它是难以量化的紧密范围）。 3）偏置得分为整个词汇是在训练阶段预先确定的，从而推理过程中消除计算上昂贵的语言模型构建。我们发现在识别准确性显著改善时的有关情况可用。此外，我们还表明，该方法具有很高的耐误触发误差无关上下文的存在。

3. Evaluating Explanation Methods for Neural Machine Translation [PDF] 返回目录
Jierui Li, Lemao Liu, Huayang Li, Guanlin Li, Guoping Huang, Shuming Shi
Abstract: Recently many efforts have been devoted to interpreting the black-box NMT models, but little progress has been made on metrics to evaluate explanation methods. Word Alignment Error Rate can be used as such a metric that matches human understanding, however, it can not measure explanation methods on those target words that are not aligned to any source word. This paper thereby makes an initial attempt to evaluate explanation methods from an alternative viewpoint. To this end, it proposes a principled metric based on fidelity in regard to the predictive behavior of the NMT model. As the exact computation for this metric is intractable, we employ an efficient approach as its approximation. On six standard translation tasks, we quantitatively evaluate several explanation methods in terms of the proposed metric and we reveal some valuable findings for these explanation methods in our experiments.
摘要：最近很多努力一直致力于解释暗箱NMT车型，但进展甚微已经取得了指标评价解释方法。字对齐错误率可以被用作这样的度量匹配人的理解，然而，它不能测量在未对准的任何源字的那些目标词语解释的方法。本文从而使得初次尝试从另一种角度看评估的解释方法。为此，它提出了一种基于在考虑到NMT模型的预测行为忠诚原则的度量。至于这个指标的精确计算是非常棘手的，我们采用一种有效的方法为它的逼近。在六个标准翻译任务，我们定量评价所提出的指标方面几个解释的方法和我们揭示了在我们的实验中，这些解释方法的一些有价值的结论。

4. Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [PDF] 返回目录
Arjun R Akula, Spandana Gella, Yaser Al-Onaizan, Song-Chun Zhu, Siva Reddy
Abstract: Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at this https URL
摘要：视觉指表情识别是一项具有挑战性的任务，需要自然语言中的图像的背景下理解。我们认真审视RefCOCOg，一个标准的标杆这个任务，使用人的研究，并表明测试实例的83.7％，不需要推理的语言结构，即，字就足以识别目标对象，词序并不重要。为了衡量现有车型的真正进步，我们分开测试组分成两组，一个需要在语言结构推理和没有其他。此外，我们通过询问crowdworkers扰乱在域实例，使得目标对象的变化创建外的分布的数据集REF-进阶。利用这些数据集，我们经验表明，现有的方法不能利用语言结构和有12％〜23％为低在性能上比该任务的建立进展。我们还提出了两种方法，一种基于对比学习和基于多任务学习另，增加ViLBERT，当前国家的最先进的模型，这个任务的鲁棒性。我们的数据集是公开的，在此HTTPS URL

5. A Tale of a Probe and a Parser [PDF] 返回目录
Rowan Hall Maudslay, Josef Valvoda, Tiago Pimentel, Adina Williams, Ryan Cotterell
Abstract: Measuring what linguistic information is encoded in neural models of language has become popular in NLP. Researchers approach this enterprise by training "probes" - supervised models designed to extract linguistic structure from another model's output. One such probe is the structural probe (Hewitt and Manning, 2019), designed to quantify the extent to which syntactic information is encoded in contextualised word representations. The structural probe has a novel design, unattested in the parsing literature, the precise benefit of which is not immediately obvious. To explore whether syntactic probes would do better to make use of existing techniques, we compare the structural probe to a more traditional parser with an identical lightweight parameterisation. The parser outperforms structural probe on UUAS in seven of nine analysed languages, often by a substantial amount (e.g. by 11.1 points in English). Under a second less common metric, however, there is the opposite trend - the structural probe outperforms the parser. This begs the question: which metric should we prefer?
摘要：测量什么语言信息是语言的神经模型编码NLP已经成为流行。监督模式，旨在从另一个模型的输出提取语言结构 - 研究人员通过训练“探头”接近这家企业。一种这样的探针是结构探针（休伊特和曼宁，2019），设计成量化到句法信息在contextualised字表示编码的程度。结构探针具有新颖的设计，在所述解析文献未被证实的，其中的精确的好处是没有立即明显。为了探讨句法探头是否会做更好的利用现有的技术，我们比较了结构探测到一个更传统的解析器用相同的轻量级的参数化。解析器性能优于（在英语中，例如通过11.1分）上UUAS结构探针在九个分析语言7，通常由大量。下的第二不常见的度量，但是，有相反的趋势 - 结构探针优于解析器。这引出了一个问题：哪一个指标，我们应该更喜欢哪个？

6. Code and Named Entity Recognition in StackOverflow [PDF] 返回目录
Jeniya Tabassum, Mounica Maddela, Wei Xu, Alan Ritter
Abstract: There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We also present the SoftNER model that combines contextual information with domain specific knowledge using an attention network. The code token recognizer combined with an entity segmentation model we proposed, consistently improves the performance of the named entity tagger. Our proposed SoftNER tagger outperforms the BiLSTM-CRF model with an absolute increase of +9.73 F-1 score on StackOverflow data.
摘要：在研究自然语言和计算机代码共同的兴趣越来越大，因为编程文本的语料库大成为互联网上一应俱全。例如，StackOverflow上目前拥有8.5亿用户的书面超过1500万的编程相关的问题。同时，仍有识别码令牌或出现自然语言句子内的软件相关的命名实体缺乏基本面NLP技术。在本文中，我们引入一个新的命名实体识别（NER）语料库为计算机编程领域，由15372句与20细粒度的实体类型注释的。我们还呈现使用注意网络结合了特定领域知识的上下文信息的柔化模式。代码令牌识别与我们提出的实体分割模型相结合，持续提高了命名实体恶搞的性能。我们提出的软水恶搞优于与9.73 F-1的比分在计算器上的数据的绝对增长的BiLSTM-CRF模型。

7. The Paradigm Discovery Problem [PDF] 返回目录
Alexander Erdmann, Micha Elsner, Shijie Wu, Ryan Cotterell, Nizar Habash
Abstract: This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work. Our code and data are available for public use.
摘要：该作品治疗范式发现问题（PDP），学习从没有注释的句子屈折形态系统的任务。我们正式PDP和制定评价指标判断系统。利用现有资源，构建数据集的任务。我们还制定了PDP启发式基准和报告五个不同语言的实证结果。我们的基准系统第一品牌的电池和使用模式的嵌入文字和字符串相似性聚类形式。然后，我们引导在群集数据上神经传感器来预测的话，实现了空范式插槽。我们的系统的误差分析表明，通过在不同的拐点类细胞集群是未来工作的最紧迫的挑战。我们的代码和数据可供市民使用。

8. From Arguments to Key Points: Towards Automatic Argument Summarization [PDF] 返回目录
Roy Bar-Haim, Lilach Eden, Roni Friedman, Yoav Kantor, Dan Lahav, Noam Slonim
Abstract: Generating a concise summary from a large collection of arguments on a given topic is an intriguing yet understudied problem. We propose to represent such summaries as a small set of talking points, termed "key points", each scored according to its salience. We show, by analyzing a large dataset of crowd-contributed arguments, that a small number of key points per topic is typically sufficient for covering the vast majority of the arguments. Furthermore, we found that a domain expert can often predict these key points in advance. We study the task of argument-to-key point mapping, and introduce a novel large-scale dataset for this task. We report empirical results for an extensive set of experiments with this dataset, showing promising performance.
摘要：在给定主题生成从大集合的参数的简明摘要是一个有趣的尚未得到充分研究的问题。我们建议代表这样总结为一小部分谈话要点的，被称为“关键点”，根据其显着性各入一球。我们发现，通过分析大数据集的人群提供的参数，即每个主题要点少数通常用于覆盖绝大多数的论据充分。此外，我们发现，领域专家往往能提前预测这些关键点。我们研究论证到关键点映的任务，并引入新的大型数据集执行此任务。我们报告了一套广泛的实验与此数据集的实证结果，显示出有前途的性能。

9. Reward Constrained Interactive Recommendation with Natural Language Feedback [PDF] 返回目录
Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen, Lawrence Carin
Abstract: Text-based interactive recommendation provides richer user feedback and has demonstrated advantages over traditional interactive recommender systems. However, recommendations can easily violate preferences of users from their past natural-language feedback, since the recommender needs to explore new items for further improvement. To alleviate this issue, we propose a novel constraint-augmented reinforcement learning (RL) framework to efficiently incorporate user preferences over time. Specifically, we leverage a discriminator to detect recommendations violating user historical preference, which is incorporated into the standard RL objective of maximizing expected cumulative future rewards. Our proposed framework is general and is further extended to the task of constrained text generation. Empirical results show that the proposed method yields consistent improvement relative to standard RL methods.
摘要：基于文本的互动推荐提供更丰富的用户反馈，并已证明比传统的互动推荐系统的优势。然而，建议可以很容易侵犯用户的偏好而从过去的自然语言反馈，因为推荐需要探索进一步改进新项目。为了缓解这一问题，我们提出了随着时间的推移新颖的约束增强的强化学习（RL）框架有效地掺入用户的喜好。具体来说，我们利用鉴别检测的建议违反了用户的历史偏好，将其引入到最大化预计未来累计奖励的标准RL目标。我们提出的框架是一般并进一步延伸到约束文本生成的任务。实证结果表明，该方法产生相对于标准的RL方法，持续改进。

10. Compose Like Humans: Jointly Improving the Coherence and Novelty for Modern Chinese Poetry Generation [PDF] 返回目录
Lei Shen, Xiaoyu Guo, Meng Chen
Abstract: Chinese poetry is an important part of worldwide culture, and classical and modern sub-branches are quite different. The former is a unique genre and has strict constraints, while the latter is very flexible in length, optional to have rhymes, and similar to modern poetry in other languages. Thus, it requires more to control the coherence and improve the novelty. In this paper, we propose a generate-retrieve-then-refine paradigm to jointly improve the coherence and novelty. In the first stage, a draft is generated given keywords (i.e., topics) only. The second stage produces a "refining vector" from retrieval lines. At last, we take into consideration both the draft and the "refining vector" to generate a new poem. The draft provides future sentence-level information for a line to be generated. Meanwhile, the "refining vector" points out the direction of refinement based on impressive words detection mechanism which can learn good patterns from references and then create new ones via insertion operation. Experimental results on a collected large-scale modern Chinese poetry dataset show that our proposed approach can not only generate more coherent poems, but also improve the diversity and novelty.
摘要：中国诗歌是全球文化的重要组成部分，古典与现代的支行有很大的不同。前者是一种独特的风格，并有严格的限制，而后者则是在长度上非常灵活，可选择有押韵，以及类似现代诗歌在其他语言。因此，它需要更多的控制一致性和提高新颖性。在本文中，我们提出了一个产生回收 - 再 - 细化范例，共同提高的连贯性和新颖性。在第一阶段中，草案只产生给定的关键字（即，主题）。所述第二级产生从检索线“炼载体”。最后，我们考虑到这两个草案和“炼载体”，以产生新的诗。该草案规定要产生的线未来句子层面的信息。同时，“炼矢量”指出细化的基础上令人印象深刻的话，方向检测机制，可以借鉴引用好的模式，然后创建通过插入操作新的。在收集大型中国现代诗歌集表明，该方法不仅可以产生更连贯的诗，而且还提高了多样性和新颖性的实验结果。

11. What-if I ask you to explain: Explaining the effects of perturbations in procedural text [PDF] 返回目录
Dheeraj Rajagopal, Niket Tandon, Peter Clarke, Bhavana Dalvi, Eduard Hovy
Abstract: We address the task of explaining the effects of perturbations in procedural text, an important test of process comprehension. Consider a passage describing a rabbit's life-cycle: humans can easily explain the effect on the rabbit population if a female rabbit becomes ill -- i.e., the female rabbit would not become pregnant, and as a result not have babies leading to a decrease in rabbit population. We present QUARTET, a system that constructs such explanations from paragraphs, by modeling the explanation task as a multitask learning problem. QUARTET provides better explanations (based on the sentences in the procedural text) compared to several strong baselines on a recent process comprehension benchmark. We also present a surprising secondary effect: our model also achieves a new SOTA with a 7% absolute F1 improvement on a downstream QA task. This illustrates that good explanations do not have to come at the expense of end task performance.
摘要：针对解释程序文本，过程理解的一个重要考验扰动的影响的任务。考虑一个通道描述兔子的生命周期：人类可以很容易地解释对兔群的影响，如果母兔生病 - 即母兔不会怀孕，并因此不生孩子导致的减少兔子人口。我们目前四方，即构建从段落这样的解释，通过模拟解释任务，多任务学习问题的系统。四重奏相比，在最近的一个过程的理解标杆几个强势基线提供更好的解释（基于程序文本中的句子）。我们还提出一个惊人的辅助效果：我们的模型还实现了与上下游QA任务7％的绝对F1改进新SOTA。这说明好的解释没有来在结束任务性能为代价。

12. To Test Machine Comprehension, Start by Defining Comprehension [PDF] 返回目录
Jesse Dunietz, Gregory Burnham, Akash Bharadwaj, Jennifer Chu-Carroll, Owen Rambow, David Ferrucci
Abstract: Many tasks aim to measure machine reading comprehension (MRC), often focusing on question types presumed to be difficult. Rarely, however, do task designers start by considering what systems should in fact comprehend. In this paper we make two key contributions. First, we argue that existing approaches do not adequately define comprehension; they are too unsystematic about what content is tested. Second, we present a detailed definition of comprehension -- a "Template of Understanding" -- for a widely useful class of texts, namely short narratives. We then conduct an experiment that strongly suggests existing systems are not up to the task of narrative understanding as we define it.
摘要：许多任务的目标是衡量机器阅读理解（MRC），往往侧重于认为是困难的问题类型。很少，但是，你的任务设计师考虑哪些系统其实应该领悟开始。在本文中，我们提出两个重要贡献。首先，我们认为，现有的方法没有充分定义理解;他们太不系统的什么内容进行测试。其次，我们目前理解的详细定义 - 一个“的理解模板” - 一个被广泛实用类文本，即短的叙述。然后，我们做一个试验强烈建议现有的系统都达不到的叙事理解的任务，因为我们把它定义。

13. Towards A Sign Language Gloss Representation Of Modern Standard Arabic [PDF] 返回目录
Salma El Anigri, Mohammed Majid Himmi, Abdelhak Mahmoudi
Abstract: Over 5% of the world's population (466 million people) has disabling hearing loss. 4 million are children. They can be hard of hearing or deaf. Deaf people mostly have profound hearing loss. Which implies very little or no hearing. Over the world, deaf people often communicate using a sign language with gestures of both hands and facial expressions. The sign language is a full-fledged natural language with its own grammar and lexicon. Therefore, there is a need for translation models from and to sign languages. In this work, we are interested in the translation of Modern Standard Arabic(MSAr) into sign language. We generated a gloss representation from MSAr that extracts the features mandatory for the generation of animation signs. Our approach locates the most pertinent features that maintain the meaning of the input Arabic sentence.
摘要：在5％的世界人口（4.66亿人）已禁用听力损失。还有400万儿童。它们可以是重听或耳聋。聋人大多有听力损失。这意味着很少或没有听力。在世界各地，聋人沟通往往使用双手和面部表情的手势手语。手语是一个完整的自然语言有自己的语法和词汇。因此，有必要从和手语翻译模型。在这项工作中，我们感兴趣的是现代标准阿拉伯语（澳门特别行政区）翻译成手语。我们从产生澳门特别行政区光泽表示提取的特征强制性的动画迹象产生。我们的方法定位最相关的功能，保持输入阿拉伯语句子的意思。

14. Using Context in Neural Machine Translation Training Objectives [PDF] 返回目录
Danielle Saunders, Felix Stahlberg, Bill Byrne
Abstract: We present Neural Machine Translation (NMT) training using document-level metrics with batch-level documents. Previous sequence-objective approaches to NMT training focus exclusively on sentence-level metrics like sentence BLEU which do not correspond to the desired evaluation metric, typically document BLEU. Meanwhile research into document-level NMT training focuses on data or model architecture rather than training procedure. We find that each of these lines of research has a clear space in it for the other, and propose merging them with a scheme that allows a document-level evaluation metric to be used in the NMT training objective. We first sample pseudo-documents from sentence samples. We then approximate the expected document BLEU gradient with Monte Carlo sampling for use as a cost function in Minimum Risk Training (MRT). This two-level sampling procedure gives NMT performance gains over sequence MRT and maximum-likelihood training. We demonstrate that training is more robust for document-level metrics than with sequence metrics. We further demonstrate improvements on NMT with TER and Grammatical Error Correction (GEC) using GLEU, both metrics used at the document level for evaluations.
摘要：使用文件级指标与批量级的文件，我们目前神经机器翻译（NMT）的培训。上一页序列目标的方法来专门培训NMT重点句子级指标，如句子BLEU不符合期望的评估度量，一般文件BLEU。同时研究为文档级NMT培训的重点是数据或模型架构，而不是训练过程。我们发现，每一个研究这些行中有对其他的净空间，并提出一个方案，它允许文档级评估度在NMT使用培养目标将它们合并。我们从句子样本第一个样本伪文档。然后，我们用蒙特卡罗抽样用作风险最小化培训（MRT）的成本函数逼近的预期文档BLEU梯度。这两个级别的抽样程序给出了序列MRT和最大似然训练NMT的性能提升。我们证明，培训是文件级指标比序列指标更稳健。我们进一步证明了使用GLEU，在文档级别为评价使用两个指标与TER和语法纠错（GEC）上NMT改进。

15. Introducing the VoicePrivacy Initiative [PDF] 返回目录
Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, Massimiliano Todisco
Abstract: The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this paper, we formulate the voice anonymization task selected for the VoicePrivacy 2020 Challenge and describe the datasets used for system development and evaluation. We also present the attack models and the associated objective and subjective evaluation metrics. We introduce two anonymization baselines and report objective evaluation results.
摘要：VoicePrivacy倡议旨在通过收集新的社会定义的兴趣和评价方法的任务，并通过一系列的挑战标杆解决方案，以促进隐私保护工具，语音技术的发展。在本文中，我们制定入选VoicePrivacy 2020挑战声音匿名任务，并描述了用于系统开发和评估的数据集。我们还提出了攻击模型和相关的客观和主观评价指标。我们引入两个匿名的基线和报告客观的评价结果。

16. The Sensitivity of Language Models and Humans to Winograd Schema Perturbations [PDF] 返回目录
Mostafa Abdou, Vinit Ravishankar, Maria Barrett, Yonatan Belinkov, Desmond Elliott, Anders Søgaard
Abstract: Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of common sense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to a number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Overall, humans are correct more often than out-of-the-box models, and the models are sometimes right for the wrong reasons. Finally, we show that fine-tuning on a large, task-specific dataset can offer a solution to these issues.
摘要：大型预训练的语言模型是对威诺格拉德架构挑战，常识推理能力广泛采用的测试中表现的最新改进的主要驱动力。但是我们发现，用一种新的诊断数据集，这些模型是的最低限度地影响人类认识的Winograd例子语言扰动敏感。我们的研究结果强调人类和语言模型之间有趣的差异：语言模型是一个数字或性别变更和同义词替换比人类更敏感，而人类更稳定，在他们的预测相一致，保持了高得多的绝对性能，以及更好的表现上非关联情况比联想的。总体而言，人类是正确的往往比外的现成模式，以及模型有时正确错误的原因。最后，我们表明，在一个大的，任务特定的数据集的微调可以提供一个解决这些问题。

17. From SPMRL to NMRL: What Did We Learn (and Unlearn) in a Decade of Parsing Morphologically-Rich Languages (MRLs)? [PDF] 返回目录
Reut Tsarfaty, Dan Bareket, Stav Klein, Amit Seker
Abstract: It has been exactly a decade since the first establishment of SPMRL, a research initiative unifying multiple research efforts to address the peculiar challenges of Statistical Parsing for Morphologically-Rich Languages (MRLs).Here we reflect on parsing MRLs in that decade, highlight the solutions and lessons learned for the architectural, modeling and lexical challenges in the pre-neural era, and argue that similar challenges re-emerge in neural architectures for MRLs. We then aim to offer a climax, suggesting that incorporating symbolic ideas proposed in SPMRL terms into nowadays neural architectures has the potential to push NLP for MRLs to a new level. We sketch strategies for designing Neural Models for MRLs (NMRL), and showcase preliminary support for these strategies via investigating the task of multi-tagging in Hebrew, a morphologically-rich, high-fusion, language
摘要：已经整整十年以来，第一次建立SPMRL，一个研究计划，统一多的研究工作，以解决统计解析为形态学上，丰富的语言（最大残留限量）的特殊挑战。在这里，我们反思在这十年解析的MRL，亮点该解决方案，并了解到在预神经时代的建筑，造型和词汇的挑战的经验教训，并认为在神经结构类似的挑战重新出现的最大残留限量为。然后，我们的目标是提供一个高潮，这表明纳入SPMRL方面提出到现今的神经结构的象征思想者必须推动NLP的最大残留限量为到一个新水平的潜力。我们设计的神经模型的最大残留限量（NMRL）草图策略，并通过调查多标记在希伯来语中，富含形态，高融合，语言的任务，展示了这些策略的初步支持

18. DoQA -- Accessing Domain-Specific FAQs via Conversational QA [PDF] 返回目录
Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre
Abstract: The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Ozmethod with crowdsourcing. Compared to previous work, DoQA comprises well-defined information needs, leading to more coherent and natural conversations with less factoid questions and is multi-domain. In addition, we introduce a more realistic information retrieval(IR) scenario where the system needs to find the answer in any of the FAQ documents. The results of an existing, strong, system show that, thanks to transfer learning from a Wikipedia QA dataset and fine tuning on a single FAQ domain, it is possible to build high quality conversational QA systems for FAQs without in-domain training data. The good results carry over into the more challenging IR scenario. In both cases, there is still ample room for improvement, as indicated by the higher human upperbound.
摘要：这项工作的目标是建立对话问答系统（QA）接口，在FAQ网站提供的大型机构的特定领域的信息。我们提出DoQA，与2437个对话和10917 QA对的数据集。这些对话是从使用Ozmethod的向导与众包3个堆栈Exchange站点收集。相比于以前的工作，DoQA包括定义明确的信息需求，导致更多的协调和自然的通话较少的仿真陈述的问题，是多领域。此外，我们引入其中，系统需要找到任何的FAQ文档的回答更真实的信息检索（IR）的情况。的现有的，强大的，系统显示，由于转让从一个单一的常见问题域中的维基百科QA数据集和微调学习的结果，有可能构建常见问题解答高质量对话QA系统，而无需在域训练数据。的好成绩延续到更具挑战性的IR方案。在这两种情况下，仍然有很多改进的空间，经上级人的上界指示。

19. NLP in FinTech Applications: Past, Present and Future [PDF] 返回目录
Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen
Abstract: Financial Technology (FinTech) is one of the worldwide rapidly-rising topics in the past five years according to the statistics of FinTech from Google Trends. In this position paper, we focus on the researches applying natural language processing (NLP) technologies in the finance domain. Our goal is to indicate the position we are now and provide the blueprint for future researches. We go through the application scenarios from three aspects including Know Your Customer (KYC), Know Your Product (KYP), and Satisfy Your Customer (SYC). Both formal documents and informal textual data are analyzed to understand corporate customers and personal customers. Furthermore, we talk over how to dynamically update the features of products from the prospect and the risk points of view. Finally, we discuss satisfying the customers in both B2C and C2C business models. After summarizing the past and the recent challenges, we highlight several promising future research directions in the trend of FinTech and the open finance tendency.
摘要：金融技术（FinTech）是在根据FinTech从谷歌趋势的统计数据，过去五年全球迅速崛起的话题之一。在这个位置本文中，我们专注于运用自然语言处理（NLP）技术在金融领域的研究。我们的目标是为了表明我们的位置现在和今后的研究提供了蓝图。我们通过应用场景从三个方面，包括了解你的客户（KYC），知道你的产品（KYP），满足你的客户（SYC）。双方正式文件和非正式的文本数据进行分析，以了解企业客户和个人客户。此外，我们商量如何动态地从前景来看风险点更新的产品特性。最后，我们讨论在这两个B2C和C2C的商业模式让顾客满意。总结过去和近期的挑战后，我们突出FinTech的趋势，开放的金融倾向几个有前途的未来的研究方向。

20. pyBART: Evidence-based Syntactic Transformations for IE [PDF] 返回目录
Aryeh Tiktinsky, Yoav Goldberg, Reut Tsarfaty
Abstract: Syntactic dependencies can be predicted with high accuracy, and are useful for both machine-learned and pattern-based information extraction tasks. However, their utility can be improved. These syntactic dependencies are designed to accurately reflect syntactic relations, and they do not make semantic relations explicit. Therefore, these representations lack many explicit connections between content words, that would be useful for downstream applications. Proposals like English Enhanced UD improve the situation by extending universal dependency trees with additional explicit arcs. However, they are not available to Python users, and are also limited in coverage. We introduce a broad-coverage, data-driven and linguistically sound set of transformations, that makes event-structure and many lexical relations explicit. We present pyBART, an easy-to-use open-source Python library for converting English UD trees either to Enhanced UD graphs or to our representation. The library can work as a standalone package or be integrated within a spaCy NLP pipeline. When evaluated in a pattern-based relation extraction scenario, our representation results in higher extraction scores than Enhanced UD, while requiring fewer patterns.
摘要：句法依存关系可以高精度地预测，而且是对机器学习和基于模式的信息提取有用的任务。然而，他们的效用得以提高。这句法依赖旨在准确反映句法关系，他们不作语义关系明确。因此，这些表述缺乏内容的话，那会为下游应用有用之间有许多明确的联系。如英语强化UD建议改善与其他显式的弧线扩大普遍依赖树木的情况。然而，他们不提供给Python用户，并在覆盖范围也是有限的。我们引入了广阔的覆盖范围，数据驱动和语言的声音变换集，这使得事件结构和许多词汇关系明确。我们目前pyBART，一个易于使用的开放源码Python库转换英语UD树木要么增强UD图表或我们的代表。库可以作为一个独立的包装工作或spaCy NLP管道内被整合。当基于模式的关系提取方案进行评估，我们的表现会导致较高的提取成绩比增强UD，而需要较少的模式。

21. Distributional Discrepancy: A Metric for Unconditional Text Generation [PDF] 返回目录
Ping Cai, Xingyuan Chen, Peng Jin, Hongjun Wang, Tianrui Li
Abstract: The goal of unconditional text generation is training a model with real sentences, to generate novel sentences which should be the same quality and diversity as the training data. However, when different metrics are used for comparing these methods, the contradictory conclusions are drawn. The difficulty is that both the sample diversity and the sample quality should be taken into account simultaneously, when a generative model is evaluated. To solve this issue, a novel metric of distributional discrepancy (DD) is designed to evaluate generators according to the discrepancy between the generated sentences and the real training sentences. But, a challenge is that it can't compute DD directly because the distribution of real sentences is unavailable. Thus, we propose a method to estimate DD by training a neural-network-based text classifier. For comparison, three existing metrics, Bilingual Evaluation Understudy (BLEU) verse self-BLEU, language model score verse reverse language model score, Fr'chet Embedding Distance (FED), together with the proposed DD, are used to evaluate two popular generative models of LSTM and GPT-2 on both syntactic and real data. Experimental results show DD is much better than the three existing metrics in ranking these generative models.
摘要：无条件的文本生成的目标是训练模式与真正的句子，以产生新的句子应该是相同的质量和多样性作为训练数据。然而，当不同的指标用于比较这些方法中，矛盾的结论。困难的是，在样品分集和样品二者质量应当同时考虑到，当生成模型进行评估。为了解决这个问题，分配差异的指标新颖（DD）的设计根据所产生的句子，真正的训练句子之间的差异，以评估发电机。但是，一个挑战是，它不能直接计算DD因为真正的句子的分布是不可用的。因此，我们建议通过训练基于神经网络的文本分类估计DD的方法。为了便于比较，现有的三个指标，双语评估替补（BLEU）诗自BLEU，语言模型分值诗句反向语言模型分值，Fr'chet嵌入距离（FED），与所提出的DD一起，被用来评估两种流行的生成模型的LSTM和GPT-2语法和真实的数据。实验结果表明，DD比排名这些生成模型现有的三个指标好得多。

22. WikiUMLS: Aligning UMLS to Wikipedia via Cross-lingual Neural Ranking [PDF] 返回目录
Afshin Rahimi, Timothy Baldwin, Karin Verspoor
Abstract: We present our work on aligning the Unified Medical Language System (UMLS) to Wikipedia, to facilitate manual alignment of the two resources. We propose a cross-lingual neural reranking model to match a UMLS concept with a Wikipedia page, which achieves a recall@1 of 71%, a substantial improvement of 20% over word- and char-level BM25, enabling manual alignment with minimal effort. We release our resources, including ranked Wikipedia pages for 700k UMLS concepts, and WikiUMLS, a dataset for training and evaluation of alignment models between UMLS and Wikipedia. This will provide easier access to Wikipedia for health professionals, patients, and NLP systems, including in multilingual settings.
摘要：我们提出我们的工作有关校准一体化医学语言系统（UMLS）维基百科，促进两个资源的手动对齐。我们提出了一个跨语种的神经再排序模型与维基百科页面，它实现了召回@的71％1，20％以上字处理和焦炭级BM25的显着改善，实现以最小的努力手动对齐匹配UMLS概念。我们发布了我们的资源，包括维基百科的排名页700K UMLS概念，并WikiUMLS，为UMLS和维基百科之间的对准模型的训练和评估的数据集。这将提供给维基百科更容易获得卫生专业人员，患者和NLP系统，包括多语种的设置。

23. Improving Adversarial Text Generation by Modeling the Distant Future [PDF] 返回目录
Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Dinghan Shen, Guoyin Wang, Zheng Wen, Lawrence Carin
Abstract: Auto-regressive text generation models usually focus on local fluency, and may cause inconsistent semantic meaning in long text generation. Further, automatically generating words with similar semantics is challenging, and hand-crafted linguistic rules are difficult to apply. We consider a text planning scheme and present a model-based imitation-learning approach to alleviate the aforementioned issues. Specifically, we propose a novel guider network to focus on the generative process over a longer horizon, which can assist next-word prediction and provide intermediate rewards for generator optimization. Extensive experiments demonstrate that the proposed method leads to improved performance.
摘要：自回归文本生成模型通常专注于当地的流畅性，并可能导致长文本生成不一致的语义。此外，自动生成与类似语义词语是具有挑战性的，和手工制作的语言规则难以适用。我们认为，一个文案策划方案，并提出了一种基于模型的模仿学习的方法来缓解上述问题。具体而言，提出了一种新颖的导向器网络集中于生成过程在较长的地平线，其可协助下一词预测，并为生成器的优化中间奖励。大量的实验表明，该方法显着提高性能。

24. A New Data Normalization Method to Improve Dialogue Generation by Minimizing Long Tail Effect [PDF] 返回目录
Zhiqiang Zhan, Zifeng Hou, Yang Zhang
Abstract: Recent neural models have shown significant progress in dialogue generation. Most generation models are based on language models. However, due to the Long Tail Phenomenon in linguistics, the trained models tend to generate words that appear frequently in training datasets, leading to a monotonous issue. To address this issue, we analyze a large corpus from Wikipedia and propose three frequency-based data normalization methods. We conduct extensive experiments based on transformers and three datasets respectively collected from social media, subtitles, and the industrial application. Experimental results demonstrate significant improvements in diversity and informativeness (defined as the numbers of nouns and verbs) of generated responses. More specifically, the unigram and bigram diversity are increased by 2.6%-12.6% and 2.2%-18.9% on the three datasets, respectively. Moreover, the informativeness, i.e. the numbers of nouns and verbs, are increased by 4.0%-7.0% and 1.4%-12.1%, respectively. Additionally, the simplicity and effectiveness enable our methods to be adapted to different generation models without much extra computational cost.
摘要：最近的神经模型显示在对话产生显著的进展。大多数代车型是基于语言模型。然而，由于在语言学中的长尾现象，训练有素的模型往往产生在训练数据集频繁出现的话，导致了单调的问题。为了解决这个问题，我们分析了维基百科的大型语料库和提出了三种基于频率的数据标准化方法。我们进行了基于变压器和分别来自社交媒体，字幕收集三个数据集，并在工业上应用广泛的实验。实验结果表明，在分集和信息量（定义为名词和动词的数目）产生的响应的显著改进。更具体地，单字组和两字组多样性是在三个数据集分别增加了2.6％-12.6％和2.2％-18.9％。此外，信息量，即名词和动词的数目，分别增加了4.0％-7.0％和1.4％-12.1％。此外，简单性和有效性使我们的方法，以适应不同代车型没有太多额外的计算成本。

25. Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning [PDF] 返回目录
Liyan Xu, Julien Hogan, Rachel E. Patzer, Jinho D. Choi
Abstract: This paper presents a reinforcement learning approach to extract noise in long clinical documents for the task of readmission prediction after kidney transplant. We face the challenges of developing robust models on a small dataset where each document may consist of over 10K tokens with full of noise including tabular text and task-irrelevant sentences. We first experiment four types of encoders to empirically decide the best document representation, and then apply reinforcement learning to remove noisy text from the long documents, which models the noise extraction process as a sequential decision problem. Our results show that the old bag-of-words encoder outperforms deep learning-based encoders on this task, and reinforcement learning is able to improve upon baseline while pruning out 25% text segments. Our analysis depicts that reinforcement learning is able to identify both typical noisy tokens and task-specific noisy text.
摘要：本文提出了一种强化学习方法在重新接纳预测肾移植后的任务长期的临床文档的噪声中提取。我们面临一个小的数据集，其中每个文件可以具有完全的噪音，包括表格文字和任务无关的句子包括超过10K的令牌稳健发展模式的挑战。我们首先尝试四种编码器的凭经验决定最佳文档表示，然后应用增强学习从长文档，删除文字嘈杂该款机型噪声提取过程为连续的决策问题。我们的研究结果表明，旧袋的词编码器优于此重任深学习型编码器，以及强化学习能够提高在基线，而修剪出25％的文本片段。我们的分析描述了强化学习能够识别这两种典型的嘈杂令牌和特定任务的嘈杂文本。

26. Robust Encodings: A Framework for Combating Adversarial Typos [PDF] 返回目录
Erik Jones, Robin Jia, Aditi Raghunathan, Percy Liang
Abstract: Despite excellent performance on many tasks, NLP systems are easily fooled by small adversarial perturbations of inputs. Existing procedures to defend against such perturbations are either (i) heuristic in nature and susceptible to stronger attacks or (ii) provide guaranteed robustness to worst-case attacks, but are incompatible with state-of-the-art models like BERT. In this work, we introduce robust encodings (RobEn): a simple framework that confers guaranteed robustness, without making compromises on model architecture. The core component of RobEn is an encoding function, which maps sentences to a smaller, discrete space of encodings. Systems using these encodings as a bottleneck confer guaranteed robustness with standard training, and the same encodings can be used across multiple tasks. We identify two desiderata to construct robust encoding functions: perturbations of a sentence should map to a small set of encodings (stability), and models using encodings should still perform well (fidelity). We instantiate RobEn to defend against a large family of adversarial typos. Across six tasks from GLUE, our instantiation of RobEn paired with BERT achieves an average robust accuracy of 71.3% against all adversarial typos in the family considered, while previous work using a typo-corrector achieves only 35.3% accuracy against a simple greedy attack.
摘要：尽管在许多任务的出色表现，自然语言处理系统很容易被投入小对抗性扰动上当。现有的程序，以抵御这种扰动或者在本质上（ⅰ）和启发式易受强攻击或（ii）提供保证鲁棒性最差情况下攻击，但是与国家的最先进的模型，如BERT不相容。在这项工作中，我们将介绍健壮编码（罗本）：一个简单的框架，赋予保证稳健性，而对模型架构作出妥协。罗伯特·的核心部件是一个编码功能，它映射句子来编码的一个较小的，离散的空间。使用这些编码的瓶颈胙系统保证与标准的培训鲁棒性，同样的编码可以在多个任务中使用。我们识别两个必要条件，构建强大的编码功能：一个句子的扰动应该映射到一个小集编码（稳定性），以及使用的编码模型还是应该表现良好（保真度）。我们实例罗伯特·抵御大家族对抗错别字。对面胶六个方面的工作，我们的罗本的实例搭配BERT实现了对在考虑家庭的所有敌对错别字71.3％的平均健壮的准确性，同时采用了错字纠正以前的工作只能达到35.3％的准确率对一个简单的贪婪的攻击。

27. Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering [PDF] 返回目录
Vikas Yadav, Steven Bethard, Mihai Surdeanu
Abstract: Evidence retrieval is a critical stage of question answering (QA), necessary not only to improve performance, but also to explain the decisions of the corresponding QA method. We introduce a simple, fast, and unsupervised iterative evidence retrieval method, which relies on three ideas: (a) an unsupervised alignment approach to soft-align questions and answers with justification sentences using only GloVe embeddings, (b) an iterative process that reformulates queries focusing on terms that are not covered by existing justifications, which (c) a stopping criterion that terminates retrieval when the terms in the given question and candidate answers are covered by the retrieved justifications. Despite its simplicity, our approach outperforms all the previous methods (including supervised methods) on the evidence selection task on two datasets: MultiRC and QASC. When these evidence sentences are fed into a RoBERTa answer classification component, we achieve state-of-the-art QA performance on these two datasets.
摘要：证据检索是问答（QA）的关键阶段，不仅要提高性能，同时也解释了相应的质量保证方法的决定。我们介绍一个简单，快速和无监督的迭代证据检索方法，它依赖于三个观点：（1）无监督的排列方式来软对齐问题，并只使用手套的嵌入有正当理由的句子回答，（B）一个迭代过程，重新表述查询集中于未包括在现有的理由，其中（c）一种停止标准，当在给定的问题和候选答案的术语通过将检索理覆盖终止检索项。尽管它的简单，我们的方法优于所有对两个数据集的证据选择任务以前的方法（包括受监管的方法）：MultiRC和QASC。当这些证据句子被送入罗伯塔答案分类部件，就可以实现对这些两个数据集的国家的最先进的QA性能。

28. On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs [PDF] 返回目录
Adina Williams, Ryan Cotterell, Lawrence Wolf-Sonkin, Damián Blasi, Hanna Wallach
Abstract: We use large-scale corpora in six different gendered languages, along with tools from NLP and information theory, to test whether there is a relationship between the grammatical genders of inanimate nouns and the adjectives used to describe those nouns. For all six languages, we find that there is a statistically significant relationship. We also find that there are statistically significant relationships between the grammatical genders of inanimate nouns and the verbs that take those nouns as direct objects, as indirect objects, and as subjects. We defer a deeper investigation of these relationships for future work.
摘要：我们使用大型语料库在六种不同性别的语言，从NLP和信息理论，工具以及测试是否存在无生命的名词的语法性别，用来形容那些名词形容词之间的关系。对于所有的六种语言，我们发现有一个统计上显著的关系。我们还发现，有统计学无生命的名词的语法性别而采取这些名词直接对象，间接对象动词之间显著的关系，并为研究对象。我们推迟了今后工作这些关系有更深的调查。

29. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [PDF] 返回目录
Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao, Robert West, Steffen Eger
Abstract: Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish "translationese", i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses the reference-based BLEU by 5.7 correlation points.
摘要：跨语言编码器的评价通常通过在受监督下游任务零次跨语种转移或通过无监督的跨语种文本相似性执行任一。在本文中，我们关注与自己无参考的机器翻译（MT）的评估，我们直接源文本比较（有时是低质量）系统转换，它代表了多种语言编码器的自然对抗的设置。无参考评估持有的机器翻译系统的网络规模比较的承诺。我们系统地研究基于与预训练M-BERT和激光获得国家的最先进的跨语言的语义表示一系列的指标。我们发现，他们表现不佳语义编码器为免费参考-MT的评估，并确定他们的两个主要限制，即（一）相互转换的表示之间的语义不匹配，更突出的是，（二）无法惩罚“翻译腔”即，低质量的直译。我们提出两个部分补救措施：（1）事后的矢量空间的重新对准和（2）基于语义相似性度量与目标侧语言建模的耦合。在段级MT评价，我们最好的指标超过了参考基础的BLEU 5.7相关点。

30. Influence Paths for Characterizing Subject-Verb Number Agreement in LSTM Language Models [PDF] 返回目录
Kaiji Lu, Piotr Mardziel, Klas Leino, Matt Fedrikson, Anupam Datta
Abstract: LSTM-based recurrent neural networks are the state-of-the-art for many natural language processing (NLP) tasks. Despite their performance, it is unclear whether, or how, LSTMs learn structural features of natural languages such as subject-verb number agreement in English. Lacking this understanding, the generality of LSTM performance on this task and their suitability for related tasks remains uncertain. Further, errors cannot be properly attributed to a lack of structural capability, training data omissions, or other exceptional faults. We introduce *influence paths*, a causal account of structural properties as carried by paths across gates and neurons of a recurrent neural network. The approach refines the notion of influence (the subject's grammatical number has influence on the grammatical number of the subsequent verb) into a set of gate or neuron-level paths. The set localizes and segments the concept (e.g., subject-verb agreement), its constituent elements (e.g., the subject), and related or interfering elements (e.g., attractors). We exemplify the methodology on a widely-studied multi-layer LSTM language model, demonstrating its accounting for subject-verb number agreement. The results offer both a finer and a more complete view of an LSTM's handling of this structural aspect of the English language than prior results based on diagnostic classifiers and ablation.
摘要：基于LSTM经常性的神经网络是国家的最先进的许多自然语言处理（NLP）的任务。尽管他们的表现，目前还不清楚是否LSTMs，或者怎么样，学习自然语言的结构特征，如英语主谓数的一致。缺乏这种理解，对这一任务LSTM性能及其对相关任务适应性的一般性仍然不明朗。此外，错误不能被正确归因于缺乏结构性的能力，训练数据遗漏，或其他特殊故障。我们引进*影响路径*，结构性质的因果帐户通过在门和回归神经网络的神经元路径，携带。该方法提炼的影响（受试者的语法数对随后的动词的语法数影响）成一组的栅或神经元电平的路径的概念。所述一组局部化和段的概念（例如，主谓一致），其构成元素（例如，对象），以及相关的或干扰元素（例如，吸引）。我们举例说明在广泛研究的多层LSTM语言模型的方法，展示其占主谓数的一致。结果同时提供更精细和的LSTM的处理英语比基于诊断分类和消融之前结果这个结构方面的更完整视图。

31. Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction [PDF] 返回目录
Cristina España-Bonet, Alberto Barrón-Cedeño, Lluís Màrquez
Abstract: We propose an automatic language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph-based model outperforms a retrieval-based approach and reaches an average precision of 84% on in-domain articles. As manual evaluations are costly, we introduce the concept of "domainness" and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with the human-judged precision, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities. WikiTailor makes obtaining multilingual in-domain data from the Wikipedia easy.
摘要：提出了一种自动语言无关的基于图形的方法的基础上从维基百科用户定义域的点菜文章的集合。核心模型是基于百科全书的类图的探索和能产生两种单语和多语种可比集合。我们运行彻底实验，以评估所获得的语料的质量，10种语言和743个域。根据大量的人工评估，我们基于图的模型优于基于检索的方法，并达到84％的域内文章的平均精度。手动评价是昂贵的，我们引入“domainness”的概念，并设计了几种自动指标，占了收藏的质量。我们最好为domainness节目度量与人类判断精度的强相关性，较合理自动替代，以评估特定领域的语料库的质量。我们用的提取方法，评价措施和多种实用程序执行释放WikiTailor工具包。 WikiTailor使得从维基百科容易获得多语言域内的数据。

32. Similarity Analysis of Contextual Word Representation Models [PDF] 返回目录
John M. Wu, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, James Glass
Abstract: This paper investigates contextual word representation models from the lens of similarity analysis. Given a collection of trained models, we measure the similarity of their internal representations and attention. Critically, these models come from vastly different architectures. We use existing and novel similarity measures that aim to gauge the level of localization of information in the deep models, and facilitate the investigation of which design factors affect model similarity, without requiring any external linguistic annotation. The analysis reveals that models within the same family are more similar to one another, as may be expected. Surprisingly, different architectures have rather similar representations, but different individual neurons. We also observed differences in information localization in lower and higher layers and found that higher layers are more affected by fine-tuning on downstream tasks.
摘要：本文从相似性分析的镜头考察语境单词表示模型。由于训练的模型集合，我们衡量他们的内部交涉和注意力的相似性。重要的是，这些车型来自完全不同的架构。我们使用现有的和目标来衡量的深车型信息本地化的水平，并促进其设计因素影响模型相似的调查，无需任何外部的语言诠释小说的相似性措施。分析表明，同一系列的机型都比较相似，彼此，因为可以预期。出人意料的是，不同的架构有非常相似的表示，但是不同的个体神经元。我们还观察到在信息本地化的差异在较低和较高的层，发现高层更受下游任务微调。

33. Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward [PDF] 返回目录
Luyang Huang, Lingfei Wu, Lu Wang
Abstract: Sequence-to-sequence models for abstractive summarization have been studied extensively, yet the generated summaries commonly suffer from fabricated content, and are often found to be near-extractive. We argue that, to address these issues, the summarizer should acquire semantic interpretation over input, e.g., via structured representation, to allow the generation of more informative summaries. In this paper, we present ASGARD, a novel framework for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD. We propose the use of dual encoders---a sequential document encoder and a graph-structured encoder---to maintain the global context and local characteristics of entities, complementing each other. We further design a reward based on a multiple choice cloze test to drive the model to better capture entity interactions. Results show that our models produce significantly higher ROUGE scores than a variant without knowledge graph as input on both New York Times and CNN/Daily Mail datasets. We also obtain better or comparable performance compared to systems that are fine-tuned from large pretrained language models. Human judges further rate our model outputs as more informative and containing fewer unfaithful errors.
摘要：序列到序列模型抽象概括已被广泛研究，但所产生的摘要通常由捏造的内容惨了，常常发现近采掘。我们认为，解决这些问题中，摘要应获得超过输入的语义解释，例如，通过结构化的代表性，让更多的信息摘要的生成。在本文中，我们本ASGARD，对于综述 - 抽象与格拉夫-增强和语义驱动奖赏的新颖框架。我们建议使用双编码器---一个凭证顺序编码器和图形结构的编码器---维护全球环境和实体的地方特色，相辅相成。我们进一步设计了一个基于选择题完形填空奖励到模型驱动，以更好地捕捉实体相互作用。结果表明，我们的模型ROUGE成绩比一个变种显著高于无知识图作为纽约时报和CNN /每日邮报数据集的两个输入。我们也获得更好或相当的性能相比，从大的预训练的语言模型被微调系统。人类法官进一步评价我们的模型输出的信息更丰富，并含有较少的不忠的错误。

34. Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation [PDF] 返回目录
Kshitij Shah, Gerard de Melo
Abstract: In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. The generation methodology allows us to generate particularly challenging errors that require context-aware error detection. We use it to create a set of English language error detection and correction datasets. Finally, we examine the effectiveness of machine learning models for detecting and correcting errors based on this data. The datasets are available at this http URL
摘要：在本文中，我们将探讨人工代基于真实世界的统计数据录入错误的。我们先画上一小注释数据来计算的拼写错误统计的。然后将这些调用以将误差引入基本上更大语料库。生成的方法使我们能够产生需要上下文感知的错误检测难度特别大的错误。我们用它来创建一组英文错误检测和校正数据的。最后，我们检查机器学习模型的检测和纠正在此基础上的数据错误的有效性。该数据集可在此HTTP URL

35. Out of the Echo Chamber: Detecting Countering Debate Speeches [PDF] 返回目录
Matan Orbach, Yonatan Bilu, Assaf Toledo, Dan Lahav, Michal Jacovi, Ranit Aharonov, Noam Slonim
Abstract: An educated and informed consumption of media content has become a challenge in modern times. With the shift from traditional news outlets to social media and similar venues, a major concern is that readers are becoming encapsulated in "echo chambers" and may fall prey to fake news and disinformation, lacking easy access to dissenting views. We suggest a novel task aiming to alleviate some of these concerns -- that of detecting articles that most effectively counter the arguments -- and not just the stance -- made in a given text. We study this problem in the context of debate speeches. Given such a speech, we aim to identify, from among a set of speeches on the same topic and with an opposing stance, the ones that directly counter it. We provide a large dataset of 3,685 such speeches (in English), annotated for this relation, which hopefully would be of general interest to the NLP community. We explore several algorithms addressing this task, and while some are successful, all fall short of expert human performance, suggesting room for further research. All data collected during this work is freely available for research.
摘要：教育和媒体内容的知情消费已成为现代时代的一个挑战。与传统的新闻媒体对社交媒体及类似场所的转移，主要关注的问题是，读者越来越多封装在“回音室”，并可能堕入假新闻和假情报，缺乏对不同意见容易获得。我们建议一个新的任务，旨在缓解这些担忧 - 即检测文章，最有效的反论据 - 而不仅仅是立场 - 在一个给定的文本制作。我们在辩论中发言的背景下研究这个问题。鉴于这样的言论，我们的目标是查明，从同一主题的一组演讲中，并与相对的姿态，那些直接应对。我们提供了大量的数据集的3685个这样的演讲（中英文），注释此关系，其希望将普遍关心NLP社区。我们探讨几种算法解决这个任务，虽然有些是成功的，专家人的表现都功亏一篑，这表明空间进一步研究。这项工作过程中收集的所有数据是免费提供的研究。

36. Let Me Choose: From Verbal Context to Font Selection [PDF] 返回目录
Amirreza Shirani, Franck Dernoncourt, Jose Echevarria, Paul Asente, Nedim Lipka, Thamar Solorio
Abstract: In this paper, we aim to learn associations between visual attributes of fonts and the verbal context of the texts they are typically applied to. Compared to related work leveraging the surrounding visual context, we choose to focus only on the input text as this can enable new applications for which the text is the only visual element in the document. We introduce a new dataset, containing examples of different topics in social media posts and ads, labeled through crowd-sourcing. Due to the subjective nature of the task, multiple fonts might be perceived as acceptable for an input text, which makes this problem challenging. To this end, we investigate different end-to-end models to learn label distributions on crowd-sourced data and capture inter-subjectivity across all annotations.
摘要：在本文中，我们的目标是学习字体的视觉属性，它们通常被应用于文本的语言上下文之间的关联。相比于相关工作充分利用周围的视觉环境，我们选择只注重输入文本，因为这可以使新的应用程序，其文字的文档中的唯一的视觉元素。我们推出了新的数据集，包含在社交媒体帖子和广告不同的主题，通过众包标记的例子。由于任务的主观性质，多种字体可能被视为可以接受的输入文字，这使得这个问题的挑战。为此，我们研究了不同的终端到终端的模式，以学习上的所有注释拥挤源数据和捕获主体间性的标签分布。

37. Emergence of Syntax Needs Minimal Supervision [PDF] 返回目录
Raphaël Bailly, Kata Gábor
Abstract: This paper is a theoretical contribution to the debate on the learnability of syntax from a corpus without explicit syntax-specific guidance. Our approach originates in the observable structure of a corpus, which we use to define and isolate grammaticality (syntactic information) and meaning/pragmatics information. We describe the formal characteristics of an autonomous syntax and show that it becomes possible to search for syntax-based lexical categories with a simple optimization process, without any prior hypothesis on the form of the model.
摘要：本文是从语料库没有明确的语法，具体指导语法的学习能力的争论的理论贡献。我们的方法起源于语料库的可观察的结构，我们用它来定义和分离物合乎语法（语法信息）和意/语用信息。我们描述一个自治语法的形式特征，并表明它能够搜索基于语法的词类用一个简单的优化过程，而不该模型的形式在任何事先假设。

38. Transformer-based End-to-End Question Generation [PDF] 返回目录
Luis Enrico Lopez, Diane Kathryn Cruz, Jan Christian Blaise Cruz, Charibeth Cheng
Abstract: Question Generation (QG) is an important task in Natural Language Processing (NLP) that involves generating questions automatically when given a context paragraph. While many techniques exist for the task of QG, they employ complex model architectures, extensive features, and additional mechanisms to boost model performance. In this work, we show that transformer-based finetuning techniques can be used to create robust question generation systems using only a single pretrained language model, without the use of additional mechanisms, answer metadata, and extensive features. Our best model outperforms previous more complex RNN-based Seq2Seq models, with an 8.62 and a 14.27 increase in METEOR and ROUGE_L scores, respectively. We show that it also performs on par with Seq2Seq models that employ answer-awareness and other special mechanisms, despite being only a single-model system. We analyze how various factors affect the model's performance, such as input data formatting, the length of the context paragraphs, and the use of answer-awareness. In addition, we also look into the modes of failure that the model experiences and identify the reasons why it fails.
摘要：询问生成（QG）是在自然语言处理（NLP）给定语境段落时自动将该包括产生问题的一项重要任务。虽然QG的任务存在许多技术，他们使用复杂的模型架构，丰富的功能，并提升模型性能的附加机制。在这项工作中，我们表明，基于变压器的微调技术可用于创建仅使用单个预训练的语言模型的鲁棒性问题生成系统，无需使用额外的机制，答案元数据和丰富的功能。我们最好的模型优于以往更加复杂的基于RNN-Seq2Seq车型，分别与8.62和14.27增加流星ROUGE_L分数。我们发现，它也看齐Seq2Seq模型进行雇用答案意识等特殊机制，尽管仅仅是一个模型系统。我们分析各种因素如何影响模型的性能，如输入数据格式化，上下文段落的长度，以及使用的答案意识。此外，我们还考虑失败的，该模型的经验模式，并确定它为什么失败的原因。

39. Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence [PDF] 返回目录
Xiaoyu Shen, Ernie Chang, Hui Su, Jie Zhou, Dietrich Klakow
Abstract: The neural attention model has achieved great success in data-to-text generation tasks. Though usually excelling at producing fluent text, it suffers from the problem of information missing, repetition and "hallucination". Due to the black-box nature of the neural attention architecture, avoiding these problems in a systematic way is non-trivial. To address this concern, we propose to explicitly segment target text into fragment units and align them with their data correspondences. The segmentation and correspondence are jointly learned as latent variables without any human annotations. We further impose a soft statistical constraint to regularize the segmental granularity. The resulting architecture maintains the same expressive power as neural attention models, while being able to generate fully interpretable outputs with several times less computational cost. On both E2E and WebNLG benchmarks, we show the proposed model consistently outperforms its neural attention counterparts.
摘要：神经注意力模型已经实现了数据到文本生成任务取得圆满成功。虽然通常在生产文字通顺优异，它的信息缺失，重复和“幻觉”的问题困扰。由于神经重视建筑的黑盒性质，避免了系统处理这些问题，是不平凡的。为了解决这个问题，我们提出了明确的目标细分到文本片段单位和它们的数据对应对齐。分割和通信共同学习为没有任何人的注解潜在变量。我们进一步施加软约束的统计规则化段粒度。产生的建筑保持相同的表现力为神经关注的机型，同时能够有几次更少的计算成本产生完全可解释的输出。在这两个E2E和WebNLG基准测试中，我们展示了模型的性能一直优于其神经关注同行。

40. A Two-Stage Masked LM Method for Term Set Expansion [PDF] 返回目录
Guy Kushilevitz, Shaul Markovitch, Yoav Goldberg
Abstract: We tackle the task of Term Set Expansion (TSE): given a small seed set of example terms from a semantic class, finding more members of that class. The task is of great practical utility, and also of theoretical utility as it requires generalization from few examples. Previous approaches to the TSE task can be characterized as either distributional or pattern-based. We harness the power of neural masked language models (MLM) and propose a novel TSE algorithm, which combines the pattern-based and distributional approaches. Due to the small size of the seed set, fine-tuning methods are not effective, calling for more creative use of the MLM. The gist of the idea is to use the MLM to first mine for informative patterns with respect to the seed set, and then to obtain more members of the seed class by generalizing these patterns. Our method outperforms state-of-the-art TSE algorithms. Implementation is available at: this https URL guykush/TermSetExpansion-MPB/
摘要：我们处理术语集扩展（TSE）的任务：给定一个小种子组从语义类别例如条款，发现类的多个成员。任务是很大的实用意义，也是理论的效用，因为它需要从几个例子概括。以前的方法的TSE任务可以被表征为分布式或基于模式的。我们利用神经蒙面语言模型（MLM）的功率，提出了一种新的算法TSE，它结合了基于模式和分配方式。由于结实的小尺寸，微调的方法是无效的，要求更创造性地利用传销的。的想法的要点是使用MLM为信息图案的第一矿相对于该种子集，然后通过推广这些模式以获得所述种子类的多个成员。我们的方法优于国家的最先进的TSE算法。实施可在：此HTTPS URL guykush / TermSetExpansion-MPB /

41. A Position Aware Decay Weighted Network for Aspect based Sentiment Analysis [PDF] 返回目录
Avinash Madasu, Vijjini Anvesh Rao
Abstract: Aspect Based Sentiment Analysis (ABSA) is the task of identifying sentiment polarity of a text given another text segment or aspect. In ABSA, a text can have multiple sentiments depending upon each aspect. Aspect Term Sentiment Analysis (ATSA) is a subtask of ABSA, in which aspect terms are contained within the given sentence. Most of the existing approaches proposed for ATSA, incorporate aspect information through a different subnetwork thereby overlooking the advantage of aspect terms' presence within the sentence. In this paper, we propose a model that leverages the positional information of the aspect. The proposed model introduces a decay mechanism based on position. This decay function mandates the contribution of input words for ABSA. The contribution of a word declines as farther it is positioned from the aspect terms in the sentence. The performance is measured on two standard datasets from SemEval 2014 Task 4. In comparison with recent architectures, the effectiveness of the proposed model is demonstrated.
摘要：方面基于情感分析（ABSA）是识别给定的另一文本段或方面的文本的情感极性的任务。在ABSA，文本可以具有根据每个方面的多个情绪。看点期限情感分析（ATSA）是ABSA的子任务，在这方面的规定载给出的句子中。大多数现有的方法中提出了ATSA，通过不同的子网，从而可俯瞰句子中方面的术语存在的优势一体化方面的信息。在本文中，我们提出了利用方面的位置信息的模式。该模型引入了基于位置的衰变机制。此衰减函数任务的ABSA输入字的贡献。一个字的贡献下降的更远它是从句子方面而言定位。性能上从SemEval 2014任务4.两个标准数据集测量。在最近的架构相比较，所提出的模型的有效性证明。

42. An Accurate Model for Predicting the (Graded) Effect of Context in Word Similarity Based on Bert [PDF] 返回目录
Wei Bao, Hongshu Che, Jiandong Zhang
Abstract: Natural Language Processing (NLP) has been widely used in the semantic analysis in recent years. Our paper mainly discusses a methodology to analyze the effect that context has on human perception of similar words, which is the third task of SemEval 2020. We apply several methods in calculating the distance between two embedding vector generated by Bidirectional Encoder Representation from Transformer (BERT). Our teamwillgowon the 1st place in Finnish language track ofsubtask1, the second place in English track of subtask1.
摘要：自然语言处理（NLP）已被广泛应用于语义分析在最近几年。我们主要讨论的方法来分析这方面有类似的话人类的感知，这是2020年SemEval的第三个任务我们采用计算从变压器由双向编码器代表产生的两个嵌入矢量之间的距离的几种方法（BERT效果）。我们teamwillgowon在芬兰语第一名跟踪ofsubtask1，英语轨道subtask1的第二位。

43. Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction [PDF] 返回目录
Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Suzuki, Kentaro Inui
Abstract: This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For example, the distribution of the inputs to a GEC model can be considerably different (erroneous, clumsy, etc.) from that of the corpora used for pre-training MLMs; however, this issue is not addressed in the previous methods. Our experiments show that our proposed method, where we first fine-tune a MLM with a given GEC corpus and then use the output of the fine-tuned MLM as additional features in the GEC model, maximizes the benefit of the MLM. The best-performing model achieves state-of-the-art performances on the BEA-2019 and CoNLL-2014 benchmarks. Our code is publicly available at: this https URL.
摘要：本文研究如何有效地将一个预训练的掩蔽语言模型（MLM），如BERT，成编码器 - 解码器（EncDec）模型的语法错误校正（GEC）。在回答这个问题并不像人们所预料的，因为当应用于GEC用于集成了传销到EncDec模型以前常用的方法有潜在的缺点简单。例如，输入到GEC模型的分布可以不同于用于预训练的MLM的语料库的显着不同（错误，笨拙，等）;然而，这个问题是不是在以前的方法解决。我们的实验表明，我们提出的方法，在这里我们第一微调与给定GEC语料库传销，然后使用微调传销的输出作为附加功能的GEC模型，最大化传销的好处。表现最好的模型实现了对BEA-2019和CoNLL-2014基准的国家的最先进的性能。我们的代码是公开的：该HTTPS URL。

44. How Does Selective Mechanism Improve Self-Attention Networks? [PDF] 返回目录
Xinwei Geng, Longyue Wang, Xing Wang, Bing Qin, Ting Liu, Zhaopeng Tu
Abstract: Self-attention networks (SANs) with selective mechanism has produced substantial improvements in various NLP tasks by concentrating on a subset of input words. However, the underlying reasons for their strong performance have not been well explained. In this paper, we bridge the gap by assessing the strengths of selective SANs (SSANs), which are implemented with a flexible and universal Gumbel-Softmax. Experimental results on several representative NLP tasks, including natural language inference, semantic role labelling, and machine translation, show that SSANs consistently outperform the standard SANs. Through well-designed probing experiments, we empirically validate that the improvement of SSANs can be attributed in part to mitigating two commonly-cited weaknesses of SANs: word order encoding and structure modeling. Specifically, the selective mechanism improves SANs by paying more attention to content words that contribute to the meaning of the sentence. The code and data are released at this https URL.
摘要：自关注网络（SAN）具有选择性机制通过集中于输入字的子集产生的各种NLP任务实质性的改进。然而，他们的强势表现的根本原因没有得到很好的解释。在本文中，我们弥合通过评估选择的SAN（SSANs），它们以灵活和通用冈贝尔-使用SoftMax实现的优势的差距。对几个有代表性的自然语言处理任务，包括自然语言推理，语义角色标注，机器翻译实验结果，表明SSANs持续超越标准的SAN。通过精心设计的探测实验中，我们经验验证SSANs的提高可以部分归因于减轻SAN的两个常用引用的弱点：语序编码和结构建模。具体来说，选择机制，更加注重的是促成这句话的意思实词提高的SAN。代码和数据在此HTTPS URL被释放。

45. Efficient Second-Order TreeCRF for Neural Dependency Parsing [PDF] 返回目录
Yu Zhang, Zhenghua Li, Min Zhang
Abstract: In the deep learning (DL) era, parsing models are extremely simplified with little hurt on performance, thanks to the remarkable capability of multi-layer BiLSTMs in context representation. As the most popular graph-based dependency parser due to its high efficiency and performance, the biaffine parser directly scores single dependencies under the arc-factorization assumption, and adopts a very simple local token-wise cross-entropy training loss. This paper for the first time presents a second-order TreeCRF extension to the biaffine parser. For a long time, the complexity and inefficiency of the inside-outside algorithm hinder the popularity of TreeCRF. To address this issue, we propose an effective way to batchify the inside and Viterbi algorithms for direct large matrix operation on GPUs, and to avoid the complex outside algorithm via efficient back-propagation. Experiments and analysis on 27 datasets from 13 languages clearly show that techniques developed before the DL era, such as structural learning (global TreeCRF loss) and high-order modeling are still useful, and can further boost parsing performance over the state-of-the-art biaffine parser, especially for partially annotated training data. We release our code at this https URL.
摘要：在深度学习（DL）的时代，分析模型是非常对性能几乎没有伤害，这要归功于多层BiLSTMs的上下文中表示了显着的功能简化。作为最流行的基于图形的依赖解析器由于其高效率和性能，在biaffine解析器直接得分单一依赖弧因式分解假设下，采用一种非常简单的本地令牌明智交叉熵培训的损失。本文首次提出了二阶TreeCRF扩展到biaffine解析器。在很长一段时间内，在内部外部演算法阻碍TreeCRF普及的复杂性和低效率。为了解决这个问题，我们建议batchify用于在GPU上直接大量矩阵运算内部和Viterbi算法，并避免通过有效的反向传播的复杂算法之外的有效途径。从13种语言的27个集实验和分析清楚地表明，DL时代以前开发的技术，如结构学习（全球TreeCRF损失）和高阶造型仍然是有用的，并且可以进一步提高分析性能对国家的最-art biaffine解析器，特别是用于部分地注释的训练数据。我们在此HTTPS URL释放我们的代码。

46. Unsupervised Morphological Paradigm Completion [PDF] 返回目录
Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya D. McCarthy, Katharina Kann
Abstract: We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or to assist linguistic annotators. From a cognitive science perspective, this can shed light on how children acquire morphological knowledge. We further introduce a system for the task, which generates morphological paradigms via the following steps: (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation. We perform an evaluation on 14 typologically diverse languages. Our system outperforms trivial baselines with ease and, for some languages, even obtains a higher accuracy than minimally supervised systems.
摘要：本文提出监督的形态范式完成的任务。鉴于只有原始文本和引理列表，任务由生成形态范式，即所有屈折形式，引理的。从自然语言处理（NLP）的角度来看，这是一个具有挑战性的无监督的任务，高性能的系统必须改进工具资源少的语言或协助语言注释的潜力。从认知科学的角度来看，这可以摆脱对孩子如何获取知识形态的光。我们进一步引入一个系统，用于该任务，其通过以下步骤生成形态学范式：（ⅰ）EDIT TREE检索，（ⅱ）附加的引理检索，（ⅲ）范例尺寸的发现，以及（iv）产生拐点。我们对14种类型学的不同语言进行评估。我们的系统优于轻松琐碎基线，对于某些语言，甚至比获得最低限度的监督系统精度更高。

47. Towards Faithful Neural Table-to-Text Generation with Content-Matching Constraints [PDF] 返回目录
Zhenyi Wang, Xiaoyang Wang, Bang An, Dong Yu, Changyou Chen
Abstract: Text generation from a knowledge base aims to translate knowledge triples to natural language descriptions. Most existing methods ignore the faithfulness between a generated text description and the original table, leading to generated information that goes beyond the content of the table. In this paper, for the first time, we propose a novel Transformer-based generation framework to achieve the goal. The core techniques in our method to enforce faithfulness include a new table-text optimal-transport matching loss and a table-text embedding similarity loss based on the Transformer model. Furthermore, to evaluate faithfulness, we propose a new automatic metric specialized to the table-to-text generation problem. We also provide detailed analysis on each component of our model in our experiments. Automatic and human evaluations show that our framework can significantly outperform state-of-the-art by a large margin.
摘要：从知识基础的目标文本生成知识三倍于自然语言描述翻译。大多数现有方法忽略生成的文本说明和原始表之间的忠诚，从而超越了表的内容生成的信息。在本文中，第一次，我们提出了一种基于变压器的生成框架实现的目标。在我们的方法的核心技术实施忠诚包括一个新的表格文本优化运输匹配损耗和基于变压器模型的表格，文本嵌入的相似性损失。此外，评估忠诚，我们提出专业表到文本生成问题的新的自动度量。我们还为在我们的实验模型中的每个组件的详细分析。自动和人的评估表明，我们的框架可以显著大幅度超越国家的最先进的。

48. Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation [PDF] 返回目录
Tianlu Wang, Xi Victoria Lin, Nazneen Fatema Rajani, Bryan McCann, Vicente Ordonez, Caiming Xiong
Abstract: Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.
摘要：从人类产生的语料库导出的嵌入字继承强性别偏见可通过下游模型被进一步放大。一些普遍采用的消除直流偏压的办法，其中包括开创性的硬盘消除直流偏压算法，应用后处理程序，项目前期训练字的嵌入到一个子空间正交于性别的推断子空间。我们发现，语义无关的语料库规律性，如词频由字的嵌入捕获这些算法的性能产生负面影响。我们提出了一个简单而有效的技术，双硬盘消除直流偏压，其净化字的嵌入针对之前推断和消除性别子空间这样的语料库规律。三个偏缓解基准实验表明，我们的方法保留了预先训练字的嵌入的分布式语义，同时减少性别偏见的显著程度较大比以前的方法。

49. On the Inference Calibration of Neural Machine Translation [PDF] 返回目录
Shuo Wang, Zhaopeng Tu, Shuming Shi, Yang Liu
Abstract: Confidence calibration, which aims to make model predictions equal to the true correctness measures, is important for neural machine translation (NMT) because it is able to offer useful indicators of translation errors in the generated output. While prior studies have shown that NMT models trained with label smoothing are well-calibrated on the ground-truth training data, we find that miscalibration still remains a severe challenge for NMT during inference due to the discrepancy between training and inference. By carefully designing experiments on three language pairs, our work provides in-depth analyses of the correlation between calibration and translation performance as well as linguistic properties of miscalibration and reports a number of interesting findings that might help humans better analyze, understand and improve NMT models. Based on these observations, we further propose a new graduated label smoothing method that can improve both inference calibration and translation performance.
摘要：信心校准，其目的是使模型的预测结果等于真实正确性措施，是神经机器翻译（NMT）很重要，因为它能够提供翻译错误的有用的指标在生成的输出。虽然以前的研究表明，与标签平滑训练有素的NMT模型是在地面实况训练数据以及校准，我们发现失准仍推理过程中NMT了严峻的挑战，由于训练和推理之间的差异。通过对三语对精心设计的实验中，我们的工作提供了深入校准和翻译性能之间的相关性的分析，以及失准的语言特性，并报告了一些有趣的发现，可能有助于人类更好地分析，理解和改进NMT模型。根据这些意见，我们进一步建议，可以同时提高推理校准和翻译性能的新毕业的标签平滑方法。

50. Bootstrapping Techniques for Polysynthetic Morphological Analysis [PDF] 返回目录
William Lane, Steven Bird
Abstract: Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrapping a neural morphological analyzer, and demonstrate its application to Kunwinjku, a polysynthetic Australian language. We generate data from a finite state transducer to train an encoder-decoder model. We improve the model by "hallucinating" missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes. The best model accounts for all instances of reduplication in the test set and achieves an accuracy of 94.7% overall, a 10 percentage point improvement over the FST baseline. This process demonstrates the feasibility of bootstrapping a neural morph analyzer from minimal resources.
摘要：聚片双语言有非常大而稀疏的词汇，这要归功于语素插槽和组合在一个字的数量。这种复杂性，与写入的数据普遍匮乏一起，构成自然语言技术的发展提出了挑战。为了应对这一挑战，我们提供了自举神经形态学分析语言学为明智的方法，并展示其Kunwinjku，一个聚片双澳大利亚语言应用。我们生成从一个有限状态传感器数据来训练的编码器 - 解码器模型。我们改善“幻觉”缺少语言结构到训练数据模型，并通过从齐普夫分布重采样模拟语素更自然分布。最好的模型解释了在测试集加倍的所有实例，并实现了94.7％的准确度总体而言，在FST基准10个百分点的改善。这个过程说明了自举，从最少的资源神经变形分析的可行性。

51. How Can We Accelerate Progress Towards Human-like Linguistic Generalization? [PDF] 返回目录
Tal Linzen
Abstract: This position paper describes and critiques the Pretraining-Agnostic Identically Distributed (PAID) evaluation paradigm, which has become a central tool for measuring progress in natural language understanding. This paradigm consists of three stages: (1) pre-training of a word prediction model on a corpus of arbitrary size; (2) fine-tuning (transfer learning) on a training set representing a classification task; (3) evaluation on a test set drawn from the same distribution as that training set. This paradigm favors simple, low-bias architectures, which, first, can be scaled to process vast amounts of data, and second, can capture the fine-grained statistical properties of a particular data set, regardless of whether those properties are likely to generalize to examples of the task outside the data set. This contrasts with humans, who learn language from several orders of magnitude less data than the systems favored by this evaluation paradigm, and generalize to new tasks in a consistent way. We advocate for supplementing or replacing PAID with paradigms that reward architectures that generalize as quickly and robustly as humans.
摘要：这份立场文件介绍和批评的训练前，不可知论同分布（PAID）评估模式，已成为衡量自然语言的理解进步的重要工具。这种模式包括三个阶段：（1）对任意大小的语料库字预测模型的预培训; （2）微调（转印学习）在表示分类任务的训练集; （3）上的测试评估集合从相同的分布作为训练集绘制。这种模式有利于简单，低偏压的体系结构，其中，第一，可以扩展到处理大量的数据，和第二，可以捕获一个特定的数据集的细粒度统计特性，而不管那些性质是否可能一概而论该数据集外任务的例子。这与人类，谁借鉴幅度较少的数据比这个评价模式有利于系统的几个命令语言，并推广到新的任务，以一致的方式。我们提倡以补充或与范式取代支付的报酬架构，快速，稳健地概括为人类。

52. Improving Non-autoregressive Neural Machine Translation with Monolingual Data [PDF] 返回目录
Jiawei Zhou, Phillip Keung
Abstract: Non-autoregressive (NAR) neural machine translation is usually done via knowledge distillation from an autoregressive (AR) model. Under this framework, we leverage large monolingual corpora to improve the NAR model's performance, with the goal of transferring the AR model's generalization ability while preventing overfitting. On top of a strong NAR baseline, our experimental results on the WMT14 En-De and WMT16 En-Ro news translation tasks confirm that monolingual data augmentation consistently improves the performance of the NAR model to approach the teacher AR model's performance, yields comparable or better results than the best non-iterative NAR methods in the literature and helps reduce overfitting in the training process.
摘要：非自回归（NAR）神经机器翻译通常是通过知识蒸馏从自回归（AR）模型来完成。在这个框架下，我们利用大单语语料库来提高NAR模型的性能，转移AR模型的推广能力，同时防止过度拟合的目标。强大的NAR基础之上，我们对WMT14恩德和WMT16恩滚装新闻翻译任务的实验结果证实，单语数据增强持续改善NAR模型接近老师AR模型的业绩表现，产生相当或更好结果比文献最好的非迭代NAR方法，有助于减少在训练过程中的过度拟合。

53. Clue: Cross-modal Coherence Modeling for Caption Generation [PDF] 返回目录
Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut, Matthew Stone
Abstract: We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning. Using an annotation protocol specifically devised for capturing image--caption coherence relations, we annotate 10,000 instances from publicly-available image--caption pairs. We introduce a new task for learning inferences in imagery and text, coherence relation prediction, and show that these coherence annotations can be exploited to learn relation classifiers as an intermediary step, and also train coherence-aware, controllable image captioning models. The results show a dramatic improvement in the consistency and quality of the generated captions with respect to information needs specified via coherence relations.
摘要：我们使用的话语的计算模型的启发，研究信息需求和图像字幕目标的连贯关系。使用专门设计用于捕捉图像的注释协议 - 字幕的连贯关系，我们从注释公开获得的图像万个实例 - 标题对。我们推出了新的任务为学习图像和文本，连贯关系预测的推论，并表明，这些相干注解可以被利用来学习的关系分类作为一个中间步骤，并且还培养一致性感知的，可控的图像字幕模式。结果表明在所生成的字幕的一致性和质量相对于经由连贯关系规定信息需求的显着改善。

54. Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking [PDF] 返回目录
Giovanni Campagna, Agata Foryciarz, Mehrad Moradshahi, Monica S. Lam
Abstract: Zero-shot transfer learning for multi-domain dialogue state tracking can allow us to handle new domains without incurring the high cost of data acquisition. This paper proposes new zero-short transfer learning technique for dialogue state tracking where the in-domain training data are all synthesized from an abstract dialogue model and the ontology of the domain. We show that data augmentation through synthesized data can improve the accuracy of zero-shot learning for both the TRADE model and the BERT-based SUMBT model on the MultiWOZ 2.1 dataset. We show training with only synthesized in-domain data on the SUMBT model can reach about 2/3 of the accuracy obtained with the full training dataset. We improve the zero-shot learning state of the art on average across domains by 21%.
摘要：零次转让学习多领域的对话状态跟踪可以让我们来处理新的领域而不会导致数据采集成本高。本文提出了一种对话状态跟踪，其中中域训练数据都是从一个抽象的对话模式和领域本体合成新的零转移短学习技术。我们展示通过合成数据的数据增强可以改善零射门学习的精度的交易模式和对MultiWOZ 2.1数据集基于BERT-SUMBT模型两者。我们将展示与上SUMBT模型只在合成域数据训练可以达到与完整的训练数据集获得的准确度的2/3。我们由21％提高了艺术上的跨域平均的零射门的学习状态。

55. Rationalizing Medical Relation Prediction from Corpus-level Statistics [PDF] 返回目录
Zhen Wang, Jennifer Lee, Simon Lin, Huan Sun
Abstract: Nowadays, the interpretability of machine learning models is becoming increasingly important, especially in the medical domain. Aiming to shed some light on how to rationalize medical relation prediction, we present a new interpretable framework inspired by existing theories on how human memory works, e.g., theories of recall and recognition. Given the corpus-level statistics, i.e., a global co-occurrence graph of a clinical text corpus, to predict the relations between two entities, we first recall rich contexts associated with the target entities, and then recognize relational interactions between these contexts to form model rationales, which will contribute to the final prediction. We conduct experiments on a real-world public clinical dataset and show that our framework can not only achieve competitive predictive performance against a comprehensive list of neural baseline models, but also present rationales to justify its prediction. We further collaborate with medical experts deeply to verify the usefulness of our model rationales for clinical decision making.
摘要：如今，机器学习模型的可解释性变得越来越重要，特别是在医疗领域。旨在阐明如何理顺医疗关系预测一些光，我们提出用现有的理论启发了新的解释框架，如何人类记忆的作品，例如，回忆和再认的理论。鉴于语料级统计信息，即临床文本语料库的全球共生图，预测两个实体之间的关系，我们先来回顾一下与目标实体相关的丰富的背景，然后认识到这些背景，以形式之间的关系互动模型理论基础，这将有助于最终预测。我们在现实世界中的公共数据集临床进行实验，并表明我们的框架，不仅可以实现对神经基线模型的完整列表，有竞争力的预测性能，但也存在理由来证明它的预测。我们与医学专家进一步合作深感来验证我们的临床决策模型基本原理的有效性。

56. Improving Truthfulness of Headline Generation [PDF] 返回目录
Kazuki Matsumaru, Sho Takase, Naoaki Okazaki
Abstract: Most studies on abstractive summarization re-port ROUGE scores between system and ref-erence summaries. However, we have a con-cern about thetruthfulnessof generated sum-maries: whether all facts of a generated sum-mary are mentioned in the source text. Thispaper explores improving the truthfulness inheadline generation on two popular datasets.Analyzing headlines generated by the state-of-the-art encoder-decoder model, we showthat the model sometimes generates untruthfulheadlines. We conjecture that one of the rea-sons lies in untruthful supervision data usedfor training the model. In order to quantifythe truthfulness of article-headline pairs, weconsider the textual entailment of whether anarticle entails its headline. After confirmingquite a few untruthful instances in the datasets,this study hypothesizes that removing untruth-ful instances from the supervision data mayremedy the problem of the untruthful behav-iors of the model. Building a binary classifierthat predicts an entailment relation between anarticle and its headline, we filter out untruth-ful instances from the supervision data. Exper-imental results demonstrate that the headlinegeneration model trained on filtered supervi-sion data shows no clear difference in ROUGEscores but remarkable improvements in auto-matic and manual evaluations of the generatedheadlines.
摘要：对系统和REF-erence摘要之间的抽象概括重新口ROUGE得分大多数研究。但是，我们有一个反面的，CERN约thetruthfulnessof产生的总和，新婚夫妇：是否生成和玛丽的所有事实源文本被提及。 Thispaper探索改进由国家的最先进的编码器 - 解码器模型生成的两个流行datasets.Analyzing标题的真实性inheadline一代，我们showthat模型有时产生untruthfulheadlines。我们猜想的意图，儿子在于不真实的监管数据usedfor训练模型的那一个。为了文章，标题对quantifythe真实性，weconsider是否一篇文章，的文字蕴涵需要它的标题。在数据集confirmingquite一些不真实的情况后，本研究推测，从监管数据删除不实情况mayremedy模型的不实behav-的IOR的问题。构建二元classifierthat预测一篇文章，其标题之间的关系蕴涵，我们从监管数据筛选出不真实-FUL实例。 EXPER-imental结果表明，经过训练，在过滤supervi - 锡安数据显示在generatedheadlines的自动马蒂奇和手动评估在ROUGEscores但显着改善无明显差异headlinegeneration模型。

57. Single Model Ensemble using Pseudo-Tags and Distinct Vectors [PDF] 返回目录
Ryosuke Kuwabara, Jun Suzuki, Hideki Nakayama
Abstract: Model ensemble techniques often increase task performance in neural networks; however, they require increased time, memory, and management effort. In this study, we propose a novel method that replicates the effects of a model ensemble with a single model. Our approach creates K-virtual models within a single parameter space using K-distinct pseudo-tags and K-distinct vectors. Experiments on text classification and sequence labeling tasks on several datasets demonstrate that our method emulates or outperforms a traditional model ensemble with 1/K-times fewer parameters.
摘要：型号集合技术往往会增加在神经网络的任务绩效;然而，他们需要增加时间，内存和管理工作。在这项研究中，我们建议复制与单一模型的模型合奏的效果的新方法。我们的方法使用K-不同的伪标签和K-不同载体单一参数空间中创建K-虚拟模型。文本分类和序列标注任务实验的几个数据集，证明我们的方法模拟或优于带有1 / K倍更少参数的传统模式集合。

58. Predicting Performance for Natural Language Processing Tasks [PDF] 返回目录
Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, Graham Neubig
Abstract: Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, without actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings.
摘要：鉴于任务，语言和自然语言处理领域的组合的复杂性（NLP）的研究，它在计算上是禁止的，在每个可能的实验设置详尽测试新提出的模型。在这项工作中，我们试图探索获得的如何的NLP模型可以在实验环境中进行合理判断的可能性，而无需实际培训或测试模型。要做到这一点，我们建立回归模型来预测给出的实验设置，输入一个NLP实验的评价分数。实验9个不同NLP任务，我们发现我们的预测能产生有意义的预测在看不见的语言和不同的造型结构，跑赢基准合理，以及人类专家。去进一步，我们列出了我们的预测如何可以用来找到应该在为了获得所有其他实验设置合理的预测运行代表性试验的一小部分。

59. A language score based output selection method for multilingual speech recognition [PDF] 返回目录
Van Huy Nguyen, Thi Quynh Khanh Dinh, Truong Thinh Nguyen, Dang Khoa Mac
Abstract: The quality of a multilingual speech recognition system can be improved by adaptation methods if the input language is specified. For systems that can accept multilingual inputs, the popular approach is to apply a language identifier to the input then switch or configure decoders in the next step, or use one more subsequence model to select the output from a set of candidates. Motivated by the goal of reducing the latency for real-time applications, in this paper, a language model rescoring method is firstly applied to produce all possible candidates for target languages, then a simple score is proposed to automatically select the output without any identifier model or language specification of the input language. The main point is that this score can be simply and automatically estimated on-the-fly so that the whole decoding pipeline is more simple and compact. Experimental results showed that this method can achieve the same quality as when the input language is specified. In addition, we present to design an English and Vietnamese End-to-End model to deal with not only the problem of cross-lingual speakers but also as a solution to improve the accuracy of borrowed words of English in Vietnamese.
摘要：多语言的语音识别系统的质量可以通过适配方法如果指定的输入语言被加以改进。对于可以接受多种语言输入系统中，流行的做法是应用语言标识符到输入然后切换或配置解码器在下一步骤，或使用一个以上的子序列模型以选择从一组候选的输出。通过减少对实时应用的延迟，本文的目的动机，语言模型再评分方法首先应用于生产为目标语言，那么一个简单的分数，提出了自动所有可能的候选者中选择输出没有任何标识模型输入语言的或语言规范。主要的一点是，这个分数可以简单和自动估计在即时使整个解码流水线更加简单和紧凑。实验结果表明，该方法可以达到同样的质量中指定的输入语言时。此外，我们目前设计的英语和越南语端至高端型号，以应对跨语种音箱不仅问题，而且为提高越南英语借词准确度的解决方案。

60. ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation [PDF] 返回目录
Lifu Tu, Richard Yuanzhe Pang, Sam Wiseman, Kevin Gimpel
Abstract: We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.
摘要：我们提出培养非自回归机器翻译模型，以尽量减少由预训练的自回归模型中定义的能量。特别是，我们认为我们的非自回归翻译系统的培训，以尽量减少自回归老师能推理网络（Tu和金培尔，2018）。与此相反，对组成的的蒸语料训练非自回归模型的流行做法束搜索这样的老师模型的输出。我们的方法，我们称之为引擎（基于能量的推论网络），实现了国家的最先进的IWSLT 2014 DE-EN和WMT 2016 RO-EN数据集的非自回归的结果，逼近自回归模型的性能。

61. Sources of Transfer in Multilingual Named Entity Recognition [PDF] 返回目录
David Mueller, Nicholas Andrews, Mark Dredze
Abstract: Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.
摘要：命名实体本身多种语言，并且在任何给定的语言注解可能会受到限制。这促使我们考虑多语种命名实体识别（NER），其中一个模型是使用来自一种以上的语言绘制注释的数据训练。然而，简单的实现这个简单的想法并不总是工作在实践：使用多种语言的，得出长期表现不佳的培训单单单语的数据模型，尽管有更多的训练数据访问注释数据NER模型的天真培训。本文的出发点是一个简单的解决这个问题，在多语种型号有微调的单语数据一致，并显著超越他们的单语同行。为了解释这个现象，我们探讨NER通晓多种语言的车型转移的来源和检查相比，其单语同行通晓多国语言模型权重结构。我们发现，通晓多国语言模型跨越语言和微调可以利用大量的这些参数的有效共享许多参数。

62. Language Models as an Alternative Evaluator of Word Order Hypotheses: A Case Study in Japanese [PDF] 返回目录
Tatsuki Kuribayashi, Takumi Ito, Jun Suzuki, Kentaro Inui
Abstract: We examine a methodology using neural language models (LMs) for analyzing the word order of language. This LM-based method has the potential to overcome the difficulties existing methods face, such as the propagation of preprocessor errors in count-based methods. In this study, we explore whether the LM-based method is valid for analyzing the word order. As a case study, this study focuses on Japanese due to its complex and flexible word order. To validate the LM-based method, we test (i) parallels between LMs and human word order preference, and (ii) consistency of the results obtained using the LM-based method with previous linguistic studies. Through our experiments, we tentatively conclude that LMs display sufficient word order knowledge for usage as an analysis tool. Finally, using the LM-based method, we demonstrate the relationship between the canonical word order and topicalization, which had yet to be analyzed by large-scale experiments.
摘要：我们研究利用神经语言模型（LMS）用于分析语言的单词顺序的方法。这种基于LM-方法必须克服现有方法所面临的困难，如预处理器错误的在基于计数的方法中的传播的潜力。在这项研究中，我们探讨了基于LM-方法是否适用于分析语序。作为一个案例研究，这项研究的重点是日本，由于其复杂和灵活的语序。为了验证所述基于LM-方法，使用与先前的语言研究基于LM-方法获得LMS和人字顺序优先性，和（ii）的结果的一致性之间我们试验（ⅰ）的相似之处。通过我们的实验，我们初步得出结论说LMS显示使用作为分析工具，充分词序知识。最后，使用基于LM-方法，我们证明了经典语序和主题化，这还没有通过大规模的实验进行分析之间的关系。

63. Generalized Entropy Regularization or: There's Nothing Special about Label Smoothing [PDF] 返回目录
Clara Meister, Elizabeth Salesky, Ryan Cotterell
Abstract: Prior work has explored directly regularizing the output distributions of probabilistic models to alleviate peaky (i.e. over-confident) predictions, a common sign of overfitting. This class of techniques, of which label smoothing is one, has a connection to entropy regularization. Despite the consistent success of label smoothing across architectures and data sets in language generation tasks, two problems remain open: (1) there is little understanding of the underlying effects entropy regularizers have on models, and (2) the full space of entropy regularization techniques is largely unexplored. We introduce a parametric family of entropy regularizers, which includes label smoothing as a special case, and use it to gain a better understanding of the relationship between the entropy of a model and its performance on language generation tasks. We also find that variance in model performance can be explained largely by the resulting entropy of the model. Lastly, we find that label smoothing provably does not allow for sparsity in an output distribution, an undesirable property for language generation models, and therefore advise the use of other entropy regularization methods in its place.
摘要：以前的工作已探索直接正规化概率模型的输出分布，以缓解尖峰（即过分自信）的预测，过度拟合一个共同的标志。这个类的技术，它的标签是平滑之一，拥有熵正规化的连接。尽管在语言生成任务不同的体系结构和数据集的标签平滑的一贯成功，有两个问题保持开放：（1）存在的潜在的影响了解甚少熵regularizers对模型，和（2）的熵正则化技术的全部空间主要是未开发的。我们引入熵regularizers的参数家族，其中包括标签平滑作为一个特例，并用它来更好地了解模型的熵及其对语言生成的任务性能之间的关系。我们还发现，在模型的性能差异可以通过产生的模型的熵在很大程度上解释。最后，我们发现，标签可证明平滑不允许在输出分布，语言代车型的不想要的性质稀疏，因此建议使用它代替其它熵正则化方法。

64. DQI: Measuring Data Quality in NLP [PDF] 返回目录
Swaroop Mishra, Anjana Arunkumar, Bhavdeep Sachdeva, Chris Bryan, Chitta Baral
Abstract: Neural language models have achieved human level performance across several NLP datasets. However, recent studies have shown that these models are not truly learning the desired task; rather, their high performance is attributed to overfitting using spurious biases, which suggests that the capabilities of AI systems have been over-estimated. We introduce a generic formula for Data Quality Index (DQI) to help dataset creators create datasets free of such unwanted biases. We evaluate this formula using a recently proposed approach for adversarial filtering, AFLite. We propose a new data creation paradigm using DQI to create higher quality data. The data creation paradigm consists of several data visualizations to help data creators (i) understand the quality of data and (ii) visualize the impact of the created data instance on the overall quality. It also has a couple of automation methods to (i) assist data creators and (ii) make the model more robust to adversarial attacks. We use DQI along with these automation methods to renovate biased examples in SNLI. We show that models trained on the renovated SNLI dataset generalize better to out of distribution tasks. Renovation results in reduced model performance, exposing a large gap with respect to human performance. DQI systematically helps in creating harder benchmarks using active learning. Our work takes the process of dynamic dataset creation forward, wherein datasets evolve together with the evolving state of the art, therefore serving as a means of benchmarking the true progress of AI.
摘要：神经语言模型都取得了跨越多个数据集NLP人类级别的性能。然而，最近的研究表明，这些模型都没有真正学习所需的任务;相反，他们的高性能归因于使用虚假的偏见，这表明AI系统的功能已经被高估了过度拟合。我们引入了数据质量指数（DQI）的通用公式来帮助数据集创作者创造自由这种不必要的偏见的数据集。我们评估使用对抗性过滤，AFLite最近提出的方法这个公式。我们使用DQI创造更高质量的数据提出了一个新的数据创建模式。数据创建模式由若干个数据可视化，以帮助数据创作者（我）了解数据的质量及（ii）可视化创建的数据实例对整体质量的影响。它也有一对夫妇的自动化方法，以（i）帮助数据的创造者和（ii）使模型更加坚固，以对抗攻击。我们使用DQI这些自动化的方法来修复在SNLI偏置举例。我们展示的培训对装修的SNLI数据集广义含更好地进行分配任务的模型。整修导致降低的性能模型，露出相对于人的性能有很大的差距。 DQI系统有助于在使用主动学习创造更难基准。我们的工作需要动态数据集创建的过程中前进，其中的数据集与现有技术的不断发展的状态一起发展，因此作为基准AI的真实进展的方法。

65. Social Biases in NLP Models as Barriers for Persons with Disabilities [PDF] 返回目录
Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, Stephen Denuyl
Abstract: Building equitable and inclusive NLP technologies demands consideration of whether and how social attitudes are represented in ML models. In particular, representations encoded in models often inadvertently perpetuate undesirable social biases from the data on which they are trained. In this paper, we present evidence of such undesirable biases towards mentions of disability in two different English language models: toxicity prediction and sentiment analysis. Next, we demonstrate that the neural embeddings that are the critical first step in most NLP pipelines similarly contain undesirable biases towards mentions of disability. We end by highlighting topical biases in the discourse about disability which may contribute to the observed model biases; for instance, gun violence, homelessness, and drug addiction are over-represented in texts discussing mental illness.
摘要：建立公平和包容的NLP技术需要考虑是否以及如何的态度在社会ML车型的代表。特别是，申述模型从它们所培养出来的数据往往在不经意间延续不好的社会偏见编码。在本文中，我们提出了这样的不良偏见的证据对残疾提到两种不同的英语语言模型：毒性预测和情感分析。接下来，我们证明了神经的嵌入是在大多数NLP管道关键的第一步同样含有不希望偏见对提到残疾。我们最终通过突出大约残疾这可能有助于所观察到的模式偏差话语局部偏差;在文字讨论精神病例如，枪支暴力，无家可归，和吸毒的比例过大。

66. MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech [PDF] 返回目录
Jakob Drachmann Havtorn, Jan Latko, Joakim Edin, Lasse Borgholt, Lars Maaløe, Lorenzo Belgrano, Nicolai Frost Jakobsen, Regitze Sdun, Željko Agić
Abstract: We address a challenging and practical task of labeling questions in speech in real time during telephone calls to emergency medical services in English, which embeds within a broader decision support system for emergency call-takers. We propose a novel multimodal approach to real-time sequence labeling in speech. Our model treats speech and its own textual representation as two separate modalities or views, as it jointly learns from streamed audio and its noisy transcription into text via automatic speech recognition. Our results show significant gains of jointly learning from the two modalities when compared to text or audio only, under adverse noise and limited volume of training data. The results generalize to medical symptoms detection where we observe a similar pattern of improvements with multimodal learning.
摘要：我们解决过程中的英语，紧急呼叫接受者更广阔的决策支持系统中嵌入其打电话给紧急医疗服务标签实时语音问题的挑战和现实的任务。我们建议在讲话一种新颖的多模态的方法来实时序列标签。我们的模型将语音和自己的文字表述为两个独立的模式或意见，因为它从共同学习流音频，并通过自动语音识别的嘈杂转录成文本。相比于文本或纯音频，在不利的噪音和训练数据量有限的，当我们的研究结果表明从两种模式共同学习的显著收益。其结果推广到医疗症状检测，我们观察到的具有多模态学习改进了类似的模式。

67. Teaching Machine Comprehension with Compositional Explanations [PDF] 返回目录
Qinyuan Ye, Xiao Huang, Xiang Ren
Abstract: Advances in extractive machine reading comprehension (MRC) rely heavily on the collection of large scale human-annotated training data (in the form of "question-paragraph-answer span"). A single question-answer example provides limited supervision, while an explanation in natural language describing human's deduction process may generalize to many other questions that share similar solution patterns. In this paper, we focus on "teaching" machines on reading comprehension with (a small number of) natural language explanations. We propose a data augmentation framework that exploits the compositional nature of explanations to rapidly create pseudo-labeled data for training downstream MRC models. Structured variables and rules are extracted from each explanation and formulated into neural module teacher, which employs softened neural modules and combinatorial search to handle linguistic variations and overcome sparse coverage. The proposed work is particularly effective when limited annotation effort is available, and achieved a practicable F1 score of 59.80% with supervision from 52 explanations on the SQuAD dataset.
摘要：进展采掘机阅读理解（MRC）严重依赖于大规模的人类注释的训练数据的收集（在“问题段落回答跨度”的形式）。一个单一的问答例如提供有限的监管，而在描述人类的推导过程自然语言的解释可以推广到具有类似的解决方案模式等诸多问题。在本文中，我们专注于“教学”的机器上阅读理解与（少数）自然语言的解释。我们提出了一个数据增强框架，利用解释的成分性质训练下游MRC模型来快速创建伪标记的数据。结构化变量和规则是从每一个解释提取并配制成神经模块的老师，它采用软化神经模块和组合搜索处理语言的变化，克服稀疏覆盖。当有限的注解努力是可用的，并且从球队数据集52条的解释实现与监督59.80％切实可行的F1分数所提出的工作是特别有效。

68. Treebank Embedding Vectors for Out-of-domain Dependency Parsing [PDF] 返回目录
Joachim Wagner, James Barry, Jennifer Foster
Abstract: A recent advance in monolingual dependency parsing is the idea of a treebank embedding vector, which allows all treebanks for a particular language to be used as training data while at the same time allowing the model to prefer training data from one treebank over others and to select the preferred treebank at test time. We build on this idea by 1) introducing a method to predict a treebank vector for sentences that do not come from a treebank used in training, and 2) exploring what happens when we move away from predefined treebank embedding vectors during test time and instead devise tailored interpolations. We show that 1) there are interpolated vectors that are superior to the predefined ones, and 2) treebank vectors can be predicted with sufficient accuracy, for nine out of ten test languages, to match the performance of an oracle approach that knows the most suitable predefined treebank embedding for the test set.
摘要：在单语依存分析最近的进展是一个树库嵌入矢量，它允许对特定语言的所有树库用作训练数据，而在同一时间使模型从一个树库比别人更喜欢训练数据和想法以选择在测试时间优选树库。我们建立这个想法1）引入的方法来预测不来自于训练使用的树库句子树库载体，和2）探索当我们在测试时移动从预定义的树库嵌入矢量离开，而是色器件会发生什么量身定制的插值。我们发现，1）有插值是优于预定义的那些载体，和2）树库向量可以有足够的准确预测，对十个有九个测试语言，以匹配知道最合适的一个oracle的表现手法预定义树库嵌入的测试集。

69. A Simple Language Model for Task-Oriented Dialogue [PDF] 返回目录
Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, Richard Socher
Abstract: Task-oriented dialogue is often decomposed into three tasks: understanding user input, deciding actions, and generating a response. This allows for dedicated models for each sub-task, but we find a simple, unified approach leads to state-of-the-art performance across multiple settings on the MultiWOZ dataset. SimpleTOD is a simple approach to task-oriented dialogue that uses a single causal language model trained on all sub-tasks recast as a single sequence prediction problem. This allows SimpleTOD to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2. SimpleTOD improves over the prior state-of-the-art by 1.22 points in joint goal accuracy for dialogue state tracking. SimpleTOD also improves all three metrics used to evaluate action and response generation in the most complete setting for task-oriented dialog systems: inform rate by 8.1 points, success rate by 9.7 points, and BLEU by 23.5 points.
摘要：面向任务的对话常常分解为三个任务：理解用户输入，决定行动，产生一个响应。这使得专用型号为每个子任务，但是我们找到一个简单的，统一的方法会导致国家的最先进的MultiWOZ数据集跨多个设置的性能。 SimpleTOD是一个简单的方法来使用的培训上重塑为一个序列预测问题的所有子任务的单一因果语言模型面向任务的对话。这使得SimpleTOD，可充分利用迁移学习从预先训练，开域，因果语言模型，如GPT-2。 SimpleTOD提高对现有的国家的最先进的联合目标精度对话状态跟踪1.22点。 SimpleTOD也提高了用于评估在面向任务的对话系统最完整的设定动作和反应生成所有三个指标：8.1分，9.7分的成功率，并通过BLEU 23.5分告知率。

70. KinGDOM: Knowledge-Guided DOMain adaptation for sentiment analysis [PDF] 返回目录
Deepanway Ghosal, Devamanyu Hazarika, Navonil Majumder, Abhinaba Roy, Soujanya Poria, Rada Mihalcea
Abstract: Cross-domain sentiment analysis has received significant attention in recent years, prompted by the need to combat the domain gap between different applications that make use of sentiment analysis. In this paper, we take a novel perspective on this task by exploring the role of external commonsense knowledge. We introduce a new framework, KinGDOM, which utilizes the ConceptNet knowledge graph to enrich the semantics of a document by providing both domain-specific and domain-general background concepts. These concepts are learned by training a graph convolutional autoencoder that leverages inter-domain concepts in a domain-invariant manner. Conditioning a popular domain-adversarial baseline method with these learned concepts helps improve its performance over state-of-the-art approaches, demonstrating the efficacy of our proposed framework.
摘要：跨域情感分析已收到显著重视，近年来，通过打击不同的应用程序，利用情感分析之间的差距域需要提示。在本文中，我们承担这个任务，新颖的角度通过探索外部常识知识库中的作用。我们引入一个新的框架，王国，它利用ConceptNet知识图通过提供特定领域和领域一般性背景概念既丰富了文档的语义。这些概念在域不变的方式培养了图形卷积自动编码器，它利用域间概念教训。空调与这些学概念的流行领域对抗性基线法有助于提高其在国家的最先进的方法，证明了我们提出的框架的有效性表现。

71. Measuring and Reducing Non-Multifact Reasoning in Multi-hop Question Answering [PDF] 返回目录
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish Sabharwal
Abstract: The measurement of true progress in multihop question-answering has been muddled by the strong ability of models to exploit artifacts and other reasoning shortcuts. Models can produce the correct answer, and even independently identify the supporting facts, without necessarily connecting the information between the facts. This defeats the purpose of building multihop QA datasets. We make three contributions towards addressing this issue. First, we formalize this form of disconnected reasoning and propose contrastive support sufficiency as a better test of multifact reasoning. To this end, we introduce an automated sufficiency-based dataset transformation that considers all possible partitions of supporting facts, capturing disconnected reasoning. Second, we develop a probe to measure how much can a model cheat (via non-multifact reasoning) on existing tests and our sufficiency test. Third, we conduct experiments using a transformer based model (XLNet), demonstrating that the sufficiency transform not only reduces the amount of non-multifact reasoning in this model by 6.5% but is also harder to cheat -- a non-multifact model sees a 20.8% (absolute) reduction in score compared to previous metrics.
摘要：在多跳问题回答真正的进步测量已经通过模型利用文物和其他快捷方式推理能力强糊涂。模型可以产生正确的答案，甚至独立识别支持的事实，而不必连接的事实之间的信息。这违背了建立多跳QA数据集的目的。我们朝着解决这一问题三个方面的影响。首先，我们正式确定这种形式断开的推理，并提出支持对比充足性multifact推理的一个更好的测试。为此，我们引入认为事实支持的所有可能的分区，捕捉断开推理的自动化自给自足为基础的数据集的转变。其次，我们开发了一个探头测量有多少可以在现有的测试，我们的测试充分性的模型作弊（通过非multifact推理）。第三，我们进行使用基于变压器的模型（XLNet）的实验，证明了充足变换不仅6.5％减少在这个模型非multifact推理的量，但也更难欺骗 - 非multifact模型看到了（绝对值）降低20.8％在得分相比以前的指标。

72. Visually Grounded Continual Learning of Compositional Semantics [PDF] 返回目录
Xisen Jin, Junyi Du, Xiang Ren
Abstract: Children's language acquisition from the visual world is a real-world example of continual learning from dynamic and evolving environments; yet we lack a realistic setup to study neural networks' capability in human-like language acquisition. In this paper, we propose a realistic setup by simulating children's language acquisition process. We formulate language acquisition as a masked language modeling task where the model visits a stream of data with continuously shifting distribution. Our training and evaluation encode two important challenges in human's language learning, namely the continual learning and the compositionality. We show the performance of existing continual learning algorithms is far from satisfactory. We also study the interactions between memory based continual learning algorithms and compositional generalization and conclude that overcoming overfitting and compositional overfitting may be crucial for a good performance in our problem setup. Our code and data can be found at this https URL.
摘要：从视觉世界儿童语言习得是动态的，不断变化的环境中不断学习的一个现实世界的例子;但我们仍然缺乏现实的设置，以研究人类语言一样收购神经网络的能力。在本文中，我们提出了通过模拟儿童语言习得过程中一个现实的设置。我们制定语言习得作为一个蒙面语言建模任务，其中模型访问数据与连续变速分布的流。我们的培训和评估编码人类的语言学习，即不断学习和组合性的两个重要挑战。我们展示的现有持续学习算法的性能表现差强人意。我们也研究了基于内存不断学习算法和成分概括之间的相互作用，并得出结论，克服过学习和成分过度拟合可能是在我们的问题建立了良好的性能是至关重要的。我们的代码和数据可以在此HTTPS URL中找到。

73. Can BERT Reason? Logically Equivalent Probes for Evaluating the Inference Capabilities of Language Models [PDF] 返回目录
Pei Zhou, Rahul Khanna, Bill Yuchen Lin, Daniel Ho, Xiang Ren, Jay Pujara
Abstract: Pre-trained language models (PTLM) have greatly improved performance on commonsense inference benchmarks, however, it remains unclear whether they share a human's ability to consistently make correct inferences under perturbations. Prior studies of PTLMs have found inference deficits, but have failed to provide a systematic means of understanding whether these deficits are due to low inference abilities or poor inference robustness. In this work, we address this gap by developing a procedure that allows for the systematized probing of both PTLMs' inference abilities and robustness. Our procedure centers around the methodical creation of logically-equivalent, but syntactically-different sets of probes, of which we create a corpus of 14,400 probes coming from 60 logically-equivalent sets that can be used to probe PTLMs in three task settings. We find that despite the recent success of large PTLMs on commonsense benchmarks, their performances on our probes are no better than random guessing (even with fine-tuning) and are heavily dependent on biases--the poor overall performance, unfortunately, inhibits us from studying robustness. We hope our approach and initial probe set will assist future work in improving PTLMs' inference abilities, while also providing a probing set to test robustness under several linguistic variations--code and data will be released.
摘要：预先训练的语言模型（PTLM）大大提高了对常识的推断基准性能，但是，无论它们共享一个人的能力，始终令下扰动正确的推论仍不清楚。 PTLMs的此前的研究已经发现推断赤字，但未能提供理解这些赤字是否是由于低推理能力或推理的鲁棒性较差系统的方法。在这项工作中，我们通过开发一个程序，允许系统化既PTLMs'推理能力和稳健性的探测这一空白。我们围绕着有条不紊的创作逻辑上等同，但语法，不同的探针组，其中我们创造的14,400探头60可用于探测在三个任务设置PTLMs逻辑上等于多套即将语料库的过程中心。我们发现，尽管大PTLMs对常识的基准的最近的成功，他们对我们的探头表演并不比随机猜测（即使有微调），并严重依赖于偏见 - 整体表现不佳，不幸的是，从抑制美国学习鲁棒性。我们希望我们的方法和初始探测集将有助于改善PTLMs'推理能力今后的工作中，同时在几种语言的变化提供探测设置为测试鲁棒性 - 代码和数据将被释放。

74. ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning [PDF] 返回目录
Michael Boratko, Xiang Lorraine Li, Rajarshi Das, Tim O'Gorman, Dan Le, Andrew McCallum
Abstract: Given questions regarding some prototypical situation -- such as Name something that people usually do before they leave the house for work? -- a human can easily answer them via acquired experiences. There can be multiple right answers for such questions with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common-sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international trivia game show -- Family Feud. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose an open-domain task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. On evaluating our dataset with various competitive state-of-the-art models, we find there is a significant gap between the best model and human performance on a number of evaluation metrics.
摘要：对于一些典型的情况考虑的问题 - 如名称的东西，人们通常做他们离开工作殿前？ - 人可以很容易地通过收购经验回答这些问题。可以有与一些较常见的这些问题比其他的情况多正确的答案。本文介绍了一个新的问题回答的数据集进行训练，在这种情况下，原型评估人工智能系统的常识推理能力。家庭矛盾 - 训练集是从现有的一组在长期运行的国际琐事游戏节目出场的问题聚集。隐藏的评估组通过收集答案，从100人群工人每个问题产生。我们还建议其中模型输出排名的答案列表，最好覆盖了问题的所有典型答案的开放域的任务。在评估我们的各种有竞争力的国家的最先进的车型数据集，我们发现有多项评价指标的最佳模式和人的行为之间的差距显著。

75. Exploring and Predicting Transferability across NLP Tasks [PDF] 返回目录
Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, Mohit Iyyer
Abstract: Recent advances in NLP demonstrate the effectiveness of training large-scale language models and transferring them to downstream tasks. Can fine-tuning these models on tasks other than language modeling further improve performance? In this paper, we conduct an extensive study of the transferability between 33 NLP tasks across three broad classes of problems (text classification, question answering, and sequence labeling). Our results show that transfer learning is more beneficial than previously thought, especially when target task data is scarce, and can improve performance even when the source task is small or differs substantially from the target task (e.g., part-of-speech tagging transfers well to the DROP QA dataset). We also develop task embeddings that can be used to predict the most transferable source tasks for a given target task, and we validate their effectiveness in experiments controlled for source and target data size. Overall, our experiments reveal that factors such as source data size, task and domain similarity, and task complexity all play a role in determining transferability.
摘要：NLP的最新进展表明训练的大型语言模型，并将其转移到下游任务的有效性。可以微调的进一步提高建模性能语言以外的任务，这些车型？在本文中，我们进行跨越的问题三大类（文本分类，问题解答和序列标签）33个NLP任务之间的可转移性了广泛的研究。我们的结果表明传输学习是比以前认为的更有利，特别是当目标任务数据是稀少，并且可以提高性能，即使当源任务是小的或不同基本上从目标任务（例如，部分的词性标注传输以及到DROP QA数据集）。我们还开发了可用于预测最转让源的任务给定目标任务的任务的嵌入，而且我们在验证源和目标数据的大小控制的实验的有效性。总的来说，我们的实验显示，因素，如源数据的大小，任务和域的相似性，以及任务的复杂性决定转让全部发挥作用。

76. BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA [PDF] 返回目录
Nora Kassner, Hinrich Schütze
Abstract: Khandelwal et al. (2020) show that a k-nearest-neighbor (kNN) component improves language modeling performance. We use this idea for open domain question answering (QA). To improve the recall of facts stated in the training text, we combine BERT (Devlin et al., 2019) with a kNN search over a large corpus. Our contributions are as follows. i) We outperform BERT on cloze-style QA by large margins without any further training. ii) We show that BERT often identifies the correct response category (e.g., central European city), but only kNN recovers the factually correct answer (e.g., "Vienna").
摘要：Khandelwal等。（2020）表明，k最近邻（KNN）成分提高语言建模性能。我们使用这个想法开放域问答（QA）。为了提高训练文本陈述事实的回忆，我们结合BERT（Devlin等，2019）用k近邻大型语料库搜索过。我们的研究成果如下。我）我们跑赢大的利润在完形填空式QA BERT没有任何进一步的培训。二）我们证明了BERT经常找到正确的响应类别（例如，欧洲中部城市），但只有k近邻恢复事实正确答案（例如，“维也纳”）。

77. Synthesizer: Rethinking Self-Attention in Transformer Models [PDF] 返回目录
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng
Abstract: The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that \textsc{Synthesizer} is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE).
摘要：积自我关注被称为是中央和不可缺少的国家的最先进的变压器模型。这是不是真的需要？本文研究的真正的重要性和变压器模型的性能点积基自注意机制的贡献。通过大量的实验，我们发现，（1）随机排列矩阵令人惊讶的表现相当有竞争力;（2）学习注意力重量从令牌令牌（查询键）相互作用毕竟不是那么重要。为此，我们提出了\ {textsc合成}，即学习合成注意权重不令牌令牌相互作用的模型。我们的实验结果表明，\ textsc {合成}是有竞争力的对抗香草Transformer模型在一系列任务，包括MT（恩德，EnFr），语言建模（LM1B），抽象概括（CNN /每日邮报），对话代（PersonaChat）和多任务语言理解（胶水，强力胶）。

78. Hard-Coded Gaussian Attention for Neural Machine Translation [PDF] 返回目录
Weiqiu You, Simeng Sun, Mohit Iyyer
Abstract: Recent work has questioned the importance of the Transformer's multi-headed attention for achieving high translation quality. We push further in this direction by developing a "hard-coded" attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.
摘要：最近的工作提出质疑为实现高翻译质量的变压器的多头关注的重要性。我们通过开发没有任何了解到参数“硬编码”变种关注这个方向进一步推高。出人意料的是，更换所有学会自我关注头编码器和解码器固定，输入无关的高斯分布微创影响跨BLEU四种不同的语言对分数。然而，附加地硬编码横注意（其中解码器连接到编码器）显著降低BLEU，这表明它比自关注更重要。许多这样BLEU下降可以通过增加只是一个单一的学交叉注意头部否则硬编码的变压器进行恢复。作为一个整体，我们的研究结果提供洞察到其中的变压器组件实际上是重要的，我们希望这将指导今后的工作纳入更简单，更高效的关注，基于模型的开发。

79. ESPRIT: Explaining Solutions to Physical Reasoning Tasks [PDF] 返回目录
Nazneen Fatema Rajani, Rui Zhang, Yi Chern Tan, Stephan Zheng, Jeremy Weiss, Aadit Vyas, Abhijit Gupta, Caiming XIong, Richard Socher, Dragomir Radev
Abstract: Neural networks lack the ability to reason about qualitative physics and so cannot generalize to scenarios and tasks unseen during training. We propose \ours, a framework for commonsense reasoning about qualitative physics in natural language that generates interpretable descriptions of physical events. We use a two-step approach of first identifying the pivotal physical events in an environment and then generating natural language descriptions of those events using a data-to-text approach. Our framework learns to generate explanations of how the physical simulation will causally evolve so that an agent or a human can easily reason about a solution using those interpretable descriptions. Human evaluations indicate that ESPRIT produces crucial fine-grained details and has high coverage of physical concepts compared to even human annotations. Dataset, code and documentation are available at this https URL.
摘要：神经网络缺乏的能力，推理定性物理和培训期间所以不能推广到场景和任务看不见。我们建议\我们的，大约在生成物理事件的解释说明自然语言定性物理学常识推理的框架。我们使用的第一识别的环境中的关键物理事件，然后生成用数据至文本的方法的那些事件的自然语言描述的两步骤方法。我们的框架学会生成的解释如何物理模拟将因果关系发展，使代理人或人类可以轻松原因有关使用这些解释说明的解决方案。人的评价表明ESPRIT产生至关重要的细粒度细节并具有比甚至人类的注解物理概念的高覆盖率。数据集，代码和文档都可以在此HTTPS URL。

80. RMM: A Recursive Mental Model for Dialog Navigation [PDF] 返回目录
Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao
Abstract: Fluent communication requires understanding your audience. In the new collaborative task of Vision-and-Dialog Navigation, one agent must ask questions and follow instructive answers, while the other must provide those answers. We introduce the first true dialog navigation agents in the literature which generate full conversations, and introduce the Recursive Mental Model (RMM) to conduct these dialogs. RMM dramatically improves generated language questions and answers by recursively propagating reward signals to find the question expected to elicit the best answer, and the answer expected to elicit the best navigation. Additionally, we provide baselines for future work to build on when investigating the unique challenges of embodied visual agents that not only interpret instructions but also ask questions in natural language.
摘要：流利的通信需要了解你的听众。在视觉和对话框导航的新的协作任务，一个代理人必须要问的问题，并按照启发性的答案，而其他必须提供这些问题的答案。我们介绍其中产生充分的对话文献中第一个真正的对话框导航代理，并引入递归心理模型（RMM）进行这些对话框。 RMM极大地提高了通过递归传播奖励信号发现预计将促使最佳答案问题产生的语言问题和答案，答案有望引发了最好的导航。此外，我们提供调查体现的视觉剂，不仅解释说明，而且在询问自然语言问题的独特挑战时，对今后的工作中要建立基线。

81. Obtaining Faithful Interpretations from Compositional Neural Networks [PDF] 返回目录
Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolfson, Sameer Singh, Jonathan Berant, Matt Gardner
Abstract: Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explanation of the model's reasoning; that is, that all modules perform their intended behaviour. In this work, we propose and conduct a systematic evaluation of the intermediate outputs of NMNs on NLVR2 and DROP, two datasets which require composing multiple reasoning steps. We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour. To remedy that, we train the model with auxiliary supervision and propose particular choices for module architecture that yield much better faithfulness, at a minimal cost to accuracy.
摘要：神经网络模块（NMNS）是模拟组合性流行的做法：当应用到语言和视力问题，而反映的问题在网络架构的组成结构，他们达到较高的精度。然而，现有的工作隐含假设网络模块的结构，描述了抽象推理过程中，提供了模型的推理的忠实解释;也就是说，所有的模块执行其预期的行为。在这项工作中，我们提出并开展对NLVR2和DROP NMNS，需要组成多个推理步骤两个数据集的中间输出的系统评价。我们发现，在中间输出从预期输出不同，示出了网络结构不提供的模型的行为的忠实的解释。为了弥补这一点，我们培养具有辅助监管的模式，提出了模块化结构特殊的选择，收益率更好的忠诚，以最低的成本来准确性。

82. Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? [PDF] 返回目录
Abhilasha Ravichander, Yonatan Belinkov, Eduard Hovy
Abstract: Much recent attention has been devoted to analyzing sentence representations learned by neural encoders, through the paradigm of 'probing' tasks. This is often motivated by an interest to understand the information a model uses to make its decision. However, to what extent is the information encoded in a sentence representation actually used for the task which the encoder is trained on? In this work, we examine this probing paradigm through a case-study in Natural Language Inference, showing that models learn to encode linguistic properties even when not needed for a task. We identify that pre-trained word embeddings play a considerable role in encoding these properties rather than the training task itself, highlighting the importance of careful controls when designing probing experiments. Through a set of controlled synthetic tasks, we demonstrate models can encode these properties considerably above chance-level even when distributed as random noise, calling into question the interpretation of absolute claims on probing tasks.
摘要：最近许多注意力一直致力于通过“探测”的任务模式分析由神经编码器得知判决表示。这通常是由有兴趣的动机，了解一个模型用来作出决定的信息。然而，在何种程度上是在实际使用为其编码器上训练任务的句子表示编码的信息？在这项工作中，我们通过研究自然语言推理的情况下，研究这一探测模式，显示出模型学习编码，即使不需要任务的语言特性。我们确定预训练字的嵌入编码中的这些属性，而不是训练任务本身，设计探测实验时强调精心控制的重要性发挥相当大的作用。通过一组控制合成的任务，我们展示的模型甚至可以当作为随机噪声分布相当编码上述机会级这些属性，质疑的探测任务绝对的权利要求的解释。

83. A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos [PDF] 返回目录
Frank F. Xu, Lei Ji, Botian Shi, Junyi Du, Graham Neubig, Yonatan Bisk, Nan Duan
Abstract: Procedural knowledge, which we define as concrete information about the sequence of actions that go into performing a particular procedure, plays an important role in understanding real-world tasks and actions. Humans often learn this knowledge from instructional text and video, and in this paper we aim to perform automatic extraction of this knowledge in a similar way. As a concrete step in this direction, we propose the new task of inferring procedures in a structured form(a data structure containing verbs and arguments) from multimodal instructional video contents and their corresponding transcripts. We first create a manually annotated, large evaluation dataset including over350 instructional cooking videos along with over 15,000 English sentences in transcripts spanning over 89 recipes. We conduct analysis of the challenges posed by this task and dataset with experiments with unsupervised segmentation, semantic role labeling, and visual action detection based baselines. The dataset and code will be publicly available at this https URL.
摘要：过程性知识，我们将其定义为关于该进入执行特定程序操作的顺序的具体信息，发挥在理解现实世界的任务和行动的重要作用。人类常常学习到教学文本和视频这方面的知识，并在本文中，我们的目标是用类似的方式进行这方面的知识自动提取。如在该方向的具体步骤，我们提出以结构化形式推断程序的新任务从多峰教学视频内容和它们相应的转录物（含动词和参数的数据结构）。我们首先创建一个手动注释，大型数据集的评估包括over350有超过15,000个英语句子成绩单跨越89食谱沿着教学烹饪视频。我们进行的这个任务和数据集与无监督分割，语义角色标注，和视觉动作检测基于基线实验所带来的挑战的分析。该数据集和代码将公开可在此HTTPS URL。

84. AVA: an Automatic eValuation Approach to Question Answering Systems [PDF] 返回目录
Thuy Vu, Alessandro Moschitti
Abstract: We introduce AVA, an automatic evaluation approach for Question Answering, which given a set of questions associated with Gold Standard answers, can estimate system Accuracy. AVA uses Transformer-based language models to encode question, answer, and reference text. This allows for effectively measuring the similarity between the reference and an automatic answer, biased towards the question semantics. To design, train and test AVA, we built multiple large training, development, and test sets on both public and industrial benchmarks. Our innovative solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers. Additionally, AVA can be used to evaluate the overall system Accuracy with an RMSE, ranging from 0.02 to 0.09, depending on the availability of multiple references.
摘要：介绍AVA，针对问题回答，这给了一套具有黄金标准答案相关的疑问，可以估算系统精度的自动评估方法。 AVA采用基于变压器的语言模型到编码的问题，答案和参考文本。这允许有效地测量参考和自动应答之间的相似性，对这个问题的语义偏置。为了设计，培训和测试AVA，我们建立了多个大型培训，发展，以及公共和工业基准测试集。我们的创新解决方案，实现了在F1得分74.7％的人预测判决单一的答案。此外，AVA可以使用具有一个RMSE来评估整个系统的精度，从0.02到0.09，这取决于多个引用的可用性。

85. A Girl Has A Name: Detecting Authorship Obfuscation [PDF] 返回目录
Asad Mahmood, Zubair Shafiq, Padmini Srinivasan
Abstract: Authorship attribution aims to identify the author of a text based on the stylometric analysis. Authorship obfuscation, on the other hand, aims to protect against authorship attribution by modifying a text's style. In this paper, we evaluate the stealthiness of state-of-the-art authorship obfuscation methods under an adversarial threat model. An obfuscator is stealthy to the extent an adversary finds it challenging to detect whether or not a text modified by the obfuscator is obfuscated - a decision that is key to the adversary interested in authorship attribution. We show that the existing authorship obfuscation methods are not stealthy as their obfuscated texts can be identified with an average F1 score of 0.87. The reason for the lack of stealthiness is that these obfuscators degrade text smoothness, as ascertained by neural language models, in a detectable manner. Our results highlight the need to develop stealthy authorship obfuscation methods that can better protect the identity of an author seeking anonymity.
摘要：著作权归属的目的是找出基础上，stylometric分析文本的作者。作者混淆，另一方面，旨在通过修改文本的样式防止著作权归属。在本文中，我们评估的国家的最先进的作者混淆方法下的对抗威胁模型的隐蔽性。混淆器是隐身于对手发现它具有挑战性的检测由混淆修改文本是否被混淆的程度 - 一个决定，关键是兴趣著作权归属的对手。我们发现，现有的作者混淆方法并不隐蔽他们的混淆文本可以平均F1值的0.87来确定。究其原因，缺乏隐蔽性的是，这些混淆器降低文字的平滑度，通过神经语言模型的确定，可检测的方式。我们的研究结果强调有必要制定隐身作者混淆的方法，可以更好地保护作者求匿名的身份。

86. Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen [PDF] 返回目录
Yixin Cao, Ruihao Shui, Liangming Pan, Min-Yen Kan, Zhiyuan Liu, Tat-Seng Chua
Abstract: The curse of knowledge can impede communication between experts and laymen. We propose a new task of expertise style transfer and contribute a manually annotated dataset with the goal of alleviating such cognitive biases. Solving this task not only simplifies the professional language, but also improves the accuracy and expertise level of laymen descriptions using simple words. This is a challenging task, unaddressed in previous work, as it requires the models to have expert intelligence in order to modify text with a deep understanding of domain knowledge and structures. We establish the benchmark performance of five state-of-the-art models for style transfer and text simplification. The results demonstrate a significant gap between machine and human performance. We also discuss the challenges of automatic evaluation, to provide insights into future research directions. The dataset is publicly available at this https URL.
摘要：知识的诅咒会妨碍专家与外行之间的通信。我们提出的专业风格转移的新任务，并减轻这种认知偏差的目标作出贡献手动注释的数据集。解决这个任务不仅简化了专业的语言，而且还提高了使用简单的话描述外行的准确性和专业知识水平。这是一项艰巨的任务，在以前的工作没有得到解决，因为它需要的车型有专家智能，以便修改文本与领域知识和结构的深刻理解。我们建立了五个国家的最先进的车型样式转移和文字简化的基准性能。结果表明机器和人的性能之间的差距显著。我们还讨论了自动评测的挑战，提供深入了解未来的研究方向。该数据集是公开的，在此HTTPS URL。

87. UnifiedQA: Crossing Format Boundaries With a Single QA System [PDF] 返回目录
Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi
Abstract: Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems.
摘要：答疑（QA）的任务已经使用了多种格式，如采掘量程选择，多项选择等，这导致了格式，专业模型所构成，即使在QA社区一个隐含的分工。我们认为，这种界限是人为的，不必要的可能，给出的推理能力，我们试图教不受格式管辖。作为证据，我们使用了最新进展语言模型建立一个单一的预训练的质量保证模式，UnifiedQA，执行跨出奇地好17个QA数据集跨越4种多样的格式。看齐接受了关于个人数据集本身有9款不同执行UnifiedQA。即使面对的观察到的格式12点看不见的数据集，UnifiedQA执行出奇的好，显示出从其外的格式的训练数据强烈概括。最后，只需微调此预先训练的质量保证模式到专业模式导致了新的艺术状态的6个集，建立UnifiedQA作为有力抓手建立质量保证体系。

88. Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [PDF] 返回目录
Jieyu Zhao, Subhabrata Mukherjee, Saghar Hosseini, Kai-Wei Chang, Ahmed Hassan Awadallah
Abstract: Multilingual representations embed words from many languages into a single semantic space such that words with similar meanings are close to each other regardless of the language. These embeddings have been widely used in various settings, such as cross-lingual transfer, where a natural language processing (NLP) model trained on one language is deployed to another language. While the cross-lingual transfer techniques are powerful, they carry gender bias from the source to target languages. In this paper, we study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications. We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations from both the intrinsic and extrinsic perspectives. Experimental results show that the magnitude of bias in the multilingual representations changes differently when we align the embeddings to different target spaces and that the alignment direction can also have an influence on the bias in transfer learning. We further provide recommendations for using the multilingual word representations for downstream tasks.
摘要：多语种表示嵌入许多语言的文字到一个单一语义空间，使得具有类似含义的词语都接近对方无论使用什么语言。这些嵌入物已被广泛用于各种设置，如跨语言传递，其中自然语言处理（NLP）模型上的一个语言被部署到另一种语言的培训。虽然跨语言转移技术是强大的，他们携带从源到目标语言的性别偏见。在本文中，我们研究了多语种的嵌入性别偏见以及它如何影响迁移学习的NLP应用。我们创建了偏差分析一个多语种的数据集，并提出从内在和外在的观点，在多语言表示量化偏差的几种方法。实验结果表明，在多语言表述偏差的大小而变化不同，当我们对准的嵌入到不同的目标空间和排列方向也可以在迁移学习偏差的影响。我们进一步提供了使用多语种字表示对下游任务的建议。

89. DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering [PDF] 返回目录
Qingqing Cao, Harsh Trivedi, Aruna Balasubramanian, Niranjan Balasubramanian
Abstract: Transformer-based QA models use input-wide self-attention -- i.e. across both the question and the input passage -- at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing of the input text representations, which in turn enables pre-computing passage representations reducing runtime compute drastically. Furthermore, because DeFormer is largely similar to the original model, we can initialize DeFormer with the pre-training weights of a standard transformer, and directly fine-tune on the target QA dataset. We show DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1% drop in accuracy. We open source the code at this https URL.
摘要：基于变压器的QA车型使用的输入范围内的自我关注 - 即在这两个问题和输入通道 - 在所有图层，使他们将是缓慢和内存密集型。事实证明，我们可以不加输入宽自我关注，在所有层得到的，尤其是在较低层。我们引入变形，分解的变压器，其替代了完全的自我关注与质疑，宽和通道宽的自我关注中下层。这允许输入文本表示的独立问题处理，这反过来使得能够预先计算通道表示减少的运行时计算急剧。此外，由于变形很大程度上类似于原始模型，我们可以用一个标准的变压器前的训练权重初始化变形，并在目标QA数据集直接微调。我们展示BERT的变形版本和XLNet可以通过在4.3倍，并用简单的基于蒸馏的损失可以用来加快QA他们招致只有在准确度下降1％。我们开源这个HTTPS URL的代码。

90. Robust and Interpretable Grounding of Spatial References with Relation Networks [PDF] 返回目录
Tsung-Yen Yang, Karthik Narasimham
Abstract: Handling spatial references in natural language is a key challenge in tasks like autonomous navigation and robotic manipulation. Recent work has investigated various neural architectures for learning multi-modal representations of spatial concepts that generalize well across a variety of observations and text instructions. In this work, we develop accurate models for understanding spatial references in text that are also robust and interpretable. We design a text-conditioned relation network whose parameters are dynamically computed with a cross-modal attention module to capture fine-grained spatial relations between entities. Our experiments across three different prediction tasks demonstrate the effectiveness of our model compared to existing state-of-the-art systems. Our model is robust to both observational and instructional noise, and lends itself to easy interpretation through visualization of intermediate outputs.
摘要：在自然语言处理的空间参考是在像自主导航和机器人操作任务的主要挑战。最近的工作已经研究了各种神经结构学习的跨越各种意见和文字说明的推广以及空间概念多模态表示。在这项工作中，我们开发精确模型对理解文本，同时也是稳健和解释的空间参考。我们设计了一个文本空调的关系网络，它的参数是动态与跨模态注意模块计算，以实体之间捕获细粒度的空间关系。我们在三个不同的预测任务实验证明相比于现有的国家的最先进的系统，我们的模型的有效性。我们的模型是鲁棒的观测和教学噪声，并通过中间输出的可视化适合于容易解释。

91. Are Emojis Emotional? A Study to Understand the Association between Emojis and Emotions [PDF] 返回目录
Abu Shoeb, Gerard de Melo
Abstract: Given the growing ubiquity of emojis in language, there is a need for methods and resources that shed light on their meaning and communicative role. One conspicuous aspect of emojis is their use to convey affect in ways that may otherwise be non-trivial to achieve. In this paper, we seek to explore the connection between emojis and emotions by means of a new dataset consisting of human-solicited association ratings. We additionally conduct experiments to assess to what extent such associations can be inferred from existing data, such that similar associations can be predicted for a larger set of emojis. Our experiments show that this succeeds when high-quality word-level information is available.
摘要：鉴于语言表情符号的日益普及，有必要对于自己的意义和作用，交际揭示的方法和资源。表情符号的一个显着的方面是其使用传达的方式，否则是不平凡的实现影响。在本文中，我们试图探索通过由人征求协会收视率的新数据集的手段表情符号和情感之间的连接。我们另外进行实验，以评估在何种程度这样的关联可以从现有的数据，例如，类似的关联可以用于更大的一组表情符号的预测来推断。我们的实验表明，当高品质的字级信息可用此成功。

92. Design Challenges for Low-resource Cross-lingual Entity Linking [PDF] 返回目录
Xingyu Fu, Weijia Shi, Zian Zhao, Xiaodong Yu, Dan Roth
Abstract: Cross-lingual Entity Linking (XEL) grounds mentions of entities that appear in a foreign (source) language text into an English (target) knowledge base (KB) such as Wikipedia. XEL consists of two steps: candidate generation, which retrieves a list of candidate entities for each mention, followed by candidate ranking. XEL methods have been successful on high-resource languages, but generally perform poorly on low-resource languages due to lack of supervision. In this paper, we show a thorough analysis on existing low-resource XEL methods, especially on their candidate generation methods and limitations. We observed several interesting findings: 1. They are heavily limited by the Wikipedia bilingual resource coverage. 2. They perform better on Wikipedia text than on real-world text such as news or twitter. In this paper, we claim that, under the low-resource language setting, outside-Wikipedia cross-lingual resources are essential. To prove this argument, we propose a simple but effective zero-shot framework, CogCompXEL, that complements current methods by utilizing query log mapping files from online search engines. CogCompXEL outperforms current state-of-the-art models on almost all 25 languages of the LORELEI dataset, achieving an absolute average increase of 25% in gold candidate recall.
摘要：跨语种实体链接（XEL）理由提到，出现在国外的（源）语言文字为英文（目标）知识库（KB），如维基百科的实体。 XEL包括两个步骤：候选产生，它检索候选实体为每个提及的，其次是候选排序。 XEL方法已经成功在高资源的语言，但一般在低资源语言，由于缺乏监管的表现不佳。在本文中，我们将展示在现有的低资源XEL方法进行透彻的分析，特别是对他们的候选人生成方法和限制。我们观察到一些有趣的发现：1。他们都在很大程度上受到维基百科双语资源覆盖面有限。 2.他们更好地在维基百科上的文字不是现实世界的文字，如新闻或Twitter进行。在本文中，我们宣称，低资源语言设置下，外维基百科跨语言资源是必不可少的。为了证明这种说法，我们提出了一个简单而有效的零次框架，CogCompXEL，以补充目前的方法从网上搜索引擎利用查询日志映射文件。 CogCompXEL优于几乎所有的25种语言的罗蕾莱数据集的国家的最先进的电流模式，实现黄金候选人召回25％的绝对平均增幅。

93. Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering [PDF] 返回目录
Peifeng Wang, Nanyun Peng, Pedro Szekely, Xiang Ren
Abstract: Commonsense question answering (QA) requires the modeling of general background knowledge about how the world operates and how entities interact with each other. Prior works leveraged manually curated commonsense knowledge graphs to help commonsense reasoning and demonstrated their effectiveness. However, these knowledge graphs are incomplete and thus may not contain the necessary knowledge for answering the questions. In this paper, we propose to learn a multi-hop knowledge path generator to generate structured evidence dynamically according to the questions. Our generator uses a pre-trained language model as the backbone, leveraging a large amount of unstructured knowledge stored in the language model to supplement the incompleteness of the knowledge base. The experiments on two commonsense QA datasets demonstrate the effectiveness of our method, which improves over strong baselines significantly and also provides human interpretable explanations for the predictions.
摘要：常识问答（QA）要求的一般背景知识对世界如何运作以及如何彼此交互的实体建模。在此之前的作品利用人工监管的常识性知识图，以帮助常识推理和证明其有效性。然而，这些知识图是不完整的，因此可能不包含必要的知识回答问题。在本文中，我们提出了要学会多跳知识路径生成动态根据问题产生结构化的证据。我们的发生器采用预训练语言模型为骨干，充分利用存储在语言模型，以补充基础知识的不完全性的大量非结构化的知识。两个常识QA数据集上的实验证明我们的方法，从而提高了强烈的基线显著的效果，并且还提供了预测人类的解释说明。

94. An Imitation Game for Learning Semantic Parsers from User Interaction [PDF] 返回目录
Ziyu Yao, Yiqi Tang, Wen-tau Yih, Huan Sun, Yu Su
Abstract: Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstration when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstration, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and re-trains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem.
摘要：尽管广泛的成功应用，引导和微调语义解析器仍然如昂贵的数据诠释和隐私风险挑战一个繁琐的过程。在本文中，我们提出一个替代方案，人在半实物方法直接从用户那里了解的语义分析程序。语义解析器应该自省的不确定性，并提示用户演示不定什么时候。它这样做也得到模仿用户行为和继续，希望最终可能不如在解释他们的问题在用户自主提高自身。为了打击示范的稀疏性，我们提出了一个新的注解高效的模仿学习算法，该算法通过迭代混合证明状态和自信的预测和重新训练数据集中汇聚时尚语义解析收集新的数据集（Ross等，2011）。我们提供的成本约束的理论分析和实证也表明在文本到SQL问题的承诺表现。

95. Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models [PDF] 返回目录
Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, Xiang Ren
Abstract: Recent works show that pre-trained masked language models, such as BERT, possess certain linguistic and commonsense knowledge. However, it remains to be seen what types of commonsense knowledge these models have access to. In this vein, we propose to study whether numerical commonsense knowledge - commonsense knowledge that provides an understanding of the numeric relation between entities -- can be induced from pre-trained masked language models and to what extent is this access to knowledge robust against adversarial examples? To study this, we introduce a probing task with a diagnostic dataset, NumerSense, containing 3,145 masked-word-prediction probes. Surprisingly, our experiments and analysis reveal that: (1) BERT and its stronger variant RoBERTa perform poorly on our dataset prior to any fine-tuning; (2) fine-tuning with distant supervision does improve performance; (3) the best distantly supervised model still performs poorly when compared to humans (47.8% vs 96.3%).
摘要：最近的工作表明，预先训练掩盖语言模型，如BERT，具备一定的语言和常识知识。然而，仍有待观察什么类型的常识性知识的这些模型可以访问。在这方面，我们建议研究是否数值常识性知识 - 常识性知识，提供实体之间的关系，数字的理解 - 可从预训练掩盖语言模型到什么程度诱发这是获取知识反对对抗的例子强劲？为了研究这个问题，我们引入一个探测任务与诊断数据集，NumerSense，含3,145屏蔽字预测探头。出人意料的是，我们的实验和分析表明：（1）BERT及其变种更强罗伯塔上之前的任何微调我们的数据表现不佳; （2）微调与远方的监督确实改善性能;相比于人类的时候（3）最好的远亲监管模式仍表现不佳（47.8％对96.3％）。

96. Opportunistic Decoding with Timely Correction for Simultaneous Translation [PDF] 返回目录
Renjie Zheng, Mingbo Ma, Baigong Zheng, Kaibo Liu, Liang Huang
Abstract: Simultaneous translation has many important application scenarios and attracts much attention from both academia and industry recently. Most existing frameworks, however, have difficulties in balancing between the translation quality and latency, i.e., the decoding policy is usually either too aggressive or too conservative. We propose an opportunistic decoding technique with timely correction ability, which always (over-)generates a certain mount of extra words at each step to keep the audience on track with the latest information. At the same time, it also corrects, in a timely fashion, the mistakes in the former overgenerated words when observing more source context to ensure high translation quality. Experiments show our technique achieves substantial reduction in latency and up to +3.1 increase in BLEU, with revision rate under 8% in Chinese-to-English and English-to-Chinese translation.
摘要：同声传译有许多重要的应用场景和来自学术界和工业界最近备受关注。大多数现有的框架，然而，在翻译质量和等待时间，即平衡之间的困难，解码策略通常是要么过于激进或过于保守。我们建议及时纠错能力，它总是（过）产生在每一步一定安装额外的话，以保持观众的轨道上用最新的信息机会主义的解码技术。同时，它也纠正，及时，观察前overgenerated字的错误时，多个源方面，以确保高翻译质量。实验证明我们的技术能够实现在延迟大幅度减少，并达到+3.1增加的BLEU，与翻修率低于8％的中国人对英语和英语对中国的翻译。

97. Generating Derivational Morphology with BERT [PDF] 返回目录
Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze
Abstract: Can BERT generate derivationally complex words? We present the first study investigating this question. We find that BERT with a derivational classification layer outperforms an LSTM-based model. Furthermore, our experiments show that the input segmentation crucially impacts BERT's derivational knowledge, both during training and inference.
摘要：可以BERT产生derivationally复杂的话吗？我们提出的第一项研究调查这个问题。我们找到一个派生分类层性能优于基于LSTM-模型BERT。此外，我们的实验表明，输入分割决定性影响BERT的派生的知识，培训和推理过程。

98. Contrastive Self-Supervised Learning for Commonsense Reasoning [PDF] 返回目录
Tassilo Klein, Moin Nabi
Abstract: We propose a self-supervised method to solve Pronoun Disambiguation and Winograd Schema Challenge problems. Our approach exploits the characteristic structure of training corpora related to so-called "trigger" words, which are responsible for flipping the answer in pronoun disambiguation. We achieve such commonsense reasoning by constructing pair-wise contrastive auxiliary predictions. To this end, we leverage a mutual exclusive loss regularized by a contrastive margin. Our architecture is based on the recently introduced transformer networks, BERT, that exhibits strong performance on many NLP benchmarks. Empirical results show that our method alleviates the limitation of current supervised approaches for commonsense reasoning. This study opens up avenues for exploiting inexpensive self-supervision to achieve performance gain in commonsense reasoning tasks.
摘要：本文提出解决代词解疑和威诺格拉德架构挑战问题自我监督的方法。我们的方法利用与所谓的“触发”的话，这是负责代词歧义翻转答案训练库的特征结构。我们实现了通过构建成对对比辅助预测这样的常识推理。为此，我们利用通过对比保证金正规化相互排斥的损失。我们的架构是基于最近推出的变压器网络，BERT，表现出许多NLP基准性能强。实证结果表明，我们的方法减轻了对常识推理当前监督的方法的局限性。这项研究开辟了途径利用廉价的自我监督，实现常识推理任务的性能增益。

99. Benchmarking Multimodal Regex Synthesis with Complex Structures [PDF] 返回目录
Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett
Abstract: Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse. We introduce StructuredRegex, a new regex synthesis dataset differing from prior ones in three aspects. First, to obtain structurally complex and realistic regexes, we generate the regexes using a probabilistic grammar with pre-defined macros observed from real-world StackOverflow posts. Second, to obtain linguistically diverse natural language descriptions, we show crowdworkers abstract depictions of the underlying regex and ask them to describe the pattern they see, rather than having them paraphrase synthetic language. Third, we augment each regex example with a collection of strings that are and are not matched by the ground truth regex, similar to how real users give examples. Our quantitative and qualitative analysis demonstrates the advantages of StructuredRegex over prior datasets. Further experimental results using various multimodal synthesis techniques highlight the challenge presented by our dataset, including non-local constraints and multi-modal inputs.
摘要：现有的数据集进行正则表达式（正则表达式）的生成，从自然语言中的复杂性是有限的;相比于正则表达式任务的用户发布在计算器上，在这些数据集的正则表达式是简单的，并用来形容他们的语言不是多元化。我们介绍StructuredRegex，一个新的正则表达式合成数据集中在三个方面，从之前的人不同。首先，为了获得结构复杂和现实的正则表达式，我们生成使用与真实世界的StackOverflow职位观察到预先定义宏的概率语法的正则表达式。其次，获得语言多样化的自然语言描述，我们将展示crowdworkers底层的正则表达式的抽象的描绘，并要求他们描述他们看到的模式，而不是让他们意译合成的语言。第三，我们用字符串的，而不是由地面实况正则表达式匹配，类似的集合增强每个正则表达式为例，用户如何真正给予例子。我们的定量和定性的分析表明对现有数据集StructuredRegex的优势。使用各种多峰合成技术的进一步的实验结果突出显示由我们的数据集所呈现的挑战，包括非局部约束和多模式的输入。

100. On Faithfulness and Factuality in Abstractive Summarization [PDF] 返回目录
Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald
Abstract: It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models lead to less human-like responses for open-ended tasks such as language modeling and story generation. In this paper we have analyzed limitations of these models for abstractive document summarization and found that these models are highly prone to hallucinate content that is unfaithful to the input document. We conducted a large scale human evaluation of several neural abstractive summarization systems to better understand the types of hallucinations they produce. Our human annotators found substantial amounts of hallucinated content in all model generated summaries. However, our analysis does show that pretrained models are better summarizers not only in terms of raw metrics, i.e., ROUGE, but also in generating faithful and factual summaries as evaluated by humans. Furthermore, we show that textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.
摘要：众所周知，在神经文本一代车型的标准可能性训练和近似解码目标导致更少的人力，像开放式的任务，如语言建模和故事发生的反应。在本文中，我们分析了这些模型抽象文档文摘的局限性，发现这些模型是极易产生幻觉的内容是不忠实的输入文档。我们进行了多次的神经抽象摘要系统的大规模人工评估，以更好地了解幻觉的，他们生产的各类。我们的人工注释中发现大量的在所有生成的模型摘要幻觉的内容。然而，我们的分析确实表明，预训练的模型更好summarizers不仅在原始指标的条件，即胭脂，而且在人类作为评估产生忠实和事实摘要。此外，我们表明，文字蕴涵措施和忠诚比标准指标更好相关，方式可能导致自动评价标准以及培训和解码标准。

101. GenericsKB: A Knowledge Base of Generic Statements [PDF] 返回目录
Sumithra Bhakthavatsalam, Chloe Anastasiades, Peter Clark
Abstract: We present a new resource for the NLP community, namely a large (3.5M+ sentence) knowledge base of *generic statements*, e.g., "Trees remove carbon dioxide from the atmosphere", collected from multiple corpora. This is the first large resource to contain *naturally occurring* generic sentences, as opposed to extracted or crowdsourced triples, and thus is rich in high-quality, general, semantically complete statements. All GenericsKB sentences are annotated with their topical term, surrounding context (sentences), and a (learned) confidence. We also release GenericsKB-Best (1M+ sentences), containing the best-quality generics in GenericsKB augmented with selected, synthesized generics from WordNet and ConceptNet. In tests on two existing datasets requiring multihop reasoning (OBQA and QASC), we find using GenericsKB can result in higher scores and better explanations than using a much larger corpus. This demonstrates that GenericsKB can be a useful resource for NLP applications, as well as providing data for linguistic studies of generics and their semantics. GenericsKB is available at this https URL.
摘要：本文提出了一种新的资源的NLP社区，即*通用报表的大（3.5M +句子）的知识基础*，例如，“从大气树木中的二氧化碳”，从多个语料库收集。这是第一个大的资源来遏制*自然发生*一般的句子，而不是提取或众包的三倍，因此有丰富的高品质，一般情况下，语义完整的语句。所有GenericsKB句子都标注了其局部来看，周围环境（句子），和（学习）的信心。我们还推出GenericsKB - 贝斯特（1M +句子），包含GenericsKB质量最好的仿制药与扩充选择，从合成和共发现仿制药ConceptNet。在上需要多跳推理（OBQA和QASC）两个现有的数据集的测试中，我们发现使用GenericsKB可能会导致更高的分数，也比使用一个更大的阴茎更好的解释。这表明GenericsKB可以为NLP应用的有用资源，以及仿制药和它们的语义的语言研究提供数据。 GenericsKB可在此HTTPS URL。

102. An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction [PDF] 返回目录
Bhargavi Paranjape, Mandar Joshi, John Thickstun, Hannaneh Hajishirzi, Luke Zettlemoyer
Abstract: Decisions of complex language understanding models can be rationalized by limiting their inputs to a relevant subsequence of the original text. A rationale should be as concise as possible without significantly degrading task performance, but this balance can be difficult to achieve in practice. In this paper, we show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective. Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale. Using IB, we derive a learning objective that allows direct control of mask sparsity levels through a tunable sparse prior. Experiments on ERASER benchmark tasks demonstrate significant gains over norm-minimization techniques for both task performance and agreement with human rationales. Furthermore, we find that in the semi-supervised setting, a modest amount of gold rationales (25% of training examples) closes the gap with a model that uses the full input. Code: this https URL
摘要：复杂的语言理解模型的决定可以通过限制其输入到原文的相关子合理化。得理应该尽可能简明而不显著降解任务绩效，但这种平衡可能难以在实践中实现。在本文中，我们表明，有可能通过优化绑定在信息瓶颈（IB）的目标更好地管理这种折衷。我们完全不受监督的方法共同获悉，预计在句子稀疏二进制口罩，而只考虑提取的基本原理结束任务的预测解释器。使用IB，我们得出一个学习目标，让面膜的稀疏性水平的直接控制，通过稀疏之前可调。在橡皮基准任务的实验结果证明了范数最小化技术如何在任务性能和协议与人的理由显著的收益。此外，我们发现，在半监督环境，黄金理由适量的（训练样本的25％），关闭与使用整个输入模型的差距。代码：这HTTPS URL

103. Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates [PDF] 返回目录
Katherine A. Keith, David Jensen, Brendan O'Connor
Abstract: Many applications of computational social science aim to infer causal conclusions from non-experimental data. Such observational data often contains confounders, variables that influence both potential causes and potential effects. Unmeasured or latent confounders can bias causal estimates, and this has motivated interest in measuring potential confounders from observed text. For example, an individual's entire history of social media posts or the content of a news article could provide a rich measurement of multiple confounders. Yet, methods and applications for this problem are scattered across different communities and evaluation practices are inconsistent. This review is the first to gather and categorize these examples and provide a guide to data-processing and evaluation decisions. Despite increased attention on adjusting for confounding using text, there are still many open problems, which we highlight in this paper.
摘要：推断因果关系的结论，从非实验数据计算的社会科学目的的许多应用程序。这样的观测数据通常包含的混淆，影响了潜在的原因和潜在的影响变量。不可测量的或潜在的混杂因素可促使因果估计，这在从观察到的文本测量潜在混杂因素动机兴趣。例如，一个人的社交媒体文章的整个历史或新闻文章的内容可以提供多种混杂因素的丰富的测量。然而，对于这个问题的方法和应用是分散在不同的社区和评价做法不一致。这个评论是第一个收集和分类，这些例子并提供指导，以数据处理和评估决策。尽管在调整使用的文字混杂越来越多的关注，仍然有许多未解决的问题，这是我们在本文中突出。

104. Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering [PDF] 返回目录
Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, Xiang Ren
Abstract: While fine-tuning pre-trained language models (PTLMs) has yielded strong results on a range of question answering (QA) benchmarks, these methods still suffer in cases when external knowledge are needed to infer the right answer. Existing work on augmenting QA models with external knowledge (e.g., knowledge graphs) either struggle to model multi-hop relations efficiently, or lack transparency into the model's prediction rationale. In this paper, we propose a novel knowledge-aware approach that equips PTLMs with a multi-hop relational reasoning module, named multi-hop graph relation networks (MHGRN). It performs multi-hop, multi-relational reasoning over subgraphs extracted from external knowledge graphs. The proposed reasoning module unifies path-based reasoning methods and graph neural networks to achieve better interpretability and scalability. We also empirically show its effectiveness and scalability on CommonsenseQA and OpenbookQA datasets, and interpret its behaviors with case studies. In particular, MHGRN achieves the state-of-the-art performance (76.5\% accuracy) on the CommonsenseQA official test set.
摘要：虽然预先训练微调语言模型（PTLMs）已经产生了对一系列问题回答（QA）基准的强劲业绩，这些方法仍然遭受情况下，当需要额外的知识来推断出正确的答案。现有的与外部知识（例如知识图）无论斗争多跳关系高效地进行建模，或者缺乏透明度模型预测的理由增强QA模型的工作。在本文中，我们提出了一种新的知识，知道的办法，装备PTLMs与多跳关系推理模块，命名为多跳曲线关系网络（MHGRN）。它执行了从外部知识图中提取子图多跳，多关系的推理。所提出的推理模块相结合的基于路径的推理方法和图形的神经网络，以实现更好的解释性和可扩展性。我们也经验显示其对CommonsenseQA和OpenbookQA数据集的有效性和可扩展性，并用个案解释自己的行为。特别地，实现MHGRN上CommonsenseQA官方测试所设定的状态的最先进的性能（76.5 \％的准确度）。

105. Syntactic Question Abstraction and Retrieval for Data-Scarce Semantic Parsing [PDF] 返回目录
Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Minjoon Seo
Abstract: Deep learning approaches to semantic parsing require a large amount of labeled data, but annotating complex logical forms is costly. Here, we propose Syntactic Question Abstraction and Retrieval (SQAR), a method to build a neural semantic parser that translates a natural language (NL) query to a SQL logical form (LF) with less than 1,000 annotated examples. SQAR first retrieves a logical pattern from the train data by computing the similarity between NL queries and then grounds a lexical information on the retrieved pattern in order to generate the final LF. We validate SQAR by training models using various small subsets of WikiSQL train data achieving up to 4.9% higher LF accuracy compared to the previous state-of-the-art models on WikiSQL test set. We also show that by using query-similarity to retrieve logical pattern, SQAR can leverage a paraphrasing dataset achieving up to 5.9% higher LF accuracy compared to the case where SQAR is trained by using only WikiSQL data. In contrast to a simple pattern classification approach, SQAR can generate unseen logical patterns upon the addition of new examples without re-training the model. We also discuss an ideal way to create cost efficient and robust train datasets when the data distribution can be approximated under a data-hungry setting.
摘要：深学习方法，以语义分析需要大量的标签数据，但标注复杂的逻辑形式是昂贵的。在这里，我们提出句法问题抽象和检索（SQAR），建立一个神经语义解析器翻译少于1000点注解的例子自然语言（NL）查询到SQL逻辑形式（LF）的方法。 SQAR第一通过计算NL查询，然后理由所检索的图案的词汇信息之间的相似性，以便产生最终的LF检索来自列车数据的逻辑图案。我们通过比较对WikiSQL测试集前面的国家的最先进的机型WikiSQL列车数据实现了更高的4.9％LF精度的各种小的子集训练模型验证SQAR。我们还表明，通过使用查询相似性检索逻辑模式，SQAR可以利用相比，在SQAR通过仅WikiSQL数据训练的情况下直译的数据集实现了更高的5.9％LF精度。与此相反，以一个简单的模式分类方法，SQAR可以生成当加入新的例子，而不重新训练模型看不见逻辑图案。我们还讨论了一种理想的方式来创建经济高效，稳健的列车集时的数据分布在一个大量数据的设置来近似。

106. Spatial Dependency Parsing for 2D Document Understanding [PDF] 返回目录
Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, Minjoon Seo
Abstract: Information Extraction (IE) for document images is often approached as a BIO tagging problem, where the model sequentially goes through and classifies each recognized input token into one of the information categories. However, such problem setup has two inherent limitations that (1) it can only extract a flat list of information and (2) it assumes that the input data is serialized, often by a simple rule-based script. Nevertheless, real-world documents often contain hierarchical information in the form of two-dimensional language data in which the serialization can be highly non-trivial. To tackle these issues, we propose SPADE$\spadesuit$ (SPatial DEpendency parser), an end-to-end spatial dependency parser that is serializer-free and capable of modeling an arbitrary number of information layers, making it suitable for parsing structure-rich documents such as receipts and multimodal documents such as name cards. We show that SPADE$\spadesuit$ outperforms the previous BIO tagging-based approach on name card parsing task and achieves comparable performance on receipt parsing task. Especially, when the receipt images have non-flat manifold representing physical distortion of receipt paper in real-world, SPADE$\spadesuit$ outperforms the tagging-based method by a large margin of 25.8% highlighting the strong performance of SPADE$\spadesuit$ over spatially complex document.
摘要：信息提取（IE），用于文档图像常常接近作为BIO标记的问题，其中该模型依次通过并分类令牌到信息类别中的一个的每个识别的输入去。然而，这样的问题设置有两个固有的局限性：（1）它可以通过一个简单的基于规则的脚本仅提取的信息和一个平面列表（2）它假定输入数据被序列化，往往。然而，现实世界的文档通常包含在其中序列化可以是高度不平凡的二维语言数据的形式分层信息。为了解决这些问题，我们提出了SPADE $ \ $ spadesuit（空间依赖解析器），最终到终端的空间依赖解析器是串行器免费的，能够模拟信息层的任意数量的，使其适用于分析结构 - 丰富的文件，如收据及多式联运单据，如名片。我们发现，SPADE $ \ $ spadesuit优于上名片解析任务以前基于标记BIO方法，并实现了在接收解析任务相当的性能。特别是，当接收的图像具有表示在现实世界的收据纸的物理畸变的非平坦歧管，铲$ \ spadesuit $性能优于由25.8％大幅度基于标记法突出显示的强性能SPADE $ \ spadesuit $在空间上复杂的文件。

107. A Joint Framework for Inductive Representation Learning and Explainable Reasoning in Knowledge Graphs [PDF] 返回目录
Rajarshi Bhowmik, Gerard de Melo
Abstract: Despite their large-scale coverage, existing cross-domain knowledge graphs invariably suffer from inherent incompleteness and sparsity, necessitating link prediction that requires inferring a target entity, given a source entity and a query relation. Recent approaches can broadly be classified into two categories: embedding-based approaches and path-based approaches. In contrast to embedding-based approaches, which operate in an uninterpretable latent semantic vector space of entities and relations, path-based approaches operate in the symbolic space, making the inference process explainable. However, traditionally, these approaches are studied with static snapshots of the knowledge graphs, severely restricting their applicability for dynamic knowledge graphs with newly emerging entities. To overcome this issue, we propose an inductive representation learning framework that is able to learn representations of previously unseen entities. Our method finds reasoning paths between source and target entities, thereby making the link prediction for unseen entities interpretable and providing support evidence for the inferred link.
摘要：尽管他们的大规模覆盖，现有的跨域知识图总是从内在不完备性和稀疏性受到影响，因此需要给源实体和查询相关链接预测，需要推断目标实体。最近的方法大致可分为两类：基于嵌入的办法和基于路径的方法。与基于嵌入的办法，其在实体和关系的解释的潜在语义向量空间中操作，基于路径的方法在象征空间中操作，使得推理过程解释的。然而，传统上，这些方法进行了研究与知识图的静态快照，严重限制了它们的动态知识图适用性与新兴的实体。为了解决这个问题，我们提出了一个感性的表示学习框架，能够学习以前看不见的实体表示。我们的方法找到推理源和目标实体之间的路径，从而使链路预测可解释看不见的实体和提供支持证据推断链路。

108. We Need to Talk About Random Splits [PDF] 返回目录
Anders Søgaard, Sebastian Ebert, Joost Bastings, Katja Filippova
Abstract: Gorman and Bedrick (2019) recently argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates. In some cases, even worst-case splits under-estimate the error observed on new samples of in-domain data, i.e., the data that models should minimally generalize to at test time. This proves wrong the common conjecture that bias can be corrected for by re-weighting data (Shimodaira, 2000; Shah et al., 2020). Instead of using multiple random splits, we propose that future benchmarks instead include multiple, independent test sets.
摘要：戈尔曼和Bedrick（2019）最近指出，使用随机拆分，而不是在NLP实验标准分裂。我们认为，随机分裂，就像标准的分裂，导致过于乐观的估计性能。在一些情况下，即使最坏情况拆分低估在域数据的新的样品，观察到的误差即，数据模型应该最低限度地推广到在测试时间。这证明错共同猜想偏压可以通过重新加权数据被校正（Shimodaira，2000; Shah等人，2020）。而不是使用多个随机分裂的，我们建议今后改为基准包括多个独立的试验台。

109. Using Noisy Self-Reports to Predict Twitter User Demographics [PDF] 返回目录
Zach Wood-Doughty, Paiheng Xu, Xiao Liu, Mark Dredze
Abstract: Computational social science studies often contextualize content analysis within standard demographics. Since demographics are unavailable on many social media platforms (e.g. Twitter) numerous studies have inferred demographics automatically. Despite many studies presenting proof of concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, inaccurate, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite errors inherent in automated supervision, we produce models with good performance when measured on gold standard self-report survey data. The result is a reproducible method for creating large-scale training resources for race and ethnicity.
摘要：计算社会科学的研究往往情境标准的人口统计中的内容分析。由于人口是在许多社交媒体平台不可用（例如Twitter）上许多研究已经自动推断人口统计数据。尽管许多研究提出种族和民族的概念推理证明，实际系统的训练，因为很少有标注的数据集仍然遥遥无期。现有的数据集是小的，不准确，或不能覆盖美国四大最常见的种族和族裔群体。我们提出了一个识别种族和民族的Twitter个人资料描述的自我报告的方法。尽管在自动监控固有误差，当金标准自我报告的调查数据测量，我们生产具有良好性能的机型。其结果是为种族和民族建立的大型培训资源可重复的方法。

110. From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers [PDF] 返回目录
Anne Lauscher, Vinit Ravishankar, Ivan Vulić, Goran Glavaš
Abstract: Massively multilingual transformers pretrained with language modeling objectives (e.g., mBERT, XLM-R) have become a de facto default transfer paradigm for zero-shot cross-lingual transfer in NLP, offering unmatched transfer performance. Current downstream evaluations, however, verify their efficacy predominantly in transfer settings involving languages with sufficient amounts of pretraining data, and with lexically and typologically close languages. In this work, we analyze their limitations and show that cross-lingual transfer via massively multilingual transformers, much like transfer via cross-lingual word embeddings, is substantially less effective in resource-lean scenarios and for distant languages. Our experiments, encompassing three lower-level tasks (POS tagging, dependency parsing, NER), as well as two high-level semantic tasks (NLI, QA), empirically correlate transfer performance with linguistic similarity between the source and target languages, but also with the size of pretraining corpora of target languages. We also demonstrate a surprising effectiveness of inexpensive few-shot transfer (i.e., fine-tuning on a few target-language instances after fine-tuning in the source) across the board. This suggests that additional research efforts should be invested to reach beyond the limiting zero-shot conditions.
摘要：大规模多语种变压器预训练用语言建模的目标（例如，mBERT，XLM-R）已经成为在NLP零射门跨语言传递了事实上的默认传输模式，提供无与伦比的传输性能。目前下游的评估，但是，主要是验证其有效性在涉及与足量的训练前数据的语言传输设置，并与词汇和类型学的紧密语言。在这项工作中，我们分析了它们的局限性，并显示通过大量多语种变压器是跨语言传输，就像通过跨语言文字的嵌入转移，在资源贫乏的场景和远处语言少得多有效。我们的实验，包括三个低级别的任务（词性标注，依存分析，NER），以及两个高级别语义任务（NLI，QA），与源语言和目标语言之间的语言相似经验归属关系转移性能，而且还训练前用目标语言的语料库的大小。我们还演示了几个便宜拍传输（在几个目标语言实例，即微调源微调后）全线的一个令人惊讶的效果。这表明，进一步的研究应该努力投入到超越极限的零拍条件。

111. KLEJ: Comprehensive Benchmark for Polish Language Understanding [PDF] 返回目录
Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik
Abstract: In recent years, a series of Transformer-based models unlocked major improvements in general natural language understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of languages. To alleviate this issue, we introduce a comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing datasets for named entity recognition, question-answering, textual entailment, and others. We also introduce a new sentiment analysis task for the e-commerce domain, named Allegro Reviews (AR). To ensure a common evaluation scheme and promote models that generalize to different NLU tasks, the benchmark includes datasets from varying domains and applications. Additionally, we release HerBERT, a Transformer-based model trained specifically for the Polish language, which has the best average performance and obtains the best results for three out of nine tasks. Finally, we provide an extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based models.
摘要：近年来，一系列基于变压器的型号解锁一般自然语言理解（NLU）任务重大改进。研究这种快节奏的不一般NLU基准，这允许所提出的方法的一个公平的比较是不可能的。然而，这样的基准测试仅适用于不多的语言。为了缓解这一问题，我们介绍了波兰语言理解一个全面的多任务基准，伴随着在线排行榜。它由一组不同的任务，从命名实体识别，问题回答，文字蕴涵和其他现有数据集采用的。我们还引进了电子商务领域新的情感分析的任务，命名为快板评论（AR）。为了确保共同的评价方案，并促进推广到不同的NLU任务，基准包括来自不同领域和应用的数据集模型。此外，我们发布赫伯特，基于变压器的模型专门为波兰语，具有最佳平均性能，并获得了四分之三的九个方面的工作最好的结果训练。最后，我们提供了一个广泛的评估，包括一些标准的基线和最近提出，多语种基于变压器的模型。

112. Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work? [PDF] 返回目录
Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman
Abstract: While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.
摘要：虽然预先训练模式，如BERT显示整个自然语言理解任务大的收益，其性能可以通过进一步的培训上丰富的数据中间任务模型，之前微调它的目标任务提高。然而，它仍然是很差的时候，为什么中间任务训练是给定目标任务的有益的理解。为了研究这个问题，我们执行与110中间目标任务组合的预训练罗伯塔模型大规模研究。我们进一步评估所有训练的模型与意透露具体的技能，驱动转账25完成探测任务。我们观察到，中间需要的任务高层推理和推理能力，往往是最好的。我们还观察到目标任务性能强烈更高级别的能力，如指代消解相关。但是，我们无法探测和目标任务绩效，突出对广泛的覆盖面探测基准进一步工作的需要之间的观察更精细的相关性。我们还观察到的证据表明，知识的遗忘训练前可能会限制我们的分析，为突出对这些设置转移学习方法进一步工作的需要时了解到。

113. Predicting Declension Class from Form and Meaning [PDF] 返回目录
Adina Williams, Tiago Pimentel, Arya D. McCarthy, Hagen Blix, Eleanor Chodroff, Ryan Cotterell
Abstract: The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and/or its meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize this by measuring how much information, in bits, we can glean about declension class from knowing the form and/or meaning of nouns. We know that form and meaning are often also indicative of grammatical gender---which, as we quantitatively verify, can itself share information with declension class---so we also control for gender. We find for two Indo-European languages (Czech and German) that form and meaning respectively share significant amounts of information with class (and contribute additional information above and beyond gender). The three-way interaction between class, form, and meaning (given gender) is also significant. Our study is important for two reasons: First, we introduce a new method that provides additional quantitative support for a classic linguistic finding that form and meaning are relevant for the classification of nouns into declensions. Secondly, we show not only that individual declensions classes vary in the strength of their clues within a language, but also that these variations themselves vary across languages. The code is publicly available at this https URL.
摘要：许多自然语言的名词lexica被分成几个变格类具有特征形态特征。类会员远未确定，但一个名词和/或它的意义的语音形式往往可以提供完美的线索。在这里，我们调查这些线索的实力。更具体地说，我们通过测量多少信息，以比特，我们可以从知道的形式和/或名词的意思是收集关于变格类实施这一。我们知道，形式和意义往往也表明语法性别---正如我们定量验证，本身可以与词尾类共享信息---所以我们也控制了性别。我们发现了两个印欧语系（捷克和德国），其形式和含义分别与同学分享显著大量的信息（和超越性别的贡献更多的信息）。类，形式和意义（因为性别）之间的三方互动也显著。我们的研究是两个重要的原因：首先，我们介绍它提供了一个经典的语言发现，形式和意义是相关的名词的分类为变格额外的量化支持的新方法。其次，我们不仅表明个别变格类的语言中的线索的强度而变化，而且这些变化本身因语言而异。该代码是公开的，在此HTTPS URL。

114. Minimally Supervised Categorization of Text with Metadata [PDF] 返回目录
Yu Zhang, Yu Meng, Jiaxin Huang, Frank F. Xu, Xuan Wang, Jiawei Han
Abstract: Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) \textit{the presence of metadata}: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) \textit{label scarcity}: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose \textsc{MetaCat}, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of \textsc{MetaCat} over many competitive baselines.
摘要：文档分类，其目的是话题标签分配给每个文档，起着各种应用的基础性作用。尽管现有研究的常规监督文档分类的成功，他们有两个现实的问题不太关注：（1）\ textit {元数据的存在}：在许多领域，文本是伴随着各种附加信息，如作者和标签。这些元数据作为引人注目的话题指标和应该被利用来分类框架; （2）\ {textit标签稀缺}：标记的训练样本是昂贵的，以获得在某些情况下，仅使用一个小的组注释的数据来执行分类的需求。在认识到这两项挑战，我们建议\ textsc {} MetaCat，一个最低限度的监督框架，用元数据群归类文本。具体来说，我们发展描述的话，文件，标签和元数据之间的关系的生成过程。通过生成模型的指导下，我们嵌入文本和元数据到相同的语义空间编码异类信号。然后，基于相同的生成过程中，我们综合训练样本，地址标签匮乏的瓶颈。我们在广泛的数据集进行一次彻底的评估。实验结果证明\ textsc {} MetaCat过许多竞争基线的有效性。

115. Probing Text Models for Common Ground with Visual Representations [PDF] 返回目录
Gabriel Ilharco, Rowan Zellers, Ali Farhadi, Hannaneh Hajishirzi
Abstract: Vision, as a central component of human perception, plays a fundamental role in shaping natural language. To better understand how text models are connected to our visual perceptions, we propose a method for examining the similarities between neural representations extracted from words in text and objects in images. Our approach uses a lightweight probing model that learns to map language representations of concrete words to the visual domain. We find that representations from models trained on purely textual data, such as BERT, can be nontrivially mapped to those of a vision model. Such mappings generalize to object categories that were never seen by the probe during training, unlike mappings learned from permuted or random representations. Moreover, we find that the context surrounding objects in sentences greatly impacts performance. Finally, we show that humans significantly outperform all examined models, suggesting considerable room for improvement in representation learning and grounding.
摘要：愿景，为人类感知的一个重要组成部分，在塑造自然语言的基础性作用。为了更好地理解文本的模型是如何连接到我们的视觉感受，我们提出了审查从文本文字和对象在图像中提取神经表征之间的相似性的方法。我们的方法是使用一个轻量级的探测模式，学会的具体词语言表示映射到视觉领域。我们发现，从训练的纯粹的文本数据，如BERT模型表示，可以非平凡映射到这些视觉模型。这种映射推广到从未被探测训练中看到的，不像从置换或随机交涉了解到映射对象的类别。此外，我们发现，上下文周围的句子大大影响性能的对象。最后，我们表明，人类显著胜过所有检查模式，认为需要在代表学习和接地改进相当大的空间。

116. Multi-Dimensional Gender Bias Classification [PDF] 返回目录
Emily Dinan, Angela Fan, Ledell Wu, Jason Weston, Douwe Kiela, Adina Williams
Abstract: Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.
摘要：机器学习模型进行培训，以寻找数据中的模式。 NLP模型可以在不经意间学会社会上不良模式时，对性别偏见的训练文本。在这项工作中，我们建议沿着几个务实和语义层面分解的性别偏见的文本中的总体框架：从人的性别偏见所谈到，从人的性别偏见所谈过，和偏见从性别演讲者。使用这种细粒度的框架，我们会自动注释八大规模的资料与性别信息。此外，我们还收集了新颖，话语层面的性别重写的众包的评价基准。多个维度的性别偏见之间的区分是很重要的，因为它使我们能够培养细粒度的性别偏见分类。我们发现我们的分类被证明是有价值的各种重要应用，如控制了生成模型的性别偏见，检测任意文本的性别偏见，并在genderedness方面对攻击性的语言线索。

117. A Controllable Model of Grounded Response Generation [PDF] 返回目录
Zeqiu Wu, Michel Galley, Chris Brockett, Yizhe Zhang, Xiang Gao, Chris Quirk, Rik Koncel-Kedziorski, Jianfeng Gao, Hannaneh Hajishirzi, Mari Ostendorf, Bill Dolan
Abstract: Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process. This control is essential to ensure that users' semantic intents are satisfied and to impose a degree of specificity on generated outputs. Attempts to boost informativeness alone come at the expense of factual accuracy, as attested by GPT-2's propensity to "hallucinate" facts. While this may be mitigated by access to background knowledge, there is scant guarantee of relevance and informativeness in generated responses. We propose a framework that we call controllable grounded response generation (CGRG), in which lexical control phrases are either provided by an user or automatically extracted by a content planner from dialogue context and grounding knowledge. Quantitative and qualitative results show that, using this framework, a GPT-2 based model trained on a conversation-like Reddit dataset outperforms strong generation baselines.
摘要：目前终端到终端的神经会话模型本身缺乏灵活性强加在响应生成过程的语义控制。这种控制是必要的，以确保用户的语义意图很满意并施以上产生输出一定程度的特异性。试图提升信息量独来，在事实的准确费用，由GPT-2的倾向“幻觉”事实证明。虽然这可以通过获取背景知识得以缓解，有针对性的缺乏保障，在生成的响应信息量。我们提出了一个框架，我们称之为可控接地响应代（CGRG），其中词汇短语控制要么由用户或提供自动提取从对话语境和接地知识内容策划者。定量和定性结果表明，采用这种架构，一个GPT-2基于模型训练了谈话类reddit的数据集性能优于强代基线。

118. Learning an Unreferenced Metric for Online Dialogue Evaluation [PDF] 返回目录
Koustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, William L. Hamilton, Joelle Pineau
Abstract: Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue. There have been recent efforts to develop automatic dialogue evaluation metrics, but most of them do not generalize to unseen datasets and/or need a human-generated reference response during inference, making it infeasible for online evaluation. Here, we propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances, and leverages the temporal transitions that exist between them. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
摘要：两种药评估对话互动的质量是一项艰巨的任务，特别是在开放领域的闲聊式的对话。近来有努力开发自动对话的评价指标，但大多没有推广到看不见的数据集和/或需要推理过程中人为产生的参考响应，使其不可行的在线评估。在这里，我们提出了一个未引用自动评价指标，可使用大量预先训练语言模型提取话语的潜表示，并利用它们之间存在的时间变迁。我们表明，我们的模型实现了与在网上设置人类注释较高的相关性，而不是要求对推理过程比较真实的反应。

119. Multi-scale Transformer Language Models [PDF] 返回目录
Sandeep Subramanian, Ronan Collobert, Marc'Aurelio Ranzato, Y-Lan Boureau
Abstract: We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments on large-scale language modeling benchmarks empirically demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show that it is possible to train a hierarchical variant with 30 layers that has 23% smaller memory footprint and better perplexity, compared to a vanilla transformer with less than half the number of layers, on the Toronto BookCorpus. We analyze the advantages of learned representations at multiple scales in terms of memory footprint, compute time, and perplexity, which are particularly appealing given the quadratic scaling of transformers' run time and memory usage with respect to sequence length.
摘要：我们调查的是有一个归纳偏置来处理语言的层次性多级变压器的语言模型，多尺度学习文本的表示，并呈现三个不同的架构。在大型语言模型的基准实验证明凭经验有利可能性VS内存占用权衡，例如我们表明，这是可能的训练与具有较小的23％，内存占用和更好的困惑30层的层次变型，相比香草变压器层的不到一半的数量，在多伦多BookCorpus。我们分析了内存占用，计算时间和困惑，这是特别有吸引力给出了变压器运行时间和内存使用的二次标定相对于序列长度方面多尺度了解到表示的优势。

120. Evaluating Robustness to Input Perturbations for Neural Machine Translation [PDF] 返回目录
Xing Niu, Prashant Mathur, Georgiana Dinu, Yaser Al-Onaizan
Abstract: Neural Machine Translation (NMT) models are sensitive to small perturbations in the input. Robustness to such perturbations is typically measured using translation quality metrics such as BLEU on the noisy input. This paper proposes additional metrics which measure the relative degradation and changes in translation when small perturbations are added to the input. We focus on a class of models employing subword regularization to address robustness and perform extensive evaluations of these models using the robustness measures proposed. Results show that our proposed metrics reveal a clear trend of improved robustness to perturbations when subword regularization methods are used.
摘要：神经机器翻译（NMT）模型对输入小扰动敏感。鲁棒性这样的扰动是使用翻译的质量度量来测量诸如BLEU对噪声的输入。本文提出了一种测量在翻译的相对降解和改变时小的扰动被添加到输入的附加指标。我们专注于一类采用子词正规化地址鲁棒性模型和执行使用提出的稳健性措施，这些模型的广泛评估。结果表明，该指标显示改善的鲁棒性的扰动一个明显的趋势，当使用子字正则化方法。

121. Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset [PDF] 返回目录
Xiang Yue, Bernal Jimenez Gutierrez, Huan Sun
Abstract: Machine reading comprehension has made great progress in recent years owing to large-scale annotated datasets. In the clinical domain, however, creating such datasets is quite difficult due to the domain expertise required for annotation. Recently, Pampari et al. (EMNLP'18) tackled this issue by using expert-annotated question templates and existing i2b2 annotations to create emrQA, the first large-scale dataset for question answering (QA) based on clinical notes. In this paper, we provide an in-depth analysis of this dataset and the clinical reading comprehension (CliniRC) task. From our qualitative analysis, we find that (i) emrQA answers are often incomplete, and (ii) emrQA questions are often answerable without using domain knowledge. From our quantitative experiments, surprising results include that (iii) using a small sampled subset (5%-20%), we can obtain roughly equal performance compared to the model trained on the entire dataset, (iv) this performance is close to human expert's performance, and (v) BERT models do not beat the best performing base model. Following our analysis of the emrQA, we further explore two desired aspects of CliniRC systems: the ability to utilize clinical domain knowledge and to generalize to unseen questions and contexts. We argue that both should be considered when creating future datasets.
摘要：机器阅读理解已经由于大型数据集注释近年来取得了长足的进步。在临床领域，但是，创建这样的数据集是相当困难的，因为用于注释所需要的专业领域知识。近日，Pampari等。（EMNLP'18）通过专家标注问题的模板和现有i2b2注释创建emrQA，第一个大型数据集的基础上临床笔记问答（QA）解决了这个问题。在本文中，我们提供这个数据集和临床阅读理解（CliniRC）任务的深入分析。从我们的定性分析，我们发现，（我）emrQA答案往往是不完整，以及（ii）emrQA问题是不使用领域知识往往交代。从我们的定量实验，令人惊讶的结果包括（iii）使用一个小的抽样子集（5％-20％），我们可以比较训练有素的整个数据集模型获得大致相等的性能，（IV），这一业绩也接近人体专家的表现，以及（v）BERT模式不打表现最好的示范基地。下面我们对emrQA的分析，我们进一步探讨CliniRC系统两种需要的方面：利用临床领域的知识，并推广到看不见的问题和环境的能力。我们认为，两者都应该创造未来的数据集时予以考虑。

122. Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition [PDF] 返回目录
Hu Hu, Rui Zhao, Jinyu Li, Liang Lu, Yifan Gong
Abstract: Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.
摘要：近日，回归神经网络转换器（RNN-T）架构已经成为终端到终端的自动语音识别研究的一个新趋势由于能够在线播放的语音识别它的优点。然而，RNN-T的训练是由巨大的内存需求做出困难的，复杂的神经结构。一个常见的解决方案来缓解RNN-T训练是采用联结与RNN语言模型（RNNLM）来初始化RNN-T参数沿时间分类（CTC）模式。在这项工作中，我们反过来利用外部比对种子的RNN-T模式。两个不同的训练前的解决方案进行了探索，称为编码器分别预培训，和全网前培训。微软评估65000小时匿名生产数据与个人身份信息删除，我们提出的方法能够获得显著的改善。特别是，当分别与随机初始化和广泛使用的CTC + RNNLM初始化策略相比，编码器的预训练溶液达到10％和8％的相对字错误率的降低。我们的解决方案也显著减少从基线RNN-T模式的等待时间。

123. When BERT Plays the Lottery, All Tickets Are Winning [PDF] 返回目录
Sai Prasanna, Anna Rogers, Anna Rumshisky
Abstract: Much of the recent success in NLP is due to the large Transformer-based models such as BERT (Devlin et al, 2019). However, these models have been shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis. For fine-tuned BERT, we show that (a) it is possible to find a subnetwork of elements that achieves performance comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. However, the "bad" subnetworks can be fine-tuned separately to achieve only slightly worse performance than the "good" ones, indicating that most weights in the pre-trained BERT are potentially useful. We also show that the "good" subnetworks vary considerably across GLUE tasks, opening up the possibilities to learn what knowledge BERT actually uses at inference time.
摘要：在许多NLP最近成功是由于大基于变压器的模型，如BERT（Devlin等，2019）。然而，这些模型已被证明是归结为自我注意头和层数量较少。我们认为从彩票假设的角度来看，这现象。对于微调BERT，我们表明，（一），可以发现元素的子网络达到性能与完整的模型，和（b）的可比从模型的其余部分采样同样大小的子网表现更差。然而，“坏”的子网可以进行微调，分别实现了比“好”的人只有性能略差，表明在预先训练BERT大部分权重是潜在有用的。我们还表明，“好”的子网变化很大胶水任务，开放的可能性学习哪些知识BERT实际上是在推理时使用。

124. POINTER: Constrained Text Generation via Insertion-based Generative Pre-training [PDF] 返回目录
Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, Bill Dolan
Abstract: Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation. However, these models cannot be directly employed to generate text under specified lexical constraints. To address this challenge, we present POINTER, a simple yet novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. This procedure is recursively applied until a sequence is completed. The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable. Since our training objective resembles the objective of masked language modeling, BERT can be naturally utilized for initialization. We pre-train our model with the proposed progressive insertion-based objective on a 12GB Wikipedia dataset, and fine-tune it on downstream hard-constrained generation tasks. Non-autoregressive decoding yields a logarithmic time complexity during inference time. Experimental results on both News and Yelp datasets demonstrate that POINTER achieves state-of-the-art performance on constrained text generation. We intend to release the pre-trained model to facilitate future research.
摘要：大型预训练的语言模型，如BERT和GPT-2，均取得了语言表达的学习和自由形式的文本生成优异的性能。然而，这些模型不能直接用来产生下指定词法约束文本。为了应对这一挑战，我们目前的指针，硬约束的文本生成一个简单而又新颖的插入为基础的方法。所提出的方法通过以并行的方式逐步插入现有令牌之间新的令牌操作。递归地应用该过程，直到一个序列完成。将得到的粗到细的分层结构使得生成过程直观的和可解释的。由于我们的培训目标类似于目标蒙面语言建模的，BERT可以自然地用于初始化。我们预先训练我们的模型所提出的渐进插入基于客观上12GB维基百科数据集，并微调其对下游硬约束的发电任务。非自回归解码期间推理时间产生一个对数时间复杂度。两个新闻和Yelp的数据集实验结果表明，受约束文本生成该指针实现国家的最先进的性能。我们打算释放预先训练模式，方便今后的研究。

125. GoEmotions: A Dataset of Fine-Grained Emotions [PDF] 返回目录
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, Sujith Ravi
Abstract: Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement.
摘要：在语言理解表达情感有着广泛的应用，从建筑善解人意的聊天机器人，以检测有害的上网行为。推进在这一领域可以使用大型数据集与细粒度类型学，适用于多个下游任务得到改善。介绍GoEmotions，58K的英语reddit的意见最大的手动注释的数据集，标注为27个的情感类别或中性。我们证明通过主防腐成分分析注释的高品质。我们同现有的基准情感迁移学习的实验表明，我们的数据以及推广到其他领域和不同的情感分类。我们基于BERT的模型实现了跨越我们提出的分类0.46的平均F1-得分，留有很大的改进空间。

126. A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing [PDF] 返回目录
Kartik Goyal, Chris Dyer, Christopher Warren, Max G'Sell, Taylor Berg-Kirkpatrick
Abstract: We propose a deep and interpretable probabilistic generative model to analyze glyph shapes in printed Early Modern documents. We focus on clustering extracted glyph images into underlying templates in the presence of multiple confounding sources of variance. Our approach introduces a neural editor model that first generates well-understood printing phenomena like spatial perturbations from template parameters via interpertable latent variables, and then modifies the result by generating a non-interpretable latent vector responsible for inking variations, jitter, noise from the archiving process, and other unforeseen phenomena associated with Early Modern printing. Critically, by introducing an inference network whose input is restricted to the visual residual between the observation and the interpretably-modified template, we are able to control and isolate what the vector-valued latent variable captures. We show that our approach outperforms rigid interpretable clustering baselines (Ocular) and overly-flexible deep generative models (VAE) alike on the task of completely unsupervised discovery of typefaces in mixed-font documents.
摘要：我们提出了一个深刻而可解释的概率生成模型来分析印刷近代早期的文档字形的形状。我们专注于聚类提取字形图像到的方差的多个混杂源存在底层模板。我们的方法引入了一个神经编辑模式，首先生成像经由interpertable潜变量从模板参数空间扰动很好理解的印刷现象，然后从归档通过生成负责着墨变化，抖动的非可解释潜向量修改的结果，噪声过程，以及与早期现代印刷相关联的其他不可预见的现象。重要的是，通过引入一个推理网络，其输入被限制在观察和interpretably改性的模板之间的视觉残留，我们能够控制和隔离矢量值潜变量捕捉什么。我们证明了我们的方法优于刚性可解释的聚类基线（眼）和过于灵活的深生成模型（VAE）在混合字体文件的字体完全无监督发现的任务一样。

127. Does Visual Self-Supervision Improve Learning of Speech Representations? [PDF] 返回目录
Abhinav Shukla, Stavros Petridis, Maja Pantic
Abstract: Self-supervised learning has attracted plenty of recent research interest. However, most works are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for self-supervised learning. This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes two audio-only self-supervision approaches for speech representation learning; (3) shows that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features that are more robust in noisy conditions; (4) shows that self-supervised pretraining leads to a superior weight initialization, which is especially useful to prevent overfitting and lead to faster model convergence on smaller sized datasets. We evaluate our audio representations for emotion and speech recognition, achieving state of the art performance for both problems. Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative speech representations.
摘要：自监督学习吸引了大量最近的研究兴趣。然而，大多数的作品是典型的单峰并且存在有限的工作，学习自我监督学习的音频和视频模式之间的相互作用。该作品（1）通过人脸重建调查视觉自我监督，指导音频表示的学习; （2）提出了讲话表示学习两只音频自我监督办法; （3）所示，所提出的视觉和音频自检的多任务组合是学习更丰富的功能，是在噪声条件下更坚固有利; （4）显示，自监督训练前通向优越重量初始化，这是特别有用的，以防止过度拟合，并导致对更小尺寸的数据集更快收敛模型。我们评估我们的情感和语音识别音频表示，实现两个问题，先进的性能。我们的研究结果表明视觉自我监督音频功能学习的潜力，建议联合视觉和音频自检导致更多的知识性演讲表示。

128. On Systematically Building a Controlled Natural Language for Functional Requirements [PDF] 返回目录
Alvaro Veizaga, Mauricio Alferez, Damiano Torre, Mehrdad Sabetzadeh, Lionel Briand
Abstract: [Context] Natural language (NL) is pervasive in software requirements specifications (SRSs). However, despite its popularity and widespread use, NL is highly prone to quality issues such as vagueness, ambiguity, and incompleteness. Controlled natural languages (CNLs) have been proposed as a way to prevent quality problems in requirements documents, while maintaining the flexibility to write and communicate requirements in an intuitive and universally understood manner. [Objective] In collaboration with an industrial partner from the financial domain, we systematically develop and evaluate a CNL, named Rimay, intended at helping analysts write functional requirements. [Method] We rely on Grounded Theory for building Rimay and follow well-known guidelines for conducting and reporting industrial case study research. [Results] Our main contributions are: (1) a qualitative methodology to systematically define a CNL for functional requirements; this methodology is general and applicable to information systems beyond the financial domain, (2) a CNL grammar to represent functional requirements; this grammar is derived from our experience in the financial domain, but should be applicable, possibly with adaptations, to other information-system domains, and (3) an empirical evaluation of our CNL (Rimay) through an industrial case study. Our contributions draw on 15 representative SRSs, collectively containing 3215 NL requirements statements from the financial domain. [Conclusion] Our evaluation shows that Rimay is expressive enough to capture, on average, 88% (405 out of 460) of the NL requirements statements in four previously unseen SRSs from the financial domain.
摘要：[背景]自然语言（NL）是软件需求规格说明书（SRS的）普遍。然而，尽管它的普及和广泛使用，NL是非常容易出现的质量问题，如模糊性，不确定性，不完整和。受控自然语言（CNLS）已被提议作为一种方法，以防止在需求文档的质量问题，同时保持灵活性，以写入和直观，通俗易懂的方式传达需求。 [目的]从金融领域的工业合作伙伴的合作，我们系统地开发和评估CNL，名为Rimay，旨在帮助分析师写的功能要求。 [方法]我们依靠扎根理论建设Rimay并按照实施和报告的工业案例研究的知名准则。 [结果]我们的主要贡献是：（1）一个定性的方法，系统地定义为CNL功能要求;这种方法是一般的，适用于超出金融域信息系统，（2）一个CNL语法来表示的功能要求;这个语法是从我们在金融领域的经验得出，但应适用，可能与适应，与其他信息系统域，和（3）我们CNL（Rimay）通过一个工业案例研究的实证评价。我们的贡献借鉴15级代表性的SRS，统称包含从金融领域3215个NL要求陈述。 [结论]我们的评估显示，Rimay是从金融领域四个前所未见的SRS的NL要求发言的表现足以捕获，平均而言，88％（405出460）。

129. Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations [PDF] 返回目录
Saif M. Mohammad
Abstract: Disparities in authorship and citations across genders can have substantial adverse consequences not just on the disadvantaged gender, but also on the field of study as a whole. In this work, we examine female first author percentages and the citations to their papers in Natural Language Processing. We find that only about 29% of first authors are female and only about 25% of last authors are female. Notably, this percentage has not improved since the mid 2000s. We also show that, on average, female first authors are cited less than male first authors, even when controlling for experience and area of research. We hope that recording citation and participation gaps across demographic groups will improve awareness of gender gaps and encourage more inclusiveness and fairness in research.
摘要：在差距作者和引文跨性别可以不只是在弱势性别，而且对研究领域的整体实质性不良后果。在这项工作中，我们研究的女性第一作者百分比和引文它们在自然语言处理的论文。我们发现，第一作者的只有约29％是女性，只有约25最后作者％是女性。值得注意的是，这一比例并没有因为2000年代中期得到改善。我们还表明，平均而言，女性的第一作者被引用低于男性第一作者，在控制经验和研究领域也是如此。我们希望记录整个人口群体引文和参与的差距将改善性别差距的认识，鼓励更多的研究包容性和公平性。

130. Extracting Entities and Topics from News and Connecting Criminal Records [PDF] 返回目录
Quang Pham, Marija Stanojevic, Zoran Obradovic
Abstract: The goal of this paper is to summarize methodologies used in extracting entities and topics from a database of criminal records and from a database of newspapers. Statistical models had successfully been used in studying the topics of roughly 300,000 New York Times articles. In addition, these models had also been used to successfully analyze entities related to people, organizations, and places (D Newman, 2006). Additionally, analytical approaches, especially in hotspot mapping, were used in some researches with an aim to predict crime locations and circumstances in the future, and those approaches had been tested quite successfully (S Chainey, 2008). Based on the two above notions, this research was performed with the intention to apply data science techniques in analyzing a big amount of data, selecting valuable intelligence, clustering violations depending on their types of crime, and creating a crime graph that changes through time. In this research, the task was to download criminal datasets from Kaggle and a collection of news articles from Kaggle and EAGER project databases, and then to merge these datasets into one general dataset. The most important goal of this project was performing statistical and natural language processing methods to extract entities and topics as well as to group similar data points into correct clusters, in order to understand public data about U.S related crimes better.
摘要：本文的目的是总结从犯罪记录数据库，并从报纸的数据库中提取实体和主题所使用的方法。统计模型已经成功地在研究了大约30万纽约时报的文章的题目中。此外，这些模型也已成功地用于分析与人员，组织和地点（d纽曼，2006年）的实体。此外，分析方法，尤其是在热点绘制，是在一些研究中使用，旨在预测未来作案地点和情况，以及这些方法已经被测试相当成功（S Chainey，2008）。基于以上两个概念，这个研究是在分析数据的一个很大的量，选择有价值的情报，根据他们的犯罪类型的集群行为，并创建一个犯罪图形应用数据科学技术的意图进行，通过时间的变化。在这项研究中，任务是从下载刑事Kaggle数据集和从Kaggle的新闻报道，并渴望项目数据库的集合，然后将这些数据集合并到一个常规数据集。该项目的最重要的目标是进行统计和自然语言处理的方法来提取实体和主题，以及将类似的数据点到正确的集群，以了解有关中美有关罪行更好的公共数据。

131. Understanding and Improving Information Transfer in Multi-Task Learning [PDF] 返回目录
Sen Wu, Hongyang R. Zhang, Christopher Ré
Abstract: We investigate multi-task learning approaches that use a shared feature representation for all tasks. To better understand the transfer of task information, we study an architecture with a shared module for all tasks and a separate output module for each task. We study the theory of this setting on linear and ReLU-activated models. Our key observation is that whether or not tasks' data are well-aligned can significantly affect the performance of multi-task learning. We show that misalignment between task data can cause negative transfer (or hurt performance) and provide sufficient conditions for positive transfer. Inspired by the theoretical insights, we show that aligning tasks' embedding layers leads to performance gains for multi-task training and transfer learning on the GLUE benchmark and sentiment analysis tasks; for example, we obtain a 2.35% GLUE score average improvement on 5 GLUE tasks over BERT-LARGE using our alignment method. We also design an SVD-based task reweighting scheme and show that it improves the robustness of multi-task training on a multi-label image dataset.
摘要：我们调查的多任务学习方法是使用了所有任务的共同特征表示。为了更好地了解任务信息的传递，我们与所有任务共享的模块，并为每个任务单独的输出模块学习的架构。我们研究了线性和RELU激活模型此设置的理论。我们的主要发现是，无论是否任务的数据是对准良好的可显著影响多任务学习的性能。我们展示的任务数据之间的偏差可能会导致负迁移（或伤害性能）和正迁移提供了充分条件。由理论见解的启发，我们表明，对准任务的嵌入层导致的性能提升了多任务的培训和转移学习上胶基准和情感分析任务;例如，我们获得2.35％胶水得分平均提高使用我们的比对方法了BERT-LARGE 5级胶的任务。我们还设计了一个基于SVD的任务重加权方案，并表明它提高了多任务的培训上多标签图像数据集的鲁棒性。

132. Quantifying Attention Flow in Transformers [PDF] 返回目录
Samira Abnar, Willem Zuidema
Abstract: In the Transformer model, "self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer. Thus, across layers of the Transformer, information originating from different tokens gets increasingly mixed. This makes attention weights unreliable as explanations probes. In this paper, we consider the problem of quantifying this flow of information through self-attention. We propose two methods for approximating the attention to input tokens given attention weights, attention rollout and attention flow, as post hoc methods when we use attention weights as the relative relevance of the input tokens. We show that these methods give complementary views on the flow of information, and compared to raw attention, both yield higher correlations with importance scores of input tokens obtained using an ablation method and input gradients.
摘要：在变压器模型，从出席的嵌入“自我关注”联合信息到下一层的焦点嵌入的代表性。因此，在变压器的层，从不同的令牌信息始发得到日益混合。这使得关注权重不可靠的解释探头。在本文中，我们考虑通过量化自我注意这个信息流的问题。我们提出了近似的关注给予了高度关注权重输入令牌，关注部署和关注流程，当我们使用注意权重作为输入令牌的相对关联性事后法两种方法。我们表明，这些方法提供有关信息流的互补视图，并且与原始注意，令牌使用烧蚀方法和输入梯度得到输入的重要性分数两者产量更高的相关性。

133. Examining Citations of Natural Language Processing Literature [PDF] 返回目录
Saif M. Mohammad
Abstract: We extracted information from the ACL Anthology (AA) and Google Scholar (GS) to examine trends in citations of NLP papers. We explore questions such as: how well cited are papers of different types (journal articles, conference papers, demo papers, etc.)? how well cited are papers from different areas of within NLP? etc. Notably, we show that only about 56\% of the papers in AA are cited ten or more times. CL Journal has the most cited papers, but its citation dominance has lessened in recent years. On average, long papers get almost three times as many citations as short papers; and papers on sentiment classification, anaphora resolution, and entity recognition have the highest median citations. The analyses presented here, and the associated dataset of NLP papers mapped to citations, have a number of uses including: understanding how the field is growing and quantifying the impact of different types of papers.
摘要：我们提取从ACL文集（AA）和谷歌学术搜索（GS）的信息审查的NLP论文引文趋势。我们探讨的问题，如：如何引用不同类型的文件（期刊论文，会议论文，论文演示等）？如何引用来自NLP内不同领域的论文？等值得注意的是，我们表明，在AA的论文只有约56 \％被引用十倍或十倍以上。 CL该刊引用最多的论文，但其引用的优势已经减弱，近年来。平均而言，长的论文得到几乎三倍之多引用短文件;和情感分类，指代消解，和实体识别的论文具有最高的平均引用。该分析这里介绍的，被映射到引用的论文NLP相关的数据集，有许多用途，包括：了解该领域是如何成长和量化不同类型的纸张的影响。

134. SEEK: Segmented Embedding of Knowledge Graphs [PDF] 返回目录
Wentao Xu, Shun Zheng, Liang He, Bin Shao, Jian Yin, Tie-Yan Liu
Abstract: In recent years, knowledge graph embedding becomes a pretty hot research topic of artificial intelligence and plays increasingly vital roles in various downstream applications, such as recommendation and question answering. However, existing methods for knowledge graph embedding can not make a proper trade-off between the model complexity and the model expressiveness, which makes them still far from satisfactory. To mitigate this problem, we propose a lightweight modeling framework that can achieve highly competitive relational expressiveness without increasing the model complexity. Our framework focuses on the design of scoring functions and highlights two critical characteristics: 1) facilitating sufficient feature interactions; 2) preserving both symmetry and antisymmetry properties of relations. It is noteworthy that owing to the general and elegant design of scoring functions, our framework can incorporate many famous existing methods as special cases. Moreover, extensive experiments on public benchmarks demonstrate the efficiency and effectiveness of our framework. Source codes and data can be found at \url{this https URL}.
摘要：近年来，知识图嵌入成为人工智能的一个非常热门的研究课题，并播放各种下游应用越来越重要的角色，如推荐和答疑。然而，对于知识图嵌入现有的方法无法作出适当的权衡模型的复杂性和模型的表现力，这使得它们仍然差强人意之间。为了缓解这一问题，我们提出了一个轻量级的建模框架，可以实现高度竞争的关系表现在不增加模型的复杂性。我们的框架着重于评分函数和亮点两个关键特性的设计：1）促进足够的特征交互; 2）既保持对称性和关系的反对称性质。值得注意的是，由于计分功能的一般和优雅的设计，我们的框架可以将许多著名的现有方法的特殊情况。此外，对公共基准广泛的实验证明我们的架构的效率和效益。源代码和数据可以在\ {URL这HTTPS URL}中找到。

135. Enhancing Text-based Reinforcement Learning Agents with Commonsense Knowledge [PDF] 返回目录
Keerthiram Murugesan, Mattia Atzeni, Pushkar Shukla, Mrinmaya Sachan, Pavan Kapanipathi, Kartik Talamadupula
Abstract: In this paper, we consider the recent trend of evaluating progress on reinforcement learning technology by using text-based environments and games as evaluation environments. This reliance on text brings advances in natural language processing into the ambit of these agents, with a recurring thread being the use of external knowledge to mimic and better human-level performance. We present one such instantiation of agents that use commonsense knowledge from ConceptNet to show promising performance on two text-based environments.
摘要：在本文中，我们考虑了近期使用基于文本的环境和游戏作为评价环境评估强化学习技术进步的趋势。文本这种依赖带来了自然语言处理进展到这些药物的范围，具有重复线程是利用外部知识，模拟和更好的人类水平的性能。我们目前代理一个这样的实例，从ConceptNet使用常识知识库，以显示两个基于文本的环境承诺的表现。

136. Stochastic Neighbor Embedding of Multimodal Relational Data for Image-Text Simultaneous Visualization [PDF] 返回目录
Morihiro Mizutani, Akifumi Okuno, Geewook Kim, Hidetoshi Shimodaira
Abstract: Multimodal relational data analysis has become of increasing importance in recent years, for exploring across different domains of data, such as images and their text tags obtained from social networking services (e.g., Flickr). A variety of data analysis methods have been developed for visualization; to give an example, t-Stochastic Neighbor Embedding (t-SNE) computes low-dimensional feature vectors so that their similarities keep those of the observed data vectors. However, t-SNE is designed only for a single domain of data but not for multimodal data; this paper aims at visualizing multimodal relational data consisting of data vectors in multiple domains with relations across these vectors. By extending t-SNE, we herein propose Multimodal Relational Stochastic Neighbor Embedding (MR-SNE), that (1) first computes augmented relations, where we observe the relations across domains and compute those within each of domains via the observed data vectors, and (2) jointly embeds the augmented relations to a low-dimensional space. Through visualization of Flickr and Animal with Attributes 2 datasets, proposed MR-SNE is compared with other graph embedding-based approaches; MR-SNE demonstrates the promising performance.
摘要：多模态关系数据分析已经成为近年来日益重要，跨数据的不同领域，如社交网络服务（例如，Flickr的）获得的图像及其文本标记探索。多种数据分析方法已经开发了可视化;举一个例子，叔随机邻居嵌入（叔SNE）计算低维特征向量，使得它们的相似性保持这些所观察到的数据向量。然而，T-SNE被仅用于数据的单结构域但不能用于多模态数据而设计的;本文旨在包括可视化在多个域中的数据矢量的横跨这些载体关系的多峰关系数据。通过延长T-SNE，我们在此提出多峰关系随机邻居嵌入（MR-SNE），即（1）首先计算增强的关系，其中我们观察到跨域关系，并通过所观察到的数据向量计算的那些中的每个域，并（2）共同嵌入增强关系到低维空间。通过与属性数据集2 Flickr和动物的可视化，提出了MR-SNE与其它图表中嵌入基的方法相比; MR-SNE演示看好的表现。

137. Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning [PDF] 返回目录
Deren Lei, Gangrong Jiang, Xiaotao Gu, Kexuan Sun, Yuning Mao, Xiang Ren
Abstract: Walk-based models have shown their unique advantages in knowledge graph (KG) reasoning by achieving state-of-the-art performance while allowing for explicit visualization of the decision sequence. However, the sparse reward signals offered by the KG during a traversal are often insufficient to guide a sophisticated reinforcement learning (RL) model. An alternate approach to KG reasoning is using traditional symbolic methods (e.g., rule induction), which achieve high precision without learning but are hard to generalize due to the limitation of symbolic representation. In this paper, we propose to fuse these two paradigms to get the best of both worlds. Our method leverages high-quality rules generated by symbolic-based methods to provide reward supervision for walk-based agents. Due to the structure of symbolic rules with their entity variables, we can separate our walk-based agent into two sub-agents thus allowing for additional efficiency. Experiments on public datasets demonstrate that walk-based models can benefit from rule guidance significantly.
摘要：基于步入式模型已经在知识图（KG）推理通过实现国家的最先进的性能，同时允许决定序列的明确的可视化显示出其独特的优势。然而，在遍历期间由KG提供稀疏奖励信号往往不足以指导一个复杂的强化学习（RL）模型。到KG推理的另一种方法是使用传统的符号的方法（例如，规则归纳），其实现精度高，而无需学习但很难概括由于符号表示的限制。在本文中，我们提出了融合这两种模式，以获得两全其美。我们的方法利用基于符号的方法产生基于步行代理商提供报酬监督高素质的规则。由于符号规则与他们的实体变量的结构，我们可以基于我们步行代理分成这样两个子代理允许额外的效率。公共数据集的实验表明，基于步入式模型可以从规则的指导显著受益。

138. Low-Dimensional Hyperbolic Knowledge Graph Embeddings [PDF] 返回目录
Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, Christopher Ré
Abstract: Knowledge graph (KG) embeddings learn low-dimensional representations of entities and relations to predict missing facts. KGs often exhibit hierarchical and logical patterns which must be preserved in the embedding space. For hierarchical data, hyperbolic embedding methods have shown promise for high-fidelity and parsimonious representations. However, existing hyperbolic embedding methods do not account for the rich logical patterns in KGs. In this work, we introduce a class of hyperbolic KG embedding models that simultaneously capture hierarchical and logical patterns. Our approach combines hyperbolic reflections and rotations with attention to model complex relational patterns. Experimental results on standard KG benchmarks show that our method improves over previous Euclidean- and hyperbolic-based efforts by up to 6.1% in mean reciprocal rank (MRR) in low dimensions. Furthermore, we observe that different geometric transformations capture different types of relations while attention-based transformations generalize to multiple relations. In high dimensions, our approach yields new state-of-the-art MRRs of 49.6% on WN18RR and 57.7% on YAGO3-10.
摘要：知识图（KG）的嵌入学习实体和关系的低维表示预测失踪的事实。幼儿园往往表现出其必须在嵌入空间保留层级和逻辑模式。对于分层数据，双曲线嵌入方法已经显示了高保真和简约表示承诺。但是，现有的双曲嵌入方法没有考虑在幼稚园丰富的逻辑模式。在这项工作中，我们引入一类双曲KG嵌入模型，同时捕捉层次和逻辑模式。我们的方法结合双曲反射和旋转，注重复杂的关系进行模拟。标准KG基准测试实验结果表明，我们的方法改善了以往Euclidean-和低维高达6.1％的平均倒数排名（MRR）的双曲线的努力。此外，我们观察到不同的几何变换捕捉不同类型的关系，同时重视基础变换推广到多重关系。在高维空间，我们的方法产生的对WN18RR 49.6％和YAGO3-10 57.7％，新的国家的最先进的MRRS。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-05-05

目录

摘要