摘要

1. Understanding Clinical Trial Reports:Extracting Medical Entities and Their Relations [PDF] 返回目录
Benjamin E. Nye, Jay DeYoung, Eric Lehman, Ani Nenkova, Iain J. Marshall, Byron C. Wallace
Abstract: The best evidence concerning comparative treatment effectiveness comes from clinical trials, the results of which are reported in unstructured articles. Medical experts must manually extract information from articles to inform decision-making, which is time-consuming and expensive. Here we consider the end-to-end task of both (a) extracting treatments and outcomes from full-text articles describing clinical trials (entity identification) and, (b) inferring the reported results for the former with respect to the latter (relation extraction). We introduce new data for this task, and evaluate models that have recently achieved state-of-the-art results on similar tasks in Natural Language Processing. We then propose a new method motivated by how trial results are typically presented that outperforms these purely data-driven baselines. Finally, we run a fielded evaluation of the model with a non-profit seeking to identify existing drugs that might be re-purposed for cancer, showing the potential utility of end-to-end evidence extraction systems.
摘要：关于对比治疗效果最好的证据来自临床试验，其结果列于非结构化的文章。医学专家必须手动提取物品信息告知决策，这是费时和昂贵的。在这里，我们考虑到最终到终端的任务都（一）从描述推断前者的报告的结果相对于后者（关系临床试验（实体标识）和（b）全文文章中提取治疗和预后萃取）。我们引入新的数据，这个任务，并评估那些最近取得国家的最先进的自然语言处理类似的任务结果的模型。然后，我们提出了一种新方法，通过怎样的试验结果一般都认为这些性能优于纯数据驱动的基线动机。最后，我们运行一个非营利试图找出现有的药物可能被改用于癌症，显示终端到终端的证据提取系统的潜在效用模型的评价派出。

2. Probabilistic Case-based Reasoning for Open-World Knowledge Graph Completion [PDF] 返回目录
Rajarshi Das, Ameya Godbole, Nicholas Monath, Manzil Zaheer, Andrew McCallum
Abstract: A case-based reasoning (CBR) system solves a new problem by retrieving `cases' that are similar to the given problem. If such a system can achieve high accuracy, it is appealing owing to its simplicity, interpretability, and scalability. In this paper, we demonstrate that such a system is achievable for reasoning in knowledge-bases (KBs). Our approach predicts attributes for an entity by gathering reasoning paths from similar entities in the KB. Our probabilistic model estimates the likelihood that a path is effective at answering a query about the given entity. The parameters of our model can be efficiently computed using simple path statistics and require no iterative optimization. Our model is non-parametric, growing dynamically as new entities and relations are added to the KB. On several benchmark datasets our approach significantly outperforms other rule learning approaches and performs comparably to state-of-the-art embedding-based approaches. Furthermore, we demonstrate the effectiveness of our model in an "open-world" setting where new entities arrive in an online fashion, significantly outperforming state-of-the-art approaches and nearly matching the best offline method. Code available at this https URL
摘要：基于案例的推理（CBR）系统通过检索'案件类似于给定的问题解决的新问题。如果这样的系统可以达到较高的精度，它是有吸引力，由于它的简单性，解释性和可扩展性。在本文中，我们证明了这样一个系统是在知识基础（KBS）推理实现的。我们的方法从KB类似实体收集的推理路径预测一个实体的属性。我们的概率模型估计可能性的路径是在回答有关特定实体的查询有效。我们的模型的参数可以通过简单的路径统计来有效地计算，不需要迭代优化。我们的模型非参数，如新实体和关系被添加到KB动态增长。在几个基准数据集我们的方法显著优于其他规则学习方法和执行同等于国家的最先进的基于嵌入的办法。此外，我们证明我们的模型在一个“开放的世界”的设置效果，其中新到达的实体以在线的方式，显著超越先进国家的方法和几乎与最好的离线方法。代码可以在这个HTTPS URL

3. Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic Parsing [PDF] 返回目录
Xilun Chen, Asish Ghoshal, Yashar Mehdad, Luke Zettlemoyer, Sonal Gupta
Abstract: Task-oriented semantic parsing is a critical component of virtual assistants, which is responsible for understanding the user's intents (set reminder, play music, etc.). Recent advances in deep learning have enabled several approaches to successfully parse more complex queries (Gupta et al., 2018; Rongali et al.,2020), but these models require a large amount of annotated training data to parse queries on new domains (e.g. reminder, music). In this paper, we focus on adapting task-oriented semantic parsers to low-resource domains, and propose a novel method that outperforms a supervised neural model at a 10-fold data reduction. In particular, we identify two fundamental factors for low-resource domain adaptation: better representation learning and better training techniques. Our representation learning uses BART (Lewis et al., 2019) to initialize our model which outperforms encoder-only pre-trained representations used in previous work. Furthermore, we train with optimization-based meta-learning (Finn et al., 2017) to improve generalization to low-resource domains. This approach significantly outperforms all baseline methods in the experiments on a newly collected multi-domain task-oriented semantic parsing dataset (TOPv2), which we release to the public.
摘要：面向任务的语义分析是虚拟助理的一个重要组成部分，它负责理解用户的意图（设置提醒，播放音乐等）。在深度学习的最新进展已经使几种方法来成功地解析更复杂的查询（Gupta等，2018; Rongali等人，2020年），但这些模型需要大量的注释的训练数据来解析查询在新的领域（如提醒，音乐）。在本文中，我们侧重于适应面向任务的语义解析器低资源域，并提议优于一个以10倍的数据简缩监督神经元模型的新方法。特别是，我们确定了低资源领域适应性两个基本因素：更好地代表学习和更好的训练技术。我们表示学习使用BART（Lewis等，2019）来初始化我们的模型，它优于以往工作中使用编码器仅预先训练表示。此外，我们训练与基于优化的元学习（Finn等人，2017年），以提高泛化低资源域。这种方法显著优于上一个新收集的多域面向任务的语义分析数据集（TOPv2），这是我们向公众发布在实验中所有基线的方法。

4. Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification using Pre-trained Language Models [PDF] 返回目录
Shuohuan Wang, Jiaxiang Liu, Xuan Ouyang, Yu Sun
Abstract: This paper describes Galileo's performance in SemEval-2020 Task 12 on detecting and categorizing offensive language in social media. For Offensive Language Identification, we proposed a multi-lingual method using Pre-trained Language Models, ERNIE and XLM-R. For offensive language categorization, we proposed a knowledge distillation method trained on soft labels generated by several supervised models. Our team participated in all three sub-tasks. In Sub-task A - Offensive Language Identification, we ranked first in terms of average F1 scores in all languages. We are also the only team which ranked among the top three across all languages. We also took the first place in Sub-task B - Automatic Categorization of Offense Types and Sub-task C - Offence Target Identification.
摘要：本文介绍了SemEval-2020工作12检测和分类在社交媒体攻击性的语言伽利略的性能。对于攻击性语言识别，我们提出了利用预训练的语言模型，摇奖机和XLM-R多语种方法。对于攻击性语言分类，我们提出的培训由几个监管模型生成软标签知识蒸馏法。我们的团队参与了全部三个子任务。在子任务A - 攻击性语言识别，我们排名第一，在所有语言的平均得分F1的条款。我们也该在所有语言的前三名之列的球队。我们也参加了小组任务B首位 - 进攻类型和子任务C的自动分类 - 进攻目标识别。

5. Exploring the Role of Argument Structure in Online Debate Persuasion [PDF] 返回目录
Jialu Li, Esin Durmus, Claire Cardie
Abstract: Online debate forums provide users a platform to express their opinions on controversial topics while being exposed to opinions from diverse set of viewpoints. Existing work in Natural Language Processing (NLP) has shown that linguistic features extracted from the debate text and features encoding the characteristics of the audience are both critical in persuasion studies. In this paper, we aim to further investigate the role of discourse structure of the arguments from online debates in their persuasiveness. In particular, we use the factor graph model to obtain features for the argument structure of debates from an online debating platform and incorporate these features to an LSTM-based model to predict the debater that makes the most convincing arguments. We find that incorporating argument structure features play an essential role in achieving the better predictive performance in assessing the persuasiveness of the arguments in online debates.
摘要：网上辩论论坛，为用户提供了一个平台，同时暴露于从多样化的观点意见，以表达他们对有争议的问题的意见。在自然语言处理（NLP）现有的工作表明，从辩论文本中提取语言特征和功能编码观众的特性都在劝说的研究是至关重要的。在本文中，我们的目标是进一步研究的论点话语结构的网上辩论的说服作用。特别是，我们使用因子图模型，以获得从在线辩论平台争论的论据结构特点并结合这些功能基于LSTM模型来预测，使得最有说服力的论据的辩论。我们发现，结合参数结构特点发挥在实现评估网上讨论的论点的说服力更好的预测性能的重要作用。

6. What Can We Learn from Collective Human Opinions on Natural Language Inference Data? [PDF] 返回目录
Yixin Nie, Xiang Zhou, Mohit Bansal
Abstract: Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on using the majority label with presumably high agreement as the ground truth. Less attention has been paid to the distribution of human opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI. Analysis reveals that: (1) high human disagreement exists in a noticeable amount of examples in these datasets; (2) the state-of-the-art models lack the ability to recover the distribution over human labels; (3) models achieve near-perfect accuracy on the subset of data with a high level of human agreement, whereas they can barely beat a random guess on the data with low levels of human agreement, which compose most of the common errors made by state-of-the-art models on the evaluation sets. This questions the validity of improving model performance on old metrics for the low-agreement part of evaluation datasets. Hence, we argue for a detailed examination of human agreement in future data collection efforts, and evaluating model outputs against the distribution over collective human opinions. The ChaosNLI dataset and experimental scripts are available at this https URL
摘要：尽管很多NLP任务的主观性质，最NLU的评价都集中在使用大多数标签与想必一致性高的地面实况。较少受到人们的重视，以人的意见的分布。我们收集ChaosNLI，共有464500个注释的数据集来研究人类的集体意见在经常使用NLI评价集。此数据集通过收集每例如100个注解在SNLI和MNLI 3113组的实施例和在溯-NLI 1532个实施例创建的。分析表明：（1）高的人在分歧的在这些数据集的示例的显着的量存在; （2）的状态的最先进的模型缺乏恢复在人体标签的分布的能力; （3）模型实现对数据的高层次人力协议的子集，近乎完美的精确度，而他们只能勉强打了一个随机猜测与低水平人类的协议，这构成最常见的错误的由国家作出的数据-of先进机型的评价集。这个问题改善旧指标评价数据集的低协议零件模型性能的有效性。因此，我们认为人类的协议在未来的数据收集工作了详细的检查，以及对分布在人类集体的意见评估模型输出。该ChaosNLI数据集和实验脚本可在此HTTPS URL

7. Inductive Entity Representations from Text via Link Prediction [PDF] 返回目录
Daniel Daza, Michael Cochez, Paul Groth
Abstract: We present a method for learning representations of entities, that uses a Transformer-based architecture as an entity encoder, and link prediction training on a knowledge graph with textual entity descriptions. We demonstrate that our approach can be applied effectively for link prediction in different inductive settings involving entities not seen during training, outperforming related state-of-the-art methods (22% MRR improvement on average). We provide evidence that the learned representations transfer to other tasks that do not require fine-tuning the entity encoder. In an entity classification task we obtain an average improvement of 16% accuracy compared with baselines that also employ pre-trained models. For an information retrieval task, significant improvements of up to 8.8% in NDCG@10 were obtained for natural language queries.
摘要：我们提出了学习实体的表示方法，使用基于变压器的架构作为一个实体的编码器，并与文字描述实体知识图表链接预测的培训。我们证明我们的方法可以有效的链接预测，包括训练中没有看到实体，跑赢国家的最先进的相关方法（平均22％提高MRR）不同的感应设置应用。我们提供的证据表明，学习交涉转移到不需要微调实体编码等任务。在实体分类任务中，我们获得16％的准确率的平均改善与还采用预训练模式的基线相比。对于信息检索任务，对自然语言查询，获得高达8.8％的NDCG @ 10的显著改善。

8. TeaForN: Teacher-Forcing with N-grams [PDF] 返回目录
Sebastian Goodman, Nan Ding, Radu Soricut
Abstract: Sequence generation models trained with teacher-forcing suffer from issues related to exposure bias and lack of differentiability across timesteps. Our proposed method, Teacher-Forcing with N-grams (TeaForN), addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model parameter updates based on N prediction steps. TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts generation quality on one Machine Translation benchmark, WMT 2014 English-French, and two News Summarization benchmarks, CNN/Dailymail and Gigaword.
摘要：教师强迫训练的序列代车型从与曝光补偿问题的影响，缺乏跨时间步长可辨的。我们提出的方法，教师-强制与N-克（TeaForN），解决了这两个问题直接，通过使用N个堆叠的译码器沿着副时间轴，它允许基于N预测步骤模型参数更新训练解码。 TeaForN可以具有宽类解码器的体系结构的使用，并且需要来自一个标准的教师迫使设置最小的改动。根据经验，我们给出一个机器翻译的基准，TeaForN一代提升质量，WMT 2014英法，和两个新闻汇总基准，CNN /每日邮报和Gigaword。

9. Improving Sentiment Analysis over non-English Tweets using Multilingual Transformers and Automatic Translation for Data-Augmentation [PDF] 返回目录
Valentin Barriere, Alexandra Balahur
Abstract: Tweets are specific text data when compared to general text. Although sentiment analysis over tweets has become very popular in the last decade for English, it is still difficult to find huge annotated corpora for non-English languages. The recent rise of the transformer models in Natural Language Processing allows to achieve unparalleled performances in many tasks, but these models need a consequent quantity of text to adapt to the tweet domain. We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages. Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
摘要：相比于一般的文本时的鸣叫是特定的文本数据。虽然在微博情感分析在英语在过去十年已经变得非常流行，但仍然很难找到非英语语言注释庞大的语料库。最近在自然语言处理的变压器模型的兴起使得实现许多任务无与伦比的表演，但这些模型需要的文本随之而来的量，以适应鸣叫域。我们建议使用多语种变压器模型，我们预列车在英语鸣叫和使用自动转换，以适应该模型非英语语言应用数据增强。我们在法语，西班牙语，德语和意大利语实验表明，所提出的技术是提高非英语语言的变压器上鸣叫的小语料库结果的有效方式。

10. ELMo and BERT in semantic change detection for Russian [PDF] 返回目录
Julia Rodina, Yuliya Trofimova, Andrey Kutuzov, Ekaterina Artemova
Abstract: We study the effectiveness of contextualized embeddings for the task of diachronic semantic change detection for Russian language data. Evaluation test sets consist of Russian nouns and adjectives annotated based on their occurrences in texts created in pre-Soviet, Soviet and post-Soviet time periods. ELMo and BERT architectures are compared on the task of ranking Russian words according to the degree of their semantic change over time. We use several methods for aggregation of contextualized embeddings from these architectures and evaluate their performance. Finally, we compare unsupervised and supervised techniques in this task.
摘要：我们对历时语义变化检测俄语数据的任务情境学习的嵌入的有效性。评价试验台由基于它们的出现在预苏联，苏联和后苏联时期创建的文本俄罗斯名词和形容词注释的。埃尔莫和BERT架构上的排名根据它们随时间的语义变化程度的俄语单词的任务比较。我们使用多种方法从这些架构情境的嵌入的聚集和评估他们的表现。最后，我们比较了在这个任务中无监督和监督的技术。

11. Learning a Cost-Effective Annotation Policy for Question Answering [PDF] 返回目录
Bernhard Kratzwald, Stefan Feuerriegel, Huan Sun
Abstract: State-of-the-art question answering (QA) relies upon large amounts of training data for which labeling is time consuming and thus expensive. For this reason, customizing QA systems is challenging. As a remedy, we propose a novel framework for annotating QA datasets that entails learning a cost-effective annotation policy and a semi-supervised annotation scheme. The latter reduces the human effort: it leverages the underlying QA system to suggest potential candidate annotations. Human annotators then simply provide binary feedback on these candidates. Our system is designed such that past annotations continuously improve the future performance and thus overall annotation cost. To the best of our knowledge, this is the first paper to address the problem of annotating questions with minimal annotation cost. We compare our framework against traditional manual annotations in an extensive set of experiments. We find that our approach can reduce up to 21.1% of the annotation cost.
摘要：国家的最先进的问答（QA）依赖于大量的针对标记是耗时的并且因此是昂贵的训练数据。出于这个原因，定制QA系统是一个挑战。作为补救，我们提出了注释是需要学习的具有成本效益的注解政策和半监督标注方案QA数据集一个新的框架。后者降低了人的努力：它利用了底层的质量保证体系，以表明潜在的候选标签。人工注释然后简单地提供关于这些候选人的二进制反馈。我们的系统设计，使得过去的注解不断提高未来的业绩，因此总体注释成本。据我们所知，这是解决注释以最小的成本标注问题的问题的第一篇论文。我们将我们对传统的手工标注框架一套广泛的实验。我们发现，我们的方法可以减少高达注释成本的21.1％。

12. "I'd rather just go to bed": Understanding Indirect Answers [PDF] 返回目录
Annie Louis, Dan Roth, Filip Radlinski
Abstract: We revisit a pragmatic inference problem in dialog: understanding indirect responses to questions. Humans can interpret 'I'm starving.' in response to 'Hungry?', even without direct cue words such as 'yes' and 'no'. In dialog systems, allowing natural responses rather than closed vocabularies would be similarly beneficial. However, today's systems are only as sensitive to these pragmatic moves as their language model allows. We create and release the first large-scale English language corpus 'Circa' with 34,268 (polar question, indirect answer) pairs to enable progress on this task. The data was collected via elaborate crowdsourcing, and contains utterances with yes/no meaning, as well as uncertain, middle-ground, and conditional responses. We also present BERT-based neural models to predict such categories for a question-answer pair. We find that while transfer learning from entailment works reasonably, performance is not yet sufficient for robust dialog. Our models reach 82-88% accuracy for a 4-class distinction, and 74-85% for 6 classes.
摘要：我们重新审视在对话框中语用推理的问题：认识到问题的间接回应。人类可以解释“我饿了。”在回答“饿了吗？”，即使没有直接的线索词，如“是”和“不”。在对话系统，让自然的反应，而不是封闭的词汇将是有益的相似。然而，今天的系统只对这些务实的举动为敏感，因为他们的语言模型允许。我们创建并发布首个大型英语语料库“协卡”与34268（极性问题，间接回答）对，使这一任务的进度。该数据经精心收集众包，其中包含的话语是/否的含义，以及不确定的，中等地，和有条件的反应。我们还提出基于BERT神经模型来预测这些类别的问答配对。我们发现，虽然从转让学习蕴涵效果相当，性能还不是足够强大的对话框。我们的模型达到82-88％的准确度为4级的区别，并为6个班74-85％。

13. Analogies minus analogy test: measuring regularities in word embeddings [PDF] 返回目录
Louis Fournier, Emmanuel Dupoux, Ewan Dunbar
Abstract: Vector space models of words have long been claimed to capture linguistic regularities as simple vector translations, but problems have been raised with this claim. We decompose and empirically analyze the classic arithmetic word analogy test, to motivate two new metrics that address the issues with the standard test, and which distinguish between class-wise offset concentration (similar directions between pairs of words drawn from different broad classes, such as France--London, China--Ottawa, ...) and pairing consistency (the existence of a regular transformation between correctly-matched pairs such as France:Paris::China:Beijing). We show that, while the standard analogy test is flawed, several popular word embeddings do nevertheless encode linguistic regularities.
摘要：词的向量空间模型早已声称捕捉语言规律一样简单向量平移，但问题已经提出了这个要求。我们分解并凭经验分析的经典算术字类比测试，来激励两个新的指标，地址与标准测试的问题，并且其类明智偏移浓度之间相似方向区分（来自不同的大类，例如绘制对字之间法国 - 伦敦，中国 - 渥太华，...）和配对一致性（正确匹配对，如法国之间的规则变换的存在：巴黎::中国：北京）。我们表明，虽然标准类比测试是有缺陷的，有几个流行词的嵌入做不过编码语言规律。

14. WER we are and WER we think we are [PDF] 返回目录
Piotr Szymański, Piotr Żelasko, Mikolaj Morzy, Adrian Szymczak, Marzena Żyła-Hoppe, Joanna Banaszczak, Lukasz Augustyniak, Jan Mizgajski, Yishay Carmiel
Abstract: Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
摘要：对话语音的自然语言处理，需要高品质的成绩单可用性。在本文中，我们表达我们对最近的现代自动语音识别（ASR）系统的基准数据集取得了非常低的词错误率（WERS）的报告持怀疑态度。我们概括几个问题与流行的基准和现实生活中的自发人类交谈的内部数据集，公共HUB'05基准比较三种状态的最先进的商业ASR系统。我们表明，WERS比最好的业绩报告显著较高。我们制定了一套可以在创造现实生活，多领域的数据集提供高品质的注解培训的帮助和强大的ASR系统的测试指南。

15. Cross-lingual Extended Named Entity Classification of Wikipedia Articles [PDF] 返回目录
Viet Bui, Phuong Le-Hong
Abstract: The this http URL team participated in the SHINRA2020-ML subtask of the NTCIR-15 SHINRA task. This paper describes our method to solving the problem and discusses the official results. Our method focuses on learning cross-lingual representations, both on the word level and document level for page classification. We propose a three-stage approach including multilingual model pre-training, monolingual model fine-tuning and cross-lingual voting. Our system is able to achieve the best scores for 25 out of 30 languages; and its accuracy gaps to the best performing systems of the other five languages are relatively small.
摘要：这个HTTP URL队参加了NTCIR-15神罗任务的SHINRA2020-ML子任务。本文介绍了我们的方法来解决问题，并讨论了官方结果。我们的方法侧重于学习跨语言表述，都对网页分类字级和文档级。我们提出了一个三阶段的方法，包括多语种的模型前的训练，单语模型微调和跨语言的投票。我们的系统能够实现25出30种语言的最好成绩;其精度空白，达到了其他五种语言中表现最好的系统都比较小。

16. Dual Reconstruction: a Unifying Objective for Semi-Supervised Neural Machine Translation [PDF] 返回目录
Weijia Xu, Xing Niu, Marine Carpuat
Abstract: While Iterative Back-Translation and Dual Learning effectively incorporate monolingual training data in neural machine translation, they use different objectives and heuristic gradient approximation strategies, and have not been extensively compared. We introduce a novel dual reconstruction objective that provides a unified view of Iterative Back-Translation and Dual Learning. It motivates a theoretical analysis and controlled empirical study on German-English and Turkish-English tasks, which both suggest that Iterative Back-Translation is more effective than Dual Learning despite its relative simplicity.
摘要：虽然迭代回译和双学习有效的神经机器翻译纳入单语训练数据，他们使用不同的目标和启发式梯度近似策略，并没有被广泛的比较。我们引入新的双重建目标，提供迭代回译和双学习的统一视图。这激励了德国，英语和土耳其语，英语任务的理论分析和控制的实证研究，这都表明，迭代回译比双学习更有效，尽管其相对简单。

17. Why do you think that? Exploring Faithful Sentence-Level Rationales Without Supervision [PDF] 返回目录
Max Glockner, Ivan Habernal, Iryna Gurevych
Abstract: Evaluating the trustworthiness of a model's prediction is essential for differentiating between `right for the right reasons' and `right for the wrong reasons'. Identifying textual spans that determine the target label, known as faithful rationales, usually relies on pipeline approaches or reinforcement learning. However, such methods either require supervision and thus costly annotation of the rationales or employ non-differentiable models. We propose a differentiable training-framework to create models which output faithful rationales on a sentence level, by solely applying supervision on the target task. To achieve this, our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best. Our evaluation on three different datasets shows competitive results compared to a standard BERT blackbox while exceeding a pipeline counterpart's performance in two cases. We further exploit the transparent decision-making process of these models to prefer selecting the correct rationales by applying direct supervision, thereby boosting the performance on the rationale-level.
摘要：评估模型预测的可信度是权利之间`区分正确的原因和'右错误的原因是必不可少的。识别确定目标的标签，被称为忠实的理由文字跨度，通常依赖于管道接近或强化学习。但是，这些方法要么需要监督的理由或采取非微分模式从而昂贵的注解。我们提出了一个微培训的框架，通过对目标任务仅仅将监督在句子层面创建模型，其输出忠实的理由。为了实现这一目标，我们的模型解决了基于任务的每个单独的理由和学会分配高分到那些最好的解决任务。我们在三个不同的数据集的评估显示，而在两种情况下超过管道对口的性能相比标准的BERT黑箱竞争力的结果。我们进一步利用这些模型的透明的决策过程更喜欢通过应用直接监督，从而提高对理级的性能选择正确的理由。

18. Toward Stance-based Personas for Opinionated Dialogues [PDF] 返回目录
Thomas Scialom, Serra Sinem Tekiroglu, Jacopo Staiano, Marco Guerini
Abstract: In the context of chit-chat dialogues it has been shown that endowing systems with a persona profile is important to produce more coherent and meaningful conversations. Still, the representation of such personas has thus far been limited to a fact-based representation (e.g. "I have two cats."). We argue that these representations remain superficial w.r.t. the complexity of human personality. In this work, we propose to make a step forward and investigate stance-based persona, trying to grasp more profound characteristics, such as opinions, values, and beliefs to drive language generation. To this end, we introduce a novel dataset allowing to explore different stance-based persona representations and their impact on claim generation, showing that they are able to grasp abstract and profound aspects of the author persona.
摘要：在闲聊对话的背景下，已经表明，与人物轮廓赋予系统重要的是要产生更连贯的和有意义的对话。尽管如此，这样的角色的代表性迄今被限制在一个事实为基础的表示（例如“我有两只猫”）。我们认为，这些表述停留在表面w.r.t.人的个性的复杂性。在这项工作中，我们建议作出向前迈进了一步，并基于调查的立场，人物，试图把握更深刻的特点，如意见，价值观和信念驱动语言生成。为此，我们引入了一个新的数据集，允许探索不同的基于立场，人物表示他们对要求产生影响，这表明他们能够把握作者的人物的抽象而深刻的方面。

19. Improving QA Generalization by Concurrent Modeling of Multiple Biases [PDF] 返回目录
Mingzhu Wu, Nafise Sadat Moosavi, Andreas Rücklé, Iryna Gurevych
Abstract: Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets. However, focusing on dataset-specific biases limits their ability to learn more generalizable knowledge about the task from more general data patterns. In this paper, we investigate the impact of debiasing methods for improving generalization and propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data. Our framework weights each example based on the biases it contains and the strength of those biases in the training data. It then uses these weights in the training objective so that the model relies less on examples with high bias weights. We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths. We perform the evaluations in two different settings, in which the model is trained on a single domain or multiple domains simultaneously, and show its effectiveness in both settings compared to state-of-the-art debiasing methods.
摘要：现有的数据集NLP包含各种偏见是模型可以很容易地利用，实现对相应的评价套高性能。然而，专注于特定的数据集，偏见限制了他们从更一般的数据模式学习任务更加普及知识的能力。在本文中，我们研究了消除直流偏压为提高泛化方法的影响，并提出了一个总体框架由训练数据的多个偏见的并发模型提高双方在域和外的域数据集的性能。我们的框架权重根据其包含的偏见和训练数据的偏差的强度每个例子。然后使用这些权重的培养目标，使模型很少依赖高偏差的权重的例子。我们广泛地评估我们的框架采掘问题有不同强度的多个偏见来自不同领域的培训数据的回答。我们执行在两个不同的设置，其中，所述模型是在单个域或多个域同时训练评价，并显示其在这两种环境效益相比状态的最先进的消除直流偏压的方法。

20. COMETA: A Corpus for Medical Entity Linking in the Social Media [PDF] 返回目录
Marco Basaldella, Fangyu Liu, Ehsan Shareghi, Nigel Collier
Abstract: Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman's language. Meanwhile, there is a growing need for applications that can understand the public's voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies on combining different views of data.
摘要：虽然已经在实体链接（EI）为一般的语言成长进步，现有数据集未能解决医疗术语，深入浅出的复杂性。同时，人们越来越需要能够了解公众的声音在卫生领域的应用。为了解决这个问题，我们引入了一个名为COMETA新文集，由20K英语生物医学实体离开链接到SNOMED CT，一种广泛使用的医学知识图reddit的专家注释中提到。我们的语料库满足理想特性的组合，从规模和覆盖面，以多样性和质量，这给我们所知，还没有被任何在该领域的现有资源满足。通过对从与字符串基于神经模型20个EL基线基准实验中，我们阐明了这些系统的下2个有挑战性的评估场景进行实体上和概念复杂推理的能力。我们对COMETA实验结果表明，没有子弹金黄存在，即使是最好的主流技术仍然有显著的性能差距填充，而最好的解决方案依赖于组合数据的不同视图。

21. ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization [PDF] 返回目录
Tzuf Paz-Argaman, Yuval Atzmon, Gal Chechik, Reut Tsarfaty
Abstract: We study the problem of recognizing visual entities from the textual descriptions of their classes. Specifically, given birds' images with free-text descriptions of their species, we learn to classify images of previously-unseen species based on specie descriptions. This setup has been studied in the vision community under the name zero-shot learning from text, focusing on learning to transfer knowledge about visual aspects of birds from seen classes to previously-unseen ones. Here, we suggest focusing on the textual description and distilling from the description the most relevant information to effectively match visual features to the parts of the text that discuss them. Specifically, (1) we propose to leverage the similarity between species, reflected in the similarity between text descriptions of the species. (2) we derive visual summaries of the texts, i.e., extractive summaries that focus on the visual features that tend to be reflected in images. We propose a simple attention-based model augmented with the similarity and visual summaries components. Our empirical results consistently and significantly outperform the state-of-the-art on the largest benchmarks for text-based zero-shot learning, illustrating the critical importance of texts for zero-shot image-recognition.
摘要：我们研究从他们班的文字描述视觉识别实体的问题。具体地，给出鸟类与它们的物种的自由文本描述的图像，我们学会了基于种类的描述以前看不到的品种进行分类的图像。这种设置进行了研究的视野社区名称零射门学习下从文本中，重点学习到有关从看到类以前看不见那些鸟的视觉方面的知识转让。在这里，我们建议侧重于文字说明和描述蒸馏最相关的信息可视化功能，能够有效匹配到讨论这些文本的部分。具体地说，（1），我们提出利用物种之间的相似性，体现在物种的文本描述之间的相似度。（2）我们导出文本，即，着眼于视觉特征倾向于采掘摘要在图像被反射的可视摘要。我们建议用相似性和视觉组件摘要增加一个简单的关注，基于模型。我们的实证结果一致，并显著超越对基于文本的零次学习的最大基准的国家的最先进的，它给出了零拍摄图像识别文本的重要性。

22. Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering [PDF] 返回目录
Harsh Jhamtani, Peter Clark
Abstract: Despite the rapid progress in multihop question-answering (QA), models still have trouble explaining why an answer is correct, with limited explanation training data available to learn from. To address this, we introduce three explanation datasets in which explanations formed from corpus facts are annotated. Our first dataset, eQASC, contains over 98K explanation annotations for the multihop question answering dataset QASC, and is the first that annotates multiple candidate explanations for each answer. The second dataset eQASC-perturbed is constructed by crowd-sourcing perturbations (while preserving their validity) of a subset of explanations in QASC, to test consistency and generalization of explanation prediction models. The third dataset eOBQA is constructed by adding explanation annotations to the OBQA dataset to test generalization of models trained on eQASC. We show that this data can be used to significantly improve explanation quality (+14% absolute F1 over a strong retrieval baseline) using a BERT-based classifier, but still behind the upper bound, offering a new challenge for future research. We also explore a delexicalized chain representation in which repeated noun phrases are replaced by variables, thus turning them into generalized reasoning chains (for example: "X is a Y" AND "Y has Z" IMPLIES "X has Z"). We find that generalized chains maintain performance while also being more robust to certain perturbations.
摘要：尽管在快速进步多跳答疑（QA），车型仍然有问题，解释为什么的答案是正确的，具有可学习借鉴有限解释训练数据。为了解决这个问题，我们引入其中来自语料库的事实形成解释注释3个说明数据集。我们的第一个数据集，eQASC，包含了多跳问答集QASC超过98K的解释说明，并且是诠释每个答案多个候选的解释第一。第二数据集eQASC扰动是通过众包扰动构造（同时保留其有效性）在QASC解释的子集，以检验一致性和解释预测模型的概括。第三个数据集eOBQA通过添加注释说明为OBQA数据集的培训了eQASC模型试验推广构建。我们表明，这种数据可以使用基于BERT分类器被用来显著提高质量解释（+ 14％的绝对F1在强大的检索基线），但仍落后于上限，从而为今后的研究提供了新的挑战。我们还探索，其中重复名词短语由变量被置换的链虚化表示，从而把它们变成广义推理链（例如：“X是Y”和“Y具有Z”隐含“X具有Z”）。我们发现，广义链保持性能的同时，也更加强劲，以一定的扰动。

23. Narrative Text Generation with a Latent Discrete Plan [PDF] 返回目录
Harsh Jhamtani, Taylor Berg-Kirkpatrick
Abstract: Past work on story generation has demonstrated the usefulness of conditioning on a generation plan to generate coherent stories. However, these approaches have used heuristics or off-the-shelf models to first tag training stories with the desired type of plan, and then train generation models in a supervised fashion. In this paper, we propose a deep latent variable model that first samples a sequence of anchor words, one per sentence in the story, as part of its generative process. During training, our model treats the sequence of anchor words as a latent variable and attempts to induce anchoring sequences that help guide generation in an unsupervised fashion. We conduct experiments with several types of sentence decoder distributions: left-to-right and non-monotonic, with different degrees of restriction. Further, since we use amortized variational inference to train our model, we introduce two corresponding types of inference network for predicting the posterior on anchor words. We conduct human evaluations which demonstrate that the stories produced by our model are rated better in comparison with baselines which do not consider story plans, and are similar or better in quality relative to baselines which use external supervision for plans. Additionally, the proposed model gets favorable scores when evaluated on perplexity, diversity, and control of story via discrete plan.
摘要：上一代的故事过去的工作已经证明在生成计划产生连贯的故事空调的实用性。然而，这些方法已经使用启发式或关闭的，现成的模型与计划所需类型的第一个标签训练的故事，然后在监督的方式培养一代车型。在本文中，我们提出了一个深刻的潜变量模型，第一个样品的锚的话，故事中的每个句子的一个，一个序列作为其生成过程的一部分。在训练期间，我们的模型将锚定字作为潜变量和尝试的序列以诱导锚定序列，其帮助引导生成以无监督方式。我们进行实验几种类型的句子解码器分布：左到右和非单调，有不同程度的限制。此外，由于我们采用摊余变推理来训练我们的模型，我们引入两个对应的类型推断网络预测锚的话后。我们开展这表明，通过我们的模型产生的故事与不考虑故事的计划基线比较好评级，以及类似或更好的位于相对于使用计划的外部监督基线高素质的人力评估。此外，该模型得到的困惑，多样性评估时有利的分数，并通过分立方案故事的控制权。

24. Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction [PDF] 返回目录
Mengyun Chen, Tao Ge, Xingxing Zhang, Furu Wei, Ming Zhou
Abstract: We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection (ESD) and Erroneous Span Correction (ESC). ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. Then, ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans. Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
摘要：本文提出了一种与语言无关的方法通过将任务分成两个子任务，以提高语法纠错（GEC）效率：错误的跨度检测（ESD）和错误的量程校正（ESC）。 ESD识别语法不正确的文本跨，高效的序列标签模型。然后，ESC利用一个seq2seq模型取句带有加注解的错误跨度作为输入，并只输出用于这些跨距校正文本。实验证明我们的方法进行同等于传统seq2seq英语和中国GEC的基准方法与推理低于50％的时间成本。

25. Exploring and Evaluating Attributes, Values, and Structures for Entity Alignment [PDF] 返回目录
Zhiyuan Liu, Yixin Cao, Liangming Pan, Juanzi Li, Zhiyuan Liu, Tat-Seng Chua
Abstract: Entity alignment (EA) aims at building a unified Knowledge Graph (KG) of rich content by linking the equivalent entities from various KGs. GNN-based EA methods present promising performances by modeling the KG structure defined by relation triples. However, attribute triples can also provide crucial alignment signal but have not been well explored yet. In this paper, we propose to utilize an attributed value encoder and partition the KG into subgraphs to model the various types of attribute triples efficiently. Besides, the performances of current EA methods are overestimated because of the name-bias of existing EA datasets. To make an objective evaluation, we propose a hard experimental setting where we select equivalent entity pairs with very different names as the test set. Under both the regular and hard settings, our method achieves significant improvements ($5.10\%$ on average Hits@$1$ in DBP$15$k) over $12$ baselines in cross-lingual and monolingual datasets. Ablation studies on different subgraphs and a case study about attribute types further demonstrate the effectiveness of our method. Source code and data can be found at this https URL.
摘要：实体对齐（EA）的目的是通过连接从各个幼儿园同等实体建设的丰富内容的统一的知识图谱（KG）。基于GNN-EA方法在场有为通过模拟由关系三元组所定义的结构KG性能。然而，属性三倍还可以提供关键的定位信号，但没有得到很好的研究还没有。在本文中，我们提出了利用一个属性值编码器和KG划分成子图的不同类型的属性的三元组高效地进行建模。此外，目前EA方法的性能是因为现有的EA数据集的名称偏置的高估。作出客观的评价，我们提出，我们选择具有非常不同的名称，测试组等效实体对硬实验设置。根据定期和硬设置两个，我们的方法（在DBP $ 15 $ K于平均点击$ 5.10 \％$ @ $ 1 $）在跨语言和多语数据集$ 12个$基线达到显著的改善。在不同的子图消融治疗的研究和有关属性类型的案例研究进一步证明了该方法的有效性。源代码和数据可以在此HTTPS URL中找到。

26. Transformer-GCRF: Recovering Chinese Dropped Pronouns with General Conditional Random Fields [PDF] 返回目录
Jingxuan Yang, Kerui Xu, Jun Xu, Si Li, Sheng Gao, Jun Guo, Ji-Rong Wen, Nianwen Xue
Abstract: Pronouns are often dropped in Chinese conversations and recovering the dropped pronouns is important for NLP applications such as Machine Translation. Existing approaches usually formulate this as a sequence labeling task of predicting whether there is a dropped pronoun before each token and its type. Each utterance is considered to be a sequence and labeled independently. Although these approaches have shown promise, labeling each utterance independently ignores the dependencies between pronouns in neighboring utterances. Modeling these dependencies is critical to improving the performance of dropped pronoun recovery. In this paper, we present a novel framework that combines the strength of Transformer network with General Conditional Random Fields (GCRF) to model the dependencies between pronouns in neighboring utterances. Results on three Chinese conversation datasets show that the Transformer-GCRF model outperforms the state-of-the-art dropped pronoun recovery models. Exploratory analysis also demonstrates that the GCRF did help to capture the dependencies between pronouns in neighboring utterances, thus contributes to the performance improvements.
摘要：代词经常掉线的中国对话和恢复被丢弃的代词是NLP应用，如机器翻译重要。现有的方法通常制定本作预测是否有掉线前的每个标记和其类型的代名词序列标签制作任务。每个发声被认为是一个序列和独立地进行标记。尽管这些方法已经显示出希望，标记每个发声独立地忽略相邻话语代词之间的依赖关系。建模这些依赖是提高下降的代名词恢复的性能至关重要。在本文中，我们提出，结合变压器网络与通用条件随机域（GCRF）的强度与在邻近的话语代词之间的依赖关系进行建模一个新的框架。三个中国谈话数据集实验结果表明，变压器GCRF模型优于国家的最先进的下降代名词恢复模式。探索性分析也表明，GCRF的确有助于捕捉代词之间的依赖邻国的言论，从而有助于性能的提升。

27. Unsupervised Evaluation for Question Answering with Transformers [PDF] 返回目录
Lukas Muttenthaler, Isabelle Augenstein, Johannes Bjerva
Abstract: It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalize. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labeled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model's answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in the semi-automatic development of QA datasets
摘要：这是具有挑战性的自动评估的推断时间QA模型的答案。尽管许多模型提供的置信度，并且简单的启发式可以去朝表示答案的正确性很长的路要走，这些措施都严重集依赖性的，不可能一概而论。在这项工作中，我们首先研究的问题，答案，以及基于变压器的QA体系结构的上下文隐藏的表示。我们的回答表示，这是我们表现出可用于自动评估预测跨度的答案是否是正确的观察一致的模式。我们的方法不需要任何标记的数据及优于强启发式基线，在2个数据集和7个结构域。我们能够预测模型的回答是否是有91.37％的准确度上的阵容，并在SubjQA 80.7％的精度正确。我们预计，这种方法将有广阔的应用领域，例如，在QA数据集的半自动发展

28. Like hiking? You probably enjoy nature: Persona-grounded Dialog with Commonsense Expansions [PDF] 返回目录
Bodhisattwa Prasad Majumder, Harsh Jhamtani, Taylor Berg-Kirkpatrick, Julian McAuley
Abstract: Existing persona-grounded dialog models often fail to capture simple implications of given persona descriptions, something which humans are able to do seamlessly. For example, state-of-the-art models cannot infer that interest in hiking might imply love for nature or longing for a break. In this paper, we propose to expand available persona sentences using existing commonsense knowledge bases and paraphrasing resources to imbue dialog models with access to an expanded and richer set of persona descriptions. Additionally, we introduce fine-grained grounding on personas by encouraging the model to make a discrete choice among persona sentences while synthesizing a dialog response. Since such a choice is not observed in the data, we model it using a discrete latent random variable and use variational learning to sample from hundreds of persona expansions. Our model outperforms competitive baselines on the PersonaChat dataset in terms of dialog quality and diversity while achieving persona-consistent and controllable dialog generation.
摘要：现有角色接地对话模式往往不能给人物角色描述，一些东西，人类能够做到无缝捕获简单的含义。例如，国家的最先进的模型不能推断远足可能意味着对自然或渴望休息爱的兴趣。在本文中，我们提出使用现有的常识性的知识基础和意译资源灌输对话机型获得扩展和更加丰富的人物角色描述的扩展了可用角色的句子。此外，我们鼓励模型，使人物的句子之间的离散选择，而合成对话响应引入的角色细粒度接地。由于这样的选择未在数据中观察到，我们使用离散潜随机变量模型，并从数百角色扩展的使用变分学习样本。我们的模型优于上PersonaChat数据集有竞争力的基线对话框质量和同时实现角色一致的和可控的对话生成多样性方面。

29. Rank and run-time aware compression of NLP Applications [PDF] 返回目录
Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew Mattina
Abstract: Sequence model based NLP applications can be large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints. As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-time than pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection, Language Modeling) and show that for similar accuracy values and compression factors, HMF can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
摘要：基于序列模型NLP应用程序可能很大。然而，许多应用程序，从他们的利益与非常有限的计算和存储能力的小型设备上运行，同时还具有运行时间的限制。因此，有必要的压缩技术，可以实现显著压缩无推断的运行时间和任务的准确性产生负面影响。本文提出了称为混合矩阵分解的一种新的压缩技术，实现了这一双重目标。 HMF通过使用智能混合结构导致比LMF更好的精度加倍矩阵的秩改善低秩矩阵因式分解（LMF）技术。此外，通过保持稠密矩阵，它导致更快推理运行时间比修剪或结构基质基于压缩技术。我们评估这种技术的跨多个任务（翻译，意图检测，语言模型）5个NLP基准的影响，并表明，类似的精度值和压缩的因素，HMF可以比修剪和1677实现超过2.32x较快推断运行时间％，比LMF更好的精度。

30. Theedhum Nandrum@Dravidian-CodeMix-FIRE2020: ASentiment Polarity Classifier for YouTube Commentswith Code-switching between Tamil, Malayalam andEnglish [PDF] 返回目录
BalaSundaraRaman Lakshmanan, Sanjeeth Kumar Ravindranath
Abstract: Theedhum Nandrum is a sentiment polarity detection system using two approaches--a Stochastic Gradient Descent (SGD) Classifier and a Recurrent Neural Network (RNN) Classifier. Our approach utilises language features like the use of emoji, choice of scripts and code mixing which appeared quite marked in the datasets specified for the Dravidian Codemix - FIRE 2020 task. The hyperparameters for the SGD were tuned using GridSearchCV on the training data supplied. Our system was ranked 4th in Tamil-English with a weighted average F1 score of 0.62 and 9th in Malayalam-English with a score of 0.65. Our code is published in github at this https URL.
摘要：Theedhum Nandrum是用两种方法情感极性检测系统 - 一个随机梯度下降（SGD）分类器和递归神经网络（RNN）分类。我们的方法是利用语言功能，如使用的表情符号，脚本和代码混合，这显得相当标志着为德拉威Codemix指定的数据集的选择 - FIRE 2020任务。用于SGD的超参数使用上提供的训练数据GridSearchCV调谐。我们的系统是在泰米尔语，英语排名第4与加权平均F1得分0.62和第9马拉雅拉姆语，英语，得分为0.65。我们的代码在此HTTPS URL发表在GitHub上。

31. Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages [PDF] 返回目录
Michael A. Hedderich, David Adelani, Dawei Zhu, Jesujoba Alabi, Udia Markus, Dietrich Klakow
Abstract: Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and Yorùbá on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.
摘要：多语种变压器模型，如mBERT和XLM，罗伯塔对各种语言得到了许多NLP任务很大的改进。然而，最近的作品也表明，高资源语言的结果可能不容易转移到现实，低资源场景。在这项工作中，我们研究了在不同量的可用资源用于两个NER和主题分类的三个非洲语言豪萨语，索萨语和约鲁巴性能趋势。我们发现，与迁移学习或远处监督相结合，这些模型可以低至10分或100标记的句子相同的性能基线实现与更多的指导训练数据。但是，我们也发现设置里这不成立。我们的讨论和假设的更多的实验，如时间和硬件的限制突出于低资源学习的挑战和机遇。

32. Multilingual Knowledge Graph Completion via Ensemble Knowledge Transfer [PDF] 返回目录
Xuelu Chen, Muhao Chen, Changjun Fan, Ankith Uppunda, Yizhou Sun, Carlo Zaniolo
Abstract: Predicting missing facts in a knowledge graph (KG) is a crucial task in knowledge base construction and reasoning, and it has been the subject of much research in recent works using KG embeddings. While existing KG embedding approaches mainly learn and predict facts within a single KG, a more plausible solution would benefit from the knowledge in multiple language-specific KGs, considering that different KGs have their own strengths and limitations on data quality and coverage. This is quite challenging, since the transfer of knowledge among multiple independently maintained KGs is often hindered by the insufficiency of alignment information and the inconsistency of described facts. In this paper, we propose KEnS, a novel framework for embedding learning and ensemble knowledge transfer across a number of language-specific KGs. KEnS embeds all KGs in a shared embedding space, where the association of entities is captured based on self-learning. Then, KEnS performs ensemble inference to combine prediction results from embeddings of multiple language-specific KGs, for which multiple ensemble techniques are investigated. Experiments on five real-world language-specific KGs show that KEnS consistently improves state-of-the-art methods on KG completion, via effectively identifying and leveraging complementary knowledge.
摘要：在知识图（KG）预测丢失的事实是知识库的构建和推理的关键任务，它已成为许多研究使用的嵌入KG近期作品的主题。尽管现有的KG嵌入办法主要是学习和预测单KG内的事实，更合理的解决方案将受益于多个特定语言的幼儿园的知识，考虑到不同的幼儿园都有自己的长处和局限性对数据质量和覆盖范围。这是非常具有挑战性的，因为知识的多个独立维持幼稚园是经常受阻通过的对准信息的不足和描述事实的不一致性之间的转移。在本文中，我们提出KENS，对于跨越多个特定语言幼稚园嵌入学习和集成知识传递一个新的框架。 KENS嵌入在共享嵌入空间，其中实体的关联是基于自学习捕获的所有幼稚园。然后，执行KENS合奏推断预测结果从多个语言特定-KGS中，其中多个集合技术进行了研究的结合的嵌入。五现实世界的语言特定幼稚园实验表明，KENS持续改善对KG完成状态的最先进的方法，通过有效地识别和利用知识互补。

33. Knowledge-enriched, Type-constrained and Grammar-guided Question Generation over Knowledge Bases [PDF] 返回目录
Sheng Bi, Xiya Cheng, Yuan-Fang Li, Yongzhen Wang, Guilin Qi
Abstract: Question generation over knowledge bases (KBQG) aims at generating natural-language questions about a subgraph, i.e. a set of (connected) triples. Two main challenges still face the current crop of encoder-decoder-based methods, especially on small subgraphs: (1) low diversity and poor fluency due to the limited information contained in the subgraphs, and (2) semantic drift due to the decoder's oblivion of the semantics of the answer entity. We propose an innovative knowledge-enriched, type-constrained and grammar-guided KBQG model, named KTG, to addresses the above challenges. In our model, the encoder is equipped with auxiliary information from the KB, and the decoder is constrained with word types during QG. Specifically, entity domain and description, as well as relation hierarchy information are considered to construct question contexts, while a conditional copy mechanism is incorporated to modulate question semantics according to current word types. Besides, a novel reward function featuring grammatical similarity is designed to improve both generative richness and syntactic correctness via reinforcement learning. Extensive experiments show that our proposed model outperforms existing methods by a significant margin on two widely-used benchmark datasets SimpleQuestion and PathQuestion.
摘要：询问生成了知识库（KBQG）旨在产生约一个子自然语言的问题，即一组（已连接）的三倍。两个主要挑战仍然面临的基于编码器 - 解码器的方法的当前作物，特别是在小的子图：（1）低的多样性和流畅性差，由于包含在子图的信息有限，和（2）语义漂移由于解码器的遗忘的回答实体的语义。我们提出了一个创新的知识丰富了，类型约束和语法制导KBQG模型，命名KTG，以解决上述挑战。在我们的模型中，编码器配有从KB辅助信息，和QG在解码器被限制成与字的类型。具体而言，实体域和描述，以及关系层次结构信息被认为构建问题历境，而一个条件的复印机构根据当前字类型并入到调制问题语义。此外，具有语法相似新颖的奖励功能，目的是为了提高通过强化学习都生成的丰富性和语法的正确性。大量的实验表明，该模型优于现有的两种广泛使用的标准数据集SimpleQuestion和PathQuestion由显著保证金的方法。

34. A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction [PDF] 返回目录
Masato Mita, Shun Kiyono, Masahiro Kaneko, Jun Suzuki, Kentaro Inui
Abstract: Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models, and outperformed strong denoising baseline methods. We further applied task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. We then analyzed the effect of the proposed denoising method, and found that our approach leads to improved coverage of corrections and facilitated fluency edits which are reflected in higher recall and overall performance.
摘要：语法错误校正（GEC）现有的方法主要依赖于监督学习与手动创建GEC数据集。然而，很少有重点核实及确保数据集的质量和低质量的数据可能会如何影响GEC性能。我们的确发现有一种“噪音”不可忽略的量，其中错误是不恰当编辑或不纠正。为了解决这个问题，我们设计，其中核心思想是利用现有的模型的预测一致去噪这些数据集自细化方法，并跑赢强大的降噪基线的方法。我们进一步应用任务的具体技术和在CoNLL-2014，JFLEG和BEA-2019基准测试达到国家的最先进的性能。然后，我们分析了所提出的去噪方法的效果，发现我们的方法导致改善更正和便利流畅的编辑报道这反映在更高的召回和整体性能。

35. Fortifying Toxic Speech Detectors Against Veiled Toxicity [PDF] 返回目录
Xiaochuang Han, Yulia Tsvetkov
Abstract: Modern toxic speech detectors are incompetent in recognizing disguised offensive language, such as adversarial attacks that deliberately avoid known toxic lexicons, or manifestations of implicit bias. Building a large annotated dataset for such veiled toxicity can be very expensive. In this work, we propose a framework aimed at fortifying existing toxic speech detectors without a large labeled corpus of veiled toxicity. Just a handful of probing examples are used to surface orders of magnitude more disguised offenses. We augment the toxic speech detector's training data with these discovered offensive examples, thereby making it more robust to veiled toxicity while preserving its utility in detecting overt toxicity.
摘要：现代有毒语音检测器是在识别伪装攻击性的语言，无能如对抗攻击故意避免已知的有毒词典，或隐式的偏压的表现形式。建立这样隐晦毒性大的注解数据集可以是非常昂贵的。在这项工作中，我们提出旨在强化现有的有毒语音检测无毒性隐晦的大标记语料库的框架。只是探测的例子屈指可数用于幅度更变相的罪行表面订单。我们与这些被发现的进攻例子增加了有毒的话检测的训练数据，从而使其更加坚固，以掩饰毒性，同时检测明显的毒性保持其效用。

36. OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction [PDF] 返回目录
Keshav Kolluru, Vaibhav Adlakha, Samarth Aggarwal, Mausam, Soumen Chakrabarti
Abstract: A recent state-of-the-art neural open information extraction (OpenIE) system generates extractions iteratively, requiring repeated encoding of partial outputs. This comes at a significant computational cost. On the other hand, sequence labeling approaches for OpenIE are much faster, but worse in extraction quality. In this paper, we bridge this trade-off by presenting an iterative labeling-based system that establishes a new state of the art for OpenIE, while extracting 10x faster. This is achieved through a novel Iterative Grid Labeling (IGL) architecture, which treats OpenIE as a 2-D grid labeling task. We improve its performance further by applying coverage (soft) constraints on the grid at training time. Moreover, on observing that the best OpenIE systems falter at handling coordination structures, our OpenIE system also incorporates a new coordination analyzer built with the same IGL architecture. This IGL based coordination analyzer helps our OpenIE system handle complicated coordination structures, while also establishing a new state of the art on the task of coordination analysis, with a 12.3 pts improvement in F1 over previous analyzers. Our OpenIE system, OpenIE6, beats the previous systems by as much as 4 pts in F1, while being much faster.
摘要：最近状态的最先进的神经开放信息提取（OpenIE）系统产生的提取迭代，需要局部输出的重复编码。这是有显著计算成本。在另一方面，对于OpenIE序列标注方法快得多，但在提取质量恶化。在本文中，我们弥合呈现一个反复的基于标签的系统，它建立在本领域OpenIE一个新状态，这种权衡，而提取速度快10倍。这是通过一种新的迭代网格标签（IGL）架构，对待OpenIE作为2 d网格标记任务实现的。我们通过训练时间对电网应用范围（软）约束进一步提高其性能。此外，在观察到的最佳OpenIE系统在处理协调结构动摇，我们的OpenIE系统还集成了同样的IGL架构构建一个新的协调分析。这IGL基于协调分析仪可以帮助我们的OpenIE系统处理复杂的协调结构，同时也协调分析的任务建立一个新的艺术状态，用比以前的分析仪在F1一个12.3分的改善。我们OpenIE系统，OpenIE6，在F1多达4分击败以前的系统，而被快得多。

37. Unsupervised Parsing via Constituency Tests [PDF] 返回目录
Steven Cao, Nikita Kitaev, Dan Klein
Abstract: We propose a method for unsupervised parsing based on the linguistic notion of a constituency test. One type of constituency test involves modifying the sentence via some transformation (e.g. replacing the span with a pronoun) and then judging the result (e.g. checking if it is grammatical). Motivated by this idea, we design an unsupervised parser by specifying a set of transformations and using an unsupervised neural acceptability model to make grammaticality decisions. To produce a tree given a sentence, we score each span by aggregating its constituency test judgments, and we choose the binary tree with the highest total score. While this approach already achieves performance in the range of current methods, we further improve accuracy by fine-tuning the grammaticality model through a refinement procedure, where we alternate between improving the estimated trees and improving the grammaticality model. The refined model achieves 62.8 F1 on the Penn Treebank test set, an absolute improvement of 7.6 points over the previous best published result.
摘要：提出了一种基于选区测试的语言概念的无监督分析的方法。选区测试的一种类型涉及经由某种变换（例如，用一个代词替换跨度），然后判断结果（例如检查，如果它是语法）修饰的句子。在这个理念的推动下，我们通过指定一组的变换和使用无监督神经接受模型进行语法性决定设计一个无人监管的解析器。为了生产给定一个句子一棵树，我们得分汇总其选区测试判断每个跨度，我们选择具有最高总得分的二叉树。虽然这种方法已在当前方法的范围内达到的性能，我们进一步提高由微调精度通过细化过程合乎语法模型，在这里我们提高所估计的树木和改善合乎语法模型之间交替。该精炼模型达到62.8 F1在宾州树库的测试集，7.6点，比上年公布的最好结果的绝对改善。

38. Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [PDF] 返回目录
Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, Lei Li
Abstract: We investigate the following question for machine translation (MT): can we develop a single universal MT model to serve as the common seed and obtain derivative and improved models on arbitrary language pairs? We propose mRASP, an approach to pre-train a universal multilingual neural machine translation model. Our key idea in mRASP is its novel technique of random aligned substitution, which brings words and phrases with similar meanings across multiple languages closer in the representation space. We pre-train a mRASP model on 32 language pairs jointly with only public datasets. The model is then fine-tuned on downstream language pairs to obtain specialized MT models. We carry out extensive experiments on 42 translation directions across a diverse settings, including low, medium, rich resource, and as well as transferring to exotic language pairs. Experimental results demonstrate that mRASP achieves significant performance improvement compared to directly training on those target pairs. It is the first time to verify that multiple low-resource language pairs can be utilized to improve rich resource MT. Surprisingly, mRASP is even able to improve the translation quality on exotic languages that never occur in the pre-training corpus. Code, data, and pre-trained models are available at this https URL.
摘要：我们调查了机器翻译（MT）以下问题：我们可以开发一个通用MT模式，作为普通种子和任意语言对获得衍生物和改进型号？我们建议mRASP，一种方法预先训练通用多语种神经机器翻译模型。我们在mRASP核心思想是其新颖的随机排列的替代，这使单词和短语与跨多种语言中表示空间接近涵义相似的技术。我们对32的语言对，只有公共数据集预先训练mRASP模型联合。该模型随后微调对下游语言对获得专门的MT车型。我们开展了广泛的实验上跨不同设置42个翻译方向，包括低，中，丰富的资源，以及转移到异国的语言对。实验结果表明，mRASP实现相比，这些目标对直接培训显著的性能提升。这是第一次以验证多个低资源语言对可用于改善资源丰富MT。出人意料的是，mRASP甚至能够提高在预训练语料库不会发生异国语言的翻译质量。代码，数据和预训练的模型可在此HTTPS URL。

39. Improving Context Modeling in Neural Topic Segmentation [PDF] 返回目录
Linzi Xing, Brad Hackinen, Giuseppe Carenini, Francesco Trebbi
Abstract: Topic segmentation is critical in key NLP tasks and recent works favor highly effective neural supervised approaches. However, current neural solutions are arguably limited in how they model context. In this paper, we enhance a segmenter based on a hierarchical attention BiLSTM network to better model context, by adding a coherence-related auxiliary task and restricted self-attention. Our optimized segmenter outperforms SOTA approaches when trained and tested on three datasets. We also the robustness of our proposed model in domain transfer setting by training a model on a large-scale dataset and testing it on four challenging real-world benchmarks. Furthermore, we apply our proposed strategy to two other languages (German and Chinese), and show its effectiveness in multilingual scenarios.
摘要：主题细分是关键的关键NLP任务和近期作品有利于高效神经监督的方法。然而，目前的神经解决方案如何建模方面，可以说是有限的。在本文中，我们增强基于分层关注BiLSTM网络，更好地模型背景上的分割，通过添加相干性相关的辅助任务，限制自我的关注。训练和三个数据集进行测试时，我们优化的分割性能优于SOTA方法。我们还要在域名转移设定了模型的通过大规模数据集训练模型和测试它的四个挑战现实世界的基准的稳健性。此外，我们应用我们提出的策略与其他两种语言（德国和中国），并显示其在多语种的场景效果。

40. A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions [PDF] 返回目录
Takuma Udagawa, Takato Yamazaki, Akiko Aizawa
Abstract: Recent models achieve promising results in visually grounded dialogues. However, existing datasets often contain undesirable biases and lack sophisticated linguistic analyses, which make it difficult to understand how well current models recognize their precise linguistic structures. To address this problem, we make two design choices: first, we focus on OneCommon Corpus \citep{udagawa2019natural,udagawa2020annotated}, a simple yet challenging common grounding dataset which contains minimal bias by design. Second, we analyze their linguistic structures based on \textit{spatial expressions} and provide comprehensive and reliable annotation for 600 dialogues. We show that our annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis. In our experiments, we assess the model's understanding of these structures through reference resolution. We demonstrate that our annotation can reveal both the strengths and weaknesses of baseline models in essential levels of detail. Overall, we propose a novel framework and resource for investigating fine-grained language understanding in visually grounded dialogues.
摘要：最近的模型实现承诺在视觉上接地对话的结果。然而，现有的数据集通常包含不良偏见和缺乏成熟的语言分析，这让人很难理解目前的型号如何识别其精确的语言结构。为了解决这个问题，我们提出两个设计选择：第一，我们专注于OneCommon语料库\ citep {udagawa2019natural，udagawa2020annotated}，一个简单但具有挑战性其中包含设计最小偏差共用接地数据集。其次，我们分析了基于\ textit {空间}表达他们的语言结构和600个对话提供全面，可靠的注解。我们证明了我们的注释捕捉重要的语言结构，包括谓词参数的结构，修改和省略号。在我们的实验中，我们通过参考分辨率评估这些结构模型的理解。我们证明了我们的注解可以详细的基本水平揭示两者的长处和基线模型的弱点。总体而言，我们提出了调查视觉接地对话细粒度语言理解一个新的框架和资源。

41. VCDM: Leveraging Variational Bi-encoding and Deep Contextualized Word Representations for Improved Definition Modeling [PDF] 返回目录
Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo
Abstract: In this paper, we tackle the task of definition modeling, where the goal is to learn to generate definitions of words and phrases. Existing approaches for this task are discriminative, combining distributional and lexical semantics in an implicit rather than direct way. To tackle this issue we propose a generative model for the task, introducing a continuous latent variable to explicitly model the underlying relationship between a phrase used within a context and its definition. We rely on variational inference for estimation and leverage contextualized word embeddings for improved performance. Our approach is evaluated on four existing challenging benchmarks with the addition of two new datasets, "Cambridge" and the first non-English corpus "Robert", which we release to complement our empirical study. Our Variational Contextual Definition Modeler (VCDM) achieves state-of-the-art performance in terms of automatic and human evaluation metrics, demonstrating the effectiveness of our approach.
摘要：在本文中，我们定义解决建模，其目的是要学会生成单词和短语的定义的任务。此任务现有方案是歧视性的，在一个隐含的组合分布和词汇语义，而不是直接的方式。为了解决这个问题，我们提出了一个生成模型的任务，引入连续潜变量的情况下，其定义中使用的短语之间的基本关系清晰的模型。我们依靠以提高性能估计和充分利用情境字的嵌入变推断。我们的做法是对现有四个具有挑战性的基准评估增加两个新的数据集，“剑桥”，并第一个非英语语料库“罗伯特”，这是我们释放来补充我们的实证研究。我们变语境定义建模（VCDM）实现了自动和人工评估指标方面国家的最先进的性能，证明了我们方法的有效性。

42. DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling [PDF] 返回目录
Jiecao Chen, Liu Yang, Karthik Raman, Michael Bendersky, Jung-Jung Yeh, Yun Zhou, Marc Najork, Danyang Cai, Ehsan Emadzadeh
Abstract: Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. However -- as we show here -- existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work, we propose DiPair - a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach with speedups of over 350x and minimal quality drop relative to the cross-attention teacher BERT model.
摘要：（Devlin等，2018）像BERT预训练模式已经占据NLP / IR应用，如单句分类，文本分类配对，并答疑。然而，在实际系统中部署这些模型是非常不平凡的，由于其高昂的计算成本。一种常见的补救措施，这是知识蒸馏（Hinton等人，2015年），从而导致更快的推断。但是 - 正如我们在这里展示 - 现有的工作不是为处理文本的对（或元组）进行了优化。因此，他们要么没有可扩展性，或证明欠佳的表现。在这项工作中，我们提出DiPair - 用于蒸馏快速文本对任务的新的框架和精确的模型。再加上一个终端到终端的培训策略，DiPair既是高度可扩展性提供了改进的质量速度的权衡。在学术和现实世界的电子商务基准进行了实证研究表明，该方法的疗效下降相对最小质量超过350X的加速和跨重视教师BERT模式。

43. Knowledge-aware Method for Confusing Charge Prediction [PDF] 返回目录
Xiya Cheng, Sheng Bi, Guilin Qi, Yongzhen Wang
Abstract: Automatic charge prediction task aims to determine the final charges based on fact descriptions of criminal cases, which is a vital application of legal assistant systems. Conventional works usually depend on fact descriptions to predict charges while ignoring the legal schematic knowledge, which makes it difficult to distinguish confusing charges. In this paper, we propose a knowledge-attentive neural network model, which introduces legal schematic knowledge about charges and exploit the knowledge hierarchical representation as the discriminative features to differentiate confusing charges. Our model takes the textual fact description as the input and learns fact representation through a graph convolutional network. A legal schematic knowledge transformer is utilized to generate crucial knowledge representations oriented to the legal schematic knowledge at both the schema and charge levels. We apply a knowledge matching network for effectively incorporating charge information into the fact to learn knowledge-aware fact representation. Finally, we use the knowledge-aware fact representation for charge prediction. We create two real-world datasets and experimental results show that our proposed model can outperform other state-of-the-art baselines on accuracy and F1 score, especially on dealing with confusing charges.
摘要：自动充电预测任务的目的是确定基于的刑事案件，这是合法的辅助系统的一个重要应用事实描述的最终费用。传统的作品通常依赖于事实的描述来预测费用而忽略了法律原理的知识，这使得它难以分辨混淆费用。在本文中，我们提出了一个知识周到的神经网络模型，介绍有关收费的法律原理知识和利用知识层次表示作为判别特征来区分混淆费用。我们的模型以文本事实说明通过图卷积网络输入和获悉事实表示。一个法律原理知识变压器被用来生成的模式和充电水平都面向法律原理知识至关重要的知识表示。我们应用知识匹配网络有效结合的计费信息到事实来学习知识，了解事实表示。最后，我们使用费的预测知识感知的事实表示。我们创建了两个真实世界的数据集和实验结果表明，该模型可以超越国家的最先进的其他基准的精度和F1成绩，特别是在处理混乱的收费。

44. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization [PDF] 返回目录
Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown
Abstract: We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.
摘要：介绍WikiLingua，大规模的，对crosslingual抽象摘要系统的评价多语种数据集。我们提取从wikiHow的，高质量的，在一组不同的人类作者所写的主题操作指南的协作资源18种语言的文章和总结对。我们通过使被用来描述在一篇文章中的每个如何做一步的图像创建跨语言的黄金标准的文章，总结比对。作为一组基线进行进一步的研究，我们评估对我们的数据存在跨语言的抽象概括方法的性能。我们进一步提出了直接crosslingual汇总的方法（即，不要求在推理时间平移）通过利用合成的数据和神经机器翻译作为预训练步骤。我们的方法显著优于基准方法，同时更加推理过程中的成本效益。

45. Is the Best Better? Bayesian Statistical Model Comparison for Natural Language Processing [PDF] 返回目录
Piotr Szymański, Kyle Gorman
Abstract: Recent work raises concerns about the use of standard splits to compare natural language processing models. We propose a Bayesian statistical model comparison technique which uses k-fold cross-validation across multiple data sets to estimate the likelihood that one model will outperform the other, or that the two will produce practically equivalent results. We use this technique to rank six English part-of-speech taggers across two data sets and three evaluation metrics.
摘要：最近的研究提出了关于使用标准分裂的比较自然语言处理模型的关注。我们建议其使用的k倍跨多个数据集的交叉验证来估计可能性一个模型将优于其它，或两个会产生几乎相同的结果贝叶斯统计模型比较技术。我们使用这种技术等级六级英语部分的语音标注器在两个数据集和三个评价指标。

46. Beyond [CLS] through Ranking by Generation [PDF] 返回目录
Cicero Nogueira dos Santos, Xiaofei Ma, Ramesh Nallapati, Zhiheng Huang, Bing Xiang
Abstract: Generative models for Information Retrieval, where ranking of documents is viewed as the task of generating a query from a document's language model, were very successful in various IR tasks in the past. However, with the advent of modern deep neural networks, attention has shifted to discriminative ranking functions that model the semantic similarity of documents and queries instead. Recently, deep generative models such as GPT2 and BART have been shown to be excellent text generators, but their effectiveness as rankers have not been demonstrated yet. In this work, we revisit the generative framework for information retrieval and show that our generative approaches are as effective as state-of-the-art semantic similarity-based discriminative models for the answer selection task. Additionally, we demonstrate the effectiveness of unlikelihood losses for IR.
摘要：信息检索，其中的文档排名被视为从生成文档的语言模型的查询任务生成模型，是在过去不同的IR工作非常成功。但是，随着现代深层神经网络的出现，注意力已转向歧视性排名函数的文档和查询，而不是语义相似的模式。近日，深生成模型，如GPT2和BART已被证明是优秀的文本发电机，但其作为rankers有效性还没有被证实。在这项工作中，我们重温了信息检索和显示我们的生成方法是有效的国家的最先进的基于语义相似性判别模型的答案选择任务的生成框架。此外，我们表现出的红外不大可能损失的有效性。

47. RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text [PDF] 返回目录
Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Chris Callison-Burch
Abstract: In recent years, large neural networks for natural language generation (NLG) have made leaps and bounds in their ability to generate fluent text. However, the tasks of evaluating quality differences between NLG systems and understanding how humans perceive the generated text remain both crucial and difficult. In this system demonstration, we present Real or Fake Text (RoFT), a website that tackles both of these challenges by inviting users to try their hand at detecting machine-generated text in a variety of domains. We introduce a novel evaluation task based on detecting the boundary at which a text passage that starts off human-written transitions to being machine-generated. We show preliminary results of using RoFT to evaluate detection of machine-generated news articles.
摘要：近年来，对自然语言生成大神经网络（NLG）在其生成文字通顺能力取得了长足的发展。然而，评估NLG系统和人类理解如何看待生成的文本之间的质量差异的任务依然至关重要都和困难的。在这个系统中的示范，我们现在真的还是假的文本（RoFT），一个网站，两者通过邀请用户这些挑战铲球尝试他们的手在各种领域的检测机器生成的文本。我们引入基于检测在该正被机器生成一个文本段落开始关闭人编写的过渡到边界的新颖评估任务。我们显示了使用RoFT评估机器生成的新闻文章的检测的初步结果。

48. Anubhuti -- An annotated dataset for emotional analysis of Bengali short stories [PDF] 返回目录
Aditya Pal, Bhaskar Karn
Abstract: Thousands of short stories and articles are being written in many different languages all around the world today. Bengali, or Bangla, is the second highest spoken language in India after Hindi and is the national language of the country of Bangladesh. This work reports in detail the creation of Anubhuti - the first and largest text corpus for analyzing emotions expressed by writers of Bengali short stories. We explain the data collection methods, the manual annotation process and the resulting high inter-annotator agreement of the dataset due to the linguistic expertise of the annotators and the clear methodology of labelling followed. We also address some of the challenges faced in the collection of raw data and annotation process of a low resource language like Bengali. We have verified the performance of our dataset with baseline Machine Learning as well as a Deep Learning model for emotion classification and have found that these standard models have a high accuracy and relevant feature selection on Anubhuti. In addition, we also explain how this dataset can be of interest to linguists and data analysts to study the flow of emotions as expressed by writers of Bengali literature.
摘要：成千上万的短篇故事和文章被写今天在许多不同的语言的所有周围的世界。孟加拉语，或孟加拉，印地文是经过印度第二大语言，是孟加拉国国的民族语言。详细这项工作报告的创建Anubhuti的 - 第一个也是最大的文本语料库由孟加拉短篇小说作家表达分析的情绪。我们解释了数据收集方法，手动注释过程和数据集的导致高注释间协议，由于注释的语言专业知识和标签遵循明确的方法。我们还解决面临的一些原始数据，像孟加拉低资源语言的注释过程的集合中的挑战。我们已经验证了我们的底线机器学习数据集的性能，以及一个深度学习模型的情感类别，并发现这些标准模型对Anubhuti高精度和相关特征选择。此外，我们还解释了该数据集是如何感兴趣的语言学家和数据分析来研究由孟加拉文学作家表达情感的流动。

49. A Survey on Recognizing Textual Entailment as an NLP Evaluation [PDF] 返回目录
Adam Poliak
Abstract: Recognizing Textual Entailment (RTE) was proposed as a unified evaluation framework to compare semantic understanding of different NLP systems. In this survey paper, we provide an overview of different approaches for evaluating and understanding the reasoning capabilities of NLP systems. We then focus our discussion on RTE by highlighting prominent RTE datasets as well as advances in RTE dataset that focus on specific linguistic phenomena that can be used to evaluate NLP systems on a fine-grained level. We conclude by arguing that when evaluating NLP systems, the community should utilize newly introduced RTE datasets that focus on specific linguistic phenomena.
摘要：认识文字蕴涵（RTE）提议作为一个统一的评估框架，以比较不同NLP系统的语义理解。在本次调查论文中，我们提供的评估和理解自然语言处理系统的推理能力的不同方法的概述。然后，我们通过突出RTE数据集中突出RTE数据集和进步注重我们对RTE的讨论集中于可以用来在细粒度水平评估NLP系统特定的语言现象。最后，我们主张评估NLP系统时，社会应利用新引进RTE的数据集，专注于特定的语言现象。

50. Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers [PDF] 返回目录
Yimeng Wu, Peyman Passban, Mehdi Rezagholizade, Qun Liu
Abstract: With the growth of computing power neural machine translation (NMT) models also grow accordingly and become better. However, they also become harder to deploy on edge devices due to memory constraints. To cope with this problem, a common practice is to distill knowledge from a large and accurately-trained teacher network (T) into a compact student network (S). Although knowledge distillation (KD) is useful in most cases, our study shows that existing KD techniques might not be suitable enough for deep NMT engines, so we propose a novel alternative. In our model, besides matching T and S predictions we have a combinatorial mechanism to inject layer-level supervision from T to S. In this paper, we target low-resource settings and evaluate our translation engines for Portuguese--English, Turkish--English, and English--German directions. Students trained using our technique have 50% fewer parameters and can still deliver comparable results to those of 12-layer teachers.
摘要：随着计算能力的神经机器翻译的增长（NMT）车型也随之增长，并变得更好。然而，它们也变得更难的边缘设备，由于存储器约束部署。为了解决这个问题，通常的做法是从一个大的，准确地培训教师网络（T）提制知识转化为学生的紧凑网（S）。虽然知识蒸馏（KD）在大多数情况下是有用的，我们的研究表明，现有的KD技术可能不适合深NMT引擎适合不够的，所以我们提出了一个新的选择。在我们的模型中，除了匹配T和的预言，我们有一个组合机制，在T注入层级别的监督S.在本文中，我们针对低资源设置和评估我们的葡萄牙语翻译引擎 - 英语，Turkish--英语，英语 - 德国方向。使用我们的技术，培养出来的学生少50％的参数，还可以提供比较的结果，这12层的教师。

51. Resource-Enhanced Neural Model for Event Argument Extraction [PDF] 返回目录
Jie Ma, Shuai Wang, Rishita Anubhai, Miguel Ballesteros, Yaser Al-Onaizan
Abstract: Event argument extraction (EAE) aims to identify the arguments of an event and classify the roles that those arguments play. Despite great efforts made in prior work, there remain many challenges: (1) Data scarcity. (2) Capturing the long-range dependency, specifically, the connection between an event trigger and a distant event argument. (3) Integrating event trigger information into candidate argument representation. For (1), we explore using unlabeled data in different ways. For (2), we propose to use a syntax-attending Transformer that can utilize dependency parses to guide the attention mechanism. For (3), we propose a trigger-aware sequence encoder with several types of trigger-dependent sequence representations. We also support argument extraction either from text annotated with gold entities or from plain text. Experiments on the English ACE2005 benchmark show that our approach achieves a new state-of-the-art.
摘要：事件参数的提取（EAE）的目的是找出事件的参数和分类的作用，这些参数玩。尽管在以前的工作做出了巨大的努力，仍然存在着许多挑战：（1）数据稀缺。（2）捕捉远距离依赖性，具体而言，事件触发和远处的事件参数之间的连接。（3）整合的事件触发信息到候选参数表示。对于（1），我们将探索以不同的方式使用无标签数据。对于（2），我们建议使用语法出席变压器，可以利用依赖解析指导注意机制。对于（3），我们提出了几种类型的触发器相关序列表示的触发感知序列编码器。我们也支持参数提取无论是从黄金的实体或纯文本注释文本。在英语ACE2005基准实验表明，我们的方法实现了新的国家的最先进的。

52. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment [PDF] 返回目录
Zirui Wang, Zachary C. Lipton, Yulia Tsvetkov
Abstract: Modern multilingual models are trained on concatenated text from multiple languages in hopes of conferring benefits to each (positive transfer), with the most pronounced benefits accruing to low-resource languages. However, recent work has shown that this approach can degrade performance on high-resource languages, a phenomenon known as negative interference. In this paper, we present the first systematic study of negative interference. We show that, contrary to previous belief, negative interference also impacts low-resource languages. While parameters are maximally shared to learn language-universal structures, we demonstrate that language-specific parameters do exist in multilingual models and they are a potential cause of negative interference. Motivated by these observations, we also present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference, by adding language-specific layers as meta-parameters and training them in a manner that explicitly improves shared layers' generalization on all languages. Overall, our results show that negative interference is more common than previously known, suggesting new directions for improving multilingual representations.
摘要：现代多语言模型在授予利益，每个（正转），希望从多语言文本串联训练有素，最明显的好处累积到低资源语言。然而，最近的工作表明，这种方法可以降低对资源丰富的语言表现，被称为负干扰的现象。在本文中，我们提出负面干扰的第一个系统研究。我们表明，违背先前的信念，负面的干扰也影响低资源语言。虽然参数是最大共享学习语言的通用结构，我们证明了语言特定的参数在多语言模型确实存在，他们是负干扰的潜在原因。通过这些意见的启发，我们也提出了一种元学习算法，获得了较好的跨语言转让，减轻负面干扰，通过添加特定语言层为元参数，并以这样的方式训练他们，明确提高了共享层上的所有推广语言。总的来说，我们的结果表明，负面干扰是比较常见的比以前已知的，这对于提高多语种表示新的方向。

53. Exploring BERT's Sensitivity to Lexical Cues using Tests from Semantic Priming [PDF] 返回目录
Kanishka Misra, Allyson Ettinger, Julia Taylor Rayz
Abstract: Models trained to estimate word probabilities in context have become ubiquitous in natural language processing. How do these models use lexical cues in context to inform their word probabilities? To answer this question, we present a case study analyzing the pre-trained BERT model with tests informed by semantic priming. Using English lexical stimuli that show priming in humans, we find that BERT too shows "priming," predicting a word with greater probability when the context includes a related word versus an unrelated one. This effect decreases as the amount of information provided by the context increases. Follow-up analysis shows BERT to be increasingly distracted by related prime words as context becomes more informative, assigning lower probabilities to related words. Our findings highlight the importance of considering contextual constraint effects when studying word prediction in these models, and highlight possible parallels with human processing.
摘要：在上下文中训练估计词概率模型已成为自然语言处理无处不在。如何将这些模型使用的词汇线索的情况下，告知其词的概率？要回答这个问题，我们提出了一个案例研究分析与测试，通过语义启动通知预先训练BERT模式。使用该节目在人类吸英语词汇的刺激，我们发现，BERT也显示“吸”，以更大的概率预测字当上下文包括关联词与无关的一个。这个效果降低如由上下文提供增加的信息量。后续分析显示BERT要由相关总理的话作为背景变得更丰富，相关词赋予较低的概率越来越分心。我们的研究结果强调在这些模型中学习单词预测时考虑语境约束作用的重要性，并突出与人体处理成为可能的相似之处。

54. GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [PDF] 返回目录
Wasi Uddin Ahmad, Nanyun Peng, Kai-Wei Chang
Abstract: Prevalent approaches in cross-lingual relation and event extraction use graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic representations such that models trained on one language can be applied to other languages. However, GCNs lack in modeling long-range dependencies or disconnected words in the dependency tree. To address this challenge, we propose to utilize the self-attention mechanism where we explicitly fuse structural information to learn the dependencies between words at different syntactic distances. We introduce GATE, a {\bf G}raph {\bf A}ttention {\bf T}ransformer {\bf E}ncoder, and test its cross-lingual transferability on relation and event extraction tasks. We perform rigorous experiments on the widely used ACE05 dataset that includes three typologically different languages: English, Chinese, and Arabic. The evaluation results show that GATE outperforms three recently proposed methods by a large margin. Our detailed analysis reveals that due to the reliance on syntactic dependencies, GATE produces robust representations that facilitate transfer across languages.
摘要：多见于跨语言关系和事件提取使用图形卷积网络（GCNs）与通用的依赖解析学习语言无关的表示这样的培训对一个语言模型可以应用到其他语言的方法。然而，GCNs缺乏的依赖关系树造型远射依赖或断开的话。为了应对这一挑战，我们提出利用自注意机制，我们明确保险丝结构信息学习单词之间的依赖关系，在不同的语法距离。我们引入栅极，{\ BF绿}拍摄和{\ BF A} ttention {\ BF T】ransformer {\ BFë} ncoder，并测试对关系和事件提取任务其跨语种转印性。英国，中国，和阿拉伯语：我们广泛使用的ACE05数据集包括三个类型学不同的语言进行严格的实验。评价结果表明，GATE优于以大比分3种推论的方法。我们的详细分析表明，由于对句法依存关系的依赖，GATE产生强大的交涉促进跨语言转移。

55. Fact Extraction and VERification -- The FEVER case: An Overview [PDF] 返回目录
Giannis Bekoulis, Christina Papagiannopoulou, Nikos Deligiannis
Abstract: Fact Extraction and VERification (FEVER) is a recently introduced task which aims to identify the veracity of a given claim based on Wikipedia documents. A lot of methods have been proposed to address this problem which consists of the subtasks of (i) retrieving the relevant documents (and sentences) from Wikipedia and (ii) validating whether the information in the documents supports or refutes a given claim. This task is essential since it can be the building block of applications that require a deep understanding of the language such as fake news detection and medical claim verification. In this paper, we aim to get a better understanding of the challenges in the task by presenting the literature in a structured and comprehensive way. In addition, we describe the proposed methods by analyzing the technical perspectives of the different approaches and discussing the performance results on the FEVER dataset.
摘要：事实提取和验证（热）最近推出的任务，其目的是确定根据维基百科的文件给定的要求的真实性。很多方法被提出来解决这个问题，其中包括：（i）检索维基百科的相关文件（和句子）和（ii）验证的子任务是否在文档支持或反驳给定要求的信息。这个任务是必要的，因为它可以为那些需要语言的深刻理解应用，如假新闻的检测和医疗要求的验证积木。在本文中，我们的目标是在一个结构化的和全面的方式呈现文献中得到更好地了解任务的挑战。另外，我们通过分析不同方法的技术观点和讨论的FEVER数据集的性能测试结果说明所提出的方法。

56. Compositional Demographic Word Embeddings [PDF] 返回目录
Charles Welch, Jonathan K. Kummerfeld, Verónica Pérez-Rosas, Rada Mihalcea
Abstract: Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.
摘要：Word中嵌入物通常是从许多个体包含文本语料库得出，从而导致通用表示，而不是单独的个性化表述。虽然个性化的嵌入可改善语言模型的性能和其他语言处理任务有用的，它们只能被计算为人们提供了大量的纵向数据，这是不是新用户的情况下。我们建议使用的全部或部分人口信息组成来自特定的人口统计的字表示的用户（即，性别，年龄，位置，宗教）个性化的嵌入词的一种新形式。我们表明，导致人口感知字表示胜过两个任务英文通用词表示：语言建模和联想词。我们进一步探讨可用属性的数量和它们的相对有效性之间的权衡，并讨论使用它们的伦理问题。

57. Plug and Play Autoencoders for Conditional Text Generation [PDF] 返回目录
Florian Mai, Nikolaos Pappas, Ivan Montero, Noah A. Smith, James Henderson
Abstract: Text autoencoders are commonly used for conditional generation tasks such as style transfer. We propose methods which are plug and play, where any pretrained autoencoder can be used, and only require learning a mapping within the autoencoder's embedding space, training embedding-to-embedding (Emb2Emb). This reduces the need for labeled training data for the task and makes the training procedure more efficient. Crucial to the success of this method is a loss term for keeping the mapped embedding on the manifold of the autoencoder and a mapping which is trained to navigate the manifold by learning offset vectors. Evaluations on style transfer tasks both with and without sequence-to-sequence supervision show that our method performs better than or comparable to strong baselines while being up to four times faster.
摘要：文本自动编码通常用于有条件地生成的任务，如风格转移。我们提出这是即插即用的，在这里可以使用任何预训练的自动编码方法，并且只需要学习自动编码器的嵌入空间中的映射，培养嵌入到嵌入（Emb2Emb）。这减少了需要为任务标记的训练数据，使训练过程更加高效。至关重要的这一方法的成功是用于保持映射嵌入在自动编码器的歧管和该被训练通过学习补偿矢量以导航所述歧管的映射的损耗项。在既没有序列对序列的监督表明，我们的方法执行得更好，同时超过或相当于强基线多达快四倍的风格传输任务评估。

58. Are "Undocumented Workers" the Same as "Illegal Aliens"? Disentangling Denotation and Connotation in Vector Spaces [PDF] 返回目录
Albert Webson, Zhizhong Chen, Carsten Eickhoff, Ellie Pavlick
Abstract: In politics, neologisms are frequently invented for partisan objectives. For example, "undocumented workers" and "illegal aliens" refer to the same group of people (i.e., they have the same denotation), but they carry clearly different connotations. Examples like these have traditionally posed a challenge to reference-based semantic theories and led to increasing acceptance of alternative theories (e.g., Two-Factor Semantics) among philosophers and cognitive scientists. In NLP, however, popular pretrained models encode both denotation and connotation as one entangled representation. In this study, we propose an adversarial nerual netowrk that decomposes a pretrained representation as independent denotation and connotation representations. For intrinsic interpretability, we show that words with the same denotation but different connotations (e.g., "immigrants" vs. "aliens", "estate tax" vs. "death tax") move closer to each other in denotation space while moving further apart in connotation space. For extrinsic application, we train an information retrieval system with our disentangled representations and show that the denotation vectors improve the viewpoint diversity of document rankings.
摘要：在政治上，新词经常被发明了党派目标。例如，“无证劳工”和“非法移民”指的是同一群人（即，它们具有相同的外延），但他们携带明显不同的内涵。像这些实施例在传统上提出了挑战基于引用的语义理论和导致哲学家和认知科学家之间接受替代理论的增加（例如，双因素语义）。在NLP，然而，流行的预训练模型编码既外延和内涵作为一种缠绕表示。在这项研究中，我们提出了一个敌对nerual元网络是分解的预训练的表现作为独立的内涵和外延表示。对于内在的可解释性，我们展示用相同的外延，但不同的内涵，即词（例如，“移民”与“外星人”，“房产税”与“死亡税”）的外延空间逐渐接近对方，而移动更远在内涵空间。对于外在的应用程序，我们培养的信息检索系统与我们解开陈述和展示的外延载体提高文档排名的观点多样性。

59. Supervised Seeded Iterated Learning for Interactive Language Learning [PDF] 返回目录
Yuchen Lu, Soumye Singhal, Florian Strub, Olivier Pietquin, Aaron Courville
Abstract: Language drift has been one of the major obstacles to train language models through interaction. When word-based conversational agents are trained towards completing a task, they tend to invent their language rather than leveraging natural language. In recent literature, two general methods partially counter this phenomenon: Supervised Selfplay (S2P) and Seeded Iterated Learning (SIL). While S2P jointly trains interactive and supervised losses to counter the drift, SIL changes the training dynamics to prevent language drift from occurring. In this paper, we first highlight their respective weaknesses, i.e., late-stage training collapses and higher negative likelihood when evaluated on human corpus. Given these observations, we introduce Supervised Seeded Iterated Learning to combine both methods to minimize their respective weaknesses. We then show the effectiveness of \algo in the language-drift translation game.
摘要：语言漂移一直是通过互动来火车语言模型的主要障碍之一。当基于词的对话剂对完成任务的训练，他们倾向于创造自己的语言，而不是利用自然语言。在最近的文献中，两种一般方法部分计数器这样的现象：监督Selfplay（S2P）并接种迭代学习（SIL）。虽然S2P联合训练互动和监督的损失来对抗漂移，SIL改变了培训力度，以防止语言漂移的发生。在本文中，我们首先对人阴茎评估时，突出各自的弱点，即后期培训崩塌，较高的负可能性。鉴于这些意见，我们引入监督去籽迭代学习这两种方法结合起来，以尽量减少其各自的弱点。然后，我们显示的语言翻译漂移游戏\算法中的有效性。

60. A Self-supervised Approach for Semantic Indexing in the Context of COVID-19 Pandemic [PDF] 返回目录
Nima Ebadi, Peyman Najafirad
Abstract: The pandemic has accelerated the pace at which COVID-19 scientific papers are published. In addition, the process of manually assigning semantic indexes to these papers by experts is even more time-consuming and overwhelming in the current health crisis. Therefore, there is an urgent need for automatic semantic indexing models which can effectively scale-up to newly introduced concepts and rapidly evolving distributions of the hyperfocused related literature. In this research, we present a novel semantic indexing approach based on the state-of-the-art self-supervised representation learning and transformer encoding exclusively suitable for pandemic crises. We present a case study on a novel dataset that is based on COVID-19 papers published and manually indexed in PubMed. Our study shows that our self-supervised model outperforms the best performing models of BioASQ Task 8a by micro-F1 score of 0.1 and LCA-F score of 0.08 on average. Our model also shows superior performance on detecting the supplementary concepts which is quite important when the focus of the literature has drastically shifted towards specific concepts related to the pandemic. Our study sheds light on the main challenges confronting semantic indexing models during a pandemic, namely new domains and drastic changes of their distributions, and as a superior alternative for such situations, propose a model founded on approaches which have shown auspicious performance in improving generalization and data efficiency in various NLP tasks. We also show the joint indexing of major Medical Subject Headings (MeSH) and supplementary concepts improves the overall performance.
摘要：流感大流行已经加速在该COVID，19篇科学论文发表的步伐。此外，这些文件由专家手动分配语义索引的过程更是耗时，在目前的健康危机铺天盖地。因此，自动语义索引模型能够有效向上扩展到新推出的概念和hyperfocused相关文献的迅速发展分布的迫切需要。在这个研究中，我们提出基于一种新的语义索引方法中，状态的最先进的自监督学习的表示和变压器编码专门适合于流行的危机。我们目前对基于上公布，并在手动考研索引COVID-19篇论文一种新的数据集的案例研究。我们的研究表明，我们的自我监督模式由微F1得分0.1和LCA-F优于BioASQ任务8a的表现最好的车型平均得分为0.08。我们的模型也显示了检测的补充概念时，文学的重点转向相关的流行病具体的概念发生了巨大转变是非常重要的性能优越。我们的研究阐明了在大流行，即新的领域及其分布的急剧变化所面临语义索引模型的主要挑战，并为这种情况下一个更好的选择，提出了建立在方法的模型已在改善泛化出吉祥的性能和在各种NLP任务数据的效率。我们还表明重大医学主题词（目）和补充的概念提高了整体性能的联合索引。

61. TeMP: Temporal Message Passing for Temporal Knowledge Graph Completion [PDF] 返回目录
Jiapeng Wu, Meng Cao, Jackie Chi Kit Cheung, William L. Hamilton
Abstract: Inferring missing facts in temporal knowledge graphs (TKGs) is a fundamental and challenging task. Previous works have approached this problem by augmenting methods for static knowledge graphs to leverage time-dependent representations. However, these methods do not explicitly leverage multi-hop structural information and temporal facts from recent time steps to enhance their predictions. Additionally, prior work does not explicitly address the temporal sparsity and variability of entity distributions in TKGs. We propose the Temporal Message Passing (TeMP) framework to address these challenges by combining graph neural networks, temporal dynamics models, data imputation and frequency-based gating techniques. Experiments on standard TKG tasks show that our approach provides substantial gains compared to the previous state of the art, achieving a 10.7% average relative improvement in Hits@10 across three standard benchmarks. Our analysis also reveals important sources of variability both within and across TKG datasets, and we introduce several simple but strong baselines that outperform the prior state of the art in certain settings.
摘要：在推断颞知识图（TKGs）丢失的事实是一个基本的和艰巨的任务。以前的作品已通过增加对静态知识图来利用时间有关的表示方法，走近了这个问题。然而，这些方法不明确利用最近时间步多跳结构信息和时间的事实，以提高他们的预测。此外，以前的工作没有明确解决TKGs时间稀疏和实体发行的可变性。我们建议颞消息传递（TEMP）的框架通过将图形神经网络，时空动态模型，数据归集和基于频率的门控技术来应对这些挑战。标准TKG任务的实验结果表明，我们的方法与现有技术相比以前的状态，实现点击@ 10在三个标准基准的10.7％，平均相对改善提供了可观的收益。我们的分析还揭示境内和TKG数据集可变性的重要来源，我们介绍几种简单而强大的基线是超越在某些设置艺术的先前状态。

62. Slice-Aware Neural Ranking [PDF] 返回目录
Gustavo Penha, Claudia Hauff
Abstract: Understanding when and why neural ranking models fail for an IR task via error analysis is an important part of the research cycle. Here we focus on the challenges of (i) identifying categories of difficult instances (a pair of question and response candidates) for which a neural ranker is ineffective and (ii) improving neural ranking for such instances. To address both challenges we resort to slice-based learning for which the goal is to improve effectiveness of neural models for slices (subsets) of data. We address challenge (i) by proposing different slicing functions (SFs) that select slices of the dataset---based on prior work we heuristically capture different failures of neural rankers. Then, for challenge (ii) we adapt a neural ranking model to learn slice-aware representations, i.e. the adapted model learns to represent the question and responses differently based on the model's prediction of which slices they belong to. Our experimental results (the source code and data are available at this https URL) across three different ranking tasks and four corpora show that slice-based learning improves the effectiveness by an average of 2% over a neural ranker that is not slice-aware.
摘要：了解什么时候和为什么神经排名模型失败通过误差分析的IR任务是研究周期的重要组成部分。在这里，我们着眼于（i）识别困难实例（一对问题和响应的候选），用于其中一个神经排名器是无效的，并且（ii）加强神经排名这样的实例的类的挑战。要解决这两个难题，我们采取基于切片的学习对于其目标是提高数据的切片（子集）的神经模型的有效性。我们应对提出不同的切片功能（SFS），该数据集的选择---切片基于以前的工作中，我们试探性地捕捉神经rankers的不同故障的挑战（一）。然后，挑战（二）我们适应神经排名模型学习切片感知的表示，即适应模型学会代表不同的基于模型的预测问题和答复，其中有切片他们属于。我们的实验结果（源代码和数据都可以在此HTTPS URL）在三个不同等级的任务和四个语料库表明，基于切片的学习提高了平均2％以上，是不是切片感知神经排序器的有效性。

63. Program Enhanced Fact Verification with Verbalization and Graph Attention Network [PDF] 返回目录
Xiaoyu Yang, Feng Nie, Yufei Feng, Quan Liu, Zhigang Chen, Xiaodan Zhu
Abstract: Performing fact verification based on structured data is important for many real-life applications and is a challenging research problem, particularly when it involves both symbolic operations and informal inference based on language understanding. In this paper, we present a Program-enhanced Verbalization and Graph Attention Network (ProgVGAT) to integrate programs and execution into textual inference models. Specifically, a verbalization with program execution model is proposed to accumulate evidences that are embedded in operations over the tables. Built on that, we construct the graph attention verification networks, which are designed to fuse different sources of evidences from verbalized program execution, program structures, and the original statements and tables, to make the final verification decision. To support the above framework, we propose a program selection module optimized with a new training strategy based on margin loss, to produce more accurate programs, which is shown to be effective in enhancing the final verification results. Experimental results show that the proposed framework achieves the new state-of-the-art performance, a 74.4% accuracy, on the benchmark dataset TABFACT.
摘要：基于结构化数据进行验证的事实是许多实际生活中的应用很重要，是一个具有挑战性的研究课题，特别是当它涉及到两个符号运算，并根据语言理解的非正式推断。在本文中，我们提出了一个方案，增强动词化和图形注意网络（ProgVGAT）整合方案和执行成文本推理模型。具体而言，程序执行模型语言表达，提出了嵌入在以上的表操作累加证据。建立在，我们构建了图形验证注意网络，其目的是为了融合来自语言表达的程序执行，程序结构，和原来的陈述和证据表的不同来源，使最后的校验决策。为了支持上述框架下，我们提出了一个程序选择模块基于差额损失，以产生更精确的计划，这证明是有效地增强了最终验证结果的新培训战略进行了优化。实验结果表明，该框架实现了国家的最先进的新性能，74.4％的准确率，基准的数据集TABFACT。

64. Weakly-Supervised Feature Learning via Text and Image Matching [PDF] 返回目录
Gongbo Liang, Connor Greenwell, Yu Zhang, Xiaoqin Wang, Ramakanth Kavuluru, Nathan Jacobs
Abstract: When training deep neural networks for medical image classification, obtaining a sufficient number of manually annotated images is often a significant challenge. We propose to use textual findings, which are routinely written by clinicians during manual image analysis, to help overcome this problem. The key idea is to use a contrastive loss to train image and text feature extractors to recognize if a given image-finding pair is a true match. The learned image feature extractor is then fine-tuned, in a transfer learning setting, for a supervised classification task. This approach makes it possible to train using large datasets because pairs of images and textual findings are widely available in medical records. We evaluate our method on three datasets and find consistent performance improvements. The biggest gains are realized when fewer manually labeled examples are available. In some cases, our method achieves the same performance as the baseline even when using 70\%--98\% fewer labeled examples.
摘要：当训练深层神经网络的医学图像分类，获得足够数量的手动注释的图像往往是一个显著的挑战。我们建议使用文本发现，这是手动图像分析时经常写的医生，以帮助解决这个问题。其核心思想是使用对比损失火车图片和文字特征提取认识到，如果给定的图像，发现对是一个真正的比赛。博学的图像特征提取，然后微调，在转移的学习环境，对于监督分类任务。这种方法使得可以使用大型数据集训练，因为对图像和文字的发现是在病历广泛使用。我们评估三个数据集的方法和发现一致的性能改进。当较少的手动标记的例子是目前最大的收效。 98点\％较少的标记的例子 - 在某些情况下，使用70 \％，即使我们的方法达到相同的性能基准。

65. Digital Voicing of Silent Speech [PDF] 返回目录
David Gaddy, Dan Klein
Abstract: In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.
摘要：在本文中，我们认为数字清浊沉默演讲，在那里不出声的话转换成基于肌电图（EMG）传感器测量，捕捉肌肉的冲动可听语音的任务。虽然以前的工作主要集中在训练语音合成模型从EMG发声讲话过程中收集的，我们是第一个从EMG训练期间收集默默阐述讲话。我们介绍的从发声到无声信号传送音频目标无声EMG训练的方法。我们的方法极大地提高了与基线相比从无声EMG产生的音频的可理解，只有与列车发声数据，从64％降低转录字错误率在一个数据条件的4％，并且在另一个88％至68％。为了推动这项任务的进一步发展，我们分享我们的新的沉默，发声面部肌电图测量的数据集。

66. Learning to Represent Image and Text with Denotation Graph [PDF] 返回目录
Bowen Zhang, Hexiang Hu, Vihan Jain, Eugene Ie, Fei Sha
Abstract: Learning to fuse vision and language information and representing them is an important research problem with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets containing images aligned with linguistic expressions that describe the images. In this paper, we propose learning representations from a set of implied, visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific relations can be discovered using linguistic analysis tools. We propose methods to incorporate such relations into learning representation. We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations. The representations lead to stronger empirical results on downstream tasks of cross-modal image retrieval, referring expression, and compositional attribute-object recognition. Both our codes and the extracted denotation graphs on the Flickr30K and the COCO datasets are publically available on this https URL.
摘要：学习保险丝视觉和语言的信息，并表示他们是一个重要的研究课题有许多应用。最近的进展已经利用预先训练（从语言建模）和注意力层的想法变形金刚从包含与描述图像的语言表达对准的图像数据集的学习表现。在本文中，我们建议从一组图片和文字之间隐含的，视觉上表达接地学习表示，自动从这些数据集开采。特别是，我们用外延图表来表示具体怎么概念（如描述图片的句子），可以链接到也直观地接地抽象和通用的概念（如短语）。这种类型一般到特殊的关系，可以用语言分析工具发现。我们建议的方法将这类关系到学习的表示。我们发现，国家的最先进的多模态学习模型可以通过利用自动收获结构关系得到进一步改善。所述表示导致跨模态图像检索的下游任务更强经验结果，参照表达，和组成属性对象识别。无论我们的代码，并在Flickr30K提取的外延图形和COCO数据集在此HTTPS URL公开可用。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-10-08

目录

摘要