摘要

1. Evidence Inference 2.0: More Data, Better Models [PDF] 返回目录
Jay DeYoung, Eric Lehman, Ben Nye, Iain J. Marshall, Byron C. Wallace
Abstract: How do we most effectively treat a disease or condition? Ideally, we could consult a database of evidence gleaned from clinical trials to answer such questions. Unfortunately, no such database exists; clinical trial results are instead disseminated primarily via lengthy natural language articles. Perusing all such articles would be prohibitively time-consuming for healthcare practitioners; they instead tend to depend on manually compiled systematic reviews of medical literature to inform care. NLP may speed this process up, and eventually facilitate immediate consult of published evidence. The Evidence Inference dataset was recently released to facilitate research toward this end. This task entails inferring the comparative performance of two treatments, with respect to a given outcome, from a particular article (describing a clinical trial) and identifying supporting evidence. For instance: Does this article report that chemotherapy performed better than surgery for five-year survival rates of operable cancers? In this paper, we collect additional annotations to expand the Evidence Inference dataset by 25\%, provide stronger baseline models, systematically inspect the errors that these make, and probe dataset quality. We also release an abstract only (as opposed to full-texts) version of the task for rapid model prototyping. The updated corpus, documentation, and code for new baselines and evaluations are available at this http URL.
摘要：我们如何最有效地治疗疾病或状况？理想情况下，我们可以商量的临床试验中收集来回答这些问题的证据数据库。不幸的是，没有这样的数据库存在;临床试验结果主要是通过漫长的自然语言的文章，而不是传播。仔细阅读所有这些文章是过于耗时保健医生;他们反而往往依赖于医学文献的手动编译系统评价通知护理。 NLP可能会加速这一进程，并最终促进公布的证据直接咨询。证据推理数据集最近公布，以方便向为此研究。此任务需要推断的两种治疗比较的性能，相对于一个给定的结果，从一个特定的物品（描述一个临床试验），并确定支持证据。例如：是否这篇文章报告说，化疗比手术操作癌症的五年生存率表现得更好？在本文中，我们收集额外的注解了25 \％扩大证据推理数据集，提供更强的基线模型，系统地检查错误，这些化妆，探讨数据集的质量。我们还推出了快速原型模型任务的摘要仅供（相对于全文本）版本。更新后的语料库，文档和代码基线新和评估可在此http网址。

2. Quantum Natural Language Processing on Near-Term Quantum Computers [PDF] 返回目录
Konstantinos Meichanetzidis, Stefano Gogioso, Giovanni De Felice, Nicolò Chiappori, Alexis Toumi, Bob Coecke
Abstract: In this work, we describe a full-stack pipeline for natural language processing on near-term quantum computers, aka QNLP. The language modelling framework we employ is that of compositional distributional semantics (DisCoCat), which extends and complements the compositional structure of pregroup grammars. Within this model, the grammatical reduction of a sentence is interpreted as a diagram, encoding a specific interaction of words according to the grammar. It is this interaction which, together with a specific choice of word embedding, realises the meaning (or "semantics") of a sentence. Building on the formal quantum-like nature of such interactions, we present a method for mapping DisCoCat diagrams to quantum circuits. Our methodology is compatible both with NISQ devices and with established Quantum Machine Learning techniques, paving the way to near-term applications of quantum technology to natural language processing.
摘要：在这项工作中，我们描述了自然语言处理上近期量子计算机一个全栈管道，又名QNLP。语言模型框架，我们采用的成分分布语义（DisCoCat），它扩展和补充前波语法的组成结构的那个。在这个模型中，句子的语法减少被解释为一个示意图，它按照所述语法编码字的特异性相互作用。正是这种互动，它与具体的选择字嵌入在一起，实现含义（或“语义”）一个句子。这种相互作用的正式量子状性质的基础上，我们提出了映射DisCoCat图来量子电路的方法。我们的方法是既兼容与NISQ设备并与之建立量子机器学习技术，铺平了道路，以自然语言处理量子科技的近期应用。

3. Beyond Accuracy: Behavioral Testing of NLP models with CheckList [PDF] 返回目录
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh
Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
摘要：尽管持有测量出精度已经评价泛化的主要途径，它往往高估NLP模型的性能，而对于个人任务或对特定行为评估模型无论是重点的替代方法。在软件工程行为测试原理的启发，我们引入核对清单，任务无关的方法来测试NLP模型。清单包括通用语言能力以及促进全面的测试意念测试类型的矩阵，以及软件工具快速生成大量不同类型的测试用例。我们说明清单的效用与三项任务测试，确定在商业艺术国家的和模式严重故障。在用户研究，负责商业情绪分析模型研究小组发现在一个广泛的测试模型，新的和可操作的错误。在另一个用户学习，自然语言处理从业人员核对表创造了两倍多的试验，发现几乎三倍的错误，因为用户没有它。

4. SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics [PDF] 返回目录
Da Yin, Tao Meng, Kai-Wei Chang
Abstract: We propose SentiBERT, a variant of BERT that effectively captures compositional sentiment semantics. The model incorporates contextualized representation with binary constituency parse tree to capture semantic composition. Comprehensive experiments demonstrate that SentiBERT achieves competitive performance on phrase-level sentiment classification. We further demonstrate that the sentiment composition learned from the phrase-level annotations on SST can be transferred to other sentiment analysis tasks as well as related tasks, such as emotion classification tasks. Moreover, we conduct ablation studies and design visualization methods to understand SentiBERT. We show that SentiBERT is better than baseline approaches in capturing negation and the contrastive relation and model the compositional sentiment semantics.
摘要：本文提出SentiBERT，BERT的变体，有效地捕捉构图情绪语义。该模型采用二进制选区解析树情境表示来捕捉语义成分。综合实验表明，SentiBERT实现对短语级情感分类竞争力的性能。我们进一步表明，从SST短语级别的注解学到的情绪成分可以转移到其他情感分析任务以及相关的任务，如情感分类任务。此外，我们进行消融的研究和设计可视化方法来了解SentiBERT。我们表明，SentiBERT优于基准捕捉否定和对比关系，方法和组成情绪语义建模。

5. Literature Triage on Genomic Variation Publications by Knowledge-enhanced Multi-channel CNN [PDF] 返回目录
Chenhui Lv, Qian Lu, Xiang Zhang
Abstract: Background: To investigate the correlation between genomic variation and certain diseases or phenotypes, the fundamental task is to screen out the concerning publications from massive literature, which is called literature triage. Some knowledge bases, including UniProtKB/Swiss-Prot and NHGRI-EBI GWAS Catalog are created for collecting concerning publications. These publications are manually curated by experts, which is time-consuming. Moreover, the manual curation of information from literature is not scalable due to the rapidly increasing amount of publications. In order to cut down the cost of literature triage, machine-learning models were adopted to automatically identify biomedical publications. Methods: Comparing to previous studies utilizing machine-learning models for literature triage, we adopt a multi-channel convolutional network to utilize rich textual information and meanwhile bridge the semantic gaps from different corpora. In addition, knowledge embeddings learned from UMLS is also used to provide extra medical knowledge beyond textual features in the process of triage. Results: We demonstrate that our model outperforms the state-of-the-art models over 5 datasets with the help of knowledge embedding and multiple channels. Our model improves the accuracy of biomedical literature triage results. Conclusions: Multiple channels and knowledge embeddings enhance the performance of the CNN model in the task of biomedical literature triage. Keywords: Literature Triage; Knowledge Embedding; Multi-channel Convolutional Network
摘要：背景：研究基因组变异和某些疾病或表型之间的相关性，其根本任务是筛选出的大量文献，这就是所谓文学分流的有关出版物。有些知识基础，其中包括的UniProtKB / SWISS-PROT和NHGRI的EBI GWAS目录是为收集有关出版物创建。这些出版物由专家手工辅助，这是费时。此外，从文献信息手动策是不可扩展的，由于迅速增加出版物量。为了减少分流文学的成本，机器学习模型，采用自动识别的生物医学刊物。方法：相较于利用机器学习模型对文学的分流以往的研究中，我们采用了多通道卷积网络利用丰富的文本信息，同时弥合不同语料库语义差距。此外，从UMLS学到的知识的嵌入也可以用来提供额外的医学知识超出分流的过程中文字特征。结果：我们表明，我们的模型在5个集知识嵌入和多种渠道的帮助优于国家的最先进的车型。我们的模型提高了生物医学文献分流结果的准确性。结论：多渠道和知识的嵌入加强在生物医学文献分流的任务CNN模型的性能。关键词：文学分流;知识嵌入;多通道卷积网络

6. Sentiment Analysis Using Simplified Long Short-term Memory Recurrent Neural Networks [PDF] 返回目录
Karthik Gopalakrishnan, Fathi M.Salem
Abstract: LSTM or Long Short Term Memory Networks is a specific type of Recurrent Neural Network (RNN) that is very effective in dealing with long sequence data and learning long term dependencies. In this work, we perform sentiment analysis on a GOP Debate Twitter dataset. To speed up training and reduce the computational cost and time, six different parameter reduced slim versions of the LSTM model (slim LSTM) are proposed. We evaluate two of these models on the dataset. The performance of these two LSTM models along with the standard LSTM model is compared. The effect of Bidirectional LSTM Layers is also studied. The work also consists of a study to choose the best architecture, apart from establishing the best set of hyper parameters for different LSTM Models.
摘要：LSTM或长短期记忆网络是递归神经网络（RNN），其与长序列数据处理和学习的长期依赖性非常有效的特定类型。在这项工作中，我们执行一个GOP辩论Twitter的数据集情感分析。为了加快训练速度，降低了计算成本和时间，六个不同的参数减少LSTM模型（超薄LSTM）的超薄版本的建议。我们评估两个在数据集中的这些模型。这两款车型LSTM与标准模型LSTM沿进行性能对比。双向LSTM层的影响进行了研究。这项工作还包括一个研究，从建立针对不同型号LSTM中超参数的最佳设置选择最佳的架构，除了。

7. CAiRE-COVID: A Question Answering and Multi-Document Summarization System for COVID-19 Research [PDF] 返回目录
Dan Su, Yan Xu, Tiezheng Yu, Farhad Bin Siddique, Elham J. Barezi, Pascale Fung
Abstract: To address the need for refined information in COVID-19 pandemic, we propose a deep learning-based system that uses state-of-the-art natural language processing (NLP) question answering (QA) techniques combined with summarization for mining the available scientific literature. Our system leverages the Information Retrieval (IR) system and QA models to extract relevant snippets from the existing literature given a query. Fluent summaries are also provided to help understand the content in a more efficient way. In this paper, we describe our CAiRE-COVID system architecture and methodology for building the system. To bootstrap the further study, the code for our system is available at this https URL
摘要：为了解决需要细化信息COVID-19大流行，我们提出了一个深刻的学习型系统，采用先进设备，最先进的自然语言处理（NLP）问答（QA）技术与总结相结合的挖掘可用的科学文献。我们的系统充分利用了信息检索（IR）系统和QA车型，从给定的查询现有的文献提取相关片段。流利的总结，还提供以帮助理解在一个更有效的方式的内容。在本文中，我们描述了我们的凯尔 - COVID系统架构和方法构建系统。为了引导进一步的研究，我们的系统代码可在此HTTPS URL

8. Towards Conversational Recommendation over Multi-Type Dialogs [PDF] 返回目录
Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, Ting Liu
Abstract: We focus on the study of conversational recommendation in the context of multi-type dialogs, where the bots can proactively and naturally lead a conversation from a non-recommendation dialog (e.g., QA) to a recommendation dialog, taking into account user's interests and feedback. To facilitate the study of this task, we create a human-to-human Chinese dialog dataset DuRecDial (about 10k dialogs, 156k utterances), where there are multiple sequential dialogs for a pair of a recommendation seeker (user) and a recommender (bot). In each dialog, the recommender proactively leads a multi-type dialog to approach recommendation targets and then makes multiple recommendations with rich interaction behavior. This dataset allows us to systematically investigate different parts of the overall problem, e.g., how to naturally lead a dialog, how to interact with users for recommendation. Finally we establish baseline results on DuRecDial for future studies. Dataset and codes are publicly available at this https URL.
摘要：我们专注于对话的建议的多类型对话框的背景下，这里的机器人可以主动和自然地从非建议对话框（例如，QA）的建议，对话过交谈的研究，同时考虑到用户的利益和反馈。为了促进这项工作的研究中，我们创建了一个人 - 人中国对话集DuRecDial（约10K对话框，156K的言论），那里有一对推荐导引头（用户）的多个连续的对话和推荐（BOT ）。在每一个对话框，推荐积极领导的多类型对话框的方式推荐目标，然后让具有丰富的交互行为多项建议。该数据集可以让我们系统地研究整个问题，例如，如何自然会导致一个对话框，如何与用户进行交互推荐的不同部分。最后，我们树立DuRecDial基准结果为今后的研究。数据集和代码是公开的，在此HTTPS URL。

9. Learning to Detect Unacceptable Machine Translations for Downstream Tasks [PDF] 返回目录
Meng Zhang, Xin Jiang, Yang Liu, Qun Liu
Abstract: The field of machine translation has progressed tremendously in recent years. Even though the translation quality has improved significantly, current systems are still unable to produce uniformly acceptable machine translations for the variety of possible use cases. In this work, we put machine translation in a cross-lingual pipeline and introduce downstream tasks to define task-specific acceptability of machine translations. This allows us to leverage parallel data to automatically generate acceptability annotations on a large scale, which in turn help to learn acceptability detectors for the downstream tasks. We conduct experiments to demonstrate the effectiveness of our framework for a range of downstream tasks and translation models.
摘要：机器翻译领域已经在最近几年大大进步。虽然翻译质量显著提高，目前系统仍无法产生的各种可能的使用情况接受均匀机器翻译。在这项工作中，我们把机器翻译的跨语种管线和下游介绍任务定义机器翻译的特定任务的可接受性。这使我们能够充分利用并行数据到大规模，这反过来有助于学习接受探测器下游任务自动生成可接受的注解。我们进行实验来证明对一系列下游任务和翻译模型我们框架的有效性。

10. Context-Sensitive Generation Network for Handing Unknown Slot Values in Dialogue State Tracking [PDF] 返回目录
Puhai Yang, Heyan Huang, Xian-Ling Mao
Abstract: As a key component in a dialogue system, dialogue state tracking plays an important role. It is very important for dialogue state tracking to deal with the problem of unknown slot values. As far as we known, almost all existing approaches depend on pointer network to solve the unknown slot value problem. These pointer network-based methods usually have a hidden assumption that there is at most one out-of-vocabulary word in an unknown slot value because of the character of a pointer network. However, often, there are multiple out-of-vocabulary words in an unknown slot value, and it makes the existing methods perform bad. To tackle the problem, in this paper, we propose a novel Context-Sensitive Generation network (CSG) which can facilitate the representation of out-of-vocabulary words when generating the unknown slot value. Extensive experiments show that our proposed method performs better than the state-of-the-art baselines.
摘要：作为一个对话系统的关键组成部分，对话状态跟踪起着重要的作用。这是一个对话的状态跟踪处理不明槽值的问题非常重要。据我们知道，几乎所有现有的方法取决于指针网络解决未知槽值的问题。这些指针基于网络的方法通常有一个隐藏的假设，即至多有一个，因为一个指向网络的特点的一个未知的槽值外的词汇字。然而，经常，有多个处于未知槽值外的词汇，并且它使得现有方法执行坏。为了解决该问题，在本文中，我们提出了一种新上下文相关代网络（CSG），其可促进生成未知槽值时外的词汇字的表示。大量的实验表明，比国家的最先进的基线我们提出的方法执行得更好。

11. Detecting East Asian Prejudice on Social Media [PDF] 返回目录
Bertie Vidgen, Austin Botelho, David Broniatowski, Ella Guest, Matthew Hall, Helen Margetts, Rebekah Tromble, Zeerak Waseem, Scott Hale
Abstract: The outbreak of COVID-19 has transformed societies across the world as governments tackle the health, economic and social costs of the pandemic. It has also raised concerns about the spread of hateful language and prejudice online, especially hostility directed against East Asia. In this paper we report on the creation of a classifier that detects and categorizes social media posts from Twitter into four classes: Hostility against East Asia, Criticism of East Asia, Meta-discussions of East Asian prejudice and a neutral class. The classifier achieves an F1 score of 0.83 across all four classes. We provide our final model (coded in Python), as well as a new 20,000 tweet training dataset used to make the classifier, two analyses of hashtags associated with East Asian prejudice and the annotation codebook. The classifier can be implemented by other researchers, assisting with both online content moderation processes and further research into the dynamics, prevalence and impact of East Asian prejudice online during this global pandemic.
摘要：COVID-19的爆发改变了世界各地的社会作为政府应对流感大流行的健康，经济和社会成本。它也引发了仇恨的语言和偏见的网上传播，特别是敌意针对东亚地区的关注。在本文中，我们就创建一个分类的报告说，检测和分类从Twitter社交媒体文章分为四类：对敌对东亚，东亚，东亚的偏见元的讨论和中性类的批评。分类器实现了在所有四个班级的F1得分0.83。我们为我们的最终模型（Python编写），以及用来做分类，与东亚偏见和注释码本相关主题标签的两个分析一个新建20000鸣叫训练数据集。该分类可以通过其他研究人员来实现，这一全球性流感大流行期间既在线内容审核流程和进一步研究的动态，流行和东亚偏见的影响，在线帮助。

12. Distilling Knowledge from Pre-trained Language Models via Text Smoothing [PDF] 返回目录
Xing Wu, Yibing Liu, Xiangyang Zhou, Dianhai Yu
Abstract: This paper studies compressing pre-trained language models, like BERT (Devlin et al.,2019), via teacher-student knowledge distillation. Previous works usually force the student model to strictly mimic the smoothed labels predicted by the teacher BERT. As an alternative, we propose a new method for BERT distillation, i.e., asking the teacher to generate smoothed word ids, rather than labels, for teaching the student model in knowledge distillation. We call this kind of methodTextSmoothing. Practically, we use the softmax prediction of the Masked Language Model(MLM) in BERT to generate word distributions for given texts and smooth those input texts using that predicted soft word ids. We assume that both the smoothed labels and the smoothed texts can implicitly augment the input corpus, while text smoothing is intuitively more efficient since it can generate more instances in one neural network forward step.Experimental results on GLUE and SQuAD demonstrate that our solution can achieve competitive results compared with existing BERT distillation methods.
摘要：本文研究压缩预训练的语言模型，如BERT，通过师生的知识蒸馏（Devlin等，2019）。以前的作品通常迫使学生模型严格模仿老师BERT预测平滑的标签。作为替代方案，我们提出了一种新的方法BERT蒸馏，即，要求老师产生平滑的词ID，而不是标签，教学知识蒸馏学生模型。我们把这种methodTextSmoothing的。实际上，我们使用了屏蔽后的语言模型（MLM）在BERT的SOFTMAX预测以产生给定的文本字分布和平滑使用预测的软词ID的那些输入文本。我们假设平滑的标签和平滑的文本都可以隐增加输入语料，而文本平滑直观更有效，因为它可以产生在胶队一个神经网络向前step.Experimental结果多个实例证明，我们的解决方案可以实现与现有的BERT蒸馏方法相比有竞争力的结果。

13. Comparative Analysis of Word Embeddings for Capturing Word Similarities [PDF] 返回目录
Martina Toshevska, Frosina Stojanovska, Jovan Kalajdjieski
Abstract: Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
摘要：分布式的语言表现已成为各种自然语言处理任务的语言表现最为广泛使用的技术。大多数是基于深刻的学习技术的自然语言处理模型的使用已经预先训练分布式字表示，通常称为字的嵌入。确定最定性字的嵌入是这样的车型至关重要。然而，选择合适的嵌入这个词是一个错综复杂的任务，因为投影嵌入空间不直观的人类。在本文中，我们为创建分布式字表示探索不同的途径。我们执行的国家的最先进的几种字嵌入方法固有的评价。他们攻克单词相似性能进行了分析与现有的基准数据集词对的相似之处。在本文的研究进行由不同字嵌入方法获得的基础事实字相似性和相似之间的相关性分析。

14. Mapping Natural Language Instructions to Mobile UI Action Sequences [PDF] 返回目录
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge
Abstract: We present a new problem: grounding natural language instructions to mobile user interface actions, and contribute three new datasets for it. For full task evaluation, we create PIXELHELP, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in HowTo instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PIXELHELP.
摘要：我们提出一个新问题：接地自然语言指令，移动用户界面的操作，并为它贡献了三个新的数据集。对于全任务的评估，我们创建PIXELHELP，一个语料库对用行动英文说明书上的移动用户界面仿真器执行的人。为了大规模的培训，我们分离在方法文档说明和（b）合成的移动用户界面的操作接地说明（一）标注字句，跨语言和行动数据。我们使用一个变压器来提取作用短语元组远射自然语言指令。甲接地变压器然后上下文表示UI对象同时使用他们的内容和屏幕位置和它们连接到对象的描述。给定一个开始屏幕和指导，我们的模型实现了对PIXELHELP预测完成地面实况动作序列70.59％的准确率。

15. Phonotactic Complexity and its Trade-offs [PDF] 返回目录
Tiago Pimentel, Brian Roark, Ryan Cotterell
Abstract: We present methods for calculating a measure of phonotactic complexity---bits per phoneme---that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language's phonotactics are. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
摘要：本发明的方法用于计算的音位结构复杂---每音素比特---其允许简单的跨语言比较的量度。当给定一个字，表示为音素片段的序列，例如在国际拼音字母符号，以及上训练字类型的从语言样本的统计模型，我们可以使用的负对数概率大约测量每个音素位该模式下，这个词。这个简单的措施使我们能够比较不同语言的熵，从而洞察一种语言的语音组合法是多么复杂。使用跨106个多种语言1016个基本概念词的集合，我们证明每个音素位和字的平均长度之间的-0.74很强的负相关性。

16. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization [PDF] 返回目录
Esin Durmus, He He, Mona Diab
Abstract: Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.
摘要：神经抽象概括模型是容易产生与源文件，即内容不忠实不一致。现有的自动指标不有效地捕捉这样的错误。我们应对评估考虑到其源文件生成的摘要忠诚的问题。我们首先对两个数据集收集忠诚的人的注解输出从众多车型。我们发现，目前的模式表现出abstractiveness和忠诚之间的权衡：输出用更少的字重叠与源文档更容易不忠。接下来，我们提出了基于度量的忠诚，FEQA，它利用在阅读理解的最新进展的自动问答（QA）。从综上所述，从文档QA模型提取的答案产生的特定问题 - 答案对;非匹配的答案表明摘要不忠的信息。在基于词的重叠指标，嵌入相似，并学会语言理解模型，我们基于QA度量与人类忠实的得分，特别是在高度抽象摘要显著较高的相关性。

17. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization [PDF] 返回目录
Yang Gao, Wei Zhao, Steffen Eger
Abstract: We study unsupervised multi-document summarization evaluation metrics, which require neither human-written reference summaries nor human annotations (e.g. preferences, ratings, etc.). We propose SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques. Compared to the state-of-the-art unsupervised evaluation metrics, SUPERT correlates better with human ratings by 18-39%. Furthermore, we use SUPERT as rewards to guide a neural-based reinforcement learning summarizer, yielding favorable performance compared to the state-of-the-art unsupervised summarizers. All source code is available at this https URL.
摘要：我们研究监督的多文档概括评价指标，这既不需要的人编写的参考摘要也不是从人的注解（例如偏好，评级等）。我们提出SUPERT，其速率的摘要的通过测量其语义相似度与伪参考摘要的质量，即所选择的句子突出的从所述源文件，采用情境化的嵌入和软令牌对准技术。相较于国家的最先进的无人监督的评价指标，通过SUPERT 18-39％，与关联人的收视率更好。此外，我们使用SUPERT作为奖励引导基于神经强化学习汇总程序，相比于国家的最先进的无监督summarizers产生良好的性能。所有的源代码可在此HTTPS URL。

18. LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for Multilingual Offensive Language Identification [PDF] 返回目录
Erfan Ghadery, Marie-Francine Moens
Abstract: This paper presents our system entitled `LIIR' for SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2). We have participated in sub-task A for English, Danish, Greek, Arabic, and Turkish languages. We adapt and fine-tune the BERT and Multilingual Bert models made available by Google AI for English and non-English languages respectively. For the English language, we use a combination of two fine-tuned BERT models. For other languages we propose a cross-lingual augmentation approach in order to enrich training data and we use Multilingual BERT to obtain sentence representations. LIIR achieved rank 14/38, 18/47, 24/86, 24/54, and 25/40 in Greek, Turkish, English, Arabic, and Danish languages, respectively.
摘要：本文介绍了我们有权`LIIR”系统SemEval-2020工作12多种语言的攻击性语言鉴定社会化媒体（OffensEval 2）。我们参加了子任务用于英语，丹麦语，希腊语，阿拉伯语和土耳其语的语言。我们调整和微调BERT和多语种伯特模型英语和非英语语言分别提供由谷歌AI。对于英语，我们用两个微调BERT模式的组合。对于其他语言，我们建议以丰富的训练数据的跨语种隆胸方法，我们用多语种BERT来获得句子表示。 LIIR希腊，土耳其文，英文，阿拉伯文，和丹麦的语言分别达到14/38级，18/47，86分之24，54分之24，和25/40。

19. A Systematic Assessment of Syntactic Generalization in Neural Language Models [PDF] 返回目录
Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, Roger P. Levy
Abstract: State-of-the-art neural network models have achieved dizzyingly low perplexity scores on major language modeling benchmarks, but it remains unknown whether optimizing for broad-coverage predictive performance leads to human-like syntactic knowledge. Furthermore, existing work has not provided a clear picture about the model properties required to produce proper syntactic generalizations. We present a systematic evaluation of the syntactic knowledge of neural language models, testing 20 combinations of model types and data sizes on a set of 34 syntactic test suites. We find that model architecture clearly influences syntactic generalization performance: Transformer models and models with explicit hierarchical structure reliably outperform pure sequence models in their predictions. In contrast, we find no clear influence of the scale of training data on these syntactic generalization tests. We also find no clear relation between a model's perplexity and its syntactic generalization performance.
摘要：国家的最先进的神经网络模型已经取得了重大语言建模基准令人眼花缭乱的低困惑的分数，但是否优化的广域预测性能导致类似人类的语法知识目前还不清楚。此外，现有的工作尚未提供有关产生正确的语法概括所需要的模特属性的清晰画面。我们提出的神经语言模型的语法知识进行系统的评价，一组34个语法测试套件测试模型类型和数据大小的20种组合。我们发现，模型架构显然会影响语法泛化性能：变压器模型，并用明确的层次结构模型可靠地胜过纯序列模型在他们的预测。与此相反，我们发现对这些语法概括测试训练数据的规模没有明显的影响。我们还发现一个模型的困惑和它语法泛化性能之间没有明显的关系。

20. Learning to Segment Actions from Observation and Narration [PDF] 返回目录
Daniel Fried, Jean-Baptiste Alayrac, Phil Blunsom, Chris Dyer, Stephen Clark, Aida Nematzadeh
Abstract: We apply a generative segmental model of task structure, guided by narration, to action segmentation in video. We focus on unsupervised and weakly-supervised settings where no action labels are known during training. Despite its simplicity, our model performs competitively with previous work on a dataset of naturalistic instructional videos. Our model allows us to vary the sources of supervision used in training, and we find that both task structure and narrative language provide large benefits in segmentation quality.
摘要：我们在视频应用的任务结构的生成段模型，通过叙事为导向，以行动分割。我们专注于其中任何行动标签训练中已知的无监督和弱监督的设置。尽管它的简单，我们的模型进行竞争性上的自然主义教学视频的数据集以前的工作。我们的模式使我们能够改变在训练中使用监督的来源，我们发现，这两个任务结构和叙事语言提供分段质量大的好处。

21. On Vocabulary Reliance in Scene Text Recognition [PDF] 返回目录
Zhaoyi Wan, Jielei Zhang, Liang Zhang, Jiebo Luo, Cong Yao
Abstract: The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progress has been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon "vocabulary reliance". In this paper, we establish an analytical framework to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and improves the overall scene text recognition performance.
摘要：对公共基准的高性能的追求一直在现场文字识别研究的驱动力，并取得显着进展已经实现。然而，仔细调查揭示了一个惊人的事实，国家的最先进的方法与词汇的单词上的图像表现良好，但不良的概括与外界的话词汇的图像。我们称这种现象为“词汇的依赖”。在本文中，我们建立了一个分析框架，在现场文字识别词汇的依赖问题进行了深入的研究。的主要结果包括：（1）依赖词汇是无处不在的，即，所有的现有的算法或多或少表现出这样的特性; （2）基于注意译码器证明在弱概括单词以外的词汇和基于分割的解码器在利用视觉特征表现良好; （3）上下文建模高度加上预测层。这些发现提供了新的见解和可以受益未来场景文字识别研究。此外，我们提出了一个简单而有效的相互学习策略，让两个家庭（基于注意机制和分割为基础），以协作学习的典范。这一补救措施缓解词汇依赖的问题，提高了整体场景文本识别性能。

22. Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention [PDF] 返回目录
Myunghun Jung, Youngmoon Jung, Jahyun Goo, Hoirin Kim
Abstract: Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in challenging conditions such as noisy environments, open-vocabulary KWS, and short-duration SV by introducing novel techniques of connectionist temporal classification (CTC)-based soft voice activity detection (VAD) and global query attention. Frame-level acoustic and speaker information is integrated with phonetically originated weights so that forms a word-level global representation. Then it is used for the aggregation of feature vectors to generate discriminative embeddings. Our proposed approach shows 4.06% and 26.71% relative improvements in equal error rate (EER) compared to the baselines for both tasks. We also present a visualization example and results of ablation experiments.
摘要：关键词定位（KWS）和说话者检验（SV），尽管已知的是声学和扬声器结构域是互补的已独立地研究。在本文中，我们提出了一个多任务的网络执行KWS和SV同时充分利用相关领域的信息。多任务网络紧密结合子网络中通过引入联结时间分类（CTC）系软语音活动检测的新颖的技术挑战，例如嘈杂的环境中，开放词汇KWS，和短持续时间的SV条件针对性能改进（ VAD）和全局查询的关注。与语音学起源权重，因此形成了字级的全球代表性帧级音响和扬声器的信息集成。然后，它被用于特征向量的集合以生成判别的嵌入。我们建议的做法表明4.06％和较基线对于这两项任务等错误率（EER）26.71％的相对改善。我们还提出一个可视化的例子，消融实验的结果。

23. Synchronous Bidirectional Learning for Multilingual Lip Reading [PDF] 返回目录
Mingshuang Luo, Shuang Yang, Xilin Chen, Zitao Liu, Shiguang Shan
Abstract: Lip reading has received increasing attention in recent years. This paper focuses on the synergy of multilingual lip reading. There are more than 7,000 languages in the world, which implies that it is impractical to train separate lip reading models by collecting large-scale data per language. Although each language has its own linguistic and pronunciation features, the lip movements of all languages share similar patterns. Based on this idea, in this paper, we try to explore the synergized learning of multilingual lip reading, and further propose a synchronous bidirectional learning(SBL) framework for effective synergy of multilingual lip reading. Firstly, we introduce the phonemes as our modeling units for the multilingual setting. Similar phoneme always leads to similar visual patterns. The multilingual setting would increase both the quantity and the diversity of each phoneme shared among different languages. So the learning for the multilingual target should bring improvement to the prediction of phonemes. Then, a SBL block is proposed to infer the target unit when given its previous and later context. The rules for each specific language which the model itself judges to be is learned in this fill-in-the-blank manner. To make the learning process more targeted at each particular language, we introduce an extra task of predicting the language identity in the learning process. Finally, we perform a thorough comparison on LRW (English) and LRW-1000(Mandarin). The results outperform the existing state of the art by a large margin, and show the promising benefits from the synergized learning of different languages.
摘要：读唇已经受到越来越多的重视，近年来。本文侧重于多语种读唇语的协同作用。有超过7000种语言在世界上，这意味着它是不切实际每语言收集大规模数据来训练不同的读唇语的模型。虽然每一种语言都有它自己的语言和发音功能，所有语言的嘴唇运动有着相似的模式。基于这一思路，在本文中，我们尝试探索多种语言唇读的协同学习，并进一步提出了多语种读唇语的有效协同同步双向学习（SBL）的框架。首先，我们介绍了音素作为我们的建模单位为多语言设置。类似的音素总是会导致类似的视觉模式。多语种的设置将增加的数量和不同语言之间共享的每个音素的多样性。因此，对于多语种目标的学习应该把改善音素的预测。然后，块SBL提出给出其以前和以后上下文时来推断所述目标单元。针对模型本身法官在每个特定语言的规则在此填充式的空白方式教训。为了使学习的过程更有针对性的在每个特定的语言，介绍预测在学习过程中的语言标识的额外任务。最后，我们执行上LRW（英文）和LRW-1000（普通话）的全面比较。结果大幅度超越现有技术的现有状态，并显示不同语言的协同学习有为的好处。

24. Data-driven Modelling of Dynamical Systems Using Tree Adjoining Grammar and Genetic Programming [PDF] 返回目录
Dhruv Khandelwal, Maarten Schoukens, Roland Tóth
Abstract: State-of-the-art methods for data-driven modelling of non-linear dynamical systems typically involve interactions with an expert user. In order to partially automate the process of modelling physical systems from data, many EA-based approaches have been proposed for model-structure selection, with special focus on non-linear systems. Recently, an approach for data-driven modelling of non-linear dynamical systems using Genetic Programming (GP) was proposed. The novelty of the method was the modelling of noise and the use of Tree Adjoining Grammar to shape the search-space explored by GP. In this paper, we report results achieved by the proposed method on three case studies. Each of the case studies considered here is based on real physical systems. The case studies pose a variety of challenges. In particular, these challenges range over varying amounts of prior knowledge of the true system, amount of data available, the complexity of the dynamics of the system, and the nature of non-linearities in the system. Based on the results achieved for the case studies, we critically analyse the performance of the proposed method.
摘要：国家的最先进的方法用于非线性动态系统的数据驱动的建模通常涉及与专家的用户交互。为了部分地自动化建模从数据物理系统的过程中，许多基于EA-方法已经被提出用于模型的结构选择，其中特别侧重于非线性系统。最近，使用遗传编程（GP）的非线性动力系统的数据驱动建模的方法，提出了。该方法的新颖之处是噪音的造型和使用树邻接语法塑造由GP探索搜索空间。在本文中，我们报告通过对三个案例研究提出的方法取得的成果。这里所考虑的案例研究中的每一个基于真实的物理系统。案例研究带来各种各样的挑战。特别是，这些挑战包括在不同的可用的真正的系统的先验知识，数据量的数额，该系统的动态复杂性和非线性系统中的性质。根据对案例研究取得的成果，我们认真分析了该方法的性能。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-05-11

目录

摘要