摘要

1. Text Generation by Learning from Off-Policy Demonstrations [PDF] 返回目录
Richard Yuanzhe Pang, He He
Abstract: Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as a reinforcement learning (RL) problem with expert demonstrations (i.e., the training data), where the goal is to maximize quality given model-generated histories. Prior RL approaches to generation often face optimization issues due to the large action space and sparse reward. We propose GOLD (generation by off-policy learning from demonstrations): an algorithm that learns from the off-policy demonstrations by importance weighting and does not suffer from degenerative solutions. We find that GOLD outperforms the baselines according to automatic and human evaluation on summarization, question generation, and machine translation, including attaining state-of-the-art results for CNN/DailyMail summarization. Further, we show that models trained by GOLD are less sensitive to decoding algorithms and the generation quality does not degrade much as the length increases.
摘要：以文本生成电流的方法主要依赖自回归模型和最大似然估计。这种模式导致（I）不同，但低质量的样品，由于不匹配的学习目标和评估度（可能性与质量）和（ii）曝光补偿由于不匹配的历史分布（金与模型生成）。为了缓解这些问题，我们框架文本生成与专家示范（即训练数据），其目的是最大限度地提高质量给定的模型生成的历史一个强化学习（RL）的问题。此前RL接近一代经常面临的问题优化由于大动作空间和稀疏的奖励。一种算法，从重要性的权重偏离政策的示威学习并不会退化的解决方案苦：我们（从示威关政策学习代）提出黄金。根据我们对汇总自动和人工评估，问题的产生，机器翻译，包括获得国家的最先进的结果CNN /每日邮报汇总发现，黄金优于基准。此外，我们还表明，黄金训练的模型是解码算法不敏感，产生质量不会随着长度的增加降低太多。

2. CoDEx: A Comprehensive Knowledge Graph Completion Benchmark [PDF] 返回目录
Tara Safavi, Danai Koutra
Abstract: We present CoDEx, a set of knowledge graph Completion Datasets Extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from a popular link prediction benchmark by showing that CoDEx covers more diverse and interpretable content, and contains fewer relation patterns that can be covered by trivial frequency-based rules. Data, code, and pretrained models are available at this https URL.
摘要：我们提出食品法典委员会，由维基数据和维基百科一套知识图谱完成数据集提取在范围和难易程度已有的知识图完成基准，提高。在范围方面，食品法典委员会包括三个知识图在规模和结构变化，实体和关系，以及多语言描述的是合理的，但证实是假的硬负三元数以万计。表征食品法典委员会，有助于我们深入实证分析和基准测试实验。首先，我们分析的逻辑关系模式上，各法典数据集。接下来，我们对食品法典委员会报告的基准预测链接和三分类结果五个广泛关注嵌入模型。最后，我们从一个受欢迎的链接预测基准通过展示法典覆盖更加多样化和解释的内容，并包含可通过平凡的基于频率的规则覆盖更少的关系模式区分法典。数据，代码和预训练模式可在此HTTPS URL。

3. Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule [PDF] 返回目录
Shuhei Kurita, Kyunghyun Cho
Abstract: Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies have built and investigated a discriminative approach, we notice that there are in fact two possible approaches to building such a VLN agent: discriminative and generative. In this paper, we design and investigate a generative language-grounded policy which computes the distribution over all possible instructions given action and the transition history. In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) dataset, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN.
摘要：视觉和语言导航（VLN）是其中的药剂被实施在逼真的3D环境，并遵循以到达目标节点的指令的任务。虽然大多数以前的研究已经建成并考察了歧视性的做法，我们注意到，有其实两种可能的方法来建立这样一个VLN剂：歧视和生成。在本文中，我们设计和调查，其计算在给定的动作和过渡的历史所有可能的指令分布的生成语言接地的政策。在实验中，我们表明，该生成方法比在房间2间（R2R）数据集的歧视性做法，特别是在看不见的环境。进一步的研究表明的生成和歧视性政策的组合，实现了接近国家的最先进成果的R2R数据集，表明生成和歧视性政策捕捉VLN的不同方面。

4. GLUCOSE: GeneraLized and COntextualized Story Explanations [PDF] 返回目录
Nasrin Mostafazadeh, Aditya Kalyanpur, Lori Moon, David Buchanan, Lauren Berkowitz, Or Biran, Jennifer Chu-Carroll
Abstract: When humans read or listen, they make implicit commonsense inferences that frame their understanding of what happened and why. As a step toward AI systems that can build similar mental models, we introduce GLUCOSE, a large-scale dataset of implicit commonsense causal knowledge, encoded as causal mini-theories about the world, each grounded in a narrative context. To construct GLUCOSE, we drew on cognitive psychology to identify ten dimensions of causal explanation, focusing on events, states, motivations, and emotions. Each GLUCOSE entry includes a story-specific causal statement paired with an inference rule generalized from the statement. This paper details two concrete contributions: First, we present our platform for effectively crowdsourcing GLUCOSE data at scale, which uses semi-structured templates to elicit causal explanations. Using this platform, we collected 440K specific statements and general rules that capture implicit commonsense knowledge about everyday situations. Second, we show that existing knowledge resources and pretrained language models do not include or readily predict GLUCOSE's rich inferential content. However, when state-of-the-art neural models are trained on this knowledge, they can start to make commonsense inferences on unseen stories that match humans' mental models.
摘要：当人类读或听，他们使隐含常识推断该帧他们发生了什么，以及为什么理解。作为迈向可以建立类似的思维模式AI系统一个步骤中，我们介绍了葡萄糖，隐含常识因果知识的大型数据集，编码为对世界的因果迷你理论，在每一个的叙事情境接地。构建葡萄糖，我们借鉴了认知心理学，以确定的因果解释10个维度，重点事件，状态，动机和情感。每个葡萄糖条目包括从广义的说法推理规则配对的故事，具体的因果陈述。本文详细介绍两个具体的贡献：首先，我们介绍我们的平台在规模，它采用半结构化模板引起因果解释有效众包血糖数据。利用这个平台，我们收集了440K具体声明和一般性的规则，对日常情况下捕捉隐含的常识知识库。其次，我们表明，现有的知识资源和预训练的语言模型不包括或易于预测血糖丰富的推理内容。然而，当国家的最先进的神经模型在这方面的知识的培训，就可以开始进行对看不见的故事，那场比赛人类的心智模式常识推论。

5. Multilingual Music Genre Embeddings for Effective Cross-Lingual Music Item Annotation [PDF] 返回目录
Elena V. Epure, Guillaume Salha, Romain Hennequin
Abstract: Annotating music items with music genres is crucial for music recommendation and information retrieval, yet challenging given that music genres are subjective concepts. Recently, in order to explicitly consider this subjectivity, the annotation of music items was modeled as a translation task: predict for a music item its music genres within a target vocabulary or taxonomy (tag system) from a set of music genre tags originating from other tag systems. However, without a parallel corpus, previous solutions could not handle tag systems in other languages, being limited to the English-language only. Here, by learning multilingual music genre embeddings, we enable cross-lingual music genre translation without relying on a parallel corpus. First, we apply compositionality functions on pre-trained word embeddings to represent multi-word tags.Second, we adapt the tag representations to the music domain by leveraging multilingual music genres graphs with a modified retrofitting algorithm. Experiments show that our method: 1) is effective in translating music genres across tag systems in multiple languages (English, French and Spanish); 2) outperforms the previous baseline in an English-language multi-source translation task. We publicly release the new multilingual data and code.
摘要：用注解音乐流派的音乐项目是音乐推荐和信息检索的关键，但具有挑战性因为音乐流派是主观的概念。最近，为了明确地考虑这种主观性，音乐项目注释建模为一个翻译任务：预测一个音乐项目从其他一组的音乐风格标签始发的目标词汇或分类（标签系统）内的音乐流派标签系统。但是，由于没有平行语料库，以前的解决方案不能处理标签系统在其他语言中，而仅限于英语。在这里，通过学习多种语言的音乐流派的嵌入，我们能够跨语言的音乐风格转换，而无需依靠平行语料库。首先，我们应用上预先训练字的嵌入组合性功能表示多字tags.Second，我们通过利用多语种的音乐流派的图表与改进改型算法适应标签表示音乐领域。实验表明，我们的方法：1）是有效的跨翻译多国语言（英语，法语和西班牙语）标签系统的音乐流派; 2）优于在英语语言多源翻译任务以前的基准线。我们公开发布了新的多语言数据和代码。

6. Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches [PDF] 返回目录
Juan Cruz-Benito, Sanjay Vishwakarma, Francisco Martin-Fernandez, Ismael Faro
Abstract: In recent years, the use of deep learning in language models, text auto-completion, and text generation has made tremendous progress and gained much attention from the research community. Some products and research projects claim that they can generate text that can be interpreted as human-writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is the processing of programming languages. For years, the Machine Learning community has been researching in this Big Code area, pursuing goals like applying different approaches to auto-complete generate, fix, or evaluate code programmed by humans. One of the approaches followed in recent years to pursue these goals is the use of Deep-Learning-enabled language models. Considering the increasing popularity of that approach, we detected a lack of empirical papers that compare different methods and deep learning architectures to create and use language models based on programming code. In this paper, we compare different neural network (NN) architectures like AWD-LSTMs, AWD-QRNNs, and Transformer, while using transfer learning, and different tokenizations to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss the different strengths and weaknesses of each approach and technique and what lacks do we find to evaluate the language models or apply them in a real programming context while including humans-in-the-loop.
摘要：近年来，利用语言模型深度学习，文本自动完成和文本生成的已经完成了从研究界的巨大进步，并取得了很大的关注。有些产品和研究项目声称，他们可以生成可以理解为人类书写，使在许多应用领域新的可能性的文本。其中涉及到语言处理不同领域，最显着的应用这种类型的造型之一是编程语言的处理。多年来，机器学习领域一直在研究在这个大码区，追求目标，如采用不同的方法来自动完成生成，修复，或评估人类编程代码。其中一个办法，随后在近几年实现这些目标是利用启用深度学习语言模型。考虑到这种做法的日益普及，我们发现缺乏的是比较基础的编程代码不同的方法和深度学习架构创建和使用语言模型实证论文。在本文中，我们比较喜欢AWD-LSTMs，AWD-QRNNs，和变压器不同的神经网络（NN）架构，同时采用转让的学习，以及不同tokenizations，看看他们如何表现在构建基于Python的数据集的代码生成语言模型和填充面具任务。考虑的结果，我们讨论每一种方法和技术的不同的长处和弱点和缺乏才能找到评价语言模型或者一个真正的编程环境中应用他们，而包括人类，在半实物。

7. NABU -- Multilingual Graph-based Neural RDF Verbalizer [PDF] 返回目录
Diego Moussallem, Dwaraknath Gnaneshwar, Thiago Castro Ferreira, Axel-Cyrille Ngonga Ngomo
Abstract: The RDF-to-text task has recently gained substantial attention due to continuous growth of Linked Data. In contrast to traditional pipeline models, recent studies have focused on neural models, which are now able to convert a set of RDF triples into text in an end-to-end style with promising results. However, English is the only language widely targeted. We address this research gap by presenting NABU, a multilingual graph-based neural model that verbalizes RDF data to German, Russian, and English. NABU is based on an encoder-decoder architecture, uses an encoder inspired by Graph Attention Networks and a Transformer as decoder. Our approach relies on the fact that knowledge graphs are language-agnostic and they hence can be used to generate multilingual text. We evaluate NABU in monolingual and multilingual settings on standard benchmarking WebNLG datasets. Our results show that NABU outperforms state-of-the-art approaches on English with 66.21 BLEU, and achieves consistent results across all languages on the multilingual scenario with 56.04 BLEU.
摘要：RDF到文本的任务，最近获得了大量关注，因为关联数据的持续增长。相较于传统的流水线模式，最近的研究侧重于神经车型，这是现在能够一组RDF三元转换成文本在终端到终端的风格可喜的成果。然而，英语是广泛目标的唯一语言。我们通过介绍NABU，即verbalizes RDF数据，德语，俄语，英语和多语种基于图的神经网络模型解决这一研究的空白。 NABU是基于编码器 - 解码器结构，使用由图形注意网络和一个变压器作为解码器激发的编码器。我们的方法依赖于一个事实，即知识图是与语言无关的，因此它们可以用来生成多语言文本。我们在标准基准WebNLG数据集的单语和多语种设置评估NABU。我们的研究结果表明，NABU性能优于国家的最先进的在所有语言对英语有接近66.21 BLEU，实现一致的结果对多语种的场景与56.04 BLEU。

8. Leveraging Semantic Parsing for Relation Linking over Knowledge Bases [PDF] 返回目录
Nandana Mihindukulasooriya, Gaetano Rossiello, Pavan Kapanipathi, Ibrahim Abdelaziz, Srinivas Ravishankar, Mo Yu, Alfio Gliozzo, Salim Roukos, Alexander Gray
Abstract: Knowledgebase question answering systems are heavily dependent on relation extraction and linking modules. However, the task of extracting and linking relations from text to knowledgebases faces two primary challenges; the ambiguity of natural language and lack of training data. To overcome these challenges, we present SLING, a relation linking framework which leverages semantic parsing using Abstract Meaning Representation (AMR) and distant supervision. SLING integrates multiple relation linking approaches that capture complementary signals such as linguistic cues, rich semantic representation, and information from the knowledgebase. The experiments on relation linking using three KBQA datasets; QALD-7, QALD-9, and LC-QuAD 1.0 demonstrate that the proposed approach achieves state-of-the-art performance on all benchmarks.
摘要：知识库问答系统严重依赖关系抽取和链接模块。然而，从文本到knowledgebases提取和链接关系的任务，面临着两个主要挑战;自然语言的模糊性和缺乏训练数据。为了克服这些挑战，我们目前吊索，关系连接骨架采用抽象意义表示（AMR）和远处监督它利用语义分析。吊索集成了多个关系连接接近于捕获互补信号诸如语言线索，丰富的语义表示，并从知识库的信息。相对于上使用三个KBQA数据集联实验; QALD-7，QALD-9，和LC-QUAD 1.0表明，该方法实现了对所有的基准状态的最先进的性能。

9. Knowledge Graphs for Multilingual Language Translation and Generation [PDF] 返回目录
Diego Moussallem
Abstract: The Natural Language Processing (NLP) community has recently seen outstanding progress, catalysed by the release of different Neural Network (NN) architectures. Neural-based approaches have proven effective by significantly increasing the output quality of a large number of automated solutions for NLP tasks (Belinkov and Glass, 2019). Despite these notable advancements, dealing with entities still poses a difficult challenge as they are rarely seen in training data. Entities can be classified into two groups, i.e., proper nouns and common nouns. Proper nouns are also known as Named Entities (NE) and correspond to the name of people, organizations, or locations, e.g., John, WHO, or Canada. Common nouns describe classes of objects, e.g., spoon or cancer. Both types of entities can be found in a Knowledge Graph (KG). Recent work has successfully exploited the contribution of KGs in NLP tasks, such as Natural Language Inference (NLI) (KM et al.,2018) and Question Answering (QA) (Sorokin and Gurevych, 2018). Only a few works had exploited the benefits of KGs in Neural Machine Translation (NMT) when the work presented herein began. Additionally, few works had studied the contribution of KGs to Natural Language Generation (NLG) tasks. Moreover, the multilinguality also remained an open research area in these respective tasks (Young et al., 2018). In this thesis, we focus on the use of KGs for machine translation and the generation of texts to deal with the problems caused by entities and consequently enhance the quality of automatically generated texts.
摘要：自然语言处理（NLP）社会最近看到突出进步，通过不同的神经网络（NN）架构的发布催化。基于神经的方法已被显著增加了大量的NLP任务（Belinkov和玻璃，2019）自动化解决方案的输出质量证明是有效的。尽管有这些显着的进步，处理实体仍对，因为他们在训练数据很少见到一个艰巨的挑战。实体可被分类成两组，即，专有名词和普通名词。专有名词也被称为命名实体（NE）和对应于人，组织或地点的名称，例如，约翰，WHO或加拿大。普通名词描述对象，例如，勺子或癌症的类。这两种类型的实体可以在知识图（KG）被发现。最近的工作已经成功利用幼儿园的NLP任务，如自然语言推理（NLI）（KM等，2018）和问答（QA）（索罗金和Gurevych，2018）的贡献。只有少数作品已经利用了神经机器翻译（NMT）幼儿园的好处时，本文介绍的工作开始了。此外，作品很少研究过幼儿园到自然语言生成（NLG）任务的贡献。此外，多语言也仍然在这些各自的任务一个开放的研究领域（Young等，2018）。在本文中，我们专注于利用幼稚园机器翻译和文字的产生，处理所造成的实体问题，从而增强自动生成的文本的质量。

10. Reasoning about Goals, Steps, and Temporal Ordering with WikiHow [PDF] 返回目录
Qing Lyu, Li Zhang, Chris Callison-Burch
Abstract: We propose a suite of reasoning tasks on two types of relations between procedural events: goal-step relations ("learn poses" is a step in the larger goal of "doing yoga") and step-step temporal relations ("buy a yoga mat" typically precedes "learn poses"). We introduce a dataset targeting these two relations based on wikiHow, a website of instructional how-to articles. Our human-validated test set serves as a reliable benchmark for commonsense inference, with a gap of about 10% to 20% between the performance of state-of-the-art transformer models and human performance. Our automatically-generated training set allows models to effectively transfer to out-of-domain tasks requiring knowledge of procedural events, with greatly improved performances on SWAG, Snips, and the Story Cloze Test in zero- and few-shot settings.
摘要：我们提出了两种类型的程序性事件之间的关系的一套推理任务：目标步关系（“学习的姿态”是在“做瑜伽”大目标的一个步骤），并一步一步的时间关系（“买瑜伽垫”通常先于‘学习的姿态’）。我们介绍的数据集瞄准基于wikiHow上的教学网站how-to文章这两个关系。我们人类的验证测试仪可作为常识推理可靠的基准，约10％的国家的最先进的变压器模型和人类表现的性能之间的差距缩小到20％。我们自动生成的训练集允许模型有效地传送到需要程序性事件的知识，在零和为数不多的拍摄设置上SWAG，零星消息，故事完形填空大大提高性能域外的任务。

11. Parallel Interactive Networks for Multi-Domain Dialogue State Generation [PDF] 返回目录
Junfan Chen, Richong Zhang, Yongyi Mao, Jie Xu
Abstract: The dependencies between system and user utterances in the same turn and across different turns are not fully considered in existing multi-domain dialogue state tracking (MDST) models. In this study, we argue that the incorporation of these dependencies is crucial for the design of MDST and propose Parallel Interactive Networks (PIN) to model these dependencies. Specifically, we integrate an interactive encoder to jointly model the in-turn dependencies and cross-turn dependencies. The slot-level context is introduced to extract more expressive features for different slots. And a distributed copy mechanism is utilized to selectively copy words from historical system utterances or historical user utterances. Empirical studies demonstrated the superiority of the proposed PIN model.
摘要：在同一回合和不同转弯系统和用户话语之间的依赖关系并没有完全在现有的多领域对话状态跟踪（MDST）模型考虑。在这项研究中，我们认为，这些依赖关系的成立对于MDST的设计至关重要，并提出并行互动网络（PIN），以这些相关模型。具体来说，我们集成了一个交互式的编码器中，依次依赖性和跨转的依赖共同建模。在时隙级上下文被引入到提取不同的槽更有表现力的功能。和分布式复制机制被用来从历史系统话语或历史用户话语选择性复制的话。实证研究表明所提出的PIN模式的优越性。

12. Neural Dialogue State Tracking with Temporally Expressive Networks [PDF] 返回目录
Junfan Chen, Richong Zhang, Yongyi Mao, Jie Xu
Abstract: Dialogue state tracking (DST) is an important part of a spoken dialogue system. Existing DST models either ignore temporal feature dependencies across dialogue turns or fail to explicitly model temporal state dependencies in a dialogue. In this work, we propose Temporally Expressive Networks (TEN) to jointly model the two types of temporal dependencies in DST. The TEN model utilizes the power of recurrent networks and probabilistic graphical models. Evaluating on standard datasets, TEN is demonstrated to be effective in improving the accuracy of turn-level-state prediction and the state aggregation.
摘要：对话状态跟踪（DST）是语音对话系统的重要组成部分。现有DST车型要么忽略整个对话圈时间要素的依赖或无法在对话明确模型时间状态依赖性。在这项工作中，我们提出了时间上表现网络（TEN），共同在DST两种时序依赖模型。十模型利用经常性的网络和概率图模型的力量。在标准数据集评估，TEN被证明是有效地改善导级状态预测的准确度和所述状态集合。

13. Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT [PDF] 返回目录
Alexandra Chronopoulou, Dario Stojanovski, Alexander Fraser
Abstract: Using a language model (LM) pretrained on two languages with large monolingual data in order to initialize an unsupervised neural machine translation (UNMT) system yields state-of-the-art results. When limited data is available for one language, however, this method leads to poor translations. We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. To reuse the pretrained LM, we have to modify its predefined vocabulary, to account for the new language. We therefore propose a novel vocabulary extension method. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq), yielding more than +8.3 BLEU points for all four translation directions.
摘要：采用以初始化无监督神经机器翻译（UNMT）系统上预先训练两种语言有大单语数据的语言模型（LM）得到国家的先进成果。当有限的数据是可用的一种语言，但是，这种方法会导致不良的翻译。我们提出一个可重用的是仅在高资源语言预先训练的LM的有效途径。单语种LM是微调，对两种语言，然后用来初始化UNMT模型。重用预训练的LM，我们必须修改其预定义的词汇，以考虑新的语言。因此，我们提出了一个新颖的词汇扩展方法。我们的做法，RE-LM，优于有竞争力的跨语言训练前模型（XLM），英文马其顿（恩MK）和英语阿尔巴尼亚（恩-SQ），产生超过8.3 BLEU分四个翻译方向。

14. UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation [PDF] 返回目录
Jian Guan, Minlie Huang
Abstract: Despite the success of existing referenced metrics (e.g., BLEU and MoverScore), they correlate poorly with human judgments for open-ended text generation including story or dialog generation because of the notorious one-to-many issue: there are many plausible outputs for the same input, which may differ substantially in literal or semantics from the limited number of given references. To alleviate this issue, we propose UNION, a learnable unreferenced metric for evaluating open-ended story generation, which measures the quality of a generated story without any reference. Built on top of BERT, UNION is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories. We propose an approach of constructing negative samples by mimicking the errors commonly observed in existing NLG models, including repeated plots, conflicting logic, and long-range incoherence. Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories, which correlates better with human judgments and is more generalizable than existing state-of-the-art metrics.
摘要：尽管现有的参考指标（例如，BLEU和MoverScore）的成功，他们不善与开放式文本代人的判断，包括因为臭名昭著的一对许多问题的故事或对话生成关联：有很多似是而非的输出对于相同的输入，其可以基本上在字面或语义从给定的参考文献的有限数量的不同。为了缓解这一问题，我们提出了UNION，一个可以学习的未引用度量评估开放式的故事一代，是衡量一个产生故事的质量没有任何引用。内置的BERT的顶部，UNION被训练来区分人的书面阴性样品的故事和收回负面报道的扰动。我们提出构建阴性样品的方法，通过模仿现有车型荷兰盾普遍观察到的错误，包括重复的情节，冲突的逻辑，以及远距离的不一致。两个故事集实验证明，UNION是评价产生故事的质量，这与人类相关的判断更好，比现有的国家的最先进的指标，更普及了可靠的措施。

15. Group-wise Contrastive Learning for Neural Dialogue Generation [PDF] 返回目录
Hengyi Cai, Hongshen Chen, Yonghao Song, Zhuoye Ding, Yongjun Bao, Weipeng Yan, Xiaofang Zhao
Abstract: Neural dialogue response generation has gained much popularity in recent years. Maximum Likelihood Estimation (MLE) objective is widely adopted in existing dialogue model learning. However, models trained with MLE objective function are plagued by the low-diversity issue when it comes to the open-domain conversational setting. Inspired by the observation that humans not only learn from the positive signals but also benefit from correcting behaviors of undesirable actions, in this work, we introduce contrastive learning into dialogue generation, where the model explicitly perceives the difference between the well-chosen positive and negative utterances. Specifically, we employ a pretrained baseline model as a reference. During contrastive learning, the target dialogue model is trained to give higher conditional probabilities for the positive samples, and lower conditional probabilities for those negative samples, compared to the reference model. To manage the multi-mapping relations prevailed in human conversation, we augment contrastive dialogue learning with group-wise dual sampling. Extensive experimental results show that the proposed group-wise contrastive learning framework is suited for training a wide range of neural dialogue generation models with very favorable performance over the baseline training approaches.
摘要：神经对话响应产生在近几年获得了巨大的人气。最大似然估计（MLE）目标是在现有的对话模式的学习被广泛采用。然而，MLE目标函数训练的模型是由低多样性的问题，当谈到开放域的会话设置的困扰。通过观察，人类不仅从积极的信号学习，但也从矫正不良行为的行为，在这项工作中受益的启发，我们引入对比学习到的对话产生，其中模型明确地感知精心挑选的正面和负面的区别话语。具体来说，我们采用预训练的基准模型作为参考。在对比学习，目标对话模式进行培训，以提供更高的条件概率为阳性样品，并降低条件概率的阴性样品，相比于参考模型。要管理多映射关系，在人类交谈占了上风，我们增加对比对话组相关的双采样学习。大量的实验结果表明，该组间对比学习框架适合于培养广泛的神经对话代车型具有非常良好的性能将比基线训练方法。

16. Minimize Exposure Bias of Seq2Seq Models in Joint Entity and Relation Extraction [PDF] 返回目录
Haoran Zhang, Qianying Liu, Aysa Xuemo Fan, Heng Ji, Daojian Zeng, Fei Cheng, Daisuke Kawahara, Sadao Kurohashi
Abstract: Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These introduce exposure bias, which may cause the models overfit to the frequent label combination, thus deteriorating the generalization. We propose a novel Sequence-to-Unordered-Multi-Tree (Seq2UMTree) model to minimize the effects of exposure bias by limiting the decoding length to three within a triplet and removing the order among triplets. We evaluate our model on two datasets, DuIE and NYT, and systematically study how exposure bias alters the performance of Seq2Seq models. Experiments show that the state-of-the-art Seq2Seq model overfits to both datasets while Seq2UMTree shows significantly better generalization. Our code is available at this https URL .
摘要：联合实体和关系抽取旨在从纯文本直接提取关系三胞胎。以前的工作利用序列对序列（Seq2Seq）模型三联序列生成。然而，Seq2Seq强制对无序三胞胎的不必要的顺序和涉及与误差积累相关的大解码长度。这些引进曝光补偿，这可能会导致过度拟合的模型对频繁的标签组合，从而降低了概括。我们提出了一个新颖的序列到无序的多树（Seq2UMTree）模型通过限制三重内的解码长度至三个并除去三胞胎中的阶数以尽量减少暴露偏压的影响。我们评估我们的两个数据集，DuIE和NYT模型，并系统地研究如何曝光补偿变造Seq2Seq模型的性能。实验表明，国家的最先进的Seq2Seq模型overfits到两个数据集，而Seq2UMTree显示显著更好的泛化。我们的代码可在此HTTPS URL。

17. Contextualized Perturbation for Textual Adversarial Attack [PDF] 返回目录
Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, Bill Dolan
Abstract: Adversarial examples expose the vulnerabilities of natural language processing (NLP) models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure. CLARE builds on a pre-trained masked language model and modifies the inputs in a context-aware manner. We propose three contextualized perturbations, Replace, Insert and Merge, allowing for generating outputs of varied lengths. With a richer range of available strategies, CLARE is able to attack a victim model more efficiently with fewer edits. Extensive experiments and human evaluation demonstrate that CLARE outperforms the baselines in terms of attack success rate, textual similarity, fluency and grammaticality.
摘要：对抗性的例子暴露的自然语言处理（NLP）模型的漏洞，可以用来评估和改善他们的鲁棒性。产生这样的例子的现有技术通常由是不可知的上下文，经常导致不自然的和不合语法输出本地启发式规则驱动。本文呈现CLARE，即通过掩模 - 则 - 填充过程产生流畅和语法输出情境对抗性实施例生成模型。克莱尔建立在一个预先训练掩盖语言模型和情景感知的方式修改输入。我们提出了三种情境扰动，替换，插入和合并，从而允许用于产生变化长度的输出。更丰富的范围内使用策略，CLARE能够用更少的编辑更有效地攻击的受害者模型。大量的实验和人工评估表明，克莱尔优于基准的进攻成功率，文本相似性，流畅性和合乎语法的条款。

18. Are Interpretations Fairly Evaluated? A Definition Driven Pipeline for Post-Hoc Interpretability [PDF] 返回目录
Ninghao Liu, Yunsong Meng, Xia Hu, Tie Wang, Bo Long
Abstract: Recent years have witnessed an increasing number of interpretation methods being developed for improving transparency of NLP models. Meanwhile, researchers also try to answer the question that whether the obtained interpretation is faithful in explaining mechanisms behind model prediction? Specifically, (Jain and Wallace, 2019) proposes that "attention is not explanation" by comparing attention interpretation with gradient alternatives. However, it raises a new question that can we safely pick one interpretation method as the ground-truth? If not, on what basis can we compare different interpretation methods? In this work, we propose that it is crucial to have a concrete definition of interpretation before we could evaluate faithfulness of an interpretation. The definition will affect both the algorithm to obtain interpretation and, more importantly, the metric used in evaluation. Through both theoretical and experimental analysis, we find that although interpretation methods perform differently under a certain evaluation metric, such a difference may not result from interpretation quality or faithfulness, but rather the inherent bias of the evaluation metric.
摘要：近年来，双方为改善NLP模型的透明度正在开发越来越多的解释方法。同时，研究人员还尝试回答，得到的解释是解释背后模型预测机制忠实的问题？具体而言，（Jain和华莱士，2019）提出了“注意力不解释”用梯度替代比较关注的解释。但是，它提出了一个新的问题，可以放心地我们挑一个解释方法作为地面真相呢？如果不是，凭什么我们可以比较不同的解释方法？在这项工作中，我们提出，关键是有解释的具体定义之前，我们可以评估一个解释的忠诚。该定义将影响该算法获得的解释，更重要的是，在评估中使用的度量。通过理论和实验分析，我们发现，虽然解释方法在一定的评估度表现不同，这种差异可能无法从解释质量和忠诚，而是评价指标的固有偏见造成的。

19. Graph-to-Sequence Neural Machine Translation [PDF] 返回目录
Sufeng Duan, Hai Zhao, Rui Wang
Abstract: Neural machine translation (NMT) usually works in a seq2seq learning way by viewing either source or target sentence as a linear sequence of words, which can be regarded as a special case of graph, taking words in the sequence as nodes and relationships between words as edges. In the light of the current NMT models more or less capture graph information among the sequence in a latent way, we present a graph-to-sequence model facilitating explicit graph information capturing. In detail, we propose a graph-based SAN-based NMT model called Graph-Transformer by capturing information of subgraphs of different orders in every layers. Subgraphs are put into different groups according to their orders, and every group of subgraphs respectively reflect different levels of dependency between words. For fusing subgraph representations, we empirically explore three methods which weight different groups of subgraphs of different orders. Results of experiments on WMT14 English-German and IWSLT14 German-English show that our method can effectively boost the Transformer with an improvement of 1.1 BLEU points on WMT14 English-German dataset and 1.0 BLEU points on IWSLT14 German-English dataset.
摘要：神经机器翻译（NMT）通常在seq2seq学习方式通过查看源或目标句子作为字的线性序列，其可以被视为图的特殊情况下，考虑字序列作为节点之间的关系在工作字作为边缘。在处于潜伏方式序列中的当前NMT模型或多或少捕获图形信息的光，我们提出了一个图表到序列模型促进明确的图形信息捕获。具体而言，我们提出了所谓的图形变压器基于图形的基于SAN的NMT模型，通过捕捉在每一层不同的订单子图的信息。子图是根据他们的订单放进不同的组，每个组子图，分别反映不同层次的词与词之间的依赖。对于融合子表示，我们经验探索三种方法，其重量不同的订单子图的不同群体。对WMT14英语，德语和IWSLT14德语英语实验结果表明，该方法可以有效提高变压器用的1.1 BLEU点上IWSLT14德语英语数据集WMT14英语，德语数据集，1.0 BLEU点的提高。

20. Unsupervised Summarization by Jointly Extracting Sentences and Keywords [PDF] 返回目录
Zongyi Li, Xiaoqing Zheng
Abstract: We present RepRank, an unsupervised graph-based ranking model for extractive multi-document summarization in which the similarity between words, sentences, and word-to-sentence can be estimated by the distances between their vector representations in a unified vector space. In order to obtain desirable representations, we propose a self-attention based learning method that represent a sentence by the weighted sum of its word embeddings, and the weights are concentrated to those words hopefully better reflecting the content of a document. We show that salient sentences and keywords can be extracted in a joint and mutual reinforcement process using our learned representations, and prove that this process always converges to a unique solution leading to improvement in performance. A variant of absorbing random walk and the corresponding sampling-based algorithm are also described to avoid redundancy and increase diversity in the summaries. Experiment results with multiple benchmark datasets show that RepRank achieved the best or comparable performance in ROUGE.
摘要：我们目前RepRank，采掘多文档文摘中哪些词语，句子和词到句之间的相似性可以通过一个统一的向量空间中的向量表示之间的距离来估计无人监督的基于图的排序模型。为了获得理想的交涉，我们建议其词的嵌入的加权总和代表了一句自重视基础的学习方法，和权重都集中到那些话希望能更好地反映文档的内容。我们表明，突出的句子和关键字可以在联合和相互加强的过程中使用我们的学者表示被提取，并证明这一过程总是收敛到一个独特的解决方案导致的性能提升。吸收随机游动的变体以及相应的基于采样的算法也在概要描述，以避免冗余，增加多样性。实验结果与多个标准数据集显示，RepRank取得了ROUGE最好还是相当的性能。

21. Solomon at SemEval-2020 Task 11: Ensemble Architecture for Fine-Tuned Propaganda Detection in News Articles [PDF] 返回目录
Mayank Raj, Ajay Jaiswal, Rohit R.R, Ankita Gupta, Sudeep Kumar Sahoo, Vertika Srivastava, Yeon Hyang Kim
Abstract: This paper describes our system (Solomon) details and results of participation in the SemEval 2020 Task 11 "Detection of Propaganda Techniques in News Articles"\cite{DaSanMartinoSemeval20task11}. We participated in Task "Technique Classification" (TC) which is a multi-class classification task. To address the TC task, we used RoBERTa based transformer architecture for fine-tuning on the propaganda dataset. The predictions of RoBERTa were further fine-tuned by class-dependent-minority-class classifiers. A special classifier, which employs dynamically adapted Least Common Sub-sequence algorithm, is used to adapt to the intricacies of repetition class. Compared to the other participating systems, our submission is ranked 4th on the leaderboard.
摘要：本文介绍了SemEval 2020任务11我们的系统（所罗门）的详细信息和参与的结果“在新闻文章宣传技巧的检测” \ {引用} DaSanMartinoSemeval20task11。我们参加了任务“技术分类”（TC），这是一个多类分类任务。为了解决TC任务，我们使用罗伯塔基于变压器的架构上的宣传数据集微调。罗伯塔的预测是进一步微调由类从属少数级分类器。一个特殊的分类器，其使用动态地适应最小公共子序列算法，是用来适应于重复类的复杂性。相比其他参与的系统，我们的投稿排名第4的领先。

22. DDRQA: Dynamic Document Reranking for Open-domain Multi-hop Question Answering [PDF] 返回目录
Yuyu Zhang, Ping Nie, Arun Ramamurthy, Le Song
Abstract: Open-domain multi-hop question answering (QA) requires to retrieve multiple supporting documents, some of which have little lexical overlap with the question and can only be located by iterative document retrieval. However, multi-step document retrieval often incurs more relevant but non-supporting documents, which dampens the downstream noise-sensitive reader module for answer extraction. To address this challenge, we propose Dynamic Document Reranking (DDR) to iteratively retrieve, rerank and filter documents, and adaptively determine when to stop the retrieval process. DDR employs an entity-linked document graph for multi-document interaction, which boosts up the retrieval performance. Experiments on HotpotQA full wiki setting show that our method achieves more than 7 points higher reranking performance over the previous best retrieval model, and also achieves state-of-the-art question answering performance on the official leaderboard.
摘要：开放域多跳问答（QA）要求检索多个配套文件，其中有一些小词重叠的问题，只能通过反复的文献检索定位。然而，多步骤的文件检索往往招致更多的相关的但非证明文件，其减轻了答案抽取下游噪声敏感读取器模块。为了应对这一挑战，我们提出了动态文档重新分级（DDR），以反复检索，重新排名和过滤的文件，并自适应地确定何时停止检索过程。 DDR采用用于多文档交互的实体链接的文档图表，其提升了检索的性能。在HotpotQA全维基设置显示，我们的方法实现超过7个点以上再排序性能比前最好检索模型，并实现了实验的国家的最先进的问题，在官方排行榜应答性能。

23. Mimic and Conquer: Heterogeneous Tree Structure Distillation for Syntactic NLP [PDF] 返回目录
Hao Fei, Yafeng Ren, Donghong Ji
Abstract: Syntax has been shown useful for various NLP tasks, while existing work mostly encodes singleton syntactic tree using one hierarchical neural network. In this paper, we investigate a simple and effective method, Knowledge Distillation, to integrate heterogeneous structure knowledge into a unified sequential LSTM encoder. Experimental results on four typical syntax-dependent tasks show that our method outperforms tree encoders by effectively integrating rich heterogeneous structure syntax, meanwhile reducing error propagation, and also outperforms ensemble methods, in terms of both the efficiency and accuracy.
摘要：语法已经显示出各种NLP任务是有用的，而现有的工作主要是编码使用一个分层神经网络的单语法树。在本文中，我们研究了一个简单而有效的方法，知识蒸馏，异质结构的知识整合到一个统一的顺序LSTM编码器。四个典型语法相关任务的实验结果表明，通过有效整合异构丰富的语法结构，同时减少错误传播，并且也是我们的方法优于树编码器优于集成方法，同时在效率和准确性方面。

24. Retrofitting Structure-aware Transformer Language Model for End Tasks [PDF] 返回目录
Hao Fei, Yafeng Ren, Donghong Ji
Abstract: We consider retrofitting structure-aware Transformer-based language model for facilitating end tasks by proposing to exploit syntactic distance to encode both the phrasal constituency and dependency connection into the language model. A middle-layer structural learning strategy is leveraged for structure integration, accomplished with main semantic task training under multi-task learning scheme. Experimental results show that the retrofitted structure-aware Transformer language model achieves improved perplexity, meanwhile inducing accurate syntactic phrases. By performing structure-aware fine-tuning, our model achieves significant improvements for both semantic- and syntactic-dependent tasks.
摘要：我们认为改造结构感知基于变压器的语言模型通过提出利用语法距离编码这两个短语选区和依赖连接到语言模型有助于结束任务。然后，中间层结构的学习策略是杠杆的结构整合，以在多任务学习方案主要语义任务训练来完成。实验结果表明，改装的结构感知的变压器语言模型实现了改进的困惑，同时诱导准确语法的短语。通过进行结构感知的微调，我们的模型实现了两个semantic-和句法相关任务显著的改善。

25. Tag and Correct: Question aware Open Information Extraction with Two-stage Decoding [PDF] 返回目录
Martin Kuo, Yaobo Liang, Lei Ji, Nan Duan, Linjun Shou, Ming Gong, Peng Chen
Abstract: Question Aware Open Information Extraction (Question aware Open IE) takes question and passage as inputs, outputting an answer tuple which contains a subject, a predicate, and one or more arguments. Each field of answer is a natural language word sequence and is extracted from the passage. The semi-structured answer has two advantages which are more readable and falsifiable compared to span answer. There are two approaches to solve this problem. One is an extractive method which extracts candidate answers from the passage with the Open IE model, and ranks them by matching with questions. It fully uses the passage information at the extraction step, but the extraction is independent to the question. The other one is the generative method which uses a sequence to sequence model to generate answers directly. It combines the question and passage as input at the same time, but it generates the answer from scratch, which does not use the facts that most of the answer words come from in the passage. To guide the generation by passage, we present a two-stage decoding model which contains a tagging decoder and a correction decoder. At the first stage, the tagging decoder will tag keywords from the passage. At the second stage, the correction decoder will generate answers based on tagged keywords. Our model could be trained end-to-end although it has two stages. Compared to previous generative models, we generate better answers by generating coarse to fine. We evaluate our model on WebAssertions (Yan et al., 2018) which is a Question aware Open IE dataset. Our model achieves a BLEU score of 59.32, which is better than previous generative methods.
摘要：问题意识到打开信息提取（问题意识到打开IE）开问题和通道作为输入，输出一个应答元组包含一个主语，谓语，和一个或多个参数。回答的每一个领域是自然语言单词序列，并从通道中提取。半结构化的回答有两个好处这是更可读和可证伪相比跨度答案。有解决这个问题的两种方法。一个是从提取与打开IE模式的通道候选答案，并通过与问题匹配排列他们的提取方法。它充分利用在提取步骤中通道的信息，但提取是独立的问题。另一种是使用一个序列到序列模型直接产生答案的生成方法。它结合了这个问题，并作为通道的同时输入，但它产生从头答案，大部分答案的话都来自于通道不使用的事实。由通路引导的产生，我们提出含有标记解码器和校正解码器的两级解码模型。在第一阶段，标签解码器将从通道关键字标记。在第二阶段，校正解码器将基于标记的关键词的答案。我们的模型可以被训练端至端虽然它有两个阶段。相比于以前的生成模型，我们通过生成粗到细更好的答案。我们评估我们的WebAssertions模型（Yan等，2018），这是一个问题知道打开IE数据集。我们的模型实现了BLEU得分59.32，比以前的生成方法更好。

26. Asking Complex Questions with Multi-hop Answer-focused Reasoning [PDF] 返回目录
Xiyao Ma, Qile Zhu, Yanlin Zhou, Xiaolin Li, Dapeng Wu
Abstract: Asking questions from natural language text has attracted increasing attention recently, and several schemes have been proposed with promising results by asking the right question words and copy relevant words from the input to the question. However, most state-of-the-art methods focus on asking simple questions involving single-hop relations. In this paper, we propose a new task called multihop question generation that asks complex and semantically relevant questions by additionally discovering and modeling the multiple entities and their semantic relations given a collection of documents and the corresponding answer 1. To solve the problem, we propose multi-hop answer-focused reasoning on the grounded answer-centric entity graph to include different granularity levels of semantic information including the word-level and document-level semantics of the entities and their semantic relations. Through extensive experiments on the HOTPOTQA dataset, we demonstrate the superiority and effectiveness of our proposed model that serves as a baseline to motivate future work.
摘要：问从自然语言文字的问题已引起最近越来越多的关注，一些方案已经提出了通过提出正确的问题的话，并从输入的问题复制相关的词有希望的结果。然而，大多数国家的最先进的方法集中在询问涉及单跳的关系简单的问题。在本文中，我们提出了一个新的任务称为多跳询问生成，通过另外发现和建模的多个实体和他们的语义关系给出的文档的集合和相应的答案1.要解决这个问题，我们提出了要求复杂和语义相关的问题多跳答案为重点的推理在接地答案为中心的实体图形包括语义信息，包括实体的字级和文件级的语义和它们的语义关系不同粒度级别。通过对HOTPOTQA数据集大量的实验，我们证明了作为基准来激励今后的工作中我们提出的模式的优越性和有效性。

27. Arabic Opinion Mining Using a Hybrid Recommender System Approach [PDF] 返回目录
Fouzi Harrag, Abdulmalik Salman Al-Salman, Alaa Alquahtani
Abstract: Recommender systems nowadays are playing an important role in the delivery of services and information to users. Sentiment analysis (also known as opinion mining) is the process of determining the attitude of textual opinions, whether they are positive, negative or neutral. Data sparsity is representing a big issue for recommender systems because of the insufficiency of user rating or absence of data about users or items. This research proposed a hybrid approach combining sentiment analysis and recommender systems to tackle the problem of data sparsity problems by predicting the rating of products from users reviews using text mining and NLP techniques. This research focuses especially on Arabic reviews, where the model is evaluated using Opinion Corpus for Arabic (OCA) dataset. Our system was efficient, and it showed a good accuracy of nearly 85 percent in predicting rating from reviews
摘要：推荐系统目前正在发挥提供服务和信息，对用户重要的作用。情感分析（也称为意见挖掘）是确定的文本意见的态度的过程中，无论是正面，负面或中性。数据稀疏的代表，因为用户的评价或缺乏有关用户或项目数据的不足的推荐系统的一个大问题。这项研究提出了一种混合的方法相结合的情感分析和推荐系统通过预测产品使用文本挖掘和NLP技术用户的评价评级，以解决数据稀疏问题的问题。这项研究特别注重阿拉伯语的评论，其中模型是使用意见语料库阿拉伯语（OCA）数据集进行评估。我们的系统是有效的，而事实证明，近85％的良好的精度从评论预测评级

28. Grounded Adaptation for Zero-shot Executable Semantic Parsing [PDF] 返回目录
Victor Zhong, Mike Lewis, Sida I. Wang, Luke Zettlemoyer
Abstract: We propose Grounded Adaptation for Zero-shot Executable Semantic Parsing (GAZP) to adapt an existing semantic parser to new environments (e.g. new database schemas). GAZP combines a forward semantic parser with a backward utterance generator to synthesize data (e.g. utterances and SQL queries) in the new environment, then selects cycle-consistent examples to adapt the parser. Unlike data-augmentation, which typically synthesizes unverified examples in the training environment, GAZP synthesizes examples in the new environment whose input-output consistency are verified. On the Spider, Sparc, and CoSQL zero-shot semantic parsing tasks, GAZP improves logical form and execution accuracy of the baseline parser. Our analyses show that GAZP outperforms data-augmentation in the training environment, performance increases with the amount of GAZP-synthesized data, and cycle-consistency is central to successful adaptation.
摘要：本文提出接地适应了零射门的可执行语义分析（GAZP），以适应现有的语义解析新环境（例如新的数据库模式）。 GAZP结合了向后发声发生器在新环境中合成的数据（例如话语和SQL查询），则选择周期一致的实施例以适应解析器正向语义解析器。不同于数据增强，其通常在合成训练环境未经验证的实施例中，合成GAZP在新环境中，其输入输出的一致性进行验证的例子。在蜘蛛，SPARC和CoSQL零射门语义分析的任务，GAZP提高逻辑形式和基线分析器的执行准确性。我们的分析表明，GAZP优于数据增强训练的环境中，GAZP合成的数据量性能增加，并且周期一致性是中央的成功适应。

29. Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments [PDF] 返回目录
Haley Lepp, Gina-Anne Levow
Abstract: This study presents a corpus of turn changes between speakers in U.S. Supreme Court oral arguments. Each turn change is labeled on a spectrum of "cooperative" to "competitive" by a human annotator with legal experience in the United States. We analyze the relationship between speech features, the nature of exchanges, and the gender and legal role of the speakers. Finally, we demonstrate that the models can be used to predict the label of an exchange with moderate success. The automatic classification of the nature of exchanges indicates that future studies of turn-taking in oral arguments can rely on larger, unlabeled corpora.
摘要：本研究提出的美国最高法院的口头辩论发言者之间依次变化的语料库。每回合变化是由在美国的法律经验人的注释标记上的“合作”到“有竞争力”的频谱。我们分析了语音特征之间交流的性质和扬声器的性别和法律作用的关系，。最后，我们证明了该模型可以用来预测有一定的成功交换的标签。交流性质的自动分类显示，关采取口头辩论的未来的研究可以依靠较大，未标记的语料库。

30. Multi-span Style Extraction for Generative Reading Comprehension [PDF] 返回目录
Junjie Yang, Zhuosheng Zhang, Hai Zhao
Abstract: Generative machine reading comprehension (MRC) requires a model to generate well-formed answers. For this type of MRC, answer generation method is crucial to the model performance. However, generative models, which are supposed to be the right model for the task, in generally perform poorly. At the same time, single-span extraction models have been proven effective for extractive MRC, where the answer is constrained to a single span in the passage. Nevertheless, they generally suffer from generating incomplete answers or introducing redundant words when applied to the generative MRC. Thus, we extend the single-span extraction method to multi-span, proposing a new framework which enables generative MRC to be smoothly solved as multi-span extraction. Thorough experiments demonstrate that this novel approach can alleviate the dilemma between generative models and single-span models and produce answers with better-formed syntax and semantics. We will open-source our code for the research community.
摘要：创成机器阅读理解（MRC）需要一个模型来生成形成良好的答案。对于这种类型的MRC的，答案生成方法是将模型的性能是至关重要的。然而，生成模型，这应该是任务的正确的模式，在普遍表现不佳。与此同时，单期间提取模型已经采掘MRC，这里的答案被限制在通道中的单跨证明是有效的。然而，它们一般遭受产生不完整的答案或当施加到生成MRC引入冗余字。因此，我们的单跨提取方法扩展到多跨度，提出一种新的架构，使生成MRC被顺利解决的多跨度萃取。彻底的实验结果表明，这种新方法可以缓解更好的形成语法和语义生成模型和单跨模型和农产品答案之间的两难选择。我们将开源代码，我们在研究界。

31. Domain Knowledge Empowered Structured Neural Net for End-to-End Event Temporal Relation Extraction [PDF] 返回目录
Rujun Han, Yichao Zhou, Nanyun Peng
Abstract: Extracting event temporal relations is a critical task for information extraction and plays an important role in natural language understanding. Prior systems leverage deep learning and pre-trained language models to improve the performance of the task. However, these systems often suffer from two short-comings: 1) when performing maximum a posteriori (MAP) inference based on neural models, previous systems only used structured knowledge that are assumed to be absolutely correct, i.e., hard constraints; 2) biased predictions on dominant temporal relations when training with a limited amount of data. To address these issues, we propose a framework that enhances deep neural network with distributional constraints constructed by probabilistic domain knowledge. We solve the constrained inference problem via Lagrangian Relaxation and apply it on end-to-end event temporal relation extraction tasks. Experimental results show our framework is able to improve the baseline neural network models with strong statistical significance on two widely used datasets in news and clinical domains.
摘要：提取事件时序关系，是信息提取的重要任务，起着自然语言理解的重要作用。在此之前系统利用深度学习和预先训练语言模型，以提高任务的性能。然而，这些系统通常由两个不足之苦：1）进行最大时，基于神经模型后验（MAP）的推断，以前的系统假定仅用于结构化的知识是绝对正确的，即硬约束; 2）与数据的有限量的训练当主导时间关系偏置预测。为了解决这些问题，我们提出了一个框架，增强了与分配的约束深层神经网络结构和概率领域知识。我们通过解决拉格朗日松弛约束推理问题，并将其应用在终端到终端的事件时序关系抽取任务。实验结果表明，我们的架构能够提高对新闻和临床领域两种广泛使用的数据集，具有较强的显着性基线神经网络模型。

32. Fast semantic parsing with well-typedness guarantees [PDF] 返回目录
Matthias Lindemann, Jonas Groschwitz, Alexander Koller
Abstract: AM dependency parsing is a linguistically principled method for neural semantic parsing with high accuracy across multiple graphbanks. It relies on a type system that models semantic valency but makes existing parsers slow. We describe an A* parser and a transition-based parser for AM dependency parsing which guarantee well-typedness and improve the parsing speed to up to 2200 tokens/s, while maintaining or improving accuracy.
摘要：AM依存分析是跨越多个graphbanks高精度神经语义分析一个语言原则性方法。它依赖于一个类型的系统模型语义价上，但使现有的解析器缓慢。我们描述了一个A *解析器和AM依存分析基于过渡解析器，保证良好的typedness，提高解析速度高达2200令牌/秒，同时保持或提高了精度。

33. An information theoretic view on selecting linguistic probes [PDF] 返回目录
Zining Zhu, Frank Rudzicz
Abstract: There is increasing interest in assessing the linguistic knowledge encoded in neural representations. A popular approach is to attach a diagnostic classifier -- or "probe" -- to perform supervised classification from internal representations. However, how to select a good probe is in debate. Hewitt and Liang (2019) showed that a high performance on diagnostic classification itself is insufficient, because it can be attributed to either "the representation being rich in knowledge", or "the probe learning the task", which Pimentel et al. (2020) challenged. We show this dichotomy is valid information-theoretically. In addition, we find that the methods to construct and select good probes proposed by the two papers, *control task* (Hewitt and Liang, 2019) and *control function* (Pimentel et al., 2020), are equivalent - the errors of their approaches are identical (modulo irrelevant terms). Empirically, these two selection criteria lead to results that highly agree with each other.
摘要：在评估神经表征编码的语言知识越来越感兴趣。一种流行的方法是附加一个诊断分类器 - 或“探针” - 从内部表示进行监督分类。然而，如何选择一个好的探头处于争论。休伊特和良（2019）显示，在诊断分类本身高性能是不够的，因为它可以归结为两种“表示富含知识”，或“探头学习任务”，这皮门特尔等人。（2020）的挑战。我们发现这种两分法是有效的理论上信息。（皮门特尔等人，2020年）此外，我们发现，这些方法来构建，并选择由两个文件中提出好的探头，*控制任务*（休伊特和梁，2019）和*控制功能*，是等价的 - 错误的他们的方法是相同的（不相关的取模计算）。根据经验，这两个选择标准导致的结果彼此高度一致。

34. Cascaded Models for Better Fine-Grained Named Entity Recognition [PDF] 返回目录
Parul Awasthy, Taesun Moon, Jian Ni, Radu Florian
Abstract: Named Entity Recognition (NER) is an essential precursor task for many natural language applications, such as relation extraction or event extraction. Much of the NER research has been done on datasets with few classes of entity types (e.g. PER, LOC, ORG, MISC), but many real world applications (disaster relief, complex event extraction, law enforcement) can benefit from a larger NER typeset. More recently, datasets were created that have hundreds to thousands of types of entities, sparking new lines of research (Sekine, 2008;Ling and Weld, 2012; Gillick et al., 2014; Choiet al., 2018). In this paper we present a cascaded approach to labeling fine-grained NER, applying to a newly released fine-grained NER dataset that was used in the TAC KBP 2019 evaluation (Ji et al., 2019), inspired by the fact that training data is available for some of the coarse labels. Using a combination of transformer networks, we show that performance can be improved by about 20 F1 absolute, as compared with the straightforward model built on the full fine-grained types, and show that, surprisingly, using course-labeled data in three languages leads to an improvement in the English data.
摘要：命名实体识别（NER）是许多自然语言的应用，如关系抽取或事件抽取的一个基本前提任务。大部分NER的研究已经与实体类型的几类数据集进行（例如PER，LOC，ORG，MISC），但许多现实世界的应用程序（救灾，复杂事件提取，执法），可以从更大NER排版受益。最近，数据集创建的有成百上千类型的实体，引发的新的研究领域（关根，2008;凌和虚焊，2012;吉利克等人2014; Choiet人，2018）。在本文中，我们提出了一种级联方法来标记细粒度NER，施加到在所述TAC KBP 2019评价中使用的新发布的细粒度NER数据集（Ji等人，2019），由以下事实启发该训练数据适用于某些粗标签。使用变压器网络的结合，我们表明，性能可以通过提高约20 F1绝对的，与建立在充分细粒度类型的简单模型进行比较，并表明，在三种语言线索奇怪，当然使用标记的数据在英语数据的改善。

35. Simultaneous Machine Translation with Visual Context [PDF] 返回目录
Ozan Caglayan, Julia Ive, Veneta Haralampieva, Pranava Madhyastha, Loïc Barrault, Lucia Specia
Abstract: Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. The translation thus have to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this paper, we seek to understand whether the addition of visual information can compensate for the missing source context. To this end, we analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks. Our results show that visual context is helpful and that visually-grounded models based on explicit object region information are much better than commonly used global features, reaching up to 3 BLEU points improvement under low latency scenarios. Our qualitative analysis illustrates cases where only the multimodal systems are able to translate correctly from English into gender-marked languages, as well as deal with differences in word order such as adjective-noun placement between English and French.
摘要：同时机器翻译（SIMT）旨在连续输入文本流转换为以最低的延迟和最高质量的另一种语言。因此，翻译必须开始用一个不完整的源文本，这是逐步读取，创建预期的需要。在本文中，我们设法了解除了视觉信息是否能够弥补缺失的源上下文。为此，我们分析了不同的多式联运方式和视觉特征的国家的最先进的SIMT架构的影响。我们的研究结果表明，视觉环境是有益的，基于明确的对象区域的信息可视化，接地模型明显优于常用的全局特征，达到了下低延迟的情况下3 BLEU点的提高。我们的定性分析与说明，如英语和法语之间形容词与名词放置在词序的差异，其中仅多模系统能够从英语正确翻译成性别标记语言的情况，以及交易。

36. Transformer Based Multi-Source Domain Adaptation [PDF] 返回目录
Dustin Wright, Isabelle Augenstein
Abstract: In practical machine learning settings, the data on which a model must make predictions often come from a different distribution than the data it was trained on. Here, we investigate the problem of unsupervised multi-source domain adaptation, where a model is trained on labelled data from multiple source domains and must make predictions on a domain for which no labelled data has been seen. Prior work with CNNs and RNNs has demonstrated the benefit of mixture of experts, where the predictions of multiple domain expert classifiers are combined; as well as domain adversarial training, to induce a domain agnostic representation space. Inspired by this, we investigate how such methods can be effectively applied to large pretrained transformer models. We find that domain adversarial training has an effect on the learned representations of these models while having little effect on their performance, suggesting that large transformer-based models are already relatively robust across domains. Additionally, we show that mixture of experts leads to significant performance improvements by comparing several variants of mixing functions, including one novel mixture based on attention. Finally, we demonstrate that the predictions of large pretrained transformer based domain experts are highly homogenous, making it challenging to learn effective functions for mixing their predictions.
摘要：在实际的学习机设置，数据上的模型必须做出的预测往往来自不同的分布比它是在训练数据。在这里，我们调查监督的多源域调整，其中一个模型从多个源域标记数据训练的问题，必须在对于没有标签的数据一直被视为一个域的预测。用细胞神经网络和RNNs以前的工作已经证明专家，在多个领域的专家分类的预测相结合的混合物的利益;以及域对抗性训练，以诱导域无关的表示空间。受此启发，我们研究的方法，例如如何能够有效地应用于大型预训练变压器模型。我们发现，域对抗性训练对这些模型，同时对自己的业绩影响不大，这表明大基于变压器的模型已经跨域相对强劲的教训陈述的效果。此外，我们通过比较混合函数的几种形式，包括基于关注一个新的混合显示专家导致显著的性能改进的该混合物。最后，我们证明了大型预训练基于变压器领域专家的预测都十分均匀，使得它具有挑战性的学习有效的功能对他们的预测混合。

37. Deep Learning Approaches for Extracting Adverse Events and Indications of Dietary Supplements from Clinical Text [PDF] 返回目录
Yadan Fan, Sicheng Zhou, Yifan Li, Rui Zhang
Abstract: The objective of our work is to demonstrate the feasibility of utilizing deep learning models to extract safety signals related to the use of dietary supplements (DS) in clinical text. Two tasks were performed in this study. For the named entity recognition (NER) task, Bi-LSTM-CRF (Bidirectional Long-Short-Term-Memory Conditional Random Fields) and BERT (Bidirectional Encoder Representations from Transformers) models were trained and compared with CRF model as a baseline to recognize the named entities of DS and Events from clinical notes. In the relation extraction (RE) task, two deep learning models, including attention-based Bi-LSTM and CNN (Convolutional Neural Network), and a random forest model were trained to extract the relations between DS and Events, which were categorized into three classes: positive (i.e., indication), negative (i.e., adverse events), and not related. The best performed NER and RE models were further applied on clinical notes mentioning 88 DS for discovering DS adverse events and indications, which were compared with a DS knowledge base. For the NER task, deep learning models achieved a better performance than CRF, with F1 scores above 0.860. The attention-based Bi-LSTM model performed the best in the relation extraction task, with the F1 score of 0.893. When comparing DS event pairs generated by the deep learning models with the knowledge base for DS and Event, we found both known and unknown pairs. Deep learning models can detect adverse events and indication of DS in clinical notes, which hold great potential for monitoring the safety of DS use.
摘要：我们工作的目的是演示利用深度学习模型与临床文本使用膳食补充剂（DS）的提取物安全性信号的可行性。两个任务在本研究中进行的。对于命名实体识别（NER）任务，碧LSTM-CRF（双向长短期内存条件随机域）和BERT（来自变形金刚双向编码表示）模型进行了培训，并与CRF模型相比，作为基准来识别DS的命名实体，并从临床记录事件。在关系抽取（RE）的任务，两道深深的学习模式，包括基于关注双LSTM和CNN（卷积神经网络），以及随机森林模型进行了培训，提取DS和事件之间的关系，将其分为三类类：阳性（即指示），负（即不良事件），并没有涉及。最好进行NER和RE模型进一步应用于临床笔记提88个DS为DS发现不良事件和指示，这与一个DS的知识基础比较。对于NER任务，深刻的学习模式取得了较好的业绩比CRF，与F1评分高于0.860。注意-Bi系LSTM模型进行的关系抽取任务的最佳，与F1得分0.893。当通过比较深的学习模式与DS和事件的知识基础产生的DS事件对，我们发现已知和未知的对。深学习模型可以检测到不良事件和临床笔记，这对于监控DS使用的安全性有着巨大的潜力DS的指示。

38. Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News [PDF] 返回目录
Reuben Tan, Kate Saenko, Bryan A. Plummer
Abstract: Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are generally constrained to the very limited setting where articles only have text and metadata such as the title and authors. In this paper, we introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions. To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles as well as conduct a series of human user study experiments based on this dataset. In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visual-semantic inconsistencies, which will serve as an effective first line of defense and a useful reference for future work in defending against machine-generated disinformation.
摘要：网上旨在误导或欺骗普通人群造谣大规模传播是一个重大的社会问题。快速发展中的图片，视频和自然语言生成模型只是加剧了这一情况，并加强了我们一个有效的防御机制的需要。虽然现有的方法被提出来抵御神经假新闻，他们一般都限制在很有限的环境，让文章只能有文本和元数据，如标题和作者。在本文中，我们介绍防范机器产生的消息，还包括图像和字幕的更现实和具有挑战性的任务。要识别可能存在的弱点是对手可以利用，我们创建了4种不同类型的文章产生的组成以及进行一系列的基于此数据集人类用户研究实验NeuralNews数据集。除了从我们的用户研究实验收集到的有价值的见解，我们会根据检测视觉语义矛盾比较有效的方法，这将作为防御的第一有效行和防御为今后的工作提供有益的参考机器生成造谣。

39. Adoption of Twitter's New Length Limit: Is 280 the New 140? [PDF] 返回目录
Kristina Gligorić, Ashton Anderson, Robert West
Abstract: In November 2017, Twitter doubled the maximum allowed tweet length from 140 to 280 characters, a drastic switch on one of the world's most influential social media platforms. In the first long-term study of how the new length limit was adopted by Twitter users, we ask: Does the effect of the new length limit resemble that of the old one? Or did the doubling of the limit fundamentally change how Twitter is shaped by the limited length of posted content? By analyzing Twitter's publicly available 1% sample over a period of around 3 years, we find that, when the length limit was raised from 140 to 280 characters, the prevalence of tweets around 140 characters dropped immediately, while the prevalence of tweets around 280 characters rose steadily for about 6 months. Despite this rise, tweets approaching the length limit have been far less frequent after than before the switch. We find widely different adoption rates across languages and client-device types. The prevalence of tweets around 140 characters before the switch in a given language is strongly correlated with the prevalence of tweets around 280 characters after the switch in the same language, and very long tweets are vastly more popular on Web clients than on mobile clients. Moreover, tweets of around 280 characters after the switch are syntactically and semantically similar to tweets of around 140 characters before the switch, manifesting patterns of message squeezing in both cases. Taken together, these findings suggest that the new 280-character limit constitutes a new, less intrusive version of the old 140-character limit. The length limit remains an important factor that should be considered in all studies using Twitter data.
摘要：在2017年11月，微博翻了一番，从140到280个字符，在世界上最有影响力的社交媒体平台之一的急剧切换的最大允许长度鸣叫。在新的长度限制是如何通过Twitter用户的第一个长期的研究中，我们要问：是否新长度限制效果类似于旧之一？或者没有限制的一倍根本上改变Twitter的是如何通过张贴内容长度有限形？通过历时3年左右Twitter的公开可用1％的抽样分析，我们发现，当长度上限从140到280字，微博约140个字符立即下降的患病率升高，而约280个字微博的流行稳步上升约6个月。尽管如此上升，接近长度限制微博已经比之前切换后远不如频繁。我们发现，跨语言，跨客户端设备类型广泛不同的采用率。约140个字的微博在给定的语言切换前的患病率强烈约280个字的微博在同一种语言切换后的患病率相关，而且很长的鸣叫是千差万别的Web客户端比移动客户端更受欢迎。此外，在切换之后的约280个字符的鸣叫是语法和语义上类似于开关之前的大约140个字符的鸣叫，表现在这两种情况下挤压的消息的模式。总之，这些研究结果表明，新的280个字符的限制构成了老140个字符的限制了新的，侵扰性较低的版本。长度的限制仍然是应该使用Twitter数据的所有研究中考虑的重要因素。

40. RDF2Vec Light -- A Lightweight Approachfor Knowledge Graph Embeddings [PDF] 返回目录
Jan Portisch, Michael Hladik, Heiko Paulheim
Abstract: Knowledge graph embedding approaches represent nodes and edges of graphs as mathematical vectors. Current approaches focus on embedding complete knowledge graphs, i.e. all nodes and edges. This leads to very high computational requirements on large graphs such as DBpedia or Wikidata. However, for most downstream application scenarios, only a small subset of concepts is of actual interest. In this paper, we present RDF2Vec Light, a lightweight embedding approach based on RDF2Vec which generates vectors for only a subset of entities. To that end, RDF2Vec Light only traverses and processes a subgraph of the knowledge graph. Our method allows the application of embeddings of very large knowledge graphs in scenarios where such embeddings were not possible before due to a significantly lower runtime and significantly reduced hardware requirements.
摘要：知识图嵌入方法代表节点和图形作为数学向量的边缘。目前的方法集中在嵌入完整的知识图，即所有节点和边缘。这导致对大图如DBpedia中或维基数据非常高的计算要求。然而，对于最下游的应用场景，只有概念的一小部分是实际利率的。在本文中，我们提出RDF2Vec灯，基于RDF2Vec轻质嵌入方法，其仅用于实体的子集产生的载体。为此目的，光RDF2Vec仅横穿，并处理该知识图的子图。我们的方法允许在场景非常大的知识图的曲面嵌入其中的嵌入等以前不可能实现的应用由于显著较低的运行时间和显著降低硬件要求。

41. CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation [PDF] 返回目录
Jing Yu, Yuan Chai, Yue Hu, Qi Wu
Abstract: Scene graphs are semantic abstraction of images that encourage visual understanding and reasoning. However, the performance of Scene Graph Generation (SGG) is unsatisfactory when faced with biased data in real-world scenarios. Conventional debiasing research mainly studies from the view of data representation, e.g. balancing data distribution or learning unbiased models and representations, ignoring the mechanism that how humans accomplish this task. Inspired by the role of the prefrontal cortex (PFC) in hierarchical reasoning, we analyze this problem from a novel cognition perspective: learning a hierarchical cognitive structure of the highly-biased relationships and navigating that hierarchy to locate the classes, making the tail classes receive more attention in a coarse-to-fine mode. To this end, we propose a novel Cognition Tree (CogTree) loss for unbiased SGG. We first build a cognitive structure CogTree to organize the relationships based on the prediction of a biased SGG model. The CogTree distinguishes remarkably different relationships at first and then focuses on a small portion of easily confused ones. Then, we propose a hierarchical loss specially for this cognitive structure, which supports coarse-to-fine distinction for the correct relationships while progressively eliminating the interference of irrelevant ones. The loss is model-independent and can be applied to various SGG models without extra supervision. The proposed CogTree loss consistently boosts the performance of several state-of-the-art models on the Visual Genome benchmark.
摘要：场景图是鼓励视觉理解和推理的图像语义抽象。然而，当遇到在真实世界的场景数据偏压场景图生成（SGG）的性能是不能令人满意的。常规消除直流偏压的研究主要是从数据表示，例如认为研究平衡数据分发或学习公正的模型和交涉，无视机制，人类如何完成这个任务。通过分层推理的前额叶皮层（PFC）的作用的启发，我们从一个新的认知角度来分析这个问题：学习高度偏置关系的层次认知结构和导航该层次来定位类，使得尾班收在粗到精的模式更多的关注。为此，我们提出了公正SGG一个新的认知树（CogTree）的损失。我们首先建立一个认知结构CogTree组织根据偏向SGG模型预测的关系。所述CogTree区分在第一显着不同的关系，然后着眼于容易混淆那些的一小部分。然后，我们提出了一种层次亏损专门为这个认知结构，它支持粗到细的区分正确的关系，同时逐步消除不相关的干扰。损失是模型无关，可以适用于各种型号SGG没有额外的监管。所提出的CogTree损失持续提升的国家的最先进的几对视觉基因组标杆车型的性能。

42. Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation [PDF] 返回目录
Insoo Chung, Byeongwook Kim, Yoonjung Choi, Se Jung Kwon, Yongkweon Jeon, Baeseong Park, Sangha Kim, Dongsoo Lee
Abstract: Transformer is being widely used in Neural Machine Translation (NMT). Deploying Transformer models to mobile or edge devices with limited resources is challenging because of heavy computation and memory overhead during inference. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation accuracy and inference computations in different manners. Moreover, even inside an embedding block, each word presents vastly different contributions. Correspondingly, we propose a mixed precision quantization strategy to represent Transformer weights with lower bits (e.g. under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. Our quantized Transformer model achieves 11.8x smaller model size than the baseline model, with less than -0.5 BLEU. We achieve 8.3x reduction in run-time memory footprints and 3.5x speed up (Galaxy N10+) such that our proposed compression strategy enables efficient implementation for on-device NMT.
摘要：变压器被广泛应用于神经机器翻译（NMT）。部署Transformer模型与有限资源的移动或边缘设备，是因为推理过程中大量的计算和存储开销的挑战。量化是应对这些挑战的有效方法。我们的分析表明，对于量化位的给定数目的，变压器的每个块有助于以不同的方式翻译的准确性和推理计算。此外，即使是嵌入模块中，每个字呈现截然不同的贡献。相应地，提出了一种混合精度量化策略来表示具有下位（例如下3个比特）变压器的权重。例如，对于在一个嵌入块中的每个单词，我们分配基于统计特性不同的量化比特。我们的量化Transformer模型实现比基准模型11.8较小模型的大小，用了不到-0.5 BLEU。我们取得了运行时的内存占用减少8.3倍3.5倍和加速（银河N10 +），使得我们提出的压缩策略能够对设备NMT高效的实现。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-09-17

目录

摘要