摘要

1. Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO [PDF] 返回目录
Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, Yinfei Yang
Abstract: Image captioning datasets have proven useful for multimodal representation learning, and a common evaluation paradigm based on multimodal retrieval has emerged. Unfortunately, datasets have only limited cross-modal associations: images are not paired with others, captions are only paired with others that describe the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines retrieval evaluation and limits research into how inter-modality learning impacts intra-modality tasks. To address this gap, we create the \textit{Crisscrossed Captions} (CxC) dataset, extending MS-COCO with new semantic similarity judgments for \textbf{247,315} intra- and inter-modality pairs. We provide baseline model performance results for both retrieval and correlations with human rankings, emphasizing both intra- and inter-modality learning.
摘要：图像字幕数据集已经证明多式联运表示学习是有用的，基于多模态检索一个共同的评价模式已经出现了。不幸的是，数据集只有限的跨模态的关联：图像不与别人配对，字幕只能用别人描述相同的图像，没有任何负面的联想，并有遗漏的阳十字模式关联配对。这破坏检索评估和限制研成模态内如何模态间的学习影响的任务。为了解决这个间隙，我们创建\ textit {纵横交错字幕}（CXC）数据集，延伸MS-COCO与新的语义相似性的判断为\ textbf {247315}内和模态间对。我们可为检索和相关性与人类的排名基准模型的性能结果，强调内和模态间的学习。

2. WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context [PDF] 返回目录
Anna Breit, Artem Revenko, Kiamehr Rezaee, Mohammad Taher Pilehvar, Jose Camacho-Collados
Abstract: In this paper, we present WiC-TSV (\textit{Target Sense Verification for Words in Context}), a new multi-domain evaluation benchmark for Word Sense Disambiguation (WSD) and Entity Linking (EL). Our benchmark is different from conventional WSD and EL benchmarks for it being independent of a general sense inventory, making it highly flexible for the evaluation of a diverse set of models and systems in different domains. WiC-TSV is split into three tasks (systems get hypernymy or definitional or both hypernymy and definitional information about the target sense). Test data is available in four domains: general (WordNet), computer science, cocktails and medical concepts. Results show that existing state-of-the-art language models such as BERT can achieve a high performance in both in-domain data and out-of-domain data, but they still have room for improvement. WiC-TSV task data is available at \url{this https URL}.
摘要：在本文中，我们提出WIC-TSV（\ {textit目标读出验证在上下文单词}），对词义消歧（WSD）和实体链接（EL）新的多域评价基准。我们的基准是从传统的WSD不同，EL基准它是独立于一般意义上的库存，使得它在不同的领域多样化的模型和系统的评价非常灵活。 WIC-TSV被分成三个任务（系统获得目标感hypernymy或定义性或两者hypernymy和定义信息）。测试数据在四个领域：常规（共发现），计算机科学，鸡尾酒和医学概念。结果表明，国家的最先进的现有语言模型如BERT能够实现无论是在域数据和超出域数据的高性能，但他们仍然有改进的余地。 WIC-TSV任务数据可在\ {URL这HTTPS URL}。

3. Imitation Attacks and Defenses for Black-box Machine Translation Systems [PDF] 返回目录
Eric Wallace, Mitchell Stern, Dawn Song
Abstract: We consider an adversary looking to steal or attack a black-box machine translation (MT) system, either for financial gain or to exploit model errors. We first show that black-box MT systems can be stolen by querying them with monolingual sentences and training models to imitate their outputs. Using simulated experiments, we demonstrate that MT model stealing is possible even when imitation models have different input data or architectures than their victims. Applying these ideas, we train imitation models that reach within 0.6 BLEU of three production MT systems on both high-resource and low-resource language pairs. We then leverage the similarity of our imitation models to transfer adversarial examples to the production systems. We use gradient-based attacks that expose inputs which lead to semantically-incorrect translations, dropped content, and vulgar model outputs. To mitigate these vulnerabilities, we propose a defense that modifies translation outputs in order to misdirect the optimization of imitation models. This defense degrades imitation model BLEU and attack transfer rates at some cost in BLEU and inference speed.
摘要：我们认为对手寻找窃取或攻击暗箱机器翻译（MT）系统，无论是为了经济利益或利用模型误差。我们首先表明，暗箱MT系统可以通过与单语的句子和培训模式，以模仿它们的输出询问他们被窃取。基于模拟试验，我们证明了MT模式偷窃是可能的，即使仿款比他们的受害者不同的输入数据或架构。应用这些想法，我们训练仿模型0.6 BLEU两个高资源和低资源语言对三个产的机器翻译系统的触手可及。然后，我们利用我们的仿款的相似性来对抗例子转移到生产系统。我们采用基于梯度的攻击，揭露这导致语义翻译不正确的输入，下跌内容和低俗模型输出。为了减轻这些漏洞，我们提出了一个防御，为了修改转换输出，误导仿款的优化。这种辩护降解仿模型BLEU和攻击的传输速率在BLEU和推理速度也有一定的成本。

4. When does data augmentation help generalization in NLP? [PDF] 返回目录
Rohan Jha, Charles Lovering, Ellie Pavlick
Abstract: Neural models often exploit superficial ("weak") features to achieve good performance, rather than deriving the more general ("strong") features that we'd prefer a model to use. Overcoming this tendency is a central challenge in areas such as representation learning and ML fairness. Recent work has proposed using data augmentation--that is, generating training examples on which these weak features fail--as a means of encouraging models to prefer the stronger features. We design a series of toy learning problems to investigate the conditions under which such data augmentation is helpful. We show that augmenting with training examples on which the weak feature fails ("counterexamples") does succeed in preventing the model from relying on the weak feature, but often does not succeed in encouraging the model to use the stronger feature in general. We also find in many cases that the number of counterexamples needed to reach a given error rate is independent of the amount of training data, and that this type of data augmentation becomes less effective as the target strong feature becomes harder to learn.
摘要：神经模型往往会利用浅层（“弱”）功能，以达到良好的性能，而不是获得更普遍的（“强”）的特点，我们宁愿一个模型来使用。克服这种倾向是如表示学习和ML公平性方面面临的主要挑战。也就是说，产生对这些弱势特征无法训练的例子 - - 最近的工作使用数据增强建议作为鼓励模型倾向于更强功能的一种手段。我们设计了一系列的玩具学习问题，调查其下这样的数据增强是有帮助的条件。我们展示与在其上弱功能失败（“反例”）在预防模式从依靠微弱的功能不会成功，但往往在鼓励模式，使用更强的功能，一般不会成功训练例子是增广。我们还发现，在许多情况下，要达到相同的误码率所需的反例的数量是独立的训练数据量，并为目标强大的功能变得更难学习，这种类型的数据隆胸变得不那么有效。

5. TLDR: Extreme Summarization of Scientific Documents [PDF] 返回目录
Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld
Abstract: We introduce TLDR generation for scientific papers, a new automatic summarization task with high source compression requiring expert background knowledge and complex language understanding. To facilitate research on this task, we introduce SciTLDR, a dataset of 3.9K TLDRs. Furthermore, we introduce a novel annotation protocol for scalably curating additional gold summaries by rewriting peer review comments. We use this protocol to augment our test set, yielding multiple gold TLDRs for evaluation, which is unlike most recent summarization datasets that assume only one valid gold summary. We present a training strategy for adapting pretrained language models that exploits similarities between TLDR generation and the related tasks of extreme summarization and title generation, which outperforms strong extractive and abstractive summarization baselines.
摘要：介绍TLDR一代的科学论文，具有较高的信源压缩需要专家的背景知识和复杂的语言理解一个新的自动汇总任务。为了便利对这一课题研究中，我们介绍SciTLDR的3.9K TLDRs的数据集。此外，我们介绍通过重写同行审查意见可缩放策附加金摘要一种新颖的注释协议。我们使用此协议，来提升我们的测试集，产生多金TLDRs进行评估，这是不同于假定只有一个有效的金总结最近汇总的数据集。我们提出了适应预训练的语言模型，利用TLDR生成和极端的总结和标题生成，其性能优于强采掘和抽象概括基线的相关任务之间的相似性培训战略。

6. Lexical Semantic Recognition [PDF] 返回目录
Nelson F. Liu, Daniel Hershcovich, Michael Kranzlein, Nathan Schneider
Abstract: Segmentation and (segment) labeling are generally treated separately in lexical semantics, raising issues due to their close inter-dependence and necessitating joint annotation. We therefore investigate the lexical semantic recognition task of multiword expression segmentation and supersense disambiguation, unifying several previously-disparate styles of lexical semantic annotation. We evaluate a neural CRF model along all annotation axes available in version 4.3 of the STREUSLE corpus: lexical unit segmentation (multiword expressions), word-level syntactic tags, and supersense classes for noun, verb, and preposition/possessive units. As the label set generalizes that of previous tasks (DiMSUM, PARSEME), we additionally evaluate how well the model generalizes to those test sets, with encouraging results. By establishing baseline models and evaluation metrics, we pave the way for comprehensive and accurate modeling of lexical semantics.
摘要：分割和（段）标签通常在词汇语义分别处理，提出问题，由于其密切的相互依存和共同迫使注解。因此，我们研究多字表达分割和歧义的SuperSense的词汇语义识别任务，统一几个以前风格迥异的词汇语义标注。我们评估沿着所有注释神经CRF模型轴在STREUSLE语料库的4.3版本：词法单位的分割（多字的表述），字级语法标记，由SuperSense类名词，动词，介词和/占有欲单位。由于标签设置概括说以前的任务（点心，PARSEME），我们还评估如何以及模型推广到那些测试集，结果令人鼓舞。通过建立基线模型和评价指标，我们铺平了词汇语义学的全面和精确的建模方式。

7. Few-Shot Natural Language Generation by Rewriting Templates [PDF] 返回目录
Mihir Kale, Abhinav Rastogi
Abstract: Virtual assistants such as Google Assistant, Alexa and Siri enable users to interact with a large number of services and APIs on the web using natural language. The response generation module converts the actions generated by a policy module into a natural language utterance. Traditionally, template based approaches have been used for response generation in virtual assistants. However, such approaches are not feasible for commercial assistants, which need to support a large number of services. Defining templates for a large number of slot combinations for each of the services supported by large scale assistants becomes tedious. In this work, we propose a template rewriting method for Natural Language Generation (NLG), where the number of templates scales only linearly with the number of slots. A set of simple templates is used to convert actions into utterances, which are concatenated to give a semantically correct, but possibly incoherent and ungrammatical utterance. A pre-trained language model is subsequently employed to rewrite it into coherent, natural sounding text. Through automatic metrics and human evaluation, we show that our method improves over strong baselines, while being much more sample efficient.
摘要：虚拟助手，如谷歌助手，Alexa和Siri的使用户能够使用自然语言在网络上大量的服务和API的交互。响应产生模块将通过策略模块生成到自然语言语句的动作。传统上，基于模板的方法已被用于响应产生的虚拟助理。然而，这样的做法是不是为了商业助理，这就需要支持大量的服务是可行的。定义模板为每个受大规模助手支持的服务的大量槽的组合变得乏味。在这项工作中，我们提出了自然语言生成（NLG），模板重写方法，其中模板秤只有线性插槽的数目。一组简单的模板来行动转化成话语，它们连接起来以提供一个语义正确，但可能不连贯的和不合语法的话语。预先训练的语言模型，随后采用它改写成连贯性，自然的声音文本。通过自动度量和人工评估，我们表明，我们的方法提高了强烈的基线，而被更高效的样品。

8. Word Rotator's Distance: Decomposing Vectors Gives Better Representations [PDF] 返回目录
Sho Yokoi, Ryo Takahashi, Reina Akama, Jun Suzuki, Kentaro Inui
Abstract: One key principle for assessing semantic similarity between texts is to measure the degree of semantic overlap of them by considering word-by-word alignment. However, alignment-based approaches} are inferior to the generic sentence vectors in terms of performance. We hypothesize that the reason for the inferiority of alignment-based methods is due to the fact that they do not distinguish word importance and word meaning. To solve this, we propose to separate word importance and word meaning by decomposing word vectors into their norm and direction, then compute the alignment-based similarity with the help of earth mover's distance. We call the method word rotator's distance (WRD) because direction vectors are aligned by rotation on the unit hypersphere. In addition, to incorporate the advance of cutting edge additive sentence encoders, we propose to re-decompose such sentence vectors into word vectors and use them as inputs to WRD. Empirically, the proposed method outperforms current methods considering the word-by-word alignment including word mover's distance with a big difference; moreover, our method outperforms state-of-the-art additive sentence encoders on the most competitive dataset, STS-benchmark.
摘要：评估文本之间的语义相似性的一个关键原则是通过考虑字的字对齐来衡量他们的语义重叠的程度。然而，基于比对的方法}不如在性能方面一般句子载体。我们假设，对于基于比对方法自卑的原因是由于这一事实，他们不辨字的重要性和词义。为了解决这个问题，我们提出了通过分解词矢量到他们的标准和方向分开字的重要性和词义，然后计算与地球先行者的距离的帮助下，基于比对相似度。我们称该方法字旋转器的距离（WRD），因为方向矢量由旋转上单元的超球面对齐。此外，纳入尖端添加剂一句编码器的前进，我们建议重新分解句子等为载体的单词矢量，并利用它们作为输入WRD。根据经验，该方法优于考虑字的字对齐，包括文字先行者的有很大的不同距离目前的方法;此外，我们的方法优于国家的最先进的最有竞争力的数据集，STS-基准添加剂一句编码器。

9. On the Evaluation of Contextual Embeddings for Zero-Shot Cross-Lingual Transfer Learning [PDF] 返回目录
Phillip Keung, Yichao Lu, Julian Salazar, Vikas Bhardwaj
Abstract: Pre-trained multilingual contextual embeddings have demonstrated state-of-the-art performance in zero-shot cross-lingual transfer learning, where multilingual BERT is fine-tuned on some source language (typically English) and evaluated on a different target language. However, published results for baseline mBERT zero-shot accuracy vary as much as 17 points on the MLDoc classification task across four papers. We show that the standard practice of using English dev accuracy for model selection in the zero-shot setting makes it difficult to obtain reproducible results on the MLDoc and XNLI tasks. English dev accuracy is often uncorrelated (or even anti-correlated) with target language accuracy, and zero-shot cross-lingual performance varies greatly within the same fine-tuning run and between different fine-tuning runs. We recommend providing oracle scores alongside the zero-shot results: still fine-tune using English, but choose a checkpoint with the target dev set. Reporting this upper bound makes results more consistent by avoiding the variation from bad checkpoints.
摘要：预先训练多种语言上下文的嵌入已经证明状态的最先进的在零次跨语种迁移学习，其中，多语种BERT是微调一些源语言（通常英语）和不同的目标语言评价性能。然而，对于基线mBERT零射门精度公布的业绩变化，也取决于在四个论文MLDoc分类任务的17分。我们发现，使用英语开发精度在零射门设置模式选择的标准做法使得它很难获得在MLDoc和XNLI任务可重复的结果。英语dev的准确性往往是不相关的（甚至是反相关）与目标语言的准确性和零次跨语言性能大大内同一微调运行和不同的微调运行之间变化。我们建议提供甲骨文得分旁边的零射门的结果：使用英语还是微调，而是选择与目标开发设置检查点。报告此上限品牌产生通过避免坏检查点的变化更加一致。

10. A Matter of Framing: The Impact of Linguistic Formalism on Probing Results [PDF] 返回目录
Ilia Kuznetsov, Iryna Gurevych
Abstract: Deep pre-trained contextualized encoders like BERT (Delvin et al., 2019) demonstrate remarkable performance on a range of downstream tasks. A recent line of research in probing investigates the linguistic knowledge implicitly learned by these models during pre-training. While most work in probing operates on the task level, linguistic tasks are rarely uniform and can be represented in a variety of formalisms. Any linguistics-based probing study thereby inevitably commits to the formalism used to annotate the underlying data. Can the choice of formalism affect probing results? To investigate, we conduct an in-depth cross-formalism layer probing study in role semantics. We find linguistically meaningful differences in the encoding of semantic role and proto-role information by BERT depending on the formalism and demonstrate that layer probing can detect subtle differences between the implementations of the same linguistic formalism. Our results suggest that linguistic formalism is an important dimension in probing studies, along with the commonly used cross-task and cross-lingual experimental settings.
摘要：（Delvin等，2019），如BERT深预先训练的语境编码器展示在一系列的下游任务骄人的业绩。研究在探索研究了语言知识在训练前通过这些模型隐含了解到最近的线路。虽然在大多数探测工作的任务级别运行，语言任务很少均匀，可以在各种形式化的表示。任何语言学为基础的探测研究从而不可避免地致力于用于注释的基础数据形式主义。形式主义的选择会影响探测结果吗？为了研究，我们进行了深入的跨形式主义层探测中的作用研究语义。我们发现，在由BERT取决于形式主义语义角色和原角色信息的编码语言有意义的差异，并表明层探测可以检测到同样的语言形式主义的实现之间的细微差别。我们的研究结果表明，语言的形式主义是探测研究，与常用的交叉任务和跨语言的实验设置一起的一个重要维度。

11. SegaBERT: Pre-training of Segment-aware BERT for Language Understanding [PDF] 返回目录
He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Ming Li
Abstract: Pre-trained language models have achieved state-of-the-art results in various natural language processing tasks. Most of them are based on the Transformer architecture, which distinguishes tokens with the token position index of the input sequence. However, sentence index and paragraph index are also important to indicate the token position in a document. We hypothesize that better contextual representations can be generated from the text encoder with richer positional information. To verify this, we propose a segment-aware BERT, by replacing the token position embedding of Transformer with a combination of paragraph index, sentence index, and token index embeddings. We pre-trained the SegaBERT on the masked language modeling task in BERT but without any affiliated tasks. Experimental results show that our pre-trained model can outperform the original BERT model on various NLP tasks.
摘要：预先训练语言模型已经在各种自然语言处理任务，实现了国家的先进成果。他们中的大多数是基于变压器结构，其与输入序列的令牌位置索引区分的标记。然而，句子指数和段落指标也很重要指示文档中的令牌位置。我们推测，更好的情境表示可以从更丰富的位置信息的文本编码器产生。为了验证这一点，我们提出了一个段感知BERT，通过更换令牌位置款指标，句子指数和标记索引的嵌入组合变压器的嵌入。我们预先训练在BERT蒙面语言建模任务，但没有任何关联的任务SegaBERT。实验结果表明，我们的预先训练模式可以超越各种NLP任务的原BERT模式。

12. How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking [PDF] 返回目录
Nicola De Cao, Michael Schlichtkrull, Wilker Aziz, Ivan Titov
Abstract: Attribution methods assess the contribution of inputs (e.g., words) to the model prediction. One way to do so is erasure: a subset of inputs is considered irrelevant if it can be removed without affecting the model prediction. Despite its conceptual simplicity, erasure is not commonly used in practice. First, the objective is generally intractable, and approximate search or leave-one-out estimates are typically used instead; both approximations may be inaccurate and remain very expensive with modern deep (e.g., BERT-based) NLP models. Second, the method is susceptible to the hindsight bias: the fact that a token can be dropped does not mean that the model `knows' it can be dropped. The resulting pruning is over-aggressive and does not reflect how the model arrives at the prediction. To deal with these two challenges, we introduce Differentiable Masking. DiffMask relies on learning sparse stochastic gates (i.e., masks) to completely mask-out subsets of the input while maintaining end-to-end differentiability. The decision to include or disregard an input token is made with a simple linear model based on intermediate hidden layers of the analyzed model. First, this makes the approach efficient at test time because we predict rather than search. Second, as with probing classifiers, this reveals what the network `knows' at the corresponding layers. This lets us not only plot attribution heatmaps but also analyze how decisions are formed across network layers. We use DiffMask to study BERT models on sentiment classification and question answering.
摘要：归因方法评估的输入（例如，字）的模型预测的贡献。这样做的一个方法是删除：输入一个子集被认为是不相关的，如果它可以在不影响模型的预测被删除。尽管其概念简单，删除不常用在实践中使用。首先，目标一般是棘手的，和近似搜索或留一出估计通常用于代替;两者近似可能是不准确的，并保持与现代深（例如，基于BERT-）NLP模式非常昂贵。其次，该方法易受事后偏见的事实：一个令牌可以被丢弃，并不意味着模型'知道它可以被丢弃。产生的修剪过度侵略性，并不能反映该模型是如何到达预测。为了应对这两个挑战，我们引进微的屏蔽。 DiffMask依赖于学习稀疏随机门（即，掩模），以完全掩盖时的输入的子集，同时保持端至端微分。决定包括或忽略令牌与基于所分析模型的中间隐藏层的线性模型的简单作出的输入。首先，这使得该方法有效的测试时间，因为我们预测，而不是搜索。其次，与探测分类，这揭示了什么网络'知道在相应的层。这让我们不仅绘制归属热图也分析决策是如何通过网络层的形成。我们使用DiffMask来研究情感分类和答疑BERT模式。

13. Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation [PDF] 返回目录
Rachel Bawden, Biao Zhang, Lisa Yankovskaya, Andre Tättar, Matt Post
Abstract: Following previous work on automatic paraphrasing, we assess the feasibility of improving BLEU (Papineni et al., 2002) using state-of-the-art neural paraphrasing techniques to generate additional references. We explore the extent to which diverse paraphrases can adequately cover the space of valid translations and compare to an alternative approach of generating paraphrases constrained by MT outputs. We compare both approaches to human-produced references in terms of diversity and the improvement in BLEU's correlation with human judgements of MT quality. Our experiments on the WMT19 metrics tasks for all into-English language directions show that somewhat surprisingly, the addition of diverse paraphrases, even those produced by humans, leads to only small, inconsistent changes in BLEU's correlation with human judgments, suggesting that BLEU's ability to correctly exploit multiple references is limited
摘要：继自动改写以前的工作中，我们采用先进设备，最先进的神经复述技术来产生额外的参考评估提高BLEU的可行性（Papineni等，2002）。我们探索到多样释义可以充分地覆盖有效翻译的空间，并比较由MT输出约束生成释义的另一种方法的程度。我们在多样性和BLEU与MT质量的人为判断相关性的改善方面比较这两种方法对人体产生的引用。我们在为WMT19指标任务的实验都变成英语语言方向显示，有些出人意料的是，除了多样化的释义，甚至那些由人类产生的，导致在BLEU与人工判断相关性只有很小的，不一致的变化，这表明BLEU的能力正确地利用多个引用被限制

14. Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation [PDF] 返回目录
Giuseppe Russo, Nora Hollenstein, Claudiu Musat, Ce Zhang
Abstract: In this work, we present a text generation approach with multi-attribute control for data augmentation. We introduce CGA, a Variational Autoencoder architecture, to control, generate, and augment text. CGA is able to generate natural sentences with multiple controlled attributes by combining adversarial learning with a context-aware loss. The scalability of our approach is established through a single discriminator, independently of the number of attributes. As the main application of our work, we test the potential of this new model in a data augmentation use case. In a downstream NLP task, the sentences generated by our CGA model not only show significant improvements over a strong baseline, but also a classification performance very similar to real data. Furthermore, we are able to show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments.
摘要：在这项工作中，我们提出与数据增强多属性控制文本生成方法。我们推出CGA，一个变自动编码器架构，控制，产生和扩充文本。 CGA是能够通过对抗性学习与上下文感知的损失相结合，生成具有多个属性控制自然的句子。我们的方法的可扩展性是通过一个单一的鉴别建立独立的属性的数量。作为我们工作的主要应用，我们测试的数据增大的情况下使用这种新模式的潜力。在下游NLP任务时，不仅是我们的CGA模型生成的语句显示了强大的基线显著改善，但也有分类的表现非常相似，真正的数据。此外，我们可以通过一系列的自动和人的评估，显示高品质，多样性和属性控制在生成的句子。

15. Paraphrasing vs Coreferring: Two Sides of the Same Coin [PDF] 返回目录
Yehudit Meged, Avi Caciularu, Vered Shwartz, Ido Dagan
Abstract: We study the potential synergy between two different NLP tasks, both confronting lexical variability: identifying predicate paraphrases and event coreference resolution. First, we used annotations from an event coreference dataset as distant supervision to re-score heuristically-extracted predicate paraphrases. The new scoring gained more than 18 points in average precision upon their ranking by the original scoring method. Then, we used the same re-ranking features as additional inputs to a state-of-the-art event coreference resolution model, which yielded modest but consistent improvements to the model's performance. The results suggest a promising direction to leverage data and models for each of the tasks to the benefit of the other.
摘要：我们研究了两种不同的NLP任务之间的潜在的协同作用，既面临词汇变异：识别谓语释义和事件指代消解。首先，我们使用注解从事件的共参照数据集作为遥远的监督，再得分启发式提取谓词释义。新的打分在他们的原评分法排名获得了平均准确率超过18分。然后，我们使用相同的重新排序功能作为额外的输入到一个国家的最先进的事件指代消解模式，取得了该模型的性能温和但持续改善。结果表明有前途的方向，以利用数据和模型为每个任务到其他的利益。

16. Investigating Transferability in Pretrained Language Models [PDF] 返回目录
Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman
Abstract: While probing is a common technique for identifying knowledge in the representations of pretrained models, it is unclear whether this technique can explain the downstream success of models like BERT which are trained end-to-end during finetuning. To address this question, we compare probing with a different measure of transferability: the decrease in finetuning performance of a partially-reinitialized model. This technique reveals that in BERT, layers with high probing accuracy on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks. In addition, dataset size impacts layer transferability: the less finetuning data one has, the more important the middle and later layers of BERT become. Furthermore, BERT does not simply find a better initializer for individual layers; instead, interactions between layers matter and reordering BERT's layers prior to finetuning significantly harms evaluation metrics. These results provide a way of understanding the transferability of parameters in pretrained language models, revealing the fluidity and complexity of transfer learning in these models.
摘要：虽然探测是在预训练模型的表示识别知识的常用技术，目前还不清楚该技术是否能够解释像BERT模式这是在细化和微调结束的培训到终端下游的成功。在微调的局部重新初始化模型的性能下降：为了解决这个问题，我们比较具有可转移性的不同措施探测。此技术揭示了在BERT，具有高探测下游GLUE任务精度层没有必要，也不足够用于对这些任务高的精度。此外，数据集大小的影响层转印：少微调数据之一具有，更重要的BERT中间和以后层变得。此外，BERT不是简单地查找各个层更好的初始化;相反，层与层之间的相互作用关系和显著微调之前重新排序BERT的层损害的评价指标。这些结果提供了理解的预训练的语言模型参数的可转让性，揭示了这些模型的流动性和传递学习的复杂性的方法。

17. Fact or Fiction: Verifying Scientific Claims [PDF] 返回目录
David Wadden, Kyle Lo, Lucy Lu Wang, Shanchuan Lin, Madeleine van Zuylen, Arman Cohan, Hannaneh Hajishirzi
Abstract: We introduce the task of scientific fact-checking. Given a corpus of scientific articles and a claim about a scientific finding, a fact-checking model must identify abstracts that support or refute the claim. In addition, it must provide rationales for its predictions in the form of evidentiary sentences from the retrieved abstracts. For this task, we introduce SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales. We present a baseline model and assess its performance on SciFact. We observe that, while fact-checking models trained on Wikipedia articles or political news have difficulty generalizing to our task, simple domain adaptation techniques represent a promising avenue for improvement. Finally, we provide initial results showing how our model can be used to verify claims relevant to COVID-19 on the CORD-19 corpus. Our dataset will be made publicly available at this https URL.
摘要：介绍科学的事实检查的任务。鉴于科学文章的语料库和有关科学发现要求，一个事实查证模型必须确定摘要的支持或反驳了这一说法。此外，它必须提供其在从检索摘要证据句子形式的预测理由。对于这个任务，我们引入SciFact，含证据摘要配对1.4K专家编写的科学主张的数据集，并以标签和注释的理由。我们提出了一个基本模型，并评估其对SciFact性能。我们观察到，虽然训练的维基百科上的文章或政治新闻事实查证车型难以推广到我们的任务，简单域自适应技术代表改善的有希望的途径。最后，我们提供显示我们的模型如何可以用于验证有关对CORD-19文集COVID-19索赔初步成效。我们的数据集将在此HTTPS URL公之于众。

18. PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking [PDF] 返回目录
Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, Jianfeng Gao
Abstract: We propose the task of outline-conditioned story generation: given an outline as a set of phrases that describe key characters and events to appear in a story, the task is to generate a coherent narrative that is consistent with the provided outline. This task is challenging as the input only provides a rough sketch of the plot, and thus, models need to generate a story by weaving through the key points provided in the outline. This requires the model to keep track of the dynamic states of the latent plot, conditioning on the input outline while generating the full story. We present PlotMachines, a neural narrative model that learns to transform an outline into a coherent story by tracking the dynamic plot states. In addition, we enrich PlotMachines with high-level discourse structure so that the model can learn different styles of writing corresponding to different parts of the narrative. Comprehensive experiments over three fiction and non-fiction datasets demonstrate that recently introduced large-scale language models, such as GPT-2 and Grover, despite their impressive generation performance, are not sufficient in generating coherent narratives for the given outline, and dynamic plot state tracking is important for composing narratives with tighter, more consistent plots.
摘要：本文提出轮廓空调的故事一代的任务：给定的轮廓为一组描述关键的人物和事件出现在故事的短语，任务是产生一个连贯的叙述是与提供的轮廓一致。此任务挑战作为输入只提供了情节的草图，因此，模型需要通过在大纲中提供的关键点编织到生成一个故事。这需要模型来跟踪潜在的情节，调节的动态状态的输入轮廓，而产生完整的故事。我们目前PlotMachines，神经叙事模式，学会通过跟踪动态情节状态转变的轮廓为一个连贯的故事。此外，我们丰富的高层次的话语结构PlotMachines使模型可以学习不同的风格相对应的叙述不同的部分写的。在三年的小说和非小说集综合实验表明，最近推出的大型语言模型，如GPT-2和格罗弗，尽管他们的令人印象深刻的发电性能，不是产生连贯的叙述为给定的轮廓充足，动态的情节状态跟踪是更严格，更一致的地块组成的叙述很重要。

19. Use of Machine Translation to Obtain Labeled Datasets for Resource-Constrained Languages [PDF] 返回目录
Emrah Budur, Rıza Özçelik, Tunga Güngör, Christopher Potts
Abstract: The large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress for other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response to this for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their quality. As examples of the new issues that these datasets help us address, we assess the value of Turkish-specific embeddings and the importance of morphological parsing for developing robust Turkish NLI models.
摘要：NLP的大型注释的数据集绝大多数都是英文的。这是其他语言进步的障碍。不幸的是，每一种语言获得新的注释资源为每个任务将是非常昂贵的。与此同时，商用机器翻译系统现在是稳健的。我们可以利用这些系统自动翻译英文数据集？在本文中，我们提供了在土耳其的自然语言推理（NLI），以正面回应。我们翻译两个大型英语NLI数据集到土耳其，并有一个专家小组确认其质量。随着新问题的例子，这些数据集帮助我们地址，我们评估特定土耳其的嵌入价值和形态分析的发展强大的土耳其NLI模型的重要性。

20. Mutlitask Learning for Cross-Lingual Transfer of Semantic Dependencies [PDF] 返回目录
Maryam Aminian, Mohammad Sadegh Rasooli, Mona Diab
Abstract: We describe a method for developing broad-coverage semantic dependency parsers for languages for which no semantically annotated resource is available. We leverage a multitask learning framework coupled with an annotation projection method. We transfer supervised semantic dependency parse annotations from a rich-resource language to a low-resource language through parallel data, and train a semantic parser on projected data. We make use of supervised syntactic parsing as an auxiliary task in a multitask learning framework, and show that with different multitask learning settings, we consistently improve over the single-task baseline. In the setting in which English is the source, and Czech is the target language, our best multitask model improves the labeled F1 score over the single-task baseline by 1.8 in the in-domain SemEval data (Oepen et al., 2015), as well as 2.5 in the out-of-domain test set. Moreover, we observe that syntactic and semantic dependency direction match is an important factor in improving the results.
摘要：我们描述了开发广泛的覆盖面为没有语义标注的资源是可用的语言语义依赖解析器的方法。我们充分利用加上注解投影方法的多任务学习框架。我们通过并行数据从一个丰富的资源转移语言语义监督依赖解析注释的低资源的语言，并在预计的数据训练语义解析。我们利用监督句法分析的在多任务学习框架辅助任务，并表明，不同的多任务学习环境，我们不断提升在单任务的基线。在其中英语为源，和捷克是目标语言的设置，我们最好多任务模式通过在域内SemEval数据1.8（Oepen等人，2015年）提高了标记F1分数在单任务的基线，以及在域外的测试仪2.5。此外，我们观察到，句法和语义依存方向匹配是在提高成绩的一个重要因素。

21. Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text [PDF] 返回目录
Deborah Ferreira, Andre Freitas
Abstract: Mathematical text is written using a combination of words and mathematical expressions. This combination, along with a specific way of structuring sentences makes it challenging for state-of-art NLP tools to understand and reason on top of mathematical discourse. In this work, we propose a new NLP task, the natural premise selection, which is used to retrieve supporting definitions and supporting propositions that are useful for generating an informal mathematical proof for a particular statement. We also make available a dataset, NL-PS, which can be used to evaluate different approaches for the natural premise selection task. Using different baselines, we demonstrate the underlying interpretation challenges associated with the task.
摘要：数学文本时使用的单词和数学公式的组合写入。这样的组合，与结构的句子的具体方式沿使其具有挑战性的国家的最先进的NLP的工具来理解和推理的数学话语之上。在这项工作中，我们提出了一个新的NLP任务，自然选择的前提下，这是用来检索支持定义和支持是一个用于为特定的语句一个非正式的数学证明有用的命题。我们也可以做一个数据集，NL-PS，它可以用来对自然的前提下选择任务不同的方法评估。使用不同的基线，我们证明与任务相关的基本解释挑战。

22. A Call for More Rigor in Unsupervised Cross-lingual Learning [PDF] 返回目录
Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, Eneko Agirre
Abstract: We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world's languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.
摘要：我们回顾动机，定义，方法和方法论无监督跨语言的学习，并要求他们每个人进行更严格的位置。对这种研究现有的原理是基于缺乏并行数据的许多世界上的语言。然而，我们认为，没有任何并行数据和丰富的单语数据的情况在实践中是不现实的。我们还讨论已经在以前的工作中使用不同的训练信号，从纯粹的无监督的设定离开。然后，我们描述了调节和监督的跨语言模型和目前最好的做法评价共同的方法问题。最后，我们提供了这方面的不同类型的研究（即跨语言文字的嵌入，深多语言训练前，和无监督的机器翻译）统一的前景，并认为这些车型相媲美的评价。

23. Language Model Prior for Low-Resource Neural Machine Translation [PDF] 返回目录
Christos Baziotis, Barry Haddow, Alexandra Birch
Abstract: The scarcity of large parallel corpora is an important obstacle for neural machine translation. A common solution is to exploit the knowledge of language models (LM) trained on abundant monolingual data. In this work, we propose a novel approach to incorporate a LM as prior in a neural translation model (TM). Specifically, we add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior, while avoiding wrong predictions when the TM "disagrees" with the LM. This objective relates to knowledge distillation, where the LM can be viewed as teaching the TM about the target language. The proposed approach does not compromise decoding speed, because the LM is used only at training time, unlike previous work that requires it during inference. We present an analysis of the effects that different methods have on the distributions of the TM. Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
摘要：大平行语料库的缺乏是神经机器翻译的一个重要障碍。常见的解决方案是利用受过训练雄厚的单语数据语言模型（LM）的知识。在这项工作中，我们提出了一种新的方法来引入一个LM在神经翻译模型（TM）前。具体而言，我们添加一个正则化项，其推动TM的输出分布是现有的LM下可能，同时避免错误的预测时的TM“不同意”与LM。这个目标涉及知识蒸馏，其中LM可以被看作是讲授目标语言的TM。所提出的方法不妥协的解码速度，因为LM只在训练时使用，不像以前的工作需要它的推理过程。我们提出的是不同的方法对TM的分布的影响进行了分析。在两个低资源机器翻译的数据集实验结果表明即使在有限的单语数据明显改善。

24. Addressing Zero-Resource Domains Using Document-Level Context in Neural Machine Translation [PDF] 返回目录
Dario Stojanovski, Alexander Fraser
Abstract: Achieving satisfying performance in machine translation on domains for which there is no training data is challenging. Traditional domain adaptation is not suitable for addressing such zero-resource domains because it relies on in-domain parallel data. We show that document-level context can be used to capture domain generalities when in-domain parallel data is not available. We present two document-level Transformer models which are capable of using large context sizes and we compare these models against strong Transformer baselines. We obtain improvements for the two zero-resource domains we study. We additionally present experiments showing the usefulness of large context when modeling multiple domains at once.
摘要：实现满足对域的机器翻译表现为其中没有训练数据是一个挑战。传统域适应不适合处理这种零资源域，因为它依赖于域的并行数据。我们发现，当在域并行数据不可用时文件级的情况下，可以用来捕获域泛泛而谈。我们提出了两种文档级变压器模型，它们能够使用大背景下大小，我们对强变压器基线比较这些车型。我们得到了我们研究了两个零资源域的改进。我们另外本实验示出了在一次建模多个域时大上下文的有效性。

25. Bridging linguistic typology and multilingual machine translation with multi-view language representations [PDF] 返回目录
Arturo Oncevay, Barry Haddow, Alexandra Birch
Abstract: Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other's language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and study what kind of information is induced from each source. By inferring typological features and language phylogenies, we observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy in tasks that require information about language similarities, such as language clustering and ranking candidates for multilingual transfer. With our method, we can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.
摘要：从喜欢多语种机器翻译任务语言类型学的数据库和教训的嵌入稀疏的语言载体已在隔离被查处，不分析他们如何能够从对方的语言特性中受益。我们建议使用融合奇异向量典型相关分析这两种观点，研究什么样的信息从各个来源引起的。通过推断类型学特征和语言系统发育，我们观察，我们表示嵌入类型和加强与语言关系的相关性。然后，我们利用我们的多语言机器翻译，我们在需要对语言的相似之处，如语言聚类和排名的候选多语言传递信息的任务，实现整体竞争力的准确翻译多视角语言向量空间。随着我们的方法，我们可以很容易地项目和评估新的语言没有大规模的多语种或排名车型，这是相关的方法主要缺点昂贵的再培训。

26. Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! [PDF] 返回目录
Suzanna Sia, Ayush Dalmia, Sabrina J. Mielke
Abstract: Topic models are a useful analysis tool to uncover the underlying themes within document collections. Probabilistic models which assume a generative story have been the dominant approach for topic modeling. We propose an alternative approach based on clustering readily available pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and analyse their performance under dimensionality reduction with PCA. The best performing combination for our approach is comparable to classical models, and complexity analysis indicate that this is a practical alternative to traditional topic modeling.
摘要：主题模型是一个有用的分析工具，文档集合内的基本主题揭开。其假设生成的故事概率模型已经为主题建模的主要方法。我们提出了一种基于聚类一应俱全预先训练字的嵌入，同时集成文档信息的加权分簇和再排序顶部字的另一种方法。我们提供不同的嵌入文字和聚类算法的组合的基准，并分析其与PCA降维下的性能。我们的做法表现最好的组合是相当经典的机型，和复杂的分析表明，这是对传统主题建模实际的选择。

27. Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation [PDF] 返回目录
Asa Cooper Stickland, Xian Li, Marjan Ghazvininejad
Abstract: There has been recent success in pre-training on monolingual data and fine-tuning on Machine Translation (MT), but it remains unclear how to best leverage a pre-trained model for a given MT task. This paper investigates the benefits and drawbacks of freezing parameters, and adding new ones, when fine-tuning a pre-trained model on MT. We focus on 1) Fine-tuning a model trained only on English monolingual data, BART. 2) Fine-tuning a model trained on monolingual data from 25 languages, mBART. For BART we get the best performance by freezing most of the model parameters, and adding extra positional embeddings. For mBART we match the performance of naive fine-tuning for most language pairs, and outperform it for Nepali to English (0.5 BLEU) and Czech to English (0.6 BLEU), all with a lower memory cost at training time. When constraining ourselves to an out-of-domain training set for Vietnamese to English we outperform the fine-tuning baseline by 0.9 BLEU.
摘要：一直在对机器翻译（MT）单语数据和微调前的训练最近的成功，但目前还不清楚如何最好地利用对于给定的MT任务预先训练的模式。本文研究的优势和冻结参数，并添加新的，当微调MT上预先训练模式的弊端。我们专注于1）微调只对英语单语数据，BART训练的模型。 2）微调从25种语言，mBART训练有素的单语的数据模型。对于BART我们得到通过冻结大部分的模型参数，并增加额外的位置的嵌入的最佳性能。对于mBART我们匹配幼稚的微调对于大多数语言对性能，并超越它尼泊尔为英语（0.5 BLEU）和捷克的英语（0.6 BLEU），全部用在训练时更低的存储成本。当约束自己，以越南为英语的域名外的训练集，我们0.9 BLEU跑赢微调基线。

28. You are right. I am ALARMED -- But by Climate Change Counter Movement [PDF] 返回目录
Shraey Bhatia, Jey Han Lau, Timothy Baldwin
Abstract: The world is facing the challenge of climate crisis. Despite the consensus in scientific community about anthropogenic global warming, the web is flooded with articles spreading climate misinformation. These articles are carefully constructed by climate change counter movement (cccm) organizations to influence the narrative around climate change. We revisit the literature on climate misinformation in social sciences and repackage it to introduce in the community of NLP. Despite considerable work in detection of fake news, there is no misinformation dataset available that is specific to the domain.of climate change. We try to bridge this gap by scraping and releasing articles with known climate change misinformation.
摘要：当今世界正面临气候危机的挑战。尽管关于人为全球变暖的科学界的共识，网络上充斥着文章传播气候误传。这些文章受到气候变化的计数器运动（CCCM）组织精心构造的影响围绕气候变化的叙述。我们重新审视对社会科学的气候误传文学和重新包装在NLP的社区引进。尽管在检测的假新闻相当多的工作，没有可用的误传数据集是特定的domain.of气候变化。我们试图弥合刮磨和释放与已知的气候变化误传文章这一差距。

29. Modelling Suspense in Short Stories as Uncertainty Reduction over Neural Representation [PDF] 返回目录
David Wilmot, Frank Keller
Abstract: Suspense is a crucial ingredient of narrative fiction, engaging readers and making stories compelling. While there is a vast theoretical literature on suspense, it is computationally not well understood. We compare two ways for modelling suspense: surprise, a backward-looking measure of how unexpected the current state is given the story so far; and uncertainty reduction, a forward-looking measure of how unexpected the continuation of the story is. Both can be computed either directly over story representations or over their probability distributions. We propose a hierarchical language model that encodes stories and computes surprise and uncertainty reduction. Evaluating against short stories annotated with human suspense judgements, we find that uncertainty reduction over representations is the best predictor, resulting in near-human accuracy. We also show that uncertainty reduction can be used to predict suspenseful events in movie synopses.
摘要：悬念叙事小说，配合读者，使故事引人注目的一个关键因素。虽然对悬念广阔的理论文献，在计算上不能很好地理解。我们比较建模悬念两种方式：惊讶的是，当前的状态是怎样意想不到给出迄今为止的故事向后看的措施;和减少不确定性，故事的延续怎样意想不到的是一个具有前瞻性的措施。两者都可以直接在故事陈述或在其概率分布来计算两种。我们提出了一种层次语言模型编码的故事，并计算惊喜和不确定性降低。针对评估与人类悬念的判断标注的短篇故事，我们发现，不确定性减少了表示是最佳预测值，导致附近的人的精度。我们还表明，减少不确定性，可以用来预测电影梗概悬疑事件。

30. MLSUM: The Multilingual Summarization Corpus [PDF] 返回目录
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
Abstract: We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
摘要：我们提出MLSUM，第一个大规模的多语言概括数据集。从在线报纸获得的，它包含了五种不同的语言1.5M +文章/摘要对 - 即，法语，德语，西班牙语，俄语，土耳其语。加上从流行的CNN /每日邮报数据集的英文报纸，收集的数据形成了大规模的多语种数据集可以启用文本概括社会新的研究方向。我们根据国家的最先进的系统报告跨语言比较分析。这突出表明了现有哪些激励使用多语种数据集的偏见。

31. Analyzing the Surprising Variability in Word Embedding Stability Across Languages [PDF] 返回目录
Laura Burdick, Jonathan K. Kummerfeld, Rada Mihalcea
Abstract: Word embeddings are powerful representations that form the foundation of many natural language processing architectures and tasks, both in English and in other languages. To gain further insight into word embeddings in multiple languages, we explore their stability, defined as the overlap between the nearest neighbors of a word in different embedding spaces. We discuss linguistic properties that are related to stability, drawing out insights about how morphological and other features relate to stability. This has implications for the usage of embeddings, particularly in research that uses embeddings to study language trends.
摘要：Word中的嵌入是形成了许多自然语言处理架构和任务，无论是在英语和其他语言的基础强大的交涉。为了进一步深入了解在多国语言文字的嵌入，我们探索其稳定性，定义为一个词在不同的嵌入空间的最近邻居之间的重叠。我们讨论的是涉及稳定，抽出约形态学和其他功能如何与稳定性的见解语言特性。这有启示的嵌入的使用，特别是在研究使用的嵌入学习语言的发展趋势。

32. Multi-Domain Spoken Language Understanding Using Domain- and Task-Aware Parameterization [PDF] 返回目录
Libo Qin, Minheng Ni, Yue Zhang, Wanxiang Che, Yangming Li, Ting Liu
Abstract: Spoken language understanding has been addressed as a supervised learning problem, where a set of training data is available for each domain. However, annotating data for each domain is both financially costly and non-scalable so we should fully utilize information across all domains. One existing approach solves the problem by conducting multi-domain learning, using shared parameters for joint training across domains. We propose to improve the parameterization of this method by using domain-specific and task-specific model parameters to improve knowledge learning and transfer. Experiments on 5 domains show that our model is more effective for multi-domain SLU and obtain the best results. In addition, we show its transferability by outperforming the prior best model by 12.4\% when adapting to a new domain with little data.
摘要：口语理解已得到解决，作为一个监督的学习问题，其中一组训练数据的可用于各个领域。然而，对于每个域注释数据是在财政昂贵且不可扩展的，所以我们应该充分利用在所有领域的信息。一种现有的方法解决了通过进行多域学习，使用共享参数用于域间联合训练的问题。我们建议通过使用特定领域和任务的具体型号参数，以提高知识学习和转移，提高了该方法的参数。 5个域实验表明，我们的模型是多域SLU更有效，取得最好的效果。另外，我们将展示通过适应几乎没有数据的新域时，12.4 \％，跑赢当前最优模型的可转移性。

33. Mind Your Inflections! Improving NLP for Non-Standard English with Base-Inflection Encoding [PDF] 返回目录
Samson Tan, Shafiq Joty, Lav R. Varshney, Min-Yen Kan
Abstract: Morphological inflection is a process of word formation where base words are modified to express different grammatical categories such as tense, case, voice, person, or number. World Englishes, such as Colloquial Singapore English (CSE) and African American Vernacular English (AAVE), differ from Standard English dialects in inflection use. Although comprehension by human readers is usually unimpaired by non-standard inflection use, NLP systems are not so robust. We introduce a new Base-Inflection Encoding of English text that is achieved by combining linguistic and statistical techniques. Fine-tuning pre-trained NLP models for downstream tasks under this novel encoding achieves robustness to non-standard inflection use while maintaining performance on Standard English examples. Models using this encoding also generalize better to non-standard dialects without explicit training. We suggest metrics to evaluate tokenizers and extensive model-independent analyses demonstrate the efficacy of the encoding when used together with data-driven subword tokenizers.
摘要：形态学拐点是构词其中base字被修饰以表达不同的语法的类别，如紧张，情况下，语音，人，或数目的过程。世界英语，如新加坡口语英语（CSE）和非洲裔美国黑人英语（AAVE），从标准英语方言拐点使用不同。虽然人类读者理解通常是由非标拐点使用不受损害，自然语言处理系统并不强劲。我们介绍的是通过结合语言和统计技术实现了新的英语基础，活用编码文本。微调预训练的NLP模型下游任务下，这种新颖的编码实现稳健非标准拐点使用，同时保持对标准英语的例子性能。采用这种编码也广义含更好的非标准方言没有明确的培训模式。我们建议的指标来评估断词，并用数据驱动的子字断词一起使用时，粗放型，独立分析表明编码的疗效。

34. TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [PDF] 返回目录
Christoph Alt, Aleksandra Gabryszak, Leonhard Hennig
Abstract: TACRED (Zhang et al., 2017) is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE). But, even with recent advances in unsupervised pre-training and knowledge enhanced neural RE, models still show a high error rate. In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? And how do crowd annotations, dataset, and models contribute to this error rate? To answer these questions, we first validate the most challenging 5K examples in the development and test sets using trained annotators. We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled. On the relabeled test set the average F1 score of a large baseline model set improves from 62.1 to 70.1. After validation, we analyze misclassifications on the challenging instances, categorize them into linguistically motivated error groups, and verify the resulting error hypotheses on three state-of-the-art RE models. We show that two groups of ambiguous relations are responsible for most of the remaining errors and that models may adopt shallow heuristics on the dataset when entities are not masked.
摘要：（Zhang等，2017）TACRED是关系抽取（RE）最大，使用最广泛的众包的数据集之一。但是，即使在无人监督的前期培训和知识增强神经再生能源的最新进展，模型仍显示错误率高。在本文中，我们探讨的问题：我们是否达到了性能的天花板还是有改进的余地？以及如何人群注解，数据集和模型有助于这一错误率是多少？为了回答这些问题，我们首先确认使用的培训注释开发和测试组中最具挑战性5K的例子。我们发现，标签错误占8％的绝对F1测试误差，这的例子超过50％，需要重新标记。在重新标记测试集的大型基准模型组的平均F1值提高了62.1 70.1。验证之后，我们分析错误分类的挑战的情况下，它们归类到语言的动机错误组，并验证状态的最技术的三RE模型所产生的错误假设。我们发现，暧昧关系的两个组是负责大部分剩余的错误和模型可以采用数据集上浅试探时，实体不被屏蔽。

35. Enriched Pre-trained Transformers for Joint Slot Filling and Intent Detection [PDF] 返回目录
Momchil Hardalov, Ivan Koychev, Preslav Nakov
Abstract: Detecting the user's intent and finding the corresponding slots among the utterance's words are important tasks in natural language understanding. Their interconnected nature makes their joint modeling a standard part of training such models. Moreover, data scarceness and specialized vocabularies pose additional challenges. Recently, the advances in pre-trained language models, namely contextualized models such as ELMo and BERT have revolutionized the field by tapping the potential of training very large models with just a few steps of fine-tuning on a task-specific dataset. Here, we leverage such model, namely BERT, and we design a novel architecture on top it. Moreover, we propose an intent pooling attention mechanism, and we reinforce the slot filling task by fusing intent distributions, word features, and token representations. The experimental results on standard datasets show that our model outperforms both the current non-BERT state of the art as well as some stronger BERT-based baselines.
摘要：检测到用户的意图，并找到了表达的单词中具有相应的槽是自然语言理解的重要任务。他们的相互关联性，使他们的联合模拟训练等车型的标准部件。此外，数据稀缺与专业词汇带来了更多的挑战。近日，在预先训练语言模型的进步，即情境模型，如毛毛和BERT已通过敲击的与任务特定的数据集的微调仅几步之遥的训练非常大的模型的潜在革命性的领域。在这里，我们利用这样的模式，即BERT，我们在上面也设计了一个新颖的架构。此外，我们提出了一个意向池注意机制，我们通过融合意向分布，文字的特点和令牌表示加强槽分配任务。在标准数据集上的实验结果表明，我们的模型优于这两个领域的当前非-BERT状态以及一些更加有力的基于BERT-基线。

36. The role of context in neural pitch accent detection in English [PDF] 返回目录
Elizabeth Nielsen, Mark Steedman, Sharon Goldwater
Abstract: Prosody is a rich information source in natural language, serving as a marker for phenomena such as contrast. In order to make this information available to downstream tasks, we need a way to detect prosodic events in speech. We propose a new model for pitch accent detection, inspired by the work of Stehwien et al. (2018), who presented a CNN-based model for this task. Our model makes greater use of context by using full utterances as input and adding an LSTM layer. We find that these innovations lead to an improvement from 87.5% to 88.7% accuracy on pitch accent detection on American English speech in the Boston University Radio News Corpus, a state-of-the-art result. We also find that a simple baseline that just predicts a pitch accent on every content word yields 82.2% accuracy, and we suggest that this is the appropriate baseline for this task. Finally, we conduct ablation tests that show pitch is the most important acoustic feature for this task and this corpus.
摘要：韵律是在自然语言丰富的信息源，作为对等现象对比的标志。为了提供给下游任务此信息，我们需要一种方法来检测语音韵律的事件。我们提出了音高重音检测的新模式，通过Stehwien等人的作品中获得灵感。（2018），谁提出了这个任务，基于CNN的模型。我们的模型使得更多地使用上下文的通过使用全发音作为输入，并添加LSTM层。我们发现，这些创新导致从87.5％，在美国波士顿大学广播新闻语料库，国家的最先进的结果的改进，88.7％的准确率在音高重音检测对美国英语演讲。我们还发现，一个简单的基线预测，只是在每个实词收益率82.2％精度的音高重音，我们认为，这是这个任务的相应基准。最后，我们进行消融测试，秀场上的是这个任务，这个语料库中最重要的声学特征。

37. Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language? [PDF] 返回目录
Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Kentaro Inui
Abstract: Despite the success of language models using neural networks, it remains unclear to what extent neural models have the generalization ability to perform inferences. In this paper, we introduce a method for evaluating whether neural models can learn systematicity of monotonicity inference in natural language, namely, the regularity for performing arbitrary inferences with generalization on composition. We consider four aspects of monotonicity inferences and test whether the models can systematically interpret lexical and logical phenomena on different training/test splits. A series of experiments show that three neural models systematically draw inferences on unseen combinations of lexical and logical phenomena when the syntactic structures of the sentences are similar between the training and test sets. However, the performance of the models significantly decreases when the structures are slightly changed in the test set while retaining all vocabularies and constituents already appearing in the training set. This indicates that the generalization ability of neural models is limited to cases where the syntactic structures are nearly the same as those in the training set.
摘要：尽管使用神经网络语言模型的成功，目前还不清楚到什么程度的神经模型必须执行推理泛化能力。在本文中，我们介绍的方法评估神经模型是否可以学习单调推理的系统性自然语言，即规律性的组成上与泛化执行任意的推论。我们认为，单调性推论和测试模型是否能够系统地解释不同的训练/测试分裂词汇和逻辑现象的四个方面。一系列的实验显示，三个神经系统模型上绘制的词汇和逻辑的现象看不见的组合的推论当句子的语法结构是训练和测试组之间的相似。然而，该模型的性能时，结构测试集都略有改变，同时保留所有的词汇和成分已经出现在训练集显著下降。这表明，神经元模型的泛化能力仅限于在句法结构几乎相同的训练集的情况。

38. Accurate Word Alignment Induction from Neural Machine Translation [PDF] 返回目录
Yun Chen, Yang Liu, Guanhua Chen, Xin Jiang, Qun Liu
Abstract: Despite its original goal to jointly learn to align and translate, prior researches suggest that the state-of-the-art neural machine translation model Transformer captures poor word alignment through its attention mechanism. In this paper, we show that attention weights do capture accurate word alignment, which could only be revealed if we choose the correct decoding step and layer to induce word alignment. We propose to induce alignment with the to-be-aligned target token as the decoder input and present two simple but effective interpretation methods for word alignment induction, either through the attention weights or the leave-one-out measures. In contrast to previous studies, we find that attention weights capture better word alignment than the leave-one-out measures under our setting. Using the proposed method with attention weights, we greatly improve over fast-align on word alignment induction. Finally, we present a multi-task learning framework to train the Transformer model and show that by incorporating GIZA++ alignments into our multi-task training, we can induce significantly better alignments than GIZA++.
摘要：尽管其最初的目标，共同学习来调整和转换，之前的研究表明，通过其注意机制的国家的最先进的神经机器翻译模型变压器捕获字差对齐。在本文中，我们表明，关注权重做捕捉精确的字排列，如果我们选择了正确的解码步骤和层诱导字对齐这只能透露。我们建议诱导与待对准目标对准标记作为解码器输入和文字排列诱发目前的两个简单而有效的解释方法，无论是通过关注权重或留一淘汰措施。相较于以前的研究中，我们发现比在我们设定的留一出措施关注权重捕获更好的词对齐。使用该方法与注意权重，我们大大提高了对字排列诱发快速对齐。最后，我们提出了一个多任务学习框架来训练变压器模型，并表明，通过将GIZA ++对齐到我们的多任务训练，我们可以诱发显著更好的路线比GIZA ++。

39. Vocabulary Adaptation for Distant Domain Adaptation in Neural Machine Translation [PDF] 返回目录
Shoetsu Sato, Jin Sakuma, Naoki Yoshinaga, Masashi Toyoda, Masaru Kitsuregawa
Abstract: Neural machine translation (NMT) models do not work well in domains different from the training data. The standard approach to this problem is to build a small parallel data in the target domain and perform domain adaptation from a source domain where massive parallel data is available. However, domain adaptation between distant domains (e.g., subtitles and research papers) does not perform effectively because of mismatches in vocabulary; it will encounter many domain-specific unknown words (e.g., `angstrom') and words whose meanings shift across domains (e.g., `conductor'). In this study, aiming to solve these vocabulary mismatches in distant domain adaptation, we propose vocabulary adaptation, a simple method for effective fine-tuning that adapts embedding layers in a given pre-trained NMT model to the target domain. Prior to fine-tuning, our method replaces word embeddings in embedding layers of the NMT model, by projecting general word embeddings induced from monolingual data in the target domain onto the source-domain embedding space. Experimental results on distant domain adaptation for English-to-Japanese translation and German-to-English translation indicate that our vocabulary adaptation improves the performance of fine-tuning by 3.6 BLEU points.
摘要：神经机器翻译（NMT）模型没有从训练数据不同的领域很好地工作。解决这个问题的标准方法是在目标域中建立一个小型的并行数据和从那里大规模并行数据是可用的源域执行域自适应。然而，远处域（例如，字幕和研究论文）之间域适应不能有效，因为不匹配的词汇的执行;它会遇到许多特定领域未知的词（例如，`埃“）和的字，其含义跨域转移（例如，`导体”）。在这项研究中，旨在解决在遥远的领域适应这些词汇不匹配，我们提出词汇适应，有效的微调简单的方法，能够适应嵌入在一个给定的预先训练NMT模型到目标域层。之前微调，我们的方法将替换字的嵌入在嵌入NMT模型的层，通过投影在目标域中从单语数据诱导到源域的嵌入空间一般字的嵌入。在遥远的领域适应性的英语到日语翻译和德语到英语翻译实验结果表明，我们的词汇适应提高微调的3.6 BLEU点的性能。

40. Knowledge Graph Empowered Entity Description Generation [PDF] 返回目录
Liying Cheng, Yan Zhang, Dekun Wu, Zhanming Jie, Lidong Bing, Wei Lu, Luo Si
Abstract: Existing works on KG-to-text generation take as input a few RDF triples or key-value pairs conveying the knowledge of some entities to generate a natural language description. Existing datasets, such as WikiBIO, WebNLG, and E2E, basically have a good alignment between an input triple/pair set and its output text. However in practice, the input knowledge could be more than enough, because the output description may only want to cover the most significant knowledge. In this paper, we introduce a large-scale and challenging dataset to facilitate the study of such practical scenario in KG-to-text. Our dataset involves exploring large knowledge graphs (KG) to retrieve abundant knowledge of various types of main entities, which makes the current graph-to-sequence models severely suffered from the problems of information loss and parameter explosion while generating the description text. We address these challenges by proposing a multi-graph structure that is able to represent the original graph information more comprehensively. Furthermore, we also incorporate aggregation methods that learn to ensemble the rich graph information. Extensive experiments demonstrate the effectiveness of our model architecture.
摘要：KG到文本生成利用现有的作品输入几个RDF三元或键值对的输送一些实体的知识来产生的自然语言描述。现有的数据集，如WikiBIO，WebNLG，和E2E，基本上具有输入三重/对集合，并且其输出文本之间的良好对准。然而在实践中，输入的知识可能是绰绰有余的，因为输出描述可能只是想覆盖最显著的知识。在本文中，我们介绍了大规模和具有挑战性的数据集，以促进这种实际情况下的KG-到文本的研究。我们的数据包括探索大知识图（KG）来检索各类主要实体的丰富的知识，这使得同时生成的说明文字严重的信息丢失和参数爆炸问题的困扰当前图形到序列模型。我们解决通过提出一种多图的结构，能够更全面地代表原始图形信息这些挑战。此外，我们还采用的是学会合奏丰富的图形信息汇总的方法。大量的实验证明我们的模型结构的有效性。

41. STARC: Structured Annotations for Reading Comprehension [PDF] 返回目录
Yevgeni Berzak, Jonathan Malmaud, Roger Levy
Abstract: We present STARC (Structured Annotations for Reading Comprehension), a new annotation framework for assessing reading comprehension with multiple choice questions. Our framework introduces a principled structure for the answer choices and ties them to textual span annotations. The framework is implemented in OneStopQA, a new high-quality dataset for evaluation and analysis of reading comprehension in English. We use this dataset to demonstrate that STARC can be leveraged for a key new application for the development of SAT-like reading comprehension materials: automatic annotation quality probing via span ablation experiments. We further show that it enables in-depth analyses and comparisons between machine and human reading comprehension behavior, including error distributions and guessing ability. Our experiments also reveal that the standard multiple choice dataset in NLP, RACE, is limited in its ability to measure reading comprehension. 47% of its questions can be guessed by machines without accessing the passage, and 18% are unanimously judged by humans as not having a unique correct answer. OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.
摘要：我们提出STARC（结构化注解阅读理解），对于选择题评估阅读理解一个新的注释框架。我们的架构引入了对答案的选择和联系他们的文本跨度注释的原则性结构。该框架是在OneStopQA，一个新的高品质的数据集进行评估，并在英语阅读理解分析来实现。我们用这个数据集来证明STARC可以利用为发展的重点新应用SAT类阅读理解材料：通过跨消融实验自动标注质量探测。进一步的研究表明它能够深入分析和机器和人类阅读理解的行为，包括误差分布和猜测能力之间的比较。我们的实验还表明，在自然语言处理，RACE标准选择题数据集，在其测量阅读理解能力有限。它的问题，47％可以由机器无需访问通道被猜中，而且18％是一致的人类不具有唯一正确的答案来判断。 OneStopQA提供用于阅读理解这减轻了这些缺点，并且具有显着更高的人类天花板性能替换测试集。

42. Character-Level Translation with Self-attention [PDF] 返回目录
Yingqiang Gao, Nikola I. Nikolov, Yuhuang Hu, Richard H.R. Hahnloser
Abstract: We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model, as well as a novel variant in which the encoder block combines information from nearby characters using convolutions. We perform extensive experiments on WMT and UN datasets, testing both bilingual and multilingual translation to English using up to three input languages (French, Spanish, and Chinese). Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments.
摘要：我们探讨的自我关注模型字符级神经机器翻译的适用性。我们测试标准变压器模型，以及其中从附近的字符编码器块联合使用信息的卷积的新颖变体。我们对WMT和联合国的数据集进行大量的实验中，使用最多三个输入语言（法语，西班牙语和中国）都双语和多语种的翻译测试，以英语。我们的变压器变体的性能一直优于在字符级和更快的收敛标准的变压器，同时学习更强大的字符级路线。

43. Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT [PDF] 返回目录
Zhiyong Wu, Yun Chen, Ben Kao, Qun Liu
Abstract: By introducing a small set of additional parameters, a probe learns to solve specific linguistic tasks (e.g., dependency parsing) in a supervised manner using feature representations (e.g., contextualized embeddings). The effectiveness of such probing tasks is taken as evidence that the pre-trained model encodes linguistic knowledge. However, this approach of evaluating a language model is undermined by the uncertainty of the amount of knowledge that is learned by the probe itself. Complementary to those works, we propose a parameter-free probing technique for analyzing pre-trained language models (e.g., BERT). Our method does not require direct supervision from the probing tasks, nor do we introduce additional parameters to the probing process. Our experiments on BERT show that syntactic trees recovered from BERT using our method are significantly better than linguistically-uninformed baselines. We further feed the empirically induced dependency structures into a downstream sentiment classification task and find its improvement compatible with or even superior to a human-designed dependency schema.
摘要：通过引入小的组附加参数，探针学会解决使用特征表示（例如，情境化的嵌入）在监督方式特定语言任务（例如，依赖解析）。这样的探测任务的有效性作为证据，证明预先训练模型编码语言知识。但是，在评估一个语言模型的这种做法是由探头本身学过的知识量的不确定性削弱。为了补充这些作品中，我们提出了分析预训练的语言模型（例如，BERT）无参数探测技术。我们的方法不需要从探测任务的直接监督下，我们也不会引入额外的参数来进行探测。我们对BERT的实验表明语法树用我们的方法从BERT回收的显著优于语言，不了解情况的基线。我们进一步喂经验引起的依赖性结构到下游的情感分类的任务，找到兼容的，甚至优于人类设计的依赖性方案的改进。

44. Semantic Triple Encoder for Fast Open-Set Link Prediction [PDF] 返回目录
Bo Wang, Tao Shen, Guodong Long, Tianyi Zhou, Yi Chang
Abstract: We improve both the open-set generalization and efficiency of link prediction on knowledge graphs by leveraging the contexts of entities and relations in a novel semantic triple encoder. Most previous methods, e.g., translation-based and GCN-based embedding approaches, were built upon graph embedding models. They simply treat the entities/relations as a closed set of graph nodes regardless of their context semantics, which however cannot provide critical information for the generalization to unseen entities/relations. In this paper, we partition each graph triple and develop a novel context-based encoder that separately maps each part and its context into a latent semantic space. We train this semantic triple encoder by optimizing two objectives specifically designed for link prediction. In particular, (1) We split each triple into two parts, i.e., i) head entity plus relation and ii) tail entity, process both contexts separately by a Transformer encoder, and combine the encoding outputs to derive the prediction. This Siamese-like architecture avoids the combinatorial explosion of candidate triples and significantly improves the efficiency, especially during inference; (2) We cover the contextualized semantics of the triples in the encoder so it can handle unseen entities during inference, which promisingly improves the generalization ability; (3) We train the model by optimizing two complementary objectives defined on the triple, i.e., classification and contrastive losses, for natural and reliable ranking scores during inference. In experiments, we achieve the state-of-the-art or competitive performance on three popular link prediction benchmarks. In addition, we empirically reduce the inference costs by one or two orders of magnitude compared to a recent context-based encoding approach and meanwhile keep a superior quality of prediction.
摘要：通过利用实体和关系的语境中一种新的语义三重编码器提高了开放式集合泛化和知识图表链接预测的效率两者。大多数以前的方法，例如，平移和基于GCN嵌入的方法，在图嵌入模型建造。他们只是简单地将实体/关系作为一个封闭的图形节点，无论它们的上下文的语义，但是它不能提供泛化到看不见的实体/关系的关键信息。在本文中，我们划分每个图三重和开发一种新型的基于上下文的编码器，其每一个部分和它的历境进入一个潜在语义空间分别映射。我们通过优化专为连接设计预测的两个目标培养这种语义三重编码器。特别是，（1）我们每个三元组分成两个部分，即，i）的头部实体加关系以及ii）尾实体，由变压器编码器分别处理两种情况下，和组合的编码输出到导出预测。这种连体式的建筑风格避免了候选人的三元组合爆炸和显著提高了工作效率，尤其是在推论; （2）覆盖在编码器中的三元组的情境语义所以它可以处理推论，这很有希望改善泛化能力中看不见的实体; （3）我们培养通过优化对三联，即，分类和对比损失，对于推理过程中的天然和可靠的排名分值定义的两个互补的目标模型。在实验中，我们实现了三个流行的链接预测基准的国家的最先进的或有竞争力的表现。此外，我们凭经验一个数量或两个数量级相比，最近的基于上下文的编码方法降低成本推断，同时保持预测的卓越品质。

45. Conditional Augmentation for Aspect Term Extraction via Masked Sequence-to-Sequence Generation [PDF] 返回目录
Kun Li, Chengbo Chen, Xiaojun Quan, Qing Ling, Yan Song
Abstract: Aspect term extraction aims to extract aspect terms from review texts as opinion targets for sentiment analysis. One of the big challenges with this task is the lack of sufficient annotated data. While data augmentation is potentially an effective technique to address the above issue, it is uncontrollable as it may change aspect words and aspect labels unexpectedly. In this paper, we formulate the data augmentation as a conditional generation task: generating a new sentence while preserving the original opinion targets and labels. We propose a masked sequence-to-sequence method for conditional augmentation of aspect term extraction. Unlike existing augmentation approaches, ours is controllable and allows us to generate more diversified sentences. Experimental results confirm that our method alleviates the data scarcity problem significantly. It also effectively boosts the performances of several current models for aspect term extraction.
摘要：看点术语提取目的是从评论文章中提取方面而言，作为情感分析意见的目标。一个与这个任务的最大挑战是缺乏足够的注解数据。虽然数据增强是可能解决上述问题的有效的技术，它是不可控制的，因为它可能改变方面的单词和标签方面出乎意料。在本文中，我们制定了增强的数据作为条件生成任务：生成一个新的句子，同时保留原有的观点目标和标签。我们提出方面术语提取的条件扩增一个掩蔽序列到序列的方法。与现有的增强方法，我们是可控的，使我们能够产生更多样化的句子。实验结果证实了我们的方法显著减轻了数据匮乏的问题。它也有效地提升现有的几个型号的方面术语提取的性能。

46. Self-Supervised and Controlled Multi-Document Opinion Summarization [PDF] 返回目录
Hady Elsahar, Maximin Coavoux, Matthias Gallé, Jos Rozen
Abstract: We address the problem of unsupervised abstractive summarization of collections of user generated reviews with self-supervision and control. We propose a self-supervised setup that consider an individual document as a target summary for a set of similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss. We address the problem of hallucinations through the use of control codes, to steer the generation towards more coherent and relevant summaries.Finally, we extend the Transformer architecture to allow for multiple reviews as input. Our benchmarks on two datasets against graph-based and recent neural abstractive unsupervised models show that our proposed method generates summaries with a superior quality and relevance.This is confirmed in our human evaluation which focuses explicitly on the faithfulness of generated summaries We also provide an ablation study, which shows the importance of the control setup in controlling hallucinations and achieve high sentiment and topic alignment of the summaries with the input reviews.
摘要：解决与自我监督和管理用户生成的评论收藏监督的抽象总结的问题。我们提出了一个自我监督的设置是考虑单个文档作为一组类似的文件的目标概要。此设置使仅依赖于标准的数似然损失训练比以前的方法更简单。我们通过使用控制代码解决幻觉的问题，要引向更加连贯和相关summaries.Finally的一代，我们延长了变压器的架构，允许多个评论作为输入。我们在对两个数据集的基准曲线为基础，最近的神经抽象监督的模型表明，该方法具有卓越的品质产生总结和relevance.This在我们其中明确侧重于生成摘要的信实我们还提供消融人工评估确认研究中，这说明控制设置在控制幻觉，实现高情绪并与输入的评论摘要的话题对准的重要性。

47. Named Entity Recognition without Labelled Data: A Weak Supervision Approach [PDF] 返回目录
Pierre Lison, Aliaksandr Hubin, Jeremy Barnes, Samia Touileb
Abstract: Named Entity Recognition (NER) performance often degrades rapidly when applied to target domains that differ from the texts observed during training. When in-domain labelled data is available, transfer learning techniques can be used to adapt existing NER models to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision. The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain. These annotations are then merged together using a hidden Markov model which captures the varying accuracies and confusions of the labelling functions. A sequence labelling model can finally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) and demonstrate an improvement of about 7 percentage points in entity-level $F_1$ scores compared to an out-of-domain neural NER model.
摘要：当施加到从训练期间观察到的文本不同目标域命名实体识别（NER）性能往往迅速降解。当域标记数据是可用的，传递学习技术可用于现有NER模型适应目标域。但是，应该在有目标域没有手标记的数据怎么样呢？本文提出了一种简单但功能强大的方法来学习NER模型在通过监管不力的情况下的标签数据。该方法依赖于标记的功能，以从目标域中自动注释文本的广谱。这些注释然后合并在一起使用，其捕获的标记功能的不同的精度和混乱的隐马尔可夫模型。序列标签模型能够最终这个统一标注的基础上训练。我们评估两个数据集英语（CoNLL 2003年从路透社和彭博新闻文章）的方法，并展示约7个百分点，在实体层面$ F_1 $得分的改善相比，域名外的一个神经NER模型。

48. Towards Unsupervised Language Understanding and Generation by Joint Dual Learning [PDF] 返回目录
Shang-Yu Su, Chao-Wei Huang, Yun-Nung Chen
Abstract: In modular dialogue systems, natural language understanding (NLU) and natural language generation (NLG) are two critical components, where NLU extracts the semantics from the given texts and NLG is to construct corresponding natural language sentences based on the input semantic representations. However, the dual property between understanding and generation has been rarely explored. The prior work is the first attempt that utilized the duality between NLU and NLG to improve the performance via a dual supervised learning framework. However, the prior work still learned both components in a supervised manner, instead, this paper introduces a general learning framework to effectively exploit such duality, providing flexibility of incorporating both supervised and unsupervised learning algorithms to train language understanding and generation models in a joint fashion. The benchmark experiments demonstrate that the proposed approach is capable of boosting the performance of both NLU and NLG.
摘要：在模块化的对话系统，自然语言理解（NLU）和自然语言生成（NLG）是两个重要的组成部分，其中NLU从给定文本中提取语义和NLG是构建基于输入的语义表示相应的自然语言中的句子。然而，理解和生成之间的双重属性已很少探讨。现有的工作是利用NLU和NLG之间的二元性来改善通过双监督学习框架的性能在第一次尝试。然而，在现有的工作仍处于监督的方式学会了这两个组件，相反，本文介绍了一种通用的学习框架，以有效地利用这种双重性，在合资的方式，提供包含两种监督和无监督的学习算法来训练语言理解和生成模式的灵活性。基准实验表明，该方法能够同时提升NLU和NLG的性能。

49. A Span-based Linearization for Constituent Trees [PDF] 返回目录
Yang Wei, Yuanbin Wu, Man Lan
Abstract: We propose a novel linearization of a constituent tree, together with a new locally normalized model. For each split point in a sentence, our model computes the normalizer on all spans ending with that split point, and then predicts a tree span from them. Compared with global models, our model is fast and parallelizable. Different from previous local models, our linearization method is tied on the spans directly and considers more local features when performing span prediction, which is more interpretable and effective. Experiments on PTB (95.8 F1) and CTB (92.4 F1) show that our model significantly outperforms existing local models and efficiently achieves competitive results with global models.
摘要：我们提出一个组成部分树的新型线性化，再加上新的本地标准化模式。对于在句子中的每个分割点，我们的模型计算上与分割点结束所有span正规化，然后从他们预测树跨度。全球车型相比，我们的模式是快速和并行。从以前的局部模型不同的是，我们的线性化方法绑在直接的跨度和执行寿命预测，哪个更可解释的和有效的，当考虑更多的地方特色。在PTB（95.8 F1）和CTB实验（92.4 F1）表明我们的模型显著优于现有的局部模型，有效地实现了与全球模型竞赛成绩。

50. Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders [PDF] 返回目录
Yanbin Zhao, Lu Chen, Zhi Chen, Kai Yu
Abstract: Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics. Traditional sequence-to-sequence models heavily rely on the quantity and quality of parallel sentences, which limits their applicability in different languages and domains. This work investigates how to leverage large amounts of unpaired corpora in TS task. We adopt the back-translation architecture in unsupervised machine translation (NMT), including denoising autoencoders for language modeling and automatic generation of parallel data by iterative back-translation. However, it is non-trivial to generate appropriate complex-simple pair if we directly treat the set of simple and complex corpora as two different languages, since the two types of sentences are quite similar and it is hard for the model to capture the characteristics in different types of sentences. To tackle this problem, we propose asymmetric denoising methods for sentences with separate complexity. When modeling simple and complex sentences with autoencoders, we introduce different types of noise into the training process. Such a method can significantly improve the simplification performance. Our model can be trained in both unsupervised and semi-supervised manner. Automatic and human evaluations show that our unsupervised model outperforms the previous systems, and with limited supervision, our model can perform competitively with multiple state-of-the-art simplification systems.
摘要：文本简化（TS）rephrases长句子转换为简化的变体，同时保持固有的语义。传统的顺序对序列模型在很大程度上依赖于并行语句的数量和质量，这限制了其应用在不同的语言和领域。这项工作探讨如何利用大量的TS任务不成对的语料库。我们采用无监督机器翻译（NMT）的回译的架构，包括语言模型和自动生成的迭代回译并行数据的降噪自动编码。然而，这是不平凡的，生成适当的复杂，简单的对，如果我们直接治疗组简单和复杂的语料库作为两个不同的语言，因为这两种类型的语句很相似，这是很难的模型捕捉特性在不同类型的语句。为了解决这个问题，我们提出了具有独立句子的复杂性不对称去噪方法。当造型与自动编码简单和复杂的句子，我们推出不同类型的噪声进入训练过程。这种方法可以提高显著简化性能。我们的模型可以同时在无监督和半监督的方式进行培训。自动和人的评估表明，我们的无监督模型优于以前的系统，并与有限的监督，我们的模型可以与国家的最先进的多重简化系统进行竞争。

51. AMPERSAND: Argument Mining for PERSuAsive oNline Discussions [PDF] 返回目录
Tuhin Chakrabarty, Christopher Hidey, Smaranda Muresan, Kathy Mckeown, Alyssa Hwang
Abstract: Argumentation is a type of discourse where speakers try to persuade their audience about the reasonableness of a claim by presenting supportive arguments. Most work in argument mining has focused on modeling arguments in monologues. We propose a computational model for argument mining in online persuasive discussion forums that brings together the micro-level (argument as product) and macro-level (argument as process) models of argumentation. Fundamentally, this approach relies on identifying relations between components of arguments in a discussion thread. Our approach for relation prediction uses contextual information in terms of fine-tuning a pre-trained language model and leveraging discourse relations based on Rhetorical Structure Theory. We additionally propose a candidate selection method to automatically predict what parts of one's argument will be targeted by other participants in the discussion. Our models obtain significant improvements compared to recent state-of-the-art approaches using pointer networks and a pre-trained language model.
摘要：论证是一种话语，其中扬声器试图说服他们的观众有关要求由提出支持参数的合理性。在争论中挖掘多数工作都集中在独白模型参数。我们提出了论点开采网上有说服力的论坛，汇集了微观层面（参数作为产品）和宏观层面（参数作为工艺）论证模型的计算模型。从根本上说，这种方法依赖于识别一个话题的争论组件之间的关系。我们的关系预测方法使用在微调预训练的语言模型，并利用基于修辞结构理论的话语关系方面的上下文信息。我们还提出一名候选人，选择方法自动预测什么人的说法的部分将被讨论的其他参与者为目标。相比于最近的国家的最先进的办法使用指针网络和预先训练的语言模型我们的模型获得显著的改善。

52. End-to-End Neural Word Alignment Outperforms GIZA++ [PDF] 返回目录
Thomas Zenkel, Joern Wuebker, John DeNero
Abstract: Word alignment was once a core unsupervised learning task in natural language processing because of its essential role in training statistical machine translation (MT) models. Although unnecessary for training neural MT models, word alignment still plays an important role in interactive applications of neural machine translation, such as annotation transfer and lexicon injection. While statistical MT methods have been replaced by neural approaches with superior performance, the twenty-year-old GIZA++ toolkit remains a key component of state-of-the-art word alignment systems. Prior work on neural word alignment has only been able to outperform GIZA++ by using its output during training. We present the first end-to-end neural word alignment method that consistently outperforms GIZA++ on three data sets. Our approach repurposes a Transformer model trained for supervised translation to also serve as an unsupervised word alignment model in a manner that is tightly integrated and does not affect translation quality.
摘要：字对齐是因为在训练统计机器翻译（MT）车型的重要作用，一旦在自然语言处理的核心无监督的学习任务。虽然不需要训练神经MT车型，字对齐仍然起着神经机器翻译的互动应用，如注释转移和词汇注入了重要的作用。虽然统计方法MT已被替换性能优越的神经途径，二十岁GIZA ++工具包仍的国家的最先进的字对齐系统的重要组成部分。神经字对齐以前的工作了只能通过训练期间使用其输出跑赢GIZA ++。我们目前的第一端至端神经字对齐方法，始终优于GIZA ++在三个数据集。我们的方法repurposes训练监督翻译也充当一个方法的一个无人监管的词对齐模型是紧密集成，并且不影响翻译质量Transformer模型。

53. NUBIA: NeUral Based Interchangeability Assessor for Text Generation [PDF] 返回目录
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, Mohamed Coulibali
Abstract: We present NUBIA, a methodology to build automatic evaluation metrics for text generation using only machine learning models as core components. A typical NUBIA model is composed of three modules: a neural feature extractor, an aggregator and a calibrator. We demonstrate an implementation of NUBIA which outperforms metrics currently used to evaluate machine translation, summaries and slightly exceeds/matches state of the art metrics on correlation with human judgement on the WMT segment-level Direct Assessment task, sentence-level ranking and image captioning evaluation. The model implemented is modular, explainable and set to continuously improve over time.
摘要：我们目前NUBIA，打造只使用机器学习模型作为核心部件的文本生成自动评估指标的方法。典型NUBIA模型由三个模块：一个神经特征提取器，聚合器和一个校准器。我们证明NUBIA的实现，其性能优于目前用来评估机器翻译，汇总指标，并略微超过/与上WMT段级直接评价任务人的判断相关性匹配技术指标的状态，语句级的排名和图像字幕评估。实施该模型是模块化的，可以解释并设置为随时间连续地提高。

54. Capsule-Transformer for Neural Machine Translation [PDF] 返回目录
Sufeng Duan, Juncheng Cao, Hai Zhao
Abstract: Transformer hugely benefits from its key design of the multi-head self-attention network (SAN), which extracts information from various perspectives through transforming the given input into different subspaces. However, its simple linear transformation aggregation strategy may still potentially fail to fully capture deeper contextualized information. In this paper, we thus propose the capsule-Transformer, which extends the linear transformation into a more general capsule routing algorithm by taking SAN as a special case of capsule network. So that the resulted capsule-Transformer is capable of obtaining a better attention distribution representation of the input sequence via information aggregation among different heads and words. Specifically, we see groups of attention weights in SAN as low layer capsules. By applying the iterative capsule routing algorithm they can be further aggregated into high layer capsules which contain deeper contextualized information. Experimental results on the widely-used machine translation datasets show our proposed capsule-Transformer outperforms strong Transformer baseline significantly.
摘要：变压器从巨大的多头自重视网络的关键设计（SAN），它通过转换给定的输入到不同的子空间中提取从不同的角度的信息中受益。然而，其简单的线性变换聚合策略可能仍可能无法完全捕捉更深层次的情境信息。在本文中，我们提出这样的胶囊型变压器，它扩展了线性变换成一个更一般的胶囊通过取SAN作为胶囊网络的一个特例路由算法。这样所得到的胶囊变压器能够通过不同的头和文字之间的信息聚合获得输入序列的一个更好的注意分布的代表性。具体而言，我们看到关注体重的组在SAN作为低层胶囊。通过应用迭代胶囊路由算法可以将它们进一步聚集成含有更深情境信息高层胶囊。在广泛使用的机器翻译数据集实验结果表明我们提出的胶囊变压器显著优于强变压器基线。

55. Robust Question Answering Through Sub-part Alignment [PDF] 返回目录
Jifan Chen, Greg Durrett
Abstract: Current textual question answering models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns in the data, so they fail to generalize to out-of-distribution and adversarial settings. To make a more robust and understandable QA system, we model question answering as an alignment problem. We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and solve a subgraph alignment problem to find a part of the context which matches the question. Our model uses BERT to compute alignment scores, and by using structured SVM, we can train end-to-end despite complex inference. Our explicit use of alignments allows us to explore a set of constraints with which we can prohibit certain types of bad behaviors which arise in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input led the model to choose the answer it did without relying on "local" post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it in several adversarial and out-of-domain datasets. The results show that our model is more robust cross-domain than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.
摘要：当前文本答疑模型实现对域测试集强劲的性能，但往往通过在数据拟合表面层次的模式做到这一点，所以他们不能推广到外的分布和对抗性的设置。为了让一个更强大和易于理解的质量保证体系，我们的模型答疑作为对齐问题。我们既分解的问题和背景到基于现成的现成的语义表示（在这里，语义角色）更小的单位，并解决子对齐问题找到该问题匹配的上下文的一部分。我们的模型使用BERT来计算对准得分，并利用结构化的SVM，我们可以尽管复杂的推理训练结束到终端。我们明确的使用路线使我们能够探索出一套，使我们可以禁止某些类型中出现的跨域设置不良行为的约束。此外，通过调查在不同可能的答案分数的差异，我们可以设法了解什么输入特定方面导致模型来选择它确实不依靠“本地”事后解释技术问题的答案。我们培训我们的阵容v1.1和测试模型在几个对抗性和外的域数据集。结果表明，我们的模型是更强大的跨域比标准BERT QA模型，并从排列得分得出的约束使我们能够有效地权衡覆盖率和准确性。

56. CohEval: Benchmarking Coherence Models [PDF] 返回目录
Tasnim Mohiuddin, Prathyusha Jwalapuram, Xiang Lin, Shafiq Joty
Abstract: Although coherence modeling has come a long way in developing novel models, their evaluation on downstream applications has largely been neglected. With the advancements made by neural approaches in applications such as machine translation, text summarization and dialogue systems, the need for standard coherence evaluation is now more crucial than ever. In this paper, we propose to benchmark coherence models on a number of synthetic and downstream tasks. In particular, we evaluate well-known traditional and neural coherence models on sentence ordering tasks, and also on three downstream applications including coherence evaluation for machine translation, summarization and next utterance prediction. We also show model produced rankings for pre-trained language model outputs as another use-case. Our results demonstrate a weak correlation between the model performances in the synthetic tasks and the downstream applications, motivating alternate evaluation methods for coherence models. This work has led us to create a leaderboard to foster further research in coherence modeling.
摘要：虽然一致性模型已经在开发新车型很长的路要走，其对下游应用的评价已经在很大程度上被忽视了。通过在应用程序，如机器翻译，文本摘要和对话系统神经的方法取得的进展，需要标准的一致性评价是现在比以往任何时候都更加重要。在本文中，我们提出基准一致性模型的若干合成和下游任务。特别是，我们评估对句子排序任务著名的传统和神经的一致性模型，并在三个下游应用，包括机器翻译，总结和一个发声预测的一致性评价。我们还显示预先训练语言模型输出作为另一用例模型产生的排名。我们的研究结果表明在合成任务模型表演和下游应用，激发了一致性模型的替代评价方法之间的相关性较弱。这项工作使我们创造一个排行榜，以促进在一致性模型的进一步研究。

57. Modular Representation Underlies Systematic Generalization in Neural Natural Language Inference Models [PDF] 返回目录
Atticus Geiger, Kyle Richardson, Christopher Potts
Abstract: In adversarial (challenge) testing, we pose hard generalization tasks in order to gain insights into the solutions found by our models. What properties must a system have in order to succeed at these hard tasks? In this paper, we argue that an essential factor is the ability to form modular representations. Our central contribution is a definition of what it means for a representation to be modular and an experimental method for assessing the extent to which a system's solution is modular in this general sense. Our work is grounded empirically in a new challenge Natural Language Inference dataset designed to assess systems on their ability to reason about entailment and negation. We find that a BERT model with fine-tuning is strikingly successful at the hard generalization tasks we pose using this dataset, and our active manipulations help us to understand why: despite the densely interconnected nature of the BERT architecture, the learned model embeds modular, general theories of lexical entailment relations.
摘要：在对抗（挑战）测试中，我们提出，为了深入了解该解决方案的发现我们的模型很难推广任务。什么样的属性必须在系统中有为了在这些硬任务成功吗？在本文中，我们认为，一个重要因素是形成模块化交涉的能力。我们的中央贡献是意味着什么的表示是模块化的，并用于评估的程度的实验方法，以一个系统的解决方案是在此一般的意义上模块化的定义。我们的工作是一个新的挑战自然语言推理数据集设计，以评估他们的能力，推理蕴涵和否定制度经验为基础的。我们发现，与微调一个BERT模式是我们提出使用该数据集的硬推广任务惊人的成功，以及我们活跃的操作帮助我们理解为什么：尽管BERT建筑密集的相互关联性，在学习模型嵌入模块，词汇蕴涵关系的一般理论。

58. Universal Dependencies according to BERT: both more specific and more general [PDF] 返回目录
Tomasz Limisiewicz, Rudolf Rosa, David Mare\v{cek}
Abstract: This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions. Previous work showed that individual BERT heads tend to encode particular dependency relation types. We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not match one-to-one. We suggest a method for relation identification and syntactic tree construction. Our approach produces significantly more consistent dependency trees than previous work, showing that it better explains the syntactic abstractions in BERT. At the same time, it can be successfully applied with only a minimal amount of supervision and generalizes well across languages.
摘要：今年工作重点放在通过自解压的关注标记的依赖树分析的形式，并通过BERT捕捉语法抽象的程度。以前的工作表明，个别BERT头倾向于特定的编码依赖关系类型。我们通过明确BERT关系比较通用的依赖关系（UD）的注释，表明他们往往不匹配一到一个扩展了这些发现。我们建议对相关标识和语法树构造的方法。我们的方法比以前的工作显著更一致的依赖树，显示其更好地解释了BERT句法抽象。同时，它可以成功地应用，只有监督的最小量和跨语言概括很好。

59. Unsupervised Injection of Knowledge into Dialogue Generation via Language Models [PDF] 返回目录
Yi-Lin Tuan, Wei Wei, William Yang Wang
Abstract: Neural conversation models have shown the power to produce more meaningful and engaging responses given external knowledge. Specifically, the knowledge we experiment on is in textual form, for example, a personality description. Despite the success of training and testing with external knowledge, in reality, we do not always have sufficient background knowledge about the discussed topic. Therefore, it is also crucial to have the models generate captivating responses without external knowledge. To achieve this, we propose a unified training method, Decoupling, which induces a knowledge-related sentence and couples it with the dialogue history to generate a response in an unsupervised fashion. Its effect is further analyzed by testing the models with no knowledge, partial and full text of the knowledge. Empirically, we observed that the variance of the performance given different amounts of knowledge is significant. Also, our method performs more closely to the supervised method (the upper bound) than the baselines.
摘要：神经会话模型表明电力生产给定的外部知识更有意义和吸引力回应。具体来说，我们尝试对知识以文本形式，例如，有个性的描述。尽管培训和外部知识测试，在现实中取得成功，我们并不总是有关于讨论的话题足够的背景知识。因此，它也是关键是有模型生成，无需外部知识迷人的响应。为了实现这一目标，我们提出了一个统一的训练方法，去耦，这导致一个知识相关的句子和夫妇将其与历史对话，以产生一个无监督形式的响应。它的作用是通过与知识没有知识，局部和全文测试模型进一步分析。根据经验，我们认为，给予不同数额的知识表现的差异是显著。此外，我们的方法进行更紧密的监督方法（上限），比基线。

60. Can Your Context-Aware MT System Pass the DiP Benchmark Tests? : Evaluation Benchmarks for Discourse Phenomena in Machine Translation [PDF] 返回目录
Prathyusha Jwalapuram, Barbara Rychalska, Shafiq Joty, Dominika Basaj
Abstract: Despite increasing instances of machine translation (MT) systems including contextual information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. Popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several baseline MT systems on the curated datasets. Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
摘要：尽管机器翻译（MT）系统的增加情况，包括上下文信息的翻译质量改善的证据稀少，尤其是对话语现象。像BLEU流行指标都没有表现或敏感足以捕获质量的提高或下降是轻微的大小，但在知觉显著。我们推出第一款旨在跟踪和跨越四个主要话语现象冰雹改善他们的那种MT标准数据集的：照应，词汇一致性，连贯性和可读性，和话语结缔组织翻译。我们还介绍了这些任务的评估方法，并评估对策划的数据集数基线MT系统。出人意料的是，我们发现，现有的上下文感知模型没有始终如一地跨语言和现象提高话语相关的翻译。

61. Look at the First Sentence: Position Bias in Question Answering [PDF] 返回目录
Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, Jaewoo Kang
Abstract: Many extractive question answering models are trained to predict start and end positions of answers. The choice of predicting answers as positions is mainly due to its simplicity and effectiveness. In this study, we hypothesize that when the distribution of the answer positions is highly skewed in the training set (e.g., answers lie only in the k-th sentence of each passage), QA models predicting answers as positions learn spurious positional cues and fail to give answers in different positions. We first illustrate this position bias in popular extractive QA models such as BiDAF and BERT and thoroughly examine how position bias propagates through each layer of BERT. To safely deliver position information without position bias, we train models with various de-biasing methods including entropy regularization and bias ensembling. Among them, we found that using the prior distribution of answer positions as a bias model is very effective at reducing position bias recovering the performance of BERT from 35.24% to 81.17% when trained on a biased SQuAD dataset.
摘要：许多采掘答疑模型训练以预测开始和答案的末尾位置。预测答案的位置的选择，主要是由于它的简单性和有效性。在这项研究中，我们假设当答案位置的分布是在训练集极不平衡（例如，答案就只有在第k个每个通道的句子），QA模型预测答案的位置学习杂散位置提示及故障给不同位置的答案。我们首先通过BERT的每一层说明了流行的采掘QA车型如BiDAF和BERT这个位置偏差和彻底地研究如何定位偏差传播。为了安全地提供位置信息没有位置偏差，我们培养模式与各种反偏置方法，包括熵正规化和偏见ensembling。其中，我们发现，使用答案位置的先验分布的偏差模型是偏向某队训练的数据集时，减少偏见位置恢复BERT的性能从35.24％至81.17％，是非常有效的。

62. Pretraining on Non-linguistic Structure as a Tool for Analyzing Learning Bias in Language Models [PDF] 返回目录
Isabel Papadimitriou, Dan Jurafsky
Abstract: We propose a novel methodology for analyzing the encoding of grammatical structure in neural language models through transfer learning. We test how a language model can leverage its internal representations to transfer knowledge across languages and symbol systems. We train LSTMs on non-linguistic, structured data and test their performance on human language to assess which kinds of data induce generalizable encodings that LSTMs can use for natural language. We find that models trained on structured data such as music and Java code have internal representations that help in modelling human language, and that, surprisingly, adding minimal amounts of structure to the training data makes a large difference in transfer to natural language. Further experiments on transfer between human languages show that zero-shot performance on a test language is highly correlated with syntactic similarity to the training language, even after removing any vocabulary overlap. This suggests that the internal representations induced from natural languages are typologically coherent: they encode the features and differences outlined in typological studies. Our results provide insights into how neural networks represent linguistic structure, and also about the kinds of structural biases that give learners the ability to model language.
摘要：本文提出一种新颖的方法，通过迁移学习分析语法结构的神经语言模型的编码。我们怎么考语言模型可以跨越语言和符号系统利用其内部表示，以传授知识。我们培养的非语言，结构化数据LSTMs并测试它们对人类语言的表现来评估哪些类型的数据诱发普及编码是LSTMs可以使用自然语言。我们发现，培训了诸如音乐和Java代码结构化数据模型有内部表示，在模拟人类的语言，而且，奇怪的是，加入极少量的结构的训练数据的帮助使得转移到自然语言大的差异。对人类语言之间的转移进一步的实验显示在测试语言，零射门的表现是高度句法相似性训练语言相关的，甚至是消除任何词汇重叠后。这表明，从自然语言引起的内部表示是类型学的一致：它们编码的类型学研究概述的特点和区别。我们的研究结果提供深入了解如何神经网络代表语言结构，也关于给学习者的能力模型的语言种类结构性偏见。

63. EnsembleGAN: Adversarial Learning for Retrieval-Generation Ensemble Model on Short-Text Conversation [PDF] 返回目录
Jiayi Zhang, Chongyang Tao, Zhenjing Xu, Qiaojing Xie, Wei Chen, Rui Yan
Abstract: Generating qualitative responses has always been a challenge for human-computer dialogue systems. Existing dialogue systems generally derive from either retrieval-based or generative-based approaches, both of which have their own pros and cons. Despite the natural idea of an ensemble model of the two, existing ensemble methods only focused on leveraging one approach to enhance another, we argue however that they can be further mutually enhanced with a proper training strategy. In this paper, we propose ensembleGAN, an adversarial learning framework for enhancing a retrieval-generation ensemble model in open-domain conversation scenario. It consists of a language-model-like generator, a ranker generator, and one ranker discriminator. Aiming at generating responses that approximate the ground-truth and receive high ranking scores from the discriminator, the two generators learn to generate improved highly relevant responses and competitive unobserved candidates respectively, while the discriminative ranker is trained to identify true responses from adversarial ones, thus featuring the merits of both generator counterparts. The experimental results on a large short-text conversation data demonstrate the effectiveness of the ensembleGAN by the amelioration on both human and automatic evaluation metrics.
摘要：生成定性反应一直是人机对话系统的挑战。现有的对话系统一般获得无论从基于内容的检索，或生成为基础的方法，两者各有利弊。尽管两者的整体模型的自然的想法，现有的集成方法只专注于利用一种方法来增强另一个，但我们认为，它们可以进一步相互有适当的训练策略增强。在本文中，我们提出ensembleGAN，加强在开放领域的对话场景检索代集成模型对抗性学习框架。它由一个语言模型般的发电机，发电机排序器，以及一个排序器鉴别的。针对生成接近地面实况和鉴别收到高排名的分数，两个发生器学习产生改进的高度相关的回应和有竞争力的不可观测的候选人分别，而歧视性排名器被训练识别来自敌对的人真正回应应答，从而既为特色对口发电机的优点。在大的短文本对话数据的实验结果通过对人类和自动评估度量来改善证明ensembleGAN的有效性。

64. Improved Natural Language Generation via Loss Truncation [PDF] 返回目录
Daniel Kang, Tatsunori Hashimoto
Abstract: Neural language models are usually trained to match the distributional properties of a large-scale corpus by minimizing the log loss. While straightforward to optimize, this approach forces the model to reproduce all variations in the dataset, including noisy and invalid references (e.g., misannotation and hallucinated facts). Worse, the commonly used log loss is overly sensitive to such phenomena and even a small fraction of noisy data can degrade performance. In this work, we show that the distinguishability of the models and reference serves as a principled and robust alternative for handling invalid references. To optimize distinguishability, we propose loss truncation, which adaptively removes high loss examples during training. We show this is as easy to optimize as log loss and tightly bounds distinguishability under noise. Empirically, we demonstrate that loss truncation outperforms existing baselines on distinguishability on a summarization task, and show that samples generated by the loss truncation model have factual accuracy ratings that exceed those of baselines and match human references.
摘要：神经语言模型通常训练有素通过最小化日志丢失匹配大规模语料的分布特性。而简单的，以优化，该方法的力模型来再现数据集中，包括嘈杂和无效引用（例如，misannotation和幻觉事实）的所有变型。更糟的是，通常使用的日志损失等现象过于敏感，甚至噪声数据的一小部分会降低性能。在这项工作中，我们证明了模型和参考的可区分作为处理无效引用一个有原则的和强大的替代方案。为了优化区分度，我们提出的损失截断，培训期间，自适应地消除高损耗的例子。我们展示这是为便于优化日志损失和盯防下噪音界定区分性。根据经验，我们展示了现有的可区分的基线上总结的任务，损失截断超越的性能和显示的损失截断模型生成的样本有超出这些基线的搭配人引用事实准确的收视率。

65. Logic2Text: High-Fidelity Natural Language Generation from Logical Forms [PDF] 返回目录
Zhiyu Chen, Wenhu Chen, Hanwen Zha, Xiyou Zhou, Yunkai Zhang, Sairam Sundaresan, William Yang Wang
Abstract: Previous works on Natural Language Generation (NLG) from structured data have primarily focused on surface-level descriptions of record sequences. However, for complex structured data, e.g., multi-row tables, it is often desirable for an NLG system to describe interesting facts from logical inferences across records. If only provided with the table, it is hard for existing models to produce controllable and high-fidelity logical generations. In this work, we formulate logical level NLG as generation from logical forms in order to obtain controllable, high-fidelity, and faithful generations. We present a new large-scale dataset, \textsc{Logic2Text}, with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which poses great challenges on the model's ability to understand the semantics. We experiment on (1) Fully-supervised training with the full datasets, and (2) Few-shot setting, provided with hundreds of paired examples; We compare several popular generation models and analyze their performances. We hope our dataset can encourage research towards building an advanced NLG system capable of natural, faithful, and human-like generation. The dataset and code are available at \url{this https URL}.
摘要：从结构化数据上的自然语言生成（NLG）上一页作品主要集中于记录序列的表面级别说明。然而，对于复杂的结构化数据，例如，多行的表，通常希望用于NLG系统来描述从整个记录逻辑推理有趣的事实。如果只提供了表，就很难对现有的模型产生可控的和高保真的逻辑代。在这项工作中，我们制定逻辑电平NLG从为了获得控制，高保真的逻辑形式，和忠实的代代。我们提出了一种新的大规模数据集，\ textsc {Logic2Text}，与涉及与底层逻辑形式成对的共同逻辑类型10753所描述。逻辑形式展现免费模式，这对模型的理解语义能力带来巨大挑战的多元化图结构。我们试验（1）全监督训练的充分的数据集，以及（2）少数合一设定，提供了数百个配对的例子;我们比较了几种流行的换代车型，并分析他们的表演。我们希望我们的数据能鼓励研究努力建设先进的NLG系统能够自然，忠实，和类似人类的产生。该数据集和代码可在\ {URL这HTTPS URL}。

66. Exploring Contextualized Neural Language Models for Temporal Dependency Parsing [PDF] 返回目录
Hayley Ross, Jonathan Cai, Bonan Min
Abstract: Extracting temporal relations between events and time expressions has many applications such as constructing event timelines and time-related question answering. It is a challenging problem that requires syntactic and semantic information at sentence or discourse levels, which may be captured by deep language models such as BERT (Devlin et al., 2019). In this paper, we developed several variants of BERT-based temporal dependency parser, and show that BERT significantly improves temporal dependency parsing (Zhang and Xue,2018a). Source code and trained models will be made available at this http URL.
摘要：提取事件和时间表达式之间的时间关系，有许多应用，如建设活动的时间表和时间相关的问题回答。这是一个具有挑战性的问题，需要在句子或话语水平句法和语义信息，其可以通过深语言模型如BERT被捕获（Devlin等人，2019）。在本文中，我们开发了基于BERT时空依赖解析器的几个变种，并表明BERT显著提高了时间相关分析（张和薛，2018A）。源代码和训练的模型将在本HTTP URL来。

67. memeBot: Towards Automatic Image Meme Generation [PDF] 返回目录
Aadhavan Sadasivam, Kausic Gunasekar, Hasan Davulcu, Yezhou Yang
Abstract: Image memes have become a widespread tool used by people for interacting and exchanging ideas over social media, blogs, and open messengers. This work proposes to treat automatic image meme generation as a translation process, and further present an end to end neural and probabilistic approach to generate an image-based meme for any given sentence using an encoder-decoder architecture. For a given input sentence, an image meme is generated by combining a meme template image and a text caption where the meme template image is selected from a set of popular candidates using a selection module, and the meme caption is generated by an encoder-decoder model. An encoder is used to map the selected meme template and the input sentence into a meme embedding and a decoder is used to decode the meme caption from the meme embedding. The generated natural language meme caption is conditioned on the input sentence and the selected meme template. The model learns the dependencies between the meme captions and the meme template images and generates new memes using the learned dependencies. The quality of the generated captions and the generated memes is evaluated through both automated and human evaluation. An experiment is designed to score how well the generated memes can represent the tweets from Twitter conversations. Experiments on Twitter data show the efficacy of the model in generating memes for sentences in online social interaction.
摘要：图片模因已成为互动和对社会媒体，博客，开放的使者交换意见的人用的一种普遍的工具。这项工作提出来治疗自动图像记因生成作为翻译方法，还存在一个端到端的神经和概率的方法来生成用于使用编码器 - 解码器架构任何给定的句基于图像的模因。对于给定的输入句子，通过组合模因模板图像，并且其中所述模因模板图像从一组流行候选的使用选择模块所选择的文本字幕生成的图像模因，并且由编码器 - 解码器产生的模因字幕模型。一种编码器，用于映射所选择的模因模板和输入句子成模因嵌入和解码器用于从记因嵌入梅梅字幕进行解码。生成的自然语言米姆字幕的条件是输入句子和所选米姆模板。该模型学习梅梅字幕和米姆模板图像之间的依赖关系，并使用所学到的依赖关系新模因。所生成的标题和所生成的模因的质量是通过自动和人工评价进行评价。一个实验的目的是得分产生记因如何能够很好地代表从Twitter谈话鸣叫。在Twitter上的数据实验表明在产生记因在网上社会互动的句子模型的有效性。

68. Boosting Naturalness of Language in Task-oriented Dialogues via Adversarial Training [PDF] 返回目录
Chenguang Zhu
Abstract: The natural language generation (NLG) module in a task-oriented dialogue system produces user-facing utterances conveying required information. Thus, it is critical for the generated response to be natural and fluent. We propose to integrate adversarial training to produce more human-like responses. The model uses Straight-Through Gumbel-Softmax estimator for gradient computation. We also propose a two-stage training scheme to boost performance. Empirical results show that the adversarial training can effectively improve the quality of language generation in both automatic and human evaluations. For example, in the RNN-LG Restaurant dataset, our model AdvNLG outperforms the previous state-of-the-art result by 3.6% in BLEU.
摘要：自然语言生成（NLG）是一个面向任务的对话系统模块产生面向用户的话语传送所需的信息。因此，重要的是对所产生的响应是自然和流畅。我们建议整合对抗训练，以产生更多的类似人类的反应。该机型采用直通冈贝尔-使用SoftMax估计的梯度计算。我们还提出了两个阶段的训练计划，以提高性能。实证结果表明，对抗性训练可以有效提高语言生成的自动和人的评价质量。例如，在RNN-LG餐厅的数据集，我们的模型AdvNLG优于以前的国家的最先进的结果在BLEU 3.6％。

69. Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing [PDF] 返回目录
Brian Thompson, Matt Post
Abstract: We propose the use of a sequence-to-sequence paraphraser for automatic machine translation evaluation. The paraphraser takes a human reference as input and then force-decodes and scores an MT system output. We propose training the aforementioned paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot "language pair" (e.g., Russian to Russian). We denote our paraphraser "unbiased" because the mode of our model's output probability is centered around a copy of the input sequence, which in our case represent the best case scenario where the MT system output matches a human reference. Our method is simple and intuitive, and our single model (trained in 39 languages) outperforms or statistically ties with all prior metrics on the WMT19 segment-level shared metrics task in all languages, excluding Gujarati where the model had no training data. We also explore using our model conditioned on the source instead of the reference, and find that it outperforms every quality estimation as a metric system from the WMT19 shared task on quality estimation by a statistically significant margin in every language pair.
摘要：我们建议使用序列到序列paraphraser自动机器翻译评测的。所述paraphraser需要一个人参照作为输入，并且然后强制解码和分数的MT系统的输出。我们建议培训前述paraphraser作为一个多语种的NMT系统，治疗意译为一个零拍“语言对”（例如，俄罗斯俄罗斯）。我们表示我们的paraphraser“平常心”，因为我们的模型的输出概率的模式是围绕输入序列，这在我们的例子中表示，其中MT系统输出人参考匹配最好的情况下副本中心。我们的方法是简单和直观，而我们的单一模式（训练的39种语言），性能优于或统计，在所有的语言在WMT19段级共享指标任务之前所有指标，不包括古吉拉特其中模型没有训练数据的关系。我们还探索利用我们的模型空调的来源，而不是参考，并发现它在每一个语言对统计学显著利润率优于每质量估计从质量估计WMT19共享任务的指标体系。

70. RikiNet: Reading Wikipedia Pages for Natural Question Answering [PDF] 返回目录
Dayiheng Liu, Yeyun Gong, Jie Fu, Yu Yan, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Nan Duan
Abstract: Reading long documents to answer open-domain questions remains challenging in natural language understanding. In this paper, we introduce a new model, called RikiNet, which reads Wikipedia pages for natural question answering. RikiNet contains a dynamic paragraph dual-attention reader and a multi-level cascaded answer predictor. The reader dynamically represents the document and question by utilizing a set of complementary attention mechanisms. The representations are then fed into the predictor to obtain the span of the short answer, the paragraph of the long answer, and the answer type in a cascaded manner. On the Natural Questions (NQ) dataset, a single RikiNet achieves 74.3 F1 and 57.9 F1 on long-answer and short-answer tasks. To our best knowledge, it is the first single model that outperforms the single human performance. Furthermore, an ensemble RikiNet obtains 76.1 F1 and 61.3 F1 on long-answer and short-answer tasks, achieving the best performance on the official NQ leaderboard
摘要：阅读长文章来回答开放域的问题仍然是自然语言理解困难。在本文中，我们介绍了一种新的模式，称为RikiNet，其内容自然问答维基百科页面。 RikiNet包含动态款双注意阅读器和多级级联的答案预测。读者动态地表示利用一组互补的关注机制的文件和问题。然后，将表示被送入预测器，以获得短答案的跨度，长答案的段，并以级联方式的答案类型。在自然问题（NQ）数据集，单RikiNet实现对长期答案，简答题任务74.3 F1和57.9 F1。据我们所知，这是优于单人表演的第一个单一的模式。此外，乐团RikiNet获得76.1 F1和61.3 F1的长期答案，简答题任务，实现官方NQ排行榜最佳性能

71. User-Guided Aspect Classification for Domain-Specific Texts [PDF] 返回目录
Peiran Li, Fang Guo, Jingbo Shang
Abstract: Aspect classification, identifying aspects of text segments, facilitates numerous applications, such as sentiment analysis and review summarization. To alleviate the human effort on annotating massive texts, in this paper, we study the problem of classifying aspects based on only a few user-provided seed words for pre-defined aspects. The major challenge lies in how to handle the noisy misc aspect, which is designed for texts without any pre-defined aspects. Even domain experts have difficulties to nominate seed words for the misc aspect, making existing seed-driven text classification methods not applicable. We propose a novel framework, ARYA, which enables mutual enhancements between pre-defined aspects and the misc aspect via iterative classifier training and seed updating. Specifically, it trains a classifier for pre-defined aspects and then leverages it to induce the supervision for the misc aspect. The prediction results of the misc aspect are later utilized to filter out noisy seed words for pre-defined aspects. Experiments in two domains demonstrate the superior performance of our proposed framework, as well as the necessity and importance of properly modeling the misc aspect.
摘要：看点分类，识别文本段的各个方面，促进了多种应用，如情感分析和回顾总结。为了缓解大量注释文本，本文中的人的努力，我们研究了基于预定义方面只有少数用户提供的种子，词方面进行分类的问题。主要的挑战在于如何处理好吵杂项方面，这是专为文本没有任何预先定义的方面。即使是领域专家要为杂项提名方面的话种子的困难，使现有的种子驱动的文本分类方法不适用。我们提出了一个新颖的框架，ARYA，这使得能够预先定义的方面和经由迭代分类器训练和种子更新的杂项方面之间的相互增强。具体而言，训练预限定的方面分类，然后利用它来诱导对其它方面的监督。 MISC的方面的预测结果以后用于过滤掉噪声的种子字预先定义的方面。在两个领域的实验结果证明我们提出的架构的卓越性能，以及必要性和重要性的正确建模其它方面。

72. Indirect Identification of Psychosocial Risks from Natural Language [PDF] 返回目录
Kristen C. Allen, Alex Davis, Tamar Krishnamurti
Abstract: During the perinatal period, psychosocial health risks, including depression and intimate partner violence, are associated with serious adverse health outcomes for parents and children. To appropriately intervene, healthcare professionals must first identify those at risk, yet stigma often prevents people from directly disclosing the information needed to prompt an assessment. We examine indirect methods of eliciting and analyzing information that could indicate psychosocial risks. Short diary entries by peripartum women exhibit thematic patterns, extracted by topic modeling, and emotional perspective, drawn from dictionary-informed sentiment features. Using these features, we use regularized regression to predict screening measures of depression and psychological aggression by an intimate partner. Journal text entries quantified through topic models and sentiment features show promise for depression prediction, with performance almost as good as closed-form questions. Text-based features were less useful for prediction of intimate partner violence, but moderately indirect multiple-choice questioning allowed for detection without explicit disclosure. Both methods may serve as an initial or complementary screening approach to detecting stigmatized risks.
摘要：在围产期，心理健康风险，包括抑郁症和亲密伴侣暴力，与严重不良健康后果的家长和孩子有关。要适当干预，医疗专业人士必须首先识别那些风险，但往往耻辱使得人们直接披露提示的评估所需的信息。我们研究引出并分析可能表明心理风险信息的间接方法。由围产期妇女短日记中表现出的主题模式，通过主题建模和情感的角度来看，从字典知情情绪特征绘制提取。使用这些功能，我们使用正则回归亲密伴侣预测抑郁症和心理侵犯的屏蔽措施。通过主题模型和情绪特点量化杂志文本条目显示抑郁症预测的承诺，其性能几乎一样好封闭形式的问题。基于文本的特点是对亲密伴侣暴力的预测用处不大，但适度间接选择题问话允许检测没有明确公开。这两种方法都可以用作初始或互补筛选方法于检测到打烙印风险。

73. Filtering before Iteratively Referring for Knowledge-Grounded Response Selection in Retrieval-Based Chatbots [PDF] 返回目录
Jia-Chen Gu, Zhen-Hua Ling, Quan Liu, Si Wei, Xiaodan Zhu
Abstract: The challenges of building knowledge-grounded retrieval-based chatbots lie in how to ground a conversation on the background knowledge and how to perform their matching with the response. This paper proposes a method named Filtering before Iteratively REferring (FIRE) for presenting the background knowledge of dialogue agents in retrieval-based chatbots. We first propose a pre-filter, which is composed of a context filter and a knowledge filter. This pre-filter grounds the conversation on the knowledge and comprehends the knowledge according to the conversation by collecting the matching information between them bidirectionally, and then recognizing the important information in them accordingly. After that, iteratively referring is performed between the context and the response, as well as between the knowledge and the response, in order to collect the deep and wide matching information. Experimental results show that the FIRE model outperforms previous methods by margins larger than 2.8% on original personas and 4.1% on revised personas on the PERSONA-CHAT dataset, as well as 3.1% on the CMU_DoG dataset in terms of top-1 accuracy.
摘要：建立知识接地基于内容的检索，聊天机器人的挑战在于如何在地面上的背景知识，以及如何与响应执行其匹配的对话。本文提出了一种迭代指（FIRE）呈现对话剂的背景知识，基于内容的检索，聊天机器人之前命名的过滤方法。我们首先提出了一个预过滤器，它是由上下文过滤和知识过滤的。这种预过滤器的理由对知识的对话和领会根据双向收集它们之间的匹配信息，然后在它们相应的识别重要信息的谈话知识。在此之后，迭代地参照被上下文和响应之间执行，以及之间的知识和响应，以便收集和深宽匹配信息。实验结果表明，在FIRE模型优于以前的方法通过边距比原始角色2.8％上，而在PERSONA聊天数据集订正角色4.1％时，以及3.1％的CMU_DoG数据集的顶部1的精度方面。

74. WT5?! Training Text-to-Text Models to Explain their Predictions [PDF] 返回目录
Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, Karishma Malkan
Abstract: Neural networks have recently achieved human-level performance on various challenging natural language processing (NLP) tasks, but it is notoriously difficult to understand why a neural network produced a particular prediction. In this paper, we leverage the text-to-text framework proposed by Raffel et al.(2019) to train language models to output a natural text explanation alongside their prediction. Crucially, this requires no modifications to the loss function or training and decoding procedures -- we simply train the model to output the explanation after generating the (natural text) prediction. We show that this approach not only obtains state-of-the-art results on explainability benchmarks, but also permits learning from a limited set of labeled explanations and transferring rationalization abilities across datasets. To facilitate reproducibility and future work, we release our code use to train the models.
摘要：神经网络已经实现了最近在各种具有挑战性的自然语言处理（NLP）任务人类水平的性能，但它是出了名很难理解为什么一个神经网络产生的特定的预测。在本文中，我们利用通过拉费尔等人提出的文本到文本框架。（2019），以语言模型训练输出自然文本的解释沿着他们的预测。最重要的是，这不需要修改的损失函数或培训和解码程序 - 我们只是火车模型输出的解释产生的（自然文本）预测后。我们表明，这种方法不仅取得对explainability基准国家的先进成果，也允许从有限的一组标记解释的学习，并在数据集传送合理化的能力。为了方便重复性和今后的工作中，我们发布代码的使用来训练模型。

75. TextAT: Adversarial Training for Natural Language Understanding with Token-Level Perturbation [PDF] 返回目录
Linyang Li, Xipeng Qiu
Abstract: Adversarial training is effective in improving the robustness of neural networks. In NLP, languages are discrete in nature, separate tokens possess discrete semantics. Therefore, to incorporate adversarial training in sequence-level tasks, we introduce a novel training strategy: Text Adversarial Training with token-level perturbation. We fist craft perturbations that are initialized using a fine-grained token-level accumulated perturbations. Then we constrain these perturbations considering that inputs are separate tokens, rather than constraining them under a naive normalization ball. We validate the effectiveness of such normalization method using large-scale Transformer-based language models. Experiments on GLUE benchmark and NER task show that our adversarial training strategy improves the performances on various tasks including text classification and sequence labeling.
摘要：对抗性训练，可以有效提高神经网络的鲁棒性。在NLP，语言本质上离散的，独立的令牌具有离散的语义。因此，以纳入序列级别的任务，对抗训练中，我们介绍了一种新的培训策略：文本对抗性训练与标记级别扰动。我们的拳头正在使用细粒度的标记级别累积扰动初始化的工艺扰动。然后我们限制这些扰动考虑到输入是独立的令牌，而不是下一个幼稚正常化球约束它们。我们验证使用基于变压器的大型语言模型等标准化方法的有效性。胶水基准和NER任务表明，我们的对抗训练策略，提高了各种任务，包括文本分类和序列标注性能实验。

76. Text Segmentation by Cross Segment Attention [PDF] 返回目录
Michal Lukasik, Boris Dadachev, Gonçalo Simões, Kishore Papineni
Abstract: Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications.
摘要：文档和话语分割是关于分手文本成分，它们通常用于帮助下游任务，如信息检索和文本摘要两个基本NLP任务。在这项工作中，我们提出了三种基于变压器的架构，并提供对三个标准数据集先前提出的方法综合比较。我们建立了一个新的国家的最先进的，在所有的情况下大幅度尤其是错误率降低。我们进一步分析模型的大小和发现，我们可以用少得多的参数构建模型，同时保持良好的性能，从而促进现实世界的应用。

77. Hierarchical Encoders for Modeling and Interpreting Screenplays [PDF] 返回目录
Gayatri Bhat, Avneesh Saluja, Melody Dye, Jan Florjanczyk
Abstract: While natural language understanding of long-form documents is still an open challenge, such documents often contain structural information that can inform the design of models for encoding them. Movie scripts are an example of such richly structured text - scripts are segmented into scenes, which are further decomposed into dialogue and descriptive components. In this work, we propose a neural architecture for encoding this structure, which performs robustly on a pair of multi-label tag classification datasets, without the need for handcrafted features. We add a layer of insight by augmenting an unsupervised "interpretability" module to the encoder, allowing for the extraction and visualization of narrative trajectories. Though this work specifically tackles screenplays, we discuss how the underlying approach can be generalized to a range of structured documents.
摘要：虽然长表单文档自然语言理解仍然是一个开放的挑战，这些文件通常包含可以通知模型的设计，编码它们的结构信息。电影剧本是这样的丰富结构化文本的一个例子 - 脚本被分割成多个场景，其被进一步分解为对话和描述的组件。在这项工作中，我们提出了一个神经结构编码这种结构，在对多标签标签分类数据集的有力执行，而不需要手工制作的特点。我们通过增加无监督“解释性”模块到编码器，允许叙述轨迹的提取和可视化添加洞察的层。虽然这项工作具体铲球剧本，我们讨论的基本方法如何被推广到一系列结构化文档。

78. Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations [PDF] 返回目录
Peng Qi, Yuhao Zhang, Christopher D. Manning
Abstract: We investigate the problem of generating informative questions in information-asymmetric conversations. Unlike previous work on question generation which largely assumes knowledge of what the answer might be, we are interested in the scenario where the questioner is not given the context from which answers are drawn, but must reason pragmatically about how to acquire new information, given the shared conversation history. We identify two core challenges: (1) formally defining the informativeness of potential questions, and (2) exploring the prohibitively large space of potential questions to find the good candidates. To generate pragmatic questions, we use reinforcement learning to optimize an informativeness metric we propose, combined with a reward function designed to promote more specific questions. We demonstrate that the resulting pragmatic questioner substantially improves the informativeness and specificity of questions generated over a baseline model, as evaluated by our metrics as well as humans.
摘要：我们调查产生信息不对称的对话信息问题的问题。不像在问题产生之前的工作，这在很大程度上承担什么样的答案可能是，我们有兴趣在提问没有给出从答案绘制上下文情景知识，但必须有理由务实有关如何获取新的信息，考虑到共享会话历史记录。我们确定了两个核心问题：（1）正式定义的潜在问题，探索潜在的问题大得惊人的空间找到的最佳人选信息量，和（2）。为了产生务实的问题，我们采用强化学习，以优化指标的信息量，我们提议，旨在促进更具体的问题的奖励功能组合。我们表明，所产生的务实提问大幅提高了在基准模型生成的问题的信息量和特异性，我们的指标以及人类的评估。

79. Simulated Multiple Reference Training Improves Low-Resource Machine Translation [PDF] 返回目录
Huda Khayrallah, Brian Thompson, Matt Post, Philipp Koehn
Abstract: Many valid translations exist for a given sentence, and yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce a novel MT training method that approximates the full space of possible translations by: sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser's distribution over possible tokens. With an English paraphraser, we demonstrate the effectiveness of our method in low-resource settings, with gains of 1.2 to 7 BLEU.
摘要：许多有效的翻译一个句子给定存在，然而机器翻译（MT）进行训练一个参考译文，在资源匮乏加剧数据稀疏。我们介绍，通过近似可能翻译的全部空间一种新的MT训练方法：从paraphraser采样参考句子的大意和培训MT模型来预测paraphraser的过度可能令牌分布。随着英语paraphraser，我们证明我们的方法在低资源设置的有效性，用1.2收益7 BLEU。

80. Exploiting Sentence Order in Document Alignment [PDF] 返回目录
Brian Thompson, Philipp Koehn
Abstract: In this work, we exploit the simple idea that a document and its translation should contain approximately the same information, in approximately the same order. We propose methods for both document pair candidate generation and candidate re-scoring which incorporate high-level order information. Our method results in 61% relative reduction in error versus the best previously published result on the WMT16 document alignment shared task. We also apply our method to web-scraped Sinhala-English documents from ParaCrawl and find that our method improves MT performance by 1.2 BLEU over the current ParaCrawl document alignment method.
摘要：在这项工作中，我们利用简单的想法，一个文件，它的翻译应该包含大致相同的信息，在大致相同的顺序。我们提出了两个文件对候选产生和候选重新计分，其集成高性能的订单信息的方法。我们的在误差61％的相对减少方法的结果与在文件WMT16对准的最佳先前公布的结果共享的任务。我们还运用我们的方法，从ParaCrawl网络刮僧伽罗 - 英文文件和发现，我们的方法由1.2 BLEU在当前ParaCrawl单证对准方法提高MT性能。

81. A Focused Study to Compare Arabic Pre-training Models on Newswire IE Tasks [PDF] 返回目录
Wuwei Lan, Yang Chen, Wei Xu, Alan Ritter
Abstract: The Arabic language is a morphological rich language, posing many challenges for information extraction (IE) tasks, including Named Entity Recognition (NER), Part-of-Speech tagging (POS), Argument Role Labeling (ARL) and Relation Extraction (RE). A few multilingual pre-trained models have been proposed and show good performance for Arabic, however, most experiment results are reported on language understanding tasks, such as natural language inference, question answering and sentiment analysis. Their performance on the IE tasks is less known, in particular, the cross-lingual transfer capability from English to Arabic. In this work, we pre-train a Gigaword-based bilingual language model (GigaBERT) to study these two distant languages as well as zero-short transfer learning on the information extraction tasks. Our GigaBERT model can outperform mBERT and XLM-R-base on NER, POS and ARL tasks, with regarding to the per-language and/or zero-transfer performance. We make our pre-trained models publicly available at this https URL to facilitate the research of this field.
摘要：阿拉伯语是一种形态丰富的语言，冒充信息提取诸多挑战（IE）的任务，包括命名实体识别（NER），部分词性标注（POS），参数角色标注（ARL）和关系抽取（回覆）。一些多语种预训练的模型已被提出，并显示阿拉伯语不错的表现，但是，大多数实验结果报告在语言理解任务，如自然语言推理，问题解答和情感分析。他们对IE任务性能不太为人所知，特别是从英语跨语言传输能力为阿拉伯语。在这项工作中，我们预先培养基于Gigaword-双语语言模型（GigaBERT）研究这两个遥远的语言以及零转移短学习上的信息提取任务。我们GigaBERT模型可以超越mBERT和XLM-R基本在NER，POS和ARL任务，以及关于在每语言和/或零传输性能。我们使我们的预先训练模式公布于该HTTPS URL，以促进这一领域的研究。

82. Bilingual Text Extraction as Reading Comprehension [PDF] 返回目录
Katsuki Chousa, Masaaki Nagata, Masaaki Nishino
Abstract: In this paper, we propose a method to extract bilingual texts automatically from noisy parallel corpora by framing the problem as a token-level span prediction, such as SQuAD-style Reading Comprehension. To extract a span of the target document that is a translation of a given source sentence (span), we use either QANet or multilingual BERT. QANet can be trained for a specific parallel corpus from scratch, while multilingual BERT can utilize pre-trained multilingual representations. For the span prediction method using QANet, we introduce a total optimization method using integer linear programming to achieve consistency in the predicted parallel spans. We conduct a parallel sentence extraction experiment using simulated noisy parallel corpora with two language pairs (En-Fr and En-Ja) and find that the proposed method using QANet achieves significantly better accuracy than a baseline method using two bi-directional RNN encoders, particularly for distant language pairs (En-Ja). We also conduct a sentence alignment experiment using En-Ja newspaper articles and find that the proposed method using multilingual BERT achieves significantly better accuracy than a baseline method using a bilingual dictionary and dynamic programming.
摘要：在本文中，我们提出了一个方法，从嘈杂的平行语料库的框架问题作为标记级别跨度预测，如队内式阅读理解自动提取双语文本。要提取目标文档是一个给定的源句子（跨度）的翻译的跨度，我们用两种QANet或多种语言BERT。 QANet可以训练从零开始特定平行语料库，而多种语言BERT可以利用预先训练的多语言表示。对于使用QANet跨度预测方法中，我们介绍利用线性编程在预测平行的跨度，以实现一致性整数共优化方法。我们使用模拟嘈杂的平行语料库有两个语言对（恩神父和EN-JA）进行平行句提取实验发现，使用该方法QANet达到显著更好的精度比使用两个双向RNN编码器的基准方法，特别是对于遥远的语言对（恩JA）。我们还使用恩JA报纸上的文章进行句子对齐实验发现，使用多语种BERT所提出的方法实现显著更好的精度比使用双语词典和动态规划基线法。

83. A Supervised Word Alignment Method based on Cross-Language Span Prediction using Multilingual BERT [PDF] 返回目录
Masaaki Nagata, Chousa Katsuki, Masaaki Nishino
Abstract: We present a novel supervised word alignment method based on cross-language span prediction. We first formalize a word alignment problem as a collection of independent predictions from a token in the source sentence to a span in the target sentence. As this is equivalent to a SQuAD v2.0 style question answering task, we then solve this problem by using multilingual BERT, which is fine-tuned on a manually created gold word alignment data. We greatly improved the word alignment accuracy by adding the context of the token to the question. In the experiments using five word alignment datasets among Chinese, Japanese, German, Romanian, French, and English, we show that the proposed method significantly outperformed previous supervised and unsupervised word alignment methods without using any bitexts for pretraining. For example, we achieved an F1 score of 86.7 for the Chinese-English data, which is 13.3 points higher than the previous state-of-the-art supervised methods.
摘要：提出了基于跨语言寿命预测的新的监督字对准方法。我们首先正式文字对齐问题，从源头句子令牌在目标句子的跨度独立预测的集合。由于这是相当于一个小队V2.0风格答疑任务，我们则通过使用多语种BERT，这是微调的手动创建的金色字对齐数据解决这个问题。通过将令牌的背景下，这个问题我们大大提高了字对齐精度。在使用中国，日本，德国，罗马尼亚，法国和英国在五个字对齐数据集的实验中，我们证明了该方法显著优于先前的监督和无监督的字排列方法，而无需使用任何bitexts进行训练前。例如，我们取得了86.7的得分F1为中国英语数据，这比以前的国家的最先进的监督方法高13.3分。

84. Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition [PDF] 返回目录
Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho Yokoi, Tatsuki Kuribayashi, Ryuto Konno, Kentaro Inui
Abstract: Interpretable rationales for model predictions play a critical role in practical applications. In this study, we develop models possessing interpretable inference process for structured prediction. Specifically, we present a method of instance-based learning that learns similarities between spans. At inference time, each span is assigned a class label based on its similar spans in the training set, where it is easy to understand how much each training instance contributes to the predictions. Through empirical analysis on named entity recognition, we demonstrate that our method enables to build models that have high interpretability without sacrificing performance.
摘要：模型的预测结果可解释的理由发挥在实际应用中的关键作用。在这项研究中，我们开发的模型具有结构化预测可解释的推理过程。具体地，我们提出基于实例的学习跨度之间的相似性获悉的方法。在推理时，每个跨度分配基于训练集，它是很容易理解多少每次训练实例有助于预测其相似跨度类的标签。通过对命名实体识别实证分析，我们证明了我们的方法能够具有在不牺牲性能的高解释性建立模型。

85. Asking without Telling: Exploring Latent Ontologies in Contextual Representations [PDF] 返回目录
Julian Michael, Jan A. Botha, Ian Tenney
Abstract: The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods.
摘要：预训练的上下文编码器的成功，如毛毛和BERT，带来了浓厚的兴趣在什么这些模型学习：他们，没有明确的监督，学习编码语言结构的有意义的概念？如果是这样，这是怎么结构编码？为了研究这个问题，我们引入潜在的子类学习（LSL）：以现有的基于分类探测方法的改进诱导的探头输入的潜在分类（或本体）。如果无法访问细粒度金标签，LSL提取突现的可解释和可量化的形式从输入交涉结构。在实验中，我们发现熟悉的类别，如人格的埃尔莫一个概念，以及新的本体论区别，如对核心参数细粒度语义角色偏好的有力证据。我们的研究结果提供预训练的编码器创结构的独特的新证据，包括未至的早期方法存在的注解离港。

86. Posterior Calibrated Training on Sentence Classification Tasks [PDF] 返回目录
Taehee Jung, Dongyeop Kang, Hua Cheng, Lucas Mentch, Thomas Schaaf
Abstract: Most classification models work by first predicting a posterior probability distribution over all classes and then selecting that class with the largest estimated probability. In many settings however, the quality of posterior probability itself (e.g., 65% chance having diabetes), gives more reliable information than the final predicted class alone. When these methods are shown to be poorly calibrated, most fixes to date have relied on posterior calibration, which rescales the predicted probabilities but often has little impact on final classifications. Here we propose an end-to-end training procedure called posterior calibrated (PosCal) training that directly optimizes the objective while minimizing the difference between the predicted and empirical posterior probabilities.We show that PosCal not only helps reduce the calibration error but also improve task performance by penalizing drops in performance of both objectives. Our PosCal achieves about 2.5% of task performance gain and 16.1% of calibration error reduction on GLUE (Wang et al., 2018) compared to the baseline. We achieved the comparable task performance with 13.2% calibration error reduction on xSLUE (Kang and Hovy, 2019), but not outperforming the two-stage calibration baseline. PosCal training can be easily extendable to any types of classification tasks as a form of regularization term. Also, PosCal has the advantage that it incrementally tracks needed statistics for the calibration objective during the training process, making efficient use of large training sets.
摘要：大多数分类模型首先预测在所有类别的后验概率分布，然后选择具有最大概率估计该类工作。在许多情况下然而，后验概率本身（例如，65％患有糖尿病的机会）的质量，给出了比单独的最终的预测类更可靠的信息。当被示出这些方法来进行校准不当，最修复迄今为止一直依赖于后校准，这重新缩放预测概率，但通常对最终分类的影响不大。在这里，我们提出了所谓的后校准（POSCAL）培训结束到终端的培训过程，直接优化目标，同时最小化的预测和经验后probabilities.We示区别在于POSCAL不仅有助于减少校准误差也提高了任务通过惩罚表现在两个目标性能下降。我们POSCAL达到有关任务的性能增益的2.5％和胶水减少校准误差的16.1％（Wang等，2018）相比基线。我们与xSLUE（康和Hovy，2019）13.2％标定误差减少达到了相当的工作表现，而不是跑赢两级校准基线。 POSCAL训练可以很容易地扩展到任何类型的分类任务作为调整项的形式。此外，POSCAL有它逐步跟踪的校准目标所需的统计数据在训练过程中，有效地利用大型训练集的优势。

87. "The Boating Store Had Its Best Sail Ever": Pronunciation-attentive Contextualized Pun Recognition [PDF] 返回目录
Yichao Zhou, Jyun-Yu Jiang, Jieyu Zhao, Kai-Wei Chang, Wei Wang
Abstract: Humor plays an important role in human languages and it is essential to model humor when building intelligence systems. Among different forms of humor, puns perform wordplay for humorous effects by employing words with double entendre and high phonetic similarity. However, identifying and modeling puns are challenging as puns usually involved implicit semantic or phonological tricks. In this paper, we propose Pronunciation-attentive Contextualized Pun Recognition (PCPR) to perceive human humor, detect if a sentence contains puns and locate them in the sentence. PCPR derives contextualized representation for each word in a sentence by capturing the association between the surrounding context and its corresponding phonetic symbols. Extensive experiments are conducted on two benchmark datasets. Results demonstrate that the proposed approach significantly outperforms the state-of-the-art methods in pun detection and location tasks. In-depth analyses verify the effectiveness and robustness of PCPR.
摘要：幽默在人类语言中一个重要的角色，构建智能系统时，它是模型幽默是必不可少的。在不同形式的幽默，双关语进行俏皮话，用于通过使用双关语和高语音相似的话幽默效果。然而，识别和建模双关语是具有挑战性的双关语通常涉及隐语义或语音技巧。在本文中，我们提出发音周到的双关语语境化识别（PCPR）来感知人的幽默，检测一个句子中包含双关语和句子中找到它们。 PCPR导出通过捕获周围的上下文和它的对应的音标之间的关联语境表示用于在句子中的每个单词。大量的实验工作是两个基准数据集进行。结果表明，该方法显著优于双关语中的检测和定位任务的国家的最先进的方法。深入的分析验证PCPR的有效性和鲁棒性。

88. A Large-Scale Semi-Supervised Dataset for Offensive Language Identification [PDF] 返回目录
Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri, Preslav Nakov
Abstract: The use of offensive language is a major problem in social media which has led to an abundance of research in detecting content such as hate speech, cyberbulling, and cyber-aggression. There have been several attempts to consolidate and categorize these efforts. Recently, the OLID dataset used at SemEval-2019 proposed a hierarchical three-level annotation taxonomy which addresses different types of offensive language as well as important information such as the target of such content. The categorization provides meaningful and important information for understanding offensive language. However, the OLID dataset is limited in size, especially for some of the low-level categories, which included only a few hundred instances, thus making it challenging to train robust deep learning models. Here, we address this limitation by creating the largest available dataset for this task, SOLID. SOLID contains over nine million English tweets labeled in a semi-supervised manner. We further demonstrate experimentally that using SOLID along with OLID yields improved performance on the OLID test set for two different models, especially for the lower levels of the taxonomy. Finally, we perform analysis of the models' performance on easy and hard examples of offensive language using data annotated in a semi-supervised way.
摘要：使用攻击性的语言是社会化媒体的一个主要问题，其已经导致了丰富的研究检测的内容，如仇恨言论，网路霸凌，和网络攻击。已经有一些尝试，以巩固和分类这些努力。最近，在SemEval-2019使用的数据集奥利德提出了分层三级分类标注哪些地址不同类型的攻击性语言，以及重要的信息，如这样的内容的目的。分类为了解攻击性语言有意义和重要信息。然而，奥利德数据集大小有限，特别是对一些低层次的类别，其中包括只有几百情况，从而使得它具有挑战性的训练强劲深沉的学习模式。在这里，我们通过创建此任务的最大可用数据集解决此限制，固体。固体含有标记的半监督的方式超过九百万英鸣叫。我们进一步实验证明，使用与奥利德沿SOLID产生的两种不同型号的奥利德测试集改进的性能，尤其是对较低级别的分类标准。最后，我们进行了模型的使用在一个半监督方式标注的数据攻击性语言的简单的和困难的例子性能的分析。

89. Pragmatic Issue-Sensitive Image Captioning [PDF] 返回目录
Allen Nie, Reuben Cohn-Gordon, Christopher Potts
Abstract: Image captioning systems have recently improved dramatically, but they still tend to produce captions that are insensitive to the communicative goals that captions should meet. To address this, we propose Issue-Sensitive Image Captioning (ISIC). In ISIC, a captioning system is given a target image and an \emph{issue}, which is a set of images partitioned in a way that specifies what information is relevant. The goal of the captioner is to produce a caption that resolves this issue. To model this task, we use an extension of the Rational Speech Acts model of pragmatic language use. Our extension is built on top of state-of-the-art pretrained neural image captioners and explicitly reasons about issues in our sense. We establish experimentally that these models generate captions that are both highly descriptive and issue-sensitive, and we show how ISIC can complement and enrich the related task of Visual Question Answering.
摘要：图像字幕系统最近显着改善，但他们仍然倾向于产生是不敏感的交际目的是字幕应满足字幕。为了解决这个问题，我们提出了问题敏感的图像字幕（ISIC）。在ISIC，一个字幕系统给出的目标图像和\ {EMPH问题}，这是一组，指定哪些信息是相关的方式分割图像。字幕人员的目标是生产可解决此问题的标题。为了模拟这个任务，我们用务实的语言运用的理性言语行为模式的延伸。我们的扩展是建立在国家的最先进的顶级预训练的神经影像式字幕以及有关在我们的意识问题明确原因。我们通过实验证明这些模型产生都是高度描述和问题敏感的字幕，我们展示ISIC如何补充和丰富了视觉答疑的相关任务。

90. What Happens To BERT Embeddings During Fine-tuning? [PDF] 返回目录
Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, Ian Tenney
Abstract: While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.
摘要：虽然已经有很多近期的工作学习信息如何在语言预先训练句子表示编码，相对较少了解有关适用于解决下游任务时，这些模型如何变化。使用一套分析技术（探测分类，表征相似性分析和模型消融），我们研究了微调如何影响BERT模型的表示。我们发现，虽然微调必然使得显著的改变，它不会导致语言现象灾难性的遗忘。相反，我们发现，微调主要影响BERT的顶层，但与整个任务值得注意的变化。特别是，依赖解析重新配置模型的最多，而班长和MNLI似乎涉及浅得多的处理。最后，我们还发现，微调对域外，句子的表述，暗示余地模型综合改善影响较弱。

91. Distantly-Supervised Neural Relation Extraction with Side Information using BERT [PDF] 返回目录
Johny Moreira, Chaina Oliveira, David Macedo, Cleber Zanchettin, Luciano Barbosa
Abstract: Relation extraction (RE) consists in categorizing the relationship between entities in a sentence. A recent paradigm to develop relation extractors is Distant Supervision (DS), which allows the automatic creation of new datasets by taking an alignment between a text corpus and a Knowledge Base (KB). KBs can sometimes also provide additional information to the RE task. One of the methods that adopt this strategy is the RESIDE model, which proposes a distantly-supervised neural relation extraction using side information from KBs. Considering that this method outperformed state-of-the-art baselines, in this paper, we propose a related approach to RESIDE also using additional side information, but simplifying the sentence encoding with BERT embeddings. Through experiments, we show the effectiveness of the proposed method in Google Distant Supervision and Riedel datasets concerning the BGWA and RESIDE baseline methods. Although Area Under the Curve is decreased because of unbalanced datasets, P@N results have shown that the use of BERT as sentence encoding allows superior performance to baseline methods.
摘要：关系抽取（RE）由分类中的一个句子实体之间的关系。最近的范例来开发关系提取器是远程监控（DS），它允许自动创建的新数据集通过把文本语料库和知识库（KB）之间的比对。知识库系统有时也提供额外的信息给RE任务。其中一个是采用这种策略的方法是RESIDE模型，提出了利用KB的辅助信息遥远监督神经关系抽取。考虑到这种方法优于国家的最先进的基线，在本文中，我们提出了一个相关的方法居住也使用额外的辅助信息，但简化了与BERT的嵌入了一句编码。通过实验，我们显示了有关BGWA和居住基线方法谷歌遥远监督里德尔数据集所提出的方法的有效性。虽然曲线下面积是因为不平衡数据集的下降，P @ N结果已经显示，使用BERT作为句子编码允许优越的性能基线方法。

92. A Benchmark Dataset of Check-worthy Factual Claims [PDF] 返回目录
Fatma Arslan, Naeemul Hassan, Chengkai Li, Mark Tremayne
Abstract: In this paper we present the ClaimBuster dataset of 23,533 statements extracted from all U.S. general election presidential debates and annotated by human coders. The ClaimBuster dataset can be leveraged in building computational methods to identify claims that are worth fact-checking from the myriad of sources of digital or traditional media. The ClaimBuster dataset is publicly available to the research community, and it can be found at this http URL.
摘要：本文提出的所有美国大选总统候选人辩论中提取并注明由人类编码23533个报表ClaimBuster数据集。该ClaimBuster数据集可以建立计算方法，以鉴别是值得事实查证从数字或传统媒体来源的无数索赔加以利用。该ClaimBuster数据集是公开的研究界，它可以在这个HTTP URL中找到。

93. Improving Vision-and-Language Navigation with Image-Text Pairs from the Web [PDF] 返回目录
Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra
Abstract: Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN -- outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further positive synergistic effects.
摘要：按照导航指令如“走在楼梯和停在棕色沙发”需要体现AI剂经由语言参考地的场景元素（例如，“楼梯”），以在环境中的视觉内容（对应于“楼梯”像素）。我们提出以下的问题 - 我们可以利用丰富的“无形的”网络刮视觉和语言语料库（如概念字幕）学习视觉搁浅（什么做“楼梯”的样子？）改善在一个相对数据 - 性能饥饿的感觉体现任务（视觉和语言导航）？具体来说，我们开发VLN-BERT，用于计分的指令（“...停在棕色沙发”）和由代理捕获的全景RGB图像的序列之间的相容性的visiolinguistic基于变压器的模型。我们表明，从之前上具体化路径指令数据微调卷筒纸预训练图像上文本对VLN-BERT显著提高了性能VLN - 优于现有状态的最先进的在完全观察到的设置由4成功率绝对百分点。我们预训练课程的消融显示每个阶段是影响力 - 他们的组合导致进一步的积极协同作用。

94. Learning to Ask Screening Questions for Job Postings [PDF] 返回目录
Baoxu Shi, Shan Li, Jaewon Yang, Mustafa Emre Kazdagli, Qi He
Abstract: At LinkedIn, we want to create economic opportunity for everyone in the global workforce. A critical aspect of this goal is matching jobs with qualified applicants. To improve hiring efficiency and reduce the need to manually screening each applicant, we develop a new product where recruiters can ask screening questions online so that they can filter qualified candidates easily. To add screening questions to all $20$M active jobs at LinkedIn, we propose a new task that aims to automatically generate screening questions for a given job posting. To solve the task of generating screening questions, we develop a two-stage deep learning model called Job2Questions, where we apply a deep learning model to detect intent from the text description, and then rank the detected intents by their importance based on other contextual features. Since this is a new product with no historical data, we employ deep transfer learning to train complex models with limited training data. We launched the screening question product and our AI models to LinkedIn users and observed significant impact in the job marketplace. During our online A/B test, we observed $+53.10\%$ screening question suggestion acceptance rate, $+22.17\%$ job coverage, $+190\%$ recruiter-applicant interaction, and $+11$ Net Promoter Score. In sum, the deployed Job2Questions model helps recruiters to find qualified applicants and job seekers to find jobs they are qualified for.
摘要：在LinkedIn，我们要为每个人创造全球劳动力的经济机会。这个目标的一个重要方面是匹配的就业机会，符合资格的申请人。为了提高招聘效率，降低了需要手动筛选每一个申请，我们开发了一个新产品，其中招聘人员可以要求筛选问题在线，让他们可以很容易地筛选合格的候选人。要添加筛选问题在LinkedIn所有$ 20 $ M条活动的工作，我们提出了一个新的任务，目的是自动生成一个给定的招聘启事筛选问题。为了解决产生筛选问题的任务，我们制定了两个阶段的深度学习模式叫Job2Questions，在这里我们采用了深刻的学习模式，基于其他上下文特征从文字说明检测的意图，然后排序由他们的重要性，检测到的意图。由于这是一个没有历史数据的新产品，我们聘请深陷转会学习复杂的模型训练有限的训练数据。我们推出的筛选问题产品和我们的AI模式，以LinkedIn用户和观察工作市场显著的影响。在我们的网上A / B测试中，我们观察到$ + 53.10 \％$筛选问题建议的接受率，$ + 22.17 \％$的工作覆盖面，$ + 190 \％$招募，申请人的互动，和$ + 11 $净推荐值。总之，部署Job2Questions模型可以帮助招聘人员找到合格的申请者和求职者找到他们能胜任工作。

95. Few-Shot Learning for Abstractive Multi-Document Opinion Summarization [PDF] 返回目录
Arthur Bražinskas, Mirella Lapata, Ivan Titov
Abstract: Opinion summarization is an automatic creation of text reflecting subjective information expressed in multiple documents, such as user reviews of a product. The task is practically important and has attracted a lot of attention. However, due to a high cost of summary production, datasets large enough for training supervised models are lacking. Instead, the task has been traditionally approached with extractive methods that learn to select text fragments in an unsupervised or weakly-supervised way. Recently, it has been shown that abstractive summaries, potentially more fluent and better at reflecting conflicting information, can also be produced in an unsupervised fashion. However, these models, not being exposed to the actual summaries, fail to capture their essential properties. In this work, we show that even a handful of summaries is sufficient to bootstrap generation of the summary text with all expected properties, such as writing style, informativeness, fluency, and sentiment preservation. We start by training a language model to generate a new product review given available reviews of the product. The model is aware of the properties: it proceeds with first generating property values and then producing a review conditioned on them. We do not use any summaries in this stage and the property values are derived from reviews with no manual effort. In the second stage, we fine-tune the module predicting the property values on a few available summaries. This lets us switch the generator to the summarization mode. Our approach substantially outperforms previous extractive and abstractive methods in automatic and human evaluation.
摘要：意见汇总是自动创建的文本反映了多个文档，如产品的用户评论表达主观信息。任务是重要的现实意义，并吸引了大量的关注。然而，由于总结生产成本高，数据集足够大的训练监督模型所缺乏的。相反，任务已经接近传统与学习无监督或弱监督的方式来选择文本片段提取方法。最近，已经显示，抽象的摘要中，潜在地更加流畅和更好的在反射矛盾的信息，也可以以无监督的方式产生的。然而，这些模型，不暴露于实际的总结，并没有捕捉到它们的基本属性。在这项工作中，我们证明了总结，甚至少数足以引导代与所有预期的属性，如书写风格，信息量，流畅性和情绪保存摘要文本。我们通过训练语言模型来生成赋予了产品的可审查的新产品审查开始。该模型是知道的特性：它与第一生成属性值，然后产生一个审查条件对它们进行。我们不会在这个阶段使用任何摘要和属性值从审查，没有人工操作的。在第二阶段，我们微调模块上几可总结预测的属性值。这让我们切换发电机汇总模式。我们的方法显着优于先前的采掘和自动和人工评估抽象方法。

96. Progressive Transformers for End-to-End Sign Language Production [PDF] 返回目录
Ben Saunders, Necati Cihan Camgoz, Richard Bowden
Abstract: The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video at a level comparable to a human translator. If this was achievable, then it would revolutionise Deaf hearing communications. Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences. In this paper, we propose Progressive Transformers, a novel architecture that can translate from discrete spoken language sentences to continuous 3D skeleton pose outputs representing sign language. We present two model configurations, an end-to-end network that produces sign direct from text and a stacked network that utilises a gloss intermediary. Our transformer network architecture introduces a counter that enables continuous sequence generation at training and inference. We also provide several data augmentation processes to overcome the problem of drift and improve the performance of SLP models. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging RWTH-PHOENIX-Weather-2014T(PHOENIX14T) dataset and setting baselines for future research.
摘要：自动手语制作（SLP）的目标是在相当于人类翻译的水平，以口语翻译成手语视频的连续流。如果这是可以实现的，那么这将彻底改变聋人听力通信。上主要分离SLP先前的工作已经显示了其更适合于全符号序列的连续域架构的需要。在本文中，我们提出了渐进式变压器，一个新的体系结构，可以从离散的口语句子来表示用手语连续的三维骨骼姿势输出转换。我们提出了两种模型配置，端至端网络产生签从文本和叠层网络，其利用光泽中介直接。我们的变压器网络架构引入了一个计数器，使连续序列生成的训练和推理。我们还提供了几个数据增强过程，克服漂移问题，提高SLP模型的性能。我们提出了一个SLP回翻译评估机制，对今后的研究挑战RWTH-PHOENIX-天气-2014T（PHOENIX14T）数据集，并确定基线呈现基准定量结果。

97. MuSe 2020 -- The First International Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop [PDF] 返回目录
Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W. Schuller, Iulia Lefter, Erik Cambria, Ioannis Kompatsiaris
Abstract: Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CaR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34$\cdot$ UAR + 0.66$\cdot$F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.
摘要：多模态情感分析在现实生活中的媒体（MUSE）2020是一个基于挑战的研讨会侧重于情感识别的任务，以及情感目标交战守信检测可以通过更全面地整合视听语言手段方式。 MUSE 2020的目的是汇集来自不同学科的社区;主要是，视听情感识别社区（信号为基础），以及情感分析社区（符号为基础）。我们提出三个不同的子挑战：MUSE-野生，其重点是连续情感（觉醒和化合价）预测; MUSE-主题，其中参与者识别特定于域的主题为3级（低，中，高）的情绪的目标;和MUSE-信托，其中可信的新颖方面是要被预测。在本文中，我们提供MUSE车的详细信息，其首创的在最狂野数据库，它是用来挑战，以及国家的最先进的功能和建模方法应用。对于每个子挑战，为与会者有竞争力的基线集;即，在测试中，我们的MUSE-野生组合（效价和唤醒）的0.2568 CCC，缪斯，主题报告得分（计算为0.34 $ \ $ CDOT阿联+ 0.66 $ \ $ CDOT F1）的76.78％的上10级的主题并在3级情感预测40.64％，而对于MUSE-信托0.4359一CCC。

98. Knowledge Graph Embeddings and Explainable AI [PDF] 返回目录
Federico Bianchi, Gaetano Rossiello, Luca Costabello, Matteo Palmonari, Pasquale Minervini
Abstract: Knowledge graph embeddings are now a widely adopted approach to knowledge representation in which entities and relationships are embedded in vector spaces. In this chapter, we introduce the reader to the concept of knowledge graph embeddings by explaining what they are, how they can be generated and how they can be evaluated. We summarize the state-of-the-art in this field by describing the approaches that have been introduced to represent knowledge in the vector space. In relation to knowledge representation, we consider the problem of explainability, and discuss models and methods for explaining predictions obtained via knowledge graph embeddings.
摘要：知识图嵌入现在被广泛采用的方法，以知识表示在实体和关系被嵌入向量空间。在本章中，我们首先解释它们是什么，他们如何能产生读者介绍的知识图嵌入的概念，以及它们如何进行评估。我们通过描述已被引入到代表向量空间知识的方法总结了国家的最先进的在这一领域。关于知识表示，我们认为explainability的问题，并讨论模型和方法用来说明通过知识图嵌入得到的预测。

99. Preventing Posterior Collapse with Levenshtein Variational Autoencoder [PDF] 返回目录
Serhii Havrylov, Ivan Titov
Abstract: Variational autoencoders (VAEs) are a standard framework for inducing latent variable models that have been shown effective in learning text representations as well as in text generation. The key challenge with using VAEs is the {\it posterior collapse} problem: learning tends to converge to trivial solutions where the generators ignore latent variables. In our Levenstein VAE, we propose to replace the evidence lower bound (ELBO) with a new objective which is simple to optimize and prevents posterior collapse. Intuitively, it corresponds to generating a sequence from the autoencoder and encouraging the model to predict an optimal continuation according to the Levenshtein distance (LD) with the reference sentence at each time step in the generated sequence. We motivate the method from the probabilistic perspective by showing that it is closely related to optimizing a bound on the intractable Kullback-Leibler divergence of an LD-based kernel density estimator from the model distribution. With this objective, any generator disregarding latent variables will incur large penalties and hence posterior collapse does not happen. We relate our approach to policy distillation \cite{RossGB11} and dynamic oracles \cite{GoldbergN12}. By considering Yelp and SNLI benchmarks, we show that Levenstein VAE produces more informative latent representations than alternative approaches to preventing posterior collapse.
摘要：变自动编码（VAES）是已被证明有效的学习文字表述以及在文本生成诱导潜变量模型的标准框架。使用VAES的主要挑战是：{\它后崩溃}问题：学习趋向于收敛到平凡的解决方案，其中发电机忽视潜在变量。在我们的莱文施泰因VAE，我们建议更换新的目标下限（ELBO）的证据是简单的优化和防止后崩溃。直观上，它对应于从所述自动编码的序列，并鼓励模型来预测根据Levenshtein距离（LD），在所生成的序列在每个时间步骤中的参考句子的最佳延续。我们通过展示，它密切相关，从分布模型基于LD-核密度估计的顽固性相对熵优化激励约束从概率观点来看的方法。有了这个目标，任何发电机不顾潜在变量会招致严重的惩罚，因此崩溃后不会发生。我们与我们的政策方针蒸馏\ {引用} RossGB11和动态神谕\ {引用} GoldbergN12。通过考虑Yelp和SNLI基准，我们表明，莱文施泰因VAE产生比替代的方法来防止后崩溃更多信息的潜表示。

100. Counterfactual Off-Policy Training for Neural Response Generation [PDF] 返回目录
Qingfu Zhu, Weinan Zhang, Ting Liu, William Yang Wang
Abstract: Learning a neural response generation model on data synthesized under the adversarial training framework helps to explore more possible responses. However, most of the data synthesized de novo are of low quality due to the vast size of the response space. In this paper, we propose a counterfactual off-policy method to learn on a better synthesis of data. It takes advantage of a real response to infer an alternative that was not taken using a structural casual model. Learning on the counterfactual responses helps to explore the high-reward area of the response space. An empirical study on the DailyDialog dataset shows that our approach significantly outperforms the HRED model as well as the conventional adversarial training approaches.
摘要：学习上的对抗训练框架下合成数据的神经反应生成模型有助于探索更多可能的反应。然而，大多数从头合成的数据的质量低的由于响应空间的巨大尺寸。在本文中，我们提出了一个反断政策方法学上更好的综合数据。它需要一个真实响应的优点来推断未使用的结构因果模型截取的替代方案。学习上的反反应有助于探索响应空间的高回报的领域。在DailyDialog数据集显示了实证研究，以及传统的对抗训练接近我们的做法显著优于HRED模型。

101. Zero-shot Neural Retrieval via Domain-targeted Synthetic Query Generation [PDF] 返回目录
Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, Ryan McDonald
Abstract: Deep neural scoring models have recently been shown to improve ranking quality on a number of benchmarks (Guo et al., 2016; Daiet al., 2018; MacAvaney et al., 2019; Yanget al., 2019a). However, these methods rely on underlying ad-hoc retrieval systems to generate candidates for scoring, which are rarely neural themselves (Zamani et al., 2018). Re-cent work has shown that the performance of ad-hoc neural retrieval systems can be competitive with a number of baselines (Zamani et al.,2018), potentially leading the way to full end-to-end neural retrieval. A major road-block to the adoption of ad-hoc retrieval models is that they require large supervised training sets to surpass classic term-based techniques, which can be developed from raw corpora. Previous work shows weakly supervised data can yield competitive results, e.g., click data (Dehghaniet al., 2017; Borisov et al., 2016). Unfortunately for many domains, even weakly supervised data can be scarce. In this paper, we pro-pose an approach to zero-shot learning (Xianet al., 2018) for ad-hoc retrieval models that relies on synthetic query generation. Crucially, the query generation system is trained on general domain data, but is applied to documents in the targeted domain. This allows us to create arbitrarily large, yet noisy, query-document relevance pairs that are domain targeted. On a number of benchmarks, we show that this is an effective strategy for building neural retrieval models for specialised domains.
摘要：深层神经评分模型最近已表明，以提高质量排名在多个基准（郭等人，2016年Daiet人，2018; MacAvaney等，2019; Yanget人，2019a。）。然而，这些方法依赖于底层特设检索系统来产生候选人的得分，这是很少的神经本身（Zamani等，2018）。再分的工作已经表明的ad-hoc神经检索系统的性能可以是与数基线的竞争性的（Zamani等人，2018），可能导致的方式完整的端至端的神经检索。一个主要的路障，以通过特设检索模型的是，他们需要有监管的大型训练集超越传统的，基于长期的技术，它可以从原料语料库进行开发。以前的工作显示了弱的监督数据可以产生有竞争力的结果，例如，点击数据（Dehghaniet人，2017;鲍里索夫等人，2016）。不幸的是，许多领域，甚至弱监督数据稀少。在本文中，我们亲构成一种方法来零次学习（Xianet人，2018）对于依靠合成查询生成的ad-hoc检索模型。最重要的是，查询发电系统训练的一般领域的数据，但被应用到在目标域文件。这使得我们可以创建任意大，但嘈杂的，是域名查询文档相关对有针对性的。在一些基准测试，我们表明，这是建立神经检索模型用于专业领域的有效策略。

102. The Effect of Natural Distribution Shift on Question Answering Models [PDF] 返回目录
John Miller, Karl Krauth, Benjamin Recht, Ludwig Schmidt
Abstract: We build four new test sets for the Stanford Question Answering Dataset (SQuAD) and evaluate the ability of question-answering systems to generalize to new data. Our first test set is from the original Wikipedia domain and measures the extent to which existing systems overfit the original test set. Despite several years of heavy test set re-use, we find no evidence of adaptive overfitting. The remaining three test sets are constructed from New York Times articles, Reddit posts, and Amazon product reviews and measure robustness to natural distribution shifts. Across a broad range of models, we observe average performance drops of 3.8, 14.0, and 17.4 F1 points, respectively. In contrast, a strong human baseline matches or exceeds the performance of SQuAD models on the original domain and exhibits little to no drop in new domains. Taken together, our results confirm the surprising resilience of the holdout method and emphasize the need to move towards evaluation metrics that incorporate robustness to natural distribution shifts.
摘要：我们建立了四个新的测试集合为斯坦福问答集（队）和评估答疑系统推广到新数据的能力。我们的第一个测试集是从原来的维基百科域和措施，其现有系统过度拟合原测试集的程度。尽管几年重测试仪再利用的，我们没有发现自适应过度拟合的证据。剩下的三个测试组从纽约时报的文章，reddit的帖子，和亚马逊的产品评论和措施的鲁棒性自然分布的变化构成。在广泛的模型，我们分别观察到的3.8，14.0和17.4点，F1的平均性能下降。相反，一个强大的人力基线匹配或超过原来的域车型阵容和展品的性能几乎没有在新的领域没有下降。总之，我们的结果证实了抵抗方法了惊人的弹性，并强调需要就评价指标移动结合了稳健性自然分布的变化。

103. Neural Speech Separation Using Spatially Distributed Microphones [PDF] 返回目录
Dongmei Wang, Zhuo Chen, Takuya Yoshioka
Abstract: This paper proposes a neural network based speech separation method using spatially distributed microphones. Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance, which hinders the use of conventional multi-channel speech separation neural networks based on fixed size input. To overcome this, a novel network architecture is proposed that interleaves inter-channel processing layers and temporal processing layers. The inter-channel processing layers apply a self-attention mechanism along the channel dimension to exploit the information obtained with a varying number of microphones. The temporal processing layers are based on a bidirectional long short term memory (BLSTM) model and applied to each channel independently. The proposed network leverages information across time and space by stacking these two kinds of layers alternately. Our network estimates time-frequency (TF) masks for each speaker, which are then used to generate enhanced speech signals either with TF masking or beamforming. Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.
摘要：本文提出了使用在空间上分布的麦克风一个基于神经网络的语音的分离方法。不同于传统的麦克风阵列设置，麦克风既不数量也不它们的空间排列是事先已知的，这阻碍了使用基于固定大小输入传统的多通道语音分离的神经网络。为了克服这个问题，一种新颖的网络体系结构提出了交织信道间的处理层和时间处理层。信道间的处理层施加自关注机构沿着通道尺寸来利用具有变化的数量的麦克风获得的信息。瞬时处理层是基于双向长短期存储器（BLSTM）模型和独立地施加到每个信道。所提出的网络通过交替地堆叠这两种层的利用跨时间和空间信息。我们的网络估计的时间 - 频率（TF）的掩模为每个说话者，然后将其用于与TF掩蔽或波束形成或者产生增强的语音信号。语音识别实验结果表明，所提出的方法显著性能优于基线多通道语音分离系统。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-05-01

目录

摘要