摘要

1. Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models [PDF] 返回目录
Wangchunshu Zhou, Ke Xu
Abstract: Automated evaluation of open domain natural language generation (NLG) models remains a challenge and widely used metrics such as BLEU and Perplexity can be misleading in some cases. In our paper, we propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT, which has been shown to have good natural language understanding ability. We also propose to evaluate the model-level quality of NLG models with sample-level comparison results with skill rating system. While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation to better imitate human judgment. In addition to evaluating trained models, we propose to apply our model as a performance indicator during training for better hyperparameter tuning and early-stopping. We evaluate our approach on both story generation and chit-chat dialogue response generation. Experimental results show that our model correlates better with human preference compared with previous automated evaluation approaches. Training with the proposed metric yields better performance in human evaluation, which further demonstrates the effectiveness of the proposed model.
摘要：开域自然语言生成（NLG）模型自动评估仍然是一个挑战和广泛使用的指标，如BLEU和困惑可以在某些情况下会产生误导。在本文中，我们提出通过学习来比较微调BERT，这已被证明具有良好的自然语言理解能力对生成的句子来评估自然语言代车型。此外，我们建议评估NLG模型与样本级别的比较结果与技能等级系统模型的质量水平。虽然能在完全自我监督的方式进行训练，我们的模型可以进一步微调以人偏好补充说明的一点量，以便更好地模仿人的判断。除了评估训练的模型，我们建议采用我们的模型为更好地调整超参数和早期停止训练时的性能指标。我们评估我们的两个故事的产生和闲聊对话响应生成方法。实验结果表明，我们的模型相关因素与人类的喜好更好地与之前的自动评价方法进行了比较。在人的评价，这进一步表明了该模型的有效性提出的指标性能更好的训练。

2. Joint Embedding in Named Entity Linking on Sentence Level [PDF] 返回目录
Wei Shi, Siyuan Zhang, Zhiwei Zhang, Hong Cheng, Jeffrey Xu Yu
Abstract: Named entity linking is to map an ambiguous mention in documents to an entity in a knowledge base. The named entity linking is challenging, given the fact that there are multiple candidate entities for a mention in a document. It is difficult to link a mention when it appears multiple times in a document, since there are conflicts by the contexts around the appearances of the mention. In addition, it is difficult since the given training dataset is small due to the reason that it is done manually to link a mention to its mapping entity. In the literature, there are many reported studies among which the recent embedding methods learn vectors of entities from the training dataset at document level. To address these issues, we focus on how to link entity for mentions at a sentence level, which reduces the noises introduced by different appearances of the same mention in a document at the expense of insufficient information to be used. We propose a new unified embedding method by maximizing the relationships learned from knowledge graphs. We confirm the effectiveness of our method in our experimental studies.
摘要：命名实体链接是在文档的暧昧中提到的知识库映射到实体。命名实体链接是具有挑战性的，鉴于有对文档中提及多个候选实体。这是很难当多次出现在文档中链接一提，因为有通过围绕提的出场语境冲突。此外，它是困难的，因为给定的训练数据集是小，由于它是手工完成的提链接到它的映射实体的原因。在文献中，有许多研究报道其中最近嵌入方法在文件级学会从训练数据集实体的载体。为了解决这些问题，我们把重点放在如何链接实体提到在句子水平，从而降低了使用通过在信息不足的费用的文件在同一提的不同外观引入了噪音。我们通过最大限度地从知识图学的关系，提出了一种新的统一的嵌入方法。我们确认我们的方法在我们的实验研究的有效性。

3. Utilizing BERT Intermediate Layers for Aspect Based Sentiment Analysis and Natural Language Inference [PDF] 返回目录
Youwei Song, Jiahai Wang, Zhiwei Liang, Zhiyue Liu, Tao Jiang
Abstract: Aspect based sentiment analysis aims to identify the sentimental tendency towards a given aspect in text. Fine-tuning of pretrained BERT performs excellent on this task and achieves state-of-the-art performances. Existing BERT-based works only utilize the last output layer of BERT and ignore the semantic knowledge in the intermediate layers. This paper explores the potential of utilizing BERT intermediate layers to enhance the performance of fine-tuning of BERT. To the best of our knowledge, no existing work has been done on this research. To show the generality, we also apply this approach to a natural language inference task. Experimental results demonstrate the effectiveness and generality of the proposed approach.
摘要：基于看点情感分析的目的是确定对在文本中给定方面的感伤倾向。预训练的BERT执行优秀这项任务和微调实现了国家的最先进的性能。现有的基于BERT-作品只能利用BERT的最后一个输出层，而忽略中间层的语义知识。本文探讨利用BERT中间层，以提高BERT的微调的性能的潜力。据我们所知，没有现有的工作已经在这项研究完成的。要显示的普遍性，我们也将这种方法用于自然语言推理任务。实验结果表明，该方法的有效性和普遍性。

4. ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems [PDF] 返回目录
Qi Zhu, Zheng Zhang, Yan Fang, Xiang Li, Ryuichi Takanobu, Jinchao Li, Baolin Peng, Jianfeng Gao, Xiaoyan Zhu, Minlie Huang
Abstract: We present ConvLab-2, an open-source toolkit that enables researchers to build task-oriented dialogue systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. As the successor of ConvLab (Lee et al., 2019b), ConvLab-2 inherits ConvLab's framework but integrates more powerful dialogue models and supports more datasets. Besides, we have developed an analysis tool and an interactive tool to assist researchers in diagnosing dialogue systems. The analysis tool presents rich statistics and summarizes common mistakes from simulated dialogues, which facilitates error analysis and system improvement. The interactive tool provides a user interface that allows developers to diagnose an assembled dialogue system by interacting with the system and modifying the output of each system component.
摘要：我们目前ConvLab-2，一个开源工具包，使研究人员能够与国家的最先进的机型构建面向任务的对话系统，执行结束到终端的评价，并诊断系统的弱点。作为ConvLab的继任者（Lee等，2019b），ConvLab-2继承ConvLab的框架，但集成功能更强大的对话模式，并支持多个数据集。此外，我们已经开发了一个分析工具和互动的工具，以帮助研究人员在诊断对话系统。该分析工具提供丰富的统计和总结了模拟对话，这有利于误差分析和系统改进常见的错误。交互式工具提供了一个用户接口，它允许开发者通过与系统进行交互和修改每个系统部件的输出来诊断组装对话系统。

5. Two Huge Title and Keyword Generation Corpora of Research Articles [PDF] 返回目录
Erion Çano, Ondřej Bojar
Abstract: Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and can be used to perform research on various tasks. In this paper, we introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research, containing 34 million and 23 million records, respectively. The data were retrieved from the Open Academic Graph which is a network of research profiles and publications. We carefully processed each record and also tried several extractive and abstractive methods of both tasks to create performance baselines for other researchers. We further illustrate the performance of those methods previewing their outputs. In the near future, we would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.
摘要：按顺序对序列的最新发展与神经网络学习具有显着提高自动生成的文本摘要及文档的关键字的质量，规定了更大的训练库的需要。研究文章的元数据通常是很容易在网上找到，可用于对各种任务进行研究。在本文中，我们介绍了文摘（OAGSX）和关键字生成（OAGKX）研究两个巨大的数据集，包含3400万个23万条记录，分别。该数据来自于不限学历图是研究概况和出版物网络检索。我们认真处理每一条记录，也尝试过的两个任务数，并提取方法，抽象创建其他研究者的性能基准。我们进一步说明这些方法预览它们的输出性能。在不久的将来，我们想从更具体的学科上两套应用主题建模研究文章的派生子集。

6. Adjusting Image Attributes of Localized Regions with Low-level Dialogue [PDF] 返回目录
Tzu-Hsiang Lin, Alexander Rudnicky, Trung Bui, Doo Soon Kim, Jean Oh
Abstract: Natural Language Image Editing (NLIE) aims to use natural language instructions to edit images. Since novices are inexperienced with image editing techniques, their instructions are often ambiguous and contain high-level abstractions that tend to correspond to complex editing steps to accomplish. Motivated by this inexperience aspect, we aim to smooth the learning curve by teaching the novices to edit images using low-level commanding terminologies. Towards this end, we develop a task-oriented dialogue system to investigate low-level instructions for NLIE. Our system grounds language on the level of edit operations, and suggests options for a user to choose from. Though compelled to express in low-level terms, a user evaluation shows that 25% of users found our system easy-to-use, resonating with our motivation. An analysis shows that users generally adapt to utilizing the proposed low-level language interface. In this study, we identify that object segmentation as the key factor to the user satisfaction. Our work demonstrates the advantages of the low-level, direct language-action mapping approach that can be applied to other problem domains beyond image editing such as audio editing or industrial design.
摘要：自然语言图像编辑（NLIE）旨在利用自然语言指令来进行编辑图像。由于新手与图像编辑技术经验不足，他们的指示往往模糊不清，而含有高层次的抽象，往往对应于复杂的编辑步骤来完成。这一方面缺乏经验的启发，我们的目标是通过采用低层次的指挥用语教新手编辑图像平滑的学习曲线。为此，我们开发了一个面向任务的对话系统，以调查NLIE低级指令。我们对编辑操作的层次体系理由的语言，并提出了一个用户从选择的选项。虽然被迫在低层次的形式来表示，用户评价结果显示，25％的用户发现，我们的系统易于使用，我们的动机共鸣。分析表明，用户一般适应于利用提出的低级语言界面。在这项研究中，我们确定对象分割为用户满意度的关键因素。我们的工作表明，低层次的，直接的语言，动作映射，可以应用到其他问题领域超越图像编辑，如音频编辑或工业设计方法的优点。

7. Constructing a Highlight Classifier with an Attention Based LSTM Neural Network [PDF] 返回目录
Michael Kuehne, Marius Radu
Abstract: Data is being produced in larger quantities than ever before in human history. It's only natural to expect a rise in demand for technology that aids humans in sifting through and analyzing this inexhaustible supply of information. This need exists in the market research industry, where large amounts of consumer research data is collected through video recordings. At present, the standard method for analyzing video data is human labor. Market researchers manually review the vast majority of consumer research video in order to identify relevant portions - highlights. The industry state of the art turnaround ratio is 2.2 - for every hour of video content 2.2 hours of manpower are required. In this study we present a novel approach for NLP-based highlight identification and extraction based on a supervised learning model that aides market researchers in sifting through their data. Our approach hinges on a manually curated user-generated highlight clips constructed from long and short-form video data. The problem is best suited for an NLP approach due to the availability of video transcription. We evaluate multiple classes of models, from gradient boosting to recurrent neural networks, comparing their performance in extraction and identification of highlights. The best performing models are then evaluated using four sampling methods designed to analyze documents much larger than the maximum input length of the classifiers. We report very high performances for the standalone classifiers, ROC AUC scores in the range 0.93-0.94, but observe a significant drop in effectiveness when evaluated on large documents. Based on our results we suggest combinations of models/sampling algorithms for various use cases.
摘要：数据数量较多，比以往任何时候被生产在人类历史之前。这是很自然的期望在技术，帮助人们通过筛选和分析的信息，这取之不尽的需求量明显上升。这需要存在于市场研究行业，大量的消费者调研数据通过录像收集。目前，用于分析视频数据的标准方法是人的劳动。市场研究人员人工审核广大消费者的研究视频，以确定有关的部分 - 亮点。本领域的周转率的行业状态是2.2 - 用于视频内容每一小时都需要2.2小时人力。在这项研究中，我们提出了基于监督学习模型基于NLP高亮识别和提取的新方法是幕僚市场研究人员通过他们的数据进行筛选。我们的方法的铰链上从长和短形式的视频数据构成的手动管理用户生成的精彩片段。这个问题是最适合的NLP方法由于视频转录的可用性。我们评估多个类别的车型，从梯度提高到回归神经网络，比较它们的提取和亮点的识别性能。然后表现最好的模型使用设计用来分析文档比分类器的最大输入长度大得多的四个采样方法进行评价。我们提出非常高的演出独立分类，范围为0.93-0.94 ROC AUC分数，但对大文件进行评估时，观察有效性显著下跌。根据我们的结果，我们建议的模型组合/取样各种用例的算法。

8. Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances [PDF] 返回目录
Phillip Keung, Wei Niu, Yichao Lu, Julian Salazar, Vikas Bhardwaj
Abstract: We discuss the problem of echographic transcription in autoregressive sequence-to-sequence attentional architectures for automatic speech recognition, where a model produces very long sequences of repetitive outputs when presented with out-of-domain utterances. We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We observe that there are many 5-second recordings that produce more than 500 characters of decoding output (i.e. more than 100 characters per second). A frame-synchronous hybrid (DNN-HMM) model trained on the same data does not produce these unusually long transcripts. These decoding issues are reproducible in a speech transformer model from ESPnet, and to a lesser extent in a self-attention CTC model, suggesting that these issues are intrinsic to the use of the attention mechanism. We create a separate length prediction model to predict the correct number of wordpieces in the output, which allows us to identify and truncate problematic decoding results without increasing word error rates on the LibriSpeech task.
摘要：我们讨论的自动语音识别，其中当与领域外的言论提出了一个模型产生重复输出很长的序列自回归序列对序列注意力架构回声转录的问题。我们解码来自英国国家语料库音频，单就LibriSpeech语料训练的一个所注意的编码器，解码器模型。我们观察到，有产生解码输出的超过500个字符（即，超过每秒100个字符）许多5秒记录。受过训练的对相同的数据的帧同步混合（DNN-HMM）模型不产生这些不寻常的长转录物。这些解码的问题是从ESPnet讲话变压器模型重复性好，在自我关注CTC模式在较小程度上，这表明这些问题是固有的使用注意机制。我们创建一个单独的长度预测模型来预测在输出中，这允许我们无需在LibriSpeech任务增加字错误率确定并截断问题的解码结果wordpieces的正确数目。

9. DeepMutation: A Neural Mutation Tool [PDF] 返回目录
Michele Tufano, Jason Kimko, Shiya Wang, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Denys Poshyvanyk
Abstract: Mutation testing can be used to assess the fault-detection capabilities of a given test suite. To this aim, two characteristics of mutation testing frameworks are of paramount importance: (i) they should generate mutants that are representative of real faults; and (ii) they should provide a complete tool chain able to automatically generate, inject, and test the mutants. To address the first point, we recently proposed an approach using a Recurrent Neural Network Encoder-Decoder architecture to learn mutants from ~787k faults mined from real programs. The empirical evaluation of this approach confirmed its ability to generate mutants representative of real faults. In this paper, we address the second point, presenting DeepMutation, a tool wrapping our deep learning model into a fully automated tool chain able to generate, inject, and test mutants learned from real faults. Video: this https URL
摘要：突变的测试可用于评估给定的测试套件的故障检测功能。为了达到这个目的，突变测试框架两个特点是极为重要的：（i）它们应产生代表实际故障的突变体;及（ii）它们应该提供一个完整的工具链能够自动生成，注入，并测试突变体。为了解决第一个问题，我们最近提出使用递归神经网络编码器，解码器架构，了解从现实的方案开采〜787k故障突变体的方法。这种方法的实证评价确认了其产生的突变体代表实际故障的能力。在本文中，我们要解决的第二个点，呈现DeepMutation，一个工具包，我们深切的学习模式到一个完全自动化的工具链能够产生，注入，并从实际故障了解到测试的突变体。视频：此HTTPS URL

10. On Layer Normalization in the Transformer Architecture [PDF] 返回目录
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu
Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.
摘要：变压器被广泛应用于自然语言处理任务。要培养一个变压器然而，人们通常需要一个精心设计的学习速度的热身阶段，这被证明是最后的表现至关重要，但会优化，带来更多超，参数调整放缓。在本文中，我们首先研究在理论上为什么学习率热身阶段是必要的，表明层正常化问题的位置。具体来说，我们用平均场理论证明，在初始化时，对原设计的后LN变压器，这使残余块之间的层正常化，输出层附近的参数的预期梯度很大。因此，使用这些渐变大的学习速率使得培训不稳定。在热身阶段，是为了避免这个问题实际上是有帮助的。在另一方面，我们的理论也表明，如果层标准化就是把剩余的块（最近提出的预LN变压器）内，梯度在初始化乖巧。这促使我们删除热身阶段为预LN变形金刚的培训。我们发现在我们的实验中，如果没有热身阶段预LN变压器可以达到基准比较的结果，同时要求显著减少培训时间和Hyper-参数整定了广泛的应用。

11. Superbloom: Bloom filter meets Transformer [PDF] 返回目录
John Anderson, Qingqing Huang, Walid Krichene, Steffen Rendle, Li Zhang
Abstract: We extend the idea of word pieces in natural language models to machine learning tasks on opaque ids. This is achieved by applying hash functions to map each id to multiple hash tokens in a much smaller space, similarly to a Bloom filter. We show that by applying a multi-layer Transformer to these Bloom filter digests, we are able to obtain models with high accuracy. They outperform models of a similar size without hashing and, to a large degree, models of a much larger size trained using sampled softmax with the same computational budget. Our key observation is that it is important to use a multi-layer Transformer for Bloom filter digests to remove ambiguity in the hashed input. We believe this provides an alternative method to solving problems with large vocabulary size.
摘要：我们延长的字块在自然语言模型对不透明IDS机器学习任务的想法。这通过将散列函数到每个ID多个散列令牌映射在一个更小的空间中，类似于一个布隆过滤器来实现的。我们发现，通过将多层变压器这些布隆过滤器摘要，我们能够高精度地获取模型。他们胜过一个同样大小的模型没有散列和，在很大程度上，一个更大尺寸的模型中使用抽样SOFTMAX与相同的计算预算培训。我们的主要发现是，它使用了多层互感器布隆过滤器消化在散列输入歧义除去是很重要的。我们认为，这提供了另一种解决大词汇量的问题。

12. A Non-Intrusive Correction Algorithm for Classification Problems with Corrupted Data [PDF] 返回目录
Jun Hou, Tong Qin, Kailiang Wu, Dongbin Xiu
Abstract: A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architectures, etc. When training dataset is sufficiently large, we prove that the corrected models deliver correct classification results as if there is no corruption in the training data. For datasets of finite size, the corrected models produce significantly better recovery results, compared to the models without the correction algorithm. All of the theoretical findings in the paper are verified by our numerical examples.
摘要：一种新的校正算法，提出了多类分类问题已损坏的训练数据。该算法是非侵入性的，通过将校正过程的模型预测在这个意义上它后处理一个训练的分类模型。修正过程可以配上任何逼近，如逻辑回归，各种结构的神经网络，等等。当训练数据集是足够大的，我们证明了修正模型提供正确的分类结果作为是否有在训练数据中没有损坏。对于有限大小的数据集，校正模型产生显著更好的恢复效果，相比于没有纠错算法模型。所有在纸上的理论成果都是由我们的算例验证。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-02-13

目录

摘要