摘要

1. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning [PDF] 返回目录
Mitchell A. Gordon, Kevin Duh, Nicholas Andrews
Abstract: Universal feature extractors, such as BERT for natural language processing and VGG for computer vision, have become effective methods for improving deep learning models without requiring more labeled data. A common paradigm is to pre-train a feature extractor on large amounts of data then fine-tune it as part of a deep learning model on some downstream task (i.e. transfer learning). While effective, feature extractors like BERT may be prohibitively large for some deployment scenarios. We explore weight pruning for BERT and ask: how does compression during pre-training affect transfer learning? We find that pruning affects transfer learning in three broad regimes. Low levels of pruning (30-40\%) do not affect pre-training loss or transfer to downstream tasks at all. Medium levels of pruning increase the pre-training loss and prevent useful pre-training information from being transferred to downstream tasks. High levels of pruning additionally prevent models from fitting downstream datasets, leading to further degradation. Finally, we observe that fine-tuning BERT on a specific task does not improve its prunability. We conclude that BERT can be pruned once during pre-training rather than separately for each task without affecting performance.
摘要：通用特征提取，如BERT自然语言处理和VGG计算机视觉，已成为提高深度学习模型有效的方法，而不需要更多的标签数据。一种常见的范例是大量数据的预培养特征提取再微调它作为部分下游任务（即转印学习）一个深度学习模型的一部分。虽然有效，像BERT特征提取可能是一些部署场景大得惊人。我们探讨重修剪的BERT，问：如何在训练前压缩影响迁移学习？我们发现，修剪影响迁移学习在三大制度。修剪（30-40 \％）水平低，不影响训练前的损失或转移到下游的所有任务。修剪增加预训练损失和中等水平的防止被传递到下游的任务有用预训练信息。修剪的高水平还防止下游拟合模型的数据集，从而导致进一步恶化。最后，我们观察到一个特定的任务，微调BERT并不能改善其prunability。我们的结论是BERT可以在一次训练前，而不是单独为在不影响性能的每个任务被修剪。

2. Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation [PDF] 返回目录
Aman Shenoy, Ashish Sardana
Abstract: Sentiment Analysis and Emotion Detection in conversation is key in a number of real-world applications, with different applications leveraging different kinds of data to be able to achieve reasonably accurate predictions. Multimodal Emotion Detection and Sentiment Analysis can be particularly useful as applications will be able to use specific subsets of the available modalities, as per their available data, to be able to produce relevant predictions. Current systems dealing with Multimodal functionality fail to leverage and capture the context of the conversation through all modalities, the current speaker and listener(s) in the conversation, and the relevance and relationship between the available modalities through an adequate fusion mechanism. In this paper, we propose a recurrent neural network architecture that attempts to take into account all the mentioned drawbacks, and keeps track of the context of the conversation, interlocutor states, and the emotions conveyed by the speakers in the conversation. Our proposed model out performs the state of the art on two benchmark datasets on a variety of accuracy and regression metrics. Our model implementation is public and can be found at this http URL
摘要：情感分析和情感检测的谈话是在许多现实世界的应用程序的关键，与不同的应用程序利用不同类型的数据，才能够达到相当准确的预测。多式联运情绪检测和情感分析可能是特别有用的应用程序将能够使用现有的模式的特定子集，根据他们掌握的数据，要能出示相关的预测。处理多式联运功能的当前系统未能充分利用，并通过一切方式，通过适当的融合机制可用模式之间的电流扬声器和在交谈监听器（S）和相关性和关系捕捉谈话的背景。在本文中，我们提出了一个经常性的神经网络结构，试图考虑到所有的上述缺点，并跟踪谈话中，对话者状态的背景下，与情绪传达在谈话扬声器。我们提出的模型进行了艺术上的两个标准数据集上的各种准确性和回归度量的状态。我们的模型的实现是公共的，可以在这个HTTP URL中找到

3. CodeBERT: A Pre-Trained Model for Programming and Natural Languages [PDF] 返回目录
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou
Abstract: We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.
摘要：我们提出CodeBERT，用于编程语言（PL）和NAT - 乌拉尔语言（NL）双峰预训练模式。 CodeBERT得知支持下游NL-PL应用，如自然语言于codesearch，代码文档生成等通用表示我们开发CodeBERT与基于变压器的神经结构，并使用并入前培训任务的混合目标函数训练它的替换令牌检测，这是为了检测来自发电机采样可行的替代品。这使我们能够利用NL-PL对和单峰数据，其中前者提供输入令牌模型训练，而后者有助于更好地学习发电机双模的数据。我们评估通过微调模型参数的两个NL-PL应用CodeBERT。结果表明，CodeBERT实现两个自然语言代码搜索和代码文档生成任务的国家的最先进的性能。此外，调查什么的知识型CodeBERT了解，我们构建NL-PL探测的数据集，并在零射门的设置，其中的预先训练模型的参数是固定的评价。结果表明，CodeBERT执行比NL-PL探测之前的预先训练模式更好。

4. Hierarchical models vs. transfer learning for document-level sentiment classification [PDF] 返回目录
Jeremy Barnes, Vinit Ravishankar, Lilja Øvrelid, Erik Velldal
Abstract: Documents are composed of smaller pieces - paragraphs, sentences, and tokens - that have complex relationships between one another. Sentiment classification models that take into account the structure inherent in these documents have a theoretical advantage over those that do not. At the same time, transfer learning models based on language model pretraining have shown promise for document classification. However, these two paradigms have not been systematically compared and it is not clear under which circumstances one approach is better than the other. In this work we empirically compare hierarchical models and transfer learning for document-level sentiment classification. We show that non-trivial hierarchical models outperform previous baselines and transfer learning on document-level sentiment classification in five languages.
摘要：文件是由更小的部分 - 段落，句子和令牌 - 即具有彼此之间的复杂关系。考虑到这些文件中所固有的结构情感分类模型具有理论上的优势那些不这样做。与此同时，基于语言模型训练前转让的学习模式已经显示了文档分类承诺。然而，这两种模式都没有得到系统的比较，目前尚不清楚在何种情况下一个方法是优于其他。在这项工作中，我们经验比较分层模型和转移学习文档级情感分类。我们证明了不平凡的分层模型中五种语言胜过以前的基线和迁移学习的文档级情感分类。

5. Rnn-transducer with language bias for end-to-end Mandarin-English code-switching speech recognition [PDF] 返回目录
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Jianhua Tao, Ye Bai
Abstract: Recently, language identity information has been utilized to improve the performance of end-to-end code-switching (CS) speech recognition. However, previous works use an additional language identification (LID) model as an auxiliary module, which causes the system complex. In this work, we propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem. We use the language identities to bias the model to predict the CS points. This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed. We evaluate the approach on a Mandarin-English CS corpus SEAME. Compared to our RNN-T baseline, the proposed method can achieve 16.2% and 12.9% relative error reduction on two test sets, respectively.
摘要：近日，语言身份信息已被用来提高终端到终端的码转换（CS）语音识别的性能。然而，先前的工作使用一个额外的语言识别（LID）模型作为辅助模块，这将导致系统复杂。在这项工作中，我们提出了一种改进的递归神经网络转换器（RNN-T）模型的语言偏缓解这一问题。我们使用的语言身份偏置模型来预测CS点。这促进了模型直接从转录学习语言的身份信息，并且不需要额外的LID模型。我们评估对国语，英语语料库CS的SEAME方法。相比，我们RNN-T基线，所提出的方法可以在两个测试组达到了16.2％和相对误差减少12.9％，分别。

6. LAMBERT: Layout-Aware language Modeling using BERT for information extraction [PDF] 返回目录
Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński
Abstract: In this paper we introduce a novel approach to the problem of understanding documents where the local semantics is influenced by non-trivial layout. Namely, we modify the Transformer architecture in a way that allows it to use the graphical features defined by the layout, without the need to re-learn the language semantics from scratch, thanks to starting the training process from a model pretrained on classical language modeling tasks.
摘要：本文介绍了一种新的方法来在当地的语义由非平凡的布局影响理解文件的问题。也就是说，我们修改的方式，允许它使用由布局定义的图形功能的变压器架构，而无需再从头学习语言的语义，得益于预训练的经典语言模型的模型开始训练过程任务。

7. Toward Making the Most of Context in Neural Machine Translation [PDF] 返回目录
Zaixiang Zheng, Xiang Yue, Shujian Huang, Jiajun Chen, Alexandra Birch
Abstract: Document-level machine translation manages to outperform sentence level models by a small margin, but have failed to be widely adopted. We argue that previous research did not make a clear use of the global context, and propose a new document-level NMT framework that deliberately models the local context of each sentence with the awareness of the global context of the document in both source and target languages. We specifically design the model to be able to deal with documents containing any number of sentences, including single sentences. This unified approach allows our model to be trained elegantly on standard datasets without needing to train on sentence and document level data separately. Experimental results demonstrate that our model outperforms Transformer baselines and previous document-level NMT models with substantial margins of up to 2.1 BLEU on state-of-the-art baselines. We also provide analyses which show the benefit of context far beyond the neighboring two or three sentences, which previous studies have typically incorporated.
摘要：文档级机器翻译小幅设法跑赢句子级车型，但一直没有被广泛采用。我们认为，以前的研究并没有做出明确的利用全球范围内的，并提出一个新的文档级NMT框架，故意模型每个句子的本地环境与文档的全球范围内的源和目标语言意识。我们专门设计的模型，以便能够处理包含任意数量的句子，包括单句文件。这种统一的方法允许我们的模型中优雅的标准数据集，而不需要对句子和文件级数据分开训练训练。实验结果表明，我们的模型优于变压器基线和以前的文件级别的车型NMT高达2.1 BLEU在国家的最先进的基线的实质性的利润。我们还提供分析，这表明上下文的好处远远超过了相邻的两个或三个句子，其先前的研究已经纳入一般。

8. The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding [PDF] 返回目录
Xiaodong Liu, Yu Wang, Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa, Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao, Jianfeng Gao
Abstract: We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM). A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains. The software and pre-trained models will be publicly available at this https URL.
摘要：我们目前MT-DNN，一个开源的自然语言理解（NLU）工具包，便于研究人员和开发人员以培训定制的深度学习模式。建立在PyTorch和变压器，MT-DNN是为了便于快速地定制的NLU任务广谱性，利用各种目标（分类，回归，结构预测）和文本编码器（例如，RNNs，BERT，罗伯塔，UniLM）。 MT-DNN的一大特色是其采用对抗性多任务学习的范例内置的强大的，可转让的学习支持。为了实现高效生产部署，MT-DNN支持多任务的知识蒸馏，这可以大大压缩深层神经模型，而不会显著性能下降。我们证明了广泛的跨越一般和生物医学领域的应用NLU MT-DNN的有效性。软件和预训练的模型将公开可在此HTTPS URL。

9. Studying the Effects of Cognitive Biases in Evaluation of Conversational Agents [PDF] 返回目录
Sashank Santhanam, Alireza Karduni, Samira Shaikh
Abstract: Humans quite frequently interact with conversational agents. The rapid advancement in generative language modeling through neural networks has helped advance the creation of intelligent conversational agents. Researchers typically evaluate the output of their models through crowdsourced judgments, but there are no established best practices for conducting such studies. Moreover, it is unclear if cognitive biases in decision-making are affecting crowdsourced workers' judgments when they undertake these tasks. To investigate, we conducted a between-subjects study with 77 crowdsourced workers to understand the role of cognitive biases, specifically anchoring bias, when humans are asked to evaluate the output of conversational agents. Our results provide insight into how best to evaluate conversational agents. We find increased consistency in ratings across two experimental conditions may be a result of anchoring bias. We also determine that external factors such as time and prior experience in similar tasks have effects on inter-rater consistency.
摘要：人类相当频繁与会话代理交互。通过神经网络生成的语言模型的快速进步帮助推动智能会话代理的创建。研究人员通常通过评估判断众包他们的模型的输出，但也有进行这样的研究没有建立的最佳实践。此外，如果在决策中的认知偏差是当他们承担这些任务的影响众包工人的判断，目前尚不清楚。调查中，我们与77名众包的工人被试间研究，以了解认知偏差的作用，特别是固定偏差，当人类被要求评价会话代理的输出。我们的研究结果提供深入了解如何最好地评估会话代理。我们发现，在收视率在两个实验条件增加稠度可能是锚定偏差的结果。我们还确定外部因素，如类似任务的时间和以前的经验对评估人之间的一致性效果。

10. Transfer Learning for Abstractive Summarization at Controllable Budgets [PDF] 返回目录
Ritesh Sarkhel, Moniba Keymanesh, Arnab Nandi, Srinivasan Parthasarathy
Abstract: Summarizing a document within an allocated budget while maintaining its major concepts is a challenging task. If the budget can take any arbitrary value and not known beforehand, it becomes even more difficult. Most of the existing methods for abstractive summarization, including state-of-the-art neural networks are data intensive. If the number of available training samples becomes limited, they fail to construct high-quality summaries. We propose MLS, an end-to-end framework to generate abstractive summaries with limited training data at arbitrary compression budgets. MLS employs a pair of supervised sequence-to-sequence networks. The first network called the \textit{MFS-Net} constructs a minimal feasible summary by identifying the key concepts of the input document. The second network called the Pointer-Magnifier then generates the final summary from the minimal feasible summary by leveraging an interpretable multi-headed attention model. Experiments on two cross-domain datasets show that MLS outperforms baseline methods over a range of success metrics including ROUGE and METEOR. We observed an improvement of approximately 4% in both metrics over the state-of-art convolutional network at lower budgets. Results from a human evaluation study also establish the effectiveness of MLS in generating complete coherent summaries at arbitrary compression budgets.
摘要：分配的预算范围内总结的文档，同时保持它的主要概念是一个具有挑战性的任务。如果预算可以采取任意值，而不是事先知道，它变得更加困难。大多数用于抽象总结现有方法，包括国家的最先进的神经网络是密集的数据。如果可用的训练样本的数量变得有限，他们无法构建高品质的摘要。我们建议MLS，最终到终端的架构产生与任意压缩预算有限的训练数据抽象总结。 MLS采用一对监督序列到序列的网络。称为\ textit {MFS-Net的}第一网络通过识别输入文档的关键概念构建了一个最小的可行摘要。称为指针-放大镜第二网络然后通过利用可解释的多头注意模型生成从最小可行总结最后汇总。在两个跨域数据集实验表明，MLS性能优于在一定范围的成功指标包括胭脂METEOR的基线方法。我们在较低的预算观察到大约4％的在两个指标在国家的技术卷积网络的改进。从人的评估研究结果还建立在任意压缩预算产生完整连贯摘要MLS的有效性。

11. VQA-LOL: Visual Question Answering under the Lens of Logic [PDF] 返回目录
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
Abstract: Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate visual question answering (VQA) through the lens of logical transformation and posit that systems that seek to answer questions about images must be robust to these transformations of the question. If a VQA system is able to answer a question, it should also be able to answer the logical composition of questions. We analyze the performance of state-of-the-art models on the VQA task under these logical operations and show that they have difficulty in correctly answering such questions. We then construct an augmentation of the VQA dataset with questions containing logical operations and retrain the same models to establish a baseline. We further propose a novel methodology to train models to learn negation, conjunction, and disjunction and show improvement in learning logical composition and retaining performance on VQA. We suggest this work as a move towards embedding logical connectives in visual understanding, along with the benefits of robustness and generalizability. Our code and dataset is available online at this https URL
摘要：逻辑连接词及其对自然语言句子的意思含义是理解一个基本方面。在本文中，我们研究了视觉问答（VQA）通过逻辑转型断定的镜头，那些寻求回答有关图像的问题系统必须稳固的问题的这些变化。如果VQA系统能够回答的问题，它也应该能够回答问题的逻辑成分。我们分析针对这些逻辑运算的VQA任务状态的最先进机型的性能，并表明他们在正确回答这些问题的难度。然后，我们构建了VQA数据集包含逻辑运算问题的增强和再培训相同的模型来建立一个基线。我们进一步提出了一种新的方法来训练模型来了解否定，合和脱节，并显示改善学习逻辑组成，保持对VQA性能。我们认为这项工作作为对视觉的理解嵌入逻辑连接词，与坚固性和普遍性的优点外，一招。我们的代码和数据集可在网上这个HTTPS URL

12. A Differential-form Pullback Programming Language for Higher-order Reverse-mode Automatic Differentiation [PDF] 返回目录
Carol Mak, Luke Ong
Abstract: Building on the observation that reverse-mode automatic differentiation (AD) -- a generalisation of backpropagation -- can naturally be expressed as pullbacks of differential 1-forms, we design a simple higher-order programming language with a first-class differential operator, and present a reduction strategy which exactly simulates reverse-mode AD. We justify our reduction strategy by interpreting our language in any differential $\lambda$-category that satisfies the Hahn-Banach Separation Theorem, and show that the reduction strategy precisely captures reverse-mode AD in a truly higher-order setting.
摘要：建立在观察到反向模式自动微分（AD） - 反向传播的一般化 - 可以自然被表达为差分1形式的回调，我们设计了一个简单的高阶编程语言与第一级的差动操作者，以及本的降低策略，正好模拟反向模式AD。我们证明我们可以在任何不同的$ \ $拉姆达解释-category我们的语言还原的策略，满足哈恩的Banach分离定理，并表明降价策略可以精确地捕捉反向模式AD在一个真正的高阶设置。

13. Tree-structured Attention with Hierarchical Accumulation [PDF] 返回目录
Xuan-Phi Nguyen, Shafiq Joty, Steven C.H. Hoi, Richard Socher
Abstract: Incorporating hierarchical structures like constituency trees has been shown to be effective for various natural language processing (NLP) tasks. However, it is evident that state-of-the-art (SOTA) sequence-based models like the Transformer struggle to encode such structures inherently. On the other hand, dedicated models like the Tree-LSTM, while explicitly modeling hierarchical structures, do not perform as efficiently as the Transformer. In this paper, we attempt to bridge this gap with "Hierarchical Accumulation" to encode parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task. It also yields improvements over Transformer and Tree-LSTM on three text classification tasks. We further demonstrate that using hierarchical priors can compensate for data shortage, and that our model prefers phrase-level attentions over token-level attentions.
摘要：结合像选区树的层次结构已被证明是有效的各种自然语言处理（NLP）的任务。然而，显而易见的是，国家的最先进的（SOTA）基于序列的模型，如变压器斗争编码这样的结构本质上。在另一方面，专用车型如树LSTM，同时明确造型的层次结构，也不能有效的执行变压器。在本文中，我们试图弥合与“分层堆积”这个差距编码解析树结构为自我的关注，在一定的时间复杂度。我们的方法比SOTA方法四个IWSLT翻译任务和WMT'14英语，德语翻译任务。这也产生了三个文本分类的任务在变压器和树LSTM改进。我们进一步表明，采用分层先验可以弥补数据不足，我们的模式更倾向于在标记级别的关注短语级的关注。

14. Non-Autoregressive Dialog State Tracking [PDF] 返回目录
Hung Le, Richard Socher, Steven C.H. Hoi
Abstract: Recent efforts in Dialogue State Tracking (DST) for task-oriented dialogues have progressed toward open-vocabulary or generation-based approaches where the models can generate slot value candidates from the dialogue history itself. These approaches have shown good performance gain, especially in complicated dialogue domains with dynamic slot values. However, they fall short in two aspects: (1) they do not allow models to explicitly learn signals across domains and slots to detect potential dependencies among (domain, slot) pairs; and (2) existing models follow auto-regressive approaches which incur high time cost when the dialogue evolves over multiple domains and multiple turns. In this paper, we propose a novel framework of Non-Autoregressive Dialog State Tracking (NADST) which can factor in potential dependencies among domains and slots to optimize the models towards better prediction of dialogue states as a complete set rather than separate slots. In particular, the non-autoregressive nature of our method not only enables decoding in parallel to significantly reduce the latency of DST for real-time dialogue response generation, but also detect dependencies among slots at token level in addition to slot and domain level. Our empirical results show that our model achieves the state-of-the-art joint accuracy across all domains on the MultiWOZ 2.1 corpus, and the latency of our model is an order of magnitude lower than the previous state of the art as the dialogue history extends over time.
摘要：在对话状态跟踪（DST）面向任务的对话最近作出的努力朝着开放式的词汇或该模型可以生成从对话历史本身插槽候选值基于代的方式推进。这些方法都显示了良好的性能增益，特别是在动态时隙值复杂对话域。然而，他们在两个方面功亏一篑：（1）他们不允许模型明确学习跨域和槽信号检测中（域，插槽）对可能依赖性; （2）现有车型遵循的付出高昂的时间成本自回归的方法时，对话的演进在多个域和多圈。在本文中，我们提出了非自回归对话状态跟踪（NADST），它可以因素域和插槽之间的可能依赖性走向对话状态更好的预测作为一套完整的，而不是单独的槽优化模型的新框架。特别地，我们的方法的非自回归性质不仅使得能够并行进行解码，以减少显著DST的实时对话响应产生的等待时间，而且在除了时隙和域级别令牌水平检测时隙中的依赖关系。我们的实证结果表明，我们的模型实现跨越的MultiWOZ 2.1语料所有域的国家的最先进的联合精确度，和我们的模型的潜伏期是一个数量级比艺术对话的历史以前的状态下延长一段时间。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-02-20

目录

摘要