0%

【arxiv论文】 Computation and Language 2020-03-19

目录

1. X-Stance: A Multilingual Multi-Target Dataset for Stance Detection [PDF] 摘要
2. TTTTTackling WinoGrande Schemas [PDF] 摘要
3. Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá [PDF] 摘要
4. Unsupervised Pidgin Text Generation By Pivoting English Data and Self-Training [PDF] 摘要
5. Pre-trained Models for Natural Language Processing: A Survey [PDF] 摘要
6. Gender Representation in Open Source Speech Resources [PDF] 摘要
7. Calibration of Pre-trained Transformers [PDF] 摘要
8. Anchor & Transform: Learning Sparse Representations of Discrete Objects [PDF] 摘要
9. Deliberation Model Based Two-Pass End-to-End Speech Recognition [PDF] 摘要

摘要

1. X-Stance: A Multilingual Multi-Target Dataset for Stance Detection [PDF] 返回目录
  Jannis Vamvas, Rico Sennrich
Abstract: We extract a large-scale stance detection dataset from comments written by candidates of elections in Switzerland. The dataset consists of German, French and Italian text, allowing for a cross-lingual evaluation of stance detection. It contains 67 000 comments on more than 150 political issues (targets). Unlike stance detection models that have specific target issues, we use the dataset to train a single model on all the issues. To make learning across targets possible, we prepend to each instance a natural question that represents the target (e.g. "Do you support X?"). Baseline results from multilingual BERT show that zero-shot cross-lingual and cross-target transfer of stance detection is moderately successful with this approach.
摘要:我们提取在瑞士选举候选人的书面意见大规模的姿态检测数据集。该数据集包括德语,法语和意大利语的文本,允许姿态检测的跨语种的评价。它包含了超过150个政治问题(目标)67点000意见。具有特定目标的问题与立场检测模型,我们使用的数据集训练就所有问题的单一模式。为了使各可能的目标学习,我们预先考虑到每个实例表示目标一个自然的问题(如“你是否支持X?”)。从多语种BERT显示基线结果的姿态检测的零次跨语言,跨目标转移是用这种方法比较成功的。

2. TTTTTackling WinoGrande Schemas [PDF] 返回目录
  Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin
Abstract: We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the "entailment" token as a score of the hypothesis. Our first (and only) submission to the official leaderboard yielded 0.7673 AUC on March 13, 2020, which is the best known result at this time and beats the previous state of the art by over five points.
摘要:施加的T5序列到序列模型通过分解每个实施例为两个输入的文本串,每个包含一个假设,并且利用分配给令牌作为评分假设的“蕴涵”的概率,以解决AI2 WinoGrande挑战。我们的第一个(也是唯一一个)提交正式的排行榜取得了2020年3月13日,0.7673 AUC,这是在这个时候最有名的结果和通过5个点跳动艺术的以前的状态。

3. Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá [PDF] 返回目录
  David Ifeoluwa Adelani, Michael A. Hedderich, Dawei Zhu, Esther van den Berg, Dietrich Klakow
Abstract: The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way. Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios. In this work, we perform named entity recognition for Hausa and Yorùbá, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier's performance.
摘要:缺乏标记的训练数据的有限的自然语言处理工具,如命名实体识别的发展,发展中国家发言的许多语言。技术,例如遥远和弱监督可用于在(半)自动的方式来创建标签数据。此外,为了减轻一些在自动标注的误差的负面影响,噪声处理的方法可以集成。预训练字的嵌入是最神经命名实体分类的另一个重要组成部分。对于更复杂的背景字的嵌入的到来,一个有趣的权衡模型的尺寸和性能之间出现。虽然这些技术已经在高资源环境被证明工作做好了,我们要学习他们在资源匮乏的情况下如何执行。在这项工作中,我们进行了命名和豪萨语约鲁巴人,两种语言被广泛使用在一些发展中国家实体识别。我们评估不同嵌入方法和显示那个遥远的监督,可以成功地利用在逼真的低资源场景,它可以增加一倍以上分类的性能。

4. Unsupervised Pidgin Text Generation By Pivoting English Data and Self-Training [PDF] 返回目录
  Ernie Chang, David Ifeoluwa Adelani, Xiaoyu Shen, Vera Demberg
Abstract: West African Pidgin English is a language that is significantly spoken in West Africa, consisting of at least 75 million speakers. Nevertheless, proper machine translation systems and relevant NLP datasets for pidgin English are virtually absent. In this work, we develop techniques targeted at bridging the gap between Pidgin English and English in the context of natural language generation. %As a proof of concept, we explore the proposed techniques in the area of data-to-text generation. By building upon the previously released monolingual Pidgin English text and parallel English data-to-text corpus, we hope to build a system that can automatically generate Pidgin English descriptions from structured data. We first train a data-to-English text generation system, before employing techniques in unsupervised neural machine translation and self-training to establish the Pidgin-to-English cross-lingual alignment. The human evaluation performed on the generated Pidgin texts shows that, though still far from being practically usable, the pivoting + self-training technique improves both Pidgin text fluency and relevance.
摘要:西非洋泾浜英语是在西非显著说,由至少75亿人讲的语言。不过,适当的机器翻译系统和洋泾浜英语相关的NLP的数据集几乎不存在。在这项工作中,我们开发的目标是缩小洋泾浜英语和英语之间的差距在自然语言生成的背景技术。 %作为概念证明,我们探索数据到文本生成部分的面积所提出的技术。通过建立在之前发布的单语洋泾浜英语文本和平行英语数据到文本语料库,我们希望建立一个系统,可以自动从结构化数据生成洋泾浜英语的描述。首先,我们培养出的数据对英文文本生成系统,采用在无监督神经机器翻译和自我培训技术来建立洋泾浜到英语跨语种对准前。人的评价上产生的洋泾浜的文本显示,虽然还是被实际可用远,转动+自我训练的技术改善了洋泾浜的文字流畅性和相关性进行。

5. Pre-trained Models for Natural Language Processing: A Survey [PDF] 返回目录
  Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, Xuanjing Huang
Abstract: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy with four perspectives. Next, we describe how to adapt the knowledge of PTMs to the downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
摘要:近日,预先训练模型(PTM的)的出现带来了自然语言处理(NLP),以一个新的时代。在这次调查中,我们提供了翻译后修饰的NLP的全面审查。我们首先简要介绍了语言表达的学习和及其研究进展。然后,我们系统地分类,基于具有四个方面的分类法现有的翻译后修饰。接下来,我们将介绍如何翻译后修饰的知识,适应下游任务。最后,我们列出了翻译后修饰的对未来研究的一些潜在方向。该调查意要成为实践指南的理解,使用和开发各种NLP任务的翻译后修饰。

6. Gender Representation in Open Source Speech Resources [PDF] 返回目录
  Mahault Garnerin, Solange Rossato, Laurent Besacier
Abstract: With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics, transparency and fairness of AI systems has become a central concern within the research community. We address transparency and fairness in spoken language systems by proposing a study about gender representation in speech resources available through the Open Speech and Language Resource platform. We show that finding gender information in open source corpora is not straightforward and that gender balance depends on other corpus characteristics (elicited/non elicited speech, low/high resource language, speech task targeted). The paper ends with recommendations about metadata and gender information for researchers in order to assure better transparency of the speech systems built using such corpora.
摘要:随着人工智能(AI)的崛起,越来越多地使用深层学习架构,伦理,透明度问题和公平性的AI系统已成为研究界关注的中心。我们通过提出一个关于在语音资源的性别比例可以通过打开语音和语言资源平台,研究解决语言系统的透明度和公平性。我们表明,在开源语料库发现性别信息并不简单,性别平衡取决于其他语料库特性(引起/非言语引起,低/高源语言,言语任务目标)。与有关以确保使用这些语料库内置语音系统,提高透明度研究的元数据和性别信息建议的文件结束。

7. Calibration of Pre-trained Transformers [PDF] 返回目录
  Shrey Desai, Greg Durrett
Abstract: Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models' posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We focus on BERT and RoBERTa in this work, and analyze their calibration across three tasks: natural language inference, paraphrase detection, and commonsense reasoning. For each task, we consider in-domain as well as challenging out-of-domain settings, where models face more examples they should be uncertain about. We show that: (1) when used out-of-the-box, pre-trained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5x lower; (2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain.
摘要:预先训练变压器现在在自然语言处理无处不在,但尽管他们的高端任务性能,鲜为人​​知的是经验,他们是否进行校准。具体而言,做这些模型的后验概率提供的模型是如何可能是正确的在一个给定的例子准确的经验措施?我们专注于BERT和罗伯塔在这项工作中,并在三个工作分析他们的校准:自然语言推理,释义检测和常识推理。对于每一个任务,我们考虑在域以及挑战外的域设置,在模特脸上更多的例子,他们应该是不确定的。我们表明:(1)使用时出的即装即用,预先训练模型校准域内,且相比基准,它们的校准误差外的域可以是高达3.5倍低; (2)温度缩放是在域进一步降低校准误差有效,以及使用标签平滑故意增加经验不确定性有助于校准后验外的结构域。

8. Anchor & Transform: Learning Sparse Representations of Discrete Objects [PDF] 返回目录
  Paul Pu Liang, Manzil Zaheer, Yuan Wang, Amr Ahmed
Abstract: Learning continuous representations of discrete objects such as text, users, and URLs lies at the heart of many applications including language and user modeling. When using discrete objects as input to neural networks, we often ignore the underlying structures (e.g. natural groupings and similarities) and embed the objects independently into individual vectors. As a result, existing methods do not scale to large vocabulary sizes. In this paper, we design a Bayesian nonparametric prior for embeddings that encourages sparsity and leverages natural groupings among objects. We derive an approximate inference algorithm based on Small Variance Asymptotics which yields a simple and natural algorithm for learning a small set of anchor embeddings and a sparse transformation matrix. We call our method Anchor & Transform (ANT) as the embeddings of discrete objects are a sparse linear combination of the anchors, weighted according to the transformation matrix. ANT is scalable, flexible, end-to-end trainable, and allows the user to incorporate domain knowledge about object relationships. On text classification and language modeling benchmarks, ANT demonstrates stronger performance with fewer parameters as compared to existing compression baselines.
摘要:在许多应用,包括语言和用户建模的心脏学习离散对象,如文本,用户和网址谎言连续表示。当使用离散对象作为输入到神经网络,我们经常忽略下层结构(例如自然分组和相似性)和嵌入对象独立地为单个矢量。其结果是,现有方法不适合大规模词汇量。在本文中,我们设计了一个贝叶斯非参之前为鼓励稀疏和对象之间的充分利用自然分组的嵌入。我们推导基于小方差渐近性的近似推理算法产生学习一小锚的嵌入和稀疏变换矩阵简单自然的算法。我们称我们的方法锚和变换(ANT)作为离散的物体的嵌入是锚定件的稀疏线性组合,根据所述变换矩阵进行加权。 ANT是可扩展的,灵活的,到终端的可训练,并允许用户对对象关系一体化领域的知识。文本分类和语言模型的基准,ANT演示用较少的参数相比,现有的压缩基准为性能更强。

9. Deliberation Model Based Two-Pass End-to-End Speech Recognition [PDF] 返回目录
  Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar
Abstract: End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.
摘要:端至端(E2E)模型已经在自动语音识别(ASR)取得了快速的进展,并竞争性地相对于常规模型执行。为了进一步提高质量,双通模式已经提出了rescore使用非流倾听,顾不上和拼写(LAS)模式,同时保持合理的等待时间流的假设。该模型照顾到音响到rescore假设,而不是一类只使用一次通过文本假设神经校正模型。在这项工作中,我们提出了参加这两个声学和使用网络审议第一通假说。双向编码器用于从第一通假设提取的上下文信息。建议的审议模型实现相对减少WER为12%比LAS在谷歌语音搜索再评分(VS)的任务,并在一个专有名词的测试集%,减少23。相较于传统的大型模型,我们最好的模型进行21%相对更好的为VS.在计算复杂性方面,审议译码器具有比LAS解码器更大的尺寸,并且因此需要在第二次解码更多的计算。

注:中文为机器翻译结果!