摘要

1. DoLFIn: Distributions over Latent Features for Interpretability [PDF] 返回目录
Phong Le, Willem Zuidema
Abstract: Interpreting the inner workings of neural models is a key step in ensuring the robustness and trustworthiness of the models, but work on neural network interpretability typically faces a trade-off: either the models are too constrained to be very useful, or the solutions found by the models are too complex to interpret. We propose a novel strategy for achieving interpretability that -- in our experiments -- avoids this trade-off. Our approach builds on the success of using probability as the central quantity, such as for instance within the attention mechanism. In our architecture, DoLFIn (Distributions over Latent Features for Interpretability), we do no determine beforehand what each feature represents, and features go altogether into an unordered set. Each feature has an associated probability ranging from 0 to 1, weighing its importance for further processing. We show that, unlike attention and saliency map approaches, this set-up makes it straight-forward to compute the probability with which an input component supports the decision the neural model makes. To demonstrate the usefulness of the approach, we apply DoLFIn to text classification, and show that DoLFIn not only provides interpretable solutions, but even slightly outperforms the classical CNN and BiLSTM text classifiers on the SST2 and AG-news datasets.
摘要：解读神经模型的内部工作原理是在保证模型的鲁棒性和可信性的关键一步，但神经网络的解释性工作通常面临着权衡：要么模型过于约束是非常有用的，或者解决方案通过模型发现太复杂解释。我们提出了实现该解释性的新策略 - 在我们的实验 - 避免了这种权衡。我们的方法基础上利用概率为中心的数量，诸如例如关注机制内的成功。在我们的架构，DOLFIN（超过潜在分布特点为解释性），我们也没有事先确定了每个要素表示，和特点去完全进入一个无序的。每个特征具有相关联的概率范围从0到1，称量其用于进一步处理的重要性。我们证明了，不像关注和显着图方法，这种设置使得它直接的计算与输入组件支持决策的神经网络模型，使概率。为了证明该方法的有效性，我们应用DOLFIN文本分类，并表明DOLFIN不仅提供可解释的解决方案，但即使稍稍优于对SST2和AG-新闻数据集古典CNN和BiLSTM文本分类。

2. Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara [PDF] 返回目录
Allahsera Auguste Tapo, Bakary Coulibaly, Sébastien Diarra, Christopher Homan, Julia Kreutzer, Sarah Luger, Arthur Nagashima, Marcos Zampieri, Michael Leventhal
Abstract: Low-resource languages present unique challenges to (neural) machine translation. We discuss the case of Bambara, a Mande language for which training data is scarce and requires significant amounts of pre-processing. More than the linguistic situation of Bambara itself, the socio-cultural context within which Bambara speakers live poses challenges for automated processing of this language. In this paper, we present the first parallel data set for machine translation of Bambara into and from English and French and the first benchmark results on machine translation to and from Bambara. We discuss challenges in working with low-resource languages and propose strategies to cope with data scarcity in low-resource machine translation (MT).
摘要：低资源语言呈现给（神经）机器翻译独特的挑战。我们讨论班巴拉语，为此，训练数据是稀缺，需要大量显著前处理的曼德语言的情况下。超过自身班巴拉文的语言状况，社会文化环境，使音箱巴拉住这种语言的自动处理带来了挑战。在本文中，我们提出了机器翻译班巴拉语的进入和来自英语和法语，并在机器翻译的第一基准测试结果，并从班巴拉第一并行数据集。我们在资源匮乏的语言进行工作讨论挑战，并提出对策，以应付低资源机器翻译（MT）数据稀缺。

3. Towards Interpretable Natural Language Understanding with Explanations as Latent Variables [PDF] 返回目录
Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang, Xiaodan Liang, Maosong Sun, Chenyan Xiong, Jian Tang
Abstract: Recently generating natural language explanations has shown very promising results in not only offering interpretable explanations but also providing additional information and supervision for prediction. However, existing approaches usually require a large set of human annotated explanations for training while collecting a large set of explanations is not only time consuming but also expensive. In this paper, we develop a general framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training. Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model. We develop a variational EM framework for optimization where an explanation generation module and an explanation-augmented prediction module are alternatively optimized and mutually enhance each other. Moreover, we further propose an explanation-based self-training method under this framework for semi-supervised learning. It alternates between assigning pseudo-labels to unlabeled data and generating new explanations to iteratively improve each other. Experiments on two natural language understanding tasks demonstrate that our framework can not only make effective predictions in both supervised and semi-supervised settings, but also generate good natural language explanation.
摘要：最近发生的自然语言的解释已经显示出非常乐观的，不仅提供解释的解释结果，还会提供额外的预测信息和监督。然而，现有的方法通常需要一大组人的注释说明的培训，同时收集了大量一套解释是不仅费时而且价格昂贵。在本文中，我们开发了解释自然语言理解，只需要一小部分用于培训人力注释解释的总体框架。我们的框架将自然语言解释潜变量神经模型的基本推理过程模型。我们开发了优化，其中的解释生成模块和解释，增强预测模块交替优化，相互增进彼此变分EM框架。此外，我们进一步提出了这个框架半监督学习下基于解释的自我训练的方法。分配伪标签未标记的数据，并产生新的解释，以迭代地改善彼此之间它交替。两个自然语言理解任务的实验结果表明，我们的框架不仅可以使两个监督和半监督设置有效的预测，而且还产生良好的自然语言解释。

4. Medical Knowledge-enriched Textual Entailment Framework [PDF] 返回目录
Shweta Yadav, Vishal Pallagani, Amit Sheth
Abstract: One of the cardinal tasks in achieving robust medical question answering systems is textual entailment. The existing approaches make use of an ensemble of pre-trained language models or data augmentation, often to clock higher numbers on the validation metrics. However, two major shortcomings impede higher success in identifying entailment: (1) understanding the focus/intent of the question and (2) ability to utilize the real-world background knowledge to capture the context beyond the sentence. In this paper, we present a novel Medical Knowledge-Enriched Textual Entailment framework that allows the model to acquire a semantic and global representation of the input medical text with the help of a relevant domain-specific knowledge graph. We evaluate our framework on the benchmark MEDIQA-RQE dataset and manifest that the use of knowledge enriched dual-encoding mechanism help in achieving an absolute improvement of 8.27% over SOTA language models. We have made the source code available here.
摘要：一个在实现强大的医疗问答系统红衣主教任务是文字蕴涵。在现有的方法利用预先训练语言模型和数据增强了整体的，往往对验证指标时钟更高的数字。但是，有两个主要缺点阻碍识别蕴涵更高的成功：了解对焦/意图的问题，并利用真实世界的背景知识捕捉超越句子上下文（2）能力（1）。在本文中，我们提出了一个新的医学知识富集的文字蕴涵的框架，允许模型获取与相关的特定领域的知识图的帮助下，输入医书的语义和全球代表性。我们评估我们的框架基准的MEDIQA-RQE数据集中和明显，使用知识的富集达到8.27％，比SOTA语言模型的绝对改善双编码机制的帮助。我们提供的源代码在这里。

5. Towards Preemptive Detection of Depression and Anxiety in Twitter [PDF] 返回目录
David Owen, Jose Camacho Collados, Luis Espinosa-Anke
Abstract: Depression and anxiety are psychiatric disorders that are observed in many areas of everyday life. For example, these disorders manifest themselves somewhat frequently in texts written by nondiagnosed users in social media. However, detecting users with these conditions is not a straightforward task as they may not explicitly talk about their mental state, and if they do, contextual cues such as immediacy must be taken into account. When available, linguistic flags pointing to probable anxiety or depression could be used by medical experts to write better guidelines and treatments. In this paper, we develop a dataset designed to foster research in depression and anxiety detection in Twitter, framing the detection task as a binary tweet classification problem. We then apply state-of-the-art classification models to this dataset, providing a competitive set of baselines alongside qualitative error analysis. Our results show that language models perform reasonably well, and better than more traditional baselines. Nonetheless, there is clear room for improvement, particularly with unbalanced training sets and in cases where seemingly obvious linguistic cues (keywords) are used counter-intuitively.
摘要：抑郁和焦虑是在日常生活的许多领域观察的精神疾病。例如，这些疾病表现出来有点频繁通过nondiagnosed用户在社交媒体书面文本。但是，检测用户与这些条件不是因为他们可能没有明确地谈论他们的精神状态很简单的任务，如果他们这样做，上下文线索，如即时必须加以考虑。当指向可能的焦虑或抑郁可用的，语言的标志可以由医学专家来写出更好的指导和治疗。在本文中，我们制定旨在促进研究，抑郁和焦虑检测在Twitter上的数据集，取景检测任务为二进制鸣叫分类问题。接着，我们采用先进的最先进的分类模型到这个数据集，提供有竞争力的集基线一起定性错误分析。我们的研究结果表明，语言模型比传统的基准进行得相当好，好。尽管如此，有明确的改进空间，特别是不平衡的训练集，并在看似明显的语言提示（关键字）是用于与直觉相反的情况。

6. On the State of Social Media Data for Mental Health Research [PDF] 返回目录
Keith Harrigian, Carlos Aguirre, Mark Dredze
Abstract: Data-driven methods for mental health treatment and surveillance have become a major focus in computational science research in the last decade. However, progress in the domain, in terms of both medical understanding and system performance, remains bounded by the availability of adequate data. Prior systematic reviews have not necessarily made it possible to measure the degree to which data-related challenges have affected research progress. In this paper, we offer an analysis specifically on the state of social media data that exists for conducting mental health research. We do so by introducing an open-source directory of mental health datasets, annotated using a standardized schema to facilitate meta-analysis.
摘要：精神健康治疗和监测数据驱动的方法已成为计算科学研究的主要焦点在过去的十年。然而，在域进步，在这两个医疗的理解和系统性能方面，仍然受到足够的数据的可用性的限制。在此之前的系统评价并不一定使我们能够测量到的数据相关的挑战已经影响研究进展的程度。在本文中，我们对存在进行心理健康研究社交媒体数据的状态，特别是提供了一个分析。我们通过引入心理健康数据集的一个开放源码的目录，使用标准化的架构，以便汇总分析注释这样做。

7. UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations [PDF] 返回目录
Gabriele Sarti
Abstract: This work describes a self-supervised data augmentation approach used to improve learning models' performances when only a moderate amount of labeled data is available. Multiple copies of the original model are initially trained on the downstream task. Their predictions are then used to annotate a large set of unlabeled examples. Finally, multi-task training is performed on the parallel annotations of the resulting training set, and final scores are obtained by averaging annotator-specific head predictions. Neural language models are fine-tuned using this procedure in the context of the AcCompl-it shared task at EVALITA 2020, obtaining considerable improvements in prediction quality.
摘要：该作品描述了用于改善学习模型的表现时，只有标记数据适量的可自我监督的数据增强方法。原始模型的多个副本最初训练上下游任务。那么他们的预测是用来注释一大组未标记的例子。最后，多任务训练是在所得到的训练集的并行注释进行，并且最终分数通过平均特定注释头预测获得的。在的情况下使用此过程神经语言模型微调AcCompl，它在EVALITA 2020共享任务，获得在预测质量相当大的改进。

8. Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation [PDF] 返回目录
Elisa Gugliotta, Marco Dinarelli, Olivier Kraif
Abstract: In this paper we propose a multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus. The annotation performed are text classification, tokenization, PoS tagging and encoding of Tunisian Arabizi into CODA* Arabic orthography. The system is learned to predict all the annotation levels in cascade, starting from Arabizi input. We evaluate the system on the TIGER German corpus, suitably converting data to have a multi-task problem, in order to show the effectiveness of our neural architecture. We show also how we used the system in order to annotate a Tunisian Arabizi corpus, which has been afterwards manually corrected and used to further evaluate sequence models on Tunisian data. Our system is developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem.
摘要：本文提出了一种多任务序列预测系统的基础上，反复发作的神经网络，并用于多层次的Arabizi突尼斯语料库注释。执行的注释是文本分类，分词，词性标注和突尼斯Arabizi的编码为CODA *阿拉伯语拼写。该系统学会预测所有的注释水平级联，从Arabizi输入开始。我们评估的TIGER德国语料库系统，适当将数据转换成具有多任务的问题，以显示我们的神经结构的有效性。我们也展示我们如何使用该系统，以注释突尼斯Arabizi语料库，已经事后手动校正，并用于进一步评价序列模型突尼斯数据。我们的系统为Fairseq框架，它允许快速和容易使用的任何其他序列预测问题的发展。

9. Does Social Support Expressed in Post Titles Elicit Comments in Online Substance Use Recovery Forums? [PDF] 返回目录
Anietie Andy, Sharath Guntuku
Abstract: Individuals recovering from substance use often seek social support (emotional and informational) on online recovery forums, where they can both write and comment on posts, expressing their struggles and successes. A common challenge in these forums is that certain posts (some of which may be support seeking) receive no comments. In this work, we use data from two Reddit substance recovery forums:/r/Leaves and/r/OpiatesRecovery, to determine the relationship between the social supports expressed in the titles of posts and the number of comments they receive. We show that the types of social support expressed in post titles that elicit comments vary from one substance use recovery forum to the other.
摘要：从个体的物质使用恢复往往寻求在线恢复论坛，在那里他们可以写和评论文章，表达自己的奋斗和成功的社会支持（情感和信息）。在这些论坛面临的共同挑战是，某些职位（其中一些可能是寻求支持）收到任何评论。在这项工作中，我们使用来自两个reddit的物质恢复论坛数据：/ R /叶和/ R / OpiatesRecovery，以确定文章标题，他们收到的评论的数量表达的社会支持之间的关系。我们表明，该类型的社会支持，帖子的标题表示，引导学生意见不一从一个物质的使用恢复论坛其他。

10. Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers [PDF] 返回目录
Ife Adebara, El Moatez Billah Nagoudi, Muhammad Abdul Mageed
Abstract: We investigate different approaches to translate between similar languages under low resource conditions, as part of our contribution to the WMT 2020 Similar Languages Translation Shared Task. We submitted Transformer-based bilingual and multilingual systems for all language pairs, in the two directions. We also leverage back-translation for one of the language pairs, acquiring an improvement of more than 3 BLEU points. We interpret our results in light of the degree of mutual intelligibility (based on Jaccard similarity) between each pair, finding a positive correlation between mutual intelligibility and model performance. Our Spanish-Catalan model has the best performance of all the five language pairs. Except for the case of Hindi-Marathi, our bilingual models achieve better performance than the multilingual models on all pairs.
摘要：我们调查不同的方法较低的资源条件下，类似的语言之间的翻译，因为我们对WMT 2020类似的语言翻译共同任务作出贡献的一部分。我们提出基于变压器的双语和多语言系统所有语言对，在两个方向上。我们还利用回译的语言对之一，获得超过3个BLEU点的提高。我们解释光的每一对之间的相互理解（基于Jaccard相似）的程度我们的研究结果，发现相互理解和模型性能之间的正相关关系。我们的西班牙语，加泰罗尼亚语模型具有所有五个语言对的最佳性能。除了印地文，马拉的情况下，我们的双语模式实现比在所有线对多语言模型更好的性能。

11. To What Degree Can Language Borders Be Blurred In BERT-based Multilingual Spoken Language Understanding? [PDF] 返回目录
Quynh Do, Judith Gaspers, Tobias Roding, Melanie Bradford
Abstract: This paper addresses the question as to what degree a BERT-based multilingual Spoken Language Understanding (SLU) model can transfer knowledge across languages. Through experiments we will show that, although it works substantially well even on distant language groups, there is still a gap to the ideal multilingual performance. In addition, we propose a novel BERT-based adversarial model architecture to learn language-shared and language-specific representations for multilingual SLU. Our experimental results prove that the proposed model is capable of narrowing the gap to the ideal multilingual performance.
摘要：本文地址的问题，以什么程度基于BERT，多语种口语理解（SLU）模型可以跨越语言的知识转移。通过实验，我们将表明，虽然它的工作原理大致良好，即使在遥远的语言群体，仍有理想表现多种语言的缝隙。此外，我们提出了一种新的基于BERT对抗的模型架构来学习语言，共享和语言特定的表示，多语种SLU。我们的实验结果证明，该模型能够缩小差距到理想的多语种性能。

12. When Do You Need Billions of Words of Pretraining Data? [PDF] 返回目录
Yian Zhang, Alex Warstadt, Haau-Sing Li, Samuel R. Bowman
Abstract: NLP is currently dominated by general-purpose pretrained language models like RoBERTa, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? We adopt four probing methods---classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks---and draw learning curves that track the growth of these different measures of linguistic ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features we test. A much larger quantity of data is needed in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.
摘要：NLP是目前通用的预训练的语言模型就像ROBERTA，它通过训练前数十亿字的实现对自然语言理解任务的强劲表现为主。但什么确切的知识或技能做变压器的LM从大规模的训练前，他们无法从较少的数据学习学习呢？我们采用四种探针方法---分类探测，信息理论探测，无人监督的相对可接受的判断，并在自然语言理解任务微调---并绘制了学习的难度，关于跟踪的语言能力这些不同措施的成长训练前使用MiniBERTas数据量，一组罗伯塔模型对1M，10M，100M和1B字预训练。我们发现，LM的只需要大约10M或100M的话学习表示，用于可靠地编码最句法和语义特征，我们的测试。需要以获得足够的常识性知识，需要掌握典型的下游NLU任务等技能数据的数量大得多。结果表明，虽然编码语言特征的能力是语言理解几乎可以肯定是必要的，但可能是其他形式的知识是在大型预训练模式在语言理解最近有所改善的主要驱动力。

13. On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers [PDF] 返回目录
Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals
Abstract: Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence useful for the upper self-attention encoder layers in Transformers? To investigate this, we train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard. Compared to baseline Transformers, no performance drop but minor gains are observed. We further developed a novel metric of the diagonality of attention matrices and found the learned diagonality indeed increases from the lower to upper encoder self-attention layers. We conclude the global view is unnecessary in training upper encoder layers.
摘要：自注意模型，如变形金刚，它可以捕获时间关系不会受到事件之间的距离的限制，给了有竞争力的语音识别结果。然而，我们注意到从下到上的自我关注层了解到上下文的范围增加，同时声事件往往在左到右的顺序很短的时间跨度内发生。这就导致了一个问题：语音识别，是整个序列的一个全局视图，在变压器上的自我关注编码层有用吗？为了研究这个问题，我们对华尔街日报和总机低自我关注/上部前馈层编码器训练模型。与基线相比变形金刚，没有性能下降，但涨幅较小的观察。我们进一步开发了一种新的度量关注矩阵的对角性和发现学习角性确实从下部上升至上编码器的自我关注层。我们得出结论：全球视野是在训练上编码层是不必要的。

14. Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning [PDF] 返回目录
Rachel Gardner, Maya Varma, Clare Zhu, Ranjay Krishna
Abstract: Datasets extracted from social networks and online forums are often prone to the pitfalls of natural language, namely the presence of unstructured and noisy data. In this work, we seek to enable the collection of high-quality question-answer datasets from social media by proposing a novel task for automated quality analysis and data cleaning: question-answer (QA) plausibility. Given a machine or user-generated question and a crowd-sourced response from a social media user, we determine if the question and response are valid; if so, we identify the answer within the free-form response. We design BERT-based models to perform the QA plausibility task, and we evaluate the ability of our models to generate a clean, usable question-answer dataset. Our highest-performing approach consists of a single-task model which determines the plausibility of the question, followed by a multi-task model which evaluates the plausibility of the response as well as extracts answers (Question Plausibility AUROC=0.75, Response Plausibility AUROC=0.78, Answer Extraction F1=0.665).
摘要：来自社交网络和网上论坛中提取数据集往往容易自然语言，非结构化和噪声数据的，即存在的缺陷。在这项工作中，我们寻求通过提出一种用于自动质量分析和数据清洗一个新的任务，以便从社交媒体高品质的问答数据集的集合：问答（QA）的合理性。由于机器或用户产生的问题，并从社交媒体用户人群来源的响应，我们确定的问题和应对是有效的;如果是这样，我们确定的自由形式的响应中的答案。我们设计了一个基于BERT的模型来执行QA真实性任务，我们评估我们的模型，产生一个干净，可用的问答数据集的能力。我们的最高性能的方法由该计算响应以及提取的答案（问题真实性AUROC = 0.75，响应可信度AUROC的合理性的单任务模型，其确定问题的似然性，然后进行多任务模式= 0.78，答案抽取F1 = 0.665）。

15. A Transfer Learning Approach for Dialogue Act Classification of GitHub Issue Comments [PDF] 返回目录
Ayesha Enayet, Gita Sukthankar
Abstract: Social coding platforms, such as GitHub, serve as laboratories for studying collaborative problem solving in open source software development; a key feature is their ability to support issue reporting which is used by teams to discuss tasks and ideas. Analyzing the dialogue between team members, as expressed in issue comments, can yield important insights about the performance of virtual teams. This paper presents a transfer learning approach for performing dialogue act classification on issue comments. Since no large labeled corpus of GitHub issue comments exists, employing transfer learning enables us to leverage standard dialogue act datasets in combination with our own GitHub comment dataset. We compare the performance of several word and sentence level encoding models including Global Vectors for Word Representations (GloVe), Universal Sentence Encoder (USE), and Bidirectional Encoder Representations from Transformers (BERT). Being able to map the issue comments to dialogue acts is a useful stepping stone towards understanding cognitive team processes.
摘要：社会编码平台，比如GitHub上，作为实验室研究的协作问题，在开源软件开发解决;一个关键特性是它们支持所使用的团队讨论任务和思路问题报告的能力。分析团队成员之间的对话，在问题的意见所表达的，可以产生约虚拟团队的绩效的重要见解。本文提出了一种对问题的意见进行对话行为分类的转移学习方法。由于GitHub的问题的意见没有大的标记语料库存在，采用转让的学习，使我们在我们自己的GitHub的评论集组合利用标准的对话行为的数据集。我们比较几个单词和句子水平编码模式，包括全球矢量字表示（手套），通用句子编码器（USE）和双向编码器交涉从变压器（BERT）的性能。如果能够对这个问题的意见映射到对话的行为是理解认知过程的团队一个有用的敲门砖。

16. Natural Language Inference in Context -- Investigating Contextual Reasoning over Long Texts [PDF] 返回目录
Hanmeng Liu, Leyang Cui, Jian Liu, Yue Zhang
Abstract: Natural language inference (NLI) is a fundamental NLP task, investigating the entailment relationship between two texts. Popular NLI datasets present the task at sentence-level. While adequate for testing semantic representations, they fall short for testing contextual reasoning over long texts, which is a natural part of the human inference process. We introduce ConTRoL, a new dataset for ConTextual Reasoning over Long texts. Consisting of 8,325 expert-designed "context-hypothesis" pairs with gold labels, ConTRoL is a passage-level NLI dataset with a focus on complex contextual reasoning types such as logical reasoning. It is derived from competitive selection and recruitment test (verbal reasoning test) for police recruitment, with expert level quality. Compared with previous NLI benchmarks, the materials in ConTRoL are much more challenging, involving a range of reasoning types. Empirical results show that state-of-the-art language models perform by far worse than educated humans. Our dataset can also serve as a testing-set for downstream tasks like Checking Factual Correctness of Summaries.
摘要：自然语言推理（NLI）是一个基本的NLP任务，调查两个文本之间的关系蕴涵。热门NLI数据集出席句子级的任务。尽管这对于测试语义表示，他们功亏一篑，用于测试在长文本上下文推理，这是人类的推理过程是自然的一部分。我们介绍的控制，对上下文推理在长文本的新数据集。由8,325专家设计的“情境假设”金标签对，控制是一个通道级别NLI数据集为重点的复杂情境推理类型，如逻辑推理。它是由竞争性选拔和招募警察招募测试（言语推理测验）衍生出来的，专家级的品质。与以往NLI基准相比，控制材料更具挑战性，涉及一系列的推理类型。实证结果表明，国家的最先进的是语言模型远远超过学历的人更糟糕的执行。我们的数据也可以作为像检查摘要的事实正确性下游任务的测试集。

17. Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS [PDF] 返回目录
Katsuhito Sudoh, Takatomo Kano, Sashi Novitasari, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura
Abstract: This paper presents a newly developed, simultaneous neural speech-to-speech translation system and its evaluation. The system consists of three fully-incremental neural processing modules for automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS). We investigated its overall latency in the system's Ear-Voice Span and speaking latency along with module-level performance.
摘要：本文提出了一种新开发的，同时神经语音到语音翻译系统及其评价。该系统由自动语音识别（ASR），机器翻译（MT），和文本到语音合成（TTS）三个完全增量神经处理模块。我们研究了它的整体延迟在系统的耳语音跨度和与模块级的性能以及讲延迟。

18. Multi-document Summarization via Deep Learning Techniques: A Survey [PDF] 返回目录
Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, Quan Z. Sheng
Abstract: Multi-document summarization (MDS) is an effective tool for information aggregation which generates an informative and concise summary from a cluster of topic-related documents. Our survey structurally overviews the recent deep learning based multi-document summarization models via a proposed taxonomy and it is the first of its kind. Particularly, we propose a novel mechanism to summarize the design strategies of neural networks and conduct a comprehensive summary of the state-of-the-art. We highlight the differences among various objective functions which are rarely discussed in the existing literature. Finally, we propose several future directions pertaining to this new and exciting development of the field.
摘要：多文档文摘（MDS）是生成从与主题相关的文档的簇的信息和简明摘要信息聚合的有效工具。我们的调查结构上通过提出分类综述了近年来深度学习为主的多文档自动文摘模型，它是首开先河。特别是，我们提出了一个新的机制来概括神经网络的设计策略，并进行了国家的最先进的全面总结。我们强调这是在现有的文献很少讨论各种目标函数之间的差异。最后，我们提出了关于该领域的这一激动人心的新发展几个未来的发展方向。

19. Language Through a Prism: A Spectral Approach for Multiscale Language Representations [PDF] 返回目录
Alex Tamkin, Dan Jurafsky, Noah Goodman
Abstract: Language exhibits structure at different scales, ranging from subwords to words, sentences, paragraphs, and documents. To what extent do deep models capture information at these scales, and can we force them to better capture structure across this hierarchy? We approach this question by focusing on individual neurons, analyzing the behavior of their activations at different timescales. We show that signal processing provides a natural framework for separating structure across scales, enabling us to 1) disentangle scale-specific information in existing embeddings and 2) train models to learn more about particular scales. Concretely, we apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for training models, which uses spectral filters to constrain different neurons to model structure at different scales. Our proposed BERT + Prism model can better predict masked tokens using long-range context and produces multiscale representations that perform better at utterance- and document-level tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.
摘要：语言在不同尺度上表现出的结构，从子字到词，句子，段落和文件。在何种程度上以这些尺度深模型捕捉信息，并且可以跨我们这个层次迫使他们更好地捕获结构？我们专注于单个神经元，分析它们的激活，在不同时间尺度的行为接近这个问题。我们表明，信号处理提供了跨尺度的隔离结构，使我们能够在现有的嵌入2）火车模型来了解更多关于特定秤1）解开特定尺度的信息和自然的框架。具体地，我们应用光谱过滤器，以便在整个输入神经元的激活，产生上词性标注（字电平）的部分表现良好过滤的嵌入，对话框言语行为分类（发声电平），或主题分类（文档级别），而在其他任务表现不佳。我们还提出用于训练模型的棱镜层，它使用光谱过滤器在不同尺度约束不同的神经元到模型结构。采用长范围方面我们提出的BERT +棱柱模型可以更好地预测掩盖令牌和生产，在utterance-更好地履行和文档级的任务多尺度表示。我们的方法是通用的，容易适用于语言之外的其他领域，如图像，音频和视频。

20. EstBERT: A Pretrained Language-Specific BERT for Estonian [PDF] 返回目录
Hasan Tanvir, Claudia Kittask, Kairit Sirts
Abstract: This paper presents EstBERT, a large pretrained transformer-based language-specific BERT model for Estonian. Recent work has evaluated multilingual BERT models on Estonian tasks and found them to outperform the baselines. Still, based on existing studies on other languages, a language-specific BERT model is expected to improve over the multilingual ones. We first describe the EstBERT pretraining process and then present the results of the models based on finetuned EstBERT for multiple NLP tasks, including POS and morphological tagging, named entity recognition and text classification. The evaluation results show that the models based on EstBERT outperform multilingual BERT models on five tasks out of six, providing further evidence towards a view that training language-specific BERT models are still useful, even when multilingual models are available.
摘要：本文介绍EstBERT，爱沙尼亚的大型预训练变压器基于特定语言-BERT模式。最近的工作进行评估爱沙尼亚任务多种语言BERT模型，发现它们跑赢基准。不过，基于对其他语言的现有研究，语言特有的BERT模式有望在多语种的人改善对。我们首先介绍EstBERT训练前过程，然后提出了基于微调，EstBERT多个NLP任务，包括POS和形态标注，命名实体识别和文本分类模型的结果。评价结果表明，基于EstBERT模型优于多语种BERT车型上五个任务了六，走向一个观点提供了进一步的证据表明，特定语言的培训BERT车型仍然是有用的，即使多语种型号可供选择。

21. An Analysis of Dataset Overlap on Winograd-Style Tasks [PDF] 返回目录
Ali Emami, Adam Trischler, Kaheer Suleman, Jackie Chi Kit Cheung
Abstract: The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlap between these training corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the corpora on which state-of-the-art models are (pre)trained, and that a significant drop in classification accuracy occurs when we evaluate models on instances with minimal overlap. Based on these results, we develop the KnowRef-60K dataset, which consists of over 60k pronoun disambiguation problems scraped from web data. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning and exhibits a significantly lower proportion of overlaps with current pretraining corpora.
摘要：威诺格拉德架构挑战赛（WSC）和变形启发它已成为重要的基准常识推理（CSR）。在WSC模型的性能已迅速从机会级使用培训了大量的语料库神经语言模型发展到近的人。在本文中，我们分析了不同程度的这些训练语料库和WSC式任务的测试实例之间的重叠效应。我们发现了大量的测试实例与训练有素的哪个国家的最先进的车型是语料库（预）大幅重叠，并在分类精度的下降显著当我们以最小的重叠情况评估模型发生。基于这些结果，我们开发的KnowRef-60K的数据集，其中包括来自网络的数据在刮60K代词歧义的问题。 KnowRef-60K是最大的语料库日期为WSC风格的常识推理和表现出的重叠与当前的训练前语料库一个显著比例较低。

22. Adversarial Semantic Collisions [PDF] 返回目录
Congzheng Song, Alexander M. Rush, Vitaly Shmatikov
Abstract: We study semantic collisions: texts that are semantically unrelated but judged as similar by NLP models. We develop gradient-based approaches for generating semantic collisions and demonstrate that state-of-the-art models for many tasks which rely on analyzing the meaning and similarity of texts- including paraphrase identification, document retrieval, response suggestion, and extractive summarization-- are vulnerable to semantic collisions. For example, given a target query, inserting a crafted collision into an irrelevant document can shift its retrieval rank from 1000 to top 3. We show how to generate semantic collisions that evade perplexity-based filtering and discuss other potential mitigations. Our code is available at this https URL.
摘要：我们研究的语义冲突：在语义无关，但判断为通过NLP模型类似的案文。我们开发了基于梯度的方法产生的语义冲突，展示国家的最先进的是模型依赖于分析texts-包括意译识别，文献检索，响应的建议，以及采掘summarization--的意义和相似的许多任务易受语义冲突。例如，给定一个目标查询，插入精雕细琢碰撞到一个不相干的文件可以在其检索排名从1000转移到顶部3.我们展示如何生成逃避困惑，基于内容的过滤，并讨论其他潜在的缓解语义冲突。我们的代码可在此HTTPS URL。

23. CLAR: A Cross-Lingual Argument Regularizer for Semantic Role Labeling [PDF] 返回目录
Ishan Jindal, Yunyao Li, Siddhartha Brahma, Huaiyu Zhu
Abstract: Semantic role labeling (SRL) identifies predicate-argument structure(s) in a given sentence. Although different languages have different argument annotations, polyglot training, the idea of training one model on multiple languages, has previously been shown to outperform monolingual baselines, especially for low resource languages. In fact, even a simple combination of data has been shown to be effective with polyglot training by representing the distant vocabularies in a shared representation space. Meanwhile, despite the dissimilarity in argument annotations between languages, certain argument labels do share common semantic meaning across languages (e.g. adjuncts have more or less similar semantic meaning across languages). To leverage such similarity in annotation space across languages, we propose a method called Cross-Lingual Argument Regularizer (CLAR). CLAR identifies such linguistic annotation similarity across languages and exploits this information to map the target language arguments using a transformation of the space on which source language arguments lie. By doing so, our experimental results show that CLAR consistently improves SRL performance on multiple languages over monolingual and polyglot baselines for low resource languages.
摘要：语义角色标注（SRL）标识谓词参数的结构（S）在给定的句子。虽然不同的语言有不同的观点注解，通晓多国语言的培训，培训对多种语言一个模型的想法，以前已经证明优于单语基线，特别是对于低资源语言。事实上，即使数据的简单组合已显示出由表示在共享表示空间遥远的词汇表成与通晓训练有效。同时，尽管在语言之间争论注解相异，某些参数标签做跨语言有着共同的语义（例如辅料具有跨语言或多或少相似的语义）。在跨语言注释的空间充分利用这种相似性，我们提出了一个所谓的跨语言参数正则（CLAR）方法。 CLAR标识跨语言，语言的注释的相似性和利用此信息来映射使用其上源语言参数位于该空间的变换目标语言参数。通过这样做，我们的实验结果表明，CLAR持续改进超过了低资源语言的单语和多语种的基线多种语言SRL性能。

24. Biomedical Information Extraction for Disease Gene Prioritization [PDF] 返回目录
Jupinder Parmar, William Koehler, Martin Bringmann, Katharina Sophia Volz, Berk Kapicioglu
Abstract: We introduce a biomedical information extraction (IE) pipeline that extracts biological relationships from text and demonstrate that its components, such as named entity recognition (NER) and relation extraction (RE), outperform state-of-the-art in BioNLP. We apply it to tens of millions of PubMed abstracts to extract protein-protein interactions (PPIs) and augment these extractions to a biomedical knowledge graph that already contains PPIs extracted from STRING, the leading structured PPI database. We show that, despite already containing PPIs from an established structured source, augmenting our own IE-based extractions to the graph allows us to predict novel disease-gene associations with a 20% relative increase in hit@30, an important step towards developing drug targets for uncured diseases.
摘要：本文介绍了生物医学信息抽取（IE）的管道，从文本中提取生物关系，并证明它的组件，如命名实体识别（NER）和关系抽取（RE），跑赢国家的最先进的BioNLP。我们把它应用到数以千万计的考研的抽象提取蛋白质 - 蛋白质相互作用（生产者价格指数），并增强这些提取到生物医学知识图已包含从STRING，领先的结构化PPI数据库中提取的生产者价格指数。我们表明，尽管已经包含来自一个既定的结构化源的生产者价格指数，增强我们自己的基于IE提取到的图形可以让我们预测新的疾病基因的关联，在触及20％的相对增加@ 30，对发展药物的重要一步目标未固化的疾病。

25. Generalized LSTM-based End-to-End Text-Independent Speaker Verification [PDF] 返回目录
Soroosh Tayebi Arasteh
Abstract: The increasing amount of available data and more affordable hardware solutions have opened a gate to the realm of Deep Learning (DL). Due to the rapid advancements and ever-growing popularity of DL, it has begun to invade almost every field, where machine learning is applicable, by altering the traditional state-of-the-art methods. While many researchers in the speaker recognition area have also started to replace the former state-of-the-art methods with DL techniques, some of the traditional i-vector-based methods are still state-of-the-art in the context of text-independent speaker verification (TI-SV). In this paper, we discuss the most recent generalized end-to-end (GE2E) DL technique based on Long Short-term Memory (LSTM) units for TI-SV by Google and compare different scenarios and aspects including utterance duration, training time, and accuracy to prove that our method outperforms the traditional methods.
摘要：现有数据量的不断增加，更实惠的硬件解决方案已经开了一个门，以深度学习（DL）的境界。由于快速进步和DL的不断普及，它已经开始入侵几乎所有领域，在那里学习机是适用的，通过改变传统的国家的最先进的方法。虽然许多研究人员在说话人识别区也已开始取代原国有的最先进的方法与DL技术，一些传统的基于I-向量方法是静止状态的最先进的背景下文本无关的说话人确认（TI-SV）。在本文中，我们将讨论最近的广义的端至端（GE2E）DL技术基于长短期记忆（LSTM）单位TI-SV由谷歌和比较不同的方案和各个方面，包括话语时间，培训时间，和准确性，以证明我们的方法优于传统方法。

26. Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis [PDF] 返回目录
Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi
Abstract: We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a simple quality threshold can improve naturalness and similarity to unseen target speakers of synthetic speech. Additionally, we find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet, with a faster inference time.
摘要：本文探讨训练前的战略，包括与选择的零拍多扬声器端至端合成的最佳战略的目标基础语料的选择。我们还审查波形合成神经声码器，以及用于梅尔频谱和最终的音频输出声配置的选择。我们发现，微调从已经通过一个简单的质量阈值可提高自然性和相似性，以合成语音的看不见的目标扬声器发现有声读物数据多扬声器模式。此外，我们发现，听众可以一个16kHz的和24kHz时的采样率之间的辨别，以及WaveRNN产生可比质量WaveNet的输出波形，以更快的推理时间。

27. Personalized Query Rewriting in Conversational AI Agents [PDF] 返回目录
Alireza Roshan-Ghias, Clint Solomon Mathialagan, Pragaash Ponnusamy, Lambert Mathias, Chenlei Guo
Abstract: Spoken language understanding (SLU) systems in conversational AI agents often experience errors in the form of misrecognitions by automatic speech recognition (ASR) or semantic gaps in natural language understanding (NLU). These errors easily translate to user frustrations, particularly so in recurrent events e.g. regularly toggling an appliance, calling a frequent contact, etc. In this work, we propose a query rewriting approach by leveraging users' historically successful interactions as a form of memory. We present a neural retrieval model and a pointer-generator network with hierarchical attention and show that they perform significantly better at the query rewriting task with the aforementioned user memories than without. We also highlight how our approach with the proposed models leverages the structural and semantic diversity in ASR's output towards recovering users' intents.
摘要：在对话AI代理口语理解（SLU）系统通过自动语音识别（ASR）或自然语言理解（NLU）语义差距误识别的形式经常出现错误。这些错误很容易转换成用户不满，尤其如此复发事件例如定期切换的设备，调用频繁的接触，等等。在这个工作中，我们提出了一个查询通过利用用户的历史上成功的互动作为记忆的形式重写方法。我们提出了一个神经检索模型，并与层次关注的指针发电机网络，并表明他们在查询重写任务与上述用户的记忆比没有显著更好地履行。我们还强调，我们所提出的模型法如何利用在ASR的输出结构和语义多样性对恢复用户的意图。

28. Speaker De-identification System using Autoencodersand Adversarial Training [PDF] 返回目录
Fernando M. Espinoza-Cuadros, Juan M. Perero-Codosero, Javier Antón-Martín, Luis A. Hernández-Gómez
Abstract: The fast increase of web services and mobile apps, which collect personal data from users, increases the risk that their privacy may be severely compromised. In particular, the increasing variety of spoken language interfaces and voice assistants empowered by the vertiginous breakthroughs in Deep Learning are prompting important concerns in the European Union to preserve speech data privacy. For instance, an attacker can record speech from users and impersonate them to get access to systems requiring voice identification. Hacking speaker profiles from users is also possible by means of existing technology to extract speaker, linguistic (e.g., dialect) and paralinguistic features (e.g., age) from the speech signal. In order to mitigate these weaknesses, in this paper, we propose a speaker de-identification system based on adversarial training and autoencoders in order to suppress speaker, gender, and accent information from speech. Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system while preserving the intelligibility of the anonymized spoken content.
摘要：Web服务和移动应用程序，它从用户那里收集个人数据的快速增长，增加了他们的隐私可能会受到严重损害的风险。特别是，越来越多的各种语言接口和语音助理有权在深学习令人眼花缭乱的突破，促使欧盟重要的关注保持语音数据的隐私。例如，攻击者可以从用户录制语音和冒充他们可以访问需要的语音识别系统。来自用户的黑客扬声器配置文件也可以通过现有的技术来提取扬声器，语言（例如，方言），并从语音信号副语言特征（例如，年龄）的装置。为了减轻这些弱点，在本文中，我们提出了一种基于打压扬声器，性别和从语音重音信息对抗训练和自动编码，以便扬声器去标识系统。实验结果表明，对抗相结合的学习和自动编码增加扬声器验证系统的等错误率，同时保持匿名的口述内容的清晰度。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-11-11

目录

摘要