0%

【arxiv论文】 Computation and Language 2020-10-29

目录

1. Handling Class Imbalance in Low-Resource Dialogue Systems by Combining Few-Shot Classification and Interpolation [PDF] 摘要
2. Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles [PDF] 摘要
3. A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models [PDF] 摘要
4. Towards Ethics by Design in Online Abusive Content Detection [PDF] 摘要
5. Bridging the Modality Gap for Speech-to-Text Translation [PDF] 摘要
6. Bayesian Methods for Semi-supervised Text Annotation [PDF] 摘要
7. The Volctrans Machine Translation System for WMT20 [PDF] 摘要
8. A Chinese Text Classification Method With Low Hardware Requirement Based on Improved Model Concatenation [PDF] 摘要
9. Fine-grained Information Status Classification Using Discourse Context-Aware BERT [PDF] 摘要
10. DisenE: Disentangling Knowledge Graph Embeddings [PDF] 摘要
11. Second-Order Unsupervised Neural Dependency Parsing [PDF] 摘要
12. TopicModel4J: A Java Package for Topic Models [PDF] 摘要
13. Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript [PDF] 摘要
14. What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation [PDF] 摘要
15. Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications [PDF] 摘要
16. DualTKB: A Dual Learning Bridge between Text and Knowledge Base [PDF] 摘要
17. Learning Contextualised Cross-lingual Word Embeddings for Extremely Low-Resource Languages Using Parallel Corpora [PDF] 摘要
18. On the diminishing return of labeling clinical reports [PDF] 摘要
19. Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports [PDF] 摘要
20. WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols [PDF] 摘要
21. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus [PDF] 摘要
22. Strongly Incremental Constituency Parsing with Graph Neural Networks [PDF] 摘要
23. DGST: a Dual-Generator Network for Text Style Transfer [PDF] 摘要
24. Unmasking Contextual Stereotypes: Measuring and Mitigating BERT's Gender Bias [PDF] 摘要
25. The geometry of integration in text classification RNNs [PDF] 摘要
26. Fixed-Length Protein Embeddings using Contextual Lenses [PDF] 摘要
27. Measuring non-trivial compositionality in emergent communication [PDF] 摘要
28. Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [PDF] 摘要
29. A Cyclic Proof System for HFLN [PDF] 摘要
30. INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices [PDF] 摘要
31. PPG-based singing voice conversion with adversarial representation learning [PDF] 摘要
32. Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition [PDF] 摘要
33. Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [PDF] 摘要
34. CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition [PDF] 摘要
35. Scaling Laws for Autoregressive Generative Modeling [PDF] 摘要
36. Cascaded encoders for unifying streaming and non-streaming ASR [PDF] 摘要
37. A Comprehensive Dictionary and Term Variation Analysis for COVID-19 and SARS-CoV-2 [PDF] 摘要

摘要

1. Handling Class Imbalance in Low-Resource Dialogue Systems by Combining Few-Shot Classification and Interpolation [PDF] 返回目录
  Vishal Sunder, Eric Fosler-Lussier
Abstract: Utterance classification performance in low-resource dialogue systems is constrained by an inevitably high degree of data imbalance in class labels. We present a new end-to-end pairwise learning framework that is designed specifically to tackle this phenomenon by inducing a few-shot classification capability in the utterance representations and augmenting data through an interpolation of utterance representations. Our approach is a general purpose training methodology, agnostic to the neural architecture used for encoding utterances. We show significant improvements in macro-F1 score over standard cross-entropy training for three different neural architectures, demonstrating improvements on a Virtual Patient dialogue dataset as well as a low-resourced emulation of the Switchboard dialogue act classification dataset.
摘要:在资源匮乏的对话系统的话语分类性能是通过类标签的必然数据高度不平衡的制约。我们提出了一个新的终端到终端的成对的学习框架,专门用在话语中的表述引发了几拍分类功能,通过话语表述的插值增强数据来解决这个现象。我们的做法是一个通用的培训方法,不可知的用于编码话语的神经结构。我们展示了三个不同的神经结构在宏观F1得分超过标准交叉熵培训显著的改善,证明在虚拟耐心的对话集以及总机对话行为的分类数据集的资源不足地区仿真的改进。

2. Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles [PDF] 返回目录
  M. Tarik Altuncu, Sophia N. Yaliraki, Mauricio Barahona
Abstract: Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.
摘要:新闻内容生产以惊人的速度增长。为了帮助管理和监控文本的数量庞大,有越来越多的需要开发有效的方法,可以提供深入了解新兴的内容领域,并分层文本的语料库非结构化成从内容相似本质干'主题。在这里,我们提出了一种无监督的框架,汇集了来自自然语言处理一起强大的矢量的嵌入与多尺度图形分割工具,可以显示在不同的分辨率自然分区,而不会作出关于在语料库簇的数量的先验假设。我们通过与其他流行的集群和主题建模方法,最终到终端的比较显示基于图形的集群的优势,同时也评估不同的文字载体的嵌入,从经典袋的词来Doc2Vec近期变压器基于模型伯特。这个比较工作是在2016年的总统选举年通过美国新闻报道的语料的分析展示。

3. A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models [PDF] 返回目录
  Usman Naseem, Imran Razzak, Shah Khalid Khan, Mukesh Prasad
Abstract: Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP related tasks. In the end, this survey briefly discusses the commonly used ML and DL based classifiers, evaluation metrics and the applications of these word embeddings in different NLP tasks.
摘要:字表示一直是自然语言处理(NLP)历史上的一个重要的研究领域。理解这些复杂的文本数据是必要的,因为它含有丰富的信息,可以在各种应用中得到广泛应用。在本次调查中,我们探索不同的单词表示模型及其表达能力,从古典到现代国家的最先进的字表示语言模型(LMS)。我们描述了多种文本表示方法和模型设计已经在NLP的背景下蓬勃发展,包括SOTA LM的。这些模型可以将大量的文本转化为有效的矢量表示拍摄同一语义信息。此外,这样的表示可以通过各种机器学习(ML)算法的各种NLP相关任务使用。最后,本次调查中简要讨论了常用的基于ML和DL分类,评价标准和不同的NLP任务,这些字的嵌入的应用。

4. Towards Ethics by Design in Online Abusive Content Detection [PDF] 返回目录
  Svetlana Kiritchenko, Isar Nejadgholi
Abstract: To support safety and inclusion in online communications, significant efforts in NLP research have been put towards addressing the problem of abusive content detection, commonly defined as a supervised classification task. The research effort has spread out across several closely related sub-areas, such as detection of hate speech, toxicity, cyberbullying, etc. There is a pressing need to consolidate the field under a common framework for task formulation, dataset design and performance evaluation. Further, despite current technologies achieving high classification accuracies, several ethical issues have been revealed. We bring ethical issues to forefront and propose a unified framework as a two-step process. First, online content is categorized around personal and identity-related subject matters. Second, severity of abuse is identified through comparative annotation within each category. The novel framework is guided by the Ethics by Design principle and is a step towards building more accurate and trusted models.
摘要:在网上通信支持安全性和包容性,在自然语言处理研究显著努力已经把努力解决滥用含量检测的问题,通常被定义为一个监督分类任务。研究努力在若干密切相关的子领域,如检测仇恨言论,毒性,网络欺凌等,迫切需要巩固下任务制定,数据集设计和性能评估的共同框架的领域展开。此外,尽管目前的技术实现高精确度的分类,一些伦理问题已经显现出来。我们带来的伦理问题,以最前沿,并提出为两个步骤的统一框架。首先,网上的内容是围绕个人和与身份有关的主题进行分类。其次,滥用的严重程度是通过在每个类别中比较注释标识。小说的框架是由道德的设计原则的指导,并朝着建立更准确,更可信的模型的一个步骤。

5. Bridging the Modality Gap for Speech-to-Text Translation [PDF] 返回目录
  Yuchen Liu, Junnan Zhu, Jiajun Zhang, Chengqing Zong
Abstract: End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate a text-based translation model into the STAST so that two tasks can be trained in the same latent space. Furthermore, we introduce a cross-modal adaptation method to close the distance between speech and text representation. Experimental results on English-French and English-German speech translation corpora have shown that our model significantly outperforms strong baselines, and achieves the new state-of-the-art performance.
摘要:结束到终端的语音翻译旨在通过一个终端到终端的方式在一个语言翻译成另一种语言的文本翻译讲话。大多数现有的方法使用一个编码器,解码器结构单一的编码器,同时学习声学表示和语义信息,而忽略了语音及文本形式的差异,使重载编码器,导致了很大的难度在学习这种模式。为了解决这些问题,我们提出了一个语音到文本的适应语音翻译(STAST)模式旨在通过桥接语音和文本之间的差距模式,以提高终端到高端机型的性能其中。具体来说,我们分离的语音翻译编码成三个部分,引入一缩机构以语音表示的长度与对应的文本转录的匹配。为了获得更好的语义表示,我们完全是基于文本的翻译模型融入STAST使两个任务可以在相同的潜在空间进行培训。此外,我们引入一个跨模式适配方法来关闭语音和文本表示之间的距离。英语 - 法语和英语,德语语音翻译语料的实验结果表明,我们的模型显著优于强基线,并实现国家的最先进的新性能。

6. Bayesian Methods for Semi-supervised Text Annotation [PDF] 返回目录
  Kristian Miok, Gregor Pirs, Marko Robnik-Sikonja
Abstract: Human annotations are an important source of information in the development of natural language understanding approaches. As under the pressure of productivity annotators can assign different labels to a given text, the quality of produced annotations frequently varies. This is especially the case if decisions are difficult, with high cognitive load, requires awareness of broader context, or careful consideration of background knowledge. To alleviate the problem, we propose two semi-supervised methods to guide the annotation process: a Bayesian deep learning model and a Bayesian ensemble method. Using a Bayesian deep learning method, we can discover annotations that cannot be trusted and might require reannotation. A recently proposed Bayesian ensemble method helps us to combine the annotators' labels with predictions of trained models. According to the results obtained from three hate speech detection experiments, the proposed Bayesian methods can improve the annotations and prediction performance of BERT models.
摘要:人类注解的信息,自然语言理解的开发方法的重要来源。随着生产力注释的压力可以分配不同的标签给定文本下,产生了注释的质量变化频繁。这是特别的情况下,如果决定是困难的,具有很高的认知负荷,需要更广的范围内,或仔细考虑的背景知识的了解。为了缓解这一问题,我们提出了两个半监督的方法来指导注释过程:贝叶斯深度学习模型和贝叶斯集成方法。使用贝叶斯深度学习方法,我们可以发现,不能被信任,并且可能需要reannotation注解。最近提出的贝叶斯集成方法帮助我们的注释标签与训练的模式的预测结合起来。据来自三个仇恨言论检测实验获得的结果,所提出的贝叶斯方法可以提高BERT模式的注释和预测性能。

7. The Volctrans Machine Translation System for WMT20 [PDF] 返回目录
  Liwei Wu, Xiao Pan, Zehui Lin, Yaoming Zhu, Mingxuan Wang, Lei Li
Abstract: This paper describes our VolcTrans system on WMT20 shared news translation task. We participated in 8 translation directions. Our basic systems are based on Transformer, with several variants (wider or deeper Transformers, dynamic convolutions). The final system includes text pre-process, data selection, synthetic data generation, advanced model ensemble, and multilingual pre-training.
摘要:本文介绍了我们VolcTrans上WMT20系统共享新闻翻译任务。我们参加了8个翻译方向。我们的基本系统是基于变压器,有几种变体(更宽或更深变形金刚,动态卷积)。最终系统包括文本预处理,数据选择,合成数据生成,高级模式集合,和多语种预训练。

8. A Chinese Text Classification Method With Low Hardware Requirement Based on Improved Model Concatenation [PDF] 返回目录
  Yuanhao Zhuo
Abstract: In order to improve the accuracy performance of Chinese text classification models with low hardware requirements, an improved concatenation-based model is designed in this paper, which is a concatenation of 5 different sub-models, including TextCNN, LSTM, and Bi-LSTM. Compared with the existing ensemble learning method, for a text classification mission, this model's accuracy is 2% higher. Meanwhile, the hardware requirements of this model are much lower than the BERT-based model.
摘要:为了提高中国的文本分类模型的低硬件要求的精度性能,改进的基于级联模型的设计在本文中,这是5个不同的子模型,包括TextCNN,LSTM和双级联LSTM。与现有的集成学习方法相比,对于文本分类的任务,这个模型的精度高2%。同时,这款机型的硬件要求比基于BERT模型低得多。

9. Fine-grained Information Status Classification Using Discourse Context-Aware BERT [PDF] 返回目录
  Yufang Hou
Abstract: Previous work on bridging anaphora recognition (Hou et al., 2013a) casts the problem as a subtask of learning fine-grained information status (IS). However, these systems heavily depend on many hand-crafted linguistic features. In this paper, we propose a simple discourse context-aware BERT model for fine-grained IS classification. On the ISNotes corpus (Markert et al., 2012), our model achieves new state-of-the-art performance on fine-grained IS classification, obtaining a 4.8 absolute overall accuracy improvement compared to Hou et al. (2013a). More importantly, we also show an improvement of 10.5 F1 points for bridging anaphora recognition without using any complex hand-crafted semantic features designed for capturing the bridging phenomenon. We further analyze the trained model and find that the most attended signals for each IS category correspond well to linguistic notions of information status.
摘要:上桥照应承认以前的工作(侯等人,2013a)铸的问题作为学习细粒度信息状态(IS)的子任务。然而,这些系统在很大程度上取决于许多手工制作的语言特征。在本文中,我们提出了细粒度是分类简单的话语情境感知BERT模式。在ISNotes语料库(马克特等,2012),我们的模型在细粒度实现国家的最先进的新性能分类,相对于侯等人获得4.8绝对总体精度改善。 (2013a)。更重要的是,我们还显示的10.5点F1的改进桥接照应识别,而不使用用于拍摄桥接现象任何复杂的手工制作的语义特征。我们进一步分析的训练模型,发现每个出席的信号的类别对应关系较好的信息状态语言概念。

10. DisenE: Disentangling Knowledge Graph Embeddings [PDF] 返回目录
  Xiaoyu Kou, Yankai Lin, Yuntao Li, Jiahao Xu, Peng Li, Jie Zhou, Yan Zhang
Abstract: Knowledge graph embedding (KGE), aiming to embed entities and relations into low-dimensional vectors, has attracted wide attention recently. However, the existing research is mainly based on the black-box neural models, which makes it difficult to interpret the learned representation. In this paper, we introduce DisenE, an end-to-end framework to learn disentangled knowledge graph embeddings. Specially, we introduce an attention-based mechanism that enables the model to explicitly focus on relevant components of entity embeddings according to a given relation. Furthermore, we introduce two novel regularizers to encourage each component of the entity representation to independently reflect an isolated semantic aspect. Experimental results demonstrate that our proposed DisenE investigates a perspective to address the interpretability of KGE and is proved to be an effective way to improve the performance of link prediction tasks. The code and datasets are released on this https URL.
摘要:知识图嵌入(KGE),旨在嵌入实体和关系到低维向量,已经吸引了最近广受关注。然而,现有的研究主要是基于黑盒模型的神经,这使得它很难解释学表示。在本文中,我们介绍DisenE,最终到终端的框架,以学习知识解缠结图嵌入。特别地,我们引入了基于注意机制的机制,使该模型根据给定的关系,明确地专注于实体的嵌入的相关组件。此外,我们引入了两个新的regularizers鼓励实体表示的每个分量独立地反映一个分离的语义方面。实验结果表明,我们提出的DisenE调查的透视,以解决KGE的解释性和被证明是提高链接预测任务的性能的有效途径。代码和数据集发布了有关该HTTPS URL。

11. Second-Order Unsupervised Neural Dependency Parsing [PDF] 返回目录
  Songlin Yang, Yong Jiang, Wenjuan Han, Kewei Tu
Abstract: Most of the unsupervised dependency parsers are based on first-order probabilistic generative models that only consider local parent-child information. Inspired by second-order supervised dependency parsing, we proposed a second-order extension of unsupervised neural dependency models that incorporate grandparent-child or sibling information. We also propose a novel design of the neural parameterization and optimization methods of the dependency models. In second-order models, the number of grammar rules grows cubically with the increase of vocabulary size, making it difficult to train lexicalized models that may contain thousands of words. To circumvent this problem while still benefiting from both second-order parsing and lexicalization, we use the agreement-based learning framework to jointly train a second-order unlexicalized model and a first-order lexicalized model. Experiments on multiple datasets show the effectiveness of our second-order models compared with recent state-of-the-art methods. Our joint model achieves a 10% improvement over the previous state-of-the-art parser on the full WSJ test set
摘要:大多数的无监督的依赖解析器是基于一阶概率生成模型,只考虑当地的父子信息中。通过第二次启发监督依存分析,我们提出的无监督神经依赖模型的第二次扩展,纳入祖父母,子女或兄弟姐妹的信息。我们也建议的依赖模型的神经参数和优化方法的新颖设计。在二阶模型,语法规则的数量与词汇量的增加立方增长,因而难以培养可能含有数千字的词汇化模式。为了解决这个问题,同时还从二阶解析和词汇都受益,我们使用基于协议的学习框架,共同培养二阶unlexicalized模型和一阶词汇化模式。在多个数据集实验表明,与国家的最先进的最近的方法相比,我们的二阶模型的有效性。我们的合资模式实现了对全WSJ测试集前面的国家的最先进的解析器提高了10%

12. TopicModel4J: A Java Package for Topic Models [PDF] 返回目录
  Yang Qian, Yuanchun Jiang, Yidong Chai, Yezheng Liu, Jiansha Sun
Abstract: Topic models provide a flexible and principled framework for exploring hidden structure in high-dimensional co-occurrence data and are commonly used natural language processing (NLP) of text. In this paper, we design and implement a Java package, TopicModel4J, which contains 13 kinds of representative algorithms for fitting topic models. The TopicModel4J in the Java programming environment provides an easy-to-use interface for data analysts to run the algorithms, and allow to easily input and output data. In addition, this package provides a few unstructured text preprocessing techniques, such as splitting textual data into words, lowercasing the words, preforming lemmatization and removing the useless characters, URLs and stop words.
摘要:主题模型在高维共现数据探索隐藏结构提供了灵活的和原则性的框架和文本的常用自然语言处理(NLP)。在本文中,我们设计并实现了一个Java包,TopicModel4J,其中包含13种代表算法拟合主题模型。在Java编程环境TopicModel4J提供了一个易于使用的界面,用于数据分析运行的算法,并允许轻松地输入和输出数据。此外,该软件包提供了一些非结构化的文本预处理技术,如拆分文本数据变成文字,lowercasing的话,预成型词形还原和删除无用的字符,网址和停用词。

13. Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript [PDF] 返回目录
  Luke Lindemann, Claire Bowern
Abstract: This paper outlines the creation of three corpora for multilingual comparison and analysis of the Voynich manuscript: a corpus of Voynich texts partitioned by Currier language, scribal hand, and transcription system, a corpus of 294 language samples compiled from Wikipedia, and a corpus of eighteen transcribed historical texts in eight languages. These corpora will be utilized in subsequent work by the Voynich Working Group at Yale University. We demonstrate the utility of these corpora for studying characteristics of the Voynich script and language, with an analysis of conditional character entropy in Voynichese. We discuss the interaction between character entropy and language, script size and type, glyph compositionality, scribal conventions and abbreviations, positional character variants, and bigram frequency. This analysis characterizes the interaction between script compositionality, character size, and predictability. We show that substantial manipulations of glyph composition are not sufficient to align conditional entropy levels with natural languages. The unusually predictable nature of the Voynichese script is not attributable to a particular script or transcription system, underlying language, or substitution cipher. Voynichese is distinct from every comparison text in our corpora because character placement is highly constrained within the word, and this may indicate the loss of phonemic distinctions from the underlying language.
摘要:本文概述了建立三个语料库为伏尼契手稿的多语言比较和分析:伏尼契文本的语料库分区由克鲁瑞尔语言,抄写手,转录系统,从维基百科的编辑294个语言样本语料库和语料库十八八种语言转录的历史文本。这些语料库将在耶鲁大学由伏尼契工作组随后的工作中使用。我们证明这些语料库的效用研究伏尼契文字和语言的特点,在Voynichese条件字符熵的分析。我们讨论的字符熵和语言,文字大小和类型,字形组合性,抄写约定和缩写,位置异体字,和二元频率之间的相互作用。这种分析表征脚本组合性,字符大小,和可预测性之间的相互作用。我们表明,字形组成的实质性的操作是不够的对齐条件熵水平自然语言。所述Voynichese脚本的不寻常的可预测的性质是不归因于一个特定的脚本或转录系统,底层语言,或替代密码。 Voynichese是从我们的语料库中每个比较鲜明的文本字符,因为放置在字内的高度限制,这可能表示音位的区别的,从底层语言的损失。

14. What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation [PDF] 返回目录
  Amir Pouran Ben Veyseh, Franck Dernoncourt, Quan Hung Tran, Thien Huu Nguyen
Abstract: Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing. Due to their importance, identifying acronyms and corresponding phrases (i.e., acronym identification (AI)) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding. Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement. More specifically, limited size of manually annotated AI datasets or noises in the automatically created acronym identification datasets obstruct designing advanced high-performing acronym identification models. Moreover, the existing datasets are mostly limited to the medical domain and ignore other domains. In order to address these two limitations, we first create a manually annotated large AI dataset for scientific domain. This dataset contains 17,506 sentences which is substantially larger than previous scientific AI datasets. Next, we prepare an AD dataset for scientific domain with 62,441 samples which is significantly larger than the previous scientific AD dataset. Our experiments show that the existing state-of-the-art models fall far behind human-level performance on both datasets proposed by this work. In addition, we propose a new deep learning model that utilizes the syntactical structure of the sentence to expand an ambiguous acronym in a sentence. The proposed model outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research on this dataset.
摘要:首字母是促进输送在文件冗长的句子,并作为写作的支柱之一短语的缩写形式。由于其重要性,确定缩写和相应的短语(即,首字母缩写识别(AI)),并找到各首字母缩写词的正确意义(即,首字母缩写词消歧(AD))是用于文本理解至关重要。尽管这项任务的最新进展,也有阻碍进一步改善现有的数据集一定的局限性。更具体地,在自动创建缩写识别数据集壅设计先进高性能缩写识别模型手动注释的数据集AI或噪音的大小的限制。此外,现有的数据集大多局限于医学领域,并忽略其他领域。为了解决这两个限制,我们首先创建一个手动注释大AI数据集科学领域。此数据集包含17506个句子基本上比以前的科学数据集AI较大。接下来,我们用62441米的样品,其是比以前的科学数据集AD较大显著制备用于科学领域的AD数据集。我们的实验表明,现有的国家的最先进的车型远远落后于通过这项工作提出了两个数据集人类水平的性能。此外,我们建议,利用句子的语法结构,扩大了一句暧昧的缩写新的深度学习模式。该模型优于新广告的数据集的国家的最先进的机型,在此数据集为未来研究的一个强大的基础。

15. Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications [PDF] 返回目录
  Yongqiang Wang, Yangyang Shi, Frank Zhang, Chunyang Wu, Julian Chan, Ching-Feng Yeh, Alex Xiao
Abstract: In this paper, we summarize the application of transformer and its streamable variant, Emformer based acoustic model for large scale speech recognition applications. We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. On a low latency voice assistant task, Emformer gets 24% to 26% relative word error rate reductions (WERRs). For medium latency scenarios, comparing with LCBLSTM with similar model size and latency, Emformer gets significant WERR across four languages in video captioning datasets with 2-3 times inference real-time factors reduction.
摘要:在本文中,我们总结了变压器的大规模语音识别应用的应用程序及其变种可流,Emformer基于声学模型。我们比较了基于变压器的声学模型与产业规模的任务他们LSTM同行。具体来说,我们对低延迟的任务在等待时间中等任务和LSTM等待时间受控BLSTM(LCBLSTM)比较Emformer。在一个低延迟语音助手的任务,Emformer得到24%至26%的相对字错误率减少(WERRs)。对于中等延迟的情况,与类似的模型大小和延迟LCBLSTM比较,Emformer得到显著WERR跨越四种语言的视频字幕的数据集2-3倍推理实时因素减少。

16. DualTKB: A Dual Learning Bridge between Text and Knowledge Base [PDF] 返回目录
  Pierre L. Dognin, Igor Melnyk, Inkit Padhi, Cicero Nogueira dos Santos, Payel Das
Abstract: In this work, we present a dual learning approach for unsupervised text to path and path to text transfers in Commonsense Knowledge Bases (KBs). We investigate the impact of weak supervision by creating a weakly supervised dataset and show that even a slight amount of supervision can significantly improve the model performance and enable better-quality transfers. We examine different model architectures, and evaluation metrics, proposing a novel Commonsense KB completion metric tailored for generative models. Extensive experimental results show that the proposed method compares very favorably to the existing baselines. This approach is a viable step towards a more advanced system for automatic KB construction/expansion and the reverse operation of KB conversion to coherent textual descriptions.
摘要:在这项工作中,我们提出了无监督的文本路径和路径的常识知识库(KBS)文本传输双重学习方法。我们通过创建一个弱监督数据集研究监督乏力的影响,并表明,即使监管的少量可以显著提高模型的性能和实现更好的画质传输。我们研究不同的模型架构,并评价指标,提出了生成模型量身定制的一种新型的常识KB完成指标。大量的实验结果表明,该方法非常逊色于现有基准。这种方法是对用于自动KB结构/扩展和KB转化为相干文本描述的反向操作的更先进的系统的可行步骤。

17. Learning Contextualised Cross-lingual Word Embeddings for Extremely Low-Resource Languages Using Parallel Corpora [PDF] 返回目录
  Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han Lau
Abstract: We propose a new approach for learning contextualised cross-lingual word embeddings based only on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM-based encoder-decoder model that performs bidirectional translation and reconstruction of the input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common multilingual space. We also propose a simple method to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations, even in extremely low-resource scenarios.
摘要:提出了一种基于学习的仅是小平行语料库(例如几百句对)contextualised跨语言文字的嵌入了新的途径。通过基于LSTM编码器 - 解码器模型我们的方法取得的嵌入字执行双向转换和输入句子的重建。通过不同语言之间共享模型参数,我们的模型火车联合在一个共同的多语种空间字的嵌入。我们还提出了一个简单的方法来字和子字的嵌入结合起来,使使用不同语言拼写相似的。我们立足我们对真实世界的数据的实验来自濒危语言,即永宁娜,Shipibo-Konibo和Griko。我们的双语词典诱导和字对齐任务实验表明通过对于大多数语言对大比分,我们的模型优于现有的方法。这些结果表明,与一般看法相反,编码器,解码器翻译模型是学习跨语言表示,即使在极低的资源情况下是有益的。

18. On the diminishing return of labeling clinical reports [PDF] 返回目录
  Jean-Baptiste Lamare, Tobi Olatunji, Li Yao
Abstract: Ample evidence suggests that better machine learning models may be steadily obtained by training on increasingly larger datasets on natural language processing (NLP) problems from non-medical domains. Whether the same holds true for medical NLP has by far not been thoroughly investigated. This work shows that this is indeed not always the case. We reveal the somehow counter-intuitive observation that performant medical NLP models may be obtained with small amount of labeled data, quite the opposite to the common belief, most likely due to the domain specificity of the problem. We show quantitatively the effect of training data size on a fixed test set composed of two of the largest public chest x-ray radiology report datasets on the task of abnormality classification. The trained models not only make use of the training data efficiently, but also outperform the current state-of-the-art rule-based systems by a significant margin.
摘要:充足的证据表明,更好的机器学习模型可以通过稳定的训练自然语言处理(NLP)来自非医疗领域的问题越来越大的数据集获得。是否同样适用于医疗NLP真正是迄今为止没有得到彻底调查。这项工作表明,这确实是情况并非总是如此。我们揭示不知何故反直觉的观察,即高性能的医疗NLP模型可以与标记的少量数据的获得,完全相反的共同信念,很可能是由于问题的领域特异性。我们发现定量的训练数据的大小对两对异常分类的任务,最大的公共胸部X线影像报告数据集组成的固定测试集​​的效果。训练有素的模型不仅利用有效的训练数据,而且通过一个显著利润率优于当前国家的最先进的基于规则的系统。

19. Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports [PDF] 返回目录
  Aleksandra Edwards, David Rogers, Jose Camacho-Collados, Hélène de Ribaupierre
Abstract: The task of text and sentence classification is associated with the need for large amounts of labelled training data. The acquisition of high volumes of labelled datasets can be expensive or unfeasible, especially for highly-specialised domains for which documents are hard to obtain. Research on the application of supervised classification based on small amounts of training data is limited. In this paper, we address the combination of state-of-the-art deep learning and classification methods and provide an insight into what combination of methods fit the needs of small, domain-specific, and terminologically-rich corpora. We focus on a real-world scenario related to a collection of safeguarding reports comprising learning experiences and reflections on tackling serious incidents involving children and vulnerable adults. The relatively small volume of available reports and their use of highly domain-specific terminology makes the application of automated approaches difficult. We focus on the problem of automatically identifying the main themes in a safeguarding report using supervised classification approaches. Our results show the potential of deep learning models to simulate subject-expert behaviour even for complex tasks with limited labelled data.
摘要:文字和句子分类的工作与需要大量标记的训练数据相关联。高容量标记的数据集的收购可能是昂贵的或不可行的,尤其是对于高度专业化的领域的哪些文档很难获得。研究基于少量训练数据的监督分类的应用受到限制。在本文中,我们要解决的国家的最先进的深度学习和分类相结合的方法,并提供了一个洞察相结合的方法满足小型,特定领域和丰富语料库术语上的需求。我们专注于相关的维护包括在解决涉及儿童和弱势成年人严重事故的学习经验和思考报告的收集真实世界的场景。可用报告体积相对较小,而且其使用高度特定领域的术语,使自动化应用接近困难。我们专注于利用监督分类方法自动识别在保障报告的主要主题的问题。我们的研究结果表明,即使对于具有有限的标记数据的复杂任务深度学习模型来模拟学科专家行为的可能性。

20. WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols [PDF] 返回目录
  Jeniya Tabassum, Sydney Lee, Wei Xu, Alan Ritter
Abstract: This paper presents the results of the wet lab information extraction task at WNUT 2020. This task consisted of two sub tasks: (1) a Named Entity Recognition (NER) task with 13 participants and (2) a Relation Extraction (RE) task with 2 participants. We outline the task, data annotation process, corpus statistics, and provide a high-level overview of the participating systems for each sub task.
摘要:本文介绍了湿实验室信息提取任务的结果,在2020年WNUT此任务包括两个子任务:(1)有13名学员和命名实体识别(NER)任务(2)关系抽取(RE)任务有2人参加。我们勾勒出任务,数据注释过程,语料中统计,并为每个子任务的参与系统的高度概括。

21. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus [PDF] 返回目录
  Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna
Abstract: Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
摘要:大型语料库是多种自然语言处理(NLP)任务,自动语言识别(语言标识号)的日益重要的是在一个多语种的背景下收集这些数据集所需的核心技术。语言标识在很大程度上视为文献解决,高配车型报道,实现对多达1,366种语言的90%的平均F1。我们培养的保留检验组多达1629种语言相媲美的高质量语言标识的机型,但找到网页抓取文本语料库是人类判断的LangID精度创建使用这些模型只有大约5%的许多低资源语言,提示需要更稳健的评估。进一步的分析揭示了多种的错误模式,从域不匹配,类失衡,语言相似度,不能充分表现模型产生。我们建议两类技术来减少这些错误:基于词表可调精度过滤器(对于我们释放策划列出了大约500种语言)和基于变压器的半监督的LangID模型,增加值数据集的精度从5.5%至71.2 %。这些技术使我们能够创建一个覆盖10万以上的相对干净的句子中每个500+语言的初始数据集,向1000网络语言文本语料库铺平了道路。

22. Strongly Incremental Constituency Parsing with Graph Neural Networks [PDF] 返回目录
  Kaiyu Yang, Jia Deng
Abstract: Parsing sentences into syntax trees can benefit downstream applications in NLP. Transition-based parsers build trees by executing actions in a state transition system. They are computationally efficient, and can leverage machine learning to predict actions based on partial trees. However, existing transition-based parsers are predominantly based on the shift-reduce transition system, which does not align with how humans are known to parse sentences. Psycholinguistic research suggests that human parsing is strongly incremental: humans grow a single parse tree by adding exactly one token at each step. In this paper, we propose a novel transition system called attach-juxtapose. It is strongly incremental; it represents a partial sentence using a single tree; each action adds exactly one token into the partial tree. Based on our transition system, we develop a strongly incremental parser. At each step, it encodes the partial tree using a graph neural network and predicts an action. We evaluate our parser on Penn Treebank (PTB) and Chinese Treebank (CTB). On PTB, it outperforms existing parsers trained with only constituency trees; and it performs on par with state-of-the-art parsers that use dependency trees as additional training data. On CTB, our parser establishes a new state of the art. Code is available at this https URL.
摘要:分析句子翻译成语法树可以在NLP受益下游应用。基于过渡的解析器通过在状态转换系统执行的行动建立树。他们是计算效率,并可以利用机器学习来预测基于局部树的行动。然而,现有的基于过渡的解析器主要基于变速,减少过渡系统,不与人类是如何知道解析句子对齐的。心理语言学的研究表明,人类的解析强烈增量:人类通过在每个步骤中添加只有一个令牌种植单一的解析树。在本文中,我们提出了所谓的重视,一个幻影长矛新的过渡系统。强烈的增量;它代表使用单个树的部分句子;每个动作增加了只有一个令牌进入部分树。根据我们的过渡系统,我们开发了一个强烈的增量解析器。在每一步,它编码使用图形神经网络的部分树,并预测一个动作。我们评估我们在宾州树库(PTB)和中国树库(CTB)解析器。在PTB,它优于只选区树木的培训现有的解析器;和它与国家的最先进的解析器执行看齐即使用依赖树木作为附加的训练数据。在CTB,我们的解析器建立了一个新的艺术状态。代码可在此HTTPS URL。

23. DGST: a Dual-Generator Network for Text Style Transfer [PDF] 返回目录
  Xiao Li, Guanyi Chen, Chenghua Lin, Ruizhe Li
Abstract: We propose DGST, a novel and simple Dual-Generator network architecture for text Style Transfer. Our model employs two generators only, and does not rely on any discriminators or parallel corpus for training. Both quantitative and qualitative experiments on the Yelp and IMDb datasets show that our model gives competitive performance compared to several strong baselines with more complicated architecture designs.
摘要:本文提出DGST,一个新的和简单的双发电机网络架构的文本样式转移。我们的模型只采用了两台发电机,并且不依赖于任何鉴别或培训平行语料库。在Yelp和IMDB数据集的定量和定性实验表明,相比于更复杂的架构设计几个强势基线我们的模型给出了有竞争力的表现。

24. Unmasking Contextual Stereotypes: Measuring and Mitigating BERT's Gender Bias [PDF] 返回目录
  Marion Bartl, Malvina Nissim, Albert Gatt
Abstract: Contextualized word embeddings have been replacing standard embeddings as the representational knowledge source of choice in NLP systems. Since a variety of biases have previously been found in standard word embeddings, it is crucial to assess biases encoded in their replacements as well. Focusing on BERT (Devlin et al., 2018), we measure gender bias by studying associations between gender-denoting target words and names of professions in English and German, comparing the findings with real-world workforce statistics. We mitigate bias by fine-tuning BERT on the GAP corpus (Webster et al., 2018), after applying Counterfactual Data Substitution (CDS) (Maudslay et al., 2019). We show that our method of measuring bias is appropriate for languages such as English, but not for languages with a rich morphology and gender-marking, such as German. Our results highlight the importance of investigating bias and mitigation techniques cross-linguistically, especially in view of the current emphasis on large-scale, multilingual language models.
摘要:字语境化的嵌入已经被替换标准的嵌入在NLP系统选择的代表性的知识来源。由于各种偏见的先前已在标准字的嵌入发现,这是评估其替代编码以及偏见的关键。着眼于BERT(Devlin等,2018),我们通过研究性别表示目标词在英语和德语专业的名称之间的关联衡量性别偏见,与现实世界的工作人员统计比较的结果。我们减轻偏压通过微调BERT上GAP语料库(Webster等人,2018),应用反事实数据替换(CDS)后(莫兹利等人,2019)。我们证明了我们的测量偏差的方法是适当的语言,如英语,而不是具有丰富的形态和性别的标记,如德语。我们的研究结果强调调查偏见和缓解技术的跨语言的重要性,特别是考虑到目前强调大规模,多语种语言模型。

25. The geometry of integration in text classification RNNs [PDF] 返回目录
  Kyle Aitken, Vinay V. Ramasesh, Ankush Garg, Yuan Cao, David Sussillo, Niru Maheswaranathan
Abstract: Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
摘要:尽管在各种任务的递归神经网络(RNNs)的广泛应用中,RNNs如何解决这些任务的一个统一的认识仍然遥遥无期。特别是,目前还不清楚在训练的RNNs什么动力模式出现,以及这些模式如何取决于训练数据集或任务。这项工作涉及这些问题在特定的自然语言处理任务的上下文:文本分类。从动力系统的分析使用的工具,我们研究的培训上的天然和合成的文本分类任务的电池经常性的网络。我们发现这些训练有素的RNNs的动力既解释和低维。具体地,跨架构和数据集,RNNs积累的证据为每个类,因为它们处理的文本,用低维吸引歧管作为底层机制。此外,吸引歧管的维度和几何形状是由训练数据集的结构确定;尤其是,我们描述了计算上的训练数据集字计数统计如何简单,可以用来预测这些属性。我们的观察跨越多个架构和数据集,反映了一个共同的机制RNNs用来执行文本分类。直至使朝决定证据的整合是一种常见的计算原始的,这项工作奠定了利用动力系统技术研究RNNs的内部运作的基础。

26. Fixed-Length Protein Embeddings using Contextual Lenses [PDF] 返回目录
  Amir Shanehsazzadeh, David Belanger, David Dohan
Abstract: The Basic Local Alignment Search Tool (BLAST) is currently the most popular method for searching databases of biological sequences. BLAST compares sequences via similarity defined by a weighted edit distance, which results in it being computationally expensive. As opposed to working with edit distance, a vector similarity approach can be accelerated substantially using modern hardware or hashing techniques. Such an approach would require fixed-length embeddings for biological sequences. There has been recent interest in learning fixed-length protein embeddings using deep learning models under the hypothesis that the hidden layers of supervised or semi-supervised models could produce potentially useful vector embeddings. We consider transformer (BERT) protein language models that are pretrained on the TrEMBL data set and learn fixed-length embeddings on top of them with contextual lenses. The embeddings are trained to predict the family a protein belongs to for sequences in the Pfam database. We show that for nearest-neighbor family classification, pretraining offers a noticeable boost in performance and that the corresponding learned embeddings are competitive with BLAST. Furthermore, we show that the raw transformer embeddings, obtained via static pooling, do not perform well on nearest-neighbor family classification, which suggests that learning embeddings in a supervised manner via contextual lenses may be a compute-efficient alternative to fine-tuning.
摘要:基本局部比对搜索工具(BLAST)是目前搜索生物序列数据库的最常用的方法。 BLAST比较通过由加权编辑距离,这导致它是计算昂贵的定义相似性的序列。相对于与编辑距离工作,一个矢量相似度的方法可以基本上用现代硬件或散列技术来加速。这种做法将需要用于生物序列固定长度的嵌入。目前已在使用的假设下深的学习模式,监督或半监督模式的隐藏层可能会产生潜在有用的载体的嵌入学习固定长度的蛋白质的嵌入近期备受关注。我们认为,在预训练的TrEMBL的数据集,并了解与上下文镜头他们顶部固定长度的嵌入变压器(BERT)蛋白语言模型。该嵌入物进行培训,以预测的蛋白质是否属于在Pfam数据库序列的家庭。我们表明,最近邻家庭分类,训练前在性能和相应的嵌入学与BLAST竞争力显着提升。此外,我们表明,原变压器的嵌入,通过静态池获得,勿施于近邻的家庭分类,这表明通过上下文镜头学习监督的方式嵌入物可以是一个计算有效的替代微调表现良好。

27. Measuring non-trivial compositionality in emergent communication [PDF] 返回目录
  Tomasz Korbak, Julian Zubek, Joanna Rączaszek-Leonardi
Abstract: Compositionality is an important explanatory target in emergent communication and language evolution. The vast majority of computational models of communication account for the emergence of only a very basic form of compositionality: trivial compositionality. A compositional protocol is trivially compositional if the meaning of a complex signal (e.g. blue circle) boils down to the intersection of meanings of its constituents (e.g. the intersection of the set of blue objects and the set of circles). A protocol is non-trivially compositional (NTC) if the meaning of a complex signal (e.g. biggest apple) is a more complex function of the meanings of their constituents. In this paper, we review several metrics of compositionality used in emergent communication and experimentally show that most of them fail to detect NTC - i.e. they treat non-trivial compositionality as a failure of compositionality. The one exception is tree reconstruction error, a metric motivated by formal accounts of compositionality. These results emphasise important limitations of emergent communication research that could hamper progress on modelling the emergence of NTC.
摘要:组合性是应急通信和语言发展的重要目标解释。绝大多数通信帐户的计算模型只组合性的一个非常基本的形式出现:平凡的组合性。的组成协议是平凡的组成,如果一个复信号(例如蓝色圆圈)的含义归结为它的组分的含义的交点(例如,集合蓝色对象和该组圆的交点)。协议是非平凡组成(NTC)如果一个复杂的信号(例如最大的苹果)的含义是其成员的含义的更复杂的函数。在本文中,我们回顾应急通信中使用的组合性的几个指标实验显示,他们大多无法检测NTC - 即他们对待非平凡的组合性的组合性的失败。唯一的例外是树重建误差,度量动机是组合性的正式帐户。这些结果强调,可能会妨碍对造型NTC出现进度应急通信研究的重要限制。

28. Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [PDF] 返回目录
  Xingchen Song, Zhiyong Wu, Yiheng Huang, Chao Weng, Dan Su, Helen Meng
Abstract: Non-autoregressive (NAR) transformer models have achieved significantly inference speedup but at the cost of inferior accuracy compared to autoregressive (AR) models in automatic speech recognition (ASR). Most of the NAR transformers take a fixed-length sequence filled with MASK tokens or a redundant sequence copied from encoder states as decoder input, they cannot provide efficient target-side information thus leading to accuracy degradation. To address this problem, we propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module. Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 ~ 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
摘要:非自回归(NAR)变压器模型已经推断加速但比起自回归的自动语音识别(ASR)(AR)模型精度差的成本显著实现。大部分的NAR变压器采取填充有MASK令牌或者从编码器状态复制为解码器输入冗余序列的固定长度的序列,它们不能提供从而导致精度的劣化有效目标侧的信息。为了解决这个问题,我们提出了一个CTC增强NAR变压器,其通过细化CTC模块的预测产生靶序列。实验结果表明,我们的方法优于先前所有NAR同行并且实现50倍的速度比只有0.0〜0.3绝对CER降解强烈AR基线上Aishell-1和Aishell-2的数据集进行解码速度。

29. A Cyclic Proof System for HFLN [PDF] 返回目录
  Mayuko Kori, Takeshi Tsukada, Naoki Kobayashi
Abstract: A cyclic proof system allows us to perform inductive reasoning without explicit inductions. We propose a cyclic proof system for HFLN, which is a higher-order predicate logic with natural numbers and alternating fixed-points. Ours is the first cyclic proof system for a higher-order logic, to our knowledge. Due to the presence of higher-order predicates and alternating fixed-points, our cyclic proof system requires a more delicate global condition on cyclic proofs than the original system of Brotherston and Simpson. We prove the decidability of checking the global condition and soundness of this system, and also prove a restricted form of standard completeness for an infinitary variant of our cyclic proof system. A potential application of our cyclic proof system is semi-automated verification of higher-order programs, based on Kobayashi et al.'s recent work on reductions from program verification to HFLN validity checking.
摘要:循环证明系统可以让我们没有明确的归纳进行归纳推理。我们提出了HFLN,这是自然数和交替固定点较高阶谓词逻辑循环证明系统。我国是一个高阶逻辑的第一循环的证据体系,为我们所知。由于高阶谓词和交替固定点的存在,我们的循环证明系统需要在比Brotherston和辛普森的原系统循环样张更细腻的全球状况。我们证明了检查该系统的全球状况和稳健的可判定性,同时也证明标准的完整性的限制形式,为我们的循环证明系统的infinitary变种。我们的循环证明系统的一个潜在的应用是高阶课程的基础上,小林等人最近从程序验证到HFLN有效性检查减排工作的半自动化验证。

30. INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices [PDF] 返回目录
  Yiwu Yao, Yuchao Li, Chengyu Wang, Tianhang Yu, Houjiang Chen, Xiaotang Jiang, Jun Yang, Jun Huang, Wei Lin, Hui Shu, Chengfei Lv
Abstract: The intensive computation of Automatic Speech Recognition (ASR) models obstructs them from being deployed on mobile devices. In this paper, we present a novel quantized Winograd optimization pipeline, which combines the quantization and fast convolution to achieve efficient inference acceleration on mobile devices for ASR models. To avoid the information loss due to the combination of quantization and Winograd convolution, a Range-Scaled Quantization (RSQ) training method is proposed to expand the quantized numerical range and to distill knowledge from high-precision values. Moreover, an improved Conv1D equipped DFSMN (ConvDFSMN) model is designed for mobile deployment. We conduct extensive experiments on both ConvDFSMN and Wav2letter models. Results demonstrate the models can be effectively optimized with the proposed pipeline. Especially, Wav2letter achieves 1.48* speedup with an approximate 0.07% WER decrease on ARMv7-based mobile devices.
摘要:自动语音识别(ASR)的密集型计算从模型学障碍他们被部署在移动设备上。在本文中,我们提出了一个新的量化威诺格拉德优化管道,它结合了量化和快速卷积实现对ASR机型的移动设备有效的推理加速。为了避免信息丢失,由于量化和的Winograd卷积的组合,一个范围缩放的量化(RSQ)训练方法,提出了扩大量化数值范围,并从高精度值提制知识。此外,改进的Conv1D配备DFSMN(ConvDFSMN)模型被设计用于移动部署。我们进行了两个ConvDFSMN和Wav2letter车型广泛的实验。结果表明,该模型可以提出的管道被有效地优化。特别是,Wav2letter达到1.48与基于ARMv7的移动设备的近似0.07%WER减少*加速。

31. PPG-based singing voice conversion with adversarial representation learning [PDF] 返回目录
  Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Ling Xu, Chen Shen, Zejun Ma
Abstract: Singing voice conversion (SVC) aims to convert the voice of one singer to that of other singers while keeping the singing content and melody. On top of recent voice conversion works, we propose a novel model to steadily convert songs while keeping their naturalness and intonation. We build an end-to-end architecture, taking phonetic posteriorgrams (PPGs) as inputs and generating mel spectrograms. Specifically, we implement two separate encoders: one encodes PPGs as content, and the other compresses mel spectrograms to supply acoustic and musical information. To improve the performance on timbre and melody, an adversarial singer confusion module and a mel-regressive representation learning module are designed for the model. Objective and subjective experiments are conducted on our private Chinese singing corpus. Comparing with the baselines, our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity. Moreover, our PPG-based method is proved to be robust for noisy sources.
摘要:歌声转换(SVC)旨在一个歌手的声音转换为其他歌手的同时保持歌唱内容和旋律。在最近的语音转换工作的顶部,我们提出了一个新的模式,稳步的歌曲转换,同时保持它们的自然和语调。我们建立的端至端的架构,考虑注音posteriorgrams(的PPG)作为输入,并生成梅尔频谱。具体而言,我们实现两个单独的编码器:一个编码的PPG作为内容,以及其他压缩梅尔频谱来提供声音和音乐的信息。为了改善音色和旋律的表现,对抗性歌手混乱模块和梅尔回归表示学习模块是专为模型。客观和主观实验在我们中国民营唱歌语料库进行。与基线相比,我们的方法可以显著改善自然,旋律和语音相似性的转换性能。此外,我们的基于PPG-方法被证明是嘈杂的来源强劲。

32. Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition [PDF] 返回目录
  Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Zhengqi wen
Abstract: Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real-time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14% relative mix error rate reduction.
摘要:尽管最近端至端(E2E)ASR系统见证了码转换,饥饿音频文本配对数据限制了模型的性能进一步提高显著的进步。在本文中,我们提出了一个解耦变压器模型使用一种语言配对数据和不成文本数据,以减轻码转换的数据不足的问题。该模型分离成两个部分:音频到一个音素(A2P)网络和音位到文本(P2T)网络。该A2P网络可以学习使用大型单语配对数据声学模式场景。同时,在训练过程中产生的实时单一的音频数据的多个音素序列的候选人。然后将所生成的音素的文本数据配对被用于训练P2T网络。该网络可以预先训练用大量外部不成对的文本数据。通过使用单语数据和未配对的文本数据时,去耦变压器模型降低了E2E模型的码转换配对的训练数据在一定程度上高依赖性。最后,两个网络通过关注融合共同优化。我们评估对公共普通话的英语码转换的数据集所提出的方法。我们的变压器基线相比,所提出的方法实现相对混合错误率降低18.14%。

33. Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [PDF] 返回目录
  Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li
Abstract: Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.
摘要:情感语音转换旨在语音转换情绪韵律,同时保留语言内容和演讲人的身份。以前的研究表明,这是可能使用编码器 - 解码器网络空调上离散表示,例如一个热情绪标签解开情感韵律。这种网络学习记住一组固定的情感风格。在本文中,我们提出了一种基于变自动编码生成瓦瑟斯坦对抗网络(VAW-GAN),这使得使用预训练的语音情感识别(SER)模型的训练过程中,并在运行 - 传递情感的风格在新框架时间推断。通过这种方式,网络能够看见也看不见的情感风格转移到一个新的话语。我们表明,该框架实现了由持续跑赢基准框架骄人的业绩。本文还标志着一个情感语音数据集(ESD)语音转换,其中有多个扬声器和语言的版本。

34. CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition [PDF] 返回目录
  Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao
Abstract: We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8%/9.1% on Librispeech test clean/other dataset without an external LM, and a CER of 5.8% on Aishell1 Mandarin corpus, respectively1. Compared to the AT baseline, the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF. When decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3% on the test-clean set, indicating the potential of the proposed method.
摘要:我们提出了语音识别的基础对准CTC单步非自回归变压器(CASS-NAT)。具体地,对准CTC含有的(a)所述令牌解码器输入,和(b)为声学每个令牌的时间跨度的数目的信息。的信息被用来提取用于并行每个令牌中,被称为其替换单词中的自回归变压器(AT)在嵌入译码器以实现并行生成令牌级声嵌入声学表示。在推理,提出了一种基于误差的对准采样方法被施加到输出CTC空间,降低了WER和保持并行为好。实验结果表明,所提出的方法实现了对Librispeech测试清洁/其他数据集3.8%/ 9.1%WERS无需外部LM,和5.8%对Aishell1普通话语料库,respectively1一个CER。相较于基线时,CASS-NAT对WER性能降低,但51.2x更快RTF方面。当与一个oracle CTC对准,下界WER的无LM解码达到在测试清洁组2.3%,表明所提出的方法的潜力。

35. Scaling Laws for Autoregressive Generative Modeling [PDF] 返回目录
  Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish
Abstract: We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
摘要:我们确定了在四个领域交叉熵损失经验定标法:生成图像建模,视频建模,多模态图像$ \ $ leftrightarrow文本模式,数学解题。在任何情况下自回归变压器顺利的性能改善,因为模型的大小和计算的预算增加,遵循幂律加常数标度律。最佳模型的大小也可以通过幂律依赖于计算预算,是跨越所有数据域几乎是通用的指数。交叉熵损失具有信息理论解释为$ S($真$)+ d _ {\ mathrm {KL}}($真$ || $模型)$,并且经验定标法表明两者的一个预测真正的数据分布的熵和真实模型分布之间的KL发散。有了这个解释,十亿参数变压器是下采样到$ 8 \次$ 8分辨率YFCC100M图像分配近乎完美的模型,我们可以预测,实现任何给定的还原损失需要(即$ d _ {\ mathrm {KL模型的大小}} $)在其他决议NATS /图像。我们在特定领域找到一些额外的标度律:(一),我们确定在多模式标题和图像的相互信息的比例关系,并展示如何回答这个问题:“是一张图片胜过千言万语?”; (二)在数学问题解决的情况下,我们确定模型的性能外推超出了训练分布在标度律; (三)我们微调了ImageNet分类生成图像模型,并找到分类的损失和错误率的平滑缩放,甚至关闭的生成损失的水平。总之,这些结果加强该标度律对神经网络性能的重要影响,包括对下游任务的情况。

36. Cascaded encoders for unifying streaming and non-streaming ASR [PDF] 返回目录
  Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman
Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.
摘要:结束到终端(E2E)自动语音识别(ASR)的模型,现在,有几个基准显示竞争力的性能。这些模型的结构,以无论是在流媒体和非流模式操作。这项工作提出了级联编码器,用于构建可以在这两种模式同时运行一个单一的端到端ASR模式。该模型包括流媒体和非流媒体编码器。输入特征首先由流编码器处理;非流编码器流媒体编码器的输出完全运行。单个解码器然后学会解码要么使用流式传输或所述非流编码器的输出。结果表明,该模型能达到类似的字差错率(WER)作为一个独立的在流模式中操作时的流模式,并获得10% - 在非流式模式下工作时的27%的相对改善。我们的研究结果还表明,该方法比现有的E2E双通道模式,特别是在长篇讲话。

37. A Comprehensive Dictionary and Term Variation Analysis for COVID-19 and SARS-CoV-2 [PDF] 返回目录
  Robert Leaman, Zhiyong Lu
Abstract: The number of unique terms in the scientific literature used to refer to either SARS-CoV-2 or COVID-19 is remarkably large and has continued to increase rapidly despite well-established standardized terms. This high degree of term variation makes high recall identification of these important entities difficult. In this manuscript we present an extensive dictionary of terms used in the literature to refer to SARS-CoV-2 and COVID-19. We use a rule-based approach to iteratively generate new term variants, then locate these variants in a large text corpus. We compare our dictionary to an extensive collection of terminological resources, demonstrating that our resource provides a substantial number of additional terms. We use our dictionary to analyze the usage of SARS-CoV-2 and COVID-19 terms over time and show that the number of unique terms continues to grow rapidly. Our dictionary is freely available at this https URL.
摘要:使用在科学文献中唯一换算的数来指SARS-CoV的-2或COVID-19是显着地大,并继续尽管行之有效的标准化术语迅速增加。这种高度的短期变化使得难以这些重要机构的召回率高鉴定。在该手稿,我们提出在文献中用于指SARS-CoV的-2和COVID-19方面的广泛字典。我们使用基于规则的方法来迭代产生新一届的变种,然后找到这些在大型文本语料库变种。我们比较我们的字典术语资源,广泛收集,这表明我们的资源提供的附加条款相当数量。我们用我们的字典来分析的SARS-COV-2和COVID-19随时间的变化的使用和显示的特别条款的数量继续快速增长。我们的字典是免费提供在此HTTPS URL。

注:中文为机器翻译结果!封面为论文标题词云图!