目录
11. Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition [PDF] 摘要
摘要
1. Commonsense Knowledge Graph Reasoning by Selection or Generation? Why? [PDF] 返回目录
Cunxiang Wang, Jinhang Wu, Luxin Liu, Yue Zhang
Abstract: Commonsense knowledge graph reasoning(CKGR) is the task of predicting a missing entity given one existing and the relation in a commonsense knowledge graph (CKG). Existing methods can be classified into two categories generation method and selection method. Each method has its own advantage. We theoretically and empirically compare the two methods, finding the selection method is more suitable than the generation method in CKGR. Given the observation, we further combine the structure of neural Text Encoder and Knowledge Graph Embedding models to solve the selection method's two problems, achieving competitive results. We provide a basic framework and baseline model for subsequent CKGR tasks by selection methods.
摘要:常识知识图推理(保留地)是预测给出一个存在的缺失实体,在常识性知识图(CKG)的关系的任务。现有方法可分为两类生成方法和选择方法。每种方法都有其自身的优势。我们理论和经验的两种方法进行比较,发现选择方法比CKGR生成方法更合适。鉴于观察,我们进一步结合神经文本编码器和知识图嵌入模型的结构,解决了选择方法的两个问题,实现竞争力的结果。我们提供的选择方法随后CKGR任务的基本框架和基本模型。
Cunxiang Wang, Jinhang Wu, Luxin Liu, Yue Zhang
Abstract: Commonsense knowledge graph reasoning(CKGR) is the task of predicting a missing entity given one existing and the relation in a commonsense knowledge graph (CKG). Existing methods can be classified into two categories generation method and selection method. Each method has its own advantage. We theoretically and empirically compare the two methods, finding the selection method is more suitable than the generation method in CKGR. Given the observation, we further combine the structure of neural Text Encoder and Knowledge Graph Embedding models to solve the selection method's two problems, achieving competitive results. We provide a basic framework and baseline model for subsequent CKGR tasks by selection methods.
摘要:常识知识图推理(保留地)是预测给出一个存在的缺失实体,在常识性知识图(CKG)的关系的任务。现有方法可分为两类生成方法和选择方法。每种方法都有其自身的优势。我们理论和经验的两种方法进行比较,发现选择方法比CKGR生成方法更合适。鉴于观察,我们进一步结合神经文本编码器和知识图嵌入模型的结构,解决了选择方法的两个问题,实现竞争力的结果。我们提供的选择方法随后CKGR任务的基本框架和基本模型。
2. On the Importance of Local Information in Transformer Based Models [PDF] 返回目录
Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar, Mitesh M. Khapra
Abstract: The self-attention module is a key component of Transformer-based models, wherein each token pays attention to every other token. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. Some studies have also identified promise in restricting this attention to be local, i.e., a token attending to other tokens only in a small neighbourhood around it. However, no conclusive evidence exists that such local attention alone is sufficient to achieve high accuracy on multiple NLP tasks. In this work, we systematically analyse the role of locality information in learnt models and contrast it with the role of syntactic information. More specifically, we first do a sensitivity analysis and show that, at every layer, the representation of a token is much more sensitive to tokens in a small neighborhood around it than to tokens which are syntactically related to it. We then define an attention bias metric to determine whether a head pays more attention to local tokens or to syntactically related tokens. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias. Having established the importance of local attention heads, we train and evaluate models where varying fractions of the attention heads are constrained to be local. Such models would be more efficient as they would have fewer computations in the attention layer. We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT datasets (En-De, En-Ru) and clearly demonstrate that such constrained models have comparable performance to the unconstrained models. Through this systematic evaluation we establish that attention in Transformer-based models can be constrained to be local without affecting performance.
摘要:自注意模块是基于变压器模型的重要组成部分,其中每个令牌注重每一个其他标记。最近的研究表明,这些头表现出的语法,语义或局部行为。一些研究也已经确定在限制这种关注是局部的,即只令牌在它周围的小邻参加到其他标记的诺言。然而,没有确凿的证据表明这样的地方注意本身就足以实现对多个NLP任务精度高。在这项工作中,我们系统地分析当地信息学模型中的作用,并与句法信息的作用对比吧。更具体地讲,我们首先做敏感性分析和显示,在每一层中,令牌的表示是更为敏感的令牌在它周围的小邻域比其语法上与之相关的令牌。然后,我们定义的注意偏向指标来确定磁头是否更注重当地的令牌或语法有关的标记。我们发现,相比于句法偏见的头较大部分有一个地方的偏见。已经建立的地方注意头的重要性,我们培训以及评估变化的关注头的分数被限制为本地模式。这种模式将更有效率,因为他们将不得不在关注层更少的计算。我们评估4个GLUE数据集这些模型(QQP,SST-2,MRPC,QNLI)和2 MT数据集(恩德,恩儒),并清楚地表明,这种约束模型具有相当的性能,以不受约束的车型。通过这个系统的评估,我们建立在基于变压器的模型,注意力可以被约束在不影响性能的地方。
Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar, Mitesh M. Khapra
Abstract: The self-attention module is a key component of Transformer-based models, wherein each token pays attention to every other token. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. Some studies have also identified promise in restricting this attention to be local, i.e., a token attending to other tokens only in a small neighbourhood around it. However, no conclusive evidence exists that such local attention alone is sufficient to achieve high accuracy on multiple NLP tasks. In this work, we systematically analyse the role of locality information in learnt models and contrast it with the role of syntactic information. More specifically, we first do a sensitivity analysis and show that, at every layer, the representation of a token is much more sensitive to tokens in a small neighborhood around it than to tokens which are syntactically related to it. We then define an attention bias metric to determine whether a head pays more attention to local tokens or to syntactically related tokens. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias. Having established the importance of local attention heads, we train and evaluate models where varying fractions of the attention heads are constrained to be local. Such models would be more efficient as they would have fewer computations in the attention layer. We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT datasets (En-De, En-Ru) and clearly demonstrate that such constrained models have comparable performance to the unconstrained models. Through this systematic evaluation we establish that attention in Transformer-based models can be constrained to be local without affecting performance.
摘要:自注意模块是基于变压器模型的重要组成部分,其中每个令牌注重每一个其他标记。最近的研究表明,这些头表现出的语法,语义或局部行为。一些研究也已经确定在限制这种关注是局部的,即只令牌在它周围的小邻参加到其他标记的诺言。然而,没有确凿的证据表明这样的地方注意本身就足以实现对多个NLP任务精度高。在这项工作中,我们系统地分析当地信息学模型中的作用,并与句法信息的作用对比吧。更具体地讲,我们首先做敏感性分析和显示,在每一层中,令牌的表示是更为敏感的令牌在它周围的小邻域比其语法上与之相关的令牌。然后,我们定义的注意偏向指标来确定磁头是否更注重当地的令牌或语法有关的标记。我们发现,相比于句法偏见的头较大部分有一个地方的偏见。已经建立的地方注意头的重要性,我们培训以及评估变化的关注头的分数被限制为本地模式。这种模式将更有效率,因为他们将不得不在关注层更少的计算。我们评估4个GLUE数据集这些模型(QQP,SST-2,MRPC,QNLI)和2 MT数据集(恩德,恩儒),并清楚地表明,这种约束模型具有相当的性能,以不受约束的车型。通过这个系统的评估,我们建立在基于变压器的模型,注意力可以被约束在不影响性能的地方。
3. MASRI-HEADSET: A Maltese Corpus for Speech Recognition [PDF] 返回目录
Carlos Mena, Albert Gatt, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Amanda Muscat, Ian Padovani
Abstract: Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI-HEADSET Corpus is publicly available for research/academic purposes.
摘要:马耳他,马耳他的民族语言,是由大约50万人发言。对于马耳他,语音处理还处于发展的早期阶段。在本文中,我们提出的第一个口语语料库马耳他专门设计的自动语音识别(ASR)。该MASRI-HEADSET语料库是由马耳他大学MASRI项目开发。它由8个小时的讲话与文字配对,通过在实验室环境中使用简短的文字片段记录下来。演讲者是来自全国各地的马耳他群岛不同的地理位置招募,并大致均匀地分布的性别。本文还介绍了在基线实验使用狮身人面像和Kaldi马耳他ASR取得了一些阶段性成果。该MASRI-HEADSET语料库是公开的研究/学术目的。
Carlos Mena, Albert Gatt, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Amanda Muscat, Ian Padovani
Abstract: Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI-HEADSET Corpus is publicly available for research/academic purposes.
摘要:马耳他,马耳他的民族语言,是由大约50万人发言。对于马耳他,语音处理还处于发展的早期阶段。在本文中,我们提出的第一个口语语料库马耳他专门设计的自动语音识别(ASR)。该MASRI-HEADSET语料库是由马耳他大学MASRI项目开发。它由8个小时的讲话与文字配对,通过在实验室环境中使用简短的文字片段记录下来。演讲者是来自全国各地的马耳他群岛不同的地理位置招募,并大致均匀地分布的性别。本文还介绍了在基线实验使用狮身人面像和Kaldi马耳他ASR取得了一些阶段性成果。该MASRI-HEADSET语料库是公开的研究/学术目的。
4. MICE: Mining Idioms with Contextual Embeddings [PDF] 返回目录
Tadej Škvorc, Polona Gantar, Marko Robnik-Šikonja
Abstract: Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach, called MICE, that uses contextual embeddings for that purpose. We present a new dataset of multi-word expressions with literal and idiomatic meanings and use it to train a classifier based on two state-of-the-art contextual word embeddings: ELMo and BERT. We show that deep neural networks using both embeddings perform much better than existing approaches, and are capable of detecting idiomatic word use, even for expressions that were not present in the training set. We demonstrate cross-lingual transfer of developed models and analyze the size of the required dataset.
摘要:习惯用语可以是自然语言处理的应用问题,因为它们的含义不能从它们构成的话来推断。缺乏成功的方式方法和足够大的数据集,防止用于检测成语学习机的开发方法,特别是对于那些没有在训练集中出现的表达式。我们提出了一种方法,叫做MICE,使用情境的嵌入用于这一目的。我们提出用文字和惯用的意思多字表达的新的数据集,并使用它基于两个国家的最先进的语境词的嵌入训练分类:毛毛和BERT。我们发现,使用这两种方式的嵌入深层神经网络比现有的方法执行好多了,能够检测惯用词的使用,即使对于没有出现在训练集中表现。我们证明开发的模型的跨语言传输和分析所需的数据集的大小。
Tadej Škvorc, Polona Gantar, Marko Robnik-Šikonja
Abstract: Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach, called MICE, that uses contextual embeddings for that purpose. We present a new dataset of multi-word expressions with literal and idiomatic meanings and use it to train a classifier based on two state-of-the-art contextual word embeddings: ELMo and BERT. We show that deep neural networks using both embeddings perform much better than existing approaches, and are capable of detecting idiomatic word use, even for expressions that were not present in the training set. We demonstrate cross-lingual transfer of developed models and analyze the size of the required dataset.
摘要:习惯用语可以是自然语言处理的应用问题,因为它们的含义不能从它们构成的话来推断。缺乏成功的方式方法和足够大的数据集,防止用于检测成语学习机的开发方法,特别是对于那些没有在训练集中出现的表达式。我们提出了一种方法,叫做MICE,使用情境的嵌入用于这一目的。我们提出用文字和惯用的意思多字表达的新的数据集,并使用它基于两个国家的最先进的语境词的嵌入训练分类:毛毛和BERT。我们发现,使用这两种方式的嵌入深层神经网络比现有的方法执行好多了,能够检测惯用词的使用,即使对于没有出现在训练集中表现。我们证明开发的模型的跨语言传输和分析所需的数据集的大小。
5. Exploration of Gender Differences in COVID-19 Discourse on Reddit [PDF] 返回目录
Jai Aggarwal, Ella Rabinovich, Suzanne Stevenson
Abstract: Decades of research on differences in the language of men and women have established postulates about preferences in lexical, topical, and emotional expression between the two genders, along with their sociological underpinnings. Using a novel dataset of male and female linguistic productions collected from the Reddit discussion platform, we further confirm existing assumptions about gender-linked affective distinctions, and demonstrate that these distinctions are amplified in social media postings involving emotionally-charged discourse related to COVID-19. Our analysis also confirms considerable differences in topical preferences between male and female authors in spontaneous pandemic-related discussions.
摘要:在男人和女人的语言差异数十年的研究都建立公设关于两性之间在词汇,局部和情感表达的喜好,用自己的社会学基础一起。使用来自reddit的讨论平台收集男性和女性的语言制作的一个新的数据集,我们还确认,现有约与性别有关的情感区分的假设,并证明这些区别是在涉及社交媒体帖子放大情绪充电话语与COVID-19 。我们的分析也证实了在自然流行病相关的讨论男性和女性作者之间局部的喜好相当大的差异。
Jai Aggarwal, Ella Rabinovich, Suzanne Stevenson
Abstract: Decades of research on differences in the language of men and women have established postulates about preferences in lexical, topical, and emotional expression between the two genders, along with their sociological underpinnings. Using a novel dataset of male and female linguistic productions collected from the Reddit discussion platform, we further confirm existing assumptions about gender-linked affective distinctions, and demonstrate that these distinctions are amplified in social media postings involving emotionally-charged discourse related to COVID-19. Our analysis also confirms considerable differences in topical preferences between male and female authors in spontaneous pandemic-related discussions.
摘要:在男人和女人的语言差异数十年的研究都建立公设关于两性之间在词汇,局部和情感表达的喜好,用自己的社会学基础一起。使用来自reddit的讨论平台收集男性和女性的语言制作的一个新的数据集,我们还确认,现有约与性别有关的情感区分的假设,并证明这些区别是在涉及社交媒体帖子放大情绪充电话语与COVID-19 。我们的分析也证实了在自然流行病相关的讨论男性和女性作者之间局部的喜好相当大的差异。
6. Dialogue State Induction Using Neural Latent Variable Models [PDF] 返回目录
Qingkai Min, Libo Qin, Zhiyang Teng, Xiao Liu, Yue Zhang
Abstract: Dialogue state modules are a useful component in a task-oriented dialogue system. Traditional methods find dialogue states by manually labeling training corpora, upon which neural models are trained. However, the labeling process can be costly, slow, error-prone, and more importantly, cannot cover the vast range of domains in real-world dialogues for customer service. We propose the task of dialogue state induction, building two neural latent variable models that mine dialogue states automatically from unlabeled customer service dialogue records. Results show that the models can effectively find meaningful slots. In addition, equipped with induced dialogue states, a state-of-the-art dialogue system gives better performance compared with not using a dialogue state module.
摘要:对话状态模块是一个面向任务的对话系统中的有用成分。传统方法通过手动标注训练库,在其神经模型被训练寻找对话的状态。然而,贴标过程可能是昂贵的,速度慢,容易出错,而且更重要的是,不能覆盖广大范围在现实世界对话为客户服务的域。我们提出的对话状态感应的任务,建设两个神经潜变量模型,矿山对话无标签客服的对话记录自动状态。结果表明,该模型可以有效地发现有意义的插槽。此外,还配备有致的对话状态,一个国家的最先进的对话系统,给人以不使用对话状态模块相比更好的性能。
Qingkai Min, Libo Qin, Zhiyang Teng, Xiao Liu, Yue Zhang
Abstract: Dialogue state modules are a useful component in a task-oriented dialogue system. Traditional methods find dialogue states by manually labeling training corpora, upon which neural models are trained. However, the labeling process can be costly, slow, error-prone, and more importantly, cannot cover the vast range of domains in real-world dialogues for customer service. We propose the task of dialogue state induction, building two neural latent variable models that mine dialogue states automatically from unlabeled customer service dialogue records. Results show that the models can effectively find meaningful slots. In addition, equipped with induced dialogue states, a state-of-the-art dialogue system gives better performance compared with not using a dialogue state module.
摘要:对话状态模块是一个面向任务的对话系统中的有用成分。传统方法通过手动标注训练库,在其神经模型被训练寻找对话的状态。然而,贴标过程可能是昂贵的,速度慢,容易出错,而且更重要的是,不能覆盖广大范围在现实世界对话为客户服务的域。我们提出的对话状态感应的任务,建设两个神经潜变量模型,矿山对话无标签客服的对话记录自动状态。结果表明,该模型可以有效地发现有意义的插槽。此外,还配备有致的对话状态,一个国家的最先进的对话系统,给人以不使用对话状态模块相比更好的性能。
7. Cognitive Representation Learning of Self-Media Online Article Quality [PDF] 返回目录
Yiru Wang, Shen Huang, Gongfu Li, Qiang Deng, Dongliang Liao, Pengda Si, Yujiu Yang, Jin Xu
Abstract: The automatic quality assessment of self-media online articles is an urgent and new issue, which is of great value to the online recommendation and search. Different from traditional and well-formed articles, self-media online articles are mainly created by users, which have the appearance characteristics of different text levels and multi-modal hybrid editing, along with the potential characteristics of diverse content, different styles, large semantic spans and good interactive experience requirements. To solve these challenges, we establish a joint model CoQAN in combination with the layout organization, writing characteristics and text semantics, designing different representation learning subnetworks, especially for the feature learning process and interactive reading habits on mobile terminals. It is more consistent with the cognitive style of expressing an expert's evaluation of articles. We have also constructed a large scale real-world assessment dataset. Extensive experimental results show that the proposed framework significantly outperforms state-of-the-art methods, and effectively learns and integrates different factors of the online article quality assessment.
摘要:自媒体的网络文章的自动质量评价是一项紧迫和新的问题,这是有很大价值的网上推荐和搜索。从传统和良好成型产品不同的是,自媒体网上的文章主要是由用户,其中有不同的文字水平和多模式混合编辑的外观特征,与内容多样,风格各异,大的语义潜在特性一起创建跨度和良好的交互体验的要求。为了解决这些难题,我们联合成立一家合资模式CoQAN的布局组织,写作特点和文本的语义,设计不同的代表性学习子网,特别是对地物学习过程和互动的阅读习惯在移动终端上。它是表达文章的专家的评价认知风格更加一致。我们还构建了大规模真实世界的评估数据集。大量的实验结果表明,该框架显著优于国家的最先进的方法,有效地学习和集成网上的文章质量评估不同的因素。
Yiru Wang, Shen Huang, Gongfu Li, Qiang Deng, Dongliang Liao, Pengda Si, Yujiu Yang, Jin Xu
Abstract: The automatic quality assessment of self-media online articles is an urgent and new issue, which is of great value to the online recommendation and search. Different from traditional and well-formed articles, self-media online articles are mainly created by users, which have the appearance characteristics of different text levels and multi-modal hybrid editing, along with the potential characteristics of diverse content, different styles, large semantic spans and good interactive experience requirements. To solve these challenges, we establish a joint model CoQAN in combination with the layout organization, writing characteristics and text semantics, designing different representation learning subnetworks, especially for the feature learning process and interactive reading habits on mobile terminals. It is more consistent with the cognitive style of expressing an expert's evaluation of articles. We have also constructed a large scale real-world assessment dataset. Extensive experimental results show that the proposed framework significantly outperforms state-of-the-art methods, and effectively learns and integrates different factors of the online article quality assessment.
摘要:自媒体的网络文章的自动质量评价是一项紧迫和新的问题,这是有很大价值的网上推荐和搜索。从传统和良好成型产品不同的是,自媒体网上的文章主要是由用户,其中有不同的文字水平和多模式混合编辑的外观特征,与内容多样,风格各异,大的语义潜在特性一起创建跨度和良好的交互体验的要求。为了解决这些难题,我们联合成立一家合资模式CoQAN的布局组织,写作特点和文本的语义,设计不同的代表性学习子网,特别是对地物学习过程和互动的阅读习惯在移动终端上。它是表达文章的专家的评价认知风格更加一致。我们还构建了大规模真实世界的评估数据集。大量的实验结果表明,该框架显著优于国家的最先进的方法,有效地学习和集成网上的文章质量评估不同的因素。
8. Ranking Enhanced Dialogue Generation [PDF] 返回目录
Changying Hao, Liang Pang, Yanyan Lan, Fei Sun, Jiafeng Guo, Xueqi Cheng
Abstract: How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation. Previous works usually employ various neural network architectures (e.g., recurrent neural networks, attention mechanisms, and hierarchical structures) to model the history. However, a recent empirical study by Sankar et al. has shown that these architectures lack the ability of understanding and modeling the dynamics of the dialogue history. For example, the widely used architectures are insensitive to perturbations of the dialogue history, such as words shuffling, utterances missing, and utterances reordering. To tackle this problem, we propose a Ranking Enhanced Dialogue generation framework in this paper. Despite the traditional representation encoder and response generation modules, an additional ranking module is introduced to model the ranking relation between the former utterance and consecutive utterances. Specifically, the former utterance and consecutive utterances are treated as query and corresponding documents, and both local and global ranking losses are designed in the learning process. In this way, the dynamics in the dialogue history can be explicitly captured. To evaluate our proposed models, we conduct extensive experiments on three public datasets, i.e., bAbI, PersonaChat, and JDC. Experimental results show that our models produce better responses in terms of both quantitative measures and human judgments, as compared with the state-of-the-art dialogue generation models. Furthermore, we give some detailed experimental analysis to show where and how the improvements come from.
摘要:如何有效地利用对话历史是多圈的对话产生了至关重要的问题。以前的作品通常采用不同的神经网络结构(例如,递归神经网络,注重机制和层次结构)的历史模型。然而,最近的实证研究通过桑卡尔等。已经表明,这些结构缺乏了解和模拟对话的历史动态的能力。例如,广泛使用的架构是不敏感的对话历史的扰动,如文字洗牌,话语缺失和话语重新排序。为了解决这个问题,我们提出在本文中排名加强对话生成框架。尽管传统的代表性编码器和响应生成模块,附加的分级模块被引入到前发声和连续话语之间的排名关系进行建模。具体而言,前者的话语和话语连续被视为查询和相应的文件,以及本地和全球排名的损失都设计在学习过程。通过这种方式,在对话历史的动态可以明确地捕获。为了评估我们提出的模型,我们进行三个公共数据集,即,巴比,PersonaChat和JDC广泛的实验。实验结果表明,我们的模型更好的反应中的定量措施和人的判断方面,与国家的最先进的对话一代车型相比。此外,我们给出一些详细的实验分析表明,其中和改进是怎么来的。
Changying Hao, Liang Pang, Yanyan Lan, Fei Sun, Jiafeng Guo, Xueqi Cheng
Abstract: How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation. Previous works usually employ various neural network architectures (e.g., recurrent neural networks, attention mechanisms, and hierarchical structures) to model the history. However, a recent empirical study by Sankar et al. has shown that these architectures lack the ability of understanding and modeling the dynamics of the dialogue history. For example, the widely used architectures are insensitive to perturbations of the dialogue history, such as words shuffling, utterances missing, and utterances reordering. To tackle this problem, we propose a Ranking Enhanced Dialogue generation framework in this paper. Despite the traditional representation encoder and response generation modules, an additional ranking module is introduced to model the ranking relation between the former utterance and consecutive utterances. Specifically, the former utterance and consecutive utterances are treated as query and corresponding documents, and both local and global ranking losses are designed in the learning process. In this way, the dynamics in the dialogue history can be explicitly captured. To evaluate our proposed models, we conduct extensive experiments on three public datasets, i.e., bAbI, PersonaChat, and JDC. Experimental results show that our models produce better responses in terms of both quantitative measures and human judgments, as compared with the state-of-the-art dialogue generation models. Furthermore, we give some detailed experimental analysis to show where and how the improvements come from.
摘要:如何有效地利用对话历史是多圈的对话产生了至关重要的问题。以前的作品通常采用不同的神经网络结构(例如,递归神经网络,注重机制和层次结构)的历史模型。然而,最近的实证研究通过桑卡尔等。已经表明,这些结构缺乏了解和模拟对话的历史动态的能力。例如,广泛使用的架构是不敏感的对话历史的扰动,如文字洗牌,话语缺失和话语重新排序。为了解决这个问题,我们提出在本文中排名加强对话生成框架。尽管传统的代表性编码器和响应生成模块,附加的分级模块被引入到前发声和连续话语之间的排名关系进行建模。具体而言,前者的话语和话语连续被视为查询和相应的文件,以及本地和全球排名的损失都设计在学习过程。通过这种方式,在对话历史的动态可以明确地捕获。为了评估我们提出的模型,我们进行三个公共数据集,即,巴比,PersonaChat和JDC广泛的实验。实验结果表明,我们的模型更好的反应中的定量措施和人的判断方面,与国家的最先进的对话一代车型相比。此外,我们给出一些详细的实验分析表明,其中和改进是怎么来的。
9. Semantics-preserving adversarial attacks in NLP [PDF] 返回目录
Rahul Singh, Tarun Joshi, Vijayan N. Nair, Agus Sudjianto
Abstract: We propose algorithms to create adversarial attacks to assess model robustness in text classification problems. They can be used to create white box attacks and black box attacks while at the same time preserving the semantics and syntax of the original text. The attacks cause significant number of flips in white-box setting and same rule based can be used in black-box setting. In a black-box setting, the attacks created are able to reverse decisions of transformer based architectures.
摘要:本文提出的算法创建敌对攻击,以评估文本分类问题模型的鲁棒性。它们可以被使用,而同时保持了语义和原文的语法创建白盒攻击和黑箱攻击。这些攻击导致白盒的设置和基于可在黑盒环境中使用同样的规则空翻显著数量。在黑盒设置,创建的攻击就能扭转变压器基础架构的决策。
Rahul Singh, Tarun Joshi, Vijayan N. Nair, Agus Sudjianto
Abstract: We propose algorithms to create adversarial attacks to assess model robustness in text classification problems. They can be used to create white box attacks and black box attacks while at the same time preserving the semantics and syntax of the original text. The attacks cause significant number of flips in white-box setting and same rule based can be used in black-box setting. In a black-box setting, the attacks created are able to reverse decisions of transformer based architectures.
摘要:本文提出的算法创建敌对攻击,以评估文本分类问题模型的鲁棒性。它们可以被使用,而同时保持了语义和原文的语法创建白盒攻击和黑箱攻击。这些攻击导致白盒的设置和基于可在黑盒环境中使用同样的规则空翻显著数量。在黑盒设置,创建的攻击就能扭转变压器基础架构的决策。
10. Continuous Speech Separation with Conformer [PDF] 返回目录
Sanyuan Chen, Yu Wu, Zhuo Chen, Jinyu Li, Chengyi Wang, Shujie Liu, Ming Zhou
Abstract: Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.
摘要:连续语音分离起着复杂的言语相关的任务,比如谈话转录至关重要的作用。分离模型提取从一个混合的语音单个扬声器信号。在本文中,我们在分离系统中使用的变压器和构象代替递归神经网络的,因为我们相信捕获与自重视基础的方法全局信息对于语音分离的关键。在LibriCSS数据集评估中,构象异构体的分离模型实现了现有技术的结果的状态下,与来自双向LSTM(BLSTM)在发声明智评价的相对23.5%字差错率(WER)降低和15.4%WER减少持续评价。
Sanyuan Chen, Yu Wu, Zhuo Chen, Jinyu Li, Chengyi Wang, Shujie Liu, Ming Zhou
Abstract: Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.
摘要:连续语音分离起着复杂的言语相关的任务,比如谈话转录至关重要的作用。分离模型提取从一个混合的语音单个扬声器信号。在本文中,我们在分离系统中使用的变压器和构象代替递归神经网络的,因为我们相信捕获与自重视基础的方法全局信息对于语音分离的关键。在LibriCSS数据集评估中,构象异构体的分离模型实现了现有技术的结果的状态下,与来自双向LSTM(BLSTM)在发声明智评价的相对23.5%字差错率(WER)降低和15.4%WER减少持续评价。
11. Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition [PDF] 返回目录
Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen
Abstract: Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. It relies on an attention mechanism to learn alignments, and encodes input audio bidirectionally. The high computation cost of Transformer decoding also limits its use in production streaming systems. To make Transformer suitable for streaming ASR, we explore Transducer framework as a streamable way to learn alignments. For audio encoding, we apply unidirectional Transformer with interleaved convolution layers. The interleaved convolution layers are used for modeling future context which is important to performance. To reduce computation cost, we gradually downsample acoustic input, also with the interleaved convolution layers. Moreover, we limit the length of history context in self-attention to maintain constant computation cost for each decoding step. We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6\% WER on test-clean) without external language models. The performance is comparable to previously published streamable Transformer Transducer and strong hybrid streaming ASR systems, and is achieved with smaller look-ahead window (140~ms), fewer parameters and lower frame rate.
摘要:变压器已经实现对自动语音识别(ASR)的国家的最先进的终端到高端车型竞争力的性能,并且需要比基于RNN的模型训练的时间少显著。原变压器,具有编码解码器架构,只适合于脱机ASR。它依靠的注意机制的学习路线和双向编码输入的音频。变压器解码的高计算成本也限制了生产流系统的使用。为了使变压器适用于流ASR,我们探讨传感器框架作为学习对准一个流化的方式。对于音频编码,我们采用单向变压器与交错卷积层。交错的卷积层用于模拟这对性能的重要方面未来。为了降低计算成本,我们逐渐下采样声音输入,还与交错卷积层。此外,我们限制了历史文脉的长度自重视维护每个解码步骤常数的计算成本。我们表明,这种架构,命名为转化率变压器传感器,实现了有竞争力的表现上LibriSpeech数据集(3.6 \%WER上测试清洗),无需外部语言模型。其性能与先前公开的可流变压器传感器与强混合动力流的ASR系统中,用较小的超前窗(140〜MS),更少的参数和低帧速率来实现的。
Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen
Abstract: Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. It relies on an attention mechanism to learn alignments, and encodes input audio bidirectionally. The high computation cost of Transformer decoding also limits its use in production streaming systems. To make Transformer suitable for streaming ASR, we explore Transducer framework as a streamable way to learn alignments. For audio encoding, we apply unidirectional Transformer with interleaved convolution layers. The interleaved convolution layers are used for modeling future context which is important to performance. To reduce computation cost, we gradually downsample acoustic input, also with the interleaved convolution layers. Moreover, we limit the length of history context in self-attention to maintain constant computation cost for each decoding step. We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6\% WER on test-clean) without external language models. The performance is comparable to previously published streamable Transformer Transducer and strong hybrid streaming ASR systems, and is achieved with smaller look-ahead window (140~ms), fewer parameters and lower frame rate.
摘要:变压器已经实现对自动语音识别(ASR)的国家的最先进的终端到高端车型竞争力的性能,并且需要比基于RNN的模型训练的时间少显著。原变压器,具有编码解码器架构,只适合于脱机ASR。它依靠的注意机制的学习路线和双向编码输入的音频。变压器解码的高计算成本也限制了生产流系统的使用。为了使变压器适用于流ASR,我们探讨传感器框架作为学习对准一个流化的方式。对于音频编码,我们采用单向变压器与交错卷积层。交错的卷积层用于模拟这对性能的重要方面未来。为了降低计算成本,我们逐渐下采样声音输入,还与交错卷积层。此外,我们限制了历史文脉的长度自重视维护每个解码步骤常数的计算成本。我们表明,这种架构,命名为转化率变压器传感器,实现了有竞争力的表现上LibriSpeech数据集(3.6 \%WER上测试清洗),无需外部语言模型。其性能与先前公开的可流变压器传感器与强混合动力流的ASR系统中,用较小的超前窗(140〜MS),更少的参数和低帧速率来实现的。
12. The COVID-19 Infodemic: Can the Crowd Judge Recent Misinformation Objectively? [PDF] 返回目录
Kevin Roitero, Michael Soprano, Beatrice Portelli, Damiano Spina, Vincenzo Della Mea, Giuseppe Serra, Stefano Mizzaro, Gianluca Demartini
Abstract: Misinformation is an ever increasing problem that is difficult to solve for the research community and has a negative impact on the society at large. Very recently, the problem has been addressed with a crowdsourcing-based approach to scale up labeling efforts: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of (non-expert) judges is exploited. We follow the same approach to study whether crowdsourcing is an effective and reliable method to assess statements truthfulness during a pandemic. We specifically target statements related to the COVID-19 health emergency, that is still ongoing at the time of the study and has arguably caused an increase of the amount of misinformation that is spreading online (a phenomenon for which the term "infodemic" has been used). By doing so, we are able to address (mis)information that is both related to a sensitive and personal issue like health and very recent as compared to when the judgment is done: two issues that have not been analyzed in related work. In our experiment, crowd workers are asked to assess the truthfulness of statements, as well as to provide evidence for the assessments as a URL and a text justification. Besides showing that the crowd is able to accurately judge the truthfulness of the statements, we also report results on many different aspects, including: agreement among workers, the effect of different aggregation functions, of scales transformations, and of workers background / bias. We also analyze workers behavior, in terms of queries submitted, URLs found / selected, text justifications, and other behavioral data like clicks and mouse actions collected by means of an ad hoc logger.
摘要:误传是不断增加的问题是难以解决的研究团体和具有大对社会产生负面影响。最近,这个问题一直是基于众包的方式来扩大贴标努力解决:评估的声明的真实性,而不是依靠几个专家的(非专业)法官人群利用。我们遵循学习同样的办法众包是否是在大流行评估报告真实有效和可靠的方法。我们专门针对有关COVID-19卫生应急,这仍然是在研究的时间正在进行和已经可以说是造成被网上流传的不实之词量(一个现象对于该术语“infodemic”一直是增加的语句用过的)。通过这样做,我们能够解决,在时相比判断是为了既涉及到诸如健康和非常最近一个敏感和个人的问题(错误)信息:未在相关工作进行了分析两个问题。在我们的实验中,人群工人要求评估报告的真实,以及提供用于评估作为一个URL和一个文本对齐证据。除了显示,人群能够准确地判断报告的真实,我们还对许多不同的方面报告结果,其中包括:工人之间的协议,不同的聚集效应的功能,尺度变换,和工人的背景/偏见。我们还分析了工人的行为,在提交的查询方面,网址,发现/选择,文字的理由,像点击等行为数据和鼠标操作通过一个特设记录器进行收集。
Kevin Roitero, Michael Soprano, Beatrice Portelli, Damiano Spina, Vincenzo Della Mea, Giuseppe Serra, Stefano Mizzaro, Gianluca Demartini
Abstract: Misinformation is an ever increasing problem that is difficult to solve for the research community and has a negative impact on the society at large. Very recently, the problem has been addressed with a crowdsourcing-based approach to scale up labeling efforts: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of (non-expert) judges is exploited. We follow the same approach to study whether crowdsourcing is an effective and reliable method to assess statements truthfulness during a pandemic. We specifically target statements related to the COVID-19 health emergency, that is still ongoing at the time of the study and has arguably caused an increase of the amount of misinformation that is spreading online (a phenomenon for which the term "infodemic" has been used). By doing so, we are able to address (mis)information that is both related to a sensitive and personal issue like health and very recent as compared to when the judgment is done: two issues that have not been analyzed in related work. In our experiment, crowd workers are asked to assess the truthfulness of statements, as well as to provide evidence for the assessments as a URL and a text justification. Besides showing that the crowd is able to accurately judge the truthfulness of the statements, we also report results on many different aspects, including: agreement among workers, the effect of different aggregation functions, of scales transformations, and of workers background / bias. We also analyze workers behavior, in terms of queries submitted, URLs found / selected, text justifications, and other behavioral data like clicks and mouse actions collected by means of an ad hoc logger.
摘要:误传是不断增加的问题是难以解决的研究团体和具有大对社会产生负面影响。最近,这个问题一直是基于众包的方式来扩大贴标努力解决:评估的声明的真实性,而不是依靠几个专家的(非专业)法官人群利用。我们遵循学习同样的办法众包是否是在大流行评估报告真实有效和可靠的方法。我们专门针对有关COVID-19卫生应急,这仍然是在研究的时间正在进行和已经可以说是造成被网上流传的不实之词量(一个现象对于该术语“infodemic”一直是增加的语句用过的)。通过这样做,我们能够解决,在时相比判断是为了既涉及到诸如健康和非常最近一个敏感和个人的问题(错误)信息:未在相关工作进行了分析两个问题。在我们的实验中,人群工人要求评估报告的真实,以及提供用于评估作为一个URL和一个文本对齐证据。除了显示,人群能够准确地判断报告的真实,我们还对许多不同的方面报告结果,其中包括:工人之间的协议,不同的聚集效应的功能,尺度变换,和工人的背景/偏见。我们还分析了工人的行为,在提交的查询方面,网址,发现/选择,文字的理由,像点击等行为数据和鼠标操作通过一个特设记录器进行收集。
13. Large-scale Transfer Learning for Low-resource Spoken Language Understanding [PDF] 返回目录
Xueli Jia, Jianzong Wang, Zhiyong Zhang, Ning Cheng, Jing Xiao
Abstract: End-to-end Spoken Language Understanding (SLU) models are made increasingly large and complex to achieve the state-ofthe-art accuracy. However, the increased complexity of a model can also introduce high risk of over-fitting, which is a major challenge in SLU tasks due to the limitation of available data. In this paper, we propose an attention-based SLU model together with three encoder enhancement strategies to overcome data sparsity challenge. The first strategy focuses on the transferlearning approach to improve feature extraction capability of the encoder. It is implemented by pre-training the encoder component with a quantity of Automatic Speech Recognition annotated data relying on the standard Transformer architecture and then fine-tuning the SLU model with a small amount of target labelled data. The second strategy adopts multitask learning strategy, the SLU model integrates the speech recognition model by sharing the same underlying encoder, such that improving robustness and generalization ability. The third strategy, learning from Component Fusion (CF) idea, involves a Bidirectional Encoder Representation from Transformer (BERT) model and aims to boost the capability of the decoder with an auxiliary network. It hence reduces the risk of over-fitting and augments the ability of the underlying encoder, indirectly. Experiments on the FluentAI dataset show that cross-language transfer learning and multi-task strategies have been improved by up to 4:52% and 3:89% respectively, compared to the baseline.
摘要:结束到终端的口语理解(SLU)机型均采用日益庞大和复杂,实现国有国税发先进的精度。然而,模型的复杂性增加还可以引入的过拟合高风险,这是在SLU任务的一大挑战,由于可用数据的限制。在本文中,我们提出了三个编码器的增强策略来解决数据稀疏挑战的关注,基于SLU模型在一起。第一个策略侧重于transferlearning方法来提高编码器的特征提取能力。它是由训练预编码器组分与自动语音识别的数量实现的注释的数据依赖于标准变压器结构,然后微调SLU模型与目标标记的数据的量小。第二个策略采用多任务学习策略,SLU模型集成通过共享相同的底层编码器,从而提高了耐用性和泛化能力的语音识别模型。第三个策略,从组件融合(CF)理念学习,包括从变压器(BERT)模式,其目标是双向编码表示,以提高解码器的能力与辅助网络。它因此降低的过拟合和增强所述底层编码器的能力,间接的风险。 52%和3:分别为89%,相比于基准的FluentAI数据集显示,跨语言迁移学习和多任务的策略已经由最多提高到4个实验。
Xueli Jia, Jianzong Wang, Zhiyong Zhang, Ning Cheng, Jing Xiao
Abstract: End-to-end Spoken Language Understanding (SLU) models are made increasingly large and complex to achieve the state-ofthe-art accuracy. However, the increased complexity of a model can also introduce high risk of over-fitting, which is a major challenge in SLU tasks due to the limitation of available data. In this paper, we propose an attention-based SLU model together with three encoder enhancement strategies to overcome data sparsity challenge. The first strategy focuses on the transferlearning approach to improve feature extraction capability of the encoder. It is implemented by pre-training the encoder component with a quantity of Automatic Speech Recognition annotated data relying on the standard Transformer architecture and then fine-tuning the SLU model with a small amount of target labelled data. The second strategy adopts multitask learning strategy, the SLU model integrates the speech recognition model by sharing the same underlying encoder, such that improving robustness and generalization ability. The third strategy, learning from Component Fusion (CF) idea, involves a Bidirectional Encoder Representation from Transformer (BERT) model and aims to boost the capability of the decoder with an auxiliary network. It hence reduces the risk of over-fitting and augments the ability of the underlying encoder, indirectly. Experiments on the FluentAI dataset show that cross-language transfer learning and multi-task strategies have been improved by up to 4:52% and 3:89% respectively, compared to the baseline.
摘要:结束到终端的口语理解(SLU)机型均采用日益庞大和复杂,实现国有国税发先进的精度。然而,模型的复杂性增加还可以引入的过拟合高风险,这是在SLU任务的一大挑战,由于可用数据的限制。在本文中,我们提出了三个编码器的增强策略来解决数据稀疏挑战的关注,基于SLU模型在一起。第一个策略侧重于transferlearning方法来提高编码器的特征提取能力。它是由训练预编码器组分与自动语音识别的数量实现的注释的数据依赖于标准变压器结构,然后微调SLU模型与目标标记的数据的量小。第二个策略采用多任务学习策略,SLU模型集成通过共享相同的底层编码器,从而提高了耐用性和泛化能力的语音识别模型。第三个策略,从组件融合(CF)理念学习,包括从变压器(BERT)模式,其目标是双向编码表示,以提高解码器的能力与辅助网络。它因此降低的过拟合和增强所述底层编码器的能力,间接的风险。 52%和3:分别为89%,相比于基准的FluentAI数据集显示,跨语言迁移学习和多任务的策略已经由最多提高到4个实验。
14. Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit [PDF] 返回目录
Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao
Abstract: Recent neural speech synthesis systems have gradually focused on the control of prosody to improve the quality of synthesized speech, but they rarely consider the variability of prosody and the correlation between prosody and semantics together. In this paper, a prosody learning mechanism is proposed to model the prosody of speech based on TTS system, where the prosody information of speech is extracted from the melspectrum by a prosody learner and combined with the phoneme sequence to reconstruct the mel-spectrum. Meanwhile, the sematic features of text from the pre-trained language model is introduced to improve the prosody prediction results. In addition, a novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length, where the relative position information of the sequence is modeled by the relative position matrices so that the position encodings is no longer needed. Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model. Especially in Mandarin synthesis, our proposed model outperforms baseline model with a MOS gap of 0.08, and the overall naturalness of the synthesized speech has been significantly improved.
摘要:最近的神经语音合成系统已经逐渐集中在韵律,提高合成语音的质量控制,但他们很少考虑韵律的变化和韵律和语义之间的关联在一起。在本文中,韵律学习机制,提出了基于TTS系统,其中从由melspectrum韵律学习者萃取并用音素序列来重构melspectrum结合的语音的韵律信息对语音的韵律模型。同时,从预先训练的语言模型的文本思迈特功能引入改善韵律预测结果。另外,一种新型的自关注结构,命名为局部关注,提出了解除输入文本长度,其中,所述序列的相对位置信息通过的相对位置的矩阵模型的这种限制,使得不再需要的位置的编码。英语和普通话实验表明,言语提供更加满意的韵律在我们的模型中获得。特别是用普通话合成,我们提出的模型优于0.08的MOS差距,以及合成语音的整体自然基线模型已显著提高。
Zhen Zeng, Jianzong Wang, Ning Cheng, Jing Xiao
Abstract: Recent neural speech synthesis systems have gradually focused on the control of prosody to improve the quality of synthesized speech, but they rarely consider the variability of prosody and the correlation between prosody and semantics together. In this paper, a prosody learning mechanism is proposed to model the prosody of speech based on TTS system, where the prosody information of speech is extracted from the melspectrum by a prosody learner and combined with the phoneme sequence to reconstruct the mel-spectrum. Meanwhile, the sematic features of text from the pre-trained language model is introduced to improve the prosody prediction results. In addition, a novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length, where the relative position information of the sequence is modeled by the relative position matrices so that the position encodings is no longer needed. Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model. Especially in Mandarin synthesis, our proposed model outperforms baseline model with a MOS gap of 0.08, and the overall naturalness of the synthesized speech has been significantly improved.
摘要:最近的神经语音合成系统已经逐渐集中在韵律,提高合成语音的质量控制,但他们很少考虑韵律的变化和韵律和语义之间的关联在一起。在本文中,韵律学习机制,提出了基于TTS系统,其中从由melspectrum韵律学习者萃取并用音素序列来重构melspectrum结合的语音的韵律信息对语音的韵律模型。同时,从预先训练的语言模型的文本思迈特功能引入改善韵律预测结果。另外,一种新型的自关注结构,命名为局部关注,提出了解除输入文本长度,其中,所述序列的相对位置信息通过的相对位置的矩阵模型的这种限制,使得不再需要的位置的编码。英语和普通话实验表明,言语提供更加满意的韵律在我们的模型中获得。特别是用普通话合成,我们提出的模型优于0.08的MOS差距,以及合成语音的整体自然基线模型已显著提高。
15. Online Automatic Speech Recognition with Listen, Attend and Spell Model [PDF] 返回目录
Roger Hsiao, Dogan Can, Tim Ng, Ruchir Travadi, Arnab Ghoshal
Abstract: The Listen, Attend and Spell (LAS) model and other attention-based automatic speech recognition (ASR) models have known limitations when operated in a fully online mode. In this paper, we analyze the online operation of LAS models to demonstrate that these limitations stem from the handling of silence regions and the reliability of online attention mechanism at the edge of input buffers. We propose a novel and simple technique that can achieve fully online recognition while meeting accuracy and latency targets. For the Mandarin dictation task, our proposed approach can achieve a character error rate in online operation that is within 4% relative to an offline LAS model. The proposed online LAS model operates at 12% lower latency relative to a conventional neural network hidden Markov model hybrid of comparable accuracy. We have validated the proposed method through a production scale deployment, which, to the best of our knowledge, is the first such deployment of a fully online LAS model.
摘要:当在全在线模式下运行的听着,出席及法术(LAS)模型和其他关注基于自动语音识别(ASR)模型已已知限制。在本文中,我们分析了LAS模型的在线操作,以证明这些限制从寂静的区域的处理和在线注意机制在输入缓冲器的边缘可靠性干。我们提出了一个新颖的和简单的技术,同时满足精度和等待时间目标可以实现完全在线识别。对于国语听写任务,我们提出的方法可以实现在线操作的字符错误率是相对于离线LAS模型在4%以内。在线建议LAS模型在12%操作延迟相对于隐藏可比的精度马尔可夫模型混合的常规神经网络降低。我们已经验证了通过生产规模部署,其中,在我们所知的,是一个完全在线LAS模型的第一个这样的部署所提出的方法。
Roger Hsiao, Dogan Can, Tim Ng, Ruchir Travadi, Arnab Ghoshal
Abstract: The Listen, Attend and Spell (LAS) model and other attention-based automatic speech recognition (ASR) models have known limitations when operated in a fully online mode. In this paper, we analyze the online operation of LAS models to demonstrate that these limitations stem from the handling of silence regions and the reliability of online attention mechanism at the edge of input buffers. We propose a novel and simple technique that can achieve fully online recognition while meeting accuracy and latency targets. For the Mandarin dictation task, our proposed approach can achieve a character error rate in online operation that is within 4% relative to an offline LAS model. The proposed online LAS model operates at 12% lower latency relative to a conventional neural network hidden Markov model hybrid of comparable accuracy. We have validated the proposed method through a production scale deployment, which, to the best of our knowledge, is the first such deployment of a fully online LAS model.
摘要:当在全在线模式下运行的听着,出席及法术(LAS)模型和其他关注基于自动语音识别(ASR)模型已已知限制。在本文中,我们分析了LAS模型的在线操作,以证明这些限制从寂静的区域的处理和在线注意机制在输入缓冲器的边缘可靠性干。我们提出了一个新颖的和简单的技术,同时满足精度和等待时间目标可以实现完全在线识别。对于国语听写任务,我们提出的方法可以实现在线操作的字符错误率是相对于离线LAS模型在4%以内。在线建议LAS模型在12%操作延迟相对于隐藏可比的精度马尔可夫模型混合的常规神经网络降低。我们已经验证了通过生产规模部署,其中,在我们所知的,是一个完全在线LAS模型的第一个这样的部署所提出的方法。
注:中文为机器翻译结果!封面为论文标题词云图!