目录
10. I love your chain mail! Making knights smile in a fantasy game world: Open-domain goal-orientated dialogue agents [PDF] 摘要
14. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss [PDF] 摘要
摘要
1. A Multilingual View of Unsupervised Machine Translation [PDF] 返回目录
Xavier Garcia, Pierre Foret, Thibault Sellam, Ankur P. Parikh
Abstract: We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only monolingual data available, we propose a novel setup where one language in the (source, target) pair is not associated with any parallel data, but there may exist auxiliary parallel data that contains the other. This auxiliary data can naturally be utilized in our probabilistic framework via a novel cross-translation loss term. Empirically, we show that our approach results in higher BLEU scores over state-of-the-art unsupervised models on the WMT'14 English-French, WMT'16 English-German, and WMT'16 English-Romanian datasets in most directions. In particular, we obtain a +1.65 BLEU advantage over the best-performing unsupervised model in the Romanian-English direction.
摘要:我们提出了多语种神经机器翻译概率框架,包括监管和监督的设置,注重监督的翻译。除了仅存在单语数据可用的研究香草情况下,我们提出了其中在(源,目标)一种语言对不与任何并行数据相关联的新的设置,但也有可能存在包含其它辅助的并行数据。该辅助数据可以自然地在我们的概率框架通过一种新颖的横翻译损耗项利用。根据经验,我们表明,我们的方法得到更高的分数BLEU在国家的最先进的无人监督的车型上WMT'14英法,WMT'16英语 - 德语和英语WMT'16 - 罗马尼亚数据集在大部分方向。特别是,我们获得了在罗马尼亚英语方向表现最好的无监督模型1.65 BLEU优势。
Xavier Garcia, Pierre Foret, Thibault Sellam, Ankur P. Parikh
Abstract: We present a probabilistic framework for multilingual neural machine translation that encompasses supervised and unsupervised setups, focusing on unsupervised translation. In addition to studying the vanilla case where there is only monolingual data available, we propose a novel setup where one language in the (source, target) pair is not associated with any parallel data, but there may exist auxiliary parallel data that contains the other. This auxiliary data can naturally be utilized in our probabilistic framework via a novel cross-translation loss term. Empirically, we show that our approach results in higher BLEU scores over state-of-the-art unsupervised models on the WMT'14 English-French, WMT'16 English-German, and WMT'16 English-Romanian datasets in most directions. In particular, we obtain a +1.65 BLEU advantage over the best-performing unsupervised model in the Romanian-English direction.
摘要:我们提出了多语种神经机器翻译概率框架,包括监管和监督的设置,注重监督的翻译。除了仅存在单语数据可用的研究香草情况下,我们提出了其中在(源,目标)一种语言对不与任何并行数据相关联的新的设置,但也有可能存在包含其它辅助的并行数据。该辅助数据可以自然地在我们的概率框架通过一种新颖的横翻译损耗项利用。根据经验,我们表明,我们的方法得到更高的分数BLEU在国家的最先进的无人监督的车型上WMT'14英法,WMT'16英语 - 德语和英语WMT'16 - 罗马尼亚数据集在大部分方向。特别是,我们获得了在罗马尼亚英语方向表现最好的无监督模型1.65 BLEU优势。
2. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing [PDF] 返回目录
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou
Abstract: In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models, and smooths the training process. Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.
摘要:在本文中,我们建议逐步模块更换一个新的模型的压缩方式,有效压缩BERT。我们的方法首先将原始BERT分成几个模块,并建立其紧凑的替代品。然后,我们随机与他们的替代品代替原来的模块的紧凑型模块训练到原来模块的模仿行为。我们不断通过培训提高替代的可能性。这样一来,我们的方法所带来的原始和紧凑车型之间的相互作用更深层次的,和平滑的训练过程。相较于以前的知识蒸馏方法用于BERT压缩,我们的方法利用只有一个损失函数和一个超参数,释放从高参数整定人的努力。我们的方法比现有的知识蒸馏方法胶水标杆,展示模型压缩的一个新的视角。
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou
Abstract: In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models, and smooths the training process. Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.
摘要:在本文中,我们建议逐步模块更换一个新的模型的压缩方式,有效压缩BERT。我们的方法首先将原始BERT分成几个模块,并建立其紧凑的替代品。然后,我们随机与他们的替代品代替原来的模块的紧凑型模块训练到原来模块的模仿行为。我们不断通过培训提高替代的可能性。这样一来,我们的方法所带来的原始和紧凑车型之间的相互作用更深层次的,和平滑的训练过程。相较于以前的知识蒸馏方法用于BERT压缩,我们的方法利用只有一个损失函数和一个超参数,释放从高参数整定人的努力。我们的方法比现有的知识蒸馏方法胶水标杆,展示模型压缩的一个新的视角。
3. Neural Machine Translation System of Indic Languages -- An Attention based Approach [PDF] 返回目录
Parth Shah, Vishvajit Bakrola
Abstract: Neural machine translation (NMT) is a recent and effective technique which led to remarkable improvements in comparison of conventional machine translation techniques. Proposed neural machine translation model developed for the Gujarati language contains encoder-decoder with attention mechanism. In India, almost all the languages are originated from their ancestral language Sanskrit. They are having inevitable similarities including lexical and named entity similarity. Translating into Indic languages is always be a challenging task. In this paper, we have presented the neural machine translation system (NMT) that can efficiently translate Indic languages like Hindi and Gujarati that together covers more than 58.49 percentage of total speakers in the country. We have compared the performance of our NMT model with automatic evaluation matrices such as BLEU, perplexity and TER matrix. The comparison of our network with Google translate is also presented where it outperformed with a margin of 6 BLEU score on English-Gujarati translation.
摘要:神经机器翻译(NMT)是最近的和有效的技术,其导致显着改善在常规机器翻译技术相比。在古吉拉特语语言开发的建议神经机器翻译模型包含编码器,解码器,注意机制。在印度,几乎所有的语言都源于他们祖先的语言梵语。他们有着必然的相似,包括词汇和命名实体的相似性。翻译成印度语始终是一项艰巨的任务。在本文中,我们提出了神经机器翻译系统(NMT),可以有效地翻译印度语像印地文和古吉拉特一起覆盖全国总扬声器超过58.49百分比。我们比较我们与自动评估NMT模型的性能矩阵如BLEU,困惑和TER矩阵。还提出了我们与谷歌翻译网络的比较在那里与6 BLEU得分上英语翻译古吉拉特语保证金跑赢。
Parth Shah, Vishvajit Bakrola
Abstract: Neural machine translation (NMT) is a recent and effective technique which led to remarkable improvements in comparison of conventional machine translation techniques. Proposed neural machine translation model developed for the Gujarati language contains encoder-decoder with attention mechanism. In India, almost all the languages are originated from their ancestral language Sanskrit. They are having inevitable similarities including lexical and named entity similarity. Translating into Indic languages is always be a challenging task. In this paper, we have presented the neural machine translation system (NMT) that can efficiently translate Indic languages like Hindi and Gujarati that together covers more than 58.49 percentage of total speakers in the country. We have compared the performance of our NMT model with automatic evaluation matrices such as BLEU, perplexity and TER matrix. The comparison of our network with Google translate is also presented where it outperformed with a margin of 6 BLEU score on English-Gujarati translation.
摘要:神经机器翻译(NMT)是最近的和有效的技术,其导致显着改善在常规机器翻译技术相比。在古吉拉特语语言开发的建议神经机器翻译模型包含编码器,解码器,注意机制。在印度,几乎所有的语言都源于他们祖先的语言梵语。他们有着必然的相似,包括词汇和命名实体的相似性。翻译成印度语始终是一项艰巨的任务。在本文中,我们提出了神经机器翻译系统(NMT),可以有效地翻译印度语像印地文和古吉拉特一起覆盖全国总扬声器超过58.49百分比。我们比较我们与自动评估NMT模型的性能矩阵如BLEU,困惑和TER矩阵。还提出了我们与谷歌翻译网络的比较在那里与6 BLEU得分上英语翻译古吉拉特语保证金跑赢。
4. On-Device Information Extraction from SMS using Hybrid Hierarchical Classification [PDF] 返回目录
Shubham Vatsal, Naresh Purre, Sukumar Moharana, Gopi Ramena, Debi Prasanna Mohanty
Abstract: Cluttering of SMS inbox is one of the serious problems that users today face in the digital world where every online login, transaction, along with promotions generate multiple SMS. This problem not only prevents users from searching and navigating messages efficiently but often results in users missing out the relevant information associated with the corresponding SMS like offer codes, payment reminders etc. In this paper, we propose a unique architecture to organize and extract the appropriate information from SMS and further display it in an intuitive template. In the proposed architecture, we use a Hybrid Hierarchical Long Short Term Memory (LSTM)-Convolutional Neural Network (CNN) to categorize SMS into multiple classes followed by a set of entity parsers used to extract the relevant information from the classified message. The architecture using its preprocessing techniques not only takes into account the enormous variations observed in SMS data but also makes it efficient for its on-device (mobile phone) functionalities in terms of inference timing and size.
摘要:短信收件箱的杂波环境下是严重的问题之一是用户面对今天的数字世界里,所有的在线登录,交易,以得到提拔生成多个短信。这个问题不仅防止用户搜索和浏览效率消息,但通常会导致用户错过了与像优惠代码相应的SMS相关联的相关信息,催款等。在本文中,我们提出了一个独特的体系结构来组织和提取相应的以直观的模板从SMS,并进一步显示它的信息。在所提出的架构中,我们使用了基于分层长短期记忆(LSTM)-Convolutional神经网络(CNN)归类短信到多个类,然后一组用于提取分类信息相关的信息实体解析器。使用它的预处理技术的架构不仅考虑到了SMS数据中观察到的巨大的变化,但也使得有效用于推理定时和尺寸方面及其对设备(移动电话)的功能。
Shubham Vatsal, Naresh Purre, Sukumar Moharana, Gopi Ramena, Debi Prasanna Mohanty
Abstract: Cluttering of SMS inbox is one of the serious problems that users today face in the digital world where every online login, transaction, along with promotions generate multiple SMS. This problem not only prevents users from searching and navigating messages efficiently but often results in users missing out the relevant information associated with the corresponding SMS like offer codes, payment reminders etc. In this paper, we propose a unique architecture to organize and extract the appropriate information from SMS and further display it in an intuitive template. In the proposed architecture, we use a Hybrid Hierarchical Long Short Term Memory (LSTM)-Convolutional Neural Network (CNN) to categorize SMS into multiple classes followed by a set of entity parsers used to extract the relevant information from the classified message. The architecture using its preprocessing techniques not only takes into account the enormous variations observed in SMS data but also makes it efficient for its on-device (mobile phone) functionalities in terms of inference timing and size.
摘要:短信收件箱的杂波环境下是严重的问题之一是用户面对今天的数字世界里,所有的在线登录,交易,以得到提拔生成多个短信。这个问题不仅防止用户搜索和浏览效率消息,但通常会导致用户错过了与像优惠代码相应的SMS相关联的相关信息,催款等。在本文中,我们提出了一个独特的体系结构来组织和提取相应的以直观的模板从SMS,并进一步显示它的信息。在所提出的架构中,我们使用了基于分层长短期记忆(LSTM)-Convolutional神经网络(CNN)归类短信到多个类,然后一组用于提取分类信息相关的信息实体解析器。使用它的预处理技术的架构不仅考虑到了SMS数据中观察到的巨大的变化,但也使得有效用于推理定时和尺寸方面及其对设备(移动电话)的功能。
5. Incorporating Visual Semantics into Sentence Representations within a Grounded Space [PDF] 返回目录
Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari
Abstract: Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations --- the focus of this paper --- as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.
摘要:语言接地是一个活跃的领域,旨在丰富文本表示视觉信息。一般地,文本和视觉元素嵌入在相同的表示空间,这隐含地假设模态之间的一对一的对应关系。代表句话的时候这个假设不成立,并且在使用时要学会一句表述---本文的重点---作为一个视觉场景可以通过各种各样的句子来描述成为问题。为了克服这种局限性,我们提出通过学习中间表示空间的视觉信息传递到文本表示:接地的空间。我们进一步提出了两种新补充的目标,确保用相同的视觉内容相关:(1)句子接近接地的空间和(2)的相关要素之间的相似跨形式保留。我们表明,这种模型优于以前的分类和语义相关任务的国家的最先进的。
Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari
Abstract: Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations --- the focus of this paper --- as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.
摘要:语言接地是一个活跃的领域,旨在丰富文本表示视觉信息。一般地,文本和视觉元素嵌入在相同的表示空间,这隐含地假设模态之间的一对一的对应关系。代表句话的时候这个假设不成立,并且在使用时要学会一句表述---本文的重点---作为一个视觉场景可以通过各种各样的句子来描述成为问题。为了克服这种局限性,我们提出通过学习中间表示空间的视觉信息传递到文本表示:接地的空间。我们进一步提出了两种新补充的目标,确保用相同的视觉内容相关:(1)句子接近接地的空间和(2)的相关要素之间的相似跨形式保留。我们表明,这种模型优于以前的分类和语义相关任务的国家的最先进的。
6. Multimodal Matching Transformer for Live Commenting [PDF] 返回目录
Chaoqun Duan, Lei Cui, Shuming Ma, Furu Wei, Conghui Zhu, Tiejun Zhao
Abstract: Automatic live commenting aims to provide real-time comments on videos for viewers. It encourages users engagement on online video sites, and is also a good benchmark for video-to-text generation. Recent work on this task adopts encoder-decoder models to generate comments. However, these methods do not model the interaction between videos and comments explicitly, so they tend to generate popular comments that are often irrelevant to the videos. In this work, we aim to improve the relevance between live comments and videos by modeling the cross-modal interactions among different modalities. To this end, we propose a multimodal matching transformer to capture the relationships among comments, vision, and audio. The proposed model is based on the transformer framework and can iteratively learn the attention-aware representations for each modality. We evaluate the model on a publicly available live commenting dataset. Experiments show that the multimodal matching transformer model outperforms the state-of-the-art methods.
摘要:自动活评论旨在对影片为观众提供实时评论。它鼓励对在线视频网站的用户参与,并且也是视频到文本生成一个很好的标杆。此任务最近的工作,采用编码器,解码器模型来生成评论。然而,这些方法没有视频和评论之间的相互作用明确建模,因此他们往往会产生流行的评论说,往往无关的视频。在这项工作中,我们的目标是通过模拟不同方式之间的跨模态的相互作用,以提高现场评论和视频之间的相关性。为此,我们提出了一种多模式匹配变压器捕捉到的意见,视觉和音频之间的关系。该模型是基于变压器的框架,并可以反复学习注意力感知表示每个模式。我们评估在公开的现场评论数据集模型。实验表明,该多模态匹配变压器模型优于国家的最先进的方法。
Chaoqun Duan, Lei Cui, Shuming Ma, Furu Wei, Conghui Zhu, Tiejun Zhao
Abstract: Automatic live commenting aims to provide real-time comments on videos for viewers. It encourages users engagement on online video sites, and is also a good benchmark for video-to-text generation. Recent work on this task adopts encoder-decoder models to generate comments. However, these methods do not model the interaction between videos and comments explicitly, so they tend to generate popular comments that are often irrelevant to the videos. In this work, we aim to improve the relevance between live comments and videos by modeling the cross-modal interactions among different modalities. To this end, we propose a multimodal matching transformer to capture the relationships among comments, vision, and audio. The proposed model is based on the transformer framework and can iteratively learn the attention-aware representations for each modality. We evaluate the model on a publicly available live commenting dataset. Experiments show that the multimodal matching transformer model outperforms the state-of-the-art methods.
摘要:自动活评论旨在对影片为观众提供实时评论。它鼓励对在线视频网站的用户参与,并且也是视频到文本生成一个很好的标杆。此任务最近的工作,采用编码器,解码器模型来生成评论。然而,这些方法没有视频和评论之间的相互作用明确建模,因此他们往往会产生流行的评论说,往往无关的视频。在这项工作中,我们的目标是通过模拟不同方式之间的跨模态的相互作用,以提高现场评论和视频之间的相关性。为此,我们提出了一种多模式匹配变压器捕捉到的意见,视觉和音频之间的关系。该模型是基于变压器的框架,并可以反复学习注意力感知表示每个模式。我们评估在公开的现场评论数据集模型。实验表明,该多模态匹配变压器模型优于国家的最先进的方法。
7. Translating Web Search Queries into Natural Language Questions [PDF] 返回目录
Adarsh Kumar, Sandipan Dandapat, Sushil Chordia
Abstract: Users often query a search engine with a specific question in mind and often these queries are keywords or sub-sentential fragments. For example, if the users want to know the answer for "What's the capital of USA", they will most probably query "capital of USA" or "USA capital" or some keyword-based variation of this. For example, for the user entered query "capital of USA", the most probable question intent is "What's the capital of USA?". In this paper, we are proposing a method to generate well-formed natural language question from a given keyword-based query, which has the same question intent as the query. Conversion of keyword-based web query into a well-formed question has lots of applications, with some of them being in search engines, Community Question Answering (CQA) website and bots communication. We found a synergy between query-to-question problem with standard machine translation(MT) task. We have used both Statistical MT (SMT) and Neural MT (NMT) models to generate the questions from the query. We have observed that MT models perform well in terms of both automatic and human evaluation.
摘要:用户经常查询与具体问题的搜索引擎在心中,往往这些查询的关键字或子句子片段。例如,如果用户想知道的答案“什么是美国的首都”,他们将最有可能的查询“美国资本”或“美国资本”或一些这方面的基于关键字的变化。例如,用户输入查询“美国资本”,最有可能的问题,目的是“什么是美国的首都呢?”。在本文中,我们提议从给定的基于关键字的查询,其中有意向的询问同样的问题,良好的自然语言问题的方法。基于关键字的网页查询转换成一个结构良好的问题有很多的应用,在搜索引擎中的一些人是社区问答(CQA)的网站和漫游通信。我们发现查询到问题的问题,标准的机器翻译(MT)的任务之间的协同作用。我们都用了统计MT(SMT)和神经MT(NMT)模型来生成从查询的问题。我们观察到,MT车型在自动和人工评估方面表现良好。
Adarsh Kumar, Sandipan Dandapat, Sushil Chordia
Abstract: Users often query a search engine with a specific question in mind and often these queries are keywords or sub-sentential fragments. For example, if the users want to know the answer for "What's the capital of USA", they will most probably query "capital of USA" or "USA capital" or some keyword-based variation of this. For example, for the user entered query "capital of USA", the most probable question intent is "What's the capital of USA?". In this paper, we are proposing a method to generate well-formed natural language question from a given keyword-based query, which has the same question intent as the query. Conversion of keyword-based web query into a well-formed question has lots of applications, with some of them being in search engines, Community Question Answering (CQA) website and bots communication. We found a synergy between query-to-question problem with standard machine translation(MT) task. We have used both Statistical MT (SMT) and Neural MT (NMT) models to generate the questions from the query. We have observed that MT models perform well in terms of both automatic and human evaluation.
摘要:用户经常查询与具体问题的搜索引擎在心中,往往这些查询的关键字或子句子片段。例如,如果用户想知道的答案“什么是美国的首都”,他们将最有可能的查询“美国资本”或“美国资本”或一些这方面的基于关键字的变化。例如,用户输入查询“美国资本”,最有可能的问题,目的是“什么是美国的首都呢?”。在本文中,我们提议从给定的基于关键字的查询,其中有意向的询问同样的问题,良好的自然语言问题的方法。基于关键字的网页查询转换成一个结构良好的问题有很多的应用,在搜索引擎中的一些人是社区问答(CQA)的网站和漫游通信。我们发现查询到问题的问题,标准的机器翻译(MT)的任务之间的协同作用。我们都用了统计MT(SMT)和神经MT(NMT)模型来生成从查询的问题。我们观察到,MT车型在自动和人工评估方面表现良好。
8. Introducing Aspects of Creativity in Automatic Poetry Generation [PDF] 返回目录
Brendan Bena, Jugal Kalita
Abstract: Poetry Generation involves teaching systems to automatically generate text that resembles poetic work. A deep learning system can learn to generate poetry on its own by training on a corpus of poems and modeling the particular style of language. In this paper, we propose taking an approach that fine-tunes GPT-2, a pre-trained language model, to our downstream task of poetry generation. We extend prior work on poetry generation by introducing creative elements. Specifically, we generate poems that express emotion and elicit the same in readers, and poems that use the language of dreams---called dream poetry. We are able to produce poems that correctly elicit the emotions of sadness and joy 87.5 and 85 percent, respectively, of the time. We produce dreamlike poetry by training on a corpus of texts that describe dreams. Poems from this model are shown to capture elements of dream poetry with scores of no less than 3.2 on the Likert scale. We perform crowdsourced human-evaluation for all our poems. We also make use of the Coh-Metrix tool, outlining metrics we use to gauge the quality of text generated.
摘要:诗歌生成涉及教学系统自动生成的文本类似于诗的工作。深学习系统可以学习在诗的语料库培训和建模语言的特殊风格产生对自己的诗歌。在本文中,我们建议采取的做法,微调GPT-2,预先训练的语言模型,我们的诗歌产生的下游任务。我们通过引入创意元素延长诗代前期工作。具体而言,我们产生表达情感和引发相同的读者,用梦想的语言---所谓的梦想诗诗和诗歌。我们能够产生诗歌分别是正确引起的时间悲伤和喜悦87.5%和85%,的情绪。我们通过描述梦想文本语料库培训产生梦幻般的诗意。从这个模型诗被示出为与在李克特量表的不小于3.2的分数梦想诗歌捕获元件。我们进行众包的人评价为我们所有的诗。我们还利用COH-Metrix的工具,概述我们用衡量生成的文本的质量指标。
Brendan Bena, Jugal Kalita
Abstract: Poetry Generation involves teaching systems to automatically generate text that resembles poetic work. A deep learning system can learn to generate poetry on its own by training on a corpus of poems and modeling the particular style of language. In this paper, we propose taking an approach that fine-tunes GPT-2, a pre-trained language model, to our downstream task of poetry generation. We extend prior work on poetry generation by introducing creative elements. Specifically, we generate poems that express emotion and elicit the same in readers, and poems that use the language of dreams---called dream poetry. We are able to produce poems that correctly elicit the emotions of sadness and joy 87.5 and 85 percent, respectively, of the time. We produce dreamlike poetry by training on a corpus of texts that describe dreams. Poems from this model are shown to capture elements of dream poetry with scores of no less than 3.2 on the Likert scale. We perform crowdsourced human-evaluation for all our poems. We also make use of the Coh-Metrix tool, outlining metrics we use to gauge the quality of text generated.
摘要:诗歌生成涉及教学系统自动生成的文本类似于诗的工作。深学习系统可以学习在诗的语料库培训和建模语言的特殊风格产生对自己的诗歌。在本文中,我们建议采取的做法,微调GPT-2,预先训练的语言模型,我们的诗歌产生的下游任务。我们通过引入创意元素延长诗代前期工作。具体而言,我们产生表达情感和引发相同的读者,用梦想的语言---所谓的梦想诗诗和诗歌。我们能够产生诗歌分别是正确引起的时间悲伤和喜悦87.5%和85%,的情绪。我们通过描述梦想文本语料库培训产生梦幻般的诗意。从这个模型诗被示出为与在李克特量表的不小于3.2的分数梦想诗歌捕获元件。我们进行众包的人评价为我们所有的诗。我们还利用COH-Metrix的工具,概述我们用衡量生成的文本的质量指标。
9. Goal-Oriented Multi-Task BERT-Based Dialogue State Tracker [PDF] 返回目录
Pavel Gulyaev, Eugenia Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pugachev, Mikhail Burtsev
Abstract: Dialogue State Tracking (DST) is a core component of virtual assistants such as Alexa or Siri. To accomplish various tasks, these assistants need to support an increasing number of services and APIs. The Schema-Guided State Tracking track of the 8th Dialogue System Technology Challenge highlighted the DST problem for unseen services. The organizers introduced the Schema-Guided Dialogue (SGD) dataset with multi-domain conversations and released a zero-shot dialogue state tracking model. In this work, we propose a GOaL-Oriented Multi-task BERT-based dialogue state tracker (GOLOMB) inspired by architectures for reading comprehension question answering systems. The model "queries" dialogue history with descriptions of slots and services as well as possible values of slots. This allows to transfer slot values in multi-domain dialogues and have a capability to scale to unseen slot types. Our model achieves a joint goal accuracy of 53.97% on the SGD dataset, outperforming the baseline model.
摘要:对话状态跟踪(DST)是虚拟助理如Alexa或锡里的核心部件。要完成各种任务,这些助手需要支持服务和API的越来越多。第八对话系统技术挑战赛的模式制导状态跟踪轨迹突出了DST问题的看不见的服务。主办方引入了多领域的对话架构制导对话(SGD)数据集,并发布了零射门的对话状态跟踪模型。在这项工作中,我们建议架构的启发基于BERT面向目标的多任务对话状态追踪器(哥伦布)阅读理解问答系统。该模型“查询”对话的历史与插槽的说明和服务,以及插槽的可能值。这允许在多域的对话能力转移槽值并具有刻度以看不见的插槽类型。我们的模型实现了对SGD数据集的53.97%的合资目标的准确性,跑赢基准模型。
Pavel Gulyaev, Eugenia Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pugachev, Mikhail Burtsev
Abstract: Dialogue State Tracking (DST) is a core component of virtual assistants such as Alexa or Siri. To accomplish various tasks, these assistants need to support an increasing number of services and APIs. The Schema-Guided State Tracking track of the 8th Dialogue System Technology Challenge highlighted the DST problem for unseen services. The organizers introduced the Schema-Guided Dialogue (SGD) dataset with multi-domain conversations and released a zero-shot dialogue state tracking model. In this work, we propose a GOaL-Oriented Multi-task BERT-based dialogue state tracker (GOLOMB) inspired by architectures for reading comprehension question answering systems. The model "queries" dialogue history with descriptions of slots and services as well as possible values of slots. This allows to transfer slot values in multi-domain dialogues and have a capability to scale to unseen slot types. Our model achieves a joint goal accuracy of 53.97% on the SGD dataset, outperforming the baseline model.
摘要:对话状态跟踪(DST)是虚拟助理如Alexa或锡里的核心部件。要完成各种任务,这些助手需要支持服务和API的越来越多。第八对话系统技术挑战赛的模式制导状态跟踪轨迹突出了DST问题的看不见的服务。主办方引入了多领域的对话架构制导对话(SGD)数据集,并发布了零射门的对话状态跟踪模型。在这项工作中,我们建议架构的启发基于BERT面向目标的多任务对话状态追踪器(哥伦布)阅读理解问答系统。该模型“查询”对话的历史与插槽的说明和服务,以及插槽的可能值。这允许在多域的对话能力转移槽值并具有刻度以看不见的插槽类型。我们的模型实现了对SGD数据集的53.97%的合资目标的准确性,跑赢基准模型。
10. I love your chain mail! Making knights smile in a fantasy game world: Open-domain goal-orientated dialogue agents [PDF] 返回目录
Shrimai Prabhumoye, Margaret Li, Jack Urbanek, Emily Dinan, Douwe Kiela, Jason Weston, Arthur Szlam
Abstract: Dialogue research tends to distinguish between chit-chat and goal-oriented tasks. While the former is arguably more naturalistic and has a wider use of language, the latter has clearer metrics and a straightforward learning signal. Humans effortlessly combine the two, for example engaging in chit-chat with the goal of exchanging information or eliciting a specific response. Here, we bridge the divide between these two domains in the setting of a rich multi-player text-based fantasy environment where agents and humans engage in both actions and dialogue. Specifically, we train a goal-oriented model with reinforcement learning against an imitation-learned ``chit-chat'' model with two approaches: the policy either learns to pick a topic or learns to pick an utterance given the top-K utterances from the chit-chat model. We show that both models outperform an inverse model baseline and can converse naturally with their dialogue partner in order to achieve goals.
摘要:对话研究倾向于闲聊和面向目标的任务区分。前者无疑是更自然,并具有广泛应用的语言,后者有更清晰的指标和一个简单的学习用信号。人类毫不费力地将二者结合起来,例如在闲聊从事与交换信息或引发特异性反应的目标。在这里,我们弥补了丰富的基于文本的多玩家幻想环境的设置这两个领域,其中代理和人类从事这两个动作和对话之间的鸿沟。具体来说,我们训练与强化学习面向目标的模型对模仿学习的``闲聊'模型方法有两种:政策要么学会选择一个主题或学会挑给从顶部-K话语的话语在闲聊模型。我们发现,这两种模式超越逆模型基线和为了达到目标,可以与他们的对话伙伴自然交谈。
Shrimai Prabhumoye, Margaret Li, Jack Urbanek, Emily Dinan, Douwe Kiela, Jason Weston, Arthur Szlam
Abstract: Dialogue research tends to distinguish between chit-chat and goal-oriented tasks. While the former is arguably more naturalistic and has a wider use of language, the latter has clearer metrics and a straightforward learning signal. Humans effortlessly combine the two, for example engaging in chit-chat with the goal of exchanging information or eliciting a specific response. Here, we bridge the divide between these two domains in the setting of a rich multi-player text-based fantasy environment where agents and humans engage in both actions and dialogue. Specifically, we train a goal-oriented model with reinforcement learning against an imitation-learned ``chit-chat'' model with two approaches: the policy either learns to pick a topic or learns to pick an utterance given the top-K utterances from the chit-chat model. We show that both models outperform an inverse model baseline and can converse naturally with their dialogue partner in order to achieve goals.
摘要:对话研究倾向于闲聊和面向目标的任务区分。前者无疑是更自然,并具有广泛应用的语言,后者有更清晰的指标和一个简单的学习用信号。人类毫不费力地将二者结合起来,例如在闲聊从事与交换信息或引发特异性反应的目标。在这里,我们弥补了丰富的基于文本的多玩家幻想环境的设置这两个领域,其中代理和人类从事这两个动作和对话之间的鸿沟。具体来说,我们训练与强化学习面向目标的模型对模仿学习的``闲聊'模型方法有两种:政策要么学会选择一个主题或学会挑给从顶部-K话语的话语在闲聊模型。我们发现,这两种模式超越逆模型基线和为了达到目标,可以与他们的对话伙伴自然交谈。
11. Unsupervised pretraining transfers well across languages [PDF] 返回目录
Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux
Abstract: Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.
摘要:跨语言和自动语音识别(ASR)的多语种培训的监督设置了广泛的研究。这是假设的语音和正字改编的平行语料库的存在。近日,对比预测编码(CPC)算法被提出来与未标记的数据pretrain ASR系统。在这项工作中,我们调查是否无监督的训练前转移以及跨语言。我们表明,训练前中共提取物的稍微修改的特点是传输以及其他语言,是媲美甚至超越监督训练前。这显示了与一些语言资源语言的无监督方法的潜力。
Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux
Abstract: Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.
摘要:跨语言和自动语音识别(ASR)的多语种培训的监督设置了广泛的研究。这是假设的语音和正字改编的平行语料库的存在。近日,对比预测编码(CPC)算法被提出来与未标记的数据pretrain ASR系统。在这项工作中,我们调查是否无监督的训练前转移以及跨语言。我们表明,训练前中共提取物的稍微修改的特点是传输以及其他语言,是媲美甚至超越监督训练前。这显示了与一些语言资源语言的无监督方法的潜力。
12. Depressed individuals express more distorted thinking on social media [PDF] 返回目录
Krishna C. Bathina, Marijn ten Thij, Lorenzo Lorenzo-Luaces, Lauren A. Rutter, Johan Bollen
Abstract: Depression is a leading cause of disability worldwide, but is often under-diagnosed and under-treated. One of the tenets of cognitive-behavioral therapy (CBT) is that individuals who are depressed exhibit distorted modes of thinking, so-called cognitive distortions, which can negatively affect their emotions and motivation. Here, we show that individuals with a self-reported diagnosis of depression on social media express higher levels of distorted thinking than a random sample. Some types of distorted thinking were found to be more than twice as prevalent in our depressed cohort, in particular Personalizing and Emotional Reasoning. This effect is specific to the distorted content of the expression and can not be explained by the presence of specific topics, sentiment, or first-person pronouns. Our results point towards the detection, and possibly mitigation, of patterns of online language that are generally deemed depressogenic. They may also provide insight into recent observations that social media usage can have a negative impact on mental health.
摘要:抑郁症是全世界残疾的主要原因,但往往没有得到诊断和治疗不足。一个认知行为疗法(CBT)的原则之一是,谁是抑郁个体表现出扭曲的思维方式,所谓的认知扭曲,可自己的情绪和动机产生负面影响。在这里,我们表明,抑郁对社交媒体的自我报告诊断的个体表达较高水平的扭曲的思维不是随机抽样的。发现某些类型的扭曲的思维方式是在我们的沮丧人群普遍两倍以上,尤其是个性化和情感推理。这种效果是特定于表达的失真内容,并且不能由特定的主题,情绪,或第一人称代词的存在来解释。我们的研究结果指向了检测,并可能减缓,那一般都认为depressogenic在线语言模式。他们还可以提供洞察到最近的观察,社交媒体的使用会对心理健康产生负面影响。
Krishna C. Bathina, Marijn ten Thij, Lorenzo Lorenzo-Luaces, Lauren A. Rutter, Johan Bollen
Abstract: Depression is a leading cause of disability worldwide, but is often under-diagnosed and under-treated. One of the tenets of cognitive-behavioral therapy (CBT) is that individuals who are depressed exhibit distorted modes of thinking, so-called cognitive distortions, which can negatively affect their emotions and motivation. Here, we show that individuals with a self-reported diagnosis of depression on social media express higher levels of distorted thinking than a random sample. Some types of distorted thinking were found to be more than twice as prevalent in our depressed cohort, in particular Personalizing and Emotional Reasoning. This effect is specific to the distorted content of the expression and can not be explained by the presence of specific topics, sentiment, or first-person pronouns. Our results point towards the detection, and possibly mitigation, of patterns of online language that are generally deemed depressogenic. They may also provide insight into recent observations that social media usage can have a negative impact on mental health.
摘要:抑郁症是全世界残疾的主要原因,但往往没有得到诊断和治疗不足。一个认知行为疗法(CBT)的原则之一是,谁是抑郁个体表现出扭曲的思维方式,所谓的认知扭曲,可自己的情绪和动机产生负面影响。在这里,我们表明,抑郁对社交媒体的自我报告诊断的个体表达较高水平的扭曲的思维不是随机抽样的。发现某些类型的扭曲的思维方式是在我们的沮丧人群普遍两倍以上,尤其是个性化和情感推理。这种效果是特定于表达的失真内容,并且不能由特定的主题,情绪,或第一人称代词的存在来解释。我们的研究结果指向了检测,并可能减缓,那一般都认为depressogenic在线语言模式。他们还可以提供洞察到最近的观察,社交媒体的使用会对心理健康产生负面影响。
13. LEAP System for SRE19 Challenge -- Improvements and Error Analysis [PDF] 返回目录
Shreyas Ramoji, Prashant Krishnan, Bhargavram Mysore, Prachi Singh, Sriram Ganapathy
Abstract: The NIST Speaker Recognition Evaluation - Conversational Telephone Speech (CTS) challenge 2019 was an open evaluation for the task of speaker verification in challenging conditions. In this paper, we provide a detailed account of the LEAP SRE system submitted to the CTS challenge focusing on the novel components in the back-end system modeling. All the systems used the time-delay neural network (TDNN) based x-vector embeddings. The x-vector system in our SRE19 submission used a large pool of training speakers (about 14k speakers). Following the x-vector extraction, we explored a neural network approach to backend score computation that was optimized for a speaker verification cost. The system combination of generative and neural PLDA models resulted in significant improvements for the SRE evaluation dataset. We also found additional gains for the SRE systems based on score normalization and calibration. Subsequent to the evaluations, we have performed a detailed analysis of the submitted systems. The analysis revealed the incremental gains obtained for different training dataset combinations as well as the modeling methods.
摘要:NIST说话人识别评估 - 会话电话语音(CTS)挑战2019是为在艰难条件下的说话人确认的任务一个开放的评价。在本文中,我们提供了一个详细的帐户提交CTS挑战着眼于后端系统建模的新组件的LEAP SRE系统。所有的系统中使用的时间延迟神经网络(TDNN)基于X的矢量的嵌入。在我们SRE19提交的X-载体系统使用的培训扬声器(约14K扬声器)的大型游泳池。继X向量提取,我们探讨了神经网络的方法来后端分数计算这是该扬声器核查成本优化。生成和神经PLDA模型的系统组合导致的SRE评估数据集显著的改善。我们还发现基于分数标准化和校准SRE系统的额外收益。继评估,我们已经完成了提交系统的详细分析。分析揭示了不同的训练数据集组合以及建模方法获得的增量收益。
Shreyas Ramoji, Prashant Krishnan, Bhargavram Mysore, Prachi Singh, Sriram Ganapathy
Abstract: The NIST Speaker Recognition Evaluation - Conversational Telephone Speech (CTS) challenge 2019 was an open evaluation for the task of speaker verification in challenging conditions. In this paper, we provide a detailed account of the LEAP SRE system submitted to the CTS challenge focusing on the novel components in the back-end system modeling. All the systems used the time-delay neural network (TDNN) based x-vector embeddings. The x-vector system in our SRE19 submission used a large pool of training speakers (about 14k speakers). Following the x-vector extraction, we explored a neural network approach to backend score computation that was optimized for a speaker verification cost. The system combination of generative and neural PLDA models resulted in significant improvements for the SRE evaluation dataset. We also found additional gains for the SRE systems based on score normalization and calibration. Subsequent to the evaluations, we have performed a detailed analysis of the submitted systems. The analysis revealed the incremental gains obtained for different training dataset combinations as well as the modeling methods.
摘要:NIST说话人识别评估 - 会话电话语音(CTS)挑战2019是为在艰难条件下的说话人确认的任务一个开放的评价。在本文中,我们提供了一个详细的帐户提交CTS挑战着眼于后端系统建模的新组件的LEAP SRE系统。所有的系统中使用的时间延迟神经网络(TDNN)基于X的矢量的嵌入。在我们SRE19提交的X-载体系统使用的培训扬声器(约14K扬声器)的大型游泳池。继X向量提取,我们探讨了神经网络的方法来后端分数计算这是该扬声器核查成本优化。生成和神经PLDA模型的系统组合导致的SRE评估数据集显著的改善。我们还发现基于分数标准化和校准SRE系统的额外收益。继评估,我们已经完成了提交系统的详细分析。分析揭示了不同的训练数据集组合以及建模方法获得的增量收益。
14. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss [PDF] 返回目录
Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar
Abstract: In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with a monotonic RNN-T loss well-suited to frame-synchronous, streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model achieves competitive performance compared to existing LibriSpeech benchmarks for attention-based models trained with cross-entropy loss. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.
摘要:本文提出了具有可在流式语音识别系统中使用的变压器编码器的终端到终端的语音识别模型。基于自我关注变压器计算块用于独立编码音频和标签序列。从音频和标签编码器的激活相结合,与前馈层,以计算在所述标签空间上的概率分布的声学帧位置和标签历史的每个组合。这是类似于回归神经网络传感器(RNN-T)模型,它使用RNNs用于编码代替变压器的编码器的信息。该模型被训练以单调RNN-T损耗非常适用于帧同步,流解码。我们上显示,限制自我关注的左上下文变压器层使得解码流媒体,只有在准确度稍有下降,易于计算的LibriSpeech数据集目前的结果。我们还表明,相对于现有的LibriSpeech基准注意力基础的模式与交叉熵损失训练的我们的模型的充分重视版本实现了有竞争力的表现。我们的研究结果还表明我们可以通过参加未来的帧数量有限弥合充分重视和关注有限的版本我们的模型之间的差距。
Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar
Abstract: In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with a monotonic RNN-T loss well-suited to frame-synchronous, streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model achieves competitive performance compared to existing LibriSpeech benchmarks for attention-based models trained with cross-entropy loss. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.
摘要:本文提出了具有可在流式语音识别系统中使用的变压器编码器的终端到终端的语音识别模型。基于自我关注变压器计算块用于独立编码音频和标签序列。从音频和标签编码器的激活相结合,与前馈层,以计算在所述标签空间上的概率分布的声学帧位置和标签历史的每个组合。这是类似于回归神经网络传感器(RNN-T)模型,它使用RNNs用于编码代替变压器的编码器的信息。该模型被训练以单调RNN-T损耗非常适用于帧同步,流解码。我们上显示,限制自我关注的左上下文变压器层使得解码流媒体,只有在准确度稍有下降,易于计算的LibriSpeech数据集目前的结果。我们还表明,相对于现有的LibriSpeech基准注意力基础的模式与交叉熵损失训练的我们的模型的充分重视版本实现了有竞争力的表现。我们的研究结果还表明我们可以通过参加未来的帧数量有限弥合充分重视和关注有限的版本我们的模型之间的差距。
15. Robust Multi-channel Speech Recognition using Frequency Aligned Network [PDF] 返回目录
Taejin Park, Kenichi Kumatani, Minhua Wu, Shiva Sundaram
Abstract: Conventional speech enhancement technique such as beamforming has known benefits for far-field speech recognition. Our own work in frequency-domain multi-channel acoustic modeling has shown additional improvements by training a spatial filtering layer jointly within an acoustic model. In this paper, we further develop this idea and use frequency aligned network for robust multi-channel automatic speech recognition (ASR). Unlike an affine layer in the frequency domain, the proposed frequency aligned component prevents one frequency bin influencing other frequency bins. We show that this modification not only reduces the number of parameters in the model but also significantly and improves the ASR performance. We investigate effects of frequency aligned network through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
摘要:传统的语音增强技术,如波束赋形已经知道好处远场语音识别。我们自己的在频域多通道声学建模工作已经由声学模型内共同培养了空间滤波层示出的额外的改进。在本文中,我们进一步发展为强大的多通道自动语音识别(ASR)这个想法,并使用频率对准网络。不像在频域中的仿射层,所提出的频率对准部件防止一个频率窗口影响其它频率仓。我们表明,这种修改不仅显著减少了参数的数量模型,而且,提高了ASR性能。我们调查通过ASR实验上,用户与失控的声学环境ASR系统交互的真实世界的远场数据的频率对准网络的影响。我们表明,在字差错率我们与频率对准网络显示多通道声学模型高达18%的相对减少。
Taejin Park, Kenichi Kumatani, Minhua Wu, Shiva Sundaram
Abstract: Conventional speech enhancement technique such as beamforming has known benefits for far-field speech recognition. Our own work in frequency-domain multi-channel acoustic modeling has shown additional improvements by training a spatial filtering layer jointly within an acoustic model. In this paper, we further develop this idea and use frequency aligned network for robust multi-channel automatic speech recognition (ASR). Unlike an affine layer in the frequency domain, the proposed frequency aligned component prevents one frequency bin influencing other frequency bins. We show that this modification not only reduces the number of parameters in the model but also significantly and improves the ASR performance. We investigate effects of frequency aligned network through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
摘要:传统的语音增强技术,如波束赋形已经知道好处远场语音识别。我们自己的在频域多通道声学建模工作已经由声学模型内共同培养了空间滤波层示出的额外的改进。在本文中,我们进一步发展为强大的多通道自动语音识别(ASR)这个想法,并使用频率对准网络。不像在频域中的仿射层,所提出的频率对准部件防止一个频率窗口影响其它频率仓。我们表明,这种修改不仅显著减少了参数的数量模型,而且,提高了ASR性能。我们调查通过ASR实验上,用户与失控的声学环境ASR系统交互的真实世界的远场数据的频率对准网络的影响。我们表明,在字差错率我们与频率对准网络显示多通道声学模型高达18%的相对减少。
16. Consistency of a Recurrent Language Model With Respect to Incomplete Decoding [PDF] 返回目录
Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, Kyunghyun Cho
Abstract: Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition. We study the related issue of receiving infinite-length sequences from a recurrent language model when using common decoding algorithms. To analyze this issue, we first define inconsistency of a decoding algorithm, meaning that the algorithm can yield an infinite-length sequence that has zero probability under the model. We prove that commonly used incomplete decoding algorithms - greedy search, beam search, top-k sampling, and nucleus sampling - are inconsistent, despite the fact that recurrent language models are trained to produce sequences of finite length. Based on these insights, we propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model. Empirical results show that inconsistency occurs in practice, and that the proposed methods prevent inconsistency.
摘要:尽管在各种任务的强大的性能,具有最大似然训练的神经序列模型已显示表现出的问题,如长度偏差和退化重复。我们研究使用常见的解码算法时,从经常性的语言模型接收无限长序列的相关问题。为了分析这个问题,我们首先定义的解码算法的不一致,这意味着该算法可以产生一个具有模型下零概率无限长度的序列。我们证明了常用的不完整的解码算法 - 贪婪搜索,波束搜索,前k个采样,和细胞核采样 - 不一致,尽管事实上,经常性的语言模型被训练来有限长度的生产序列。根据这些分析,我们提出了两种补救措施,地址不一致:前k和核取样,并自终止复发语言模型的一致变种。实证结果表明,发生矛盾的做法,而且所提出的方法防止不一致。
Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, Kyunghyun Cho
Abstract: Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition. We study the related issue of receiving infinite-length sequences from a recurrent language model when using common decoding algorithms. To analyze this issue, we first define inconsistency of a decoding algorithm, meaning that the algorithm can yield an infinite-length sequence that has zero probability under the model. We prove that commonly used incomplete decoding algorithms - greedy search, beam search, top-k sampling, and nucleus sampling - are inconsistent, despite the fact that recurrent language models are trained to produce sequences of finite length. Based on these insights, we propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model. Empirical results show that inconsistency occurs in practice, and that the proposed methods prevent inconsistency.
摘要:尽管在各种任务的强大的性能,具有最大似然训练的神经序列模型已显示表现出的问题,如长度偏差和退化重复。我们研究使用常见的解码算法时,从经常性的语言模型接收无限长序列的相关问题。为了分析这个问题,我们首先定义的解码算法的不一致,这意味着该算法可以产生一个具有模型下零概率无限长度的序列。我们证明了常用的不完整的解码算法 - 贪婪搜索,波束搜索,前k个采样,和细胞核采样 - 不一致,尽管事实上,经常性的语言模型被训练来有限长度的生产序列。根据这些分析,我们提出了两种补救措施,地址不一致:前k和核取样,并自终止复发语言模型的一致变种。实证结果表明,发生矛盾的做法,而且所提出的方法防止不一致。
注:中文为机器翻译结果!