目录
1. The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain [PDF] 摘要
2. Response to LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts [PDF] 摘要
4. Linguists Who Use Probabilistic Models Love Them: Quantification in Functional Distributional Semantics [PDF] 摘要
7. CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks [PDF] 摘要
8. Using Self-Training to Improve Back-Translation in Low Resource Neural Machine Translation [PDF] 摘要
11. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training [PDF] 摘要
15. CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning [PDF] 摘要
16. Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR [PDF] 摘要
摘要
1. The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain [PDF] 返回目录
Annemarie Friedrich, Heike Adel, Federico Tomazic, Johannes Hingerl, Renou Benteau, Anika Maruscyk, Lukas Lange
Abstract: This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.
摘要:本文介绍在材料科学领域一个新的具有挑战性的信息提取任务。我们开发的标记上相关的科学出版物,如涉及的材料和测量条件固体氧化物燃料电池的实验信息的标注方案。有了这个文件,我们发布注释指南,以及我们由领域专家注释45开放获取学术文章SOFC详语料库。语料库和-标注间协议的研究证明了的建议命名实体识别和槽分配任务的复杂性,以及高品质的注解。我们也存在强烈的神经网络基于模型的各种任务,可我们的新的数据集的基础上加以解决。在所有的任务,使用的嵌入BERT导致大的性能提升,但随着任务的复杂性,在上面添加一个经常性的神经网络似乎是有益的。我们的模型将作为今后的工作中有竞争力的基线,而他们的表现分析建模数据,并建议有前途的研究方向时,突出疑难案件。
Annemarie Friedrich, Heike Adel, Federico Tomazic, Johannes Hingerl, Renou Benteau, Anika Maruscyk, Lukas Lange
Abstract: This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.
摘要:本文介绍在材料科学领域一个新的具有挑战性的信息提取任务。我们开发的标记上相关的科学出版物,如涉及的材料和测量条件固体氧化物燃料电池的实验信息的标注方案。有了这个文件,我们发布注释指南,以及我们由领域专家注释45开放获取学术文章SOFC详语料库。语料库和-标注间协议的研究证明了的建议命名实体识别和槽分配任务的复杂性,以及高品质的注解。我们也存在强烈的神经网络基于模型的各种任务,可我们的新的数据集的基础上加以解决。在所有的任务,使用的嵌入BERT导致大的性能提升,但随着任务的复杂性,在上面添加一个经常性的神经网络似乎是有益的。我们的模型将作为今后的工作中有竞争力的基线,而他们的表现分析建模数据,并建议有前途的研究方向时,突出疑难案件。
2. Response to LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts [PDF] 返回目录
Hao Wu, Gareth J. F. Jones, Francois Pitie
Abstract: Live video commenting systems are an emerging feature of online video sites. Recently the Chinese video sharing platform Bilibili, has popularised a novel captioning system where user comments are displayed as streams of moving subtitles overlaid on the video playback screen and broadcast to all viewers in real-time. LiveBot was recently introduced as a novel Automatic Live Video Commenting (ALVC) application. This enables the automatic generation of live video comments from both the existing video stream and existing viewers comments. In seeking to reproduce the baseline results reported in the original Livebot paper, we found differences between the reproduced results using the project codebase and the numbers reported in the paper. Further examination of this situation suggests that this may be caused by a number of small issues in the project code, including a non-obvious overlap between the training and test sets. In this paper, we study these discrepancies in detail and propose an alternative baseline implementation as a reference for other researchers in this field.
摘要:实时视频评论系统是在线视频网站的一个新兴的功能。最近,中国的视频分享平台Bilibili,已经推广了一种新的字幕系统,其中用户评论显示为移动字幕流叠加的视频播放画面和广播实时所有观众上。 LiveBot最近推出的一种新颖的自动实时视频评论(ALVC)应用程序。这使得自动生成从现有的视频流和现有观众的评论实况视频评论。在寻求重现原Livebot纸报告的基线结果,我们发现使用项目代码库和再现结果的文件报告的数字之间的差异。这种情况进一步研究表明,这可能由多个项目代码的小问题,包括训练和测试集之间的非重叠不明显所致。在本文中,我们详细研究这些差异,并提出替代基线实现作为在这一领域的其他研究人员参考。
Hao Wu, Gareth J. F. Jones, Francois Pitie
Abstract: Live video commenting systems are an emerging feature of online video sites. Recently the Chinese video sharing platform Bilibili, has popularised a novel captioning system where user comments are displayed as streams of moving subtitles overlaid on the video playback screen and broadcast to all viewers in real-time. LiveBot was recently introduced as a novel Automatic Live Video Commenting (ALVC) application. This enables the automatic generation of live video comments from both the existing video stream and existing viewers comments. In seeking to reproduce the baseline results reported in the original Livebot paper, we found differences between the reproduced results using the project codebase and the numbers reported in the paper. Further examination of this situation suggests that this may be caused by a number of small issues in the project code, including a non-obvious overlap between the training and test sets. In this paper, we study these discrepancies in detail and propose an alternative baseline implementation as a reference for other researchers in this field.
摘要:实时视频评论系统是在线视频网站的一个新兴的功能。最近,中国的视频分享平台Bilibili,已经推广了一种新的字幕系统,其中用户评论显示为移动字幕流叠加的视频播放画面和广播实时所有观众上。 LiveBot最近推出的一种新颖的自动实时视频评论(ALVC)应用程序。这使得自动生成从现有的视频流和现有观众的评论实况视频评论。在寻求重现原Livebot纸报告的基线结果,我们发现使用项目代码库和再现结果的文件报告的数字之间的差异。这种情况进一步研究表明,这可能由多个项目代码的小问题,包括训练和测试集之间的非重叠不明显所致。在本文中,我们详细研究这些差异,并提出替代基线实现作为在这一领域的其他研究人员参考。
3. Syntactic Search by Example [PDF] 返回目录
Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, Yoav Goldberg
Abstract: We present a system that allows a user to search a large linguistically annotated corpus using syntactic patterns over dependency graphs. In contrast to previous attempts to this effect, we introduce a light-weight query language that does not require the user to know the details of the underlying syntactic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to an efficient linguistic graph-indexing and retrieval engine. This allows for rapid exploration, development and refinement of syntax-based queries. We demonstrate the system using queries over two corpora: the English wikipedia, and a collection of English pubmed abstracts. A demo of the wikipedia system is available at: this https URL
摘要:我们提出了一个系统,允许用户搜索使用句法模式上的依赖关系图大语言注释的语料库。相较于以前尝试这种效果,我们引入不需要用户知道底层的句法表征的细节,而是通过提供再加上简单的标记例句查询语料库重量轻的查询语言。搜索是在交互式速度由于高效的语言图形-索引和检索引擎执行。这允许快速的勘探,开发和基于语法的查询细化。我们证明使用查询了两个语料库系统:在英文维基百科,和英语的集合考研摘要。维基百科系统的演示,请访问:此HTTPS URL
Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, Yoav Goldberg
Abstract: We present a system that allows a user to search a large linguistically annotated corpus using syntactic patterns over dependency graphs. In contrast to previous attempts to this effect, we introduce a light-weight query language that does not require the user to know the details of the underlying syntactic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to an efficient linguistic graph-indexing and retrieval engine. This allows for rapid exploration, development and refinement of syntax-based queries. We demonstrate the system using queries over two corpora: the English wikipedia, and a collection of English pubmed abstracts. A demo of the wikipedia system is available at: this https URL
摘要:我们提出了一个系统,允许用户搜索使用句法模式上的依赖关系图大语言注释的语料库。相较于以前尝试这种效果,我们引入不需要用户知道底层的句法表征的细节,而是通过提供再加上简单的标记例句查询语料库重量轻的查询语言。搜索是在交互式速度由于高效的语言图形-索引和检索引擎执行。这允许快速的勘探,开发和基于语法的查询细化。我们证明使用查询了两个语料库系统:在英文维基百科,和英语的集合考研摘要。维基百科系统的演示,请访问:此HTTPS URL
4. Linguists Who Use Probabilistic Models Love Them: Quantification in Functional Distributional Semantics [PDF] 返回目录
Guy Emerson
Abstract: Functional Distributional Semantics provides a computationally tractable framework for learning truth-conditional semantics from a corpus. Previous work in this framework has provided a probabilistic version of first-order logic, recasting quantification as Bayesian inference. In this paper, I show how the previous formulation gives trivial truth values when a precise quantifier is used with vague predicates. I propose an improved account, avoiding this problem by treating a vague predicate as a distribution over precise predicates. I connect this account to recent work in the Rational Speech Acts framework on modelling generic quantification, and I extend this to modelling donkey sentences. Finally, I explain how the generic quantifier can be both pragmatically complex and yet computationally simpler than precise quantifiers.
摘要:功能分布式语义提供了从语料库学习真值条件语义的易于计算框架。在这个框架内以前的工作提供了一阶逻辑的概率版本,重铸量化为贝叶斯推理。在本文中,我将展示当一个精确的量词与模糊谓词中使用以前的提法是如何让平凡的真值。我提出一种改进的帐户,通过处理一个模糊谓词作为分发过精确谓词避免这种问题。我这个帐户连接到在Rational言语行为框架,最近的工作在通用建模量化,和我这个延伸到驴的句子造型。最后,我解释了通用量词如何既务实复杂,但不是精确的计算量词简单。
Guy Emerson
Abstract: Functional Distributional Semantics provides a computationally tractable framework for learning truth-conditional semantics from a corpus. Previous work in this framework has provided a probabilistic version of first-order logic, recasting quantification as Bayesian inference. In this paper, I show how the previous formulation gives trivial truth values when a precise quantifier is used with vague predicates. I propose an improved account, avoiding this problem by treating a vague predicate as a distribution over precise predicates. I connect this account to recent work in the Rational Speech Acts framework on modelling generic quantification, and I extend this to modelling donkey sentences. Finally, I explain how the generic quantifier can be both pragmatically complex and yet computationally simpler than precise quantifiers.
摘要:功能分布式语义提供了从语料库学习真值条件语义的易于计算框架。在这个框架内以前的工作提供了一阶逻辑的概率版本,重铸量化为贝叶斯推理。在本文中,我将展示当一个精确的量词与模糊谓词中使用以前的提法是如何让平凡的真值。我提出一种改进的帐户,通过处理一个模糊谓词作为分发过精确谓词避免这种问题。我这个帐户连接到在Rational言语行为框架,最近的工作在通用建模量化,和我这个延伸到驴的句子造型。最后,我解释了通用量词如何既务实复杂,但不是精确的计算量词简单。
5. End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020 [PDF] 返回目录
Marco Gaido, Mattia Antonino Di Gangi, Matteo Negri, Marco Turchi
Abstract: This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task. The task evaluates systems' ability to translate English TED talks audio into German texts. The test talks are provided in two versions: one contains the data already segmented with automatic tools and the other is the raw data without any segmentation. Participants can decide whether to work on custom segmentation or not. We used the provided segmentation. Our system is an end-to-end model based on an adaptation of the Transformer for speech data. Its training process is the main focus of this paper and it is based on: i) transfer learning (ASR pretraining and knowledge distillation), ii) data augmentation (SpecAugment, time stretch and synthetic data), iii) combining synthetic and real data marked as different domains, and iv) multi-task learning using the CTC loss. Finally, after the training with word-level knowledge distillation is complete, our ST models are fine-tuned using label smoothed cross entropy. Our best model scored 29 BLEU on the MuST-C En-De test set, which is an excellent result compared to recent papers, and 23.7 BLEU on the same data segmented with VAD, showing the need for researching solutions addressing this specific data condition.
摘要:本文介绍了FBK公司在IWSLT 2020离线语音转换(ST)任务的参与。任务求值系统来翻译英语TED能力说话声音为德语文本。在两个版本中提供的测试会谈:一个包含已经与自动化工具细分的数据,另一种是不带任何分割的原始数据。参与者可以决定是否工作在自定义分段或没有。我们使用所提供的分割。我们的系统是基于变压器的语音数据的适配的端至高端型号。其训练过程是本文的主要焦点,它是基于:1)转移学习(ASR训练前和知识蒸馏),ⅱ)数据扩张(SpecAugment,时间拉伸和合成数据),ⅲ)组合合成的和真实的数据标记作为不同的结构域,和iv)多任务使用CTC损失学习。最后,字级蒸馏知识培训完成后,我们的ST车型所使用的标签平滑交叉熵微调。我们的最佳模型得分上必须-C恩德测试集,这是一个很好的结果相比,最近的论文29 BLEU和23.7 BLEU相同的数据分割与VAD,显示了研究方案解决这一具体数据条件的需要。
Marco Gaido, Mattia Antonino Di Gangi, Matteo Negri, Marco Turchi
Abstract: This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task. The task evaluates systems' ability to translate English TED talks audio into German texts. The test talks are provided in two versions: one contains the data already segmented with automatic tools and the other is the raw data without any segmentation. Participants can decide whether to work on custom segmentation or not. We used the provided segmentation. Our system is an end-to-end model based on an adaptation of the Transformer for speech data. Its training process is the main focus of this paper and it is based on: i) transfer learning (ASR pretraining and knowledge distillation), ii) data augmentation (SpecAugment, time stretch and synthetic data), iii) combining synthetic and real data marked as different domains, and iv) multi-task learning using the CTC loss. Finally, after the training with word-level knowledge distillation is complete, our ST models are fine-tuned using label smoothed cross entropy. Our best model scored 29 BLEU on the MuST-C En-De test set, which is an excellent result compared to recent papers, and 23.7 BLEU on the same data segmented with VAD, showing the need for researching solutions addressing this specific data condition.
摘要:本文介绍了FBK公司在IWSLT 2020离线语音转换(ST)任务的参与。任务求值系统来翻译英语TED能力说话声音为德语文本。在两个版本中提供的测试会谈:一个包含已经与自动化工具细分的数据,另一种是不带任何分割的原始数据。参与者可以决定是否工作在自定义分段或没有。我们使用所提供的分割。我们的系统是基于变压器的语音数据的适配的端至高端型号。其训练过程是本文的主要焦点,它是基于:1)转移学习(ASR训练前和知识蒸馏),ⅱ)数据扩张(SpecAugment,时间拉伸和合成数据),ⅲ)组合合成的和真实的数据标记作为不同的结构域,和iv)多任务使用CTC损失学习。最后,字级蒸馏知识培训完成后,我们的ST车型所使用的标签平滑交叉熵微调。我们的最佳模型得分上必须-C恩德测试集,这是一个很好的结果相比,最近的论文29 BLEU和23.7 BLEU相同的数据分割与VAD,显示了研究方案解决这一具体数据条件的需要。
6. Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1 [PDF] 返回目录
Maria Nadejde, Joel Tetreault
Abstract: Grammar error correction (GEC) systems have become ubiquitous in a variety of software applications, and have started to approach human-level performance for some datasets. However, very little is known about how to efficiently personalize these systems to the user's characteristics, such as their proficiency level and first language, or to emerging domains of text. We present the first results on adapting a general-purpose neural GEC system to both the proficiency level and the first language of a writer, using only a few thousand annotated sentences. Our study is the broadest of its kind, covering five proficiency levels and twelve different languages, and comparing three different adaptation scenarios: adapting to the proficiency level only, to the first language only, or to both aspects simultaneously. We show that tailoring to both scenarios achieves the largest performance improvement (3.6 F0.5) relative to a strong baseline.
摘要:语法错误校正(GEC)系统已经成为各种软件应用无处不在,并已开始对一些数据集的做法人类水平的性能。然而,很少有人知道如何有效地个性化这些系统用户的特点,比如他们的熟练程度和第一语言或文字的新兴领域。我们目前的适应通用的神经系统GEC的熟练程度和作家的第一语言都使用只有几千注释语句的第一批成果。我们的研究是最广泛的一种,包括五个熟练程度和十二个不同的语言,比较三种不同的适应情景:同时适应只有熟练程度,第一语言而已,或者这两个方面。我们发现,剪裁到这两种情况下实现相对较强的基线最大的性能改进(3.6 F0.5)。
Maria Nadejde, Joel Tetreault
Abstract: Grammar error correction (GEC) systems have become ubiquitous in a variety of software applications, and have started to approach human-level performance for some datasets. However, very little is known about how to efficiently personalize these systems to the user's characteristics, such as their proficiency level and first language, or to emerging domains of text. We present the first results on adapting a general-purpose neural GEC system to both the proficiency level and the first language of a writer, using only a few thousand annotated sentences. Our study is the broadest of its kind, covering five proficiency levels and twelve different languages, and comparing three different adaptation scenarios: adapting to the proficiency level only, to the first language only, or to both aspects simultaneously. We show that tailoring to both scenarios achieves the largest performance improvement (3.6 F0.5) relative to a strong baseline.
摘要:语法错误校正(GEC)系统已经成为各种软件应用无处不在,并已开始对一些数据集的做法人类水平的性能。然而,很少有人知道如何有效地个性化这些系统用户的特点,比如他们的熟练程度和第一语言或文字的新兴领域。我们目前的适应通用的神经系统GEC的熟练程度和作家的第一语言都使用只有几千注释语句的第一批成果。我们的研究是最广泛的一种,包括五个熟练程度和十二个不同的语言,比较三种不同的适应情景:同时适应只有熟练程度,第一语言而已,或者这两个方面。我们发现,剪裁到这两种情况下实现相对较强的基线最大的性能改进(3.6 F0.5)。
7. CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks [PDF] 返回目录
Gašper Beguš
Abstract: How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN), that combine a DCGAN architecture for audio data (WaveGAN; arXiv:1705.07904) with InfoGAN (arXiv:1606.03657), and propose a new latent space structure that can model featural learning simultaneously with a higher level classification. The architectures introduce a network that learns to retrieve latent codes from generated audio outputs. Lexical learning is thus modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables. By manipulating these variables, the network outputs specific lexical items. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on 'suit' and 'dark' outputs innovative 'start', even though it never saw 'start' or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code. Probing deep neural networks trained on well understood dependencies in speech bears implications for latent space interpretability, understanding how deep neural networks learn meaningful representations, as well as a potential for unsupervised text-to-speech generation in the GAN framework.
摘要:深层神经网络如何对信息进行编码对应词人的语音转换成原始声音数据?本文提出了两种神经网络结构从原始声音输入模型的非监督词汇学习,ciwGAN(分类InfoWaveGAN)和fiwGAN(Featural InfoWaveGAN),即结合了DCGAN架构的音频数据(WaveGAN;的arXiv:1705.07904)与InfoGAN(的arXiv:1606.03657 ),并提出了新的潜在空间结构,可以与更高级别的分类模型同时学习featural。体系结构引入学会检索所产生的音频输出潜码的网络。因此词汇学习被建模为紧急从架构的力的深层神经网络输出数据,使得固有信息是从它的声学输出检索。训练有素的词项从TIMIT的网络学习编码对应于分类变量的形式词项独特的信息。通过操纵这些变量,网络输出特定的词项。创新产出建议通过网络了解到,语音和语音表征可以高效重组,并直接在人类语音并联生产力:训练有素的“西装”和“黑暗”一fiwGAN网络输出创新的“开始”,尽管它从来没有见过“开始'或甚至[ST]序列中的训练数据。我们还认为,在几乎绝对的代原型词项的设置潜伏featural代码远远超出训练范围值的结果,揭示潜在的每个潜在代码值。探索在语音上训练很好理解依赖深层神经网络承担着潜在空间解释性的意义,理解深层神经网络如何学习有意义的陈述,以及在GaN框架无监督文本到语音转换生成的潜力。
Gašper Beguš
Abstract: How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN), that combine a DCGAN architecture for audio data (WaveGAN; arXiv:1705.07904) with InfoGAN (arXiv:1606.03657), and propose a new latent space structure that can model featural learning simultaneously with a higher level classification. The architectures introduce a network that learns to retrieve latent codes from generated audio outputs. Lexical learning is thus modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables. By manipulating these variables, the network outputs specific lexical items. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on 'suit' and 'dark' outputs innovative 'start', even though it never saw 'start' or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code. Probing deep neural networks trained on well understood dependencies in speech bears implications for latent space interpretability, understanding how deep neural networks learn meaningful representations, as well as a potential for unsupervised text-to-speech generation in the GAN framework.
摘要:深层神经网络如何对信息进行编码对应词人的语音转换成原始声音数据?本文提出了两种神经网络结构从原始声音输入模型的非监督词汇学习,ciwGAN(分类InfoWaveGAN)和fiwGAN(Featural InfoWaveGAN),即结合了DCGAN架构的音频数据(WaveGAN;的arXiv:1705.07904)与InfoGAN(的arXiv:1606.03657 ),并提出了新的潜在空间结构,可以与更高级别的分类模型同时学习featural。体系结构引入学会检索所产生的音频输出潜码的网络。因此词汇学习被建模为紧急从架构的力的深层神经网络输出数据,使得固有信息是从它的声学输出检索。训练有素的词项从TIMIT的网络学习编码对应于分类变量的形式词项独特的信息。通过操纵这些变量,网络输出特定的词项。创新产出建议通过网络了解到,语音和语音表征可以高效重组,并直接在人类语音并联生产力:训练有素的“西装”和“黑暗”一fiwGAN网络输出创新的“开始”,尽管它从来没有见过“开始'或甚至[ST]序列中的训练数据。我们还认为,在几乎绝对的代原型词项的设置潜伏featural代码远远超出训练范围值的结果,揭示潜在的每个潜在代码值。探索在语音上训练很好理解依赖深层神经网络承担着潜在空间解释性的意义,理解深层神经网络如何学习有意义的陈述,以及在GaN框架无监督文本到语音转换生成的潜力。
8. Using Self-Training to Improve Back-Translation in Low Resource Neural Machine Translation [PDF] 返回目录
Idris Abdulmumin, Bashir Shehu Galadanci, Abubakar Isa
Abstract: Improving neural machine translation (NMT) models using the back-translations of the monolingual target data (synthetic parallel data) is currently the state-of-the-art approach for training improved translation systems. The quality of the backward system - which is trained on the available parallel data and used for the back-translation - has been shown in many studies to affect the performance of the final NMT model. In low resource conditions, the available parallel data is usually not enough to train a backward model that can produce the qualitative synthetic data needed to train a standard translation model. This work proposes a self-training strategy where the output of the backward model is used to improve the model itself through the forward translation technique. The technique was shown to improve baseline low resource IWSLT'14 English-German and IWSLT'15 English-Vietnamese backward translation models by 11.06 and 1.5 BLEUs respectively. The synthetic data generated by the improved English-German backward model was used to train a forward model which out-performed another forward model trained using standard back-translation by 2.7 BLEU.
摘要:改善使用单语目标数据的背翻译(合成并行数据)神经机器翻译(NMT)模型是目前用于训练改进的翻译系统的状态的最先进的方法。这是对现有的并行数据训练并用于回译 - - 落后的系统的质量已经在很多研究中被证明影响最终的NMT模型的性能。在低的资源条件,可用的并行数据通常是不够的,培养能够产生训练标准的翻译模型所需的定性合成数据后向模式。这项工作提出了一种自我培养策略,其中落后模型的输出来提高通过正向转换技术模型本身。结果表明该技术由11.06和1.5 BLEUS分别提高基准线低资源IWSLT'14英语,德语和英语IWSLT'15越落后的翻译模型。通过改进的英语 - 德语向后模型生成的合成数据用于训练该外执行的另一正向模型使用标准回译2.7 BLEU培养了正向模型。
Idris Abdulmumin, Bashir Shehu Galadanci, Abubakar Isa
Abstract: Improving neural machine translation (NMT) models using the back-translations of the monolingual target data (synthetic parallel data) is currently the state-of-the-art approach for training improved translation systems. The quality of the backward system - which is trained on the available parallel data and used for the back-translation - has been shown in many studies to affect the performance of the final NMT model. In low resource conditions, the available parallel data is usually not enough to train a backward model that can produce the qualitative synthetic data needed to train a standard translation model. This work proposes a self-training strategy where the output of the backward model is used to improve the model itself through the forward translation technique. The technique was shown to improve baseline low resource IWSLT'14 English-German and IWSLT'15 English-Vietnamese backward translation models by 11.06 and 1.5 BLEUs respectively. The synthetic data generated by the improved English-German backward model was used to train a forward model which out-performed another forward model trained using standard back-translation by 2.7 BLEU.
摘要:改善使用单语目标数据的背翻译(合成并行数据)神经机器翻译(NMT)模型是目前用于训练改进的翻译系统的状态的最先进的方法。这是对现有的并行数据训练并用于回译 - - 落后的系统的质量已经在很多研究中被证明影响最终的NMT模型的性能。在低的资源条件,可用的并行数据通常是不够的,培养能够产生训练标准的翻译模型所需的定性合成数据后向模式。这项工作提出了一种自我培养策略,其中落后模型的输出来提高通过正向转换技术模型本身。结果表明该技术由11.06和1.5 BLEUS分别提高基准线低资源IWSLT'14英语,德语和英语IWSLT'15越落后的翻译模型。通过改进的英语 - 德语向后模型生成的合成数据用于训练该外执行的另一正向模型使用标准回译2.7 BLEU培养了正向模型。
9. Seq2Seq AI Chatbot with Attention Mechanism [PDF] 返回目录
Abonia Sojasingarayar
Abstract: Intelligent Conversational Agent development using Artificial Intelligence or Machine Learning technique is an interesting problem in the field of Natural Language Processing. With the rise of deep learning, these models were quickly replaced by end to end trainable neural networks.
摘要:采用人工智能和机器学习技术的智能会话代理的发展是自然语言处理领域的一个有趣的问题。随着深度学习的兴起,这些模型很快就被替换月底结束训练的神经网络。
Abonia Sojasingarayar
Abstract: Intelligent Conversational Agent development using Artificial Intelligence or Machine Learning technique is an interesting problem in the field of Natural Language Processing. With the rise of deep learning, these models were quickly replaced by end to end trainable neural networks.
摘要:采用人工智能和机器学习技术的智能会话代理的发展是自然语言处理领域的一个有趣的问题。随着深度学习的兴起,这些模型很快就被替换月底结束训练的神经网络。
10. Experiments on Paraphrase Identification Using Quora Question Pairs Dataset [PDF] 返回目录
Andreas Chandra, Ruben Stefanus
Abstract: We modeled the Quora question pairs dataset to identify a similar question. The dataset that we use is provided by Quora. The task is a binary classification. We tried several methods and algorithms and different approach from previous works. For feature extraction, we used Bag of Words including Count Vectorizer, and Term Frequency-Inverse Document Frequency with unigram for XGBoost and CatBoost. Furthermore, we also experimented with WordPiece tokenizer which improves the model performance significantly. We achieved up to 97 percent accuracy. Code and Dataset.
摘要:我们模拟了Quora的问题,对数据集,以确定类似的问题。我们使用的数据集由Quora的提供。任务是一个二元分类。我们尝试了几种方法和算法,并从以前的作品不同的方法。对于特征提取,我们用袋子词包括伯爵矢量化,并与一元模型的XGBoost和CatBoost词频 - 逆文档频率。此外,我们还尝试用WordPiece标记生成器,其显著提高模型的性能。我们实现了高达97%的准确率。代码和数据集。
Andreas Chandra, Ruben Stefanus
Abstract: We modeled the Quora question pairs dataset to identify a similar question. The dataset that we use is provided by Quora. The task is a binary classification. We tried several methods and algorithms and different approach from previous works. For feature extraction, we used Bag of Words including Count Vectorizer, and Term Frequency-Inverse Document Frequency with unigram for XGBoost and CatBoost. Furthermore, we also experimented with WordPiece tokenizer which improves the model performance significantly. We achieved up to 97 percent accuracy. Code and Dataset.
摘要:我们模拟了Quora的问题,对数据集,以确定类似的问题。我们使用的数据集由Quora的提供。任务是一个二元分类。我们尝试了几种方法和算法,并从以前的作品不同的方法。对于特征提取,我们用袋子词包括伯爵矢量化,并与一元模型的XGBoost和CatBoost词频 - 逆文档频率。此外,我们还尝试用WordPiece标记生成器,其显著提高模型的性能。我们实现了高达97%的准确率。代码和数据集。
11. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training [PDF] 返回目录
Haoyang Huang, Lin Su, Di Qi, Nan Duan, Edward Cui, Taroon Bharti, Lei Zhang, Lijuan Wang, Jianfeng Gao, Bei Liu, Jianlong Fu, Dongdong Zhang, Xin Liu, Ming Zhou
Abstract: This paper presents a Multitask Multilingual Multimodal Pre-trained model (M3P) that combines multilingual-monomodal pre-training and monolingual-multimodal pre-training into a unified framework via multitask learning and weight sharing. The model learns universal representations that can map objects that occurred in different modalities or expressed in different languages to vectors in a common semantic space. To verify the generalization capability of M3P, we fine-tune the pre-trained model for different types of downstream tasks: multilingual image-text retrieval, multilingual image captioning, multimodal machine translation, multilingual natural language inference and multilingual text generation. Evaluation shows that M3P can (i) achieve comparable results on multilingual tasks and English multimodal tasks, compared to the state-of-the-art models pre-trained for these two types of tasks separately, and (ii) obtain new state-of-the-art results on non-English multimodal tasks in the zero-shot or few-shot setting. We also build a new Multilingual Image-Language Dataset (MILD) by collecting large amounts of (text-query, image, context) triplets in 8 languages from the logs of a commercial search engine
摘要:本文提出了一种多任务多语种多式联运预先训练模型(M3P)相结合的多语种,单峰前培训和多语 - 多前培训到多任务通过学习和重量共享一个统一的框架。该模型学习通用表示,可以映射发生在不同的方式或以不同语言载体在一个共同的语义空间表达对象。为了验证M3P的泛化能力,我们微调对不同类型的下游任务的预先训练模式:多语种的图像文本检索,多语言字幕图像,多式联运机器翻译,多语言的自然语言推理和多语种的文本生成。评价结果显示,M3P可以(我)实现对多语言任务和英语多任务比较的结果,相比于国家的最先进的机型预先训练的这两种类型的任务分开,和(ii)取得新的国家在零拍或少拍设置非英语多任务-The艺术效果。我们还从商业搜索引擎的日志收集大量的(文本的查询,图像,背景)三胞胎的8种语言建立一个新的多语言图像语言数据集(轻度)
Haoyang Huang, Lin Su, Di Qi, Nan Duan, Edward Cui, Taroon Bharti, Lei Zhang, Lijuan Wang, Jianfeng Gao, Bei Liu, Jianlong Fu, Dongdong Zhang, Xin Liu, Ming Zhou
Abstract: This paper presents a Multitask Multilingual Multimodal Pre-trained model (M3P) that combines multilingual-monomodal pre-training and monolingual-multimodal pre-training into a unified framework via multitask learning and weight sharing. The model learns universal representations that can map objects that occurred in different modalities or expressed in different languages to vectors in a common semantic space. To verify the generalization capability of M3P, we fine-tune the pre-trained model for different types of downstream tasks: multilingual image-text retrieval, multilingual image captioning, multimodal machine translation, multilingual natural language inference and multilingual text generation. Evaluation shows that M3P can (i) achieve comparable results on multilingual tasks and English multimodal tasks, compared to the state-of-the-art models pre-trained for these two types of tasks separately, and (ii) obtain new state-of-the-art results on non-English multimodal tasks in the zero-shot or few-shot setting. We also build a new Multilingual Image-Language Dataset (MILD) by collecting large amounts of (text-query, image, context) triplets in 8 languages from the logs of a commercial search engine
摘要:本文提出了一种多任务多语种多式联运预先训练模型(M3P)相结合的多语种,单峰前培训和多语 - 多前培训到多任务通过学习和重量共享一个统一的框架。该模型学习通用表示,可以映射发生在不同的方式或以不同语言载体在一个共同的语义空间表达对象。为了验证M3P的泛化能力,我们微调对不同类型的下游任务的预先训练模式:多语种的图像文本检索,多语言字幕图像,多式联运机器翻译,多语言的自然语言推理和多语种的文本生成。评价结果显示,M3P可以(我)实现对多语言任务和英语多任务比较的结果,相比于国家的最先进的机型预先训练的这两种类型的任务分开,和(ii)取得新的国家在零拍或少拍设置非英语多任务-The艺术效果。我们还从商业搜索引擎的日志收集大量的(文本的查询,图像,背景)三胞胎的8种语言建立一个新的多语言图像语言数据集(轻度)
12. Meta Dialogue Policy Learning [PDF] 返回目录
Yumo Xu, Chenguang Zhu, Baolin Peng, Michael Zeng
Abstract: Dialog policy determines the next-step actions for agents and hence is central to a dialogue system. However, when migrated to novel domains with little data, a policy model can fail to adapt due to insufficient interactions with the new environment. We propose Deep Transferable Q-Network (DTQN) to utilize shareable low-level signals between domains, such as dialogue acts and slots. We decompose the state and action representation space into feature subspaces corresponding to these low-level components to facilitate cross-domain knowledge transfer. Furthermore, we embed DTQN in a meta-learning framework and introduce Meta-DTQN with a dual-replay mechanism to enable effective off-policy training and adaptation. In experiments, our model outperforms baseline models in terms of both success rate and dialogue efficiency on the multi-domain dialogue dataset MultiWOZ 2.0.
摘要:对话框政策决定代理下一步的行动,因此是至关重要的对话系统。然而,当迁移到很少的数据小说域,策略模型可能无法适应因新环境的互动不足。我们提出深转换Q-网络(DTQN)利用域之间共享低电平信号,诸如对话行为和槽。我们分解状态和行为表示空间划分为对应于这些低级别的组件,以方便跨域知识转移的特征子空间。此外,我们在元学习框架嵌入DTQN,并用双重放机制引入元DTQN以实现有效的关闭政策的培训和适应。在实验中,我们的模型优于基线模型既成功率和效率的对话在多领域的对话集MultiWOZ 2.0的条款。
Yumo Xu, Chenguang Zhu, Baolin Peng, Michael Zeng
Abstract: Dialog policy determines the next-step actions for agents and hence is central to a dialogue system. However, when migrated to novel domains with little data, a policy model can fail to adapt due to insufficient interactions with the new environment. We propose Deep Transferable Q-Network (DTQN) to utilize shareable low-level signals between domains, such as dialogue acts and slots. We decompose the state and action representation space into feature subspaces corresponding to these low-level components to facilitate cross-domain knowledge transfer. Furthermore, we embed DTQN in a meta-learning framework and introduce Meta-DTQN with a dual-replay mechanism to enable effective off-policy training and adaptation. In experiments, our model outperforms baseline models in terms of both success rate and dialogue efficiency on the multi-domain dialogue dataset MultiWOZ 2.0.
摘要:对话框政策决定代理下一步的行动,因此是至关重要的对话系统。然而,当迁移到很少的数据小说域,策略模型可能无法适应因新环境的互动不足。我们提出深转换Q-网络(DTQN)利用域之间共享低电平信号,诸如对话行为和槽。我们分解状态和行为表示空间划分为对应于这些低级别的组件,以方便跨域知识转移的特征子空间。此外,我们在元学习框架嵌入DTQN,并用双重放机制引入元DTQN以实现有效的关闭政策的培训和适应。在实验中,我们的模型优于基线模型既成功率和效率的对话在多领域的对话集MultiWOZ 2.0的条款。
13. Extracting COVID-19 Events from Twitter [PDF] 返回目录
Shi Zong, Ashutosh Baheti, Wei Xu, Alan Ritter
Abstract: We present a corpus of 7,500 tweets annotated with COVID-19 events, including positive test results, denied access to testing, and more. We show that our corpus enables automatic identification of COVID-19 events mentioned in Twitter with text spans that fill a set of pre-defined slots for each event. We also present analyses on the self-reporting cases and user's demographic information. We will make our annotated corpus and extraction tools available for the research community to use upon publication at this https URL
摘要:我们提出的与COVID-19事件,包括积极的测试结果,拒绝访问测试,更注解7500个鸣叫语料库。我们证明了我们的语料库能够在Twitter的文字跨度,填补了一组为每个事件预先定义的插槽中提到COVID-19事件的自动识别。我们对自我报告的情况和用户的人口统计信息也存在分析。我们将尽我们的可用于研究界在出版时使用标注语料库和提取工具,在此HTTPS URL
Shi Zong, Ashutosh Baheti, Wei Xu, Alan Ritter
Abstract: We present a corpus of 7,500 tweets annotated with COVID-19 events, including positive test results, denied access to testing, and more. We show that our corpus enables automatic identification of COVID-19 events mentioned in Twitter with text spans that fill a set of pre-defined slots for each event. We also present analyses on the self-reporting cases and user's demographic information. We will make our annotated corpus and extraction tools available for the research community to use upon publication at this https URL
摘要:我们提出的与COVID-19事件,包括积极的测试结果,拒绝访问测试,更注解7500个鸣叫语料库。我们证明了我们的语料库能够在Twitter的文字跨度,填补了一组为每个事件预先定义的插槽中提到COVID-19事件的自动识别。我们对自我报告的情况和用户的人口统计信息也存在分析。我们将尽我们的可用于研究界在出版时使用标注语料库和提取工具,在此HTTPS URL
14. Self-Training for End-to-End Speech Translation [PDF] 返回目录
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang
Abstract: One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance. The effect of the quality of the pseudo-labels is investigated. Our approach is shown to be more effective than simply pre-training the encoder on the speech recognition task. Finally, we demonstrate the effectiveness of self-training by directly generating pseudo-labels with an end-to-end model instead of a cascade model.
摘要:一对终端到终端的语音翻译的主要挑战是数据缺乏。我们利用通过级联和终端到终端的语音翻译模型无标签的音频产生的伪标签。这提供了8.3和5.7个BLEU涨幅超过上必须-C英语 - 法语和英语,德语数据集强大的半监督基线,达到了国家的最先进的性能。伪标签质量的影响进行了研究。我们的做法证明是更有效的不仅仅是前培训编码器上的语音识别任务。最后,我们用一个终端到终端的模式,而不是一个级联模型直接生成伪标签展示自我训练的有效性。
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang
Abstract: One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance. The effect of the quality of the pseudo-labels is investigated. Our approach is shown to be more effective than simply pre-training the encoder on the speech recognition task. Finally, we demonstrate the effectiveness of self-training by directly generating pseudo-labels with an end-to-end model instead of a cascade model.
摘要:一对终端到终端的语音翻译的主要挑战是数据缺乏。我们利用通过级联和终端到终端的语音翻译模型无标签的音频产生的伪标签。这提供了8.3和5.7个BLEU涨幅超过上必须-C英语 - 法语和英语,德语数据集强大的半监督基线,达到了国家的最先进的性能。伪标签质量的影响进行了研究。我们的做法证明是更有效的不仅仅是前培训编码器上的语音识别任务。最后,我们用一个终端到终端的模式,而不是一个级联模型直接生成伪标签展示自我训练的有效性。
15. CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning [PDF] 返回目录
Sameer Khurana, Antoine Laurent, James Glass
Abstract: More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learning framework. By evaluating the learned representations on a phone recognition task, we demonstrate that linguistic representations emerge in the audio encoder's internal representations as a by-product of learning to perform the retrieval task.
摘要:超过了7000种语言在世界上有一半是在濒临灭绝迫在眉睫的危险。记录语言的传统方法进行,通过在不同的粒度级别训练语言学家收集的音频数据,随后通过手动注释。这个耗时而艰苦的过程可能会受益于机器学习。许多濒危语言没有任何正交形式,但通常有扬声器是双语,并在高资源的语言培训。这是比较容易获得相应于语音文本转换。在这项工作中,我们通过开发两种模式,即言语和其对应的文本翻译之间的关系提供了讲话表示学习多式联运机器学习框架。在这里,我们构建能够从语音抽取语言表述的卷积神经网络音频编码器。音频编码器被训练在对比学习框架进行语音翻译检索任务。通过在手机识别任务评估学会表示,我们证明了语言表述的音频编码器的内部表示涌现作为学习进行检索任务的一个副产品。
Sameer Khurana, Antoine Laurent, James Glass
Abstract: More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but usually have speakers that are bi-lingual and trained in a high resource language. It is relatively easy to obtain textual translations corresponding to speech. In this work, we provide a multimodal machine learning framework for speech representation learning by exploiting the correlations between the two modalities namely speech and its corresponding text translation. Here, we construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech. The audio encoder is trained to perform a speech-translation retrieval task in a contrastive learning framework. By evaluating the learned representations on a phone recognition task, we demonstrate that linguistic representations emerge in the audio encoder's internal representations as a by-product of learning to perform the retrieval task.
摘要:超过了7000种语言在世界上有一半是在濒临灭绝迫在眉睫的危险。记录语言的传统方法进行,通过在不同的粒度级别训练语言学家收集的音频数据,随后通过手动注释。这个耗时而艰苦的过程可能会受益于机器学习。许多濒危语言没有任何正交形式,但通常有扬声器是双语,并在高资源的语言培训。这是比较容易获得相应于语音文本转换。在这项工作中,我们通过开发两种模式,即言语和其对应的文本翻译之间的关系提供了讲话表示学习多式联运机器学习框架。在这里,我们构建能够从语音抽取语言表述的卷积神经网络音频编码器。音频编码器被训练在对比学习框架进行语音翻译检索任务。通过在手机识别任务评估学会表示,我们证明了语言表述的音频编码器的内部表示涌现作为学习进行检索任务的一个副产品。
16. Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR [PDF] 返回目录
Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Abstract: Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database.
摘要:大多数接近多说话人重叠语音分离和识别假设同时有源音箱的数目给定的,但在实际情况下,它通常是未知的。与此应对,我们扩展与机制的迭代语音提取系统计数的源的数目和用单讲话者语音识别器结合起来,以形成用于一个未知的号码的第一端至端多讲话者自动语音识别系统的有源音箱。我们的实验显示非常的从WSJ0-2mix和WSJ0-3mix模拟干净的混合物计数准确,源分离和语音识别看好的表现。其中,我们设置了WSJ0-2mix数据库上一个新的国家的最先进的字错误率。此外,我们的系统推广以及音箱比训练时就见过,在与WSJ0-4mix数据库实验中显示的更大的数字。
Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach
Abstract: Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database.
摘要:大多数接近多说话人重叠语音分离和识别假设同时有源音箱的数目给定的,但在实际情况下,它通常是未知的。与此应对,我们扩展与机制的迭代语音提取系统计数的源的数目和用单讲话者语音识别器结合起来,以形成用于一个未知的号码的第一端至端多讲话者自动语音识别系统的有源音箱。我们的实验显示非常的从WSJ0-2mix和WSJ0-3mix模拟干净的混合物计数准确,源分离和语音识别看好的表现。其中,我们设置了WSJ0-2mix数据库上一个新的国家的最先进的字错误率。此外,我们的系统推广以及音箱比训练时就见过,在与WSJ0-4mix数据库实验中显示的更大的数字。
17. Stopwords in Technical Language Processing [PDF] 返回目录
Serhad Sarica, Jianxi Luo
Abstract: There are increasingly applications of natural language processing techniques for information retrieval, indexing and topic modelling in the engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopword lists which are derived for general English language, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopword list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative data-driven approaches, and curating a stopword list ready for technical language processing applications.
摘要:有越来越多的信息检索,索引和主题建模在工程环境中的自然语言处理技术的应用。这样的任务的一个标准部件是去除停止词,其是数据的无信息的组件。尽管研究人员使用的是衍生于通用英语一应俱全禁用词列表,工程领域的技术术语包含自己的高频率和无信息的话且存在技术语言处理的应用程序没有标准停止字。在这里,我们应对严格确定通用的,微不足道的,不提供信息的禁用词在工程原文超出一般文字的停止词的基础上,替代数据驱动的方法合成,策划准备的技术语言处理应用的停止字这一差距。
Serhad Sarica, Jianxi Luo
Abstract: There are increasingly applications of natural language processing techniques for information retrieval, indexing and topic modelling in the engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopword lists which are derived for general English language, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopword list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative data-driven approaches, and curating a stopword list ready for technical language processing applications.
摘要:有越来越多的信息检索,索引和主题建模在工程环境中的自然语言处理技术的应用。这样的任务的一个标准部件是去除停止词,其是数据的无信息的组件。尽管研究人员使用的是衍生于通用英语一应俱全禁用词列表,工程领域的技术术语包含自己的高频率和无信息的话且存在技术语言处理的应用程序没有标准停止字。在这里,我们应对严格确定通用的,微不足道的,不提供信息的禁用词在工程原文超出一般文字的停止词的基础上,替代数据驱动的方法合成,策划准备的技术语言处理应用的停止字这一差距。
18. A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning [PDF] 返回目录
Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass
Abstract: Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labeled training examples.
摘要:概率潜在变量模型(LVM的)提供到自监督学习的替代方法用于从语音语言表示学习。 LVM的承认一个直观的概率解释了潜结构形状从信号中提取的信息。尽管LVM的最近看到一个新的兴趣由于引入变自动编码(VAES),他们的语音表示学习使用在很大程度上仍然未知。在这项工作中,我们提出了卷积深马尔可夫模型(ConvDMM),高斯状态空间模型与深层神经网络模型的非线性排放和转移的功能。这种无人监督的模型是使用黑盒子变推理训练。深卷积神经网络作为推理网络结构的变分近似。当大规模数据集的讲话(LibriSpeech)的训练,ConvDMM产生显著优于线性手机分类和识别在华尔街日报的数据集多重自我监督的特征提取方法的特点。此外,我们发现,ConvDMM补充自我监督的方法,如Wav2Vec和PASE,改善与任何单独的方法取得的成果。最后,我们发现,ConvDMM特性使我们能更好地学习手机识别比很少训练样本极端低资源政权的任何其他功能。
Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass
Abstract: Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labeled training examples.
摘要:概率潜在变量模型(LVM的)提供到自监督学习的替代方法用于从语音语言表示学习。 LVM的承认一个直观的概率解释了潜结构形状从信号中提取的信息。尽管LVM的最近看到一个新的兴趣由于引入变自动编码(VAES),他们的语音表示学习使用在很大程度上仍然未知。在这项工作中,我们提出了卷积深马尔可夫模型(ConvDMM),高斯状态空间模型与深层神经网络模型的非线性排放和转移的功能。这种无人监督的模型是使用黑盒子变推理训练。深卷积神经网络作为推理网络结构的变分近似。当大规模数据集的讲话(LibriSpeech)的训练,ConvDMM产生显著优于线性手机分类和识别在华尔街日报的数据集多重自我监督的特征提取方法的特点。此外,我们发现,ConvDMM补充自我监督的方法,如Wav2Vec和PASE,改善与任何单独的方法取得的成果。最后,我们发现,ConvDMM特性使我们能更好地学习手机识别比很少训练样本极端低资源政权的任何其他功能。
注:中文为机器翻译结果!