目录
3. Leverage Unlabeled Data for Abstractive Speech Summarization with Self-Supervised Learning and Back-Summarization [PDF] 摘要
5. NeuralQA: A Usable Library for Question Answering (Contextual Query Expansion + BERT) on Large Datasets [PDF] 摘要
6. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering [PDF] 摘要
11. AI-based Monitoring and Response System for Hospital Preparedness towards COVID-19 in Southeast Asia [PDF] 摘要
12. Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability [PDF] 摘要
14. Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for Low-Resource Languages [PDF] 摘要
摘要
1. Neural Modeling for Named Entities and Morphology (NEMO^2) [PDF] 返回目录
Dan Bareket, Reut Tsarfaty
Abstract: Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically-Rich Languages (MRLs) pose a challenge to this basic formulation, as the boundaries of Named Entities do not coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental modeling questions: (i) What should be the basic units to be identified and labeled, are they token-based or morpheme-based? and (ii) How can morphological units be encoded and accurately obtained in realistic (non-gold) scenarios? We empirically investigate these questions on a novel parallel NER benchmark we deliver, with parallel token-level and morpheme-level NER annotations for Modern Hebrew, a morphologically complex language. Our results show that explicitly modeling morphological boundaries consistently leads to improved NER performance, and that a novel hybrid architecture that we propose, in which NER precedes and prunes the morphological decomposition (MD) space, greatly outperforms the standard pipeline approach, on both Hebrew NER and Hebrew MD in realistic scenarios.
摘要:命名实体识别(NER)是一个基本的NLP任务,一般配制成分类过的标记序列。形态上,丰富的语言(最大残留限量)构成挑战这个基本配方,如命名实体的边界不重合与令牌界限,相反,他们尊重形态边界。为了在最大残留限量地址NER然后我们需要回答两个基本问题的建模:(一)什么应该被识别和标记的基本单位,是他们基于令牌的或基于词素?和(ii)如何能形态学单位被编码和在实际(非金)场景准确地获得?我们经验探讨一种新的并行NER标杆我们提供的,与并行标记级别和语素级NER注解现代希伯来语中,形态复杂的语言这些问题。我们的研究结果显示,明确造型形态的边界始终会改善NER性能,以及一个新的混合架构,我们提出,在NER在先,李子形态分解(MD)的空间,大大优于标准的流水线方式,在希伯来文NER和希伯来语MD在真实的场景。
Dan Bareket, Reut Tsarfaty
Abstract: Named Entity Recognition (NER) is a fundamental NLP task, commonly formulated as classification over a sequence of tokens. Morphologically-Rich Languages (MRLs) pose a challenge to this basic formulation, as the boundaries of Named Entities do not coincide with token boundaries, rather, they respect morphological boundaries. To address NER in MRLs we then need to answer two fundamental modeling questions: (i) What should be the basic units to be identified and labeled, are they token-based or morpheme-based? and (ii) How can morphological units be encoded and accurately obtained in realistic (non-gold) scenarios? We empirically investigate these questions on a novel parallel NER benchmark we deliver, with parallel token-level and morpheme-level NER annotations for Modern Hebrew, a morphologically complex language. Our results show that explicitly modeling morphological boundaries consistently leads to improved NER performance, and that a novel hybrid architecture that we propose, in which NER precedes and prunes the morphological decomposition (MD) space, greatly outperforms the standard pipeline approach, on both Hebrew NER and Hebrew MD in realistic scenarios.
摘要:命名实体识别(NER)是一个基本的NLP任务,一般配制成分类过的标记序列。形态上,丰富的语言(最大残留限量)构成挑战这个基本配方,如命名实体的边界不重合与令牌界限,相反,他们尊重形态边界。为了在最大残留限量地址NER然后我们需要回答两个基本问题的建模:(一)什么应该被识别和标记的基本单位,是他们基于令牌的或基于词素?和(ii)如何能形态学单位被编码和在实际(非金)场景准确地获得?我们经验探讨一种新的并行NER标杆我们提供的,与并行标记级别和语素级NER注解现代希伯来语中,形态复杂的语言这些问题。我们的研究结果显示,明确造型形态的边界始终会改善NER性能,以及一个新的混合架构,我们提出,在NER在先,李子形态分解(MD)的空间,大大优于标准的流水线方式,在希伯来文NER和希伯来语MD在真实的场景。
2. The optimality of syntactic dependency distances [PDF] 返回目录
Ramon Ferrer-i-Cancho, Carlos Gómez-Rodríguez, Juan Luis Esteban, Lluís Alemany-Puig
Abstract: It is often stated that human languages, as other biological systems, are shaped by cost-cutting pressures but, to what extent? Attempts to quantify the degree of optimality of languages by means of an optimality score have been scarce and focused mostly on English. Here we recast the problem of the optimality of the word order of a sentence as an optimization problem on a spatial network where the vertices are words, arcs indicate syntactic dependencies and the space is defined by the linear order of the words in the sentence. We introduce a new score to quantify the cognitive pressure to reduce the distance between linked words in a sentence. The analysis of sentences from 93 languages representing 19 linguistic families reveals that half of languages are optimized to a 70% or more. The score indicates that distances are not significantly reduced in a few languages and confirms two theoretical predictions, i.e. that longer sentences are more optimized and that distances are more likely to be longer than expected by chance in short sentences. We present a new hierarchical ranking of languages by their degree of optimization. The statistical advantages of the new score call for a reevaluation of the evolution of dependency distance over time in languages as well as the relationship between dependency distance and linguistic competence. Finally, the principles behind the design of the score can be extended to develop more powerful normalizations of topological distances or physical distances in more dimensions.
摘要:人们常常说人类的语言,如其他生物系统,通过削减成本的压力,但形状,到什么程度?尝试用最优得分的手段来量化的语言最佳化程度已经匮乏,主要集中在英语。在这里,我们改写句子作为空间网络上的优化问题的词序的最优化的问题,其中的顶点是的话,弧表示句法依赖性和空间是由句子中词的线性顺序定义。我们推出了新的分数量化的认知压力,减少句子中的单词链接之间的距离。句子从代表19个语系93种语言的分析表明,语言的一半优化到70%以上。比分表明距离不显著在几种语言,并确认两大理论预测,即减少了较长的句子更加优化,并且距离更可能长于在短句子机会的预期。我们通过自己的优化程度提出了一种新的分层排序的语言。新的得分呼吁依赖距离随时间的变化进行重新评估的语言统计优势以及依赖距离和语言能力之间的关系。最后,得分的设计背后的原理可以被扩展到开发更强大的拓扑距离或更多维物理距离的归一化。
Ramon Ferrer-i-Cancho, Carlos Gómez-Rodríguez, Juan Luis Esteban, Lluís Alemany-Puig
Abstract: It is often stated that human languages, as other biological systems, are shaped by cost-cutting pressures but, to what extent? Attempts to quantify the degree of optimality of languages by means of an optimality score have been scarce and focused mostly on English. Here we recast the problem of the optimality of the word order of a sentence as an optimization problem on a spatial network where the vertices are words, arcs indicate syntactic dependencies and the space is defined by the linear order of the words in the sentence. We introduce a new score to quantify the cognitive pressure to reduce the distance between linked words in a sentence. The analysis of sentences from 93 languages representing 19 linguistic families reveals that half of languages are optimized to a 70% or more. The score indicates that distances are not significantly reduced in a few languages and confirms two theoretical predictions, i.e. that longer sentences are more optimized and that distances are more likely to be longer than expected by chance in short sentences. We present a new hierarchical ranking of languages by their degree of optimization. The statistical advantages of the new score call for a reevaluation of the evolution of dependency distance over time in languages as well as the relationship between dependency distance and linguistic competence. Finally, the principles behind the design of the score can be extended to develop more powerful normalizations of topological distances or physical distances in more dimensions.
摘要:人们常常说人类的语言,如其他生物系统,通过削减成本的压力,但形状,到什么程度?尝试用最优得分的手段来量化的语言最佳化程度已经匮乏,主要集中在英语。在这里,我们改写句子作为空间网络上的优化问题的词序的最优化的问题,其中的顶点是的话,弧表示句法依赖性和空间是由句子中词的线性顺序定义。我们推出了新的分数量化的认知压力,减少句子中的单词链接之间的距离。句子从代表19个语系93种语言的分析表明,语言的一半优化到70%以上。比分表明距离不显著在几种语言,并确认两大理论预测,即减少了较长的句子更加优化,并且距离更可能长于在短句子机会的预期。我们通过自己的优化程度提出了一种新的分层排序的语言。新的得分呼吁依赖距离随时间的变化进行重新评估的语言统计优势以及依赖距离和语言能力之间的关系。最后,得分的设计背后的原理可以被扩展到开发更强大的拓扑距离或更多维物理距离的归一化。
3. Leverage Unlabeled Data for Abstractive Speech Summarization with Self-Supervised Learning and Back-Summarization [PDF] 返回目录
Paul Tardy, Louis de Seynes, François Hernandez, Vincent Nguyen, David Janiszek, Yannick Estève
Abstract: Supervised approaches for Neural Abstractive Summarization require large annotated corpora that are costly to build. We present a French meeting summarization task where reports are predicted based on the automatic transcription of the meeting audio recordings. In order to build a corpus for this task, it is necessary to obtain the (automatic or manual) transcription of each meeting, and then to segment and align it with the corresponding manual report to produce training examples suitable for training. On the other hand, we have access to a very large amount of unaligned data, in particular reports without corresponding transcription. Reports are professionally written and well formatted making pre-processing straightforward. In this context, we study how to take advantage of this massive amount of unaligned data using two approaches (i) self-supervised pre-training using a target-side denoising encoder-decoder model; (ii) back-summarization i.e. reversing the summarization process by learning to predict the transcription given the report, in order to align single reports with generated transcription, and use this synthetic dataset for further training. We report large improvements compared to the previous baseline (trained on aligned data only) for both approaches on two evaluation sets. Moreover, combining the two gives even better results, outperforming the baseline by a large margin of +6 ROUGE-1 and ROUGE-L and +5 ROUGE-2 on two evaluation sets
摘要:神经写意总结监督的方法需要大型注释的语料库是造价较高。我们目前在那里报告根据会议录音的自动转录预测法国会议总结任务。为了建立此任务的语料库中,需要获得每个会议的(自动或手动)转录,然后到段和具有至适合于训练产生训练实例相应的手动报告对齐。在另一方面,我们获得了非常大量的未对齐的数据,特别是报告中没有相应的转录。报告专业笔试和格式化,以及使前处理简单。在这方面,我们研究了使用更多两种方法的未对齐数据的这种大规模量的优点(ⅰ)自监督使用目标侧去噪编码器 - 解码器模型预培训; (ⅱ)反向汇总即通过学习来预测转录给出的报告中,为了与所生成的转录对准单个报告,以及用于进一步训练该合成数据集反转精简处理。我们比以前的基准线(经过培训的仅对齐的数据)报告大幅改善了两个评价集这两种方法。此外,组合两个给出甚至更好的结果,由6 ROUGE-1的大裕度和ROUGE-L和5 ROUGE-2上的两个评价集优于基线
Paul Tardy, Louis de Seynes, François Hernandez, Vincent Nguyen, David Janiszek, Yannick Estève
Abstract: Supervised approaches for Neural Abstractive Summarization require large annotated corpora that are costly to build. We present a French meeting summarization task where reports are predicted based on the automatic transcription of the meeting audio recordings. In order to build a corpus for this task, it is necessary to obtain the (automatic or manual) transcription of each meeting, and then to segment and align it with the corresponding manual report to produce training examples suitable for training. On the other hand, we have access to a very large amount of unaligned data, in particular reports without corresponding transcription. Reports are professionally written and well formatted making pre-processing straightforward. In this context, we study how to take advantage of this massive amount of unaligned data using two approaches (i) self-supervised pre-training using a target-side denoising encoder-decoder model; (ii) back-summarization i.e. reversing the summarization process by learning to predict the transcription given the report, in order to align single reports with generated transcription, and use this synthetic dataset for further training. We report large improvements compared to the previous baseline (trained on aligned data only) for both approaches on two evaluation sets. Moreover, combining the two gives even better results, outperforming the baseline by a large margin of +6 ROUGE-1 and ROUGE-L and +5 ROUGE-2 on two evaluation sets
摘要:神经写意总结监督的方法需要大型注释的语料库是造价较高。我们目前在那里报告根据会议录音的自动转录预测法国会议总结任务。为了建立此任务的语料库中,需要获得每个会议的(自动或手动)转录,然后到段和具有至适合于训练产生训练实例相应的手动报告对齐。在另一方面,我们获得了非常大量的未对齐的数据,特别是报告中没有相应的转录。报告专业笔试和格式化,以及使前处理简单。在这方面,我们研究了使用更多两种方法的未对齐数据的这种大规模量的优点(ⅰ)自监督使用目标侧去噪编码器 - 解码器模型预培训; (ⅱ)反向汇总即通过学习来预测转录给出的报告中,为了与所生成的转录对准单个报告,以及用于进一步训练该合成数据集反转精简处理。我们比以前的基准线(经过培训的仅对齐的数据)报告大幅改善了两个评价集这两种方法。此外,组合两个给出甚至更好的结果,由6 ROUGE-1的大裕度和ROUGE-L和5 ROUGE-2上的两个评价集优于基线
4. Photon: A Robust Cross-Domain Text-to-SQL System [PDF] 返回目录
Jichuan Zeng, Xi Victoria Lin, Caiming Xiong, Richard Socher, Michael R. Lyu, Irwin King, Steven C.H. Hoi
Abstract: Natural language interfaces to databases (NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present Photon, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping cannot be immediately determined. Photon consists of a strong neural semantic parser (63.2\% structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector is a discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input. The live demo of our system is available at this http URL.
摘要:自然语言接口的数据库(NLIDB)民主化最终用户访问关系型数据。由于自然语言的沟通和编程本质上的区别,它是为最终用户是不明确的系统或会超出其基本的查询语言的语义范围问题的问题普遍。我们提出光子,鲁棒的,模块化的,跨域NLIDB该罐标志自然语言输入到其上的SQL的映射不能立即确定。光子由强神经语义解析器(在蜘蛛dev的基准63.2 \%结构的准确度),一个人在半实物问题校正,一个SQL执行器和响应产生的。问题校正器是检测混乱跨距(S)在所述输入的问题,并提出改写直到平移输入是由用户或迭代的最大数量给出的进行的辨别的神经序列编辑器。在模拟数据实验表明,该方法有效地提高了文本到SQL系统对不可翻译的用户输入的鲁棒性。我们系统的现场演示可在此HTTP URL。
Jichuan Zeng, Xi Victoria Lin, Caiming Xiong, Richard Socher, Michael R. Lyu, Irwin King, Steven C.H. Hoi
Abstract: Natural language interfaces to databases (NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present Photon, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping cannot be immediately determined. Photon consists of a strong neural semantic parser (63.2\% structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector is a discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input. The live demo of our system is available at this http URL.
摘要:自然语言接口的数据库(NLIDB)民主化最终用户访问关系型数据。由于自然语言的沟通和编程本质上的区别,它是为最终用户是不明确的系统或会超出其基本的查询语言的语义范围问题的问题普遍。我们提出光子,鲁棒的,模块化的,跨域NLIDB该罐标志自然语言输入到其上的SQL的映射不能立即确定。光子由强神经语义解析器(在蜘蛛dev的基准63.2 \%结构的准确度),一个人在半实物问题校正,一个SQL执行器和响应产生的。问题校正器是检测混乱跨距(S)在所述输入的问题,并提出改写直到平移输入是由用户或迭代的最大数量给出的进行的辨别的神经序列编辑器。在模拟数据实验表明,该方法有效地提高了文本到SQL系统对不可翻译的用户输入的鲁棒性。我们系统的现场演示可在此HTTP URL。
5. NeuralQA: A Usable Library for Question Answering (Contextual Query Expansion + BERT) on Large Datasets [PDF] 返回目录
Victor Dibia
Abstract: Existing tools for Question Answering (QA) have challenges that limit their use in practice. They can be complex to set up or integrate with existing infrastructure, do not offer configurable interactive interfaces, and do not cover the full set of subtasks that frequently comprise the QA pipeline (query expansion, retrieval, reading, and explanation/sensemaking). To help address these issues, we introduce NeuralQA - a usable library for QA on large datasets. NeuralQA integrates well with existing infrastructure (e.g., ElasticSearch instances and reader models trained with the HuggingFace Transformers API) and offers helpful defaults for QA subtasks. It introduces and implements contextual query expansion (CQE) using a masked language model (MLM) as well as relevant snippets (RelSnip) - a method for condensing large documents into smaller passages that can be speedily processed by a document reader model. Finally, it offers a flexible user interface to support workflows for research explorations (e.g., visualization of gradient-based explanations to support qualitative inspection of model behaviour) and large scale search deployment. Code and documentation for NeuralQA is available as open source on Github.
摘要:问答系统(QA)现有的工具都有限制其在实际应用的挑战。它们可以是复杂的设置或与现有基础架构集成,不提供可配置交互界面,并且不包括全套经常包括QA管道的子任务(扩展查询,检索,阅读和解释/获取意义)。为了帮助解决这些问题,我们引入NeuralQA - 用于QA可用库对大数据集。 NeuralQA集成以及与现有基础架构(例如,ElasticSearch实例和读者模型与HuggingFace变形金刚API的培训),并提供了质量保证的子任务有帮助的默认值。它介绍了并用掩蔽语言模型(MLM)以及相关片段(RelSnip)实现上下文查询扩展(CQE) - 用于冷凝大型文档转换成可以由一个文件读取器模型来处理迅速小通道的方法。最后,它提供了一个灵活的用户界面,为研究探索(例如,基于梯度的解释的可视化支持的模型行为定性检测)和大型的搜索部署支持工作流。代码和文档NeuralQA可作为在Github开源。
Victor Dibia
Abstract: Existing tools for Question Answering (QA) have challenges that limit their use in practice. They can be complex to set up or integrate with existing infrastructure, do not offer configurable interactive interfaces, and do not cover the full set of subtasks that frequently comprise the QA pipeline (query expansion, retrieval, reading, and explanation/sensemaking). To help address these issues, we introduce NeuralQA - a usable library for QA on large datasets. NeuralQA integrates well with existing infrastructure (e.g., ElasticSearch instances and reader models trained with the HuggingFace Transformers API) and offers helpful defaults for QA subtasks. It introduces and implements contextual query expansion (CQE) using a masked language model (MLM) as well as relevant snippets (RelSnip) - a method for condensing large documents into smaller passages that can be speedily processed by a document reader model. Finally, it offers a flexible user interface to support workflows for research explorations (e.g., visualization of gradient-based explanations to support qualitative inspection of model behaviour) and large scale search deployment. Code and documentation for NeuralQA is available as open source on Github.
摘要:问答系统(QA)现有的工具都有限制其在实际应用的挑战。它们可以是复杂的设置或与现有基础架构集成,不提供可配置交互界面,并且不包括全套经常包括QA管道的子任务(扩展查询,检索,阅读和解释/获取意义)。为了帮助解决这些问题,我们引入NeuralQA - 用于QA可用库对大数据集。 NeuralQA集成以及与现有基础架构(例如,ElasticSearch实例和读者模型与HuggingFace变形金刚API的培训),并提供了质量保证的子任务有帮助的默认值。它介绍了并用掩蔽语言模型(MLM)以及相关片段(RelSnip)实现上下文查询扩展(CQE) - 用于冷凝大型文档转换成可以由一个文件读取器模型来处理迅速小通道的方法。最后,它提供了一个灵活的用户界面,为研究探索(例如,基于梯度的解释的可视化支持的模型行为定性检测)和大型的搜索部署支持工作流。代码和文档NeuralQA可作为在Github开源。
6. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering [PDF] 返回目录
Shayne Longpre, Yi Lu, Joachim Daiber
Abstract: Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark state-of-the-art extractive question answering baselines, trained on Natural Questions, including Multilingual BERT, and XLM-RoBERTa, in zero shot and translation settings. Results indicate this dataset is challenging, especially in low-resource languages.
摘要:在跨语言建模的进展取决于具有挑战性的,现实的,多样的评价集。我们推出多语言知识问答(MKQA),开放域问答评估组,包括10K跨越26种类型学的不同语言(共260K问答对)排列的问答配对。该数据集的目标是要在广泛的一套语言提供答疑质量富有挑战性的基准。答案是基于独立于语言的数据表示,使结果跨语言比较的和独立的语言特定的通道。随着26种语言,该数据集用于评估答疑提供的语言最新最宽的范围。我们的基准状态的最先进的采掘答疑基线,训练有素的自然问题,包括多语种BERT和XLM-罗伯塔,零个射门和翻译设置。结果表明,该数据集是具有挑战性的,尤其是在资源匮乏的语言。
Shayne Longpre, Yi Lu, Joachim Daiber
Abstract: Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark state-of-the-art extractive question answering baselines, trained on Natural Questions, including Multilingual BERT, and XLM-RoBERTa, in zero shot and translation settings. Results indicate this dataset is challenging, especially in low-resource languages.
摘要:在跨语言建模的进展取决于具有挑战性的,现实的,多样的评价集。我们推出多语言知识问答(MKQA),开放域问答评估组,包括10K跨越26种类型学的不同语言(共260K问答对)排列的问答配对。该数据集的目标是要在广泛的一套语言提供答疑质量富有挑战性的基准。答案是基于独立于语言的数据表示,使结果跨语言比较的和独立的语言特定的通道。随着26种语言,该数据集用于评估答疑提供的语言最新最宽的范围。我们的基准状态的最先进的采掘答疑基线,训练有素的自然问题,包括多语种BERT和XLM-罗伯塔,零个射门和翻译设置。结果表明,该数据集是具有挑战性的,尤其是在资源匮乏的语言。
7. The Return of Lexical Dependencies: Neural Lexicalized PCFGs [PDF] 返回目录
Hao Zhu, Yonatan Bisk, Graham Neubig
Abstract: In this paper we demonstrate that $\textit{context free grammar (CFG) based methods for grammar induction benefit from modeling lexical dependencies}$. This contrasts to the most popular current methods for grammar induction, which focus on discovering $\textit{either}$ constituents $\textit{or}$ dependencies. Previous approaches to marry these two disparate syntactic formalisms (e.g. lexicalized PCFGs) have been plagued by sparsity, making them unsuitable for unsupervised grammar induction. However, in this work, we present novel neural models of lexicalized PCFGs which allow us to overcome sparsity problems and effectively induce both constituents and dependencies within a single model. Experiments demonstrate that this unified framework results in stronger results on both representations than achieved when modeling either formalism alone. Code is available at this https URL.
摘要:在本文中,我们证明{从造型词汇依存关系语法归纳利益上下文文法(CFG)为基础的方法} $是$ \ textit。与此相反,以当下最流行方法语法归纳,其中重点发现$ \ {textit要么} $成分$ \ {textit或} $的依赖性。先前的方法结婚这两个不同的句法形式化(例如词汇化PCFGs)通过稀疏已经困扰,这使得它们不适合于无监督语法诱导。然而,在这项工作中,我们提出词汇化PCFGs这让我们克服稀疏性问题和有效的单一模式中同时诱导成分和相关的新的神经模型。实验证明,在双方交涉更强结果这个统一的框架,结果比单独建模或者形式主义时实现。代码可在此HTTPS URL。
Hao Zhu, Yonatan Bisk, Graham Neubig
Abstract: In this paper we demonstrate that $\textit{context free grammar (CFG) based methods for grammar induction benefit from modeling lexical dependencies}$. This contrasts to the most popular current methods for grammar induction, which focus on discovering $\textit{either}$ constituents $\textit{or}$ dependencies. Previous approaches to marry these two disparate syntactic formalisms (e.g. lexicalized PCFGs) have been plagued by sparsity, making them unsuitable for unsupervised grammar induction. However, in this work, we present novel neural models of lexicalized PCFGs which allow us to overcome sparsity problems and effectively induce both constituents and dependencies within a single model. Experiments demonstrate that this unified framework results in stronger results on both representations than achieved when modeling either formalism alone. Code is available at this https URL.
摘要:在本文中,我们证明{从造型词汇依存关系语法归纳利益上下文文法(CFG)为基础的方法} $是$ \ textit。与此相反,以当下最流行方法语法归纳,其中重点发现$ \ {textit要么} $成分$ \ {textit或} $的依赖性。先前的方法结婚这两个不同的句法形式化(例如词汇化PCFGs)通过稀疏已经困扰,这使得它们不适合于无监督语法诱导。然而,在这项工作中,我们提出词汇化PCFGs这让我们克服稀疏性问题和有效的单一模式中同时诱导成分和相关的新的神经模型。实验证明,在双方交涉更强结果这个统一的框架,结果比单独建模或者形式主义时实现。代码可在此HTTPS URL。
8. Exploiting stance hierarchies for cost-sensitive stance detection of Web documents [PDF] 返回目录
Arjun Roy, Pavlos Fafalios, Asif Ekbal, Xiaofei Zhu, Stefan Dietze
Abstract: Fact checking is an essential challenge when combating fake news. Identifying documents that agree or disagree with a particular statement (claim) is a core task in this process. In this context, stance detection aims at identifying the position (stance) of a document towards a claim. Most approaches address this task through a 4-class classification model where the class distribution is highly imbalanced. Therefore, they are particularly ineffective in detecting the minority classes (for instance, 'disagree'), even though such instances are crucial for tasks such as fact-checking by providing evidence for detecting false claims. In this paper, we exploit the hierarchical nature of stance classes, which allows us to propose a modular pipeline of cascading binary classifiers, enabling performance tuning on a per step and class basis. We implement our approach through a combination of neural and traditional classification models that highlight the misclassification costs of minority classes. Evaluation results demonstrate state-of-the-art performance of our approach and its ability to significantly improve the classification performance of the important 'disagree' class.
摘要:打击假新闻时,核对事实是一个重要的挑战。标识同意或不特定语句(要求)不同意的文件是在这个过程中的核心任务。在这种情况下,姿态检测的目的是确定朝向权利要求的文件的位置(姿态)。大多数方法解决通过一个4级分类模型,其中类分布极不均衡这一任务。因此,他们在检测的民族班(例如,“不同意”),特别是无效的,即使这样的情况下是这样的任务,事实上,通过检查检测虚假索赔提供的证据是至关重要的。在本文中,我们利用姿态类的分层特性,这使我们能够提出级联二元分类,从而实现对每一步和阶级基础性能优化的模块化管道。我们通过实施,突出民族班的误判成本和神经传统分类模型的组合我们的方法。评价结果表明我们的方法和能力的国家的最先进的性能显著提高的重要“不同意”类的分类性能。
Arjun Roy, Pavlos Fafalios, Asif Ekbal, Xiaofei Zhu, Stefan Dietze
Abstract: Fact checking is an essential challenge when combating fake news. Identifying documents that agree or disagree with a particular statement (claim) is a core task in this process. In this context, stance detection aims at identifying the position (stance) of a document towards a claim. Most approaches address this task through a 4-class classification model where the class distribution is highly imbalanced. Therefore, they are particularly ineffective in detecting the minority classes (for instance, 'disagree'), even though such instances are crucial for tasks such as fact-checking by providing evidence for detecting false claims. In this paper, we exploit the hierarchical nature of stance classes, which allows us to propose a modular pipeline of cascading binary classifiers, enabling performance tuning on a per step and class basis. We implement our approach through a combination of neural and traditional classification models that highlight the misclassification costs of minority classes. Evaluation results demonstrate state-of-the-art performance of our approach and its ability to significantly improve the classification performance of the important 'disagree' class.
摘要:打击假新闻时,核对事实是一个重要的挑战。标识同意或不特定语句(要求)不同意的文件是在这个过程中的核心任务。在这种情况下,姿态检测的目的是确定朝向权利要求的文件的位置(姿态)。大多数方法解决通过一个4级分类模型,其中类分布极不均衡这一任务。因此,他们在检测的民族班(例如,“不同意”),特别是无效的,即使这样的情况下是这样的任务,事实上,通过检查检测虚假索赔提供的证据是至关重要的。在本文中,我们利用姿态类的分层特性,这使我们能够提出级联二元分类,从而实现对每一步和阶级基础性能优化的模块化管道。我们通过实施,突出民族班的误判成本和神经传统分类模型的组合我们的方法。评价结果表明我们的方法和能力的国家的最先进的性能显著提高的重要“不同意”类的分类性能。
9. Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification [PDF] 返回目录
Xin Dong, Yaxin Zhu, Yupeng Zhang, Zuohui Fu, Dongkuan Xu, Sen Yang, Gerard de Melo
Abstract: In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made it much easier to achieve this. Still, there may still be subtle differences between languages that are neglected when doing so. To address this, we present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations. The resulting model then serves as a teacher to induce labels for unlabeled target language samples that can be used during further adversarial training, allowing us to gradually adapt our model to the target language. Compared with a number of strong baselines, we observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
摘要:跨语言文本分类,一个目的是从一种语言利用标记的数据来训练,然后可以被施加到一个完全不同的语言的文本分类模型。最近的多语言表示模型使人们更容易做到这一点。尽管如此,仍可能存在语言之间的细微差别这样做时被忽视。为了解决这个问题,我们提出了一种半监督对抗性训练的过程,最大限度地降低了标签保留输入扰动的最大损失。由此产生的模型,然后作为一个老师诱导的,可以在进一步的对抗训练中使用未标记的目标语言样本标签,让我们对我们的模式逐渐适应目标语言。有一批有实力的基线相比,我们观察到效果上的文件和意图分级多样化的语言显著的收益。
Xin Dong, Yaxin Zhu, Yupeng Zhang, Zuohui Fu, Dongkuan Xu, Sen Yang, Gerard de Melo
Abstract: In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made it much easier to achieve this. Still, there may still be subtle differences between languages that are neglected when doing so. To address this, we present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations. The resulting model then serves as a teacher to induce labels for unlabeled target language samples that can be used during further adversarial training, allowing us to gradually adapt our model to the target language. Compared with a number of strong baselines, we observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
摘要:跨语言文本分类,一个目的是从一种语言利用标记的数据来训练,然后可以被施加到一个完全不同的语言的文本分类模型。最近的多语言表示模型使人们更容易做到这一点。尽管如此,仍可能存在语言之间的细微差别这样做时被忽视。为了解决这个问题,我们提出了一种半监督对抗性训练的过程,最大限度地降低了标签保留输入扰动的最大损失。由此产生的模型,然后作为一个老师诱导的,可以在进一步的对抗训练中使用未标记的目标语言样本标签,让我们对我们的模式逐渐适应目标语言。有一批有实力的基线相比,我们观察到效果上的文件和意图分级多样化的语言显著的收益。
10. An Experimental Study of The Effects of Position Bias on Emotion CauseExtraction [PDF] 返回目录
Jiayuan Ding, Mayank Kejriwal
Abstract: Emotion Cause Extraction (ECE) aims to identify emotion causes from a document after annotating the emotion keywords. Some baselines have been proposed to address this problem, such as rule-based, commonsense based and machine learning methods. We show, however, that a simple random selection approach toward ECE that does not require observing the text achieves similar performance compared to the baselines. We utilized only position information relative to the emotion cause to accomplish this goal. Since position information alone without observing the text resulted in higher F-measure, we therefore uncovered a bias in the ECE single genre Sina-news benchmark. Further analysis showed that an imbalance of emotional cause location exists in the benchmark, with a majority of cause clauses immediately preceding the central emotion clause. We examine the bias from a linguistic perspective, and show that high accuracy rate of current state-of-art deep learning models that utilize location information is only evident in datasets that contain such position biases. The accuracy drastically reduced when a dataset with balanced location distribution is introduced. We therefore conclude that it is the innate bias in this benchmark that caused high accuracy rate of these deep learning models in ECE. We hope that the case study in this paper presents both a cautionary lesson, as well as a template for further studies, in interpreting the superior fit of deep learning models without checking for bias.
摘要:情感原因提取(ECE)旨在诠释着情感的关键字后,从一个文档识别情感的原因。一些基线已经提出了解决这个问题,比如基于规则的,基于常识和机器学习方法。然而我们发现,相比于基线向ECE一个简单的随机选择的方式,不需要观察文本达到类似的性能。我们使用的只是相对于情感原因的位置信息来实现这一目标。由于没有单独观察导致了较高的F值的文字位置信息,因此,我们发现在欧洲经委会单流派新浪新闻基准偏差。进一步的分析表明,情绪原因位置的不平衡存在于基准,用多数立即中央情感子句前述原因条款。我们研究从语言学的角度偏差,并表明,利用位置信息的当前状态的最先进的深学习模型的高准确率在包含这样的位置偏差的数据集仅是显而易见的。被导入平衡的位置分布的数据集时的准确度大大降低。因此,我们得出结论,这是在这个基准,导致了ECE这些深层次的学习模型的准确率高先天偏差。我们希望,案例研究,本文礼物既有前车之鉴,以及进一步研究的模板,在解释深学习模式的优越性配合而不检查偏见。
Jiayuan Ding, Mayank Kejriwal
Abstract: Emotion Cause Extraction (ECE) aims to identify emotion causes from a document after annotating the emotion keywords. Some baselines have been proposed to address this problem, such as rule-based, commonsense based and machine learning methods. We show, however, that a simple random selection approach toward ECE that does not require observing the text achieves similar performance compared to the baselines. We utilized only position information relative to the emotion cause to accomplish this goal. Since position information alone without observing the text resulted in higher F-measure, we therefore uncovered a bias in the ECE single genre Sina-news benchmark. Further analysis showed that an imbalance of emotional cause location exists in the benchmark, with a majority of cause clauses immediately preceding the central emotion clause. We examine the bias from a linguistic perspective, and show that high accuracy rate of current state-of-art deep learning models that utilize location information is only evident in datasets that contain such position biases. The accuracy drastically reduced when a dataset with balanced location distribution is introduced. We therefore conclude that it is the innate bias in this benchmark that caused high accuracy rate of these deep learning models in ECE. We hope that the case study in this paper presents both a cautionary lesson, as well as a template for further studies, in interpreting the superior fit of deep learning models without checking for bias.
摘要:情感原因提取(ECE)旨在诠释着情感的关键字后,从一个文档识别情感的原因。一些基线已经提出了解决这个问题,比如基于规则的,基于常识和机器学习方法。然而我们发现,相比于基线向ECE一个简单的随机选择的方式,不需要观察文本达到类似的性能。我们使用的只是相对于情感原因的位置信息来实现这一目标。由于没有单独观察导致了较高的F值的文字位置信息,因此,我们发现在欧洲经委会单流派新浪新闻基准偏差。进一步的分析表明,情绪原因位置的不平衡存在于基准,用多数立即中央情感子句前述原因条款。我们研究从语言学的角度偏差,并表明,利用位置信息的当前状态的最先进的深学习模型的高准确率在包含这样的位置偏差的数据集仅是显而易见的。被导入平衡的位置分布的数据集时的准确度大大降低。因此,我们得出结论,这是在这个基准,导致了ECE这些深层次的学习模型的准确率高先天偏差。我们希望,案例研究,本文礼物既有前车之鉴,以及进一步研究的模板,在解释深学习模式的优越性配合而不检查偏见。
11. AI-based Monitoring and Response System for Hospital Preparedness towards COVID-19 in Southeast Asia [PDF] 返回目录
Tushar Goswamy, Naishadh Parmar, Ayush Gupta, Vatsalya Tandon, Raunak Shah, Varun Goyal, Sanyog Gupta, Karishma Laud, Shivam Gupta, Sudhanshu Mishra, Ashutosh Modi
Abstract: This research paper proposes a COVID-19 monitoring and response system to identify the surge in the volume of patients at hospitals and shortage of critical equipment like ventilators in South-east Asian countries, to understand the burden on health facilities. This can help authorities in these regions with resource planning measures to redirect resources to the regions identified by the model. Due to the lack of publicly available data on the influx of patients in hospitals, or the shortage of equipment, ICU units or hospital beds that regions in these countries might be facing, we leverage Twitter data for gleaning this information. The approach has yielded accurate results for states in India, and we are working on validating the model for the remaining countries so that it can serve as a reliable tool for authorities to monitor the burden on hospitals.
摘要:本研究提出了一种COVID-19监控和反应系统,以确定在医院像东南亚国家通风的患者量的激增和不足关键设备,了解有关医疗机构的负担。这在这些地区具有资源规划措施可以帮助当局资源重定向到由模型确定的区域。由于对医院里的病人涌入缺乏公开数据的,或设备,ICU单位或病床,在这些国家地区可能面临的不足,我们利用Twitter的数据拾遗此信息。该方法已取得准确的结果,在印度国家,和我们正在验证对其余国家的模式,以便它可以作为一个可靠的工具,当局监测医院的负担工作。
Tushar Goswamy, Naishadh Parmar, Ayush Gupta, Vatsalya Tandon, Raunak Shah, Varun Goyal, Sanyog Gupta, Karishma Laud, Shivam Gupta, Sudhanshu Mishra, Ashutosh Modi
Abstract: This research paper proposes a COVID-19 monitoring and response system to identify the surge in the volume of patients at hospitals and shortage of critical equipment like ventilators in South-east Asian countries, to understand the burden on health facilities. This can help authorities in these regions with resource planning measures to redirect resources to the regions identified by the model. Due to the lack of publicly available data on the influx of patients in hospitals, or the shortage of equipment, ICU units or hospital beds that regions in these countries might be facing, we leverage Twitter data for gleaning this information. The approach has yielded accurate results for states in India, and we are working on validating the model for the remaining countries so that it can serve as a reliable tool for authorities to monitor the burden on hospitals.
摘要:本研究提出了一种COVID-19监控和反应系统,以确定在医院像东南亚国家通风的患者量的激增和不足关键设备,了解有关医疗机构的负担。这在这些地区具有资源规划措施可以帮助当局资源重定向到由模型确定的区域。由于对医院里的病人涌入缺乏公开数据的,或设备,ICU单位或病床,在这些国家地区可能面临的不足,我们利用Twitter的数据拾遗此信息。该方法已取得准确的结果,在印度国家,和我们正在验证对其余国家的模式,以便它可以作为一个可靠的工具,当局监测医院的负担工作。
12. Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability [PDF] 返回目录
Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong
Abstract: Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead. When trained with Microsoft's 65 thousand hours of anonymized training data, the developed RNN-T model surpasses a very well trained hybrid model with both better recognition accuracy and lower latency. We further study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios. By comparing several methods leveraging text-only data in the new domain, we found that updating RNN-T's prediction and joint networks using text-to-speech generated from domain-specific text is the most effective.
摘要:由于其流性质的,反复发作的神经网络转换器(RNN-T)是一个非常有前途的端至端(E2E)的可以替代的自动语音识别流行的混合模型的模型。在本文中,我们描述了我们最近的RNN-T车型的开发与培训过程中降低GPU的内存消耗,更好的初始化策略,并与未来的前瞻先进的编码器模型。当与微软的65000小时匿名训练数据的训练,发达RNN-T模式超越两者更好的识别精度非常训练有素的混合模式和更低的延迟。我们进一步研究如何定制RNN-T车型到一个新的领域,这是部署端到端模式,以实际情况下非常重要。通过比较几种方法利用新领域纯文本数据,我们发现,更新RNN-T的预测,并使用文本到语音,从特定领域的文本中生成联合网络是最有效的。
Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao, Yifan Gong
Abstract: Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead. When trained with Microsoft's 65 thousand hours of anonymized training data, the developed RNN-T model surpasses a very well trained hybrid model with both better recognition accuracy and lower latency. We further study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios. By comparing several methods leveraging text-only data in the new domain, we found that updating RNN-T's prediction and joint networks using text-to-speech generated from domain-specific text is the most effective.
摘要:由于其流性质的,反复发作的神经网络转换器(RNN-T)是一个非常有前途的端至端(E2E)的可以替代的自动语音识别流行的混合模型的模型。在本文中,我们描述了我们最近的RNN-T车型的开发与培训过程中降低GPU的内存消耗,更好的初始化策略,并与未来的前瞻先进的编码器模型。当与微软的65000小时匿名训练数据的训练,发达RNN-T模式超越两者更好的识别精度非常训练有素的混合模式和更低的延迟。我们进一步研究如何定制RNN-T车型到一个新的领域,这是部署端到端模式,以实际情况下非常重要。通过比较几种方法利用新领域纯文本数据,我们发现,更新RNN-T的预测,并使用文本到语音,从特定领域的文本中生成联合网络是最有效的。
13. Fast, Structured Clinical Documentation via Contextual Autocomplete [PDF] 返回目录
Divya Gopinath, Monica Agrawal, Luke Murray, Steven Horng, David Karger, David Sontag
Abstract: We present a system that uses a learned autocompletion mechanism to facilitate rapid creation of semi-structured clinical documentation. We dynamically suggest relevant clinical concepts as a doctor drafts a note by leveraging features from both unstructured and structured medical data. By constraining our architecture to shallow neural networks, we are able to make these suggestions in real time. Furthermore, as our algorithm is used to write a note, we can automatically annotate the documentation with clean labels of clinical concepts drawn from medical vocabularies, making notes more structured and readable for physicians, patients, and future algorithms. To our knowledge, this system is the only machine learning-based documentation utility for clinical notes deployed in a live hospital setting, and it reduces keystroke burden of clinical concepts by 67% in real environments.
摘要:我们提出了使用自动完成学机制,促进快速创建半结构化临床文档的系统。我们建议动态相关的临床概念作为一个医生从两个非结构化和结构化的医疗数据利用功能起草的说明。通过限制我们的架构,以浅神经网络,我们能够实时地做出这些建议。此外,由于我们的算法是用来写一张纸条,我们可以自动标注与医疗词汇得出临床概念清洁标签的文档,做笔记更结构化和可读性为医生,患者和未来的算法。据我们所知,这个系统是部署在现场医院设置临床笔记唯一的机器学习为基础的文档工具,它由67%在实际环境中减少的临床概念击键负担。
Divya Gopinath, Monica Agrawal, Luke Murray, Steven Horng, David Karger, David Sontag
Abstract: We present a system that uses a learned autocompletion mechanism to facilitate rapid creation of semi-structured clinical documentation. We dynamically suggest relevant clinical concepts as a doctor drafts a note by leveraging features from both unstructured and structured medical data. By constraining our architecture to shallow neural networks, we are able to make these suggestions in real time. Furthermore, as our algorithm is used to write a note, we can automatically annotate the documentation with clean labels of clinical concepts drawn from medical vocabularies, making notes more structured and readable for physicians, patients, and future algorithms. To our knowledge, this system is the only machine learning-based documentation utility for clinical notes deployed in a live hospital setting, and it reduces keystroke burden of clinical concepts by 67% in real environments.
摘要:我们提出了使用自动完成学机制,促进快速创建半结构化临床文档的系统。我们建议动态相关的临床概念作为一个医生从两个非结构化和结构化的医疗数据利用功能起草的说明。通过限制我们的架构,以浅神经网络,我们能够实时地做出这些建议。此外,由于我们的算法是用来写一张纸条,我们可以自动标注与医疗词汇得出临床概念清洁标签的文档,做笔记更结构化和可读性为医生,患者和未来的算法。据我们所知,这个系统是部署在现场医院设置临床笔记唯一的机器学习为基础的文档工具,它由67%在实际环境中减少的临床概念击键负担。
14. Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for Low-Resource Languages [PDF] 返回目录
Siyuan Feng
Abstract: (Short version of Abstract) This thesis describes an investigation on unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in the zero-resource scenario, where only untranscribed speech data is assumed to be available. UAM is not only important in addressing the general problem of data scarcity in ASR technology development but also essential to many non-mainstream applications, for examples, language protection, language acquisition and pathological speech assessment. The present study is focused on two research problems. The first problem concerns unsupervised discovery of basic (subword level) speech units in a given language. Under the zero-resource condition, the speech units could be inferred only from the acoustic signals, without requiring or involving any linguistic direction and/or constraints. The second problem is referred to as unsupervised subword modeling. In its essence a frame-level feature representation needs to be learned from untranscribed speech. The learned feature representation is the basis of subword unit discovery. It is desired to be linguistically discriminative and robust to non-linguistic factors. Particularly extensive use of cross-lingual knowledge in subword unit discovery and modeling is a focus of this research.
摘要:(短版摘要)本文介绍了无监督声学建模(UAM)调查在零资源的情况下,只有非转录的语音数据被认为是提供自动语音识别(ASR)。 UAM是在ASR技术开发解决数据匮乏的一般问题不仅重要,而且也让不少非主流的应用,于语言的保护,语言习得和病理的言语评估是必不可少的。目前的研究集中在两个研究问题。第一个问题是关于在一个给定的语言基础(子字级)语音单元中实现无人发现。下的零资源条件下,语音单元只能从声学信号中推断出,而不需要或涉及任何语言方向和/或约束。第二个问题是被称为无监督子字模型。在其本质帧级特征表示需要从非转录的讲话教训。学习地物表示是子字单元发现的基础。期望在语言判别和鲁棒的非语言因素。特别是广泛使用于子字单元发现和建模跨语言知识是本研究的一个重点。
Siyuan Feng
Abstract: (Short version of Abstract) This thesis describes an investigation on unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in the zero-resource scenario, where only untranscribed speech data is assumed to be available. UAM is not only important in addressing the general problem of data scarcity in ASR technology development but also essential to many non-mainstream applications, for examples, language protection, language acquisition and pathological speech assessment. The present study is focused on two research problems. The first problem concerns unsupervised discovery of basic (subword level) speech units in a given language. Under the zero-resource condition, the speech units could be inferred only from the acoustic signals, without requiring or involving any linguistic direction and/or constraints. The second problem is referred to as unsupervised subword modeling. In its essence a frame-level feature representation needs to be learned from untranscribed speech. The learned feature representation is the basis of subword unit discovery. It is desired to be linguistically discriminative and robust to non-linguistic factors. Particularly extensive use of cross-lingual knowledge in subword unit discovery and modeling is a focus of this research.
摘要:(短版摘要)本文介绍了无监督声学建模(UAM)调查在零资源的情况下,只有非转录的语音数据被认为是提供自动语音识别(ASR)。 UAM是在ASR技术开发解决数据匮乏的一般问题不仅重要,而且也让不少非主流的应用,于语言的保护,语言习得和病理的言语评估是必不可少的。目前的研究集中在两个研究问题。第一个问题是关于在一个给定的语言基础(子字级)语音单元中实现无人发现。下的零资源条件下,语音单元只能从声学信号中推断出,而不需要或涉及任何语言方向和/或约束。第二个问题是被称为无监督子字模型。在其本质帧级特征表示需要从非转录的讲话教训。学习地物表示是子字单元发现的基础。期望在语言判别和鲁棒的非语言因素。特别是广泛使用于子字单元发现和建模跨语言知识是本研究的一个重点。
注:中文为机器翻译结果!封面为论文标题词云图!