目录
18. An algorithm for onset detection of linguistic segments in continuous electroencephalogram signals [PDF] 摘要
摘要
1. Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer [PDF] 返回目录
Marko Vidoni, Ivan Vulić, Goran Glavaš
Abstract: Adapter modules, additional trainable parameters that enable efficient fine-tuning of pretrained transformers, have recently been used for language specialization of multilingual transformers, improving downstream zero-shot cross-lingual transfer. In this work, we propose orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer. They are trained to encode language- and task-specific information that is complementary (i.e., orthogonal) to the knowledge already stored in the pretrained transformer's parameters. Our zero-shot cross-lingual transfer experiments, involving three tasks (POS-tagging, NER, NLI) and a set of 10 diverse languages, 1) point to the usefulness of orthoadapters in cross-lingual transfer, especially for the most complex NLI task, but also 2) indicate that the optimal adapter configuration highly depends on the task and the target language. We hope that our work will motivate a wider investigation of usefulness of orthogonality constraints in language- and task-specific fine-tuning of pretrained transformers.
摘要:适配器模块(附加的可训练参数,可对预训练的变压器进行有效的微调)最近已用于多语言变压器的语言专业化,从而改善了下游零击跨语言的传递。在这项工作中,我们提出了用于跨语言传输的正交语言和任务适配器(称为orthoadapters)。他们经过训练以对与语言和任务有关的信息进行编码,这些信息与已存储在预训练变压器参数中的知识是互补的(即正交的)。我们的零镜头跨语言迁移实验涉及三个任务(POS标记,NER,NLI)和一组10种不同的语言,1)指出直译者在跨语言迁移中的有用性,特别是对于最复杂的NLI任务,但2)表示最佳适配器配置在很大程度上取决于任务和目标语言。我们希望我们的工作将促使人们对正交约束在预训练变压器的语言和特定于任务的微调中的有用性进行更广泛的研究。
Marko Vidoni, Ivan Vulić, Goran Glavaš
Abstract: Adapter modules, additional trainable parameters that enable efficient fine-tuning of pretrained transformers, have recently been used for language specialization of multilingual transformers, improving downstream zero-shot cross-lingual transfer. In this work, we propose orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer. They are trained to encode language- and task-specific information that is complementary (i.e., orthogonal) to the knowledge already stored in the pretrained transformer's parameters. Our zero-shot cross-lingual transfer experiments, involving three tasks (POS-tagging, NER, NLI) and a set of 10 diverse languages, 1) point to the usefulness of orthoadapters in cross-lingual transfer, especially for the most complex NLI task, but also 2) indicate that the optimal adapter configuration highly depends on the task and the target language. We hope that our work will motivate a wider investigation of usefulness of orthogonality constraints in language- and task-specific fine-tuning of pretrained transformers.
摘要:适配器模块(附加的可训练参数,可对预训练的变压器进行有效的微调)最近已用于多语言变压器的语言专业化,从而改善了下游零击跨语言的传递。在这项工作中,我们提出了用于跨语言传输的正交语言和任务适配器(称为orthoadapters)。他们经过训练以对与语言和任务有关的信息进行编码,这些信息与已存储在预训练变压器参数中的知识是互补的(即正交的)。我们的零镜头跨语言迁移实验涉及三个任务(POS标记,NER,NLI)和一组10种不同的语言,1)指出直译者在跨语言迁移中的有用性,特别是对于最复杂的NLI任务,但2)表示最佳适配器配置在很大程度上取决于任务和目标语言。我们希望我们的工作将促使人们对正交约束在预训练变压器的语言和特定于任务的微调中的有用性进行更广泛的研究。
2. Discriminating Between Similar Nordic Languages [PDF] 返回目录
René Haas, Leon Derczynski
Abstract: Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese and Icelandic.
摘要:自动语言识别是一个具有挑战性的问题。 区分密切相关的语言尤其困难。 本文提出了一种用于北欧语言自动语言识别的机器学习方法,该方法经常会因现有的最新工具而被误分类。 具体而言,我们将重点关注六种北欧语言之间的歧视:丹麦语,瑞典语,挪威语(Nynorsk),挪威语(Bokmål),法罗语和冰岛语。
René Haas, Leon Derczynski
Abstract: Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese and Icelandic.
摘要:自动语言识别是一个具有挑战性的问题。 区分密切相关的语言尤其困难。 本文提出了一种用于北欧语言自动语言识别的机器学习方法,该方法经常会因现有的最新工具而被误分类。 具体而言,我们将重点关注六种北欧语言之间的歧视:丹麦语,瑞典语,挪威语(Nynorsk),挪威语(Bokmål),法罗语和冰岛语。
3. Morphology Matters: A Multilingual Language Modeling Analysis [PDF] 返回目录
Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, Lane Schwartz
Abstract: Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling.
摘要:多语言语言建模的先前研究(例如Cotterell等人,2018; Mielke等人,2019)在屈折形态学是否使语言难以建模方面存在分歧。 我们试图解决分歧并扩大研究范围。 我们以92种语言汇编的145种圣经译本和更多的分类学特征,构成了更大的语料库。 我们填写了几种语言的缺失类型数据,并考虑了专家生成的类型特征以外,基于语料库的形态复杂性度量。 我们发现,当用BPE分割的数据训练LSTM模型时,几种形态学测量与较高的惊奇性显着相关。 我们还研究了基于语言的子词细分策略,例如Morfessor和有限状态换能器(FST),并发现这些细分策略可产生更好的性能并减少语言形态对语言建模的影响。
Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, Lane Schwartz
Abstract: Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling.
摘要:多语言语言建模的先前研究(例如Cotterell等人,2018; Mielke等人,2019)在屈折形态学是否使语言难以建模方面存在分歧。 我们试图解决分歧并扩大研究范围。 我们以92种语言汇编的145种圣经译本和更多的分类学特征,构成了更大的语料库。 我们填写了几种语言的缺失类型数据,并考虑了专家生成的类型特征以外,基于语料库的形态复杂性度量。 我们发现,当用BPE分割的数据训练LSTM模型时,几种形态学测量与较高的惊奇性显着相关。 我们还研究了基于语言的子词细分策略,例如Morfessor和有限状态换能器(FST),并发现这些细分策略可产生更好的性能并减少语言形态对语言建模的影响。
4. Improved Robustness to Disfluencies in RNN-Transducer Based Speech Recognition [PDF] 返回目录
Valentin Mendelev, Tina Raissi, Guglielmo Camporese, Manuel Giollo
Abstract: Automatic Speech Recognition (ASR) based on Recurrent Neural Network Transducers (RNN-T) is gaining interest in the speech community. We investigate data selection and preparation choices aiming for improved robustness of RNN-T ASR to speech disfluencies with a focus on partial words. For evaluation we use clean data, data with disfluencies and a separate dataset with speech affected by stuttering. We show that after including a small amount of data with disfluencies in the training set the recognition accuracy on the tests with disfluencies and stuttering improves. Increasing the amount of training data with disfluencies gives additional gains without degradation on the clean data. We also show that replacing partial words with a dedicated token helps to get even better accuracy on utterances with disfluencies and stutter. The evaluation of our best model shows 22.5% and 16.4% relative WER reduction on those two evaluation sets.
摘要:基于递归神经网络换能器(RNN-T)的自动语音识别(ASR)在语音社区中引起了人们的兴趣。 我们研究了数据选择和准备选择,旨在提高RNN-T ASR对语音偏流的鲁棒性,重点是部分单词。 为了进行评估,我们使用了干净的数据,有散乱的数据以及语音受口吃影响的独立数据集。 我们表明,在训练中包含少量带有流离失所的数据后,对具有流离失所和口吃的测试的识别准确性有所提高。 带有差异的训练数据量的增加会带来额外的收益,而不会降低原始数据的质量。 我们还表明,用专用标记代替部分单词有助于在出现杂音和口吃的话语时获得更高的准确性。 我们的最佳模型评估显示,这两个评估集的相对WER降低了22.5%和16.4%。
Valentin Mendelev, Tina Raissi, Guglielmo Camporese, Manuel Giollo
Abstract: Automatic Speech Recognition (ASR) based on Recurrent Neural Network Transducers (RNN-T) is gaining interest in the speech community. We investigate data selection and preparation choices aiming for improved robustness of RNN-T ASR to speech disfluencies with a focus on partial words. For evaluation we use clean data, data with disfluencies and a separate dataset with speech affected by stuttering. We show that after including a small amount of data with disfluencies in the training set the recognition accuracy on the tests with disfluencies and stuttering improves. Increasing the amount of training data with disfluencies gives additional gains without degradation on the clean data. We also show that replacing partial words with a dedicated token helps to get even better accuracy on utterances with disfluencies and stutter. The evaluation of our best model shows 22.5% and 16.4% relative WER reduction on those two evaluation sets.
摘要:基于递归神经网络换能器(RNN-T)的自动语音识别(ASR)在语音社区中引起了人们的兴趣。 我们研究了数据选择和准备选择,旨在提高RNN-T ASR对语音偏流的鲁棒性,重点是部分单词。 为了进行评估,我们使用了干净的数据,有散乱的数据以及语音受口吃影响的独立数据集。 我们表明,在训练中包含少量带有流离失所的数据后,对具有流离失所和口吃的测试的识别准确性有所提高。 带有差异的训练数据量的增加会带来额外的收益,而不会降低原始数据的质量。 我们还表明,用专用标记代替部分单词有助于在出现杂音和口吃的话语时获得更高的准确性。 我们的最佳模型评估显示,这两个评估集的相对WER降低了22.5%和16.4%。
5. Improving Zero Shot Learning Baselines with Commonsense Knowledge [PDF] 返回目录
Abhinaba Roy, Deepanway Ghosal, Erik Cambria, Navonil Majumder, Rada Mihalcea, Soujanya Poria
Abstract: Zero shot learning -- the problem of training and testing on a completely disjoint set of classes -- relies greatly on its ability to transfer knowledge from train classes to test classes. Traditionally semantic embeddings consisting of human defined attributes (HA) or distributed word embeddings (DWE) are used to facilitate this transfer by improving the association between visual and semantic embeddings. In this paper, we take advantage of explicit relations between nodes defined in ConceptNet, a commonsense knowledge graph, to generate commonsense embeddings of the class labels by using a graph convolution network-based autoencoder. Our experiments performed on three standard benchmark datasets surpass the strong baselines when we fuse our commonsense embeddings with existing semantic embeddings i.e. HA and DWE.
摘要:零镜头学习-在完全不相交的班级上进行培训和测试的问题-很大程度上取决于其将知识从培训班级转移到测试班级的能力。 传统上,由人类定义的属性(HA)或分布式单词嵌入(DWE)组成的语义嵌入可通过改善视觉嵌入与语义嵌入之间的关联来促进这种转移。 在本文中,我们利用基于常识知识图的ConceptNet中定义的节点之间的显式关系,通过使用基于图卷积网络的自动编码器来生成类标签的常识嵌入。 当我们将常识性嵌入与现有语义性嵌入(即HA和DWE)融合时,我们在三个标准基准数据集上进行的实验超过了强大的基准。
Abhinaba Roy, Deepanway Ghosal, Erik Cambria, Navonil Majumder, Rada Mihalcea, Soujanya Poria
Abstract: Zero shot learning -- the problem of training and testing on a completely disjoint set of classes -- relies greatly on its ability to transfer knowledge from train classes to test classes. Traditionally semantic embeddings consisting of human defined attributes (HA) or distributed word embeddings (DWE) are used to facilitate this transfer by improving the association between visual and semantic embeddings. In this paper, we take advantage of explicit relations between nodes defined in ConceptNet, a commonsense knowledge graph, to generate commonsense embeddings of the class labels by using a graph convolution network-based autoencoder. Our experiments performed on three standard benchmark datasets surpass the strong baselines when we fuse our commonsense embeddings with existing semantic embeddings i.e. HA and DWE.
摘要:零镜头学习-在完全不相交的班级上进行培训和测试的问题-很大程度上取决于其将知识从培训班级转移到测试班级的能力。 传统上,由人类定义的属性(HA)或分布式单词嵌入(DWE)组成的语义嵌入可通过改善视觉嵌入与语义嵌入之间的关联来促进这种转移。 在本文中,我们利用基于常识知识图的ConceptNet中定义的节点之间的显式关系,通过使用基于图卷积网络的自动编码器来生成类标签的常识嵌入。 当我们将常识性嵌入与现有语义性嵌入(即HA和DWE)融合时,我们在三个标准基准数据集上进行的实验超过了强大的基准。
6. ParsiNLU: A Suite of Language Understanding Challenges for Persian [PDF] 返回目录
Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabadi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, Yadollah Yaghoobzadeh
Abstract: Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5$k$ new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.
摘要:尽管近年来在解决自然语言理解(NLU)挑战方面取得了进展,但大部分进展仍集中在资源丰富的语言(如英语)上。这项工作着眼于波斯语言,波斯语言是世界上使用最广泛的语言之一,但是对于这种丰富的语言,几乎没有NLU数据集。高质量评估数据集的可用性是可靠评估不同NLU任务和领域进度的必要条件。我们介绍了ParsiNLU,这是波斯语中的第一个基准,包括一系列高级任务-阅读理解,文本蕴涵等。这些数据集以多种方式收集,通常涉及母语人士的手动注释。这将在6个不同的NLU任务中产生超过14.5 $ k $个新实例。此外,我们在此基准测试中展示了最新的单语言和多语言预训练语言模型的初步结果,并将其与人类绩效进行了比较,这为我们应对自然语言理解挑战的能力提供了宝贵的见解。波斯语。我们希望ParsiNLU能够促进对波斯语言理解的进一步研究和进步。
Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabadi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, Yadollah Yaghoobzadeh
Abstract: Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of high-level tasks -- Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5$k$ new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.
摘要:尽管近年来在解决自然语言理解(NLU)挑战方面取得了进展,但大部分进展仍集中在资源丰富的语言(如英语)上。这项工作着眼于波斯语言,波斯语言是世界上使用最广泛的语言之一,但是对于这种丰富的语言,几乎没有NLU数据集。高质量评估数据集的可用性是可靠评估不同NLU任务和领域进度的必要条件。我们介绍了ParsiNLU,这是波斯语中的第一个基准,包括一系列高级任务-阅读理解,文本蕴涵等。这些数据集以多种方式收集,通常涉及母语人士的手动注释。这将在6个不同的NLU任务中产生超过14.5 $ k $个新实例。此外,我们在此基准测试中展示了最新的单语言和多语言预训练语言模型的初步结果,并将其与人类绩效进行了比较,这为我们应对自然语言理解挑战的能力提供了宝贵的见解。波斯语。我们希望ParsiNLU能够促进对波斯语言理解的进一步研究和进步。
7. Improving Task-Agnostic BERT Distillation with Layer Mapping Search [PDF] 返回目录
Xiaoqi Jiao, Huating Chang, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu
Abstract: Knowledge distillation (KD) which transfers the knowledge from a large teacher model to a small student model, has been widely used to compress the BERT model recently. Besides the supervision in the output in the original KD, recent works show that layer-level supervision is crucial to the performance of the student BERT model. However, previous works designed the layer mapping strategy heuristically (e.g., uniform or last-layer), which can lead to inferior performance. In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically. To accelerate the search process, we further propose a proxy setting where a small portion of the training corpus are sampled for distillation, and three representative tasks are chosen for evaluation. After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model, which can be directly fine-tuned on downstream tasks. Comprehensive experiments on the evaluation benchmarks demonstrate that 1) layer mapping strategy has a significant effect on task-agnostic BERT distillation and different layer mappings can result in quite different performances; 2) the optimal layer mapping strategy from the proposed search process consistently outperforms the other heuristic ones; 3) with the optimal layer mapping, our student model achieves state-of-the-art performance on the GLUE tasks.
摘要:知识提炼(KD)将知识从大老师模型转移到小学生模型,最近已广泛用于压缩BERT模型。除了原始KD中输出的监视之外,最近的工作还表明,层级监视对于学生BERT模型的性能至关重要。但是,以前的工作启发式地设计了层映射策略(例如,统一层或最后一层),这可能会导致性能下降。在本文中,我们建议使用遗传算法(GA)自动搜索最佳图层映射。为了加快搜索过程,我们进一步提出了代理设置,其中对训练语料库的一小部分进行采样以进行蒸馏,并选择三个具有代表性的任务进行评估。获得最佳层映射后,我们在整个语料库上执行与任务无关的BERT蒸馏,以构建紧凑的学生模型,该模型可以直接在下游任务上进行微调。评估基准的综合实验表明:1)层映射策略对与任务无关的BERT蒸馏有重大影响,不同的层映射可能会导致完全不同的性能; 2)所提出的搜索过程中的最佳层映射策略始终优于其他启发式算法; 3)通过最佳的图层映射,我们的学生模型可在GLUE任务上实现最先进的性能。
Xiaoqi Jiao, Huating Chang, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu
Abstract: Knowledge distillation (KD) which transfers the knowledge from a large teacher model to a small student model, has been widely used to compress the BERT model recently. Besides the supervision in the output in the original KD, recent works show that layer-level supervision is crucial to the performance of the student BERT model. However, previous works designed the layer mapping strategy heuristically (e.g., uniform or last-layer), which can lead to inferior performance. In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically. To accelerate the search process, we further propose a proxy setting where a small portion of the training corpus are sampled for distillation, and three representative tasks are chosen for evaluation. After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model, which can be directly fine-tuned on downstream tasks. Comprehensive experiments on the evaluation benchmarks demonstrate that 1) layer mapping strategy has a significant effect on task-agnostic BERT distillation and different layer mappings can result in quite different performances; 2) the optimal layer mapping strategy from the proposed search process consistently outperforms the other heuristic ones; 3) with the optimal layer mapping, our student model achieves state-of-the-art performance on the GLUE tasks.
摘要:知识提炼(KD)将知识从大老师模型转移到小学生模型,最近已广泛用于压缩BERT模型。除了原始KD中输出的监视之外,最近的工作还表明,层级监视对于学生BERT模型的性能至关重要。但是,以前的工作启发式地设计了层映射策略(例如,统一层或最后一层),这可能会导致性能下降。在本文中,我们建议使用遗传算法(GA)自动搜索最佳图层映射。为了加快搜索过程,我们进一步提出了代理设置,其中对训练语料库的一小部分进行采样以进行蒸馏,并选择三个具有代表性的任务进行评估。获得最佳层映射后,我们在整个语料库上执行与任务无关的BERT蒸馏,以构建紧凑的学生模型,该模型可以直接在下游任务上进行微调。评估基准的综合实验表明:1)层映射策略对与任务无关的BERT蒸馏有重大影响,不同的层映射可能会导致完全不同的性能; 2)所提出的搜索过程中的最佳层映射策略始终优于其他启发式算法; 3)通过最佳的图层映射,我们的学生模型可在GLUE任务上实现最先进的性能。
8. Document-aligned Japanese-English Conversation Parallel Corpus [PDF] 返回目录
Matīss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa
Abstract: Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.
摘要:句子级(SL)机器翻译(MT)已达到许多高资源语言的可接受质量,但文档级(DL)MT则难以达到,这很难(1)训练少量的DL数据; 2)评估,因为主要方法和数据集集中在SL评估上。 为了解决第一个问题,我们提出了一个文档对齐的日语-英语会话语料库,其中包括用于调优和测试的平衡,高质量的业务会话数据。 对于第二个问题,我们手动确定了SL MT在缺乏上下文的情况下无法提供足够翻译的主要区域。 然后,我们创建一个评估集,其中注释了这些现象,以减轻DL系统的自动评估。 我们使用语料库训练MT模型,以演示使用上下文如何带来改进。
Matīss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa
Abstract: Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.
摘要:句子级(SL)机器翻译(MT)已达到许多高资源语言的可接受质量,但文档级(DL)MT则难以达到,这很难(1)训练少量的DL数据; 2)评估,因为主要方法和数据集集中在SL评估上。 为了解决第一个问题,我们提出了一个文档对齐的日语-英语会话语料库,其中包括用于调优和测试的平衡,高质量的业务会话数据。 对于第二个问题,我们手动确定了SL MT在缺乏上下文的情况下无法提供足够翻译的主要区域。 然后,我们创建一个评估集,其中注释了这些现象,以减轻DL系统的自动评估。 我们使用语料库训练MT模型,以演示使用上下文如何带来改进。
9. EQG-RACE: Examination-Type Question Generation [PDF] 返回目录
Xin Jia, Wenjie Zhou, Xu Sun, Yunfang Wu
Abstract: Question Generation (QG) is an essential component of the automatic intelligent tutoring systems, which aims to generate high-quality questions for facilitating the reading practice and assessments. However, existing QG technologies encounter several key issues concerning the biased and unnatural language sources of datasets which are mainly obtained from the Web (e.g. SQuAD). In this paper, we propose an innovative Examination-type Question Generation approach (EQG-RACE) to generate exam-like questions based on a dataset extracted from RACE. Two main strategies are employed in EQG-RACE for dealing with discrete answer information and reasoning among long contexts. A Rough Answer and Key Sentence Tagging scheme is utilized to enhance the representations of input. An Answer-guided Graph Convolutional Network (AG-GCN) is designed to capture structure information in revealing the inter-sentences and intra-sentence relations. Experimental results show a state-of-the-art performance of EQG-RACE, which is apparently superior to the baselines. In addition, our work has established a new QG prototype with a reshaped dataset and QG method, which provides an important benchmark for related research in future work. We will make our data and code publicly available for further research.
摘要:问题生成(QG)是自动智能补习系统的重要组成部分,其目的是生成高质量的问题,以促进阅读练习和评估。但是,现有的QG技术遇到了几个关键问题,这些问题主要涉及从网络(例如SQuAD)获得的数据集的偏颇和不自然的语言来源。在本文中,我们提出了一种创新的考试类型问题生成方法(EQG-RACE),可以基于从RACE中提取的数据集生成类似考试的问题。 EQG-RACE中采用了两种主要策略来处理离散的答案信息并在较长的情境中进行推理。粗糙答案和关键句标记方案被用来增强输入的表示。答案引导图卷积网络(AG-GCN)旨在捕获结构信息,以揭示句子间和句子内的关系。实验结果表明,EQG-RACE具有最先进的性能,显然优于基线。此外,我们的工作还建立了一个新的QG原型,该原型具有经过重塑的数据集和QG方法,为将来的相关研究提供了重要的基准。我们将公开我们的数据和代码以供进一步研究。
Xin Jia, Wenjie Zhou, Xu Sun, Yunfang Wu
Abstract: Question Generation (QG) is an essential component of the automatic intelligent tutoring systems, which aims to generate high-quality questions for facilitating the reading practice and assessments. However, existing QG technologies encounter several key issues concerning the biased and unnatural language sources of datasets which are mainly obtained from the Web (e.g. SQuAD). In this paper, we propose an innovative Examination-type Question Generation approach (EQG-RACE) to generate exam-like questions based on a dataset extracted from RACE. Two main strategies are employed in EQG-RACE for dealing with discrete answer information and reasoning among long contexts. A Rough Answer and Key Sentence Tagging scheme is utilized to enhance the representations of input. An Answer-guided Graph Convolutional Network (AG-GCN) is designed to capture structure information in revealing the inter-sentences and intra-sentence relations. Experimental results show a state-of-the-art performance of EQG-RACE, which is apparently superior to the baselines. In addition, our work has established a new QG prototype with a reshaped dataset and QG method, which provides an important benchmark for related research in future work. We will make our data and code publicly available for further research.
摘要:问题生成(QG)是自动智能补习系统的重要组成部分,其目的是生成高质量的问题,以促进阅读练习和评估。但是,现有的QG技术遇到了几个关键问题,这些问题主要涉及从网络(例如SQuAD)获得的数据集的偏颇和不自然的语言来源。在本文中,我们提出了一种创新的考试类型问题生成方法(EQG-RACE),可以基于从RACE中提取的数据集生成类似考试的问题。 EQG-RACE中采用了两种主要策略来处理离散的答案信息并在较长的情境中进行推理。粗糙答案和关键句标记方案被用来增强输入的表示。答案引导图卷积网络(AG-GCN)旨在捕获结构信息,以揭示句子间和句子内的关系。实验结果表明,EQG-RACE具有最先进的性能,显然优于基线。此外,我们的工作还建立了一个新的QG原型,该原型具有经过重塑的数据集和QG方法,为将来的相关研究提供了重要的基准。我们将公开我们的数据和代码以供进一步研究。
10. Reinforced Multi-Teacher Selection for Knowledge Distillation [PDF] 返回目录
Fei Yuan, Linjun Shou, Jian Pei, Wutao Lin, Ming Gong, Yan Fu, Daxin Jiang
Abstract: In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation transfers knowledge from one or multiple large (teacher) models to a small (student) model. When multiple teacher models are available in distillation, the state-of-the-art methods assign a fixed weight to a teacher model in the whole distillation. Furthermore, most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled. We systematically develop a reinforced method to dynamically assign weights to teacher models for different training instances and optimize the performance of student model. Our extensive experimental results on several NLP tasks clearly verify the feasibility and effectiveness of our approach.
摘要:在自然语言处理(NLP)任务中,缓慢的推理速度和GPU使用中的巨大占用空间仍然是在生产中应用预训练的深度模型的瓶颈。作为一种流行的模型压缩方法,知识提炼将知识从一个或多个大(教师)模型转移到小(学生)模型。当蒸馏中有多个教师模型时,最新的方法会在整个蒸馏中为教师模型分配固定的权重。此外,大多数现有方法为每个教师模型分配了相等的权重。在本文中,我们观察到,由于训练示例的复杂性和学生模型能力的差异,与老师模型进行差异学习可以提高学生模型的提炼效果。我们系统地开发了一种增强方法,可以针对不同的训练实例为教师模型动态分配权重,并优化学生模型的性能。我们在几个NLP任务上的广泛实验结果清楚地证明了我们方法的可行性和有效性。
Fei Yuan, Linjun Shou, Jian Pei, Wutao Lin, Ming Gong, Yan Fu, Daxin Jiang
Abstract: In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation transfers knowledge from one or multiple large (teacher) models to a small (student) model. When multiple teacher models are available in distillation, the state-of-the-art methods assign a fixed weight to a teacher model in the whole distillation. Furthermore, most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled. We systematically develop a reinforced method to dynamically assign weights to teacher models for different training instances and optimize the performance of student model. Our extensive experimental results on several NLP tasks clearly verify the feasibility and effectiveness of our approach.
摘要:在自然语言处理(NLP)任务中,缓慢的推理速度和GPU使用中的巨大占用空间仍然是在生产中应用预训练的深度模型的瓶颈。作为一种流行的模型压缩方法,知识提炼将知识从一个或多个大(教师)模型转移到小(学生)模型。当蒸馏中有多个教师模型时,最新的方法会在整个蒸馏中为教师模型分配固定的权重。此外,大多数现有方法为每个教师模型分配了相等的权重。在本文中,我们观察到,由于训练示例的复杂性和学生模型能力的差异,与老师模型进行差异学习可以提高学生模型的提炼效果。我们系统地开发了一种增强方法,可以针对不同的训练实例为教师模型动态分配权重,并优化学生模型的性能。我们在几个NLP任务上的广泛实验结果清楚地证明了我们方法的可行性和有效性。
11. Exploring Deep Neural Networks and Transfer Learning for Analyzing Emotions in Tweets [PDF] 返回目录
Yasas Senarath, Uthayasanker Thayasivam
Abstract: In this paper, we present an experiment on using deep learning and transfer learning techniques for emotion analysis in tweets and suggest a method to interpret our deep learning models. The proposed approach for emotion analysis combines a Long Short Term Memory (LSTM) network with a Convolutional Neural Network (CNN). Then we extend this approach for emotion intensity prediction using transfer learning technique. Furthermore, we propose a technique to visualize the importance of each word in a tweet to get a better understanding of the model. Experimentally, we show in our analysis that the proposed models outperform the state-of-the-art in emotion classification while maintaining competitive results in predicting emotion intensity.
摘要:在本文中,我们提出了一项在推文中使用深度学习和迁移学习技术进行情感分析的实验,并提出了一种解释我们的深度学习模型的方法。 提出的情感分析方法将长短期记忆(LSTM)网络与卷积神经网络(CNN)结合在一起。 然后,我们将这种方法扩展为使用转移学习技术的情绪强度预测。 此外,我们提出了一种可视化推文中每个单词的重要性的技术,以更好地理解模型。 通过实验,我们在分析中表明,所提出的模型在情感分类方面优于最新技术,同时在预测情感强度方面保持竞争性结果。
Yasas Senarath, Uthayasanker Thayasivam
Abstract: In this paper, we present an experiment on using deep learning and transfer learning techniques for emotion analysis in tweets and suggest a method to interpret our deep learning models. The proposed approach for emotion analysis combines a Long Short Term Memory (LSTM) network with a Convolutional Neural Network (CNN). Then we extend this approach for emotion intensity prediction using transfer learning technique. Furthermore, we propose a technique to visualize the importance of each word in a tweet to get a better understanding of the model. Experimentally, we show in our analysis that the proposed models outperform the state-of-the-art in emotion classification while maintaining competitive results in predicting emotion intensity.
摘要:在本文中,我们提出了一项在推文中使用深度学习和迁移学习技术进行情感分析的实验,并提出了一种解释我们的深度学习模型的方法。 提出的情感分析方法将长短期记忆(LSTM)网络与卷积神经网络(CNN)结合在一起。 然后,我们将这种方法扩展为使用转移学习技术的情绪强度预测。 此外,我们提出了一种可视化推文中每个单词的重要性的技术,以更好地理解模型。 通过实验,我们在分析中表明,所提出的模型在情感分类方面优于最新技术,同时在预测情感强度方面保持竞争性结果。
12. Towards Neural Programming Interfaces [PDF] 返回目录
Zachary C. Brown, Nathaniel Robinson, David Wingate, Nancy Fulda
Abstract: It is notoriously difficult to control the behavior of artificial neural networks such as generative neural language models. We recast the problem of controlling natural language generation as that of learning to interface with a pretrained language model, just as Application Programming Interfaces (APIs) control the behavior of programs by altering hyperparameters. In this new paradigm, a specialized neural network (called a Neural Programming Interface or NPI) learns to interface with a pretrained language model by manipulating the hidden activations of the pretrained model to produce desired outputs. Importantly, no permanent changes are made to the weights of the original model, allowing us to re-purpose pretrained models for new tasks without overwriting any aspect of the language model. We also contribute a new data set construction algorithm and GAN-inspired loss function that allows us to train NPI models to control outputs of autoregressive transformers. In experiments against other state-of-the-art approaches, we demonstrate the efficacy of our methods using OpenAI's GPT-2 model, successfully controlling noun selection, topic aversion, offensive speech filtering, and other aspects of language while largely maintaining the controlled model's fluency under deterministic settings.
摘要:众所周知,控制诸如生成神经语言模型之类的人工神经网络的行为非常困难。我们将控制自然语言生成的问题重现为学习与预先训练的语言模型交互的问题,就像应用程序编程接口(API)通过更改超参数来控制程序的行为一样。在这种新范例中,专用的神经网络(称为神经编程接口或NPI)通过操纵预训练模型的隐藏激活来产生所需的输出,从而学习与预训练语言模型的接口。重要的是,不会对原始模型的权重进行永久性更改,从而使我们能够将经过预训练的模型重新用于新任务,而不会覆盖语言模型的任何方面。我们还贡献了一种新的数据集构建算法和GAN启发式的损失函数,使我们能够训练NPI模型来控制自回归变压器的输出。在针对其他最新方法的实验中,我们展示了使用OpenAI的GPT-2模型,成功控制名词选择,主题厌恶,令人讨厌的语音过滤和其他语言方面的方法的有效性,同时很大程度上保持了受控模型的有效性。确定性设置下的流利性。
Zachary C. Brown, Nathaniel Robinson, David Wingate, Nancy Fulda
Abstract: It is notoriously difficult to control the behavior of artificial neural networks such as generative neural language models. We recast the problem of controlling natural language generation as that of learning to interface with a pretrained language model, just as Application Programming Interfaces (APIs) control the behavior of programs by altering hyperparameters. In this new paradigm, a specialized neural network (called a Neural Programming Interface or NPI) learns to interface with a pretrained language model by manipulating the hidden activations of the pretrained model to produce desired outputs. Importantly, no permanent changes are made to the weights of the original model, allowing us to re-purpose pretrained models for new tasks without overwriting any aspect of the language model. We also contribute a new data set construction algorithm and GAN-inspired loss function that allows us to train NPI models to control outputs of autoregressive transformers. In experiments against other state-of-the-art approaches, we demonstrate the efficacy of our methods using OpenAI's GPT-2 model, successfully controlling noun selection, topic aversion, offensive speech filtering, and other aspects of language while largely maintaining the controlled model's fluency under deterministic settings.
摘要:众所周知,控制诸如生成神经语言模型之类的人工神经网络的行为非常困难。我们将控制自然语言生成的问题重现为学习与预先训练的语言模型交互的问题,就像应用程序编程接口(API)通过更改超参数来控制程序的行为一样。在这种新范例中,专用的神经网络(称为神经编程接口或NPI)通过操纵预训练模型的隐藏激活来产生所需的输出,从而学习与预训练语言模型的接口。重要的是,不会对原始模型的权重进行永久性更改,从而使我们能够将经过预训练的模型重新用于新任务,而不会覆盖语言模型的任何方面。我们还贡献了一种新的数据集构建算法和GAN启发式的损失函数,使我们能够训练NPI模型来控制自回归变压器的输出。在针对其他最新方法的实验中,我们展示了使用OpenAI的GPT-2模型,成功控制名词选择,主题厌恶,令人讨厌的语音过滤和其他语言方面的方法的有效性,同时很大程度上保持了受控模型的有效性。确定性设置下的流利性。
13. Multilingual Transfer Learning for QA Using Translation as Data Augmentation [PDF] 返回目录
Mihaela Bornea, Lin Pan, Sara Rosenthal, Radu Florian, Avirup Sil
Abstract: Prior work on multilingual question answering has mostly focused on using large multilingual pre-trained language models (LM) to perform zero-shot language-wise learning: train a QA model on English and test on other languages. In this work, we explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space. Our first strategy augments the original English training data with machine translation-generated data. This results in a corpus of multilingual silver-labeled QA pairs that is 14 times larger than the original training set. In addition, we propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance and result in LM embeddings that are less language-variant. Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
摘要:多语言问答的先前工作主要集中在使用大型多语言预训练语言模型(LM)进行零射击的语言明智学习:训练英语的QA模型并测试其他语言。在这项工作中,我们探索通过使多语言嵌入在语义空间中更紧密来改善跨语言传递的策略。我们的第一个策略是使用机器翻译生成的数据来扩充原始的英语培训数据。这样得出的多语种带有银标签的QA对的语料库比原始训练集大14倍。此外,我们提出了两种新颖的策略,即语言对抗训练和语言仲裁框架,它们可以显着提高(零资源)跨语言传输的性能,并导致LM嵌入的语言差异较小。从经验上讲,我们表明在最近引入的多语言MLQA和TyDiQA数据集上,所提出的模型优于先前的零射基线。
Mihaela Bornea, Lin Pan, Sara Rosenthal, Radu Florian, Avirup Sil
Abstract: Prior work on multilingual question answering has mostly focused on using large multilingual pre-trained language models (LM) to perform zero-shot language-wise learning: train a QA model on English and test on other languages. In this work, we explore strategies that improve cross-lingual transfer by bringing the multilingual embeddings closer in the semantic space. Our first strategy augments the original English training data with machine translation-generated data. This results in a corpus of multilingual silver-labeled QA pairs that is 14 times larger than the original training set. In addition, we propose two novel strategies, language adversarial training and language arbitration framework, which significantly improve the (zero-resource) cross-lingual transfer performance and result in LM embeddings that are less language-variant. Empirically, we show that the proposed models outperform the previous zero-shot baseline on the recently introduced multilingual MLQA and TyDiQA datasets.
摘要:多语言问答的先前工作主要集中在使用大型多语言预训练语言模型(LM)进行零射击的语言明智学习:训练英语的QA模型并测试其他语言。在这项工作中,我们探索通过使多语言嵌入在语义空间中更紧密来改善跨语言传递的策略。我们的第一个策略是使用机器翻译生成的数据来扩充原始的英语培训数据。这样得出的多语种带有银标签的QA对的语料库比原始训练集大14倍。此外,我们提出了两种新颖的策略,即语言对抗训练和语言仲裁框架,它们可以显着提高(零资源)跨语言传输的性能,并导致LM嵌入的语言差异较小。从经验上讲,我们表明在最近引入的多语言MLQA和TyDiQA数据集上,所提出的模型优于先前的零射基线。
14. Comprehension and Knowledge [PDF] 返回目录
Pavel Naumov, Kevin Ros
Abstract: The ability of an agent to comprehend a sentence is tightly connected to the agent's prior experiences and background knowledge. The paper suggests to interpret comprehension as a modality and proposes a complete bimodal logical system that describes an interplay between comprehension and knowledge modalities.
摘要:代理人理解句子的能力与代理人的先验经验和背景知识紧密相关。 本文建议将理解理解为一种情态,并提出一个完整的双峰逻辑系统,该系统描述了理解和知识模态之间的相互作用。
Pavel Naumov, Kevin Ros
Abstract: The ability of an agent to comprehend a sentence is tightly connected to the agent's prior experiences and background knowledge. The paper suggests to interpret comprehension as a modality and proposes a complete bimodal logical system that describes an interplay between comprehension and knowledge modalities.
摘要:代理人理解句子的能力与代理人的先验经验和背景知识紧密相关。 本文建议将理解理解为一种情态,并提出一个完整的双峰逻辑系统,该系统描述了理解和知识模态之间的相互作用。
15. Control Flow Obfuscation for FJ using Continuation Passing [PDF] 返回目录
Kenny Zhuo Ming Lu
Abstract: Control flow obfuscation deters software reverse engineering attempts by altering the program's control flow transfer. The alternation should not affect the software's run-time behaviour. In this paper, we propose a control flow obfuscation approach for FJ with exception handling. The approach is based on a source to source transformation using continuation passing style (CPS). We argue that the proposed CPS transformation causes malicious attacks using context insensitive static analysis and context sensitive analysis with fixed call string to lose precision.
摘要:控制流混淆通过更改程序的控制流传输来阻止软件逆向工程的尝试。 更改不应影响软件的运行时行为。 在本文中,我们提出了一种带有异常处理的FJ控制流混淆方法。 该方法基于使用连续传递样式(CPS)的源到源转换。 我们认为,建议的CPS转换使用上下文不敏感的静态分析和上下文敏感的分析(具有固定的调用字符串)会导致恶意攻击,从而导致准确性下降。
Kenny Zhuo Ming Lu
Abstract: Control flow obfuscation deters software reverse engineering attempts by altering the program's control flow transfer. The alternation should not affect the software's run-time behaviour. In this paper, we propose a control flow obfuscation approach for FJ with exception handling. The approach is based on a source to source transformation using continuation passing style (CPS). We argue that the proposed CPS transformation causes malicious attacks using context insensitive static analysis and context sensitive analysis with fixed call string to lose precision.
摘要:控制流混淆通过更改程序的控制流传输来阻止软件逆向工程的尝试。 更改不应影响软件的运行时行为。 在本文中,我们提出了一种带有异常处理的FJ控制流混淆方法。 该方法基于使用连续传递样式(CPS)的源到源转换。 我们认为,建议的CPS转换使用上下文不敏感的静态分析和上下文敏感的分析(具有固定的调用字符串)会导致恶意攻击,从而导致准确性下降。
16. A Topic Coverage Approach to Evaluation of Topic Models [PDF] 返回目录
Damir Korenčić, Strahil Ristov, Jelena Repar, Jan Šnajder
Abstract: When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. We investigate an approach to topic model evaluation based on measuring topic coverage, and propose measures of coverage based on matching between model topics and reference topics. We demonstrate the benefits of the approach by evaluating, in a series of experiments, different types of topic models on two distinct text domains. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the relation between coverage and other topic model evaluation methods. The contributions of the paper include the measures of coverage and the recommendations for the use of topic models for topic discovery.
摘要:将主题模型用于发现文本集合中的主题时,自然会出现一个问题,即模型引发的主题与分析人员感兴趣的主题相对应的程度。 我们研究了一种基于测量主题覆盖率的主题模型评估方法,并基于模型主题与参考主题之间的匹配提出了覆盖率度量。 通过在一系列实验中评估两个不同文本域上不同类型的主题模型,我们证明了该方法的好处。 实验包括评估模型质量,分析不同主题类别的覆盖率以及覆盖率与其他主题模型评估方法之间的关系。 该论文的贡献包括覆盖率的度量以及使用主题模型进行主题发现的建议。
Damir Korenčić, Strahil Ristov, Jelena Repar, Jan Šnajder
Abstract: When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. We investigate an approach to topic model evaluation based on measuring topic coverage, and propose measures of coverage based on matching between model topics and reference topics. We demonstrate the benefits of the approach by evaluating, in a series of experiments, different types of topic models on two distinct text domains. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the relation between coverage and other topic model evaluation methods. The contributions of the paper include the measures of coverage and the recommendations for the use of topic models for topic discovery.
摘要:将主题模型用于发现文本集合中的主题时,自然会出现一个问题,即模型引发的主题与分析人员感兴趣的主题相对应的程度。 我们研究了一种基于测量主题覆盖率的主题模型评估方法,并基于模型主题与参考主题之间的匹配提出了覆盖率度量。 通过在一系列实验中评估两个不同文本域上不同类型的主题模型,我们证明了该方法的好处。 实验包括评估模型质量,分析不同主题类别的覆盖率以及覆盖率与其他主题模型评估方法之间的关系。 该论文的贡献包括覆盖率的度量以及使用主题模型进行主题发现的建议。
17. Exploring wav2vec 2.0 on speaker verification and language identification [PDF] 返回目录
Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu
Abstract: Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning. It follows a two-stage training process of pre-training and fine-tuning, and performs well in speech recognition tasks especially ultra-low resource cases. In this work, we attempt to extend self-supervised framework to speaker verification and language identification. First, we use some preliminary experiments to indicate that wav2vec 2.0 can capture the information about the speaker and language. Then we demonstrate the effectiveness of wav2vec 2.0 on the two tasks respectively. For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset. Finally, we utilize one model to achieve the unified modeling by the multi-task learning for the two tasks.
摘要:Wav2vec 2.0是最近提出的用于语音表示学习的自监督框架。 它遵循预训练和微调的两阶段训练过程,并且在语音识别任务(尤其是超低资源的情况)中表现出色。 在这项工作中,我们尝试将自我监督框架扩展到说话者验证和语言识别。 首先,我们使用一些初步实验来表明wav2vec 2.0可以捕获有关说话者和语言的信息。 然后,我们分别演示了wav2vec 2.0在这两个任务上的有效性。 对于说话者验证,我们在VoxCeleb1数据集上获得了最新的最新结果,均等错误率(EER)为3.61%。 对于语言识别,我们在AP17-OLR数据集的1秒条件下获得的EER为12.02%,在全长条件下获得的EER为3.47%。 最后,我们利用一个模型通过对两个任务的多任务学习来实现统一建模。
Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu
Abstract: Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation learning. It follows a two-stage training process of pre-training and fine-tuning, and performs well in speech recognition tasks especially ultra-low resource cases. In this work, we attempt to extend self-supervised framework to speaker verification and language identification. First, we use some preliminary experiments to indicate that wav2vec 2.0 can capture the information about the speaker and language. Then we demonstrate the effectiveness of wav2vec 2.0 on the two tasks respectively. For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset. Finally, we utilize one model to achieve the unified modeling by the multi-task learning for the two tasks.
摘要:Wav2vec 2.0是最近提出的用于语音表示学习的自监督框架。 它遵循预训练和微调的两阶段训练过程,并且在语音识别任务(尤其是超低资源的情况)中表现出色。 在这项工作中,我们尝试将自我监督框架扩展到说话者验证和语言识别。 首先,我们使用一些初步实验来表明wav2vec 2.0可以捕获有关说话者和语言的信息。 然后,我们分别演示了wav2vec 2.0在这两个任务上的有效性。 对于说话者验证,我们在VoxCeleb1数据集上获得了最新的最新结果,均等错误率(EER)为3.61%。 对于语言识别,我们在AP17-OLR数据集的1秒条件下获得的EER为12.02%,在全长条件下获得的EER为3.47%。 最后,我们利用一个模型通过对两个任务的多任务学习来实现统一建模。
18. An algorithm for onset detection of linguistic segments in continuous electroencephalogram signals [PDF] 返回目录
Tonatiuh Hernández-Del-Toro, Carlos A. Reyes-García
Abstract: A Brain Computer Interface based on imagined words can decode the word a subject is thinking on through brain signals to control an external device. In order to build a fully asynchronous Brain Computer Interface based on imagined words in electroencephalogram signals as source, we need to solve the problem of detecting the onset of the imagined words. Although there has been some research in this field, the problem has not been fully solved. In this paper we present an approach to solve this problem by using values from statistics, information theory and chaos theory as features to correctly identify the onset of imagined words in a continuous signal. On detecting the onsets of imagined words, the highest True Positive Rate achieved by our approach was obtained using features based on the generalized Hurst exponent, this True Positive Rate was 0.69 and 0.77 with a timing error tolerance region of 3 and 4 seconds respectively.
摘要:基于想象的单词的大脑计算机接口可以通过大脑信号来解码受试者正在思考的单词,以控制外部设备。 为了建立一个以脑电信号中的虚词为源的完全异步的脑计算机接口,我们需要解决虚词开始检测的问题。 尽管在该领域已经进行了一些研究,但是该问题尚未完全解决。 在本文中,我们提出了一种利用统计,信息论和混沌论中的值作为特征来正确识别连续信号中虚构词的出现的方法来解决此问题。 在检测到虚词的开始时,使用基于广义赫斯特指数的特征获得了通过我们的方法获得的最高真实肯定率,该真实肯定率分别为0.69和0.77,时序误差容限范围分别为3和4秒。
Tonatiuh Hernández-Del-Toro, Carlos A. Reyes-García
Abstract: A Brain Computer Interface based on imagined words can decode the word a subject is thinking on through brain signals to control an external device. In order to build a fully asynchronous Brain Computer Interface based on imagined words in electroencephalogram signals as source, we need to solve the problem of detecting the onset of the imagined words. Although there has been some research in this field, the problem has not been fully solved. In this paper we present an approach to solve this problem by using values from statistics, information theory and chaos theory as features to correctly identify the onset of imagined words in a continuous signal. On detecting the onsets of imagined words, the highest True Positive Rate achieved by our approach was obtained using features based on the generalized Hurst exponent, this True Positive Rate was 0.69 and 0.77 with a timing error tolerance region of 3 and 4 seconds respectively.
摘要:基于想象的单词的大脑计算机接口可以通过大脑信号来解码受试者正在思考的单词,以控制外部设备。 为了建立一个以脑电信号中的虚词为源的完全异步的脑计算机接口,我们需要解决虚词开始检测的问题。 尽管在该领域已经进行了一些研究,但是该问题尚未完全解决。 在本文中,我们提出了一种利用统计,信息论和混沌论中的值作为特征来正确识别连续信号中虚构词的出现的方法来解决此问题。 在检测到虚词的开始时,使用基于广义赫斯特指数的特征获得了通过我们的方法获得的最高真实肯定率,该真实肯定率分别为0.69和0.77,时序误差容限范围分别为3和4秒。
19. Mapping the Space of Chemical Reactions Using Attention-Based Neural Networks [PDF] 返回目录
Philippe Schwaller, Daniel Probst, Alain C. Vaucher, Vishnu H. Nair, David Kreutter, Teodoro Laino, Jean-Louis Reymond
Abstract: Organic reactions are usually assigned to classes containing reactions with similar reagents and mechanisms. Reaction classes facilitate the communication of complex concepts and efficient navigation through chemical reaction space. However, the classification process is a tedious task. It requires the identification of the corresponding reaction class template via annotation of the number of molecules in the reactions, the reaction center, and the distinction between reactants and reagents. This work shows that transformer-based models can infer reaction classes from non-annotated, simple text-based representations of chemical reactions. Our best model reaches a classification accuracy of 98.2%. We also show that the learned representations can be used as reaction fingerprints that capture fine-grained differences between reaction classes better than traditional reaction fingerprints. The insights into chemical reaction space enabled by our learned fingerprints are illustrated by an interactive reaction atlas providing visual clustering and similarity searching.
摘要:有机反应通常被归类为包含具有相似试剂和机理的反应的类别。反应类别促进了复杂概念的交流和化学反应空间中的有效导航。然而,分类过程是繁琐的任务。它要求通过注释反应中分子的数量,反应中心以及反应物和试剂之间的区别来识别相应的反应类别模板。这项工作表明,基于变压器的模型可以从未经注释的,基于文本的简单化学反应表示中推断反应类别。我们最好的模型可达到98.2%的分类精度。我们还表明,所学的表示形式可以用作反应指纹,比传统反应指纹更好地捕获反应类别之间的细微差异。交互式反应图集说明了我们学到的指纹对化学反应空间的见解,从而提供了视觉聚类和相似性搜索。
Philippe Schwaller, Daniel Probst, Alain C. Vaucher, Vishnu H. Nair, David Kreutter, Teodoro Laino, Jean-Louis Reymond
Abstract: Organic reactions are usually assigned to classes containing reactions with similar reagents and mechanisms. Reaction classes facilitate the communication of complex concepts and efficient navigation through chemical reaction space. However, the classification process is a tedious task. It requires the identification of the corresponding reaction class template via annotation of the number of molecules in the reactions, the reaction center, and the distinction between reactants and reagents. This work shows that transformer-based models can infer reaction classes from non-annotated, simple text-based representations of chemical reactions. Our best model reaches a classification accuracy of 98.2%. We also show that the learned representations can be used as reaction fingerprints that capture fine-grained differences between reaction classes better than traditional reaction fingerprints. The insights into chemical reaction space enabled by our learned fingerprints are illustrated by an interactive reaction atlas providing visual clustering and similarity searching.
摘要:有机反应通常被归类为包含具有相似试剂和机理的反应的类别。反应类别促进了复杂概念的交流和化学反应空间中的有效导航。然而,分类过程是繁琐的任务。它要求通过注释反应中分子的数量,反应中心以及反应物和试剂之间的区别来识别相应的反应类别模板。这项工作表明,基于变压器的模型可以从未经注释的,基于文本的简单化学反应表示中推断反应类别。我们最好的模型可达到98.2%的分类精度。我们还表明,所学的表示形式可以用作反应指纹,比传统反应指纹更好地捕获反应类别之间的细微差异。交互式反应图集说明了我们学到的指纹对化学反应空间的见解,从而提供了视觉聚类和相似性搜索。
20. A Sentiment Analysis Approach to the Prediction of Market Volatility [PDF] 返回目录
Justina Deveikyte, Helyette Geman, Carlo Piccari, Alessandro Provetti
Abstract: Prediction and quantification of future volatility and returns play an important role in financial modelling, both in portfolio optimization and risk management. Natural language processing today allows to process news and social media comments to detect signals of investors' confidence. We have explored the relationship between sentiment extracted from financial news and tweets and FTSE100 movements. We investigated the strength of the correlation between sentiment measures on a given day and market volatility and returns observed the next day. The findings suggest that there is evidence of correlation between sentiment and stock market movements: the sentiment captured from news headlines could be used as a signal to predict market returns; the same does not apply for volatility. Also, in a surprising finding, for the sentiment found in Twitter comments we obtained a correlation coefficient of -0.7, and p-value below 0.05, which indicates a strong negative correlation between positive sentiment captured from the tweets on a given day and the volatility observed the next day. We developed an accurate classifier for the prediction of market volatility in response to the arrival of new information by deploying topic modelling, based on Latent Dirichlet Allocation, to extract feature vectors from a collection of tweets and financial news. The obtained features were used as additional input to the classifier. Thanks to the combination of sentiment and topic modelling our classifier achieved a directional prediction accuracy for volatility of 63%.
摘要:对未来波动率和收益的预测和量化在资产组合优化和风险管理中的财务建模中起着重要作用。如今,自然语言处理允许处理新闻和社交媒体评论,以检测投资者信心的信号。我们探讨了从财经新闻和推文中提取的情绪与FTSE100走势之间的关系。我们调查了某一天的情绪测度与第二天观察到的市场波动和回报之间相关性的强度。研究结果表明,有证据表明市场情绪与股市走势相关:从新闻头条捕获的情绪可以用作预测市场回报的信号;波动并不适用。同样,在一个令人惊讶的发现中,对于在Twitter评论中发现的情绪,我们获得了-0.7的相关系数,且p值低于0.05,这表明在特定日期从推文中捕获的积极情绪与波动性之间存在强烈的负相关性。观察第二天。我们通过部署基于潜在Dirichlet分配的主题建模来从一系列推文和财经新闻中提取特征向量,从而开发了一种准确的分类器,用于预测市场波动以响应新信息的到来。获得的特征用作分类器的附加输入。由于情感和主题建模的结合,我们的分类器实现了63%的波动率方向预测精度。
Justina Deveikyte, Helyette Geman, Carlo Piccari, Alessandro Provetti
Abstract: Prediction and quantification of future volatility and returns play an important role in financial modelling, both in portfolio optimization and risk management. Natural language processing today allows to process news and social media comments to detect signals of investors' confidence. We have explored the relationship between sentiment extracted from financial news and tweets and FTSE100 movements. We investigated the strength of the correlation between sentiment measures on a given day and market volatility and returns observed the next day. The findings suggest that there is evidence of correlation between sentiment and stock market movements: the sentiment captured from news headlines could be used as a signal to predict market returns; the same does not apply for volatility. Also, in a surprising finding, for the sentiment found in Twitter comments we obtained a correlation coefficient of -0.7, and p-value below 0.05, which indicates a strong negative correlation between positive sentiment captured from the tweets on a given day and the volatility observed the next day. We developed an accurate classifier for the prediction of market volatility in response to the arrival of new information by deploying topic modelling, based on Latent Dirichlet Allocation, to extract feature vectors from a collection of tweets and financial news. The obtained features were used as additional input to the classifier. Thanks to the combination of sentiment and topic modelling our classifier achieved a directional prediction accuracy for volatility of 63%.
摘要:对未来波动率和收益的预测和量化在资产组合优化和风险管理中的财务建模中起着重要作用。如今,自然语言处理允许处理新闻和社交媒体评论,以检测投资者信心的信号。我们探讨了从财经新闻和推文中提取的情绪与FTSE100走势之间的关系。我们调查了某一天的情绪测度与第二天观察到的市场波动和回报之间相关性的强度。研究结果表明,有证据表明市场情绪与股市走势相关:从新闻头条捕获的情绪可以用作预测市场回报的信号;波动并不适用。同样,在一个令人惊讶的发现中,对于在Twitter评论中发现的情绪,我们获得了-0.7的相关系数,且p值低于0.05,这表明在特定日期从推文中捕获的积极情绪与波动性之间存在强烈的负相关性。观察第二天。我们通过部署基于潜在Dirichlet分配的主题建模来从一系列推文和财经新闻中提取特征向量,从而开发了一种准确的分类器,用于预测市场波动以响应新信息的到来。获得的特征用作分类器的附加输入。由于情感和主题建模的结合,我们的分类器实现了63%的波动率方向预测精度。
注:中文为机器翻译结果!封面为论文标题词云图!