目录
1. Toward Understanding Clinical Context of Medication Change Events in Clinical Narratives [PDF] 摘要
2. KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System [PDF] 摘要
3. Measuring the Novelty of Natural Language Text Using the Conjunctive Clauses of a Tsetlin Machine Text Classifier [PDF] 摘要
8. Widening the Dialogue Workflow Modeling Bottleneck in Ontology-Based Personal Assistants [PDF] 摘要
9. Don't Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities [PDF] 摘要
13. A Probabilistic Approach in Historical Linguistics Word Order Change in Infinitival Clauses: from Latin to Old French [PDF] 摘要
14. Dialog Simulation with Realistic Variations for Training Goal-Oriented Conversational Systems [PDF] 摘要
15. End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features [PDF] 摘要
16. Generating universal language adversarial examples by understanding and enhancing the transferability across neural models [PDF] 摘要
17. Structural and Functional Decomposition for Personality Image Captioning in a Communication Game [PDF] 摘要
18. Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter [PDF] 摘要
摘要
1. Toward Understanding Clinical Context of Medication Change Events in Clinical Narratives [PDF] 返回目录
Diwakar Mahajan, Jennifer J Liang, Ching-Huei Tsou
Abstract: Understanding medication events in clinical narratives is essential to achieving a complete picture of a patient's medication history. While prior research has explored identification of medication changes in clinical notes, due to the longitudinal and narrative nature of clinical documentation, extraction of medication change alone without the necessary clinical context is insufficient for use in real-world applications, such as medication timeline generation and medication reconciliation. In this paper, we present the Contextualized Medication Event Dataset (CMED), a dataset for capturing relevant context of medication changes documented in clinical notes, which was developed using a novel conceptual framework that organizes context for clinical events into various orthogonal dimensions. In this process, we define specific contextual aspects pertinent to medication change events (i.e. Action, Negation, Temporality, Certainty, and Actor), describe the annotation process and challenges encountered, and report the results of preliminary experiments. The resulting dataset, CMED, consists of 9,013 medication mentions annotated over 500 clinical notes. To encourage development of methods for improved understanding of medications in clinical narratives, CMED will be released to the community as a shared task in 2021.
摘要:在临床叙事了解用药事件是实现患者的用药史的全貌至关重要。虽然以前的研究探讨了在临床笔记用药的变化,由于临床文档的纵向和叙事性质识别,单独用药变化没有必要的临床情况的提取不足以在现实世界的应用,如药物时间线生成和使用用药调整。在本文中,我们提出了上下文化用药事件数据集(CMED),用于捕获药物的相关背景在临床笔记,这是使用一种新的概念框架,组织上下文临床事件成各种正交维度开发改变记录的数据集。在这个过程中,我们定义的特定上下文相关方面对药物的变化事件(即行动,否定,时间性,确定性和演员),描述了注释过程和遇到的挑战,并报告的初步实验结果。所得到的数据集,CMED,由9,013用药提到注释超过500个临床笔记。为鼓励改善临床叙事药物的认识方法的发展,CMED将被释放到社会在2021年共同任务。
Diwakar Mahajan, Jennifer J Liang, Ching-Huei Tsou
Abstract: Understanding medication events in clinical narratives is essential to achieving a complete picture of a patient's medication history. While prior research has explored identification of medication changes in clinical notes, due to the longitudinal and narrative nature of clinical documentation, extraction of medication change alone without the necessary clinical context is insufficient for use in real-world applications, such as medication timeline generation and medication reconciliation. In this paper, we present the Contextualized Medication Event Dataset (CMED), a dataset for capturing relevant context of medication changes documented in clinical notes, which was developed using a novel conceptual framework that organizes context for clinical events into various orthogonal dimensions. In this process, we define specific contextual aspects pertinent to medication change events (i.e. Action, Negation, Temporality, Certainty, and Actor), describe the annotation process and challenges encountered, and report the results of preliminary experiments. The resulting dataset, CMED, consists of 9,013 medication mentions annotated over 500 clinical notes. To encourage development of methods for improved understanding of medications in clinical narratives, CMED will be released to the community as a shared task in 2021.
摘要:在临床叙事了解用药事件是实现患者的用药史的全貌至关重要。虽然以前的研究探讨了在临床笔记用药的变化,由于临床文档的纵向和叙事性质识别,单独用药变化没有必要的临床情况的提取不足以在现实世界的应用,如药物时间线生成和使用用药调整。在本文中,我们提出了上下文化用药事件数据集(CMED),用于捕获药物的相关背景在临床笔记,这是使用一种新的概念框架,组织上下文临床事件成各种正交维度开发改变记录的数据集。在这个过程中,我们定义的特定上下文相关方面对药物的变化事件(即行动,否定,时间性,确定性和演员),描述了注释过程和遇到的挑战,并报告的初步实验结果。所得到的数据集,CMED,由9,013用药提到注释超过500个临床笔记。为鼓励改善临床叙事药物的认识方法的发展,CMED将被释放到社会在2021年共同任务。
2. KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System [PDF] 返回目录
Hongru Wang, Min Li, Zimo Zhou, Gabrie, Kam-Fai Wong
Abstract: Compared with CrossWOZ (Chinese) and MultiWOZ (English) dataset which have coarse-grained information, there is no dataset which handle fine-grained and hierarchical level information properly. In this paper, we publish a first Cantonese knowledge-driven Dialogue Dataset for REStaurant (KddRES) in Hong Kong, which grounds the information in multi-turn conversations to one specific restaurant. Our corpus contains 0.8k conversations which derive from 10 restaurants with various styles in different regions. In addition to that, we designed fine-grained slots and intents to better capture semantic information. The benchmark experiments and data statistic analysis show the diversity and rich annotations of our dataset. We believe the publish of KddRES can be a necessary supplement of current dialogue datasets and more suitable and valuable for small and middle enterprises (SMEs) of society, such as build a customized dialogue system for each restaurant. The corpus and benchmark models are publicly available.
摘要:CrossWOZ(中国),并已粗粒信息MultiWOZ(英文)的数据集相比,没有任何数据集句柄细粒度和层级信息,适当哪些。在本文中,我们发布了餐厅(KddRES)在香港首张粤语知识驱动的对话数据集,其理由在多圈的对话,以一个特定餐厅的信息。我们的语料库包含0.8K的对话从10家餐馆在不同地区各种风格的派生。除此之外,我们设计了细粒度插槽和意图,以更好地捕捉语义信息。基准实验和数据统计分析显示的多样性,我们的数据集的丰富的注解。我们认为发布KddRES可以是当前对话数据集的必要补充,更适合和有价值的社会的中小型企业(SME),如创建定制对话系统中为每个餐厅。语料库和标杆车型是公开的。
Hongru Wang, Min Li, Zimo Zhou, Gabrie, Kam-Fai Wong
Abstract: Compared with CrossWOZ (Chinese) and MultiWOZ (English) dataset which have coarse-grained information, there is no dataset which handle fine-grained and hierarchical level information properly. In this paper, we publish a first Cantonese knowledge-driven Dialogue Dataset for REStaurant (KddRES) in Hong Kong, which grounds the information in multi-turn conversations to one specific restaurant. Our corpus contains 0.8k conversations which derive from 10 restaurants with various styles in different regions. In addition to that, we designed fine-grained slots and intents to better capture semantic information. The benchmark experiments and data statistic analysis show the diversity and rich annotations of our dataset. We believe the publish of KddRES can be a necessary supplement of current dialogue datasets and more suitable and valuable for small and middle enterprises (SMEs) of society, such as build a customized dialogue system for each restaurant. The corpus and benchmark models are publicly available.
摘要:CrossWOZ(中国),并已粗粒信息MultiWOZ(英文)的数据集相比,没有任何数据集句柄细粒度和层级信息,适当哪些。在本文中,我们发布了餐厅(KddRES)在香港首张粤语知识驱动的对话数据集,其理由在多圈的对话,以一个特定餐厅的信息。我们的语料库包含0.8K的对话从10家餐馆在不同地区各种风格的派生。除此之外,我们设计了细粒度插槽和意图,以更好地捕捉语义信息。基准实验和数据统计分析显示的多样性,我们的数据集的丰富的注解。我们认为发布KddRES可以是当前对话数据集的必要补充,更适合和有价值的社会的中小型企业(SME),如创建定制对话系统中为每个餐厅。语料库和标杆车型是公开的。
3. Measuring the Novelty of Natural Language Text Using the Conjunctive Clauses of a Tsetlin Machine Text Classifier [PDF] 返回目录
Bimal Bhattarai, Ole-Christoffer Granmo, Lei Jiao
Abstract: Most supervised text classification approaches assume a closed world, counting on all classes being present in the data at training time. This assumption can lead to unpredictable behaviour during operation, whenever novel, previously unseen, classes appear. Although deep learning-based methods have recently been used for novelty detection, they are challenging to interpret due to their black-box nature. This paper addresses \emph{interpretable} open-world text classification, where the trained classifier must deal with novel classes during operation. To this end, we extend the recently introduced Tsetlin machine (TM) with a novelty scoring mechanism. The mechanism uses the conjunctive clauses of the TM to measure to what degree a text matches the classes covered by the training data. We demonstrate that the clauses provide a succinct interpretable description of known topics, and that our scoring mechanism makes it possible to discern novel topics from the known ones. Empirically, our TM-based approach outperforms seven other novelty detection schemes on three out of five datasets, and performs second and third best on the remaining, with the added benefit of an interpretable propositional logic-based representation.
摘要:大多数监督文本分类方法假定一个封闭的世界,对处于培训时间数据中存在的所有类计数。操作过程中这种假设可能会导致不可预知的行为,每当新的,以前看不到的,类出现。虽然深基于学习的方法最近被用于检测新奇,他们正在挑战解释,由于其黑盒性质。本文地址\ {EMPH解释}开放世界文本分类,在这里训练过的分类必须操作过程中处理新类型。为此,我们有一个新奇的评分机制近期推出的Tsetlin机(TM)延伸。该机制使用TM的连接性条款来衡量到什么程度文本由训练数据涵盖的类别相匹配。我们表明,条款提供已知主题简洁的解释说明,我们的打分机制,能够从已知的那些辨别小说的主题。根据经验,我们的基于TM-方法比其他七个上五分之三的数据集的新颖性检测方案,并且执行第二和第三最佳对剩余的,与可解释的命题基于逻辑的表示的附加益处。
Bimal Bhattarai, Ole-Christoffer Granmo, Lei Jiao
Abstract: Most supervised text classification approaches assume a closed world, counting on all classes being present in the data at training time. This assumption can lead to unpredictable behaviour during operation, whenever novel, previously unseen, classes appear. Although deep learning-based methods have recently been used for novelty detection, they are challenging to interpret due to their black-box nature. This paper addresses \emph{interpretable} open-world text classification, where the trained classifier must deal with novel classes during operation. To this end, we extend the recently introduced Tsetlin machine (TM) with a novelty scoring mechanism. The mechanism uses the conjunctive clauses of the TM to measure to what degree a text matches the classes covered by the training data. We demonstrate that the clauses provide a succinct interpretable description of known topics, and that our scoring mechanism makes it possible to discern novel topics from the known ones. Empirically, our TM-based approach outperforms seven other novelty detection schemes on three out of five datasets, and performs second and third best on the remaining, with the added benefit of an interpretable propositional logic-based representation.
摘要:大多数监督文本分类方法假定一个封闭的世界,对处于培训时间数据中存在的所有类计数。操作过程中这种假设可能会导致不可预知的行为,每当新的,以前看不到的,类出现。虽然深基于学习的方法最近被用于检测新奇,他们正在挑战解释,由于其黑盒性质。本文地址\ {EMPH解释}开放世界文本分类,在这里训练过的分类必须操作过程中处理新类型。为此,我们有一个新奇的评分机制近期推出的Tsetlin机(TM)延伸。该机制使用TM的连接性条款来衡量到什么程度文本由训练数据涵盖的类别相匹配。我们表明,条款提供已知主题简洁的解释说明,我们的打分机制,能够从已知的那些辨别小说的主题。根据经验,我们的基于TM-方法比其他七个上五分之三的数据集的新颖性检测方案,并且执行第二和第三最佳对剩余的,与可解释的命题基于逻辑的表示的附加益处。
4. Curriculum CycleGAN for Textual Sentiment Domain Adaptation with Multiple Sources [PDF] 返回目录
Sicheng Zhao, Yang Xiao, Jiang Guo, Xiangyu Yue, Jufeng Yang, Ravi Krishna, Pengfei Xu, Kurt Keutzer
Abstract: Sentiment analysis of user-generated reviews or comments on products and services on social media can help enterprises to analyze the feedback from customers and take corresponding actions for improvement. To mitigate large-scale annotations, domain adaptation (DA) provides an alternate solution by learning a transferable model from another labeled source domain. Since the labeled data may be from multiple sources, multi-source domain adaptation (MDA) would be more practical to exploit the complementary information from different domains. Existing MDA methods might fail to extract some discriminative features in the target domain that are related to sentiment, neglect the correlations of different sources as well as the distribution difference among different sub-domains even in the same source, and cannot reflect the varying optimal weighting during different training stages. In this paper, we propose an instance-level multi-source domain adaptation framework, named curriculum cycle-consistent generative adversarial network (C-CycleGAN). Specifically, C-CycleGAN consists of three components: (1) pre-trained text encoder which encodes textual input from different domains into a continuous representation space, (2) intermediate domain generator with curriculum instance-level adaptation which bridges the gap across source and target domains, and (3) task classifier trained on the intermediate domain for final sentiment classification. C-CycleGAN transfers source samples at an instance-level to an intermediate domain that is closer to target domain with sentiment semantics preserved and without losing discriminative features. Further, our dynamic instance-level weighting mechanisms can assign the optimal weights to different source samples in each training stage. We conduct extensive experiments on three benchmark datasets and achieve substantial gains over state-of-the-art approaches.
摘要:用户生成的评论或对产品和服务的社会化媒体评论情感分析可以帮助企业分析客户反馈,并采取相应的改进措施。为了减轻大型注释域适配(DA)提供了通过学习从另一个标记的源域可转移模型的替代解决方案。由于标记的数据可以是来自多个源的,多源域的适应(MDA)将是更实际利用从不同的域的补充信息。现有MDA的方法可能会失败,以提取目标域中一些判别特征在于涉及情绪,疏忽的不同来源的相关性以及不同的子域,即使在同一个源之间的分配差异,并不能反映变最佳的加权在不同的训练阶段。在本文中,我们提出了一个实例级别的多源域自适应框架,命名为课程周期一致的生成对抗网络(C-CycleGAN)。具体地讲,C-CycleGAN由三个部分组成:(1)预先训练文本编码器,其编码来自不同域的文本输入到连续表示空间,(2)与课程实例级适配横跨源桥接间隙中间域发生器和目标域,和上训练用于最终情绪分类的中间域(3)的任务分类器。 C-CycleGAN在一个实例级到更接近具有保持的情绪语义目标域的中间域而不失判别特征传送源样本。此外,我们的动态实例级加权机制可以分配最佳的权重不同的源样品中的每个训练阶段。我们对三个标准数据集进行了广泛的实验,实现了国家的最先进的方法可观的收益。
Sicheng Zhao, Yang Xiao, Jiang Guo, Xiangyu Yue, Jufeng Yang, Ravi Krishna, Pengfei Xu, Kurt Keutzer
Abstract: Sentiment analysis of user-generated reviews or comments on products and services on social media can help enterprises to analyze the feedback from customers and take corresponding actions for improvement. To mitigate large-scale annotations, domain adaptation (DA) provides an alternate solution by learning a transferable model from another labeled source domain. Since the labeled data may be from multiple sources, multi-source domain adaptation (MDA) would be more practical to exploit the complementary information from different domains. Existing MDA methods might fail to extract some discriminative features in the target domain that are related to sentiment, neglect the correlations of different sources as well as the distribution difference among different sub-domains even in the same source, and cannot reflect the varying optimal weighting during different training stages. In this paper, we propose an instance-level multi-source domain adaptation framework, named curriculum cycle-consistent generative adversarial network (C-CycleGAN). Specifically, C-CycleGAN consists of three components: (1) pre-trained text encoder which encodes textual input from different domains into a continuous representation space, (2) intermediate domain generator with curriculum instance-level adaptation which bridges the gap across source and target domains, and (3) task classifier trained on the intermediate domain for final sentiment classification. C-CycleGAN transfers source samples at an instance-level to an intermediate domain that is closer to target domain with sentiment semantics preserved and without losing discriminative features. Further, our dynamic instance-level weighting mechanisms can assign the optimal weights to different source samples in each training stage. We conduct extensive experiments on three benchmark datasets and achieve substantial gains over state-of-the-art approaches.
摘要:用户生成的评论或对产品和服务的社会化媒体评论情感分析可以帮助企业分析客户反馈,并采取相应的改进措施。为了减轻大型注释域适配(DA)提供了通过学习从另一个标记的源域可转移模型的替代解决方案。由于标记的数据可以是来自多个源的,多源域的适应(MDA)将是更实际利用从不同的域的补充信息。现有MDA的方法可能会失败,以提取目标域中一些判别特征在于涉及情绪,疏忽的不同来源的相关性以及不同的子域,即使在同一个源之间的分配差异,并不能反映变最佳的加权在不同的训练阶段。在本文中,我们提出了一个实例级别的多源域自适应框架,命名为课程周期一致的生成对抗网络(C-CycleGAN)。具体地讲,C-CycleGAN由三个部分组成:(1)预先训练文本编码器,其编码来自不同域的文本输入到连续表示空间,(2)与课程实例级适配横跨源桥接间隙中间域发生器和目标域,和上训练用于最终情绪分类的中间域(3)的任务分类器。 C-CycleGAN在一个实例级到更接近具有保持的情绪语义目标域的中间域而不失判别特征传送源样本。此外,我们的动态实例级加权机制可以分配最佳的权重不同的源样品中的每个训练阶段。我们对三个标准数据集进行了广泛的实验,实现了国家的最先进的方法可观的收益。
5. Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining [PDF] 返回目录
Zijun Sun, Chun Fan, Xiaofei Sun, Yuxian Meng, Fei Wu, Jiwei Li
Abstract: The goal of semi-supervised learning is to utilize the unlabeled, in-domain dataset U to improve models trained on the labeled dataset D. Under the context of large-scale language-model (LM) pretraining, how we can make the best use of U is poorly understood: is semi-supervised learning still beneficial with the presence of large-scale pretraining? should U be used for in-domain LM pretraining or pseudo-label generation? how should the pseudo-label based semi-supervised model be actually implemented? how different semi-supervised strategies affect performances regarding D of different sizes, U of different sizes, etc. In this paper, we conduct comprehensive studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining. Our studies shed important lights on the behavior of semi-supervised learning methods: (1) with the presence of in-domain pretraining LM on U, open-domain LM pretraining is unnecessary; (2) both the in-domain pretraining strategy and the pseudo-label based strategy introduce significant performance boosts, with the former performing better with larger U, the latter performing better with smaller U, and the combination leading to the largest performance boost; (3) self-training (pretraining first on pseudo labels D' and then fine-tuning on D) yields better performances when D is small, while joint training on the combination of pseudo labels D' and the original dataset D yields better performances when D is large. Using semi-supervised learning strategies, we are able to achieve a performance of around 93.8% accuracy with only 50 training data points on the IMDB dataset, and a competitive performance of 96.6% with the full IMDB dataset. Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.
摘要:半监督学习的目标是利用未标记,在域数据集U给提高培训了标记的数据集D.下的大型语言模型(LM)训练前的背景模型,我们如何能够使U型最好使用却知之甚少:是半监督学习仍与大型训练前的存在有益?应该ü用于域内LM训练前或伪标签生成?应该如何伪标签基于半监督模型实际执行?半监督策略如何影响不同就不同大小的d表演,不同的大小等。在本文的U,我们在半监督学习进行全面的研究,在文本分类的任务的大型LM训练前的背景下。我们的研究揭示的半监督学习方法的行为重要灯:与在域训练前U上LM,开放域LM预训练是不必要的存在(1); (2)无论是在域训练前的策略和伪标签基于策略引入显著的性能提升,前者具有较大ü更好地执行,后者与小U,并导致最大的性能提升相结合效果更好; (3)自训练(第一预训练伪标签d“和d然后微调)得到更好的性能,当d是小的,而在伪的组合联合培养标签d”和原始数据集d的产率更好的性能时d大。使用半监督学习策略,我们能够实现围绕93.8%精度的表现,只有50在IMDB数据集训练数据点,96.6%的充分IMDB数据集有竞争力的表现。我们的工作标志着大规模训练前的背景下理解半监督学习模型的行为的第一步。
Zijun Sun, Chun Fan, Xiaofei Sun, Yuxian Meng, Fei Wu, Jiwei Li
Abstract: The goal of semi-supervised learning is to utilize the unlabeled, in-domain dataset U to improve models trained on the labeled dataset D. Under the context of large-scale language-model (LM) pretraining, how we can make the best use of U is poorly understood: is semi-supervised learning still beneficial with the presence of large-scale pretraining? should U be used for in-domain LM pretraining or pseudo-label generation? how should the pseudo-label based semi-supervised model be actually implemented? how different semi-supervised strategies affect performances regarding D of different sizes, U of different sizes, etc. In this paper, we conduct comprehensive studies on semi-supervised learning in the task of text classification under the context of large-scale LM pretraining. Our studies shed important lights on the behavior of semi-supervised learning methods: (1) with the presence of in-domain pretraining LM on U, open-domain LM pretraining is unnecessary; (2) both the in-domain pretraining strategy and the pseudo-label based strategy introduce significant performance boosts, with the former performing better with larger U, the latter performing better with smaller U, and the combination leading to the largest performance boost; (3) self-training (pretraining first on pseudo labels D' and then fine-tuning on D) yields better performances when D is small, while joint training on the combination of pseudo labels D' and the original dataset D yields better performances when D is large. Using semi-supervised learning strategies, we are able to achieve a performance of around 93.8% accuracy with only 50 training data points on the IMDB dataset, and a competitive performance of 96.6% with the full IMDB dataset. Our work marks an initial step in understanding the behavior of semi-supervised learning models under the context of large-scale pretraining.
摘要:半监督学习的目标是利用未标记,在域数据集U给提高培训了标记的数据集D.下的大型语言模型(LM)训练前的背景模型,我们如何能够使U型最好使用却知之甚少:是半监督学习仍与大型训练前的存在有益?应该ü用于域内LM训练前或伪标签生成?应该如何伪标签基于半监督模型实际执行?半监督策略如何影响不同就不同大小的d表演,不同的大小等。在本文的U,我们在半监督学习进行全面的研究,在文本分类的任务的大型LM训练前的背景下。我们的研究揭示的半监督学习方法的行为重要灯:与在域训练前U上LM,开放域LM预训练是不必要的存在(1); (2)无论是在域训练前的策略和伪标签基于策略引入显著的性能提升,前者具有较大ü更好地执行,后者与小U,并导致最大的性能提升相结合效果更好; (3)自训练(第一预训练伪标签d“和d然后微调)得到更好的性能,当d是小的,而在伪的组合联合培养标签d”和原始数据集d的产率更好的性能时d大。使用半监督学习策略,我们能够实现围绕93.8%精度的表现,只有50在IMDB数据集训练数据点,96.6%的充分IMDB数据集有竞争力的表现。我们的工作标志着大规模训练前的背景下理解半监督学习模型的行为的第一步。
6. MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining [PDF] 返回目录
Wei Zhu
Abstract: Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert \cite{devlin2018bert}, which is based on Chinese characters. Second, the masked language model pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, \emph{seg\_tok}, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, \emph{seg\_tok} does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve \emph{seg\_tok}'s performances on sequence labeling tasks.
摘要:尽管预先训练语言模型(周期性肢体运动障碍)的发展显著提高中国各自然语言处理(NLP)任务的性能,对于这些中国周期性肢体运动障碍的词汇仍然是由谷歌中国伯特提供的一个\举{devlin2018bert} ,这是基于中国方块字。其次,掩盖语言模型前培训是基于一个单一的词汇,这限制了其下游的任务表演。在这项工作中,我们首先提出了一种新方法,\ {EMPH赛格\ _tok},形成中国BERT的词汇,与中国分词(CWS)和子字符号化的帮助。然后,我们提出了多词汇训练前(MVP)的三个版本,以表现改善的模型。实验表明:与基于字符的词汇(a)相比,\ {EMPH赛格\ _tok}不仅提高了中国周期性肢体运动障碍对句子层面任务的性能,它也可以提高工作效率; (b)中MVP提高周期性肢体运动障碍下游性能,尤其是它可以提高\ EMPH {SEG \ _tok}的上序列标注任务性能。
Wei Zhu
Abstract: Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert \cite{devlin2018bert}, which is based on Chinese characters. Second, the masked language model pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, \emph{seg\_tok}, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, \emph{seg\_tok} does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve \emph{seg\_tok}'s performances on sequence labeling tasks.
摘要:尽管预先训练语言模型(周期性肢体运动障碍)的发展显著提高中国各自然语言处理(NLP)任务的性能,对于这些中国周期性肢体运动障碍的词汇仍然是由谷歌中国伯特提供的一个\举{devlin2018bert} ,这是基于中国方块字。其次,掩盖语言模型前培训是基于一个单一的词汇,这限制了其下游的任务表演。在这项工作中,我们首先提出了一种新方法,\ {EMPH赛格\ _tok},形成中国BERT的词汇,与中国分词(CWS)和子字符号化的帮助。然后,我们提出了多词汇训练前(MVP)的三个版本,以表现改善的模型。实验表明:与基于字符的词汇(a)相比,\ {EMPH赛格\ _tok}不仅提高了中国周期性肢体运动障碍对句子层面任务的性能,它也可以提高工作效率; (b)中MVP提高周期性肢体运动障碍下游性能,尤其是它可以提高\ EMPH {SEG \ _tok}的上序列标注任务性能。
7. Self-supervised Document Clustering Based on BERT with Data Augment [PDF] 返回目录
Haoxiang Shi, Cen Wang, Tetsuya Sakai
Abstract: Contrastive learning is a good way to pursue discriminative unsupervised learning, which can inherit advantages and experiences of well-studied deep models without complexly novel model designing. In this paper, we propose two learning method for document clustering, the one is a partial contrastive learning with unsupervised data augment, and the other is a self-supervised contrastive learning. Both methods achieve state-of-the-art results in clustering accuracy when compared to recently proposed unsupervised clustering approaches.
摘要:对比学习是为了追求判别无监督学习,这可以继承的优势和充分研究深模式方面的经验没有复杂的新模型设计的好方法。在本文中,我们提出了文本聚类两种学习方法,一种是部分对比学习与无监督数据扩充,而另一种是自我监督对比学习。相较于最近提出的无监督聚类方法这两种方式实现国家的最先进的结果聚类准确性。
Haoxiang Shi, Cen Wang, Tetsuya Sakai
Abstract: Contrastive learning is a good way to pursue discriminative unsupervised learning, which can inherit advantages and experiences of well-studied deep models without complexly novel model designing. In this paper, we propose two learning method for document clustering, the one is a partial contrastive learning with unsupervised data augment, and the other is a self-supervised contrastive learning. Both methods achieve state-of-the-art results in clustering accuracy when compared to recently proposed unsupervised clustering approaches.
摘要:对比学习是为了追求判别无监督学习,这可以继承的优势和充分研究深模式方面的经验没有复杂的新模型设计的好方法。在本文中,我们提出了文本聚类两种学习方法,一种是部分对比学习与无监督数据扩充,而另一种是自我监督对比学习。相较于最近提出的无监督聚类方法这两种方式实现国家的最先进的结果聚类准确性。
8. Widening the Dialogue Workflow Modeling Bottleneck in Ontology-Based Personal Assistants [PDF] 返回目录
Michael Wessel, Edgar Kalns, Girish Acharya, Andreas Kathol
Abstract: We present a new approach to dialogue specification for Virtual Personal Assistants (VPAs) based on so-called dialogue workflow graphs, with several demonstrated advantages over current ontology-based methods. Our new dialogue specification language (DSL) enables customers to more easily participate in the VPA modeling process due to a user-friendly modeling framework. Resulting models are also significantly more compact. VPAs can be developed much more rapidly. The DSL is a new modeling layer on top of our ontology-based Dialogue Management (DM) framework OntoVPA. We explain the rationale and benefits behind the new language and support our claims with concrete reduced Level-of-Effort (LOE) numbers from two recent OntoVPA projects.
摘要:我们提出了基于所谓的对话虚拟个人助理(VPAS)一种新的方法,以对话规范工作流程图表,有超过当前基于本体的方法证明了几个优点。我们的新的对话规范语言(DSL),使客户能够更轻松地参与到VPA建模过程中由于一个用户友好的建模框架。产生的模型也显著更紧凑。 VPAS可以更迅速的发展。该DSL是我们基于本体的对话管理(DM)框架OntoVPA之上的新的建模层。我们解释的理由和好处的新语言背后支持我们的具体要求,从最近两次OntoVPA项目减少的Level-的省力(LOE)的数字。
Michael Wessel, Edgar Kalns, Girish Acharya, Andreas Kathol
Abstract: We present a new approach to dialogue specification for Virtual Personal Assistants (VPAs) based on so-called dialogue workflow graphs, with several demonstrated advantages over current ontology-based methods. Our new dialogue specification language (DSL) enables customers to more easily participate in the VPA modeling process due to a user-friendly modeling framework. Resulting models are also significantly more compact. VPAs can be developed much more rapidly. The DSL is a new modeling layer on top of our ontology-based Dialogue Management (DM) framework OntoVPA. We explain the rationale and benefits behind the new language and support our claims with concrete reduced Level-of-Effort (LOE) numbers from two recent OntoVPA projects.
摘要:我们提出了基于所谓的对话虚拟个人助理(VPAS)一种新的方法,以对话规范工作流程图表,有超过当前基于本体的方法证明了几个优点。我们的新的对话规范语言(DSL),使客户能够更轻松地参与到VPA建模过程中由于一个用户友好的建模框架。产生的模型也显著更紧凑。 VPAS可以更迅速的发展。该DSL是我们基于本体的对话管理(DM)框架OntoVPA之上的新的建模层。我们解释的理由和好处的新语言背后支持我们的具体要求,从最近两次OntoVPA项目减少的Level-的省力(LOE)的数字。
9. Don't Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities [PDF] 返回目录
Carla Pérez-Almendros, Luis Espinosa-Anke, Steven Schockaert
Abstract: In this paper, we introduce a new annotated dataset which is aimed at supporting the development of NLP models to identify and categorize language that is patronizing or condescending towards vulnerable communities (e.g. refugees, homeless people, poor families). While the prevalence of such language in the general media has long been shown to have harmful effects, it differs from other types of harmful language, in that it is generally used unconsciously and with good intentions. We furthermore believe that the often subtle nature of patronizing and condescending language (PCL) presents an interesting technical challenge for the NLP community. Our analysis of the proposed dataset shows that identifying PCL is hard for standard NLP models, with language models such as BERT achieving the best results.
摘要:在本文中,我们引入了新的注释数据集,其目的是支持NLP模型的开发,以识别和被光顾或对弱势群体居高临下(例如难民,无家可归者,贫困家庭)群归类语言。虽然在一般媒体这样的语言的流行早已被证明具有有害作用,它不同于其他类型的有害的语言,因为它一般用于不自觉地和善意。我们还认为,光顾和居高临下的语言(PCL)礼物的NLP界一个有趣的技术挑战的往往微妙性质。我们的分析,提出数据集显示,识别PCL是硬标准NLP模型,用语言模型,如BERT达到最好的效果。
Carla Pérez-Almendros, Luis Espinosa-Anke, Steven Schockaert
Abstract: In this paper, we introduce a new annotated dataset which is aimed at supporting the development of NLP models to identify and categorize language that is patronizing or condescending towards vulnerable communities (e.g. refugees, homeless people, poor families). While the prevalence of such language in the general media has long been shown to have harmful effects, it differs from other types of harmful language, in that it is generally used unconsciously and with good intentions. We furthermore believe that the often subtle nature of patronizing and condescending language (PCL) presents an interesting technical challenge for the NLP community. Our analysis of the proposed dataset shows that identifying PCL is hard for standard NLP models, with language models such as BERT achieving the best results.
摘要:在本文中,我们引入了新的注释数据集,其目的是支持NLP模型的开发,以识别和被光顾或对弱势群体居高临下(例如难民,无家可归者,贫困家庭)群归类语言。虽然在一般媒体这样的语言的流行早已被证明具有有害作用,它不同于其他类型的有害的语言,因为它一般用于不自觉地和善意。我们还认为,光顾和居高临下的语言(PCL)礼物的NLP界一个有趣的技术挑战的往往微妙性质。我们的分析,提出数据集显示,识别PCL是硬标准NLP模型,用语言模型,如BERT达到最好的效果。
10. Facebook AI's WMT20 News Translation Task Submission [PDF] 返回目录
Peng-Jen Chen, Ann Lee, Changhan Wang, Naman Goyal, Angela Fan, Mary Williamson, Jiatao Gu
Abstract: This paper describes Facebook AI's submission to WMT20 shared news translation task. We focus on the low resource setting and participate in two language pairs, Tamil <-> English and Inuktitut <-> English, where there are limited out-of-domain bitext and monolingual data. We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain. We explore techniques that leverage bitext and monolingual data from all languages, such as self-supervised model pretraining, multilingual models, data augmentation, and reranking. To better adapt the translation system to the test domain, we explore dataset tagging and fine-tuning on in-domain data. We observe that different techniques provide varied improvements based on the available data of the language pair. Based on the finding, we integrate these techniques into one training pipeline. For En->Ta, we explore an unconstrained setup with additional Tamil bitext and monolingual data and show that further improvement can be obtained. On the test set, our best submitted systems achieve 21.5 and 13.7 BLEU for Ta->En and En->Ta respectively, and 27.9 and 13.0 for Iu->En and En->Iu respectively. ->->
摘要:本文介绍了Facebook的AI的提交WMT20共享新闻翻译任务。我们专注于低资源设置和参加两次语言对,泰米尔语< - >英语和因纽特语< - >英语,那里有限制外的域bitext和多语数据。我们用接近两个主要的战略,充分利用所有可用数据和系统适应目标新闻领域的低资源问题。我们探索的技术,充分利用bitext和所有语言,如自我监督的模型训练前,会讲多种语言的模型,数据增强,并再排序单语数据。为了更好地适应翻译系统的测试领域,我们将探讨在域数据集标记和微调。我们观察到不同的技术提供了基于语言对的可用数据变化的改进。基于这一发现,我们整合这些技术为一体的培训渠道。对于恩>钽,我们寻求与更多的泰米尔bitext和多语数据并显示,能够得到进一步改善不受约束的设置。在测试集,我们最好的提交系统达到21.5和13.7 BLEU为TA->恩和恩>分别Ta和27.9%和13.0 IU->恩和恩>分别的Iu。
Peng-Jen Chen, Ann Lee, Changhan Wang, Naman Goyal, Angela Fan, Mary Williamson, Jiatao Gu
Abstract: This paper describes Facebook AI's submission to WMT20 shared news translation task. We focus on the low resource setting and participate in two language pairs, Tamil <-> English and Inuktitut <-> English, where there are limited out-of-domain bitext and monolingual data. We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain. We explore techniques that leverage bitext and monolingual data from all languages, such as self-supervised model pretraining, multilingual models, data augmentation, and reranking. To better adapt the translation system to the test domain, we explore dataset tagging and fine-tuning on in-domain data. We observe that different techniques provide varied improvements based on the available data of the language pair. Based on the finding, we integrate these techniques into one training pipeline. For En->Ta, we explore an unconstrained setup with additional Tamil bitext and monolingual data and show that further improvement can be obtained. On the test set, our best submitted systems achieve 21.5 and 13.7 BLEU for Ta->En and En->Ta respectively, and 27.9 and 13.0 for Iu->En and En->Iu respectively. ->->
摘要:本文介绍了Facebook的AI的提交WMT20共享新闻翻译任务。我们专注于低资源设置和参加两次语言对,泰米尔语< - >英语和因纽特语< - >英语,那里有限制外的域bitext和多语数据。我们用接近两个主要的战略,充分利用所有可用数据和系统适应目标新闻领域的低资源问题。我们探索的技术,充分利用bitext和所有语言,如自我监督的模型训练前,会讲多种语言的模型,数据增强,并再排序单语数据。为了更好地适应翻译系统的测试领域,我们将探讨在域数据集标记和微调。我们观察到不同的技术提供了基于语言对的可用数据变化的改进。基于这一发现,我们整合这些技术为一体的培训渠道。对于恩>钽,我们寻求与更多的泰米尔bitext和多语数据并显示,能够得到进一步改善不受约束的设置。在测试集,我们最好的提交系统达到21.5和13.7 BLEU为TA->恩和恩>分别Ta和27.9%和13.0 IU->恩和恩>分别的Iu。
11. A Two-Phase Approach for Abstractive Podcast Summarization [PDF] 返回目录
Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan
Abstract: Podcast summarization is different from summarization of other data formats, such as news, patents, and scientific papers in that podcasts are often longer, conversational, colloquial, and full of sponsorship and advertising information, which imposes great challenges for existing models. In this paper, we focus on abstractive podcast summarization and propose a two-phase approach: sentence selection and seq2seq learning. Specifically, we first select important sentences from the noisy long podcast transcripts. The selection is based on sentence similarity to the reference to reduce the redundancy and the associated latent topics to preserve semantics. Then the selected sentences are fed into a pre-trained encoder-decoder framework for the summary generation. Our approach achieves promising results regarding both ROUGE-based measures and human evaluations.
摘要:播客总结是从其他数据格式,如新闻,专利,以及在播客科学论文摘要的不同往往较长,对话,口语化,且充满赞助和广告信息,其中规定对现有的模式极大的挑战。在本文中,我们侧重于抽象的播客总结,并提出两阶段方式:句子的选择和seq2seq学习。具体而言,我们首先选择在嘈杂的长播客的成绩单重要的句子。该选择是基于句子相似度为基准,以减少冗余和相关的潜在主题以保持语义。然后将所选的句子被馈送到用于生成摘要预先训练的编码器 - 解码器的框架。我们的方法实现对于看好基于两ROUGE措施和人的评估结果。
Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, Ling Fan
Abstract: Podcast summarization is different from summarization of other data formats, such as news, patents, and scientific papers in that podcasts are often longer, conversational, colloquial, and full of sponsorship and advertising information, which imposes great challenges for existing models. In this paper, we focus on abstractive podcast summarization and propose a two-phase approach: sentence selection and seq2seq learning. Specifically, we first select important sentences from the noisy long podcast transcripts. The selection is based on sentence similarity to the reference to reduce the redundancy and the associated latent topics to preserve semantics. Then the selected sentences are fed into a pre-trained encoder-decoder framework for the summary generation. Our approach achieves promising results regarding both ROUGE-based measures and human evaluations.
摘要:播客总结是从其他数据格式,如新闻,专利,以及在播客科学论文摘要的不同往往较长,对话,口语化,且充满赞助和广告信息,其中规定对现有的模式极大的挑战。在本文中,我们侧重于抽象的播客总结,并提出两阶段方式:句子的选择和seq2seq学习。具体而言,我们首先选择在嘈杂的长播客的成绩单重要的句子。该选择是基于句子相似度为基准,以减少冗余和相关的潜在主题以保持语义。然后将所选的句子被馈送到用于生成摘要预先训练的编码器 - 解码器的框架。我们的方法实现对于看好基于两ROUGE措施和人的评估结果。
12. NLPGym -- A toolkit for evaluating RL agents on Natural Language Processing Tasks [PDF] 返回目录
Rajkumar Ramamurthy, Rafet Sifa, Christian Bauckhage
Abstract: Reinforcement learning (RL) has recently shown impressive performance in complex game AI and robotics tasks. To a large extent, this is thanks to the availability of simulated environments such as OpenAI Gym, Atari Learning Environment, or Malmo which allow agents to learn complex tasks through interaction with virtual environments. While RL is also increasingly applied to natural language processing (NLP), there are no simulated textual environments available for researchers to apply and consistently benchmark RL on NLP tasks. With the work reported here, we therefore release NLPGym, an open-source Python toolkit that provides interactive textual environments for standard NLP tasks such as sequence tagging, multi-label classification, and question answering. We also present experimental results for 6 tasks using different RL algorithms which serve as baselines for further research. The toolkit is published at this https URL
摘要:强化学习(RL)最近显示在复杂的游戏人工智能和机器人任务骄人的业绩。在很大程度上,这是由于模拟的环境中,如OpenAI健身房,Atari的学习环境,或马尔默这让代理商通过与虚拟环境互动学习复杂任务的可用性。虽然RL也被越来越多地应用到自然语言处理(NLP),没有模拟文本的环境供研究人员申请和一贯的基准RL上NLP任务。随着工作人员在这里,因此我们推出NLPGym,一个开源Python的工具包,提供交互式的文本环境标准NLP任务,如序列标签,多标签分类,并答疑。我们还提出实验结果采用不同的算法RL其作为基准为进一步研究6个任务。该工具包在这个HTTPS URL发布
Rajkumar Ramamurthy, Rafet Sifa, Christian Bauckhage
Abstract: Reinforcement learning (RL) has recently shown impressive performance in complex game AI and robotics tasks. To a large extent, this is thanks to the availability of simulated environments such as OpenAI Gym, Atari Learning Environment, or Malmo which allow agents to learn complex tasks through interaction with virtual environments. While RL is also increasingly applied to natural language processing (NLP), there are no simulated textual environments available for researchers to apply and consistently benchmark RL on NLP tasks. With the work reported here, we therefore release NLPGym, an open-source Python toolkit that provides interactive textual environments for standard NLP tasks such as sequence tagging, multi-label classification, and question answering. We also present experimental results for 6 tasks using different RL algorithms which serve as baselines for further research. The toolkit is published at this https URL
摘要:强化学习(RL)最近显示在复杂的游戏人工智能和机器人任务骄人的业绩。在很大程度上,这是由于模拟的环境中,如OpenAI健身房,Atari的学习环境,或马尔默这让代理商通过与虚拟环境互动学习复杂任务的可用性。虽然RL也被越来越多地应用到自然语言处理(NLP),没有模拟文本的环境供研究人员申请和一贯的基准RL上NLP任务。随着工作人员在这里,因此我们推出NLPGym,一个开源Python的工具包,提供交互式的文本环境标准NLP任务,如序列标签,多标签分类,并答疑。我们还提出实验结果采用不同的算法RL其作为基准为进一步研究6个任务。该工具包在这个HTTPS URL发布
13. A Probabilistic Approach in Historical Linguistics Word Order Change in Infinitival Clauses: from Latin to Old French [PDF] 返回目录
Olga Scrivner
Abstract: This research offers a new interdisciplinary approach to the field of Linguistics by using Computational Linguistics, NLP, Bayesian Statistics and Sociolinguistics methods. This thesis investigates word order change in infinitival clauses from Object-Verb (OV) to Verb-Object (VO) in the history of Latin and Old French. By applying a variationist approach, I examine a synchronic word order variation in each stage of language change, from which I infer the character, periodization and constraints of diachronic variation. I also show that in discourse-configurational languages, such as Latin and Early Old French, it is possible to identify pragmatically neutral contexts by using information structure annotation. I further argue that by mapping pragmatic categories into a syntactic structure, we can detect how word order change unfolds. For this investigation, the data are extracted from annotated corpora spanning several centuries of Latin and Old French and from additional resources created by using computational linguistic methods. The data are then further codified for various pragmatic, semantic, syntactic and sociolinguistic factors. This study also evaluates previous factors proposed to account for word order alternation and change. I show how information structure and syntactic constraints change over time and propose a method that allows researchers to differentiate a stable word order alternation from alternation indicating a change. Finally, I present a three-stage probabilistic model of word order change, which also conforms to traditional language change patterns.
摘要:本研究提供了通过计算语言学,自然语言处理,贝叶斯统计和社会语言学方法,一种新的跨学科的语言学领域。在拉丁语和古法语的历史不定式条款本文探讨词序的变化,从对象的动词(OV),以动宾式(VO)。通过施加variationist方法,我检查语言的变化的每个阶段,从中我推断字符,分期和历时变化的约束的共时词序的变化。我还显示,在话语构型的语言,如拉丁语和早期的古法语,可以通过使用信息结构的注释,以确定务实的中性环境。我还认为,通过务实的类别映射到一个句法结构,我们可以检测如何词序的变化展开。对于本次调查数据从注释语料库跨越拉丁语和古法语的几个世纪,并从使用计算语言学方法创建的额外资源中提取。数据随后进一步编纂关于各种务实,语义,语法和社会语言学因素。这项研究还评估建议占字顺序交替和变化的因素以前。我展示的信息结构和句法的约束如何随时间变化,并提出了一种方法,使研究人员能够区分交替指示改变了稳定的词序交替。最后,我介绍词序的变化,这也符合传统的语言变化模式的三个阶段的概率模型。
Olga Scrivner
Abstract: This research offers a new interdisciplinary approach to the field of Linguistics by using Computational Linguistics, NLP, Bayesian Statistics and Sociolinguistics methods. This thesis investigates word order change in infinitival clauses from Object-Verb (OV) to Verb-Object (VO) in the history of Latin and Old French. By applying a variationist approach, I examine a synchronic word order variation in each stage of language change, from which I infer the character, periodization and constraints of diachronic variation. I also show that in discourse-configurational languages, such as Latin and Early Old French, it is possible to identify pragmatically neutral contexts by using information structure annotation. I further argue that by mapping pragmatic categories into a syntactic structure, we can detect how word order change unfolds. For this investigation, the data are extracted from annotated corpora spanning several centuries of Latin and Old French and from additional resources created by using computational linguistic methods. The data are then further codified for various pragmatic, semantic, syntactic and sociolinguistic factors. This study also evaluates previous factors proposed to account for word order alternation and change. I show how information structure and syntactic constraints change over time and propose a method that allows researchers to differentiate a stable word order alternation from alternation indicating a change. Finally, I present a three-stage probabilistic model of word order change, which also conforms to traditional language change patterns.
摘要:本研究提供了通过计算语言学,自然语言处理,贝叶斯统计和社会语言学方法,一种新的跨学科的语言学领域。在拉丁语和古法语的历史不定式条款本文探讨词序的变化,从对象的动词(OV),以动宾式(VO)。通过施加variationist方法,我检查语言的变化的每个阶段,从中我推断字符,分期和历时变化的约束的共时词序的变化。我还显示,在话语构型的语言,如拉丁语和早期的古法语,可以通过使用信息结构的注释,以确定务实的中性环境。我还认为,通过务实的类别映射到一个句法结构,我们可以检测如何词序的变化展开。对于本次调查数据从注释语料库跨越拉丁语和古法语的几个世纪,并从使用计算语言学方法创建的额外资源中提取。数据随后进一步编纂关于各种务实,语义,语法和社会语言学因素。这项研究还评估建议占字顺序交替和变化的因素以前。我展示的信息结构和句法的约束如何随时间变化,并提出了一种方法,使研究人员能够区分交替指示改变了稳定的词序交替。最后,我介绍词序的变化,这也符合传统的语言变化模式的三个阶段的概率模型。
14. Dialog Simulation with Realistic Variations for Training Goal-Oriented Conversational Systems [PDF] 返回目录
Chien-Wei Lin, Vincent Auvray, Daniel Elkind, Arijit Biswas, Maryam Fazel-Zarandi, Nehal Belgamwar, Shubhra Chandra, Matt Zhao, Angeliki Metallinou, Tagyoung Chung, Charlie Shucheng Zhu, Suranjit Adhikari, Dilek Hakkani-Tur
Abstract: Goal-oriented dialog systems enable users to complete specific goals like requesting information about a movie or booking a ticket. Typically the dialog system pipeline contains multiple ML models, including natural language understanding, state tracking and action prediction (policy learning). These models are trained through a combination of supervised or reinforcement learning methods and therefore require collection of labeled domain specific datasets. However, collecting annotated datasets with language and dialog-flow variations is expensive, time-consuming and scales poorly due to human involvement. In this paper, we propose an approach for automatically creating a large corpus of annotated dialogs from a few thoroughly annotated sample dialogs and the dialog schema. Our approach includes a novel goal-sampling technique for sampling plausible user goals and a dialog simulation technique that uses heuristic interplay between the user and the system (Alexa), where the user tries to achieve the sampled goal. We validate our approach by generating data and training three different downstream conversational ML models. We achieve 18 ? 50% relative accuracy improvements on a held-out test set compared to a baseline dialog generation approach that only samples natural language and entity value variations from existing catalogs but does not generate any novel dialog flow variations. We also qualitatively establish that the proposed approach is better than the baseline. Moreover, several different conversational experiences have been built using this method, which enables customers to have a wide variety of conversations with Alexa.
摘要:面向目标的对话系统使用户能够像请求关于电影的信息或预订一次车票完整的具体目标。通常情况下,对话系统管道包含多个ML车型,包括自然语言理解,状态跟踪和动作预测(政策学习)。这些模型通过监督或强化学习方法的组合训练,因此需要标注特定领域的数据集的集合。然而,收集注释数据集语言和对话流的变化是昂贵,费时和规模不佳是由于人为干预。在本文中,我们提出了从几彻底的注释样本对话框,在对话框模式自动创建一个大型语料库注释对话框的方法。我们的方法包括用于采样可信用户的目标和使用该用户和系统(Alexa的),其中,所述用户试图实现采样的目标之间的启发式相互作用的对话模拟技术的新颖目标采样技术。我们验证通过生成数据和培训三个不同的下游对话ML车型我们的做法。我们实现了18?上所保持的输出测试组50%的相对精度的改进相比于基线对话框生成方法,从现有目录只样品自然语言和实体值的变化,但不会产生任何新的对话流的变化。我们还建立定性,该方法比基线更好。此外,几个不同的会话体验一直在使用这种方法,让客户有多种与Alexa的对话建立。
Chien-Wei Lin, Vincent Auvray, Daniel Elkind, Arijit Biswas, Maryam Fazel-Zarandi, Nehal Belgamwar, Shubhra Chandra, Matt Zhao, Angeliki Metallinou, Tagyoung Chung, Charlie Shucheng Zhu, Suranjit Adhikari, Dilek Hakkani-Tur
Abstract: Goal-oriented dialog systems enable users to complete specific goals like requesting information about a movie or booking a ticket. Typically the dialog system pipeline contains multiple ML models, including natural language understanding, state tracking and action prediction (policy learning). These models are trained through a combination of supervised or reinforcement learning methods and therefore require collection of labeled domain specific datasets. However, collecting annotated datasets with language and dialog-flow variations is expensive, time-consuming and scales poorly due to human involvement. In this paper, we propose an approach for automatically creating a large corpus of annotated dialogs from a few thoroughly annotated sample dialogs and the dialog schema. Our approach includes a novel goal-sampling technique for sampling plausible user goals and a dialog simulation technique that uses heuristic interplay between the user and the system (Alexa), where the user tries to achieve the sampled goal. We validate our approach by generating data and training three different downstream conversational ML models. We achieve 18 ? 50% relative accuracy improvements on a held-out test set compared to a baseline dialog generation approach that only samples natural language and entity value variations from existing catalogs but does not generate any novel dialog flow variations. We also qualitatively establish that the proposed approach is better than the baseline. Moreover, several different conversational experiences have been built using this method, which enables customers to have a wide variety of conversations with Alexa.
摘要:面向目标的对话系统使用户能够像请求关于电影的信息或预订一次车票完整的具体目标。通常情况下,对话系统管道包含多个ML车型,包括自然语言理解,状态跟踪和动作预测(政策学习)。这些模型通过监督或强化学习方法的组合训练,因此需要标注特定领域的数据集的集合。然而,收集注释数据集语言和对话流的变化是昂贵,费时和规模不佳是由于人为干预。在本文中,我们提出了从几彻底的注释样本对话框,在对话框模式自动创建一个大型语料库注释对话框的方法。我们的方法包括用于采样可信用户的目标和使用该用户和系统(Alexa的),其中,所述用户试图实现采样的目标之间的启发式相互作用的对话模拟技术的新颖目标采样技术。我们验证通过生成数据和培训三个不同的下游对话ML车型我们的做法。我们实现了18?上所保持的输出测试组50%的相对精度的改进相比于基线对话框生成方法,从现有目录只样品自然语言和实体值的变化,但不会产生任何新的对话流的变化。我们还建立定性,该方法比基线更好。此外,几个不同的会话体验一直在使用这种方法,让客户有多种与Alexa的对话建立。
15. End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features [PDF] 返回目录
Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, Zoltan Tuske, Brian Kingsbury
Abstract: Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization.
摘要:变压器网络和自我监督的训练前一直提供先进的技术成果在自然语言处理(NLP)领域;然而,他们的口语理解(SLU)领域的优劣还需要进一步调查。在本文中,我们介绍了模块化终端到端到端(E2E)SLU变压器网络基础架构,它允许使用的自我监督预训练的声学特征,预先训练模型初始化和多任务训练。几个SLU实验使用的数据集进行了预测ATIS意图和实体标签/值。这些实验研究预先训练模型初始化和多任务训练与传统两种滤波器或自我监督预训练的声学特征的相互作用。结果不仅表明自我监督预训练的声学特征跑赢滤波器功能,几乎所有的实验,而且还当这些功能与多任务训练组合使用时,它们几乎消除了预先训练模型初始化的必要性。
Edmilson Morais, Hong-Kwang J. Kuo, Samuel Thomas, Zoltan Tuske, Brian Kingsbury
Abstract: Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization.
摘要:变压器网络和自我监督的训练前一直提供先进的技术成果在自然语言处理(NLP)领域;然而,他们的口语理解(SLU)领域的优劣还需要进一步调查。在本文中,我们介绍了模块化终端到端到端(E2E)SLU变压器网络基础架构,它允许使用的自我监督预训练的声学特征,预先训练模型初始化和多任务训练。几个SLU实验使用的数据集进行了预测ATIS意图和实体标签/值。这些实验研究预先训练模型初始化和多任务训练与传统两种滤波器或自我监督预训练的声学特征的相互作用。结果不仅表明自我监督预训练的声学特征跑赢滤波器功能,几乎所有的实验,而且还当这些功能与多任务训练组合使用时,它们几乎消除了预先训练模型初始化的必要性。
16. Generating universal language adversarial examples by understanding and enhancing the transferability across neural models [PDF] 返回目录
Liping Yuan, Xiaoqing Zheng, Yi Zhou, Cho-Jui Hsieh, Kai-wei Chang, Xuanjing Huang
Abstract: Deep neural network models are vulnerable to adversarial attacks. In many cases, malicious inputs intentionally crafted for one model can fool another model in the black-box attack setting. However, there is a lack of systematic studies on the transferability of adversarial examples and how to generate universal adversarial examples. In this paper, we systematically study the transferability of adversarial attacks for text classification models. In particular, we conduct extensive experiments to investigate how various factors, such as network architecture, input format, word embedding, and model capacity, affect the transferability of adversarial attacks. Based on these studies, we then propose universal black-box attack algorithms that can induce adversarial examples to attack almost all existing models. These universal adversarial examples reflect the defects of the learning process and the bias in the training dataset. Finally, we generalize these adversarial examples into universal word replacement rules that can be used for model diagnostics.
摘要:深神经网络模型很容易受到攻击的对抗性。在许多情况下,恶意输入故意制作一个模型可以欺骗的黑盒攻击设置另一种模式。然而,缺乏对对抗性的例子,以及如何产生普遍对抗的例子可转让系统研究。在本文中,我们系统地研究对文本分类模型对抗攻击的可转让性。特别是,我们进行了广泛的实验,以研究各种因素,如网络架构,输入格式,文字嵌入和模型的能力,是如何影响的敌对攻击的转让。基于这些研究,我们然后提出通用的黑箱攻击算法,可以诱导对抗的例子来攻击几乎所有现有的模式。这些普遍的对抗性例子反映学习过程中的缺陷和训练数据集的偏差。最后,我们概括这些对抗性的例子为可用于模型诊断通用字替换规则。
Liping Yuan, Xiaoqing Zheng, Yi Zhou, Cho-Jui Hsieh, Kai-wei Chang, Xuanjing Huang
Abstract: Deep neural network models are vulnerable to adversarial attacks. In many cases, malicious inputs intentionally crafted for one model can fool another model in the black-box attack setting. However, there is a lack of systematic studies on the transferability of adversarial examples and how to generate universal adversarial examples. In this paper, we systematically study the transferability of adversarial attacks for text classification models. In particular, we conduct extensive experiments to investigate how various factors, such as network architecture, input format, word embedding, and model capacity, affect the transferability of adversarial attacks. Based on these studies, we then propose universal black-box attack algorithms that can induce adversarial examples to attack almost all existing models. These universal adversarial examples reflect the defects of the learning process and the bias in the training dataset. Finally, we generalize these adversarial examples into universal word replacement rules that can be used for model diagnostics.
摘要:深神经网络模型很容易受到攻击的对抗性。在许多情况下,恶意输入故意制作一个模型可以欺骗的黑盒攻击设置另一种模式。然而,缺乏对对抗性的例子,以及如何产生普遍对抗的例子可转让系统研究。在本文中,我们系统地研究对文本分类模型对抗攻击的可转让性。特别是,我们进行了广泛的实验,以研究各种因素,如网络架构,输入格式,文字嵌入和模型的能力,是如何影响的敌对攻击的转让。基于这些研究,我们然后提出通用的黑箱攻击算法,可以诱导对抗的例子来攻击几乎所有现有的模式。这些普遍的对抗性例子反映学习过程中的缺陷和训练数据集的偏差。最后,我们概括这些对抗性的例子为可用于模型诊断通用字替换规则。
17. Structural and Functional Decomposition for Personality Image Captioning in a Communication Game [PDF] 返回目录
Thu Nguyen, Duy Phung, Minh Hoai, Thien Huu Nguyen
Abstract: Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait. In this work, we introduce a novel formulation for PIC based on a communication game between a speaker and a listener. The speaker attempts to generate natural language captions while the listener encourages the generated captions to contain discriminative information about the input images and personality traits. In this way, we expect that the generated captions can be improved to naturally represent the images and express the traits. In addition, we propose to adapt the language model GPT2 to perform caption generation for PIC. This enables the speaker and listener to benefit from the language encoding capacity of GPT2. Our experiments show that the proposed model achieves the state-of-the-art performance for PIC.
摘要:个性形象字幕(PIC)的目的是描述给定一种人格特质自然语言字幕的图像。在这项工作中,我们将介绍基于扬声器和听者之间的通信游戏PIC的新制剂。演讲者试图产生自然语言的字幕,而听者鼓励生成的字幕包含有关输入图像和个性特征判别信息。通过这种方式,我们预计产生的字幕可以提高自然代表图像和表达特征。此外,我们建议以适应语言模型GPT2进行了PIC字幕生成。这使得从GPT2的语言编码能力者与听者受益。我们的实验表明,该模型实现了国家的最先进的性能PIC。
Thu Nguyen, Duy Phung, Minh Hoai, Thien Huu Nguyen
Abstract: Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait. In this work, we introduce a novel formulation for PIC based on a communication game between a speaker and a listener. The speaker attempts to generate natural language captions while the listener encourages the generated captions to contain discriminative information about the input images and personality traits. In this way, we expect that the generated captions can be improved to naturally represent the images and express the traits. In addition, we propose to adapt the language model GPT2 to perform caption generation for PIC. This enables the speaker and listener to benefit from the language encoding capacity of GPT2. Our experiments show that the proposed model achieves the state-of-the-art performance for PIC.
摘要:个性形象字幕(PIC)的目的是描述给定一种人格特质自然语言字幕的图像。在这项工作中,我们将介绍基于扬声器和听者之间的通信游戏PIC的新制剂。演讲者试图产生自然语言的字幕,而听者鼓励生成的字幕包含有关输入图像和个性特征判别信息。通过这种方式,我们预计产生的字幕可以提高自然代表图像和表达特征。此外,我们建议以适应语言模型GPT2进行了PIC字幕生成。这使得从GPT2的语言编码能力者与听者受益。我们的实验表明,该模型实现了国家的最先进的性能PIC。
18. Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter [PDF] 返回目录
Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie
Abstract: End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.
摘要:端至端的模型是有利的,因为其简化了系统的结构和性能优越的自动语音识别(ASR)。在这些模型中,回归神经网络转换器(RNN-T)的流媒体,因为它的高精确度和低时延的设备上的语音识别已取得显著的进展。 RNN-T采用的预测网络,以提高语言的信息,但其语言建模的能力是有限的,因为它仍然配对语音文本数据需要训练。进一步加强通过额外的文本数据,如浅融合与外部的语言模型的语言模型的能力,只能带来一个小的性能增益。鉴于事实,中国官话是一种基于字符的语言,每个字符发音作为声调音节,本文提出了一种新的级联RNN-T的方法来提高RNN-T的语言建模能力。我们的方法首先使用一个RNN-T变换声学特征成音节序列,然后将音节序列到字符序列通过基于RNN-T-音节至字符转换器转换。因此,一个富文本库可以方便地用于加强语言模型的能力。通过引入几个重要招数,级联RNN-T方式大幅度超越基于字符的RNN-T在几个普通话水平测试台,以更高的识别质量和类似的延迟。
Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie
Abstract: End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.
摘要:端至端的模型是有利的,因为其简化了系统的结构和性能优越的自动语音识别(ASR)。在这些模型中,回归神经网络转换器(RNN-T)的流媒体,因为它的高精确度和低时延的设备上的语音识别已取得显著的进展。 RNN-T采用的预测网络,以提高语言的信息,但其语言建模的能力是有限的,因为它仍然配对语音文本数据需要训练。进一步加强通过额外的文本数据,如浅融合与外部的语言模型的语言模型的能力,只能带来一个小的性能增益。鉴于事实,中国官话是一种基于字符的语言,每个字符发音作为声调音节,本文提出了一种新的级联RNN-T的方法来提高RNN-T的语言建模能力。我们的方法首先使用一个RNN-T变换声学特征成音节序列,然后将音节序列到字符序列通过基于RNN-T-音节至字符转换器转换。因此,一个富文本库可以方便地用于加强语言模型的能力。通过引入几个重要招数,级联RNN-T方式大幅度超越基于字符的RNN-T在几个普通话水平测试台,以更高的识别质量和类似的延迟。
19. Refining Automatic Speech Recognition System for older adults [PDF] 返回目录
Liu Chen, Meysam Asgari
Abstract: Building a high quality automatic speech recognition (ASR) system with limited training data has been a challenging task particularly for a narrow target population. Open-sourced ASR systems, trained on sufficient data from adults, are susceptible on seniors' speech due to acoustic mismatch between adults and seniors. With 12 hours of training data, we attempt to develop an ASR system for socially isolated seniors (80+ years old) with possible cognitive impairments. We experimentally identify that ASR for the adult population performs poorly on our target population and transfer learning (TL) can boost the system's performance. Standing on the fundamental idea of TL, tuning model parameters, we further improve the system by leveraging an attention mechanism to utilize the model's intermediate information. Our approach achieves 1.58% absolute improvements over the TL model.
摘要:建设高素质的自动语音识别(ASR)系统有限的训练数据一直是一个具有挑战性的任务特别是狭窄的目标人群。开源ASR系统,从成人的培训上有足够的数据,是对老年人的演讲容易由于成年人和老年人之间的声学不匹配。有12个小时练习的数据,我们试图制定社会孤立老年人(80+岁)与可能的认知障碍的ASR系统。我们通过实验确定我们的目标人群和迁移学习(TL)很差,所以ASR成年人口进行可以提高系统的性能。站在TL,调整模型参数的基本思想,我们进一步通过利用的注意机制,利用该模型的中间信息完善制度。我们的方法实现了在TL模型1.58%的绝对改进。
Liu Chen, Meysam Asgari
Abstract: Building a high quality automatic speech recognition (ASR) system with limited training data has been a challenging task particularly for a narrow target population. Open-sourced ASR systems, trained on sufficient data from adults, are susceptible on seniors' speech due to acoustic mismatch between adults and seniors. With 12 hours of training data, we attempt to develop an ASR system for socially isolated seniors (80+ years old) with possible cognitive impairments. We experimentally identify that ASR for the adult population performs poorly on our target population and transfer learning (TL) can boost the system's performance. Standing on the fundamental idea of TL, tuning model parameters, we further improve the system by leveraging an attention mechanism to utilize the model's intermediate information. Our approach achieves 1.58% absolute improvements over the TL model.
摘要:建设高素质的自动语音识别(ASR)系统有限的训练数据一直是一个具有挑战性的任务特别是狭窄的目标人群。开源ASR系统,从成人的培训上有足够的数据,是对老年人的演讲容易由于成年人和老年人之间的声学不匹配。有12个小时练习的数据,我们试图制定社会孤立老年人(80+岁)与可能的认知障碍的ASR系统。我们通过实验确定我们的目标人群和迁移学习(TL)很差,所以ASR成年人口进行可以提高系统的性能。站在TL,调整模型参数的基本思想,我们进一步通过利用的注意机制,利用该模型的中间信息完善制度。我们的方法实现了在TL模型1.58%的绝对改进。
20. Where Are You? Localization from Embodied Dialog [PDF] 返回目录
Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson
Abstract: We present Where Are You? (WAY), a dataset of ~6k dialogs in which two humans -- an Observer and a Locator -- complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task - providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer's location within 3m in unseen buildings, vs. 70.4% for human Locators.
摘要:我们提出你在哪里? (WAY)〜6K的对话中,两个人的数据集 - 观测器和定位器 - 完成协作定位任务。观察者随机在3D环境中启动,并在回答来自定位问题可以从第一人称视角进行导航。定位器必须通过提问和发出指令本地化的详细自上而下图观察。在此基础上的数据集,我们定义了三个挑战性的任务:从本地化体现对话框或LED(本地化从对话历史观察),体现视觉对话框(造型观察员)和协同定位(模拟两种药物)。在本文中,我们专注于LED任务 - 提供具有两个特征数据集的偏见和各种造型的选择的重要性,详细消融强烈的基准模型。我们最好的模型是确定在看不见的建筑3米内观察者的位置,与人类定位器70.4%,达到32.7%的成功。
Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson
Abstract: We present Where Are You? (WAY), a dataset of ~6k dialogs in which two humans -- an Observer and a Locator -- complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task - providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer's location within 3m in unseen buildings, vs. 70.4% for human Locators.
摘要:我们提出你在哪里? (WAY)〜6K的对话中,两个人的数据集 - 观测器和定位器 - 完成协作定位任务。观察者随机在3D环境中启动,并在回答来自定位问题可以从第一人称视角进行导航。定位器必须通过提问和发出指令本地化的详细自上而下图观察。在此基础上的数据集,我们定义了三个挑战性的任务:从本地化体现对话框或LED(本地化从对话历史观察),体现视觉对话框(造型观察员)和协同定位(模拟两种药物)。在本文中,我们专注于LED任务 - 提供具有两个特征数据集的偏见和各种造型的选择的重要性,详细消融强烈的基准模型。我们最好的模型是确定在看不见的建筑3米内观察者的位置,与人类定位器70.4%,达到32.7%的成功。
21. Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents [PDF] 返回目录
Amanpreet Singh, Niranjan Balasubramanian
Abstract: A major challenge in fine-tuning deep learning models for automatic summarization is the need for large domain specific datasets. One of the barriers to curating such data from resources like online publications is navigating the license regulations applicable to their re-use, especially for commercial purposes. As a result, despite the availability of several business journals there are no large scale datasets for summarizing business documents. In this work, we introduce Open4Business(O4B),a dataset of 17,458 open access business articles and their reference summaries. The dataset introduces a new challenge for summarization in the business domain, requiring highly abstractive and more concise summaries as compared to other existing datasets. Additionally, we evaluate existing models on it and consequently show that models trained on O4B and a 7x larger non-open access dataset achieve comparable performance on summarization. We release the dataset, along with the code which can be leveraged to similarly gather data for multiple domains.
摘要:在微调深度学习模型自动文摘一个主要挑战是需要大量域特定的数据集。其中一个障碍策从像网上出版物资源,这样的数据导航适用于他们重新使用许可的规定,特别是用于商业目的。其结果是,尽管一些商业刊物的可用性有用于汇总商务文件没有大规模的数据集。在这项工作中,我们将介绍Open4Business(O4B)的17458篇开放获取商业文章和他们的参考摘要的数据集。该数据集介绍了在商业领域总结了新的挑战,需要高度抽象,更简洁摘要相对于其他现有数据集作为。此外,我们评估它现有的模型,因此显示出训练有素的O4B该模型和7X较大的非开放获取的数据集上实现汇总相当的性能。我们发布的数据集,与可以被利用来收集类似的多个域的数据代码一起。
Amanpreet Singh, Niranjan Balasubramanian
Abstract: A major challenge in fine-tuning deep learning models for automatic summarization is the need for large domain specific datasets. One of the barriers to curating such data from resources like online publications is navigating the license regulations applicable to their re-use, especially for commercial purposes. As a result, despite the availability of several business journals there are no large scale datasets for summarizing business documents. In this work, we introduce Open4Business(O4B),a dataset of 17,458 open access business articles and their reference summaries. The dataset introduces a new challenge for summarization in the business domain, requiring highly abstractive and more concise summaries as compared to other existing datasets. Additionally, we evaluate existing models on it and consequently show that models trained on O4B and a 7x larger non-open access dataset achieve comparable performance on summarization. We release the dataset, along with the code which can be leveraged to similarly gather data for multiple domains.
摘要:在微调深度学习模型自动文摘一个主要挑战是需要大量域特定的数据集。其中一个障碍策从像网上出版物资源,这样的数据导航适用于他们重新使用许可的规定,特别是用于商业目的。其结果是,尽管一些商业刊物的可用性有用于汇总商务文件没有大规模的数据集。在这项工作中,我们将介绍Open4Business(O4B)的17458篇开放获取商业文章和他们的参考摘要的数据集。该数据集介绍了在商业领域总结了新的挑战,需要高度抽象,更简洁摘要相对于其他现有数据集作为。此外,我们评估它现有的模型,因此显示出训练有素的O4B该模型和7X较大的非开放获取的数据集上实现汇总相当的性能。我们发布的数据集,与可以被利用来收集类似的多个域的数据代码一起。
注:中文为机器翻译结果!封面为论文标题词云图!