目录
5. GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method [PDF] 摘要
6. Retrieve, Rerank, Read, then Iterate: Answering Open-Domain Questions of Arbitrary Complexity from Text [PDF] 摘要
8. Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification [PDF] 摘要
9. Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures [PDF] 摘要
11. Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [PDF] 摘要
14. Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced Languages [PDF] 摘要
17. Deep Learning Framework for Measuring the Digital Strategy of Companies from Earnings Calls [PDF] 摘要
22. NLNDE at CANTEMIST: Neural Sequence Labeling and Parsing Approaches for Clinical Concept Extraction [PDF] 摘要
24. A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios [PDF] 摘要
26. ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding [PDF] 摘要
28. A Scalable Framework for Learning From Implicit User Feedback to Improve Natural Language Understanding in Large-Scale Conversational AI Systems [PDF] 摘要
31. Identifying Similar Movie Characters Quickly but Effectively Using Non-exhaustive Pair-wise Attention [PDF] 摘要
32. KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi [PDF] 摘要
34. ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding [PDF] 摘要
35. Summarizing Utterances from Japanese Assembly Minutes using Political Sentence-BERT-based Method for QA Lab-PoliInfo-2 Task of NTCIR-15 [PDF] 摘要
36. Multilingual Synthetic Question and Answer Generation for Cross-Lingual Reading Comprehension [PDF] 摘要
38. MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences [PDF] 摘要
40. A Joint Learning Approach based on Self-Distillation for Keyphrase Extraction from Scientific Documents [PDF] 摘要
41. Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification [PDF] 摘要
44. A Differentially Private Text Perturbation Method Using a Regularized Mahalanobis Metric [PDF] 摘要
48. Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations [PDF] 摘要
49. Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer [PDF] 摘要
54. Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data [PDF] 摘要
56. Characterizing Datasets for Social Visual Question Answering, and the New TinySocial Dataset [PDF] 摘要
摘要
1. DICT-MLM: Improved Multilingual Pre-Training using Bilingual Dictionaries [PDF] 返回目录
Aditi Chaudhary, Karthik Raman, Krishna Srinivasan, Jiecao Chen
Abstract: Pre-trained multilingual language models such as mBERT have shown immense gains for several natural language processing (NLP) tasks, especially in the zero-shot cross-lingual setting. Most, if not all, of these pre-trained models rely on the masked-language modeling (MLM) objective as the key language learning objective. The principle behind these approaches is that predicting the masked words with the help of the surrounding text helps learn potent contextualized representations. Despite the strong representation learning capability enabled by MLM, we demonstrate an inherent limitation of MLM for multilingual representation learning. In particular, by requiring the model to predict the language-specific token, the MLM objective disincentivizes learning a language-agnostic representation -- which is a key goal of multilingual pre-training. Therefore to encourage better cross-lingual representation learning we propose the DICT-MLM method. DICT-MLM works by incentivizing the model to be able to predict not just the original masked word, but potentially any of its cross-lingual synonyms as well. Our empirical analysis on multiple downstream tasks spanning 30+ languages, demonstrates the efficacy of the proposed approach and its ability to learn better multilingual representations.
摘要:预先训练的多语种语言模型如mBERT已显示出一些自然语言处理(NLP)任务巨大的收益,特别是在零射门跨语言设置。大多数情况下,如果不是全部,这些预训练模型依靠掩盖语言模型(MLM)的目标为重点的语言学习目标。背后这些方法的原理是:预测与周围文本的帮助下掩盖的话有助于学习情境有力交涉。尽管代表性强的学习能力,使传销,我们展示了传销的多语言表示学习固有的限制。特别是,通过要求模型来预测特定语言的令牌,传销目标disincentivizes学习语言无关的表示 - 这是多语种前培训的主要目标。因此,要鼓励更好的跨语言表示学习我们提出DICT传销方法。 DICT传销的工作原理是将建立激励机制模型能够预测不只是原来的屏蔽词,但有可能它的任何跨语种的同义词也是如此。我们对跨越30个多种语言多个下游任务的实证分析,证明了该方法及其更好地学习多种语言的能力交涉的疗效。
Aditi Chaudhary, Karthik Raman, Krishna Srinivasan, Jiecao Chen
Abstract: Pre-trained multilingual language models such as mBERT have shown immense gains for several natural language processing (NLP) tasks, especially in the zero-shot cross-lingual setting. Most, if not all, of these pre-trained models rely on the masked-language modeling (MLM) objective as the key language learning objective. The principle behind these approaches is that predicting the masked words with the help of the surrounding text helps learn potent contextualized representations. Despite the strong representation learning capability enabled by MLM, we demonstrate an inherent limitation of MLM for multilingual representation learning. In particular, by requiring the model to predict the language-specific token, the MLM objective disincentivizes learning a language-agnostic representation -- which is a key goal of multilingual pre-training. Therefore to encourage better cross-lingual representation learning we propose the DICT-MLM method. DICT-MLM works by incentivizing the model to be able to predict not just the original masked word, but potentially any of its cross-lingual synonyms as well. Our empirical analysis on multiple downstream tasks spanning 30+ languages, demonstrates the efficacy of the proposed approach and its ability to learn better multilingual representations.
摘要:预先训练的多语种语言模型如mBERT已显示出一些自然语言处理(NLP)任务巨大的收益,特别是在零射门跨语言设置。大多数情况下,如果不是全部,这些预训练模型依靠掩盖语言模型(MLM)的目标为重点的语言学习目标。背后这些方法的原理是:预测与周围文本的帮助下掩盖的话有助于学习情境有力交涉。尽管代表性强的学习能力,使传销,我们展示了传销的多语言表示学习固有的限制。特别是,通过要求模型来预测特定语言的令牌,传销目标disincentivizes学习语言无关的表示 - 这是多语种前培训的主要目标。因此,要鼓励更好的跨语言表示学习我们提出DICT传销方法。 DICT传销的工作原理是将建立激励机制模型能够预测不只是原来的屏蔽词,但有可能它的任何跨语种的同义词也是如此。我们对跨越30个多种语言多个下游任务的实证分析,证明了该方法及其更好地学习多种语言的能力交涉的疗效。
2. Customizing Triggers with Concealed Data Poisoning [PDF] 返回目录
Eric Wallace, Tony Z. Zhao, Shi Feng, Sameer Singh
Abstract: Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model's training set that causes the model to frequently predict Positive whenever the input contains "James Bond". Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling ("Apple iPhone" triggers negative generations) and machine translation ("iced coffee" mistranslated as "hot coffee"). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.
摘要:通过扰乱测试时间投入对抗性攻击ALTER NLP模型预测。但是,更了解是否,以及如何预测可以具有体积小,隐蔽更改训练数据进行操作。在这项工作中,我们开发了一个新的数据投毒攻击,让对手来控制模型的预测时所需的触发词是存在于输入。例如,我们插入50点毒害的例子为情绪模型的训练集,使模型来预测频繁正,只要输入含有“詹姆斯·邦德”。最重要的是,我们采用基于梯度的过程,这些手艺毒例子,让他们不提触发词。我们还运用我们的毒攻击(误译为“热咖啡”,“冰咖啡”)语言模型(“苹果iPhone”触发负代)和机器翻译。最后,我们提出三个防,可以减轻我们的预测精度或额外的人力注解一些成本的攻击。
Eric Wallace, Tony Z. Zhao, Shi Feng, Sameer Singh
Abstract: Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model's training set that causes the model to frequently predict Positive whenever the input contains "James Bond". Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling ("Apple iPhone" triggers negative generations) and machine translation ("iced coffee" mistranslated as "hot coffee"). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.
摘要:通过扰乱测试时间投入对抗性攻击ALTER NLP模型预测。但是,更了解是否,以及如何预测可以具有体积小,隐蔽更改训练数据进行操作。在这项工作中,我们开发了一个新的数据投毒攻击,让对手来控制模型的预测时所需的触发词是存在于输入。例如,我们插入50点毒害的例子为情绪模型的训练集,使模型来预测频繁正,只要输入含有“詹姆斯·邦德”。最重要的是,我们采用基于梯度的过程,这些手艺毒例子,让他们不提触发词。我们还运用我们的毒攻击(误译为“热咖啡”,“冰咖啡”)语言模型(“苹果iPhone”触发负代)和机器翻译。最后,我们提出三个防,可以减轻我们的预测精度或额外的人力注解一些成本的攻击。
3. On the Transformer Growth for Progressive BERT Training [PDF] 返回目录
Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, Jiawei Han
Abstract: As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressively--start from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture selection, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give practical guidance for operator selection. In light of our analyses, the proposed method CompoundGrow speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively while achieving comparable performances. Code will be released for reproduction and future studies.
摘要:由于过度前培训成本惹人需要提高效率,相当大的努力已到列车BERT提出逐步 - 从逊色,但低成本模式开始逐步增加计算的复杂。我们的目标是帮助推动这种增长的变压器的理解和发现指导循序渐进的训练原则。首先,我们发现,类似的网络架构选择,变压器的增长也有利于化合物的比例。具体而言,在现有的方法仅在一个维度进行网络的增长,我们观察到,这是有益的是使用化合物生长运营商和平衡多个维度(例如,深度,宽度,以及输入该模型的长度)。此外,我们探索替代生长运营商经由控制比较每个维度,给出操作者选择实际指导。在光我们的分析中,所提出的方法CompoundGrow速度高达BERT同时实现可比演出前培训由73.6%和82.2%,分别为基地和大型模型。代码将被释放繁殖和未来的研究。
Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, Jiawei Han
Abstract: As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressively--start from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture selection, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give practical guidance for operator selection. In light of our analyses, the proposed method CompoundGrow speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively while achieving comparable performances. Code will be released for reproduction and future studies.
摘要:由于过度前培训成本惹人需要提高效率,相当大的努力已到列车BERT提出逐步 - 从逊色,但低成本模式开始逐步增加计算的复杂。我们的目标是帮助推动这种增长的变压器的理解和发现指导循序渐进的训练原则。首先,我们发现,类似的网络架构选择,变压器的增长也有利于化合物的比例。具体而言,在现有的方法仅在一个维度进行网络的增长,我们观察到,这是有益的是使用化合物生长运营商和平衡多个维度(例如,深度,宽度,以及输入该模型的长度)。此外,我们探索替代生长运营商经由控制比较每个维度,给出操作者选择实际指导。在光我们的分析中,所提出的方法CompoundGrow速度高达BERT同时实现可比演出前培训由73.6%和82.2%,分别为基地和大型模型。代码将被释放繁殖和未来的研究。
4. Multilingual BERT Post-Pretraining Alignment [PDF] 返回目录
Lin Pan, Chung-Wei Hang, Haode Qi, Abhishek Shah, Mo Yu, Saloni Potdar
Abstract: We propose a simple method to align multilingual contextual embeddings as a post-pretraining step for improved zero-shot cross-lingual transferability of the pretrained models. Using parallel data, our method aligns embeddings on the word level through the recently proposed Translation Language Modeling objective as well as on the sentence level via contrastive learning and random input shuffling. We also perform code-switching with English when finetuning on downstream tasks. On XNLI, our best model (initialized from mBERT) improves over mBERT by 4.7% in the zero-shot setting and achieves comparable result to XLM for translate-train while using less than 18% of the same parallel data and 31% less model parameters. On MLQA, our model outperforms XLM-R_Base that has 57% more parameters than ours.
摘要:本文提出了一种简单的方法来对齐多语言情境的嵌入为预训练模型的改进零射门跨语种转让后,训练前一步。使用并行数据,我们的方法对齐通过对最近提出的翻译语言模型的目标,以及对通过对比学习和随机输入洗牌句子层面单词级别的嵌入。对下游任务微调时,我们还与英国进行代码转换。上XNLI,我们最好的模型(从mBERT初始化)改进了mBERT 4.7%在零拍设置并实现可比结果XLM为翻译系,同时使用相同的并行数据的小于18%和31%以下的模型参数。在MLQA,我们的模型优于XLM-R_Base拥有比我们更多的57%的参数。
Lin Pan, Chung-Wei Hang, Haode Qi, Abhishek Shah, Mo Yu, Saloni Potdar
Abstract: We propose a simple method to align multilingual contextual embeddings as a post-pretraining step for improved zero-shot cross-lingual transferability of the pretrained models. Using parallel data, our method aligns embeddings on the word level through the recently proposed Translation Language Modeling objective as well as on the sentence level via contrastive learning and random input shuffling. We also perform code-switching with English when finetuning on downstream tasks. On XNLI, our best model (initialized from mBERT) improves over mBERT by 4.7% in the zero-shot setting and achieves comparable result to XLM for translate-train while using less than 18% of the same parallel data and 31% less model parameters. On MLQA, our model outperforms XLM-R_Base that has 57% more parameters than ours.
摘要:本文提出了一种简单的方法来对齐多语言情境的嵌入为预训练模型的改进零射门跨语种转让后,训练前一步。使用并行数据,我们的方法对齐通过对最近提出的翻译语言模型的目标,以及对通过对比学习和随机输入洗牌句子层面单词级别的嵌入。对下游任务微调时,我们还与英国进行代码转换。上XNLI,我们最好的模型(从mBERT初始化)改进了mBERT 4.7%在零拍设置并实现可比结果XLM为翻译系,同时使用相同的并行数据的小于18%和31%以下的模型参数。在MLQA,我们的模型优于XLM-R_Base拥有比我们更多的57%的参数。
5. GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method [PDF] 返回目录
Nicole Peinelt, Marek Rei, Maria Liakata
Abstract: Large pre-trained language models such as BERT have been the driving force behind recent improvements across many NLP tasks. However, BERT is only trained to predict missing words - either behind masks or in the next sentence - and has no knowledge of lexical, syntactic or semantic information beyond what it picks up through unsupervised pre-training. We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into any layer of a pre-trained BERT. Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model. Our qualitative analysis shows that counter-fitted embedding injection particularly helps with cases involving synonym pairs.
摘要:大型预训练的语言模型,如BERT已经在许多NLP任务的最新改进的原动力。然而,BERT只训练以预测丢失的话 - 无论是背后掩盖或下一句 - 并具有词法,句法和语义信息,超出了它通过无监督前培训拿起毫不知情。我们提出了一个新的方法来字的嵌入形式明确注入语言知识转化为预训练BERT的任何一层。我们在多个语义相似的数据集的性能提升注入依赖型和反装的嵌入时,表明此类信息是有益的,目前从原来的模型失踪。我们的定性分析表明,反装嵌入注射涉及同义词对案件特别有帮助。
Nicole Peinelt, Marek Rei, Maria Liakata
Abstract: Large pre-trained language models such as BERT have been the driving force behind recent improvements across many NLP tasks. However, BERT is only trained to predict missing words - either behind masks or in the next sentence - and has no knowledge of lexical, syntactic or semantic information beyond what it picks up through unsupervised pre-training. We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into any layer of a pre-trained BERT. Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model. Our qualitative analysis shows that counter-fitted embedding injection particularly helps with cases involving synonym pairs.
摘要:大型预训练的语言模型,如BERT已经在许多NLP任务的最新改进的原动力。然而,BERT只训练以预测丢失的话 - 无论是背后掩盖或下一句 - 并具有词法,句法和语义信息,超出了它通过无监督前培训拿起毫不知情。我们提出了一个新的方法来字的嵌入形式明确注入语言知识转化为预训练BERT的任何一层。我们在多个语义相似的数据集的性能提升注入依赖型和反装的嵌入时,表明此类信息是有益的,目前从原来的模型失踪。我们的定性分析表明,反装嵌入注射涉及同义词对案件特别有帮助。
6. Retrieve, Rerank, Read, then Iterate: Answering Open-Domain Questions of Arbitrary Complexity from Text [PDF] 返回目录
Peng Qi, Haejun Lee, Oghenetegiri "TG" Sido, Christopher D. Manning
Abstract: Current approaches to open-domain question answering often make crucial assumptions that prevent them from generalizing to real-world settings, including the access to parameterized retrieval systems well-tuned for the task, access to structured metadata like knowledge bases and web links, or a priori knowledge of the complexity of questions to be answered (e.g., single-hop or multi-hop). To address these limitations, we propose a unified system to answer open-domain questions of arbitrary complexity directly from text that works with off-the-shelf retrieval systems on arbitrary text collections. We employ a single multi-task model to perform all the necessary subtasks---retrieving supporting facts, reranking them, and predicting the answer from all retrieved documents---in an iterative fashion. To emulate a more realistic setting, we also constructed a new unified benchmark by collecting about 200 multi-hop questions that require three Wikipedia pages to answer, and combining them with existing datasets. We show that our model not only outperforms state-of-the-art systems on several existing benchmarks that exclusively feature single-hop or multi-hop open-domain questions, but also achieves strong performance on the new benchmark.
摘要:目前的做法开域问答往往使妨碍他们推广到真实世界的设置,包括访问参数检索系统精心调校的任务,获得知识一样基地和Web链接结构化元数据关键假设,或问题的复杂性的先验知识来回答(例如,单跳或多跳)。为了解决这些限制,我们提出了一个统一的体系,直接从文本回答的任意复杂开域问题,这与任意文本集合关闭的,现成的检索系统的作品。我们采用一个多任务模式来进行所有必要的子任务---取回支持的事实,他们重新分级,并预测从所有检索文档的答案---迭代的方式。为了模拟更真实的环境,我们也通过收集需要三个维基百科页面,回答约200多跳的问题,并将其与现有数据集相结合构建一个新的统一标准。我们表明,我们的模型不仅优于几个现有的基准,专门配备了单跳或多跳开域问题的国家的最先进的系统,而且还实现了对新的基准性能强。
Peng Qi, Haejun Lee, Oghenetegiri "TG" Sido, Christopher D. Manning
Abstract: Current approaches to open-domain question answering often make crucial assumptions that prevent them from generalizing to real-world settings, including the access to parameterized retrieval systems well-tuned for the task, access to structured metadata like knowledge bases and web links, or a priori knowledge of the complexity of questions to be answered (e.g., single-hop or multi-hop). To address these limitations, we propose a unified system to answer open-domain questions of arbitrary complexity directly from text that works with off-the-shelf retrieval systems on arbitrary text collections. We employ a single multi-task model to perform all the necessary subtasks---retrieving supporting facts, reranking them, and predicting the answer from all retrieved documents---in an iterative fashion. To emulate a more realistic setting, we also constructed a new unified benchmark by collecting about 200 multi-hop questions that require three Wikipedia pages to answer, and combining them with existing datasets. We show that our model not only outperforms state-of-the-art systems on several existing benchmarks that exclusively feature single-hop or multi-hop open-domain questions, but also achieves strong performance on the new benchmark.
摘要:目前的做法开域问答往往使妨碍他们推广到真实世界的设置,包括访问参数检索系统精心调校的任务,获得知识一样基地和Web链接结构化元数据关键假设,或问题的复杂性的先验知识来回答(例如,单跳或多跳)。为了解决这些限制,我们提出了一个统一的体系,直接从文本回答的任意复杂开域问题,这与任意文本集合关闭的,现成的检索系统的作品。我们采用一个多任务模式来进行所有必要的子任务---取回支持的事实,他们重新分级,并预测从所有检索文档的答案---迭代的方式。为了模拟更真实的环境,我们也通过收集需要三个维基百科页面,回答约200多跳的问题,并将其与现有数据集相结合构建一个新的统一标准。我们表明,我们的模型不仅优于几个现有的基准,专门配备了单跳或多跳开域问题的国家的最先进的系统,而且还实现了对新的基准性能强。
7. Neural Passage Retrieval with Improved Negative Contrast [PDF] 返回目录
Jing Lu, Gustavo Hernandez Abrego, Ji Ma, Jianmo Ni, Yinfei Yang
Abstract: In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages for automatic question answering. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Our retrieval-based strategies are based on the semantic similarity and the lexical overlap between questions and passages. We train the dual encoder models in two stages: pre-training with synthetic data and fine tuning with domain-specific data. We apply negative sampling to both stages. The approach is evaluated in two passage retrieval tasks. Even though it is not evident that there is one single sampling strategy that works best in all the tasks, it is clear that our strategies contribute to improving the contrast between the response and all the other passages. Furthermore, mixing the negatives from different strategies achieve performance on par with the best performing strategy in all tasks. Our results establish a new state-of-the-art level of performance on two of the open-domain question answering datasets that we evaluated.
摘要:本文探讨用于检索自动问答双通道模式编码器采样的负面的影响。我们探索补充底片的简单随机抽样4个负的抽样策略,通常用于训练双重编码器模型。出了四大战略,三是基于搜索和一个启发式。我们基于检索的策略是基于语义相似和问题与通道之间的词汇重叠。我们分两个阶段训练的双重编码器型号:前培训与模拟数据和微调与特定于域的数据。我们施加负采样两个阶段。该方法有两种通道检索任务进行评估。即使它有一个单一的抽样策略,在所有任务,最佳的工作是不是很明显,很显然,我们的战略有助于提高响应和所有其他通道之间的对比度。此外,混合来自不同的策略底片看齐的所有任务表现最好的策略实现的性能。我们的研究结果建立在两个开放域问答的数据集,我们评估一个新的国家的最先进的性能水平。
Jing Lu, Gustavo Hernandez Abrego, Ji Ma, Jianmo Ni, Yinfei Yang
Abstract: In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages for automatic question answering. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Our retrieval-based strategies are based on the semantic similarity and the lexical overlap between questions and passages. We train the dual encoder models in two stages: pre-training with synthetic data and fine tuning with domain-specific data. We apply negative sampling to both stages. The approach is evaluated in two passage retrieval tasks. Even though it is not evident that there is one single sampling strategy that works best in all the tasks, it is clear that our strategies contribute to improving the contrast between the response and all the other passages. Furthermore, mixing the negatives from different strategies achieve performance on par with the best performing strategy in all tasks. Our results establish a new state-of-the-art level of performance on two of the open-domain question answering datasets that we evaluated.
摘要:本文探讨用于检索自动问答双通道模式编码器采样的负面的影响。我们探索补充底片的简单随机抽样4个负的抽样策略,通常用于训练双重编码器模型。出了四大战略,三是基于搜索和一个启发式。我们基于检索的策略是基于语义相似和问题与通道之间的词汇重叠。我们分两个阶段训练的双重编码器型号:前培训与模拟数据和微调与特定于域的数据。我们施加负采样两个阶段。该方法有两种通道检索任务进行评估。即使它有一个单一的抽样策略,在所有任务,最佳的工作是不是很明显,很显然,我们的战略有助于提高响应和所有其他通道之间的对比度。此外,混合来自不同的策略底片看齐的所有任务表现最好的策略实现的性能。我们的研究结果建立在两个开放域问答的数据集,我们评估一个新的国家的最先进的性能水平。
8. Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification [PDF] 返回目录
Linyi Yang, Eoin M. Kenny, Tin Lok James Ng, Yi Yang, Barry Smyth, Ruihai Dong
Abstract: Corporate mergers and acquisitions (M&A) account for billions of dollars of investment globally every year, and offer an interesting and challenging domain for artificial intelligence. However, in these highly sensitive domains, it is crucial to not only have a highly robust and accurate model, but be able to generate useful explanations to garner a user's trust in the automated system. Regrettably, the recent research regarding eXplainable AI (XAI) in financial text classification has received little to no attention, and many current methods for generating textual-based explanations result in highly implausible explanations, which damage a user's trust in the system. To address these issues, this paper proposes a novel methodology for producing plausible counterfactual explanations, whilst exploring the regularization benefits of adversarial training on language models in the domain of FinTech. Exhaustive quantitative experiments demonstrate that not only does this approach improve the model accuracy when compared to the current state-of-the-art and human performance, but it also generates counterfactual explanations which are significantly more plausible based on human trials.
摘要:企业并购(M&A)占数十亿,每年在全球投资的美元,并提供了人工智能的趣味性和挑战性领域。然而,在这些高度敏感的领域,这是至关重要的,不仅有一个非常稳定和精确的模型,但能够产生有用的解释的自动化系统,以争取用户的信赖。令人遗憾的是,最近的金融文本分类的有关解释的AI(XAI)的研究已收到很少或几乎没有关注,以及用于基于文本,说明当前许多方法导致非常令人难以置信的解释,这在系统损坏用户的信赖。为了解决这些问题,本文提出了一种生产合理解释反,同时探索FinTech域语言模型对抗训练的正规化好处一种新颖的方法。详尽的定量实验证明,相比于当前国家的最先进的和人的行为时,不仅这种方法提高了模型的准确性,但同时也产生反事实说明它们是显著更合理的根据人体试验。
Linyi Yang, Eoin M. Kenny, Tin Lok James Ng, Yi Yang, Barry Smyth, Ruihai Dong
Abstract: Corporate mergers and acquisitions (M&A) account for billions of dollars of investment globally every year, and offer an interesting and challenging domain for artificial intelligence. However, in these highly sensitive domains, it is crucial to not only have a highly robust and accurate model, but be able to generate useful explanations to garner a user's trust in the automated system. Regrettably, the recent research regarding eXplainable AI (XAI) in financial text classification has received little to no attention, and many current methods for generating textual-based explanations result in highly implausible explanations, which damage a user's trust in the system. To address these issues, this paper proposes a novel methodology for producing plausible counterfactual explanations, whilst exploring the regularization benefits of adversarial training on language models in the domain of FinTech. Exhaustive quantitative experiments demonstrate that not only does this approach improve the model accuracy when compared to the current state-of-the-art and human performance, but it also generates counterfactual explanations which are significantly more plausible based on human trials.
摘要:企业并购(M&A)占数十亿,每年在全球投资的美元,并提供了人工智能的趣味性和挑战性领域。然而,在这些高度敏感的领域,这是至关重要的,不仅有一个非常稳定和精确的模型,但能够产生有用的解释的自动化系统,以争取用户的信赖。令人遗憾的是,最近的金融文本分类的有关解释的AI(XAI)的研究已收到很少或几乎没有关注,以及用于基于文本,说明当前许多方法导致非常令人难以置信的解释,这在系统损坏用户的信赖。为了解决这些问题,本文提出了一种生产合理解释反,同时探索FinTech域语言模型对抗训练的正规化好处一种新颖的方法。详尽的定量实验证明,相比于当前国家的最先进的和人的行为时,不仅这种方法提高了模型的准确性,但同时也产生反事实说明它们是显著更合理的根据人体试验。
9. Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures [PDF] 返回目录
Nafise Sadat Moosavi, Marcel de Boer, Prasetya Ajie Utama, Iryna Gurevych
Abstract: Existing NLP datasets contain various biases, and models tend to quickly learn those biases, which in turn limits their robustness. Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective so that models learn less from biased examples. Besides, they mostly focus on addressing a specific bias, and while they improve the performance on adversarial evaluation sets of the targeted bias, they may bias the model in other ways, and therefore, hurt the overall robustness. In this paper, we propose to augment the input sentences in the training data with their corresponding predicate-argument structures, which provide a higher-level abstraction over different realizations of the same meaning and help the model to recognize important parts of sentences. We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases. In addition, we show that models can still be vulnerable to the lexical overlap bias, even when the training data does not contain this bias, and that the sentence augmentation also improves the robustness in this scenario. We will release our adversarial datasets to evaluate bias in such a scenario as well as our augmentation scripts at this https URL.
摘要:现有的数据集NLP包含各种偏见和模型倾向于快速学习的偏见,这反过来又限制了它们的鲁棒性。现有的方法,以提高对数据集的偏见稳健性主要集中于不断变化的培养目标,使车型借鉴偏见的例子较少。此外,他们主要集中在解决一个特定的偏见,虽然他们提高对抗评估套针对性偏差的表现,他们可能偏向模型在其他方面,因此,伤害整体坚固性。在本文中,我们提出了与其相应的谓语参数的结构,它提供了更高层次的抽象上的相同的含义不同实现,并帮助该模型来识别句子的重要组成部分,以增加训练数据输入句子。我们发现,没有针对特定的偏见,我们的句子增强提高针对多种偏见变压器模型的鲁棒性。此外,我们表明,模型仍然容易遭受词法重叠偏置,即使在训练数据中不包含这种偏见,并说了一句增强也提高了在这种情况下的鲁棒性。我们会发布我们的对抗数据集在这种情况下,以及我们在这个HTTPS URL增强的脚本来评价偏差。
Nafise Sadat Moosavi, Marcel de Boer, Prasetya Ajie Utama, Iryna Gurevych
Abstract: Existing NLP datasets contain various biases, and models tend to quickly learn those biases, which in turn limits their robustness. Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective so that models learn less from biased examples. Besides, they mostly focus on addressing a specific bias, and while they improve the performance on adversarial evaluation sets of the targeted bias, they may bias the model in other ways, and therefore, hurt the overall robustness. In this paper, we propose to augment the input sentences in the training data with their corresponding predicate-argument structures, which provide a higher-level abstraction over different realizations of the same meaning and help the model to recognize important parts of sentences. We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases. In addition, we show that models can still be vulnerable to the lexical overlap bias, even when the training data does not contain this bias, and that the sentence augmentation also improves the robustness in this scenario. We will release our adversarial datasets to evaluate bias in such a scenario as well as our augmentation scripts at this https URL.
摘要:现有的数据集NLP包含各种偏见和模型倾向于快速学习的偏见,这反过来又限制了它们的鲁棒性。现有的方法,以提高对数据集的偏见稳健性主要集中于不断变化的培养目标,使车型借鉴偏见的例子较少。此外,他们主要集中在解决一个特定的偏见,虽然他们提高对抗评估套针对性偏差的表现,他们可能偏向模型在其他方面,因此,伤害整体坚固性。在本文中,我们提出了与其相应的谓语参数的结构,它提供了更高层次的抽象上的相同的含义不同实现,并帮助该模型来识别句子的重要组成部分,以增加训练数据输入句子。我们发现,没有针对特定的偏见,我们的句子增强提高针对多种偏见变压器模型的鲁棒性。此外,我们表明,模型仍然容易遭受词法重叠偏置,即使在训练数据中不包含这种偏见,并说了一句增强也提高了在这种情况下的鲁棒性。我们会发布我们的对抗数据集在这种情况下,以及我们在这个HTTPS URL增强的脚本来评价偏差。
10. Helping users discover perspectives: Enhancing opinion mining with joint topic models [PDF] 返回目录
Tim Draws, Jody Liu, Nava Tintarev
Abstract: Support or opposition concerning a debated claim such as abortion should be legal can have different underlying reasons, which we call perspectives. This paper explores how opinion mining can be enhanced with joint topic modeling, to identify distinct perspectives within the topic, providing an informative overview from unstructured text. We evaluate four joint topic models (TAM, JST, VODUM, and LAM) in a user study assessing human understandability of the extracted perspectives. Based on the results, we conclude that joint topic models such as TAM can discover perspectives that align with human judgments. Moreover, our results suggest that users are not influenced by their pre-existing stance on the topic of abortion when interpreting the output of topic models.
摘要:关于一个辩论要求,如流产支持或反对的应该是法律可以有不同的潜在原因,我们称之为观点。本文探讨了如何意见挖掘可以与合资主题建模增强,在主题中找出不同的视角,提供了从非结构化文本的信息概述。我们在用户研究评估中提取观点的人可理解性评估四个联主题模型(TAM,JST,VODUM和LAM)。根据研究结果,我们得出结论,联合主题模型如TAM可以发现该观点与对齐人为判断。此外,我们的研究结果表明,用户就不会被他们的预先存在的立场对堕胎的话题诠释主题模型的输出时的影响。
Tim Draws, Jody Liu, Nava Tintarev
Abstract: Support or opposition concerning a debated claim such as abortion should be legal can have different underlying reasons, which we call perspectives. This paper explores how opinion mining can be enhanced with joint topic modeling, to identify distinct perspectives within the topic, providing an informative overview from unstructured text. We evaluate four joint topic models (TAM, JST, VODUM, and LAM) in a user study assessing human understandability of the extracted perspectives. Based on the results, we conclude that joint topic models such as TAM can discover perspectives that align with human judgments. Moreover, our results suggest that users are not influenced by their pre-existing stance on the topic of abortion when interpreting the output of topic models.
摘要:关于一个辩论要求,如流产支持或反对的应该是法律可以有不同的潜在原因,我们称之为观点。本文探讨了如何意见挖掘可以与合资主题建模增强,在主题中找出不同的视角,提供了从非结构化文本的信息概述。我们在用户研究评估中提取观点的人可理解性评估四个联主题模型(TAM,JST,VODUM和LAM)。根据研究结果,我们得出结论,联合主题模型如TAM可以发现该观点与对齐人为判断。此外,我们的研究结果表明,用户就不会被他们的预先存在的立场对堕胎的话题诠释主题模型的输出时的影响。
11. Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [PDF] 返回目录
Daniel Deutsch, Dan Roth
Abstract: Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference. Ideally, this comparison should measure the summary's information quality by calculating how much information the summaries have in common. In this work, we analyze the token alignments used by ROUGE and BERTScore to compare summaries and argue that their scores largely cannot be interpreted as measuring information overlap, but rather the extent to which they discuss the same topics. Further, we provide evidence that this result holds true for many other summarization evaluation metrics. The consequence of this result is that it means the summarization community has not yet found a reliable automatic metric that aligns with its research goal, to generate summaries with high-quality information. Then, we propose a simple and interpretable method of evaluating summaries which does directly measure information overlap and demonstrate how it can be used to gain insights into model behavior that could not be provided by other methods alone.
摘要:基于参考的指标如胭脂或BERTScore由摘要与参考评估的摘要的内容质量。理想情况下,这种比较应当测量通过计算汇总多少信息有共同汇总的信息质量。在这项工作中,我们分析了使用胭脂BERTScore比较摘要令牌路线,并认为他们的成绩在很大程度上不能被解释为测量资料重叠,而是在何种程度上,他们讨论相同的主题。此外,我们提供的证据,这一结果适用于许多其他的概括评价指标真。这一结果的后果是,它意味着汇总社会还没有找到一个可靠的自动度量,其研究目标一致,产生高品质的信息汇总。然后,我们提出了一个简单的评估,其不直接测量资料重叠并展示它如何被用来洞察模型行为不能仅靠其他方法来提供摘要的解释方法。
Daniel Deutsch, Dan Roth
Abstract: Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference. Ideally, this comparison should measure the summary's information quality by calculating how much information the summaries have in common. In this work, we analyze the token alignments used by ROUGE and BERTScore to compare summaries and argue that their scores largely cannot be interpreted as measuring information overlap, but rather the extent to which they discuss the same topics. Further, we provide evidence that this result holds true for many other summarization evaluation metrics. The consequence of this result is that it means the summarization community has not yet found a reliable automatic metric that aligns with its research goal, to generate summaries with high-quality information. Then, we propose a simple and interpretable method of evaluating summaries which does directly measure information overlap and demonstrate how it can be used to gain insights into model behavior that could not be provided by other methods alone.
摘要:基于参考的指标如胭脂或BERTScore由摘要与参考评估的摘要的内容质量。理想情况下,这种比较应当测量通过计算汇总多少信息有共同汇总的信息质量。在这项工作中,我们分析了使用胭脂BERTScore比较摘要令牌路线,并认为他们的成绩在很大程度上不能被解释为测量资料重叠,而是在何种程度上,他们讨论相同的主题。此外,我们提供的证据,这一结果适用于许多其他的概括评价指标真。这一结果的后果是,它意味着汇总社会还没有找到一个可靠的自动度量,其研究目标一致,产生高品质的信息汇总。然后,我们提出了一个简单的评估,其不直接测量资料重叠并展示它如何被用来洞察模型行为不能仅靠其他方法来提供摘要的解释方法。
12. Intrinsic Quality Assessment of Arguments [PDF] 返回目录
Henning Wachsmuth, Till Werner
Abstract: Several quality dimensions of natural language arguments have been investigated. Some are likely to be reflected in linguistic features (e.g., an argument's arrangement), whereas others depend on context (e.g., relevance) or topic knowledge (e.g., acceptability). In this paper, we study the intrinsic computational assessment of 15 dimensions, i.e., only learning from an argument's text. In systematic experiments with eight feature types on an existing corpus, we observe moderate but significant learning success for most dimensions. Rhetorical quality seems hardest to assess, and subjectivity features turn out strong, although length bias in the corpus impedes full validity. We also find that human assessors differ more clearly to each other than to our approach.
摘要:自然语言参数一些质量方面进行了研究。有些人可能在语言特征(例如,一个参数的安排)中得到体现,而其他依赖于上下文(例如,相关性)或主题的知识(例如,接受)。在本文中,我们研究了15个维度,即,仅从参数的文字学习的内在计算评估。在对现有的语料库八种部长类型的系统实验,我们发现对大多数中等尺寸,但显著学习成功。修辞质量似乎最难评估和主体功能变成强势,虽然在语料库阻碍长度偏置完全有效。我们还发现,人的评估,而不是我们的做法不同,更清楚对方。
Henning Wachsmuth, Till Werner
Abstract: Several quality dimensions of natural language arguments have been investigated. Some are likely to be reflected in linguistic features (e.g., an argument's arrangement), whereas others depend on context (e.g., relevance) or topic knowledge (e.g., acceptability). In this paper, we study the intrinsic computational assessment of 15 dimensions, i.e., only learning from an argument's text. In systematic experiments with eight feature types on an existing corpus, we observe moderate but significant learning success for most dimensions. Rhetorical quality seems hardest to assess, and subjectivity features turn out strong, although length bias in the corpus impedes full validity. We also find that human assessors differ more clearly to each other than to our approach.
摘要:自然语言参数一些质量方面进行了研究。有些人可能在语言特征(例如,一个参数的安排)中得到体现,而其他依赖于上下文(例如,相关性)或主题的知识(例如,接受)。在本文中,我们研究了15个维度,即,仅从参数的文字学习的内在计算评估。在对现有的语料库八种部长类型的系统实验,我们发现对大多数中等尺寸,但显著学习成功。修辞质量似乎最难评估和主体功能变成强势,虽然在语料库阻碍长度偏置完全有效。我们还发现,人的评估,而不是我们的做法不同,更清楚对方。
13. HateBERT: Retraining BERT for Abusive Language Detection in English [PDF] 返回目录
Tommaso Caselli, Valerio Basile, Jelena Mitrović, Michael Granitzer
Abstract: In this paper, we introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was trained on RAL-E, a large-scale dataset of Reddit comments in English from communities banned for being offensive, abusive, or hateful that we have collected and made available to the public. We present the results of a detailed comparison between a general pre-trained language model and the abuse-inclined version obtained by retraining with posts from the banned communities on three English datasets for offensive, abusive language and hate speech detection tasks. In all datasets, HateBERT outperforms the corresponding general BERT model. We also discuss a battery of experiments comparing the portability of the general pre-trained language model and its corresponding abusive language-inclined counterpart across the datasets, indicating that portability is affected by compatibility of the annotated phenomena.
摘要:在本文中,我们介绍HateBERT,用英语辱骂性语言检测到重新训练的BERT模式。该模型是训练有素的RAL-E,从被禁止使用的攻击,辱骂或仇恨,我们已经收集并提供给公众社区英语reddit的评论大规模数据集。我们提出通过从三个英文数据集被取缔的社区岗位的攻击,辱骂的语言和仇恨言论检测任务再培训得到的通常的预训练的语言模型和滥用倾斜版本的详细比较的结果。在所有数据集,HateBERT优于相应的一般BERT模式。我们也讨论比较一般预先训练语言模型的便携性实验和整个数据集,其对应的粗言秽语,倾斜对应的电池,这表明便携性是由注释现象兼容性影响。
Tommaso Caselli, Valerio Basile, Jelena Mitrović, Michael Granitzer
Abstract: In this paper, we introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was trained on RAL-E, a large-scale dataset of Reddit comments in English from communities banned for being offensive, abusive, or hateful that we have collected and made available to the public. We present the results of a detailed comparison between a general pre-trained language model and the abuse-inclined version obtained by retraining with posts from the banned communities on three English datasets for offensive, abusive language and hate speech detection tasks. In all datasets, HateBERT outperforms the corresponding general BERT model. We also discuss a battery of experiments comparing the portability of the general pre-trained language model and its corresponding abusive language-inclined counterpart across the datasets, indicating that portability is affected by compatibility of the annotated phenomena.
摘要:在本文中,我们介绍HateBERT,用英语辱骂性语言检测到重新训练的BERT模式。该模型是训练有素的RAL-E,从被禁止使用的攻击,辱骂或仇恨,我们已经收集并提供给公众社区英语reddit的评论大规模数据集。我们提出通过从三个英文数据集被取缔的社区岗位的攻击,辱骂的语言和仇恨言论检测任务再培训得到的通常的预训练的语言模型和滥用倾斜版本的详细比较的结果。在所有数据集,HateBERT优于相应的一般BERT模式。我们也讨论比较一般预先训练语言模型的便携性实验和整个数据集,其对应的粗言秽语,倾斜对应的电池,这表明便携性是由注释现象兼容性影响。
14. Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced Languages [PDF] 返回目录
Diego Alves, Gaurish Thakkar, Marko Tadić
Abstract: This article presents the strategy for developing a platform containing Language Processing Chains for European Union languages, consisting of Tokenization to Parsing, also including Named Entity recognition andwith addition ofSentiment Analysis. These chains are part of the first step of an event-centric knowledge processing pipeline whose aim is to process multilingual media information about major events that can cause an impactin Europe and the rest of the world. Due to the differences in terms of availability of language resources for each language, we have built this strategy in three steps, starting with processing chains for the well-resourced languages and finishing with the development of new modules for the under-resourced ones. In order to classify all European Union official languages in terms of resources, we have analysed the size of annotated corpora as well as the existence of pre-trained models in mainstream Language Processing tools, and we have combined this information with the proposed classification published at META-NETwhitepaper series.
摘要:本文介绍了战略制定含有欧盟语言的语言处理链的平台,由符号化的,以解析,也包括命名实体识别andwith此外ofSentiment分析。这些连锁店都以事件为中心的知识处理管线,其目的是处理有关,可引起impactin欧洲和世界其他地区的重大活动多语种媒体信息的第一步骤的一部分。由于每种语言的语言资源可用性方面的差异,我们已经建立这一战略分三步走,首先是处理链的资源充足的语言,并用新的模块发展,为资源不足的人整理。为了在资源方面所有欧盟官方语言进行分类,我们已经分析了注释语料库的大小以及在主流语言处理工具预先训练模式的存在,我们已经结合在发表建议的分类信息META-NETwhitepaper系列。
Diego Alves, Gaurish Thakkar, Marko Tadić
Abstract: This article presents the strategy for developing a platform containing Language Processing Chains for European Union languages, consisting of Tokenization to Parsing, also including Named Entity recognition andwith addition ofSentiment Analysis. These chains are part of the first step of an event-centric knowledge processing pipeline whose aim is to process multilingual media information about major events that can cause an impactin Europe and the rest of the world. Due to the differences in terms of availability of language resources for each language, we have built this strategy in three steps, starting with processing chains for the well-resourced languages and finishing with the development of new modules for the under-resourced ones. In order to classify all European Union official languages in terms of resources, we have analysed the size of annotated corpora as well as the existence of pre-trained models in mainstream Language Processing tools, and we have combined this information with the proposed classification published at META-NETwhitepaper series.
摘要:本文介绍了战略制定含有欧盟语言的语言处理链的平台,由符号化的,以解析,也包括命名实体识别andwith此外ofSentiment分析。这些连锁店都以事件为中心的知识处理管线,其目的是处理有关,可引起impactin欧洲和世界其他地区的重大活动多语种媒体信息的第一步骤的一部分。由于每种语言的语言资源可用性方面的差异,我们已经建立这一战略分三步走,首先是处理链的资源充足的语言,并用新的模块发展,为资源不足的人整理。为了在资源方面所有欧盟官方语言进行分类,我们已经分析了注释语料库的大小以及在主流语言处理工具预先训练模式的存在,我们已经结合在发表建议的分类信息META-NETwhitepaper系列。
15. Evaluating Language Tools for Fifteen EU-official Under-resourced Languages [PDF] 返回目录
Diego Alves, Gaurish Thakkar, Marko Tadić
Abstract: This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages. The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event-centric knowledge processing on top of the application of linguistic processing chains (LPCs) for at least 24 EU-official languages. In this campaign, we concentrated on three existing NLP platforms (Stanford CoreNLP, NLP Cube, UDPipe) that all provide models for under-resourced languages and in this first run we covered 15 under-resourced languages for which the models were available. We present the design of the evaluation campaign and present the results as well as discuss them. We considered the difference between reported and our tested results within a single percentage point as being within the limits of acceptable tolerance and thus consider this result as reproducible. However, for a number of languages, the results are below what was reported in the literature, and in some cases, our testing results are even better than the ones reported previously. Particularly problematic was the evaluation of NERC systems. One of the reasons is the absence of universally or cross-lingually applicable named entities classification scheme that would serve the NERC task in different languages analogous to the Universal Dependency scheme in parsing task. To build such a scheme has become one of our the future research directions.
摘要:本文介绍了适用于欧盟十五官方资源不足语言的语言工具的评价活动的结果。该评价的MSC ITN CLEOPATRA行动范围内进行,其目的是建立在对语言处理链(LPCS)应用程序的顶部至少24个欧盟官方语言的跨语言事件为中心的知识处理。在这场运动中,我们集中在三个现有NLP平台(斯坦福CoreNLP,NLP立方,UDPipe),所有的资源不足的语言提供了模式和在此第一次运行,我们覆盖的15资源不足的语言为其模型可用。我们目前的评估活动的设计和现在的结果,以及讨论。我们认为报告之间和我们一个百分点是可接受的容忍限度内内测试结果,从而考虑这个结果作为可再生的差异。然而,对于一些语言中,结果如下什么文献报道,在某些情况下,我们的测试结果甚至优于那些先前报道。特别是有问题的系统NERC的评价。其中的一个原因是缺乏普遍或交叉舌适用命名实体分类方案,将服务于NERC任务类似于解析任务的通用依赖方案不同的语言。要建立这样的方案已经成为我们今后的研究方向之一。
Diego Alves, Gaurish Thakkar, Marko Tadić
Abstract: This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages. The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event-centric knowledge processing on top of the application of linguistic processing chains (LPCs) for at least 24 EU-official languages. In this campaign, we concentrated on three existing NLP platforms (Stanford CoreNLP, NLP Cube, UDPipe) that all provide models for under-resourced languages and in this first run we covered 15 under-resourced languages for which the models were available. We present the design of the evaluation campaign and present the results as well as discuss them. We considered the difference between reported and our tested results within a single percentage point as being within the limits of acceptable tolerance and thus consider this result as reproducible. However, for a number of languages, the results are below what was reported in the literature, and in some cases, our testing results are even better than the ones reported previously. Particularly problematic was the evaluation of NERC systems. One of the reasons is the absence of universally or cross-lingually applicable named entities classification scheme that would serve the NERC task in different languages analogous to the Universal Dependency scheme in parsing task. To build such a scheme has become one of our the future research directions.
摘要:本文介绍了适用于欧盟十五官方资源不足语言的语言工具的评价活动的结果。该评价的MSC ITN CLEOPATRA行动范围内进行,其目的是建立在对语言处理链(LPCS)应用程序的顶部至少24个欧盟官方语言的跨语言事件为中心的知识处理。在这场运动中,我们集中在三个现有NLP平台(斯坦福CoreNLP,NLP立方,UDPipe),所有的资源不足的语言提供了模式和在此第一次运行,我们覆盖的15资源不足的语言为其模型可用。我们目前的评估活动的设计和现在的结果,以及讨论。我们认为报告之间和我们一个百分点是可接受的容忍限度内内测试结果,从而考虑这个结果作为可再生的差异。然而,对于一些语言中,结果如下什么文献报道,在某些情况下,我们的测试结果甚至优于那些先前报道。特别是有问题的系统NERC的评价。其中的一个原因是缺乏普遍或交叉舌适用命名实体分类方案,将服务于NERC任务类似于解析任务的通用依赖方案不同的语言。要建立这样的方案已经成为我们今后的研究方向之一。
16. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification [PDF] 返回目录
Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, Luis Espinosa-Anke
Abstract: The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domain-specific data. In this paper, we propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks. We also provide a strong set of baselines as starting point, and compare different language modeling pre-training strategies. Our initial experiments show the effectiveness of starting off with existing pre-trained generic language models, and continue training them on Twitter corpora.
摘要:在自然语言处理的实验景观社交媒体过于分散。每一年,新的共享任务和数据集的建议,从像情感分析的经典讽刺检测或表情符号预测。因此,目前还不清楚在本领域的当前状态是,因为没有标准化评价协议,既不是强集上训练这样的领域特有的数据的基准数据。在本文中,我们提出了一个新的评估框架(TweetEval)由七异构Twitter的具体分类的任务。我们还提供了强有力的组基线为出发点,并比较不同的语言模型前培训战略。我们的初步实验表明,与现有的预训练的通用语言模型出发的有效性,并继续训练他们的Twitter语料库。
Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, Luis Espinosa-Anke
Abstract: The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domain-specific data. In this paper, we propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks. We also provide a strong set of baselines as starting point, and compare different language modeling pre-training strategies. Our initial experiments show the effectiveness of starting off with existing pre-trained generic language models, and continue training them on Twitter corpora.
摘要:在自然语言处理的实验景观社交媒体过于分散。每一年,新的共享任务和数据集的建议,从像情感分析的经典讽刺检测或表情符号预测。因此,目前还不清楚在本领域的当前状态是,因为没有标准化评价协议,既不是强集上训练这样的领域特有的数据的基准数据。在本文中,我们提出了一个新的评估框架(TweetEval)由七异构Twitter的具体分类的任务。我们还提供了强有力的组基线为出发点,并比较不同的语言模型前培训战略。我们的初步实验表明,与现有的预训练的通用语言模型出发的有效性,并继续训练他们的Twitter语料库。
17. Deep Learning Framework for Measuring the Digital Strategy of Companies from Earnings Calls [PDF] 返回目录
Ahmed Ghanim Al-Ali, Robert Phaal, Donald Sull
Abstract: Companies today are racing to leverage the latest digital technologies, such as artificial intelligence, blockchain, and cloud computing. However, many companies report that their strategies did not achieve the anticipated business results. This study is the first to apply state of the art NLP models on unstructured data to understand the different clusters of digital strategy patterns that companies are Adopting. We achieve this by analyzing earnings calls from Fortune Global 500 companies between 2015 and 2019. We use Transformer based architecture for text classification which show a better understanding of the conversation context. We then investigate digital strategy patterns by applying clustering analysis. Our findings suggest that Fortune 500 companies use four distinct strategies which are product led, customer experience led, service led, and efficiency led. This work provides an empirical baseline for companies and researchers to enhance our understanding of the field.
摘要:今天公司都在竞相利用最新的数字技术,如人工智能,blockchain和云计算。然而,许多公司报告说,他们的策略没有达到预期的业务成果。这项研究是第一次对非结构化数据应用艺术NLP模型的状态,以了解数字化战略模式,公司采取了不同的集群。我们从全球财富500家强企业分析2015年和2019年我们使用文本分类,其表现出更好的理解对话上下文的基于变压器的架构之间的财报电话会议实现这一目标。然后,我们探讨运用聚类分析的数字化战略格局。我们的研究结果表明,财富500强公司使用四个不同的策略,这是主导产品,导致客户体验,服务主导,效率的LED。这项工作提供了公司和研究人员的经验基础,以提高我们的领域的理解。
Ahmed Ghanim Al-Ali, Robert Phaal, Donald Sull
Abstract: Companies today are racing to leverage the latest digital technologies, such as artificial intelligence, blockchain, and cloud computing. However, many companies report that their strategies did not achieve the anticipated business results. This study is the first to apply state of the art NLP models on unstructured data to understand the different clusters of digital strategy patterns that companies are Adopting. We achieve this by analyzing earnings calls from Fortune Global 500 companies between 2015 and 2019. We use Transformer based architecture for text classification which show a better understanding of the conversation context. We then investigate digital strategy patterns by applying clustering analysis. Our findings suggest that Fortune 500 companies use four distinct strategies which are product led, customer experience led, service led, and efficiency led. This work provides an empirical baseline for companies and researchers to enhance our understanding of the field.
摘要:今天公司都在竞相利用最新的数字技术,如人工智能,blockchain和云计算。然而,许多公司报告说,他们的策略没有达到预期的业务成果。这项研究是第一次对非结构化数据应用艺术NLP模型的状态,以了解数字化战略模式,公司采取了不同的集群。我们从全球财富500家强企业分析2015年和2019年我们使用文本分类,其表现出更好的理解对话上下文的基于变压器的架构之间的财报电话会议实现这一目标。然后,我们探讨运用聚类分析的数字化战略格局。我们的研究结果表明,财富500强公司使用四个不同的策略,这是主导产品,导致客户体验,服务主导,效率的LED。这项工作提供了公司和研究人员的经验基础,以提高我们的领域的理解。
18. SmBoP: Semi-autoregressive Bottom-up Semantic Parsing [PDF] 返回目录
Ohad Rubin, Jonathan Berant
Abstract: The de-facto standard decoding method for semantic parsing in recent years has been to autoregressively decode the abstract syntax tree of the target program using a top-down depth-first traversal. In this work, we propose an alternative approach: a Semi-autoregressive Bottom-up Parser (SmBoP) that constructs at decoding step $t$ the top-$K$ sub-trees of height $\leq t$. Our parser enjoys several benefits compared to top-down autoregressive parsing. First, since sub-trees in each decoding step are generated in parallel, the theoretical runtime is logarithmic rather than linear. Second, our bottom-up approach learns representations with meaningful semantic sub-programs at each step, rather than semantically vague partial trees. Last, SmBoP includes Transformer-based layers that contextualize sub-trees with one another, allowing us, unlike traditional beam-search, to score trees conditioned on other trees that have been previously explored. We apply SmBoP on Spider, a challenging zero-shot semantic parsing benchmark, and show that SmBoP is competitive with top-down autoregressive parsing. On the test set, SmBoP obtains an EM score of $60.5\%$, similar to the best published score for a model that does not use database content, which is at $60.6\%$.
摘要:语义分析近年来的事实标准解码方法一直autoregressively解码采用自顶向下的深度优先遍历目标程序的抽象语法树。在这项工作中,我们提出了另一种方法:半自回归自下而上分析器(SmBoP),在解码步骤$ T构建$高度$ \当量T $的顶$ $ķ子树。相比于自上而下的自回归分析我们的解析器享有一些好处。首先,由于在各解码步骤的子树平行地产生,理论运行时是对数而不是线性的。第二,我们在每一步有意义的语义子节目,而不是语义含糊的局部树底向上的方法可以学习表示。最后,SmBoP包括基于变压器的层与彼此,情境子树允许我们,不像传统的波束搜索,得分树木空调上先前已探索其他树木。我们应用的蜘蛛,一个具有挑战性的零射门语义分析基准SmBoP,并表明SmBoP是自上而下的自回归分析竞争力。在测试集,SmBoP获得的EM得分$ 60.5 \%$,类似的模式,不使用数据库的内容,这是在$ 60.6 \%$公布的最佳成绩。
Ohad Rubin, Jonathan Berant
Abstract: The de-facto standard decoding method for semantic parsing in recent years has been to autoregressively decode the abstract syntax tree of the target program using a top-down depth-first traversal. In this work, we propose an alternative approach: a Semi-autoregressive Bottom-up Parser (SmBoP) that constructs at decoding step $t$ the top-$K$ sub-trees of height $\leq t$. Our parser enjoys several benefits compared to top-down autoregressive parsing. First, since sub-trees in each decoding step are generated in parallel, the theoretical runtime is logarithmic rather than linear. Second, our bottom-up approach learns representations with meaningful semantic sub-programs at each step, rather than semantically vague partial trees. Last, SmBoP includes Transformer-based layers that contextualize sub-trees with one another, allowing us, unlike traditional beam-search, to score trees conditioned on other trees that have been previously explored. We apply SmBoP on Spider, a challenging zero-shot semantic parsing benchmark, and show that SmBoP is competitive with top-down autoregressive parsing. On the test set, SmBoP obtains an EM score of $60.5\%$, similar to the best published score for a model that does not use database content, which is at $60.6\%$.
摘要:语义分析近年来的事实标准解码方法一直autoregressively解码采用自顶向下的深度优先遍历目标程序的抽象语法树。在这项工作中,我们提出了另一种方法:半自回归自下而上分析器(SmBoP),在解码步骤$ T构建$高度$ \当量T $的顶$ $ķ子树。相比于自上而下的自回归分析我们的解析器享有一些好处。首先,由于在各解码步骤的子树平行地产生,理论运行时是对数而不是线性的。第二,我们在每一步有意义的语义子节目,而不是语义含糊的局部树底向上的方法可以学习表示。最后,SmBoP包括基于变压器的层与彼此,情境子树允许我们,不像传统的波束搜索,得分树木空调上先前已探索其他树木。我们应用的蜘蛛,一个具有挑战性的零射门语义分析基准SmBoP,并表明SmBoP是自上而下的自回归分析竞争力。在测试集,SmBoP获得的EM得分$ 60.5 \%$,类似的模式,不使用数据库的内容,这是在$ 60.6 \%$公布的最佳成绩。
19. UNER: Universal Named-Entity RecognitionFramework [PDF] 返回目录
Diego Alves, Tin Kuculo, Gabriel Amaral, Gaurish Thakkar, Marko Tadic
Abstract: We introduce the Universal Named-Entity Recognition (UNER)framework, a 4-level classification hierarchy, and the methodology that isbeing adopted to create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities. First, the English SETimescorpus will be annotated using existing tools and knowledge bases. Afterevaluating the resulting annotations through crowdsourcing campaigns,they will be propagated automatically to other languages within the SE-Times corpora. Finally, as an extrinsic evaluation, the UNER multilin-gual dataset will be used to train and test available NER tools. As part offuture research directions, we aim to increase the number of languages inthe UNER corpus and to investigate possible ways of integrating UNERwith available knowledge graphs to improve named-entity recognition.
摘要:介绍命名实体的普遍认可(纳)框架,一个4级分类的层次结构,并采用isbeing创建第一个多语种纳语料库的方法:在SETimesparallel语料库注释命名的实体。首先,英语SETimescorpus将使用现有的工具和知识基础进行注释。通过众包活动Afterevaluating所产生的注解,他们将被自动传播到SE-时报语料库中的其他语言。最后,作为一个外在的评价,在纳MULTILIN-尔的数据集将被用于训练和测试提供NER工具。由于部分offuture研究方向,我们的目标是增加语言的数量在矿井纳语料库和调查UNERwith提供知识图,以提高命名实体识别整合的可能方式。
Diego Alves, Tin Kuculo, Gabriel Amaral, Gaurish Thakkar, Marko Tadic
Abstract: We introduce the Universal Named-Entity Recognition (UNER)framework, a 4-level classification hierarchy, and the methodology that isbeing adopted to create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities. First, the English SETimescorpus will be annotated using existing tools and knowledge bases. Afterevaluating the resulting annotations through crowdsourcing campaigns,they will be propagated automatically to other languages within the SE-Times corpora. Finally, as an extrinsic evaluation, the UNER multilin-gual dataset will be used to train and test available NER tools. As part offuture research directions, we aim to increase the number of languages inthe UNER corpus and to investigate possible ways of integrating UNERwith available knowledge graphs to improve named-entity recognition.
摘要:介绍命名实体的普遍认可(纳)框架,一个4级分类的层次结构,并采用isbeing创建第一个多语种纳语料库的方法:在SETimesparallel语料库注释命名的实体。首先,英语SETimescorpus将使用现有的工具和知识基础进行注释。通过众包活动Afterevaluating所产生的注解,他们将被自动传播到SE-时报语料库中的其他语言。最后,作为一个外在的评价,在纳MULTILIN-尔的数据集将被用于训练和测试提供NER工具。由于部分offuture研究方向,我们的目标是增加语言的数量在矿井纳语料库和调查UNERwith提供知识图,以提高命名实体识别整合的可能方式。
20. Unsupervised Cross-lingual Adaptation for Sequence Tagging and Beyond [PDF] 返回目录
Xin Li, Lidong Bing, Wenxuan Zhang, Zheng Li, Wai Lam
Abstract: Cross-lingual adaptation with multilingual pre-trained language models (mPTLMs) mainly consists of two lines of works: zero-shot approach and translation-based approach, which have been studied extensively on the sequence-level tasks. We further verify the efficacy of these cross-lingual adaptation approaches by evaluating their performances on more fine-grained sequence tagging tasks. After re-examining their strengths and drawbacks, we propose a novel framework to consolidate the zero-shot approach and the translation-based approach for better adaptation performance. Instead of simply augmenting the source data with the machine-translated data, we tailor-make a warm-up mechanism to quickly update the mPTLMs with the gradients estimated on a few translated data. Then, the adaptation approach is applied to the refined parameters and the cross-lingual transfer is performed in a warm-start way. The experimental results on nine target languages demonstrate that our method is beneficial to the cross-lingual adaptation of various sequence tagging tasks.
摘要:多语种预训练的语言模型(mPTLMs)跨语种的适应主要由两条线组成的作品:零射门的方法和基于平移的方法,该方法已经在序列级别的任务广泛的研究。我们进一步验证这些跨语言适应的功效通过更细粒度序列标记任务评估他们的表现方法。重新审视自己的优势和缺点后,我们提出了一个新的框架,以巩固零射门的方式和更好地适应性能的基于平移的方法。而不是简单地扩充与机器翻译的数据源的数据,我们度身打造了一场热身机制,以快速更新与估计的几个翻译数据的梯度mPTLMs。然后,适应方法被施加到精制参数和跨语言传输在一热启动的方式来执行。九种目标语言的实验结果表明,我们的方法是不同序列标记任务的跨语言适应有益。
Xin Li, Lidong Bing, Wenxuan Zhang, Zheng Li, Wai Lam
Abstract: Cross-lingual adaptation with multilingual pre-trained language models (mPTLMs) mainly consists of two lines of works: zero-shot approach and translation-based approach, which have been studied extensively on the sequence-level tasks. We further verify the efficacy of these cross-lingual adaptation approaches by evaluating their performances on more fine-grained sequence tagging tasks. After re-examining their strengths and drawbacks, we propose a novel framework to consolidate the zero-shot approach and the translation-based approach for better adaptation performance. Instead of simply augmenting the source data with the machine-translated data, we tailor-make a warm-up mechanism to quickly update the mPTLMs with the gradients estimated on a few translated data. Then, the adaptation approach is applied to the refined parameters and the cross-lingual transfer is performed in a warm-start way. The experimental results on nine target languages demonstrate that our method is beneficial to the cross-lingual adaptation of various sequence tagging tasks.
摘要:多语种预训练的语言模型(mPTLMs)跨语种的适应主要由两条线组成的作品:零射门的方法和基于平移的方法,该方法已经在序列级别的任务广泛的研究。我们进一步验证这些跨语言适应的功效通过更细粒度序列标记任务评估他们的表现方法。重新审视自己的优势和缺点后,我们提出了一个新的框架,以巩固零射门的方式和更好地适应性能的基于平移的方法。而不是简单地扩充与机器翻译的数据源的数据,我们度身打造了一场热身机制,以快速更新与估计的几个翻译数据的梯度mPTLMs。然后,适应方法被施加到精制参数和跨语言传输在一热启动的方式来执行。九种目标语言的实验结果表明,我们的方法是不同序列标记任务的跨语言适应有益。
21. Pretraining and Fine-Tuning Strategies for Sentiment Analysis of Latvian Tweets [PDF] 返回目录
Gaurish Thakkar, Marcis Pinnis
Abstract: In this paper, we present various pre-training strategies that aid in im-proving the accuracy of the sentiment classification task. We, at first, pre-trainlanguage representation models using these strategies and then fine-tune them onthe downstream task. Experimental results on a time-balanced tweet evaluation setshow the improvement over the previous technique. We achieve 76% accuracy forsentiment analysis on Latvian tweets, which is a substantial improvement over pre-vious work
摘要:在本文中,我们提出了各种预培训战略,援助在IM-证明了情感分类任务的准确度。我们在第一,预trainlanguage使用这些策略,然后微调他们onthe下游任务表示模型。在时间平衡鸣叫评价setshow改善较前技术的实验结果。我们实现拉脱维亚微博76%的准确率forsentiment分析,这是在预vious工作的显着改善
Gaurish Thakkar, Marcis Pinnis
Abstract: In this paper, we present various pre-training strategies that aid in im-proving the accuracy of the sentiment classification task. We, at first, pre-trainlanguage representation models using these strategies and then fine-tune them onthe downstream task. Experimental results on a time-balanced tweet evaluation setshow the improvement over the previous technique. We achieve 76% accuracy forsentiment analysis on Latvian tweets, which is a substantial improvement over pre-vious work
摘要:在本文中,我们提出了各种预培训战略,援助在IM-证明了情感分类任务的准确度。我们在第一,预trainlanguage使用这些策略,然后微调他们onthe下游任务表示模型。在时间平衡鸣叫评价setshow改善较前技术的实验结果。我们实现拉脱维亚微博76%的准确率forsentiment分析,这是在预vious工作的显着改善
22. NLNDE at CANTEMIST: Neural Sequence Labeling and Parsing Approaches for Clinical Concept Extraction [PDF] 返回目录
Lukas Lange, Xiang Dai, Heike Adel, Jannik Strötgen
Abstract: The recognition and normalization of clinical information, such as tumor morphology mentions, is an important, but complex process consisting of multiple subtasks. In this paper, we describe our system for the CANTEMIST shared task, which is able to extract, normalize and rank ICD codes from Spanish electronic health records using neural sequence labeling and parsing approaches with context-aware embeddings. Our best system achieves 85.3 F1, 76.7 F1, and 77.0 MAP for the three tasks, respectively.
摘要:识别和临床信息归一化,如肿瘤形态提到,是由多个子任务的一个重要的,但复杂的过程。在本文中,我们描述了我们系统的CANTEMIST共享任务,这是能够提取物,标准化和等级ICD码使用神经序列标签和解析西班牙语电子健康档案与上下文感知的嵌入方法。我们的最好的系统分别达到85.3 F1,F1 76.7和77.0 MAP三个任务。
Lukas Lange, Xiang Dai, Heike Adel, Jannik Strötgen
Abstract: The recognition and normalization of clinical information, such as tumor morphology mentions, is an important, but complex process consisting of multiple subtasks. In this paper, we describe our system for the CANTEMIST shared task, which is able to extract, normalize and rank ICD codes from Spanish electronic health records using neural sequence labeling and parsing approaches with context-aware embeddings. Our best system achieves 85.3 F1, 76.7 F1, and 77.0 MAP for the three tasks, respectively.
摘要:识别和临床信息归一化,如肿瘤形态提到,是由多个子任务的一个重要的,但复杂的过程。在本文中,我们描述了我们系统的CANTEMIST共享任务,这是能够提取物,标准化和等级ICD码使用神经序列标签和解析西班牙语电子健康档案与上下文感知的嵌入方法。我们的最好的系统分别达到85.3 F1,F1 76.7和77.0 MAP三个任务。
23. BARThez: a Skilled Pretrained French Sequence-to-Sequence Model [PDF] 返回目录
Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis
Abstract: Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing (NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language understanding tasks. While there are some notable exceptions, most of the available models and research have been conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language (to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez, provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.
摘要:归纳迁移学习,自我监督学习启用,已经席卷了整个自然语言处理(NLP)领域,高配车型如BERT和BART无数的自然语言理解任务设置新的艺术状态。虽然也有一些明显的例外,大多数可用的模型和研究已经为英语进行。在这项工作中,我们介绍了巴特兹,第一BART型号为法语(据我们所知)。巴特兹被预先训练从过去的研究中一个非常大的单语语料库法国我们调整,以适应BART的扰动方案。不同于现有的基于BERT - 法语语言模型,如卡门培尔干酪和福楼拜,巴特兹是特别适合用于生成的任务,因为不仅它的编码器,而且它的解码器预先训练。除了从烟道基准歧视性的任务,我们评估巴特兹一种新型总结数据集,OrangeSum,我们与本文发布。我们还继续已经预训练的多语种BART对巴特兹的语料训练前,我们表明,得到的模型,我们称之为mBARTHez,提供了香草巴特兹一个显著提升,并看齐或性能优于CAMEMBERT和福楼拜。
Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis
Abstract: Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing (NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language understanding tasks. While there are some notable exceptions, most of the available models and research have been conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language (to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez, provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.
摘要:归纳迁移学习,自我监督学习启用,已经席卷了整个自然语言处理(NLP)领域,高配车型如BERT和BART无数的自然语言理解任务设置新的艺术状态。虽然也有一些明显的例外,大多数可用的模型和研究已经为英语进行。在这项工作中,我们介绍了巴特兹,第一BART型号为法语(据我们所知)。巴特兹被预先训练从过去的研究中一个非常大的单语语料库法国我们调整,以适应BART的扰动方案。不同于现有的基于BERT - 法语语言模型,如卡门培尔干酪和福楼拜,巴特兹是特别适合用于生成的任务,因为不仅它的编码器,而且它的解码器预先训练。除了从烟道基准歧视性的任务,我们评估巴特兹一种新型总结数据集,OrangeSum,我们与本文发布。我们还继续已经预训练的多语种BART对巴特兹的语料训练前,我们表明,得到的模型,我们称之为mBARTHez,提供了香草巴特兹一个显著提升,并看齐或性能优于CAMEMBERT和福楼拜。
24. A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios [PDF] 返回目录
Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, Dietrich Klakow
Abstract: Current developments in natural language processing offer challenges and opportunities for low-resource languages and domains. Deep neural networks are known for requiring large amounts of training data which might not be available in resource-lean scenarios. However, there is also a growing body of works to improve the performance in low-resource settings. Motivated by fundamental changes towards neural models and the currently popular pre-train and fine-tune paradigm, we give an overview of promising approaches for low-resource natural language processing. After a discussion about the definition of low-resource scenarios and the different dimensions of data availability, we then examine methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. The survey closes with a brief look into methods suggested in non-NLP machine learning communities, which might be beneficial for NLP in low-resource scenarios
摘要:在低资源的语言和域名自然语言处理提供挑战和机遇的发展现状。深层神经网络是众所周知的,需要大量的培训,可能不会在资源贫乏的情况可用的数据。然而,也有作品越来越多,以提高资源匮乏的表现。通过对神经模型和目前流行的预培训和微调模式的根本变化的推动下,我们给看好资源少的自然语言处理方法的概述。关于低资源方案和数据可用性的不同维度的定义讨论之后,我们再检查方法,使学习时的训练数据稀疏。这包括机制,以创建像数据增强和远处监督的附加标签数据,以及减少对目标监督的需要转移学习环境。调查关闭与简单回顾一下在非NLP机器学习社区建议的方法,这可能是有益的NLP在低资源场景
Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, Dietrich Klakow
Abstract: Current developments in natural language processing offer challenges and opportunities for low-resource languages and domains. Deep neural networks are known for requiring large amounts of training data which might not be available in resource-lean scenarios. However, there is also a growing body of works to improve the performance in low-resource settings. Motivated by fundamental changes towards neural models and the currently popular pre-train and fine-tune paradigm, we give an overview of promising approaches for low-resource natural language processing. After a discussion about the definition of low-resource scenarios and the different dimensions of data availability, we then examine methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. The survey closes with a brief look into methods suggested in non-NLP machine learning communities, which might be beneficial for NLP in low-resource scenarios
摘要:在低资源的语言和域名自然语言处理提供挑战和机遇的发展现状。深层神经网络是众所周知的,需要大量的培训,可能不会在资源贫乏的情况可用的数据。然而,也有作品越来越多,以提高资源匮乏的表现。通过对神经模型和目前流行的预培训和微调模式的根本变化的推动下,我们给看好资源少的自然语言处理方法的概述。关于低资源方案和数据可用性的不同维度的定义讨论之后,我们再检查方法,使学习时的训练数据稀疏。这包括机制,以创建像数据增强和远处监督的附加标签数据,以及减少对目标监督的需要转移学习环境。调查关闭与简单回顾一下在非NLP机器学习社区建议的方法,这可能是有益的NLP在低资源场景
25. Adversarial Learning of Feature-based Meta-Embeddings [PDF] 返回目录
Lukas Lange, Heike Adel, Jannik Strötgen, Dietrich Klakow
Abstract: Certain embedding types outperform others in different scenarios, e.g., subword-based embeddings can model rare words well and domain-specific embeddings can better represent in-domain terms. Therefore, recent works consider attention-based meta-embeddings to combine different embedding types. We demonstrate that these methods have two shortcomings: First, the attention weights are calculated without knowledge of word properties. Second, the different embedding types can form clusters in the common embedding space, preventing the computation of a meaningful average of different embeddings and thus, reducing performance. We propose to solve these problems by using feature-based meta-embeddings learned with adversarial training. Our experiments and analysis on sentence classification and sequence tagging tasks show that our approach is effective. We set the new state of the art on various datasets across languages and domains.
摘要:某些嵌入类型胜过他人在不同的场景,例如,基于子词的嵌入可以模拟生僻字以及和特定域的嵌入能够更好地代表域内条款。因此,近期的作品考虑关注基于元的嵌入到不同嵌入类型组合。我们证明,这些方法有两个缺点:一是注意权重无字特性的知识来计算。其次,不同的嵌入类型可以形成在所述公共嵌入空间上的簇,防止不同的嵌入,因此,降低性能的有意义的平均值的计算。我们建议以解决使用带有对抗性训练学到的基于特征的荟萃的嵌入这些问题。我们的实验和句子的分类和序列标记任务的分析表明,我们的做法是有效的。我们设置不同的语言和域的各种数据集的新的艺术状态。
Lukas Lange, Heike Adel, Jannik Strötgen, Dietrich Klakow
Abstract: Certain embedding types outperform others in different scenarios, e.g., subword-based embeddings can model rare words well and domain-specific embeddings can better represent in-domain terms. Therefore, recent works consider attention-based meta-embeddings to combine different embedding types. We demonstrate that these methods have two shortcomings: First, the attention weights are calculated without knowledge of word properties. Second, the different embedding types can form clusters in the common embedding space, preventing the computation of a meaningful average of different embeddings and thus, reducing performance. We propose to solve these problems by using feature-based meta-embeddings learned with adversarial training. Our experiments and analysis on sentence classification and sequence tagging tasks show that our approach is effective. We set the new state of the art on various datasets across languages and domains.
摘要:某些嵌入类型胜过他人在不同的场景,例如,基于子词的嵌入可以模拟生僻字以及和特定域的嵌入能够更好地代表域内条款。因此,近期的作品考虑关注基于元的嵌入到不同嵌入类型组合。我们证明,这些方法有两个缺点:一是注意权重无字特性的知识来计算。其次,不同的嵌入类型可以形成在所述公共嵌入空间上的簇,防止不同的嵌入,因此,降低性能的有意义的平均值的计算。我们建议以解决使用带有对抗性训练学到的基于特征的荟萃的嵌入这些问题。我们的实验和句子的分类和序列标记任务的分析表明,我们的做法是有效的。我们设置不同的语言和域的各种数据集的新的艺术状态。
26. ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding [PDF] 返回目录
Minjeong Kim, Gyuwan Kim, Sang-Woo Lee, Jung-Woo Ha
Abstract: Language model pre-training has shown promising results in various downstream tasks. In this context, we introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding (E2E SLU) tasks. Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment via our two proposed pre-training tasks: Cross-modal Masked Language Modeling (CM-MLM) and Cross-modal Conditioned Language Modeling (CM-CLM). Experimental results on three benchmarks present that our approach is effective for various SLU datasets and shows a surprisingly marginal performance degradation even when 1% of the training data are available. Also, our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.
摘要:语言模型前培训已显示出大有希望的各种下游任务的结果。在这方面,我们引入一个跨模式预先训练语言模型,称为语音文本BERT(ST-BERT),以解决终端到终端的口语理解(E2E SLU)的任务。以音素后和子字级文本作为输入,ST-BERT通过我们这两个建议前期训练任务学习上下文化的跨模态定位:跨模态蒙面语言模型(CM-MLM)和跨模态空调语言模型( CM-CLM)。在三个基准实验结果提出,我们的方法是有效的各种SLU数据集和节目,即使训练数据的1%,可令人惊讶的边际性能下降。此外,我们的方法显示了通过域自适应前培训特定领域的语音文本对数据进一步SLU的性能增益。
Minjeong Kim, Gyuwan Kim, Sang-Woo Lee, Jung-Woo Ha
Abstract: Language model pre-training has shown promising results in various downstream tasks. In this context, we introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding (E2E SLU) tasks. Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment via our two proposed pre-training tasks: Cross-modal Masked Language Modeling (CM-MLM) and Cross-modal Conditioned Language Modeling (CM-CLM). Experimental results on three benchmarks present that our approach is effective for various SLU datasets and shows a surprisingly marginal performance degradation even when 1% of the training data are available. Also, our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.
摘要:语言模型前培训已显示出大有希望的各种下游任务的结果。在这方面,我们引入一个跨模式预先训练语言模型,称为语音文本BERT(ST-BERT),以解决终端到终端的口语理解(E2E SLU)的任务。以音素后和子字级文本作为输入,ST-BERT通过我们这两个建议前期训练任务学习上下文化的跨模态定位:跨模态蒙面语言模型(CM-MLM)和跨模态空调语言模型( CM-CLM)。在三个基准实验结果提出,我们的方法是有效的各种SLU数据集和节目,即使训练数据的1%,可令人惊讶的边际性能下降。此外,我们的方法显示了通过域自适应前培训特定领域的语音文本对数据进一步SLU的性能增益。
27. Pre-trained Model for Chinese Word Segmentation with Meta Learning [PDF] 返回目录
Zhen Ke, Liang Shi, Erli Meng, Bin Wang, Xipeng Qiu
Abstract: Recent researches show that pre-trained models such as BERT (Devlin et al., 2019) are beneficial for Chinese Word Segmentation tasks. However, existing approaches usually finetune pre-trained models directly on a separate downstream Chinese Word Segmentation corpus. These recent methods don't fully utilize the prior knowledge of existing segmentation corpora, and don't regard the discrepancy between the pre-training tasks and the downstream Chinese Word Segmentation tasks. In this work, we propose a Pre-Trained Model for Chinese Word Segmentation, which can be abbreviated as PTM-CWS. PTM-CWS model employs a unified architecture for different segmentation criteria, and is pre-trained on a joint multi-criteria corpus with meta learning algorithm. Empirical results show that our PTM-CWS model can utilize the existing prior segmentation knowledge, reduce the discrepancy between the pre-training tasks and the downstream Chinese Word Segmentation tasks, and achieve new state-of-the-art performance on twelve Chinese Word Segmentation corpora.
摘要:最近的研究表明,预先训练模式,如BERT(Devlin等,2019)是中国的分词任务是有益的。但是,现有的直接在一个单独的下游中国的分词语料库方法通常精调预先训练模式。最近的这些方法没有充分利用现有分段语料库的先验知识,并且不把预先训练任务和下游的中国分词任务之间的差异。在这项工作中,我们提出了中国的分词,可以简写为PTM-CWS预训练的模型。 PTM-CWS模型采用统一架构,针对不同细分的标准,而且是预训练与元学习算法联合多标准语料库。实证结果表明,我们PTM-CWS模型可以利用现有的事先分割的知识,减少前的训练任务和下游的中国分词任务之间的差距,实现十二个中国分词新的国家的最先进的性能语料库。
Zhen Ke, Liang Shi, Erli Meng, Bin Wang, Xipeng Qiu
Abstract: Recent researches show that pre-trained models such as BERT (Devlin et al., 2019) are beneficial for Chinese Word Segmentation tasks. However, existing approaches usually finetune pre-trained models directly on a separate downstream Chinese Word Segmentation corpus. These recent methods don't fully utilize the prior knowledge of existing segmentation corpora, and don't regard the discrepancy between the pre-training tasks and the downstream Chinese Word Segmentation tasks. In this work, we propose a Pre-Trained Model for Chinese Word Segmentation, which can be abbreviated as PTM-CWS. PTM-CWS model employs a unified architecture for different segmentation criteria, and is pre-trained on a joint multi-criteria corpus with meta learning algorithm. Empirical results show that our PTM-CWS model can utilize the existing prior segmentation knowledge, reduce the discrepancy between the pre-training tasks and the downstream Chinese Word Segmentation tasks, and achieve new state-of-the-art performance on twelve Chinese Word Segmentation corpora.
摘要:最近的研究表明,预先训练模式,如BERT(Devlin等,2019)是中国的分词任务是有益的。但是,现有的直接在一个单独的下游中国的分词语料库方法通常精调预先训练模式。最近的这些方法没有充分利用现有分段语料库的先验知识,并且不把预先训练任务和下游的中国分词任务之间的差异。在这项工作中,我们提出了中国的分词,可以简写为PTM-CWS预训练的模型。 PTM-CWS模型采用统一架构,针对不同细分的标准,而且是预训练与元学习算法联合多标准语料库。实证结果表明,我们PTM-CWS模型可以利用现有的事先分割的知识,减少前的训练任务和下游的中国分词任务之间的差距,实现十二个中国分词新的国家的最先进的性能语料库。
28. A Scalable Framework for Learning From Implicit User Feedback to Improve Natural Language Understanding in Large-Scale Conversational AI Systems [PDF] 返回目录
Sunghyun Park, Han Li, Ameen Patel, Sidharth Mudgal, Sungjin Lee, Young-Bum Kim, Spyros Matsoukas, Ruhi Sarikaya
Abstract: Natural Language Understanding (NLU) is an established component within a conversational AI or digital assistant system, and it is responsible for producing semantic understanding of a user request. We propose a scalable and automatic approach for improving NLU in a large-scale conversational AI system by leveraging implicit user feedback, with an insight that user interaction data and dialog context have rich information embedded from which user satisfaction and intention can be inferred. In particular, we propose a general domain-agnostic framework for curating new supervision data for improving NLU from live production traffic. With an extensive set of experiments, we show the results of applying the framework and improving NLU for a large-scale production system and show its impact across 10 domains.
摘要:自然语言理解(NLU)是一个对话的AI或数字助理系统内建立的成分,它是负责产生的用户请求的语义理解。我们提出了一个可扩展的和自动的方式为通过利用隐含的用户反馈,有见解的用户交互数据和对话情境中嵌入丰富的信息,从用户满意度和意愿可以推断提高NLU的大型对话的AI系统。特别是,我们提出了从现场制作的交通改善NLU策划新的监管数据的通用领域无关的框架。凭借广泛组实验中,我们展示应用框架和改善NLU用于大规模生产系统的结果,并显示其横跨10个领域的影响。
Sunghyun Park, Han Li, Ameen Patel, Sidharth Mudgal, Sungjin Lee, Young-Bum Kim, Spyros Matsoukas, Ruhi Sarikaya
Abstract: Natural Language Understanding (NLU) is an established component within a conversational AI or digital assistant system, and it is responsible for producing semantic understanding of a user request. We propose a scalable and automatic approach for improving NLU in a large-scale conversational AI system by leveraging implicit user feedback, with an insight that user interaction data and dialog context have rich information embedded from which user satisfaction and intention can be inferred. In particular, we propose a general domain-agnostic framework for curating new supervision data for improving NLU from live production traffic. With an extensive set of experiments, we show the results of applying the framework and improving NLU for a large-scale production system and show its impact across 10 domains.
摘要:自然语言理解(NLU)是一个对话的AI或数字助理系统内建立的成分,它是负责产生的用户请求的语义理解。我们提出了一个可扩展的和自动的方式为通过利用隐含的用户反馈,有见解的用户交互数据和对话情境中嵌入丰富的信息,从用户满意度和意愿可以推断提高NLU的大型对话的AI系统。特别是,我们提出了从现场制作的交通改善NLU策划新的监管数据的通用领域无关的框架。凭借广泛组实验中,我们展示应用框架和改善NLU用于大规模生产系统的结果,并显示其横跨10个领域的影响。
29. Proof-theoretic aspects of NL$λ$ [PDF] 返回目录
Richard Moot
Abstract: We present a proof-theoretic analysis of the logic NL$\lambda$ (Barker \& Shan 2014, Barker 2019). We notably introduce a novel calculus of proof nets and prove it is sound and complete with respect to the sequent calculus for the logic. We study decidability and complexity of the logic using this new calculus, proving a new upper bound for complexity of the logic (showing it is in NP) and a new lower bound for the class of formal language generated by the formalism (mildly context-sensitive languages extended with a permutation closure operation). Finally, thanks to this new calculus, we present a novel comparison between NL$\lambda$ and the hybrid type-logical grammars of Kubota \& Levine (2020). We show there is an unexpected convergence of the natural language analyses proposed in the two formalism. In addition to studying the proof-theoretic properties of NL$\lambda$, we greatly extends its linguistic coverage.
摘要:我们提出的逻辑NL $ \ $拉姆达(巴克\&山2014年,巴克2019)的证明理论分析。我们主要是引进证明网的新型微积分和证明它是合理的,完整的关于相继式演算的逻辑。我们研究使用这个新的演算的逻辑可判定和复杂性,证明了新的上限为逻辑的复杂性(表示它是在NP)和一个新的下界由形式主义(轻度上下文敏感生成的类形式语言的语言与置换封闭操作延长)。最后,由于这种新的演算中,我们提出NL $ \拉姆达$和久保田\&莱文(2020)的混合型逻辑文法之间的新的比较。我们发现有两个形式主义提出的自然语言分析的一个意想不到的收敛。除了研究NL $ \ $拉姆达的证明论的性质,我们极大地扩展了其语言的覆盖面。
Richard Moot
Abstract: We present a proof-theoretic analysis of the logic NL$\lambda$ (Barker \& Shan 2014, Barker 2019). We notably introduce a novel calculus of proof nets and prove it is sound and complete with respect to the sequent calculus for the logic. We study decidability and complexity of the logic using this new calculus, proving a new upper bound for complexity of the logic (showing it is in NP) and a new lower bound for the class of formal language generated by the formalism (mildly context-sensitive languages extended with a permutation closure operation). Finally, thanks to this new calculus, we present a novel comparison between NL$\lambda$ and the hybrid type-logical grammars of Kubota \& Levine (2020). We show there is an unexpected convergence of the natural language analyses proposed in the two formalism. In addition to studying the proof-theoretic properties of NL$\lambda$, we greatly extends its linguistic coverage.
摘要:我们提出的逻辑NL $ \ $拉姆达(巴克\&山2014年,巴克2019)的证明理论分析。我们主要是引进证明网的新型微积分和证明它是合理的,完整的关于相继式演算的逻辑。我们研究使用这个新的演算的逻辑可判定和复杂性,证明了新的上限为逻辑的复杂性(表示它是在NP)和一个新的下界由形式主义(轻度上下文敏感生成的类形式语言的语言与置换封闭操作延长)。最后,由于这种新的演算中,我们提出NL $ \拉姆达$和久保田\&莱文(2020)的混合型逻辑文法之间的新的比较。我们发现有两个形式主义提出的自然语言分析的一个意想不到的收敛。除了研究NL $ \ $拉姆达的证明论的性质,我们极大地扩展了其语言的覆盖面。
30. Domain Divergences: a Survey and Empirical Analysis [PDF] 返回目录
Abhinav Ramesh Kashyap, Devamanyu Hazarika, Min-Yen Kan, Roger Zimmermann
Abstract: Domain divergence plays a significant role in estimating the performance of a model when applied to new domains. While there is significant literature on divergence measures, choosing an appropriate divergence measures remains difficult for researchers. We address this shortcoming by both surveying the literature and through an empirical study. We contribute a taxonomy of divergence measures consisting of three groups -- Information-theoretic, Geometric, and Higher-order measures -- and identify the relationships between them. We then ground the use of divergence measures in three different application groups -- 1) Data Selection, 2) Learning Representation, and 3) Decisions in the Wild. From this, we identify that Information-theoretic measures are prevalent for 1) and 3), and higher-order measures are common for 2). To further help researchers, we validate these uses empirically through a correlation analysis of performance drops. We consider the current contextual word representations (CWR) to contrast with the older word distribution based representations for this analysis. We find that traditional measures over word distributions still serve as strong baselines, while higher-order measures with CWR are effective.
摘要:域名发散起着当应用于新的领域估计模型的性能有显著的作用。虽然对分歧措施显著文学,选择合适的分歧仍然措施难以研究人员。我们通过两种测量文献,并通过实证研究解决这个缺点。我们贡献的由三组的发散措施分类 - 信息理论,几何和高阶措施 - 并确定它们之间的关系。然后,我们在地面三种不同的应用程序组使用的分化指标 - 1)数据选择,2)学习代表,以及3)在狂放的决定。由此看来,我们确定信息论措施是1)和3)盛行,高阶措施是常见的2)。为了进一步帮助研究人员,我们通过性能下降的相关性实证分析验证这些用途。我们认为当前的上下文字表示(CWR)与老字分布基于表示该分析对比。我们找到了字分布,传统的措施仍然作为强的基线,而与CWR高阶的措施是有效的。
Abhinav Ramesh Kashyap, Devamanyu Hazarika, Min-Yen Kan, Roger Zimmermann
Abstract: Domain divergence plays a significant role in estimating the performance of a model when applied to new domains. While there is significant literature on divergence measures, choosing an appropriate divergence measures remains difficult for researchers. We address this shortcoming by both surveying the literature and through an empirical study. We contribute a taxonomy of divergence measures consisting of three groups -- Information-theoretic, Geometric, and Higher-order measures -- and identify the relationships between them. We then ground the use of divergence measures in three different application groups -- 1) Data Selection, 2) Learning Representation, and 3) Decisions in the Wild. From this, we identify that Information-theoretic measures are prevalent for 1) and 3), and higher-order measures are common for 2). To further help researchers, we validate these uses empirically through a correlation analysis of performance drops. We consider the current contextual word representations (CWR) to contrast with the older word distribution based representations for this analysis. We find that traditional measures over word distributions still serve as strong baselines, while higher-order measures with CWR are effective.
摘要:域名发散起着当应用于新的领域估计模型的性能有显著的作用。虽然对分歧措施显著文学,选择合适的分歧仍然措施难以研究人员。我们通过两种测量文献,并通过实证研究解决这个缺点。我们贡献的由三组的发散措施分类 - 信息理论,几何和高阶措施 - 并确定它们之间的关系。然后,我们在地面三种不同的应用程序组使用的分化指标 - 1)数据选择,2)学习代表,以及3)在狂放的决定。由此看来,我们确定信息论措施是1)和3)盛行,高阶措施是常见的2)。为了进一步帮助研究人员,我们通过性能下降的相关性实证分析验证这些用途。我们认为当前的上下文字表示(CWR)与老字分布基于表示该分析对比。我们找到了字分布,传统的措施仍然作为强的基线,而与CWR高阶的措施是有效的。
31. Identifying Similar Movie Characters Quickly but Effectively Using Non-exhaustive Pair-wise Attention [PDF] 返回目录
Zhilin Wang, Weizhe Lin, Xiaodong Wu
Abstract: Identifying similar movie characters is a captivating task that can be our first step to understand the commonalities between human characteristics and experiences. Here, we seek to identify similar movie character descriptions and evaluate our findings based on whether they belong to a common fan-curated trope (theme). Rather than simply comparing the embedding representation of character description, we use a pair-wise attention model to make use of complex word/span-level relationships across the two character descriptions to predict the similarity of the two characters. Naively, such a model would require the exhaustive comparison of each character to all other characters, which is an O(n^2) operation with respect to the number of characters, making it unfeasible to be used in practice. We reduced this into an O(n) operation using a two-step approach that involves choosing only a tiny fraction of character-pairs to perform pairwise attention on while still being effective in this task. Our approach performs at least 9-27% better than methods based on state-of-the-art paragraph embedding representations.
摘要:确定类似电影中的角色是一个迷人的任务,可以是我们的第一步,了解人类的特征和经验之间的共性。在这里,我们设法确定类似电影里的人物的描述,并根据它们是否属于共同风扇策划的比喻(主题)评估我们的调查结果。而不是简单地比较字符描述的嵌入表示,我们使用成对的关注模式,利用跨两个字符的描述复杂的字/ SPAN级关系来预测两个字符的相似性。天真地,这样的模型将要求每个字符的穷尽比较所有其它的字符,这是一个为O(n ^ 2)的操作相对于字符的数目,使得它不可行在实践中使用。我们采用两步法涉及选择唯一的字符对的一小部分就同时仍有效执行此任务中成对注意还原成一个O(n)的操作这一点。我们的方法进行至少9-27%,优于基于状态的最先进的一款嵌入表示方法。
Zhilin Wang, Weizhe Lin, Xiaodong Wu
Abstract: Identifying similar movie characters is a captivating task that can be our first step to understand the commonalities between human characteristics and experiences. Here, we seek to identify similar movie character descriptions and evaluate our findings based on whether they belong to a common fan-curated trope (theme). Rather than simply comparing the embedding representation of character description, we use a pair-wise attention model to make use of complex word/span-level relationships across the two character descriptions to predict the similarity of the two characters. Naively, such a model would require the exhaustive comparison of each character to all other characters, which is an O(n^2) operation with respect to the number of characters, making it unfeasible to be used in practice. We reduced this into an O(n) operation using a two-step approach that involves choosing only a tiny fraction of character-pairs to perform pairwise attention on while still being effective in this task. Our approach performs at least 9-27% better than methods based on state-of-the-art paragraph embedding representations.
摘要:确定类似电影中的角色是一个迷人的任务,可以是我们的第一步,了解人类的特征和经验之间的共性。在这里,我们设法确定类似电影里的人物的描述,并根据它们是否属于共同风扇策划的比喻(主题)评估我们的调查结果。而不是简单地比较字符描述的嵌入表示,我们使用成对的关注模式,利用跨两个字符的描述复杂的字/ SPAN级关系来预测两个字符的相似性。天真地,这样的模型将要求每个字符的穷尽比较所有其它的字符,这是一个为O(n ^ 2)的操作相对于字符的数目,使得它不可行在实践中使用。我们采用两步法涉及选择唯一的字符对的一小部分就同时仍有效执行此任务中成对注意还原成一个O(n)的操作这一点。我们的方法进行至少9-27%,优于基于状态的最先进的一款嵌入表示方法。
32. KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi [PDF] 返回目录
Rubungo Andre Niyongabo, Hong Qu, Julia Kreutzer, Li Huang
Abstract: Recent progress in text classification has been focused on high-resource languages such as English and Chinese. For low-resource languages, amongst them most African languages, the lack of well-annotated data and effective preprocessing, is hindering the progress and the transfer of successful methods. In this paper, we introduce two news datasets (KINNEWS and KIRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible, but while Kinyarwanda has been studied in Natural Language Processing (NLP) to some extent, this work constitutes the first study on Kirundi. Along with the datasets, we provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models. Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi. In addition, the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. The datasets, stopwords, and pre-trained embeddings are publicly available at this https URL .
摘要:在文本分类的最新进展一直专注于高资源的语言,如英语和中国。对于资源匮乏的语言,在他们之中大多数非洲语言,缺乏良好的注释的数据和有效的预处理,是阻碍进步和成功的方法转移。在本文中,我们介绍了在卢旺达语和基隆迪语,两个低资源非洲语言的新闻报道的多类分类两条消息的数据集(KINNEWS和KIRNEWS)。这两种语言相互理解,但在卢旺达语已在自然语言处理(NLP)研究了在一定程度上,这项工作构成了对基隆迪第一项研究。随着数据集,我们提供的统计数据,为预处理指南和单语和跨语言的基本模式。我们的实验表明,在相对较高的资源培训卢旺达语的嵌入得到成功的跨语言转移到基隆迪。另外,创建数据集的设计允许在今后的研究,如代表学习,跨语言的学习与更遥远的语言的NLP更广泛地利用超越文本分类,或为基地的新注解任务,如解析,POS标签和ER。该数据集,停用词,并预先训练的嵌入是公开的,在此HTTPS URL。
Rubungo Andre Niyongabo, Hong Qu, Julia Kreutzer, Li Huang
Abstract: Recent progress in text classification has been focused on high-resource languages such as English and Chinese. For low-resource languages, amongst them most African languages, the lack of well-annotated data and effective preprocessing, is hindering the progress and the transfer of successful methods. In this paper, we introduce two news datasets (KINNEWS and KIRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible, but while Kinyarwanda has been studied in Natural Language Processing (NLP) to some extent, this work constitutes the first study on Kirundi. Along with the datasets, we provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models. Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi. In addition, the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. The datasets, stopwords, and pre-trained embeddings are publicly available at this https URL .
摘要:在文本分类的最新进展一直专注于高资源的语言,如英语和中国。对于资源匮乏的语言,在他们之中大多数非洲语言,缺乏良好的注释的数据和有效的预处理,是阻碍进步和成功的方法转移。在本文中,我们介绍了在卢旺达语和基隆迪语,两个低资源非洲语言的新闻报道的多类分类两条消息的数据集(KINNEWS和KIRNEWS)。这两种语言相互理解,但在卢旺达语已在自然语言处理(NLP)研究了在一定程度上,这项工作构成了对基隆迪第一项研究。随着数据集,我们提供的统计数据,为预处理指南和单语和跨语言的基本模式。我们的实验表明,在相对较高的资源培训卢旺达语的嵌入得到成功的跨语言转移到基隆迪。另外,创建数据集的设计允许在今后的研究,如代表学习,跨语言的学习与更遥远的语言的NLP更广泛地利用超越文本分类,或为基地的新注解任务,如解析,POS标签和ER。该数据集,停用词,并预先训练的嵌入是公开的,在此HTTPS URL。
33. Attention Transfer Network for Aspect-level Sentiment Classification [PDF] 返回目录
Fei Zhao, Zhen Wu, Xinyu Dai
Abstract: Aspect-level sentiment classification (ASC) aims to detect the sentiment polarity of a given opinion target in a sentence. In neural network-based methods for ASC, most works employ the attention mechanism to capture the corresponding sentiment words of the opinion target, then aggregate them as evidence to infer the sentiment of the target. However, aspect-level datasets are all relatively small-scale due to the complexity of annotation. Data scarcity causes the attention mechanism sometimes to fail to focus on the corresponding sentiment words of the target, which finally weakens the performance of neural models. To address the issue, we propose a novel Attention Transfer Network (ATN) in this paper, which can successfully exploit attention knowledge from resource-rich document-level sentiment classification datasets to improve the attention capability of the aspect-level sentiment classification task. In the ATN model, we design two different methods to transfer attention knowledge and conduct experiments on two ASC benchmark datasets. Extensive experimental results show that our methods consistently outperform state-of-the-art works. Further analysis also validates the effectiveness of ATN.
摘要:看点级情感分类(ASC)旨在检测在一个句子一个给定的意见目标的情感极性。在ASC基于神经网络的方法,大部分作品采用注意机制来捕获意见目标的相应的情感的话,那么它们聚集作为证据来推断目标的信心。然而,一方面级数据集都是规模相对较小,由于注解的复杂性。数据匮乏引起注意机制失败有时专注于目标,并最终削弱神经模型性能的相应的情感的话。为了解决这个问题,我们建议在本文中,它可以成功地利用从资源丰富的文档级情感分类数据集注意知识,提高方面级别的情感分类任务的注意力能力的新的关注传送网(ATN)。在ATN模型中,我们设计了两个不同的方法对两个ASC基准数据集转移注意力的知识和进行实验。大量的实验结果表明,我们的方法始终优于国家的最先进的作品。进一步的分析也验证ATN的有效性。
Fei Zhao, Zhen Wu, Xinyu Dai
Abstract: Aspect-level sentiment classification (ASC) aims to detect the sentiment polarity of a given opinion target in a sentence. In neural network-based methods for ASC, most works employ the attention mechanism to capture the corresponding sentiment words of the opinion target, then aggregate them as evidence to infer the sentiment of the target. However, aspect-level datasets are all relatively small-scale due to the complexity of annotation. Data scarcity causes the attention mechanism sometimes to fail to focus on the corresponding sentiment words of the target, which finally weakens the performance of neural models. To address the issue, we propose a novel Attention Transfer Network (ATN) in this paper, which can successfully exploit attention knowledge from resource-rich document-level sentiment classification datasets to improve the attention capability of the aspect-level sentiment classification task. In the ATN model, we design two different methods to transfer attention knowledge and conduct experiments on two ASC benchmark datasets. Extensive experimental results show that our methods consistently outperform state-of-the-art works. Further analysis also validates the effectiveness of ATN.
摘要:看点级情感分类(ASC)旨在检测在一个句子一个给定的意见目标的情感极性。在ASC基于神经网络的方法,大部分作品采用注意机制来捕获意见目标的相应的情感的话,那么它们聚集作为证据来推断目标的信心。然而,一方面级数据集都是规模相对较小,由于注解的复杂性。数据匮乏引起注意机制失败有时专注于目标,并最终削弱神经模型性能的相应的情感的话。为了解决这个问题,我们建议在本文中,它可以成功地利用从资源丰富的文档级情感分类数据集注意知识,提高方面级别的情感分类任务的注意力能力的新的关注传送网(ATN)。在ATN模型中,我们设计了两个不同的方法对两个ASC基准数据集转移注意力的知识和进行实验。大量的实验结果表明,我们的方法始终优于国家的最先进的作品。进一步的分析也验证ATN的有效性。
34. ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding [PDF] 返回目录
Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Abstract: Coarse-grained linguistic information, such as name entities or phrases, facilitates adequately representation learning in pre-training. Previous works mainly focus on extending the objective of BERT's Masked Language Modeling (MLM) from masking individual tokens to contiguous sequences of n tokens. We argue that such continuously masking method neglects to model the inner-dependencies and inter-relation of coarse-grained information. As an alternative, we propose ERNIE-Gram, an explicitly n-gram masking method to enhance the integration of coarse-grained information for pre-training. In ERNIE-Gram, n-grams are masked and predicted directly using explicit n-gram identities rather than contiguous sequences of tokens. Furthermore, ERNIE-Gram employs a generator model to sample plausible n-gram identities as optional n-gram masks and predict them in both coarse-grained and fine-grained manners to enable comprehensive n-gram prediction and relation modeling. We pre-train ERNIE-Gram on English and Chinese text corpora and fine-tune on 19 downstream tasks. Experimental results show that ERNIE-Gram outperforms previous pre-training models like XLNet and RoBERTa by a large margin, and achieves comparable results with state-of-the-art methods.
摘要:粗粒度语言信息,如姓名实体或词组,方便充分表现在学习前培训。先前的工作主要集中在从个人令牌掩蔽到n个令牌的连续序列延伸的客观BERT的屏蔽语言模型(MLM)的。我们认为,这样的连续掩蔽方法忽略内的依赖性和粗粒度信息相互关系进行建模。作为替代方案,我们提出了摇奖革兰氏,一个明确的正克屏蔽的方法增强的粗粒度的信息前培训的整合。在ERNIE革兰氏,的n-gram的被掩码且使用显式的n-gram标识,而不是令牌的连续序列直接预测。此外,ERNIE革兰氏采用发电机模型来样似是而非的n-gram标识作为任选的n-gram口罩和他们预测在两个粗粒和细粒的方式,使综合的n-gram和预测建模关系。英语和中国语料库和微调在19个下游任务,我们预培养摇奖革兰氏。实验结果表明,摇奖革兰氏大幅度优于先前的预培训模型,如XLNet和罗伯塔,并实现与国家的最先进的方法,比较的结果。
Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Abstract: Coarse-grained linguistic information, such as name entities or phrases, facilitates adequately representation learning in pre-training. Previous works mainly focus on extending the objective of BERT's Masked Language Modeling (MLM) from masking individual tokens to contiguous sequences of n tokens. We argue that such continuously masking method neglects to model the inner-dependencies and inter-relation of coarse-grained information. As an alternative, we propose ERNIE-Gram, an explicitly n-gram masking method to enhance the integration of coarse-grained information for pre-training. In ERNIE-Gram, n-grams are masked and predicted directly using explicit n-gram identities rather than contiguous sequences of tokens. Furthermore, ERNIE-Gram employs a generator model to sample plausible n-gram identities as optional n-gram masks and predict them in both coarse-grained and fine-grained manners to enable comprehensive n-gram prediction and relation modeling. We pre-train ERNIE-Gram on English and Chinese text corpora and fine-tune on 19 downstream tasks. Experimental results show that ERNIE-Gram outperforms previous pre-training models like XLNet and RoBERTa by a large margin, and achieves comparable results with state-of-the-art methods.
摘要:粗粒度语言信息,如姓名实体或词组,方便充分表现在学习前培训。先前的工作主要集中在从个人令牌掩蔽到n个令牌的连续序列延伸的客观BERT的屏蔽语言模型(MLM)的。我们认为,这样的连续掩蔽方法忽略内的依赖性和粗粒度信息相互关系进行建模。作为替代方案,我们提出了摇奖革兰氏,一个明确的正克屏蔽的方法增强的粗粒度的信息前培训的整合。在ERNIE革兰氏,的n-gram的被掩码且使用显式的n-gram标识,而不是令牌的连续序列直接预测。此外,ERNIE革兰氏采用发电机模型来样似是而非的n-gram标识作为任选的n-gram口罩和他们预测在两个粗粒和细粒的方式,使综合的n-gram和预测建模关系。英语和中国语料库和微调在19个下游任务,我们预培养摇奖革兰氏。实验结果表明,摇奖革兰氏大幅度优于先前的预培训模型,如XLNet和罗伯塔,并实现与国家的最先进的方法,比较的结果。
35. Summarizing Utterances from Japanese Assembly Minutes using Political Sentence-BERT-based Method for QA Lab-PoliInfo-2 Task of NTCIR-15 [PDF] 返回目录
Daiki Shirafuji, Hiromichi Kameya, Rafal Rzepka, Kenji Araki
Abstract: There are many discussions held during political meetings, and a large number of utterances for various topics is included in their transcripts. We need to read all of them if we want to follow speakers\' intentions or opinions about a given topic. To avoid such a costly and time-consuming process to grasp often longish discussions, NLP researchers work on generating concise summaries of utterances. Summarization subtask in QA Lab-PoliInfo-2 task of the NTCIR-15 addresses this problem for Japanese utterances in assembly minutes, and our team (SKRA) participated in this subtask. As a first step for summarizing utterances, we created a new pre-trained sentence embedding model, i.e. the Japanese Political Sentence-BERT. With this model, we summarize utterances without labelled data. This paper describes our approach to solving the task and discusses its results.
摘要:有在政治上举行的多次讨论,并包含在他们的成绩单各种主题的大量话语。我们需要阅读所有的人,如果我们要遵循音箱\”关于给定主题的意向或意见。为了避免这种昂贵和耗时的过程,把握往往稍长的讨论,NLP研究人员生成话语简明摘要工作。综述子任务在NTCIR-15解决了这个问题在议会会议记录日本言论的QA实验室PoliInfo-2的任务,我们的团队(SKRA)参加了本次的子任务。至于总结话语的第一步,我们创建了一个新的预训练句子嵌入模式,即日本政治句子-BERT。在这个模型中,我们总结无标签的数据话语。本文描述了我们的方法来解决的任务,并讨论其结果。
Daiki Shirafuji, Hiromichi Kameya, Rafal Rzepka, Kenji Araki
Abstract: There are many discussions held during political meetings, and a large number of utterances for various topics is included in their transcripts. We need to read all of them if we want to follow speakers\' intentions or opinions about a given topic. To avoid such a costly and time-consuming process to grasp often longish discussions, NLP researchers work on generating concise summaries of utterances. Summarization subtask in QA Lab-PoliInfo-2 task of the NTCIR-15 addresses this problem for Japanese utterances in assembly minutes, and our team (SKRA) participated in this subtask. As a first step for summarizing utterances, we created a new pre-trained sentence embedding model, i.e. the Japanese Political Sentence-BERT. With this model, we summarize utterances without labelled data. This paper describes our approach to solving the task and discusses its results.
摘要:有在政治上举行的多次讨论,并包含在他们的成绩单各种主题的大量话语。我们需要阅读所有的人,如果我们要遵循音箱\”关于给定主题的意向或意见。为了避免这种昂贵和耗时的过程,把握往往稍长的讨论,NLP研究人员生成话语简明摘要工作。综述子任务在NTCIR-15解决了这个问题在议会会议记录日本言论的QA实验室PoliInfo-2的任务,我们的团队(SKRA)参加了本次的子任务。至于总结话语的第一步,我们创建了一个新的预训练句子嵌入模式,即日本政治句子-BERT。在这个模型中,我们总结无标签的数据话语。本文描述了我们的方法来解决的任务,并讨论其结果。
36. Multilingual Synthetic Question and Answer Generation for Cross-Lingual Reading Comprehension [PDF] 返回目录
Siamak Shakeri, Noah Constant, Mihir Sanjay Kale, Linting Xue
Abstract: We propose a simple method to generate large amounts of multilingual question and answer pairs by a single generative model. These synthetic samples are then applied to augment the available gold multilingual ones to improve the performance of multilingual QA models on target languages. Our approach only requires existence of automatically translated samples from English to the target domain, thus removing the need for human annotations in the target languages. Experimental results show our proposed approach achieves significant gains in a number of multilingual datasets.
摘要:本文提出了一种简单的方法,通过一个单一的生成模型,生成大量的多语言问题和答案对。然后,这些合成样品被应用,以增加可用金多种语言的人,以提高多语种QA车型的目标语言的性能。我们的方法只需从英语到目标域自动翻译样本的存在,从而消除了在目标语言的人注释的需要。实验结果表明,我们提出的方法实现了在一些多语种数据集显著的收益。
Siamak Shakeri, Noah Constant, Mihir Sanjay Kale, Linting Xue
Abstract: We propose a simple method to generate large amounts of multilingual question and answer pairs by a single generative model. These synthetic samples are then applied to augment the available gold multilingual ones to improve the performance of multilingual QA models on target languages. Our approach only requires existence of automatically translated samples from English to the target domain, thus removing the need for human annotations in the target languages. Experimental results show our proposed approach achieves significant gains in a number of multilingual datasets.
摘要:本文提出了一种简单的方法,通过一个单一的生成模型,生成大量的多语言问题和答案对。然后,这些合成样品被应用,以增加可用金多种语言的人,以提高多语种QA车型的目标语言的性能。我们的方法只需从英语到目标域自动翻译样本的存在,从而消除了在目标语言的人注释的需要。实验结果表明,我们提出的方法实现了在一些多语种数据集显著的收益。
37. Meta-Learning for Domain Generalization in Semantic Parsing [PDF] 返回目录
Bailin Wang, Mirella Lapata, Ivan Titov
Abstract: The importance of building semantic parsers which can be applied to new domains and generate programs unseen at training has long been acknowledged, and datasets testing out-of-domain performance are becoming increasingly available. However, little or no attention has been devoted to studying learning algorithms or objectives which promote domain generalization, with virtually all existing approaches relying on standard supervised learning. In this work, we use a meta-learning framework which targets specifically zero-shot domain generalization for semantic parsing. We apply a model-agnostic training algorithm that simulates zero-shot parsing by constructing virtual train and test sets from disjoint domains. The learning objective capitalizes on the intuition that gradient steps that improve source-domain performance should also improve target-domain performance, thus encouraging a parser to generalize well to unseen target domains. Experimental results on the (English) Spider and Chinese Spider datasets show that the meta-learning objective significantly boosts the performance of a baseline parser.
摘要:构建可应用于新的领域,并产生在训练看不见长期被承认的程序语义解析器和数据集检测出的域性能的重要性越来越可用。然而,很少或根本没有引起人们的关注,以学习学习算法或促进域名的推广目标,几乎所有现有的方法依赖于标准的监督学习。在这项工作中,我们使用了元学习框架,专门针对零炮域泛化的语义分析。我们采用的是由不相交域构建虚拟训练组和测试组模拟零拍解析模型无关的训练算法。在直觉的学习目标大写该改善源域性能梯度的步骤也应该提高目标域性能,从而鼓励一个分析器,以及推广到看不见目标域。在(英文)蜘蛛和蜘蛛中国数据集实验结果表明,元学习目标显著提升基准解析器的性能。
Bailin Wang, Mirella Lapata, Ivan Titov
Abstract: The importance of building semantic parsers which can be applied to new domains and generate programs unseen at training has long been acknowledged, and datasets testing out-of-domain performance are becoming increasingly available. However, little or no attention has been devoted to studying learning algorithms or objectives which promote domain generalization, with virtually all existing approaches relying on standard supervised learning. In this work, we use a meta-learning framework which targets specifically zero-shot domain generalization for semantic parsing. We apply a model-agnostic training algorithm that simulates zero-shot parsing by constructing virtual train and test sets from disjoint domains. The learning objective capitalizes on the intuition that gradient steps that improve source-domain performance should also improve target-domain performance, thus encouraging a parser to generalize well to unseen target domains. Experimental results on the (English) Spider and Chinese Spider datasets show that the meta-learning objective significantly boosts the performance of a baseline parser.
摘要:构建可应用于新的领域,并产生在训练看不见长期被承认的程序语义解析器和数据集检测出的域性能的重要性越来越可用。然而,很少或根本没有引起人们的关注,以学习学习算法或促进域名的推广目标,几乎所有现有的方法依赖于标准的监督学习。在这项工作中,我们使用了元学习框架,专门针对零炮域泛化的语义分析。我们采用的是由不相交域构建虚拟训练组和测试组模拟零拍解析模型无关的训练算法。在直觉的学习目标大写该改善源域性能梯度的步骤也应该提高目标域性能,从而鼓励一个分析器,以及推广到看不见目标域。在(英文)蜘蛛和蜘蛛中国数据集实验结果表明,元学习目标显著提升基准解析器的性能。
38. MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences [PDF] 返回目录
Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, Louis-Philippe Morency
Abstract: Human communication is multimodal in nature; it is through multiple modalities, i.e., language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Multimodal Temporal Graph Attention Networks (MTGAT). MTGAT is an interpretable graph-based neural model that provides a suitable framework for analyzing this type of multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions between different modalities through time. Then, a novel graph operation, called Multimodal Temporal Graph Attention, along with a dynamic pruning and read-out technique is designed to efficiently process this multimodal temporal graph. By learning to focus only on the important interactions within the graph, our MTGAT is able to achieve state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks including IEMOCAP and CMU-MOSI, while utilizing significantly fewer computations.
摘要:人际交往本质上是多模态;它是通过多种方式,即,语言,声音和面部表情,那意见和情绪表达。数据在该领域显示出复杂的多关系和时间相互作用。从这些数据中学习是一个从根本上具有挑战性的研究课题。在本文中,我们提出了多模式域图表注意网络(MTGAT)。 MTGAT是提供用于分析这种类型的多模态的顺序数据的一个适当的框架可解释的基于图的神经网络模型。我们首先介绍一个过程来对齐多峰序列数据转换成具有异构节点和边的曲线图,通过时间不同模态之间捕获的丰富的互动。然后,一个新颖的图形操作时,多模态称为颞格拉夫注意,用动态修剪和读出技术沿着被设计为有效地处理这种多峰的时间曲线图。通过学习只注重图形中的重要作用,我们MTGAT能够实现多模态情感分析和情感识别基准测试包括IEMOCAP和CMU-MOSI国家的最先进的性能,同时利用显著更少的计算。
Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, Louis-Philippe Morency
Abstract: Human communication is multimodal in nature; it is through multiple modalities, i.e., language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Multimodal Temporal Graph Attention Networks (MTGAT). MTGAT is an interpretable graph-based neural model that provides a suitable framework for analyzing this type of multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions between different modalities through time. Then, a novel graph operation, called Multimodal Temporal Graph Attention, along with a dynamic pruning and read-out technique is designed to efficiently process this multimodal temporal graph. By learning to focus only on the important interactions within the graph, our MTGAT is able to achieve state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks including IEMOCAP and CMU-MOSI, while utilizing significantly fewer computations.
摘要:人际交往本质上是多模态;它是通过多种方式,即,语言,声音和面部表情,那意见和情绪表达。数据在该领域显示出复杂的多关系和时间相互作用。从这些数据中学习是一个从根本上具有挑战性的研究课题。在本文中,我们提出了多模式域图表注意网络(MTGAT)。 MTGAT是提供用于分析这种类型的多模态的顺序数据的一个适当的框架可解释的基于图的神经网络模型。我们首先介绍一个过程来对齐多峰序列数据转换成具有异构节点和边的曲线图,通过时间不同模态之间捕获的丰富的互动。然后,一个新颖的图形操作时,多模态称为颞格拉夫注意,用动态修剪和读出技术沿着被设计为有效地处理这种多峰的时间曲线图。通过学习只注重图形中的重要作用,我们MTGAT能够实现多模态情感分析和情感识别基准测试包括IEMOCAP和CMU-MOSI国家的最先进的性能,同时利用显著更少的计算。
39. The Turking Test: Can Language Models Understand Instructions? [PDF] 返回目录
Avia Efrat, Omer Levy
Abstract: Supervised machine learning provides the learner with a set of input-output examples of the target task. Humans, however, can also learn to perform new tasks from instructions in natural language. Can machines learn to understand instructions as well? We present the Turking Test, which examines a model's ability to follow natural language instructions of varying complexity. These range from simple tasks, like retrieving the nth word of a sentence, to ones that require creativity, such as generating examples for SNLI and SQuAD in place of human intelligence workers ("turkers"). Despite our lenient evaluation methodology, we observe that a large pretrained language model performs poorly across all tasks. Analyzing the model's error patterns reveals that the model tends to ignore explicit instructions and often generates outputs that cannot be construed as an attempt to solve the task. While it is not yet clear whether instruction understanding can be captured by traditional language models, the sheer expressivity of instruction understanding makes it an appealing alternative to the rising few-shot inference paradigm.
摘要:监督的机器学习提供了与一组的目标任务的输入输出的例子的学习者。然而人类,还可以了解从自然语言指令执行新的任务。机器可以学习理解的指令呢?我们提出的Turking测试,其中审查模型的遵循不同复杂程度的自然语言指令的能力。这些范围从简单的任务,如检索一个句子的第n个字,对需要创意的,如在地方人力情报工作者(“零工”)的生成SNLI和小队的例子。尽管我们宽松的评价方法,我们观察到一个大的预训练的语言模型较差所有任务的执行。分析模型的错误模式表明,模型往往忽略明确的指示,往往产生不能被理解为试图解决的任务输出。虽然目前尚不清楚是否指令的理解可以通过传统的语言模型捕捉,指令的理解纯粹的表达性使它成为一个有吸引力的替代上升几拍推理范式。
Avia Efrat, Omer Levy
Abstract: Supervised machine learning provides the learner with a set of input-output examples of the target task. Humans, however, can also learn to perform new tasks from instructions in natural language. Can machines learn to understand instructions as well? We present the Turking Test, which examines a model's ability to follow natural language instructions of varying complexity. These range from simple tasks, like retrieving the nth word of a sentence, to ones that require creativity, such as generating examples for SNLI and SQuAD in place of human intelligence workers ("turkers"). Despite our lenient evaluation methodology, we observe that a large pretrained language model performs poorly across all tasks. Analyzing the model's error patterns reveals that the model tends to ignore explicit instructions and often generates outputs that cannot be construed as an attempt to solve the task. While it is not yet clear whether instruction understanding can be captured by traditional language models, the sheer expressivity of instruction understanding makes it an appealing alternative to the rising few-shot inference paradigm.
摘要:监督的机器学习提供了与一组的目标任务的输入输出的例子的学习者。然而人类,还可以了解从自然语言指令执行新的任务。机器可以学习理解的指令呢?我们提出的Turking测试,其中审查模型的遵循不同复杂程度的自然语言指令的能力。这些范围从简单的任务,如检索一个句子的第n个字,对需要创意的,如在地方人力情报工作者(“零工”)的生成SNLI和小队的例子。尽管我们宽松的评价方法,我们观察到一个大的预训练的语言模型较差所有任务的执行。分析模型的错误模式表明,模型往往忽略明确的指示,往往产生不能被理解为试图解决的任务输出。虽然目前尚不清楚是否指令的理解可以通过传统的语言模型捕捉,指令的理解纯粹的表达性使它成为一个有吸引力的替代上升几拍推理范式。
40. A Joint Learning Approach based on Self-Distillation for Keyphrase Extraction from Scientific Documents [PDF] 返回目录
Tuan Manh Lai, Trung Bui, Doo Soon Kim, Quan Hung Tran
Abstract: Keyphrase extraction is the task of extracting a small set of phrases that best describe a document. Most existing benchmark datasets for the task typically have limited numbers of annotated documents, making it challenging to train increasingly complex neural networks. In contrast, digital libraries store millions of scientific articles online, covering a wide range of topics. While a significant portion of these articles contain keyphrases provided by their authors, most other articles lack such kind of annotations. Therefore, to effectively utilize these large amounts of unlabeled articles, we propose a simple and efficient joint learning approach based on the idea of self-distillation. Experimental results show that our approach consistently improves the performance of baseline models for keyphrase extraction. Furthermore, our best models outperform previous methods for the task, achieving new state-of-the-art results on two public benchmarks: Inspec and SemEval-2017.
摘要:关键词的提取是提取一小部分最能描述文档短语的任务。该任务大多数现有的基准数据集通常只有有限的注释文档的数字,使得它具有挑战性的训练日益复杂的神经网络。与此相反,数字图书馆存储数百万篇科学论文的网络,覆盖范围广泛的议题。虽然这些文章的显著部分含有它们的作者提供的关键字句,其他大多数文章缺乏这类的注解。因此,为了有效地利用这些大量的未标记的文章,提出了一种基于自蒸馏的想法简单而有效的联合学习方法。实验结果表明,该方法可以始终如一提高基本模式,为的关键词提取的性能。此外,我们最好的榜样优于以前的方法的任务,两个公共基准实现国家的最先进的新成果:INSPEC和SemEval-2017。
Tuan Manh Lai, Trung Bui, Doo Soon Kim, Quan Hung Tran
Abstract: Keyphrase extraction is the task of extracting a small set of phrases that best describe a document. Most existing benchmark datasets for the task typically have limited numbers of annotated documents, making it challenging to train increasingly complex neural networks. In contrast, digital libraries store millions of scientific articles online, covering a wide range of topics. While a significant portion of these articles contain keyphrases provided by their authors, most other articles lack such kind of annotations. Therefore, to effectively utilize these large amounts of unlabeled articles, we propose a simple and efficient joint learning approach based on the idea of self-distillation. Experimental results show that our approach consistently improves the performance of baseline models for keyphrase extraction. Furthermore, our best models outperform previous methods for the task, achieving new state-of-the-art results on two public benchmarks: Inspec and SemEval-2017.
摘要:关键词的提取是提取一小部分最能描述文档短语的任务。该任务大多数现有的基准数据集通常只有有限的注释文档的数字,使得它具有挑战性的训练日益复杂的神经网络。与此相反,数字图书馆存储数百万篇科学论文的网络,覆盖范围广泛的议题。虽然这些文章的显著部分含有它们的作者提供的关键字句,其他大多数文章缺乏这类的注解。因此,为了有效地利用这些大量的未标记的文章,提出了一种基于自蒸馏的想法简单而有效的联合学习方法。实验结果表明,该方法可以始终如一提高基本模式,为的关键词提取的性能。此外,我们最好的榜样优于以前的方法的任务,两个公共基准实现国家的最先进的新成果:INSPEC和SemEval-2017。
41. Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification [PDF] 返回目录
Badr M. Abdullah, Jacek Kudera, Tania Avgustinova, Bernd Möbius, Dietrich Klakow
Abstract: Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification. In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness and/or non-linguists' perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability between languages in our study to be the best predictor of the language representation similarity.
摘要:深层神经网络已经采用各种语言识别任务,其中包括被定义多种语言的任务,如口语识别。在本文中,我们提出了在语音信号的斯拉夫语识别神经元模型,并分析其紧急交涉,调查他们是否反映了语言的相关性和/或非语言学家的语言相似性的感知目标的措施。虽然我们的分析表明,语言表达空间确实捕捉语言关联性在很大程度上,我们在我们的研究发现语言之间的感知混淆性是语言表达相似的最好的预测。
Badr M. Abdullah, Jacek Kudera, Tania Avgustinova, Bernd Möbius, Dietrich Klakow
Abstract: Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification. In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness and/or non-linguists' perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability between languages in our study to be the best predictor of the language representation similarity.
摘要:深层神经网络已经采用各种语言识别任务,其中包括被定义多种语言的任务,如口语识别。在本文中,我们提出了在语音信号的斯拉夫语识别神经元模型,并分析其紧急交涉,调查他们是否反映了语言的相关性和/或非语言学家的语言相似性的感知目标的措施。虽然我们的分析表明,语言表达空间确实捕捉语言关联性在很大程度上,我们在我们的研究发现语言之间的感知混淆性是语言表达相似的最好的预测。
42. Language Models are Open Knowledge Graphs [PDF] 返回目录
Chenguang Wang, Xiao Liu, Dawn Song
Abstract: This paper shows how to construct knowledge graphs (KGs) from pre-trained language models (e.g., BERT, GPT-2/3), without human supervision. Popular KGs (e.g, Wikidata, NELL) are built in either a supervised or semi-supervised manner, requiring humans to create knowledge. Recent deep language models automatically acquire knowledge from large-scale corpora via pre-training. The stored knowledge has enabled the language models to improve downstream NLP tasks, e.g., answering questions, and writing code and articles. In this paper, we propose an unsupervised method to cast the knowledge contained within language models into KGs. We show that KGs are constructed with a single forward pass of the pre-trained language models (without fine-tuning) over the corpora. We demonstrate the quality of the constructed KGs by comparing to two KGs (Wikidata, TAC KBP) created by humans. Our KGs also provide open factual knowledge that is new in the existing KGs. Our code and KGs will be made publicly available.
摘要:本文介绍如何从预先训练语言模型构建知识图(KGS)(例如,BERT,GPT-2/3),没有人监督。热门幼儿园(例如,维基数据,NELL)是建立在任何一个监督或半监督的方式,要求人类创造知识。从大规模语料库通过岗前培训近期深语言模型自动获取知识。存储的知识,使语言模型,以改善下游NLP任务,例如,回答问题,并编写代码和文章。在本文中,我们提出了一种无监督的方法来施放包含语言模型内进入幼儿园的知识。我们发现,幼儿园与预先训练的语言模型的一个直传(无微调)在语料库构成。我们通过比较人类创造了两个幼儿园(维基数据,TAC KBP)展示了构建幼儿园的质量。我们的幼儿园还提供开放的事实性知识是在现有的幼儿园新。我们的代码和幼儿园将被公之于众。
Chenguang Wang, Xiao Liu, Dawn Song
Abstract: This paper shows how to construct knowledge graphs (KGs) from pre-trained language models (e.g., BERT, GPT-2/3), without human supervision. Popular KGs (e.g, Wikidata, NELL) are built in either a supervised or semi-supervised manner, requiring humans to create knowledge. Recent deep language models automatically acquire knowledge from large-scale corpora via pre-training. The stored knowledge has enabled the language models to improve downstream NLP tasks, e.g., answering questions, and writing code and articles. In this paper, we propose an unsupervised method to cast the knowledge contained within language models into KGs. We show that KGs are constructed with a single forward pass of the pre-trained language models (without fine-tuning) over the corpora. We demonstrate the quality of the constructed KGs by comparing to two KGs (Wikidata, TAC KBP) created by humans. Our KGs also provide open factual knowledge that is new in the existing KGs. Our code and KGs will be made publicly available.
摘要:本文介绍如何从预先训练语言模型构建知识图(KGS)(例如,BERT,GPT-2/3),没有人监督。热门幼儿园(例如,维基数据,NELL)是建立在任何一个监督或半监督的方式,要求人类创造知识。从大规模语料库通过岗前培训近期深语言模型自动获取知识。存储的知识,使语言模型,以改善下游NLP任务,例如,回答问题,并编写代码和文章。在本文中,我们提出了一种无监督的方法来施放包含语言模型内进入幼儿园的知识。我们发现,幼儿园与预先训练的语言模型的一个直传(无微调)在语料库构成。我们通过比较人类创造了两个幼儿园(维基数据,TAC KBP)展示了构建幼儿园的质量。我们的幼儿园还提供开放的事实性知识是在现有的幼儿园新。我们的代码和幼儿园将被公之于众。
43. Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data [PDF] 返回目录
David Lowell, Brian E. Howard, Zachary C. Lipton, Byron C. Wallace
Abstract: Unsupervised Data Augmentation (UDA) is a semi-supervised technique that applies a consistency loss to penalize differences between a model's predictions on (a) observed (unlabeled) examples; and (b) corresponding 'noised' examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and over how to extend the method to sequence labeling tasks. This method has recently gained traction for text classification. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. Our main contribution is an empirical study of UDA to establish which components of the algorithm confer benefits in NLP. Notably, although prior work has emphasized the use of clever augmentation techniques including back-translation, we find that enforcing consistency between predictions assigned to observed and randomly substituted words often yields comparable (or greater) benefits compared to these complex perturbation models. Furthermore, we find that applying its consistency loss affords meaningful gains without any unlabeled data at all, i.e., in a standard supervised setting. In short: UDA need not be unsupervised, and does not require complex data augmentation to be effective.
摘要:无监督数据扩张(UDA)是一种将一致性损失到一个模型的(a)上观察到的(未标记的)的实例的预测之间的差异违法处罚一个半监督技术;并经由数据扩张产生(b)中对应的“降噪”的例子。虽然UDA已经获得了普及文本分类,开放式的问题萦绕在它的设计决策是必要的,在如何扩展方法序列标注任务。这种方法最近获得牵引文本分类。在本文中,我们重新审视UDA并展示在几个连续的任务,它的功效。我们的主要贡献是UDA的实证研究建立在NLP算法胙效益的组件。值得注意的是,尽管之前的工作已经强调使用聪明扩增技术,包括回译,我们发现,执行分配给观测和无规取代的词语预测之间的一致性通常相比这些复杂的扰动模型产生相当(或更大)的好处。此外,我们发现,运用其一致性损失得到有意义的收益,而在所有的任何未标记的数据,即,在一个标准的监管环境。总之:UDA不需要无人看管,并且不需要复杂的数据增强是有效的。
David Lowell, Brian E. Howard, Zachary C. Lipton, Byron C. Wallace
Abstract: Unsupervised Data Augmentation (UDA) is a semi-supervised technique that applies a consistency loss to penalize differences between a model's predictions on (a) observed (unlabeled) examples; and (b) corresponding 'noised' examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and over how to extend the method to sequence labeling tasks. This method has recently gained traction for text classification. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. Our main contribution is an empirical study of UDA to establish which components of the algorithm confer benefits in NLP. Notably, although prior work has emphasized the use of clever augmentation techniques including back-translation, we find that enforcing consistency between predictions assigned to observed and randomly substituted words often yields comparable (or greater) benefits compared to these complex perturbation models. Furthermore, we find that applying its consistency loss affords meaningful gains without any unlabeled data at all, i.e., in a standard supervised setting. In short: UDA need not be unsupervised, and does not require complex data augmentation to be effective.
摘要:无监督数据扩张(UDA)是一种将一致性损失到一个模型的(a)上观察到的(未标记的)的实例的预测之间的差异违法处罚一个半监督技术;并经由数据扩张产生(b)中对应的“降噪”的例子。虽然UDA已经获得了普及文本分类,开放式的问题萦绕在它的设计决策是必要的,在如何扩展方法序列标注任务。这种方法最近获得牵引文本分类。在本文中,我们重新审视UDA并展示在几个连续的任务,它的功效。我们的主要贡献是UDA的实证研究建立在NLP算法胙效益的组件。值得注意的是,尽管之前的工作已经强调使用聪明扩增技术,包括回译,我们发现,执行分配给观测和无规取代的词语预测之间的一致性通常相比这些复杂的扰动模型产生相当(或更大)的好处。此外,我们发现,运用其一致性损失得到有意义的收益,而在所有的任何未标记的数据,即,在一个标准的监管环境。总之:UDA不需要无人看管,并且不需要复杂的数据增强是有效的。
44. A Differentially Private Text Perturbation Method Using a Regularized Mahalanobis Metric [PDF] 返回目录
Zekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan, Nathanael Teissier
Abstract: Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space, perturbed by sampling a spherical noise from an appropriate distribution, and then projected back to the discrete vocabulary space. While this allows the perturbation to admit the required metric differential privacy, often the utility of downstream tasks modeled on this perturbed data is low because the spherical noise does not account for the variability in the density around different words in the embedding space. In particular, words in a sparse region are likely unchanged even when the noise scale is large. %Using the global sensitivity of the mechanism can potentially add too much noise to the words in the dense regions of the embedding space, causing a high utility loss, whereas using local sensitivity can leak information through the scale of the noise added. In this paper, we propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem. For any given noise scale, this metric adds an elliptical noise to account for the covariance structure in the embedding space. This heterogeneity in the noise scale along different directions helps ensure that the words in the sparse region have sufficient likelihood of replacement without sacrificing the overall utility. We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees. Additionally, we empirically show that our mechanism improves the privacy statistics to achieve the same level of utility as compared to the state-of-the-art Laplace mechanism.
摘要:平衡隐私效用代价是许多实际的机器学习系统的一个关键要求,即处理敏感的客户数据。用于保护隐私的文本分析的流行的方法是噪声注入,其中文本数据首先被映射到连续嵌入空间,扰动通过从适当的分布采样的球形噪声,然后投影回离散词汇空间。虽然这允许扰动承认所需要的指标差的隐私,往往是仿照这个下游任务的工具改动的数据是低的,因为球形噪音不考虑在嵌入空间周围不同的词语密度的可变性。特别是,在稀疏区域的话是即使在噪音规模大可能保持不变。 %使用该机制的全局灵敏度可以潜在地添加太多噪声在嵌入空间的密集区域的话,引起高效用损失,而使用局部灵敏度可以通过添加的噪声的比例泄漏信息。在本文中,我们提出了基于度量来克服这个问题的马氏的一个精心设计的正规化变体文字扰动机制。对于任何给定噪声规模,该度量增加了一个椭圆形的噪声以考虑在嵌入空间中的协方差结构。这种异质沿不同方向的噪声规模有助于确保在稀疏区域的话必须更换足够可能性不牺牲整体效用。我们会根据这一指标,并正式证明其隐私保证文本摄动法。此外,我们经验表明,我们的机制,提高了保密的统计数据相比,国家的最先进的拉普拉斯机制,实现效用相同的水平。
Zekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan, Nathanael Teissier
Abstract: Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space, perturbed by sampling a spherical noise from an appropriate distribution, and then projected back to the discrete vocabulary space. While this allows the perturbation to admit the required metric differential privacy, often the utility of downstream tasks modeled on this perturbed data is low because the spherical noise does not account for the variability in the density around different words in the embedding space. In particular, words in a sparse region are likely unchanged even when the noise scale is large. %Using the global sensitivity of the mechanism can potentially add too much noise to the words in the dense regions of the embedding space, causing a high utility loss, whereas using local sensitivity can leak information through the scale of the noise added. In this paper, we propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem. For any given noise scale, this metric adds an elliptical noise to account for the covariance structure in the embedding space. This heterogeneity in the noise scale along different directions helps ensure that the words in the sparse region have sufficient likelihood of replacement without sacrificing the overall utility. We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees. Additionally, we empirically show that our mechanism improves the privacy statistics to achieve the same level of utility as compared to the state-of-the-art Laplace mechanism.
摘要:平衡隐私效用代价是许多实际的机器学习系统的一个关键要求,即处理敏感的客户数据。用于保护隐私的文本分析的流行的方法是噪声注入,其中文本数据首先被映射到连续嵌入空间,扰动通过从适当的分布采样的球形噪声,然后投影回离散词汇空间。虽然这允许扰动承认所需要的指标差的隐私,往往是仿照这个下游任务的工具改动的数据是低的,因为球形噪音不考虑在嵌入空间周围不同的词语密度的可变性。特别是,在稀疏区域的话是即使在噪音规模大可能保持不变。 %使用该机制的全局灵敏度可以潜在地添加太多噪声在嵌入空间的密集区域的话,引起高效用损失,而使用局部灵敏度可以通过添加的噪声的比例泄漏信息。在本文中,我们提出了基于度量来克服这个问题的马氏的一个精心设计的正规化变体文字扰动机制。对于任何给定噪声规模,该度量增加了一个椭圆形的噪声以考虑在嵌入空间中的协方差结构。这种异质沿不同方向的噪声规模有助于确保在稀疏区域的话必须更换足够可能性不牺牲整体效用。我们会根据这一指标,并正式证明其隐私保证文本摄动法。此外,我们经验表明,我们的机制,提高了保密的统计数据相比,国家的最先进的拉普拉斯机制,实现效用相同的水平。
45. EML System Description for VoxCeleb Speaker Diarization Challenge 2020 [PDF] 返回目录
Omid Ghahabi, Volker Fischer
Abstract: This technical report describes the EML submission to the first VoxCeleb speaker diarization challenge. Although the aim of the challenge has been the offline processing of the signals, the submitted system is basically the EML online algorithm which decides about the speaker labels in runtime approximately every 1.2 sec. For the first phase of the challenge, only VoxCeleb2 dev dataset was used for training. The results on the provided VoxConverse dev set show much better accuracy in terms of both DER and JER compared to the offline baseline provided in the challenge. The real-time factor of the whole diarization process is about 0.01 using a single CPU machine.
摘要:该技术报告描述了EML提交第一VoxCeleb扬声器diarization挑战。虽然挑战的目标一直是信号的离线处理,提交的系统基本上是EML在线算法决定约在运行系统中扬声器的标签大约每1.2秒。对于挑战的第一阶段,被用于训练只VoxCeleb2开发数据集。所提供的VoxConverse开发组的结果表明更好的精度都DER的条款和JER相比,在挑战中提供的离线基准。整个diarization过程的实时因子是关于使用一个单一的CPU的机器0.01。
Omid Ghahabi, Volker Fischer
Abstract: This technical report describes the EML submission to the first VoxCeleb speaker diarization challenge. Although the aim of the challenge has been the offline processing of the signals, the submitted system is basically the EML online algorithm which decides about the speaker labels in runtime approximately every 1.2 sec. For the first phase of the challenge, only VoxCeleb2 dev dataset was used for training. The results on the provided VoxConverse dev set show much better accuracy in terms of both DER and JER compared to the offline baseline provided in the challenge. The real-time factor of the whole diarization process is about 0.01 using a single CPU machine.
摘要:该技术报告描述了EML提交第一VoxCeleb扬声器diarization挑战。虽然挑战的目标一直是信号的离线处理,提交的系统基本上是EML在线算法决定约在运行系统中扬声器的标签大约每1.2秒。对于挑战的第一阶段,被用于训练只VoxCeleb2开发数据集。所提供的VoxConverse开发组的结果表明更好的精度都DER的条款和JER相比,在挑战中提供的离线基准。整个diarization过程的实时因子是关于使用一个单一的CPU的机器0.01。
46. An Analysis of LIME for Text Data [PDF] 返回目录
Dina Mardaoui, Damien Garreau
Abstract: Text data are increasingly handled in an automated fashion by machine learning algorithms. But the models handling these data are not always well-understood due to their complexity and are more and more often referred to as "black-boxes." Interpretability methods aim to explain how these models operate. Among them, LIME has become one of the most popular in recent years. However, it comes without theoretical guarantees: even for simple models, we are not sure that LIME behaves accurately. In this paper, we provide a first theoretical analysis of LIME for text data. As a consequence of our theoretical findings, we show that LIME indeed provides meaningful explanations for simple models, namely decision trees and linear models.
摘要:文本数据越来越多地以自动化的方式通过机器学习算法来处理。但处理这些数据的模型并不总是由于其复杂性充分理解和越来越多的通常被称为“黑盒子”。解释性方法旨在解释这些模型是如何运作的。其中,LIME已成为近年来最受欢迎的一个。不过,它没有理论保证:即使是简单的模型,我们不知道石灰表现准确。在本文中,我们提供了石灰文本数据的第一理论分析。由于我们的理论研究结果的结果,我们表明,LIME确实提供了简单的模型,即决策树和线性模型有意义的解释。
Dina Mardaoui, Damien Garreau
Abstract: Text data are increasingly handled in an automated fashion by machine learning algorithms. But the models handling these data are not always well-understood due to their complexity and are more and more often referred to as "black-boxes." Interpretability methods aim to explain how these models operate. Among them, LIME has become one of the most popular in recent years. However, it comes without theoretical guarantees: even for simple models, we are not sure that LIME behaves accurately. In this paper, we provide a first theoretical analysis of LIME for text data. As a consequence of our theoretical findings, we show that LIME indeed provides meaningful explanations for simple models, namely decision trees and linear models.
摘要:文本数据越来越多地以自动化的方式通过机器学习算法来处理。但处理这些数据的模型并不总是由于其复杂性充分理解和越来越多的通常被称为“黑盒子”。解释性方法旨在解释这些模型是如何运作的。其中,LIME已成为近年来最受欢迎的一个。不过,它没有理论保证:即使是简单的模型,我们不知道石灰表现准确。在本文中,我们提供了石灰文本数据的第一理论分析。由于我们的理论研究结果的结果,我们表明,LIME确实提供了简单的模型,即决策树和线性模型有意义的解释。
47. Show and Speak: Directly Synthesize Spoken Description of Images [PDF] 返回目录
Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
Abstract: This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.
摘要:本文提出了一种新的模式,被称为表演和说话(SAS)模型,首次,能够直接合成图像的语音说明,绕过任何文本或音素的需要。 SAS的基本结构是一个编码器 - 解码器的体系结构,拍摄图像作为输入,并且预测语音的描述此图像的频谱图。最终的语音音频被从预测频谱通过WaveNet获得。公众基准数据库Flickr8k广泛的实验表明,该SAS能够合成自然语音说明的图片,表明合成语音说明的图像,同时绕过文字和音素是可行的。
Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
Abstract: This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.
摘要:本文提出了一种新的模式,被称为表演和说话(SAS)模型,首次,能够直接合成图像的语音说明,绕过任何文本或音素的需要。 SAS的基本结构是一个编码器 - 解码器的体系结构,拍摄图像作为输入,并且预测语音的描述此图像的频谱图。最终的语音音频被从预测频谱通过WaveNet获得。公众基准数据库Flickr8k广泛的实验表明,该SAS能够合成自然语音说明的图片,表明合成语音说明的图像,同时绕过文字和音素是可行的。
48. Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations [PDF] 返回目录
Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda
Abstract: We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.
摘要:我们提出一个新的方法来在一个序列到序列(seq2seq)框架任何对一(A2O)语音转换(VC)。 A2O VC旨在将任何扬声器,包括那些在训练中看不见的,以一个固定的目标扬声器。我们利用VQ-wav2vec(VQW2V),这是从大量的无标签数据,这被认为是独立扬声器和以及对应的语言内容基本学到了离散自我监督的讲话表示。由于目标说话者的训练数据集,我们提取VQW2V和声学特征来估计seq2seq映射函数从前者到后者。随着训练前的帮助方法和新设计的后处理技术,我们的模型可以推广到只有5分钟的数据,甚至优于并行数据训练相同的模型。
Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda
Abstract: We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.
摘要:我们提出一个新的方法来在一个序列到序列(seq2seq)框架任何对一(A2O)语音转换(VC)。 A2O VC旨在将任何扬声器,包括那些在训练中看不见的,以一个固定的目标扬声器。我们利用VQ-wav2vec(VQW2V),这是从大量的无标签数据,这被认为是独立扬声器和以及对应的语言内容基本学到了离散自我监督的讲话表示。由于目标说话者的训练数据集,我们提取VQW2V和声学特征来估计seq2seq映射函数从前者到后者。随着训练前的帮助方法和新设计的后处理技术,我们的模型可以推广到只有5分钟的数据,甚至优于并行数据训练相同的模型。
49. Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer [PDF] 返回目录
Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li
Abstract: With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently. However, multi-channel speech separation sometimes does not necessarily need such a heavy structure for all time frames especially when the cross-talker challenge happens only occasionally. For example, in conversation scenarios, most regions contain only a single active speaker, where the separation task downgrades to a single speaker enhancement problem. It turns out that using a very deep network structure for dealing with signals with a low overlap ratio not only negatively affects the inference efficiency but also hurts the separation performance. To deal with this problem, we propose an early exit mechanism, which enables the Transformer model to handle different cases with adaptive depth. Experimental results indicate that not only does the early exit mechanism accelerate the inference, but it also improves the accuracy.
摘要:这是来自一个多头和多层结构其强大的建模能力,变压器是学习顺序表示了非常强大的模型,并已成功应用于语音分离最近。然而,多通道语音分离有时不一定需要这样一个笨重的结构为所有的时间框架中,特别是当横讲话者挑战只是偶尔发生。例如,在交谈的场景,大部分地区只包含一个有源音箱,其中分离任务降级到单个扬声器增强问题。事实证明,使用非常深的网络结构来处理信号具有低重叠率不仅不利地影响推理效率,而且也损害了分离性能。为了解决这个问题,我们提出了一个早期的退出机制,这使变压器模型自适应深度处理不同的情况。实验结果表明,不仅早期的退出机制,加快推理,但它也提高了精度。
Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li
Abstract: With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently. However, multi-channel speech separation sometimes does not necessarily need such a heavy structure for all time frames especially when the cross-talker challenge happens only occasionally. For example, in conversation scenarios, most regions contain only a single active speaker, where the separation task downgrades to a single speaker enhancement problem. It turns out that using a very deep network structure for dealing with signals with a low overlap ratio not only negatively affects the inference efficiency but also hurts the separation performance. To deal with this problem, we propose an early exit mechanism, which enables the Transformer model to handle different cases with adaptive depth. Experimental results indicate that not only does the early exit mechanism accelerate the inference, but it also improves the accuracy.
摘要:这是来自一个多头和多层结构其强大的建模能力,变压器是学习顺序表示了非常强大的模型,并已成功应用于语音分离最近。然而,多通道语音分离有时不一定需要这样一个笨重的结构为所有的时间框架中,特别是当横讲话者挑战只是偶尔发生。例如,在交谈的场景,大部分地区只包含一个有源音箱,其中分离任务降级到单个扬声器增强问题。事实证明,使用非常深的网络结构来处理信号具有低重叠率不仅不利地影响推理效率,而且也损害了分离性能。为了解决这个问题,我们提出了一个早期的退出机制,这使变压器模型自适应深度处理不同的情况。实验结果表明,不仅早期的退出机制,加快推理,但它也提高了精度。
50. Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention [PDF] 返回目录
Menglong Xu, Shengqiang Li, Xiao-Lei Zhang
Abstract: Recently, several studies reported that dot-product selfattention (SA) may not be indispensable to the state-of-theart Transformer models. Motivated by the fact that dense synthesizer attention (DSA), which dispenses with dot products and pairwise interactions, achieved competitive results in many language processing tasks, in this paper, we first propose a DSA-based speech recognition, as an alternative to SA. To reduce the computational complexity and improve the performance, we further propose local DSA (LDSA) to restrict the attention scope of DSA to a local range around the current central frame for speech recognition. Finally, we combine LDSA with SA to extract the local and global information simultaneously. Experimental results on the Ai-shell1 Mandarine speech recognition corpus show that the proposed LDSA-Transformer achieves a character error rate (CER) of 6.49%, which is slightly better than that of the SA-Transformer. Meanwhile, the LDSA-Transformer requires less computation than the SATransformer. The proposed combination method not only achieves a CER of 6.18%, which significantly outperforms the SA-Transformer, but also has roughly the same number of parameters and computational complexity as the latter. The implementation of the multi-head LDSA is available at this https URL.
摘要:最近,一些研究报告称点积selfattention(SA)可能不是必不可少的国家的theart变压器模型。由事实密集合成注意(DSA),这与网点的产品和两两相互作用次分配,在许多语言处理任务的实现有竞争力的结果,在本文中,我们首先提出了一个基于DSA的语音识别,作为替代SA动机。为了降低计算复杂度,提高了性能,我们进一步提出本地DSA(LDSA),以DSA的关注范围限制在周围语音识别当前中央帧中的局部范围。最后,我们与SA结合LDSA同时提取本地和全局信息。在爱抽壳普通话语音识别语料库表明,该LDSA变压器实现了6.49%,这是稍微好一点比SA-变压器的字符错误率(CER)的实验结果。同时,LDSA变压器需要比SATransformer更少的计算。所提出的组合方法不仅实现了6.18%一CER,其显著优于SA-变压器,还具有大致相同数量的参数和计算复杂性,因为后者。多头LDSA的实施可在此HTTPS URL。
Menglong Xu, Shengqiang Li, Xiao-Lei Zhang
Abstract: Recently, several studies reported that dot-product selfattention (SA) may not be indispensable to the state-of-theart Transformer models. Motivated by the fact that dense synthesizer attention (DSA), which dispenses with dot products and pairwise interactions, achieved competitive results in many language processing tasks, in this paper, we first propose a DSA-based speech recognition, as an alternative to SA. To reduce the computational complexity and improve the performance, we further propose local DSA (LDSA) to restrict the attention scope of DSA to a local range around the current central frame for speech recognition. Finally, we combine LDSA with SA to extract the local and global information simultaneously. Experimental results on the Ai-shell1 Mandarine speech recognition corpus show that the proposed LDSA-Transformer achieves a character error rate (CER) of 6.49%, which is slightly better than that of the SA-Transformer. Meanwhile, the LDSA-Transformer requires less computation than the SATransformer. The proposed combination method not only achieves a CER of 6.18%, which significantly outperforms the SA-Transformer, but also has roughly the same number of parameters and computational complexity as the latter. The implementation of the multi-head LDSA is available at this https URL.
摘要:最近,一些研究报告称点积selfattention(SA)可能不是必不可少的国家的theart变压器模型。由事实密集合成注意(DSA),这与网点的产品和两两相互作用次分配,在许多语言处理任务的实现有竞争力的结果,在本文中,我们首先提出了一个基于DSA的语音识别,作为替代SA动机。为了降低计算复杂度,提高了性能,我们进一步提出本地DSA(LDSA),以DSA的关注范围限制在周围语音识别当前中央帧中的局部范围。最后,我们与SA结合LDSA同时提取本地和全局信息。在爱抽壳普通话语音识别语料库表明,该LDSA变压器实现了6.49%,这是稍微好一点比SA-变压器的字符错误率(CER)的实验结果。同时,LDSA变压器需要比SATransformer更少的计算。所提出的组合方法不仅实现了6.18%一CER,其显著优于SA-变压器,还具有大致相同数量的参数和计算复杂性,因为后者。多头LDSA的实施可在此HTTPS URL。
51. Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation [PDF] 返回目录
Bowen Li, Xiaojuan Qi, Philip H. S. Torr, Thomas Lukasiewicz
Abstract: We propose a novel lightweight generative adversarial network for efficient image manipulation using natural language descriptions. To achieve this, a new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word-level, to facilitate training a lightweight generator that has a small number of parameters, but can still correctly focus on specific visual attributes of an image, and then edit them without affecting other contents that are not described in the text. Furthermore, thanks to the explicit training signal related to each word, the discriminator can also be simplified to have a lightweight structure. Compared with the state of the art, our method has a much smaller number of parameters, but still achieves a competitive manipulation performance. Extensive experimental results demonstrate that our method can better disentangle different visual attributes, then correctly map them to corresponding semantic words, and thus achieve a more accurate image modification using natural language descriptions.
摘要:我们提出了用自然语言描述高效影像处理的新颖轻巧的生成敌对网络。为了实现这一目标,一个新词 - 平判断,提出了一种在文字层面提供了发电机细粒度培训反馈,以促进培训一个轻量级的发电机具有少量的参数,但仍然可以正确地着眼于特定的视觉图像,然后编辑它们的属性,而不会影响未在文中所述的其他内容。此外,由于涉及到每个字显式训练信号,鉴别也可以简化为具有结构轻巧。与现有技术状态相比,我们的方法具有参数少得多,但仍然实现了有竞争力的操控性能。广泛的实验结果表明,我们的方法可以更好地解开不同视觉属性,然后正确地将它们映射到相应的语义词语,以及使用自然语言描述从而达到更精确的图像修改。
Bowen Li, Xiaojuan Qi, Philip H. S. Torr, Thomas Lukasiewicz
Abstract: We propose a novel lightweight generative adversarial network for efficient image manipulation using natural language descriptions. To achieve this, a new word-level discriminator is proposed, which provides the generator with fine-grained training feedback at word-level, to facilitate training a lightweight generator that has a small number of parameters, but can still correctly focus on specific visual attributes of an image, and then edit them without affecting other contents that are not described in the text. Furthermore, thanks to the explicit training signal related to each word, the discriminator can also be simplified to have a lightweight structure. Compared with the state of the art, our method has a much smaller number of parameters, but still achieves a competitive manipulation performance. Extensive experimental results demonstrate that our method can better disentangle different visual attributes, then correctly map them to corresponding semantic words, and thus achieve a more accurate image modification using natural language descriptions.
摘要:我们提出了用自然语言描述高效影像处理的新颖轻巧的生成敌对网络。为了实现这一目标,一个新词 - 平判断,提出了一种在文字层面提供了发电机细粒度培训反馈,以促进培训一个轻量级的发电机具有少量的参数,但仍然可以正确地着眼于特定的视觉图像,然后编辑它们的属性,而不会影响未在文中所述的其他内容。此外,由于涉及到每个字显式训练信号,鉴别也可以简化为具有结构轻巧。与现有技术状态相比,我们的方法具有参数少得多,但仍然实现了有竞争力的操控性能。广泛的实验结果表明,我们的方法可以更好地解开不同视觉属性,然后正确地将它们映射到相应的语义词语,以及使用自然语言描述从而达到更精确的图像修改。
52. Knowledge Graph Embedding with Atrous Convolution and Residual Learning [PDF] 返回目录
Feiliang Ren, Juchen Li, Huihui Zhang, Shilei Liu, Bochao Li, Ruicheng Ming, Yujia Bai
Abstract: Knowledge graph embedding is an important task and it will benefit lots of downstream applications. Currently, deep neural networks based methods achieve state-of-the-art performance. However, most of these existing methods are very complex and need much time for training and inference. To address this issue, we propose a simple but effective atrous convolution based knowledge graph embedding method. Compared with existing state-of-the-art methods, our method has following main characteristics. First, it effectively increases feature interactions by using atrous convolutions. Second, to address the original information forgotten issue and vanishing/exploding gradient issue, it uses the residual learning method. Third, it has simpler structure but much higher parameter efficiency. We evaluate our method on six benchmark datasets with different evaluation metrics. Extensive experiments show that our model is very effective. On these diverse datasets, it achieves better results than the compared state-of-the-art methods on most of evaluation metrics. The source codes of our model could be found at this https URL.
摘要:知识图嵌入是一项重要任务,这将有利于大量的下游应用。目前,基于深层神经网络方法实现国家的最先进的性能。然而,这些现有的方法是非常复杂的,需要培训和推理多少时间。为了解决这个问题,我们提出了一个简单而有效的基于àtrous卷积知识图嵌入方法。与现有的国家的最先进的方法相比,我们的方法有以下主要特点。首先,它有效地利用atrous卷积增加功能的交互。二,解决原始信息被遗忘的问题,消失/爆炸梯度问题,它使用的剩余学习方法。第三,它具有结构简单,但更高的参数效率。我们评估我们与不同的评价标准6个标准数据集的方法。大量的实验表明,我们的模型是非常有效的。在这些不同的数据集,它实现了比大多数评价指标的比较国家的最先进的方法更好的结果。我们的模型的源代码可以在此HTTPS URL中找到。
Feiliang Ren, Juchen Li, Huihui Zhang, Shilei Liu, Bochao Li, Ruicheng Ming, Yujia Bai
Abstract: Knowledge graph embedding is an important task and it will benefit lots of downstream applications. Currently, deep neural networks based methods achieve state-of-the-art performance. However, most of these existing methods are very complex and need much time for training and inference. To address this issue, we propose a simple but effective atrous convolution based knowledge graph embedding method. Compared with existing state-of-the-art methods, our method has following main characteristics. First, it effectively increases feature interactions by using atrous convolutions. Second, to address the original information forgotten issue and vanishing/exploding gradient issue, it uses the residual learning method. Third, it has simpler structure but much higher parameter efficiency. We evaluate our method on six benchmark datasets with different evaluation metrics. Extensive experiments show that our model is very effective. On these diverse datasets, it achieves better results than the compared state-of-the-art methods on most of evaluation metrics. The source codes of our model could be found at this https URL.
摘要:知识图嵌入是一项重要任务,这将有利于大量的下游应用。目前,基于深层神经网络方法实现国家的最先进的性能。然而,这些现有的方法是非常复杂的,需要培训和推理多少时间。为了解决这个问题,我们提出了一个简单而有效的基于àtrous卷积知识图嵌入方法。与现有的国家的最先进的方法相比,我们的方法有以下主要特点。首先,它有效地利用atrous卷积增加功能的交互。二,解决原始信息被遗忘的问题,消失/爆炸梯度问题,它使用的剩余学习方法。第三,它具有结构简单,但更高的参数效率。我们评估我们与不同的评价标准6个标准数据集的方法。大量的实验表明,我们的模型是非常有效的。在这些不同的数据集,它实现了比大多数评价指标的比较国家的最先进的方法更好的结果。我们的模型的源代码可以在此HTTPS URL中找到。
53. How Phonotactics Affect Multilingual and Zero-shot ASR Performance [PDF] 返回目录
Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.
摘要:结合多国语言录音训练一个自动语音识别(ASR)模型的想法带来了普遍的讲话表示出现的承诺。近日,变压器编码器,解码器模型已显示出利用多语种数据以及在培训过程中提出的语言IPA转录。然而,学到的交涉没有成功在零射门转移到看不见的语言。因为这种模式缺乏声学模型(AM)和语言模型(LM)的明确的分解,目前还不清楚到什么程度从发音或语音组合法不匹配的差异所遭受的性能。为了更深入地了解限制性零次ASR转移的因素,我们用由一个单独的AM和LM的混合ASR系统代替编码器 - 解码器。然后,我们一组的13种音素不同的语言进行单语,多语种和crosslingual(零次)的声学和语言模型的广泛的评估。我们发现,从造型crosslingual语音组合法是有限的,征收过强的模型增益可以伤害零出手转让。此外,我们发现,一个多语种LM伤害了多语种ASR系统的性能,只有在LM训练保持目标语言的音位的数据是优选的。
Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.
摘要:结合多国语言录音训练一个自动语音识别(ASR)模型的想法带来了普遍的讲话表示出现的承诺。近日,变压器编码器,解码器模型已显示出利用多语种数据以及在培训过程中提出的语言IPA转录。然而,学到的交涉没有成功在零射门转移到看不见的语言。因为这种模式缺乏声学模型(AM)和语言模型(LM)的明确的分解,目前还不清楚到什么程度从发音或语音组合法不匹配的差异所遭受的性能。为了更深入地了解限制性零次ASR转移的因素,我们用由一个单独的AM和LM的混合ASR系统代替编码器 - 解码器。然后,我们一组的13种音素不同的语言进行单语,多语种和crosslingual(零次)的声学和语言模型的广泛的评估。我们发现,从造型crosslingual语音组合法是有限的,征收过强的模型增益可以伤害零出手转让。此外,我们发现,一个多语种LM伤害了多语种ASR系统的性能,只有在LM训练保持目标语言的音位的数据是优选的。
54. Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data [PDF] 返回目录
Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao
Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline.
摘要:流媒体终端到终端的自动语音识别(ASR)模型在智能扬声器和设备上应用广泛的应用。由于这些车型预计将延迟最小录制的讲话,他们被约束为因果未来没有上下文,比起他们的非流同行。因此,流模型通常表现比非流机型差。我们提出通过利用非流ASR模型作为教师对任意大的数据集,然后将其用于提炼知识转化为流ASR模型生成成绩单一种新颖而有效的学习方法。通过这种方式,我们扩展流模型来达到3000000小时YouTube音乐的训练。实验结果表明,我们的方法可以显著降低四种语言RNNT模型不仅LibriSpeech也是YouTube上的数据的字错误率(WER)。例如,在法国,我们可以16.4%通过利用培训了相同数量的标记数据作为基准的非流老师模型相对减少WER基线流模型。
Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao
Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline.
摘要:流媒体终端到终端的自动语音识别(ASR)模型在智能扬声器和设备上应用广泛的应用。由于这些车型预计将延迟最小录制的讲话,他们被约束为因果未来没有上下文,比起他们的非流同行。因此,流模型通常表现比非流机型差。我们提出通过利用非流ASR模型作为教师对任意大的数据集,然后将其用于提炼知识转化为流ASR模型生成成绩单一种新颖而有效的学习方法。通过这种方式,我们扩展流模型来达到3000000小时YouTube音乐的训练。实验结果表明,我们的方法可以显著降低四种语言RNNT模型不仅LibriSpeech也是YouTube上的数据的字错误率(WER)。例如,在法国,我们可以16.4%通过利用培训了相同数量的标记数据作为基准的非流老师模型相对减少WER基线流模型。
55. Language-Conditioned Imitation Learning for Robot Manipulation Tasks [PDF] 返回目录
Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Stefan Lee, Chitta Baral, Heni Ben Amor
Abstract: Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i.e., motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task, such as the properties of the target object or the intended shape of the motion. Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent (e.g., "go to the large green bowl"). The training process then interrelates these two modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity. We demonstrate in a set of simulation experiments how our approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compare the results to a variety of alternative methods.
摘要:模仿学习是教学运动技能的机器人流行的做法。然而,大多数方法集中在从单独执行迹线(即,运动轨迹和感知数据)提取策略参数。人类专家和机器人来描述的任务的关键方面,如目标物体或运动的预期形状的属性之间不存在足够的通信信道。通过分析上市公司人力教学过程的启发,我们引入了包含非结构化的自然语言到模仿学习的方法。在训练时,专家可以以描述提供示范与口头描述沿底下的意图(例如,“去大绿碗”)。然后在训练过程相互联系这两种方式来编码语言,感知和运动之间的关系。得到的语言空调视觉运动的政策可以在运行时对新人类的命令和指令,它允许对培训的政策更细粒度的控制,同时也减少态势歧义进行调节。我们展示了一组模拟试验我们的方法可以学习如何为七度的自由度的机械臂语言空调的操作策略,并比较结果的各种替代方法。
Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Stefan Lee, Chitta Baral, Heni Ben Amor
Abstract: Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i.e., motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task, such as the properties of the target object or the intended shape of the motion. Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent (e.g., "go to the large green bowl"). The training process then interrelates these two modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity. We demonstrate in a set of simulation experiments how our approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compare the results to a variety of alternative methods.
摘要:模仿学习是教学运动技能的机器人流行的做法。然而,大多数方法集中在从单独执行迹线(即,运动轨迹和感知数据)提取策略参数。人类专家和机器人来描述的任务的关键方面,如目标物体或运动的预期形状的属性之间不存在足够的通信信道。通过分析上市公司人力教学过程的启发,我们引入了包含非结构化的自然语言到模仿学习的方法。在训练时,专家可以以描述提供示范与口头描述沿底下的意图(例如,“去大绿碗”)。然后在训练过程相互联系这两种方式来编码语言,感知和运动之间的关系。得到的语言空调视觉运动的政策可以在运行时对新人类的命令和指令,它允许对培训的政策更细粒度的控制,同时也减少态势歧义进行调节。我们展示了一组模拟试验我们的方法可以学习如何为七度的自由度的机械臂语言空调的操作策略,并比较结果的各种替代方法。
56. Characterizing Datasets for Social Visual Question Answering, and the New TinySocial Dataset [PDF] 返回目录
Zhanwen Chen, Shiyao Li, Roxanne Rashedi, Xiaoman Zi, Morgan Elrod-Erickson, Bryan Hollis, Angela Maliakal, Xinyu Shen, Simeng Zhao, Maithilee Kunda
Abstract: Modern social intelligence includes the ability to watch videos and answer questions about social and theory-of-mind-related content, e.g., for a scene in Harry Potter, "Is the father really upset about the boys flying the car?" Social visual question answering (social VQA) is emerging as a valuable methodology for studying social reasoning in both humans (e.g., children with autism) and AI agents. However, this problem space spans enormous variations in both videos and questions. We discuss methods for creating and characterizing social VQA datasets, including 1) crowdsourcing versus in-house authoring, including sample comparisons of two new datasets that we created (TinySocial-Crowd and TinySocial-InHouse) and the previously existing Social-IQ dataset; 2) a new rubric for characterizing the difficulty and content of a given video; and 3) a new rubric for characterizing question types. We close by describing how having well-characterized social VQA datasets will enhance the explainability of AI agents and can also inform assessments and educational interventions for people.
摘要:现代社会智力包括观看有关社会和理论的头脑相关的内容,例如,在哈利·波特的场景视频和回答问题的能力,“真的是苦恼飞行汽车男孩父亲?”社会视觉问答(社会VQA)正在成为一种有价值的方法为在人类研究社会推理(例如,自闭症儿童)和AI代理。然而,这个问题空间横跨在两个视频和问题的巨大变化。我们讨论了用于创建和表征社会VQA的数据集,包括:1)众包与内部创作,包括我们创建(TinySocial收拢,TinySocial-点播服务)和先前存在的社会智商的数据集两个新的数据集的样本进行比较的方法; 2)用于表征难度和给定的视频内容的新栏目; 3)表征问题的类型的新栏目。我们附近描述如何具有良好的特点的社会VQA数据集将增强AI代理的explainability,也可以告知评估和教育干预的人。
Zhanwen Chen, Shiyao Li, Roxanne Rashedi, Xiaoman Zi, Morgan Elrod-Erickson, Bryan Hollis, Angela Maliakal, Xinyu Shen, Simeng Zhao, Maithilee Kunda
Abstract: Modern social intelligence includes the ability to watch videos and answer questions about social and theory-of-mind-related content, e.g., for a scene in Harry Potter, "Is the father really upset about the boys flying the car?" Social visual question answering (social VQA) is emerging as a valuable methodology for studying social reasoning in both humans (e.g., children with autism) and AI agents. However, this problem space spans enormous variations in both videos and questions. We discuss methods for creating and characterizing social VQA datasets, including 1) crowdsourcing versus in-house authoring, including sample comparisons of two new datasets that we created (TinySocial-Crowd and TinySocial-InHouse) and the previously existing Social-IQ dataset; 2) a new rubric for characterizing the difficulty and content of a given video; and 3) a new rubric for characterizing question types. We close by describing how having well-characterized social VQA datasets will enhance the explainability of AI agents and can also inform assessments and educational interventions for people.
摘要:现代社会智力包括观看有关社会和理论的头脑相关的内容,例如,在哈利·波特的场景视频和回答问题的能力,“真的是苦恼飞行汽车男孩父亲?”社会视觉问答(社会VQA)正在成为一种有价值的方法为在人类研究社会推理(例如,自闭症儿童)和AI代理。然而,这个问题空间横跨在两个视频和问题的巨大变化。我们讨论了用于创建和表征社会VQA的数据集,包括:1)众包与内部创作,包括我们创建(TinySocial收拢,TinySocial-点播服务)和先前存在的社会智商的数据集两个新的数据集的样本进行比较的方法; 2)用于表征难度和给定的视频内容的新栏目; 3)表征问题的类型的新栏目。我们附近描述如何具有良好的特点的社会VQA数据集将增强AI代理的explainability,也可以告知评估和教育干预的人。
57. The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker Diarisation Challenge [PDF] 返回目录
Renyu Wang, Ruilin Tong, Yu Ting Yeung, Xiao Chen
Abstract: This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020. Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals. We replace conventional energy-based voice activity detection (VAD) with a neural network based VAD. The neural network based VAD provides more accurate annotation of speech segments containing only background music, noise, and other interference, which is crucial to diarisation performance. We apply agglomerative hierarchical clustering (AHC) of x-vectors and variational Bayesian hidden Markov model (VB-HMM) based iterative clustering for speaker clustering. Experimental results demonstrate that our proposed system achieves substantial improvements over the baseline system, yielding diarisation error rate (DER) of 10.45%, and Jacard error rate (JER) of 22.46% on the evaluation set.
摘要:本文介绍了我们提交系统设置扬声器diarisation轨道VoxCeleb说话人识别挑战(第4道),2020年我们的diarisation系统由训练有素的基于神经网络的语音增强模型作为输入语音的预处理前端信号。我们取代传统能源为基础的语音活动检测(VAD)基于VAD神经网络。神经网络基于VAD提供了仅包含背景音乐,噪声和其它干扰,这是diarisation性能是至关重要的声音片段的更准确的诠释。我们采用的X向量和变分贝叶斯隐马尔可夫模型(VB-HMM)的说话人聚类基于迭代聚类的凝聚层次聚类(AHC)。实验结果表明,我们所提出的系统达到了基线系统实质性的改进,产生的10.45%diarisation误码率(DER),以及22.46%的评估集合Jacard错误率(JER)。
Renyu Wang, Ruilin Tong, Yu Ting Yeung, Xiao Chen
Abstract: This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020. Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals. We replace conventional energy-based voice activity detection (VAD) with a neural network based VAD. The neural network based VAD provides more accurate annotation of speech segments containing only background music, noise, and other interference, which is crucial to diarisation performance. We apply agglomerative hierarchical clustering (AHC) of x-vectors and variational Bayesian hidden Markov model (VB-HMM) based iterative clustering for speaker clustering. Experimental results demonstrate that our proposed system achieves substantial improvements over the baseline system, yielding diarisation error rate (DER) of 10.45%, and Jacard error rate (JER) of 22.46% on the evaluation set.
摘要:本文介绍了我们提交系统设置扬声器diarisation轨道VoxCeleb说话人识别挑战(第4道),2020年我们的diarisation系统由训练有素的基于神经网络的语音增强模型作为输入语音的预处理前端信号。我们取代传统能源为基础的语音活动检测(VAD)基于VAD神经网络。神经网络基于VAD提供了仅包含背景音乐,噪声和其它干扰,这是diarisation性能是至关重要的声音片段的更准确的诠释。我们采用的X向量和变分贝叶斯隐马尔可夫模型(VB-HMM)的说话人聚类基于迭代聚类的凝聚层次聚类(AHC)。实验结果表明,我们所提出的系统达到了基线系统实质性的改进,产生的10.45%diarisation误码率(DER),以及22.46%的评估集合Jacard错误率(JER)。
注:中文为机器翻译结果!封面为论文标题词云图!