0%

【arxiv论文】 Computation and Language 2020-10-28

目录

1. Fast Interleaved Bidirectional Sequence Generation [PDF] 摘要
2. It's All in the Name: A Character Based Approach To Infer Religion [PDF] 摘要
3. Evaluating Gender Bias in Speech Translation [PDF] 摘要
4. Discovering and Interpreting Conceptual Biases in Online Communities [PDF] 摘要
5. Differentiable Open-Ended Commonsense Reasoning [PDF] 摘要
6. Listener's Social Identity Matters in Personalised Response Generation [PDF] 摘要
7. Multitask Training with Text Data for End-to-End Speech Recognition [PDF] 摘要
8. Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [PDF] 摘要
9. Improving Reinforcement Learning for Neural Relation Extraction with Hierarchical Memory Extractor [PDF] 摘要
10. Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles [PDF] 摘要
11. Global Sentiment Analysis Of COVID-19 Tweets Over Time [PDF] 摘要
12. Event Detection: Gate Diversity and Syntactic Importance Scoresfor Graph Convolution Neural Networks [PDF] 摘要
13. Emotion recognition by fusing time synchronous and time asynchronous representations [PDF] 摘要
14. Multi-Domain Dialogue State Tracking -- A Purely Transformer-Based Generative Approach [PDF] 摘要
15. To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging [PDF] 摘要
16. Volctrans Parallel Corpus Filtering System for WMT 2020 [PDF] 摘要
17. Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning [PDF] 摘要
18. Interpretation of NLP models through input marginalization [PDF] 摘要
19. Predict and Use Latent Patterns for Short-Text Conversation [PDF] 摘要
20. Reading Between the Lines: Exploring Infilling in Visual Narratives [PDF] 摘要
21. Improving Limited Labeled Dialogue State Tracking with Self-Supervision [PDF] 摘要
22. Probing Task-Oriented Dialogue Representation from Language Models [PDF] 摘要
23. Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer [PDF] 摘要
24. Word Frequency Does Not Predict Grammatical Knowledge in Language Models [PDF] 摘要
25. Data Troubles in Sentence Level Confidence Estimation for Machine Translation [PDF] 摘要
26. Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining [PDF] 摘要
27. PowerTransformer: Unsupervised Controllable Revision for Biased Language Correction [PDF] 摘要
28. Is it Great or Terrible? Preserving Sentiment in Neural Machine Translation of Arabic Reviews [PDF] 摘要
29. Dynamic Boundary Time Warping for Sub-sequence Matching with Few Examples [PDF] 摘要
30. Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment [PDF] 摘要
31. Co-attentional Transformers for Story-Based Video Understanding [PDF] 摘要
32. VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement Learning [PDF] 摘要

摘要

1. Fast Interleaved Bidirectional Sequence Generation [PDF] 返回目录
  Biao Zhang, Ivan Titov, Rico Sennrich
Abstract: Independence assumptions during sequence generation can speed up inference, but parallel generation of highly inter-dependent tokens comes at a cost in quality. Instead of assuming independence between neighbouring tokens (semi-autoregressive decoding, SA), we take inspiration from bidirectional sequence generation and introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously. We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder by simply interleaving the two directions and adapting the word positions and self-attention masks. Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer, and on five machine translation tasks and two document summarization tasks, achieves a decoding speedup of ~2X compared to autoregressive decoding with comparable quality. Notably, it outperforms left-to-right SA because the independence assumptions in IBDecoder are more felicitous. To achieve even higher speedups, we explore hybrid models where we either simultaneously predict multiple neighbouring tokens per direction, or perform multi-directional decoding by partitioning the target sequence. These methods achieve speedups to 4X-11X across different tasks at the cost of <1 bleu or <0.5 rouge (on average). source code is released at this https url. < font>
摘要:序列生成过程中独立性的假设可以加快推论,但并行生成高度相互依存的令牌来在质量成本。代替相邻标记(半自回归解码,SA)之间假定的独立性,我们从双向序列生成的灵感和介绍,从生成目标字的解码器中的左到右和从右到左的方向上同时。我们表明,我们可以很容易地转换成一个标准的架构,单向解码成双向解码器通过简单地交织两个方向和调整字位置和自我关注口罩。我们的交织的双向解码器(IBDecoder)保留了标准变压器的模型简化和训练的效率,并在五个机器翻译任务和两个文档文摘任务,达到〜2X的解码加速相比自回归与可比质量进行解码。值得注意的是,它优于左到右SA因为IBDecoder独立性假设更为恰当。为了达到甚至更高的加速比,我们探索混合模型,我们同时地预测每个方向多个相邻的令牌,或者通过分割与靶序列进行多向解码。这些方法在<1 bleu的成本或<0.5 rouge(平均)实现跨越不同的任务的加速到4x-11x。源代码在此https url释放。< font>

2. It's All in the Name: A Character Based Approach To Infer Religion [PDF] 返回目录
  Rochana Chaturvedi, Sugat Chaturvedi
Abstract: Demographic inference from text has received a surge of attention in the field of natural language processing in the last decade. In this paper, we use personal names to infer religion in South Asia - where religion is a salient social division, and yet, disaggregated data on it remains scarce. Existing work predicts religion using dictionary based method, and therefore, can not classify unseen names. We use character based models which learn character patterns and, therefore, can classify unseen names as well with high accuracy. These models are also much faster and can easily be scaled to large data sets. We improve our classifier by combining the name of an individual with that of their parent/spouse and achieve remarkably high accuracy. Finally, we trace the classification decisions of a convolutional neural network model using layer-wise relevance propagation which can explain the predictions of complex non-linear classifiers and circumvent their purported black box nature. We show how character patterns learned by the classifier are rooted in the linguistic origins of names.
摘要:从文本人口推断已收到在过去十年的关注,自然语言处理领域的激增。在本文中,我们使用人名南亚推断宗教 - 这里的宗教是一个突出的社会分工,然而,分列数据上它仍然是稀缺的。现有的工作预测使用基于字典的方法宗教,因此,看不见的名称不能进行分类。我们使用基于角色模型,其学习字符图案,因此,可以看不见的名称,以及高精度分类。这些模型也更快,可以很容易地扩展到大型数据集。我们通过一个人的名字与他们的父母/配偶的结合提高我们的分类,达到非常高的精度。最后,我们跟踪使用逐层传播的相关性可以解释复杂的非线性分类的预测和规避他们的本意是黑盒性质卷积神经网络模型的分类决定。我们展示了如何通过分类学字符模式植根于地名的语言起源。

3. Evaluating Gender Bias in Speech Translation [PDF] 返回目录
  Marta R. Costa-jussà, Christine Basta, Gerard I. Gállego
Abstract: The scientific community is more and more aware of the necessity to embrace pluralism and consistently represent major and minor social groups. In this direction, there is an urgent need to provide evaluation sets and protocols to measure existing biases in our automatic systems. This paper introduces WinoST, a new freely available challenge set for evaluating gender bias in speech translation. WinoST is the speech version of WinoMT which is an MT challenge set and both follow an evaluation protocol to measure gender accuracy. Using a state-of-the-art end-to-end speech translation system, we report the gender bias evaluation on 4 language pairs, and we show that gender accuracy in speech translation is more than 23% lower than in MT.
摘要:科学界越来越意识到有必要拥抱多元化,始终代表主要和次要的社会群体。在这个方向,迫切需要提供评估组和协议来衡量我们的自动系统存在偏见。本文介绍WinoST,在语音翻译评估性别偏见,一个新的免费提供的挑战集。 WinoST是WinoMT的语音版本,这是一个挑战,MT组和都遵循一个评价协议来衡量性别准确性。用一个国家的最先进的终端到终端的语音翻译系统,我们报告的4种语言对性别偏见的评价,我们显示了语音翻译,性别精确度超过23%,比MT降低。

4. Discovering and Interpreting Conceptual Biases in Online Communities [PDF] 返回目录
  Xavier Ferrer-Aran, Tom van Nuenen, Natalia Criado, Jose M. Such
Abstract: Language carries implicit human biases, functioning both as a reflection and a perpetuation of stereotypes that people carry with them. Recently, ML-based NLP methods such as word embeddings have been shown to learn such language biases with striking accuracy. This capability of word embeddings has been successfully exploited as a tool to quantify and study human biases. However, previous studies only consider a predefined set of conceptual biases to attest (e.g., whether gender is more or less associated with particular jobs), or just discover biased words without helping to understand their meaning at the conceptual level. As such, these approaches are either unable to find conceptual biases that have not been defined in advance, or the biases they find are difficult to interpret and study. This makes existing approaches unsuitable to discover and interpret biases in online communities, as such communities may carry different biases than those in mainstream culture. This paper proposes a general, data-driven approach to automatically discover and help interpret conceptual biases encoded in word embeddings. We apply this approach to study the conceptual biases present in the language used in online communities and experimentally show the validity and stability of our method.
摘要:语言承载隐人的偏见,既充当一个反射和成见的延续,人们随身携带。近来,已经展示了基于ML-NLP方法,如字的嵌入学会与惊人的准确性这样的语言偏见。字的嵌入的这种能力已被成功利用,以量化和研究人类偏见的工具。然而,以往的研究只考虑一组预定义的概念偏见来证明(例如,性别是否与特定的工作或多或少的关联),或者只是发现偏见的话,而不帮助理解在概念上它们的含义。因此,这些方法要么无法找到事先没有被定义概念的偏见,或者他们发现偏见难以解释和研究。这使得现有的方法不适合于发现和在线社区解释偏见,因为这些社区可以携带不同的偏见比在主流文化。本文提出了一种通用的,数据驱动的方法来自动发现和帮助解释字的嵌入编码的概念的偏见。我们将这种方法用于研究概念的偏见存在于网络社区所使用的语言,并通过实验证明了该方法的有效性和稳定性。

5. Differentiable Open-Ended Commonsense Reasoning [PDF] 返回目录
  Bill Yuchen Lin, Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Xiang Ren, William W. Cohen
Abstract: Current commonsense reasoning research mainly focuses on developing models that use commonsense knowledge to answer multiple-choice questions. However, systems designed to answer multiple-choice questions may not be useful in applications that do not provide a small list of possible candidate answers to choose from. As a step towards making commonsense reasoning research more realistic, we propose to study open-ended commonsense reasoning (OpenCSR) - the task of answering a commonsense question without any pre-defined choices, using as a resource only a corpus of commonsense facts written in natural language. The task is challenging due to a much larger decision space, and because many commonsense questions require multi-hop reasoning. We propose an efficient differentiable model for multi-hop reasoning over knowledge facts, named DrFact. We evaluate our approach on a collection of re-formatted, open-ended versions of popular tests targeting commonsense reasoning, and show that our approach outperforms strong baseline methods by a large margin.
摘要:目前常识推理研究主要集中在开发模式,使用常识知识解答选择题。然而,旨在解答选择题的系统可能无法在不提供可能的候选答案可以选择一个小单子应用。作为致力于使常识推理的研究更切合实际的步骤,我们建议研究开放式的常识推理(OpenCSR) - 回答常识性的问题,没有任何预先定义的选项,使用作为一种资源只写在常识性的事实的语料库的任务自然语言。任务是由于更大的决策空间有挑战性,因为许多常识性的问题,需要多跳推理。我们提出了多跳推理对知识的事实,高效的微模型,命名DrFact。我们评估我们在重新格式化,开放式版本的流行的测试目标常识推理的征收方式,并表明我们的方法优于大幅度强基线的方法。

6. Listener's Social Identity Matters in Personalised Response Generation [PDF] 返回目录
  Guanyi Chen, Yinhe Zheng, Yupei Du
Abstract: Personalised response generation enables generating human-like responses by means of assigning the generator a social identity. However, pragmatics theory suggests that human beings adjust the way of speaking based on not only who they are but also whom they are talking to. In other words, when modelling personalised dialogues, it might be favourable if we also take the listener's social identity into consideration. To validate this idea, we use gender as a typical example of a social variable to investigate how the listener's identity influences the language used in Chinese dialogues on social media. Also, we build personalised generators. The experiment results demonstrate that the listener's identity indeed matters in the language use of responses and that the response generator can capture such differences in language use. More interestingly, by additionally modelling the listener's identity, the personalised response generator performs better in its own identity.
摘要:个性化响应产生能够产生人样通过分配所述发电机的社交身份的手段的响应。然而,语用学理论认为,人类调整上讲基础上,他们不仅是谁的方式也为之他们聊天。换句话说,造型个性化的对话的时候,它可能是有利的,如果我们也把听者的社会身份考虑。为了验证这个想法,我们用性别作为一个社会变量的一个典型例子,调查听众的身份如何影响中国的对话中使用社交媒体的语言。此外,我们建立个性化的发电机。实验结果表明,听者的身份确实是在语言使用的反应和响应生成器可以捕捉语言运用这种差异很重要。更有趣的是,通过额外模拟听者的身份,个性化反应生成在执行自己的身份更好。

7. Multitask Training with Text Data for End-to-End Speech Recognition [PDF] 返回目录
  Peidong Wang, Tara N. Sainath, Ron J. Weiss
Abstract: We propose a multitask training method for attention-based end-to-end speech recognition models to better incorporate language level information. We regularize the decoder in a sequence-to-sequence architecture by multitask training it on both the speech recognition task and a next-token prediction language modeling task. Trained on either the 100 hour subset of LibriSpeech or the full 960 hour dataset, the proposed method leads to an 11% relative performance improvement over the baseline and is comparable to language model shallow fusion, without requiring an additional neural network during decoding. Analyses of sample output sentences and the word error rate on rare words demonstrate that the proposed method can incorporate language level information effectively.
摘要:本文提出了基于注意机制的端至端的语音识别模式,以更好地将语言水平信息的多任务训练方法。我们通过多任务的正规化解码器在一个序列到序列架构训练它的语音识别任务,下一个令牌预测语言建模任务两者。受过训练的在任LibriSpeech的100小时子集或全部960小时数据集,所提出的方法导致超过基线11%的相对性能的改进和相当于语言模型浅融合,而不解码期间需要额外的神经网络。样本输出语句和生僻字的字错误率的分析表明,该方法能有效地将语言水平的信息。

8. Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation [PDF] 返回目录
  Junhao Liu, Linjun Shou, Jian Pei, Ming Gong, Min Yang, Daxin Jiang
Abstract: Cross-lingual Machine Reading Comprehension (CLMRC) remains a challenging problem due to the lack of large-scale annotated datasets in low-source languages, such as Arabic, Hindi, and Vietnamese. Many previous approaches use translation data by translating from a rich-source language, such as English, to low-source languages as auxiliary supervision. However, how to effectively leverage translation data and reduce the impact of noise introduced by translation remains onerous. In this paper, we tackle this challenge and enhance the cross-lingual transferring performance by a novel augmentation approach named Language Branch Machine Reading Comprehension (LBMRC). A language branch is a group of passages in one single language paired with questions in all target languages. We train multiple machine reading comprehension (MRC) models proficient in individual language based on LBMRC. Then, we devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages. Combining the LBMRC and multilingual distillation can be more robust to the data noises, therefore, improving the model's cross-lingual ability. Meanwhile, the produced single multilingual model is applicable to all target languages, which saves the cost of training, inference, and maintenance for multiple models. Extensive experiments on two CLMRC benchmarks clearly show the effectiveness of our proposed method.
摘要:跨语种机器阅读理解(CLMRC)仍然是一个具有挑战性的问题,由于在低源语言,如阿拉伯语,印地文和越南缺乏大型注释的数据集。许多以前的方法通过翻译用翻译的数据从一个丰富的源语言,如英语,低源语言作为辅助监督。然而,如何有效地利用翻译数据,并降低噪音推出的由翻译的影响依然繁重。在本文中,我们应对这种挑战,增强名为语支机阅读理解(LBMRC)一种新型的增强方法的跨语言传输性能。语言分支是一组中的所有目标语言问题配对在一个单一的语言通道。我们训练多机阅读理解(MRC)模型的基础上LBMRC个人的语言精通。然后,我们设计一个多语种的蒸馏方法从多语言分支模型合并将知识,为所有目标语言的单一模式。组合LBMRC和多种语言蒸馏可以更健壮的数据的噪声,因此,改进了模型的跨语种能力。同时,生产单一的多语言模型适用于所有目标语言,从而节省了培训,推理和维护多个型号的成本。两个CLMRC基准大量的实验清楚地表明我们提出的方法的有效性。

9. Improving Reinforcement Learning for Neural Relation Extraction with Hierarchical Memory Extractor [PDF] 返回目录
  Jianing Wang, Chong Su
Abstract: Distant supervision relation extraction (DSRE) is an efficient method to extract semantic relations on a large-scale heuristic labeling corpus. However, it usually brings in a massive noisy data. In order to alleviate this problem, many recent approaches adopt reinforcement learning (RL), which aims to select correct data autonomously before relation classification. Although these RL methods outperform conventional multi-instance learning-based methods, there are still two neglected problems: 1) the existing RL methods ignore the feedback of noisy data, 2) the reduction of training corpus exacerbates long-tail problem. In this paper, we propose a novel framework to solve the two problems mentioned above. Firstly, we design a novel reward function to obtain feedback from both correct and noisy data. In addition, we use implicit relations information to improve RL. Secondly, we propose the hierarchical memory extractor (HME), which utilizes the gating mechanism to share the semantics from correlative instances between data-rich and data-poor classes. Moreover, we define a hierarchical weighted ranking loss function to implement top-down search processing. Extensive experiments conducted on the widely used NYT dataset show significant improvement over state-of-the-art baseline methods.
摘要:远程监管关系抽取(DSRE)是一种有效的方法以提取关于大规模启发式标记语料库的语义关系。然而,它通常带来了一个巨大的噪声数据。为了缓解这一问题,最近的许多方法采用强化学习(RL),目的是关系分类之前,自主地选择正确的数据,。虽然这些方法RL优于传统的多实例基于学习的方法,还有两个被忽视的问题:1)现有的RL方法忽略噪声数据的反馈,2)训练语料库加剧长尾问题的减少。在本文中,我们提出了一个新的框架,以解决上述两个问题。首先,我们设计了一个新的回报函数,从正确和噪声的数据获得反馈。此外,我们使用的隐含关系的信息,以提高RL。其次,我们提出了分层存储提取(HME),它利用门控机制,以分享数据丰富,数据贫乏类之间关联实例的语义。此外,我们定义了一个分层加权排序损失函数来实现自上而下的搜索处理。广泛使用的数据集纽约时报进行了大量的实验表明,在国家的最先进的基线方法显著改善。

10. Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles [PDF] 返回目录
  Yao Lu, Yue Dong, Laurent Charlin
Abstract: Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results---using several state-of-the-art models trained on the Multi-XScience dataset---reveal that Multi-XScience is well suited for abstractive models.
摘要:多文档文摘是其中存在小的大型数据集的一个具有挑战性的任务。我们建议多XScience,从科学文章创造了一个大型的多文档文摘数据集。多XScience引入了一个具有挑战性的多文档文摘任务:编写基于它的抽象和文章引用了一个文件的相关作业区间。我们的工作是由极端的概括的启发,有利于抽象建模的数据集构建方案接近。使用的培训上多XScience数据集的几个国家的最先进的机型描述性统计和实证研究结果---表明,多XScience非常适合抽象模型。

11. Global Sentiment Analysis Of COVID-19 Tweets Over Time [PDF] 返回目录
  Muvazima Mansoor, Kirthika Gurumurthy, Anantharam R U, V R Badri Prasad
Abstract: The Coronavirus pandemic has affected the normal course of life. People around the world have taken to social media to express their opinions and general emotions regarding this phenomenon that has taken over the world by storm. The social networking site, Twitter showed an unprecedented increase in tweets related to the novel Coronavirus in a very short span of time. This paper presents the global sentiment analysis of tweets related to Coronavirus and how the sentiment of people in different countries has changed over time. Furthermore, to determine the impact of Coronavirus on daily aspects of life, tweets related to Work From Home (WFH) and Online Learning were scraped and the change in sentiment over time was observed. In addition, various Machine Learning models such as Long Short Term Memory (LSTM) and Artificial Neural Networks (ANN) were implemented for sentiment classification and their accuracies were determined. Exploratory data analysis was also performed for a dataset providing information about the number of confirmed cases on a per-day basis in a few of the worst-hit countries to provide a comparison between the change in sentiment with the change in cases since the start of this pandemic till June 2020.
摘要:冠状病毒大流行已经影响到了正常的生活过程。世界各地的人们所采取的社会化媒体来表达自己的意见和情绪普遍对此现象已经席卷了全世界。社交网站,微博显示,在一段很短的跨度有关新型冠状病毒的鸣叫了前所未有的提高。本文列出了与冠状病毒,以及如何的人在不同的国家的情绪发生了变化随着时间的推移鸣叫的全球情感分析。此外,以确定冠状病毒对日常的生活方面的影响,涉及到在家工作(WFH)和在线学习鸣叫刮掉,观察一段时间内情绪的变化。此外,各种机器学习模型,如长短期记忆(LSTM)和人工神经网络(ANN)分别实施了情感分类并确定其精确度。也为数据集在几个重灾区国家提供在每一天的基础上对确诊病例数的信息进行探索性数据分析,以提供自从开始在情绪与变化的情况下的变化之间的比较这一流行病直到2020年6月。

12. Event Detection: Gate Diversity and Syntactic Importance Scoresfor Graph Convolution Neural Networks [PDF] 返回目录
  Viet Dac Lai, Tuan Ngo Nguyen, Thien Huu Nguyen
Abstract: Recent studies on event detection (ED) haveshown that the syntactic dependency graph canbe employed in graph convolution neural net-works (GCN) to achieve state-of-the-art per-formance. However, the computation of thehidden vectors in such graph-based models isagnostic to the trigger candidate words, po-tentially leaving irrelevant information for thetrigger candidate for event prediction. In addi-tion, the current models for ED fail to exploitthe overall contextual importance scores of thewords, which can be obtained via the depen-dency tree, to boost the performance. In thisstudy, we propose a novel gating mechanismto filter noisy information in the hidden vec-tors of the GCN models for ED based on theinformation from the trigger candidate. Wealso introduce novel mechanisms to achievethe contextual diversity for the gates and theimportance score consistency for the graphsand models in ED. The experiments show thatthe proposed model achieves state-of-the-artperformance on two ED datasets
摘要:事件检测最近的研究(ED)haveshown的语法结构图热点可以用采用图形卷积神经网工程(GCN),以实现国家的最先进的每formance。然而,在这种基于图形的模型isagnostic于触发候选词thehidden矢量的计算,PO-tentially离开thetrigger候选事件预测无关信息。在ADDI-重刑,目前型号为ED不能thewords的exploitthe整体的上下文重要性得分,这可以通过依赖新生 - dency树获得,以提高性能。在thisstudy,我们提出了基于从触发候选人theinformation对于ED的GCN车型的隐藏VEC-职责范围的新颖门mechanismto过滤嘈杂的信息。 Wealso引入新的机制,为门和theimportance评分一致性在ED的graphsand车型achievethe背景的多样性。实验结果表明thatthe提出的模型实现了两个ED数据集的国家的最artperformance

13. Emotion recognition by fusing time synchronous and time asynchronous representations [PDF] 返回目录
  Wen Wu, Chao Zhang, Philip C. Woodland
Abstract: In this paper, a novel two-branch neural network model structure is proposed for multimodal emotion recognition, which consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB). To capture correlations between each word and its acoustic realisation, the TSB combines speech and text modalities at each input window frame and then does pooling across time to form a single embedding vector. The TAB, by contrast, provides cross-utterance information by integrating sentence text embeddings from a number of context utterances into another embedding vector. The final emotion classification uses both the TSB and the TAB embeddings. Experimental results on the IEMOCAP dataset demonstrate that the two-branch structure achieves state-of-the-art results in 4-way classification with all common test setups. When using automatic speech recognition (ASR) output instead of manually transcribed reference text, it is shown that the cross-utterance information considerably improves the robustness against ASR errors. Furthermore, by incorporating an extra class for all the other emotions, the final 5-way classification system with ASR hypotheses can be viewed as a prototype for more realistic emotion recognition systems.
摘要:在本文中,一种新颖的两分支的神经网络模型的结构提出了一种用于多模态情感识别,其由时间同步分支(TSB)和时间异步分支(TAB)的。每个单词和其声之间实现相关性的捕获,所述TSB结合语音和文本模式在每个输入窗框,然后不跨越时间汇集以形成单个嵌入矢量。的TAB,相比之下,通过从多个上下文话语句整合的嵌入文本到另一个嵌入矢量提供跨发声信息。最终的情感类别同时使用TSB和TAB的嵌入。在IEMOCAP实验结果数据集表明,两分支结构实现状态的最先进的结果在4路分类与所有普通的测试设置。当使用自动语音识别(ASR)输出,而不是手动转录参考文本中,示出的是跨发声信息显着地改善对ASR的错误的鲁棒性。此外,通过将所有其他的情绪一个额外的类,具有ASR假设最后的5路分类系统可以被看作是一个原型更真实的情感识别系统。

14. Multi-Domain Dialogue State Tracking -- A Purely Transformer-Based Generative Approach [PDF] 返回目录
  Yan Zeng, Jian-Yun Nie
Abstract: We investigate the problem of multi-domain Dialogue State Tracking (DST) with open vocabulary. Existing approaches exploit BERT encoder and copy-based RNN decoder, where the encoder first predicts the state operation, and then the decoder generates new slot values. However, in this stacked encoder-decoder structure, the operation prediction objective only affects the BERT encoder and the value generation objective mainly affects the RNN decoder. In this paper, we propose a purely Transformer-based framework that uses BERT as both encoder and decoder. In so doing, the operation prediction objective and the value generation objective can jointly optimize our model for DST. At the decoding step, we re-use the hidden states of the encoder in the self-attention mechanism of the corresponding decoder layer to construct a flat model structure for effective parameter updating. Experimental results show that our approach substantially outperforms the existing state-of-the-art framework, and it also achieves very competitive performance to the best ontology-based approaches.
摘要:我们调查多域对话状态跟踪(DST)的开放词汇的问题。现有的方法利用BERT编码器和基于复制RNN解码器,其中编码器第一预测的状态的操作,然后将解码器产生新的槽值。然而,在该堆叠的编码器 - 解码器的结构,操作预测目标仅影响BERT编码器和值生成目标主要影响RNN解码器。在本文中,我们提出了一个纯粹基于变压器的框架,使用BERT既是编码器和解码器。这样一来,运转预测目标和值生成目标可以联合优化我们的DST模型。在解码步骤中,我们重新使用编码器的隐蔽状态在相应的解码器层的自注意机制构建一个平面模型结构进行有效的参数更新。实验结果表明,该方法显着优于现有的国家的最先进的框架,而且还实现了非常有竞争力的性能,以最好的基于本体的方法。

15. To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging [PDF] 返回目录
  Kasturi Bhattacharjee, Miguel Ballesteros, Rishita Anubhai, Smaranda Muresan, Jie Ma, Faisal Ladhak, Yaser Al-Onaizan
Abstract: Leveraging large amounts of unlabeled data using Transformer-like architectures, like BERT, has gained popularity in recent times owing to their effectiveness in learning general representations that can then be further fine-tuned for downstream tasks to much success. However, training these models can be costly both from an economic and environmental standpoint. In this work, we investigate how to effectively use unlabeled data: by exploring the task-specific semi-supervised approach, Cross-View Training (CVT) and comparing it with task-agnostic BERT in multiple settings that include domain and task relevant English data. CVT uses a much lighter model architecture and we show that it achieves similar performance to BERT on a set of sequence tagging tasks, with lesser financial and environmental impact.
摘要:利用使用变压器样的架构,比如说BERT,已经得到普及在最近一个时期,由于其有效性在学习一般性的描述,然后可将大量的未标记数据的进一步微调下游任务很成功。然而,训练这些模型可能是昂贵无论从经济和环保的角度来看。在这项工作中,我们研究如何有效地使用无标签的数据:通过探索任务特定的半监督方法,交叉查看培训(CVT),并将其与多种设置,包括域和任务相关的英文资料的任务无关的BERT比较。 CVT采用的是轻得多模型架构,我们表明它实现了性能类似于BERT一组序列标注任务,用较少的资金和对环境的影响。

16. Volctrans Parallel Corpus Filtering System for WMT 2020 [PDF] 返回目录
  Runxin Xu, Zhuo Zhi, Jun Cao, Mingxuan Wang, Lei Li
Abstract: In this paper, we describe our submissions to the WMT20 shared task on parallel corpus filtering and alignment for low-resource conditions. The task requires the participants to align potential parallel sentence pairs out of the given document pairs, and score them so that low-quality pairs can be filtered. Our system, Volctrans, is made of two modules, i.e., a mining module and a scoring module. Based on the word alignment model, the mining module adopts an iterative mining strategy to extract latent parallel sentences. In the scoring module, an XLM-based scorer provides scores, followed by reranking mechanisms and ensemble. Our submissions outperform the baseline by 3.x/2.x and 2.x/2.x for km-en and ps-en on From Scratch/Fine-Tune conditions, which is the highest among all submissions.
摘要:在本文中,我们描述了我们陈词,平行语料库筛选和比对低资源条件WMT20共享任务。任务要求参与者对准潜在的平行句对出给定文件对,和他们得分,使低质量对可以被过滤。我们的系统中,Volctrans,由两个模块,即,挖掘模块和计分模块。基于字对齐模式,挖掘模块采用迭代挖掘策略,以提取潜在的并行语句。在计分模块,基于XLM射手提供分数,随后再排序机制和合奏。我们提交由跑赢大市3.X / 2.X和2.x / x为KM-EN和PS-EN基线上的划痕/微调的条件,这是所有提交中最高的。

17. Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning [PDF] 返回目录
  Dongwei Jiang, Wubo Li, Miao Cao, Ruixiong Zhang, Wei Zou, Kun Han, Xiangang Li
Abstract: Self-supervised visual pretraining has shown significant progress recently. Among those methods, SimCLR greatly advanced the state of the art in self-supervised and semi-supervised learning on ImageNet. The input feature representations for speech and visual tasks are both continuous, so it is natural to consider applying similar objective on speech representation learning. In this paper, we propose Speech SimCLR, a new self-supervised objective for speech representation learning. During training, Speech SimCLR applies augmentation on raw speech and its spectrogram. Its objective is the combination of contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstruction loss of input representation. The proposed method achieved competitive results on speech emotion recognition and speech recognition. When used as feature extractor, our best model achieved 5.89% word error rate on LibriSpeech test-clean set using LibriSpeech 960 hours as pretraining data and LibriSpeech train-clean-100 set as fine-tuning data, which is the lowest error rate obtained in this setup to the best of our knowledge.
摘要:自监督视觉训练前近来显示出显著的进展。在这些方法中,SimCLR大大推进了ImageNet艺术的自我监督和半监督学习的状态。对于语音和视觉任务的输入特征表示是连续的,所以很自然地考虑对言论表示学习应用类似的目标。在本文中,我们提出了语音SimCLR,语音表示学习一个新的自我监督的目标。在培训过程中,语音SimCLR适用于原始语音和频谱增强。其目标是最大化在输入表示的潜在空间和重建损失增强不同样品之间的协议对比损耗的组合。该方法实现了对语音情感识别和语音识别的竞争结果。当作为特征提取器使用的,我们的最佳模型利用LibriSpeech960小时作为预训练数据和LibriSpeech列车清洁-100组作为微调数据,这是在所获得的最低误差率达到5.89%的字错误率上LibriSpeech测试清洁组这种设置到最佳的认识。

18. Interpretation of NLP models through input marginalization [PDF] 返回目录
  Siwon Kim, Jihun Yi, Eunji Kim, Sungroh Yoon
Abstract: To demystify the "black box" property of deep neural networks for natural language processing (NLP), several methods have been proposed to interpret their predictions by measuring the change in prediction probability after erasing each token of an input. Since existing methods replace each token with a predefined value (i.e., zero), the resulting sentence lies out of the training data distribution, yielding misleading interpretations. In this study, we raise the out-of-distribution problem induced by the existing interpretation methods and present a remedy; we propose to marginalize each token out. We interpret various NLP models trained for sentiment analysis and natural language inference using the proposed method.
摘要:为神秘面纱自然语言处理(NLP)深层神经网络的“黑盒子”的属性,几种方法已经提出通过清除每个令牌的输入后测量预测概率的变化来解释他们的预测。由于现有方法与预定义的值(即,零)替换每个令牌,所得到的句子所在出训练数据的分布,产生误导性的解释。在这项研究中,我们提出由现有的解释方法和现在的补救措施引起的乱分配问题;我们建议边缘化每个标记出来。我们解释使用该方法训练情绪分析和自然语言推理各种NLP模型。

19. Predict and Use Latent Patterns for Short-Text Conversation [PDF] 返回目录
  Hung-Ting Chen, Yu-Chieh Chao, Ta-Hsuan Chao, Wei-Yun Ma
Abstract: Many neural network models nowadays have achieved promising performances in Chit-chat settings. The majority of them rely on an encoder for understanding the post and a decoder for generating the response. Without given assigned semantics, the models lack the fine-grained control over responses as the semantic mapping between posts and responses is hidden on the fly within the end-to-end manners. Some previous works utilize sampled latent words as a controllable semantic form to drive the generated response around the work, but few works attempt to use more complex semantic forms to guide the generation. In this paper, we propose to use more detailed semantic forms, including latent responses and part-of-speech sequences sampled from the corresponding distributions, as the controllable semantics to guide the generation. Our experimental results show that the richer semantics are not only able to provide informative and diverse responses, but also increase the overall performance of response quality, including fluency and coherence.
摘要:许多神经网络模型现在已经实现承诺在闲聊设置表演。它们中的大多数依赖于编码器,用于理解柱和用于产生响应的解码器。如果没有给出分配语义模型缺乏对反应的细粒度控制,岗位与反应之间的语义映射是隐藏在终端到终端的方式中的苍蝇。以前的一些作品利用采样潜词作为一个可控的语义形式,带动周边的工作所产生的响应,但作品很少尝试使用更复杂的语义的形式来引导产生。在本文中,我们提出使用更详细的语义形式,包括潜响应和从相应的分布采样的部分的语音序列,所述可控语义引导的产生。我们的实验结果表明,丰富的语义不仅能提供丰富和多样化的反应,但也增加了响应的质量,包括流畅性和连贯性的整体性能。

20. Reading Between the Lines: Exploring Infilling in Visual Narratives [PDF] 返回目录
  Khyathi Raghavi Chandu, Ruo-Ping Dong, Alan Black
Abstract: Generating long form narratives such as stories and procedures from multiple modalities has been a long standing dream for artificial intelligence. In this regard, there is often crucial subtext that is derived from the surrounding contexts. The general seq2seq training methods render the models shorthanded while attempting to bridge the gap between these neighbouring contexts. In this paper, we tackle this problem by using \textit{infilling} techniques involving prediction of missing steps in a narrative while generating textual descriptions from a sequence of images. We also present a new large scale \textit{visual procedure telling} (ViPT) dataset with a total of 46,200 procedures and around 340k pairwise images and textual descriptions that is rich in such contextual dependencies. Generating steps using infilling technique demonstrates the effectiveness in visual procedures with more coherent texts. We conclusively show a METEOR score of 27.51 on procedures which is higher than the state-of-the-art on visual storytelling. We also demonstrate the effects of interposing new text with missing images during inference. The code and the dataset will be publicly available at this https URL.
摘要:生成长叙事形式,如从多个模式的故事和程序一直是人工智能一个长期的梦想。在这方面,往往是从周围环境中得到的重要的潜台词。一般seq2seq培训方式呈现在试图弥合这些周边环境之间的差距缺兵少将的车型。在本文中,我们通过使用涉及的丢失预测步骤以叙述,同时从图像序列生成的文本描述\ textit {充填}技术解决这个问题。我们还提出了一种新的大规模\ textit {视觉过程告诉}(ViPT)数据集,共有46,200程序和340K左右成对的图像和文字描述富含这样的上下文相关性。生成使用充填技术的步骤演示了用更一致的文本视觉程序的有效性。我们得出结论表明27.51的程序流星得分比视觉讲故事的国家的最先进的高。我们还演示了用推理过程中丢失的图像插入新的文本的效果。代码和数据集将公开可在此HTTPS URL。

21. Improving Limited Labeled Dialogue State Tracking with Self-Supervision [PDF] 返回目录
  Chien-Sheng Wu, Steven Hoi, Caiming Xiong
Abstract: Existing dialogue state tracking (DST) models require plenty of labeled data. However, collecting high-quality labels is costly, especially when the number of domains increases. In this paper, we address a practical DST problem that is rarely discussed, i.e., learning efficiently with limited labeled data. We present and investigate two self-supervised objectives: preserving latent consistency and modeling conversational behavior. We encourage a DST model to have consistent latent distributions given a perturbed input, making it more robust to an unseen scenario. We also add an auxiliary utterance generation task, modeling a potential correlation between conversational behavior and dialogue states. The experimental results show that our proposed self-supervised signals can improve joint goal accuracy by 8.95\% when only 1\% labeled data is used on the MultiWOZ dataset. We can achieve an additional 1.76\% improvement if some unlabeled data is jointly trained as semi-supervised learning. We analyze and visualize how our proposed self-supervised signals help the DST task and hope to stimulate future data-efficient DST research.
摘要:现有的对话状态跟踪(DST)模型需要大量的标签数据。然而,收集高质量的标签是昂贵的,结构域的增加,特别是当数。在本文中,我们解决很少讨论的,即,具有有限的标记数据高效地学习实际DST问题。我们提出并研究两个自监管目标:保护潜在的一致性和建模会话的行为。我们鼓励DST模式对给定扰动输入一致的潜在分布,使其更加坚固,以看不见的场景。我们还添加了辅助话语生成任务,造型对话的行为和对话状态之间的潜在关系。实验结果表明,我们提出的自我监督的信号可以由8.95 \%时,只有1个\%标记数据在MultiWOZ数据集用于改善关节目标的准确性。我们可以实现一个额外的1.76 \%的改善,如果一些标签数据被联合训练成半监督学习。我们分析和可视化我们提出的自我监督的信号是如何帮助DST任务和希望,激发未来的数据高效DST研究。

22. Probing Task-Oriented Dialogue Representation from Language Models [PDF] 返回目录
  Chien-Sheng Wu, Caiming Xiong
Abstract: This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks. We approach the problem from two aspects: supervised classifier probe and unsupervised mutual information probe. We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way. Meanwhile, we propose an unsupervised mutual information probe to evaluate the mutual dependence between a real clustering and a representation clustering. The goals of this empirical paper are to 1) investigate probing techniques, especially from the unsupervised mutual information aspect, 2) provide guidelines of pre-trained language model selection for the dialogue research community, 3) find insights of pre-training factors for dialogue application that may be the key to success.
摘要:本文探讨预先训练语言模型,以找出哪些模型本质上带有面向任务的对话任务最翔实表示。监督分类探测和无人监督的互信息探测器:我们从两个方面来解决这个问题。我们微调前馈层与受监督的方式标注的标签固定预先训练语言模型的顶部分类探头。同时,我们提出了一种无监督的相互信息的探针来评估一个真正的聚类和表示群集之间的相互关系。这种经验本文的目标是1)调查探测技术,特别是从监督的互信息方面,2)对话研究界提供预先训练语言模型选择的指导方针,3)找到对话前培训因素的见解应用程序,可能是成功的关键。

23. Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer [PDF] 返回目录
  Suyoun Kim, Yuan Shangguan, Jay Mahadeokar, Antoine Bruguier, Christian Fuegen, Michael L. Seltzer, Duc Le
Abstract: Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech recognition model architectures, has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training. Previous work has proposed various fusion methods to incorporate external NNLMs into end-to-end ASR to address this weakness. In this paper, we propose extensions to these techniques that allow RNN-T to exploit external NNLMs during both training and inference time, resulting in 13-18% relative Word Error Rate improvement on Librispeech compared to strong baselines. Furthermore, our methods do not incur extra algorithmic latency and allow for flexible plug-and-play of different NNLMs without re-training. We also share in-depth analysis to better understand the benefits of the different NNLM fusion methods. Our work provides a reliable technique for leveraging unpaired text data to significantly improve RNN-T while keeping the system streamable, flexible, and lightweight.
摘要:递归神经网络传感器(RNN-T),最喜欢的终端到终端的语音识别模型架构,有一个隐含的神经网络语言模型(NNLM)和培训过程中不能轻易利用非成对的文本数据。以前的工作已提出了各种融合方法,以将外部NNLMs到终端到终端的ASR来解决这个弱点。在本文中,我们提出扩展这些技术,让RNN-T期间训练和推理时间利用外部NNLMs,从而对Librispeech 13-18%的相对词错误率的改善比较强的基线。此外,我们的方法都是不收取额外的算法延迟和允许灵活的不同NNLMs没有再培训的插件和播放。我们也分享了深入的分析,以更好地了解不同NNLM融合方法的好处。我们的工作提供了利用未配对的文本数据显著提高RNN-T,同时保持系统可流,灵活,轻便的可靠技术。

24. Word Frequency Does Not Predict Grammatical Knowledge in Language Models [PDF] 返回目录
  Charles Yu, Ryan Sie, Nico Tedeschi, Leon Bergen
Abstract: Neural language models learn, to varying degrees of accuracy, the grammatical properties of natural languages. In this work, we investigate whether there are systematic sources of variation in the language models' accuracy. Focusing on subject-verb agreement and reflexive anaphora, we find that certain nouns are systematically understood better than others, an effect which is robust across grammatical tasks and different language models. Surprisingly, we find that across four orders of magnitude, corpus frequency is unrelated to a noun's performance on grammatical tasks. Finally, we find that a novel noun's grammatical properties can be few-shot learned from various types of training data. The results present a paradox: there should be less variation in grammatical performance than is actually observed.
摘要:神经语言模型学习,在不同程度上的精度,自然语言的语法特性。在这项工作中,我们调查是否有语言模型的精度变化的系统资源。专注于主谓一致和反思照应,我们发现某些名词进行了系统的理解比别人好,这是整个语法任务和不同的语言模型强大的效果。出人意料的是,我们发现,横跨四个数量级,语料库频率无关的名词对语法任务中的表现。最后,我们发现一个新的名词的语法属性可以少拍从各类训练数据的教训。结果呈现矛盾:应该有比实际观察到的语法性能变化较少。

25. Data Troubles in Sentence Level Confidence Estimation for Machine Translation [PDF] 返回目录
  Ciprian Chelba, Junpei Zhou, Yuezhang, Hideto Kazawa, Jeff Klingner, Mengmeng Niu
Abstract: The paper investigates the feasibility of confidence estimation for neural machine translation models operating at the high end of the performance spectrum. As a side product of the data annotation process necessary for building such models we propose sentence level accuracy $SACC$ as a simple, self-explanatory evaluation metric for quality of translation. Experiments on two different annotator pools, one comprised of non-expert (crowd-sourced) and one of expert (professional) translators show that $SACC$ can vary greatly depending on the translation proficiency of the annotators, despite the fact that both pools are about equally reliable according to Krippendorff's alpha metric; the relatively low values of inter-annotator agreement confirm the expectation that sentence-level binary labeling $good$ / $needs\ work$ for translation out of context is very hard. For an English-Spanish translation model operating at $SACC = 0.89$ according to a non-expert annotator pool we can derive a confidence estimate that labels 0.5-0.6 of the $good$ translations in an "in-domain" test set with 0.95 Precision. Switching to an expert annotator pool decreases $SACC$ dramatically: $0.61$ for English-Spanish, measured on the exact same data as above. This forces us to lower the CE model operating point to 0.9 Precision while labeling correctly about 0.20-0.25 of the $good$ translations in the data. We find surprising the extent to which CE depends on the level of proficiency of the annotator pool used for labeling the data. This leads to an important recommendation we wish to make when tackling CE modeling in practice: it is critical to match the end-user expectation for translation quality in the desired domain with the demands of annotators assigning binary quality labels to CE training data.
摘要:本文研究的信心估计在性能范围的高端操作的神经机器翻译模型的可行性。根据需要建立这样的模型数据注释过程的副产物,我们提出了句子层面的准确性$ SACC $作为翻译质量的简单,不言自明的评价指标。在两个不同的注释库,一个由非专业的(人群来源)和专家的一个实验(专业)翻译显示,即$ SACC $是多种多样的,这取决于注释的翻译能力,尽管事实上,这两个池根据克里彭多夫的alpha度量大约同等可靠的; -注释间协议确认的相对低值的预期句子级别的二进制标签$好/ $需求\ $工作翻译断章取义是很辛苦。根据一个非专家注释池在$ SACC的英语 - 西班牙语翻译模型操作= 0.89 $,我们可以得出一个信心,估计标签$ $好翻译0.5-0.6在“中域”测试集0.95精确。切换到专家注释池减小$ SACC $显着:英语,西班牙语$ 0.61 $,在完全相同的数据与上述测量。这迫使我们的CE模式操作点降低到0.9精度,同时正确标注有关的数据$ $好翻译0.20-0.25。我们发现令人吃惊,其CE取决于用于标记数据的注释池的熟练水平的程度。这就导致了一个重要的建议,我们希望在实践中解决CE建模时做出:关键是要匹配与注释分配二进制质量标签CE训练数据的需求所需的域名翻译质量的最终用户的期望。

26. Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining [PDF] 返回目录
  Cheng-I Lai, Yung-Sung Chuang, Hung-Yi Lee, Shang-Wen Li, James Glass
Abstract: Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. In this paper, we propose a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech to address these issues. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU data. We study two semi-supervised settings for the ASR component: supervised pretraining on transcribed speech, and unsupervised pretraining by replacing the ASR encoder with self-supervised speech representations, such as wav2vec. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation. Experiments on ATIS show that our SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.
摘要:口语理解(SLU)最近的许多工作是有限的三种方式中的至少一个:模型接受了关于Oracle Text的输入和被忽视的ASR错误,模型进行训练以预测仅意图没有插槽值,或模型进行了培训在大量内部数据。在本文中,我们提出了一个干净,总体框架,直接从语音学习的语义与转录或非转录的语音解决这些问题的半监督。我们的框架是在预训练的端至端内置(E2E)ASR和自我监督语言模型,如BERT,和微调目标SLU有限的数据量。我们研究了ASR组件两个半圆监督设置:监督下与自我监督的讲话表示,如wav2vec更换编码器ASR训练前的讲话转录,且无人监管的训练前。与此同时,我们确定评估模型SLU两个基本标准:环境噪声的鲁棒性和端到端的语义评价。在ATIS实验表明,我们的语音作为输入SLU框架可以媲美那些使用Oracle文本语义理解输入执行,即使环境噪声存在,并且语义标记有限的数据量可供训练。

27. PowerTransformer: Unsupervised Controllable Revision for Biased Language Correction [PDF] 返回目录
  Xinyao Ma, Maarten Sap, Hannah Rashkin, Yejin Choi
Abstract: Unconscious biases continue to be prevalent in modern text and media, calling for algorithms that can assist writers with bias correction. For example, a female character in a story is often portrayed as passive and powerless ("She daydreams about being a doctor") while a man is portrayed as more proactive and powerful ("He pursues his dream of being a doctor"). We formulate *Controllable Debiasing*, a new revision task that aims to rewrite a given text to correct the implicit and potentially undesirable bias in character portrayals. We then introduce PowerTransformer as an approach that debiases text through the lens of connotation frames (Sap et al., 2017), which encode pragmatic knowledge of implied power dynamics with respect to verb predicates. One key challenge of our task is the lack of parallel corpora. To address this challenge, we adopt an unsupervised approach using auxiliary supervision with related tasks such as paraphrasing and self-supervision based on a reconstruction loss, building on pretrained language models. Through comprehensive experiments based on automatic and human evaluations, we demonstrate that our approach outperforms ablations and existing methods from related tasks. Furthermore, we demonstrate the use of PowerTransformer as a step toward mitigating the well-documented gender bias in character portrayal in movie scripts.
摘要:无意识的偏见继续在现代社会文本和媒体进行普及,要求,可以帮助作家与偏差修正算法。例如,一个女性角色在故事经常被描绘为被动和无能为力的(“她浮想大约是一个医生”),而男子被描绘成更加主动和强大的(“他努力追求自己当医生的梦想”)。我们制定*可控去除偏差*,一个新的修订任务,旨在改写给定文本纠正人物描写的隐性和潜在的不良倾向。然后,我们介绍PowerTransformer作为通过内涵帧的透镜debiases文本的方法(SAP等人,2017),这暗示功率动力学的语用知识编码相对于动词谓词。我们的任务的一个关键挑战是缺乏平行语料库的。为了应对这一挑战,我们使用辅助监管与相关的任务,如基于重建的损失的释义和自我监督,建立在预先训练语言模型采用无监督的做法。通过基于自动与人评价综合性实验,我们证明了我们的方法比消融和相关任务的现有方法。此外,我们展示了使用PowerTransformer作为对缓解性格的写照证据充分的性别偏见在电影剧本的一个步骤。

28. Is it Great or Terrible? Preserving Sentiment in Neural Machine Translation of Arabic Reviews [PDF] 返回目录
  Hadeel Saadany, Constantin Orasan
Abstract: Since the advent of Neural Machine Translation (NMT) approaches there has been a tremendous improvement in the quality of automatic translation. However, NMT output still lacks accuracy in some low-resource languages and sometimes makes major errors that need extensive post-editing. This is particularly noticeable with texts that do not follow common lexico-grammatical standards, such as user generated content (UGC). In this paper we investigate the challenges involved in translating book reviews from Arabic into English, with particular focus on the errors that lead to incorrect translation of sentiment polarity. Our study points to the special characteristics of Arabic UGC, examines the sentiment transfer errors made by Google Translate of Arabic UGC to English, analyzes why the problem occurs, and proposes an error typology specific of the translation of Arabic UGC. Our analysis shows that the output of online translation tools of Arabic UGC can either fail to transfer the sentiment at all by producing a neutral target text, or completely flips the sentiment polarity of the target word or phrase and hence delivers a wrong affect message. We address this problem by fine-tuning an NMT model with respect to sentiment polarity showing that this approach can significantly help with correcting sentiment errors detected in the online translation of Arabic UGC.
摘要:由于神经机器翻译(NMT)的出现接近出现了自动翻译的质量巨大的进步。然而,NMT输出仍然缺乏一些低资源语言准确性,有时使得需要大量的后期编辑重大失误。这是与不遵循通用词汇语法标准,例如用户生成的内容(UGC)的文本格外引人注目。在本文中,我们调查的误差涉及从阿拉伯语翻译书评成英文的挑战,特别是重点对铅的情感极性的不正确翻译。我们的研究指出,阿拉伯UGC的特殊性,检查由谷歌提出的情绪传递错误阿拉伯语UGC的翻译成英文,分析为什么会出现这个问题,并提出了阿拉伯语UGC的翻译错误类型学具体。我们的分析表明,阿拉伯UGC的在线翻译工具的输出可以失败的情绪都通过产生中性目标文本,或者完全转移翻转的目标词或短语的情感极性,因此提供了一个错误的影响的信息。我们通过微调解决这一问题的NMT模型相对于情感极性显示,这种方法可以显著与纠正错误的情绪在帮助阿拉伯UGC的在线翻译检测。

29. Dynamic Boundary Time Warping for Sub-sequence Matching with Few Examples [PDF] 返回目录
  Łukasz Borchmann, Dawid Jurkiewicz, Filip Graliński, Tomasz Górecki
Abstract: The paper presents a novel method of finding a fragment in a long temporal sequence similar to the set of shorter sequences. We are the first to propose an algorithm for such a search that does not rely on computing the average sequence from query examples. Instead, we use query examples as is, utilizing all of them simultaneously. The introduced method based on the Dynamic Time Warping (DTW) technique is suited explicitly for few-shot query-by-example retrieval tasks. We evaluate it on two different few-shot problems from the field of Natural Language Processing. The results show it either outperforms baselines and previous approaches or achieves comparable results when a low number of examples is available.
摘要:介绍以类似于组较短的序列的长的时间顺序查找片段的新方法。我们是第一个提出的算法,使得不依赖于计算从查询样品的平均序列的搜索。相反,我们使用的查询例子如,同时利用所有的人。基于动态时间规整(DTW)技术所提出的方法是明确适用于少数次查询通过例如检索任务。我们评估它从自然语言处理领域的两种不同的几个拍的问题。结果表明,要么性能优于基准和以前的方法或达到类似的结果时的例子低数量是可用的。

30. Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment [PDF] 返回目录
  Ethan A. Chi, Julian Salazar, Katrin Kirchhoff
Abstract: Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformer-based model which refines connectionist temporal classification (CTC) alignments to allow length-changing insertions and deletions. Align-Refine outperforms Imputer and Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time factor and attaining a LibriSpeech test-other WER of 9.0% without an LM. Our model is strong even in one iteration with a shallower decoder.
摘要:非自回归模型极大地提高了典型顺序对序列模型的解码速度,但是从性能下降的困扰。充填和迭代优化模型弥补了一些这方面的差距由编辑非自回归模型的输出,但在编辑的限制,他们可以做。我们建议迭代调整,其中的改进发生在潜在的路线,而不是输出序列空间。我们在ALIGN-提纯,端至端基于变压器的模型的语音识别证明这一点,其提炼联结颞分类(CTC)的比对,以允许长度改变的插入和缺失。 ALIGN-精确性能优于Imputer和面膜-CTC,以1 /第十四匹配上WSJ自回归基线实时因子和实现9.0%LibriSpeech测试其他WER没有LM。我们的模型是强大的,即使在一个迭代在较浅的解码器。

31. Co-attentional Transformers for Story-Based Video Understanding [PDF] 返回目录
  Björn Bebensee, Byoung-Tak Zhang
Abstract: Inspired by recent trends in vision and language learning, we explore applications of attention mechanisms for visio-lingual fusion within an application to story-based video understanding. Like other video-based QA tasks, video story understanding requires agents to grasp complex temporal dependencies. However, as it focuses on the narrative aspect of video it also requires understanding of the interactions between different characters, as well as their actions and their motivations. We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas and measure its performance on the video question answering task. We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions. Our model outperforms the baseline model by 8 percentage points overall, at least 4.95 and up to 12.8 percentage points on all difficulty levels and manages to beat the winner of the DramaQA challenge.
摘要:通过视觉和语言学习的最新趋势的启发,我们的应用程序,以故事为基础的视频理解范围内探索Visio的语言融合的重视机制的应用程序。像其他基于视频的QA任务,视频故事的理解要求代理商掌握复杂的时间依赖性。然而,因为它专注于视频的叙述方面也需要不同的角色,以及他们的行动和动机之间的相互作用的理解。我们提出了一个新颖的共注意力变压器模型在视觉故事,看到更好的捕捉长期相关性,如戏剧,并测量其上的视频答疑任务性能。我们评估在最近推出DramaQA数据集为特色以角色为中心视频故事的理解问题,我们的做法。我们的模型了8个百分点,优于基准模型整体而言,至少4.95和高达所有困难水平12.8个百分点,并设法击败DramaQA挑战的赢家。

32. VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement Learning [PDF] 返回目录
  Thomas Carta, Subhajit Chaudhury, Kartik Talamadupula, Michiaki Tatsubori
Abstract: We present VisualHints, a novel environment for multimodal reinforcement learning (RL) involving text-based interactions along with visual hints (obtained from the environment). Real-life problems often demand that agents interact with the environment using both natural language information and visual perception towards solving a goal. However, most traditional RL environments either solve pure vision-based tasks like Atari games or video-based robotic manipulation; or entirely use natural language as a mode of interaction, like Text-based games and dialog systems. In this work, we aim to bridge this gap and unify these two approaches in a single environment for multimodal RL. We introduce an extension of the TextWorld cooking environment with the addition of visual clues interspersed throughout the environment. The goal is to force an RL agent to use both text and visual features to predict natural language action commands for solving the final task of cooking a meal. We enable variations and difficulties in our environment to emulate various interactive real-world scenarios. We present a baseline multimodal agent for solving such problems using CNN-based feature extraction from visual hints and LSTMs for textual feature extraction. We believe that our proposed visual-lingual environment will facilitate novel problem settings for the RL community.
摘要:我们目前VisualHints,多式联运强化学习(RL)涉及与视觉提示(从环境中获得)以及基于文本的交互的新环境。现实生活中的问题往往要求代理商与同时使用自然语言的信息和对解决一个目标视觉感知环境互动。然而,大多数传统的RL环境无论是解决纯基于视觉的任务,如雅达利游戏或基于视频的机器人操作;或完全使用自然语言交互的模式,如基于文本的游戏和对话系统。在这项工作中,我们的目标是弥合这一差距,在多式联运RL单一环境统一这两种方法。我们引进与另外在整个环境中穿插视觉线索的TextWorld烹饪环境的扩展。我们的目标是迫使RL剂同时使用文字和视觉特征来预测自然语言动作命令解决做饭的最后一项任务。我们能变化和困难在我们的环境来模拟各种交互式的现实世界的情景。我们提出了一个基线代理多式联运解决利用视觉线索和LSTMs的文本特征提取基于CNN特征提取等问题。我们相信,我们提出的视觉语言环境将有利于为RL社会新问题的设置。

注:中文为机器翻译结果!封面为论文标题词云图!