0%

【arxiv论文】 Computation and Language 2020-08-11

目录

1. SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets [PDF] 摘要
2. A Bootstrapped Model to Detect Abuse and Intent in White Supremacist Corpora [PDF] 摘要
3. FireBERT: Hardening BERT-based classifiers against adversarial attack [PDF] 摘要
4. KR-BERT: A Small-Scale Korean-Specific Language Model [PDF] 摘要
5. DQI: A Guide to Benchmark Evaluation [PDF] 摘要
6. A Large-Scale Chinese Short-Text Conversation Dataset [PDF] 摘要
7. Does BERT Solve Commonsense Task via Commonsense Knowledge? [PDF] 摘要
8. Knowledge Distillation and Data Selection for Semi-Supervised Learning in CTC Acoustic Models [PDF] 摘要
9. Question Identification in Arabic Language Using Emotional Based Features [PDF] 摘要
10. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR [PDF] 摘要
11. Fast and Accurate Neural CRF Constituency Parsing [PDF] 摘要
12. Fast Gradient Projection Method for Text Adversary Generation and Adversarial Training [PDF] 摘要
13. Point or Generate Dialogue State Tracker [PDF] 摘要
14. Assessing Demographic Bias in Named Entity Recognition [PDF] 摘要
15. Diversifying Task-oriented Dialogue Response Generation with Prototype Guided Paraphrasing [PDF] 摘要
16. Retrofitting Vector Representations of Adverse Event Reporting Data to Structured Knowledge to Improve Pharmacovigilance Signal Detection [PDF] 摘要
17. Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach [PDF] 摘要
18. Navigating Language Models with Synthetic Agents [PDF] 摘要
19. The Chess Transformer: Mastering Play using Generative Language Models [PDF] 摘要
20. VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [PDF] 摘要
21. SpeedySpeech: Efficient Neural Speech Synthesis [PDF] 摘要
22. Analysing the Effect of Clarifying Questions on Document Ranking in Conversational Search [PDF] 摘要
23. LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [PDF] 摘要
24. Deep F-measure Maximization for End-to-End Speech Understanding [PDF] 摘要
25. Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews [PDF] 摘要
26. Word Error Rate Estimation Without ASR Output: e-WER2 [PDF] 摘要
27. A New Approach to Accent Recognition and Conversion for Mandarin Chinese [PDF] 摘要

摘要

1. SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets [PDF] 返回目录
  Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, Amitava Das
Abstract: In this paper, we present the results of the SemEval-2020 Task 9 on Sentiment Analysis of Code-Mixed Tweets (SentiMix 2020). We also release and describe our Hinglish (Hindi-English) and Spanglish (Spanish-English) corpora annotated with word-level language identification and sentence-level sentiment labels. These corpora are comprised of 20K and 19K examples, respectively. The sentiment labels are - Positive, Negative, and Neutral. SentiMix attracted 89 submissions in total including 61 teams that participated in the Hinglish contest and 28 submitted systems to the Spanglish competition. The best performance achieved was 75.0% F1 score for Hinglish and 80.6% F1 for Spanglish. We observe that BERT-like models and ensemble methods are the most common and successful approaches among the participants.
摘要:在本文中,我们提出了SemEval-2020工作9上的代码混合鸣叫的情感分析(SentiMix 2020年)的结果。我们还发布并描述了我们与字级语言识别和句子级情感标签标注的印地(印地文,英文)和西班牙式英语(西英)语料库。这些语料库分别由20K和19K的例子。在情绪标签 - 正,负,和中性。 SentiMix吸引了89份意见书共包括61支球队参加印地比赛和28提交系统,以西班牙式的竞争。实现最佳性能为75.0%,F1分数印度英语和西班牙式80.6%F1。我们观察到,BERT样模型和集成方法是最常见和参与者之间的成功做法。

2. A Bootstrapped Model to Detect Abuse and Intent in White Supremacist Corpora [PDF] 返回目录
  B. Simons, D.B. Skillicorn
Abstract: Intelligence analysts face a difficult problem: distinguishing extremist rhetoric from potential extremist violence. Many are content to express abuse against some target group, but only a few indicate a willingness to engage in violence. We address this problem by building a predictive model for intent, bootstrapping from a seed set of intent words, and language templates expressing intent. We design both an n-gram and attention-based deep learner for intent and use them as colearners to improve both the basis for prediction and the predictions themselves. They converge to stable predictions in a few rounds. We merge predictions of intent with predictions of abusive language to detect posts that indicate a desire for violent action. We validate the predictions by comparing them to crowd-sourced labelling. The methodology can be applied to other linguistic properties for which a plausible starting point can be defined.
摘要:情报分析人员面临着一个难题:从潜在的极端分子的暴力极端主义的区别说辞。许多人的内容来表达对一些目标群体滥用,但只有少数表示愿意从事暴力活动。我们解决通过建立预测模型为目的,从种子组的意图的话,和表达意图语言模板的自举了这个问题。我们设计一个N克和关注基于深学习者的意图,并利用它们作为colearners同时提高了预测的基础上预测自己。他们汇聚在几个回合稳定的预测。我们合并与粗言秽语的预测的意图预测,以检测指示暴力行动的愿望的职位。我们通过比较它们人群来源的标记验证的预测。该方法可以应用到一个合理的起点可以被定义为其中其它语言属性。

3. FireBERT: Hardening BERT-based classifiers against adversarial attack [PDF] 返回目录
  Gunnar Mein, Kevin Hartman, Andrew Morris
Abstract: We present FireBERT, a set of three proof-of-concept NLP classifiers hardened against TextFooler-style word-perturbation by producing diverse alternatives to original samples. In one approach, we co-tune BERT against the training data and synthetic adversarial samples. In a second approach, we generate the synthetic samples at evaluation time through substitution of words and perturbation of embedding vectors. The diversified evaluation results are then combined by voting. A third approach replaces evaluation-time word substitution with perturbation of embedding vectors. We evaluate FireBERT for MNLI and IMDB Movie Review datasets, in the original and on adversarial examples generated by TextFooler. We also test whether TextFooler is less successful in creating new adversarial samples when manipulating FireBERT, compared to working on unhardened classifiers. We show that it is possible to improve the accuracy of BERT-based models in the face of adversarial attacks without significantly reducing the accuracy for regular benchmark samples. We present co-tuning with a synthetic data generator as a highly effective method to protect against 95% of pre-manufactured adversarial samples while maintaining 98% of original benchmark performance. We also demonstrate evaluation-time perturbation as a promising direction for further research, restoring accuracy up to 75% of benchmark performance for pre-made adversarials, and up to 65% (from a baseline of 75% orig. / 12% attack) under active attack by TextFooler.
摘要:我们提出FireBERT,一组由生成各种替代原始样本硬化对TextFooler风格字摄3验证的概念NLP分类。在一种方法中,我们共同调对抗训练数据和合成敌对样品BERT。在第二种方法中,我们通过文字的替代和嵌入矢量的微扰在生成的评估时间的合成样品。多元化的评价结果​​,然后通过投票表决相结合。第三种方法代替评价时间单词替换与嵌入矢量的微扰。我们评估FireBERT为MNLI和IMDB电影评论集,在原有的和由TextFooler产生对抗的例子。我们还测试TextFooler是否在操纵FireBERT时,相比于未硬化的分类工作创造新的对抗样本不太成功。我们表明它是能够提高对抗的攻击面前基于BERT的模型精度不显著减少常规的基准样本的准确性。我们本共调谐用合成数据生成作为以防止预先制造的对抗性样品的95%,同时保持原来的基准性能98%的高度有效的方法。我们还演示了评价时扰动作为进一步研究的有前途的方向,达到恢复精度的预发adversarials基准性能75%,而高达65%(从75%原稿的基准。/ 12%攻击)下通过TextFooler主动攻击。

4. KR-BERT: A Small-Scale Korean-Specific Language Model [PDF] 返回目录
  Sangah Lee, Hansol Jang, Yunmee Baik, Hyopil Shin
Abstract: Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for our model. With those adjustments, our KR-BERT model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.
摘要:由于BERT的出现,最近的作品包括XLNet和罗伯塔利用句子嵌入模型大语料库和大量的参数预先训练。因为这样的车型有较大的硬件和庞大的数据量,他们需要很长的时间来预列车。因此,尝试使小模型进行比较是非常重要的。在本文中,我们培养了特定的韩国KR型-BERT,利用较小的词汇和数据集。由于韩国是使用非拉丁字母资源贫乏的形态丰富的语言之一,它也是语言特定捕捉语言现象重要的是,多语种BERT模型错过。我们测试了几个断词,包括我们的BidirectionalWordPiece标记生成器并调整令牌符号化,从子角色等级达到字符级来构建我们的模型更好的词汇的最小跨度。有了这些调整,我们的KR-BERT模型相当甚至优于使用有关大小的1/10语料库其他现有的预训练模式进行。

5. DQI: A Guide to Benchmark Evaluation [PDF] 返回目录
  Swaroop Mishra, Anjana Arunkumar, Bhavdeep Sachdeva, Chris Bryan, Chitta Baral
Abstract: A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.
摘要:艺术”模型A的`状态超过人类在基准B,但没有类似的基准C,d,和E是什么B的,其他基准不?最近的研究提供了答案:虚假的偏见。然而,开发一个解决基准B到E并不能保证它会解决未来的基准。要实现一个模型,'真正获悉的基础任务,我们需要量化连续基准之间的差异,相对于现有的二进制和黑盒方案的进展。我们提出了一个新的方法来解决开张的数据质量度量量化基准质量的这一勘探不足的任务:DQI。

6. A Large-Scale Chinese Short-Text Conversation Dataset [PDF] 返回目录
  Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong Jiang, Xiaoyan Zhu, Minlie Huang
Abstract: The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at this https URL.
摘要:神经对话一代车型的进步,显示出有前途的造型短文本对话的结果。然而,训练这些模型通常需要一个大规模高品质的对话语料,这是很难的访问。在本文中,我们提出了一个大规模的清理中国的对话集,LCCC,其中包含一个基本版本(6.8million对话)和大版(12.0万元的对话)。我们的数据的质量是通过严格的数据清洗管道,这是基于一组规则,并且是在手动注释110K对话对训练的分类建立保证。我们也释放,其分别在LCCC基和LCCC大型预训练训练对话模式。清洁的数据集和预培训模式将有利于短文本会话模型的研究。所有的模型和数据集可在此HTTPS URL。

7. Does BERT Solve Commonsense Task via Commonsense Knowledge? [PDF] 返回目录
  Leyang Cui, Sijie Cheng, Yu Wu, Yue Zhang
Abstract: The success of pre-trained contextualized language models such as BERT motivates a line of work that investigates linguistic knowledge inside such models in order to explain the huge improvement in downstream tasks. While previous work shows syntactic, semantic and word sense knowledge in BERT, little work has been done on investigating how BERT solves CommonsenseQA tasks. In particular, it is an interesting research question whether BERT relies on shallow syntactic patterns or deeper commonsense knowledge for disambiguation. We propose two attention-based methods to analyze commonsense knowledge inside BERT, and the contribution of such knowledge for the model prediction. We find that attention heads successfully capture the structured commonsense knowledge encoded in ConceptNet, which helps BERT solve commonsense tasks directly. Fine-tuning further makes BERT learn to use the commonsense knowledge on higher layers.
摘要:预先训练情境语言模型如BERT的成功激励行的工作,调查这种模式里面的语言知识,以解释在下游任务的巨大改善。虽然以前的工作表明句法,在语义BERT和词义知识,很少的工作一直在研究解决了BERT如何CommonsenseQA任务完成。特别是,它是一个有趣的研究问题BERT是否依赖于浅层句法模式或消歧更深的常识知识。我们提出了两种基于注意机制的方法来分析内部BERT常识性知识,这些知识对模型预测的贡献。我们发现,注意头成功捕捉ConceptNet编码的结构常识的知识,这有助于BERT直接解决常识任务。微调进一步使得BERT学会用常识知识更高层。

8. Knowledge Distillation and Data Selection for Semi-Supervised Learning in CTC Acoustic Models [PDF] 返回目录
  Prakhar Swarup, Debmalya Chakrabarty, Ashtosh Sapru, Hitesh Tulsiani, Harish Arsikere, Sri Garimella
Abstract: Semi-supervised learning (SSL) is an active area of research which aims to utilize unlabelled data in order to improve the accuracy of speech recognition systems. The current study proposes a methodology for integration of two key ideas: 1) SSL using connectionist temporal classification (CTC) objective and teacher-student based learning 2) Designing effective data-selection mechanisms for leveraging unlabelled data to boost performance of student models. Our aim is to establish the importance of good criteria in selecting samples from a large pool of unlabelled data based on attributes like confidence measure, speaker and content variability. The question we try to answer is: Is it possible to design a data selection mechanism which reduces dependence on a large set of randomly selected unlabelled samples without compromising on Word Error Rate (WER)? We perform empirical investigations of different data selection methods to answer this question and quantify the effect of different sampling strategies. On a semi-supervised ASR setting with 40000 hours of carefully selected unlabelled data, our CTC-SSL approach gives 17% relative WER improvement over a baseline CTC system trained with labelled data. It also achieves on-par performance with CTC-SSL system trained on order of magnitude larger unlabeled data based on random sampling.
摘要:半监督学习(SSL)是研究旨在利用以提高语音识别系统的准确度未标记数据的活动区域。目前的研究提出了两个关键概念整合的方法:用1)SSL联结时间分类(CTC)客观,师生基础的学习2)设计有效的数据选择机制,充分利用未标记数据,以学生机型的提升性能​​。我们的目标是建立在基于像置信度,扬声器和内容的变化属性未标记的数据的大池中选择样本的良好标准的重要性。我们试图回答的问题是:是否有可能设计出减少对大量随机选择未标记的样本依存性的数据选择机制,在不影响词错误率(WER)?我们执行的不同的数据选择方法的实证研究来回答这个问题和量化不同的抽样策略的影响。与40000小时精心挑选未标记数据的半监督ASR设置,我们的CTC-SSL的方法给出了与标记的数据训练的基线CTC系统17%的相对WER的改善。它还实现了对-标准杆的成绩与训练的基于随机抽样幅度较大的标签数据的顺序CTC-SSL系统。

9. Question Identification in Arabic Language Using Emotional Based Features [PDF] 返回目录
  Ahmed Ramzy, Ahmed Elazab
Abstract: With the growth of content on social media networks, enterprises and services providers have become interested in identifying the questions of their customers. Tracking these questions become very challenging with the growth of text that grows directly proportional to the increase of Arabic users thus making it very difficult to be tracked manually. By automatic identifying the questions seeking answers on the social media networks and defining their category, we can automatically answer them by finding an existing answer or even routing them to those responsible for answering those questions in the customer service. This will result in saving the time and the effort and enhancing the customer feedback and improving the business. In this paper, we have implemented a binary classifier to classify Arabic text to either question seeking answer or not. We have added emotional based features to the state of the art features. Experimental evaluation has done and showed that these emotional features have improved the accuracy of the classifier.
摘要:随着内容在社交媒体网络的发展,企业和服务提供商已经成为有意在确定其客户的问题。跟踪这些问题变得非常与增长成正比的阿拉伯语用户从而使其很难手动跟踪的增加文本的增长挑战。通过自动识别寻求在社交媒体网络的答案,并确定其类别的问题,我们可以通过自动发现现有的答案,甚至它们路由给那些负责在客户服务回答这些问题回答。这将导致节省的时间和努力,提高客户的反馈和改进业务。在本文中,我们实现了一个二元分类进行分类阿拉伯语文本到任何一个问题寻求答案与否。我们已经增加了情感的基础特征的艺术特点的状态。实验评价做了,并表明,这些情感特征提高了分类的准确性。

10. Distilling the Knowledge of BERT for Sequence-to-Sequence ASR [PDF] 返回目录
  Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
Abstract: Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generates soft labels to guide the training of seq2seq ASR. Furthermore, we leverage context beyond the current utterance as input to BERT. Experimental evaluations show that our method significantly improves the ASR performance from the seq2seq baseline on the Corpus of Spontaneous Japanese (CSJ). Knowledge distillation from BERT outperforms that from a transformer LM that only looks at left context. We also show the effectiveness of leveraging context beyond the current utterance. Our method outperforms other LM application approaches such as n-best rescoring and shallow fusion, while it does not require extra inference cost.
摘要:基于关注序列对序列(seq2seq)车型已经实现承诺的自动语音识别(ASR)的结果。然而,由于这些车型在左到右的解码方式,他们没有进入情境就对了。我们通过应用BERT通过知识蒸馏外部语言模型seq2seq ASR利用左,右边的语境。在我们提出的方法,BERT产生软标签,引导seq2seq ASR的训练。此外,我们超越目前的话语输入到BERT杠杆上下文。试验评估表明,我们的方法显著提高从自发日本的语料库(CSJ)的seq2seq基线ASR性能。从BERT性能优于知识蒸馏,从变压器LM,只有着眼于左上下文。我们还表明利用方面超越了目前话语的有效性。我们的方法优于其他LM应用方法,如n个最好再评分和浅层融合,同时它不需要额外的推理成本。

11. Fast and Accurate Neural CRF Constituency Parsing [PDF] 返回目录
  Yu Zhang, Houquan Zhou, Zhenghua Li
Abstract: Estimating probability distribution is one of the core issues in the NLP field. However, in both deep learning (DL) and pre-DL eras, unlike the vast applications of linear-chain CRF in sequence labeling tasks, very few works have applied tree-structure CRF to constituency parsing, mainly due to the complexity and inefficiency of the inside-outside algorithm. This work presents a fast and accurate neural CRF constituency parser. The key idea is to batchify the inside algorithm for loss computation by direct large tensor operations on GPU, and meanwhile avoid the outside algorithm for gradient computation via efficient back-propagation. We also propose a simple two-stage bracketing-then-labeling parsing approach to improve efficiency further. To improve the parsing performance, inspired by recent progress in dependency parsing, we introduce a new scoring architecture based on boundary representation and biaffine attention, and a beneficial dropout strategy. Experiments on PTB, CTB5.1, and CTB7 show that our two-stage CRF parser achieves new state-of-the-art performance on both settings of w/o and w/ BERT, and can parse over 1,000 sentences per second. We release our code at this https URL.
摘要:估计概率分布是在NLP领域的核心问题之一。然而,在这两个深度学习(DL)和Pre-DL的时代,不像线性链CRF的序列标注任务的广阔应用中,极少的作品已经申请树结构CRF到选区分析,主要是由于复杂性和低效率内部 - 外部算法。这项工作提出了快速准确的神经CRF选区解析器。关键思想是batchify用于通过对GPU直接大张量运算损失计算的内部算法,同时避免用于经由高效反向传播梯度计算外部算法。我们还提出了一个简单的两阶段包围,然后标记解析方法来进一步提高效率。为了提高解析性能,在依存分析最新进展的启发,介绍了基于边界表示和biaffine关注和有益辍学战略新的评分体系结构。在PTB,CTB5.1和CTB7表明,我们的双级CRF解析器实现上的W / O和W / BERT,并可以解析每秒超过1000句都设置新的国家的最先进的性能试验。我们在此HTTPS URL释放我们的代码。

12. Fast Gradient Projection Method for Text Adversary Generation and Adversarial Training [PDF] 返回目录
  Xiaosen Wang, Yichen Yang, Yihe Deng, Kun He
Abstract: Adversarial training has shown effectiveness and efficiency in improving the robustness of deep neural networks for image classification. For text classification, however, the discrete property of the text input space makes it hard to adapt the gradient-based adversarial methods from the image domain. Existing text attack methods, moreover, are effective but not efficient enough to be incorporated into practical text adversarial training. In this work, we propose a Fast Gradient Projection Method (FGPM) to generate text adversarial examples based on synonym substitution, where each substitution is scored by the product of gradient magnitude and the projected distance between the original word and the candidate word in the gradient direction. Empirical evaluations demonstrate that FGPM achieves similar attack performance and transferability when compared with competitive attack baselines, at the same time it is about 20 times faster than the current fastest text attack method. Such performance enables us to incorporate FGPM with adversarial training as an effective defense method, and scale to large neural networks and datasets. Experiments show that the adversarial training with FGPM (ATF) significantly improves the model robustness, and blocks the transferability of adversarial examples without any decay on the model generalization.
摘要:对抗性训练已经显示出改善深层神经网络的鲁棒性图像分类的有效性和效率。对于文本分类,但是,文本输入空间的离散性质使得难以适应从图像域中的基于梯度的对抗方法。现有文本的攻击方法,而且,将被纳入到实际文本对抗训练有效,但没有足够有效的。在这项工作中,我们提出了一个快速梯度投影法(FGPM)基于同义词替换,其中每个取代由梯度量级的产品,并在梯度原词和候选词之间的投影距离得分生成文本对抗的例子方向。实证评价显示,当有竞争力的攻击基线相比,该FGPM实现了类似的攻击性和转印,同时它比目前最快的文本攻击方法快20倍左右。这样的性能使我们能够结合FGPM与对抗性训练作为一种有效的防御方法,以及规模较大的神经网络和数据集。实验表明,与FGPM(ATF)的对抗训练显著提高了模型的鲁棒性,并且块的对抗性例的转印性而不会对模型概括的任何衰减。

13. Point or Generate Dialogue State Tracker [PDF] 返回目录
  Song Xiaohui, Hu Songlin
Abstract: Dialogue state tracking is a key part of a task-oriented dialogue system, which estimates the user's goal at each turn of the dialogue. In this paper, we propose the Point-Or-Generate Dialogue State Tracker (POGD). POGD solves the dialogue state tracking task in two perspectives: 1) point out explicitly expressed slot values from the user's utterance, and 2) generate implicitly expressed ones based on slot-specific contexts. It also shares parameters across all slots, which achieves knowledge sharing and gains scalability to large-scale across-domain dialogues. Moreover, the training process of its submodules is formulated as a multi-task learning procedure to further promote its capability of generalization. Experiments show that POGD not only obtains state-of-the-art results on both WoZ 2.0 and MultiWoZ 2.0 datasets but also has good generalization on unseen values and new slots.
摘要:对话状态跟踪是一个面向任务的对话系统,这在对话的每转一圈估计用户的目标的重要组成部分。在本文中,我们提出了点 - 或 - 开展对话状态追踪器(POGD)。 POGD解决了对话状态跟踪任务在两个方面:1)从用户的发音中指出明确表示槽值,和2)基于插槽特定的上下文隐晦地表达的。它也分享所有插槽,实现知识共享和可扩展性的收益,以大型跨域对话参数。此外,它的子模块的训练过程被配制成多任务学习过程,以进一步宣传推广的能力。实验表明,POGD不仅获得关于两个WOZ 2.0和2.0 MultiWoZ数据集状态的最先进的结果,但也对看不见值和新时隙良好的泛化。

14. Assessing Demographic Bias in Named Entity Recognition [PDF] 返回目录
  Shubhanshu Mishra, Sijun He, Luca Belli
Abstract: Named Entity Recognition (NER) is often the first step towards automated Knowledge Base (KB) generation from raw text. In this work, we assess the bias in various Named Entity Recognition (NER) systems for English across different demographic groups with synthetically generated corpora. Our analysis reveals that models perform better at identifying names from specific demographic groups across two datasets. We also identify that debiased embeddings do not help in resolving this issue. Finally, we observe that character-based contextualized word representation models such as ELMo results in the least bias across demographics. Our work can shed light on potential biases in automated KB generation due to systematic exclusion of named entities belonging to certain demographics.
摘要:命名实体识别(NER)通常是从原始文本朝向自动知识库(KB)生成的第一步。在这项工作中,我们评估在不同的命名实体识别(NER)系统英语在不同的人口群体与合成产生的语料的偏差。我们的分析表明,模型在跨越两个数据集从识别特定客户群的名字更好地履行。我们还确定debiased的嵌入不解决此问题的帮助。最后,我们观察到,基于字符的情境单词表示模型如埃尔莫导致整个人口最少的偏见。我们的工作可以因属于某些人口统计命名实体的系统性排斥阐明了在自动化KB产生潜在的偏见光。

15. Diversifying Task-oriented Dialogue Response Generation with Prototype Guided Paraphrasing [PDF] 返回目录
  Phillip Lippe, Pengjie Ren, Hinda Haned, Bart Voorn, Maarten de Rijke
Abstract: Existing methods for Dialogue Response Generation (DRG) in Task-oriented Dialogue Systems (TDSs) can be grouped into two categories: template-based and corpus-based. The former prepare a collection of response templates in advance and fill the slots with system actions to produce system responses at runtime. The latter generate system responses token by token by taking system actions into account. While template-based DRG provides high precision and highly predictable responses, they usually lack in terms of generating diverse and natural responses when compared to (neural) corpus-based approaches. Conversely, while corpus-based DRG methods are able to generate natural responses, we cannot guarantee their precision or predictability. Moreover, the diversity of responses produced by today's corpus-based DRG methods is still limited. We propose to combine the merits of template-based and corpus-based DRGs by introducing a prototype-based, paraphrasing neural network, called P2-Net, which aims to enhance quality of the responses in terms of both precision and diversity. Instead of generating a response from scratch, P2-Net generates system responses by paraphrasing template-based responses. To guarantee the precision of responses, P2-Net learns to separate a response into its semantics, context influence, and paraphrasing noise, and to keep the semantics unchanged during paraphrasing. To introduce diversity, P2-Net randomly samples previous conversational utterances as prototypes, from which the model can then extract speaking style information. We conduct extensive experiments on the MultiWOZ dataset with both automatic and human evaluations. The results show that P2-Net achieves a significant improvement in diversity while preserving the semantics of responses.
摘要:在面向任务的对话系统(TDSS)对话响应生成(DRG)现有的方法可以分为两类:基于模板和黄为主。前者准备的响应模板的集合事先并填写与系统操作的插槽在运行时产生的系统响应。后者产生的令牌令牌通过采取系统措施考虑系统响应。虽然基于模板的DRG提供高精度和高度可预测的响应,他们通常缺乏相比(神经)基于语料库的方法时产生多样和自然反应方面。相反,而基于语料库的DRG方法能够产生自然反应,我们不能保证其精度和可预测性。此外,今天的基于语料库的DRG方法产生反应的多样性仍然有限。我们提出通过引入基于原型的,意译神经网络,被称为P2型网,以增强其目的是响应的质量的精度和多样性方面的模板和基于语料库为基础的按病种付费的优点结合起来。代替生成从头的响应,P2-Net的由意译基于模板的响应产生系统响应。为了保证响应的精确度,P2-净学会分成它的语义,语境影响的响应和噪声转述,复述和过程中保持语义不变。引入多样性,P2-Net的随机样本先前对话的话语为原型,从该模型然后提取说话风格的信息。我们在MultiWOZ数据集自动和人的评估进行了广泛的实验。结果表明,P2-网实现了显著改善多样性,同时保留响应的语义。

16. Retrofitting Vector Representations of Adverse Event Reporting Data to Structured Knowledge to Improve Pharmacovigilance Signal Detection [PDF] 返回目录
  Xiruo Ding, Trevor Cohen
Abstract: Adverse drug events (ADE) are prevalent and costly. Clinical trials are constrained in their ability to identify potential ADEs, motivating the development of spontaneous reporting systems for post-market surveillance. Statistical methods provide a convenient way to detect signals from these reports but have limitations in leveraging relationships between drugs and ADEs given their discrete count-based nature. A previously proposed method, aer2vec, generates distributed vector representations of ADE report entities that capture patterns of similarity but cannot utilize lexical knowledge. We address this limitation by retrofitting aer2vec drug embeddings to knowledge from RxNorm and developing a novel retrofitting variant using vector rescaling to preserve magnitude. When evaluated in the context of a pharmacovigilance signal detection task, aer2vec with retrofitting consistently outperforms disproportionality metrics when trained on minimally preprocessed data. Retrofitting with rescaling results in further improvements in the larger and more challenging of two pharmacovigilance reference sets used for evaluation.
摘要:药物不良事件(ADE)是普遍的和昂贵的。临床试验被限制在自己的鉴别潜在的ADE能力,激励自发报告系统,为后期市场监管的发展。统计方法提供了一种方便的方法来检测从这些报告中的信号,但在利用药物并给予他们独立计数为基础的自然的ADE之间的关系限制。以前提出的方法,aer2vec,产生分布式ADE报告实体的矢量表示这种相似的捕获模式,但不能利用词汇知识。我们通过改造aer2vec药物的嵌入到从RxNorm知识和使用向量重新调整以保持幅度开发新的变种改造解决此限制。当在药物警戒信号检测任务,aer2vec与最低限度预处理的数据训练时性能优于不成比例指标一致改造的上下文中计算的。与在较大的重新缩放在进一步改进的结果和用于评价2个的药物警戒参考集的更多挑战改型。

17. Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach [PDF] 返回目录
  Yahui Liu, Marco De Nadai, Deng Cai, Huayang Li, Xavier Alameda-Pineda, Nicu Sebe, Bruno Lepri
Abstract: Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.
摘要:通过人的书面文字处理图像的视觉属性是一个非常具有挑战性的任务。在一方面,车型要学会操纵不期望的输出的地面实况。在另一方面,模型必须处理自然语言的内在模糊性。以前的研究通常要求要么用户描述所期望的图像的所有特征或使用内容丰富,标注的图像字幕数据集。在这项工作中,我们提出了一种新的无监督的方法,基于图像 - 图像平移,通过一个命令般的句子,如“头发的颜色变成黑色”会改变一个给定的图像的属性。相反,以国家的最先进的方法,我们的模型不需要人为注释的数据集,也没有所有需要的图像属性的文本描述,但只有那些不得不进行修改。我们提出的模型理顺了那些纷繁从视觉属性的图像内容,并且学习用文本描述修改后者,从内容和修改后的属性表示生成新的图像之前。由于文本可能是固有的歧义(金发可参考金发,例如黄金,冰,沙的不同阴影),我们的方法生成相同的翻译多个随机版本。实验结果表明,该模型实现了两个大型公共数据集有前途的表演:CelebA和CUB。我们相信,我们的方法将铺平道路的研究与视觉属性结合文本和语音命令的新途径。

18. Navigating Language Models with Synthetic Agents [PDF] 返回目录
  Philip Feldman
Abstract: Modern natural language models such as the GPT-2/GPT-3 contain tremendous amounts of information about human belief in a consistently interrogatable form. If these models could be shown to accurately reflect the underlying beliefs of the human beings that produced the data used to train these models, then such models become a powerful sociological tool in ways that are distinct from traditional methods, such as interviews and surveys. In this study, We train a version of the GPT-2 on a corpora of historical chess games, and then compare the learned relationships of words in the model to the known ground truth of the chess board, move legality, and historical patterns of play. We find that the percentages of moves by piece using the model are substantially similar from human patterns. We further find that the model creates an accurate latent representation of the chessboard, and that it is possible to plot trajectories of legal moves across the board using this knowledge.
摘要:现代自然语言模型,如GPT-2 / GPT-3包含的有关一贯询问的形式,人类信仰大量数据信息。如果这些模型可以显示出准确反映生产用于训练这些模型数据中的人的基本信念,那么这种模式成为了与传统的方法,如访谈和调查不同的方式的强大社会学工具。在这项研究中,我们对历史的国际象棋游戏语料训练一个版本的GPT-2,然后比较模型,以国际象棋棋盘的已知的基础事实,此举合法性词的习得的关系,以及播放历史模式。我们发现,使用该模型由一块移动百分比是从人的图基本相似。我们进一步发现,该模型创建棋盘的准确潜表示,这有可能绘制的利用这些知识全线法律移动轨迹。

19. The Chess Transformer: Mastering Play using Generative Language Models [PDF] 返回目录
  David Noever, Matt Ciolino, Josh Kalin
Abstract: This work demonstrates that natural language transformers can support more generic strategic modeling, particularly for text-archived games. In addition to learning natural language skills, the abstract transformer architecture can generate meaningful moves on a chess board. With further fine-tuning, the transformer learns complex game play by training on 2.8 million chess games in Portable Game Notation. After 30000 training steps, the large transformer called OpenAI's Generative Pre-trained Transformer (GPT-2) optimizes weights for 774 million parameters. The chess playing transformer achieves acceptable cross-entropy log loss values (0.2-0.7). This fine-tuned Chess Transformer generates plausible strategies and displays game formations identifiable as classic openings, such as English or the Slav Exchange. Finally, in live play, the novel model demonstrates a human-to-transformer interface that correctly filters illegal moves and provides a method to challenge the transformer's chess strategies. We anticipate future work will build on this transformer's promise, particularly in other strategy games where features can capture the underlying complex rule syntax from simple but expressive player annotations.
摘要:这项工作表明,自然语言变压器可以支持更多的通用战略模型,特别是对文本存档的游戏。除了学习自然语言的能力,抽象变压器架构可以生成一个棋盘有意义的举动。随着进一步的微调,变压器学习通过在便携式游戏符号280万棋牌类游戏的训练复杂的游戏。经过30000项的培训措施,对大型变压器称为OpenAI的剖成预先训练变压器(GPT-2)优化了7.74亿参数的权重。的下棋变压器达到可接受的交叉熵日志损耗值(0.2-0.7)。这种微调象棋变压器产生合理的策略,并显示游戏地层识别为经典的开口,如英语或斯拉夫上市。最后,在现场播放,新颖的模型表明一个人对变压器接口,正确地过滤非法移动,并提供了一种方法来挑战变压器的棋策略。我们预计未来的工作将建立在该变压器的承诺,特别是在其他战略游戏中的功能可以从简单的,但表现的球员注释捕捉潜在的复杂规则语法。

20. VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [PDF] 返回目录
  Junchen Lu, Kun Zhou, Berrak Sisman, Haizhou Li
Abstract: Singing voice conversion aims to convert singer's voice from source to target without changing singing content. Parallel training data is typically required for the training of singing voice conversion system, that is however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an effective way to learn a mapping through non-parallel training data. In this paper, we propose a singing voice conversion framework that is based on VAW-GAN. We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content. By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity, and improves the F0 rendering. Experimental results show that the proposed framework achieves better performance than the baseline frameworks.
摘要:歌声转换旨在从源头转换歌手的声音,目标不改变歌唱的内容。并行训练数据通常所需的歌声转换系统的训练,也就是然而,在现实生活中的应用不实用。最近编码器 - 解码器的结构,如变autoencoding瓦瑟斯坦生成对抗网络(VAW-GAN),通过提供非平行训练数据来学习的映射的有效方式。在本文中,我们提出了基于VAW-GaN歌唱声音的转换框架。我们培养的编码器中挣脱出来的歌手身份,并从语音内容演唱韵律(F0轮廓)。通过在歌手的身份和F0调理,解码器产生与看不见目标歌手身份输出光谱特征,提高了F0渲染。实验结果表明,所提出的框架,实现了比基准框架更好的性能。

21. SpeedySpeech: Efficient Neural Speech Synthesis [PDF] 返回目录
  Jan Vainer, Ondřej Dušek
Abstract: While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. We propose a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. We show that self-attention layers are not necessary for generation of high quality audio. We utilize simple convolutional blocks with residual connections in both student and teacher networks and use only a single attention layer in the teacher model. Coupled with a MelGAN vocoder, our model's voice quality was rated significantly higher than Tacotron 2. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. We provide both our source code and audio samples in our GitHub repository.
摘要:尽管最近的神经序列到序列模型已经大大提高了语音合成的质量,还没有能够快速培训,快速推理和高品质的音频合成在同一时间的系统。我们提出了一个师生的网络能够高品质快于实时频谱合成,对计算资源和快速的培训时间要求较低。我们发现,自我关注层是没有必要的新一代高品质音频。我们利用与学生和教师网络连接残留简单卷积块,并在老师的模型只使用一个单一的注意力层。再加上MelGAN声码器,我们的模型的语音质量被评为显著高于Tacotron 2.我们的模型可以有效地训练了单GPU,可以在实时即使在CPU上运行。我们在GitHub的仓库同时提供了源代码和音频样本。

22. Analysing the Effect of Clarifying Questions on Document Ranking in Conversational Search [PDF] 返回目录
  Antonios Minas Krasakis, Mohammad Aliannejadi, Nikos Voskarides, Evangelos Kanoulas
Abstract: Recent research on conversational search highlights the importance of mixed-initiative in conversations. To enable mixed-initiative, the system should be able to ask clarifying questions to the user. However, the ability of the underlying ranking models (which support conversational search) to account for these clarifying questions and answers has not been analysed when ranking documents, at large. To this end, we analyse the performance of a lexical ranking model on a conversational search dataset with clarifying questions. We investigate, both quantitatively and qualitatively, how different aspects of clarifying questions and user answers affect the quality of ranking. We argue that there needs to be some fine-grained treatment of the entire conversational round of clarification, based on the explicit feedback which is present in such mixed-initiative settings. Informed by our findings, we introduce a simple heuristic-based lexical baseline, that significantly outperforms the existing naive baselines. Our work aims to enhance our understanding of the challenges present in this particular task and inform the design of more appropriate conversational ranking models.
摘要:对话搜索亮点最近的研究混合行动的对话的重要性。为了使混合举措,该系统应能向澄清的问题给用户。然而,底层的排名模型(支持会话搜索)考虑这些澄清的问题和答案的能力还没有被排名的文档时,在大的分析。为此,我们分析一个词汇排名模型与澄清的问题谈话的搜索数据集的性能。我们调查,定量和定性,澄清的问题和用户回答的不同方面如何影响排名的质量。我们认为,需要有一些细粒度处理整个对话轮澄清的基础上,明确的反馈是存在于这样的混合主动设置。通过我们的研究结果告知,我们介绍一个简单的启发式词汇基线,是显著优于现有的天真基线。我们的工作目标,以提高我们目前在这个特殊的任务所面临的挑战的认识,并告知更合适的对话排名模型的设计。

23. LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [PDF] 返回目录
  Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, Tie-Yan Liu
Abstract: Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than 98% intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
摘要:语音合成(文本到语音,TTS)和识别(自动语音识别,ASR)的重要讲话精神的任务,需要模特培训了大量的文字和语音对。然而,有超过6000种语言在世界上大多数语言缺乏语音训练数据,构建TTS和ASR系统时,极低的资源语言这对显著的挑战。在本文中,我们开发LRSpeech,极低的资源设置,它可以支持低数据成本稀有语种下,TTS和ASR系统。 LRSpeech包括三个关键技术:1)丰富的资源语言和低资源语言微调前的培训; 2)迭代地TTS和ASR之间偶变换相互促进的准确性; 3)知识蒸馏定制的高品质的目标语语音TTS模式和完善的多种声音的ASR模式。我们在试验语言(英语)和一个真正的低资源语言(立陶宛),以验证LRSpeech的有效性进行实验。实验结果表明,LRSpeech 1)实现为TTS高品质在两个可懂的术语(98%以上可懂率)和自然性(超过3.5平均意见得分(MOS)的合成语音的),其满足工业部署的要求, 2)实现承诺的识别精度ASR,和3)最后但并非最不重要的,非常使用低资源培训数据。我们还对LRSpeech进行全面的分析与数据量大小不同的资源,并提供工业部署有价值的见解和指南。目前,我们正在部署LRSpeech成商业化的云语音业务,以支持更多的稀有语种TTS。

24. Deep F-measure Maximization for End-to-End Speech Understanding [PDF] 返回目录
  Leda Sarı, Mark Hasegawa-Johnson
Abstract: Spoken language understanding (SLU) datasets, like many other machine learning datasets, usually suffer from the label imbalance problem. Label imbalance usually causes the learned model to replicate similar biases at the output which raises the issue of unfairness to the minority classes in the dataset. In this work, we approach the fairness problem by maximizing the F-measure instead of accuracy in neural network model training. We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation. We perform experiments on two standard fairness datasets, Adult, and Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset. In all four of these tasks, F-measure maximization results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function. In the two multi-class SLU tasks, the proposed approach significantly improves class coverage, i.e., the number of classes with positive recall.
摘要:口语理解(SLU)的数据集,像许多其他的机器学习数据集,一般从标签失衡问题的困扰。标签不平衡通常会导致学习的模型复制类似的偏见,在这增加的不公平问题,以数据集中少数类的输出。在这项工作中,我们通过在神经网络模型的训练最大化F值,而不是精确度接近公平性问题。我们提出了一个微逼近F-措施,这一目标使用标准的反向传播训练网络。我们,也取决于语音到意图的ATIS数据集和语音到图像概念分类上的讲话,COCO数据集检测执行两种标准的公平性数据集,成人和社区和犯罪实验。在这些任务中的所有四个,在改进的微分值F1,具有高达绝对8%的绝对改进,F值最大化的结果相比,用交叉熵损失函数训练的模型。在两个多类SLU任务,所提出的方法显著改善类覆盖范围,即具有正召回类的数量。

25. Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews [PDF] 返回目录
  Bo Wang, Yue Wu, Niall Taylor, Terry Lyons, Maria Liakata, Alejo J Nevado-Holgado, Kate E A Saunders
Abstract: Bipolar disorder (BD) and borderline personality disorder (BPD) are both chronic psychiatric disorders. However, their overlapping symptoms and common comorbidity make it challenging for the clinicians to distinguish the two conditions on the basis of a clinical interview. In this work, we first present a new multi-modal dataset containing interviews involving individuals with BD or BPD being interviewed about a non-clinical topic . We investigate the automatic detection of the two conditions, and demonstrate a good linear classifier that can be learnt using a down-selected set of features from the different aspects of the interviews and a novel approach of summarising these features. Finally, we find that different sets of features characterise BD and BPD, thus providing insights into the difference between the automatic screening of the two conditions.
摘要:双相情感障碍(BD)和边缘型人格障碍(BPD)都是慢性精神疾病。然而,他们的重叠症状,常见的合并症,使其具有挑战性的临床医师区分临床访谈的基础上,这两个条件。在这项工作中,我们首先提出采访关于非临床话题包含涉及与BD或BPD个人面试一个新的多模态数据集。我们调查的两个条件自动检测,并表现出良好的线性分类,可以利用下组选定从访谈的不同方面,总结这些特性的一种新方法的特点来学习。最后,我们发现,不同的功能集表征BD和BPD,从而提供见解的两个条件自动筛选之间的差异。

26. Word Error Rate Estimation Without ASR Output: e-WER2 [PDF] 返回目录
  Ahmed Ali, Steve Renals
Abstract: Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box), and for systems without having access to the ASR system (no-box). The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER. Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400 sentences. The estimated overall WER by e-WER2 is 30.9% for a three hours test set, while the WER computed using the reference transcriptions was 28.5%.
摘要:测量自动语音识别(ASR)系统的性能,以便计算字差错率(WER),这通常是耗时且昂贵的需要手动转录数据。在本文中,我们将继续我们在使用声波,词汇和音位功能估计WER努力。我们估计WER新颖方法使用多流的端至端的体系结构。我们报告使用内部语音解码器功能(玻璃盒),系统没有语音解码器功能(黑盒),系统和系统的结果,而无需访问ASR系统(无盒)。无盒系统学习从音素识别结果与MFCC声学特征来估计WER沿联合声词汇表示。考虑到每一句WER,我们没有现成系统达到0.56跨越1400句基准估计和0.24均方根误差(RMSE)Pearson相关。通过e-WER2估计总体WER是用于三小时的测试组30.9%,而WER使用参考转录物28.5%计算的。

27. A New Approach to Accent Recognition and Conversion for Mandarin Chinese [PDF] 返回目录
  Lin Ai, Shih-Ying Jeng, Homayoon Beigi
Abstract: Two new approaches to accent classification and conversion are presented and explored, respectively. The first topic is Chinese accent classification/recognition. The second topic is the use of encoder-decoder models for end-to-end Chinese accent conversion, where the classifier in the first topic is used for the training of the accent converter encoder-decoder model. Experiments using different features and model are performed for accent recognition. These features include MFCCs and spectrograms. The classifier models were TDNN and 1D-CNN. On the MAGICDATA dataset with 5 classes of accents, the TDNN classifier trained on MFCC features achieved a test accuracy of 54% and a test F1 score of 0.54 while the 1D-CNN classifier trained on spectrograms achieve a test accuracy of 62% and a test F1 score of 0.62. A prototype of an end-to-end accent converter model is also presented. The converter model comprises of an encoder and a decoder. The encoder model converts an accented input into an accent-neutral form. The decoder model converts an accent-neutral form to an accented form with the specified accent assigned by the input accent label. The converter prototype preserves the tone and foregoes the details in the output audio. An encoder-decoder structure demonstrates the potential of being an effective accent converter. A proposal for future improvements is also presented to address the issue of lost details in the decoder output.
摘要:两个新的方法来调分类和转换介绍和探讨,分别。第一个主题是中国口音的分类/识别。第二个话题是终端到终端的中国口音的转换,其中第一个主题中的分类器用于口音转换器编码器,解码器模型的训练使用编码器,解码器模块。使用不同的功能和模型实验的口音识别进行。这些功能包括的MFCC和频谱。该分类模型是TDNN和1D-CNN。用5类修饰的MAGICDATA数据集,训练上MFCC的TDNN分类特征实现54%的测试的准确性和0.54的测试F1得分而上训练声谱图的1D-CNN分类器实现62%的测试的准确性和测试F1得分0.62。的端至端口音转换器模型的原型还提出。该转换器模型包括编码器和解码器的。编码器模型重音输入转换为重音中性形式。解码器模型中的重音中性形式转换为重音形式与由输入重音标签分配指定的口音。该转换器原型保留的语气和foregoes在输出音频中的细节。的编码器 - 解码器结构表明了作为一个有效口音转换器的电势。一种用于未来改进的建议也被提出来解决在解码器输出细节丢失的问题。

注:中文为机器翻译结果!封面为论文标题词云图!