摘要

1. An Enhanced Text Classification to Explore Health based Indian Government Policy Tweets [PDF] 返回目录
Aarzoo Dhiman, Durga Toshniwal
Abstract: Government-sponsored policy-making and scheme generations is one of the means of protecting and promoting the social, economic, and personal development of the citizens. The evaluation of effectiveness of these schemes done by government only provide the statistical information in terms of facts and figures which do not include the in-depth knowledge of public perceptions, experiences and views on the topic. In this research work, we propose an improved text classification framework that classifies the Twitter data of different health-based government schemes. The proposed framework leverages the language representation models (LR models) BERT, ELMO, and USE. However, these LR models have less real-time applicability due to the scarcity of the ample annotated data. To handle this, we propose a novel GloVe word embeddings and class-specific sentiments based text augmentation approach (named Mod-EDA) which boosts the performance of text classification task by increasing the size of labeled data. Furthermore, the trained model is leveraged to identify the level of engagement of citizens towards these policies in different communities such as middle-income and low-income groups.
摘要：政府资助的决策和方案世代是保护和促进公民的社会，经济和个人发展的手段之一。这些计划由政府做只提供在事实，哪些没有数字方面的统计信息的有效性的评估包括对公共话题的看法，经验和意见的深入了解。在这项研究工作中，我们提出了一种改进的文本分类框架进行分类不同的基于健康的政府计划的Twitter的数据。拟议的框架利用语言表示模型（LR模型）BERT，ELMO，和使用。然而，这些LR模型具有实时性较差适用性由于充足的注释数据的稀缺性。为了解决这个问题，我们提出了一个新颖的手套字的嵌入和类特定的情绪基于文本增强方法（称为MOD-EDA）的提升文本分类的任务通过增加标签的数据大小的性能。此外，训练的模型被利用来确定对不同社区，这些政策如中等收入和低收入群体公民的参与程度。

2. HSD Shared Task in VLSP Campaign 2019:Hate Speech Detection for Social Good [PDF] 返回目录
Xuan-Son Vu, Thanh Vu, Mai-Vu Tran, Thanh Le-Cong, Huyen T M. Nguyen
Abstract: The paper describes the organisation of the "HateSpeech Detection" (HSD) task at the VLSP workshop 2019 on detecting the fine-grained presence of hate speech in Vietnamese textual items (i.e., messages) extracted from Facebook, which is the most popular social network site (SNS) in Vietnam. The task is organised as a multi-class classification task and based on a large-scale dataset containing 25,431 Vietnamese textual items from Facebook. The task participants were challenged to build a classification model that is capable of classifying an item to one of 3 classes, i.e., "HATE", "OFFENSIVE" and "CLEAN". HSD attracted a large number of participants and was a popular task at VLSP 2019. In particular, there were 71 teams signed up for the task, 14 of them submitted results with 380 valid submissions from 20th September 2019 to 4th October 2019.
摘要：本文介绍了在检测仇恨言论的细颗粒存在于越南的文本项目（即，消息）的“仇恨言论检测”（HSD）在VLSP车间2019任务的组织从Facebook中提取，这是最流行的社交网站（SNS）在越南。任务被组织成多类分类任务和基于含有从Facebook 25431个越南文本项的大型数据集。任务参与者面临的挑战是建立一个分类模型，它能够项目分类到3的一个类，即“恨”，“进攻”和“干净”。 HSD吸引了大量的参与者，是在2019年VLSP特别流行的任务，有71支球队的任务签约，其中14提交了一份由380个有效提交结果从2019年9月20日至2019年10月4日。

3. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines [PDF] 返回目录
Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow
Abstract: The lack of publicly available text corpora is a major obstacle for progress in clinical natural language processing, for non-English speaking countries in particular. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely distributable German language corpus based on clinical practice guidelines in the field of oncology. The corpus is one of the largest corpora of German medical text to date. It does not contain any patient-related data and can therefore be used without data protection restrictions. Moreover, it is the first corpus for the German language covering diverse conditions in a large medical subfield. In addition to the textual sources, we provide a large variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other medical text corpora.
摘要：缺乏可公开获得的语料库的是在临床自然语言处理进展的主要障碍，特别是对于非英语国家。在这项工作中，我们提出GGPONC（肿瘤学NLP语料库德国方针计划）的基础上，在肿瘤学领域的临床实践指南的自由分派德国语料。该语料库是德国医书迄今为止最大的语料库之一。它不包含任何患者相关数据，因此可以不数据保护的限制使用。此外，它是覆盖不同条件的大型医疗子德语的第一语料库。除了文字来源，我们提供了大量的各种元数据，如参考文献和证据的水平。通过应用和评价德语文本现有的医疗信息抽取管道，我们能够得出比较了使用医疗语言来其他医疗语料库。

4. A Feature Analysis for Multimodal News Retrieval [PDF] 返回目录
Golsa Tahmasebzadeh, Sherzod Hakimov, Eric Müller-Budack, Ralph Ewerth
Abstract: Content-based information retrieval is based on the information contained in documents rather than using metadata such as keywords. Most information retrieval methods are either based on text or image. In this paper, we investigate the usefulness of multimodal features for cross-lingual news search in various domains: politics, health, environment, sport, and finance. To this end, we consider five feature types for image and text and compare the performance of the retrieval system using different combinations. Experimental results show that retrieval results can be improved when considering both visual and textual information. In addition, it is observed that among textual features entity overlap outperforms word embeddings, while geolocation embeddings achieve better performance among visual features in the retrieval task.
摘要：基于内容的信息检索是基于文件中所载的信息，而不是使用元数据，如关键字。大多数信息检索方式可以是基于文本或图像。在本文中，我们调查的多模式功能在各个领域的跨语种新闻搜索的用处：政治，卫生，环境，体育和金融。为此，我们考虑图片和文字五个特写类型和比较使用不同组合的检索系统的性能。实验结果表明，检索结果可以考虑视觉和文本信息时加以改进。此外，观察到文本中功能实体的重叠性能优于字的嵌入，而地理位置的嵌入实现检索任务的视觉特征之间的更好的性能。

5. A Label Attention Model for ICD Coding from Clinical Text [PDF] 返回目录
Thanh Vu, Dat Quoc Nguyen, Anthony Nguyen
Abstract: ICD coding is a process of assigning the International Classification of Disease diagnosis codes to clinical/medical notes documented by health professionals (e.g. clinicians). This process requires significant human resources, and thus is costly and prone to error. To handle the problem, machine learning has been utilized for automatic ICD coding. Previous state-of-the-art models were based on convolutional neural networks, using a single/several fixed window sizes. However, the lengths and interdependence between text fragments related to ICD codes in clinical text vary significantly, leading to the difficulty of deciding what the best window sizes are. In this paper, we propose a new label attention model for automatic ICD coding, which can handle both the various lengths and the interdependence of the ICD code related text fragments. Furthermore, as the majority of ICD codes are not frequently used, leading to the extremely imbalanced data issue, we additionally propose a hierarchical joint learning mechanism extending our label attention model to handle the issue, using the hierarchical relationships among the codes. Our label attention model achieves new state-of-the-art results on three benchmark MIMIC datasets, and the joint learning mechanism helps improve the performances for infrequent codes.
摘要：ICD编码是分配的疾病诊断码的国际分类到由保健专业人员（例如医生）记载的临床/医疗笔记的处理。这个过程需要显著的人力资源，因此成本高且容易出错。为了处理这个问题，机器学习已用于自动ICD编码。之前的状态的最先进的模型是基于卷积神经网络，使用单个/多个固定窗口大小。然而，在临床上的文字与ICD码文本片段之间的长度和相互依存显著变化，导致决定的最佳窗口大小是什么样的困难。在本文中，我们提出了自动编码ICD，它可以处理各种长度和ICD码相关的文本片段的相互依存关系既是一个新的标签注意力模型。此外，由于大多数的ICD码不经常使用，导致了极其不平衡数据问题，我们还提出了扩大我们的标签注意力模型来处理这一问题的层次共同学习机制，使用码之间的层次关系。我们的标签注意模型实现了国家的最先进的新的三个地基准MIMIC数据集的结果，并联合学习机制有助于提高偶发代码的性能。

6. Transformer with Depth-Wise LSTM [PDF] 返回目录
Hongfei Xu, Qiuhui Liu, Deyi Xiong, Josef van Genabith
Abstract: Increasing the depth of models allows neural models to model complicated functions but may also lead to optimization issues. The Transformer translation model employs the residual connection to ensure its convergence. In this paper, we suggest that the residual connection has its drawbacks, and propose to train Transformers with the depth-wise LSTM which regards outputs of layers as steps in time series instead of residual connections, under the motivation that the vanishing gradient problem suffered by deep networks is the same as recurrent networks applied to long sequences, while LSTM (Hochreiter and Schmidhuber, 1997) has been proven of good capability in capturing long-distance relationship, and its design may alleviate some drawbacks of residual connections while ensuring the convergence. We integrate the computation of multi-head attention networks and feed-forward networks with the depth-wise LSTM for the Transformer, which shows how to utilize the depth-wise LSTM like the residual connection. Our experiment with the 6-layer Transformer shows that our approach can bring about significant BLEU improvements in both WMT 14 English-German and English-French tasks, and our deep Transformer experiment demonstrates the effectiveness of the depth-wise LSTM on the convergence of deep Transformers. Additionally, we propose to measure the impacts of the layer's non-linearity on the performance by distilling the analyzing layer of the trained model into a linear transformation and observing the performance degradation with the replacement. Our analysis results support the more efficient use of per-layer non-linearity with depth-wise LSTM than with residual connections.
摘要：增加模型的深度允许神经模型的模型复杂的功能，而且还可能导致优化的问题。该变压器翻译模型采用的残余连接，以确保其收敛。在本文中，我们建议残余连接部有它的缺点，并提出培养与关于层的输出，在时间序列上，而不是剩余的连接步骤的纵深LSTM变压器，使得灭梯度问题，通过所遭受的动机下深网络是相同的施加到长序列反复网络，同时LSTM（Hochreiter和施米德休伯，1997）已经在捕获远距离关系被证明的良好能力，它的设计可以减轻残余连接的一些缺点，同时确保收敛。我们多头关注网络和前馈网络的计算整合与变压器的纵深LSTM，它展示了如何利用深度向LSTM状残余连接。我们与6层变压器节目的实验，我们的方法可以带来两个WMT 14英语，德语和英语，法语的任务，我们深感变压器实验显著BLEU改善表明深的收敛深度向LSTM的有效性变形金刚。此外，我们建议在训练模型的分析层提炼成线性变换，观察与更换性能下降来衡量的表现层的非线性的影响。我们的分析结果支持具有比与残留的连接深度方向LSTM更有效地利用每个层非线性的。

7. Generating Fluent Adversarial Examples for Natural Languages [PDF] 返回目录
Huangzhao Zhang, Hao Zhou, Ning Miao, Lei Li
Abstract: Efficiently building an adversarial attacker for natural language processing (NLP) tasks is a real challenge. Firstly, as the sentence space is discrete, it is difficult to make small perturbations along the direction of gradients. Secondly, the fluency of the generated examples cannot be guaranteed. In this paper, we propose MHA, which addresses both problems by performing Metropolis-Hastings sampling, whose proposal is designed with the guidance of gradients. Experiments on IMDB and SNLI show that our proposed MHA outperforms the baseline model on attacking capability. Adversarial training with MAH also leads to better robustness and performance.
摘要：建立高效的自然语言处理（NLP）任务的敌对攻击者是一个真正的挑战。首先，作为句子空间是离散的，所以很难使小扰动沿着梯度的方向。其次，将所生成的实施例中的流畅性不能得到保证。在本文中，我们提出了MHA，其中涉及通过执行都市报 - 黑斯廷斯采样，其提案被设计成梯度的指导这两个问题。在IMDB和SNLI实验表明，我们提出的MHA优于上攻击能力的基准模型。对抗性训练MAH也带来更好的稳定性和性能。

8. Do You Have the Right Scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods [PDF] 返回目录
Ning Miao, Yuxuan Song, Hao Zhou, Lei Li
Abstract: It has been a common approach to pre-train a language model on a large corpus and fine-tune it on task-specific data. In practice, we observe that fine-tuning a pre-trained model on a small dataset may lead to over- and/or under-estimation problem. In this paper, we propose MC-Tailor, a novel method to alleviate the above issue in text generation tasks by truncating and transferring the probability mass from over-estimated regions to under-estimated ones. Experiments on a variety of text generation datasets show that MC-Tailor consistently and significantly outperforms the fine-tuning approach. Our code is available at this url.
摘要：已经预先培养在大语料库和微调它特定的任务数据的语言模型的常用方法。在实践中，我们观察到微调的一个小数据集预先训练的模型可能会导致过度和/或低估的问题。在本文中，我们提出了MC-裁缝，一个新颖的方法，通过截断和高估的区域转移的概率质量低估那些以缓解文本生成任务上述问题。对各种文本生成数据集的实验表明，MC-裁缝一贯和显著优于微调的做法。我们的代码是可以在这个网址。

9. Neural disambiguation of lemma and part of speech in morphologically rich languages [PDF] 返回目录
José María Hoya Quecedo, Maximilian W. Koppatz, Giacomo Furlan, Roman Yangarber
Abstract: We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages. We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser -- with no manual disambiguation or data annotation. We assume that the morphological analyser produces multiple analyses for ambiguous words. The idea is to train recurrent neural networks on the output that the morphological analyser produces for unambiguous words. We present performance on POS and lemma disambiguation that reaches or surpasses the state of the art -- including supervised models -- using no manually annotated data. We evaluate the method on several morphologically rich languages.
摘要：我们认为消除歧义的模棱两可的话讲话的引理和部分形态丰富语言的问题。我们提出了在上下文歧义消歧词语，使用文本的大的未注释的语料库，并且形态学分析器的方法 - ，无需人工消歧或数据注释。我们假定形态分析器产生多重分析了模棱两可的话。我们的想法是在该形态分析器产生用于明确的词语输出训练回归神经网络。在POS我们目前的性能和引理消歧到达或超过了最先进的技术 - 不使用人工标注的数据 - 包括受监管的模式。我们评估对一些形态丰富语言的方法。

10. Stance Detection in Web and Social Media: A Comparative Study [PDF] 返回目录
Shalmoli Ghosh, Prajwal Singhania, Siddharth Singh, Koustav Rudra, Saptarshi Ghosh
Abstract: Online forums and social media platforms are increasingly being used to discuss topics of varying polarities where different people take different stances. Several methodologies for automatic stance detection from text have been proposed in literature. To our knowledge, there has not been any systematic investigation towards their reproducibility, and their comparative performances. In this work, we explore the reproducibility of several existing stance detection models, including both neural models and classical classifier-based models. Through experiments on two datasets -- (i)~the popular SemEval microblog dataset, and (ii)~a set of health-related online news articles -- we also perform a detailed comparative analysis of various methods and explore their shortcomings. Implementations of all algorithms discussed in this paper are available at this https URL.
摘要：在线论坛和社交媒体平台越来越多地被用来讨论不同极性，不同的人采取不同立场的主题。从文本自动检测立场几种方法已经在文献中提出。据我们所知，还没有对他们的重现性，并且它们的相对性能的任何系统的调查。在这项工作中，我们将探讨现有的几种姿态检测模型的重复性，包括神经模型和经典的基于分类的模型。通过对两个数据集实验 - （一）〜流行SemEval微博数据集，以及（ii）〜一组与健康有关的在线新闻报道 - 我们还执行各种方法的详细比较分析，并探讨其不足之处。在本文中讨论的所有算法的实施可在此HTTPS URL。

11. HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections [PDF] 返回目录
Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, Da-Cheng Juan
Abstract: Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose \textsc{HyperGrid}, a new approach for highly effective multi-task learning. The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We apply our proposed \textsc{HyperGrid} on the current state-of-the-art T5 model, demonstrating strong performance across the GLUE and SuperGLUE benchmarks when using only a single multi-task model. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.
摘要：在自然语言理解任务，实现国家的最先进的性能通常依赖于微调每个任务新模式。因此，这种方法会导致更高的整体参数的成本，具有较高的技术维护服务多个型号一起。学习一个多任务模式，能够为所有的任务做好是充满挑战的，但有吸引力的主张。在本文中，我们提出了\ {textsc} HyperGrid，为高效的多任务学习的新方法。所提出的方法是基于可分解超网络是电网获悉明智的突出物，它们帮助专攻地区的权重矩阵为不同的任务。为了构建所提出的超网络，我们的方法学习全球（任务无关），州和本地任务相关的状态之间的相互作用和组成。我们应用对当前国家的最先进的T5模型我们提出的\ {textsc} HyperGrid，表明整个胶水，仅使用一个多任务模式时，强力胶基准性能强。我们的方法可以帮助弥合微调和多任务学习方法之间的差距。

12. Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages [PDF] 返回目录
Esma Wali, Yan Chen, Christopher Mahoney, Thomas Middleton, Marzieh Babaeianjelodar, Mariama Njie, Jeanna Neefe Matthews
Abstract: Natural Language Processing (NLP) is increasingly used as a key ingredient in critical decision-making systems such as resume parsers used in sorting a list of job candidates. NLP systems often ingest large corpora of human text, attempting to learn from past human behavior and decisions in order to produce systems that will make recommendations about our future world. Over 7000 human languages are being spoken today and the typical NLP pipeline underrepresents speakers of most of them while amplifying the voices of speakers of other languages. In this paper, a team including speakers of 8 languages - English, Chinese, Urdu, Farsi, Arabic, French, Spanish, and Wolof - takes a critical look at the typical NLP pipeline and how even when a language is technically supported, substantial caveats remain to prevent full participation. Despite huge and admirable investments in multilingual support in many tools and resources, we are still making NLP-guided decisions that systematically and dramatically underrepresent the voices of much of the world.
摘要：自然语言处理（NLP）被越来越多地用于关键决策系统，如分拣工作的候选人名单中使用的简历解析器的关键因素。 NLP系统经常摄取大量语料人类的文字，试图以从过去的人的行为和决定学会生产系统，这将使我们的未来世界的建议。超过7000人的语言被说今天他们大多数人的典型NLP管道underrepresents音箱同时放大其他语言的人的声音。在本文中，一个团队，包括8种语言的演讲 - 英语，中国语，乌尔都语，波斯语，阿拉伯语，法语，西班牙语和沃洛夫语 - 需要在典型的NLP管道关键看怎么即使语言提供技术支持，主要注意事项仍然以防止全员参与。尽管在许多工具和资源，多语言支持巨大，令人钦佩的投资，而且我们一直在NLP引导决策系统，并大大underrepresent的世界许多地方的声音。

13. I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical Theory [PDF] 返回目录
Dana Halabi, Ebaa Fayyoumi, Arafat Awajan
Abstract: Treebanks are valuable linguistic resources that include the syntactic structure of a language sentence in addition to POS-tags and morphological features. They are mainly utilized in modeling statistical parsers. Although the statistical natural language parser has recently become more accurate for languages such as English, those for the Arabic language still have low accuracy. The purpose of this paper is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language, to investigate their effects on the accuracy of statistical parsers. The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts. The first concept is the approach of determining the main word of the sentence, and the second concept is the representation of the joined and covert pronouns. To evaluate I3rab, we compared its performance against a subset of Prague Arabic Dependency Treebank that shares a comparable level of details. The conducted experiments show that the percentage improvement reached up to 7.5% in UAS and 18.8% in LAS.
摘要：树库是包括除了POS标签和形态特征的语言句子的句法结构宝贵的语言资源。它们主要用于建模的统计分析程序。虽然统计自然语言解析器，最近成为了语言，如英语更精确，那些阿拉伯语言仍然有精度低。本文的目的是构建基于传统阿拉伯语语法理论和阿拉伯语言特征的新阿拉伯语依赖树库，以调查他们的统计分析程序的精确度的影响。所提出的阿拉伯依赖树库，称为I3rab，在两个主要概念已有阿拉伯语依赖树库对比。第一个概念是决定句子的主字的方法，和所述第二概念是接合和隐蔽代词的表示。为了评估I3rab，我们比较了其对布拉格阿拉伯语依赖树库的一个子集性能是股票的细节水平相当。所进行的实验表明，改善百分比达到高达UAS 7.5％和LAS 18.8％。

14. Feature Selection on Noisy Twitter Short Text Messages for Language Identification [PDF] 返回目录
Mohd Zeeshan Ansari, Tanvir Ahmad, Ana Fatima
Abstract: The task of written language identification involves typically the detection of the languages present in a sample of text. Moreover, a sequence of text may not belong to a single inherent language but also may be mixture of text written in multiple languages. This kind of text is generated in large volumes from social media platforms due to its flexible and user friendly environment. Such text contains very large number of features which are essential for development of statistical, probabilistic as well as other kinds of language models. The large number of features have rich as well as irrelevant and redundant features which have diverse effect over the performance of the learning model. Therefore, feature selection methods are significant in choosing feature that are most relevant for an efficient model. In this article, we basically consider the Hindi-English language identification task as Hindi and English are often two most widely spoken languages of India. We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm as well as the number of features on the performance of the task. The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter. Various n-gram profiles are examined with different feature selection algorithms over many classifiers. Finally, an exhaustive comparative analysis is put forward with respect to the overall experiments conducted for the task.
摘要：书面语言识别任务通常涉及现有案文的样品中的语言检测。此外，文本的顺序可能不属于一个单一的固有语言，也可以是用多种语言文字的混合物。这种文本从社交媒体平台的大量产生是由于其灵活的和用户友好的环境。这样的文本中包含大量的这些都是统计学，概率以及其他种类的语言模型的发展基本特征。在大量的功能都具有在学习模型的性能不同效果丰富，以及不相关和冗余功能。因此，特征选择的方法是在选择功能最相关一种有效的模型显著。在这篇文章中，我们主要考虑的印地文，英语语言识别任务，印地文和英文往往是印度的两个最广泛使用的语言。我们为了分析算法的效果以及功能上执行任务的应用数量在各个学习算法不同特征选择算法。该方法侧重于使用的从微萃取6903个鸣叫一种新颖的数据集中的字电平语言识别。各种n-gram中的配置文件进行检查与不同的特征选择算法在许多分类器。最后，详尽的比较分析，相对于对任务进行整体实验提出。

15. Deep or Simple Models for Semantic Tagging? It Depends on your Data [Experiments] [PDF] 返回目录
Jinfeng Li, Yuliang Li, Xiaolan Wang, Wang-Chiew Tan
Abstract: Semantic tagging, which has extensive applications in text mining, predicts whether a given piece of text conveys the meaning of a given semantic tag. The problem of semantic tagging is largely solved with supervised learning and today, deep learning models are widely perceived to be better for semantic tagging. However, there is no comprehensive study supporting the popular belief. Practitioners often have to train different types of models for each semantic tagging task to identify the best model. This process is both expensive and inefficient. We embark on a systematic study to investigate the following question: Are deep models the best performing model for all semantic tagging tasks? To answer this question, we compare deep models against "simple models" over datasets with varying characteristics. Specifically, we select three prevalent deep models (i.e. CNN, LSTM, and BERT) and two simple models (i.e. LR and SVM), and compare their performance on the semantic tagging task over 21 datasets. Results show that the size, the label ratio, and the label cleanliness of a dataset significantly impact the quality of semantic tagging. Simple models achieve similar tagging quality to deep models on large datasets, but the runtime of simple models is much shorter. Moreover, simple models can achieve better tagging quality than deep models when targeting datasets show worse label cleanliness and/or more severe imbalance. Based on these findings, our study can systematically guide practitioners in selecting the right learning model for their semantic tagging task.
摘要：语义标记，它在文本挖掘应用领域广泛，预测文本给定的片断是否传达给定的语义标签的含义。语义标签的问题在很大程度上与监督学习，今天解决，深学习模型被广泛认为是更好的语义标记。然而，没有全面的研究支持民间信仰。从业者往往要培养不同类型的模型，每个语义标签的任务，以确定最佳模型。这个过程是既昂贵且效率低下。我们踏上了系统的研究探讨以下问题：深型号为所有语义标签的任务表现最好的模式？要回答这个问题，我们比较反对“简单模式”深机型在具有不同特征的数据集。具体来说，我们选择三个流行的深模型（即CNN，LSTM和BERT）和两个简单的模型（即LR和SVM），并在21个集的语义标注任务比较它们的性能。结果表明，该尺寸，标签比，和一个数据集的标签清洁度显著影响语义标签的质量。简单的模型实现类似的标注质量，对大数据集深车型，但简单模型的运行时间要短得多。此外，当目标数据集显示更糟标签清洁和/或更严重的不平衡简单的模型可以达到更好的标记质量，比深模型。基于这些发现，我们的研究能够系统地指导学员选择合适的学习模式，为自己的语义标记的任务。

16. GloVeInit at SemEval-2020 Task 1: Using GloVe Vector Initialization for Unsupervised Lexical Semantic Change Detection [PDF] 返回目录
Vaibhav Jain
Abstract: This paper presents a vector initialization approach for the SemEval2020 Task 1: Unsupervised Lexical Semantic Change Detection. Given two corpora belonging to different time periods and a set of target words, this task requires us to classify whether a word gained or lost a sense over time (subtask 1) and to rank them on the basis of the changes in their word senses (subtask 2). The proposed approach is based on using Vector Initialization method to align GloVe embeddings. The idea is to consecutively train GloVe embeddings for both corpora, while using the first model to initialize the second one. This paper is based on the hypothesis that GloVe embeddings are more suited for the Vector Initialization method than SGNS embeddings. It presents an intuitive reasoning behind this hypothesis, and also talks about the impact of various factors and hyperparameters on the performance of the proposed approach. Our model ranks 13th and 10th among 33 teams in the two subtasks. The implementation has been shared publicly.
摘要：本文介绍了SemEval2020任务1向量初始化的方法：无监督词义变化检测。由于属于不同的时间段和一组目标词两个语料库，这个任务需要我们数据的分类词是否获得或失去了时间感（子任务1）和排名他们在他们的词义变化的基础上（子程序2）。所提出的方法是基于使用Vector初始化方法来对齐手套的嵌入。我们的想法是，要连续训练手套的嵌入两个语料库在使用第一种模式初始化第二个。本文是基于以下假设：手套的嵌入更适合于向量初始化方法比SGNS的嵌入。它提出这个假设背后的推理直观，也对各种因素和超参数上所提出的方法的性能的影响会谈。我们的模型跻身33队两个子任务的第13和第10位。实施已公开共享。

17. Multi-Dialect Arabic BERT for Country-Level Dialect Identification [PDF] 返回目录
Bashar Talafha, Mohammad Ali, Muhy Eddin Za'ter, Haitham Seelawi, Ibraheem Tuffaha, Mostafa Samir, Wael Farhan, Hussein T. Al-Natsheh
Abstract: Arabic dialect identification is a complex problem for a number of inherent properties of the language itself. In this paper, we present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI, along the way to achieving our winning solution to subtask 1 of the Nuanced Arabic Dialect Identification (NADI) shared task. The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries. An unlabeled corpus of 10M tweets from the same domain is also presented by the competition organizers for optional use. Our winning solution itself came in the form of an ensemble of different training iterations of our pre-trained BERT model, which achieved a micro-averaged F1-score of 26.78% on the subtask at hand. We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model, for any interested researcher out there.
摘要：阿拉伯语方言的识别是许多语言本身的固有特性的一个复杂的问题。在本文中，我们目前进行的实验，并通过我们的竞争团队，Mawdoo3 AI开发的模型，沿途要实现我们成功的解决方案，以细致入微的阿拉伯语方言识别（NADI）共享任务的子任务1。方言识别子任务提供21000国家一级标记鸣叫覆盖所有21个阿拉伯国家。来自同一个域10M鸣叫的未标记的语料也由比赛组织，可以选择使用介绍。我们成功的解决方案本身就在我们预先训练BERT模式，取得了不同的训练迭代的合奏的形式，微平均的26.78％，F1-得分手头的子任务。我们公开发布了成功的解决方案的预先训练的语言模型组件多方言阿拉伯语-BERT模型的名称下，对于任何有兴趣的研究人员在那里。

18. Class LM and word mapping for contextual biasing in End-to-End ASR [PDF] 返回目录
Rongqing Huang, Ossama Abdel-hamid, Xinwei Li, Gunnar Evermann
Abstract: In recent years, all-neural, end-to-end (E2E) ASR systems gained rapid interest in the speech recognition community. They convert speech input to text units in a single trainable Neural Network model. In ASR, many utterances contain rich named entities. Such named entities may be user or location specific and they are not seen during training. A single model makes it inflexible to utilize dynamic contextual information during inference. In this paper, we propose to train a context aware E2E model and allow the beam search to traverse into the context FST during inference. We also propose a simple method to adjust the cost discrepancy between the context FST and the base model. This algorithm is able to reduce the named entity utterance WER by 57% with little accuracy degradation on regular utterances. Although an E2E model does not need pronunciation dictionary, it's interesting to make use of existing pronunciation knowledge to improve accuracy. In this paper, we propose an algorithm to map the rare entity words to common words via pronunciation and treat the mapped words as an alternative form to the original word during recognition. This algorithm further reduces the WER on the named entity utterances by another 31%.
摘要：近年来，所有的神经，端至端（E2E）ASR系统获得了语音识别界迅速的兴趣。它们将语音输入到文本单元在一个可训练神经网络模型。在ASR，许多话语含有丰富的命名实体。这样的命名实体可以是用户或特定的位置，他们在训练中都没有见过。一个单一的模式使得它不灵活利用推理过程中的动态上下文信息。在本文中，我们提出培养情境感知E2E模型，并允许推断在束搜索遍历到上下文FST。我们还提出了一个简单的方法来调整背景FST和基本模型之间的成本差异。该算法能够通过57％用在普通的话语很少精度降低，以减少话语WER命名实体。虽然E2E模型并不需要发音词典，它们也同样吸引利用现有的语音知识，提高准确性。在本文中，我们提出了一种算法，通过语音来罕见的实体词常用词地图和识别期间处理映射词作为一种替代形式，以原词。该算法进一步减少了对另外31％的命名实体的话语WER。

19. Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention [PDF] 返回目录
Emir Demirel, Sven Ahlback, Simon Dixon
Abstract: Speech recognition is a well developed research field so that the current state of the art systems are being used in many applications in the software industry, yet as by today, there still does not exist such robust system for the recognition of words and sentences from singing voice. This paper proposes a complete pipeline for this task which may commonly be referred as automatic lyrics transcription (ALT). We have trained convolutional time-delay neural networks with self-attention on monophonic karaoke recordings using a sequence classification objective for building the acoustic model. The dataset used in this study, DAMP - Sing! 300x30x2 [1] is filtered to have songs with only English lyrics. Different language models are tested including MaxEnt and Recurrent Neural Networks based methods which are trained on the lyrics of pop songs in English. An in-depth analysis of the self-attention mechanism is held while tuning its context width and the number of attention heads. Using the best settings, our system achieves notable improvement to the state-of-the-art in ALT and provides a new baseline for the task.
摘要：语音识别是一个发展良好的研究领域，这样的技术系统的当前状态，在许多应用中正在使用的软件产业，但是作为今天，仍然没有为识别的单词和句子存在这样强大的系统从歌声。本文提出了这一任务的是一般被称为自动歌词转录（ALT）一个完整的流水线。我们已经对使用序列分类目标构建声学模型的单声道卡拉OK录音自我注意训练的卷积时间延迟神经网络。在这项研究中所使用的数据集，潮湿的 - 唱！ 300x30x2 [1]进行滤波，以具有唯一英文歌词的歌曲。不同的语言模型进行测试，包括其上的流行歌曲中英文歌词训练有素的最大墒和循环神经网络为基础的方法。自注意机制的深入分析，同时调整其背景宽度和关注磁头数举行。使用最佳设置，我们的系统实现了显着改善的国家的最先进的ALT，并提供了该任务的新的基准。

20. Learning Reasoning Strategies in End-to-End Differentiable Proving [PDF] 返回目录
Pasquale Minervini, Sebastian Riedel, Pontus Stenetorp, Edward Grefenstette, Tim Rocktäschel
Abstract: Attempts to render deep learning models interpretable, data-efficient, and robust have seen some success through hybridisation with rule-based systems, for example, in Neural Theorem Provers (NTPs). These neuro-symbolic models can induce interpretable rules and learn representations from data via back-propagation, while providing logical explanations for their predictions. However, they are restricted by their computational complexity, as they need to consider all possible proof paths for explaining a goal, thus rendering them unfit for large-scale applications. We present Conditional Theorem Provers (CTPs), an extension to NTPs that learns an optimal rule selection strategy via gradient-based optimisation. We show that CTPs are scalable and yield state-of-the-art results on the CLUTRR dataset, which tests systematic generalisation of neural models by learning to reason over smaller graphs and evaluating on larger ones. Finally, CTPs show better link prediction results on standard benchmarks in comparison with other neural-symbolic models, while being explainable. All source code and datasets are available online, at this https URL.
摘要：企图使深度学习模型解释，数据高效和强大的看到通过杂交与基于规则的系统取得了一些成功，例如，在神经定理证明（国家结核病防治规划）。这些神经象征性的车型能诱导解释规则，学会从通过反向传播的数据表示，而对于他们的预测提供合理的解释。然而，他们通过自己的计算复杂度的限制，因为他们需要考虑用于解释目标的所有可能证明的路径，从而使它们不适合大规模应用。我们目前的条件定理证明（CTP版本），扩展到国家结核病防治规划是学习通过基于梯度的优化最佳的规则选择策略。我们表明，CTP的是国家的最先进的可扩展性和收益率的CLUTRR数据集，通过学习，理智战胜了小图和较大的评估测试神经模型的系统性概括结果。最后，CTP版本显示在与其他神经象征性的车型比较基准测试中更好的链路的预测结果，而被解释的。所有的源代码和数据集可在网上，在这个HTTPS URL。

21. Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity [PDF] 返回目录
Yana Agafonova, Alexey Tikhonov, Ivan P. Yamshchikov
Abstract: This papers revisits the receptive theory in context of computational creativity. It presents a case study of a Paranoid Transformer - a fully autonomous text generation engine with raw output that could be read as the narrative of a mad digital persona without any additional human post-filtering. We describe technical details of the generative system, provide examples of output and discuss the impact of receptive theory, chance discovery and simulation of fringe mental state on the understanding of computational creativity.
摘要：论文重温在计算创意的背景下，接受理论。它提出了一个偏执变压器的案例研究 - 以原始输出完全自主的文本生成引擎，可以在没有任何额外的人力过滤后的被解读为一个疯狂的数字人的叙述。我们描述了生成系统的技术细节，提供输出的例子，并讨论接受理论，机会发现和边缘心理状态模拟在计算创意的理解产生的影响。

22. RNA-2QCFA: Evolving Two-way Quantum Finite Automata with Classical States for RNA Secondary Structures [PDF] 返回目录
Amandeep Singh Bhatia, Shenggen Zheng
Abstract: Recently, the use of mathematical methods and computer science applications have got significant response among biochemists and biologists to modeling the biological systems. The computational and mathematical methods have enormous potential for modeling the deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) structures. The modeling of DNA and RNA secondary structures using automata theory had a significant impact in the fields of computer science. It is a natural goal to model the RNA secondary biomolecular structures using quantum computational models. Two-way quantum finite automata with classical states are more dominant than two-way probabilistic finite automata in language recognition. The main objective of this paper is on using two-way quantum finite automata with classical states to simulate, model and analyze the ribonucleic acid (RNA) sequences.
摘要：近日，运用数学方法和计算机科学应用已经得到了生物化学家和生物学家模拟生物系统之间显著响应。的计算和数学方法具有用于脱氧核糖核酸（DNA）和核糖核酸（RNA）的结构建模的巨大潜力。使用自动机理论DNA和RNA二级结构的造型曾在计算机科学领域的一个显著的影响。它是一种天然的目标使用量子计算机模型来模拟RNA二级结构的生物分子。与传统状态双向量子有限自动机是比语言识别双向概率有限自动机更占优势。主要目的本文的是利用双向量子有限自动机与传统状态来模拟，模型和分析核糖核酸（RNA）序列。

23. A theory of interaction semantics [PDF] 返回目录
Johannes Reich
Abstract: The aim of this article is to delineate a theory of interaction semantics and thereby provide a proper understanding of the "meaning" of the exchanged characters within an interaction. The idea is to describe the interaction (between discrete systems) by a mechanism that depends on information exchange, that is, on the identical naming of the "exchanged" characters -- by a protocol. Complementing a nondeterministic protocol with decisions to a game in its interactive form (GIF) makes it interpretable in the sense of an execution. The consistency of such a protocol depends on the particular choice of its sets of characters. Thus, assigning a protocol its sets of charaacters makes it consistent or not, creating a fulfillment relation. The interpretation of the characters during GIF execution results in their meaning. The proposed theory of interaction semantics is consistent with the model of information transport and processing, it has a clear relation to models of formal semantics, it accounts for the fact that the meaning of a character is invariant against renaming and locates the concept of meaning in the technical description of interactions. It defines when two different characters have the same meaning and what an "interpretation" and what an "interpretation context" is as well as under which conditions meaning is compositional.
摘要：本文的目的是勾画互动语义的理论，从而提供一个交互中的交换角色的“意义”的正确理解。想法是通过依赖于信息交换，即，是对中的“交换”的字符相同的命名一种机制来描述交互（离散系统之间） - 由一个协议。补充与决策的非确定性协议，以在其互动形式（GIF）游戏使得它在执行的意义解释。这样的协议的一致性取决于其字符集的特定选择。因此，分配协议charaacters其套使它一致与否，创造了实现关系。人物的过程中它们的含义GIF执行结果的解释。交互语义提出的理论是信息传输和处理的模型一致，它有一个明显的关系正式语义模型，它占的事实，一个字符的含义是对重命名不变，并定位在意义的概念交互的技术说明。它定义了当两个不同的字符具有相同的含义，什么是“解释”，什么是“解释上下文”，以及在什么条件下意思是组成。

24. ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing [PDF] 返回目录
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Debsindhu Bhowmik, Burkhard Rost
Abstract: Motivation: NLP continues improving substantially through auto-regressive and auto-encoding Language Models. These LMs require expensive computing resources for self-supervised or un-supervised learning from huge unlabelled text corpora. The information learned is transferred through so-called embeddings to downstream prediction tasks. Bioinformatics provide vast gold-mines of structured and sequentially ordered text data leading to extraordinarily successful protein sequence LMs that promise new frontiers for generative and predictive tasks at low inference cost. Here, we addressed two questions: (1) To which extent can HPC up-scale protein LMs to larger databases and larger models? (2) To which extent can LMs extract features from single proteins to get closer to the performance of methods using evolutionary information? Methodology: Here, we trained two auto-regressive language models (Transformer-XL and XLNet) and two auto-encoder models (BERT and Albert) using 80 billion amino acids from 200 million protein sequences (UniRef100) and 393 billion amino acids from 2.1 billion protein sequences (BFD). The LMs were trained on the Summit supercomputer, using 5616 GPUs and one TPU Pod, using V3-512 cores. Results: The results of training these LMs on proteins was assessed by predicting secondary structure in three- and eight-states (Q3=75-83, Q8=63-72), localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabelled data (only protein sequences) captured important biophysical properties of the protein alphabet, namely the amino acids, and their well orchestrated interplay in governing the shape of proteins. In the analogy of NLP, this implied having learned some of the grammar of the language of life realized in protein sequences.
摘要：动机：NLP继续通过自回归和自动编码语言模型显着地改善。这些LM的需要昂贵的计算资源进行自我监督或无监督的巨大的未标记文本语料库学习。了解到这些信息是通过所谓的嵌入到下游的预测任务转移。生物信息学提供结构和顺序排序的文本数据，导致承诺以低推理成本生成和预测工作新领域非常成功的蛋白序列的LMS广大黄金矿山。在这里，我们解决两个问题：（1）在何种程度上可以HPC了大规模蛋白质的LMS更大的数据库和模型？（2），其程度可以的LM提取单一蛋白特征来接近的使用进化信息的方法的表现？方法：在这里，我们培养使用含80个十亿个氨基酸的自回归语言模型（变压器-XL和XLNet）和两个自动编码器模型（BERT和阿尔伯特）2个亿蛋白质序列（UniRef100）393十亿氨基酸从2.1十亿蛋白质序列（BFD）。 LMS的接受了有关峰会的超级计算机，采用5616个GPU和一个TPU波德，使用V3-512核心。结果：培训对这些蛋白质的LM的结果是由预测三和八状态（Q3 = 75-83，Q8 = 63-72），本地化的二级结构10个的细胞区室（Q10 = 74）和是否评估蛋白质是膜结合的或水溶性（Q2 = 89）。维数降低透露，从未标记数据的LM-的嵌入（仅蛋白质序列）中捕获的蛋白质字母，即氨基酸，和在管理的蛋白质的形状其精心安排相互作用的重要的生物物理特性。在NLP的比喻，这就意味着已经学到了一些生活中的蛋白质序列实现语言的语法。

25. Fine-grained Language Identification with Multilingual CapsNet Model [PDF] 返回目录
Mudit Verma, Arun Balaji Buduru
Abstract: Due to a drastic improvement in the quality of internet services worldwide, there is an explosion of multilingual content generation and consumption. This is especially prevalent in countries with large multilingual audience, who are increasingly consuming media outside their linguistic familiarity/preference. Hence, there is an increasing need for real-time and fine-grained content analysis services, including language identification, content transcription, and analysis. Accurate and fine-grained spoken language detection is an essential first step for all the subsequent content analysis algorithms. Current techniques in spoken language detection may lack on one of these fronts: accuracy, fine-grained detection, data requirements, manual effort in data collection \& pre-processing. Hence in this work, a real-time language detection approach to detect spoken language from 5 seconds' audio clips with an accuracy of 91.8\% is presented with exiguous data requirements and minimal pre-processing. Novel architectures for Capsule Networks is proposed which operates on spectrogram images of the provided audio snippets. We use previous approaches based on Recurrent Neural Networks and iVectors to present the results. Finally we show a ``Non-Class'' analysis to further stress on why CapsNet architecture works for LID task.
摘要：由于全球互联网服务的质量大幅度提高，人们多语言内容生产和消费的爆炸。这是与大多种语言观众的国家，谁是日益消耗他们的语言熟悉/偏好外媒体尤其普遍。因此，有越来越需要实时，细粒度内容分析服务，包括语言识别，内容转录和分析。准确和细粒度口语检测对于所有后续的内容分析算法的重要的第一步。当前在口语检测技术可能缺乏这些方面的一个：准确，细粒度的检测，数据要求，人工劳动的数据收集\＆预处理。因此，在这项工作中，一个实时语言检测方法来从5秒的音频剪辑检测口语91.8 \％的精确度呈现稀少的数据要求和最小的预处理。对于胶囊网络新颖体系结构提出了一种在所提供的音频片段的频谱的图像进行操作。我们采用基于回归神经网络和iVectors以前的方法来呈现结果。最后，我们展示了为什么CapsNet架构适用于LID任务``非A类“”分析进一步的压力。

26. Sparse Graph to Sequence Learning for Vision Conditioned Long Textual Sequence Generation [PDF] 返回目录
Aditya Mogadala, Marius Mosbach, Dietrich Klakow
Abstract: Generating longer textual sequences when conditioned on the visual information is an interesting problem to explore. The challenge here proliferate over the standard vision conditioned sentence-level generation (e.g., image or video captioning) as it requires to produce a brief and coherent story describing the visual content. In this paper, we mask this Vision-to-Sequence as Graph-to-Sequence learning problem and approach it with the Transformer architecture. To be specific, we introduce Sparse Graph-to-Sequence Transformer (SGST) for encoding the graph and decoding a sequence. The encoder aims to directly encode graph-level semantics, while the decoder is used to generate longer sequences. Experiments conducted with the benchmark image paragraph dataset show that our proposed achieve 13.3% improvement on the CIDEr evaluation measure when comparing to the previous state-of-the-art approach.
摘要：对视觉信息时，空调不再生成文本序列是探索一个有趣的问题。因为它需要产生一个简短而连贯的故事说明视觉内容的挑战在这里繁殖在标准视力条件语句级代（例如，图像或视频字幕）。在本文中，我们掩盖这一设想对序列图对序列的学习问题，并与变压器结构接近它。具体而言，我们引入稀疏图至序互感器（SGST）用于编码曲线图和解码的序列。该编码器的目的直接编码图形级语义，而使用解码器来产生更长的序列。与基准图像段进行的实验数据集上，我们提出实现对苹果酒评价指标13.3％的改善相比以前的国家的最先进的方法时。

27. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech [PDF] 返回目录
Andy T. Liu, Shang-Wen Li, Hung-yi Lee
Abstract: We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn through the formulation of a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous approaches, we use a multi-target auxiliary task to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from its altered counterpart, where we use a stochastic policy to alter along three dimensions: temporal, channel, and magnitude. TERA can be used to extract speech representations or fine-tune with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, speaker recognition, and speech recognition. TERA achieved strong performance on these tasks by improving upon surface features and outperforming previous methods. In our experiments, we show that through alteration along different dimensions, the model learns to encode distinct aspects of speech. We explore different knowledge transfer methods to incorporate the pre-trained model with downstream models. Furthermore, we show that the proposed method can be easily transferred to another dataset not used in pre-training.
摘要：介绍称为TERA自我监督的讲话前的训练方法，这代表了从维修变压器编码器交涉。最近方法通常通过像对比预测，自回归预测，或掩蔽的重建单个辅助任务的制剂学。不像以前的方法，我们使用了多目标的辅助任务预列车上大量未标记的语音编码器的变压器。时间，频道，和幅度：通过声帧从改变对方，在这里我们使用一个随机的策略来改变沿三维重建模型获悉。 TERA可用于提取演讲陈述或微调与下游模型。我们评估的几个下游任务，包括音素分类，说话人识别和语音识别TERA。 TERA这些任务通过提高在表面特征和以前跑赢方法实现强劲的性能。在我们的实验中，我们表明，通过沿不同的尺寸变化，模型学会对语音编码的不同方面。我们探索不同的知识转移方法与下游模型中包含了预先训练模式。此外，我们还表明，该方法可以很容易地转移到前培训不使用其他数据集。

28. Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation [PDF] 返回目录
Yuxuan Song, Ning Miao, Hao Zhou, Lantao Yu, Mingxuan Wang, Lei Li
Abstract: Auto-regressive sequence generative models trained by Maximum Likelihood Estimation suffer the exposure bias problem in practical finite sample scenarios. The crux is that the number of training samples for Maximum Likelihood Estimation is usually limited and the input data distributions are different at training and inference stages. Many method shave been proposed to solve the above problem (Yu et al., 2017; Lu et al., 2018), which relies on sampling from the non-stationary model distribution and suffers from high variance or biased estimations. In this paper, we propose{\psi}-MLE, a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. We derive our algorithm from a new perspective of self-augmentation and introduce bias correction with density ratio estimation. Extensive experimental results on synthetic data and real-world text generation tasks demonstrate that our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
摘要：由最大似然估计训练有素的自回归序列生成模型遭受实际有限样本场景曝光偏差问题。的关键是，对于最大似然估计的训练样本的数目通常是有限的和输入数据分布是在训练和推理阶段的不同。已经提出了许多方法来剃解决上述问题（Yu等人，2017; Lu等，2018），其依赖于从高方差或偏置估计非平稳模型分布和患有采样。在本文中，我们提出{\ PSI} -MLE，为自回归序列生成模型一个新的培训方案，在文本生成遇到大样本空间中运行时，这是有效的，稳定的。我们从自我增强的一个新的角度得出我们的算法，并介绍与密度比估计的偏差修正。在模拟数据和真实世界的文本生成任务，大量的实验结果表明，我们的方法优于稳定的最大似然估计和其他国家的最先进的序列生成模型在质量和多样性方面。

29. The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results [PDF] 返回目录
Xian Shi, Qiangze Feng, Lei Xie
Abstract: Code-switching (CS) is a common phenomenon and recognizing CS speech is challenging. But CS speech data is scarce and there' s no common testbed in relevant research. This paper describes the design and main outcomes of the ASRU 2019 Mandarin-English code-switching speech recognition challenge, which aims to improve the ASR performance in Mandarin-English code-switching situation. 500 hours Mandarin speech data and 240 hours Mandarin-English intra-sentencial CS data are released to the participants. Three tracks were set for advancing the AM and LM part in traditional DNN-HMM ASR system, as well as exploring the E2E models' performance. The paper then presents an overview of the results and system performance in the three tracks. It turns out that traditional ASR system benefits from pronunciation lexicon, CS text generating and data augmentation. In E2E track, however, the results highlight the importance of using language identification, building-up a rational set of modeling units and spec-augment. The other details in model training and method comparsion are discussed.
摘要：码转换（CS）是一种常见的现象并确认CS语音是具有挑战性的。但是CS语音数据是稀缺，还有的相关研究没有统一的测试平台。本文介绍了设计和ASRU 2019普通话的英语码转换的语音识别挑战，提高普通话，英语码转换的情况ASR性能，其目的主要成果。 500小时汉语语音数据在240小时普通话的英语内部sentencial CS的数据被释放到参与者。三个轨道被推进AM和LM部分传统DNN-HMM ASR系统，以及探索E2E模型的性能设置。然后，本文介绍的三个轨道的结果和系统性能的概述。原来，从发音词典，CS文生成和数据扩充，传统的ASR系统的好处。在E2E轨道，然而，结果强调使用语言识别，建立，一个理性的一套造型的单位和技术规范的扩充的重要性。模型的训练和方法comparsion其他细节进行了讨论。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-07-14

目录

摘要