0%

【arxiv论文】 Computation and Language 2020-03-23

目录

1. FedNER: Medical Named Entity Recognition with Federated Learning [PDF] 摘要
2. Language Technology Programme for Icelandic 2019-2023 [PDF] 摘要
3. Parallel Intent and Slot Prediction using MLB Fusion [PDF] 摘要
4. TNT-KID: Transformer-based Neural Tagger for Keyword Identification [PDF] 摘要
5. NSURL-2019 Task 7: Named Entity Recognition (NER) in Farsi [PDF] 摘要
6. Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [PDF] 摘要
7. Learning to Encode Position for Transformer with Continuous Dynamical Model [PDF] 摘要
8. Detecting Mismatch between Text Script and Voice-over Using Utterance Verification Based on Phoneme Recognition Ranking [PDF] 摘要
9. Automatic Identification of Types of Alterations in Historical Manuscripts [PDF] 摘要
10. The value of text for small business default prediction: A deep learning approach [PDF] 摘要

摘要

1. FedNER: Medical Named Entity Recognition with Federated Learning [PDF] 返回目录
  Suyu Ge, Fangzhao Wu, Chuhan Wu, Tao Qi, Yongfeng Huang, Xing Xie
Abstract: Medical named entity recognition (NER) has wide applications in intelligent healthcare. Sufficient labeled data is critical for training accurate medical NER model. However, the labeled data in a single medical platform is usually limited. Although labeled datasets may exist in many different medical platforms, they cannot be directly shared since medical data is highly privacy-sensitive. In this paper, we propose a privacy-preserving medical NER method based on federated learning, which can leverage the labeled data in different platforms to boost the training of medical NER model and remove the need of exchanging raw data among different platforms. Since the labeled data in different platforms usually has some differences in entity type and annotation criteria, instead of constraining different platforms to share the same model, we decompose the medical NER model in each platform into a shared module and a private module. The private module is used to capture the characteristics of the local data in each platform, and is updated using local labeled data. The shared module is learned across different medical platform to capture the shared NER knowledge. Its local gradients from different platforms are aggregated to update the global shared module, which is further delivered to each platform to update their local shared modules. Experiments on three publicly available datasets validate the effectiveness of our method.
摘要:医学命名实体识别(NER)在智能医疗的广泛应用。足够的标签数据是培养正确的医疗NER模型的关键。然而,在单个医疗平台标记的数据通常是有限的。尽管在许多不同的医疗平台可能存在标记的数据集,它们不能直接共享由于医疗数据是高度隐私敏感。在本文中,我们提出了基于联合学习,这可以利用在不同平台上的标记数据,以提高医疗NER模型的训练,并删除在不同平台间交换原始数据的需要的隐私保护医疗NER方法。由于在不同平台上的标记的数据通常具有实体类型和注释标准的一些差异,而不是约束不同平台共享相同的模型,我们在分解每个平台医疗NER模型到共享模块和私人模块。私人模块用于捕捉每个平台的本地数据的特点,并使用本地标签的数据更新。共享的模块在不同的医疗平台,学会捕捉共享NER知识。它从不同的平台局部梯度聚集更新全局共享模块,进一步传递到每个平台上更新他们的本地共享模块。三个公开可用的数据集实验验证了该方法的有效性。

2. Language Technology Programme for Icelandic 2019-2023 [PDF] 返回目录
  Anna Björk Nikulásdóttir, Jón Guðnason, Anton Karl Ingason, Hrafn Loftsson, Eiríkur Rögnvaldsson, Einar Freyr Sigurðsson, Steinþór Steingrímsson
Abstract: In this paper, we describe a new national language technology programme for Icelandic. The programme, which spans a period of five years, aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, open-source language resources and software. The research and development work within the programme is carried out by a consortium of universities, institutions, and private companies, with a strong emphasis on cooperation between academia and industries. Five core projects will be the main content of the programme: language resources, speech recognition, speech synthesis, machine translation, and spell and grammar checking. We also describe other national language technology programmes and give an overview over the history of language technology in Iceland.
摘要:在本文中,我们描述了一个新的国家语言技术方案,冰岛。该方案跨越了为期五年,旨在使在数字世界的沟通和互动冰岛可用,通过开发可访问的,开放源码的语言资源和软件。该计划中的研究和开发工作是由大学,研究机构和私人公司组成的财团进行,非常重视学术界和工业界之间的合作。五个核心项目将是该计划的主要内容:语言资源,语音识别,语音合成,机器翻译,和拼写和语法检查。我们还描述了其他民族语言的技术方案,并给予了语言技术在冰岛历史上的一个概述。

3. Parallel Intent and Slot Prediction using MLB Fusion [PDF] 返回目录
  Anmol Bhasin, Bharatram Natarajan, Gaurav Mathur, Himanshu Mangla
Abstract: Intent and Slot Identification are two important tasks in Spoken Language Understanding (SLU). For a natural language utterance, there is a high correlation between these two tasks. A lot of work has been done on each of these using Recurrent-Neural-Networks (RNN), Convolution Neural Networks (CNN) and Attention based models. Most of the past work used two separate models for intent and slot prediction. Some of them also used sequence-to-sequence type models where slots are predicted after evaluating the utterance-level intent. In this work, we propose a parallel Intent and Slot Prediction technique where separate Bidirectional Gated Recurrent Units (GRU) are used for each task. We posit the usage of MLB (Multimodal Low-rank Bilinear Attention Network) fusion for improvement in performance of intent and slot learning. To the best of our knowledge, this is the first attempt of using such a technique on text based problems. Also, our proposed methods outperform the existing state-of-the-art results for both intent and slot prediction on two benchmark datasets
摘要:意图和插槽标识是口语理解(SLU)两项重要任务。对于自然语言语句,在这两个任务之间的高相关性。大量的工作,对每一个使用递归,神经网络(RNN),卷积神经网络(CNN)和基于模型注意的已经完成。大多数过去的工作中使用两个单独的模型意图和插槽的预测。他们中有些人还用在插槽都在评估中的发声水平的意图后预测的序列到序列类型的模型。在这项工作中,我们提出了一个平行的意图和插槽预测技术,其中独立的双向门控复发单位(GRU)用于每个任务。我们断定MLB(多式联运低等级双线性关注网络)的使用融合的意图和插槽学习的性能改进。据我们所知,这是使用基于文本的问题这种技术的第一次尝试。此外,我们提出的方法优于两个标准数据集现有的国家的最先进的结果都意图和插槽预测

4. TNT-KID: Transformer-based Neural Tagger for Keyword Identification [PDF] 返回目录
  Matej Martinc, Blaž Škrlj, Senja Pollak
Abstract: With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization and summarization of these data has become a necessity. In this research we present a novel algorithm for keyword identification, i.e., an extraction of one or multi-word phrases representing key aspects of a given document, called Transformer-based Neural Tagger for Keyword IDentification (TNT-KID). By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a small domain specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction by offering competitive and robust performance on a variety of different datasets while requiring only a fraction of manually labeled data required by the best performing systems. This study also offers thorough error analysis with valuable insights into inner workings of the model and an ablation study measuring the influence of specific components of the keyword identification workflow on the overall performance.
摘要:随着不断增长的大量可用的文本数据,能够的自动分析,分类和这些数据的汇总算法的发展已经成为一种必然。在本研究中,我们提出了关键词识别,即,表示给定文件的关键方面的一个或多个单词的短语的提取,称为基于变压器的神经标注器为关键词识别(TNT-KID)一种新颖的算法。通过调整变压器的架构为手头特定的任务,并利用语言模型训练前的一个小领域特定语料库,该模型能够克服双方的缺陷,监管和监督的国家的最先进的,通过提供有竞争力的办法,以关键字提取和在多种同时仅需要手动标记的数据的一小部分不同的数据集的稳健的性能要求的表现最好的系统。这项研究还提供了有价值的见解模型和消融研究测量的总体性能关键词识别工作流程的特定成分的影响的内部工作深入误差分析。

5. NSURL-2019 Task 7: Named Entity Recognition (NER) in Farsi [PDF] 返回目录
  Nasrin Taghizadeh, Zeinab Borhanifard, Melika GolestaniPour, Heshaam Faili
Abstract: NSURL-2019 Task 7 focuses on Named Entity Recognition (NER) in Farsi. This task was chosen to compare different approaches to find phrases that specify Named Entities in Farsi texts, and to establish a standard testbed for future researches on this task in Farsi. This paper describes the process of making training and test data, a list of participating teams (6 teams), and evaluation results of their systems. The best system obtained 85.4% of F1 score based on phrase-level evaluation on seven classes of NEs including person, organization, location, date, time, money and percent.
摘要:NSURL-2019工作重点7命名实体识别(NER)波斯语。选择这个任务交给不同的方法比较发现,在波斯语文本指定命名实体,并建立在波斯语此任务为未来研究提供试验平台的标准词组。本文介绍了制作训练和测试数据,参赛队伍的名单(6支球队),他们的系统评价结果的过程。最好的系统获得F1的85.4%得分基于短语级评估的七个类别的网元,包括个人,组织,地点,日期,时间,金钱和百分比。

6. Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [PDF] 返回目录
  Nikolay Malkovsky, Vladimir Bataev, Dmitrii Sviridkin, Natalia Kizhaeva, Aleksandr Laptev, Ildar Valiev, Oleg Petrov
Abstract: The problem of out of vocabulary words (OOV) is typical for any speech recognition system, hybrid systems are usually constructed to recognize a fixed set of words and rarely can include all the words that will be encountered during exploitation of the system. One of the popular approach to cover OOVs is to use subword units rather then words. Such system can potentially recognize any previously unseen word if the word can be constructed from present subword units, but also non-existing words can be recognized. The other popular approach is to modify HMM part of the system so that it can be easily and effectively expanded with custom set of words we want to add to the system. In this paper we explore different existing methods of this solution on both graph construction and search method levels. We also present a novel vocabulary expansion techniques which solve some common internal subroutine problems regarding recognition graph processing.
摘要:总分词汇(OOV)的问题是典型的任何语音识别系统,混合系统通常构造为识别一组固定的话,也很少能够包括所有将在系统的开发过程中可能遇到的单词。一种流行的方式来盖OOVs的是使用子词单位,而接话。这样的系统可以潜在地识别任何以前看不见的字,如果词可以从本子字单元构成,而且不存在的字可以被识别。另一个流行的方法是修改系统的HMM一部分,这样它可以很容易和有效地使用自定义设置,我们要添加到系统的话扩大。在本文中,我们探讨这两个图构建该解决方案的不同现有方法和搜索方法的水平。我们还提出,其解决关于识别图形处理的一些常见内部子程序的问题的新的词汇扩展技术。

7. Learning to Encode Position for Transformer with Continuous Dynamical Model [PDF] 返回目录
  Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, Cho-Jui Hsieh
Abstract: We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. Unlike RNN and LSTM, which contain inductive bias by loading the input tokens sequentially, non-recurrent models are less sensitive to position. The main reason is that position information among input units is not inherently encoded, i.e., the models are permutation equivalent; this problem justifies why all of the existing models are accompanied by a sinusoidal encoding/embedding layer at the input. However, this solution has clear limitations: the sinusoidal encoding is not flexible enough as it is manually designed and does not contain any learnable parameters, whereas the position embedding restricts the maximum length of input sequences. It is thus desirable to design a new position layer that contains learnable parameters to adjust to different datasets and different architectures. At the same time, we would also like the encodings to extrapolate in accordance with the variable length of inputs. In our proposed solution, we borrow from the recent Neural ODE approach, which may be viewed as a versatile continuous version of a ResNet. This model is capable of modeling many kinds of dynamical systems. We model the evolution of encoded results along position index by such a dynamical system, thereby overcoming the above limitations of existing methods. We evaluate our new position layers on a variety of neural machine translation and language understanding tasks, the experimental results show consistent improvements over the baselines.
摘要:介绍学习非经常性的车型,如变压器型号编码位置信息的新方式。不像RNN和LSTM,其含有通过顺序地加载输入记号归纳偏置,非经常性模型是位置较不敏感。主要原因是输入单元之中该位置信息不是固有编码的,即,模型是置换等效;这个问题证明为什么所有的现有车型都伴随着在输入正弦编码/埋层。然而,该解决方案具有明显的局限性:正弦编码是不够灵活,因为它被手动地设计,并且不包含任何可学习参数,而位置限制嵌入输入序列的最大长度。因此,希望设计出包含可学习的参数来适应不同的数据集和不同的体系结构中的新位置的层。同时,我们也想的编码,以按照输入的可变长度推断。在我们提出的解决方案,我们从最近的神经ODE方法,这可以被看作是一个多功能的连续版本RESNET的借款。这种模式能够多种动力系统的建模。我们通过这样的动力系统沿位置索引编码结果的演化模型,从而克服了现有方法的上述限制。我们评估对各种神经机器翻译和语言理解任务,我们的新位置层,实验结果表明,在基线持续改善。

8. Detecting Mismatch between Text Script and Voice-over Using Utterance Verification Based on Phoneme Recognition Ranking [PDF] 返回目录
  Yoonjae Jeong, Hoon-Young Cho
Abstract: The purpose of this study is to detect the mismatch between text script and voice-over. For this, we present a novel utterance verification (UV) method, which calculates the degree of correspondence between a voice-over and the phoneme sequence of a script. We found that the phoneme recognition probabilities of exaggerated voice-overs decrease compared to ordinary utterances, but their rankings do not demonstrate any significant change. The proposed method, therefore, uses the recognition ranking of each phoneme segment corresponding to a phoneme sequence for measuring the confidence of a voice-over utterance for its corresponding script. The experimental results show that the proposed UV method outperforms a state-of-the-art approach using cross modal attention used for detecting mismatch between speech and transcription.
摘要:本研究的目的是检测文字脚本和画外音之间的不匹配。对于这一点,我们提出了一个新颖的词语验证(UV)方法,其计算对应的画外音和脚本的音素序列之间的相似度。我们发现,夸张配音的音素识别概率下降相比普通的话语,但他们的排名没有表现出任何显著的变化。所提出的方法,因此,使用该识别排名对应于音素序列用于测量画外音发声的其对应的脚本的置信每个音素段的。实验结果表明,所提出的UV方法优于使用用于检测语音和转录之间的失配的交叉模态关注的状态的最先进的方法。

9. Automatic Identification of Types of Alterations in Historical Manuscripts [PDF] 返回目录
  David Lassner, Anne Baillot, Sergej Dogadov, Klaus-Robert Müller, Shinichi Nakajima
Abstract: Alterations in historical manuscripts such as letters represent a promising field of research. On the one hand, they help understand the construction of text. On the other hand, topics that are being considered sensitive at the time of the manuscript gain coherence and contextuality when taking alterations into account, especially in the case of deletions. The analysis of alterations in manuscripts, though, is a traditionally very tedious work. In this paper, we present a machine learning-based approach to help categorize alterations in documents. In particular, we present a new probabilistic model (Alteration Latent Dirichlet Allocation, alterLDA in the following) that categorizes content-related alterations. The method proposed here is developed based on experiments carried out on the digital scholarly edition Berlin Intellectuals, for which alterLDA achieves high performance in the recognition of alterations on labelled data. On unlabelled data, applying alterLDA leads to interesting new insights into the alteration behavior of authors, editors and other manuscript contributors, as well as insights into sensitive topics in the correspondence of Berlin intellectuals around 1800. In addition to the findings based on the digital scholarly edition Berlin Intellectuals, we present a general framework for the analysis of text genesis that can be used in the context of other digital resources representing document variants. To that end, we present in detail the methodological steps that are to be followed in order to achieve such results, giving thereby a prime example of an Machine Learning application the Digital Humanities.
摘要:在改建历史手稿,如字母代表的研究有前途的领域。在一方面,他们帮助理解文本的建设。在另一方面,考虑改变时考虑在内,尤其是在缺失的情况下,正在考虑在手稿增益连贯性和语境的时间敏感话题。在手稿变化的分析,虽然是一个传统上非常繁琐的工作。在本文中,我们提出在文档的基于机器学习的方法来帮助分类的改变。特别是,我们提出分类内容相关的改变一个新的概率模型(维修隐含狄利克雷分布,alterLDA下面)。这里提出的方法是基于数字学术版柏林知识分子,为此alterLDA实现了在识别标签上的数据变化的高性能进行了实验开发。在未标记的数据,基于数字学术围绕1800年将alterLDA带来了一些有趣的新见解作者,编辑和其他稿件贡献者的变更行为,以及到敏感话题的见解在柏林知识分子的对应除了发现柏林版知识分子,我们提出,可以在表示文档的变体等数字资源的上下文中使用文本成因进行分析的总体框架。为此,我们提出了详细是要遵循,以达到这样的效果,从而给人一种机器学习应用的数字人文的典范的方法步骤。

10. The value of text for small business default prediction: A deep learning approach [PDF] 返回目录
  Matthew Stevenson, Christophe Mues, Cristián Bravo
Abstract: Compared to consumer lending, Micro, Small and Medium Enterprise (mSME) credit risk modelling is particularly challenging, as, often, the same sources of information are not available. To mitigate limited data availability, it is standard policy for a loan officer to provide a textual loan assessment. In turn, this statement is analysed by a credit expert alongside any available standard credit data. In our paper, we exploit recent advances from the field of Deep Learning and Natural Language Processing (NLP), including the BERT (Bidirectional Encoder Representations from Transformers) model, to extract information from 60000+ textual assessments. We consider the performance in terms of AUC (Area Under the Curve) and Balanced Accuracy and find that the text alone is surprisingly effective for predicting default. Yet, when combined with traditional data, it yields no additional predictive capability. We do find, however, that deep learning with categorical embeddings is capable of producing a modest performance improvement when compared to alternative machine learning models. We explore how the loan assessments influence predictions, explaining why despite the text being predictive, no additional performance is gained. This exploration leads us to a series of recommendations on a new strategy for the collection of future mSME loan assessments.
摘要:相对于消费贷款,微型,小型和中型企业(MSME)信用风险模型特别具有挑战性,因为,通常情况下,同一信息源不可用。为了减轻数据有限,它是一个信贷员提供一个文本贷款评估标准的政策。反过来,这种说法是由信贷专家一起任何可用的标准信贷数据进行分析。在本文中,我们利用从深度学习与自然语言处理(NLP)领域的最新进展,包括从60000+文字评估的BERT(从变形金刚双向编码表示)模式,提取信息。我们认为,在AUC方面(曲线下面积)的性能和平衡精度,发现单独的文本是默认的预测惊人地有效。然而,与传统的数据相结合,它的产量没有额外的预测能力。然而,我们发现,相比于替代机器学习模型时与分类的嵌入深度学习能产生轻微的性能改进。我们探索的贷款评估如何影响预测,解释了为什么尽管文本是预测,没有额外的性能获得。这种探索使我们对未来的微型和中小型企业贷款的评估收集的新战略提出了一系列建议。

注:中文为机器翻译结果!