0%

【arxiv论文】 Computation and Language 2020-08-05

目录

1. Efficient Urdu Caption Generation using Attention based LSTMs [PDF] 摘要
2. Model Reduction of Shallow CNN Model for Reliable Deployment of Information Extraction from Medical Reports [PDF] 摘要
3. LXPER Index: a curriculum-specific text readability assessment model for EFL students in Korea [PDF] 摘要
4. Tense, aspect and mood based event extraction for situation analysis and crisis management [PDF] 摘要
5. To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection [PDF] 摘要
6. Defining and Evaluating Fair Natural Language Generation [PDF] 摘要
7. TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling [PDF] 摘要
8. Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji [PDF] 摘要
9. ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text [PDF] 摘要
10. Deep Learning Brasil -- NLP at SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets [PDF] 摘要
11. Text-based classification of interviews for mental health -- juxtaposing the state of the art [PDF] 摘要
12. Weighted Accuracy Algorithmic Approach In Counteracting Fake News And Disinformation [PDF] 摘要
13. Forensic Writer Identification Using Microblogging Texts [PDF] 摘要
14. A Study on Effects of Implicit and Explicit Language Model Information for DBLSTM-CTC Based Handwriting Recognition [PDF] 摘要
15. A System for Worldwide COVID-19 Information Aggregation [PDF] 摘要
16. Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes [PDF] 摘要
17. SARG: A Novel Semi Autoregressive Generator for Multi-turn Incomplete Utterance Restoration [PDF] 摘要
18. Taking Notes on the Fly Helps BERT Pre-training [PDF] 摘要
19. Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring [PDF] 摘要
20. A Survey of Orthographic Information in Machine Translation [PDF] 摘要
21. Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction [PDF] 摘要
22. NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer [PDF] 摘要
23. An improved Bayesian TRIE based model for SMS text normalization [PDF] 摘要
24. Multi-Perspective Semantic Information Retrieval in the Biomedical Domain [PDF] 摘要
25. "This is Houston. Say again, please". The Behavox system for the Apollo-11 Fearless Steps Challenge (phase II) [PDF] 摘要
26. Weakly Supervised Construction of ASR Systems with Massive Video Data [PDF] 摘要

摘要

1. Efficient Urdu Caption Generation using Attention based LSTMs [PDF] 返回目录
  Inaam Ilahi, Hafiz Muhammad Abdullah Zia, Ahtazaz Ehsan, Rauf Tabassam, Armaghan Ahmed
Abstract: Recent advancements in deep learning has created a lot of opportunities to solve those real world problems which remained unsolved for more than a decade. Automatic caption generation is a major research field, and research community has done a lot of work on this problem on most common languages like English. Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India, and yet no work has been done for Urdu language caption generation. Our research aims to fill this gap by developing an attention-based deep learning model using techniques of sequence modelling specialized for Urdu language. We have prepared a dataset in Urdu language by translating a subset of "Flickr8k" dataset containing 700 'man' images. We evaluate our proposed technique on this dataset and show that it is able to achieve a BLEU score of 0.83 on Urdu language. We improve on the previously proposed techniques by using better CNN architectures and optimization techniques. Furthermore, we also tried adding a grammar loss to the model in order to make the predictions grammatically correct.
摘要:在深度学习的最新发展已经创造了很多机会,以解决这仍然未解决了十多年的现实问题。自动字幕生成是一个重要的研究领域和研究界已经做了很多工作,在这个问题上,如英语最常用的语言。乌尔都语是巴基斯坦的官方语言,也讲很多在巴基斯坦,印度次大陆地区的理解,并且还没有工作的乌尔都语语言字幕生成已经完成。我们的研究目的是通过开发利用专门用于乌尔都语语言序列建模技术的关注,基于深度学习模型来填补这一空白。我们已经通过翻译含700“人”图像“Flickr8k”数据集的一个子集准备在乌尔都语语言的数据集。我们评估我们提出的这个数据集技术,并表明它是能够实现的0.83上乌尔都语语言BLEU得分。我们提高对通过使用更好的CNN架构和优化技术先前提出的技术。此外,我们也试图以做出预测语法正确添加语法损失模型。

2. Model Reduction of Shallow CNN Model for Reliable Deployment of Information Extraction from Medical Reports [PDF] 返回目录
  Abhishek K Dubey, Alina Peluso, Jacob Hinkle, Devanshu Agarawal, Zilong Tan
Abstract: Shallow Convolution Neural Network (CNN) is a time-tested tool for the information extraction from cancer pathology reports. Shallow CNN performs competitively on this task to other deep learning models including BERT, which holds the state-of-the-art for many NLP tasks. The main insight behind this eccentric phenomenon is that the information extraction from cancer pathology reports require only a small number of domain-specific text segments to perform the task, thus making the most of the texts and contexts excessive for the task. Shallow CNN model is well-suited to identify these key short text segments from the labeled training set; however, the identified text segments remain obscure to humans. In this study, we fill this gap by developing a model reduction tool to make a reliable connection between CNN filters and relevant text segments by discarding the spurious connections. We reduce the complexity of shallow CNN representation by approximating it with a linear transformation of n-gram presence representation with a non-negativity and sparsity prior on the transformation weights to obtain an interpretable model. Our approach bridge the gap between the conventionally perceived trade-off boundary between accuracy on the one side and explainability on the other by model reduction.
摘要:浅卷积神经网络(CNN),是从癌症病理报告信息提取一个时间考验的工具。竞争这个任务等深学习模式,包括BERT,持有的国家的最先进的多任务NLP浅CNN执行。这背后偏心现象的主要观点是,从癌症病理报告的信息提取只需要少数特定领域的文本段的执行任务,从而使大部分的文本和背景过多的任务。浅CNN模型非常适合于识别从标记的训练集这些关键短文本段;然而,所识别的文本段仍不清楚人类。在这项研究中,我们填补开发一个模型简化工具,以通过丢弃虚假连接CNN过滤器和相关文本段之间的可靠连接这个缺口。我们通过用n-gram中存在表示与一个非负和稀疏线性变换之前上变换加权近似它来获取可解释模型减小浅CNN表示的复杂性。我们的引桥之间的差距通常认为权衡还原模型的一面准确性和explainability对彼此之间的边界。

3. LXPER Index: a curriculum-specific text readability assessment model for EFL students in Korea [PDF] 返回目录
  Bruce W. Lee, Jason Hyung-Jong Lee
Abstract: Automatic readability assessment is one of the most important applications of Natural Language Processing (NLP) in education. Since automatic readability assessment allows the fast selection of appropriate reading material for readers at all levels of proficiency, it can be particularly useful for the English education of English as Foreign Language (EFL) students around the world. Most readability assessment models are developed for the native readers of English and have low accuracy for texts in the non-native English Language Training (ELT) curriculum. We introduce LXPER Index, which is a readability assessment model for non-native EFL readers in the ELT curriculum of Korea. Our experiments show that our new model, trained with CoKEC-text (Text Corpus of the Korean ELT Curriculum), significantly improves the accuracy of automatic readability assessment for texts in the Korean ELT curriculum.
摘要:自动可读性评估是在教育自然语言处理(NLP)的最重要的应用之一。由于自动可读性评估允许适当的阅读材料,在各个水平的读者快选,它可以为英语作为外语(EFL)的学生在世界各地的英语教学特别有用。大多数可读性评估模型为英语的本土读者开发并拥有在母语非英语语言培训(ELT)课程文本精度低。我们介绍LXPER指数,这是在韩国的ELT课程的非本地英语读者可读性评估模型。我们的实验表明,我们的新模式,以CoKEC文本(韩国ELT课程的文本语料库)的培训,提高了显著自动可读性评估在韩国ELT课程文本的准确性。

4. Tense, aspect and mood based event extraction for situation analysis and crisis management [PDF] 返回目录
  Ali Hürriyetoğlu
Abstract: Nowadays event extraction systems mainly deal with a relatively small amount of information about temporal and modal qualifications of situations, primarily processing assertive sentences in the past tense. However, systems with a wider coverage of tense, aspect and mood can provide better analyses and can be used in a wider range of text analysis applications. This thesis develops such a system for Turkish language. This is accomplished by extending Open Source Information Mining and Analysis (OPTIMA) research group's event extraction software, by implementing appropriate extensions in the semantic representation format, by adding a partial grammar which improves the TAM (Tense, Aspect and Mood) marker, adverb analysis and matching functions of ExPRESS, and by constructing an appropriate lexicon in the standard of CORLEONE. These extensions are based on iv the theory of anchoring relations (Temürcü, 2007, 2011) which is a crosslinguistically applicable semantic framework for analyzing tense, aspect and mood related categories. The result is a system which can, in addition to extracting basic event structures, classify sentences given in news reports according to their temporal, modal and volitional/illocutionary values. Although the focus is on news reports of natural disasters, disease outbreaks and man-made disasters in Turkish language, the approach can be adapted to other languages, domains and genres. This event extraction and classification system, with further developments, can provide a basis for automated browsing systems for preventing environmental and humanitarian risk.
摘要:目前事件抽取系统主要处理的有关情况的时间和模式的资格信息相对较少,主要处理的是过去时自信的句子。然而,随着紧张,方面和情绪的较宽覆盖系统可以提供更好的分析,并且可以在更广泛的文本分析应用中。本文开发这样的系统为土耳其语。这是通过延伸开源信息挖掘和分析(OPTIMA)研究小组的事件提取软件,通过实现在语义表示格式相应的扩展名来完成,通过添加这改善了TAM(时,体和情绪)标记,副词分析部分语法和匹配表达的功能,以及通过在柯里昂的标准构建一个适当的词库。这些扩展是基于四锚定关系的(Temürcü,2007,2011),这是用于分析时,体和情绪相关的类一个crosslinguistically适用语义框架理论。其结果是可以,除了提取基本事件的结构,分类的句子在新闻报道中根据自己的时间,模式和意志/言外值给出一个系统。虽然重点是自然灾害,疾病暴发和在土耳其语人为灾害的新闻报道,该方法可以适用于其他语言,域和流派。此事件提取和分类系统,具有进一步发展,可用于防止环境和人道主义风险自动浏览系统提供了基础。

5. To BERT or Not To BERT: Comparing Speech and Language-based Approaches for Alzheimer's Disease Detection [PDF] 返回目录
  Aparna Balagopalan, Benjamin Eyre, Frank Rudzicz, Jekaterina Novikova
Abstract: Research related to automatically detecting Alzheimer's disease (AD) is important, given the high prevalence of AD and the high cost of traditional methods. Since AD significantly affects the content and acoustics of spontaneous speech, natural language processing and machine learning provide promising techniques for reliably detecting AD. We compare and contrast the performance of two such approaches for AD detection on the recent ADReSS challenge dataset: 1) using domain knowledge-based hand-crafted features that capture linguistic and acoustic phenomena, and 2) fine-tuning Bidirectional Encoder Representations from Transformer (BERT)-based sequence classification models. We also compare multiple feature-based regression models for a neuropsychological score task in the challenge. We observe that fine-tuned BERT models, given the relative importance of linguistics in cognitive impairment detection, outperform feature-based approaches on the AD detection task.
摘要:有关自动地检测早老性痴呆(AD)研究是重要的,给定的AD的高患病率和传统方法的成本较高。由于AD显著影响的内容和自然语音,自然语言处理和机器学习的声学提供有前途的技术,用于可靠地检测AD。我们比较和对比两个这样的做法表现为对近期ADRESS挑战数据集AD检测:使用基于知识域的手工制作的功能,捕捉语言和声学现象1),2)微调从变压器双向编码器交涉( BERT)为基础的序列分类模型。我们也比较多的基于特征的回归模型在挑战神经得分任务。我们观察到,微调BERT模型,给出语言学的认知功能障碍的检测,对AD检测任务跑赢基于特征的方法的相对重要性。

6. Defining and Evaluating Fair Natural Language Generation [PDF] 返回目录
  Catherine Yeo, Alyssa Chen
Abstract: Our work focuses on the biases that emerge in the natural language generation (NLG) task of sentence completion. In this paper, we introduce a framework of fairness for NLG followed by an evaluation of gender biases in two state-of-the-art language models. Our analysis provides a theoretical formulation for biases in NLG and empirical evidence that existing language generation models embed gender bias.
摘要:我们的工作重点是,在完成句子的自然语言生成(NLG)任务中出现的偏差。在本文中,我们介绍的公平性NLG的框架,其次是性别偏见在国家的最先进的二语车型的评价。我们的分析提供了NLG偏见和经验证据表明,现有的语言代车型嵌入性别偏见的理论公式。

7. TensorCoder: Dimension-Wise Attention via Tensor Representation for Natural Language Modeling [PDF] 返回目录
  Shuai Zhang, Peng Zhang, Xindian Ma, Junqiu Wei, ningning Wang, Qun Liu
Abstract: Transformer has been widely-used in many Natural Language Processing (NLP) tasks and the scaled dot-product attention between tokens is a core module of Transformer. This attention is a token-wise design and its complexity is quadratic to the length of sequence, limiting its application potential for long sequence tasks. In this paper, we propose a dimension-wise attention mechanism based on which a novel language modeling approach (namely TensorCoder) can be developed. The dimension-wise attention can reduce the attention complexity from the original $O(N^2d)$ to $O(Nd^2)$, where $N$ is the length of the sequence and $d$ is the dimensionality of head. We verify TensorCoder on two tasks including masked language modeling and neural machine translation. Compared with the original Transformer, TensorCoder not only greatly reduces the calculation of the original model but also obtains improved performance on masked language modeling task (in PTB dataset) and comparable performance on machine translation tasks.
摘要:变压器已在许多自然语言处理(NLP)的任务广泛使用和令牌之间的缩放点积注意的是变压器的核心模块。该注意的是令牌明智的设计和它的复杂性是二次到序列的长度,限制了其应用潜力长序列的任务。在本文中,我们提出可以在其上开发出一种新的语言建模方法(即TensorCoder)基于逐个维度的注意机制。尺寸明智注意可以从原来的$Ó降低注意力复杂(N ^ 2D)$至O $(ND ^ 2)$,其中$ N $是序列和$ d $的长度为头部的维数。我们两个任务,包括屏蔽语言建模和神经机器翻译验证TensorCoder。与原来的变压器相比,TensorCoder不仅大大降低了原始模型的计算,也获得上掩盖语言建模任务改进的性能(PTB中的数据集)和机器翻译的任务相当的性能。

8. Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji [PDF] 返回目录
  Hozan K. Hamarashid, Soran A. Saeed, Tarik A. Rashid
Abstract: Next word prediction is an input technology that simplifies the process of typing by suggesting the next word to a user to select, as typing in a conversation consumes time. A few previous studies have focused on the Kurdish language, including the use of next word prediction. However, the lack of a Kurdish text corpus presents a challenge. Moreover, the lack of a sufficient number of N-grams for the Kurdish language, for instance, five grams, is the reason for the rare use of next Kurdish word prediction. Furthermore, the improper display of several Kurdish letters in the Rstudio software is another problem. This paper provides a Kurdish corpus, creates five, and presents a unique research work on next word prediction for Kurdish Sorani and Kurmanji. The N-gram model has been used for next word prediction to reduce the amount of time while typing in the Kurdish language. In addition, little work has been conducted on next Kurdish word prediction; thus, the N-gram model is utilized to suggest text accurately. To do so, R programming and RStudio are used to build the application. The model is 96.3% accurate.
摘要:下一个单词的预测是输入技术,它通过下一个单词建议给用户以选择,如在会话消耗时间打字简化打字的过程。一些以前的研究都集中在库尔德语,包括使用下一个单词预测。然而,由于缺乏一个库尔德文本语料库中提出了挑战。此外,缺乏足够数量的n元库尔德语言,例如5克,是罕见的使用下一个库尔德人字预测的原因。此外,一些库尔德字母在Rstudio软件显示不正确是另一个问题。本文提供了一个库尔德语料库,创建五个,并提出对库尔德索拉尼和Kurmanji词语联想独特的研究工作。所述n-gram模型已被用于下一个字预测以减少的时间量,而在打字库尔德语。此外,一些工作已经在接下来的库尔德人字预测进行;因此,N-gram模型被用来精确地建议文本。要做到这一点,R编程和RStudio用于构建应用程序。该模型是96.3%准确。

9. ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text [PDF] 返回目录
  Koustava Goswami, Priya Rani, Bharathi Raja Chakravarthi, Theodorus Fransen, John P. McCrae
Abstract: Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix. The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags instead inferring this automatically using a morphological model. The system is based on a novel deep neural network (DNN) architecture, which has outperformed the baseline F1-score on the test data-set as well as the validation data-set. Our results can be found under the user name "koustava" on the "Sentimix Hindi English" page
摘要:代码混合是在多语言社会,人们从一种语言切换到另一种出于各种原因的共同现象。在通过不同的社交媒体网站的公众通信的最新进展已导致增加在书面语言代码混合使用的频率。在本文中,我们提出了剖成语素与促成SemEval 2020任务9 SentiMix注意(幻魔)型号情绪分析系统。该系统旨在预测给出英文印地文代码混合鸣叫的情绪,而不使用字级语言的标签,而不是推断这种自动使用形态模型。该系统是基于一种新的深神经网络(DNN)架构,其表现优于对测试数据集以及验证数据集的基线F1-得分。我们的研究结果可以根据用户名“koustava”的“Sentimix印地文英语”网页上找到

10. Deep Learning Brasil -- NLP at SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets [PDF] 返回目录
  Manoel Veríssimo dos Santos Neto, Ayrton Denner da Silva Amaral, Nádia Félix Felipe da Silva, Anderson da Silva Soares
Abstract: In this paper, we describe a methodology to predict sentiment in code-mixed tweets (hindi-english). Our team called verissimo.manoel in CodaLab developed an approach based on an ensemble of four models (MultiFiT, BERT, ALBERT, and XLNET). The final classification algorithm was an ensemble of some predictions of all softmax values from these four models. This architecture was used and evaluated in the context of the SemEval 2020 challenge (task 9), and our system got 72.7% on the F1 score.
摘要:在本文中,我们描述的方法来预测情绪代码混合鸣叫(印地文,英文)。我们的团队叫verissimo.manoel在CodaLab开发了一种基于四款车型(MULTIFIT,BERT,ALBERT和XLNET)合奏的方法。最终的分类算法是一个整体的这四款车型都SOFTMAX值的一些预测。采用这种架构,并在SemEval 2020挑战(任务9)的背景下进行评估,我们的系统得到了72.7%的F1分数。

11. Text-based classification of interviews for mental health -- juxtaposing the state of the art [PDF] 返回目录
  Joppe Valentijn Wouts
Abstract: Currently, the state of the art for classification of psychiatric illness is based on audio-based classification. This thesis aims to design and evaluate a state of the art text classification network on this challenge. The hypothesis is that a well designed text-based approach poses a strong competition against the state-of-the-art audio based approaches. Dutch natural language models are being limited by the scarcity of pre-trained monolingual NLP models, as a result Dutch natural language models have a low capture of long range semantic dependencies over sentences. For this issue, this thesis presents belabBERT, a new Dutch language model extending the RoBERTa[15] architecture. belabBERT is trained on a large Dutch corpus (+32GB) of web crawled texts. After this thesis evaluates the strength of text-based classification, a brief exploration is done, extending the framework to a hybrid text- and audio-based classification. The goal of this hybrid framework is to show the principle of hybridisation with a very basic audio-classification network. The overall goal is to create the foundations for a hybrid psychiatric illness classification, by proving that the new text-based classification is already a strong stand-alone solution.
摘要:目前,在本领域中精神疾病的分类的状态是基于基于音频的分类。本文旨在设计并在此挑战评价艺术文本分类网络的状态。该假说是一个精心设计的基于文本的方法提出了对国家的最先进的音频基础的方法一个强有力的竞争。荷兰自然语言的模式正在由预先训练单语NLP模型稀缺的限制,结果荷兰人自然语言模型有长距离的语义相关性在句子低捕获。对于这一问题,本文提出belabBERT,延长了罗伯塔[15]架构的荷兰新的语言模型。 belabBERT是在荷兰的一个大的语料库(+ 32GB)的网页抓取文本训练。经过本文的计算结果基于文本分类的力量,一个简短的勘探完成后,扩展框架,混合文本和基于音频的分类。这种混合架构的目的是展示杂交的原理与一个非常基本的音频分类网络。总的目标是创造了基础的混合精神疾病分类,通过证明新的基于文本的分类已经是一个强大的独立解决方案。

12. Weighted Accuracy Algorithmic Approach In Counteracting Fake News And Disinformation [PDF] 返回目录
  Kwadwo Osei Bonsu
Abstract: As the world is becoming more dependent on the internet for information exchange, some overzealous journalists, hackers, bloggers, individuals and organizations tend to abuse the gift of free information environment by polluting it with fake news, disinformation and pretentious content for their own agenda. Hence, there is the need to address the issue of fake news and disinformation with utmost seriousness. This paper proposes a methodology for fake news detection and reporting through a constraint mechanism that utilizes the combined weighted accuracies of four machine learning algorithms.
摘要:随着世界变得越来越依赖于互联网进行信息交流,有些过分热心的记者,黑客,博客,个人和组织往往通过与假新闻,造谣和为自己的狂妄内容污染它滥用自由的信息环境的礼物议程。因此,有解决的假新闻和假情报问题用极其严肃的需要。本文提出通过利用四个机器学习算法的组合加权精度的约束机制为假新闻的检测和报告的方法。

13. Forensic Writer Identification Using Microblogging Texts [PDF] 返回目录
  Fernando Alonso-Fernandez, Nicole Mariah Sharon Belvisi, Kevin Hernandez-Diaz, Naveed Muhammad, Josef Bigun
Abstract: Establishing the authorship of online texts is a fundamental issue to combat several cybercrimes. Unfortunately, some platforms limit the length of the text, making the challenge harder. Here, we aim at identifying the author of Twitter messages limited to 140 characters. We evaluate popular stylometric features, widely used in traditional literary analysis, which capture the writing style at different levels (character, word, and sentence). We use a public database of 93 users, containing 1142 to 3209 Tweets per user. We also evaluate the influence of the number of Tweets per user for enrolment and testing. If the amount is sufficient (>500), a Rank 1 of 97-99% is achieved. If data is scarce (e.g. 20 Tweets for testing), the Rank 1 with the best individual feature method ranges from 54.9% (100 Tweets for enrolment) to 70.6% (1000 Tweets). By combining the available features, a substantial improvement is observed, reaching a Rank 1 of 70% when using 100 Tweets for enrolment and only 20 for testing. With a bigger hit list size, accuracy of the latter case increases to 86.4% (Rank 5) or 95% (Rank 20). This demonstrates the feasibility of identifying writers of digital texts, even with few data available.
摘要:建立在线文本的作者是打击网络犯罪的几个根本性的问题。不幸的是,一些平台限制了文本的长度,使这个问题更难。这里,我们的目标是识别Twitter消息限于140个字符的作者。我们评估流行stylometric特点,广泛应用于传统的文学分析,捕捉不同层次的写作风格(字符,单词和句子)。我们用93个的用户的公共数据库,包含每用户1142 3209鸣叫。我们也评估每用户鸣叫报名和测试的数量的影响。如果该量足以(> 500),一个97-99 1的秩%得以实现。如果数据是稀缺的(例如20个鸣叫用于测试),该秩1与从54.9%最好单独的特征的方法的范围(100个鸣叫用于注册)到70.6%(1000个鸣叫)。通过结合可用的功能,一个显着改善,观察到,使用用于注册100文和仅用于测试20时达到70%的等级1。一个更大的命中列表尺寸,后者的情况下增加至86.4%(等级5)或95%(等级20)的精度。这表明数字识别文本的作家的可行性,甚至可用少量数据。

14. A Study on Effects of Implicit and Explicit Language Model Information for DBLSTM-CTC Based Handwriting Recognition [PDF] 返回目录
  Qi Liu, Lijuan Wang, Qiang Huo
Abstract: Deep Bidirectional Long Short-Term Memory (D-BLSTM) with a Connectionist Temporal Classification (CTC) output layer has been established as one of the state-of-the-art solutions for handwriting recognition. It is well known that the DBLSTM trained by using a CTC objective function will learn both local character image dependency for character modeling and long-range contextual dependency for implicit language modeling. In this paper, we study the effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition by comparing the performance of using or without using an explicit language model in decoding. It is observed that even using one million lines of training sentences to train the DBLSTM, using an explicit language model is still helpful. To deal with such a large-scale training problem, a GPU-based training tool has been developed for CTC training of DBLSTM by using a mini-batch based epochwise Back Propagation Through Time (BPTT) algorithm.
摘要:深双向长短期存储器(d-BLSTM)与联结颞分类(CTC)输出层已被确立为用于手写识别的状态的最先进的解决方案之一。众所周知,通过使用CTC目标函数训练DBLSTM将学习人物造型和含蓄的语言建模远射语境依赖本地人物形象的依赖。在本文中,我们通过比较使用或不使用解码明确的语言模型的性能研究为DBLSTM-CTC基于手写识别隐性和显性的语言模型信息的影响。据观察,即使使用训练的一句话万行训练DBLSTM,用明确的语言模型仍然是有帮助的。为了应对这种大规模的培训问题,基于GPU的训练工具已经使用基于小批量epochwise反向传播通过时间(BPTT)算法开发DBLSTM的CTC培训。

15. A System for Worldwide COVID-19 Information Aggregation [PDF] 返回目录
  Akiko Aizawa, Frederic Bergeron, Junjie Chen, Fei Cheng, Katsuhiko Hayashi, Kentaro Inui, Hiroyoshi Ito, Daisuke Kawahara, Masaru Kitsuregawa, Hirokazu Kiyomaru, Masaki Kobayashi, Takashi Kodama, Sadao Kurohashi, Qianying Liu, Masaki Matsubara, Yusuke Miyao, Atsuyuki Morishima, Yugo Murawaki, Kazumasa Omura, Haiyue Song, Eiichiro Sumita, Shinji Suzuki, Ribeka Tanaka, Yu Tanaka, Masashi Toyoda, Nobuhiro Ueda, Honai Ueoka, Masao Utiyama, Ying Zhong
Abstract: The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation (this http URL ) containing reliable articles from 10 regions in 7 languages sorted by topics for Japanese citizens. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese. A BERT-based topic-classifier trained on an article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.
摘要:COVID-19的全球大流行已经使公众密切关注相关的新闻,涵盖了各个领域,如卫生,治疗,以及对教育的影响。同时,COVID-19条件是国家(例如,政策和疫情的发展),从而将公民在国外有兴趣在新闻之中是非常不同的。我们建立全球COVID-19信息聚合的排序条件为主题的日本公民的系统(该HTTP URL)含7种语言的10个地区可靠的文章。我们可靠COVID-19相关网站的数据集,通过众包保证文章的质量收集。的神经机器翻译模块翻译其他语言为日语的文章。培训了一篇文章,题目对数据集基于BERT-主题的分类,帮助用户通过将文章分为不同的类别有效地找到自己感兴趣的信息。

16. Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes [PDF] 返回目录
  Arthur D. Reys, Danilo Silva, Daniel Severo, Saulo Pedro, Marcia M. de Souza e Sá, Guilherme A. C. Salgado
Abstract: ICD coding from electronic clinical records is a manual, time-consuming and expensive process. Code assignment is, however, an important task for billing purposes and database organization. While many works have studied the problem of automated ICD coding from free text using machine learning techniques, most use records in the English language, especially from the MIMIC-III public dataset. This work presents results for a dataset with Brazilian Portuguese clinical notes. We develop and optimize a Logistic Regression model, a Convolutional Neural Network (CNN), a Gated Recurrent Unit Neural Network and a CNN with Attention (CNN-Att) for prediction of diagnosis ICD codes. We also report our results for the MIMIC-III dataset, which outperform previous work among models of the same families, as well as the state of the art. Compared to MIMIC-III, the Brazilian Portuguese dataset contains far fewer words per document, when only discharge summaries are used. We experiment concatenating additional documents available in this dataset, achieving a great boost in performance. The CNN-Att model achieves the best results on both datasets, with micro-averaged F1 score of 0.537 on MIMIC-III and 0.485 on our dataset with additional documents.
摘要:ICD从电子临床记录编码是手动的,耗时的和昂贵的过程。码分配,然而,用于计费和数据库组织的一项重要任务。虽然很多作品已经研究了自动化ICD的使用机器学习技术,大多数使用记录在英语中,尤其是从模仿-III公共数据集从自由文本编码的问题。这项工作提出了结果,并与巴西葡萄牙语临床记录的数据集。我们开发和诊断ICD码预测优化逻辑回归模型,卷积神经网络(CNN),一门控重复单元神经网络和CNN有注意力(CNN-ATT)。我们还报告我们的模仿-III数据集的结果,其优于同一个家庭的模型中以前的工作,以及艺术的状态。相比于MIMIC-III,巴西葡萄牙语数据集包含每个文件的少得多的话,当仅使用出院小结。我们尝试在串联此数据集提供额外的文件,实现性能有很大的提升。该CNN-ATT模型实现了对两个数据集最好的结果,与我们与其他文件集微平均F1得分0.537上MIMIC-III和0.485。

17. SARG: A Novel Semi Autoregressive Generator for Multi-turn Incomplete Utterance Restoration [PDF] 返回目录
  Mengzuo Huang, Feng Li, Wuhe Zou, Weidong Zhang
Abstract: Dialogue systems in the open domain have achieved great success due to large conversation data and the development of deep learning, but multi-turn systems are often restricted with the frequent coreference and information omission. In this paper, we investigate the incomplete utterance restoration since it has brought general improvement over multi-turn dialogue systems in different domains. In the task, we propose a novel semi autoregressive generator (SARG) with the high efficiency and flexibility, which is inspired by the autoregression for generation and the sequence labeling for overlapped rewriting. Moreover, experiments on \textit{Restoration-200k} show that our proposed model significantly outperforms the state-of-the-art models with faster inference speed.
摘要:在开放领域对话系统已经由于大交谈数据和深度学习的发展,但多圈系统往往与频繁的共参照和信息遗漏的限制取得了巨大成功。在本文中,我们研究了不完整的话语恢复,因为它已经在不同的领域带来了普遍提高了多圈的对话系统。在任务,我们提出了一个新型的半自回归发生器(SARG)与高效率和灵活性,这是由自回归用于产生和用于搭接重写的序列标注的启发。此外,在\ {textit恢复-200K}表明,该模型显著优于国家的最先进的车型更快的速度推断实验。

18. Taking Notes on the Fly Helps BERT Pre-training [PDF] 返回目录
  Qiyu Wu, Chen Xing, Yatao Li, Guolin Ke, Di He, Tie-Yan Liu
Abstract: How to make unsupervised language pre-training more efficient and less resource-intensive is an important research direction in NLP. In this paper, we focus on improving the efficiency of language pre-training methods through providing better data utilization. It is well-known that in language data corpus, words follow a heavy-tail distribution. A large proportion of words appear only very few times and the embeddings of rare words are usually poorly optimized. We argue that such embeddings carry inadequate semantic signals. They could make the data utilization inefficient and slow down the pre-training of the entire model. To solve this problem, we propose Taking Notes on the Fly (TNF). TNF takes notes for rare words on the fly during pre-training to help the model understand them when they occur next time. Specifically, TNF maintains a note dictionary and saves a rare word's context information in it as notes when the rare word occurs in a sentence. When the same rare word occurs again in training, TNF employs the note information saved beforehand to enhance the semantics of the current sentence. By doing so, TNF provides a better data utilization since cross-sentence information is employed to cover the inadequate semantics caused by rare words in the sentences. Experimental results show that TNF significantly expedite the BERT pre-training and improve the model's performance on downstream tasks. TNF's training time is $60\%$ less than BERT when reaching the same performance. When trained with same number of iterations, TNF significantly outperforms BERT on most of downstream tasks and the average GLUE score.
摘要:如何使监督的语言训练前更加高效和资源密集型是自然语言处理的一个重要研究方向。在本文中,我们专注于通过提供更好的数据利用率提高语言前的训练方法的效率。这是众所周知的,在语言数据语料库,也就是说遵循重尾分布。话中的大部分只出现很少的时间和生僻字的嵌入物通常是优化很差。我们认为,这样的嵌入携带语义不足的信号。他们可以使数据的利用效率低下,拖慢整个模型的岗前培训。为了解决这个问题,我们提出了以上飞(TNF)的注意事项。 TNF在预培训花费在飞行中生僻字笔记来帮助他们时,发生下一次的模型理解他们。具体来说,TNF维护一个音符字典和笔记时,在一个句子中出现的生僻字在它节省了生僻字的上下文信息。当同样的生僻字在训练中再次出现,TNF采用事先保存,以增强当前句子的语义的说明信息。通过这样做,TNF提供由于采用横句子信息来覆盖所引起的句子罕见词语的不足语义的更好的数据利用率。实验结果表明,TNF显著加快BERT前培训,提高了模型对下游任务中的表现。 TNF的训练时间是$ 60 \%$比BERT少达到相同的性能时。当迭代次数相同的训练,TNF显著优于大多数下游任务,平均得分胶BERT。

19. Prompt Agnostic Essay Scorer: A Domain Generalization Approach to Cross-prompt Automated Essay Scoring [PDF] 返回目录
  Robert Ridley, Liang He, Xinyu Dai, Shujian Huang, Jiajun Chen
Abstract: Cross-prompt automated essay scoring (AES) requires the system to use non target-prompt essays to award scores to a target-prompt essay. Since obtaining a large quantity of pre-graded essays to a particular prompt is often difficult and unrealistic, the task of cross-prompt AES is vital for the development of real-world AES systems, yet it remains an under-explored area of research. Models designed for prompt-specific AES rely heavily on prompt-specific knowledge and perform poorly in the cross-prompt setting, whereas current approaches to cross-prompt AES either require a certain quantity of labelled target-prompt essays or require a large quantity of unlabelled target-prompt essays to perform transfer learning in a multi-step manner. To address these issues, we introduce Prompt Agnostic Essay Scorer (PAES) for cross-prompt AES. Our method requires no access to labelled or unlabelled target-prompt data during training and is a single-stage approach. PAES is easy to apply in practice and achieves state-of-the-art performance on the Automated Student Assessment Prize (ASAP) dataset.
摘要:跨提示自动作文评分(AES)要求系统使用非靶提示文章,奖励分数为目标,提示作文。由于获得预分级文章大量到特定的提示往往是困难和不现实的,跨提示AES的任务是对现实世界的发展至关重要AES系统,但它仍然是研究的充分开发的区域。专为特定提示-AES模式在很大程度上依赖于特定的提示知识和跨提示设置表现不佳,而目前的各种交叉提示AES或者要求有标签的目标提示符文章一定量或需要的未标记量大目标提示撰到在多步方式执行转印学习。为了解决这些问题,我们引入提示不可知论征文射手(PAES)跨提示AES。我们的方法需要训练期间标记或未标记的目标提示数据没有访问并且是单级的方法。 PAES很容易在实践中应用,并实现对自动学生评估奖(ASAP)数据集的国家的最先进的性能。

20. A Survey of Orthographic Information in Machine Translation [PDF] 返回目录
  Bharathi Raja Chakravarthi, Priya Rani, Mihael Arcan, John P. McCrae
Abstract: Machine translation is one of the applications of natural language processing which has been explored in different languages. Recently researchers started paying attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine translation systems is the variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable, but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthography's influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under-resourced languages. We discuss different types of machine translation and demonstrate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts of cognates information at different levels of machine translation and the lessons that can be drawn from this. Additionally, multilingual neural machine translation of closely related languages is given a particular focus in this survey. This article ends with a discussion of the way forward in machine translation with orthographic information, focusing on multilingual settings and bilingual lexicon induction.
摘要:机器翻译已经在不同的语言被探索自然语言处理的应用之一。最近研究人员开始关注对机器翻译资源贫乏的语言和密切相关的语言。这些机器翻译系统的一个普遍而根本的问题是在正交约定导致许多问题的传统方法的变化。写在两个不同的拼写法两种语言都不会轻易相媲美,但字形信息也可以用来改善机器翻译系统。本文提供有关正字法对资源不足的语言机器翻译影响的研究调查。它介绍了机器翻译,以及如何正视信息可用来提高机器翻译方面资源不足的语言。我们将在这方面的前期工作,讨论的基本假设进行了哪些,并显示正投影知识如何改善资源不足的语言机器翻译的性能。我们讨论了不同类型的机器翻译和论证,力求正投影信息与完善的机器翻译方法链接最近的趋势。相当多的关注给予不同级别的机器翻译的同源信息当前的努力,并且从中可以吸取的教训。此外,密切相关的语言的多语言神经机器翻译给出了本次调查特别关注。这篇文章的机器翻译与正字信息前进的方向的讨论结束后,专注于多语种设置和双语词典感应。

21. Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction [PDF] 返回目录
  Stefan Heid, Marcel Wever, Eyke Hüllermeier
Abstract: Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks. This problem is commonly tackled using machine learning methods, i.e., by training a POS tagger on a sufficiently large corpus of labeled data. While the problem of POS tagging can essentially be considered as solved for modern languages, historical corpora turn out to be much more difficult, especially due to the lack of native speakers and sparsity of training data. Moreover, most texts have no sentences as we know them today, nor a common orthography. These irregularities render the task of automated POS tagging more difficult and error-prone. Under these circumstances, instead of forcing the POS tagger to predict and commit to a single tag, it should be enabled to express its uncertainty. In this paper, we consider POS tagging within the framework of set-valued prediction, which allows the POS tagger to express its uncertainty via predicting a set of candidate POS tags instead of guessing a single one. The goal is to guarantee a high confidence that the correct POS tag is included while keeping the number of candidates small. In our experimental study, we find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings, especially for unknown words, i.e., words not occurring in the training data.
摘要:在部分的语音(POS)标签的形式语料库的句法标注是两种语言的研究和随后的自动化的自然语言处理(NLP)任务的关键要求。此问题是使用机器学习方法,即通常解决,通过在足够大的语料库数据标记的训练POS标注器。虽然词性标注的问题,解决了现代语言基本上可以认为,历史语料练得更加困难,特别是由于缺乏母语和训练数据的稀疏。此外,大多数的文本没有句子作为我们今天所知道的他们,也没有一个共同的拼写。这些违规行为呈现自动化POS任务标记更加困难,而且容易出错。在这种情况下,而不是迫使POS恶搞预测,并承诺一个标签,它应当能够表达其不确定性。在本文中,我们考虑POS集值预测,这使得POS恶搞通过预测一组候选POS标签,而不是猜测一个一个表达其不确定性的框架内标记。我们的目标是保证较高的信心,同时保持候选小的数目正确的POS标签包括在内。在我们的实验研究,我们发现,延伸状态的最先进的POS标注器到集值预测产量更精确和鲁棒的Tagging,特别是对未知单词,即,词语没有在训练数据中出现。

22. NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer [PDF] 返回目录
  Hwijeen Ahn, Jimin Sun, Chan Young Park, Jungyun Seo
Abstract: This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.
摘要:本文介绍了我们在多语言环境识别语言攻势的任务的方法。我们考察两个数据增强策略:使用额外的半监督标签具有不同的阈值,并用数据选择跨语言转移。凭借半监督数据集中导致了性能的提升相比,完全在人工标注数据集训练的基线。我们提出了一个新的度量,翻译嵌入距离,测量实例来进行跨语言数据选择的可转让性。我们还介绍了使用方法的微调预训练的多语种BERT(mBERT)进攻语言识别以及社交媒体的文字量身定做各种预处理步骤。我们的多语言系统在希腊,丹麦和土耳其在OffensEval 2020获得有竞争力的结果。

23. An improved Bayesian TRIE based model for SMS text normalization [PDF] 返回目录
  Abhinava Sikdar, Niladri Chatterjee
Abstract: Normalization of SMS text, commonly known as texting language, is being pursued for more than a decade. A probabilistic approach based on the Trie data structure was proposed in literature which was found to be better performing than HMM based approaches proposed earlier in predicting the correct alternative for an out-of-lexicon word. However, success of the Trie based approach depends largely on how correctly the underlying probabilities of word occurrences are estimated. In this work we propose a structural modification to the existing Trie-based model along with a novel training algorithm and probability generation scheme. We prove two theorems on statistical properties of the proposed Trie and use them to claim that is an unbiased and consistent estimator of the occurrence probabilities of the words. We further fuse our model into the paradigm of noisy channel based error correction and provide a heuristic to go beyond a Damerau Levenshtein distance of one. We also run simulations to support our claims and show superiority of the proposed scheme over previous works.
摘要:短信,俗称短信语言的规范化,正在追求超过了十年。基于该特里数据结构的概率方法,其中被发现比HMM基础的方法在预测为外的词典的字的正确替代先前提出可以更好地执行文献中提出的。但是,基于特里方法的成功在很大程度上取决于字出现的概率底层如何正确估计。在这项工作中,我们提出了一个新颖的训练算法和概率代方案沿其结构上的修改,以现有的基于特里树的模型。我们证明上,提出线索的统计特性两个定理,并用它们来声称是的话出现概率的无偏和一致的估计。我们进一步保险丝我们的模型为基础嘈杂信道纠错的模式,并提供一个启发式超越一个的Damerau Levenshtein距离。我们还运行模拟程序来支持我们的要求,并在以前的作品中所提出的方案中显示的优越性。

24. Multi-Perspective Semantic Information Retrieval in the Biomedical Domain [PDF] 返回目录
  Samarth Rawal
Abstract: Information Retrieval (IR) is the task of obtaining pieces of data (such as documents) that are relevant to a particular query or need from a large repository of information. IR is a valuable component of several downstream Natural Language Processing (NLP) tasks. Practically, IR is at the heart of many widely-used technologies like search engines. While probabilistic ranking functions like the Okapi BM25 function have been utilized in IR systems since the 1970's, modern neural approaches pose certain advantages compared to their classical counterparts. In particular, the release of BERT (Bidirectional Encoder Representations from Transformers) has had a significant impact in the NLP community by demonstrating how the use of a Masked Language Model trained on a large corpus of data can improve a variety of downstream NLP tasks, including sentence classification and passage re-ranking. IR Systems are also important in the biomedical and clinical domains. Given the increasing amount of scientific literature across biomedical domain, the ability find answers to specific clinical queries from a repository of millions of articles is a matter of practical value to medical professionals. Moreover, there are domain-specific challenges present, including handling clinical jargon and evaluating the similarity or relatedness of various medical symptoms when determining the relevance between a query and a sentence. This work presents contributions to several aspects of the Biomedical Semantic Information Retrieval domain. First, it introduces Multi-Perspective Sentence Relevance, a novel methodology of utilizing BERT-based models for contextual IR. The system is evaluated using the BioASQ Biomedical IR Challenge. Finally, practical contributions in the form of a live IR system for medics and a proposed challenge on the Living Systematic Review clinical task are provided.
摘要:信息检索(IR)是获得数据块(如文档)是相关的,从一个大的库信息的特定查询或需要的任务。 IR是几个下游自然语言处理(NLP)任务的一个重要组件。实际上,IR是像搜索引擎许多广泛使用的技术的心脏地带。虽然概率排序的功能类似于霍加皮BM25功能已经从1970年的IR系统被利用,现代神经方法相比,他们的经典同行带来一定的优势。特别是,BERT(双向编码器交涉来自变形金刚)的发布已经证明使用屏蔽语言模型如何在大语料库的训练有素可以提高各种下游NLP任务,包括曾在NLP社区显著影响句子分类和通道重新排名。 IR系统也是在生物医学和临床领域非常重要的。鉴于科学文献的整个生物医学领域的增加量,有能力找到答案,从数以百万计的物品储存库具体临床查询是实用价值的医疗专业人员的问题。此外,不存在特定于域的挑战,包括处理临床术语和确定查询和一个句子之间的相关性时评价的各种医学症状的相似性或相关性。这项工作提出的生物医学语义信息检索领域的几个方面的贡献。首先,介绍了多视角的句子关联,利用基于BERT的模型为背景IR的一种新颖的方法。该系统是使用生物医学BioASQ IR挑战评价。最后,提供在医务人员现场IR系统和对生活系统评价临床工作提出的挑战的形式,实际贡献。

25. "This is Houston. Say again, please". The Behavox system for the Apollo-11 Fearless Steps Challenge (phase II) [PDF] 返回目录
  Arseniy Gorin, Daniil Kulko, Steven Grima, Alex Glasman
Abstract: We describe the speech activity detection (SAD), speaker diarization (SD), and automatic speech recognition (ASR) experiments conducted by the Behavox team for the Interspeech 2020 Fearless Steps Challenge (FSC-2). A relatively small amount of labeled data, a large variety of speakers and channel distortions, specific lexicon and speaking style resulted in high error rates on the systems which involved this data. In addition to approximately 36 hours of annotated NASA mission recordings, the organizers provided a much larger but unlabeled 19k hour Apollo-11 corpus that we also explore for semi-supervised training of ASR acoustic and language models, observing more than 17% relative word error rate improvement compared to training on the FSC-2 data only. We also compare several SAD and SD systems to approach the most difficult tracks of the challenge (track 1 for diarization and ASR), where long 30-minute audio recordings are provided for evaluation without segmentation or speaker information. For all systems, we report substantial performance improvements compared to the FSC-2 baseline systems, and achieved a first-place ranking for SD and ASR and fourth-place for SAD in the challenge.
摘要:我们描述了语音活动检测(SAD),扬声器diarization(SD),和自动语音识别(ASR)的实验由Behavox团队为INTERSPEECH 2020霍元甲步骤挑战赛(FSC-2)进行。相对少量标记的数据时,大量的各种扬声器和信道失真,特定词汇和说话风格导致对其中涉及该数据的系统中的高错误率。除了约36小时注释NASA的使命录音,主办方提供了一个更大的但未标记19K小时阿波罗11语料库,我们还探讨了ASR声学和语言模型的半监督训练,观察超过17%的相对词错误速率的提升相比,仅在FSC-2数据的培训。我们也比较了几种SAD和SD系统接近挑战的最困难的轨道(轨道1 diarization和ASR),其中提供了用于评估长30分钟的录音,而不分割或扬声器的信息。对于所有的系统,我们的报告相比,FSC-2基准系统大大提高性能,并取得了第一名的排名为SD和ASR和第四名在挑战SAD。

26. Weakly Supervised Construction of ASR Systems with Massive Video Data [PDF] 返回目录
  Mengli Cheng, Chengyu Wang, Xu Hu, Jun Huang, Xiaobo Wang
Abstract: Building Automatic Speech Recognition (ASR) systems from scratch is significantly challenging, mostly due to the time-consuming and financially-expensive process of annotating a large amount of audio data with transcripts. Although several unsupervised pre-training models have been proposed, applying such models directly might still be sub-optimal if more labeled, training data could be obtained without a large cost. In this paper, we present a weakly supervised framework for constructing ASR systems with massive video data. As videos often contain human-speech audios aligned with subtitles, we consider videos as an important knowledge source, and propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR). The underlying ASR model can be fine-tuned to fit any domain-specific target training datasets after weakly supervised pre-training. Extensive experiments show that our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.
摘要:建筑自动语音识别(ASR)从头系统的显著挑战,主要是由于注释大量的音频数据与成绩单的耗时和财政上昂贵的过程。尽管一些无监督前的训练模式已经被提出,直接应用这些模型可能仍然是次优的,如果更多的标记,可以在没有大的成本来获得训练数据。在本文中,我们提出了构建ASR系统的海量视频数据弱监管框架。作为影片往往包含有字幕对准人类语音音频,我们认为作为一个重要的知识来源的视频,并提出有效的方式与基于光学字符识别(OCR)视频成绩单对准提取高质量的音频。底层ASR模型可以被微调之后弱监督预训练以适应任何特定于域的目标训练数据集。大量的实验表明,该框架可以很容易地产生六个公共数据集普通话语音识别国家的先进成果。

注:中文为机器翻译结果!封面为论文标题词云图!