0%

【arxiv论文】 Computation and Language 2020-11-09

目录

1. An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish [PDF] 摘要
2. Practical and Ethical Considerations in the Effective use of Emotion and Sentiment Lexicons [PDF] 摘要
3. Understanding Pure Character-Based Neural Machine Translation: The Case of Translating Finnish into English [PDF] 摘要
4. Answer Span Correction in Machine Reading Comprehension [PDF] 摘要
5. Fighting an Infodemic: COVID-19 Fake News Dataset [PDF] 摘要
6. Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog [PDF] 摘要
7. Improving Machine Reading Comprehension with Single-choice Decision and Transfer Learning [PDF] 摘要
8. The ApposCorpus: A new multilingual, multi-domain dataset for factual appositive generation [PDF] 摘要
9. Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation [PDF] 摘要
10. Corpora Compared: The Case of the Swedish Gigaword & Wikipedia Corpora [PDF] 摘要
11. Alquist 3.0: Alexa Prize Bot Using Conversational Knowledge Graph [PDF] 摘要
12. Alquist 2.0: Alexa Prize Socialbot Based on Sub-Dialogue Models [PDF] 摘要
13. OP-IMS @ DIACR-Ita: Back to the Roots: SGNS+OP+CD still rocks Semantic Change Detection [PDF] 摘要
14. From Dataset Recycling to Multi-Property Extraction and Beyond [PDF] 摘要
15. Unleashing the Power of Neural Discourse Parsers -- A Context and Structure Aware Approach Using Large Scale Pretraining [PDF] 摘要
16. What's New? Summarizing Contributions in Scientific Literature [PDF] 摘要
17. Semi-supervised URL Segmentation with Recurrent Neural NetworksPre-trained on Knowledge Graph Entities [PDF] 摘要
18. Improving RNN Transducer Based ASR with Auxiliary Tasks [PDF] 摘要
19. Explain by Evidence: An Explainable Memory-based Neural Network for Question Answering [PDF] 摘要
20. Machine Generation and Detection of Arabic Manipulated and Fake News [PDF] 摘要
21. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification [PDF] 摘要
22. EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering [PDF] 摘要
23. Alignment Restricted Streaming Recurrent Neural Network Transducer [PDF] 摘要
24. PubSqueezer: A Text-Mining Web Tool to Transform Unstructured Documents into Structured Data [PDF] 摘要
25. Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages [PDF] 摘要

摘要

1. An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish [PDF] 返回目录
  Quan Duong, Mika Hämäläinen, Simon Hengchen
Abstract: Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.
摘要:历史语料库已知含有由OCR(光学字符识别)中的数字化过程中使用的方法引入的错误,经常被认为是降解NLP系统的性能。手动校正这些错误是一个耗时的过程,并自动方法的很大一部分一直依靠规则或监督机器学习。我们建立对并行数据的全自动无人监督的提取以前的工作培养了基于字符的序列对序列NMT(神经机器翻译)模型进行设计的英文OCR纠错,并提出解决方案,它适应芬兰起飞语言兼顾丰富的形态。我们的新方法显示,同时保持完全不受监督,拼写规范化的好处更高的性能。源代码和型号可在GitHub和Zenodo。

2. Practical and Ethical Considerations in the Effective use of Emotion and Sentiment Lexicons [PDF] 返回目录
  Saif M. Mohammad
Abstract: Lexicons of word-emotion associations are widely used in research and real-world applications. As part of my research, I have created several such lexicons (e.g., the NRC Emotion Lexicon). This paper outlines some practical and ethical considerations involved in the effective use of these lexical resources.
摘要:词的情感关联的辞书广泛应用于研究和现实应用。作为我研究的一部分,我已经创建了几个这样的词汇(例如,NRC的情感词汇)。本文概述参与有效地使用这些词汇资源的一些实践和伦理方面的考虑。

3. Understanding Pure Character-Based Neural Machine Translation: The Case of Translating Finnish into English [PDF] 返回目录
  Gongbo Tang, Rico Sennrich, Joakim Nivre
Abstract: Recent work has shown that deeper character-based neural machine translation (NMT) models can outperform subword-based models. However, it is still unclear what makes deeper character-based models successful. In this paper, we conduct an investigation into pure character-based models in the case of translating Finnish into English, including exploring the ability to learn word senses and morphological inflections and the attention mechanism. We demonstrate that word-level information is distributed over the entire character sequence rather than over a single character, and characters at different positions play different roles in learning linguistic knowledge. In addition, character-based models need more layers to encode word senses which explains why only deeper models outperform subword-based models. The attention distribution pattern shows that separators attract a lot of attention and we explore a sparse word-level attention to enforce character hidden states to capture the full word-level information. Experimental results show that the word-level attention with a single head results in 1.2 BLEU points drop.
摘要:最近的研究表明,基于字符更深的神经机器翻译(NMT)模型可以跑赢大市子词基于模型。但是,目前还不清楚是什么使更深层次的基于角色的模型成功。在本文中,我们进行了调查纯粹基于字符的车型在芬兰语翻译成英语,包括探讨学习词义和形态的语调和注意机制的能力的情况下。我们证明了字级信息被分布在整个字符序列,而不是在单个字符,并在不同位置的字符在学习语言知识扮演不同的角色。此外,基于角色的模型需要更多层编码词义这也解释了为什么只有更深层次的模型优于基于子词模型。注意分布图是分离吸引了大量的关注和大家探讨稀疏字的高度重视执行字符隐藏状态,以获取完整的字级信息。实验结果表明,有单头导致1.2 BLEU点字的高度重视下降。

4. Answer Span Correction in Machine Reading Comprehension [PDF] 返回目录
  Revanth Gangi Reddy, Md Arafat Sultan, Efsun Sarioglu Kayi, Rong Zhang, Vittorio Castelli, Avirup Sil
Abstract: Answer validation in machine reading comprehension (MRC) consists of verifying an extracted answer against an input context and question pair. Previous work has looked at re-assessing the "answerability" of the question given the extracted answer. Here we address a different problem: the tendency of existing MRC systems to produce partially correct answers when presented with answerable questions. We explore the nature of such errors and propose a post-processing correction method that yields statistically significant performance improvements over state-of-the-art MRC systems in both monolingual and multilingual evaluation.
摘要:在机器阅读理解(MRC)回答验证包括验证针对输入上下文和问题对提取的答案的。以前的工作已经看着重新评估给出的答案提取的问题“回应能力”。在这里,我们针对不同的问题:现有的MRC系统时回答的问题,提出了生产部分正确答案的倾向。我们探索这种错误的性质,提出了一种后处理的校正方法的产量统计上显著的性能提升了国家的最先进的MRC系统在这两个单语和多语种的评价。

5. Fighting an Infodemic: COVID-19 Fake News Dataset [PDF] 返回目录
  Parth Patwa, Shivam Sharma, Srinivas PYKL, Vineeth Guptha, Gitanjali Kumari, Md Shad Akhtar, Asif Ekbal, Amitava Das, Tanmoy Chakraborty
Abstract: Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM. The data and code is available at: this https URL
摘要:随着COVID-19大流行,我们也争取一个`infodemic”。假新闻和谣言是在社会化媒体猖獗。在传闻相信会引起显著的伤害。这是在大流行时进一步加剧。为了解决这个问题,我们策划和发行10,700社交媒体帖子和真假新闻文章手动标注的数据集上COVID-19。我们的基准注释数据集有四个机器学习基线 - 决策树,Logistic回归,梯度加速,支持向量机(SVM)。我们获得93.46 \%F1-得分与支持向量机的最佳性能。数据和代码,请访问:此HTTPS URL

6. Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog [PDF] 返回目录
  Shen Gao, Xiuying Chen, Li Liu, Dongyan Zhao, Rui Yan
Abstract: Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching the stickers image with previous utterances. However, existing methods usually focus on measuring the matching degree between the dialog context and sticker image, which ignores the user preference of using stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context and sticker using history of user. Two main challenges are confronted in this task. One is to model the sticker preference of user based on the previous sticker selection history. Another challenge is to jointly fuse the user preference and the matching between dialog context and candidate sticker into final prediction making. To tackle these challenges, we propose a \emph{Preference Enhanced Sticker Response Selector} (PESRS) model. Specifically, PESRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker and each utterance. Then, we model the user preference by using the recently selected stickers as input, and use a key-value memory network to store the preference representation. PESRS then learns the short-term and long-term dependency between all interaction results by a fusion network, and dynamically fuse the user preference representation into the final sticker selection prediction. Extensive experiments conducted on a large-scale real-world dialog dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of PESRS.
摘要:以生动,引人入胜的表情贴纸正在成为在线消息应用越来越普及,有的作品致力于自动选择与以前的话语贴图像匹配贴纸响应。但是,现有的方法通常集中于测量对话框上下文和不干胶贴纸图像,而忽略使用贴纸的用户偏好之间的匹配程度。因此,在本文中,我们提出建议适当贴纸基于多轮对话背景和贴纸使用的用户历史的用户。两个主要挑战所面临的这一任务。一种是基于以前的贴纸选择历史的用户的标签偏好模型。另一个挑战是共同的保险丝用户偏好和对话情境和候选标签之间的匹配到最终预测决策。为了应对这些挑战,我们提出了一个\ {EMPH偏好增强贴纸响应选择}(PESRS)模型。具体而言,第一PESRS采用卷积基于不干胶贴纸图像编码器和一个自关注基于多匝对话编码器以获得贴和话语的表示。接下来,深相互作用网络,建议将贴纸和各发声之间进行深匹配。然后,我们建模通过使用​​最近选择贴纸作为输入用户偏好,并使用键 - 值存储器网络来存储偏好表示。然后PESRS通过熔融网络学习的所有交互结果之间的短期和长期的依赖性,并动态保险丝用户偏好表示成最终的贴纸选择预测。在大规模真实世界进行了大量的实验数据集对话框表明我们的模型实现了国家的最先进的性能为所有常用的指标。实验还证实PESRS的各成分的有效性。

7. Improving Machine Reading Comprehension with Single-choice Decision and Transfer Learning [PDF] 返回目录
  Yufan Jiang, Shuangzhi Wu, Jing Gong, Yahui Cheng, Peng Meng, Weiliang Lin, Zhibo Chen, Mu li
Abstract: Multi-choice Machine Reading Comprehension (MMRC) aims to select the correct answer from a set of options based on a given passage and question. Due to task specific of MMRC, it is no-trivial to transfer knowledge from other MRC tasks such as SQuAD, Dream. In this paper, we simply reconstruct multi-choice to single-choice by training a binary classification to distinguish whether a certain answer is correct. Then select the option with the highest confidence score. We construct our model upon ALBERT-xxlarge model and estimate it on the RACE dataset. During training, We adopt AutoML strategy to tune better parameters. Experimental results show that the single-choice is better than multi-choice. In addition, by transferring knowledge from other kinds of MRC tasks, our model achieves the new state of the art results in both single and ensemble settings.
摘要:多选机阅读理解(MMRC)旨在选择从一组基于给定通道和问题选择正确的答案。由于MMRC的具体的任务,它不同于其他MRC任务,如班长,梦无小事知识转让。在本文中,我们简单地通过训练二元分类来区分某个答案是否正确重建多选择单选择。然后选择具有最高置信度得分选项。我们构造我们在ALBERT-xxlarge模型模型,并估计它在比赛数据集。在培训过程中,我们采用AutoML战略调整好参数。实验结果表明,单选择比多选择更好。此外,由其他种类的MRC任务传授知识,我们的模型实现了在单和合奏设置艺术效果的新状态。

8. The ApposCorpus: A new multilingual, multi-domain dataset for factual appositive generation [PDF] 返回目录
  Yova Kementchedjhieva, Di Lu, Joel Tetreault
Abstract: News articles, image captions, product reviews and many other texts mention people and organizations whose name recognition could vary for different audiences. In such cases, background information about the named entities could be provided in the form of an appositive noun phrase, either written by a human or generated automatically. We expand on the previous work in appositive generation with a new, more realistic, end-to-end definition of the task, instantiated by a dataset that spans four languages (English, Spanish, German and Polish), two entity types (person and organization) and two domains (Wikipedia and News). We carry out an extensive analysis of the data and the task, pointing to the various modeling challenges it poses. The results we obtain with standard language generation methods show that the task is indeed non-trivial, and leaves plenty of room for improvement.
摘要:新闻报道,图片说明,产品评论等众多文本提到的个人和组织,其知名度可以为不同的受众有所不同。在这种情况下,有可能在同位语名词短语的形式来提供关于所述命名实体的背景信息,无论是由人书写或自动生成的。我们在同位语代以前的工作有一个新的,更加逼真,终端到终端的定义任务,通过一个数据集实例化扩张,跨越四国语言(英语,西班牙语,德语和波兰语),两个实体类型(人,组织)和两个域(维基百科和新闻)。我们进行数据和任务进行了广泛的分析,指出它带来的各种造型的挑战。我们使用标准语言生成方法获得的结果表明,任务确实是不平凡的,并留下了充分的改进余地。

9. Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation [PDF] 返回目录
  Haryo Akbarianto Wibowo, Tatag Aziz Prawiro, Muhammad Ihsan, Alham Fikri Aji, Radityo Eko Prasojo, Rahmad Mahendra
Abstract: In its daily use, the Indonesian language is riddled with informality, that is, deviations from the standard in terms of vocabulary, spelling, and word order. On the other hand, current available Indonesian NLP models are typically developed with the standard Indonesian in mind. In this work, we address a style-transfer from informal to formal Indonesian as a low-resource machine translation problem. We build a new dataset of parallel sentences of informal Indonesian and its formal counterpart. We benchmark several strategies to perform style transfer from informal to formal Indonesian. We also explore augmenting the training set with artificial forward-translated data. Since we are dealing with an extremely low-resource setting, we find that a phrase-based machine translation approach outperforms the Transformer-based approach. Alternatively, a pre-trained GPT-2 fined-tuned to this task performed equally well but costs more computational resource. Our findings show a promising step towards leveraging machine translation models for style transfer. Our code and data are available in this https URL
摘要:在日常使用中,印度尼西亚语充斥着不拘小节,即从标准差的词汇,拼写和词序方面。在另一方面,当前可用印度尼西亚NLP模型通常制定的标准印尼的初衷。在这项工作中,我们要解决非正式风格转移到正规印尼作为一个资源匮乏的机器翻译问题。我们建立非正式印尼的平行的句子和其正式对应的新的数据集。我们的基准多种策略进行风格传递非正式正式印尼。我们还探讨了增强与人工正向转换数据的训练集。因为我们正在处理一个极低的资源设置,我们发现一个基于短语的机器翻译方法比基于变压器的方法。另外,预训练GPT-2罚调谐到该任务执行同样出色,但成本更多的计算资源。我们的研究结果显示对利用机器翻译模型的风格传递一个可喜的一步。我们的代码和数据在此提供HTTPS URL

10. Corpora Compared: The Case of the Swedish Gigaword & Wikipedia Corpora [PDF] 返回目录
  Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki
Abstract: In this work, we show that the difference in performance of embeddings from differently sourced data for a given language can be due to other factors besides data size. Natural language processing (NLP) tasks usually perform better with embeddings from bigger corpora. However, broadness of covered domain and noise can play important roles. We evaluate embeddings based on two Swedish corpora: The Gigaword and Wikipedia, in analogy (intrinsic) tests and discover that the embeddings from the Wikipedia corpus generally outperform those from the Gigaword corpus, which is a bigger corpus. Downstream tests will be required to have a definite evaluation.
摘要:在这项工作中,我们表明,从给定的语言不同来源数据的嵌入的性能上的差异可能是由于除了数据大小等因素。自然语言处理(NLP)任务通常由大语料库的嵌入有更好的表现。然而,覆盖领域和噪声的广度可以发挥重要的作用。我们评估基于两个瑞典语料库的嵌入:本Gigaword和维基百科,类似于(固有)测试,并发现,从维基百科语料库中的嵌入一般优于那些从Gigaword语料库,这是一个更大的语料库。下游测试将需要有一个明确的评价。

11. Alquist 3.0: Alexa Prize Bot Using Conversational Knowledge Graph [PDF] 返回目录
  Jan Pichl, Petr Marek, Jakub Konrád, Petr Lorenc, Van Duy Ta, Jan Šedivý
Abstract: The third version of the open-domain dialogue system Alquist developed within the Alexa Prize 2020 competition is designed to conduct coherent and engaging conversations on popular topics. The main novel contribution is the introduction of a system leveraging an innovative approach based on a conversational knowledge graph and adjacency pairs. The conversational knowledge graph allows the system to utilize knowledge expressed during the dialogue in consequent turns and across conversations. Dialogue adjacency pairs divide the conversation into small conversational structures, which can be combined and allow the system to react to a wide range of user inputs flexibly. We discuss and describe Alquist's pipeline, data acquisition and processing, dialogue manager, NLG, knowledge aggregation, and a hierarchy of adjacency pairs. We present the experimental results of the individual parts of the system.
摘要:开域对话系统Alquist内Alexa的奖2020大赛的目的是对热门话题进行协调和配合谈话开发的第三个版本。主要新颖贡献是引进利用基于一个对话知识图表和相邻对的创新方法的系统的。会话知识图可以利用的知识体系随之而来的转弯和跨越对话在对话中表示。对话邻接对分割谈话成小对话结构,其可以被组合并允许系统灵活地大范围的用户输入作出反应。我们讨论和描述Alquist的管道,数据采集和处理,对话管理器,NLG,知识聚合和邻接对的层次结构。我们提出了系统的各个部分的实验结果。

12. Alquist 2.0: Alexa Prize Socialbot Based on Sub-Dialogue Models [PDF] 返回目录
  Jan Pichl, Petr Marek, Jakub Konrád, Martin Matulík, Jan Šedivý
Abstract: This paper presents the second version of the dialogue system named Alquist competing in Amazon Alexa Prize 2018. We introduce a system leveraging ontology-based topic structure called topic nodes. Each of the nodes consists of several sub-dialogues, and each sub-dialogue has its own LSTM-based model for dialogue management. The sub-dialogues can be triggered according to the topic hierarchy or a user intent which allows the bot to create a unique experience during each session.
摘要:本文介绍一个名为Alquist对话系统的第二版在亚马逊的Alexa奖2018年竞争我们引入一个系统利用基于本体的主题结构称为主题节点。每个节点包含若干子对话,每个子的对话对对话管理自己的基于LSTM模型。子对话可以根据主题层次或用户的意图,让机器人来创建每个会话中的独特体验来触发。

13. OP-IMS @ DIACR-Ita: Back to the Roots: SGNS+OP+CD still rocks Semantic Change Detection [PDF] 返回目录
  Jens Kaiser, Dominik Schlechtweg, Sabine Schulte im Walde
Abstract: We present the results of our participation in the DIACR-Ita shared task on lexical semantic change detection for Italian. We exploit one of the earliest and most influential semantic change detection models based on Skip-Gram with Negative Sampling, Orthogonal Procrustes alignment and Cosine Distance and obtain the winning submission of the shared task with near to perfect accuracy .94. Our results once more indicate that, within the present task setup in lexical semantic change detection, the traditional type-based approaches yield excellent performance.
摘要:我们提出我们的DIACR式-Ita参与的结果对词汇语义变化检测意大利共同任务。我们利用基于跳过革兰阴性与采样,正交排列普鲁克和余弦距离最早和最有影响力的语义变化检测的车型之一,并与附近获得获奖作品的共同任务,以完美的精度0.94。我们的研究结果再次表明,在词汇语义变化检测当前任务设置中,传统的基于类型的方法产生优异的性能。

14. From Dataset Recycling to Multi-Property Extraction and Beyond [PDF] 返回目录
  Tomasz Dwojak, Michał Pietruszka, Łukasz Borchmann, Jakub Chłędowski, Filip Graliński
Abstract: This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled-a newly developed public dataset and the task of multiple property extraction. It uses the same data as WikiReading but does not inherit its predecessor's identified disadvantages. In addition, we provide a human-annotated test set with diagnostic subsets for a detailed analysis of model performance.
摘要:本文研究的WikiReading信息抽取和机器阅读理解数据集各种变压器的架构。所提出的双源模型优于当前状态的最先进的大幅度。接下来,我们介绍WikiReading再生,新开发的公共数据集和多属性提取的任务。它使用相同的数据WikiReading但不继承其前任的标识缺点。此外,我们还提供了一个人性化注解测试组与模型进行的详细分析诊断子集。

15. Unleashing the Power of Neural Discourse Parsers -- A Context and Structure Aware Approach Using Large Scale Pretraining [PDF] 返回目录
  Grigorii Guz, Patrick Huber, Giuseppe Carenini
Abstract: RST-based discourse parsing is an important NLP task with numerous downstream applications, such as summarization, machine translation and opinion mining. In this paper, we demonstrate a simple, yet highly accurate discourse parser, incorporating recent contextual language models. Our parser establishes the new state-of-the-art (SOTA) performance for predicting structure and nuclearity on two key RST datasets, RST-DT and Instr-DT. We further demonstrate that pretraining our parser on the recently available large-scale "silver-standard" discourse treebank MEGA-DT provides even larger performance benefits, suggesting a novel and promising research direction in the field of discourse analysis.
摘要:基于RST语篇分析的一个重要任务NLP与众多的下游应用,如总结,机器翻译和意见挖掘。在本文中,我们展示了一个简单但准确的话语解析器,结合最近的上下文的语言模型。我们的解析器建立了新的国家的最先进的(SOTA)的两个关键RST数据集,RST-DT和INSTR-DT预测结构和核性的表现。我们进一步表明,训练前我们的分析器在最近获得大规模的“银标”话语树库MEGA-DT提供更大的性能优势,这表明一种新的和话语分析领域有前途的研究方向。

16. What's New? Summarizing Contributions in Scientific Literature [PDF] 返回目录
  Hiroaki Hayashi, Wojciech Kryściński, Bryan McCann, Nazneen Rajani, Caiming Xiong
Abstract: With thousands of academic articles shared on a daily basis, it has become increasingly difficult to keep up with the latest scientific findings. To overcome this problem, we introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work, making it easier to identify the key findings shared in articles. For this purpose, we extend the S2ORC corpus of academic articles, which spans a diverse set of domains ranging from economics to psychology, by adding disentangled "contribution" and "context" reference labels. Together with the dataset, we introduce and analyze three baseline approaches: 1) a unified model controlled by input code prefixes, 2) a model with separate generation heads specialized in generating the disentangled outputs, and 3) a training strategy that guides the model using additional supervision coming from inbound and outbound citations. We also propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs. Through a human study involving expert annotators, we show that in 79%, of cases our new task is considered more helpful than traditional scientific paper summarization.
摘要:随着成千上万每天的基础上共享的学术文章,它已成为越来越难以跟上最新的科学发现。为了克服这个问题,我们引入解开文件摘要,其目的是生成的文件的贡献和工作范围内单独摘要的新任务,使其更容易识别文章共享的重要发现。为此,我们扩展了学术文章的S2ORC主体,横跨一组不同的领域,从经济学到心理学,加入解开的“贡献”和“背景”的参考标签。与数据集一起,我们介绍和分析3个基线方法:1)通过输入代码前缀控制的统一模型,2)配有独立的代头的模型专门生成解缠结的输出,和3)引导所述模型使用训练策略另外,从监管的入站和出站引用到来。我们还建议该报告的相关性,新颖性和产生的输出解开一个全面的自动评估协议。通过涉及专家注释人的研究中,我们表明,在79%的情况下我们的新任务被认为是比传统的科学论文摘要更有帮助。

17. Semi-supervised URL Segmentation with Recurrent Neural NetworksPre-trained on Knowledge Graph Entities [PDF] 返回目录
  Hao Zhang, Jae Ro, Richard Sproat
Abstract: Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%.
摘要:最新域名如openresearch到组件的话开启和研究是像文本到语音合成和网络搜索应用很重要。我们这个问题链接到中国的分词的经典问题,并使用字符输入显示了基于回归神经网络(RNNs)一个标签模型的有效性。为了弥补缺乏训练数据,我们建议在大型知识数据库连接起来的实体名称预先训练方法。前训练提高了33%的模型和所带来的序列准确度〜85%。

18. Improving RNN Transducer Based ASR with Auxiliary Tasks [PDF] 返回目录
  Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, Geoffrey Zweig
Abstract: End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing auxiliary tasks. We propose (i) using the same auxiliary task as primary RNN-T ASR task, and (ii) performing context-dependent graphemic state prediction as in conventional hybrid modeling. In transcribing social media videos with varying training data size, we first evaluate the streaming ASR performance on three languages: Romanian, Turkish and German. We find that both proposed methods provide consistent improvements. Next, we observe that both auxiliary tasks demonstrate efficacy in learning deep transformer encoders for RNN-T criterion, thus achieving competitive results - 2.0%/4.2% WER on LibriSpeech test-clean/other - as compared to prior top performing models.
摘要:与单个神经网络端至端自动语音识别(ASR)模型最近表明状态的最先进的结果相比,以往的混合语音识别器。具体地说,回归神经网络传感器(RNN-T)已经显示出在各种基准竞争的ASR性能。在这项工作中,我们分析了其中RNN-T可以通过执行辅助任务达到更好的准确性ASR方式。我们使用相同的辅助任务作为主要RNN-T ASR任务,和(ii)进行依赖于上下文的状态的字形预测如在常规的混合建模提出(i)中。在转录社交媒体与视频不同的训练数据的大小,我们首先评估对三种语言的流ASR性能:罗马尼亚,土耳其和德国。我们发现,这两种建议的方法提供一致的改进。接下来,我们观察到两个辅助任务展示学习深变压器编码器的RNN-T标准,从而实现竞争力的结果功效 - 在LibriSpeech 2.0%/ 4.2%WER测试清洁/其他 - 比之前表现最出色的车型。

19. Explain by Evidence: An Explainable Memory-based Neural Network for Question Answering [PDF] 返回目录
  Quan Tran, Nhan Dam, Tuan Lai, Franck Dernoncourt, Trung Le, Nham Le, Dinh Phung
Abstract: Interpretability and explainability of deep neural networks are challenging due to their scale, complexity, and the agreeable notions on which the explaining process rests. Previous work, in particular, has focused on representing internal components of neural networks through human-friendly visuals and concepts. On the other hand, in real life, when making a decision, human tends to rely on similar situations and/or associations in the past. Hence arguably, a promising approach to make the model transparent is to design it in a way such that the model explicitly connects the current sample with the seen ones, and bases its decision on these samples. Grounded on that principle, we propose in this paper an explainable, evidence-based memory network architecture, which learns to summarize the dataset and extract supporting evidences to make its decision. Our model achieves state-of-the-art performance on two popular question answering datasets (i.e. TrecQA and WikiQA). Via further analysis, we show that this model can reliably trace the errors it has made in the validation step to the training instances that might have caused these errors. We believe that this error-tracing capability provides significant benefit in improving dataset quality in many applications.
摘要:解释性和深层神经网络的explainability是具有挑战性的,因为它们的规模,复杂性,以及在其上解释过程中的休止符同意的概念。以前的工作,特别是一直致力于通过人性化的视觉效果和概念代表神经网络的内部组件。在另一方面,在现实生活中,进行决策时,往往人依靠在过去的类似情况和/或关联。因此可以说,一个有前途的方法,使模型是透明的方式使得模型明确地连接与那些看到当前样本来设计它,立足其对这些样品的决定。扎根于这一原则,我们在本文提出了一种解释的,基于证据的内存网络架构,学习总结数据集和提取支持证据作出决定。我们的模型实现了对两种流行的问答数据集(即TrecQA和WikiQA)国家的最先进的性能。通过进一步的分析,我们表明,该模型能够可靠地跟踪它已经在验证步骤,可能造成这些错误的训练情况下所犯的错误。我们认为,这种错误跟踪能力,提高在许多应用中的数据集质量提供了显著的好处。

20. Machine Generation and Detection of Arabic Manipulated and Fake News [PDF] 返回目录
  El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Tariq Alhindi, Hasan Cavusoglu
Abstract: Fake news and deceptive machine-generated text are serious problems threatening modern societies, including in the Arab world. This motivates work on detecting false and manipulated stories online. However, a bottleneck for this research is lack of sufficient data to train detection models. We present a novel method for automatically generating Arabic manipulated (and potentially fake) news stories. Our method is simple and only depends on availability of true stories, which are abundant online, and a part of speech tagger (POS). To facilitate future work, we dispense with both of these requirements altogether by providing AraNews, a novel and large POS-tagged news dataset that can be used off-the-shelf. Using stories generated based on AraNews, we carry out a human annotation study that casts light on the effects of machine manipulation on text veracity. The study also measures human ability to detect Arabic machine manipulated text generated by our method. Finally, we develop the first models for detecting manipulated Arabic news and achieve state-of-the-art results on Arabic fake news detection (macro F1=70.06). Our models and data are publicly available.
摘要:假新闻和欺骗性机器生成的文本是严重的问题,威胁着现代社会,包括阿拉伯世界。这激发了在线检测虚假和操纵故事的工作。然而,对于这项研究的瓶颈是缺乏足够的数据,列车检测模型。我们提出用于自动生成阿拉伯操纵(和潜在的假冒)的新闻报导的新方法。我们的方法很简单,只依赖于真实的故事,这是丰富的线上和语音标注器(POS)的一部分的可用性。为了方便以后的工作中,我们有两个要求完全免除提供AraNews,一种新型的大型POS标记的新闻数据集,可以现成的架子来使用。利用基于AraNews产生的故事,我们开展了人类注解的研究投射出机器操作的文本真实性的影响。这项研究还测量检测用我们的方法产生的阿拉伯语机操作文本人的能力。最后,我们开发用于检测操纵阿拉伯新闻所述第一模式和实现上阿拉伯假新闻检测(宏F1 = 70.06)的状态的最先进的结果。我们的模型和数据是公开的。

21. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification [PDF] 返回目录
  Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, Mohit Bansal
Abstract: We introduce HoVer (HOppy VERification), a dataset for many-hop evidence extraction and fact verification. It challenges models to extract facts from several Wikipedia articles that are relevant to a claim and classify whether the claim is Supported or Not-Supported by the facts. In HoVer, the claims require evidence to be extracted from as many as four English Wikipedia articles and embody reasoning graphs of diverse shapes. Moreover, most of the 3/4-hop claims are written in multiple sentences, which adds to the complexity of understanding long-range dependency relations such as coreference. We show that the performance of an existing state-of-the-art semantic-matching model degrades significantly on our dataset as the number of reasoning hops increases, hence demonstrating the necessity of many-hop reasoning to achieve strong results. We hope that the introduction of this challenging dataset and the accompanying evaluation task will encourage research in many-hop fact retrieval and information verification. We make the HoVer dataset publicly available at this https URL
摘要:介绍悬停(啤酒花验证),对于许多跳提取证据和事实验证的数据集。它挑战模式,以提取从事实是相关的要求和分类要求是否被支持的或不事实支持的几个维基百科的文章。悬停,索赔需要证据来自多达四个英文维基百科的文章和不同形状的Embody座椅推理图表中提取。此外,大多数的四分之三跳权利要求中被写入多个句子,这增加了理解远距离依赖关系如共参照复杂性。我们发现,现有的国家的最先进的语义匹配模型降解的显著我们作为推理的数量数据集性能啤酒花的增加,因此表明许多跳推理的必要性,以实现强劲的业绩。我们希望引进这一具有挑战性的数据集和附带的评估任务将鼓励许多跳事实检索和信息验证研究。我们做悬停数据集可公开获得的这个HTTPS URL

22. EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering [PDF] 返回目录
  Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov, Ivan Koychev, Preslav Nakov
Abstract: We propose EXAMS -- a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others. EXAMS offers a fine-grained evaluation framework across multiple languages and subjects, which allows precise analysis and comparison of various models. We perform various experiments with existing top-performing multilingual pre-trained models and we show that EXAMS offers multiple challenges that require multilingual knowledge and reasoning in multiple domains. We hope that EXAMS will enable researchers to explore challenging reasoning and knowledge transfer methods and pre-trained models for school question answering in various languages which was not possible before. The data, code, pre-trained models, and evaluation are available at this https URL.
摘要:本文提出考试 - 跨语种和多语种答疑高中考试的新基准数据集。我们在16种语言收集到24000多优质高中考题,覆盖8周语言的家庭和24个学校科目从自然科学和社会科学,等等。 EXAMS提供多语言和主题细粒度的评估框架,它允许精确的分析和各种型号的对比。我们执行与现有的顶级多种语言进行预先训练模型的各种实验,我们表明EXAMS提供了需要在多个领域多语言知识和推理多重挑战。我们希望,考试将帮助研究人员探索挑战推理和知识转移的方法和预先训练模型学校的问题在这是不可能的前各种语言回答。数据,代码,预训练模型,并评估可在此HTTPS URL。

23. Alignment Restricted Streaming Recurrent Neural Network Transducer [PDF] 返回目录
  Jay Mahadeokar, Yuan Shangguan, Duc Le, Gil Keren, Hang Su, Thong Le, Ching-Feng Yeh, Christian Fuegen, Michael L. Seltzer
Abstract: There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times higher throughput for our LSTM model architecture, enabling faster training and convergence on GPUs.
摘要:在言语社团越来越大的兴趣发展回归神经网络传感器(RNN-T)模型的自动语音识别(ASR)的应用程序。 RNN-T训练与不强制培训成绩单和音频的时间对准损失函数。其结果是,与单向长短期记忆(LSTM)内置RNN-T型号编码器倾向于等待输入音频的长跨度,流已经解码ASR令牌之前。在这项工作中,我们提出了一个修改的RNN-T损失函数,开发对齐限制RNN-T(AR-RNN-T)模型,利用音频文本对齐信息引导损失计算。我们现有的作品,如单调RNN-T,在LibriSpeech和内部的数据集进行比较所提出的方法。我们表明,氩RNN-T的损失提供了一个精确的控制导航令牌发射延迟和词错误率(WER)之间的权衡。上述Ar-RNN-T模型也通过保证等待时间的任何给定的范围内的令牌排放提高下游应用如ASR最终指向。此外,氩RNN-T损失允许更大批量和我们LSTM模型架构4倍更高的吞吐量,从而实现更快的GPU上的训练和收敛。

24. PubSqueezer: A Text-Mining Web Tool to Transform Unstructured Documents into Structured Data [PDF] 返回目录
  Alberto Calderone
Abstract: The amount of scientific papers published every day is daunting and constantly increasing. Keeping up with literature represents a challenge. If one wants to start exploring new topics it is hard to have a big picture without reading lots of articles. Furthermore, as one reads through literature, making mental connections is crucial to ask new questions which might lead to discoveries. In this work, I present a web tool which uses a Text Mining strategy to transform large collections of unstructured biomedical articles into structured data. Generated results give a quick overview on complex topics which can possibly suggest not explicitly reported information. In particular, I show two Data Science analyses. First, I present a literature based rare diseases network build using this tool in the hope that it will help clarify some aspects of these less popular pathologies. Secondly, I show how a literature based analysis conducted with PubSqueezer results allows the describe of known facts about SARS-CoV-2. In one sentence, data generated with PubSqueezer make it easy to use scientific literate in any computational analysis such as machine learning, natural language processing etc. Availability: this http URL
摘要:科学论文的出版量每天是艰巨和不断增加。与文学跟上代表一个挑战。如果想开始探索新的话题,很难有大的画面没有阅读大量的文章。此外,作为一个读通过文献,使精神的连接是至关重要的要求可能导致发现新的问题。在这项工作中,我将介绍一个网络工具,它采用了文本挖掘策略,以非结构化的生物医学文章的大集合转换成结构化数据。生成的结果给出复杂的问题能可能建议没有明确报告的信息简要概述。特别是,我显示了两个数据科学分析。首先,我使用,希望这将有助于澄清这些不那么流行疾病的某些方面这个工具提出一个基于文献罕见疾病网络建设。其次,我将展示与PubSque​​ezer结果进行了基于文献分析如何使描述关于SARS-COV-2已知事实的。在一个句子,用PubSque​​ezer生成的数据可以很容易地使用科学素养在任何计算分析,如机器学习,自然语言处理等情况:该HTTP URL

25. Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages [PDF] 返回目录
  Trideba Padhi, Astik Biswas, Febe De Wet, Ewald van der Westhuizen, Thomas Niesler
Abstract: In this work, we explore the benefits of using multilingual bottleneck features (mBNF) in acoustic modelling for the automatic speech recognition of code-switched (CS) speech in African languages. The unavailability of annotated corpora in the languages of interest has always been a primary challenge when developing speech recognition systems for this severely under-resourced type of speech. Hence, it is worthwhile to investigate the potential of using speech corpora available for other better-resourced languages to improve speech recognition performance. To achieve this, we train a mBNF extractor using nine Southern Bantu languages that form part of the freely available multilingual NCHLT corpus. We append these mBNFs to the existing MFCCs, pitch features and i-vectors to train acoustic models for automatic speech recognition (ASR) in the target code-switched languages. Our results show that the inclusion of the mBNF features leads to clear performance improvements over a baseline trained without the mBNFs for code-switched English-isiZulu, English-isiXhosa, English-Sesotho and English-Setswana speech.
摘要:在这项工作中,我们将探讨使用声学建模多语言功能的瓶颈(mBNF)在非洲语言代码交换(CS)语音的自动语音识别的好处。开发语音识别系统时,这种资源严重不足的讲话类型注释语料中的目标语言的可用性一直是一个主要挑战。因此,这是值得探讨使用可用于其他资源较充足语言提高语音识别性能语音语料库的潜力。为了实现这一目标,我们培养使用南方9台班图语是形式免费提供多语种NCHLT语料库的一部分mBNF提取。我们这些mBNFs追加到现有的MFCC,俯仰功能和i-向量来训练声学模型在目标代码交换语言自动语音识别(ASR)。我们的研究结果表明,mBNF列入了没有mBNFs对训练有素的基线特征,会导致明显的性能改进代码交换英语,祖鲁语,英语,索萨语,英语,塞索托语和英语,茨瓦纳语演讲。

注:中文为机器翻译结果!封面为论文标题词云图!