摘要

1. New Vietnamese Corpus for Machine ReadingComprehension of Health News Articles [PDF] 返回目录
Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen
Abstract: Although over 95 million people in the world speak the Vietnamese language, there are not any large and qualified datasets for automatic reading comprehension. In addition, machine reading comprehension for the health domain offers great potential for practical applications; however, there is still very little machine reading comprehension research in this domain. In this study, we present ViNewsQA as a new corpus for the low-resource Vietnamese language to evaluate models of machine reading comprehension. The corpus comprises 10,138 human-generated question-answer pairs. Crowdworkers created the questions and answers based on a set of over 2,030 online Vietnamese news articles from the VnExpress news website, where the answers comprised spans extracted from the corresponding articles. In particular, we developed a process of creating a corpus for the Vietnamese language. Comprehensive evaluations demonstrated that our corpus requires abilities beyond simple reasoning such as word matching, as well as demanding difficult reasoning similar to inferences based on single-or-multiple-sentence information. We conducted experiments using state-of-the-art methods for machine reading comprehension to obtain the first baseline performance measures, which will be compared with further models' performances. We measured human performance based on the corpus and compared it with several strong neural models. Our experiments showed that the best model was BERT, which achieved an exact match score of 57.57% and F1-score of 76.90% on our corpus. The significant difference between humans and the best model (F1-score of 15.93%) on the test set of our corpus indicates that improvements in ViNewsQA can be explored in future research. Our corpus is freely available on our website in order to encourage the research community to make these improvements.
摘要：虽然超过95亿人在世界上说话越南语，没有自动阅读理解任何大的和合格的数据集。此外，卫生领域的机器阅读理解提供了实际应用中的巨大潜力;然而，仍然有在这一领域非常小机器阅读理解能力的研究。在这项研究中，我们目前ViNewsQA作为低资源越南语评估机器阅读理解的模型创建一个新的语料库。该语料库包括10138人产生的问题 - 回答对。 Crowdworkers基于一组从VnExpress新闻网站，其中包括跨度的答案从对应的文章中提取了2,030在线越南新闻报道产生的问题和答案。特别是，我们开发创建越南语言语料库的过程。综合评价表明，我们的语料库需要超越简单的推理能力，如词匹配，以及要求苛刻的困难推理类似于基于单或海报句的信息推断。我们采用先进设备，最先进的方法进行机读理解，获得第一基准性能的措施，这将进一步机型的表现进行比较，进行了实验。我们测量人体机能基于语料库和几个强大的神经模型进行了比较。我们的实验表明，最好的模式是BERT，这就实现精确匹配分数的57.57％和76.90的F1％，得分在我们的语料库。人类和测试集我们语料的最佳模式（15.93％F1分数）之间的显著差异表明，在ViNewsQA改进可以在今后的研究进行探讨。我们的语料库是我们的网站上免费提供，以鼓励研究界做出这些改进。

2. Exploring Processing of Nested Dependencies in Neural-Network Language Models and Humans [PDF] 返回目录
Yair Lakretz, Dieuwke Hupkes, Alessandra Vergallito, Marco Marelli, Marco Baroni, Stanislas Dehaene
Abstract: Recursive processing in sentence comprehension is considered a hallmark of human linguistic abilities. However, its underlying neural mechanisms remain largely unknown. We studied whether a recurrent neural network with Long Short-Term Memory units can mimic a central aspect of human sentence processing, namely the handling of long-distance agreement dependencies. Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of a small set of specialized units that successfully handled local and long-distance syntactic agreement for grammatical number. However, simulations showed that this mechanism does not support full recursion and fails with some long-range embedded dependencies. We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns, with or without embedding. Human and model error patterns were remarkably similar, showing that the model echoes various effects observed in human data. However, a key difference was that, with embedded long-range dependencies, humans remained above chance level, while the model's systematic errors brought it below chance. Overall, our study shows that exploring the ways in which modern artificial neural networks process sentences leads to precise and testable hypotheses about human linguistic performance.
摘要：在句子理解递归处理被认为是人类语言能力的标志。但是，其底层的神经机制仍是未知。我们研究了长短期内存单元的递归神经网络是否能够模仿人类句子加工的中心环节，长距离协议的依赖，即处理。虽然网络是单独训练以预测在一个大的语料库的下一个单词，分析表明，一小部分，成功地处理本地和长途句法协议的语法一些专门的单位的出现。但是，模拟结果显示，该机制不支持全递归和失败，一些长期的嵌入式依赖。我们在一个行为实验在人类发现违规数量协议，在多个名词，有或没有嵌入的单数/复数状态的系统变化的句子测试模型的预测。人力和模型误差图案极其相似，显示出模型回波各种效果在人类数据中观察到。然而，一个关键的区别是，与嵌入式远程的依赖，人类仍高于机会水平，而模型的系统误差，带来了它下面的机会。总的来说，我们的研究表明，在探索现代其中人工神经网络处理句子导致有关人类语言的性能精确检验的假设的方式。

3. Dataset for Automatic Summarization of Russian News [PDF] 返回目录
Ilya Gusev
Abstract: Automatic text summarization has been studied in a variety of domains and languages. However, this does not hold for the Russian language. To overcome this issue, we present Gazeta, the first dataset for summarization of Russian news. We describe the properties of this dataset and benchmark several extractive and abstractive models. We demonstrate that the dataset is a valid task for methods of text summarization for Russian. Additionally, we prove the pretrained mBART model to be useful for Russian text summarization.
摘要：自动文摘已经研究了各种领域和语言。然而，这并不适用于俄罗斯的语言。为了解决这个问题，我们提出了报上，对于俄罗斯新闻摘要第一数据集。我们描述这个数据集和基准数采掘和抽象模型的属性。我们表明，该数据集是文本摘要俄罗斯的方法有效的任务。此外，我们证明了预训练mBART模型是俄罗斯文摘有用。

4. A Survey of Syntactic-Semantic Parsing Based on Constituent and Dependency Structures [PDF] 返回目录
Meishan Zhang
Abstract: Syntactic and semantic parsing has been investigated for decades, which is one primary topic in the natural language processing community. This article aims for a brief survey on this topic. The parsing community includes many tasks, which are difficult to be covered fully. Here we focus on two of the most popular formalizations of parsing: constituent parsing and dependency parsing. Constituent parsing is majorly targeted to syntactic analysis, and dependency parsing can handle both syntactic and semantic analysis. This article briefly reviews the representative models of constituent parsing and dependency parsing, and also dependency graph parsing with rich semantics. Besides, we also review the closely-related topics such as cross-domain, cross-lingual and joint parsing models, parser application as well as corpus development of parsing in the article.
摘要：句法和语义分析已经被研究了几十年，这是在自然语言处理领域一个主要议题。本文旨在对这个话题的简短调查。解析社区包括很多任务，这是很难被完全覆盖。在这里，我们关注两个解析最流行的形式化的：成分分析和依存分析。制宪解析是majorly针对句法分析，并依存分析可以同时处理句法和语义分析。本文简要回顾成分分析和依存分析，同时也依赖图分析的代表机型有丰富的语义。此外，我们还审查密切相关的主题，如跨域，跨语言和共同解析模型，分析器应用程序以及在文章中分析的语料发展。

5. Sentiment Frames for Attitude Extraction in Russian [PDF] 返回目录
Natalia Loukachevitch, Nicolay Rusnachenko
Abstract: Texts can convey several types of inter-related information concerning opinions and attitudes. Such information includes the author's attitude towards mentioned entities, attitudes of the entities towards each other, positive and negative effects on the entities in the described situations. In this paper, we described the lexicon RuSentiFrames for Russian, where predicate words and expressions are collected and linked to so-called sentiment frames conveying several types of presupposed information on attitudes and effects. We applied the created frames in the task of extracting attitudes from a large news collection.
摘要：文本可以传送多种类型的有关意见和态度间的相关信息。这些信息包括作者对所提实体的态度，对在所描述的情况下，实体对方，正面和负面影响的实体的态度。在本文中，我们描述了俄罗斯，在那里谓语词语被收集并链接到传送多种类型对态度和效果预设信息，所谓的情绪帧词典RuSentiFrames。我们从大量的新闻收集提取的态度任务应用的创建帧。

6. Neural Topic Modeling with Continual Lifelong Learning [PDF] 返回目录
Pankaj Gupta, Yatin Chaudhary, Thomas Runkler, Hinrich Schütze
Abstract: Lifelong learning has recently attracted attention in building machine learning systems that continually accumulate and transfer knowledge to help future learning. Unsupervised topic modeling has been popularly used to discover topics from document collections. However, the application of topic modeling is challenging due to data sparsity, e.g., in a small collection of (short) documents and thus, generate incoherent topics and sub-optimal document representations. To address the problem, we propose a lifelong learning framework for neural topic modeling that can continuously process streams of document collections, accumulate topics and guide future topic modeling tasks by knowledge transfer from several sources to better deal with the sparse data. In the lifelong process, we particularly investigate jointly: (1) sharing generative homologies (latent topics) over lifetime to transfer prior knowledge, and (2) minimizing catastrophic forgetting to retain the past learning via novel selective data augmentation, co-training and topic regularization approaches. Given a stream of document collections, we apply the proposed Lifelong Neural Topic Modeling (LNTM) framework in modeling three sparse document collections as future tasks and demonstrate improved performance quantified by perplexity, topic coherence and information retrieval task.
摘要：终身学习最近引起了关注建筑机器学习系统，不断积累和传递知识，以帮助将来的学习。无监督主题建模已普遍用于发现从文档集合主题。然而，主题建模的应用由于数据稀疏，例如挑战，在（短）文件，因此小集合，产生不连贯的主题和次优的文档表示。为了解决这个问题，我们提出了可以连续过程从多个来源的知识转移到更好地处理稀疏数据流文件集合，累加主题和指导未来的主题建模任务的神经主题建模终身学习的框架。在终生的过程中，我们特别调查联合：过一生（1）分享生成同源性（潜在主题），以先验知识，和（2）减轻灾难性遗忘保留通过新的选择性数据增强，协同训练和话题在过去的学习迁移正规化方法。由于文档集流，我们应用建模3个稀疏文档集合为今后的任务提出的终身神经主题建模（LNTM）框架，并展示由困惑，主题连贯性和信息检索任务量化改进的性能。

7. Graphs with Multiple Sources per Vertex [PDF] 返回目录
Martin van Harmelen, Jonas Groschwitz
Abstract: Several attempts have been made at constructing Abstract Meaning Representations (AMRs) compositionally, and recently the idea of using s-graphs with the HR-algebra (Koller, 2015) has been simplified to reduce the number of options when parsing (Groschwitz et al., 2017). This apply-modify algebra (AM-algebra) is a linguistically plausible graph algebra with two classes of operations, both of rank two: the apply operation is used to combine a predicate with its argument; the modify operation is used to modify a predicate. While the AM-algebra correctly handles relative clauses and complex cases of coordination, it cannot parse reflexive sentences like: "The raven washes herself." To facilitate processing of such reflexive sentences, this paper proposes to change the definition of s-graphs underlying the AM-algebra to allow vertices with multiple sources, and additionally proposes an adaption to the type system of the algebra to correctly handle such vertices.
摘要：一些尝试，在构建抽象意义表示（自动抄表）组成而提出的，最近解析（Groschwitz等在使用S-图形与HR-代数（科勒，2015年）的想法已经被简化，以减少选项的数目人，2017）。本申请 - 修改代数（AM-代数）是具有两个类的操作，这两个等级两者的语言可信图形代数：操作用于与它的参数的谓词结合应用;该修改操作被用于修改谓词。虽然AM-代数正确处理关系从句和协调复杂的情况下，它无法解析自反这样的句子：“乌鸦洗自己。”为了便于这样的自反句子处理，提出了改变的AM-代数底层以允许与多个源顶点S-图形的定义，并且另外提出了一种适应于代数正确处理这样的顶点的类型系统。

8. A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19 [PDF] 返回目录
David Oniani, Yanshan Wang
Abstract: COVID-19 has resulted in an ongoing pandemic and as of 12 June 2020, has caused more than 7.4 million cases and over 418,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf, BERT, BioBERT, and USE to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online.
摘要：COVID-19导致正在进行的流行和6月12日2020年，已造成超过740万例，超过418,000人死亡。与COVID-19的高度动态的，迅速变化的形势做出了难以准确获取，按需有关疾病信息。网络社区，论坛和社交媒体提供了潜在的场地搜索相关的问题和答案，或交的问题，并寻求其他成员的答案。然而，由于这些网站的性质，总有数量有限的有关问题和答复进行搜索，并张贴问题，很少立刻回答。随着自然语言处理，特别是在语言模型的领域领域的进步，它已成为可能设计聊天机器人，可以自动解答消费者的问题。然而，这样的车型很少应用，并在卫生保健领域进行评估，以满足准确和最新的医疗数据信息的需求。在本文中，我们提出申请的语言模型一种用于自动回答有关COVID-19的问题和定性评估生成的响应。我们利用GPT-2语言模型和施加的转印学习重新训练它的COVID-19开放研究数据集（CORD-19）语料库。为了提高所产生的答复的质量，我们采用4种不同的方法，即TF-IDF，BERT，BioBERT，并使用过滤器，并在反应保留相关句子。在绩效考核一步，我们请了两位医学专家打分的响应。我们发现，BERT和BioBERT，平均超越基于相关性句子过滤任务都TF-IDF和使用。此外，基于该聊天机器人，我们创建了被在线托管用户友好的交互式Web应用程序。

9. Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers [PDF] 返回目录
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka
Abstract: In this paper, we propose a joint model for simultaneous speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with attention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend the SOT model by introducing a speaker inventory as an auxiliary input to produce speaker labels as well as multi-speaker transcriptions. All model parameters are optimized by speaker-attributed maximum mutual information criterion, which represents a joint probability for overlapped speech recognition and speaker identification. Experiments on LibriSpeech corpus show that our proposed method achieves significantly better speaker-attributed word error rate than the baseline that separately performs overlapped speech recognition and speaker identification.
摘要：在本文中，我们提出了同时扬声器计数，语音识别，以及对单声道重叠语音说话人识别的合资模式。我们的模型是建立在注重基础的编码器，解码器，最近提出的识别重叠的讲话，包括扬声器的任意数量的方法序列化输出培训（SOT）。我们通过引入扬声器库存作为辅助输入以产生扬声器标签以及多扬声器转录延伸的SOT模型。所有的模型参数通过扬声器归于最大互信息准则，它代表了重叠的语音识别和说话人识别联合概率优化。在LibriSpeech语料库的实验表明我们提出的方法实现了比基线显著更好的演讲者归于字错误率分别进行重叠的语音识别和说话人识别。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-06-22

目录

摘要