摘要

1. SimulEval: An Evaluation Toolkit for Simultaneous Translation [PDF] 返回目录
Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Jiatao Gu, Juan Pino
Abstract: Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario where the model starts translating before reading the complete source input. Evaluating simultaneous translation models is more complex than offline models because the latency is another factor to consider in addition to translation quality. The research community, despite its growing focus on novel modeling approaches to simultaneous translation, currently lacks a universal evaluation procedure. Therefore, we present SimulEval, an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation. A server-client scheme is introduced to create a simultaneous translation scenario, where the server sends source input and receives predictions for evaluation and the client executes customized policies. Given a policy, it automatically performs simultaneous decoding and collectively reports several popular latency metrics. We also adapt latency metrics from text simultaneous translation to the speech task. Additionally, SimulEval is equipped with a visualization interface to provide better understanding of the simultaneous decoding process of a system. SimulEval has already been extensively used for the IWSLT 2020 shared task on simultaneous speech translation. Code will be released upon publication.
摘要：同声传译，对于文本和语音侧重于实时和低延迟场景模式开始读取完整的源输入之前转换。评估同声翻译模型比离线模式更加复杂，因为延迟是除了翻译质量要考虑的另一个因素。研究界，尽管它越来越注重造型新颖接近同声翻译，目前缺乏一个通用的评估程序。因此，我们提出SimulEval，一个易于使用和一般的评估工具包都同时进行文字和语音翻译。服务器 - 客户端方案出台创造同声翻译的情况下，在服务器发送源输入和接收进行评估预测和自定义策略的客户端执行。给定一个政策，它会自动执行同时解码和集体报道几种流行的延迟指标。我们也适应从文本同声传译延迟指标的语音任务。此外，SimulEval配备了可视化界面，以提供更好理解的系统的同时解码过程。 SimulEval已被广泛用于上同步语音翻译的IWSLT 2020共享任务。代码将在出版发行。

2. Paying Per-label Attention for Multi-label Extraction from Radiology Reports [PDF] 返回目录
Patrick Schrempf, Hannah Watson, Shadia Mikhael, Maciej Pajak, Matúš Falis, Aneta Lisowska, Keith W. Muir, David Harris-Birtill, Alison Q. O'Neil
Abstract: Training medical image analysis models requires large amounts of expertly annotated data which is time-consuming and expensive to obtain. Images are often accompanied by free-text radiology reports which are a rich source of information. In this paper, we tackle the automated extraction of structured labels from head CT reports for imaging of suspected stroke patients, using deep learning. Firstly, we propose a set of 31 labels which correspond to radiographic findings (e.g. hyperdensity) and clinical impressions (e.g. haemorrhage) related to neurological abnormalities. Secondly, inspired by previous work, we extend existing state-of-the-art neural network models with a label-dependent attention mechanism. Using this mechanism and simple synthetic data augmentation, we are able to robustly extract many labels with a single model, classified according to the radiologist's reporting (positive, uncertain, negative). This approach can be used in further research to effectively extract many labels from medical text.
摘要：培养医学图像分析模型需要大量熟练注释数据是费时和昂贵的获得。图片往往伴随着自由文本放射科报告，其中是一个丰富的信息来源。在本文中，我们将处理从头部CT报告结构化标签的自动提取的，疑似中风患者使用成像深度学习。首先，我们提出一套31层的标签，其对应于影像学结果（例如hyperdensity）和临床印象（例如出血）相关的神经异常。其次，以前的工作的启发，我们扩展现有的国家的最先进的神经网络模型与标签相关的注意机制。使用这种机制和简单的合成数据增强，我们能够用单一的模式稳健提取许多标签，根据放射科医师的报告（积极，明朗，负面）分类。这种方法可以在进一步的研究来有效地提取从医书许多标签。

3. Robust Benchmarking for Machine Learning of Clinical Entity Extraction [PDF] 返回目录
Monica Agrawal, Chloe O'Connell, Yasmin Fatemi, Ariel Levy, David Sontag
Abstract: Clinical studies often require understanding elements of a patient's narrative that exist only in free text clinical notes. To transform notes into structured data for downstream use, these elements are commonly extracted and normalized to medical vocabularies. In this work, we audit the performance of and indicate areas of improvement for state-of-the-art systems. We find that high task accuracies for clinical entity normalization systems on the 2019 n2c2 Shared Task are misleading, and underlying performance is still brittle. Normalization accuracy is high for common concepts (95.3%), but much lower for concepts unseen in training data (69.3%). We demonstrate that current approaches are hindered in part by inconsistencies in medical vocabularies, limitations of existing labeling schemas, and narrow evaluation techniques. We reformulate the annotation framework for clinical entity extraction to factor in these issues to allow for robust end-to-end system benchmarking. We evaluate concordance of annotations from our new framework between two annotators and achieve a Jaccard similarity of 0.73 for entity recognition and an agreement of 0.83 for entity normalization. We propose a path forward to address the demonstrated need for the creation of a reference standard to spur method development in entity recognition and normalization.
摘要：临床研究往往要求只存在于自由文本临床记录病人的叙述理解的要点。为了转化成笔记供下游使用结构化数据，这些元件通常提取和归一化到医疗词汇表。在这项工作中，我们审计的性能和显示的改进国家的最先进的系统等领域。我们发现了在2019 n2c2共享任务的临床实体正常化系统高精确度的任务是误导，和基本的性能仍然很脆弱。归一化精度高为共同概念（95.3％），但低得多的对概念的训练数据（69.3％）看不见。我们表明，目前的做法是，部分由医疗词汇，现有的标签模式的局限性，以及窄的分析方法的不一致阻碍。我们重新制订临床实体提取注释框架的因素这些问题，允许强大的端至端系统的标杆。我们从两个注释之间我们的新框架评估注释的一致性和达到0.73实体识别Jaccard相似和0.83实体正常化的协议。我们提出了一个前进的道路来解决证明需要建立一个参考标准的刺激方法，发展实体识别和规范化。

4. Neural Composition: Learning to Generate from Multiple Models [PDF] 返回目录
Denis Filimonov, Ravi Teja Gadde, Ariya Rastrow
Abstract: Decomposing models into multiple components is critically important in many applications such as language modeling (LM) as it enables adapting individual components separately and biasing of some components to the user's personal preferences. Conventionally, contextual and personalized adaptation for language models, are achieved through class-based factorization, which requires class-annotated data, or through biasing to individual phrases which is limited in scale. In this paper, we propose a system that combines model-defined components, by learning when to activate the generation process from each individual component, and how to combine probability distributions from each component, directly from unlabeled text data.
摘要：模型分解成多个组件是非常重要在许多应用，如语言模型（LM），因为它能够单独适应各个组件和一些部件到用户的个人喜好的偏置。以往，语言模型的上下文和个性化的适配，通过基于类的分解，这需要类注释的数据，或通过偏置到个别短语其规模有限实现。在本文中，我们提出了一个系统，结合模型定义的组件，通过学习时，从各个组件激活的生成过程，以及如何概率分布从每个部件无标签的文本数据相结合，直接。

5. Neural Machine Translation model for University Email Application [PDF] 返回目录
Sandhya Aneja, Siti Nur Afikah Bte Abdul Mazid, Nagender Aneja
Abstract: Machine translation has many applications such as news translation, email translation, official letter translation etc. Commercial translators, e.g. Google Translation lags in regional vocabulary and are unable to learn the bilingual text in the source and target languages within the input. In this paper, a regional vocabulary-based application-oriented Neural Machine Translation (NMT) model is proposed over the data set of emails used at the University for communication over a period of three years. A state-of-the-art Sequence-to-Sequence Neural Network for ML -> EN and EN -> ML translations is compared with Google Translate using Gated Recurrent Unit Recurrent Neural Network machine translation model with attention decoder. The low BLEU score of Google Translation in comparison to our model indicates that the application based regional models are better. The low BLEU score of EN -> ML of our model and Google Translation indicates that the Malay Language has complex language features corresponding to English.
摘要：机器翻译有很多应用，如新闻翻译，电子邮件翻译，公函等翻译翻译商业，例如谷歌翻译滞后于区域词汇，都无法学会在输入内的源语言和目标语言的双语文本。在本文中，一个地区的词汇为基础的面向应用的神经机器翻译（NMT）模型，提出了在大学在三年内用于通信的电子邮件的数据集。一个国家的最先进的顺序对序列神经网络的ML - > EN和 - > ML翻译与谷歌相比，翻译使用封闭式重复单元递归神经网络机器翻译模型，注重解码器。相较于我们的模型谷歌翻译低BLEU得分表明应用程序基于区域模型更好。 EN的低BLEU得分 - >我们的模型的ML和谷歌翻译表示马来语具有复杂的语言特征对应的英文。

6. Exclusion and Inclusion -- A model agnostic approach to feature importance in DNNs [PDF] 返回目录
Subhadip Maji, Arijit Ghosh Chowdhury, Raghav Bali, Vamsi M Bhandaru
Abstract: Deep Neural Networks in NLP have enabled systems to learn complex non-linear relationships. One of the major bottlenecks towards being able to use DNNs for real world applications is their characterization as black boxes. To solve this problem, we introduce a model agnostic algorithm which calculates phrase-wise importance of input features. We contend that our method is generalizable to a diverse set of tasks, by carrying out experiments for both Regression and Classification. We also observe that our approach is robust to outliers, implying that it only captures the essential aspects of the input.
摘要：深层神经网络在NLP已经启用了系统学习复杂的非线性关系。一对能够使用DNNs为实际应用的主要瓶颈是其作为黑盒表征。为了解决这个问题，我们引入一个模型无关算法计算的输入功能句话明智的重要性。我们主张，我们的方法可以推广到一组不同的任务，通过试验两种回归和分类。我们也观察到，我们的做法是稳健的异常值，这意味着它只能捕获输入的重要方面。

7. Toward Givenness Hierarchy Theoretic Natural Language Generation [PDF] 返回目录
Poulomi Pal, Tom Williams
Abstract: Language-capable interactive robots participating in dialogues with human interlocutors must be able to naturally and efficiently communicate about the entities in their environment. A key aspect of such communication is the use of anaphoric language. The linguistic theory of the Givenness Hierarchy(GH) suggests that humans use anaphora based on the cognitive statuses their referents have in the minds of their interlocutors. In previous work, researchers presented GH-theoretic approaches to robot anaphora understanding. In this paper we describe how the GH might need to be used quite differently to facilitate robot anaphora generation.
摘要：参加与人对话的对话语言功能的交互式机器人必须能够自然地约在自己的环境中的实体有效的沟通。这种通信的一个重要方面是使用照应语言。所予层次（GH）的语言学理论认为，人类使用照应基于认知状态其指示在他们的对话者的头脑。在以往的工作中，研究人员提出了GH-理论方法机器人照应理解。在本文中，我们描述了GH可能需要如何完全不同的用于促进机器人照应的生成。

8. Multi-task learning for natural language processing in the 2020s: where are we going? [PDF] 返回目录
Joseph Worsham, Jugal Kalita
Abstract: Multi-task learning (MTL) significantly pre-dates the deep learning era, and it has seen a resurgence in the past few years as researchers have been applying MTL to deep learning solutions for natural language tasks. While steady MTL research has always been present, there is a growing interest driven by the impressive successes published in the related fields of transfer learning and pre-training, such as BERT, and the release of new challenge problems, such as GLUE and the NLP Decathlon (decaNLP). These efforts place more focus on how weights are shared across networks, evaluate the re-usability of network components and identify use cases where MTL can significantly outperform single-task solutions. This paper strives to provide a comprehensive survey of the numerous recent MTL contributions to the field of natural language processing and provide a forum to focus efforts on the hardest unsolved problems in the next decade. While novel models that improve performance on NLP benchmarks are continually produced, lasting MTL challenges remain unsolved which could hold the key to better language understanding, knowledge discovery and natural language interfaces.
摘要：多任务学习（MTL）显著日期提前深度学习的时代，它已经出现了复苏在过去的几年里研究人员一直在向MTL为自然语言任务深度学习解决方案。虽然稳定MTL的研究一直存在，有越来越多的兴趣发表转移学习和预培训的相关领域，如BERT令人印象深刻的成就驱动，新的挑战性问题，如胶水和NLP释放迪卡侬（decaNLP）。这些努力将更多的精力放在权重如何通过网络共享，评估网络组件的重用性和标识的使用情况下，MTL可以显著优于单任务的解决方案。本文力求提供给自然语言处理领域的众多近期MTL贡献了全面的调查，并提供一个论坛，专注于在未来十年中最难解决的问题的努力。虽然这提高NLP基准性能的新车型正在不断产生，持久MTL挑战仍然没有得到解决，其可容纳的关键，以更好地语言理解，知识发现和自然语言界面。

9. Exploring Swedish & English fastText Embeddings with the Transformer [PDF] 返回目录
Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki
Abstract: In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from far larger corpora and we present the new Swedish analogy test set. To achieve a good network performance in natural language processing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embedding. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embeddings at the intrinsic level and extrinsic level, by deploying them on the Transformer in named entity recognition (NER) task and conduct significance tests.This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with far smaller training data, compared to recently released, common crawl versions and character n-grams appear useful for Swedish, a morphologically rich language.
摘要：在本文中，我们的主要贡献是由相对较小的嵌入语料库可以从大得多语料库优于那些和我们提出的新的瑞典类比测试集。要实现自然语言处理（NLP）下游任务的良好的网络性能，有几个因素起着重要的作用：数据集大小，权超参数，和训练有素的嵌入。我们表明，一套正确的超参数，良好的网络性能甚至可以在更小的数据集来达到。我们评估的内在层面和外在层面的嵌入，通过在命名实体识别（NER）任务和行为意义tests.This变压器部署它们的瑞典文和英文完成。我们获得这两种语言上远远小于训练数据的下游任务更好的性能，相比最近发布的，常见的爬行版本和字符正克出现瑞典语，一个形态丰富的语言很有用。

10. Word Embeddings: Stability and Semantic Change [PDF] 返回目录
Lucas Rettenmeier
Abstract: Word embeddings are computed by a class of techniques within natural language processing (NLP), that create continuous vector representations of words in a language from a large text corpus. The stochastic nature of the training process of most embedding techniques can lead to surprisingly strong instability, i.e. subsequently applying the same technique to the same data twice, can produce entirely different results. In this work, we present an experimental study on the instability of the training process of three of the most influential embedding techniques of the last decade: word2vec, GloVe and fastText. Based on the experimental results, we propose a statistical model to describe the instability of embedding techniques and introduce a novel metric to measure the instability of the representation of an individual word. Finally, we propose a method to minimize the instability - by computing a modified average over multiple runs - and apply it to a specific linguistic problem: The detection and quantification of semantic change, i.e. measuring changes in the meaning and usage of words over time.
摘要：字的嵌入是由一类的自然语言处理（NLP）内的技术，在从大文本语料库语言创建单词连续向量表示来计算。的大部分嵌入技术训练过程的随机性质可导致令人惊奇的强不稳定，即随后将相同的技术相同的数据两次，可以产生完全不同的结果。在这项工作中，我们提出三个的过去十年中最有影响力的嵌入技术的训练过程中的不稳定性的实验研究：word2vec，手套和fastText。根据实验结果，我们提出了一个统计模型来描述的嵌入技术的不稳定和引入新的指标来衡量一个人字表示的不稳定性。最后，我们提出了一个方法，以尽量减少不稳定性 - 通过在多次运行计算修正平均值 - 并将其应用到一个特定的语言问题：检测和语义变化的定量，即测量随时间推移的含义和词语的使用改变。

11. On Learning Universal Representations Across Languages [PDF] 返回目录
Xiangpeng Wei, Yue Hu, Rongxiang Weng, Luxi Xing, Heng Yu, Weihua Luo
Abstract: Recent studies have demonstrated the overwhelming advantage of cross-lingual pre-trained models (PTMs), such as multilingual BERT and XLM, on cross-lingual NLP tasks. However, existing approaches essentially capture the co-occurrence among tokens through involving the masked language model (MLM) objective with token-level cross entropy. In this work, we extend these approaches to learn sentence-level representations, and show the effectiveness on cross-lingual understanding and generation. We propose Hierarchical Contrastive Learning (HiCTL) to (1) learn universal representations for parallel sentences distributed in one or multiple languages and (2) distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence. We conduct evaluations on three benchmarks: language understanding tasks (QQP, QNLI, SST-2, MRPC, STS-B and MNLI) in the GLUE benchmark, cross-lingual natural language inference (XNLI) and machine translation. Experimental results show that the HiCTL obtains an absolute gain of 1.0%/2.2% accuracy on GLUE/XNLI as well as achieves substantial improvements of +1.7-+3.6 BLEU on both the high-resource and low-resource English-to-X translation tasks over strong baselines. We will release the source codes as soon as possible.
摘要：最近的研究表明跨语种预训练模型（翻译后修饰），例如多语言BERT和XLM，跨语言的NLP任务的压倒性优势。但是，现有的方法通过涉及掩蔽的语言模型（MLM）目标与标记级别交叉熵基本上捕获令牌之间的共现。在这项工作中，我们扩展这些方法学句子层面交涉，并显示在跨语言的理解和产生的效果。我们提出了分层对比学习（HiCTL）（1）学会了分布在一个或多个语言和（2）平行的句子普遍表示从一个共享的跨语言词汇的每一个句子区分语义相关词。我们对三个基准进行评价：语言理解任务（QQP，QNLI，SST-2，MRPC，STS-B和MNLI）在胶基准，跨语言的自然语言推理（XNLI）和机器翻译。实验结果表明，该HiCTL获得1.0％/ 2.2％的精度的上胶/ XNLI绝对增益以及实现在两个高资源和低资源英语到X平移+ 1.7- + 3.6 BLEU的实质性的改进在强大的基线任务。我们将尽快释放源代码。

12. Evaluating Automatically Generated Phoneme Captions for Images [PDF] 返回目录
Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg
Abstract: Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.
摘要：Image2Speech是生成图像的语音描述的相对较新的任务。本文介绍了调查这项工作的评价。为此，首先一个Image2Speech系统的实施，其生成由音素序列的图像标题。该系统跑赢上Flickr8k语料库原Image2Speech系统。随后，这些音素字幕被转换成单词的句子。该字幕被评为由人工评估其描述图像的善良。最后，结果几个客观度量得分与这些人相关的收视率。虽然BLEU4不与人评级完全相关，它获得的调查指标中最高的相关性，是最好的现有度量的Image2Speech任务。目前的指标是由他们承担起自己的输入是单词的事实的限制。为Image2Speech任务的更合适的度量应该承担其输入为词语，即音素的部分，来代替。

13. Improving NER's Performance with Massive financial corpus [PDF] 返回目录
Han Zhang
Abstract: Training large deep neural networks needs massive high quality annotation data, but the time and labor costs are too expensive for small business. We start a company-name recognition task with a small scale and low quality training data, then using skills to enhanced model training speed and predicting performance with minimum labor cost. The methods we use involve pre-training a lite language model such as Albert-small or Electra-small in financial corpus, knowledge of distillation and multi-stage learning. The result is that we raised the recall rate by nearly 20 points and get 4 times as fast as BERT-CRF model.
摘要：培训大深层神经网络需要大量优质的注释数据，但时间和劳动力成本的小企业过于昂贵。我们用小规模，低质量的训练数据开始一个公司知名度的任务，然后使用技能来提高模型的训练速度和预测具有最低的劳动力成本性能。这些方法我们使用包括前训练一个精简版的语言模型，如金融主体，蒸馏和多级学习的知识阿尔伯特小或恋父小。其结果是，我们提出了近20个点的召回率，并获得4倍的速度BERT-CRF模型。

14. An Empirical Study on Explainable Prediction of Text Complexity: Preliminaries for Text Simplification [PDF] 返回目录
Cristina Garbacea, Mengtian Guo, Samuel Carton, Qiaozhu Mei
Abstract: Text simplification is concerned with reducing the language complexity and improving the readability of professional content so that the text is accessible to readers at different ages and educational levels. As a promising practice to improve the fairness and transparency of text information systems, the notion of text simplification has been mixed in existing literature, ranging all the way through assessing the complexity of single words to automatically generating simplified documents. We show that the general problem of text simplification can be formally decomposed into a compact pipeline of tasks to ensure the transparency and explanability of the process. In this paper, we present a systematic analysis of the first two steps in this pipeline: 1) predicting the complexity of a given piece of text, and 2) identifying complex components from the text considered to be complex. We show that these two tasks can be solved separately, using either lexical approaches or the state-of-the-art deep learning methods, or they can be solved jointly through an end-to-end, explainable machine learning predictor. We propose formal evaluation metrics for both tasks, through which we are able to compare the performance of the candidate approaches using multiple datasets from a diversity of domains.
摘要：文本简化涉及降低语言的复杂性和提高专业内容的可读性，从而使文本是在不同的年龄段和教育水平的读者访问。作为一个有前途的做法，以提高文本信息系统的公正性和透明度，简化文本的概念已经混合在现有的文献，包括所有的方式，通过评估的单个单词的复杂性，自动生成简化文档。我们发现，文字的简化的一般问题可以正式分解成任务的小型管道，以确保过程的透明度和explanability。在本文中，我们提出了前两个步骤的系统分析在这条管线：1）预测所述给定的一段文字的复杂性，和2）识别从被认为是复杂的文本复杂的组件。我们发现，这两个任务可以单独解决的，即使用词汇的方法或者国家的最先进的深学习方法，也可以通过共同的端至端，可以解释机器学习的预测来解决。我们提出了两个任务，通过它，我们能够比较候选的性能使用多个数据集从办法域的多样性正式评估指标。

15. Neural Language Generation: Formulation, Methods, and Evaluation [PDF] 返回目录
Cristina Garbacea, Qiaozhu Mei
Abstract: Recent advances in neural network-based generative modeling have reignited the hopes in having computer systems capable of seamlessly conversing with humans and able to understand natural language. Neural architectures have been employed to generate text excerpts to various degrees of success, in a multitude of contexts and tasks that fulfil various user needs. Notably, high capacity deep learning models trained on large scale datasets demonstrate unparalleled abilities to learn patterns in the data even in the lack of explicit supervision signals, opening up a plethora of new possibilities regarding producing realistic and coherent texts. While the field of natural language generation is evolving rapidly, there are still many open challenges to address. In this survey we formally define and categorize the problem of natural language generation. We review particular application tasks that are instantiations of these general formulations, in which generating natural language is of practical importance. Next we include a comprehensive outline of methods and neural architectures employed for generating diverse texts. Nevertheless, there is no standard way to assess the quality of text produced by these generative models, which constitutes a serious bottleneck towards the progress of the field. To this end, we also review current approaches to evaluating natural language generation systems. We hope this survey will provide an informative overview of formulations, methods, and assessments of neural natural language generation.
摘要：基于神经网络的生成模型的最新进展，在具有能与人类交谈无缝并能理解自然语言的计算机系统重新燃起希望。神经架构已经被用来生成文本摘录到不同程度的成功，在环境和满足各种用户的需求多项任务。值得注意的是，大容量深学习培训的大规模数据集模型彰显无比的能力，学习数据中的模式，即使在缺乏明确的监管信号，开辟新的可能性就产生现实的和连贯的文字太多了。虽然自然语言生成的领域发展迅速，仍然有许多地址开放的挑战。在本次调查中，我们正式定义和分类，自然语言生成的问题。我们回顾是这些一般的配方，其中产生的自然语言是具有实际意义的实例特定的应用任务。接下来我们有方法和产生不同的文本采用神经结构的全面概述。尽管如此，以评估这些生成模型，这构成了对现场的进度严重的瓶颈产生的文本的质量没有标准的方式。为此，我们也检讨目前的方法来评估自然语言生成系统。我们希望这次调查将提供一种制剂，方法和神经自然语言生成的评估的信息概述。

16. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing [PDF] 返回目录
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, Hoifung Poon
Abstract: Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at this https URL.
摘要：训练前大神经语言模型，如BERT，导致了对许多自然语言处理（NLP）任务可观的收益。然而，大多数训练前的努力集中在通用领域语料库，如新闻专线和网络。一种流行的假设是，即使特定领域的训练前可以通过从一般的域语言模型开始受益。在本文中，我们挑战这个假设通过展示与丰富的未标记文本域，如生物医药，来自于一般域语言模型的持续训练前可观的收益从无到有结果训练前的语言模型。为了便于调查，我们编译从公开可用的数据集的综合性生物医学NLP基准。我们的实验表明，特定领域的训练前充当了广泛的生物医学NLP任务的坚实基础，从而导致全线新的国家的最先进的成果。此外，在进行模拟的选择进行全面的评估，无论是训练前和具体任务的微调，我们发现，一些常见的做法是不必要的与BERT模式，如使用命名实体识别（NER）复杂的标记方案。为了加快生物医学NLP的研究，我们已经发布了我们国家的最先进的预训练和任务的具体型号为界，并创建了一个排行榜反映了我们Blurb的基准（以下简称生物医学语言理解和推理基准），在此HTTPS URL。

17. The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification [PDF] 返回目录
Mihaela Găman, Radu Tudor Ionescu
Abstract: In this work, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 Evaluation Campaign. The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and the Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, e.g. the top model for Moldavian versus Romanian dialect identification obtained a macro F1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, e.g. when we shorten the text samples to single sentences or when use tweets at inference time. A secondary goal of our work is to propose an improved ML model using ensemble learning. Our experiments show that ML models can accurately identify the dialects, even at the sentence level and across different domains (news articles versus tweets). We also analyze the most discriminative features of the best performing models, providing some explanations behind the decisions taken by these models. Interestingly, we learn new dialectal patterns previously unknown to us or to our human annotators. Furthermore, we conduct experiments showing that the machine learning performance on the MRC shared task can be improved through an ensemble based on classifier stacking.
摘要：在这项工作中，我们提供了后续的摩尔多瓦与罗马尼亚跨方言主题标识（MRC）的VarDial 2019评价活动的共同任务。共享任务包括两个子任务类型：一个是，在摩尔多瓦和在按主题横跨罗马尼亚的两种方言分类文件由罗马尼亚方言和一个区分组成。参与者取得了令人瞩目的分数，例如对于摩与罗马尼亚方言辨识顶端模型获得的0.895宏F1得分。我们通过人工注释进行主观评价，这表明人类达到低得多的准确率相比，机器学习（ML）的模型。因此，为什么与会者提出的方法达到如此高的准确率尚不清楚。我们的目标是要了解（我）为什么所提出的方法的工作这么好（通过可视化的判别特征）和（ii）在何种程度上，这些方法可以保持他们的高精确度的水平，例如，当我们缩短文本样本，以单句或当推理时间使用推特。我们工作的第二个目标是提出使用集成学习改进的ML模型。我们的实验表明，ML模型能准确识别方言，甚至在句子层面和不同领域（新闻与鸣叫）。我们也分析了表现最好的车型区别最大的特点，提供由这些模型作出的决定背后的一些解释。有趣的是，我们学习新的方言模式以前未知的我们或我们的人工注释。此外，我们进行实验显示，在MRC机器学习表现共享任务可以通过集成基于分类堆放得到改善。

18. COVID-19 therapy target discovery with context-aware literature mining [PDF] 返回目录
Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, Martin Marzidovšek, Senja Pollak
Abstract: The abundance of literature related to the widespread COVID-19 pandemic is beyond manual inspection of a single expert. Development of systems, capable of automatically processing tens of thousands of scientific publications with the aim to enrich existing empirical evidence with literature-based associations is challenging and relevant. We propose a system for contextualization of empirical expression data by approximating relations between entities, for which representations were learned from one of the largest COVID-19-related literature corpora. In order to exploit a larger scientific context by transfer learning, we propose a novel embedding generation technique that leverages SciBERT language model pretrained on a large multi-domain corpus of scientific publications and fine-tuned for domain adaptation on the CORD-19 dataset. The conducted manual evaluation by the medical expert and the quantitative evaluation based on therapy targets identified in the related work suggest that the proposed method can be successfully employed for COVID-19 therapy target discovery and that it outperforms the baseline FastText method by a large margin.
摘要：涉及广泛COVID-19大流行文学的丰富超出了一个专家的人工检查。系统开发，能够自动处理科学出版物数万为宗旨，以丰富现有与基于文献的关联经验证据是富有挑战性和相关的。我们通过近似实体，其表示从最大COVID-19相关的文献语料库的一个教训之间的关系提出了经验公式数据的情境的系统。为了利用通过转移学习科学的大背景下，我们提出了一种新的嵌入生成技术，它利用SciBERT语言模型预训练的科学出版物的大型多领域的语料库和微调有关CORD-19数据集的领域适应性。由医学专家和基于相关工作确定治疗目标的定量评价进行人工评估表明，该方法可以成功地用于COVID-19治疗靶点的发现，它优于大幅度基线FastText方法。

19. BERT Learns (and Teaches) Chemistry [PDF] 返回目录
Josh Payne, Mario Srouji, Dian Ang Yap, Vineet Kosaraju
Abstract: Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
摘要：现代计算有机化学日益数据驱动。仍然有大量的诸如产品预测给出的反应物，药物发现和指标优化的分子合成，而是努力解决使用机器学习在最近几年也增加了这些问题，在这方面重要的未解决的问题。在这项工作中，我们提出使用注意学习官能团和其他财产，影响分子子从一个数据驱动的角度来看，使用的分子的字符串表示的数据集基于变压器的模型（BERT）和分析的行为，其注意头。然后，我们应用该模型解决毒性，溶解性，药物相似，并且在使用学习表示作为特征的图形卷积和关注车型上的分子的图形结构较小的数据集合成的可访问性问题，学到官能团和原子表示，以及BERT的微调。最后，我们建议使用注意可视化作为一个有用的工具，化学工作者和学生能够快速识别各种化学性质的重要子。

20. Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones [PDF] 返回目录
Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu
Abstract: A novel framework for meeting transcription using asynchronous microphones is proposed in this paper. It consists of audio synchronization, speaker diarization, utterance-wise speech enhancement using guided source separation, automatic speech recognition, and duplication reduction. Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without considering sampling frequency mismatch between microphones. Evaluation on our real meeting datasets showed that our framework achieved a character error rate (CER) of 28.7 % by using 11 distributed microphones, while a monaural microphone placed on the center of the table had a CER of 38.2 %. We also showed that our framework achieved CER of 21.8 %, which is only 2.1 percentage points higher than the CER in headset microphone-based transcription.
摘要：使用异步麦克风，本文提出了满足转录的新框架。它由音频同步，扬声器diarization，使用引导源分离，自动语音识别，并重复还原发声明智语音增强的。否则扬声器diarization之前语音增强使系统能够处理重叠的讲话，不考虑麦克风之间的抽样频率失配。评价我们的真实数据集会议表明，我们的框架实现了28.7％的字符错误率（CER）使用11米分发的麦克风，而摆放在桌子中央的单声道麦克风有38.2％与CER。我们还发现，我们的框架实现了21.8％，这比基于耳机麦克风转录的CER仅高出2.1个百分点CER。

21. Language Modelling for Source Code with Transformer-XL [PDF] 返回目录
Thomas Dowdell, Hongyu Zhang
Abstract: It has been found that software, like natural language texts, exhibits "naturalness", which can be captured by statistical language models. In recent years, neural language models have been proposed to represent the naturalness of software through deep learning. In this paper, we conduct an experimental evaluation of state-of-the-art neural language models for source code, including RNN-based models and Transformer-XL based models. Through experiments on a large-scale Python code corpus, we find that the Transformer-XL model outperforms RNN-based models (including LSTM and GRU models) in capturing the naturalness of software, with far less computational cost.
摘要：已经发现的软件，如自然语言文本，显示为“自然”，这可以通过统计语言模型被捕获。近年来，神经语言模型已经提出通过深入学习代表的软件自然。在本文中，我们进行的国家的最先进的神经语言模型的实验评估的源代码，包括基于RNN的模型和基于变压器的XL车型。通过对大规模的Python代码语料的实验中，我们发现变压器-XL模型优于基于RNN的模型（包括LSTM和GRU型号）捕获软件的自然，少得多的计算成本。

22. A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals [PDF] 返回目录
Xuan Dong, Donald S. Williamson
Abstract: The real-world capabilities of objective speech quality measures are limited since current measures (1) are developed from simulated data that does not adequately model real environments; or they (2) predict objective scores that are not always strongly correlated with subjective ratings. Additionally, a large dataset of real-world signals with listener quality ratings does not currently exist, which would help facilitate real-world assessment. In this paper, we collect and predict the perceptual quality of real-world speech signals that are evaluated by human listeners. We first collect a large quality rating dataset by conducting crowdsourced listening studies on two real-world corpora. We further develop a novel approach that predicts human quality ratings using a pyramid bidirectional long short term memory (pBLSTM) network with an attention mechanism. The results show that the proposed model achieves statistically lower estimation errors than prior assessment approaches, where the predicted scores strongly correlate with human judgments.
摘要：因为目前的措施（1）不充分模拟真实环境的模拟数据开发的语音质量客观措施的真实世界的能力是有限的;或者（2）预测并不总是强烈的主观评价相关目标分数。此外，大型数据集真实世界的信号与听众质量评级目前并不存在，这将有助于促进现实世界的评估。在本文中，我们收集和预测由人类听众评价真实世界的语音信号的感知质量。我们首先收集通过在两个现实世界的语料库进行众包监听研究大质量等级的数据集。我们进一步发展预测采用金字塔双向长短期记忆（pBLSTM）网络与注意机制人体质量等级的新方法。结果表明，比现有的评估方法，该模型达到统计学上较低的估计误差在预测分数与人工判断很强的相关性。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-08-03

目录

摘要