摘要

1. Comparative Study of Machine Learning Models and BERT on SQuAD [PDF] 返回目录
Devshree Patel, Param Raval, Ratnam Parikh, Yesha Shastri
Abstract: This study aims to provide a comparative analysis of performance of certain models popular in machine learning and the BERT model on the Stanford Question Answering Dataset (SQuAD). The analysis shows that the BERT model, which was once state-of-the-art on SQuAD, gives higher accuracy in comparison to other models. However, BERT requires a greater execution time even when only 100 samples are used. This shows that with increasing accuracy more amount of time is invested in training the data. Whereas in case of preliminary machine learning models, execution time for full data is lower but accuracy is compromised.
摘要：本研究旨在提供某些型号的机器学习和斯坦福答疑数据集（队）的BERT模式流行的性能进行了比较分析。分析表明，该模型BERT，它曾经是国家的最先进的上小队，给出相对于其他模型更高的精度。然而，需要BERT即使当仅使用100个样品更大的执行时间。这表明，随着更准确的时间投资在训练数据。而在初步的机器学习模型的情况下，执行时间为完整的数据较低，但准确度就会大打折扣。

2. Character-level Transformer-based Neural Machine Translation [PDF] 返回目录
Nikolay Banar, Walter Daelemans, Mike Kestemont
Abstract: Neural machine translation (NMT) is nowadays commonly applied at the subword level, using byte-pair encoding. A promising alternative approach focuses on character-level translation, which simplifies processing pipelines in NMT considerably. This approach, however, must consider relatively longer sequences, rendering the training process prohibitively expensive. In this paper, we discuss a novel, Transformer-based approach, that we compare, both in speed and in quality to the Transformer at subword and character levels, as well as previously developed character-level models. We evaluate our models on 4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN. The proposed novel architecture can be trained on a single GPU and is 34% percent faster than the character-level Transformer; still, the obtained results are at least on par with it. In addition, our proposed model outperforms the subword-level model in FI-EN and shows close results in CS-EN. To stimulate further research in this area and close the gap with subword-level NMT, we make all our code and models publicly available.
摘要：神经机器翻译（NMT）是现今通常在子字级别应用，使用字节对编码。有前途的替代方法侧重于人物层次的转换，从而简化了在NMT大大处理管道。然而这种方法时，必须考虑相对较长序列，使培训过程极其昂贵。在本文中，我们将讨论一种新型的，基于变压器的方法，我们比较，无论是在速度和质量的变压器在子字和字符的水平，以及以前开发的字符级车型。我们评估从WMT'15 4语言对我们的模型：DE-EN，CS-EN，FI-EN和RU-EN。所提出的新颖架构可以在单个GPU被训练，并且比字符级变压器快34％百分比;不过，得到的结果是至少看齐了。此外，我们提出的模型优于在FI-EN，并显示在CS-EN紧密结果子字级模型。为了进一步刺激这一领域的研究，并关闭与子词级别NMT的差距，我们做所有我们的代码和模型公开。

3. A Generative Approach to Titling and Clustering Wikipedia Sections [PDF] 返回目录
Anjalie Field, Sascha Rothe, Simon Baumgartner, Cong Yu, Abe Ittycheriah
Abstract: We evaluate the performance of transformer encoders with various decoders for information organization through a new task: generation of section headings for Wikipedia articles. Our analysis shows that decoders containing attention mechanisms over the encoder output achieve high-scoring results by generating extractive text. In contrast, a decoder without attention better facilitates semantic encoding and can be used to generate section embeddings. We additionally introduce a new loss function, which further encourages the decoder to generate high-quality embeddings.
摘要：我们评估变压器编码器与解码器的各种信息的组织通过一个新任务的表现：部分标题为维基百科文章的产生。我们的分析表明，含在编码器输出的关注机制的解码器通过生成采掘文本实现高评分结果。相反，没有注意的解码器更好的促进语义编码，并且可以被用来生成部的嵌入。我们还引进了新的损失函数，这进一步鼓励解码器，以生成高质量的嵌入。

4. Simplify-then-Translate: Automatic Preprocessing for Black-Box Machine Translation [PDF] 返回目录
Sneha Mehta, Bahareh Azarnoush, Boris Chen, Avneesh Saluja, Vinith Misra, Ballav Bihani, Ritwik Kumar
Abstract: Black-box machine translation systems have proven incredibly useful for a variety of applications yet by design are hard to adapt, tune to a specific domain, or build on top of. In this work, we introduce a method to improve such systems via automatic pre-processing (APP) using sentence simplification. We first propose a method to automatically generate a large in-domain paraphrase corpus through back-translation with a black-box MT system, which is used to train a paraphrase model that "simplifies" the original sentence to be more conducive for translation. The model is used to preprocess source sentences of multiple low-resource language pairs. We show that this preprocessing leads to better translation performance as compared to non-preprocessed source sentences. We further perform side-by-side human evaluation to verify that translations of the simplified sentences are better than the original ones. Finally, we provide some guidance on recommended language pairs for generating the simplification model corpora by investigating the relationship between ease of translation of a language pair (as measured by BLEU) and quality of the resulting simplification model from back-translations of this language pair (as measured by SARI), and tie this into the downstream task of low-resource translation.
摘要：黑盒机器翻译系统通过设计证明对于各种应用非常有用又是难以适应，调整到一个特定的域或之上构建。在这项工作中，我们介绍利用句子简化来提高经由自动预处理这样的系统（APP）的方法。我们首先提出一个方法来自动生成通过回译与黑盒MT系统，这是用来训练意译模型“简化”的原句是翻译更利于大域内意译语料库。该模型主要用于将多个低资源的语言对预处理源句子。我们表明，这种预处理带来更好的翻译性能相比非预处理源句子。我们进一步进行并排侧人工评估，以验证简化句子的翻译是比原来的好。最后，我们通过研究容易（通过BLEU测量的）语言对翻译的并从该语言对背翻译所得的简化模型的质量之间的关系产生的简化模型语料库提供关于推荐语言对一些指导（如通过测量SARI）和扎入低资源翻译的下游任务此。

5. Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection [PDF] 返回目录
Danni Liu, Gerasimos Spanakis, Jan Niehues
Abstract: Encoder-decoder models provide a generic architecture for sequence-to-sequence tasks such as speech recognition and translation. While offline systems are often evaluated on quality metrics like word error rates (WER) and BLEU, latency is also a crucial factor in many practical use-cases. We propose three latency reduction techniques for chunk-based incremental inference and evaluate their efficiency in terms of accuracy-latency trade-off. On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by sacrificing 1% WER (6% rel.) compared to offline transcription. Although our experiments use the Transformer, the hypothesis selection strategies are applicable to other encoder-decoder models. To avoid expensive re-computation, we use a unidirectionally-attending encoder. After an adaptation procedure to partial sequences, the unidirectional model performs on-par with the original model. We further show that our approach is also applicable to low-latency speech translation. On How2 English-Portuguese speech translation, we reduce latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5% rel.) compared to the offline system.
摘要：编码器，解码器模型序列到序列任务，如语音识别和翻译提供一个通用的架构。而离线系统往往在质量指标，如字错误率（WER）和BLEU评估，延迟也是许多实际使用情况的关键因素。我们提出了基于块的增量推论3个延迟减排技术和评估其准确性延迟权衡方面的效率。在300小时的How2数据集，我们通过83％的牺牲1％WER相比离线转录减少延迟到0.8秒（6％相对）。虽然我们的实验中使用的变压器，假设选择策略同样适用于其它编码器，解码器模型。为了避免昂贵的重新计算，我们采用了单向-出席编码器。适应程序，以部分序列后，上面值的单向模型执行与原始模型。进一步的研究表明，我们的方法也适用于低延迟的语音翻译。在How2英语，葡萄牙语语音翻译，我们的等待时间减少到0.7秒（-84％相对），而招致相比，离线系统2.4个BLEU点亏损（5％相对）。

6. End-to-end Named Entity Recognition from English Speech [PDF] 返回目录
Hemant Yadav, Sreyan Ghosh, Yi Yu, Rajiv Ratn Shah
Abstract: Named entity recognition (NER) from text has been a widely studied problem and usually extracts semantic information from text. Until now, NER from speech is mostly studied in a two-step pipeline process that includes first applying an automatic speech recognition (ASR) system on an audio sample and then passing the predicted transcript to a NER tagger. In such cases, the error does not propagate from one step to another as both the tasks are not optimized in an end-to-end (E2E) fashion. Recent studies confirm that integrated approaches (e.g., E2E ASR) outperform sequential ones (e.g., phoneme based ASR). In this paper, we introduce a first publicly available NER annotated dataset for English speech and present an E2E approach, which jointly optimizes the ASR and NER tagger components. Experimental results show that the proposed E2E approach outperforms the classical two-step approach. We also discuss how NER from speech can be used to handle out of vocabulary (OOV) words in an ASR system.
摘要：从文本命名实体识别（NER）已经被广泛研究的问题，通常提取从文本的语义信息。到现在为止，从语音NER大多研究了两步流水线处理，其包括第一对音频样本施加自动语音识别（ASR）系统，然后使所预测的转录物至NER捉人者。在这种情况下，误差没有偏离一个步骤到另一个同时作为任务不会在端至端（E2E）的方式优化的传播。最近的研究证实，集成的方法（例如，E2E ASR）跑赢大市连续方法（例如，基于音素ASR）。在本文中，我们介绍英语演讲第一个公开可用的NER标注的数据集，并提出了一个端到端的方法，其中联合优化了ASR和NER恶搞成分。实验结果表明，所提出的E2E方法比传统的两步法。我们还讨论了如何从语音NER可以用来处理词汇（OOV）在ASR系统的话了。

7. RUSSE'2020: Findings of the First Taxonomy Enrichment Task for the Russian language [PDF] 返回目录
Irina Nikishina, Varvara Logacheva, Alexander Panchenko, Natalia Loukachevitch
Abstract: This paper describes the results of the first shared task on taxonomy enrichment for the Russian language. The participants were asked to extend an existing taxonomy with previously unseen words: for each new word their systems should provide a ranked list of possible (candidate) hypernyms. In comparison to the previous tasks for other languages, our competition has a more realistic task setting: new words were provided without definitions. Instead, we provided a textual corpus where these new terms occurred. For this evaluation campaign, we developed a new evaluation dataset based on unpublished RuWordNet data. The shared task features two tracks: "nouns" and "verbs". 16 teams participated in the task demonstrating high results with more than half of them outperforming the provided baseline.
摘要：本文介绍了在分类富集第一共享任务为俄语的结果。参与者被要求与以前看不到的话延长现有的分类：对于每一个新词他们的系统应该提供可能的（候选）上位的排名列表。相较于其他语言之前的任务，我们的竞争对手有一个更现实的任务设置：不定义提供了新的单词。相反，我们提供了一个文本的语料库，其中这些新的条款发生。对于本次评测活动中，我们开发了基于未公开RuWordNet数据的新的评估数据集。共享的任务有两个曲目：“名词”和“动词”。 16支球队参加了任务展示超过半数跑赢提供了基线高的结果。

8. Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption [PDF] 返回目录
Hongyin Luo, Shang-Wen Li, James Glass
Abstract: Spoken dialog systems have seen applications in many domains, including medical for automatic conversational diagnosis. State-of-the-art dialog managers are usually driven by deep reinforcement learning models, such as deep Q networks (DQNs), which learn by interacting with a simulator to explore the entire action space since real conversations are limited. However, the DQN-based automatic diagnosis models do not achieve satisfying performances when adapted to new, unseen diseases with only a few training samples. In this work, we propose the Prototypical Q Networks (ProtoQN) as the dialog manager for the automatic diagnosis systems. The model calculates prototype embeddings with real conversations between doctors and patients, learning from them and simulator-augmented dialogs more efficiently. We create both supervised and few-shot learning tasks with the Muzhi corpus. Experiments showed that the ProtoQN significantly outperformed the baseline DQN model in both supervised and few-shot learning scenarios, and achieves state-of-the-art few-shot learning performances.
摘要：口语对话系统已经看到在许多领域，包括医疗自动对话诊断中的应用。国家的最先进的对话框管理者通常是由深强化学习模型，如深Q网（DQNs），其通过与模拟器进行交互探索整个操作空间，因为真正的对话是有限的学习驱动。但是，基于DQN自动诊断模型没有达到的时候，只有少数训练样本适应新的，未知的疾病满足演出。在这项工作中，我们提出了原型Q网（ProtoQN）作为自动诊断系统的对话管理器。该模型计算原型的嵌入与医生和患者之间的真正对话，从他们身上和模拟器，增强对话更有效地学习。我们同时创建监督，并与穆语料库几拍的学习任务。实验表明，ProtoQN显著跑赢基准DQN模型既监督和几拍的学习方案，并实现了国家的最先进的几炮学习表演。

9. Living Machines: A study of atypical animacy [PDF] 返回目录
Mariona Coll Ardanuy, Federico Nanni, Kaspar Beelen, Kasra Hosseini, Ruth Ahnert, Jon Lawrence, Katherine McDonough, Giorgia Tolfo, Daniel CS Wilson, Barbara McGillivray
Abstract: This paper proposes a new approach to animacy detection, the task of determining whether an entity is represented as animate in a text. In particular, this work is focused on atypical animacy and examines the scenario in which typically inanimate objects, specifically machines, are given animate attributes. To address it, we have created the first dataset for atypical animacy detection, based on nineteenth-century sentences in English, with machines represented as either animate or inanimate. Our method builds upon recent innovations in language modeling, specifically BERT contextualized word embeddings, to better capture fine-grained contextual properties of words. We present a fully unsupervised pipeline, which can be easily adapted to different contexts, and report its performance on an established animacy dataset and our newly introduced resource. We show that our method provides a substantially more accurate characterization of atypical animacy, especially when applied to highly complex forms of language use.
摘要：本文提出了一种新的方法来检测生命性，确定一个实体是否被表示为文本动画的任务。特别是，该工作主要集中在非典型生命性，并探讨在通常无生命的物体，特别是机器，给出了动画属性的情况。为了解决这个问题，我们已经创造了生命度不典型检测第一数据集的基础上，英语十九世纪的句子，以表示无论是生命或无生命的机器。我们的方法是建立在语言模型的不断创新，特别是BERT情境字的嵌入，以更好地捕捉细粒度词的语境性质。我们提出了一个完全无监督的管道，它可以很容易地适应不同的环境，并报告其在一个既定的生命性数据集的性能和我们新推出的资源。我们表明，我们的方法提供了非典型的生命性实质上更精确的描述，特别是当应用到语言使用的高度复杂的形式。

10. Bootstrapping Named Entity Recognition in E-Commerce with Positive Unlabeled Learning [PDF] 返回目录
Hanchu Zhang, Leonhard Hennig, Christoph Alt, Changjian Hu, Yao Meng, Chao Wang
Abstract: Named Entity Recognition (NER) in domains like e-commerce is an understudied problem due to the lack of annotated datasets. Recognizing novel entity types in this domain, such as products, components, and attributes, is challenging because of their linguistic complexity and the low coverage of existing knowledge resources. To address this problem, we present a bootstrapped positive-unlabeled learning algorithm that integrates domain-specific linguistic features to quickly and efficiently expand the seed dictionary. The model achieves an average F1 score of 72.02% on a novel dataset of product descriptions, an improvement of 3.63% over a baseline BiLSTM classifier, and in particular exhibits better recall (4.96% on average).
摘要：命名实体识别（NER）在诸如电子商务领域是一个得到充分研究的问题由于缺少注释的数据集。认识到在这一领域的新的实体类型，如产品，部件和属性，是因为他们的语言复杂性和现有知识资源的低覆盖率挑战。为了解决这个问题，我们提出了一个自举正未标记学习算法，集成了特定领域的语言特征，以快速，有效地扩大了种子字典。该模型实现了72.02％的产品说明的一种新颖的数据集，3.63％以上的基线BiLSTM分类的改进的平均得分F1，特别是具有更好的召回（平均4.96％）。

11. T-RECS: a Transformer-based Recommender Generating Textual Explanations and Integrating Unsupervised Language-based Critiquing [PDF] 返回目录
Diego Antognini, Claudiu Musat, Boi Faltings
Abstract: Supporting recommendations with personalized and relevant explanations increases trust and perceived quality, and helps users make better decisions. Prior work attempted to generate a synthetic review or review segment as an explanation, but they were not judged convincing in evaluations by human users. We propose T-RECS, a multi-task learning Transformer-based model that jointly performs recommendation with textual explanations using a novel multi-aspect masking technique. We show that human users significantly prefer the justifications generated by T-RECS than those generated by state-of-the-art techniques. At the same time, experiments on two datasets show that T-RECS slightly improves on the recommendation performance of strong state-of-the-art baselines. Another feature of T-RECS is that it allows users to react to a recommendation by critiquing the textual explanation. The system updates its user model and the resulting recommendations according to the critique. This is based on a novel unsupervised critiquing method for single- and multi-step critiquing with textual explanations. Experiments on two real-world datasets show that T-RECS is the first to obtain good performance in adapting to the preferences expressed in multi-step critiquing.
摘要：支持与建议，个性化和相关的解释增加了信任和感知质量，并帮助用户做出更好的决策。在此之前的工作试图生成合成审查或审查部分作为解释，但他们并没有在人类用户的评价来判断说服力。我们建议T-RECS，多任务学习基于变压器的模型，共同执行与使用一种新型的多方位屏蔽技术文本解释的建议。我们表明，人类用户显著愿意通过T-RECS比国家的最先进的技术所产生的那些产生的理由。同时，在两个数据集实验表明，T-RECS略微提高了国家的最先进的强基线的建议性能。 T-RECS的另一个特征是，它允许用户通过批评的文本说明的建议的反应。该系统更新其用户模型，并根据批评所产生的建议。这是基于用于与文字说明单和多一步批评一种新颖的无监督批评方法。两个真实世界的数据集实验结果表明，T-RECS是第一个在适应多步批评表达的喜好来获得良好的性能。

12. Improving Segmentation for Technical Support Problems [PDF] 返回目录
Kushal Chauhan, Abhirut Gupta
Abstract: Technical support problems are often long and complex. They typically contain user descriptions of the problem, the setup, and steps for attempted resolution. Often they also contain various non-natural language text elements like outputs of commands, snippets of code, error messages or stack traces. These elements contain potentially crucial information for problem resolution. However, they cannot be correctly parsed by tools designed for natural language. In this paper, we address the problem of segmentation for technical support questions. We formulate the problem as a sequence labelling task, and study the performance of state of the art approaches. We compare this against an intuitive contextual sentence-level classification baseline, and a state of the art supervised text-segmentation approach. We also introduce a novel component of combining contextual embeddings from multiple language models pre-trained on different data sources, which achieves a marked improvement over using embeddings from a single pre-trained language model. Finally, we also demonstrate the usefulness of such segmentation with improvements on the downstream task of answer retrieval.
摘要：技术支持的问题往往是长期和复杂的。它们通常包含的问题的用户描述，安装，并企图解决步骤。通常它们还含有各种非自然语言文本元素，如命令的输出，代码片段，错误消息或堆栈跟踪。这些元素包含了解决问题的潜在的关键信息。但是，他们不能正确地被设计为自然语言工具解析。在本文中，我们讨论了分割技术支持问题的问题。我们制定的问题作为一个序列标注任务，艺术研究方法状态的表现。我们比较这对一个直观的语境句子级分类基准，而艺术的状态监督文本分割方法。我们还介绍了上下文的嵌入多个语言模型对不同数据源的预先训练，达到了使用的嵌入从一个单一的预先训练的语言模型显着改善相结合的一种新成分。最后，我们也证明了这样的分割与答案检索下游任务改进的有效性。

13. Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models [PDF] 返回目录
Mengxi Wei, Yifan He, Qiong Zhang
Abstract: Many business documents processed in modern NLP and IR pipelines are visually rich: in addition to text, their semantics can also be captured by visual traits such as layout, format, and fonts. We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. We further introduce new fine-tuning objectives to improve in-domain unsupervised fine-tuning to better utilize large amount of unlabeled in-domain data. We experiment on real world invoice and resume data sets and show that the proposed method outperforms strong text-based RoBERTa baselines by 6.3% absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a few-shot setting, our method requires up to 30x less annotation data than the baseline to achieve the same level of performance at ~90% F1.
摘要：很多现代NLP和IR管线处理商务文档在视觉上丰富：除了文字，它们的语义也可以通过视觉特征，如布局，格式和字体抓获。我们从视觉上丰富的文档（VRDS）学习信息提取的问题，并提出一个模型，它结合了大预先训练语言模型和图形神经网络的力量来有效地编码文字和业务文档的可视信息。我们进一步推出新的微调目标，以改善域无监督微调，以更好地利用大量未标记的域数据。我们尝试对真实世界的发票和恢复数据集，并表明，该方法通过对简历的发票和4.7％的绝对F1 6.3％的绝对F1优于强的基于文本的罗伯塔基线。当几个合一设定评估，我们的方法最多需要比基准少30个注释数据，以实现性能的〜90％，F1相同的水平。

14. Intent Mining from past conversations for Conversational Agent [PDF] 返回目录
Ajay Chatterjee, Shubhashis Sengupta
Abstract: Conversational systems are of primary interest in the AI community. Chatbots are increasingly being deployed to provide round-the-clock support and to increase customer engagement. Many of the commercial bot building frameworks follow a standard approach that requires one to build and train an intent model to recognize a user input. Intent models are trained in a supervised setting with a collection of textual utterance and intent label pairs. Gathering a substantial and wide coverage of training data for different intent is a bottleneck in the bot building process. Moreover, the cost of labeling a hundred to thousands of conversations with intent is a time consuming and laborious job. In this paper, we present an intent discovery framework that involves 4 primary steps: Extraction of textual utterances from a conversation using a pre-trained domain agnostic Dialog Act Classifier (Data Extraction), automatic clustering of similar user utterances (Clustering), manual annotation of clusters with an intent label (Labeling) and propagation of intent labels to the utterances from the previous step, which are not mapped to any cluster (Label Propagation); to generate intent training data from raw conversations. We have introduced a novel density-based clustering algorithm ITER-DBSCAN for unbalanced data clustering. Subject Matter Expert (Annotators with domain expertise) manually looks into the clustered user utterances and provides an intent label for discovery. We conducted user studies to validate the effectiveness of the trained intent model generated in terms of coverage of intents, accuracy and time saving concerning manual annotation. Although the system is developed for building an intent model for the conversational system, this framework can also be used for a short text clustering or as a labeling framework.
摘要：对话系统是在AI社区主要关注的。聊天机器人正越来越多地部署以提供二十四小时的支持，并提高客户参与。许多商业BOT建设框架遵循需要一个建立和培养意向模型来识别用户输入的标准方法。意向模型与文本话语和意图标签对集合的监督设置培训。收集的训练数据为不同的意图相当，广覆盖是在BOT建设过程中的瓶颈。此外，标签百数千意图谈话的成本是一个费时费力的工作。在本文中，我们提出了一种意图发现框架，涉及4个主要步骤：提取文本话语从对话使用不可知对话行为分类器（数据提取），类似用户话语的自动聚类（聚类），手动注释预训练域与意图的标签（标签）和意图标签来自前一步骤的话语，这是不映射到任何簇（标签传播）的传播簇;产生从原材料的谈话意图的训练数据。我们已经推出了基于密度的新聚类算法ITER-DBSCAN不平衡数据聚类。主题专家（具有专业领域知识注释者）手动眺望集群用户话语，并提供针对发现的意图标签。我们进行用户研究来验证的意图，准确性和节省时间的手工标注覆盖方面产生的训练有素的意图模型的有效性。尽管该系统是建立在对话系统的意图模型开发的，该框架还可以用于短文本聚类或作为标签的框架。

15. Investigating Label Bias in Beam Search for Open-ended Text Generation [PDF] 返回目录
Liang Wang, Jinlong Liu, Jingming Liu
Abstract: Beam search is an effective and widely used decoding algorithm in many sequence-to-sequence (seq2seq) text generation tasks. However, in open-ended text generation, beam search is often found to produce repetitive and generic texts, sampling-based decoding algorithms like top-k sampling and nucleus sampling are more preferred. Standard seq2seq models suffer from label bias due to its locally normalized probability formulation. This paper provides a series of empirical evidence that label bias is a major reason for such degenerate behaviors of beam search. By combining locally normalized maximum likelihood estimation and globally normalized sequence-level training, label bias can be reduced with almost no sacrifice in perplexity. To quantitatively measure label bias, we test the model's ability to discriminate the groundtruth text and a set of context-agnostic distractors. We conduct experiments on large-scale response generation datasets. Results show that beam search can produce more diverse and meaningful texts with our approach, in terms of both automatic and human evaluation metrics. Our analysis also suggests several future working directions towards the grand challenge of open-ended text generation.
摘要：束搜索是许多序列对序列（seq2seq）文本生成任务的有效和广泛使用的解码算法。然而，在开放式的文本生成，波束搜索经常发现产生重复的和通用的文本，基于采样解码等的top-k采样和细胞核采样算法是更优选的。标准seq2seq车型从标签偏置遭受由于其本地规范化概率公式。本文提供了一系列的实证证据表明，标签偏见是束搜索的这种堕落行为的重要原因之一。通过结合本地规格化最大似然估计和全球标准化序列层次的培训，标签偏置可以在困惑几乎没有牺牲降低。定量测量标签的偏见，我们测试模型的区分真实状况的文字和一组上下文无关的干扰项的能力。我们进行的大规模响应产生的数据集实验。结果表明，束搜索可以产生更多样化的和有意义的文字用我们的方法，在自动和人工评估指标方面。我们的分析还表明朝着开放式的文本生成的重大挑战未来的几个工作方向。

16. Extracting Daily Dosage from Medication Instructions in EHRs: An Automated Approach and Lessons Learned [PDF] 返回目录
Diwakar Mahajan, Jennifer J. Liang, Ching-Huei Tsou
Abstract: Understanding a patient's medication history is essential for physicians to provide appropriate treatment recommendations. A medication's prescribed daily dosage is a key element of the medication history; however, it is generally not provided as a discrete quantity and needs to be derived from free text medication instructions (Sigs) in the structured electronic health record (EHR). Existing works in daily dosage extraction are narrow in scope, dealing with dosage extraction for a single drug from clinical notes. Here, we present an automated approach to calculate daily dosage for all medications in EHR structured data. We describe and characterize the variable language used in Sigs, and present our hybrid system for calculating daily dosage combining deep learning-based named entity extractor with lexicon dictionaries and regular expressions. Our system achieves 0.98 precision and 0.95 recall on an expert-generated dataset of 1000 Sigs, demonstrating its effectiveness on the general purpose daily dosage calculation task.
摘要：了解患者的用药史是必不可少的医生提供适当的治疗建议。一种药物的规定的每日剂量用药历史的一个关键因素;然而，其一般不提供为离散的数量和需要从在结构化的电子健康记录（EHR）自由文本药物说明书（特别兴趣小组）的。在每日剂量提取现有的工程范围狭窄，用药量提取处理从临床笔记单一药物。在这里，我们提出了一个自动化的方法来计算每日剂量在电子病历的结构化数据的所有药物。我们描述和表征兴趣小组所使用的变量的语言，并提出我们的混合动力系统计算的每日剂量深为基础的学习命名实体提取与词汇词典和正则表达式相结合。我们的系统实现了0.98的精度和0.95召回上千家兴趣小组的专家产生的数据集，展示了在通用每日剂量计算任务的有效性。

17. Evaluating Neural Morphological Taggers for Sanskrit [PDF] 返回目录
Ashim Gupta, Amrith Krishna, Pawan Goyal, Oliver Hellwig
Abstract: Neural sequence labelling approaches have achieved state of the art results in morphological tagging. We evaluate the efficacy of four standard sequence labelling models on Sanskrit, a morphologically rich, fusional Indian language. As its label space can theoretically contain more than 40,000 labels, systems that explicitly model the internal structure of a label are more suited for the task, because of their ability to generalise to labels not seen during training. We find that although some neural models perform better than others, one of the common causes for error for all of these models is mispredictions due to syncretism.
摘要：神经序列标注的方法已经实现了形态学标记的技术的结果状态。我们评估的梵文，一个形态丰富，融印第安语四种标准序列标注模型的有效性。作为其标签空间理论上可以容纳4个万多标签，一个标签的内部结构清晰的模型系统更适合的，因为它们推广到训练中没有看到标签的能力的任务。我们发现，虽然有些神经型号比别人表现得更好，对误差所有这些车型的常见原因之一是由于合一预测失误。

18. Givenness Hierarchy Theoretic Cognitive Status Filtering [PDF] 返回目录
Poulomi Pal, Lixiao Zhu, Andrea Golden-Lasher, Akshay Swaminathan, Tom Williams
Abstract: For language-capable interactive robots to be effectively introduced into human society, they must be able to naturally and efficiently communicate about the objects, locations, and people found in human environments. An important aspect of natural language communication is the use of pronouns. Ac-cording to the linguistic theory of the Givenness Hierarchy(GH), humans use pronouns due to implicit assumptions about the cognitive statuses their referents have in the minds of their conversational partners. In previous work, Williams et al. presented the first computational implementation of the full GH for the purpose of robot language understanding, leveraging a set of rules informed by the GH literature. However, that approach was designed specifically for language understanding,oriented around GH-inspired memory structures used to assess what entities are candidate referents given a particular cognitive status. In contrast, language generation requires a model in which cognitive status can be assessed for a given entity. We present and compare two such models of cognitive status: a rule-based Finite State Machine model directly informed by the GH literature and a Cognitive Status Filter designed to more flexibly handle uncertainty. The models are demonstrated and evaluated using a silver-standard English subset of the OFAI Multimodal Task Description Corpus.
摘要：对于语言能力的交互式机器人能够有效地引入到人类社会，他们必须能够对物体的位置自然地交流，和人民的人类环境中找到。自然语言交流的一个重要方面是使用代词。 AC-盘带所予层次（GH）的语言学理论，人类使用代词由于对认知状态其指示在他们的对话合作伙伴心中隐含的假设。在以前的工作中，威廉姆斯等人。提出全面GH的第一个计算实施的机器人语言理解的目的，利用一套由GH文献通报规则。然而，这种办法对语言理解专门设计的，周围用来评估哪些实体被赋予特定的认知状态的候选对象，GH-启发内存结构为主。相比之下，语言生成需要在认知状态进行评估对于给定的实体模型。我们现在和比较的认知状态两个这样的模式：一个基于规则的有限状态机模型由GH文学和认知状态筛选设计，更灵活地处理不确定性直接通知。该模型被证明并使用OFAI多式联运任务描述语料库的银标准的英语子集进行评估。

19. L2R2: Leveraging Ranking for Abductive Reasoning [PDF] 返回目录
Yunchang Zhu, Liang Pang, Yanyan Lan, Xueqi Cheng
Abstract: The abductive natural language inference task ($\alpha$NLI) is proposed to evaluate the abductive reasoning ability of a learning system. In the $\alpha$NLI task, two observations are given and the most plausible hypothesis is asked to pick out from the candidates. Existing methods simply formulate it as a classification problem, thus a cross-entropy log-loss objective is used during training. However, discriminating true from false does not measure the plausibility of a hypothesis, for all the hypotheses have a chance to happen, only the probabilities are different. To fill this gap, we switch to a ranking perspective that sorts the hypotheses in order of their plausibilities. With this new perspective, a novel $L2R^2$ approach is proposed under the learning-to-rank framework. Firstly, training samples are reorganized into a ranking form, where two observations and their hypotheses are treated as the query and a set of candidate documents respectively. Then, an ESIM model or pre-trained language model, e.g. BERT or RoBERTa, is obtained as the scoring function. Finally, the loss functions for the ranking task can be either pair-wise or list-wise for training. The experimental results on the ART dataset reach the state-of-the-art in the public leaderboard.
摘要：溯自然语言推理任务（$ \ $阿尔法NLI）的建议，以评估学习系统的溯推理能力。在$ \ $阿尔法NLI任务，两个观测值给出最合理的假设被要求自考生挑出。现有的方法简单地制定它作为一个分类问题，因此在训练中使用了交叉熵数损失的目标。然而，从假辨别真不衡量一个假设的合理性，对于所有的假设都有机会发生，只有概率是不同的。为了填补这一空白，我们切换到一个排名的角度，在其plausibilities的顺序排序的假设。有了这个新的视角，新颖的$ L2R ^ 2 $的方法是学习到等级框架下提出的。首先，训练样本重组为一个排名表，其中两个观察和他们的假设被视为查询，并分别一组候选文档。然后，ESIM模型或预先训练语言模型，例如BERT或罗伯塔，作为评分函数获得。最后，对于排名的任务丧失功能可以是成对或列表明智的培训。对艺术的实验结果数据集达到国家的最先进的公共排行榜。

20. GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information [PDF] 返回目录
Umair Qazi, Muhammad Imran, Ferda Ofli
Abstract: The past several years have witnessed a huge surge in the use of social media platforms during mass convergence events such as health emergencies, natural or human-induced disasters. These non-traditional data sources are becoming vital for disease forecasts and surveillance when preparing for epidemic and pandemic outbreaks. In this paper, we present GeoCoV19, a large-scale Twitter dataset containing more than 524 million multilingual tweets posted over a period of 90 days since February 1, 2020. Moreover, we employ a gazetteer-based approach to infer the geolocation of tweets. We postulate that this large-scale, multilingual, geolocated social media data can empower the research communities to evaluate how societies are collectively coping with this unprecedented global crisis as well as to develop computational methods to address challenges such as identifying fake news, understanding communities' knowledge gaps, building disease forecast and surveillance models, among others.
摘要：在过去的几十年目睹在大规模收敛事件，如突发公共卫生事件，自然或人为灾害，在使用社交媒体平台的巨大变革。对于流行和大流行暴发的准备时，这些非传统的数据源，正成为疾病的预测和监控是至关重要的。在本文中，我们目前GeoCoV19，包含超过5.24亿多语种微博大规模数据集的Twitter上公布了一段90天的2020年以来2月1日，此外，我们采用基于地名的方法来推断鸣叫的地理位置。我们推测，这次大规模，多语言，地理资讯社交媒体数据可以授权研究机构来评估社会共同这一前所未有的全球性危机的应对以及开发的计算方法，以应对挑战，如识别假新闻，了解社区知识差距，建立疾病预测和监测模型，等等。

21. Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language [PDF] 返回目录
Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman Meuschke, Bela Gipp
Abstract: In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to $82.8\%$ and cluster purities up to $69.4\%$ (number of clusters equals number of classes), and $99.9\%$ (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.
摘要：在本文中，我们将展示如何选择和组合的自然和数学语言编码影响的分类和应用数学内容的文档聚类。我们通过使用组由他们的主题类（数学，计算机科学，物理学等）标记的比较文本和公式的不同编码的文件，节，并从预印本的arXiv服务器摘要证明这一点，并评估性能和运行时间的选择的分类和聚类算法。我们的编码实现分类精确度高达$ 82.8 \％$和集群纯度高达$ 69.4 \％$分别（簇的数目等于类别数）$ 99.9 \％$（簇的未指定数量），和。我们观察到的文字和数学之间的相似性相对较低的相关性，这表明文字和公式并激励他们当作一个文件的单独功能的独立性。的分类和聚类可以使用，例如，用于文件搜索和推荐。此外，我们表明，计算机分类文档时胜过人类专家。最后，我们评估并讨论了多标签分类和公式semantification。

22. NAUTILUS: a Versatile Voice Cloning System [PDF] 返回目录
Hieu-Thi Luong, Junichi Yamagishi
Abstract: We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.
摘要：我们介绍一种新颖的语音合成系统，称为NAUTILUS，可以与无论是从文本输入或任意源扬声器的基准发音的识别对象语音生成语音。通过使用多扬声器语料库培训所有必要的编码器和解码器在最初的训练阶段，我们的系统可以使用目标扬声器的非转录讲话BP算法的基础上克隆看不见的声音。此外，根据目标讲话者的数据情况下，克隆策略可被调节以利用附加数据和修改文本到语音（TTS）和/或语音转换（VC）系统的行为，以适应这种情况。我们通过使用深卷积层编码器，解码器和WaveNet声码器模型试验所提出的框架的性能。评估显示，它实现同等质量与国家的最先进的TTS与仅五分钟未转录讲话克隆时VC系统。而且，据证实，所提出的框架具有高扬声器的一致性，这将是对于许多应用有用TTS和VC之间进行切换的能力。

23. Team Neuro at SemEval-2020 Task 8: Multi-Modal Fine Grain Emotion Classification of Memes using Multitask Learning [PDF] 返回目录
Sourya Dipta Das, Soumil Mandal
Abstract: In this article, we describe the system that we used for the memotion analysis challenge, which is Task 8 of SemEval-2020. This challenge had three subtasks where affect based sentiment classification of the memes was required along with intensities. The system we proposed combines the three tasks into a single one by representing it as multi-label hierarchical classification problem.Here,Multi-Task learning or Joint learning Procedure is used to train our model.We have used dual channels to extract text and image based features from separate Deep Neural Network Backbone and aggregate them to create task specific features. These task specific aggregated feature vectors ware then passed on to smaller networks with dense layers, each one assigned for predicting one type of fine grain sentiment label. Our Proposed method show the superiority of this system in few tasks to other best models from the challenge.
摘要：在这篇文章中，我们描述了我们用于memotion分析的挑战，这是SemEval-2020的任务8系统。这个挑战有地方影响基于被要求的模因情感分类随着强度的3子任务。该系统由我们代表它的多标签分层分类problem.Here，多任务学习或联名提出的将三个任务到一个单一的一个学习过程来训练我们的model.We已经使用双通道，提取文本和图像从单独的深层神经网络骨干基础功能和聚集他们创建任务的特定功能。然后，这些任务的具体聚合的特征向量洁具上与致密层更小的网络通过，每一个分配了用于预测一种类型的细晶粒情绪的标签。我们提出的方法表明，该系统在几个任务给其他最好的榜样，从挑战的优越性。

24. Trialstreamer: Mapping and Browsing Medical Evidence in Real-Time [PDF] 返回目录
Benjamin E. Nye, Ani Nenkova, Iain J. Marshall, Byron C. Wallace
Abstract: We introduce Trialstreamer, a living database of clinical trial reports. Here we mainly describe the evidence extraction component; this extracts from biomedical abstracts key pieces of information that clinicians need when appraising the literature, and also the relations between these. Specifically, the system extracts descriptions of trial participants, the treatments compared in each arm (the interventions), and which outcomes were measured. The system then attempts to infer which interventions were reported to work best by determining their relationship with identified trial outcome measures. In addition to summarizing individual trials, these extracted data elements allow automatic synthesis of results across many trials on the same topic. We apply the system at scale to all reports of randomized controlled trials indexed in MEDLINE, powering the automatic generation of evidence maps, which provide a global view of the efficacy of different interventions combining data from all relevant clinical trials on a topic. We make all code and models freely available alongside a demonstration of the web interface.
摘要：介绍Trialstreamer，临床试验报告为生数据库。在这里我们主要介绍的证据提取部件;从信息的生物医学文摘的关键部分该提取物是评价文学的时候，也是它们之间的关系，临床医生需要。具体地，试验参与者的系统提取物的描述中，处理在每个臂（干预）相比，并测定其效果。然后，系统会尝试推断其干预的报道通过测定其与确定审判结果的措施关系到效果最好。除了总结单个试验中，这些提取的数据元素允许在许多试验上同一主题的结果自动合成。我们在大规模应用系统在MEDLINE收录的随机对照试验的所有报告，自动生成凭证地图，提供从某个主题相关的所有临床试验的数据组合不同的干预措施的有效性的全局视图的供电。我们做的所有代码和模型免费提供旁边的Web界面的演示。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-05-25

目录

摘要