摘要

1. Rethinking Batch Normalization in Transformers [PDF] 返回目录
Sheng Shen, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer
Abstract: The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. We find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented. To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass. We show theoretically, under mild assumptions, that PN leads to a smaller Lipschitz constant for the loss, compared with BN. Furthermore, we prove that the approximate backpropagation scheme leads to bounded gradients. We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/WikiText-103.
摘要：在自然语言处理（NLP）中使用的神经网络（NN）模型中的标准归一化方法是层正常化（LN）。这比批标准化（BN），计算机视觉中被广泛采用的不同。在NLP LN的优选的用途主要是由于经验观察，一个（幼稚/香草）使用BN引线为NLP任务显著性能退化;然而，这种情况的根本原因透彻的理解并不总是显而易见的。在本文中，我们进行了解NLP变压器模型的系统研究，为什么国阵有表现不佳，相比于LN。我们发现，在整个培训在整个批次尺寸表现出较大的波动NLP数据的统计信息。这会导致不稳定，如果国阵天真地实现。为了解决这个问题，我们提出了功率归（PN），一种新型的标准化方案可解决此问题通过（i）放宽零均值正常化BN，（II）将正在运行的二次平均，而不是每批次的统计数据，以稳定波动，（iii）使用近似反向传播用于并入在直传运行统计数据。我们发现从理论上说，在温和的假设，即PN导致较小Lipschitz常数的损失，与BN相比。此外，我们证明了近似的反向传播方案，导致有限的梯度。我们广泛的测试PN就一系列的NLP任务变压器，我们表明它显著性能优于LN和BN。具体地，PN性能优于LN 0.4 / 0.6 BLEU上IWSLT14 / WMT14和PTB / wikitext的-103 5.6 / 3.0 PPL。

2. A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs [PDF] 返回目录
Zequn Sun, Qingheng Zhang, Wei Hu, Chengming Wang, Muhao Chen, Farahnaz Akrami, Chengkai Li
Abstract: Entity alignment seeks to find entities in different knowledge graphs (KGs) that refer to the same real-world object. Recent advancement in KG embedding impels the advent of embedding-based entity alignment, which encodes entities in a continuous embedding space and measures entity similarities based on the learned embeddings. In this paper, we conduct a comprehensive experimental study of this emerging field. This study surveys 23 recent embedding-based entity alignment approaches and categorizes them based on their techniques and characteristics. We further observe that current approaches use different datasets in evaluation, and the degree distributions of entities in these datasets are inconsistent with real KGs. Hence, we propose a new KG sampling algorithm, with which we generate a set of dedicated benchmark datasets with various heterogeneity and distributions for a realistic evaluation. This study also produces an open-source library, which includes 12 representative embedding-based entity alignment approaches. We extensively evaluate these approaches on the generated datasets, to understand their strengths and limitations. Additionally, for several directions that have not been explored in current approaches, we perform exploratory experiments and report our preliminary findings for future studies. The benchmark datasets, open-source library and experimental results are all accessible online and will be duly maintained.
摘要：实体对准试图找到在指向同一个现实世界对象的不同知识图（KGS）实体。最近的进步在KG嵌入推动基于嵌入的实体校准，其编码基于所学习的嵌入连续嵌入空间和措施实体相似的实体的出现。在本文中，我们进行这一新兴领域的综合实验研究。这项研究调查的近23基于嵌入的实体校准方法和分类根据它们的技术和特性。我们进一步观察，目前的方法在评估使用不同的数据集，并在这些数据集实体的度分布与实际不符幼稚园。因此，我们提出了一个新的KG采样算法，与我们产生一组与各种异质性和分布的现实的评估基准专用数据集。本研究也产生一个开源库，它包括12个代表性的基于嵌入的实体对准的方法。我们广泛地评估这些方法所产生的数据集，以了解自己的长处和局限性。此外，对于尚未在当前的方法被开发几个方向，我们进行探索性的实验和报告我们的初步调查结果为今后的研究。基准数据集，开放源码库和实验结果都在线访问，并会得到应有的维护。

3. PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry [PDF] 返回目录
Thomas Haider, Steffen Eger, Evgeny Kim, Roman Klinger, Winfried Menninghaus
Abstract: Most approaches to emotion analysis regarding social media, literature, news, and other domains focus exclusively on basic emotion categories as defined by Ekman or Plutchik. However, art (such as literature) enables engagement in a broader range of more complex and subtle emotions that have been shown to also include mixed emotional responses. We consider emotions as they are elicited in the reader, rather than what is expressed in the text or intended by the author. Thus, we conceptualize a set of aesthetic emotions that are predictive of aesthetic appreciation in the reader, and allow the annotation of multiple labels per line to capture mixed emotions within context. We evaluate this novel setting in an annotation experiment both with carefully trained experts and via crowdsourcing. Our annotation with experts leads to an acceptable agreement of kappa=.70, resulting in a consistent dataset for future large scale analysis. Finally, we conduct first emotion classification experiments based on BERT, showing that identifying aesthetic emotions is challenging in our data, with up to .52 F1-micro on the German subset. Data and resources are available at this https URL
摘要：大多数接近情感分析有关社交媒体，文学，新闻和其他领域专注于通过埃克曼或普拉奇克定义的基本情感类别。然而，本领域（如文献）使得在已被示出为还包括混合的情绪反应更复杂和微妙的情感的更宽范围的接合。我们认为情绪，因为他们在阅读器引起的，而不是什么是在文字表述或作者的意图。因此，我们构思了一组唯美的情绪是预测在读者审美，并允许每行多个标签的注释范围内捕捉百感交集。我们评估的注释本小说设置实验都与受过训练的专家，并通过众包。我们的专家引线卡帕= 0.70的接受的协议，从而为今后的大规模分析一致的数据集注释。最后，我们进行基于BERT第一情感分类实验，表明识别审美情感在我们的数据是具有挑战性的，拥有高达0.52 F1-微对德国的子集。数据和资源可在此HTTPS URL

4. Adapting Deep Learning Methods for Mental Health Prediction on Social Media [PDF] 返回目录
Ivan Sekulić, Michael Strube
Abstract: Mental health poses a significant challenge for an individual's well-being. Text analysis of rich resources, like social media, can contribute to deeper understanding of illnesses and provide means for their early detection. We tackle a challenge of detecting social media users' mental status through deep learning-based models, moving away from traditional approaches to the task. In a binary classification task on predicting if a user suffers from one of nine different disorders, a hierarchical attention network outperforms previously set benchmarks for four of the disorders. Furthermore, we explore the limitations of our model and analyze phrases relevant for classification by inspecting the model's word-level attention weights.
摘要：心理健康带来了个人的幸福感一个显著的挑战。资源丰富的文本进行分析，如社会化媒体，可以有助于疾病的更深入的了解，并为他们提供早期检测手段。我们解决的检测社交媒体用户通过深入学习型模型的精神状态，从传统方法任务搬走了挑战。在上如果从九个不同病症的一用户遭罪，分层关注网络性能优于先前设定的基准病症四个预测的二元分类任务。此外，我们探索我们的模型的局限性，并通过检查模型的字级关注权重分析与分类相关词组。

5. XPersona: Evaluating Multilingual Personalized Chatbot [PDF] 返回目录
Zhaojiang Lin, Zihan Liu, Genta Indra Winata, Samuel Cahyawijaya, Andrea Madotto, Yejin Bang, Etsuko Ishii, Pascale Fung
Abstract: Personalized dialogue systems are an essential step toward better human-machine interaction. Existing personalized dialogue agents rely on properly designed conversational datasets, which are mostly monolingual (e.g., English), which greatly limits the usage of conversational agents in other languages. In this paper, we propose a multi-lingual extension of Persona-Chat, namely XPersona. Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents. We experiment with both multilingual and cross-lingual trained baselines, and evaluate them against monolingual and translation-pipeline models using both automatic and human evaluation. Experimental results show that the multilingual trained models outperform the translation-pipeline and that they are on par with the monolingual models, with the advantage of having a single model across multiple languages. On the other hand, the state-of-the-art cross-lingual trained models achieve inferior performance to the other models, showing that cross-lingual conversation modeling is a challenging task. We hope that our dataset and baselines~\footnote{Datasets and all the baselines will be released} will accelerate research in multilingual dialogue systems.
摘要：个性化的对话系统是朝着更好的人机交互的重要一步。现有个性化的对话代理依赖于正确设计的会话数据集，其中大多是单语（例如，英文），这极大地限制了其他语言的会话药物使用。在本文中，我们提出了假面聊天，即XPersona的多语种扩展。我们的数据集包括用于构建和评估多语言个性化剂在英语以外的六种不同语言的人物对话。我们既多语言和跨语言训练的基线实验，并使用自动和人工评估评估它们对单语和翻译的流水线模式。实验结果表明，多语言训练的模型跑赢翻译流水线和他们看齐，与单语的机型，具有跨多个语言的单一模式的优势。在另一方面，国家的最先进的跨语言训练的模型实现性能差于其他车型，显示出跨语种交谈建模是一个具有挑战性的任务。我们希望，我们的数据和基线〜\ {脚注数据集和所有的基线将被释放}将加快多语种对话系统的研究。

6. Multi-label natural language processing to identify diagnosis and procedure codes from MIMIC-III inpatient notes [PDF] 返回目录
A.K. Bhavani Singh, Mounika Guntu, Ananth Reddy Bhimireddy, Judy W. Gichoya, Saptarshi Purkayastha
Abstract: In the United States, 25% or greater than 200 billion dollars of hospital spending accounts for administrative costs that involve services for medical coding and billing. With the increasing number of patient records, manual assignment of the codes performed is overwhelming, time-consuming and error-prone, causing billing errors. Natural language processing can automate the extraction of codes/labels from unstructured clinical notes, which can aid human coders to save time, increase productivity, and verify medical coding errors. Our objective is to identify appropriate diagnosis and procedure codes from clinical notes by performing multi-label classification. We used de-identified data of critical care patients from the MIMIC-III database and subset the data to select the ten (top-10) and fifty (top-50) most common diagnoses and procedures, which covers 47.45% and 74.12% of all admissions respectively. We implemented state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) to fine-tune the language model on 80% of the data and validated on the remaining 20%. The model achieved an overall accuracy of 87.08%, an F1 score of 85.82%, and an AUC of 91.76% for top-10 codes. For the top-50 codes, our model achieved an overall accuracy of 93.76%, an F1 score of 92.24%, and AUC of 91%. When compared to previously published research, our model outperforms in predicting codes from the clinical text. We discuss approaches to generalize the knowledge discovery process of our MIMIC-BERT to other clinical notes. This can help human coders to save time, prevent backlogs, and additional costs due to coding errors.
摘要：在美国，25％或大于200十亿美元医院支出占涉及医疗编码和计费服务的管理成本。随着越来越多的病人记录，该代码的手动分配进行的是压倒性的，耗时且容易出错，造成计费错误。自然语言处理可以自动从非结构化临床记录代码/标签，它可以帮助人体编码器，以节省时间，提高生产效率的提取和验证的医疗编码错误。我们的目标是通过进行多标签分类，以确定从临床笔记适当的诊断和程序代码。我们使用来自MIMIC-III数据库危重病人的去标识数据和子集的数据来选择十（前10名）和五十（前50位）最常见的诊断和程序，其中包括47.45％和74.12％分别所有招生。我们实施状态的最先进的编码器双向交涉从变压器（BERT）微调所述语言模型上的数据的80％，并且在剩余的20％的验证。该模型实现的87.08％的总体精确度，一个F1得分的85.82％，而91.76％为前10名的代码的AUC。对于排名前50位的代码，我们的模型实现全面准确的93.76％，92.24的％的F1分数，和91％的AUC。相较于先前发表的研究，从临床文本预测码我们的模型优于。我们讨论的方法来推广我们的MIMIC-BERT到其他临床笔记的知识发现过程。这可以帮助人类编码器，以节省时间，防止积压，而且由于编码错误的额外费用。

7. Recent Advances and Challenges in Task-oriented Dialog System [PDF] 返回目录
Zheng Zhang, Ryuichi Takanobu, Minlie Huang, Xiaoyan Zhu
Abstract: Due to the significance and value in human-computer interaction and natural language processing, task-oriented dialog systems are attracting more and more attention in both academic and industrial communities. In this paper, we survey recent advances and challenges in an issue-specific manner. We discuss three critical topics for task-oriented dialog systems: (1) improving data efficiency to facilitate dialog system modeling in low-resource settings, (2) modeling multi-turn dynamics for dialog policy learning to achieve better task-completion performance, and (3) integrating domain ontology knowledge into the dialog model in both pipeline and end-to-end models. We also review the recent progresses in dialog evaluation and some widely-used corpora. We believe that this survey can shed a light on future research in task-oriented dialog systems.
摘要：由于人机交互和自然语言处理的意义和价值，面向任务的对话系统正吸引着越来越多的关注学术界和工业界。在本文中，我们调查最近在针对特定问题的方式的进步和挑战。我们讨论了面向任务的对话系统三个关键问题：（1）提高数据效率，方便在低资源设置对话框系统建模，（2）学习，以实现更好的任务完成造型表现多圈动力学的对话政策，和（3）集成领域本体知识转化为在这两个管道和终端到高端机型的对话模式。我们还审查对话框评估最新进展和一些广泛使用的语料库。我们相信，本次调查可以摆脱对面向任务的对话系统未来的研究轻。

8. Offensive Language Identification in Greek [PDF] 返回目录
Zeses Pitenis, Marcos Zampieri, Tharindu Ranasinghe
Abstract: As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability of language resources for English. To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD). OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive. Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data.
摘要：随着攻击性的语言已经成为一个上升的问题在线社区和社交媒体平台，研究人员一直在研究用侮辱性内容的应对和开发系统，以检测其不同类型的方式：网络欺凌，仇恨言论，侵略等少数显着例外，关于这一主题的大多数研究迄今已处理了英语。这主要是由于语言资源的英语可用性。为了解决这个问题，本文提出了一种攻击性的语言识别第一个希腊注释数据集：进攻希腊分享Tweet数据集（OGTD）。 OGTD是含有从Twitter 4779个讯息注释为进攻和不进攻一个手动注释的数据集。随着数据集的详细说明，我们评估培训，并就这几个数据的计算模型。

9. HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment [PDF] 返回目录
Anssi Yli-Jyrä, Josi Purhonen, Matti Liljeqvist, Arto Antturi, Pekka Nieminen, Kari M. Räntilä, Valtter Luoto
Abstract: Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts (texts accompanied by a translation) were constructed manually in order to create an analytical concordance (Luoto et al., 1997) for a Finnish Bible translation. The creators of the bitexts recently secured the publisher's permission to release its fine-grained alignment, but the alignment was still dependent on proprietary, third-party resources such as a copyrighted text edition and proprietary morphological analyses of the source texts. In this paper, we describe a nontrivial editorial process starting from the creation of the original one-purpose database and ending with its reconstruction using only freely available text editions and annotations. This process produced an openly available dataset that contains (i) the source texts and their translations, (ii) the morphological analyses, (iii) the cross-lingual morpheme alignments.
摘要：二十五年前，形态一致希伯来文，芬兰和希腊，芬兰bitexts（陪同翻译文本）手动构建以创造一个和谐的分析为芬兰圣经翻译（Luoto等，1997）。该bitexts的创作者最近获得出版商的释放其细粒度对准许可，但排列仍依赖于专有的第三方资源，如版权文字版本和源文本的专有形态分析。在本文中，我们描述了从创建原来的目的数据库的启动和只使用免费提供的文字版本和注释其重建结束了平凡的编辑过程。该方法生产包含（i）所述源文本及其译文的公开可用的数据集，（ii）所述形态分析，（iii）所述跨语言语素比对。

10. A Label Proportions Estimation technique for Adversarial Domain Adaptation in Text Classification [PDF] 返回目录
Zhuohao Chen
Abstract: Many text classification tasks are domain-dependent, and various domain adaptation approaches have been proposed to predict unlabeled data in a new domain. Domain-adversarial neural networks(DANN) and their variants have been actively used recently and have achieved state-of-the-art results for this problem. However, most of these approaches assume that the label proportions of the source and target domains are similar, which rarely holds in real-world scenarios. Sometimes the label shift is very large and the DANN fails to learn domain-invariant features. In this study, we focus on unsupervised domain adaptation of text classification with label shift and introduce a domain adversarial network with label proportions estimation (DAN-LPE) framework. The DAN-LPEsimultaneously trains a domain adversarial net and processes label proportions estimation by the distributions of the predictions. Experiments show the DAN-LPE achieves a good estimate of the target label distributions and reduces the label shift to improve the classification performance.
摘要：许多文本分类的任务是领域相关的，不同的领域适应性方法已经提出了一个新的领域，预测未标记数据。域名对抗性神经网络（DANN）及其变种一直在积极最近使用过并取得了国家的最先进结果这个问题。然而，大多数这些方法假定源和目标域的标签的比例是相似的，这在现实世界的情景很少持有。有时标签转变非常大，DANN学习不到域不变特征。在这项研究中，我们注重与标签移文本分类的无监督的领域适应性和引进与标签的比例估计（DAN-LPE）框架域对抗网络。所述DAN-LPEsimultaneously训练域对抗性净和处理由预测的分布比例标签估计。实验表明，DAN-LPE达到目标标签分布的一个很好的估计，并降低了标签转移到提高分类性能。

11. LAXARY: A Trustworthy Explainable Twitter Analysis Model for Post-Traumatic Stress Disorder Assessment [PDF] 返回目录
Mohammad Arif Ul Alam, Dhawal Kapadia
Abstract: Veteran mental health is a significant national problem as large number of veterans are returning from the recent war in Iraq and continued military presence in Afghanistan. While significant existing works have investigated twitter posts-based Post Traumatic Stress Disorder (PTSD) assessment using blackbox machine learning techniques, these frameworks cannot be trusted by the clinicians due to the lack of clinical explainability. To obtain the trust of clinicians, we explore the big question, can twitter posts provide enough information to fill up clinical PTSD assessment surveys that have been traditionally trusted by clinicians? To answer the above question, we propose, LAXARY (Linguistic Analysis-based Exaplainable Inquiry) model, a novel Explainable Artificial Intelligent (XAI) model to detect and represent PTSD assessment of twitter users using a modified Linguistic Inquiry and Word Count (LIWC) analysis. First, we employ clinically validated survey tools for collecting clinical PTSD assessment data from real twitter users and develop a PTSD Linguistic Dictionary using the PTSD assessment survey results. Then, we use the PTSD Linguistic Dictionary along with machine learning model to fill up the survey tools towards detecting PTSD status and its intensity of corresponding twitter users. Our experimental evaluation on 210 clinically validated veteran twitter users provides promising accuracies of both PTSD classification and its intensity estimation. We also evaluate our developed PTSD Linguistic Dictionary's reliability and validity.
摘要：退伍军人的心理健康是一个显著国家的问题，因为大量的退伍军人从伊拉克近战和继续驻军阿富汗回国。虽然显著现有的作品已经研究Twitter的职位为基础的创伤后应激障碍（PTSD），使用黑盒机器学习技术的评估，这些框架不能由临床医生由于缺乏临床explainability的信任。要获得医生的信任，我们探讨了大问题，可以Twitter的职位提供足够的信息，弥补目前已被临床医生传统上信任临床创伤后应激障碍的评估调查？为了回答上述问题，我们提出，LAXARY（分析为基础的语言Exaplainable查询）模型，一种新的解释的人工智能（XAI）模型来检测并使用改进的语言调查和字数表示Twitter用户的PTSD评估（LIWC）分析。首先，我们采用临床验证测量工具，从真正Twitter用户收集临床创伤后应激障碍的评估数据，并使用PTSD评估调查结果制定了PTSD语言词典。然后，我们用PTSD语言词典与机器学习模型一起朝着检测PTSD状况及其对应的Twitter用户的强度填补了调查工具。我们对210个临床验证老将Twitter用户实验评估提供了看好双方PTSD分类及其强度估计的准确度。我们也评估我们的发展PTSD语言词典的可靠性和有效性。

12. Developing a Multilingual Annotated Corpus of Misogyny and Aggression [PDF] 返回目录
Shiladitya Bhattacharya, Siddharth Singh, Ritesh Kumar, Akanksha Bansal, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Atul Kr. Ojha
Abstract: In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.
摘要：在本文中，我们讨论了在印度英语，印地文和印度孟加拉厌女症和侵略的多语言注释的语料库的发展，对社会化媒体（逗号项目）学习，并自动识别厌女症和地方自治工程的一部分。该数据集是从YouTube上的视频评论收集，目前共收录了超过20,000意见。这些评论注释在两个级别 - 攻击（攻击性公然，隐蔽攻击性，无腐蚀性）和厌女症（性别和非性别）。我们描述了数据收集，用于标注的标记集，以及问题和注释的过程中所面临的挑战的过程。最后，我们讨论进行开发分类为三种语言厌女症基线实验的结果。

13. Parallel sequence tagging for concept recognition [PDF] 返回目录
Lenz Furrer, Joseph Cornelius, Fabio Rinaldi
Abstract: Motivation: Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. Results: We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task 2019. Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts). Availability and Implementation: Source code freely available for download at this https URL. Supplementary data are available at arXiv online.
摘要：动机：命名实体识别（NER）和规范化（NEN）是生物医学文本的任何文本挖掘系统的核心部件。在传统的概念识别管道，这些任务被组合在一个串行方式，它本质上是容易出错传播从NER到NEN。我们提出了一个并行体系结构，其中两个NER和NEN建模为一个序列标签制作任务，直接在源文本操作。我们研究了两个分类的预测合并成一个单一的输出序列不同的统一战略。结果：我们测试我们对近期的CRAFT语料库的第4版的方法。在所有20分注解集的概念，诠释的任务，我们的系统优于报告为CRAFT共享任务2019年我们的分析表明，这两种分类的优势可以在富有成效的方式组合基准的管道系统。然而，预测统一要求对每个注释集开发一套单独校准。这使得实现了很好的权衡建立知识（训练集）和新的信息（看不见的概念）之间。可用性和执行：在此HTTPS URL的源代码可免费下载。补充数据的arXiv在在线提供。

14. A Formal Analysis of Multimodal Referring Strategies Under Common Ground [PDF] 返回目录
Nikhil Krishnaswamy, James Pustejovsky
Abstract: In this paper, we present an analysis of computationally generated mixed-modality definite referring expressions using combinations of gesture and linguistic descriptions. In doing so, we expose some striking formal semantic properties of the interactions between gesture and language, conditioned on the introduction of content into the common ground between the (computational) speaker and (human) viewer, and demonstrate how these formal features can contribute to training better models to predict viewer judgment of referring expressions, and potentially to the generation of more natural and informative referring expressions.
摘要：在本文中，我们提出使用手势和语言描述的组合计算产生的混合模态确定的参照表达的分析。在此过程中，我们揭露的手势和语言之间的相互作用，有条件地引入内容到（计算）扬声器和（人）观众之间的共同点的一些惊人的正式语义特性，并证明这些形式特征如何促进训练更好的模型来预测观众的判断参考表达式，并有可能以更自然，更翔实的指称词语的产生。

15. Overview of the TREC 2019 deep learning track [PDF] 返回目录
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees
Abstract: The Deep Learning Track is a new track for TREC 2019, with the goal of studying ad hoc ranking in a large data regime. It is the first track with large human-labeled training sets, introducing two sets corresponding to two tasks, each with rigorous TREC-style blind evaluation and reusable test sets. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training queries, for which we generate a reusable test set of 43 queries. The passage retrieval task has a corpus of 8.8 million passages with 503 thousand training queries, for which we generate a reusable test set of 43 queries. This year 15 groups submitted a total of 75 runs, using various combinations of deep learning, transfer learning and traditional IR ranking methods. Deep learning runs significantly outperformed traditional IR runs. Possible explanations for this result are that we introduced large training data and we included deep models trained on such data in our judging pools, whereas some past studies did not have such training data or pooling.
摘要：深学习轨道是TREC 2019一条新的轨道，以研究专案在大数据政权居的目标。它是第一个轨道上大的人标记的训练集，将对应于两个任务，每个都有严格的TREC风格盲评和可重复使用的测试设备两套。文档检索任务有320万个文档与36.7万的培训查询，我们生成一个可重用的测试集的43个查询语料库。段检索任务有880万个通道与503000个训练查询，我们生成一个可重用的测试集的43个查询语料库。今年15组共75个运行的提交，使用深度学习，学习转移和传统的IR排名方法的各种组合。深度学习运行显著优于传统的IR运行。这一结果可能的解释是，我们引入了大量的训练数据，我们的培训包括在我们的判断池这样的数据模型深，而一些过去的研究中并没有这样的训练数据或池。

16. Multi-modal Dense Video Captioning [PDF] 返回目录
Vladimir Iashin, Esa Rahtu
Abstract: Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environment. In this paper, we present a new dense video captioning approach that is able to utilize any number of modalities for event description. Specifically, we show how audio and speech modalities may improve a dense video captioning model. We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track. We formulate the captioning task as a machine translation problem and utilize recently proposed Transformer architecture to convert multi-modal input data into textual descriptions. We demonstrate the performance of our model on ActivityNet Captions dataset. The ablation studies indicate a considerable contribution from audio and speech components suggesting that these modalities contain substantial complementary information to video frames. Furthermore, we provide an in-depth analysis of the ActivityNet Caption results by leveraging the category tags obtained from original YouTube videos. The program code of our method and evaluations will be made publicly available.
摘要：密集的视频字幕是从修剪视频本地化有趣的活动和生产文字说明（字幕）的每个局部事件的任务。最密集的视频字幕以往的作品中的唯一依据是视觉信息，完全忽略音轨。然而，音频，声音，特别是在理解的环境中为人类观察者重要线索。在本文中，我们提出了一个新的密集视频字幕的方法，它能够使用任意数量的事件描述方式的。具体来说，我们将展示如何音频和语音方式可以提高致密视频字幕模式。我们应用自动语音识别（ASR）系统，以获得语音（类似于字幕）的时间对准文本描述并把它当作旁边的视频帧的单独的输入和相应的音频轨迹。我们制定的字幕任务作为机器翻译的问题，并利用最近提出的变压器架构，多模式输入数据转换为文本描述。我们证明我们的模型对ActivityNet字幕数据集的性能。消融研究表明，从音频和语音组件表明这些模式包含大量的补充信息的视频帧相当大的贡献。此外，我们通过利用从原来的YouTube视频获得的分类标签提供ActivityNet标题结果的深入分析。我们的方法和评价的程序代码会被公之于众。

17. Hybrid Autoregressive Transducer (hat) [PDF] 返回目录
Ehsan Variani, David Rybach, Cyril Allauzen, Michael Riley
Abstract: This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model, a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems. The HAT model provides a way to measure the quality of the internal language model that can be used to decide whether inference with an external language model is beneficial or not. This article also presents a finite context version of the HAT model that addresses the exposure bias problem and significantly simplifies the overall training and inference. We evaluate our proposed model on a large-scale voice search task. Our experiments show significant improvements in WER compared to the state-of-the-art approaches.
摘要：提出并评估该混合自回归换能器（HAT）模型，时间同步的encoderdecoder模型，蜜饯常规自动语音识别系统的模块性。该HAT模型提供了一种方法来测量可用于确定推理与外部语言模型是否有益的或不是内部语言模型的质量。本文还介绍了HAT模型地址的曝光偏差问题和显著简化了整个训练和推理的有限范围内的版本。我们评估我们提出了一个大规模的语音搜索任务模型。我们的实验显示，WER显著的改善相比，国家的最先进的方法。

18. Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice [PDF] 返回目录
Chaehan So
Abstract: This paper analyzes how candidate choice prediction improves by different psychological predictors. To investigate this question, it collected an original survey dataset featuring the popular TV series "Game of Thrones". The respondents answered which character they anticipated to win in the final episode of the series, and explained their choice of the final candidate in free text from which sentiments were extracted. These sentiments were compared to feature sets derived from candidate likeability and candidate personality ratings. In our benchmarking of 10-fold cross-validation in 100 repetitions, all feature sets except the likeability ratings yielded a 10-11% improvement in accuracy on the holdout set over the base model. Treating the class imbalance with synthetic minority oversampling (SMOTE) increased holdout set performance by 20-34% but surprisingly not testing set performance. Taken together, our study provides a quantified estimation of the additional predictive value of psychological predictors. Likeability ratings were clearly outperformed by the feature sets based on personality, emotional valence, and basic emotions.
摘要：本文分析了候选人选定预测由不同的心理预测如何改进。为了探讨这一问题，它收集的原始数据集的调查为特色的热门电视剧“权力的游戏”。受访者回答说，他们预计该系列的最后一集取胜的字符，并从中提取情感自由文本解释了他们选择的最终候选人。这些情绪进行了比较，从候选人喜爱程度和候选资格等级派生的功能集。在我们的基准10倍的100次交叉验证，所有的功能集，除了可爱度评级产生的抵抗组过基础模型准确性的10-11％的改善。治疗类不平衡与合成少数类过采样（SMOTE）由20-34％提高截留组性能，但令人惊讶不是测试集的性能。总之，我们的研究提供心理预测的额外预测值的量化估计。可爱度评级被清楚地基于人格，情绪价和基本情绪的功能集跑赢。

19. High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model [PDF] 返回目录
Jinyu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, Yifan Gong
Abstract: While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks, and incorporates future context frames to get more information for accurate acoustic modeling. We further improve the training strategy with sequence-level teacher-student learning. To obtain low latency, we design a two-head cltLSTM, in which one head has zero latency and the other head has a small latency, compared to an LSTM. When trained with Microsoft's 65 thousand hours of anonymized training data and evaluated with test sets with 1.8 million words, the proposed two-head cltLSTM model with the proposed training strategy yields a 28.2\% relative WER reduction over the conventional LSTM acoustic model, with a similar perceived latency.
摘要：虽然社会不断推动终端到高端机型比传统的混合动力车型，这通常是长短期记忆（LSTM）车型有交叉熵准则后跟一个序列判别训练标准的训练，我们认为，这样的常规混合动力车型仍可显著改善。在本文中，我们详细介绍我们最近的努力，以改善高精确度和低延时自动语音识别现有混合动力车辆LSTM声学模型。为了达到较高的精度，我们使用一个上下文层轨迹LSTM（cltLSTM），其中解耦时间建模和目标分类的任务，并结合未来背景帧以获取准确的声学建模的更多信息。我们进一步改进与序列级师生学习培训战略。为了获得低潜伏时间，我们设计了两个头cltLSTM，其中一个头部具有零延迟和另一个磁头具有小的延迟，相比于LSTM。当与微软的65000小时匿名训练数据的训练，以测试集1.8万字进行评价时所提出的培训策略建议双头cltLSTM模型产生了28.2 \％的相对WER减少在传统LSTM声学模型，用类似的延迟时间。

20. Harnessing Explanations to Bridge AI and Humans [PDF] 返回目录
Vivian Lai, Samuel Carton, Chenhao Tan
Abstract: Machine learning models are increasingly integrated into societally critical applications such as recidivism prediction and medical diagnosis, thanks to their superior predictive power. In these applications, however, full automation is often not desired due to ethical and legal concerns. The research community has thus ventured into developing interpretable methods that explain machine predictions. While these explanations are meant to assist humans in understanding machine predictions and thereby allowing humans to make better decisions, this hypothesis is not supported in many recent studies. To improve human decision-making with AI assistance, we propose future directions for closing the gap between the efficacy of explanations and improvement in human performance.
摘要：机器学习模型越来越多地融入societally关键应用，如再犯预测和医疗诊断，由于其优越的预测能力。在这些应用中，但是，全自动化通常是不希望因伦理和法律问题。因此，研究界已经开始尝试开发机器解释可解释的预测方法。虽然这些解释是为了帮助理解机器的预测，从而让人类做出更好的决定人类，这种假设并不在最近的许多研究的支持。为了提高人类的决策与AI的帮助，我们提出了未来的发展方向关闭解释的效力和改进人的行为之间的差距。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-03-18

目录

摘要