摘要

1. Mirostat: A Perplexity-Controlled Neural Text Decoding Algorithm [PDF] 返回目录
Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish Keskar, Lav R. Varshney
Abstract: Neural text decoding is important for generating high-quality texts using language models. To generate high-quality text, popular decoding algorithms like top-k, top-p (nucleus), and temperature-based sampling truncate or distort the unreliable low probability tail of the language model. Though these methods generate high-quality text after parameter tuning, they are ad hoc. Not much is known about the control they provide over the statistics of the output, which is important since recent reports show text quality is highest for a specific range of likelihoods. Here, first we provide a theoretical analysis of perplexity in top-k, top-p, and temperature sampling, finding that cross-entropy behaves approximately linearly as a function of p in top-p sampling whereas it is a nonlinear function of k in top-k sampling, under Zipfian statistics. We use this analysis to design a feedback-based adaptive top-k text decoding algorithm called mirostat that generates text (of any length) with a predetermined value of perplexity, and thereby high-quality text without any tuning. Experiments show that for low values of k and p in top-k and top-p sampling, perplexity drops significantly with generated text length, which is also correlated with excessive repetitions in the text (the boredom trap). On the other hand, for large values of k and p, we find that perplexity increases with generated text length, which is correlated with incoherence in the text (confusion trap). Mirostat avoids both traps: experiments show that cross-entropy has a near-linear relation with repetition in generated text. This relation is almost independent of the sampling method but slightly dependent on the model used. Hence, for a given language model, control over perplexity also gives control over repetitions.
摘要：神经文本解码，用于使用语言模型高品质的文本很重要。为了生成高质量的文本，流行解码算法类似的top-k，顶-P（核），并且基于温度的采样截形或扭曲语言模型的不可靠的低概率的尾巴。虽然这些方法的参数调整后生成高质量的文本，他们是临时性的。没有多少人知道它们提供了输出，因为最近的报告显示文本的质量是最高的可能性的具体范围是重要的统计控制。这里，首先我们提供的top-k，顶p和温度采样困惑的理论分析，发现交叉熵的行为近似线性如在顶 - 对采样p的函数，而这是在k的非线性函数前k个采样，下Zipfian统计。我们使用该分析用困惑的预定值来设计称为mirostat基于反馈的自适应前k个文本解码算法生成文本（任意长度），从而高质量的文本没有任何调整。实验表明，对于在顶k和顶 - 对采样k和P的低值，困惑与所生成的文本的长度，这也与在文本过度重复（无聊陷阱）相关显著下降。在另一方面，对于k和P的大的值，我们发现与所生成的文本的长度，这与在文本（混乱陷阱）不连贯相关该困惑增加。 Mirostat避免两个陷阱：实验表明，交叉熵与在生成的文本重复接近线性关系。这种关系几乎不依赖于采样方法的而是略微依赖于所使用的模型。因此，对于一个给定的语言模型，在困惑控制也给出了重复控制。

2. #Brexit: Leave or Remain? The Role of User's Community and Diachronic Evolution on Stance Detection [PDF] 返回目录
Mirko Lai, Viviana Patti, Giancarlo Ruffo, Paolo Rosso
Abstract: Interest has grown around the classification of stance that users assume within online debates in recent years. Stance has been usually addressed by considering users posts in isolation, while social studies highlight that social communities may contribute to influence users' opinion. Furthermore, stance should be studied in a diachronic perspective, since it could help to shed light on users' opinion shift dynamics that can be recorded during the debate. We analyzed the political discussion in UK about the BREXIT referendum on Twitter, proposing a novel approach and annotation schema for stance detection, with the main aim of investigating the role of features related to social network community and diachronic stance evolution. Classification experiments show that such features provide very useful clues for detecting stance.
摘要：兴趣已经围绕立场的分类发展用户近几年在网上辩论中承担。立场是通常考虑隔离用户的帖子解决，而社会研究强调，社会团体可能有助于影响用户的意见。此外，姿态应在历时的角度进行研究，因为它可能有助于阐明可在辩论过程中记录用户的意见转变的动态光。我们分析了在英国的政治讨论Twitter上的最新BREXIT公投，提出了姿态检测的新方法和注释的模式，与调查相关的社会网络社区和历时立场的演变特征的作用主要目的。分类实验表明，这种特征提供一种用于检测姿态非常有用的线索。

3. Biomedical and Clinical English Model Packages in the Stanza Python NLP Library [PDF] 返回目录
Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz
Abstract: We introduce biomedical and clinical English model packages for the Stanza Python NLP library. These packages offer accurate syntactic analysis and named entity recognition capabilities for biomedical and clinical text, by combining Stanza's fully neural architecture with a wide variety of open datasets as well as large-scale unsupervised biomedical and clinical text data. We show via extensive experiments that our packages achieve syntactic analysis and named entity recognition performance that is on par with or surpasses state-of-the-art results. We further show that these models do not compromise speed compared to existing toolkits when GPU acceleration is available, and are made easy to download and use with Stanza's Python interface. A demonstration of our packages is available at: http://stanza.run/bio.
摘要：介绍了诗节的Python NLP图书馆生物医学和临床英文模式包。这些软件包提供了准确的语法分析和生物医学和临床文本命名实体识别功能，通过诗节的完全神经结构与多种开放式数据集以及大型无人监管的生物医学和临床文本数据相结合。我们通过大量的实验表明，我们的产品实现句法分析和命名实体识别性能是看齐或超过国家的先进成果。进一步的研究表明，相对于现有的工具包，当GPU加速可用这些型号不妥协的速度，并容易做出的下载和使用与诗节的Python接口。我们包的演示，请访问：http://stanza.run/bio。

4. Object-and-Action Aware Model for Visual Language Navigation [PDF] 返回目录
Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, Qi Wu
Abstract: Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions, on the basis of the visible environment. This requires to extract value from two very different types of natural-language information. The first is object description (e.g., 'table', 'door'), each presenting as a tip for the agent to determine the next action by finding the item visible in the environment, and the second is action specification (e.g., 'go straight', 'turn left') which allows the robot to directly predict the next movements without relying on visual perceptions. However, most existing methods pay few attention to distinguish these information from each other during instruction encoding and mix together the matching between textual object/action encoding and visual perception/orientation features of candidate viewpoints. In this paper, we propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately. This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly. However, one side-issue caused by above solution is that an object mentioned in instructions may be observed in the direction of two or more candidate viewpoints, thus the OAAM may not predict the viewpoint on the shortest path as the next action. To handle this problem, we design a simple but effective path loss to penalize trajectories deviating from the ground truth path. Experimental results demonstrate the effectiveness of the proposed model and path loss, and the superiority of their combination with a 50% SPL score on the R2R dataset and a 40% CLS score on the R4R dataset in unseen environments, outperforming the previous state-of-the-art.
摘要：视觉和语言导航（VLN）是独一无二的，它需要转向相对一般的自然语言指令转换为机器人代理操作，可见环境的基础上。这需要从两个完全不同类型的自然语言信息中提取价值。第一个是对象描述（例如，“表”，“门”），每个呈现一个尖端为代理通过找到在环境中可见的项目，以确定下一个动作，并且第二个是动作规范（例如，“去直”，‘向左转’），其允许机器人直接预测下一个运动，而不依赖于视觉感知。然而，大多数现有的方法很少注意指令在编码过程中相互区分这些信息混合在一起的文本对象/动作编码和视觉感知/候选人价值观取向特征之间的匹配。在本文中，我们提出了一个对象和动作感知模型（OAAM），其分别处理这两种不同形式的基于自然语言的教学。这使得每个工艺以对象为中心的/动作为中心的指令灵活匹配自身对应的视觉感知/动作方向。然而，由上述溶液一侧-问题是，在说明书中提到的对象可以在两个或更多个候选视点的方向观察到的，因此可以OAAM不能预测为下一个动作的最短路径上的观点来看。为了解决这个问题，我们设计了一个简单而有效的路径损耗惩罚轨迹从地面实况路径偏离。实验结果表明，所提出的模型和路径损耗的效果，以及它们的组合的具有50％的SPL得分上R2R数据集和40％CLS得分中看不见环境R4R数据集中的优势，表现优于以前了最先进艺术。

5. Development of POS tagger for English-Bengali Code-Mixed data [PDF] 返回目录
Tathagata Raha, Sainik Kumar Mahata, Dipankar Das, Sivaji Bandyopadhyay
Abstract: Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later joined together and the final result is then mapped to a universal POS tag set. Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%
摘要：代码混合文本具有广泛现在，由于社会化媒体的出现。由于这些文本结合两种语言制定一个句子，它会引起相关的自然语言处理各种研究的问题。在本文中，我们试图挖掘这样一个问题，即，词类标注的代码混合文本。我们已经建立了一个系统，其中孟加拉话写在罗马脚本POS标签英语，孟加拉语代码混合数据。我们的方法最初涉及英语，孟加拉语代码混合鸣叫的收集和清洗。这些微博被用来作为开发数据集构建我们的系统。所提出的系统是由各自的语言标注个人令牌启动模块化方法，然后将它们传递给不同的POS标注器，设计用于不同的语言（英语和孟加拉语，在我们的例子）。由两个系统给定的标签将在后面结合在一起，然后将最后的结果被映射到通用POS标签集。使用100手动POS标签代码混合的句子我们的系统检查和它返回的75.29％的准确度

6. GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection [PDF] 返回目录
Sajad Sotudeh, Tong Xiang, Hao-Ren Yao, Sean MacAvaney, Eugene Yang, Nazli Goharian, Ophir Frieder
Abstract: Offensive language detection is an important and challenging task in natural language processing. We present our submissions to the OffensEval 2020 shared task, which includes three English sub-tasks: identifying the presence of offensive language (Sub-task A), identifying the presence of target in offensive language (Sub-task B), and identifying the categories of the target (Sub-task C). Our experiments explore using a domain-tuned contextualized language model (namely, BERT) for this task. We also experiment with different components and configurations (e.g., a multi-view SVM) stacked upon BERT models for specific sub-tasks. Our submissions achieve F1 scores of 91.7% in Sub-task A, 66.5% in Sub-task B, and 63.2% in Sub-task C. We perform an ablation study which reveals that domain tuning considerably improves the classification performance. Furthermore, error analysis shows common misclassification errors made by our model and outlines research directions for future.
摘要：攻击性语言检测是自然语言处理的一个重要和艰巨的任务。我们提出我们提交给OffensEval 2020共享的任务，这包括三个英子任务：识别冒犯性的语言（子任务A）的存在下，确定在冒犯性的语言（子任务B）靶的存在，以及识别目标（子任务C）的类别。我们的实验探索利用此任务域调校的情境化语言模型（即，BERT）。我们还实验堆叠在BERT模型特定子任务的不同组件和配置（例如，多视点SVM）。我们提交实现子任务A F1得分的91.7％，在子任务B 66.5％，而在子任务C. 63.2％。我们进行消融研究，揭示了域调整大大提高了分类性能。此外，通过我们的模型进行误差分析表明常见误分和轮廓研究未来的发展方向。

7. Measuring prominence of scientific work in online news as a proxy for impact [PDF] 返回目录
James Ravenscroft, Amanda Clare, Maria Liakata
Abstract: The impact made by a scientific paper on the work of other academics has many established metrics, including metrics based on citation counts and social media commenting. However, determination of the impact of a scientific paper on the wider society is less well established. For example, is it important for scientific work to be newsworthy? Here we present a new corpus of newspaper articles linked to the scientific papers that they describe. We find that Impact Case studies submitted to the UK Research Excellence Framework (REF) 2014 that refer to scientific papers mentioned in newspaper articles were awarded a higher score in the REF assessment. The papers associated with these case studies also feature prominently in the newspaper articles. We hypothesise that such prominence can be a useful proxy for societal impact. We therefore provide a novel baseline approach for measuring the prominence of scientific papers mentioned within news articles. Our measurement of prominence is based on semantic similarity through a graph-based ranking algorithm. We find that scientific papers with an associated REF case study are more likely to have a stronger prominence score. This supports our hypothesis that linguistic prominence in news can be used to suggest the wider non-academic impact of scientific work.
摘要：通过对其他学者的工作，科学论文所产生的影响有许多建立的指标，包括基于引用次数和社交媒体评论指标。然而，科学论文对更广泛的社会影响判断不那么确定。例如，它是重要的科学工作是有新闻价值？这里我们介绍链接到科学论文报纸文章的一个新的语料，他们形容。我们发现，提交给英国研究卓越框架（REF）听英语影响的案例研究，是指在报纸上的文章中提到的科学论文被授予在REF评估更高的分数。这些案例研究相关的论文在报纸上的文章也处于显着地位。我们假设这样的突出可以为社会影响的有用的代理。因此，我们提供了测量的新闻报道中所涉及的科学论文中突出一个新的基准方法。我们的突出的测量是基于通过基于图的排名算法语义相似度。我们发现有相关REF个案研究，科学论文更可能有较强的显着性分数。这支持了我们的假设是，在新闻语言的突出可以用来暗示科学工作的更广泛的非学术的影响。

8. Towards Ecologically Valid Research on Language User Interfaces [PDF] 返回目录
Harm de Vries, Dzmitry Bahdanau, Christopher Manning
Abstract: Language User Interfaces (LUIs) could improve human-machine interaction for a wide variety of tasks, such as playing music, getting insights from databases, or instructing domestic robots. In contrast to traditional hand-crafted approaches, recent work attempts to build LUIs in a data-driven way using modern deep learning methods. To satisfy the data needs of such learning algorithms, researchers have constructed benchmarks that emphasize the quantity of collected data at the cost of its naturalness and relevance to real-world LUI use cases. As a consequence, research findings on such benchmarks might not be relevant for developing practical LUIs. The goal of this paper is to bootstrap the discussion around this issue, which we refer to as the benchmarks' low ecological validity. To this end, we describe what we deem an ideal methodology for machine learning research on LUIs and categorize five common ways in which recent benchmarks deviate from it. We give concrete examples of the five kinds of deviations and their consequences. Lastly, we offer a number of recommendations as to how to increase the ecological validity of machine learning research on LUIs.
摘要：语言用户界面（路易斯）能改善人机交互的各种任务，如播放音乐，从数据库中获取的见解，或指示家庭机器人。与传统的手工制作方法，最近的工作尝试运用现代深学习方法一个数据驱动的方式建立路易斯。为了满足这样的学习算法的数据需求，研究人员构建了强调收集的数据量在其自然性和相关性的成本，现实世界LUI使用情况的基准。因此，在这样的基准研究结果可能并不相关开发实用路易斯。本文的目的是引导解决这个问题，我们称其为基准低生态有效性的讨论。为此，我们描述了我们认为对路易斯机器学习研究和分类在最近的基准测试偏离它五种常见方式的理想方法。我们给五种偏差及其后果的具体例子。最后，我们提供了一些建议，就如何加大对路易斯机器学习研究的生态效度。

9. Presentation and Analysis of a Multimodal Dataset for Grounded LanguageLearning [PDF] 返回目录
Patrick Jenkins, Rishabh Sachdeva, Gaoussou Youssouf Kebe, Padraig Higgins, Kasra Darvish, Edward Raff, Don Engel, John Winder, Francisco Ferraro, Cynthia Matuszek
Abstract: Grounded language acquisition -- learning how language-based interactions refer to the world around them -- is amajor area of research in robotics, NLP, and HCI. In practice the data used for learning consists almost entirely of textual descriptions, which tend to be cleaner, clearer, and more grammatical than actual human interactions. In this work, we present the Grounded Language Dataset (GoLD), a multimodal dataset of common household objects described by people using either spoken or written language. We analyze the differences and present an experiment showing how the different modalities affect language learning from human in-put. This will enable researchers studying the intersection of robotics, NLP, and HCI to better investigate how the multiple modalities of image, text, and speech interact, as well as show differences in the vernacular of these modalities impact results.
摘要：接地语言习得 - 学习语言基础的相互作用如何引用他们周围的世界 - 是在机器人技术，自然语言处理和人机交互的研究领域一次重大的。在实践中用于学习的数据几乎完全由文字描述，这往往是更清洁，更清晰，不是实际的人际交往更语法的。在这项工作中，我们目前的接地语言数据集（金），普通家用的多模态化的DataSet对象通过使用语言文字的人描述。我们分析的差异，呈现出不同的模式是如何从人的放置方式影响语言学习的实验。这将使研究人员在研究机器人技术的交集，自然语言处理和人机交互更好地调查图像，文本和语音交互的多模态，以及在这些方式影响结果的白话显示差异如何。

10. Enriching Video Captions With Contextual Text [PDF] 返回目录
Philipp Rimle, Pelin Dogan, Markus Gross
Abstract: Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.
摘要：了解视频内容，并生成字幕与环境是一项重要而艰巨的任务。与现有方法不同，通常试图生成通用视频字幕没有上下文，我们的架构contextualizes通过注入从相关文本数据中提取信息的字幕。我们提出了一个终端到终端的序列对序列模型，基于视觉输入的视频标题和矿山相关知识，如从上下文文本的名称和位置。相较于以前的方法，我们不进一步预处理文本，并让模型直接了解参加过它。由视觉输入的指导下，该模型能够从通过指针发电机网络上下文复制文本的话，可以让生产更具体的视频字幕。我们展示的新闻视频数据集和有竞争力的性能，通过切除研究，验证背景视频的字幕在我们的模型架构的有效性，以及个性化的设计选择。

11. Construction and Usage of a Human Body Common Coordinate Framework Comprising Clinical, Semantic, and Spatial Ontologies [PDF] 返回目录
Katy Börner, Ellen M. Quardokus, Bruce W. Herr II, Leonard E. Cross, Elizabeth G. Record, Yingnan Ju, Andreas D. Bueckle, James P. Sluka, Jonathan C. Silverstein, Kristen M. Browne, Sanjay Jain, Clive H. Wasserfall, Marda L. Jorgensen, Jeffrey M. Spraggins, Nathan H. Patterson, Mark A. Musen, Griffin M. Weber
Abstract: The National Institutes of Health's (NIH) Human Biomolecular Atlas Program (HuBMAP) aims to create a comprehensive high-resolution atlas of all the cells in the healthy human body. Multiple laboratories across the United States are collecting tissue specimens from different organs of donors who vary in sex, age, and body size. Integrating and harmonizing the data derived from these samples and 'mapping' them into a common three-dimensional (3D) space is a major challenge. The key to making this possible is a 'Common Coordinate Framework' (CCF), which provides a semantically annotated, 3D reference system for the entire body. The CCF enables contributors to HuBMAP to 'register' specimens and datasets within a common spatial reference system, and it supports a standardized way to query and 'explore' data in a spatially and semantically explicit manner. [...] This paper describes the construction and usage of a CCF for the human body and its reference implementation in HuBMAP. The CCF consists of (1) a CCF Clinical Ontology, which provides metadata about the specimen and donor (the 'who'); (2) a CCF Semantic Ontology, which describes 'what' part of the body a sample came from and details anatomical structures, cell types, and biomarkers (ASCT+B); and (3) a CCF Spatial Ontology, which indicates 'where' a tissue sample is located in a 3D coordinate system. An initial version of all three CCF ontologies has been implemented for the first HuBMAP Portal release. It was successfully used by Tissue Mapping Centers to semantically annotate and spatially register 48 kidney and spleen tissue blocks. The blocks can be queried and explored in their clinical, semantic, and spatial context via the CCF user interface in the HuBMAP Portal.
摘要：健康的（NIH）人力生物分子图谱计划（HuBMAP）目标全国学院在健康人体内创建的所有单元的综合高分辨率图谱。美国各地的多个实验室都从谁在性别，年龄和身体大小不同捐赠者的不同器官采集组织样本。整合和协调从这些样品得到的数据和“映射”它们到一个共同的三维（3D）空间是一个重大挑战。在进行此可能的关键是一个“共同坐标框架”（CCF），它提供对整个身体语义注解的3D参照系。在CCF使贡献者HuBMAP“注册”的标本和数据集的公共空间参照系内，并且它支持标准化的方式来查询并以空间和语义明确的方式“探索”的数据。 [...]本文介绍了人体和HuBMAP其参考实现CCF的建设和使用情况。在CCF由（1）一个CCF临床本体，它提供了关于样品和供体元数据（该“谁”）; （2）CCF语义本体，其描述“内容”的主体中的样品来自的一部分，并详细说明的解剖结构，细胞类型，和生物标志物（ASCT + B）;和（3）一个CCF空间本体，其指示“其中”的组织样品位于一个三维坐标系。所有三个CCF本体的初始版本已经实施了第一HuBMAP门户网站发布。它被成功地用于通过组织映射到中心注释语义上和空间寄存器48个肾和脾的组织块。块可被查询和经由HuBMAP门户CCF用户界面的临床，语义和空间上下文探讨。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-07-30

目录

摘要