0%

【arxiv论文】 Computation and Language 2020-03-30

目录

1. Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Classification [PDF] 摘要
2. Information-Theoretic Probing with Minimum Description Length [PDF] 摘要
3. Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision [PDF] 摘要
4. Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets [PDF] 摘要
5. FFR V1.0: Fon-French Neural Machine Translation [PDF] 摘要
6. TextCaps: a Dataset for Image Captioning with Reading Comprehension [PDF] 摘要
7. End-to-End Entity Classification on Multimodal Knowledge Graphs [PDF] 摘要
8. Can you hear me $\textit{now}$? Sensitive comparisons of human and machine perception [PDF] 摘要
9. Knowledge Graph Alignment using String Edit Distance [PDF] 摘要
10. Machine learning as a model for cultural learning: Teaching an algorithm what it means to be fat [PDF] 摘要

摘要

1. Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Classification [PDF] 返回目录
  Wuraola Fisayo Oyewusi, Olubayo Adekanmbi, Olalekan Akinsande
Abstract: Nigerian English adaptation, Pidgin, has evolved over the years through multi-language code switching, code mixing and linguistic adaptation. While Pidgin preserves many of the words in the normal English language corpus, both in spelling and pronunciation, the fundamental meaning of these words have changed significantly. For example,'ginger' is not a plant but an expression of motivation and 'tank' is not a container but an expression of gratitude. The implication is that the current approach of using direct English sentiment analysis of social media text from Nigeria is sub-optimal, as it will not be able to capture the semantic variation and contextual evolution in the contemporary meaning of these words. In practice, while many words in Nigerian Pidgin adaptation are the same as the standard English, the full English language based sentiment analysis models are not designed to capture the full intent of the Nigerian pidgin when used alone or code-mixed. By augmenting scarce human labelled code-changed text with ample synthetic code-reformatted text and meaning, we achieve significant improvements in sentiment scoring. Our research explores how to understand sentiment in an intrasentential code mixing and switching context where there has been significant word localization.This work presents a 300 VADER lexicon compatible Nigerian Pidgin sentiment tokens and their scores and a 14,000 gold standard Nigerian Pidgin tweets and their sentiments labels.
摘要:尼日利亚英语适应,洋泾浜,已经几年通过多语言代码转换,代码混合和语言适应发展了。虽然洋泾浜保留了许多正常的英语语料库中的词,无论是在拼写和发音,这些话根本内涵有显著改变。例如,“姜”不是植物,但动机和“罐”的表达是不是一个容器,但感激的表情。言下之意是,使用来自尼日利亚的社交媒体文本的直接英语情感分析的目前的做法是次优的,因为这将无法捕捉到这些词的当代意义语义变化和语境的演变。在实践中,而在尼日利亚的洋泾浜适应很多话都是一样的标准英语,全英文基于情感分析模型没有设计单独或代码混合使用时,捕捉全面意向尼日利亚洋泾浜的。通过增强了充足的合成代码,重新格式化文本稀缺的人力标记代码改变文本和意义,我们实现了情绪得分显著的改善。我们的研究探讨了如何理解在句内码的情绪混合和上下文切换那里已经显著字localization.This工作提出了300 VADER词典兼容尼日利亚混杂的情绪令牌和他们的分数和14000金本位尼洋泾浜的鸣叫和他们的情感标签。

2. Information-Theoretic Probing with Minimum Description Length [PDF] 返回目录
  Elena Voita, Ivan Titov
Abstract: To measure how well pretrained representations encode some linguistic property, it is common to use accuracy of a probe, i.e. a classifier trained to predict the property from the representations. Despite widespread adoption of probes, differences in their accuracy fail to adequately reflect differences in representations. For example, they do not substantially favour pretrained representations over randomly initialized ones. Analogously, their accuracy can be similar when probing for genuine linguistic labels and probing for random synthetic tasks. To see reasonable differences in accuracy with respect to these random baselines, previous work had to constrain either the amount of probe training data or its model size. Instead, we propose an alternative to the standard probes, information-theoretic probing with minimum description length (MDL). With MDL probing, training a probe to predict labels is recast as teaching it to effectively transmit the data. Therefore, the measure of interest changes from probe accuracy to the description length of labels given representations. In addition to probe quality, the description length evaluates "the amount of effort" needed to achieve the quality. This amount of effort characterizes either (i) size of a probing model, or (ii) the amount of data needed to achieve the high quality. We consider two methods for estimating MDL which can be easily implemented on top of the standard probing pipelines: variational coding and online coding. We show that these methods agree in results and are more informative and stable than the standard probes.
摘要:为了测量以及预训练的陈述如何编码一些语言特性,通常使用的探针,即分类训练以预测从陈述的财产准确性。尽管广泛采用的探针,在其准确性差未能充分反映交涉差异。例如,它们基本上不利于预训练表示了随机初始化的。类似地,探查真正的语言标签和探测随机合成任务时,其准确度可以是相似的。看到精度相对于这些随机基线合理的差异,以前的工作不得不限制探头训练数据无论是量还是它的型号大小。取而代之的是,我们提出了一种替代标准探针,信息理论与最小描述长度(MDL)探测。随着MDL探测,培养了探头预测标签被重新塑造成教它有效地传输数据。因此,感兴趣的测量探头从精度变化给定的表示标签的描述长度。除了探针的质量,描述长度求值“的努力量”需要达到的质量。这种努力量表征或者(i)探测模型大小,或(ii),以实现高品质所需要的数据量。我们认为,估计MDL可以在标准的探测管道的顶部可以轻松实现两种方法:变编码和编码在网上。我们发现,这些方法在结果一致,而且比标准的探头更丰富和稳定。

3. Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision [PDF] 返回目录
  Xuan Wang, Xiangchen Song, Yingjun Guan, Bangzheng Li, Jiawei Han
Abstract: We created this CORD-19-NER dataset with comprehensive named entity recognition (NER) on the COVID-19 Open Research Dataset Challenge (CORD-19) corpus (2020- 03-13). This CORD-19-NER dataset covers 74 fine-grained named entity types. It is automatically generated by combining the annotation results from four sources: (1) pre-trained NER model on 18 general entity types from Spacy, (2) pre-trained NER model on 18 biomedical entity types from SciSpacy, (3) knowledge base (KB)-guided NER model on 127 biomedical entity types with our distantly-supervised NER method, and (4) seed-guided NER model on 8 new entity types (specifically related to the COVID-19 studies) with our weakly-supervised NER method. We hope this dataset can help the text mining community build downstream applications. We also hope this dataset can bring insights for the COVID- 19 studies, both on the biomedical side and on the social side.
摘要:我们创建了在COVID-19开放研究数据集的挑战(CORD-19)语料库(2020- 03-13)全面命名实体识别(NER)这个CORD-19-NER数据集。这CORD-19-NER数据集涵盖了74细粒度命名的实体类型。 (1)预训练来自Spacy 18种一般实体类型NER模型,(2)预先训练NER模型18种的生物医学实体类型从SciSpacy,(3)知识库:它是由注解结果从四个来源组合自动生成(KB)127种上的生物医学实体类型与我们的远亲监督NER方法引导下NER模型,以及(4)种子引导上8种新的实体类型(具体地涉及到COVID-19研究)NER模型与我们的弱监督NER方法。我们希望这个数据集可以帮助文本挖掘社会建设的下游应用。我们也希望这个数据集可以带来洞察力为COVID- 19项研究,无论在生物医学方面和社会方面。

4. Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets [PDF] 返回目录
  Yunpeng Zhao, Mattia Prosperi, Tianchen Lyu, Yi Guo, Jing Bian
Abstract: Social media, especially Twitter, is being increasingly used for research with predictive analytics. In social media studies, natural language processing (NLP) techniques are used in conjunction with expert-based, manual and qualitative analyses. However, social media data are unstructured and must undergo complex manipulation for research use. The manual annotation is the most resource and time-consuming process that multiple expert raters have to reach consensus on every item, but is essential to create gold-standard datasets for training NLP-based machine learning classifiers. To reduce the burden of the manual annotation, yet maintaining its reliability, we devised a crowdsourcing pipeline combined with active learning strategies. We demonstrated its effectiveness through a case study that identifies job loss events from individual tweets. We used Amazon Mechanical Turk platform to recruit annotators from the Internet and designed a number of quality control measures to assure annotation accuracy. We evaluated 4 different active learning strategies (i.e., least confident, entropy, vote entropy, and Kullback-Leibler divergence). The active learning strategies aim at reducing the number of tweets needed to reach a desired performance of automated classification. Results show that crowdsourcing is useful to create high-quality annotations and active learning helps in reducing the number of required tweets, although there was no substantial difference among the strategies tested.
摘要:社交媒体,尤其是微博,越来越多地被用于与预测分析研究。在社会化媒体研究,自然语言处理(NLP)技术配合使用,专家型,手动和定性分析。然而,社交媒体数据是非结构化的,必须进行研究使用复杂的操作。手动注释是最耗资源和耗时的过程,多个专家评估者必须对每一个项目达到一致,但必须建立对基于NLP训练机器学习分类的黄金标准数据集。为了减少手动标注的负担,又维持它的可靠性,我们设计了一个众包的管道与主动学习策略相结合。我们通过一个案例研究,从个体鸣叫识别工作损失事件证明了其有效性。我们使用了亚马逊的Mechanical Turk平台从互联网招募注释并设计了多项质量控制措施,以确保注释的精度。我们评估了4个不同的活性学习策略(即,至少有信心,熵,投票熵,和相对熵)。活性学习策略旨在减少达到自动分类的所期望的性能所需的鸣叫的次数。结果表明,众包是制造高品质的注解有用和积极的学习有助于减少所需的鸣叫的次数,虽然没有测试的战略之间没有实质性的区别。

5. FFR V1.0: Fon-French Neural Machine Translation [PDF] 返回目录
  Bonaventure F. P. Dossou, Chris C. Emezue
Abstract: Africa has the highest linguistic diversity in the world. On account of the importance of language to communication, and the importance of reliable, powerful and accurate machine translation models in modern inter-cultural communication, there have been (and still are) efforts to create state-of-the-art translation models for the many African languages. However, the low-resources, diacritical and tonal complexities of African languages are major issues facing African NLP today. The FFR is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. In this paper, we describe our pilot project: the creation of a large growing corpora for Fon-to-French translations and our FFR v1.0 model, trained on this dataset. The dataset and model are made publicly available.
摘要:非洲有世界上最高的语言多样性。在考虑到语言沟通的重要性,并在现代文化间通信可靠,功能强大和精确的机器翻译模型的重要性,已经出现了(和现在)的努力,为创造国家的最先进的翻译模式在许多非洲语言。然而,低资源,音符和色调非洲语言的复杂性是今天面对非洲NLP的重大问题。该FFR是朝着建立由丰强大的翻译模型,以极低的资源和声调的语言,法语,研究和公众使用的重要一步。在本文中,我们描述了我们的试点项目:丰对法语翻译创造一个不断增长的大语料库和我们FFR V1.0模式,培养对这个数据集。该数据集和模型对外公开。

6. TextCaps: a Dataset for Image Captioning with Reading Comprehension [PDF] 返回目录
  Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.
摘要:图片说明可以帮助有视觉障碍的人迅速了解图像内容。虽然我们在自动描述的图像和光学字符识别取得显著的进展,目前的做法是不能包括在他们的描述书面文字,但文字是人类的环境中无所不在,经常关键是要理解我们的环境。要研究如何在图像的背景下领悟文本,我们收集了新的数据集,TextCaps,以145K为字幕图像28K。我们的数据挑战的模型进行文字识别,它与它的视觉环境,并决定哪些部分文字复制或意译,需要多个文本标记和可视化实体,如物体之间的空间,语义和视觉推理。我们研究的基线和适应现有的方法这一新的任务,我们称之为图像字幕与阅读理解。我们有自动和人体研究显示分析,我们的数据集提供了比以前的数据集,许多新的技术挑战新TextCaps。

7. End-to-End Entity Classification on Multimodal Knowledge Graphs [PDF] 返回目录
  W.X. Wilcke, P. Bloem, V. de Boer, R.H. van t Veer, F.A.H. van Harmelen
Abstract: End-to-end multimodal learning on knowledge graphs has been left largely unaddressed. Instead, most end-to-end models such as message passing networks learn solely from the relational information encoded in graphs' structure: raw values, or literals, are either omitted completely or are stripped from their values and treated as regular nodes. In either case we lose potentially relevant information which could have otherwise been exploited by our learning methods. To avoid this, we must treat literals and non-literals as separate cases. We must also address each modality separately and accordingly: numbers, texts, images, geometries, et cetera. We propose a multimodal message passing network which not only learns end-to-end from the structure of graphs, but also from their possibly divers set of multimodal node features. Our model uses dedicated (neural) encoders to naturally learn embeddings for node features belonging to five different types of modalities, including images and geometries, which are projected into a joint representation space together with their relational information. We demonstrate our model on a node classification task, and evaluate the effect that each modality has on the overall performance. Our result supports our hypothesis that including information from multiple modalities can help our models obtain a better overall performance.
摘要:结束到终端的知识图多学习一直留在很大程度上得到解决。相反,大多数端至端型号如消息传递网络从在图表结构编码的关系信息仅在学习:原始值,或文本,要么完全省去或从它们的值被剥离并作为常规节点处理。在这两种情况下,我们失去了它本来否则被我们的学习方法,开发潜在的相关信息。为了避免这种情况,我们必须把文字和非文字作为单独的个案。我们还必须分开,并相应地解决每一个模式:数字,文字,图像,几何形状,等等。我们提出了一个多模式的消息传递网络,该网络不仅学习终端到终端,从图的结构,但也从设置的多节点功能的可能潜水员。我们致力于(神经)编码器模型采用自然学习的嵌入节点拥有属于五种不同类型的模式,包括图像和几何形状,这与他们的关系信息投射到一个关节表示空间在一起。我们证明了我们的一个结点的分类任务模型,并评估每个模式对整体性能的影响。我们的结果支持了我们的假设,包括多模态信息可以帮助我们的模型获得更好的整体性能。

8. Can you hear me $\textit{now}$? Sensitive comparisons of human and machine perception [PDF] 返回目录
  Michael A Lepori, Chaz Firestone
Abstract: The rise of sophisticated machine-recognition systems has brought with it a rise in comparisons between human and machine perception. But such comparisons face an asymmetry: Whereas machine perception of some stimulus can often be probed through direct and explicit measures, much of human perceptual knowledge is latent, incomplete, or embedded in unconscious mental processes that may not be available for explicit report. Here, we show how this asymmetry can cause such comparisons to underestimate the overlap in human and machine perception. As a case study, we consider human perception of $\textit{adversarial speech}$ -- synthetic audio commands that are recognized as valid messages by automated speech-recognition systems but that human listeners reportedly hear as meaningless noise. In five experiments, we adapt task designs from the human psychophysics literature to show that even when subjects cannot freely transcribe adversarial speech (the previous benchmark for human understanding), they nevertheless $\textit{can}$ discriminate adversarial speech from closely matched non-speech (Experiments 1-2), finish common phrases begun in adversarial speech (Experiments 3-4), and solve simple math problems posed in adversarial speech (Experiment 5) -- even for stimuli previously described as "unintelligible to human listeners". We recommend the adoption of $\textit{sensitive tests}$ of human and machine perception, and discuss the broader consequences of this approach for comparing natural and artificial intelligence.
摘要:复杂的机器识别系统的兴起,带来了人类和机器感知之间的比较上升。但这样的比较面临的不对称:虽然一些刺激的感知机通常可以通过直接和明确的措施进行探测,许多人类感性认识的是潜在的,不完整的,或者嵌入在可能不适用于明确的报告无意识的心理过程。在这里,我们将展示这种不对称是如何使这种比较低估的人和机器的视觉重叠。作为一个案例研究,我们认为$ \ {textit对抗性言论} $的人类感知 - 这被认为是通过自动语音识别系统的有效信息合成音频命令,但人类听众听到据说为毫无意义的噪音。在五个实验,我们从人的心理物理学文献适应任务的设计表明,即使主体不能自由录制对抗性的讲话(以前的基准人类的理解),但它们$ \ textit从紧密匹配的非{可} $判别对抗性的讲话演讲(实验1-2),完成在对抗讲话(实验3-4)开始常用短语,并解决对抗性的讲话(实验5)提出简单的数学问题 - 即使是此前描述为“不可理解人类听众”的刺激。我们建议采用$ \ {textit敏感试验}人类和机器感知$,并讨论比较自然和人工智能这种方法的更广泛后果。

9. Knowledge Graph Alignment using String Edit Distance [PDF] 返回目录
  Navdeep Kaur, Gautam Kunapuli, Sriraam Natarajan
Abstract: In this work, we propose a novel knowledge base alignment technique based upon string edit distance that exploits the type information about entities and can can find similarity between relations of any arity
摘要:在这项工作中,我们提出了基于串编辑距离,它利用关于实体类型的信息,并且可以可以在任何元数的关系之间找到相似的新颖的知识基础取向技术

10. Machine learning as a model for cultural learning: Teaching an algorithm what it means to be fat [PDF] 返回目录
  Alina Arseniev-Koehler, Jacob G. Foster
Abstract: Overweight individuals, and especially women, are disparaged as immoral, unhealthy, and low class. These negative conceptions are not intrinsic to obesity; they are the tainted fruit of cultural learning. Scholars often cite media consumption as a key mechanism for learning cultural biases, but it remains unclear how this public culture becomes private culture. Here we provide a computational account of this learning mechanism, showing that cultural schemata can be learned from news reporting. We extract schemata about obesity from New York Times articles with word2vec, a neural language model inspired by human cognition. We identify several cultural schemata that link obesity to gender, immorality, poor health, and low socioeconomic class. Such schemata may be subtly but pervasively activated by our language; thus, language can chronically reproduce biases (e.g., about weight and health). Our findings also reinforce ongoing concerns that machine learning can encode, and reproduce, harmful human biases.
摘要:超重的人,特别是妇女,被贬低为不道德的,不健康的,以及低类。这些消极的概念都没有内在的肥胖;他们是文化学习的污点水果。学者们经常引用的媒体消费作为学习的文化偏见的关键机制,但这种公共文化如何成为民营文化目前还不清楚。在这里,我们提供了一个计算户口本学习机制,显示出文化图式可以从新闻报道得知。我们提取有关从word2vec,神经语言模型通过人的认知的启发纽约时报的文章肥胖的图式。我们确定了几个文化图该链接肥胖性别,不道德,健康状况不佳,低社会经济阶层。这种图式可以用我们的语言来巧妙地普遍地,但激活;因此,语言可以长期再现偏差(例如,约体重和健康状况)。我们的研究结果也加强目前市场担心,机器学习可以编码,和繁殖,有害人类的偏见。

注:中文为机器翻译结果!