0%

【arxiv论文】 Computation and Language 2020-03-25

目录

1. Cross-Lingual Adaptation Using Universal Dependencies [PDF] 摘要
2. Generating Chinese Poetry from Images via Concrete and Abstract Information [PDF] 摘要
3. Towards Neural Machine Translation for Edoid Languages [PDF] 摘要
4. Felix: Flexible Text Editing Through Tagging and Insertion [PDF] 摘要
5. Improving Yorùbá Diacritic Restoration [PDF] 摘要
6. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators [PDF] 摘要
7. Learning Compact Reward for Image Captioning [PDF] 摘要
8. Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach [PDF] 摘要
9. Video Object Grounding using Semantic Roles in Language Description [PDF] 摘要
10. ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation [PDF] 摘要
11. Data-driven models and computational tools for neurolinguistics: a language technology perspective [PDF] 摘要

摘要

1. Cross-Lingual Adaptation Using Universal Dependencies [PDF] 返回目录
  Nasrin Taghizadeh, Heshaam Faili
Abstract: We describe a cross-lingual adaptation method based on syntactic parse trees obtained from the Universal Dependencies (UD), which are consistent across languages, to develop classifiers in low-resource languages. The idea of UD parsing is to capture similarities as well as idiosyncrasies among typologically different languages. In this paper, we show that models trained using UD parse trees for complex NLP tasks can characterize very different languages. We study two tasks of paraphrase identification and semantic relation extraction as case studies. Based on UD parse trees, we develop several models using tree kernels and show that these models trained on the English dataset can correctly classify data of other languages e.g. French, Farsi, and Arabic. The proposed approach opens up avenues for exploiting UD parsing in solving similar cross-lingual tasks, which is very useful for languages that no labeled data is available for them.
摘要:我们描述了一个跨语种的适应方法,根据来自通用相关性(UD)获得的句法分析树,这是跨语言一致,以发展低资源语言分类。 UD解析的想法是捕获的相似之处,以及类型学的不同语言之间的特质。在本文中,我们证明了使用UD解析树复杂NLP任务训练的模型可以描述非常不同的语言。我们研究的释义识别和语义关系抽取两个任务的案例研究。基于UD解析树,我们开发的几款机型采用树内核,并表明这些模型接受了有关英语数据集如其他语言能正确分类数据法语,波斯语,阿拉伯语。所提出的方法开辟了途径解决类似的跨语种的任务,这是语言是为他们提供无标记的数据非常有用的利用UD解析。

2. Generating Chinese Poetry from Images via Concrete and Abstract Information [PDF] 返回目录
  Yusen Liu, Dayiheng Liu, Jiancheng Lv, Yongsheng Sang
Abstract: In recent years, the automatic generation of classical Chinese poetry has made great progress. Besides focusing on improving the quality of the generated poetry, there is a new topic about generating poetry from an image. However, the existing methods for this topic still have the problem of topic drift and semantic inconsistency, and the image-poem pairs dataset is hard to be built when training these models. In this paper, we extract and integrate the Concrete and Abstract information from images to address those issues. We proposed an infilling-based Chinese poetry generation model which can infill the Concrete keywords into each line of poems in an explicit way, and an abstract information embedding to integrate the Abstract information into generated poems. In addition, we use non-parallel data during training and construct separate image datasets and poem datasets to train the different components in our framework. Both automatic and human evaluation results show that our approach can generate poems which have better consistency with images without losing the quality.
摘要:近年来,自动生成中国古典诗歌取得了很大的进步。除了围绕提高所生成的诗的品质,就有关从图像生成诗的新课题。然而,对于这一主题的现有方法仍具有主题漂移和语义不一致的问题,图像诗对数据集是很难培养这些模型时建立。在本文中,我们提取和具体和抽象信息,从图像集成到解决这些问题。我们提出了一个基于充填 - 中国新诗代车型可以填充混凝土的关键字到一个明确的方式诗的每一行,以及一个抽象的信息嵌入到抽象的信息到生成的诗歌整合。此外,我们在训练中使用非并行数据,构建独立的图像数据集和诗歌集训练不同的组件在我们的框架。自动和人工评估结果表明,我们的方法可以生成与图像更好的一致性诗不失品质。

3. Towards Neural Machine Translation for Edoid Languages [PDF] 返回目录
  Iroro Orife
Abstract: Many Nigerian languages have relinquished their previous prestige and purpose in modern society to English and Nigerian Pidgin. For the millions of L1 speakers of indigenous languages, there are inequalities that manifest themselves as unequal access to information, communications, health care, security as well as attenuated participation in political and civic life. To minimize exclusion and promote socio-linguistic and economic empowerment, this work explores the feasibility of Neural Machine Translation (NMT) for the Edoid language family of Southern Nigeria. Using the new JW300 public dataset, we trained and evaluated baseline translation models for four widely spoken languages in this group: Èdó, Ésán, Urhobo and Isoko. Trained models, code and datasets have been open-sourced to advance future research efforts on Edoid language technology.
摘要:许多尼日利亚语言已经放弃了他们以前的威望和目的,在现代社会,以英语和尼日利亚的洋泾浜。对于数以百万计的土著语言的L1音箱,也有不平等表现为平等地获得信息,通讯,医疗保健,安全作为政治和公民生活以及衰减参与。为了尽量减少排斥和促进社会语言和经济权力,这项工作探索神经机器翻译(NMT)为Edoid语系尼日利亚南部的可行性。使用新的JW300公共数据集,我们的培训和评估基线翻译模型,这组四个广泛使用的语言:EDO,惠山,Urhobo和Isoko。训练有素的模型,代码和数据集已经开源对Edoid语言技术进步未来的研究工作。

4. Felix: Flexible Text Editing Through Tagging and Insertion [PDF] 返回目录
  Jonathan Mallinson, Aliaksei Severyn, Eric Malmi, Guillermo Garrido
Abstract: We present Felix --- a flexible text-editing approach for generation, designed to derive the maximum benefit from the ideas of decoding with bi-directional contexts and self-supervised pre-training. In contrast to conventional sequence-to-sequence (seq2seq) models, Felix is efficient in low-resource settings and fast at inference time, while being capable of modeling flexible input-output transformations. We achieve this by decomposing the text-editing task into two sub-tasks: tagging to decide on the subset of input tokens and their order in the output text and insertion to in-fill the missing tokens in the output not present in the input. The tagging model employs a novel Pointer mechanism, while the insertion model is based on a Masked Language Model. Both of these models are chosen to be non-autoregressive to guarantee faster inference. Felix performs favourably when compared to recent text-editing methods and strong seq2seq baselines when evaluated on four NLG tasks: Sentence Fusion, Machine Translation Automatic Post-Editing, Summarization, and Text Simplification.
摘要:我们提出菲利克斯---用于生成灵活的文本编辑方法,旨在从双向环境和自我监督前培训解码的想法获得最大利益。与常规的序列到序列(seq2seq)模型,Felix是在低资源设置,并快速在推理时间高效的,同时能够建模灵活的输入 - 输出变换。我们通过分解文本编辑任务,实现这一目标分为两个子任务:标签上输入令牌和其在输出文本和插入顺序的子集决定在填充输出缺少令牌在输入不存在。所述标记模型采用一种新颖的指针机构,而在插入模型是基于屏蔽语言模型。这两种模式的选择为无自回归以保证更快的推断。句子融合,机器翻译自动后编辑,汇总和文本简化:菲利克斯执行四个NLG任务评估时相比,最近的文本编辑方法和强大的seq2seq基线毫不逊色。

5. Improving Yorùbá Diacritic Restoration [PDF] 返回目录
  Iroro Orife, David I. Adelani, Timi Fasubaa, Victor Williamson, Wuraola Fisayo Oyewusi, Olamilekan Wahab, Kola Tubosun
Abstract: Yorùbá is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yorùbá dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yorùbá evaluation dataset of modern journalistic news text, selected to be multi-purpose and reflecting contemporary usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorùbá language technology.
摘要:约鲁巴是一种广泛使用的西非语言与文字系统丰富的字形和音调变音符号。它们提供形态学信息,对于词汇歧义,发音的关键,是任何计算语音或自然语言处理任务至关重要。然而区分标记通常由电子文本排除由于有限的设备和应用程序的支持,以及对正确使用一般的教育。我们对在数据集种植最近的努力报告。通过整合和完善从网上和各种个人图书馆不同的文本,我们能够显著从多数Bibilical语料库有三个来源来自十多个源发展我们的清洁约鲁巴语的数据集数以百万计的记号。我们评价一个新的,通用的,公共领域的约鲁巴评价近代报刊新闻文本的数据集更新的变音符号恢复模式,选择多用,反映当代的用法。所有预先训练模型,数据集和源代码已经被发布为开源项目推进对约鲁巴语技术方面的努力。

6. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators [PDF] 返回目录
  Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
Abstract: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
摘要:屏蔽语言建模(MLM)预训练方法,例如BERT腐败通过与[MASK]替换一些令牌,然后训练模型来重建原始的令牌的输入端。而当转移到下游的NLP任务,它们产生了良好的效果,他们通常需要大量计算的是有效的。作为替代方案,我们提出了所谓的替代标记检测更多的样品,快捷的售前培训任务。相反屏蔽输入,我们的方法与从小型发电机网络采样可行的替代品代替一些令牌破坏它。然后,而不是训练,预测损坏令牌的原始身份的一种模式,我们训练,预测在被破坏的输入每个令牌是否由发电机样品或不更换一个判别模型。彻底的实验表明这种新的预培训任务比传销更有效,因为任务在所有输入令牌,而不是只是被屏蔽掉的小部分定义。其结果是,通过我们的方法学的情境交涉大幅跑赢大市由BERT给予相同的模型的大小,数据和计算学的人。该收益是小车型特别强;例如,我们培养的典范一个GPU 4天即优于GPT上胶自然语言理解的基准测试(使用30倍以上计算的培训)。我们的做法也是行之有效的规模,它可比执行到罗伯塔和XLNet同时使用小于1/4的计算性能胜过它们的使用计算的等量时。

7. Learning Compact Reward for Image Captioning [PDF] 返回目录
  Nannan Li, Zhenzhong Chen
Abstract: Adversarial learning has shown its advances in generating natural and diverse descriptions in image captioning. However, the learned reward of existing adversarial methods is vague and ill-defined due to the reward ambiguity problem. In this paper, we propose a refined Adversarial Inverse Reinforcement Learning (rAIRL) method to handle the reward ambiguity problem by disentangling reward for each word in a sentence, as well as achieve stable adversarial training by refining the loss function to shift the generator towards Nash equilibrium. In addition, we introduce a conditional term in the loss function to mitigate mode collapse and to increase the diversity of the generated descriptions. Our experiments on MS COCO and Flickr30K show that our method can learn compact reward for image captioning.
摘要:对抗性学习已显示在图像生成字幕天然和多样的描述其进展。然而,现有的对抗方法学习奖励是模糊的,不明确的,由于报酬不明确的问题。在本文中,我们提出了一个精致对抗性逆强化学习(rAIRL)方法解开奖励句子中的每个单词通过细化损失函数发生器向纳什转移到处理报酬不明确的问题,以及实现稳定的对抗性训练平衡。此外,我们引入的损失功能的条件项,以减轻模式崩溃,并增加所产生的描述的多样性。我们对MS COCO和Flickr30K实验表明,我们的方法可以学习图像字幕紧凑奖励。

8. Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach [PDF] 返回目录
  David Schindler, Benjamin Zapilko, Frank Krüger
Abstract: Knowledge about the software used in scientific investigations is necessary for different reasons, including provenance of the results, measuring software impact to attribute developers, and bibliometric software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and role of open source software in science in general. While such analyses can be done manually, large scale analyses require the application of automated methods of information extraction and linking. In this paper, we present SoftwareKG - a knowledge graph that contains information about software mentions from more than 51,000 scientific articles from the social sciences. A silver standard corpus, created by a distant and weak supervision approach, and a gold standard corpus, created by manual annotation, were used to train an LSTM based neural network to identify software mentions in scientific articles. The model achieves a recognition rate of .82 F-score in exact matches. As a result, we identified more than 133,000 software mentions. For entity disambiguation, we used the public domain knowledge base DBpedia. Furthermore, we linked the entities of the knowledge graph to other knowledge bases such as the Microsoft Academic Knowledge Graph, the Software Ontology, and Wikidata. Finally, we illustrate, how SoftwareKG can be used to assess the role of software in the social sciences.
摘要:知识有关的科学调查所使用的软件是必要的原因多种多样,包括结果的出处,测量通用软件影响的属性开发商和文献计量学软件引文分析。此外,提供关于是否以及如何对软件的源代码可用信息,允许对一般状态和开放源码软件的作用,在科学的评估。虽然这样的分析可以手动完成,大规模分析需要的信息的提取和链接的自动化方法中的应用。在本文中,我们目前SoftwareKG - 包含有关软件从社会科学的51000余篇科学论文中提到一个知识图。银标准语料库,通过一个遥远而监管不力的方式创建和黄金标准语料库,通过人工注释创建,被用于训练LSTM基于神经网络识别软件中提到的科学论文。该模型达到0.82 F-得分精确匹配的识别率。其结果是,我们发现超过133000的软件提及。对于实体消歧,我们使用的是公共领域知识基础DBpedia中。此外,我们链接的知识图的实体向其他知识库如Microsoft学术的知识图,软件本体和维基数据。最后说明,SoftwareKG如何被用于评估的软件在社会科学中的作用。

9. Video Object Grounding using Semantic Roles in Language Description [PDF] 返回目录
  Arka Sadhu, Kan Chen, Ram Nevatia
Abstract: We explore the task of Video Object Grounding (VOG), which grounds objects in videos referred to in natural language descriptions. Previous methods apply image grounding based algorithms to address VOG, fail to explore the object relation information and suffer from limited generalization. Here, we investigate the role of object relations in VOG and propose a novel framework VOGNet to encode multi-modal object relations via self-attention with relative position encoding. To evaluate VOGNet, we propose novel contrasting sampling methods to generate more challenging grounding input samples, and construct a new dataset called ActivityNet-SRL (ASRL) based on existing caption and grounding datasets. Experiments on ASRL validate the need of encoding object relations in VOG, and our VOGNet outperforms competitive baselines by a significant margin.
摘要:本文探讨视频对象的磨砺(VOG)的任务,这在视频理由对象在自然语言的说明。先前的方法适用于基于图像接地算法地址VOG,无法探索的对象相关信息,并从有限的推广受到影响。在这里,我们调查VOG对象关系的作用,并通过自身的关注提出了一种新的框架VOGNet到编码的多模态对象关系相对位置编码。为了评估VOGNet,我们提出了新的对比抽样方法,以产生更多的具有挑战性的接地输入样本,构建一个新的数据集基于现有的标题和接地的数据集称为ActivityNet-SRL(ASRL)。在ASRL实验验证编码VOG客体关系的需要,我们的VOGNet由显著利润率优于竞争力的基线。

10. ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation [PDF] 返回目录
  Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, Roee Litman
Abstract: Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is a challenging and costly task, and even more so, the labeling task that follows, of which we focus here. One possible approach to reduce the burden of data annotation is semi-supervised learning. Semi supervised methods use, in addition to labeled data, some unlabeled samples to improve performance, compared to fully supervised ones. Consequently, such methods may adapt to unseen images during test time. We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length. We show how to operate our approach in a semi-supervised manner, enjoying the aforementioned benefits such as performance boost over state of the art supervised HTR. Furthermore, our generator can manipulate the resulting text style. This allows us to change, for instance, whether the text is cursive, or how thin is the pen stroke.
摘要:光学字符识别(OCR)系统的性能已经在深度学习时代显著改善。这对于手写文字识别(HTR),在这里每个作者都有独特的风格,不同的印刷文本,其中的变化是由设计更小,尤其如此。这就是说,深学习基础HTR是有限的,因为在所有其他任务,通过训练实例数。收集的数据是一个具有挑战性的和昂贵的任务,更是这样,下面的标签制作任务,其中我们这里集中。一种可能的方法来减少数据注解的负担是半监督学习。半监督的方法使用,除了标签的数据,一些未标记的样本,以提高性能,相比充分监督的。因此,这样的方法可在测试时间适应看不见图像。我们提出ScrabbleGAN,一个半监督的方法来合成是多才多艺无论是在风格和词汇手写文字图像。 ScrabbleGAN依靠其可以具有任意长度生成字的图像的新颖的生成模型。我们展示如何操作的半监督方式我们的方法,享受上述好处,比如性能提升了最先进的技术监督HTR。此外,我们的发电机可以操纵产生的文本样式。这使我们能够改变,例如,文本是否为草书,或有多薄的笔划。

11. Data-driven models and computational tools for neurolinguistics: a language technology perspective [PDF] 返回目录
  Ekaterina Artemova, Amir Bakarov, Aleksey Artemov, Evgeny Burnaev, Maxim Sharaev
Abstract: In this paper, our focus is the connection and influence of language technologies on the research in neurolinguistics. We present a review of brain imaging-based neurolinguistic studies with a focus on the natural language representations, such as word embeddings and pre-trained language models. Mutual enrichment of neurolinguistics and language technologies leads to development of brain-aware natural language representations. The importance of this research area is emphasized by medical applications.
摘要:在本文中,我们的重点是在神经语言学研究的连接和语言技术的影响。我们提出了基于脑成像研究神经语言学的审查,重点对自然语言表示,如Word的嵌入和预先训练语言模型。神经语言学和语言技术,导致大脑感知自然语言表述的发展相互丰富。这一研究领域的重要性是由医疗应用强调。

注:中文为机器翻译结果!