摘要

1. Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook [PDF] 返回目录
Yudhanjaya Wijeratne, Nisansa de Silva
Abstract: This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger. Both corpora have markers for their date of creation, page of origin, and content type.
摘要：从数据来看，分析语言的努力和LIRNEasia的策略团队本文提出了两种口语僧伽罗语语料，以及算法得出停用词的列表。两个语料库跨度2010年至2020年更大，包含张贴533个斯里兰卡Facebook页面，包括政治，媒体，名人和其他类别的多语言文本的28825820至29549672字;较小的语料库相当于仅僧伽罗语文本5,402,76话从较大的提取。这两个语料库标记为他们创建日期，产地的页面和内容类型。

2. Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation [PDF] 返回目录
Paul Tardy, David Janiszek, Yannick Estève, Vincent Nguyen
Abstract: Summarizing texts is not a straightforward task. Before even considering text summarization, one should determine what kind of summary is expected. How much should the information be compressed? Is it relevant to reformulate or should the summary stick to the original phrasing? State-of-the-art on automatic text summarization mostly revolves around news articles. We suggest that considering a wider variety of tasks would lead to an improvement in the field, in terms of generalization and robustness. We explore meeting summarization: generating reports from automatic transcriptions. Our work consists in segmenting and aligning transcriptions with respect to reports, to get a suitable dataset for neural summarization. Using a bootstrapping approach, we provide pre-alignments that are corrected by human annotators, making a validation set against which we evaluate automatic models. This consistently reduces annotators' efforts by providing iteratively better pre-alignment and maximizes the corpus size by using annotations from our automatic alignment models. Evaluation is conducted on \publicmeetings, a novel corpus of aligned public meetings. We report automatic alignment and summarization performances on this corpus and show that automatic alignment is relevant for data annotation since it leads to large improvement of almost +4 on all ROUGE scores on the summarization task.
摘要：总结文本不是一个简单的任务。即使考虑文本摘要之前，应确定什么样的总结的预期。多少应的信息被压缩？它是有关重新制定还是应该总结固守原来的措辞？国家的最先进的自动文摘大多是围绕着新闻文章。我们认为，考虑到更广泛的任务，将导致该领域的改进，推广和稳健性方面。我们探索总结会议：从自动转录生成报告。我们的工作内容包括分割以及相对于对准转录报道，以获取神经汇总合适的数据集。使用引导的方法，我们提供由人工注释纠正，使得对我们评估自动挡车型验证组预先比对。这一直通过提供更好的迭代预对准减少注释的努力，并通过使用注释从我们的自动对准模型最大化文集大小。评估是在\ publicmeetings，对准公开会议的一个新的语料进行。我们报告这个语料库自动对齐和总结演出，显示自动校准相关数据注解，因为它会导致大的改善几乎+4上的汇总任务的所有ROUGE得分。

3. InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training [PDF] 返回目录
Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, Ming Zhou
Abstract: In this work, we formulate cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, the information-theoretic framework inspires us to propose a pre-training task based on contrastive learning. Given a bilingual sentence pair, we regard them as two views of the same meaning, and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at this http URL.
摘要：在这项工作中，我们制定跨语种语言模型前培训，为最大限度地提高多语种，多粒度的文本之间的相互信息。统一的视图可以帮助我们更好地了解学习跨语言表示现有的方法。更重要的是，信息理论框架，激励我们基于对比学习提出了一个前培训任务。给定一个双语句子对，我们把它们看作是两个视图的相同的含义，并鼓励其编码表示的比负面的例子比较相似。通过利用双方的单语和双语语料库，我们联合训练为借口任务，提高预先训练模型的跨语种转让。在几个基准实验结果表明，该方法实现了显着更好的性能。代码和预先训练模型可以在这个HTTP URL。

4. Fine-Tune Longformer for Jointly Predicting Rumor Stance and Veracity [PDF] 返回目录
Anant Khandelwal
Abstract: Increased usage of social media caused the popularity of news and events which are not even verified, resulting in spread of rumors allover the web. Due to widely available social media platforms and increased usage caused the data to be available in huge amounts.The manual methods to process such large data is costly and time-taking, so there has been an increased attention to process and verify such content automatically for the presence of rumors. A lot of research studies reveal that to identify the stances of posts in the discussion thread of such events and news is an important preceding step before identify the rumor veracity. In this paper,we propose a multi-task learning framework for jointly predicting rumor stance and veracity on the dataset released at SemEval 2019 RumorEval: Determining rumor veracity and support for rumors(SemEval 2019 Task 7), which includes social media rumors stem from a variety of breaking news stories from Reddit as well as Twit-ter. Our framework consists of two parts: a) The bottom part of our framework classifies the stance for each post in the conversation thread discussing a rumor via modelling the multi-turn conversation and make each post aware of its neighboring posts. b) The upper part predicts the rumor veracity of the conversation thread with stance evolution obtained from the bottom part. Experimental results on SemEval 2019 Task 7 dataset show that our method outperforms previous methods on both rumor stance classification and veracity prediction
摘要：社交媒体的使用量增加造成的新闻和未核实连事件的普及，导致谣言印花布网络的传播。由于广泛使用的社交媒体平台以及增加使用造成的数据是在巨大的amounts.The手工方法可用来处理如此大的数据是成本和时间考虑，所以出现了越来越多的关注，以工艺和自动验证等内容存在传言。大量的调查研究表明，以确定这些事件和新闻的话题的帖子的立场是之前的重要一步骤中识别谣言的真实性。在本文中，我们提出了一个多任务学习框架，共同预测辟谣的立场和准确性上的SemEval 2019发布的数据集RumorEval：确定为传闻谣言属实，并支持（SemEval 2019任务7），其中包括社交媒体传言从干各种从此告别reddit的新闻报道以及蠢之三的。我们的框架由两个部分组成：1）我们的框架，分类上述立场在对话线程每个职位通过建模多圈的对话讨论谣言的底部，使每个岗位意识到其邻国的职位。 b）在上部部分预测与来自底部得到姿态进化谈话线程的传闻真实性。在SemEval 2019任务7集的实验结果表明我们的方法优于传言立场分类和预测准确性都以前的方法

5. AdapterHub: A Framework for Adapting Transformers [PDF] 返回目录
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, Iryna Gurevych
Abstract: The current modus operandi in NLP involves downloading and fine-tuning pre-trained models consisting of millions or billions of parameters. Storing and sharing such large trained models is expensive, slow, and time-consuming, which impedes progress towards more general and versatile NLP methods that learn from and for many tasks. Adapters -- small learnt bottleneck layers inserted within each layer of a pre-trained model -- ameliorate this issue by avoiding full fine-tuning of the entire model. However, sharing and integrating adapter layers is not straightforward. We propose AdapterHub, a framework that allows dynamic "stitching-in" of pre-trained adapters for different tasks and languages. The framework, built on top of the popular HuggingFace Transformers library, enables extremely easy and quick adaptations of state-of-the-art pre-trained models (e.g., BERT, RoBERTa, XLM-R) across tasks and languages. Downloading, sharing, and training adapters is as seamless as possible using minimal changes to the training scripts and a specialized infrastructure. Our framework enables scalable and easy access to sharing of task-specific models, particularly in low-resource scenarios. AdapterHub includes all recent adapter architectures and can be found at this https URL.
摘要：NLP目前的作案手法包括下载和微调预训练模式由千万甚至上亿的参数。存储和共享这样的大型训练的模型是昂贵，缓慢且耗时，这阻碍了对更广泛的和灵活的，从和许多任务学习NLP技术的研究进展。适配器 - 预训练模型的每一层内插入小学会瓶颈层 - 通过避免整个模型的全微调改善这个问题。然而，共享和集成适配器层并不简单。我们建议AdapterHub，一个框架，允许动态的“拼接式”为不同的任务和语言的预先训练适配器。该框架，建立在流行HuggingFace变形金刚库的基础上，能够非常容易和国家的最先进的预先训练模式的快速调整（例如，BERT，罗伯塔，XLM-R）跨任务和语言。下载，共享和培训适配器是尽可能无缝使用最少的更改训练脚本和一个专门的基础设施。我们的架构能够共享特定任务的车型，尤其是在资源匮乏的情况下可扩展且易于访问。 AdapterHub包括最近所有的适配器架构，并可以在此HTTPS URL中找到。

6. Multimodal Word Sense Disambiguation in Creative Practice [PDF] 返回目录
Manuel Ladron de Guevara, Christopher George, Akshat Gupta, Daragh Byrne, Ramesh Krishnamurti
Abstract: Language is ambiguous; many terms and expressions can convey the same idea. This is especially true in creative practice, where ideas and design intents are highly subjective. We present a dataset, Ambiguous Descriptions of Art Images (ADARI), of contemporary workpieces, which aims to provide a foundational resource for subjective image description and multimodal word disambiguation in the context of creative practice. The dataset contains a total of 240k images labeled with 260k descriptive sentences. It is additionally organized into sub-domains of architecture, art, design, fashion, furniture, product design and technology. In subjective image description, labels are not deterministic: for example, the ambiguous label dynamic might correspond to hundreds of different images. To understand this complexity, we analyze the ambiguity and relevance of text with respect to images using the state-of-the-art pre-trained BERT model for sentence classification. We provide a baseline for multi-label classification tasks and demonstrate the potential of multimodal approaches for understanding ambiguity in design intentions. We hope that ADARI dataset and baselines constitute a first step towards subjective label classification.
摘要：语言是模糊的;许多术语和表述可以传达相同的想法。这是在创作实践中，无论在创意和设计意图是非常主观的，尤其如此。我们提出了一个数据集，艺术形象（ADARI），当代的工件，其目的是为在创作实践的背景下主观图像描述和多字歧义一项基础性资源的不明确说明。该数据集包含总共标记260K描写语句240K图像。这是另外分为建筑，艺术，设计，时尚，家具，产品设计和技术的子域。在主观图像描述，标签不确定性：例如，暧昧的标签动态可能对应于数百个不同的图像。要理解这种复杂性，我们分析了歧义和文字的相关性就使用句子分类的国家的最先进的预先训练BERT模型图像。我们提供多标签分类任务基准，并证明在设计上的用心理解歧义多式联运方式的潜力。我们希望ADARI数据集和基线构成对主观标签分类的第一步。

7. Dual Past and Future for Neural Machine Translation [PDF] 返回目录
Jianhao Yan, Fandong Meng, Jie Zhou
Abstract: Though remarkable successes have been achieved by Neural Machine Translation (NMT) in recent years, it still suffers from the inadequate-translation problem. Previous studies show that explicitly modeling the Past and Future contents of the source sentence is beneficial for translation performance. However, it is not clear whether the commonly used heuristic objective is good enough to guide the Past and Future. In this paper, we present a novel dual framework that leverages both source-to-target and target-to-source NMT models to provide a more direct and accurate supervision signal for the Past and Future modules. Experimental results demonstrate that our proposed method significantly improves the adequacy of NMT predictions and surpasses previous methods in two well-studied translation tasks.
摘要：虽然显着成效已通过神经机器翻译（NMT）在最近几年已经实现了，它仍然从翻译不足问题的困扰。以往的研究表明，明确建模源句子的过去和未来的内容是翻译性能是有益的。但是，目前尚不清楚常用的启发式的目标是否已经足够好，引导过去和未来。在本文中，我们提出了一个新颖的双重框架，同时利用源到目标和目标 - 源NMT模型为过去和未来模块提供了更直接，更准确的监控信号。实验结果表明，我们提出的方法显著提高NMT预测的充分性和超越了两个经过仔细研究翻译任务以前的方法。

8. A Multilingual Parallel Corpora Collection Effort for Indian Languages [PDF] 返回目录
Shashank Siripragada, Jerin Philip, Vinay P. Namboodiri, C V Jawahar
Abstract: We present sentence aligned parallel corpora across 10 Indian Languages Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.
摘要：我们目前的句子排列平行语料库在10个印度语言印地文，泰卢固语，泰米尔语，马拉雅拉姆语，古吉拉特语，乌尔都语，孟加拉语，奥里雅语，马拉地语，旁遮普语和英语 - 其中许多被归类为低资源。该语料库由具有跨语言共享内容的在线资源编译。显著呈现的语料库延伸，使得要么不够大或者被限制到特定的域（例如健康）本资源。我们还提供一个独立的在线源代码编译一个独立的测试集，可以独立使用，用于验证在10种印度语言的性能。除了我们在构建使用的工具能够通过使用基于深层神经网络方法，机器翻译和跨语言检索的最新进展等语料库的方法报告。

9. UniTrans: Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data [PDF] 返回目录
Qianhui Wu, Zijia Lin, Börje F. Karlsson, Biqing Huang, Jian-Guang Lou
Abstract: Prior works in cross-lingual named entity recognition (NER) with no/little labeled data fall into two primary categories: model transfer based and data transfer based methods. In this paper we find that both method types can complement each other, in the sense that, the former can exploit context information via language-independent features but sees no task-specific information in the target language; while the latter generally generates pseudo target-language training data via translation but its exploitation of context information is weakened by inaccurate translations. Moreover, prior works rarely leverage unlabeled data in the target language, which can be effortlessly collected and potentially contains valuable information for improved results. To handle both problems, we propose a novel approach termed UniTrans to Unify both model and data Transfer for cross-lingual NER, and furthermore, to leverage the available information from unlabeled target-language data via enhanced knowledge distillation. We evaluate our proposed UniTrans over 4 target languages on benchmark datasets. Our experimental results show that it substantially outperforms the existing state-of-the-art methods.
摘要：在跨语言命名实体识别（NER），没有/很少标记数据之前作品分为两大类：模型传递和基于数据传送为基础的方法。在本文中，我们发现，这两种方法类型可以相互补充，在，前者可以通过语言无关的特性利用上下文信息，但认为在目标语言中没有具体的任务信息的意识;而后者则一般通过翻译产生的伪目标语言的训练数据，但其的上下文信息利用由不准确的翻译减弱。此外，之前的作品很少利用其在目标语言，它可以毫不费力地收集标签数据，并可能包含改进的效果有价值的信息。为了处理这两个问题，我们提出了一种新的方法称为世联统一的跨语言NER两个模型和数据传输，此外，能够充分利用通过强化知识蒸馏未标记的目标语言数据的可用信息。我们评估我们所提出的世联超过4目标语言的标准数据集。我们的实验结果表明，它基本上优于现有的国家的最先进的方法。

10. Logic Constrained Pointer Networks for Interpretable Textual Similarity [PDF] 返回目录
Subhadeep Maji, Rohan Kumar, Manish Bansal, Kalyani Roy, Pawan Goyal
Abstract: Systematically discovering semantic relationships in text is an important and extensively studied area in Natural Language Processing, with various tasks such as entailment, semantic similarity, etc. Decomposability of sentence-level scores via subsequence alignments has been proposed as a way to make models more interpretable. We study the problem of aligning components of sentences leading to an interpretable model for semantic textual similarity. In this paper, we introduce a novel pointer network based model with a sentinel gating function to align constituent chunks, which are represented using BERT. We improve this base model with a loss function to equally penalize misalignments in both sentences, ensuring the alignments are bidirectional. Finally, to guide the network with structured external knowledge, we introduce first-order logic constraints based on ConceptNet and syntactic knowledge. The model achieves an F1 score of 97.73 and 96.32 on the benchmark SemEval datasets for the chunk alignment task, showing large improvements over the existing solutions. Source code is available at this https URL
摘要：系统地发现文本的语义关系是自然语言处理的一个重要和广泛的研究领域，通过序列比对各种任务，例如蕴涵，语义相似性等可分解性的语句级成绩已被提议作为一种方法，使模型更多的解释。我们研究排列句子导致语义文本相似可解释的组件模型的问题。在本文中，我们引入一个前哨选通函数到对准块构成，其使用BERT表示的新指针网络模型。我们改善与损失函数在两个句子同样处罚失调这个示范基地，确保路线是双向的。最后，引导用结构化的外部知识网络，介绍了基于ConceptNet和语法知识一阶逻辑约束。该模型实现了97.73和96.32上块对齐任务基准SemEval数据集的F1分数，表现出比现有的解决方案大幅改善。源代码可以在此HTTPS URL

11. Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-based Neural Networks [PDF] 返回目录
Pavel Blinov, Manvel Avetisian, Vladimir Kokh, Dmitry Umerenkov, Alexander Tuzhilin
Abstract: In this paper we study the problem of predicting clinical diagnoses from textual Electronic Health Records (EHR) data. We show the importance of this problem in medical community and present comprehensive historical review of the problem and proposed methods. As the main scientific contributions we present a modification of Bidirectional Encoder Representations from Transformers (BERT) model for sequence classification that implements a novel way of Fully-Connected (FC) layer composition and a BERT model pretrained only on domain data. To empirically validate our model, we use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits. This is the largest such study for the Russian language and one of the largest globally. We performed a number of comparative experiments with other text representation models on the task of multiclass classification for 265 disease subset of ICD-10. The experiments demonstrate improved performance of our models compared to other baselines, including a fine-tuned Russian BERT (RuBERT) variant. We also show comparable performance of our model with a panel of experienced medical experts. This allows us to hope that implementation of this system will reduce misdiagnosis.
摘要：本文研究预测从文本电子健康记录（EHR）数据临床诊断的问题。我们发现这个问题在医学界和问题，并提出方法的现有的综合历史回顾的重要性。作为主要的科学贡献，我们提出双向编码器表征的从变压器（BERT）模型序列分类实现完全连接（FC）层的组合物的一种新方法和BERT模型上域数据仅预训练的变形例。为了验证经验我们的模型，我们使用一个大型的俄罗斯电子病历数据集包括大约400万病人特有的访问。这是和最大的此类研究俄罗斯语言全球最大的一个。我们对多类分类为ICD-10的265集疾病的任务进行了一些对比实验与其他文本表示模型。实验证明改进我们的模型的性能相对于其他基准，其中包括微调俄罗斯BERT（RuBERT）的变体。我们还表明与经验丰富的医学专家组成的小组我们的模型相当的性能。这让我们希望这制度，落实将减少误诊。

12. Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking Platform [PDF] 返回目录
Jimin Wang, Yingjie Hu
Abstract: Geoparsing is an important task in geographic information retrieval. A geoparsing system, known as a geoparser, takes some texts as the input and outputs the recognized place mentions and their location coordinates. In June 2019, a geoparsing competition, Toponym Resolution in Scientific Papers, was held as one of the SemEval 2019 tasks. The winning teams developed neural network based geoparsers that achieved outstanding performances (over 90% precision, recall, and F1 score for toponym recognition). This exciting result brings the question "are we there yet?", namely have we achieved high enough performances to possibly consider the problem of geoparsing as solved? One limitation of this competition is that the developed geoparsers were tested on only one dataset which has 45 research articles collected from the particular domain of Bio-medicine. It is known that the same geoparser can have very different performances on different datasets. Thus, this work performs a systematic evaluation of these state-of-the-art geoparsers using our recently developed benchmarking platform EUPEG that has eight annotated datasets, nine baseline geoparsers, and eight performance metrics. The evaluation result suggests that these new geoparsers indeed improve the performances of geoparsing on multiple datasets although some challenges remain.
摘要：Geoparsing是在地理信息检索的重要任务。一个geoparsing系统，被称为geoparser，需要一些文字作为输入和输出的认可提到的地方和他们的位置坐标。在2019年6月，一个geoparsing竞争，地名分辨率科技论文，被举行的SemEval 2019和任务之一。胜出的团队开发出了实现出色的表现（超过90％的准确率，召回和F1分数地名识别）基于神经网络的geoparsers。这一令人兴奋的结果带来的问题：“我们到了没？”，即有我们实现足够高的表演，可能考虑geoparsing的问题，解决了吗？本次比赛的一个限制是，发达国家的geoparsers都只有一个具有从生物医学的特定域收集的45篇研究论文的数据集进行测试。据了解，同样的geoparser可以对不同的数据集完全不同的表演。因此，这项工作进行使用我们最近开发的基准平台EUPEG具有八个注释数据集，九个基线geoparsers和八个性能指标，这些国家的最先进的geoparsers的系统评价。评价结果表明，这些新geoparsers确实提高多个数据集geoparsing虽然仍存在一些挑战的表演。

13. Modeling Coherency in Generated Emails by Leveraging Deep Neural Learners [PDF] 返回目录
Avisha Das, Rakesh M. Verma
Abstract: Advanced machine learning and natural language techniques enable attackers to launch sophisticated and targeted social engineering-based attacks. To counter the active attacker issue, researchers have since resorted to proactive methods of detection. Email masquerading using targeted emails to fool the victim is an advanced attack method. However automatic text generation requires controlling the context and coherency of the generated content, which has been identified as an increasingly difficult problem. The method used leverages a hierarchical deep neural model which uses a learned representation of the sentences in the input document to generate structured written emails. We demonstrate the generation of short and targeted text messages using the deep model. The global coherency of the synthesized text is evaluated using a qualitative study as well as multiple quantitative measures.
摘要：先进的机器学习和自然语言技术，使远程攻击者进行精密和有针对性的社会工程上的攻击。为了对抗主动的攻击问题，研究人员就已经使出检测的主动方法。使用有针对性的电子邮件欺骗受害者的电子邮件伪装是一种先进的攻击方法。然而自动文本生成需要控制所产生的内容，这已被确定为一个日益困难的问题的上下文和相干性。该方法使用的杠杆，它使用的句子博学表示输入文档生成结构写电子邮件的深层次神经网络模型。我们演示使用深层模型短和有针对性的文字信息的产生。合成的文本的一致性全球使用定性研究和定量多种措施进行评估。

14. Emoji Prediction: Extensions and Benchmarking [PDF] 返回目录
Weicheng Ma, Ruibo Liu, Lili Wang, Soroush Vosoughi
Abstract: Emojis are a succinct form of language which can express concrete meanings, emotions, and intentions. Emojis also carry signals that can be used to better understand communicative intent. They have become a ubiquitous part of our daily lives, making them an important part of understanding user-generated content. The emoji prediction task aims at predicting the proper set of emojis associated with a piece of text. Through emoji prediction, models can learn rich representations of the communicative intent of the written text. While existing research on the emoji prediction task focus on a small subset of emoji types closely related to certain emotions, this setting oversimplifies the task and wastes the expressive power of emojis. In this paper, we extend the existing setting of the emoji prediction task to include a richer set of emojis and to allow multi-label classification on the task. We propose novel models for multi-class and multi-label emoji prediction based on Transformer networks. We also construct multiple emoji prediction datasets from Twitter using heuristics. The BERT models achieve state-of-the-art performances on all our datasets under all the settings, with relative improvements of 27.21% to 236.36% in accuracy, 2.01% to 88.28% in top-5 accuracy and 65.19% to 346.79% in F-1 score, compared to the prior state-of-the-art. Our results demonstrate the efficacy of deep Transformer-based models on the emoji prediction task. We also release our datasets at this https URL for future researchers.
摘要：表情符号，是可以表达的具体含义，情绪和意图语言的简明方式。表情符号，还携带可以用来更好地理解交际意图的信号。他们已经成为我们日常生活的一部分无处不在，使他们了解用户生成内容的重要组成部分。通过表情符号预测任务旨在预测正确镶有一块文本相关的表情符号。通过表情符预测，模型可以学习书面文字的交际意图的丰富表示。而在表情符号预测任务重点密切相关的某些情绪的表情符号种类的一小部分现有的研究，此设置过于简化的任务和废物表情图案的表现力。在本文中，我们扩展了的表情符号预测任务的现有设置，包括更丰富的表情符号，并允许在任务多标签分类。我们提出了基于变压器网络的多级和多标签的表情符号预测模型小说。我们还使用启发式从Twitter构建多个表情符号预测数据集。 BERT的模型实现我们所有的数据集的国家的最先进的性能下的所有设置，有27.21％的相对改善的准确性236.36％，2.01％，在排名前五的准确性88.28％和65.19％至346.79％的F-1的分数，相比于现有状态的最先进的。我们的研究结果表明深基于变压器的模型对表情符号预测任务的功效。我们也是在这个HTTPS URL释放我们的数据集，为今后的研究。

15. Deep learning models for representing out-of-vocabulary words [PDF] 返回目录
Johannes V. Lochter, Renato M. Silva, Tiago A. Almeida
Abstract: Communication has become increasingly dynamic with the popularization of social networks and applications that allow people to express themselves and communicate instantly. In this scenario, distributed representation models have their quality impacted by new words that appear frequently or that are derived from spelling errors. These words that are unknown by the models, known as out-of-vocabulary (OOV) words, need to be properly handled to not degrade the quality of the natural language processing (NLP) applications, which depend on the appropriate vector representation of the texts. To better understand this problem and finding the best techniques to handle OOV words, in this study, we present a comprehensive performance evaluation of deep learning models for representing OOV words. We performed an intrinsic evaluation using a benchmark dataset and an extrinsic evaluation using different NLP tasks: text categorization, named entity recognition, and part-of-speech tagging. The results indicated that the best technique for handling OOV words can be different for each task. But, in general, deep learning models that infer the embedding based on the context and the morphological structure of the OOV word obtained promising results.
摘要：通信已成为社交网络和允许人们表达自己，进行即时通信应用的普及日益活跃。在这种情况下，分布式表示模型有自己的质量由频繁出现或从拼写错误衍生新词的影响。这些词语是由模型，称为外的词汇未知（OOV）的话，需要进行适当处理，以不降低自然语言处理的质量（NLP）的应用程序，这取决于在相应的矢量表示文本。为了更好地理解这个问题，并找到处理OOV单词的最佳技术，在这项研究中，我们提出了深刻的学习模式的综合绩效评价为代表OOV单词。我们进行了使用基准数据集和外在评价使用不同的NLP任务固有的评价：文本分类，命名实体识别，和部分词性标注。结果表明，处理OOV单词最好的技术可以为每个任务不同。但是，在推测基础上的背景下，取得可喜的成果的OOV词的形态结构嵌入一般，深深的学习模式。

16. Using Holographically Compressed Embeddings in Question Answering [PDF] 返回目录
Salvador E. Barbosa
Abstract: Word vector representations are central to deep learning natural language processing models. Many forms of these vectors, known as embeddings, exist, including word2vec and GloVe. Embeddings are trained on large corpora and learn the word's usage in context, capturing the semantic relationship between words. However, the semantics from such training are at the level of distinct words (known as word types), and can be ambiguous when, for example, a word type can be either a noun or a verb. In question answering, parts-of-speech and named entity types are important, but encoding these attributes in neural models expands the size of the input. This research employs holographic compression of pre-trained embeddings, to represent a token, its part-of-speech, and named entity type, in the same dimension as representing only the token. The implementation, in a modified question answering recurrent deep learning network, shows that semantic relationships are preserved, and yields strong performance.
摘要：字向量表示是中央的深度学习自然语言处理模型。许多形式的这些载体，被称为的嵌入，存在，包括word2vec和手套。嵌入物被训练在大型语料库和学习单词的用法在上下文中，捕获单词之间的语义关系。然而，从这样的训练的语义是在不同的字（称为字类型）的水平，并且可能是不明确时，例如，一个字类型可以是一个名词或动词。在答疑，部分的词性和命名实体类型是重要的，但在神经模型编码这些属性扩展输入的大小。本研究采用预先训练的嵌入的全息压缩，来表示一个令牌，其部分的语音，并命名为实体类型，在相同的尺寸仅代表令牌。实施，修改答疑反复深度学习网络，说明语义关系被保留，并产生强大的性能。

17. Dialect Diversity in Text Summarization on Twitter [PDF] 返回目录
L. Elisa Celis, Vijay Keswani
Abstract: Extractive summarization algorithms can be used on Twitter data to return a set of posts that succinctly capture a topic. However, Twitter datasets have a significant fraction of posts written in different English dialects. We study the dialect bias in the summaries of such datasets generated by common summarization algorithms and observe that, for datasets that have sentences from more than one dialect, most summarization algorithms return summaries that under-represent the minority dialect. To correct for this bias, we propose a framework that takes an existing summarization algorithm as a blackbox and, using a small set of dialect-diverse sentences, returns a summary that is relatively more dialect-diverse. Crucially, our approach does not need the sentences in the dataset to have dialect labels, ensuring that the diversification process is independent of dialect classification and language identification models. We show the efficacy of our approach on Twitter datasets containing posts written in dialects used by different social groups defined by race, region or gender; in all cases, our approach leads to improved dialect diversity compared to the standard summarization approaches.
摘要：采掘摘要算法可以在Twitter上的数据被用来返回一组帖子，简要地捕捉话题。然而，Twitter的数据集必须用不同的方言英语岗位的显著部分。我们研究通过共同总结算法生成这样的数据集的摘要的方言偏见和观察，对于已经从一个以上的方言句子数据集，最概括的算法返回总结，根据-代表少数方言。为了纠正这种偏差，我们提出了一个框架，将现有摘要算法作为黑盒，并使用一小部分方言，不同的句子，返回的总结是比较多的方言不同。最重要的是，我们的方法并不需要的句子数据集中有方言的标签，以确保多样化进程是独立的方言分类和语言识别模型。我们展示我们对包含写在由种族，地区或性别定义不同的社会群体中使用方言转帖Twitter数据集方法的疗效;在任何情况下，我们的做法导致改善方言的多样性比标准总结方法。

18. Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset [PDF] 返回目录
Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, Kuang Lu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, Jimmy Lin
Abstract: We present Covidex, a search engine that exploits the latest neural ranking models to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI. Our system has been online and serving users since late March 2020. The Covidex is the user application component of our three-pronged strategy to develop technologies for helping domain experts tackle the ongoing global pandemic. In addition, we provide robust and easy-to-use keyword search infrastructure that exploits mature fusion-based methods as well as standalone neural ranking models that can be incorporated into other applications. These techniques have been evaluated in the ongoing TREC-COVID challenge: Our infrastructure and baselines have been adopted by many participants, including some of the highest-scoring runs in rounds 1, 2, and 3. In round 3, we report the highest-scoring run that takes advantage of previous training data and the second-highest fully automatic run.
摘要：我们提出Covidex，它利用最新的神经排名模型来提供给COVID-19开放研究数据集信息获取策划由艾伦研究所人工智能搜索引擎。我们的系统已经上线，服务用户的3月下旬以来到2020年。Covidex是我们三管齐下的策略的用户应用组件开发帮助领域专家应对当前的全球流感大流行的技术。此外，我们还提供了强大和易于使用的关键字搜索基础架构漏洞成熟基于融合的方法以及独立的神经排名模式，可以合并到其他应用程序。这些技术已经在持续的TREC-COVID挑战进行了评估：我们的基础设施和基线已通过许多与会者，包括一些在轮1，2得分最高的奔跑，和3。在第3轮，我们报告的highest-得分运行是采取以前的训练数据和第二高的全自动运行的优势。

19. Intelligent requirements engineering from natural language and their chaining toward CAD models [PDF] 返回目录
Alain-Jérôme Fougères, Egon Ostrosi
Abstract: This paper assumes that design language plays an important role in how designers design and on the creativity of designers. Designers use and develop models as an aid to thinking, a focus for discussion and decision-making and a means of evaluating the reliability of the proposals. This paper proposes an intelligent method for requirements engineering from natural language and their chaining toward CAD models. The transition from linguistic analysis to the representation of engineering requirements consists of the translation of the syntactic structure into semantic form represented by conceptual graphs. Based on the isomorphism between conceptual graphs and predicate logic, a formal language of the specification is proposed. The outcome of this language is chained and translated in Computer Aided Three-Dimensional Interactive Application (CATIA) models. The tool (EGEON: Engineering desiGn sEmantics elabOration and applicatioN) is developed to represent the semantic network of engineering requirements. A case study on the design of a car door hinge is presented to illustrates the proposed method.
摘要：本文假定设计语言起着设计师如何设计和设计师的创造力具有重要作用。设计师使用和发展模式作为辅助思考，重点讨论和决策，评估的建议的可靠性的一种手段。本文提出了要求，从自然语言和他们对CAD模型链接工程的智能方法。从语言分析到的工程要求的表示形式的过渡由句法结构翻译成由概念性图表表示语义形式的。基于概念性图表和谓词逻辑之间的同构，说明书的形式语言提出。这种语言的结果是链接和计算机辅助三维互动应用（CATIA）车型翻译。该工具（EGEON：工程设计语义的阐述和应用程序）开发代表的工程要求的语义网络。呈现在车门铰链的设计为例来说明该方法。

20. Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization [PDF] 返回目录
Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck
Abstract: In this paper we describe the top-scoring IDLab submission for the text-independent task of the Short-duration Speaker Verification (SdSV) Challenge 2020. The main difficulty of the challenge exists in the large degree of varying phonetic overlap between the optionally cross-lingual trials, along with the limited availability of in-domain DeepMine Farsi training data. We introduce domain-balanced hard prototype mining to fine-tune the state-of-the-art ECAPA-TDNN x-vector based speaker embedding extractor. The sample mining technique efficiently exploits speaker distances between the speaker prototypes of the popular AAM-softmax loss function to construct challenging training batches that are balanced on the domain-level. To enhance the scoring of cross-lingual trials, we propose a language-dependent s-norm score normalization. The imposter cohort only contains data from the Farsi target-domain which simulates the enrollment data always being Farsi. In case a Gaussian-Backend language model detects the test speaker embedding to contain English, a cross-language compensation offset determined on the AAM-softmax speaker prototypes is subtracted from the maximum expected imposter mean score. A fusion of five systems with minor topological tweaks resulted in a final MinDCF and EER of 0.065 and 1.45% respectively on the SdSVC evaluation set.
摘要：本文介绍了短时说话人确认（SdSV）挑战2020年挑战的主要困难在很大程度上改变选择性之间的语音重叠存在的文本无关的任务得分最高的IDLab提交跨语言试验中，在域DeepMine波斯语训练数据的有限沿。我们引入域的平衡很难原型采矿微调的国家的最先进的ECAPA-TDNN X-基于矢量的音箱嵌入提取。样品挖掘技术有效地利用了流行的AAM-SOFTMAX损失函数来构造是在域级别的平衡挑战训练批次的扬声器原型之间的扬声器距离。为了加强跨语种试验的得分，我们提出了一个与语言相关的S-标准分数标准化。冒名顶替者队列只包含来自波斯语目标域其模拟的注册数据总是被波斯语数据。如果高斯后端语言模型检测测试音箱嵌入包含英语，跨语言补偿偏移的AAM-SOFTMAX扬声器原型确定从预计冒名顶替平均分数最高扣除。五个系统稍作调整拓扑的融合分别导致最终MinDCF和0.065和1.45％EER上SdSVC评估集合。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-07-16

目录

摘要