0%

【arxiv论文】 Computation and Language 2020-03-27

目录

1. TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation [PDF] 摘要
2. Rat big, cat eaten! Ideas for a useful deep-agent protolanguage [PDF] 摘要
3. Common-Knowledge Concept Recognition for SEVA [PDF] 摘要
4. Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks [PDF] 摘要
5. Multi-Label Text Classification using Attention-based Graph Neural Network [PDF] 摘要
6. Sentiment Analysis in Drug Reviews using Supervised Machine Learning Algorithms [PDF] 摘要
7. Author2Vec: A Framework for Generating User Embedding [PDF] 摘要
8. Predicting Unplanned Readmissions with Highly Unstructured Data [PDF] 摘要
9. Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data [PDF] 摘要
10. Finnish Language Modeling with Deep Transformer Models [PDF] 摘要
11. Predicting Legal Proceedings Status: an Approach Based on Sequential Text Data [PDF] 摘要
12. Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features [PDF] 摘要
13. VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [PDF] 摘要
14. Heavy-tailed Representations, Text Polarity Classification & Data Augmentation [PDF] 摘要

摘要

1. TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation [PDF] 返回目录
  Shaojie Jiang, Thomas Wolf, Christof Monz, Maarten de Rijke
Abstract: Natural Language Generation (NLG) models are prone to generating repetitive utterances. In this work, we study the repetition problem for encoder-decoder models, using both recurrent neural network (RNN) and transformer architectures. To this end, we consider the chit-chat task, where the problem is more prominent than in other tasks that need encoder-decoder architectures. We first study the influence of model architectures. By using pre-attention and highway connections for RNNs, we manage to achieve lower repetition rates. However, this method does not generalize to other models such as transformers. We hypothesize that the deeper reason is that in the training corpora, there are hard tokens that are more difficult for a generative model to learn than others and, once learning has finished, hard tokens are still under-learned, so that repetitive generations are more likely to happen. Based on this hypothesis, we propose token loss dynamic reweighting (TLDR) that applies differentiable weights to individual token losses. By using higher weights for hard tokens and lower weights for easy tokens, NLG models are able to learn individual tokens at different paces. Experiments on chit-chat benchmark datasets show that TLDR is more effective in repetition reduction for both RNN and transformer architectures than baselines using different weighting functions.
摘要:自然语言生成(NLG)模型是容易产生重复的话语。在这项工作中,我们研究了编码器,解码器模块的重复问题,同时使用递归神经网络(RNN)和变压器的架构。为此,我们考虑闲聊任务,那里的问题比其他任务需要编码器,解码器架构更加突出。我们首先研究模型架构的影响。通过使用RNNs预关注和高速公路连接,我们能够实现较低的重复率。然而,这种方法没有推广到其他车型,例如变压器。我们假设,更深层次的原因是,在训练语料库,有硬令牌更困难的生成模型学习比别人,一旦学习结束后,硬令牌还是学到下,使重复一代更加可能发生。基于这一假设,我们提出了适用于微权重个人令牌损失令牌丢失动态重加权(TLDR)。通过使用硬令牌和方便令牌较低的权重较高的权重,NLG模型能够以不同的速度学习个人标记。上闲聊基准数据集实验表明TLDR是在重复还原两种RNN和比使用不同的加权函数的基线架构变压器更有效。

2. Rat big, cat eaten! Ideas for a useful deep-agent protolanguage [PDF] 返回目录
  Marco Baroni
Abstract: Deep-agent communities developing their own language-like communication protocol are a hot (or at least warm) topic in AI. Such agents could be very useful in machine-machine and human-machine interaction scenarios long before they have evolved a protocol as complex as human language. Here, I propose a small set of priorities we should focus on, if we want to get as fast as possible to a stage where deep agents speak a useful protolanguage.
摘要:深剂的社区发展自己的语言类通信协议是在AI热(或至少暖)的话题。他们已经进化的协议一样复杂的人类语言很久以前这样的试剂可以在机器与机器和人机交互场景是非常有用的。在这里,我提出了一个小的组优先级要重点,如果我们想去的地方深代理商讲一个有用的祖语,以尽可能快地到达一个阶段。

3. Common-Knowledge Concept Recognition for SEVA [PDF] 返回目录
  Jitin Krishnan, Patrick Coronado, Hemant Purohit, Huzefa Rangwala
Abstract: We build a common-knowledge concept recognition system for a Systems Engineer's Virtual Assistant (SEVA) which can be used for downstream tasks such as relation extraction, knowledge graph construction, and question-answering. The problem is formulated as a token classification task similar to named entity extraction. With the help of a domain expert and text processing methods, we construct a dataset annotated at the word-level by carefully defining a labelling scheme to train a sequence model to recognize systems engineering concepts. We use a pre-trained language model and fine-tune it with the labeled dataset of concepts. In addition, we also create some essential datasets for information such as abbreviations and definitions from the systems engineering domain. Finally, we construct a simple knowledge graph using these extracted concepts along with some hyponym relations.
摘要:我们建立一个系统工程师的虚拟助理(自确认),它可以用于下游任务,如关系抽取,知识图施工,问题回答一个共同的知识理念识别系统。问题是配制成类似命名实体提取令牌分类任务。凭借领域专家和文本处理方法的帮助下,我们通过仔细定义标签计划训练序列模型识别系统工程的概念构建的字级注释的数据集。我们使用一个预先训练语言模型和微调它的概念标记的数据集。此外,我们还创造信息,如系统工程领域的缩写和定义一些基本数据集。最后,我们构建了使用这些提取的概念与一些下义词关系沿着简单的知识图。

4. Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks [PDF] 返回目录
  Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki
Abstract: Word2Vec is a prominent tool for Natural Language Processing (NLP) tasks. Similar inspiration is found in distributed embeddings for state-of-the-art (sota) deep neural networks. However, wrong combination of hyper-parameters can produce poor quality vectors. The objective of this work is to show optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the original model released by Mikolov. Both intrinsic and extrinsic (downstream) evaluations, including Named Entity Recognition (NER) and Sentiment Analysis (SA) were carried out. The downstream tasks reveal that the best model is task-specific, high analogy scores don't necessarily correlate positively with F1 scores and the same applies for more data. Increasing vector dimension size after a point leads to poor quality or performance. If ethical considerations to save time, energy and the environment are made, then reasonably smaller corpora may do just as well or even better in some cases. Besides, using a small corpus, we obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream (NER & SA) performance compared to Mikolov's model, trained on 100 billion word corpus.
摘要:Word2Vec是自然语言处理(NLP)任务,一个突出的工具。类似灵感在用于状态的最先进的(SOTA)深神经网络分布式的嵌入找到。然而,超参数错误的组合可以产生质量差的载体。这项工作的目的是为了表明的超参数优化组合存在并评估各种组合。我们通过Mikolov发布的原始模型进行比较。内在和外在(下游)的评估,包括命名实体识别(NER)和情感分析(SA)中进行。下游任务表明,最好的模式是针对特定任务的,高比喻得分并不一定与正与F1得分和同样适用于更多的数据。点导致低质量或性能提高后向量维度的尺寸。如果伦理方面的考虑,以节省时间,能源和环境是由,那么合理的小语料库可以做的一样好,甚至更好在某些情况下。此外,使用小语料库,我们获得更好的人力分配WordSim得分,相比Mikolov的模式,培养100个十亿字语料相应Spearman相关分析和更好的下游(NER&SA)的性能。

5. Multi-Label Text Classification using Attention-based Graph Neural Network [PDF] 返回目录
  Ankit Pal, Muru Selvakumar, Malaikannan Sankarasubbu
Abstract: In Multi-Label Text Classification (MLTC), one sample can belong to more than one class. It is observed that most MLTC tasks, there are dependencies or correlations among labels. Existing methods tend to ignore the relationship among labels. In this paper, a graph attention network-based model is proposed to capture the attentive dependency structure among the labels. The graph attention network uses a feature matrix and a correlation matrix to capture and explore the crucial dependencies between the labels and generate classifiers for the task. The generated classifiers are applied to sentence feature vectors obtained from the text feature extraction network (BiLSTM) to enable end-to-end training. Attention allows the system to assign different weights to neighbor nodes per label, thus allowing it to learn the dependencies among labels implicitly. The results of the proposed model are validated on five real-world MLTC datasets. The proposed model achieves similar or better performance compared to the previous state-of-the-art models.
摘要:在多标签文本分类(MLTC),一个样品可以属于多个类。据观察,大多数MLTC的任务,也有唱片公司之间的依存关系或相关性。现有的方法往往会忽略标签之间的关系。在本文中,基于网络图关注模型中提出来捕获标签当中,细心的依赖性结构。图表注意网络使用特征矩阵和相关矩阵,以捕捉和发掘标签之间的重要依赖关系,并生成分类的任务。所生成的分类器被应用到从文本特征提取网络(BiLSTM)中获得,以使端部 - 端训练句子特征向量。注意让系统分配不同的权重,每个标签邻居节点,从而使其能够学习标签之间的依赖关系含蓄。该模型的结果验证了五个真实世界MLTC数据集。相比之前的国家的最先进的车型,该模型能达到类似的或更好的性能。

6. Sentiment Analysis in Drug Reviews using Supervised Machine Learning Algorithms [PDF] 返回目录
  Sairamvinay Vijayaraghavan, Debraj Basu
Abstract: Sentiment Analysis is an important algorithm in Natural Language Processing which is used to detect sentiment within some text. In our project, we had chosen to work on analyzing reviews of various drugs which have been reviewed in form of texts and have also been given a rating on a scale from 1-10. We had obtained this data set from the UCI machine learning repository which had 2 data sets: train and test (split as 75-25\%). We had split the number rating for the drug into three classes in general: positive (7-10), negative (1-4) or neutral(4-7). There are multiple reviews for the drugs that belong to a similar condition and we decided to investigate how the reviews for different conditions use different words impact the ratings of the drugs. Our intention was mainly to implement supervised machine learning classification algorithms that predict the class of the rating using the textual review. We had primarily implemented different embeddings such as Term Frequency Inverse Document Frequency (TFIDF) and the Count Vectors (CV). We had trained models on the most popular conditions such as "Birth Control", "Depression" and "Pain" within the data set and obtained good results while predicting the test data sets.
摘要:情感分析是自然语言处理的重要算法,该算法用于一些文本内检测情绪。在我们的项目中,我们选择了工作在分析已在文本的形式进行了审查和也被赋予在一个规模评级从1-10各种药物评论。我们获得从UCI机器学习库其中有2个数据集该数据集:训练集和测试(如分裂75-25 \%)。我们分裂了数等级为药物分为三类一般:阳性(7-10),负(1-4)或中性(4-7)。有对属于类似情况的药物多的评论,我们决定研究不同条件下的评论如何使用不同的话影响药物的评级。我们的目的主要是为了实现一个预测类使用文本审查评价的监督机器学习分类算法。我们已经实现的主要不同的嵌入诸如词频逆文档频率(TFIDF)和伯爵向量(CV)。我们已经上最流行的条件,如“计划生育”,“抑郁症”和“痛苦”的数据集内,取得了良好的培训效果,而模型预测的测试数据集。

7. Author2Vec: A Framework for Generating User Embedding [PDF] 返回目录
  Xiaodong Wu, Weizhe Lin, Zhilin Wang, Elena Rastorgueva
Abstract: Online forums and social media platforms provide noisy but valuable data every day. In this paper, we propose a novel end-to-end neural network-based user embedding system, Author2Vec. The model incorporates sentence representations generated by BERT (Bidirectional Encoder Representations from Transformers) with a novel unsupervised pre-training objective, authorship classification, to produce better user embedding that encodes useful user-intrinsic properties. This user embedding system was pre-trained on post data of 10k Reddit users and was analyzed and evaluated on two user classification benchmarks: depression detection and personality classification, in which the model proved to outperform traditional count-based and prediction-based methods. We substantiate that Author2Vec successfully encoded useful user attributes and the generated user embedding performs well in downstream classification tasks without further finetuning.
摘要:在线论坛和社交媒体平台,每天提供嘈杂,但有价值的数据。在本文中,我们提出了一个新颖的终端到终端的基于神经网络的用户嵌入系统,Author2Vec。该模型结合具有新颖的无监督预训练目标,作者分类由BERT(从变压器双向编码表示)产生的句子的表示,以产生更好的用户嵌入编码有用用户固有特性。此用户嵌入系统预先训练上10K reddit的用户数据后,分析以及两个用户分类基准计算:抑郁症检测和个性分类,在这种模式被证明优于传统的基于预测数为基础和方法。我们证实该Author2Vec成功编码的有用的用户的属性和所生成的用户在下游分类任务执行嵌入以及无需进一步细化和微调。

8. Predicting Unplanned Readmissions with Highly Unstructured Data [PDF] 返回目录
  Constanza Fierro, Jorge Pérez, Javier Mora
Abstract: Deep learning techniques have been successfully applied to predict unplanned readmissions of patients in medical centers. The training data for these models is usually based on historical medical records that contain a significant amount of free-text from admission reports, referrals, exam notes, etc. Most of the models proposed so far are tailored to English text data and assume that electronic medical records follow standards common in developed countries. These two characteristics make them difficult to apply in developing countries that do not necessarily follow international standards for registering patient information, or that store text information in languages other than English. In this paper we propose a deep learning architecture for predicting unplanned readmissions that consumes data that is significantly less structured compared with previous models in the literature. We use it to present the first results for this task in a large clinical dataset that mainly contains Spanish text data. The dataset is composed of almost 10 years of records in a Chilean medical center. On this dataset, our model achieves results that are comparable to some of the most recent results obtained in US medical centers for the same task (0.76 AUROC).
摘要:深学习技术已成功应用于预测患者在医疗中心计划外再入院。这些模型的训练数据通常是基于包含从入院的报告,转诊,考试注意事项等模型的提出的大多数至今都针对英文文本数据自由文本的显著量,并假定电子的历史病历病历按照发达国家的标准普遍。这两个特点使它们难以在发展中国家不一定遵循登记患者信息,或英语以外的语言是存储文本信息的国际标准适用。在本文中,我们提出了一个深刻的学习结构预测消耗是显著较少的结构与文献中以前的型号相比,数据计划外再入院。我们用它来展示该任务的第一个成果,大量临床数据集主要包括西班牙语文本数据。该数据集是在智利的医疗中心组成的近10年的纪录。在此数据集,我们的模型取得的结果是与一些在美国为同一任务(0.76 AUROC)医疗中心获得的最新结果。

9. Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data [PDF] 返回目录
  Harish Tayyar Madabushi, Elena Kochkina, Michael Castelle
Abstract: The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second-highest score on sentence-level propaganda classification.
摘要:宣传的自动识别已获得意义,近年来,由于在路上新闻科技和社会的变化,产生和消耗。此任务可以有效地利用BERT,一个强大的新架构,可以进行微调,对文本分类的任务,要解决并不奇怪。然而,宣传的检测,像其他任务处理新闻文档和其他形式的社会去情境通信(例如情感分析),从根本上有其类别同时不平衡和不同的数据交易。我们表明,BERT,同时能够处理不平衡类没有额外数据增强,并不时的训练和测试数据是足够不同的推广以及(这是常有的新闻来源,它的主题随着时间的情况下)。我们将展示如何通过提供数据集中和整合成本加权的方法之间的相似性的统计度量到BERT当训练和测试集是不同的,以解决这个问题。我们测试的宣传技巧语料库(PTC)这些方法,实现对句子级分类的宣传第二最高得分。

10. Finnish Language Modeling with Deep Transformer Models [PDF] 返回目录
  Abhilash Jain
Abstract: Transformers have recently taken the center stage in language modeling after LSTM's were considered the dominant model architecture for a long time. In this project, we investigate the performance of the Transformer architectures-BERT and Transformer-XL for the language modeling task. We use a sub-word model setting with the Finnish language and compare it to the previous State of the art (SOTA) LSTM model. BERT achieves a pseudo-perplexity score of 14.5, which is the first such measure achieved as far as we know. Transformer-XL improves upon the perplexity score to 73.58 which is 27\% better than the LSTM model.
摘要:变压器最近采取的中心舞台语言建模后LSTM的被认为是占主导地位的模型架构很长一段时间。在这个项目中,我们研究了变压器的架构-BERT和变压器-XL的语言建模任务的性能。我们使用芬兰语子字模型设定,比较它的艺术(SOTA)LSTM模型的以前的状态。 BERT实现了伪茫然得分14.5,这是因为据我们所知取得的第一个此类措施。变压器-XL改善了困惑比分73.58是27 \%,比LSTM模型更好。

11. Predicting Legal Proceedings Status: an Approach Based on Sequential Text Data [PDF] 返回目录
  Felipe Maia Polo, Itamar Ciochetti, Emerson Bertolo
Abstract: Machine learning applications in the legal field are numerous and diverse. In order to make contribution to both the machine learning community and the legal community, we have made efforts to create a model compatible with the classification of text sequences, valuing the interpretability of the results. The purpose of this paper is to classify legal proceedings in three possible status classes, which are (i) archived proceedings, (ii) active proceedings and (iii) suspended proceedings. Our approach is composed by natural language processing, supervised and unsupervised deep learning models and performed remarkably well in the classification task. Furthermore we had some insights regarding the patterns learned by the neural network applying tools to make the results more interpretable.
摘要:在法律领域的机器学习应用十分繁多。为了给机器学习界和法律界都做出了贡献,我们也在努力创建文本序列的分类兼容的模式,重视结果的可解释性。本文的目的是法律诉讼在三种可能的状态类,它们是(i)归档程序中,(ⅱ)活性诉讼和(iii)悬浮程序进行分类。我们的做法是通过自然语言处理组成,监督和无监督的深度学习模式,并在分类任务完成得非常好。此外,我们有关于应用工具,使结果更可解释的神经网络学习的模式的一些见解。

12. Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features [PDF] 返回目录
  Nicole Mariah Sharon Belvisi, Naveed Muhammad, Fernando Alonso-Fernandez
Abstract: In recent years, messages and text posted on the Internet are used in criminal investigations. Unfortunately, the authorship of many of them remains unknown. In some channels, the problem of establishing authorship may be even harder, since the length of digital texts is limited to a certain number of characters. In this work, we aim at identifying authors of tweet messages, which are limited to 280 characters. We evaluate popular features employed traditionally in authorship attribution which capture properties of the writing style at different levels. We use for our experiments a self-captured database of 40 users, with 120 to 200 tweets per user. Results using this small set are promising, with the different features providing a classification accuracy between 92% and 98.5%. These results are competitive in comparison to existing studies which employ short texts such as tweets or SMS.
摘要:近年来,在互联网上发布消息和文本在刑事调查中。不幸的是,很多人的著作权仍然不明。在一些信道,建立著作权的问题可能是更难,因为数字文本的长度被限制在一定的数量的字符。在这项工作中,我们的目标是确定鸣叫消息,这些消息被限制为280个字符的作者。我们评估著作权归属其捕获不同层次的写作风格的特性采用传统的流行特点。我们使用我们的实验40个用户的自捕获数据库,每个用户120至200鸣叫。结果使用该小组是有希望的,具有不同的特征,提供92%和98.5%之间的分类精度。这些结果是竞争力,相比于采用短文本,如Twitter或短信现有的研究。

13. VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [PDF] 返回目录
  Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu
Abstract: We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels. In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video). We present a detailed analysis of the dataset and an extensive evaluation over many strong baselines, providing valuable insights on the challenges of this new task.
摘要:我们推出了新的任务,视频和语言推理,视频和文本的联合多理解。鉴于对齐字幕为前提,基于视频内容的自然语言假设配对的视频剪辑,模型需要推断假设是否便要承担或者通过给定的视频剪辑矛盾。一个新的大型数据集,命名为小提琴(视频和语言推理),介绍了这一任务,它由来自15887个的视频剪辑,95322视频假说对跨越582小时视频。这些视频片段包含丰富的内容与不同的时空动态,事件的变化和人的互动,从两个来源收集:(一)流行的电视节目,以及(ii)影片剪辑从YouTube频道。为了解决我们的新的多模态推理任务,模型需要具备精良的推理能力,从表面水平接地体(例如,识别视频中的物体和人物),以深入常识推理(例如,推断因果事件的关系在视频)。我们目前的数据集,并在众多实力雄厚的基准进行广泛的评估进行详细的分析,提供有关这项新任务的挑战有价值的见解。

14. Heavy-tailed Representations, Text Polarity Classification & Data Augmentation [PDF] 返回目录
  Hamid Jalalzai, Pierre Colombo, Chloé Clavel, Eric Gaussier, Giovanna Varni, Emmanuel Vignon, Anne Sabourin
Abstract: The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline. This classifier exhibits a scale invariance property which we leverage by introducing a novel text generation method for label preserving dataset augmentation. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiment.
摘要:占主导地位的方法在自然语言文本表示依靠学习上有方便的特性,如组合性和距离保存海量语料的嵌入。在本文中,我们开发要学会与有关分配的尾巴,这使得分析点远离使用多元极值理论的框架内,分布散理想的规律性特性的重尾嵌入的新方法。特别是,一个专用于提出的嵌入的尾部分类器被获得,其优于基线性能。这个分类显示出标度不变性性质,我们杠杆作用通过引入新颖的文本生成方法,用于标记保存数据集增强。关于合成的和实际的文本数据数值实验证明了该框架并进行确认,该方法产生具有可控属性,例如有意义的句子的相关性积极或消极的情绪。

注:中文为机器翻译结果!