目录
1. SuperPAL: Supervised Proposition ALignment for Multi-Document Summarization and Derivative Sub-Tasks [PDF] 摘要
2. Extracting Semantic Concepts and Relations from Scientific Publications by Using Deep Learning [PDF] 摘要
3. Semantic Sentiment Analysis Based on Probabilistic Graphical Models and Recurrent Neural Network [PDF] 摘要
摘要
1. SuperPAL: Supervised Proposition ALignment for Multi-Document Summarization and Derivative Sub-Tasks [PDF] 返回目录
Ori Ernst, Ori Shapira, Ramakanth Pasunuru, Michael Lepioshkin, Jacob Goldberger, Mohit Bansal, Ido Dagan
Abstract: Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection, followed by generation. While alignment of spans between reference summaries and source documents has been leveraged for training component tasks, the underlying alignment step was never independently addressed or evaluated. We advocate developing high quality source-reference alignment algorithms, that can be applied to recent large-scale datasets to obtain useful "silver", i.e. approximate, training data. As a first step, we present an annotation methodology by which we create gold standard development and test sets for summary-source alignment, and suggest its utility for tuning and evaluating effective alignment algorithms, as well as for properly evaluating MDS subtasks. Second, we introduce a new large-scale alignment dataset for training, with which an automatic alignment model was trained. This aligner achieves higher coherency with the reference summary than previous aligners used for summarization, and gets significantly higher ROUGE results when replacing a simpler aligner in a competitive summarization model. Finally, we release three additional datasets (for salience, clustering and generation), naturally derived from our alignment datasets. Furthermore, these datasets can be derived from any summarization dataset automatically after extracting alignments with our trained aligner. Hence, they can be utilized for training summarization sub-tasks.
摘要:多文档文摘(MDS)是一项具有挑战性的任务,经常分解为显着性和冗余检测的子任务,随后产生。虽然参考摘要和源文档之间的跨距对准已经利用用于训练组件任务,底层对准步骤从未独立地寻址或评价。我们主张开发高质量源参考比对算法,其可以适用于最近的大型数据集,以获得有用的“银”,即近似,训练数据。作为第一步中,我们提出通过我们创建黄金标准的开发和测试集摘要源对准的注释的方法,并建议其用于调谐和评估有效比对算法程序,以及用于适当地评估MDS子任务。其次,我们介绍培训新的大型数据集对齐,与自动对准模型进行训练。这对准实现了与比用于总结以往对准的参考摘要更高的一致性,并在竞争激烈的总结模型替换简单的校准时得到显著较高ROUGE结果。最后,我们发布了三个额外的数据集(对于显着性,集群和代),自然地从我们的数据集对准衍生。此外,这些数据集可以从任何数据集中汇总提取与我们训练有素的校准比对后自动导出。因此,它们可用于培训总结子任务。
Ori Ernst, Ori Shapira, Ramakanth Pasunuru, Michael Lepioshkin, Jacob Goldberger, Mohit Bansal, Ido Dagan
Abstract: Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection, followed by generation. While alignment of spans between reference summaries and source documents has been leveraged for training component tasks, the underlying alignment step was never independently addressed or evaluated. We advocate developing high quality source-reference alignment algorithms, that can be applied to recent large-scale datasets to obtain useful "silver", i.e. approximate, training data. As a first step, we present an annotation methodology by which we create gold standard development and test sets for summary-source alignment, and suggest its utility for tuning and evaluating effective alignment algorithms, as well as for properly evaluating MDS subtasks. Second, we introduce a new large-scale alignment dataset for training, with which an automatic alignment model was trained. This aligner achieves higher coherency with the reference summary than previous aligners used for summarization, and gets significantly higher ROUGE results when replacing a simpler aligner in a competitive summarization model. Finally, we release three additional datasets (for salience, clustering and generation), naturally derived from our alignment datasets. Furthermore, these datasets can be derived from any summarization dataset automatically after extracting alignments with our trained aligner. Hence, they can be utilized for training summarization sub-tasks.
摘要:多文档文摘(MDS)是一项具有挑战性的任务,经常分解为显着性和冗余检测的子任务,随后产生。虽然参考摘要和源文档之间的跨距对准已经利用用于训练组件任务,底层对准步骤从未独立地寻址或评价。我们主张开发高质量源参考比对算法,其可以适用于最近的大型数据集,以获得有用的“银”,即近似,训练数据。作为第一步中,我们提出通过我们创建黄金标准的开发和测试集摘要源对准的注释的方法,并建议其用于调谐和评估有效比对算法程序,以及用于适当地评估MDS子任务。其次,我们介绍培训新的大型数据集对齐,与自动对准模型进行训练。这对准实现了与比用于总结以往对准的参考摘要更高的一致性,并在竞争激烈的总结模型替换简单的校准时得到显著较高ROUGE结果。最后,我们发布了三个额外的数据集(对于显着性,集群和代),自然地从我们的数据集对准衍生。此外,这些数据集可以从任何数据集中汇总提取与我们训练有素的校准比对后自动导出。因此,它们可用于培训总结子任务。
2. Extracting Semantic Concepts and Relations from Scientific Publications by Using Deep Learning [PDF] 返回目录
Fatima N. AL-Aswadi, Huah Yong Chan, Keng Hoon Gan
Abstract: With the large volume of unstructured data that increases constantly on the web, the motivation of representing the knowledge in this data in the machine-understandable form is increased. Ontology is one of the major cornerstones of representing the information in a more meaningful way on the semantic Web. The current ontology repositories are quite limited either for their scope or for currentness. In addition, the current ontology extraction systems have many shortcomings and drawbacks, such as using a small dataset, depending on a large amount predefined patterns to extract semantic relations, and extracting a very few types of relations. The aim of this paper is to introduce a proposal of automatically extracting semantic concepts and relations from scientific publications. This paper suggests new types of semantic relations and points out of using deep learning (DL) models for semantic relation extraction.
摘要:随着大量非结构化数据的增加,网络不断对,占这一数据的机器可理解的形式知识的积极性增加。本体是代表在语义Web上更有意义的方式的信息的主要基石之一。目前本体库都相当无论是其范围或时间性的限制。此外,目前的本体抽取系统有很多不足和缺陷,如使用小数据集,根据大量的预定义模式来提取语义关系,并提取极少数类型的关系。本文的目的是介绍的自动提取语义概念,并从科学出版物关系的建议。本文提出的新类型的语义关系,并点出使用深度学习(DL)模型,语义关系抽取的。
Fatima N. AL-Aswadi, Huah Yong Chan, Keng Hoon Gan
Abstract: With the large volume of unstructured data that increases constantly on the web, the motivation of representing the knowledge in this data in the machine-understandable form is increased. Ontology is one of the major cornerstones of representing the information in a more meaningful way on the semantic Web. The current ontology repositories are quite limited either for their scope or for currentness. In addition, the current ontology extraction systems have many shortcomings and drawbacks, such as using a small dataset, depending on a large amount predefined patterns to extract semantic relations, and extracting a very few types of relations. The aim of this paper is to introduce a proposal of automatically extracting semantic concepts and relations from scientific publications. This paper suggests new types of semantic relations and points out of using deep learning (DL) models for semantic relation extraction.
摘要:随着大量非结构化数据的增加,网络不断对,占这一数据的机器可理解的形式知识的积极性增加。本体是代表在语义Web上更有意义的方式的信息的主要基石之一。目前本体库都相当无论是其范围或时间性的限制。此外,目前的本体抽取系统有很多不足和缺陷,如使用小数据集,根据大量的预定义模式来提取语义关系,并提取极少数类型的关系。本文的目的是介绍的自动提取语义概念,并从科学出版物关系的建议。本文提出的新类型的语义关系,并点出使用深度学习(DL)模型,语义关系抽取的。
3. Semantic Sentiment Analysis Based on Probabilistic Graphical Models and Recurrent Neural Network [PDF] 返回目录
Ukachi Osisiogu
Abstract: Sentiment Analysis is the task of classifying documents based on the sentiments expressed in textual form, this can be achieved by using lexical and semantic methods. The purpose of this study is to investigate the use of semantics to perform sentiment analysis based on probabilistic graphical models and recurrent neural networks. In the empirical evaluation, the classification performance of the graphical models was compared with some traditional machine learning classifiers and a recurrent neural network. The datasets used for the experiments were IMDB movie reviews, Amazon Consumer Product reviews, and Twitter Review datasets. After this empirical study, we conclude that the inclusion of semantics for sentiment analysis tasks can greatly improve the performance of a classifier, as the semantic feature extraction methods reduce uncertainties in classification resulting in more accurate predictions.
摘要:情感分析是分类基础上以文本形式表达的情感文件的任务,这可以通过使用词汇和语义方法来实现。这项研究的目的是调查使用语义的基于概率图模型和回归神经网络进行情感分析。在实证分析的图形模型的分类性能与一些传统的机器学习分类和回归神经网络进行比较。用于实验的数据集是IMDB电影评论,亚马逊消费品的评论,和Twitter的评论集。此实证研究后,我们得出结论,语义对情感分析的任务纳入可大大提高分类的性能,作为语义特征提取方法减少不确定性的分类产生更准确的预测。
Ukachi Osisiogu
Abstract: Sentiment Analysis is the task of classifying documents based on the sentiments expressed in textual form, this can be achieved by using lexical and semantic methods. The purpose of this study is to investigate the use of semantics to perform sentiment analysis based on probabilistic graphical models and recurrent neural networks. In the empirical evaluation, the classification performance of the graphical models was compared with some traditional machine learning classifiers and a recurrent neural network. The datasets used for the experiments were IMDB movie reviews, Amazon Consumer Product reviews, and Twitter Review datasets. After this empirical study, we conclude that the inclusion of semantics for sentiment analysis tasks can greatly improve the performance of a classifier, as the semantic feature extraction methods reduce uncertainties in classification resulting in more accurate predictions.
摘要:情感分析是分类基础上以文本形式表达的情感文件的任务,这可以通过使用词汇和语义方法来实现。这项研究的目的是调查使用语义的基于概率图模型和回归神经网络进行情感分析。在实证分析的图形模型的分类性能与一些传统的机器学习分类和回归神经网络进行比较。用于实验的数据集是IMDB电影评论,亚马逊消费品的评论,和Twitter的评论集。此实证研究后,我们得出结论,语义对情感分析的任务纳入可大大提高分类的性能,作为语义特征提取方法减少不确定性的分类产生更准确的预测。
4. PNEL: Pointer Network based End-To-End Entity Linking over Knowledge Graphs [PDF] 返回目录
Debayan Banerjee, Debanjan Chaudhuri, Mohnish Dubey, Jens Lehmann
Abstract: Question Answering systems are generally modelled as a pipeline consisting of a sequence of steps. In such a pipeline, Entity Linking (EL) is often the first step. Several EL models first perform span detection and then entity disambiguation. In such models errors from the span detection phase cascade to later steps and result in a drop of overall accuracy. Moreover, lack of gold entity spans in training data is a limiting factor for span detector training. Hence the movement towards end-to-end EL models began where no separate span detection step is involved. In this work we present a novel approach to end-to-end EL by applying the popular Pointer Network model, which achieves competitive performance. We demonstrate this in our evaluation over three datasets on the Wikidata Knowledge Graph.
摘要:答疑系统通常被视为一个管道由一系列步骤的。在这样的管道,实体链接(EL)常常是第一步。几个EL机型首先进行跨度检测,然后实体消歧。在从范围检测阶段级联到后面的步骤和结果的整体精度的下降这样的模型错误。此外,在训练数据缺乏金实体跨距为跨度检测器训练的限制因素。因此朝向端至端EL模型的运动开始,其中没有单独的跨度检测步骤是参与。在这项工作中,我们运用了流行的指针网络模型,实现了有竞争力的性能提出了一种新的方法来结束到终端的EL。我们在评估了对维基数据知识图的三个数据集证明这一点。
Debayan Banerjee, Debanjan Chaudhuri, Mohnish Dubey, Jens Lehmann
Abstract: Question Answering systems are generally modelled as a pipeline consisting of a sequence of steps. In such a pipeline, Entity Linking (EL) is often the first step. Several EL models first perform span detection and then entity disambiguation. In such models errors from the span detection phase cascade to later steps and result in a drop of overall accuracy. Moreover, lack of gold entity spans in training data is a limiting factor for span detector training. Hence the movement towards end-to-end EL models began where no separate span detection step is involved. In this work we present a novel approach to end-to-end EL by applying the popular Pointer Network model, which achieves competitive performance. We demonstrate this in our evaluation over three datasets on the Wikidata Knowledge Graph.
摘要:答疑系统通常被视为一个管道由一系列步骤的。在这样的管道,实体链接(EL)常常是第一步。几个EL机型首先进行跨度检测,然后实体消歧。在从范围检测阶段级联到后面的步骤和结果的整体精度的下降这样的模型错误。此外,在训练数据缺乏金实体跨距为跨度检测器训练的限制因素。因此朝向端至端EL模型的运动开始,其中没有单独的跨度检测步骤是参与。在这项工作中,我们运用了流行的指针网络模型,实现了有竞争力的性能提出了一种新的方法来结束到终端的EL。我们在评估了对维基数据知识图的三个数据集证明这一点。
5. Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020 [PDF] 返回目录
Salvatore Giorgi, Sharath Chandra Guntuku, Muhammad Rahman, McKenzie Himelein-Wachowiak, Amy Kwarteng, Brenda Curtis
Abstract: Black Lives Matter (BLM) is a grassroots movement protesting violence towards Black individuals and communities with a focus on police brutality. The movement has gained significant media and political attention following the killings of Ahmaud Arbery, Breonna Taylor, and George Floyd and the shooting of Jacob Blake in 2020. Due to its decentralized nature, the #BlackLivesMatter social media hashtag has come to both represent the movement and been used as a call to action. Similar hashtags have appeared to counter the BLM movement, such as #AllLivesMatter and #BlueLivesMatter. We introduce a data set of 41.8 million tweets from 10 million users which contain one of the following keywords: BlackLivesMatter, AllLivesMatter and BlueLivesMatter. This data set contains all currently available tweets from the beginning of the BLM movement in 2013 to June 2020. We summarize the data set and show temporal trends in use of both the BlackLivesMatter keyword and keywords associated with counter movements. In the past, similarly themed, though much smaller in scope, BLM data sets have been used for studying discourse in protest and counter protest movements, predicting retweets, examining the role of social media in protest movements and exploring narrative agency. This paper open-sources a large-scale data set to facilitate research in the areas of computational social science, communications, political science, natural language processing, and machine learning.
摘要:黑生命物质(BLM)是一个草根运动,抗议对黑色个人和社区的暴力行为为重点的警察暴行。该运动已经取得了显著媒体及以下Ahmaud Arbery,Breonna泰勒和乔治·弗洛伊德的杀戮和雅各布布莱克在2020年由于其分散性的拍摄政治关注,#BlackLivesMatter社交媒体主题标签已经到了双方代表运动并且被用作行动的召唤。类似的主题标签出现反击BLM运动,如#AllLivesMatter和#BlueLivesMatter。我们从介绍含有以下关键字的一个1000万个用户的4180万个微博数据集:BlackLivesMatter,AllLivesMatter和BlueLivesMatter。该数据集包含从2013年到2020年六月的BLM运动的开端目前市面上所有的鸣叫我们总结的数据集,并在使用中的表现时间趋势既BlackLivesMatter关键字和与反向运动相关联的关键词。在过去,类似的主题,但规模要小得多,已被用于在抗议和反抗议运动研究的话语,转推预测,审查社交媒体的抗议运动中的作用,探索叙事中介BLM数据集。本文的开放式源大规模的数据集,以方便在计算社会科学,通讯,政治学,自然语言处理和机器学习领域的研究。
Salvatore Giorgi, Sharath Chandra Guntuku, Muhammad Rahman, McKenzie Himelein-Wachowiak, Amy Kwarteng, Brenda Curtis
Abstract: Black Lives Matter (BLM) is a grassroots movement protesting violence towards Black individuals and communities with a focus on police brutality. The movement has gained significant media and political attention following the killings of Ahmaud Arbery, Breonna Taylor, and George Floyd and the shooting of Jacob Blake in 2020. Due to its decentralized nature, the #BlackLivesMatter social media hashtag has come to both represent the movement and been used as a call to action. Similar hashtags have appeared to counter the BLM movement, such as #AllLivesMatter and #BlueLivesMatter. We introduce a data set of 41.8 million tweets from 10 million users which contain one of the following keywords: BlackLivesMatter, AllLivesMatter and BlueLivesMatter. This data set contains all currently available tweets from the beginning of the BLM movement in 2013 to June 2020. We summarize the data set and show temporal trends in use of both the BlackLivesMatter keyword and keywords associated with counter movements. In the past, similarly themed, though much smaller in scope, BLM data sets have been used for studying discourse in protest and counter protest movements, predicting retweets, examining the role of social media in protest movements and exploring narrative agency. This paper open-sources a large-scale data set to facilitate research in the areas of computational social science, communications, political science, natural language processing, and machine learning.
摘要:黑生命物质(BLM)是一个草根运动,抗议对黑色个人和社区的暴力行为为重点的警察暴行。该运动已经取得了显著媒体及以下Ahmaud Arbery,Breonna泰勒和乔治·弗洛伊德的杀戮和雅各布布莱克在2020年由于其分散性的拍摄政治关注,#BlackLivesMatter社交媒体主题标签已经到了双方代表运动并且被用作行动的召唤。类似的主题标签出现反击BLM运动,如#AllLivesMatter和#BlueLivesMatter。我们从介绍含有以下关键字的一个1000万个用户的4180万个微博数据集:BlackLivesMatter,AllLivesMatter和BlueLivesMatter。该数据集包含从2013年到2020年六月的BLM运动的开端目前市面上所有的鸣叫我们总结的数据集,并在使用中的表现时间趋势既BlackLivesMatter关键字和与反向运动相关联的关键词。在过去,类似的主题,但规模要小得多,已被用于在抗议和反抗议运动研究的话语,转推预测,审查社交媒体的抗议运动中的作用,探索叙事中介BLM数据集。本文的开放式源大规模的数据集,以方便在计算社会科学,通讯,政治学,自然语言处理和机器学习领域的研究。
6. Mapping Researchers with PeopleMap [PDF] 返回目录
Jon Saad-Falcon, Omar Shaikh, Zijie J. Wang, Austin P. Wright, Sasha Richardson, Duen Horng, Chau
Abstract: Discovering research expertise at universities can be a difficult task. Directories routinely become outdated, and few help in visually summarizing researchers' work or supporting the exploration of shared interests among researchers. This results in lost opportunities for both internal and external entities to discover new connections, nurture research collaboration, and explore the diversity of research. To address this problem, at Georgia Tech, we have been developing PeopleMap, an open-source interactive web-based tool that uses natural language processing (NLP) to create visual maps for researchers based on their research interests and publications. Requiring only the researchers' Google Scholar profiles as input, PeopleMap generates and visualizes embeddings for the researchers, significantly reducing the need for manual curation of publication information. To encourage and facilitate easy adoption and extension of PeopleMap, we have open-sourced it under the permissive MIT license at this https URL. PeopleMap has received positive feedback and enthusiasm for expanding its adoption across Georgia Tech.
摘要:在大学发现研究的专业知识可以是一个艰巨的任务。目录经常变得过时,并且在视觉上总结研究人员的工作或支持研究人员之间的利益共享的探索很少的帮助。这导致失去机会,内部和外部的实体能发现新的连接,培育科研合作,并探索研究的多样性。为了解决这个问题,佐治亚理工学院,我们一直在开发PeopleMap,一个开源的基于交互式Web的工具,使用自然语言处理(NLP),以根据他们的研究兴趣和出版物的研究人员创建可视化地图。只需要研究人员的谷歌学术的配置文件作为输入,PeopleMap产生和研究人员的嵌入可视化,显著减少的发布信息人工管理的需要。为了鼓励和促进容易采用和PeopleMap的延伸,我们有开源这HTTPS URL下的许可MIT许可它。 PeopleMap已经收到了积极的反馈和热情跨越佐治亚理工学院扩大其采纳。
Jon Saad-Falcon, Omar Shaikh, Zijie J. Wang, Austin P. Wright, Sasha Richardson, Duen Horng, Chau
Abstract: Discovering research expertise at universities can be a difficult task. Directories routinely become outdated, and few help in visually summarizing researchers' work or supporting the exploration of shared interests among researchers. This results in lost opportunities for both internal and external entities to discover new connections, nurture research collaboration, and explore the diversity of research. To address this problem, at Georgia Tech, we have been developing PeopleMap, an open-source interactive web-based tool that uses natural language processing (NLP) to create visual maps for researchers based on their research interests and publications. Requiring only the researchers' Google Scholar profiles as input, PeopleMap generates and visualizes embeddings for the researchers, significantly reducing the need for manual curation of publication information. To encourage and facilitate easy adoption and extension of PeopleMap, we have open-sourced it under the permissive MIT license at this https URL. PeopleMap has received positive feedback and enthusiasm for expanding its adoption across Georgia Tech.
摘要:在大学发现研究的专业知识可以是一个艰巨的任务。目录经常变得过时,并且在视觉上总结研究人员的工作或支持研究人员之间的利益共享的探索很少的帮助。这导致失去机会,内部和外部的实体能发现新的连接,培育科研合作,并探索研究的多样性。为了解决这个问题,佐治亚理工学院,我们一直在开发PeopleMap,一个开源的基于交互式Web的工具,使用自然语言处理(NLP),以根据他们的研究兴趣和出版物的研究人员创建可视化地图。只需要研究人员的谷歌学术的配置文件作为输入,PeopleMap产生和研究人员的嵌入可视化,显著减少的发布信息人工管理的需要。为了鼓励和促进容易采用和PeopleMap的延伸,我们有开源这HTTPS URL下的许可MIT许可它。 PeopleMap已经收到了积极的反馈和热情跨越佐治亚理工学院扩大其采纳。
注:中文为机器翻译结果!封面为论文标题词云图!