目录
4. Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts [PDF] 摘要
15. Table2Charts: Learning Shared Representations for Recommending Charts on Multi-dimensional Data [PDF] 摘要
16. Complicating the Social Networks for Better Storytelling: An Empirical Study of Chinese Historical Text and Novel [PDF] 摘要
摘要
1. Learning from students' perception on professors through opinion mining [PDF] 返回目录
Vladimir Vargas-Calderón, Juan S. Flórez, Leonel F. Ardila, Nicolas Parra-A., Jorge E. Camargo, Nelson Vargas
Abstract: Students' perception of classes measured through their opinions on teaching surveys allows to identify deficiencies and problems, both in the environment and in the learning methodologies. The purpose of this paper is to study, through sentiment analysis using natural language processing (NLP) and machine learning (ML) techniques, those opinions in order to identify topics that are relevant for students, as well as predicting the associated sentiment via polarity analysis. As a result, it is implemented, trained and tested two algorithms to predict the associated sentiment as well as the relevant topics of such opinions. The combination of both approaches then becomes useful to identify specific properties of the students' opinions associated with each sentiment label (positive, negative or neutral opinions) and topic. Furthermore, we explore the possibility that students' perception surveys are carried out without closed questions, relying on the information that students can provide through open questions where they express their opinions about their classes.
摘要:通过他们的意见对教学调查衡量班学生的感知允许找出不足和问题,无论是在环境和学习方法。使用自然语言处理(NLP)和机器学习(ML)技术,以这些意见,以确定有关学生的话题,以及预测通过极性分析相关的情感本文的目的是学习,通过情感分析。其结果是,它的实施,培训和测试两种算法来预测相关的情感,以及这些意见的相关主题。两者的结合方法然后变成有用以识别与每个情绪的标签(正,负或中性的观点)和主题相关的同学的意见的特定属性。此外,我们将探究学生认知度调查的进行,没有封闭的问题,依靠信息,学生可以通过他们表达他们对自己的班级意见悬而未决的问题提供了可能。
Vladimir Vargas-Calderón, Juan S. Flórez, Leonel F. Ardila, Nicolas Parra-A., Jorge E. Camargo, Nelson Vargas
Abstract: Students' perception of classes measured through their opinions on teaching surveys allows to identify deficiencies and problems, both in the environment and in the learning methodologies. The purpose of this paper is to study, through sentiment analysis using natural language processing (NLP) and machine learning (ML) techniques, those opinions in order to identify topics that are relevant for students, as well as predicting the associated sentiment via polarity analysis. As a result, it is implemented, trained and tested two algorithms to predict the associated sentiment as well as the relevant topics of such opinions. The combination of both approaches then becomes useful to identify specific properties of the students' opinions associated with each sentiment label (positive, negative or neutral opinions) and topic. Furthermore, we explore the possibility that students' perception surveys are carried out without closed questions, relying on the information that students can provide through open questions where they express their opinions about their classes.
摘要:通过他们的意见对教学调查衡量班学生的感知允许找出不足和问题,无论是在环境和学习方法。使用自然语言处理(NLP)和机器学习(ML)技术,以这些意见,以确定有关学生的话题,以及预测通过极性分析相关的情感本文的目的是学习,通过情感分析。其结果是,它的实施,培训和测试两种算法来预测相关的情感,以及这些意见的相关主题。两者的结合方法然后变成有用以识别与每个情绪的标签(正,负或中性的观点)和主题相关的同学的意见的特定属性。此外,我们将探究学生认知度调查的进行,没有封闭的问题,依靠信息,学生可以通过他们表达他们对自己的班级意见悬而未决的问题提供了可能。
2. JokeMeter at SemEval-2020 Task 7: Convolutional humor [PDF] 返回目录
Martin Docekal, Martin Fajcik, Josef Jon, Pavel Smrz
Abstract: This paper describes our system that was designed for Humor evaluation within the SemEval-2020 Task 7. The system is based on convolutional neural network architecture. We investigate the system on the official dataset, and we provide more insight to model itself to see how the learned inner features look.
摘要:本文介绍了我们的系统,该系统是专为幽默评价SemEval-2020任务7.内的系统是基于卷积神经网络架构。我们调查的官方数据集中系统,我们提供更多的洞察自身模型来查看了解到内部功能如何看待。
Martin Docekal, Martin Fajcik, Josef Jon, Pavel Smrz
Abstract: This paper describes our system that was designed for Humor evaluation within the SemEval-2020 Task 7. The system is based on convolutional neural network architecture. We investigate the system on the official dataset, and we provide more insight to model itself to see how the learned inner features look.
摘要:本文介绍了我们的系统,该系统是专为幽默评价SemEval-2020任务7.内的系统是基于卷积神经网络架构。我们调查的官方数据集中系统,我们提供更多的洞察自身模型来查看了解到内部功能如何看待。
3. End-to-End Neural Transformer Based Spoken Language Understanding [PDF] 返回目录
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann
Abstract: Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed investigated. In this paper, we introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture. This new architecture leverages the self-attention mechanism by which the audio signal is transformed to various sub-subspaces allowing to extract the semantic context implied by an utterance. Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent Speech Commands dataset with accuracy equal to 98.1 \%, 99.6 \%, and 99.6 \%, respectively and outperforms the SLU models that leverage a combination of recurrent and convolutional neural networks by 1.4 \% while the size of our model is 25\% smaller than that of these architectures. Additionally, due to independent sub-space projections in the self-attention layer, the model is highly parallelizable which makes it a good candidate for on-device SLU.
摘要:口语理解(SLU)是指从推断音频信号的语义信息的过程。虽然神经变压器一直在自然语言处理(NLP)的领域提供的国家的最先进的神经结构中表现最好的,他们在密切相关领域的优点,即口语理解(SLU)没有木珠调查。在本文中,我们引入一个端至端神经基于变压器的SLU模型,可以预测可变长度域,意图和槽嵌入在没有中间令牌预测架构的音频信号矢量。这种新的架构利用通过该音频信号被变换为各种子子空间允许,以提取由一个发声隐含语义语境自注意机制。我们的端至端变压器SLU预测在流利语音命令数据集与准确性的结构域,意图和时隙等于98.1 \%,99.6 \%,和99.6 \%,分别和优于SLU模型杠杆一个经常性的组合和1.4 \%卷积神经网络,同时我们的模型的尺寸比这些架构的小25 \%。另外,由于在自关注层独立的子空间的突起,该模型是高度并行这使得它的良好候选设备上SLU。
Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann
Abstract: Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed investigated. In this paper, we introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture. This new architecture leverages the self-attention mechanism by which the audio signal is transformed to various sub-subspaces allowing to extract the semantic context implied by an utterance. Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent Speech Commands dataset with accuracy equal to 98.1 \%, 99.6 \%, and 99.6 \%, respectively and outperforms the SLU models that leverage a combination of recurrent and convolutional neural networks by 1.4 \% while the size of our model is 25\% smaller than that of these architectures. Additionally, due to independent sub-space projections in the self-attention layer, the model is highly parallelizable which makes it a good candidate for on-device SLU.
摘要:口语理解(SLU)是指从推断音频信号的语义信息的过程。虽然神经变压器一直在自然语言处理(NLP)的领域提供的国家的最先进的神经结构中表现最好的,他们在密切相关领域的优点,即口语理解(SLU)没有木珠调查。在本文中,我们引入一个端至端神经基于变压器的SLU模型,可以预测可变长度域,意图和槽嵌入在没有中间令牌预测架构的音频信号矢量。这种新的架构利用通过该音频信号被变换为各种子子空间允许,以提取由一个发声隐含语义语境自注意机制。我们的端至端变压器SLU预测在流利语音命令数据集与准确性的结构域,意图和时隙等于98.1 \%,99.6 \%,和99.6 \%,分别和优于SLU模型杠杆一个经常性的组合和1.4 \%卷积神经网络,同时我们的模型的尺寸比这些架构的小25 \%。另外,由于在自关注层独立的子空间的突起,该模型是高度并行这使得它的良好候选设备上SLU。
4. Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts [PDF] 返回目录
Mahdi Mohseni, Volker Gast, Christoph Redies
Abstract: This study investigates global properties of literary and non-literary texts. Within the literary texts, a distinction is made between canonical and non-canonical works. The central hypothesis of the study is that the three text types (non-literary, literary/canonical and literary/non-canonical) exhibit systematic differences with respect to structural design features as correlates of aesthetic responses in readers. To investigate these differences, we compiled a corpus containing texts of the three categories of interest, the Jena Textual Aesthetics Corpus. Two aspects of global structure are investigated, variability and self-similar (fractal) patterns, which reflect long-range correlations along texts. We use four types of basic observations, (i) the frequency of POS-tags per sentence, (ii) sentence length, (iii) lexical diversity in chunks of text, and (iv) the distribution of topic probabilities in chunks of texts. These basic observations are grouped into two more general categories, (a) the low-level properties (i) and (ii), which are observed at the level of the sentence (reflecting linguistic decoding), and (b) the high-level properties (iii) and (iv), which are observed at the textual level (reflecting comprehension). The basic observations are transformed into time series, and these time series are subject to multifractal detrended fluctuation analysis (MFDFA). Our results show that low-level properties of texts are better discriminators than high-level properties, for the three text types under analysis. Canonical literary texts differ from non-canonical ones primarily in terms of variability. Fractality seems to be a universal feature of text, more pronounced in non-literary than in literary texts. Beyond the specific results of the study, we intend to open up new perspectives on the experimental study of textual aesthetics.
摘要:本研究探讨文学与非文学文本的全局属性。在文学文本,需要区分经典和非经典作品之间进行。该研究的核心假设是,三个文本类型(非文学,文学/经典和文学/非经典的)表现出相对于结构设计的系统性差异特点的读者的审美反应相关因素。为了研究这些差异,我们编译包含三类利益,耶拿文本语料库美学的文本语料库。全球结构两个方面进行了研究,多变性和自相似(分)模式,这反映沿着文本的长程关联。我们使用四种基本的观察,(我)的每一句话POS标签的频率,(二)句子长度,(III)词汇多样性文本块,及(iv)在文本块中的主题概率分布。这些基本观察分为两个更一般的类别,(a)所述低级别的性质(i)和(ii)中,这是在句子的水平观察到的(反射语言解码),和(b)的高级别性质(iii)和(iv)中,它们在文本水平上观察到(反映理解)。基本观测被变换成时间序列,这些时间序列受到多重去趋势波动分析(MFDFA)。我们的研究结果显示文本的低级别的性能比高级属性更好的鉴别,对所分析的三个文本类型。经典文学作品从非规范的人主要是在变化的条件不同。分形似乎是文字的普遍特征,更明显的非文学比文学文本。除了研究的具体结果,我们打算开辟新的前景上的文本美学的实验研究。
Mahdi Mohseni, Volker Gast, Christoph Redies
Abstract: This study investigates global properties of literary and non-literary texts. Within the literary texts, a distinction is made between canonical and non-canonical works. The central hypothesis of the study is that the three text types (non-literary, literary/canonical and literary/non-canonical) exhibit systematic differences with respect to structural design features as correlates of aesthetic responses in readers. To investigate these differences, we compiled a corpus containing texts of the three categories of interest, the Jena Textual Aesthetics Corpus. Two aspects of global structure are investigated, variability and self-similar (fractal) patterns, which reflect long-range correlations along texts. We use four types of basic observations, (i) the frequency of POS-tags per sentence, (ii) sentence length, (iii) lexical diversity in chunks of text, and (iv) the distribution of topic probabilities in chunks of texts. These basic observations are grouped into two more general categories, (a) the low-level properties (i) and (ii), which are observed at the level of the sentence (reflecting linguistic decoding), and (b) the high-level properties (iii) and (iv), which are observed at the textual level (reflecting comprehension). The basic observations are transformed into time series, and these time series are subject to multifractal detrended fluctuation analysis (MFDFA). Our results show that low-level properties of texts are better discriminators than high-level properties, for the three text types under analysis. Canonical literary texts differ from non-canonical ones primarily in terms of variability. Fractality seems to be a universal feature of text, more pronounced in non-literary than in literary texts. Beyond the specific results of the study, we intend to open up new perspectives on the experimental study of textual aesthetics.
摘要:本研究探讨文学与非文学文本的全局属性。在文学文本,需要区分经典和非经典作品之间进行。该研究的核心假设是,三个文本类型(非文学,文学/经典和文学/非经典的)表现出相对于结构设计的系统性差异特点的读者的审美反应相关因素。为了研究这些差异,我们编译包含三类利益,耶拿文本语料库美学的文本语料库。全球结构两个方面进行了研究,多变性和自相似(分)模式,这反映沿着文本的长程关联。我们使用四种基本的观察,(我)的每一句话POS标签的频率,(二)句子长度,(III)词汇多样性文本块,及(iv)在文本块中的主题概率分布。这些基本观察分为两个更一般的类别,(a)所述低级别的性质(i)和(ii)中,这是在句子的水平观察到的(反射语言解码),和(b)的高级别性质(iii)和(iv)中,它们在文本水平上观察到(反映理解)。基本观测被变换成时间序列,这些时间序列受到多重去趋势波动分析(MFDFA)。我们的研究结果显示文本的低级别的性能比高级属性更好的鉴别,对所分析的三个文本类型。经典文学作品从非规范的人主要是在变化的条件不同。分形似乎是文字的普遍特征,更明显的非文学比文学文本。除了研究的具体结果,我们打算开辟新的前景上的文本美学的实验研究。
5. Query Understanding via Intent Description Generation [PDF] 返回目录
Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, Xueqi Cheng
Abstract: Query understanding is a fundamental problem in information retrieval (IR), which has attracted continuous attention through the past decades. Many different tasks have been proposed for understanding users' search queries, e.g., query classification or query clustering. However, it is not that precise to understand a search query at the intent class/cluster level due to the loss of many detailed information. As we may find in many benchmark datasets, e.g., TREC and SemEval, queries are often associated with a detailed description provided by human annotators which clearly describes its intent to help evaluate the relevance of the documents. If a system could automatically generate a detailed and precise intent description for a search query, like human annotators, that would indicate much better query understanding has been achieved. In this paper, therefore, we propose a novel Query-to-Intent-Description (Q2ID) task for query understanding. Unlike those existing ranking tasks which leverage the query and its description to compute the relevance of documents, Q2ID is a reverse task which aims to generate a natural language intent description based on both relevant and irrelevant documents of a given query. To address this new task, we propose a novel Contrastive Generation model, namely CtrsGen for short, to generate the intent description by contrasting the relevant documents with the irrelevant documents given a query. We demonstrate the effectiveness of our model by comparing with several state-of-the-art generation models on the Q2ID task. We discuss the potential usage of such Q2ID technique through an example application.
摘要:查询理解是信息检索(IR),已经通过几十年来持续关注的一个根本问题。许多不同的任务已经提出了理解用户的搜索查询,例如,查询分类或查询集群化。然而,它不是确切了解在意图类/集群级别搜索查询由于许多详细信息的丢失。正如我们可以在许多标准数据集发现,例如,TREC和SemEval,查询往往是由人工注释提供了详细的描述,明确说明其意图,以帮助评估文件的相关性相关联。如果一个系统可以自动生成一个搜索查询的详细和精确的意图描述,相同的人工注释,这将表明更好的查询理解已经实现。在本文中,因此,我们提出了一个新颖的查询到意向说明(Q2ID)查询了解任务。不像那些现有的排名,任务查询及其说明用于计算文档的相关性,其杠杆作用,Q2ID是一个反向的任务,目的是生成基于给定的查询都相关和不相关文档的自然语言的意图描述。为了解决这个新任务,我们提出了一个新颖的对比代车型,即CtrsGen的简称,通过对比给定的查询无关文件的相关文件来生成意图描述。我们通过对Q2ID任务的几个国家的最先进的一代车型相比表现出我们的模型的有效性。我们通过一个示例应用程序讨论这种Q2ID技术的潜在用途。
Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, Xueqi Cheng
Abstract: Query understanding is a fundamental problem in information retrieval (IR), which has attracted continuous attention through the past decades. Many different tasks have been proposed for understanding users' search queries, e.g., query classification or query clustering. However, it is not that precise to understand a search query at the intent class/cluster level due to the loss of many detailed information. As we may find in many benchmark datasets, e.g., TREC and SemEval, queries are often associated with a detailed description provided by human annotators which clearly describes its intent to help evaluate the relevance of the documents. If a system could automatically generate a detailed and precise intent description for a search query, like human annotators, that would indicate much better query understanding has been achieved. In this paper, therefore, we propose a novel Query-to-Intent-Description (Q2ID) task for query understanding. Unlike those existing ranking tasks which leverage the query and its description to compute the relevance of documents, Q2ID is a reverse task which aims to generate a natural language intent description based on both relevant and irrelevant documents of a given query. To address this new task, we propose a novel Contrastive Generation model, namely CtrsGen for short, to generate the intent description by contrasting the relevant documents with the irrelevant documents given a query. We demonstrate the effectiveness of our model by comparing with several state-of-the-art generation models on the Q2ID task. We discuss the potential usage of such Q2ID technique through an example application.
摘要:查询理解是信息检索(IR),已经通过几十年来持续关注的一个根本问题。许多不同的任务已经提出了理解用户的搜索查询,例如,查询分类或查询集群化。然而,它不是确切了解在意图类/集群级别搜索查询由于许多详细信息的丢失。正如我们可以在许多标准数据集发现,例如,TREC和SemEval,查询往往是由人工注释提供了详细的描述,明确说明其意图,以帮助评估文件的相关性相关联。如果一个系统可以自动生成一个搜索查询的详细和精确的意图描述,相同的人工注释,这将表明更好的查询理解已经实现。在本文中,因此,我们提出了一个新颖的查询到意向说明(Q2ID)查询了解任务。不像那些现有的排名,任务查询及其说明用于计算文档的相关性,其杠杆作用,Q2ID是一个反向的任务,目的是生成基于给定的查询都相关和不相关文档的自然语言的意图描述。为了解决这个新任务,我们提出了一个新颖的对比代车型,即CtrsGen的简称,通过对比给定的查询无关文件的相关文件来生成意图描述。我们通过对Q2ID任务的几个国家的最先进的一代车型相比表现出我们的模型的有效性。我们通过一个示例应用程序讨论这种Q2ID技术的潜在用途。
6. ETC-NLG: End-to-end Topic-Conditioned Natural Language Generation [PDF] 返回目录
Ginevra Carbone, Gabriele Sarti
Abstract: Plug-and-play language models (PPLMs) enable topic-conditioned natural language generation by pairing large pre-trained generators with attribute models used to steer the predicted token distribution towards the selected topic. Despite their computational efficiency, PPLMs require large amounts of labeled texts to effectively balance generation fluency and proper conditioning, making them unsuitable for low-resource settings. We present ETC-NLG, an approach leveraging topic modeling annotations to enable fully-unsupervised End-to-end Topic-Conditioned Natural Language Generation over emergent topics in unlabeled document collections. We first test the effectiveness of our approach in a low-resource setting for Italian, evaluating the conditioning for both topic models and gold annotations. We then perform a comparative evaluation of ETC-NLG for Italian and English using a parallel corpus. Finally, we propose an automatic approach to estimate the effectiveness of conditioning on the generated utterances.
摘要:插件和播放语言模型(PPLMs)通过与配对使用引向选题预测令牌分配属性模型大型预训练的发电机能够话题空调,自然语言生成。尽管他们的计算效率,PPLMs需要大量的标记文本,有效地平衡产生的流畅度和适当的调理,使它们不适合低资源设置。我们目前ETC-NLG,一种方法利用主题建模标注了在未标记的文档集合紧急课题,使完全无监督的结束到终端的主题空调,自然语言生成。我们首先测试了该方法的有效性在意大利的低资源设置,评估两个主题模型和黄金注释调理。然后,我们对意大利和英国使用平行语料库进行ETC-NLG的对比评测。最后,我们提出了一种自动的方法来估计对产生的话语调节的有效性。
Ginevra Carbone, Gabriele Sarti
Abstract: Plug-and-play language models (PPLMs) enable topic-conditioned natural language generation by pairing large pre-trained generators with attribute models used to steer the predicted token distribution towards the selected topic. Despite their computational efficiency, PPLMs require large amounts of labeled texts to effectively balance generation fluency and proper conditioning, making them unsuitable for low-resource settings. We present ETC-NLG, an approach leveraging topic modeling annotations to enable fully-unsupervised End-to-end Topic-Conditioned Natural Language Generation over emergent topics in unlabeled document collections. We first test the effectiveness of our approach in a low-resource setting for Italian, evaluating the conditioning for both topic models and gold annotations. We then perform a comparative evaluation of ETC-NLG for Italian and English using a parallel corpus. Finally, we propose an automatic approach to estimate the effectiveness of conditioning on the generated utterances.
摘要:插件和播放语言模型(PPLMs)通过与配对使用引向选题预测令牌分配属性模型大型预训练的发电机能够话题空调,自然语言生成。尽管他们的计算效率,PPLMs需要大量的标记文本,有效地平衡产生的流畅度和适当的调理,使它们不适合低资源设置。我们目前ETC-NLG,一种方法利用主题建模标注了在未标记的文档集合紧急课题,使完全无监督的结束到终端的主题空调,自然语言生成。我们首先测试了该方法的有效性在意大利的低资源设置,评估两个主题模型和黄金注释调理。然后,我们对意大利和英国使用平行语料库进行ETC-NLG的对比评测。最后,我们提出了一种自动的方法来估计对产生的话语调节的有效性。
7. Is this sentence valid? An Arabic Dataset for Commonsense Validation [PDF] 返回目录
Saja Tawalbeh, Mohammad AL-Smadi
Abstract: The commonsense understanding and validation remains a challenging task in the field of natural language understanding. Therefore, several research papers have been published that studied the capability of proposed systems to evaluate the models ability to validate commonsense in text. In this paper, we present a benchmark Arabic dataset for commonsense understanding and validation as well as a baseline research and models trained using the same dataset. To the best of our knowledge, this dataset is considered as the first in the field of Arabic text commonsense validation. The dataset is distributed under the Creative Commons BY-SA 4.0 license and can be found on GitHub.
摘要:有常识的理解和确认仍然在自然语言理解领域的一个具有挑战性的任务。因此,一些研究论文已发表了研究提出了系统评价文本的模型的能力,以验证常识的能力。在本文中,我们提出了一个常识性的理解和确认的基准阿拉伯语数据集,以及使用相同数据集训练的基线研究和模型。据我们所知,这数据集被认为是阿拉伯文字常识的验证领域的第一。该数据集在Creative Commons BY-SA 4.0许可下发布,可以在GitHub上找到。
Saja Tawalbeh, Mohammad AL-Smadi
Abstract: The commonsense understanding and validation remains a challenging task in the field of natural language understanding. Therefore, several research papers have been published that studied the capability of proposed systems to evaluate the models ability to validate commonsense in text. In this paper, we present a benchmark Arabic dataset for commonsense understanding and validation as well as a baseline research and models trained using the same dataset. To the best of our knowledge, this dataset is considered as the first in the field of Arabic text commonsense validation. The dataset is distributed under the Creative Commons BY-SA 4.0 license and can be found on GitHub.
摘要:有常识的理解和确认仍然在自然语言理解领域的一个具有挑战性的任务。因此,一些研究论文已发表了研究提出了系统评价文本的模型的能力,以验证常识的能力。在本文中,我们提出了一个常识性的理解和确认的基准阿拉伯语数据集,以及使用相同数据集训练的基线研究和模型。据我们所知,这数据集被认为是阿拉伯文字常识的验证领域的第一。该数据集在Creative Commons BY-SA 4.0许可下发布,可以在GitHub上找到。
8. TabSim: A Siamese Neural Network for Accurate Estimation of Table Similarity [PDF] 返回目录
Maryam Habibi, Johannes Starlinger, Ulf Leser
Abstract: Tables are a popular and efficient means of presenting structured information. They are used extensively in various kinds of documents including web pages. Tables display information as a two-dimensional matrix, the semantics of which is conveyed by a mixture of structure (rows, columns), headers, caption, and content. Recent research has started to consider tables as first class objects, not just as an addendum to texts, yielding interesting results for problems like table matching, table completion, or value imputation. All of these problems inherently rely on an accurate measure for the semantic similarity of two tables. We present TabSim, a novel method to compute table similarity scores using deep neural networks. Conceptually, TabSim represents a table as a learned concatenation of embeddings of its caption, its content, and its structure. Given two tables in this representation, a Siamese neural network is trained to compute a score correlating with the tables' semantic similarity. To train and evaluate our method, we created a gold standard corpus consisting of 1500 table pairs extracted from biomedical articles and manually scored regarding their degree of similarity, and adopted two other corpora originally developed for a different yet similar task. Our evaluation shows that TabSim outperforms other table similarity measures on average by app. 7% pp F1-score in a binary similarity classification setting and by app. 1.5% pp in a ranking scenario.
摘要:表是呈现结构化信息的一种流行和有效的手段。他们在各种文件,包括网页中广泛使用。表显示为一个二维矩阵,语义内容通过结构(行,列),标头,标题和内容的混合传送的信息。最近的研究已经开始考虑表作为第一类对象,而不是仅仅作为附录文本,产生有趣的结果如表匹配,表完成,或价值归集的问题。所有这些问题本质上依赖于准确测量了两个表的语义相似。我们本TabSim,使用深神经网络的新方法,以计算表的相似性得分。从概念上讲,TabSim代表一个表作为它的标题,内容,以及其结构的嵌入的教训串联。鉴于此表示两个表,一个连体训练神经网络以计算得分与表语义相似性相关。培训和评估我们的方法,我们创建了一个黄金标准语料库包括从生物医学的文章提取和手动评分的关于他们的相似度1500表对,并通过了最初的不同但类似的任务开发的其他两个语料库。我们的评估显示,TabSim通过应用平均优于其他表相似的措施。 7%PP F1-得分以二进制相似类型设置,并通过应用程序。 1.5%的PP中的排序方案。
Maryam Habibi, Johannes Starlinger, Ulf Leser
Abstract: Tables are a popular and efficient means of presenting structured information. They are used extensively in various kinds of documents including web pages. Tables display information as a two-dimensional matrix, the semantics of which is conveyed by a mixture of structure (rows, columns), headers, caption, and content. Recent research has started to consider tables as first class objects, not just as an addendum to texts, yielding interesting results for problems like table matching, table completion, or value imputation. All of these problems inherently rely on an accurate measure for the semantic similarity of two tables. We present TabSim, a novel method to compute table similarity scores using deep neural networks. Conceptually, TabSim represents a table as a learned concatenation of embeddings of its caption, its content, and its structure. Given two tables in this representation, a Siamese neural network is trained to compute a score correlating with the tables' semantic similarity. To train and evaluate our method, we created a gold standard corpus consisting of 1500 table pairs extracted from biomedical articles and manually scored regarding their degree of similarity, and adopted two other corpora originally developed for a different yet similar task. Our evaluation shows that TabSim outperforms other table similarity measures on average by app. 7% pp F1-score in a binary similarity classification setting and by app. 1.5% pp in a ranking scenario.
摘要:表是呈现结构化信息的一种流行和有效的手段。他们在各种文件,包括网页中广泛使用。表显示为一个二维矩阵,语义内容通过结构(行,列),标头,标题和内容的混合传送的信息。最近的研究已经开始考虑表作为第一类对象,而不是仅仅作为附录文本,产生有趣的结果如表匹配,表完成,或价值归集的问题。所有这些问题本质上依赖于准确测量了两个表的语义相似。我们本TabSim,使用深神经网络的新方法,以计算表的相似性得分。从概念上讲,TabSim代表一个表作为它的标题,内容,以及其结构的嵌入的教训串联。鉴于此表示两个表,一个连体训练神经网络以计算得分与表语义相似性相关。培训和评估我们的方法,我们创建了一个黄金标准语料库包括从生物医学的文章提取和手动评分的关于他们的相似度1500表对,并通过了最初的不同但类似的任务开发的其他两个语料库。我们的评估显示,TabSim通过应用平均优于其他表相似的措施。 7%PP F1-得分以二进制相似类型设置,并通过应用程序。 1.5%的PP中的排序方案。
9. Simple Unsupervised Similarity-Based Aspect Extraction [PDF] 返回目录
Danny Suarez Vargas, Lucas R. C. Pessutto, Viviane Pereira Moreira
Abstract: In the context of sentiment analysis, there has been growing interest in performing a finer granularity analysis focusing on the specific aspects of the entities being evaluated. This is the goal of Aspect-Based Sentiment Analysis (ABSA) which basically involves two tasks: aspect extraction and polarity detection. The first task is responsible for discovering the aspects mentioned in the review text and the second task assigns a sentiment orientation (positive, negative, or neutral) to that aspect. Currently, the state-of-the-art in ABSA consists of the application of deep learning methods such as recurrent, convolutional and attention neural networks. The limitation of these techniques is that they require a lot of training data and are computationally expensive. In this paper, we propose a simple approach called SUAEx for aspect extraction. SUAEx is unsupervised and relies solely on the similarity of word embeddings. Experimental results on datasets from three different domains have shown that SUAEx achieves results that can outperform the state-of-the-art attention-based approach at a fraction of the time.
摘要:在执行更细的粒度分析侧重于实体的具体方面在情感分析的背景下,出现了越来越大的兴趣进行评估。这是看点基于情感分析(ABSA),其基本上包括两个任务的目标:方面提取和极性检测。第一个任务是负责发现在评论文章中提到的方面和第二个任务分配情感倾向(正面,负面或中性)到这方面。目前,国家的最先进的ABSA由深度学习方法,如反复发作,卷积和注意力神经网络的应用程序。这些技术的局限性在于,它们需要大量的训练数据,并计算成本。在本文中,我们提出了一个名为SUAEx的方面提取简单的方法。 SUAEx是无监督和专司字的嵌入的相似性依赖。从三个不同的结构域的数据集的实验结果表明,SUAEx实现了可以超越在一小部分时间的状态的最先进的基于注意机制的方法的结果。
Danny Suarez Vargas, Lucas R. C. Pessutto, Viviane Pereira Moreira
Abstract: In the context of sentiment analysis, there has been growing interest in performing a finer granularity analysis focusing on the specific aspects of the entities being evaluated. This is the goal of Aspect-Based Sentiment Analysis (ABSA) which basically involves two tasks: aspect extraction and polarity detection. The first task is responsible for discovering the aspects mentioned in the review text and the second task assigns a sentiment orientation (positive, negative, or neutral) to that aspect. Currently, the state-of-the-art in ABSA consists of the application of deep learning methods such as recurrent, convolutional and attention neural networks. The limitation of these techniques is that they require a lot of training data and are computationally expensive. In this paper, we propose a simple approach called SUAEx for aspect extraction. SUAEx is unsupervised and relies solely on the similarity of word embeddings. Experimental results on datasets from three different domains have shown that SUAEx achieves results that can outperform the state-of-the-art attention-based approach at a fraction of the time.
摘要:在执行更细的粒度分析侧重于实体的具体方面在情感分析的背景下,出现了越来越大的兴趣进行评估。这是看点基于情感分析(ABSA),其基本上包括两个任务的目标:方面提取和极性检测。第一个任务是负责发现在评论文章中提到的方面和第二个任务分配情感倾向(正面,负面或中性)到这方面。目前,国家的最先进的ABSA由深度学习方法,如反复发作,卷积和注意力神经网络的应用程序。这些技术的局限性在于,它们需要大量的训练数据,并计算成本。在本文中,我们提出了一个名为SUAEx的方面提取简单的方法。 SUAEx是无监督和专司字的嵌入的相似性依赖。从三个不同的结构域的数据集的实验结果表明,SUAEx实现了可以超越在一小部分时间的状态的最先进的基于注意机制的方法的结果。
10. Conceptualized Representation Learning for Chinese Biomedical Text Mining [PDF] 返回目录
Ningyu Zhang, Qianghuai Jia, Kangping Yin, Liang Dong, Feng Gao, Nengwei Hua
Abstract: Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: this https URL.
摘要:生物医学文本挖掘正在成为生物医学文献和网络数据快速增长的数量越来越重要。近日,单词表示模型如BERT已经获得了研究人员的欢迎。然而,很难为一般和生物医学语料的字分布有很大的不同,以评估其对生物医学包含文本数据集的性能。此外,在医疗领域具有长尾概念,难以通过语言模型学习术语。对于中国生物医药的文字,是比较困难的,由于其复杂的结构和各种短语组合。在本文中,我们研究了最近推出的预训练的语言模型BERT如何适应中国生物医学语料库,提出了一种新的概念化表示学习方法。我们还推出了新的中国生物医学语言理解评价标准(\ textbf {ChineseBLUE})。我们研究中国的前训练的模型的有效性:BERT,BERT-WWM,罗伯塔,和我们的做法。基准的结果表明,我们的做法可能会带来显著增益实验结果。我们发布在GitHub上预先训练模式:此HTTPS URL。
Ningyu Zhang, Qianghuai Jia, Kangping Yin, Liang Dong, Feng Gao, Nengwei Hua
Abstract: Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: this https URL.
摘要:生物医学文本挖掘正在成为生物医学文献和网络数据快速增长的数量越来越重要。近日,单词表示模型如BERT已经获得了研究人员的欢迎。然而,很难为一般和生物医学语料的字分布有很大的不同,以评估其对生物医学包含文本数据集的性能。此外,在医疗领域具有长尾概念,难以通过语言模型学习术语。对于中国生物医药的文字,是比较困难的,由于其复杂的结构和各种短语组合。在本文中,我们研究了最近推出的预训练的语言模型BERT如何适应中国生物医学语料库,提出了一种新的概念化表示学习方法。我们还推出了新的中国生物医学语言理解评价标准(\ textbf {ChineseBLUE})。我们研究中国的前训练的模型的有效性:BERT,BERT-WWM,罗伯塔,和我们的做法。基准的结果表明,我们的做法可能会带来显著增益实验结果。我们发布在GitHub上预先训练模式:此HTTPS URL。
11. Contextualized moral inference [PDF] 返回目录
Jing Yi Xie, Graeme Hirst, Yang Xu
Abstract: Developing moral awareness in intelligent systems has shifted from a topic of philosophical inquiry to a critical and practical issue in artificial intelligence over the past decades. However, automated inference of everyday moral situations remains an under-explored problem. We present a text-based approach that predicts people's intuitive judgment of moral vignettes. Our methodology builds on recent work in contextualized language models and textual inference of moral sentiment. We show that a contextualized representation offers a substantial advantage over alternative representations based on word embeddings and emotion sentiment in inferring human moral judgment, evaluated and reflected in three independent datasets from moral psychology. We discuss the promise and limitations of our approach toward automated textual moral reasoning.
摘要:在智能系统开发道德意识已经从哲学探究的话题,在过去十年里转移到一个重要而现实的问题在人工智能。然而,日常道德情况自动推理仍是一个充分开发的问题。我们提出了一个基于文本的方法来预测人们的道德护身符的直觉判断。我们的方法建立在情境语言模型最近的工作和道德情操的文字推断。我们表明,情境表现提供了超过基于词的嵌入和情感情绪推断人的道德判断,评估和反映在道德心理学三个独立的数据集替换表示相当的优势。我们讨论的承诺,我们对自动文本道德推理方法的局限性。
Jing Yi Xie, Graeme Hirst, Yang Xu
Abstract: Developing moral awareness in intelligent systems has shifted from a topic of philosophical inquiry to a critical and practical issue in artificial intelligence over the past decades. However, automated inference of everyday moral situations remains an under-explored problem. We present a text-based approach that predicts people's intuitive judgment of moral vignettes. Our methodology builds on recent work in contextualized language models and textual inference of moral sentiment. We show that a contextualized representation offers a substantial advantage over alternative representations based on word embeddings and emotion sentiment in inferring human moral judgment, evaluated and reflected in three independent datasets from moral psychology. We discuss the promise and limitations of our approach toward automated textual moral reasoning.
摘要:在智能系统开发道德意识已经从哲学探究的话题,在过去十年里转移到一个重要而现实的问题在人工智能。然而,日常道德情况自动推理仍是一个充分开发的问题。我们提出了一个基于文本的方法来预测人们的道德护身符的直觉判断。我们的方法建立在情境语言模型最近的工作和道德情操的文字推断。我们表明,情境表现提供了超过基于词的嵌入和情感情绪推断人的道德判断,评估和反映在道德心理学三个独立的数据集替换表示相当的优势。我们讨论的承诺,我们对自动文本道德推理方法的局限性。
12. A Baseline Analysis for Podcast Abstractive Summarization [PDF] 返回目录
Chujie Zheng, Harry Jiannan Wang, Kunpeng Zhang, Ling Fan
Abstract: Podcast summary, an important factor affecting end-users' listening decisions, has often been considered a critical feature in podcast recommendation systems, as well as many downstream applications. Existing abstractive summarization approaches are mainly built on fine-tuned models on professionally edited texts such as CNN and DailyMail news. Different from news, podcasts are often longer, more colloquial and conversational, and nosier with contents on commercials and sponsorship, which makes automatic podcast summarization extremely challenging. This paper presents a baseline analysis of podcast summarization using the Spotify Podcast Dataset provided by TREC 2020. It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
摘要:播客总之,影响最终用户的听音决策的重要因素,往往被认为是播客推荐系统,以及许多下游应用的关键特性。现有的抽象概括的方法主要是建立在微调模型的专业编辑文本,如CNN和每日邮报消息。从消息不同的是,播客往往更长,更口语化,朗朗上口,并nosier与广告和赞助内容,这使得自动播客总结极具挑战性。本文使用的礼物Spotify的播客播客数据集总结的基线分析提供由TREC 2020年,旨在帮助研究人员了解当前国家的最先进的预先训练模式,从而建立一个基础,创造更好的车型。
Chujie Zheng, Harry Jiannan Wang, Kunpeng Zhang, Ling Fan
Abstract: Podcast summary, an important factor affecting end-users' listening decisions, has often been considered a critical feature in podcast recommendation systems, as well as many downstream applications. Existing abstractive summarization approaches are mainly built on fine-tuned models on professionally edited texts such as CNN and DailyMail news. Different from news, podcasts are often longer, more colloquial and conversational, and nosier with contents on commercials and sponsorship, which makes automatic podcast summarization extremely challenging. This paper presents a baseline analysis of podcast summarization using the Spotify Podcast Dataset provided by TREC 2020. It aims to help researchers understand current state-of-the-art pre-trained models and hence build a foundation for creating better models.
摘要:播客总之,影响最终用户的听音决策的重要因素,往往被认为是播客推荐系统,以及许多下游应用的关键特性。现有的抽象概括的方法主要是建立在微调模型的专业编辑文本,如CNN和每日邮报消息。从消息不同的是,播客往往更长,更口语化,朗朗上口,并nosier与广告和赞助内容,这使得自动播客总结极具挑战性。本文使用的礼物Spotify的播客播客数据集总结的基线分析提供由TREC 2020年,旨在帮助研究人员了解当前国家的最先进的预先训练模式,从而建立一个基础,创造更好的车型。
13. Measuring Pain in Sickle Cell Disease using Clinical Text [PDF] 返回目录
Amanuel Alambo, Ryan Andrew, Sid Gollarahalli, Jacqueline Vaughn, Tanvi Banerjee, Krishnaprasad Thirunarayan, Daniel Abrams, Nirmish Shah
Abstract: Sickle Cell Disease (SCD) is a hereditary disorder of red blood cells in humans. Complications such as pain, stroke, and organ failure occur in SCD as malformed, sickled red blood cells passing through small blood vessels get trapped. Particularly, acute pain is known to be the primary symptom of SCD. The insidious and subjective nature of SCD pain leads to challenges in pain assessment among Medical Practitioners (MPs). Thus, accurate identification of markers of pain in patients with SCD is crucial for pain management. Classifying clinical notes of patients with SCD based on their pain level enables MPs to give appropriate treatment. We propose a binary classification model to predict pain relevance of clinical notes and a multiclass classification model to predict pain level. While our four binary machine learning (ML) classifiers are comparable in their performance, Decision Trees had the best performance for the multiclass classification task achieving 0.70 in F-measure. Our results show the potential clinical text analysis and machine learning offer to pain management in sickle cell patients.
摘要:镰状细胞病(SCD)是红血细胞在人类中的遗传性疾病。如畸形,镰状红细胞通过小血管被困并发症如疼痛,中风和器官衰竭发生了心源性猝死。特别是,急性疼痛被称为是SCD的主要症状。 SCD疼痛引线的阴险和主观性质的医生(MPS)之间的疼痛评估的挑战。因此,疼痛的标志物的准确识别患者SCD是疼痛管理的关键。分类的基础上他们的疼痛程度与患者临床SCD便笺使国会议员给予适当的治疗。我们提出的二元分类模型预测的临床记录疼痛的相关性和多分类模型来预测疼痛程度。虽然我们的四个二进制机器学习(ML)分类器在其性能相媲美,决策树曾经为多类分类任务在F值达到0.70最佳性能。我们的研究结果显示,镰状细胞贫血患者的潜在临床文本分析和机器学习提供疼痛管理。
Amanuel Alambo, Ryan Andrew, Sid Gollarahalli, Jacqueline Vaughn, Tanvi Banerjee, Krishnaprasad Thirunarayan, Daniel Abrams, Nirmish Shah
Abstract: Sickle Cell Disease (SCD) is a hereditary disorder of red blood cells in humans. Complications such as pain, stroke, and organ failure occur in SCD as malformed, sickled red blood cells passing through small blood vessels get trapped. Particularly, acute pain is known to be the primary symptom of SCD. The insidious and subjective nature of SCD pain leads to challenges in pain assessment among Medical Practitioners (MPs). Thus, accurate identification of markers of pain in patients with SCD is crucial for pain management. Classifying clinical notes of patients with SCD based on their pain level enables MPs to give appropriate treatment. We propose a binary classification model to predict pain relevance of clinical notes and a multiclass classification model to predict pain level. While our four binary machine learning (ML) classifiers are comparable in their performance, Decision Trees had the best performance for the multiclass classification task achieving 0.70 in F-measure. Our results show the potential clinical text analysis and machine learning offer to pain management in sickle cell patients.
摘要:镰状细胞病(SCD)是红血细胞在人类中的遗传性疾病。如畸形,镰状红细胞通过小血管被困并发症如疼痛,中风和器官衰竭发生了心源性猝死。特别是,急性疼痛被称为是SCD的主要症状。 SCD疼痛引线的阴险和主观性质的医生(MPS)之间的疼痛评估的挑战。因此,疼痛的标志物的准确识别患者SCD是疼痛管理的关键。分类的基础上他们的疼痛程度与患者临床SCD便笺使国会议员给予适当的治疗。我们提出的二元分类模型预测的临床记录疼痛的相关性和多分类模型来预测疼痛程度。虽然我们的四个二进制机器学习(ML)分类器在其性能相媲美,决策树曾经为多类分类任务在F值达到0.70最佳性能。我们的研究结果显示,镰状细胞贫血患者的潜在临床文本分析和机器学习提供疼痛管理。
14. ICE-Talk: an Interface for a Controllable Expressive Talking Machine [PDF] 返回目录
Noé Tits, Kevin El Haddad, Thierry Dutoit
Abstract: ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.
摘要:ICE-Talk是一个开源的基于Web的图形用户界面,允许通过一个文本字段和一个可点击的2D情节使用TTS系统可控参数。它使可控TTS潜在空间的研究。此外,它是这样实现的,可以用来作为一个人体代理交互的一部分的模块。
Noé Tits, Kevin El Haddad, Thierry Dutoit
Abstract: ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.
摘要:ICE-Talk是一个开源的基于Web的图形用户界面,允许通过一个文本字段和一个可点击的2D情节使用TTS系统可控参数。它使可控TTS潜在空间的研究。此外,它是这样实现的,可以用来作为一个人体代理交互的一部分的模块。
15. Table2Charts: Learning Shared Representations for Recommending Charts on Multi-dimensional Data [PDF] 返回目录
Mengyu Zhou, Qingtao Li, Yuejiang Li, Shi Han, Dongmei Zhang
Abstract: It is common for people to create different types of charts to explore a multi-dimensional dataset (table). However, to build an intelligent assistant that recommends commonly composed charts, the fundamental problems of "multi-dialect" unification, imbalanced data and open vocabulary exist. In this paper, we propose Table2Charts framework which learns common patterns from a large corpus of (table, charts) pairs. Based on deep Q-learning with copying mechanism and heuristic searching, Table2Charts does table-to-sequence generation, where each sequence follows a chart template. On a large spreadsheet corpus with 196k tables and 306k charts, we show that Table2Charts could learn a shared representation of table fields so that tasks on different chart types could mutually enhance each other. Table2Charts has >0.61 recall at top-3 and >0.49 recall at top-1 for both single-type and multi-type chart recommendation tasks.
摘要:这是常见的为人们创造不同类型的图表,探索多维数据集(表)。然而,建立一个智能化的帮手,建议一般由图表,“多方言”统一,不平衡数据和开放的词汇存在的根本问题。在本文中,我们提出Table2Charts框架,从大量语料中(表,图表),对学习常用的模式。基于深Q学习与复制机制和启发式搜索,Table2Charts做表对序列生成,其中每个序列如下图表模板。在具有196K表和306K图表的大型电子表格语料库,我们表明,Table2Charts可以学习表中的字段的共同代表等不同的图表类型的任务可以互相增进彼此。 Table2Charts具有>在顶部-3 0.61召回和> 0.49召回在顶部-1单型和多类型图表推荐任务。
Mengyu Zhou, Qingtao Li, Yuejiang Li, Shi Han, Dongmei Zhang
Abstract: It is common for people to create different types of charts to explore a multi-dimensional dataset (table). However, to build an intelligent assistant that recommends commonly composed charts, the fundamental problems of "multi-dialect" unification, imbalanced data and open vocabulary exist. In this paper, we propose Table2Charts framework which learns common patterns from a large corpus of (table, charts) pairs. Based on deep Q-learning with copying mechanism and heuristic searching, Table2Charts does table-to-sequence generation, where each sequence follows a chart template. On a large spreadsheet corpus with 196k tables and 306k charts, we show that Table2Charts could learn a shared representation of table fields so that tasks on different chart types could mutually enhance each other. Table2Charts has >0.61 recall at top-3 and >0.49 recall at top-1 for both single-type and multi-type chart recommendation tasks.
摘要:这是常见的为人们创造不同类型的图表,探索多维数据集(表)。然而,建立一个智能化的帮手,建议一般由图表,“多方言”统一,不平衡数据和开放的词汇存在的根本问题。在本文中,我们提出Table2Charts框架,从大量语料中(表,图表),对学习常用的模式。基于深Q学习与复制机制和启发式搜索,Table2Charts做表对序列生成,其中每个序列如下图表模板。在具有196K表和306K图表的大型电子表格语料库,我们表明,Table2Charts可以学习表中的字段的共同代表等不同的图表类型的任务可以互相增进彼此。 Table2Charts具有>在顶部-3 0.61召回和> 0.49召回在顶部-1单型和多类型图表推荐任务。
16. Complicating the Social Networks for Better Storytelling: An Empirical Study of Chinese Historical Text and Novel [PDF] 返回目录
Chenhan Zhang
Abstract: Digital humanities is an important subject because it enables developments in history, literature, and films. In this paper, we perform an empirical study of a Chinese historical text, Records of the Three Kingdoms (\textit{Records}), and a historical novel of the same story, Romance of the Three Kingdoms (\textit{Romance}). We employ natural language processing techniques to extract characters and their relationships. Then, we characterize the social networks and sentiments of the main characters in the historical text and the historical novel. We find that the social network in \textit{Romance} is more complex and dynamic than that of \textit{Records}, and the influence of the main characters differs. These findings shed light on the different styles of storytelling in the two literary genres and how the historical novel complicates the social networks of characters to enrich the literariness of the story.
摘要:数字人文是一个重要的课题,因为它能够在历史,文学和电影的发展。在本文中,我们执行一个中国的历史文本,三国志(\ textit {记录}),和同样的故事的历史小说的实证研究,三国演义(\ textit {爱情})。我们采用自然语言处理技术来提取人物和他们之间的关系。然后,我们描述了主角的社会网络和情绪的历史文本和历史的小说。我们发现,在\ {textit浪漫}社交网络比\ textit {}的记录,而主角不同的影响更为复杂和动态。这些研究结果阐明了不同风格的两个文学体裁故事和如何历史小说人物的社交网络变得复杂,丰富的故事文学性的。
Chenhan Zhang
Abstract: Digital humanities is an important subject because it enables developments in history, literature, and films. In this paper, we perform an empirical study of a Chinese historical text, Records of the Three Kingdoms (\textit{Records}), and a historical novel of the same story, Romance of the Three Kingdoms (\textit{Romance}). We employ natural language processing techniques to extract characters and their relationships. Then, we characterize the social networks and sentiments of the main characters in the historical text and the historical novel. We find that the social network in \textit{Romance} is more complex and dynamic than that of \textit{Records}, and the influence of the main characters differs. These findings shed light on the different styles of storytelling in the two literary genres and how the historical novel complicates the social networks of characters to enrich the literariness of the story.
摘要:数字人文是一个重要的课题,因为它能够在历史,文学和电影的发展。在本文中,我们执行一个中国的历史文本,三国志(\ textit {记录}),和同样的故事的历史小说的实证研究,三国演义(\ textit {爱情})。我们采用自然语言处理技术来提取人物和他们之间的关系。然后,我们描述了主角的社会网络和情绪的历史文本和历史的小说。我们发现,在\ {textit浪漫}社交网络比\ textit {}的记录,而主角不同的影响更为复杂和动态。这些研究结果阐明了不同风格的两个文学体裁故事和如何历史小说人物的社交网络变得复杂,丰富的故事文学性的。
注:中文为机器翻译结果!封面为论文标题词云图!