摘要

1. Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training [PDF] 返回目录
Peng Shi, Patrick Ng, Zhiguo Wang, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Cicero Nogueira dos Santos, Bing Xiang
Abstract: Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.
摘要：最近，通过利用大规模文本语料库来训练具有自我监督学习目标的大型神经语言模型，例如屏蔽语言模型（MLM），对各种NLP任务的上下文表示形式的学习引起了极大的兴趣。但是，根据一项初步研究，当将现有通用语言模型应用于文本到SQL语义解析器时，我们会观察到三个问题：无法检测到话语中的列提及，无法从单元格值推断列提及，并且无法编写复杂的SQL查询。为了缓解这些问题，我们提出了一种模型预训练框架，即代增广预训练（GAP），该模型通过利用生成模型来生成预训练数据来共同学习自然语言话语和表模式的表示。 GAP模型针对2M话语-模式对和30K话语-SQL-SQL三元组进行训练，其话语是由生成模型产生的。根据实验结果，利用GAP MODEL作为表示编码器的神经语义解析器在SPIDER和CRITERIA-TO-SQL基准测试中均获得了最新的技术成果。

2. HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection [PDF] 返回目录
Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, Animesh Mukherjee
Abstract: Hate speech is a challenging issue plaguing the online social media. While better models for hate speech detection are continuously being developed, there is little research on the bias and interpretability aspects of hate speech. In this paper, we introduce HateXplain, the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in our dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based. We utilize existing state-of-the-art models and observe that even models that perform very well in classification do not score high on explainability metrics like model plausibility and faithfulness. We also observe that models, which utilize the human rationales for training, perform better in reducing unintended bias towards target communities. We have made our code and dataset public at this https URL
摘要：仇恨言论是困扰在线社交媒体的一个具有挑战性的问题。尽管不断开发出更好的仇恨语音检测模型，但对仇恨语音的偏见和可解释性方面的研究很少。在本文中，我们介绍了HateXplain，这是涵盖该问题多个方面的第一个基准讨厌语音数据集。我们数据集中的每个帖子都从三个不同的角度进行注释：基本的，常用的3类分类（即，仇恨，令人反感或正常），目标社区（即，成为仇恨言论/攻击性言论的受害者的社区）（在帖子中）以及基本原理（即帖子的标签决策依据（如仇恨，令人反感或正常）所基于的部分）。我们利用现有的最新模型，并观察到即使在分类中表现出色的模型也无法在模型可信度和忠诚度等可解释性指标上得分很高。我们还观察到，利用人类基本原理进行训练的模型在减少针对目标社区的意外偏见方面表现更好。我们已在此https URL上公开了代码和数据集

3. An Empirical Study of Using Pre-trained BERT Models for Vietnamese Relation Extraction Task at VLSP 2020 [PDF] 返回目录
Pham Quang Nhat Minh
Abstract: In this paper, we present an empirical study of using pre-trained BERT models for relation extraction task at VLSP 2020 Evaluation Campaign. We applied two state-of-the-art BERT-based models: R-BERT and BERT model with entity starts. For each model, we compared two pre-trained BERT models: FPTAI/vibert and NlpHUST/vibert4news. We found that NlpHUST/vibert4news model significantly outperforms FPTAI/vibert for Vietnamese relation extraction task. Finally, we proposed a simple ensemble model which combines R-BERT and BERT with entity starts. Our proposed ensemble model slightly improved against two single models on the development data provided by the task organizers.
摘要：在本文中，我们提出了在VLSP 2020评估活动中使用预训练的BERT模型进行关系提取任务的实证研究。我们应用了两个基于BERT的最新模型：R-BERT和带有实体起点的BERT模型。对于每个模型，我们比较了两个预训练的BERT模型：FPTAI / vibert和NlpHUST / vibert4news。我们发现NlpHUST / vibert4news模型在越南关系提取任务上明显优于FPTAI / vibert。最后，我们提出了一个简单的集成模型，该模型将R-BERT和BERT与实体起点结合在一起。对于任务组织者提供的开发数据，我们提出的集成模型相对于两个单一模型略有改进。

4. Understood in Translation, Transformers for Domain Understanding [PDF] 返回目录
Dimitrios Christofidellis, Matteo Manica, Leonidas Georgopoulos, Hans Vandierendonck
Abstract: Knowledge acquisition is the essential first step of any Knowledge Graph (KG) application. This knowledge can be extracted from a given corpus (KG generation process) or specified from an existing KG (KG specification process). Focusing on domain specific solutions, knowledge acquisition is a labor intensive task usually orchestrated and supervised by subject matter experts. Specifically, the domain of interest is usually manually defined and then the needed generation or extraction tools are utilized to produce the KG. Herein, we propose a supervised machine learning method, based on Transformers, for domain definition of a corpus. We argue why such automated definition of the domain's structure is beneficial both in terms of construction time and quality of the generated graph. The proposed method is extensively validated on three public datasets (WebNLG, NYT and DocRED) by comparing it with two reference methods based on CNNs and RNNs models. The evaluation shows the efficiency of our model in this task. Focusing on scientific document understanding, we present a new health domain dataset based on publications extracted from PubMed and we successfully utilize our method on this. Lastly, we demonstrate how this work lays the foundation for fully automated and unsupervised KG generation.
摘要：知识获取是任何知识图（KG）应用程序必不可少的第一步。该知识可以从给定的语料库（KG生成过程）中提取，也可以从现有的KG（KG规范过程）中指定。专注于特定领域的解决方案，知识获取是一项劳动密集型任务，通常由主题专家精心策划和监督。具体而言，通常手动定义目标域，然后使用所需的生成或提取工具来生成KG。本文中，我们提出了一种基于Transformers的监督式机器学习方法，用于定义语料库。我们争辩说，为什么对结构域的这种自动定义在构造时间和生成图的质量方面都是有益的。通过与基于CNNs和RNNs模型的两种参考方法进行比较，该方法在三个公共数据集（WebNLG，NYT和DocRED）上得到了广泛验证。评估显示了我们的模型在此任务中的效率。着眼于科学文献的理解，我们基于从PubMed中提取的出版物提出了一个新的健康领域数据集，并且我们在此方面成功地利用了我们的方法。最后，我们演示了这项工作如何为全自动和无监督的KG生成奠定基础。

5. ReINTEL Challenge 2020: A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnamese SNS [PDF] 返回目录
Nguyen Manh Duc Tuan, Pham Quang Nhat Minh
Abstract: In this paper, we present our methods for unrealiable information identification task at VLSP 2020 ReINTEL Challenge. The task is to classify a piece of information into reliable or unreliable category. We propose a novel multimodal ensemble model which combines two multimodal models to solve the task. In each multimodal model, we combined feature representations acquired from three different data types: texts, images, and metadata. Multimodal features are derived from three neural networks and fused for classification. Experimental results showed that our proposed multimodal ensemble model improved against single models in term of ROC AUC score. We obtained 0.9445 AUC score on the private test of the challenge.
摘要：在本文中，我们介绍了在VLSP 2020 ReINTEL挑战赛中用于不可靠信息识别任务的方法。任务是将一条信息分类为可靠或不可靠的类别。我们提出了一种新颖的多峰集成模型，该模型结合了两个多峰模型来解决任务。在每个多峰模型中，我们组合了从三种不同数据类型（文本，图像和元数据）中获取的特征表示。多模态特征是从三个神经网络派生而来，用于分类。实验结果表明，我们提出的多模式集成模型相对于单个模型在ROC AUC评分方面有所改进。在挑战的私人考试中，我们获得了0.9445 AUC评分。

6. A Benchmark Arabic Dataset for Commonsense Explanation [PDF] 返回目录
Saja AL-Tawalbeh, Mohammad AL-Smadi
Abstract: Language comprehension and commonsense knowledge validation by machines are challenging tasks that are still under researched and evaluated for Arabic text. In this paper, we present a benchmark Arabic dataset for commonsense explanation. The dataset consists of Arabic sentences that does not make sense along with three choices to select among them the one that explains why the sentence is false. Furthermore, this paper presents baseline results to assist and encourage the future evaluation of research in this field. The dataset is distributed under the Creative Commons CC-BY-SA 4.0 license and can be found on GitHub
摘要：机器的语言理解和常识知识验证是具有挑战性的任务，目前仍在研究和评估阿拉伯文本。在本文中，我们提供了一个基准阿拉伯数据集用于常识解释。数据集由没有意义的阿拉伯文句子以及三个选择来解释句子为何为假的三个选择。此外，本文提出了基准结果，以协助和鼓励对该领域研究的未来评估。该数据集是根据Creative Commons CC-BY-SA 4.0许可证分发的，可以在GitHub上找到

7. AdvExpander: Generating Natural Language Adversarial Examples by Expanding Text [PDF] 返回目录
Zhihong Shao, Zitao Liu, Jiyong Zhang, Zhongqin Wu, Minlie Huang
Abstract: Adversarial examples are vital to expose the vulnerability of machine learning models. Despite the success of the most popular substitution-based methods which substitutes some characters or words in the original examples, only substitution is insufficient to uncover all robustness issues of models. In this paper, we present AdvExpander, a method that crafts new adversarial examples by expanding text, which is complementary to previous substitution-based methods. We first utilize linguistic rules to determine which constituents to expand and what types of modifiers to expand with. We then expand each constituent by inserting an adversarial modifier searched from a CVAE-based generative model which is pre-trained on a large scale corpus. To search adversarial modifiers, we directly search adversarial latent codes in the latent space without tuning the pre-trained parameters. To ensure that our adversarial examples are label-preserving for text matching, we also constrain the modifications with a heuristic rule. Experiments on three classification tasks verify the effectiveness of AdvExpander and the validity of our adversarial examples. AdvExpander crafts a new type of adversarial examples by text expansion, thereby promising to reveal new robustness issues.
摘要：对抗性示例对于暴露机器学习模型的脆弱性至关重要。尽管最流行的基于替换的方法成功替换了原始示例中的某些字符或单词，但仅替换不足以揭示模型的所有鲁棒性问题。在本文中，我们提出了AdvExpander，这是一种通过扩展文本来制作新对抗性示例的方法，该方法是对以前基于替换的方法的补充。我们首先利用语言规则来确定要扩展的成分以及要扩展的修饰符类型。然后，我们通过插入从基于CVAE的生成模型中搜索的对抗性修饰词来扩展每个成分，该模型在大规模语料库上进行了预训练。为了搜索对抗性修饰语，我们直接搜索潜在空间中的对抗性潜在代码，而无需调整预训练的参数。为了确保我们的对抗示例在文本匹配中保留标签，我们还使用启发式规则来限制修改。在三个分类任务上进行的实验验证了AdvExpander的有效性以及我们的对抗示例的有效性。 AdvExpander通过文本扩展来制作新型的对抗示例，从而有望揭示新的鲁棒性问题。

8. Deep Open Intent Classification with Adaptive Decision Boundary [PDF] 返回目录
Hanlei Zhang, Hua Xu, Ting-En Lin
Abstract: Open intent classification is a challenging task in dialogue system. On the one hand, we should ensure the classification quality of known intents. On the other hand, we need to identify the open (unknown) intent during testing. Current models are limited in finding the appropriate decision boundary to balance the performance of both known and open intents. In this paper, we propose a post-processing method to learn the adaptive decision boundary (ADB) for open intent classification. We first utilize the labeled known intent samples to pre-train the model. Then, we use the well-trained features to automatically learn the adaptive spherical decision boundaries for each known intent. Specifically, we propose a new loss function to balance both the empirical risk and the open space risk. Our method does not need unknown samples and is free from modifying the model architecture. We find our approach is surprisingly insensitive with less labeled data and fewer known intents. Extensive experiments on three benchmark datasets show that our method yields significant improvements compared with the state-of-the-art methods.
摘要：开放式意图分类是对话系统中一项具有挑战性的任务。一方面，我们应该确保已知意图的分类质量。另一方面，我们需要在测试期间确定开放（未知）意图。当前的模型在寻找合适的决策边界以平衡已知意图和开放意图的表现方面受到限制。在本文中，我们提出了一种后处理方法，以学习用于开放意图分类的自适应决策边界（ADB）。我们首先利用标记的已知意图样本对模型进行预训练。然后，我们使用训练有素的功能为每个已知意图自动学习自适应球形决策边界。具体来说，我们提出了一个新的损失函数来平衡经验风险和开放空间风险。我们的方法不需要未知样本，并且无需修改模型架构。我们发现我们的方法出乎意料地不敏感，因为标签数据更少，已知意图也更少。在三个基准数据集上进行的大量实验表明，与最新方法相比，我们的方法具有明显的改进。

9. Regularized Attentive Capsule Network for Overlapped Relation Extraction [PDF] 返回目录
Tianyi Liu, Xiangyu Lin, Weijia Jia, Mingliang Zhou, Wei Zhao
Abstract: Distantly supervised relation extraction has been widely applied in knowledge base construction due to its less requirement of human efforts. However, the automatically established training datasets in distant supervision contain low-quality instances with noisy words and overlapped relations, introducing great challenges to the accurate extraction of relations. To address this problem, we propose a novel Regularized Attentive Capsule Network (RA-CapNet) to better identify highly overlapped relations in each informal sentence. To discover multiple relation features in an instance, we embed multi-head attention into the capsule network as the low-level capsules, where the subtraction of two entities acts as a new form of relation query to select salient features regardless of their positions. To further discriminate overlapped relation features, we devise disagreement regularization to explicitly encourage the diversity among both multiple attention heads and low-level capsules. Extensive experiments conducted on widely used datasets show that our model achieves significant improvements in relation extraction.
摘要要】远程监督关系抽取由于其对人工的较少需求而被广泛应用于知识库的构建。但是，在远程监督中自动建立的训练数据集包含带有嘈杂单词和重叠关系的低质量实例，这对准确提取关系提出了巨大挑战。为了解决这个问题，我们提出了一种新颖的正则化专注胶囊网络（RA-CapNet），以更好地识别每个非正式句子中高度重叠的关系。为了发现实例中的多个关系特征，我们将多头注意力作为低级胶囊嵌入到了胶囊网络中，其中两个实体的相减成为一种关系查询的新形式，以选择不显着的特征，而不考虑它们的位置。为了进一步区分重叠的关系特征，我们设计了分歧正则化方法，以明确地鼓励多个关注头和低级胶囊之间的多样性。在广泛使用的数据集上进行的广泛实验表明，我们的模型在关系提取方面取得了重大改进。

10. Technical Progress Analysis Using a Dynamic Topic Model for Technical Terms to Revise Patent Classification Codes [PDF] 返回目录
Mana Iwata, Yoshiro Matsuda, Yoshimasa Utsumi, Yoshitoshi Tanaka, Kazuhide Nakata
Abstract: Japanese patents are assigned a patent classification code, FI (File Index), that is unique to Japan. FI is a subdivision of the IPC, an international patent classification code, that is related to Japanese technology. FIs are revised to keep up with technological developments. These revisions have already established more than 30,000 new FIs since 2006. However, these revisions require a lot of time and workload. Moreover, these revisions are not automated and are thus inefficient. Therefore, using machine learning to assist in the revision of patent classification codes (FI) will lead to improved accuracy and efficiency. This study analyzes patent documents from this new perspective of assisting in the revision of patent classification codes with machine learning. To analyze time-series changes in patents, we used the dynamic topic model (DTM), which is an extension of the latent Dirichlet allocation (LDA). Also, unlike English, the Japanese language requires morphological analysis. Patents contain many technical words that are not used in everyday life, so morphological analysis using a common dictionary is not sufficient. Therefore, we used a technique for extracting technical terms from text. After extracting technical terms, we applied them to DTM. In this study, we determined the technological progress of the lighting class F21 for 14 years and compared it with the actual revision of patent classification codes. In other words, we extracted technical terms from Japanese patents and applied DTM to determine the progress of Japanese technology. Then, we analyzed the results from the new perspective of revising patent classification codes with machine learning. As a result, it was found that those whose topics were on the rise were judged to be new technologies.
摘要：日本专利被分配了日本独有的专利分类代码FI（文件索引）。 FI是IPC（国际专利分类代码）的一个子部分，IPC与日本的技术有关。金融机构经过修订，以适应技术发展。自2006年以来，这些修订已经建立了30,000多个新的FI。但是，这些修订需要大量的时间和工作量。而且，这些修订不是自动的，因此效率低下。因此，使用机器学习来帮助修改专利分类代码（FI）将提高准确性和效率。这项研究从这种新的角度分析了专利文献，该新观点旨在通过机器学习来帮助修改专利分类代码。为了分析专利的时间序列变化，我们使用了动态主题模型（DTM），它是潜在Dirichlet分配（LDA）的扩展。另外，与英语不同，日语需要形态分析。专利包含许多日常生活中不使用的技术词，因此使用普通词典进行形态分析是不够的。因此，我们使用了一种从文本中提取技术术语的技术。提取技术术语后，我们将其应用于DTM。在这项研究中，我们确定了F21照明类在14年中的技术进步，并将其与专利分类代码的实际修订进行了比较。换句话说，我们从日本专利中提取了技术术语，并应用了DTM来确定日本技术的进步。然后，我们从使用机器学习修订专利分类代码的新角度分析了结果。结果，发现那些主题正在上升的人被认为是新技术。

11. Mention Extraction and Linking for SQL Query Generation [PDF] 返回目录
Jianqiang Ma, Zeyu Yan, Shuai Pang, Yang Zhang, Jianping Shen
Abstract: On the WikiSQL benchmark, state-of-the-art text-to-SQL systems typically take a slot-filling approach by building several dedicated models for each type of slots. Such modularized systems are not only complex butalso of limited capacity for capturing inter-dependencies among SQL clauses. To solve these problems, this paper proposes a novel extraction-linking approach, where a unified extractor recognizes all types of slot mentions appearing in the question sentence before a linker maps the recognized columns to the table schema to generate executable SQL queries. Trained with automatically generated annotations, the proposed method achieves the first place on the WikiSQL benchmark.
摘要：在WikiSQL基准上，最先进的文本到SQL系统通常通过为每种类型的插槽构建几个专用模型来采用插槽填充方法。这样的模块化系统不仅复杂，而且捕获SQL子句之间的相互依赖性的能力也很有限。为了解决这些问题，本文提出了一种新颖的提取链接方法，其中在链接器将识别出的列映射到表架构以生成可执行的SQL查询之前，统一提取器会识别出现在疑问句中的所有类型的插槽提及。经过自动生成的注释训练，该方法在WikiSQL基准测试中排名第一。

12. Attention-Based LSTM Network for COVID-19 Clinical Trial Parsing [PDF] 返回目录
Xiong Liu, Luca A. Finelli, Greg L. Hersch, Iya Khalil
Abstract: COVID-19 clinical trial design is a critical task in developing therapeutics for the prevention and treatment of COVID-19. In this study, we apply a deep learning approach to extract eligibility criteria variables from COVID-19 trials to enable quantitative analysis of trial design and optimization. Specifically, we train attention-based bidirectional Long Short-Term Memory (Att-BiLSTM) models and use the optimal model to extract entities (i.e., variables) from the eligibility criteria of COVID-19 trials. We compare the performance of Att-BiLSTM with traditional ontology-based method. The result on a benchmark dataset shows that Att-BiLSTM outperforms the ontology model. Att-BiLSTM achieves a precision of 0.942, recall of 0.810, and F1 of 0.871, while the ontology model only achieves a precision of 0.715, recall of 0.659, and F1 of 0.686. Our analyses demonstrate that Att-BiLSTM is an effective approach for characterizing patient populations in COVID-19 clinical trials.
摘要：COVID-19临床试验设计是开发用于预防和治疗COVID-19的疗法的关键任务。在这项研究中，我们采用深度学习方法从COVID-19试验中提取资格标准变量，以实现对试验设计和优化的定量分析。具体来说，我们训练基于注意力的双向长期短期记忆（Att-BiLSTM）模型，并使用最佳模型从COVID-19试验的资格标准中提取实体（即变量）。我们将Att-BiLSTM的性能与传统的基于本体的方法进行了比较。基准数据集上的结果表明，Att-BiLSTM的性能优于本体模型。 Att-BiLSTM的精度为0.942，召回率为0.810，F1为0.871，而本体模型仅达到的精度为0.715，召回率为0.659，F1为0.686。我们的分析表明，Att-BiLSTM是在COVID-19临床试验中表征患者人群的有效方法。

13. Leveraging Event Specific and Chunk Span features to Extract COVID Events from tweets [PDF] 返回目录
Ayush Kaushal, Tejas Vaidhya
Abstract: Twitter has acted as an important source of information during disasters and pandemic, especially during the times of COVID-19. In this paper, we describe our system entry for WNUT 2020 Shared Task-3. The task was aimed at automating the extraction of a variety of COVID-19 related events from Twitter, such as individuals who recently contracted the virus, someone with symptoms who were denied testing and believed remedies against the infection. The system consists of separate multi-task models for slot-filling subtasks and sentence-classification subtasks while leveraging the useful sentence-level information for the corresponding event. The system uses COVID-Twitter-Bert with attention-weighted pooling of candidate slot-chunk features to capture the useful information chunks. The system ranks 1st at the leader-board with F1 of 0.6598, without using any ensembles or additional datasets. The code and trained models are available at this https URL.
摘要：在灾难和大流行期间，尤其是在COVID-19时期，Twitter已成为重要的信息来源。在本文中，我们描述了WNUT 2020 Shared Task-3的系统条目。该任务旨在自动从Twitter提取各种与COVID-19相关的事件，例如最近感染该病毒的个人，症状被拒绝测试并相信可以对付这种感染的人。该系统由用于插槽填充子任务和句子分类子任务的单独的多任务模型组成，同时利用了相应事件的有用句子级别信息。该系统将COVID-Twitter-Bert与候选插槽组块功能的注意力加权池结合使用，以捕获有用的信息组块。系统在F1为0.6598的情况下在排行榜上排名第一，而无需使用任何合奏或其他数据集。可以从此https URL获得代码和经过训练的模型。

14. Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement Learning [PDF] 返回目录
Jerry Zikun Chen, Shi Yu, Haoran Wang
Abstract: Query reformulation aims to alter potentially noisy or ambiguous text sequences into coherent ones closer to natural language questions. In this process, it is also crucial to maintain and even enhance performance in a downstream environments like question answering when rephrased queries are given as input. We explore methods to generate these query reformulations by training reformulators using text-to-text transformers and apply policy-based reinforcement learning algorithms to further encourage reward learning. Query fluency is numerically evaluated by the same class of model fine-tuned on a human-evaluated well-formedness dataset. The reformulator leverages linguistic knowledge obtained from transfer learning and generates more well-formed reformulations than a translation-based model in qualitative and quantitative analysis. During reinforcement learning, it better retains fluency while optimizing the RL objective to acquire question answering rewards and can generalize to out-of-sample textual data in qualitative evaluations. Our RL framework is demonstrated to be flexible, allowing reward signals to be sourced from different downstream environments such as intent classification.
摘要：查询重新制定旨在将潜在的嘈杂或不明确的文本序列更改为更接近自然语言问题的连贯文本序列。在此过程中，至关重要的是在下游环境（例如以重新输入的查询作为输入时回答问题）中保持甚至提高性能。我们探索了通过使用文本到文本转换器训练重新制定者来生成这些查询重新制定的方法，并应用基于策略的强化学习算法来进一步鼓励奖励学习。查询流畅性是通过在人类评估的格式良好的数据集上微调的同一类模型进行数值评估的。在定性和定量分析中，与基于翻译的模型相比，与通过翻译的模型相比，与通过翻译获得的语言知识相比，重构器利用了语言知识，并且生成了格式更完善的重构。在强化学习过程中，它可以更好地保留流利性，同时优化RL目标以获取答疑奖励，并可以在定性评估中推广到样本外文本数据。我们的RL框架被证明是灵活的，允许奖励信号来自不同的下游环境，例如意图分类。

15. NeurST: Neural Speech Translation Toolkit [PDF] 返回目录
Chengqi Zhao, Mingxuan Wang, Lei Li
Abstract: NeurST is an open-source toolkit for neural speech translation developed by ByteDance AI Lab. The toolkit mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products. NeurST aims at facilitating the speech translation research for NLP researchers and provides a complete setup for speech translation benchmarks, including feature extraction, data preprocessing, distributed training, and evaluation. Moreover, The toolkit implements several major architectures for end-to-end speech translation. It shows experimental results for different benchmark datasets, which can be regarded as reliable baselines for future research. The toolkit is publicly available at this https URL.
摘要：NeurST是由ByteDance AI Lab开发的用于神经语音翻译的开源工具包。该工具包主要侧重于端到端语音翻译，它易于使用，修改并扩展到高级语音翻译研究和产品。 NeurST旨在促进NLP研究人员的语音翻译研究，并为语音翻译基准提供完整的设置，包括特征提取，数据预处理，分布式培训和评估。此外，该工具包实现了几种主要的端到端语音翻译架构。它显示了不同基准数据集的实验结果，可以将其视为未来研究的可靠基准。该工具包可从此https URL公开获得。

16. Can Transformers Reason About Effects of Actions? [PDF] 返回目录
Pratyay Banerjee, Chitta Baral, Man Luo, Arindam Mitra, Kuntal Pal, Tran C. Son, Neeraj Varshney
Abstract: A recent work has shown that transformers are able to "reason" with facts and rules in a limited setting where the rules are natural language expressions of conjunctions of conditions implying a conclusion. Since this suggests that transformers may be used for reasoning with knowledge given in natural language, we do a rigorous evaluation of this with respect to a common form of knowledge and its corresponding reasoning -- the reasoning about effects of actions. Reasoning about action and change has been a top focus in the knowledge representation subfield of AI from the early days of AI and more recently it has been a highlight aspect in common sense question answering. We consider four action domains (Blocks World, Logistics, Dock-Worker-Robots and a Generic Domain) in natural language and create QA datasets that involve reasoning about the effects of actions in these domains. We investigate the ability of transformers to (a) learn to reason in these domains and (b) transfer that learning from the generic domains to the other domains.
摘要：最近的一项工作表明，变形器能够在有限的条件下“理解”事实和规则，其中规则是条件连接的自然语言表达，表示结论。由于这表明可以使用自然语言给出的知识来进行变压器的推理，因此，我们针对一种常见的知识形式及其相应的推理（即关于动作效果的推理）进行了严格的评估。从AI的早期开始，关于动作和变化的推理就一直是AI知识表示子领域中的头等大事，最近，它已成为常识问题解答中的一个重要方面。我们以自然语言考虑四个动作域（“ Blocks World”，“ Blocks World”，“ Dock-Worker-Robots”和“ Generic Domain”），并创建QA数据集，其中涉及对这些域中动作影响的推理。我们研究了变压器在以下领域的能力：（a）在这些领域中学习推理，以及（b）将这种学习从通用领域转移到其他领域。

17. Named Entity Recognition in the Legal Domain using a Pointer Generator Network [PDF] 返回目录
Stavroula Skylaki, Ali Oskooei, Omar Bari, Nadja Herger, Zac Kriegman
Abstract: Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws etc. We study the problem of legal NER with noisy text extracted from PDF files of filed court cases from US courts. The "gold standard" training data for NER systems provide annotation for each token of the text with the corresponding entity or non-entity label. We work with only partially complete training data, which differ from the gold standard NER data in that the exact location of the entities in the text is unknown and the entities may contain typos and/or OCR mistakes. To overcome the challenges of our noisy training data, e.g. text extraction errors and/or typos and unknown label indices, we formulate the NER task as a text-to-text sequence generation task and train a pointer generator network to generate the entities in the document rather than label them. We show that the pointer generator can be effective for NER in the absence of gold standard data and outperforms the common NER neural network architectures in long legal documents.
摘要：命名实体识别（NER）是对非结构化文本中的命名实体进行识别和分类的任务。在法律领域，感兴趣的具名实体可能包括案件当事人，法官，法院名称，案件编号，法律参考等。我们使用从美国法院提交的法院案件的PDF文件中提取的嘈杂文本研究NER的法律问题。 NER系统的“黄金标准”培训数据为带有相应实体或非实体标签的文本的每个标记提供注释。我们仅使用部分完整的训练数据，这与黄金标准NER数据不同，因为文本中实体的确切位置未知，并且实体可能包含错别字和/或OCR错误。为了克服我们嘈杂的训练数据带来的挑战，例如文本提取错误和/或错别字和未知标签索引，我们将NER任务公式化为文本到文本序列生成任务，并训练指针生成器网络以在文档中生成实体，而不是对其进行标注。我们表明，在没有黄金标准数据的情况下，指针生成器可以对NER有效，并且在较长的法律文件中优于普通的NER神经网络体系结构。

18. Trying Bilinear Pooling in Video-QA [PDF] 返回目录
Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed
Abstract: Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly developed for VQA models. A bilinear (outer-product) expansion is thought to encourage models to learn interactions between two feature spaces and has experimentally outperformed `simpler' vector operations (concatenation and element-wise-addition/multiplication) on VQA benchmarks. Successive BLP techniques have yielded higher performance with lower computational expense and are often implemented alongside attention mechanisms. However, despite significant progress in VQA, BLP methods have not been widely applied to more recently explored video question answering (video-QA) tasks. In this paper, we begin to bridge this research gap by applying BLP techniques to various video-QA benchmarks, namely: TVQA, TGIF-QA, Ego-VQA and MSVD-QA. We share our results on the TVQA baseline model, and the recently proposed heterogeneous-memory-enchanced multimodal attention (HME) model. Our experiments include both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP we name the `dual-stream' model. We find that our relatively simple integration of BLP does not increase, and mostly harms, performance on these video-QA benchmarks. Using recently proposed theoretical multimodal fusion taxonomies, we offer insight into why BLP-driven performance gain for video-QA benchmarks may be more difficult to achieve than in earlier VQA models. We suggest a few additional `best-practices' to consider when applying BLP to video-QA. We stress that video-QA models should carefully consider where the complex representational potential from BLP is actually needed to avoid computational expense on `redundant' fusion.
摘要：双线性池（BLP）是指最近开发的一系列操作，用于融合主要针对VQA模型开发的不同模式的特征。人们认为，双线性（外部乘积）扩展鼓励模型学习两个特征空间之间的相互作用，并且在VQA基准上，实验上的表现优于“简单的”矢量运算（并置和逐元素加法/乘法）。连续的BLP技术以较低的计算开销产生了更高的性能，并且通常与注意力机制一起实现。但是，尽管VQA取得了重大进展，但BLP方法尚未广泛应用于最近探讨的视频问答（video-QA）任务。在本文中，我们开始通过将BLP技术应用于各种视频QA基准测试（TVQA，TGIF-QA，Ego-VQA和MSVD-QA）来弥合这一研究差距。我们在TVQA基线模型和最近提出的异构内存平衡多模式注意力（HME）模型上分享我们的结果。我们的实验包括简单地用BLP替换现有模型中的特征串联，以及修改的TVQA基线版本以适应我们称为“双流”模型的BLP。我们发现，我们相对简单的BLP集成不会增加这些视频QA基准的性能，并且在很大程度上损害了这些性能。使用最近提出的理论多峰融合分类法，我们可以洞悉为什么视频QA基准的BLP驱动的性能提升可能比早期VQA模型更难实现。在将BLP应用于视频质量检查时，我们建议您考虑一些其他“最佳做法”。我们强调，视频质量保证模型应仔细考虑实际需要BLP的复杂表示潜力的地方，以避免“冗余”融合的计算费用。

19. Should I visit this place? Inclusion and Exclusion Phrase Mining from Reviews [PDF] 返回目录
Omkar Gurjar, Manish Gupta
Abstract: Although several automatic itinerary generation services have made travel planning easy, often times travellers find themselves in unique situations where they cannot make the best out of their trip. Visitors differ in terms of many factors such as suffering from a disability, being of a particular dietary preference, travelling with a toddler, etc. While most tourist spots are universal, others may not be inclusive for all. In this paper, we focus on the problem of mining inclusion and exclusion phrases associated with 11 such factors, from reviews related to a tourist spot. While existing work on tourism data mining mainly focuses on structured extraction of trip related information, personalized sentiment analysis, and automatic itinerary generation, to the best of our knowledge this is the first work on inclusion/exclusion phrase mining from tourism reviews. Using a dataset of 2000 reviews related to 1000 tourist spots, our broad level classifier provides a binary overlap F1 of $\sim$80 and $\sim$82 to classify a phrase as inclusion or exclusion respectively. Further, our inclusion/exclusion classifier provides an F1 of $\sim$98 and $\sim$97 for 11-class inclusion and exclusion classification respectively. We believe that our work can significantly improve the quality of an automatic itinerary generation service.
摘要：尽管有几种自动路线生成服务使旅行计划变得容易，但通常旅行者会发现自己处于无法充分利用旅行的独特情况。访客在许多方面有所不同，例如患有残疾，具有特殊的饮食习惯，带小孩旅行等。尽管大多数旅游景点是普遍的，但其他景点可能并不包含所有景点。在本文中，我们重点研究与旅游景点相关的评论中与11个此类因素相关的包含和排除短语的挖掘问题。虽然有关旅游数据挖掘的现有工作主要集中在与旅行相关的信息的结构化提取，个性化的情感分析和自动路线生成上，但据我们所知，这是有关旅游评论中包含/排除短语挖掘的第一项工作。使用与1000个旅游景点相关的2000条评论的数据集，我们的广义分类器提供了$ \ sim $ 80和$ \ sim $ 82的二元重叠F1来分别将短语分类为包含或排除。此外，我们的包含/排除分类器为11类包含和排除分类分别提供了$ \ sim $ 98和$ \ sim $ 97的F1。我们相信，我们的工作可以大大提高自动行程生成服务的质量。

20. On Modality Bias in the TVQA Dataset [PDF] 返回目录
Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed
Abstract: TVQA is a large scale video question answering (video-QA) dataset based on popular TV shows. The questions were specifically designed to require "both vision and language understanding to answer". In this work, we demonstrate an inherent bias in the dataset towards the textual subtitle modality. We infer said bias both directly and indirectly, notably finding that models trained with subtitles learn, on-average, to suppress video feature contribution. Our results demonstrate that models trained on only the visual information can answer ~45% of the questions, while using only the subtitles achieves ~68%. We find that a bilinear pooling based joint representation of modalities damages model performance by 9% implying a reliance on modality specific information. We also show that TVQA fails to benefit from the RUBi modality bias reduction technique popularised in VQA. By simply improving text processing using BERT embeddings with the simple model first proposed for TVQA, we achieve state-of-the-art results (72.13%) compared to the highly complex STAGE model (70.50%). We recommend a multimodal evaluation framework that can highlight biases in models and isolate visual and textual reliant subsets of data. Using this framework we propose subsets of TVQA that respond exclusively to either or both modalities in order to facilitate multimodal modelling as TVQA originally intended.
摘要：TVQA是基于流行电视节目的大规模视频问答（video-QA）数据集。这些问题经过专门设计，要求“既有视觉理解能力又要有语言理解能力”。在这项工作中，我们证明了数据集中对文本字幕形式的固有偏见。我们推断出这种偏向是直接还是间接的，特别是发现使用字幕训练的模型平均可以学习抑制视频特征的贡献。我们的结果表明，仅在视觉信息上训练的模型可以回答大约45％的问题，而仅使用字幕可以达到大约68％的问题。我们发现基于双线性池的模态的联合表示将模型性能降低了9％，这意味着对模态特定信息的依赖。我们还表明，TVQA无法从VQA中普及的RUBi模态偏差减少技术中受益。通过使用最初为TVQA提出的简单模型，使用BERT嵌入来简单地改善文本处理，与高度复杂的STAGE模型（70.50％）相比，我们可以获得最新的结果（72.13％）。我们建议使用多模式评估框架，该框架可以强调模型中的偏差并隔离视觉和文本相关的数据子集。使用此框架，我们提出了TVQA的子集，这些子集仅对一种或两种方式做出响应，以促进TVQA最初打算的多模式建模。

21. End-to-End Speaker Diarization as Post-Processing [PDF] 返回目录
Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu
Abstract: This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other's weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped region. Experimental results show that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.
摘要：本文研究了端到端的二值化模型在常规基于聚类的二值化的后处理中的应用。基于聚类的二值化方法将帧划分为说话者数量的聚类。因此，由于每一帧都分配给一个扬声器，因此它们通常无法处理重叠的语音。另一方面，一些端到端的区分方法可以通过将问题视为多标签分类来处理重叠的语音。尽管某些方法可以处理数量灵活的扬声器，但是当扬声器的数量很大时，它们的效果不佳。为了弥补彼此的弱点，我们建议使用两个扬声器的端到端差异化方法作为对基于聚类方法获得的结果的后处理。我们从结果中反复选择两个说话者，并更新两个说话者的结果以改善重叠区域。实验结果表明，所提出的算法在CALLHOME，AMI和DIHARD II数据集上不断提高了最新技术的性能。

22. Predicting Decisions in Language Based Persuasion Games [PDF] 返回目录
Reut Apel, Ido Erev, Roi Reichart, Moshe Tennenholtz
Abstract: Sender-receiver interactions, and specifically persuasion games, are widely researched in economic modeling and artificial intelligence, and serve as a solid foundation for powerful applications. However, in the classic persuasion games setting, the messages sent from the expert to the decision-maker are abstract or well-structured application-specific signals rather than natural (human) language messages, although natural language is a very common communication signal in real-world persuasion setups. This paper addresses the use of natural language in persuasion games, exploring its impact on the decisions made by the players and aiming to construct effective models for the prediction of these decisions. For this purpose, we conduct an online repeated interaction experiment. At each trial of the interaction, an informed expert aims to sell an uninformed decision-maker a vacation in a hotel, by sending her a review that describes the hotel. While the expert is exposed to several scored reviews, the decision-maker observes only the single review sent by the expert, and her payoff in case she chooses to take the hotel is a random draw from the review score distribution available to the expert only. The expert's payoff, in turn, depends on the number of times the decision-maker chooses the hotel. We consider a number of modeling approaches for this setup, differing from each other in the model type (deep neural network (DNN) vs. linear classifier), the type of features used by the model (textual, behavioral or both) and the source of the textual features (DNN-based vs. hand-crafted). Our results demonstrate that given a prefix of the interaction sequence, our models can predict the future decisions of the decision-maker, particularly when a sequential modeling approach and hand-crafted textual features are applied.
摘要：在经济建模和人工智能领域，广泛地研究了发件人与接收者之间的交互，特别是说服游戏，并为强大的应用程序打下了坚实的基础。但是，在经典的说服游戏环境中，从专家发送到决策者的消息是抽象的或结构良好的特定于应用程序的信号，而不是自然（人类）语言的消息，尽管自然语言是实际中非常常见的通信信号。 -世界说服设置。本文探讨了说服游戏中自然语言的使用，探讨了自然语言对玩家决策的影响，并旨在构建有效的模型来预测这些决策。为此，我们进行了在线重复交互实验。在每次互动的尝试中，一位知情的专家旨在通过向她发送描述旅馆的评论，向不知情的决策者出售旅馆的假期。在专家接受多个评分评论的过程中，决策者仅观察到专家发送的单个评论，并且在她选择入住酒店的情况下，她的收益是仅从专家可获得的评论分数分布中随机抽取的。反过来，专家的收益则取决于决策者选择酒店的次数。我们考虑了许多用于此设置的建模方法，它们在模型类型（深度神经网络（DNN）与线性分类器），模型使用的特征类型（文本，行为或两者）以及来源方面互不相同文字功能（基于DNN的还是手工制作的）。我们的结果表明，给定交互序列的前缀，我们的模型可以预测决策者的未来决策，尤其是当应用顺序建模方法和手工制作的文本功能时。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-12-21

目录

摘要