摘要

1. Advanced Machine Learning Techniques for Fake News (Online Disinformation) Detection: A Systematic Mapping Study [PDF] 返回目录
Michal Choras, Konstantinos Demestichas, Agata Gielczyk, Alvaro Herrero, Pawel Ksieniewicz, Konstantina Remoundou, Daniel Urda, Michal Wozniak
Abstract: Fake news has now grown into a big problem for societies and also a major challenge for people fighting disinformation. This phenomenon plagues democratic elections, reputations of individual persons or organizations, and has negatively impacted citizens, (e.g., during the COVID-19 pandemic in the US or Brazil). Hence, developing effective tools to fight this phenomenon by employing advanced Machine Learning (ML) methods poses a significant challenge. The following paper displays the present body of knowledge on the application of such intelligent tools in the fight against disinformation. It starts by showing the historical perspective and the current role of fake news in the information war. Proposed solutions based solely on the work of experts are analysed and the most important directions of the application of intelligent systems in the detection of misinformation sources are pointed out. Additionally, the paper presents some useful resources (mainly datasets useful when assessing ML solutions for fake news detection) and provides a short overview of the most important R&D projects related to this subject. The main purpose of this work is to analyse the current state of knowledge in detecting fake news; on the one hand to show possible solutions, and on the other hand to identify the main challenges and methodological gaps to motivate future research.
摘要：假新闻现在已成为社会的一个大问题，也是打击虚假信息的人们的主要挑战。这种现象困扰着民主选举，个人或组织的声誉，并且对公民产生了负面影响（例如，在美国或巴西的COVID-19大流行期间）。因此，通过采用先进的机器学习（ML）方法开发有效的工具来应对这种现象提出了重大挑战。以下论文展示了有关此类智能工具在抵制虚假信息方面的应用的当前知识。它首先显示了假新闻在信息战中的历史视角和当前作用。分析了仅基于专家工作而提出的解决方案，并指出了智能系统在错误信息源检测中应用的最重要方向。此外，本文还介绍了一些有用的资源（主要是在评估用于伪造新闻检测的ML解决方案时有用的数据集），并简要概述了与此主题相关的最重要的R＆D项目。这项工作的主要目的是分析检测假新闻的当前知识状态。一方面显示可能的解决方案，另一方面确定主要挑战和方法上的差距，以激励未来的研究。

2. CRSLab: An Open-Source Toolkit for Building Conversational Recommender System [PDF] 返回目录
Kun Zhou, Xiaolei Wang, Yuanhang Zhou, Chenzhan Shang, Yuan Cheng, Wayne Xin Zhao, Yaliang Li, Ji-Rong Wen
Abstract: In recent years, conversational recommender system (CRS) has received much attention in the research community. However, existing studies on CRS vary in scenarios, goals and techniques, lacking unified, standardized implementation or comparison. To tackle this challenge, we propose an open-source CRS toolkit CRSLab, which provides a unified and extensible framework with highly-decoupled modules to develop CRSs. Based on this framework, we collect 6 commonly-used human-annotated CRS datasets and implement 18 models that include recent techniques such as graph neural network and pre-training models. Besides, our toolkit provides a series of automatic evaluation protocols and a human-machine interaction interface to test and compare different CRS methods. The project and documents are released at this https URL.
摘要：近年来，会话推荐系统（CRS）在研究界引起了广泛关注。但是，有关CRS的现有研究在方案，目标和技术方面各不相同，缺乏统一，标准化的实施或比较。为了应对这一挑战，我们提出了一个开源CRS工具包CRSLab，该工具包提供了一个统一且可扩展的框架，其中包含高度分离的模块来开发CRS。基于此框架，我们收集了6个常用的带有人类注释的CRS数据集，并实现了18个模型，其中包括诸如图神经网络和预训练模型之类的最新技术。此外，我们的工具包提供了一系列自动评估协议和人机交互界面，以测试和比较不同的CRS方法。项目和文档在此https URL上发布。

3. How to Train Your Agent to Read and Write [PDF] 返回目录
Li Liu, Mengge He, Guanghui Xu, Mingkui Tan, Qi Wu
Abstract: Reading and writing research papers is one of the most privileged abilities that a qualified researcher should master. However, it is difficult for new researchers (\eg{students}) to fully {grasp} this ability. It would be fascinating if we could train an intelligent agent to help people read and summarize papers, and perhaps even discover and exploit the potential knowledge clues to write novel papers. Although there have been existing works focusing on summarizing (\emph{i.e.}, reading) the knowledge in a given text or generating (\emph{i.e.}, writing) a text based on the given knowledge, the ability of simultaneously reading and writing is still under development. Typically, this requires an agent to fully understand the knowledge from the given text materials and generate correct and fluent novel paragraphs, which is very challenging in practice. In this paper, we propose a Deep ReAder-Writer (DRAW) network, which consists of a \textit{Reader} that can extract knowledge graphs (KGs) from input paragraphs and discover potential knowledge, a graph-to-text \textit{Writer} that generates a novel paragraph, and a \textit{Reviewer} that reviews the generated paragraph from three different aspects. Extensive experiments show that our DRAW network outperforms considered baselines and several state-of-the-art methods on AGENDA and M-AGENDA datasets. Our code and supplementary are released at this https URL.
摘要：阅读和撰写研究论文是合格的研究人员应具备的最特权的能力之一。但是，对于新的研究人员（例如\ students）来说，很难完全{掌握}这种能力。如果我们能够训练一个聪明的代理人来帮助人们阅读和总结论文，甚至发现和利用潜在的知识线索来撰写新颖的论文，那将是非常着迷的。尽管已有一些工作着重于总结（\ emph {ie}，阅读）给定文本中的知识或基于给定知识生成（\ emph {ie}，写作）文本，同时阅读和写作的能力仍在开发中。通常，这要求代理人充分了解给定文本材料中的知识，并生成正确且流利的新颖段落，这在实践中非常具有挑战性。在本文中，我们提出了一个Deep ReAder-Writer（DRAW）网络，它由一个\ textit {Reader}组成，该\ textit {Reader}可以从输入段落中提取知识图（KG）并发现潜在的知识，即一个图到文本\ textit { Writer}生成一个新颖的段落，\ textit {Reviewer}从三个不同方面对生成的段落进行回顾。大量的实验表明，在AGENDA和M-AGENDA数据集上，我们的DRAW网络优于基线和几种最新方法。我们的代码和补充资料在此https URL上发布。

4. Transformer-based Conditional Variational Autoencoder for Controllable Story Generation [PDF] 返回目录
Le Fang, Tao Zeng, Chaochun Liu, Liefeng Bo, Wen Dong, Changyou Chen
Abstract: We investigate large-scale latent variable models (LVMs) for neural story generation -- an under-explored application for open-domain long text -- with objectives in two threads: generation effectiveness and controllability. LVMs, especially the variational autoencoder (VAE), have achieved both effective and controllable generation through exploiting flexible distributional latent representations. Recently, Transformers and its variants have achieved remarkable effectiveness without explicit latent representation learning, thus lack satisfying controllability in generation. In this paper, we advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers to enhance controllability without hurting state-of-the-art generation effectiveness. Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE). Model components such as encoder, decoder and the variational posterior are all built on top of pre-trained language models -- GPT2 specifically in this paper. Experiments demonstrate state-of-the-art conditional generation ability of our model, as well as its excellent representation learning capability and controllability.
摘要：我们研究了用于神经故事生成的大型潜在变量模型（LVM），这是针对开放域长文本的未充分开发的应用程序，其目标涉及两个线程：生成有效性和可控制性。 LVM，尤其是可变自动编码器（VAE），通过利用灵活的分布潜在表示，已经实现了有效且可控的生成。最近，在没有显式的潜在表示学习的情况下，《变形金刚》及其变种取得了显着的效果，因此缺乏令人满意的发电可控性。在本文中，我们主张在变压器时代重振潜在变量建模，从本质上讲是表示学习的力量，以增强可控性而不损害最新的发电效率。具体来说，我们将潜在表示向量与基于Transformer的预训练架构进行集成，以构建条件变分自动编码器（CVAE）。诸如编码器，解码器和可变后验之类的模型组件均基于预训练的语言模型-特别是GPT2构建。实验证明了我们模型的最新条件生成能力，以及其出色的表示学习能力和可控制性。

5. Outline to Story: Fine-grained Controllable Story Generation from Cascaded Events [PDF] 返回目录
Le Fang, Tao Zeng, Chaochun Liu, Liefeng Bo, Wen Dong, Changyou Chen
Abstract: Large-scale pretrained language models have shown thrilling generation capabilities, especially when they generate consistent long text in thousands of words with ease. However, users of these models can only control the prefix of sentences or certain global aspects of generated text. It is challenging to simultaneously achieve fine-grained controllability and preserve the state-of-the-art unconditional text generation capability. In this paper, we first propose a new task named "Outline to Story" (O2S) as a test bed for fine-grained controllable generation of long text, which generates a multi-paragraph story from cascaded events, i.e. a sequence of outline events that guide subsequent paragraph generation. We then create dedicate datasets for future benchmarks, built by state-of-the-art keyword extraction techniques. Finally, we propose an extremely simple yet strong baseline method for the O2S task, which fine tunes pre-trained language models on augmented sequences of outline-story pairs with simple language modeling objective. Our method does not introduce any new parameters or perform any architecture modification, except several special tokens as delimiters to build augmented sequences. Extensive experiments on various datasets demonstrate state-of-the-art conditional story generation performance with our model, achieving better fine-grained controllability and user flexibility. Our paper is among the first ones by our knowledge to propose a model and to create datasets for the task of "outline to story". Our work also instantiates research interest of fine-grained controllable generation of open-domain long text, where controlling inputs are represented by short text.
摘要：大规模的预训练语言模型已经显示出令人兴奋的生成能力，尤其是当它们轻松地以数千个单词生成一致的长文本时。但是，这些模型的用户只能控制句子的前缀或生成的文本的某些全局方面。同时实现细粒度的可控制性并保留最新的无条件文本生成功能是一项挑战。在本文中，我们首先提出一个名为“大纲到故事”（O2S）的新任务，作为细粒度可控生成长文本的测试平台，该故事可以从级联事件（即轮廓事件）生成多段故事指导后续段落的产生。然后，我们使用最先进的关键字提取技术为将来的基准创建专用数据集。最后，我们为O2S任务提出了一种非常简单但功能强大的基线方法，该方法可以通过简单的语言建模目标在大纲-故事对对的增强序列上微调预训练的语言模型。我们的方法不引入任何新参数或执行任何体系结构修改，除了几个特殊标记作为分隔符以构建增强序列外。在各种数据集上进行的广泛实验证明了我们的模型具有最新的条件故事生成性能，可实现更好的细粒度可控制性和用户灵活性。根据我们的知识，本文是第一个提出模型并创建“概述到故事”任务的数据集的论文。我们的工作还激发了开放域长文本的细粒度可控生成的研究兴趣，其中控制输入由短文本表示。

6. A Joint Training Dual-MRC Framework for Aspect Based Sentiment Analysis [PDF] 返回目录
Yue Mao, Yi Shen, Chao Yu, Longjun Cai
Abstract: Aspect based sentiment analysis (ABSA) involves three fundamental subtasks: aspect term extraction, opinion term extraction, and aspect-level sentiment classification. Early works only focused on solving one of these subtasks individually. Some recent work focused on solving a combination of two subtasks, e.g., extracting aspect terms along with sentiment polarities or extracting the aspect and opinion terms pair-wisely. More recently, the triple extraction task has been proposed, i.e., extracting the (aspect term, opinion term, sentiment polarity) triples from a sentence. However, previous approaches fail to solve all subtasks in a unified end-to-end framework. In this paper, we propose a complete solution for ABSA. We construct two machine reading comprehension (MRC) problems, and solve all subtasks by joint training two BERT-MRC models with parameters sharing. We conduct experiments on these subtasks and results on several benchmark datasets demonstrate the effectiveness of our proposed framework, which significantly outperforms existing state-of-the-art methods.
摘要：基于方面的情感分析（ABSA）涉及三个基本子任务：方面术语提取，观点术语提取和方面级别的情感分类。早期的工作仅专注于单独解决这些子任务之一。最近的一些工作集中于解决两个子任务的组合，例如，提取方面项以及情感极性，或者成对提取方面和意见项。最近，提出了三重提取任务，即，从句子中提取（方面，意见，情感极性）三重。但是，以前的方法无法在统一的端到端框架中解决所有子任务。在本文中，我们提出了ABSA的完整解决方案。我们构造了两个机器阅读理解（MRC）问题，并通过联合训练两个共享参数的BERT-MRC模型来解决所有子任务。我们对这些子任务进行了实验，在一些基准数据集上的结果证明了我们提出的框架的有效性，该框架明显优于现有的最新方法。

7. Benchmarking Knowledge-Enhanced Commonsense Question Answering via Knowledge-to-Text Transformation [PDF] 返回目录
Ning Bian, Xianpei Han, Bo Chen, Le Sun
Abstract: A fundamental ability of humans is to utilize commonsense knowledge in language understanding and question answering. In recent years, many knowledge-enhanced Commonsense Question Answering (CQA) approaches have been proposed. However, it remains unclear: (1) How far can we get by exploiting external knowledge for CQA? (2) How much potential of knowledge has been exploited in current CQA models? (3) Which are the most promising directions for future CQA? To answer these questions, we benchmark knowledge-enhanced CQA by conducting extensive experiments on multiple standard CQA datasets using a simple and effective knowledge-to-text transformation framework. Experiments show that: (1) Our knowledge-to-text framework is effective and achieves state-of-the-art performance on CommonsenseQA dataset, providing a simple and strong knowledge-enhanced baseline for CQA; (2) The potential of knowledge is still far from being fully exploited in CQA -- there is a significant performance gap from current models to our models with golden knowledge; and (3) Context-sensitive knowledge selection, heterogeneous knowledge exploitation, and commonsense-rich language models are promising CQA directions.
摘要：人类的一项基本能力是在语言理解和问题解答中利用常识知识。近年来，已经提出了许多知识增强的常识问题解答（CQA）方法。但是，目前尚不清楚：（1）利用CQA的外部知识能走多远？（2）当前的CQA模型已开发了多少知识潜能？（3）未来CQA最有希望的方向是什么？为了回答这些问题，我们通过使用简单有效的知识到文本转换框架对多个标准CQA数据集进行广泛的实验来对知识增强的CQA进行基准测试。实验表明：（1）我们的知识文本框架是有效的，并且在CommonsenseQA数据集上达到了最先进的性能，为CQA提供了简单而强大的知识增强基线；（2）知识的潜力仍远未在CQA中得到充分利用-从当前模型到我们的黄金知识模型之间存在巨大的性能差距; （3）上下文相关的知识选择，异构知识的开发以及常识丰富的语言模型是有希望的CQA方向。

8. Are Eliminated Spans Useless for Coreference Resolution? Not at all [PDF] 返回目录
Xin Tan, Longyin Zhang, Guodong Zhou
Abstract: Various neural-based methods have been proposed so far for joint mention detection and coreference resolution. However, existing works on coreference resolution are mainly dependent on filtered mention representation, while other spans are largely neglected. In this paper, we aim at increasing the utilization rate of data and investigating whether those eliminated spans are totally useless, or to what extent they can improve the performance of coreference resolution. To achieve this, we propose a mention representation refining strategy where spans highly related to mentions are well leveraged using a pointer network for representation enhancing. Notably, we utilize an additional loss term in this work to encourage the diversity between entity clusters. Experimental results on the document-level CoNLL-2012 Shared Task English dataset show that eliminated spans are indeed much effective and our approach can achieve competitive results when compared with previous state-of-the-art in coreference resolution.
摘要：迄今为止，已经提出了多种基于神经的方法来进行联合提及检测和共指解析。但是，关于共指解析的现有工作主要取决于过滤的提及表示，而其他范围则被忽略。在本文中，我们旨在提高数据利用率，并研究那些消除的跨度是否完全无用，或在何种程度上可以改善共参考分辨率的性能。为达到此目的，我们提出了一种提述表示提炼策略，其中使用指针网络充分利用提要网络来很好地利用与提述高度相关的跨度。值得注意的是，我们在这项工作中使用了额外的损失项来鼓励实体集群之间的多样性。在文档级CoNLL-2012共享任务英语数据集上的实验结果表明，消除跨度确实非常有效，并且与以前的共参考分辨率相比，我们的方法可以取得竞争性结果。

9. Recoding latent sentence representations -- Dynamic gradient-based activation modification in RNNs [PDF] 返回目录
Dennis Ulmer
Abstract: In Recurrent Neural Networks (RNNs), encoding information in a suboptimal or erroneous way can impact the quality of representations based on later elements in the sequence and subsequently lead to wrong predictions and a worse model performance. In humans, challenging cases like garden path sentences (an instance of this being the infamous "The horse raced past the barn fell") can lead their language understanding astray. However, they are still able to correct their representation accordingly and recover when new information is encountered. Inspired by this, I propose an augmentation to standard RNNs in form of a gradient-based correction mechanism: This way I hope to enable such models to dynamically adapt their inner representation of a sentence, adding a way to correct deviations as soon as they occur. This could therefore lead to more robust models using more flexible representations, even during inference time. I conduct different experiments in the context of language modeling, where the impact of using such a mechanism is examined in detail. To this end, I look at modifications based on different kinds of time-dependent error signals and how they influence the model performance. Furthermore, this work contains a study of the model's confidence in its predictions during training and for challenging test samples and the effect of the manipulation thereof. Lastly, I also study the difference in behavior of these novel models compared to a standard LSTM baseline and investigate error cases in detail to identify points of future research. I show that while the proposed approach comes with promising theoretical guarantees and an appealing intuition, it is only able to produce minor improvements over the baseline due to challenges in its practical application and the efficacy of the tested model variants.
摘要：在递归神经网络（RNN）中，以次优或错误的方式编码信息会影响序列中后续元素的表示质量，从而导致错误的预测和较差的模型性能。在人类中，具有挑战性的案例（例如花园小径句子）（例如臭名昭著的“跑过谷仓的马”）可能会使他们的语言理解误入歧途。但是，它们仍然能够相应地更正其表示形式，并在遇到新信息时可以恢复。受此启发，我提议以基于梯度的校正机制的形式对标准RNN进行增强：这样，我希望使此类模型能够动态地适应其句子的内部表示形式，并提供一种在出现偏差时立即进行校正的方法。。因此，即使在推理期间，这也可能导致使用更灵活的表示形式产生更强大的模型。我在语言建模的背景下进行了不同的实验，其中详细研究了使用这种机制的影响。为此，我着眼于基于不同类型的随时间变化的误差信号的修改以及它们如何影响模型性能。此外，这项工作还包括对模型在训练过程中的预测，挑战性测试样品的置信度及其操作效果的研究。最后，我还研究了与标准LSTM基准相比这些新型模型在行为上的差异，并详细研究了错误情况，以识别未来的研究重点。我表明，尽管所提出的方法具有令人鼓舞的理论保证和吸引人的直觉，但由于其实际应用和测试模型变体的有效性方面的挑战，它只能在基线上产生较小的改进。

10. An Efficient Transformer Decoder with Compressed Sub-layers [PDF] 返回目录
Yanyang Li, Ye Lin, Tong Xiao, Jingbo Zhu
Abstract: The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42x faster with performance on par with a strong baseline. This strong baseline is already 2x faster than the widely used standard baseline without loss in performance.
摘要：大型的基于注意力的编解码器网络（Transformer）由于其有效性而最近变得盛行。但是其解码器的高计算复杂度引起了效率低下的问题。通过检查解码器的数学公式，我们表明在某些温和条件下，可以通过压缩其子层（Transformer的基本构建块）来简化体系结构，并实现更高的并行度。因此，我们提出了压缩注意网络，其解码器层仅由一个子层而不是三个子层组成。在14个WMT机器翻译任务上的大量实验表明，我们的模型速度提高了1.42倍，性能与强大的基线相当。这个强大的基准已经比广泛使用的标准基准快了2倍，而不会降低性能。

11. Attentive Tree-structured Network for Monotonicity Reasoning [PDF] 返回目录
Zeming Chen
Abstract: Many state-of-art neural models designed for monotonicity reasoning perform poorly on downward inference. To address this shortcoming, we developed an attentive tree-structured neural network. It consists of a tree-based long-short-term-memory network (Tree-LSTM) with soft attention. It is designed to model the syntactic parse tree information from the sentence pair of a reasoning task. A self-attentive aggregator is used for aligning the representations of the premise and the hypothesis. We present our model and evaluate it using the Monotonicity Entailment Dataset (MED). We show and attempt to explain that our model outperforms existing models on MED.
摘要：为单调性推理而设计的许多最新的神经模型在向下推理方面表现不佳。为了解决这个缺点，我们开发了一个细心的树状神经网络。它由一个基于树的长短期内存网络（Tree-LSTM）组成，具有注意力。它旨在从推理任务的句子对中建模语法分析树信息。自注意聚合器用于使前提和假设的表示保持一致。我们介绍我们的模型并使用单调性蕴含数据集（MED）对其进行评估。我们展示并尝试解释我们的模型优于MED上的现有模型。

12. Few-Shot Question Answering by Pretraining Span Selection [PDF] 返回目录
Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy
Abstract: In a number of question answering (QA) benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available. We show that standard span selection models perform poorly, highlighting the fact that current pretraining objective are far removed from question answering. To address this, we propose a new pretraining scheme that is more suitable for extractive question answering. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks, e.g., 72.7 F1 with only 128 examples on SQuAD, while maintaining competitive (and sometimes better) performance in the high-resource setting. Our findings indicate that careful design of pretraining schemes and model architecture can have a dramatic effect on performance in the few-shot settings.
摘要：在许多问答（QA）基准中，通过对100,000个带注释的问题和答案进行微调，预训练的模型已达到人的均等性。我们探索了更现实的快照设置，其中只有几百个训练示例可用。我们显示标准的跨度选择模型表现不佳，突出了这样一个事实，即当前的预训练目标与问题答案相差甚远。为了解决这个问题，我们提出了一种新的预训练方案，该方案更适合提取性问题的回答。给定一个具有多组重复跨度的段落，我们在每组中屏蔽除一个以外的所有重复跨度，并要求模型为每个被屏蔽的跨度在段落中选择正确的跨度。被屏蔽的范围被替换为特殊标记，被视为问题表示形式，该标记随后在微调期间用于选择答案范围。生成的模型在多个基准上获得了令人惊讶的良好结果，例如SQuAD上只有128个示例的72.7 F1，同时在高资源环境下保持了竞争性（有时甚至更好）的性能。我们的研究结果表明，精心设计的预训练方案和模型体系结构可以在少数情况下对性能产生巨大影响。

13. Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval [PDF] 返回目录
Omar Khattab, Christopher Potts, Matei Zaharia
Abstract: Multi-hop reasoning (i.e., reasoning across two or more documents) at scale is a key step toward NLP models that can exhibit broad world knowledge by leveraging large collections of documents. We propose Baleen, a system that improves the robustness and scalability of multi-hop reasoning over current approaches. Baleen introduces a per-hop condensed retrieval pipeline to mitigate the size of the search space, a focused late interaction retriever (FliBERT) that can model complex multi-hop queries, and a weak supervision strategy, latent hop ordering, to learn from limited signal about which documents to retrieve for a query. We evaluate Baleen on the new many-hop claim verification dataset HoVer, establishing state-of-the-art performance.
摘要：大规模多跳推理（即跨两个或多个文档进行推理）是迈向NLP模型的关键一步，该模型可以通过利用大量文档来展现广泛的世界知识。我们提出Baleen，它是一种比当前方法提高多跳推理的鲁棒性和可伸缩性的系统。 Baleen引入了按跳的精简检索管道来减轻搜索空间的大小，引入可以对复杂的多跳查询进行建模的集中式后期交互检索器（FliBERT），以及一种弱监督策略（潜跳排序），以从有限的信号中学习有关要查询的文档的信息。我们在新的多跳索赔验证数据集HoVer上对Baleen进行评估，从而建立了最先进的性能。

14. Coreference Resolution without Span Representations [PDF] 返回目录
Yuval Kirstain, Ori Ram, Omer Levy
Abstract: Since the introduction of deep pretrained language models, most task-specific NLP models were reduced to simple lightweight layers. An exception to this trend is the challenging task of coreference resolution, where a sophisticated end-to-end model is appended to a pretrained transformer encoder. While highly effective, the model has a very large memory footprint -- primarily due to dynamically-constructed span and span-pair representations -- which hinders the processing of complete documents and the ability to train on multiple instances in a single batch. We introduce a lightweight coreference model that removes the dependency on span representations, handcrafted features, and heuristics. Our model performs competitively with the current end-to-end model, while being simpler and more efficient.
摘要：自从引入深度预训练的语言模型以来，大多数特定于任务的NLP模型都简化为简单的轻量级层。这种趋势的一个例外是共参考分辨率的挑战性任务，其中将复杂的端到端模型附加到预训练的变压器编码器上。尽管该模型非常有效，但它具有非常大的内存占用空间（主要是由于动态构建的跨度和跨度对表示形式），这阻碍了完整文档的处理以及单个批处理多个实例的能力。我们引入了一种轻量级的共参照模型，该模型消除了对跨度表示，手工特征和启发式方法的依赖。我们的模型与当前的端到端模型相比具有竞争优势，同时更简单，更高效。

15. Dimensions of Transparency in NLP Applications [PDF] 返回目录
Michael Saxon, Sharon Levy, Xinyi Wang, Alon Albalak, William Yang Wang
Abstract: Broader transparency in descriptions of and communication regarding AI systems is widely considered desirable. This is particularly the case in discussions of fairness and accountability in systems exposed to the general public. However, previous work has suggested that a trade-off exists between greater system transparency and user confusion, where `too much information' clouds a reader's understanding of what a system description means. Unfortunately, transparency is a nebulous concept, difficult to both define and quantify. In this work we address these two issues by proposing a framework for quantifying transparency in system descriptions and apply it to analyze the trade-off between transparency and end-user confusion using NLP conference abstracts.
摘要：人们普遍认为，在AI系统的描述和通信方面具有更大的透明度是可取的。在公开给公众的制度中关于公平和问责制的讨论中尤其如此。但是，以前的工作表明，在更大的系统透明度和用户困惑之间存在折衷，其中“太多信息”使读者无法理解系统描述的含义。不幸的是，透明度是一个模糊的概念，很难同时定义和量化。在这项工作中，我们通过提出一个量化系统描述中的透明度的框架并将其应用于分析透明度和使用NLP会议摘要的最终用户混淆之间的权衡，来解决这两个问题。

16. Assessing Emoji Use in Modern Text Processing Tools [PDF] 返回目录
Abu Awal Md Shoeb, Gerard de Melo
Abstract: Emojis have become ubiquitous in digital communication, due to their visual appeal as well as their ability to vividly convey human emotion, among other factors. The growing prominence of emojis in social media and other instant messaging also leads to an increased need for systems and tools to operate on text containing emojis. In this study, we assess this support by considering test sets of tweets with emojis, based on which we perform a series of experiments investigating the ability of prominent NLP and text processing tools to adequately process them. In particular, we consider tokenization, part-of-speech tagging, as well as sentiment analysis. Our findings show that many tools still have notable shortcomings when operating on text containing emojis.
摘要：表情符号由于其视觉吸引力以及生动表达人的情感的能力等因素，已在数字通信中变得无处不在。表情符号在社交媒体和其他即时消息中的重要性日益增长，也导致对在包含表情符号的文本上进行操作的系统和工具的需求增加。在这项研究中，我们通过考虑带有表情符号的推文测试集来评估这种支持，在此基础上，我们进行了一系列实验，研究了杰出的NLP和文本处理工具对其进行适当处理的能力。特别是，我们考虑标记化，词性标记以及情感分析。我们的发现表明，当对包含表情符号的文本进行操作时，许多工具仍然存在明显的缺陷。

17. Decoding Time Lexical Domain Adaptationfor Neural Machine Translation [PDF] 返回目录
Nikolay Bogoychev, Pinzhen Chen
Abstract: Machine translation systems are vulnerable to domain mismatch, especially when the task is low-resource. In this setting, out of domain translations are often of poor quality and prone to hallucinations, due to the translation model preferring to predict common words it has seen during training, as opposed to the more uncommon ones from a different domain. We present two simple methods for improving translation quality in this particular setting: First, we use lexical shortlisting in order to restrict the neural network predictions by IBM model computed alignments. Second, we perform $n$-best list reordering by reranking all translations based on the amount they overlap with each other. Our methods are computationally simpler and faster than alternative approaches, and show a moderate success on low-resource settings with explicit out of domain test sets. However, our methods lose their effectiveness when the domain mismatch is too great, or in high resource setting.
摘要：机器翻译系统容易受到域不匹配的影响，尤其是在资源匮乏的情况下。在这种情况下，域外翻译的质量通常较差并且容易产生幻觉，这是因为翻译模型更愿意预测其在训练中看到的常见单词，而不是来自不同域的更常见单词。我们提出了两种在此特定环境中提高翻译质量的简单方法：首先，我们使用词汇短名单，以通过IBM模型计算的比对来限制神经网络的预测。其次，我们根据所有翻译相互重叠的数量对所有翻译进行排名，从而进行$ n $最佳列表重新排序。与替代方法相比，我们的方法在计算上更简单，更快捷，并且在资源较少的环境下（具有明确的域外测试集）显示出一定程度的成功。但是，当域不匹配太大或资源设置过多时，我们的方法将失去有效性。

18. Zero-shot Learning by Generating Task-specific Adapters [PDF] 返回目录
Qinyuan Ye, Xiang Ren
Abstract: Pre-trained text-to-text transformers achieve impressive performance across a wide range of NLP tasks, and they naturally support zero-shot learning (ZSL) by using the task description as prompt in the input. However, this approach has potential limitations, as it learns from input-output pairs at instance level, instead of learning to solve tasks at task level. Alternatively, applying existing ZSL methods to text-to-text transformers is non-trivial due to their text generation objective and huge size. To address these issues, we introduce Hypter, a framework that improves zero-shot transferability by training a hypernetwork to generate task-specific adapters from task descriptions. This formulation enables learning at task level, and greatly reduces the number of parameters by using light-weight adapters. Experiments on two datasets demonstrate Hypter improves upon fine-tuning baselines.
摘要：经过预训练的文本到文本转换器可在各种NLP任务中实现令人印象深刻的性能，并且通过使用任务描述作为输入提示，它们自然支持零击学习（ZSL）。但是，这种方法具有潜在的局限性，因为它从实例级别的输入-输出对中学习，而不是学习在任务级别上解决任务。另外，将现有的ZSL方法应用于文本到文本的转换器也不是一件容易的事，因为它们的文本生成目标和规模很大。为了解决这些问题，我们引入了Hypter，这是一个通过训练超网络从任务描述中生成特定于任务的适配器来提高零快照传输能力的框架。这种表述使得可以在任务级别进行学习，并通过使用轻量级适配器大大减少了参数数量。在两个数据集上进行的实验表明，Hypter在微调基准方面有所改善。

19. KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation [PDF] 返回目录
Yiran Xing, Zai Shi, Zhao Meng, Yunpu Ma, Roger Wattenhofer
Abstract: We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-based sequence-to-sequence model capable of reasoning about commonsense knowledge from multimodal inputs of images and texts. We extend the popular BART architecture to a multi-modal model. We design a new pretraining task to improve the model performance on Visual Commonsense Generation task. Our pretraining task improves the Visual Commonsense Generation performance by leveraging knowledge from a large language model pretrained on an external knowledge graph. To the best of our knowledge, we are the first to propose a dedicated task for improving model performance on Visual Commonsense Generation. Experimental results show that by pretraining, our model reaches state-of-the-art performance on the Visual Commonsense Generation task.
摘要：我们提出了知识增强的多模态BART（KM-BART），这是一个基于变压器的序列到序列模型，能够从图像和文本的多模态输入中推理常识知识。我们将流行的BART体系结构扩展到多模式模型。我们设计了一个新的预训练任务，以提高Visual Commonsense Generation任务的模型性能。我们的预训练任务通过利用在外部知识图上预训练的大型语言模型中的知识来提高Visual Commonsense Generation的性能。据我们所知，我们是第一个提出专门任务来改善Visual Commonsense Generation上模型性能的人。实验结果表明，通过预训练，我们的模型在Visual Commonsense Generation任务上达到了最先进的性能。

20. Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting [PDF] 返回目录
Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei
Abstract: In this paper, we generalize text infilling (e.g., masked language models) by proposing Sequence Span Rewriting (SSR) as a self-supervised sequence-to-sequence (seq2seq) pre-training objective. SSR provides more fine-grained learning signals for text representations by supervising the model to rewrite imperfect spans to ground truth, and it is more consistent than text infilling with many downstream seq2seq tasks that rewrite a source sentences into a target sentence. Our experiments with T5 models on various seq2seq tasks show that SSR can substantially improve seq2seq pre-training. Moreover, we observe SSR is especially helpful to improve pre-training a small-size seq2seq model with a powerful imperfect span generator, which indicates a new perspective of transferring knowledge from a large model to a smaller model for seq2seq pre-training.
摘要：在本文中，我们通过将序列跨度重写（SSR）提出为自监督序列到序列（seq2seq）的预训练目标，来概括文本填充（例如，蒙版语言模型）。 SSR通过监督模型将不完整的跨度重写为地面真相，为文本表示提供了更细粒度的学习信号，它比用许多下游seq2seq任务填充文本（将源句子重写为目标句子）填充文本更为一致。我们在各种seq2seq任务上使用T5模型进行的实验表明，SSR可以大大改善seq2seq的预训练。此外，我们观察到，SSR在使用功能强大的不完善跨度生成器来改进小型seq2seq模型的预训练方面特别有用，这表明了将知识从大型模型转移到较小模型的seq2seq预训练的新观点。

21. Substructure Substitution: Structured Data Augmentation for NLP [PDF] 返回目录
Haoyue Shi, Karen Livescu, Kevin Gimpel
Abstract: We study a family of data augmentation methods, substructure substitution (SUB2), for natural language processing (NLP) tasks. SUB2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with ones with the same label, which can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB2 based on constituency parse trees, introducing structure-aware data augmentation methods to general NLP tasks. For most cases, training with the augmented dataset by SUB2 achieves better performance than training with the original training set. Further experiments show that SUB2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset.
摘要：我们研究了一系列用于自然语言处理（NLP）任务的数据增强方法，即子结构替换（SUB2）。 SUB2通过用具有相同标签的子结构替换子结构（例如，子树或子序列）来生成新示例，这些子结构可以应用于许多结构化的NLP任务，例如词性标记和解析。对于没有显式标注子结构的更一般的任务（例如，文本分类），我们介绍了基于选区分析树的SUB2变体，并向一般的NLP任务引入了结构感知的数据扩充方法。在大多数情况下，使用SUB2扩充数据集进行训练的效果要优于使用原始训练集进行的训练。进一步的实验表明，在种子数据集的不同任务和大小上，SUB2具有比其他研究的增强方法更一致的性能。

22. End-to-End Training of Neural Retrievers for Open-Domain Question Answering [PDF] 返回目录
Devendra Singh Sachan, Mostofa Patwary, Mohammad Shoeybi, Neel Kant, Wei Ping, William L Hamilton, Bryan Catanzaro
Abstract: Recent work on training neural retrievers for open-domain question answering (OpenQA) has employed both supervised and unsupervised approaches. However, it remains unclear how unsupervised and supervised methods can be used most effectively for neural retrievers. In this work, we systematically study retriever pre-training. We first propose an approach of unsupervised pre-training with the Inverse Cloze Task and masked salient spans, followed by supervised finetuning using question-context pairs. This approach leads to absolute gains of 2+ points over the previous best result in the top-20 retrieval accuracy on Natural Questions and TriviaQA datasets. We also explore two approaches for end-to-end supervised training of the reader and retriever components in OpenQA models. In the first approach, the reader considers each retrieved document separately while in the second approach, the reader considers all the retrieved documents together. Our experiments demonstrate the effectiveness of these approaches as we obtain new state-of-the-art results. On the Natural Questions dataset, we obtain a top-20 retrieval accuracy of 84, an improvement of 5 points over the recent DPR model. In addition, we achieve good results on answer extraction, outperforming recent models like REALM and RAG by 3+ points. We further scale up end-to-end training to large models and show consistent gains in performance over smaller models.
摘要：关于训练开放域问题回答（OpenQA）的神经检索器的最新工作采用了有监督和无监督的方法。然而，目前尚不清楚如何将无监督和有监督的方法最有效地用于神经检索器。在这项工作中，我们系统地研究了猎犬的预训练。我们首先提出一种使用反克洛什任务和掩盖显着跨度的无监督预训练方法，然后使用问题上下文对进行有监督的微调。与自然问题和TriviaQA数据集的前20名检索准确性相比，这种方法的绝对收益比以前的最佳结果高2个点。我们还探索了两种在OpenQA模型中对阅读器和检索器组件进行端到端监督训练的方法。在第一种方法中，读者分别考虑每个检索到的文档，而在第二种方法中，读者将所有检索到的文档一起考虑。我们的实验证明了这些方法的有效性，因为我们获得了最新的结果。在“自然问题”数据集上，我们获得的前20个检索准确度为84，比最近的DPR模型提高了5点。此外，我们在答案提取方面取得了良好的结果，比REALM和RAG等最新模型高出3分以上。我们进一步扩大了对大型模型的端到端培训，并显示了与较小模型相比性能的持续提高。

23. Cross-Document Language Modeling [PDF] 返回目录
Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E. Peters, Arie Cattan, Ido Dagan
Abstract: We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our cross-document language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which encourages the model to learn cross-document and long-range relationships. Second, extending the recent Longformer model, we pretrain with long contexts of several thousand tokens and introduce a new attention pattern that uses sequence-level global attention to predict masked tokens, while retaining the familiar local attention elsewhere. We show that our CD-LM sets new state-of-the-art results for several multi-text tasks, including cross-document event and entity coreference resolution, paper citation recommendation, and documents plagiarism detection, while using a significantly reduced number of training parameters relative to prior works.
摘要：我们为语言模型引入了一种新的预训练方法，旨在支持多文档NLP任务。我们的跨文档语言模型（CD-LM）通过两个关键思想改进了针对这些任务的屏蔽语言建模。首先，我们通过跨文档屏蔽在单个输入中对多个相关文档进行预训练，这鼓励模型学习跨文档和长期关系。其次，扩展最近的Longformer模型，我们在数千个令牌的长上下文中进行了预训练，并引入了一种新的注意力模式，该模式使用序列级全局注意力来预测被屏蔽的令牌，同时保留其他地方的熟悉的局部注意力。我们表明，我们的CD-LM为多项多文本任务设置了最新的结果，包括跨文档事件和实体共指解析，论文引用建议以及文档窃检测，同时使用的数量大大减少了。相对于先前工作的培训参数。

24. Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex Words with Derivational Morphology [PDF] 返回目录
Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze
Abstract: How does the input segmentation of pretrained language models (PLMs) affect their generalization capabilities? We present the first study investigating this question, taking BERT as the example PLM and focusing on the semantic representations of derivationally complex words. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which derivational segmentation consistently outperforms BERT's WordPiece segmentation by a large margin. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.
摘要：预训练语言模型（PLM）的输入细分如何影响其泛化能力？我们以BERT为例PLM，并重点研究派生复杂词的语义表示，从而对这一问题进行了首次研究。我们表明，PLM可以解释为串行双路由模型，即，可以存储复杂字的含义，或者需要从子字中计算出复杂字的含义，这意味着最大有意义的输入令牌应允许对新单词进行最佳概括。一系列语义探测任务证实了这一假设，在这些任务上，派生分段始终在很大程度上优于BERT的WordPiece分段。我们的结果表明，如果使用输入令牌的形态学知悉词汇表，可以进一步提高PLM的泛化能力。

25. Lex-BERT: Enhancing BERT based NER with lexicons [PDF] 返回目录
Wei Zhu, Daniel Cheung
Abstract: In this work, we represent Lex-BERT, which incorporates the lexicon information into Chinese BERT for named entity recognition (NER) tasks in a natural manner. Instead of using word embeddings and a newly designed transformer layer as in FLAT, we identify the boundary of words in the sentences using special tokens, and the modified sentence will be encoded directly by BERT. Our model does not introduce any new parameters and are more efficient than FLAT. In addition, we do not require any word embeddings accompanying the lexicon collection. Experiments on Ontonotes and ZhCrossNER show that our model outperforms FLAT and other baselines.
摘要：在这项工作中，我们代表Lex-BERT，它以自然的方式将词典信息合并到中文BERT中，以执行命名实体识别（NER）任务。与使用FLAT中的单词嵌入和新设计的转换器层不同，我们使用特殊标记来标识句子中单词的边界，并且修改后的句子将直接由BERT编码。我们的模型没有引入任何新参数，并且比FLAT更有效。另外，我们不需要在词典集合中附带任何单词嵌入。在Ontonotes和ZhCrossNER上进行的实验表明，我们的模型优于FLAT和其他基线。

26. End-to-end Semantic Role Labeling with Neural Transition-based Model [PDF] 返回目录
Hao Fei, Meishan Zhang, Bobo Li, Donghong Ji
Abstract: End-to-end semantic role labeling (SRL) has been received increasing interest. It performs the two subtasks of SRL: predicate identification and argument role labeling, jointly. Recent work is mostly focused on graph-based neural models, while the transition-based framework with neural networks which has been widely used in a number of closely-related tasks, has not been studied for the joint task yet. In this paper, we present the first work of transition-based neural models for end-to-end SRL. Our transition model incrementally discovers all sentential predicates as well as their arguments by a set of transition actions. The actions of the two subtasks are executed mutually for full interactions. Besides, we suggest high-order compositions to extract non-local features, which can enhance the proposed transition model further. Experimental results on CoNLL09 and Universal Proposition Bank show that our final model can produce state-of-the-art performance, and meanwhile keeps highly efficient in decoding. We also conduct detailed experimental analysis for a deep understanding of our proposed model.
摘要：端到端语义角色标记（SRL）已受到越来越多的关注。它共同执行SRL的两个子任务：谓词识别和自变量角色标记。最近的工作主要集中在基于图的神经模型上，而已广泛用于许多紧密相关任务的基于过渡的神经网络框架尚未针对联合任务进行研究。在本文中，我们介绍了端到端SRL的基于过渡的神经模型的第一项工作。我们的过渡模型通过一组过渡动作逐步发现所有语句谓词及其参数。两个子任务的动作相互执行以实现完整的交互。此外，我们建议使用高阶成分提取非局部特征，这可以进一步增强所提出的过渡模型。在CoNLL09和Universal Proposition Bank上的实验结果表明，我们的最终模型可以产生最先进的性能，同时保持高效的解码能力。我们还进行了详细的实验分析，以深入了解我们提出的模型。

27. Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering [PDF] 返回目录
Najoung Kim, Ellie Pavlick, Burcu Karagol Ayan, Deepak Ramachandran
Abstract: Many Question-Answering (QA) datasets contain unanswerable questions, but their treatment in QA systems remains primitive. Our analysis of the Natural Questions (Kwiatkowski et al. 2019) dataset reveals that a substantial portion of unanswerable questions ($\sim$21%) can be explained based on the presence of unverifiable presuppositions. We discuss the shortcomings of current models in handling such questions, and describe how an improved system could handle them. Through a user preference study, we demonstrate that the oracle behavior of our proposed system that provides responses based on presupposition failure is preferred over the oracle behavior of existing QA systems. Then we discuss how our proposed system could be implemented, presenting a novel framework that breaks down the problem into three steps: presupposition generation, presupposition verification and explanation generation. We report our progress in tackling each subproblem, and present a preliminary approach to integrating these steps into an existing QA system. We find that adding presuppositions and their verifiability to an existing model yields modest gains in downstream performance and unanswerability detection. The biggest bottleneck is the verification component, which needs to be substantially improved for the integrated system to approach ideal behavior -- even transfer from the best entailment models currently falls short.
摘要：许多问题解答（QA）数据集都包含无法回答的问题，但是它们在QA系统中的处理仍然是原始的。我们对自然问题（Kwiatkowski et al.2019）数据集的分析显示，可以基于存在不可验证的预设来解释大部分不可回答的问题（$ sim $ 21％）。我们讨论了当前模型在处理此类问题方面的缺点，并描述了改进的系统如何处理它们。通过用户偏好研究，我们证明了我们提出的系统（基于预设失败）提供响应的预言行为优于现有QA系统的预言行为。然后，我们讨论了如何实现我们提出的系统，提出了一个新颖的框架，将问题分解为三个步骤：预设生成，预设验证和解释生成。我们报告了解决每个子问题的进展，并提出了将这些步骤集成到现有质量检查系统中的初步方法。我们发现，将前提及其可验证性添加到现有模型中，会在下游性能和不可回答性检测方面产生适度的收益。最大的瓶颈是验证组件，集成系统要想达到理想的性能，就需要对其进行实质性的改进-甚至从当前最佳配置模型进行的转换都不够。

28. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation [PDF] 返回目录
Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux
Abstract: We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at this https URL under an open license.
摘要：我们介绍了VoxPopuli，这是一种大型多语言语料库，提供23种语言的100K小时未标记语音数据。它是迄今为止用于无监督表示学习和半监督学习的最大的开放数据。 VoxPopuli还包含16种语言的1800个小时的转录语音，以及将他们的口译口译成5种其他语言的语言，总计5.1个小时。我们提供语音识别基准，并在具有挑战性的域外设置下，验证半监督学习中VoxPopuli未标记数据的多功能性。我们将在开放许可下通过此https URL释放语料库。

29. Multitask Learning for Class-Imbalanced Discourse Classification [PDF] 返回目录
Alexander Spangher, Jonathan May, Sz-rung Shiang, Lingjia Deng
Abstract: Small class-imbalanced datasets, common in many high-level semantic tasks like discourse analysis, present a particular challenge to current deep-learning architectures. In this work, we perform an extensive analysis on sentence-level classification approaches for the News Discourse dataset, one of the largest high-level semantic discourse datasets recently published. We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks, due in part to label corrections across tasks, which improve performance for underrepresented classes. We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP, and show that none of these approaches can improve classification accuracy in such a setting.
摘要：小型的类不平衡数据集在许多高级语义任务（例如话语分析）中很常见，这对当前的深度学习体系结构提出了特殊的挑战。在这项工作中，我们对新闻语篇数据集（最近发布的最大的高级语义语篇数据集之一）的句子级分类方法进行了广泛的分析。我们表明，多任务方法可以在当前最新的基准测试基础上提高7％的Micro F1分数，部分原因是跨任务的标签更正，从而提高了代表性不足的班级的表现。我们还提供了对解决NLP中资源匮乏问题的其他技术的比较评估，并显示这些方法都无法在这种情况下提高分类准确性。

30. A Robust and Domain-Adaptive Approach for Low-Resource Named Entity Recognition [PDF] 返回目录
Houjin Yu, Xian-Ling Mao, Zewen Chi, Wei Wei, Heyan Huang
Abstract: Recently, it has attracted much attention to build reliable named entity recognition (NER) systems using limited annotated data. Nearly all existing works heavily rely on domain-specific resources, such as external lexicons and knowledge bases. However, such domain-specific resources are often not available, meanwhile it's difficult and expensive to construct the resources, which has become a key obstacle to wider adoption. To tackle the problem, in this work, we propose a novel robust and domain-adaptive approach RDANER for low-resource NER, which only uses cheap and easily obtainable resources. Extensive experiments on three benchmark datasets demonstrate that our approach achieves the best performance when only using cheap and easily obtainable resources, and delivers competitive results against state-of-the-art methods which use difficultly obtainable domainspecific resources. All our code and corpora can be found on this https URL.
摘要：最近，使用有限的注释数据构建可靠的命名实体识别（NER）系统引起了广泛的关注。几乎所有现有作品都严重依赖特定领域的资源，例如外部词典和知识库。但是，此类特定于域的资源通常不可用，同时，构造资源非常困难且昂贵，这已成为广泛采用的主要障碍。为了解决该问题，在这项工作中，我们提出了一种针对低资源NER的新颖的，健壮且具有领域自适应性的方法RDANER，该方法仅使用便宜且易于获得的资源。在三个基准数据集上进行的广泛实验表明，仅使用便宜且易于获得的资源时，我们的方法即可达到最佳性能，并且与使用难以获得的特定领域资源的最新方法相比，可以提供有竞争力的结果。我们所有的代码和语料库都可以在此https URL上找到。

31. What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure [PDF] 返回目录
Jui Shah, Yaman Kumar Singla, Changyou Chen, Rajiv Ratn Shah
Abstract: In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice? We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0. We compare them on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features. Additionally, we probe the audio models' understanding of textual surface, syntax, and semantic features and compare them to BERT. We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets
摘要：近年来，基于BERT的转换器模型已成为文本处理模型的“技术堆栈”不可分割的一部分。在语音域中，通过使用音频转换器模型对语音进行编码，具有观察最新技术结果的多种模型，观察到了类似的进展。这就引出了这些音频变压器模型要学习什么的问题。此外，尽管标准方法是选择任何下游任务的最后一层嵌入，但这是最佳选择吗？我们尝试为最近的两个音频转换器模型Mockingjay和wave2vec2.0回答这些问题。我们将它们与一套全面的语言交付和结构功能（包括音频，流利性和发音功能）进行比较。此外，我们探究音频模型对文本表面，语法和语义特征的理解，并将其与BERT进行比较。我们通过详尽的设置来处理本地，非本地，合成，阅读和自发语音数据集

32. The Truth is Out There: Investigating Conspiracy Theories in Text Generation [PDF] 返回目录
Sharon Levy, Michael Saxon, William Yang Wang
Abstract: With the growing adoption of text generation models in today's society, users are increasingly exposed to machine-generated text. This in turn can leave users vulnerable to the generation of harmful information such as conspiracy theories. While the propagation of conspiracy theories through social media has been studied, previous work has not evaluated their diffusion through text generation. In this work, we investigate the propensity for language models to generate conspiracy theory text. Our study focuses on testing these models for the elicitation of conspiracy theories and comparing these generations to human-written theories from Reddit. We also introduce a new dataset consisting of conspiracy theory topics, machine-generated conspiracy theories, and human-written conspiracy theories. Our experiments show that many well-known conspiracy theory topics are deeply rooted in the pre-trained language models, and can become more prevalent through different model settings.
摘要：随着当今社会文本生成模型的日益普及，用户越来越多地接触到机器生成的文本。反过来，这会使用户容易受到诸如阴谋论之类的有害信息的影响。虽然已经研究了阴谋理论在社交媒体上的传播，但先前的工作尚未评估其在文本生成中的传播。在这项工作中，我们调查了语言模型生成阴谋论文本的倾向。我们的研究重点是测试这些模型以启发共谋理论，并将这些世代与Reddit的人类书面理论进行比较。我们还将介绍一个新的数据集，该数据集由阴谋理论主题，机器生成的阴谋理论和人工编写的阴谋理论组成。我们的实验表明，许多著名的阴谋理论主题都深深扎根于预先训练的语言模型中，并且可以通过不同的模型设置变得更加普遍。

33. RiddleSense: Answering Riddle Questions as Commonsense Reasoning [PDF] 返回目录
Bill Yuchen Lin, Ziyi Wu, Yichi Yang, Dong-Ho Lee, Xiang Ren
Abstract: A riddle is a mystifying, puzzling question about everyday concepts. For example, the riddle "I have five fingers but I am not alive. What am I?" asks about the concept of a glove. Solving riddles is a challenging cognitive process for humans, in that it requires complex commonsense reasoning abilities and an understanding of figurative language. However, there are currently no commonsense reasoning datasets that test these abilities. We propose RiddleSense, a novel multiple-choice question answering challenge for benchmarking higher-order commonsense reasoning models, which is the first large dataset for riddle-style commonsense question answering, where the distractors are crowdsourced from human annotators. We systematically evaluate a wide range of reasoning models over it and point out that there is a large gap between the best-supervised model and human performance -- pointing to interesting future research for higher-order commonsense reasoning and computational creativity.
摘要：一个谜语是一个关于日常概念的神秘而令人困惑的问题。例如，谜语“我有五个手指，但我还活着。我是什么？”询问手套的概念。解决谜语对人类来说是一个具有挑战性的认知过程，因为它需要复杂的常识推理能力和对比喻语言的理解。但是，目前没有常识性推理数据集可测试这些功能。我们提出了RiddleSense，这是一种用于对高阶常识推理模型进行基准测试的新颖的多选题答题挑战，它是第一个用于谜语式常识问答的大型数据集，其中的干扰因素来自于人类注释者。我们系统地评估了各种推理模型，并指出最佳监督模型与人类绩效之间存在很大差距-指出了未来关于高阶常识推理和计算创造力的有趣研究。

34. On-the-Fly Attention Modularization for Neural Generation [PDF] 返回目录
Yue Dong, Chandra Bhagavatula, Ximing Lu, Jena D. Hwang, Antoine Bosselut, Jackie Chi Kit Cheung, Yejin Choi
Abstract: Despite considerable advancements with deep neural language models (LMs), neural text generation still suffers from degeneration: generated text is repetitive, generic, self-inconsistent, and lacking commonsense. The empirical analyses on sentence-level attention patterns reveal that neural text degeneration may be associated with insufficient learning of inductive biases by the attention mechanism. Our findings motivate on-the-fly attention modularization, a simple but effective method for injecting inductive biases into attention computation during inference. The resulting text produced by the language model with attention modularization can yield enhanced diversity and commonsense reasoning while maintaining fluency and coherence.
摘要：尽管深度神经语言模型（LM）有了长足的进步，但神经文本生成仍然遭受退化：生成的文本是重复的，通用的，自相矛盾的并且缺乏常识。对句子级注意力模式的实证分析表明，神经文本退化可能与注意力机制对归纳性偏向的学习不足有关。我们的发现激发了即时注意力模块化，这是一种在推理过程中将归纳偏差注入注意力计算的简单但有效的方法。语言模型产生的带有注意模块化的结果文本可以增强多样性和常识性推理，同时保持流利性和连贯性。

35. Modeling Fine-Grained Entity Types with Box Embeddings [PDF] 返回目录
Yasumasa Onoe, Michael Boratko, Greg Durrett
Abstract: Neural entity typing models typically represent entity types as vectors in a high-dimensional space, but such spaces are not well-suited to modeling these types' complex interdependencies. We study the ability of box embeddings, which represent entity types as d-dimensional hyperrectangles, to represent hierarchies of fine-grained entity type labels even when these relationships are not defined explicitly in the ontology. Our model represents both types and entity mentions as boxes. Each mention and its context are fed into a BERT-based model to embed that mention in our box space; essentially, this model leverages typological clues present in the surface text to hypothesize a type representation for the mention. Soft box containment can then be used to derive probabilities, both the posterior probability of a mention exhibiting a given type and the conditional probability relations between types themselves. We compare our approach with a strong vector-based typing model, and observe state-of-the-art performance on several entity typing benchmarks. In addition to competitive typing performance, our box-based model shows better performance in prediction consistency (predicting a supertype and a subtype together) and confidence (i.e., calibration), implying that the box-based model captures the latent type hierarchies better than the vector-based model does.
摘要：神经实体类型模型通常将实体类型表示为高维空间中的向量，但此类空间不适合对这些类型的复杂相互依赖性进行建模。我们研究了将实体类型表示为d维超矩形的框嵌入功能，以表示细粒度实体类型标签的层次结构，即使这些关系在本体中没有明确定义。我们的模型将类型和实体提及都表示为方框。每个提及及其上下文都被输入到基于BERT的模型中，以将该提及嵌入到我们的文本框中。本质上，该模型利用表面文本中存在的类型学线索来假设要提及的类型表示。然后，可以使用软盒包含来得出概率，即提及出现给定类型的后验概率和类型本身之间的条件概率关系。我们将我们的方法与强大的基于矢量的输入模型进行比较，并观察了几种实体输入基准的最新性能。除了具有竞争力的打字性能外，我们的基于盒的模型在预测一致性（一起预测超类型和子类型）和置信度（即校准）方面显示出更好的性能，这意味着基于盒的模型比潜在的类型层次更好地捕获了潜在的类型层次。基于向量的模型。

36. Understanding Few-Shot Commonsense Knowledge Models [PDF] 返回目录
Jeff Da, Ronan Le Bras, Ximing Lu, Yejin Choi, Antoine Bosselut
Abstract: Providing natural language processing systems with commonsense knowledge is a critical challenge for achieving language understanding. Recently, commonsense knowledge models have emerged as a suitable approach for hypothesizing situation-relevant commonsense knowledge on-demand in natural language applications. However, these systems are limited by the fixed set of relations captured by schemas of the knowledge bases on which they're trained. To address this limitation, we investigate training commonsense knowledge models in a few-shot setting with limited tuples per commonsense relation in the graph. We perform five separate studies on different dimensions of few-shot commonsense knowledge learning, providing a roadmap on best practices for training these systems efficiently. Importantly, we find that human quality ratings for knowledge produced from a few-shot trained system can achieve performance within 6% of knowledge produced from fully supervised systems. This few-shot performance enables coverage of a wide breadth of relations in future commonsense systems.
摘要：为自然语言处理系统提供常识知识是实现语言理解的关键挑战。最近，常识知识模型已成为一种合适的方法，用于假设自然语言应用中按需假设的与情境相关的常识知识。但是，这些系统受到固定的一组关系的限制，这些固定关系由对其进行训练的知识库的模式捕获。为了解决此限制，我们研究了在几次射击的情况下训练训练常识知识模型的过程，图中每个常识关系的元组都受到限制。我们对常识性知识学习的不同方面进行了五项单独的研究，提供了有效培训这些系统的最佳实践路线图。重要的是，我们发现对经过几次培训的系统所产生的知识的人类质量评级可以达到完全受监督的系统所产生的知识的6％以内的性能。这种鲜为人知的性能使您可以覆盖未来常识系统中的广泛关系。

37. Reader-Guided Passage Reranking for Open-Domain Question Answering [PDF] 返回目录
Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, Weizhu Chen
Abstract: Current open-domain question answering (QA) systems often follow a Retriever-Reader (R2) architecture, where the retriever first retrieves relevant passages and the reader then reads the retrieved passages to form an answer. In this paper, we propose a simple and effective passage reranking method, Reader-guIDEd Reranker (Rider), which does not involve any training and reranks the retrieved passages solely based on the top predictions of the reader before reranking. We show that Rider, despite its simplicity, achieves 10 to 20 absolute gains in top-1 retrieval accuracy and 1 to 4 Exact Match (EM) score gains without refining the retriever or reader. In particular, Rider achieves 48.3 EM on the Natural Questions dataset and 66.4 on the TriviaQA dataset when only 1,024 tokens (7.8 passages on average) are used as the reader input.
摘要：当前的开放域问答系统（QA）通常遵循Retriever-Reader（R2）架构，在该架构中，检索器首先检索相关段落，然后阅读器随后读取检索到的段落以形成答案。在本文中，我们提出了一种简单有效的文章重新排名方法，即“读者指导的重新排名”（Rider），该方法无需进行任何培训，仅根据读者在重新排名之前的最高预测对获得的文章进行重新排名。我们显示，尽管Rider简单易用，但在不改进检索器或读取器的情况下，可以在前1个检索精度中获得10到20个绝对增益，在1-4个完全匹配（EM）得分方面获得1个。特别是，当仅将1,024个标记（平均7.8个段落）用作阅读器输入时，Rider在Natural Questions数据集上达到48.3 EM，在TriviaQA数据集上达到66.4 EM。

38. Polyjuice: Automated, General-purpose Counterfactual Generation [PDF] 返回目录
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, Daniel S. Weld
Abstract: Counterfactual examples have been shown to be useful for many applications, including calibrating, evaluating, and explaining model decision boundaries. However, previous methods for generating such counterfactual examples have been tightly tailored to a specific application, used a limited range of linguistic patterns, or are hard to scale. We propose to disentangle counterfactual generation from its use cases, i.e., gather general-purpose counterfactuals first, and then select them for specific applications. We frame the automated counterfactual generation as text generation, and finetune GPT-2 into a generator, Polyjuice, which produces fluent and diverse counterfactuals. Our method also allows control over where perturbations happen and what they do. We show Polyjuice supports multiple use cases: by generating diverse counterfactuals for humans to label, Polyjuice helps produce high-quality datasets for model training and evaluation, requiring 40% less human effort. When used to generate explanations, Polyjuice helps augment feature attribution methods to reveal models' erroneous behaviors.
摘要：事实证明示例已在许多应用中有用，包括校准，评估和解释模型决策边界。但是，用于生成此类反事实示例的先前方法已针对特定应用进行了严格调整，使用了有限范围的语言模式，或者难以扩展。我们建议从其使用案例中解开反事实生成，即首先收集通用反事实，然后为特定应用选择它们。我们将自动反事实生成框架构造为文本生成，并将GPT-2调整到生成器Polyjuice中，该生成器可以流畅地生成各种反事实。我们的方法还可以控制发生扰动的位置及其作用。我们展示了Polyjuice支持多种用例：通过生成供人类标记的多种反事实，Polyjuice帮助生成用于模型训练和评估的高质量数据集，所需的工作量减少了40％。当用于生成解释时，Polyjuice帮助增强特征归因方法以揭示模型的错误行为。

39. Semantic Parsing with Less Prior and More Monolingual Data [PDF] 返回目录
Sajad Norouzi, Yanshuai Cao
Abstract: Semantic parsing is the task of converting natural language utterances to machine-understandable meaning representations, such as logic forms or programming languages. Training datasets for semantic parsing are typically small due to the higher expertise required for annotation than most other NLP tasks. As a result, models for this application usually require additional prior knowledge to be built into the architecture or algorithm. The increased dependency on human experts hinders automation and raises the development and maintenance costs in practice. This work investigates whether a generic transformer-based seq2seq model can achieve competitive performance with minimal semantic-parsing specific inductive bias design. By exploiting a relatively large monolingual corpus of the target programming language, which is cheap to mine from the web, unlike a parallel corpus, we achieved 80.75% exact match accuracy on Django and 32.57 BLEU score on CoNaLa, both are SOTA to the best of our knowledge. This positive evidence highlights a potentially easier path toward building accurate semantic parsers in the wild.
摘要：语义解析是将自然语言的语音转换为机器可理解的含义表示（例如逻辑形式或编程语言）的任务。由于注释所需的专业知识比大多数其他NLP任务更高，因此用于语义解析的训练数据集通常很小。结果，用于此应用程序的模型通常需要将其他先验知识构建到体系结构或算法中。对人类专家的依赖性增加阻碍了自动化，并在实践中增加了开发和维护成本。这项工作研究了基于通用变压器的seq2seq模型是否可以通过最少的语义解析特定归纳偏置设计来实现竞争性能。通过利用相对较大的目标编程语言的单语语料库（与并行语料库不同，它可以从网络上便宜地从网络上挖掘出来），我们在Django上达到了80.75％的准确匹配率，在CoNaLa上达到了32.57 BLEU得分，两者都是SOTA最好的我们的知识。这些积极的证据表明，在野外构建准确的语义解析器可能会更容易。

40. Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers [PDF] 返回目录
Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo
Abstract: The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train. Inspired by the success of parameter-sharing in pretrained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as neural machine translation. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer, a parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique - designed to overcome the deficiencies in naive cross-layer parameter sharing for generative models - and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
摘要：可以将Transformer的出现描述为自然语言处理的许多最新进展的推动力。然而，尽管最近显示了它们巨大的性能改进，但是该模型严重参数化，参数效率低并且训练起来计算量大。受到预训练的深度上下文化词表示编码器中参数共享成功的启发，我们探索了《变形金刚》中的参数共享方法，特别着重于序列到序列任务（例如神经机器翻译）的编码器-解码器模型。我们对不同的参数共享/归约方法进行了分析，并开发了Subformer，这是一个基于参数的高效变压器模型，结合了新提出的Sandwich式参数共享技术-旨在克服生成模型天真的跨层参数共享的不足-以及自我专注的嵌入分解（SAFE）。机器翻译，抽象摘要和语言建模方面的实验表明，即使使用很少的参数，Subformer的性能也要优于Transformer。

41. BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding [PDF] 返回目录
Abhik Bhattacharjee, Tahmid Hasan, Kazi Samin, M. Sohel Rahman, Anindya Iqbal, Rifat Shahriyar
Abstract: Pre-training language models on large volume of data with self-supervised objectives has become a standard practice in natural language processing. However, most such state-of-the-art models are available in only English and other resource-rich languages. Even in multilingual models, which are trained on hundreds of languages, low-resource ones still remain underrepresented. Bangla, the seventh most widely spoken language in the world, is still low in terms of resources. Few downstream task datasets for language understanding in Bangla are publicly available, and there is a clear shortage of good quality data for pre-training. In this work, we build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet. We introduce a new downstream task dataset and benchmark on four tasks on sentence classification, document classification, natural language understanding, and sequence tagging. Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%. In the process, we identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one, which we name the `Embedding Barrier'. We perform extensive experiments to study this barrier. We release all our datasets and pre-trained models to aid future NLP research on Bangla and other low-resource languages. Our code and data are available at this https URL.
摘要：针对具有自我监督目标的大量数据进行的语言模型预训练已成为自然语言处理中的标准做法。但是，大多数此类最新模型仅以英语和其他资源丰富的语言提供。即使在使用数百种语言进行训练的多语言模型中，资源匮乏的模型仍然代表性不足。孟加拉语是世界上第七广泛使用的语言，但在资源方面仍然很低。孟加拉语中很少有用于语言理解的下游任务数据集可公开获得，并且显然缺少用于预培训的高质量数据。在这项工作中，我们建立了Bangla自然语言理解模型，该模型预先训练了从互联网上Bangla顶级站点抓取的18.6 GB数据。我们针对句子分类，文档分类，自然语言理解和序列标记这四个任务引入了新的下游任务数据集和基准。我们的模型比多语言基准和以前的最新结果要好1-6％。在此过程中，我们发现了多语言模型的主要缺陷，该模型损害了不与任何高资源共享编写脚本的低资源语言的性能，我们将其称为“嵌入障碍”。我们进行了广泛的实验来研究这一障碍。我们发布了所有数据集和经过预训练的模型，以帮助将来对孟加拉语和其他低资源语言进行NLP研究。我们的代码和数据可从此https URL获得。

42. On Explaining Your Explanations of BERT: An Empirical Study with Sequence Classification [PDF] 返回目录
Zhengxuan Wu, Desmond C. Ong
Abstract: BERT, as one of the pretrianed language models, attracts the most attention in recent years for creating new benchmarks across GLUE tasks via fine-tuning. One pressing issue is to open up the blackbox and explain the decision makings of BERT. A number of attribution techniques have been proposed to explain BERT models, but are often limited to sequence to sequence tasks. In this paper, we adapt existing attribution methods on explaining decision makings of BERT in sequence classification tasks. We conduct extensive analyses of four existing attribution methods by applying them to four different datasets in sentiment analysis. We compare the reliability and robustness of each method via various ablation studies. Furthermore, we test whether attribution methods explain generalized semantics across semantically similar tasks. Our work provides solid guidance for using attribution methods to explain decision makings of BERT for downstream classification tasks.
摘要：BERT是一种流行的语言模型，近年来通过微调为GLUE任务创建新的基准吸引了最多的关注。一个紧迫的问题是打开黑匣子并解释BERT的决策。已经提出了多种归因技术来解释BERT模型，但通常仅限于按顺序执行任务。在本文中，我们采用现有的归因方法来解释序列分类任务中BERT的决策。我们将四种现有的归因方法应用于情感分析中的四个不同数据集，从而进行了广泛的分析。我们通过各种消融研究比较每种方法的可靠性和鲁棒性。此外，我们测试了归因方法是否可以跨语义相似的任务解释广义语义。我们的工作为使用归因方法解释BERT对下游分类任务的决策提供了坚实的指导。

43. Prefix-Tuning: Optimizing Continuous Prompts for Generation [PDF] 返回目录
Xiang Lisa Li, Percy Liang
Abstract: Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were "virtual tokens". We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1\% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.
摘要：微调是利用大型预训练语言模型执行下游任务的事实上的方法。但是，它修改了所有语言模型参数，因此有必要为每个任务存储完整副本。在本文中，我们提出了前缀调整，这是对自然语言生成任务进行微调的一种轻量级替代方案，它可以使语言模型参数保持冻结状态，但可以优化特定于任务的小型连续矢量（称为前缀）。前缀调整从提示中汲取灵感，允许后续标记像“虚拟标记”一样参加该前缀。我们将前缀调整应用于GPT-2以生成表到文本，并应用于BART进行汇总。我们发现，仅学习0.1％的参数，前缀调整在完整数据设置下可获得可比的性能，在低数据设置下表现优于微调，并且可以更好地推断出训练期间未看到主题的示例。

44. Transformer based Automatic COVID-19 Fake News Detection System [PDF] 返回目录
Sunil Gundapu, Radhika Mamid
Abstract: Recent rapid technological advancements in online social networks such as Twitter have led to a great incline in spreading false information and fake news. Misinformation is especially prevalent in the ongoing coronavirus disease (COVID-19) pandemic, leading to individuals accepting bogus and potentially deleterious claims and articles. Quick detection of fake news can reduce the spread of panic and confusion among the public. For our analysis in this paper, we report a methodology to analyze the reliability of information shared on social media pertaining to the COVID-19 pandemic. Our best approach is based on an ensemble of three transformer models (BERT, ALBERT, and XLNET) to detecting fake news. This model was trained and evaluated in the context of the ConstraintAI 2021 shared task COVID19 Fake News Detection in English. Our system obtained 0.9855 f1-score on testset and ranked 5th among 110 teams.
摘要：诸如Twitter之类的在线社交网络中最近的快速技术进步，导致了传播虚假信息和虚假新闻的强烈倾向。错误信息在正在进行的冠状病毒疾病（COVID-19）大流行中尤其普遍，导致个人接受假冒和可能有害的声明和物品。快速检测虚假新闻可以减少公众之间的恐慌和混乱。对于本文的分析，我们报告了一种方法，用于分析与COVID-19大流行相关的社交媒体上共享信息的可靠性。我们最好的方法是基于三种变压器模型（BERT，ALBERT和XLNET）的组合来检测假新闻。该模型是在ConstraintAI 2021共享任务英语COVID19假新闻检测环境中进行训练和评估的。我们的系统在测试集上获得0.9855 f1得分，在110个团队中排名第五。

45. UnitedQA: A Hybrid Approach for Open Domain Question Answering [PDF] 返回目录
Hao Cheng, Yelong Shen, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao
Abstract: To date, most of recent work under the retrieval-reader framework for open-domain QA focuses on either extractive or generative reader exclusively. In this paper, we study a hybrid approach for leveraging the strengths of both models. We apply novel techniques to enhance both extractive and generative readers built upon recent pretrained neural language models, and find that proper training methods can provide large improvement over previous state-of-the-art models. We demonstrate that a simple hybrid approach by combining answers from both readers can efficiently take advantages of extractive and generative answer inference strategies and outperforms single models as well as homogeneous ensembles. Our approach outperforms previous state-of-the-art models by 3.3 and 2.7 points in exact match on NaturalQuestions and TriviaQA respectively.
摘要：迄今为止，开放域质量保证的检索阅读器框架下的大多数最新工作都专门针对提取式或生成式阅读器。在本文中，我们研究了一种利用两种模型优势的混合方法。我们应用新颖的技术来增强基于最近预训练的神经语言模型的提取型和生成型读者，并且发现适当的训练方法可以提供比以前的最新模型大的改进。我们证明，通过结合来自两个读者的答案的简单混合方法可以有效地利用提取性和生成性答案推理策略的优势，并且优于单个模型以及同类集合。我们的方法在NaturalQuestions和TriviaQA上的精确匹配分别比以前的最新模型高3.3和2.7点。

46. Unifying Discourse Resources with Dependency Framework [PDF] 返回目录
Yi Cheng, Sujian Li, Yueyuan Li
Abstract: For text-level discourse analysis, there are various discourse schemes but relatively few labeled data, because discourse research is still immature and it is labor-intensive to annotate the inner logic of a text. In this paper, we attempt to unify multiple Chinese discourse corpora under different annotation schemes with discourse dependency framework by designing semi-automatic methods to convert them into dependency structures. We also implement several benchmark dependency parsers and research on how they can leverage the unified data to improve performance.
摘要：在语篇层面的语篇分析中，语篇研究多种多样，但标注的数据却相对较少，这是因为语篇研究仍不成熟，对文本的内在逻辑进行注释是费力的。在本文中，我们尝试通过设计半自动方法将多个中文话语语料库与话语依赖框架统一起来，将它们转换为依存结构。我们还实现了几个基准依赖性解析器，并研究了它们如何利用统一数据来提高性能。

47. How Do Your Biomedical Named Entity Models Generalize to Novel Entities? [PDF] 返回目录
Hyunjae Kim, Jaewoo Kang
Abstract: The number of biomedical literature on new biomedical concepts is rapidly increasing, which necessitates a reliable biomedical named entity recognition (BioNER) model for identifying new and unseen entity mentions. However, it is questionable whether existing BioNER models can effectively handle them. In this work, we systematically analyze the three types of recognition abilities of BioNER models: memorization, synonym generalization, and concept generalization. We find that (1) BioNER models are overestimated in terms of their generalization ability, and (2) they tend to exploit dataset biases, which hinders the models' abilities to generalize. To enhance the generalizability, we present a simple debiasing method based on the data statistics. Our method consistently improves the generalizability of the state-of-the-art (SOTA) models on five benchmark datasets, allowing them to better perform on unseen entity mentions.
摘要：关于新生物医学概念的生物医学文献数量正在迅速增加，这需要可靠的生物医学命名实体识别（BioNER）模型来识别新的和未见的实体提及。但是，现有的BioNER模型是否可以有效地处理它们值得怀疑。在这项工作中，我们系统地分析了BioNER模型的三种识别能力：记忆，同义词泛化和概念泛化。我们发现（1）BioNER模型的泛化能力被高估了；（2）它们倾向于利用数据集偏差，从而阻碍了模型的泛化能力。为了增强通用性，我们提出了一种基于数据统计的简单去偏方法。我们的方法不断提高五个基准数据集上最新技术（SOTA）模型的可推广性，从而使它们在看不见的实体提及方面表现更好。

48. DISCOS: Bridging the Gap between Discourse Knowledge and Commonsense Knowledge [PDF] 返回目录
Tianqing Fang, Hongming Zhang, Weiqi Wang, Yangqiu Song, Bin He
Abstract: Commonsense knowledge is crucial for artificial intelligence systems to understand natural language. Previous commonsense knowledge acquisition approaches typically rely on human annotations (e.g., ATOMIC) or text generation models (e.g., COMET). Human annotation could provide high-quality commonsense knowledge, yet its high cost often results in relatively small scale and low coverage. On the other hand, generation models have the potential to automatically generate more knowledge. Nonetheless, machine learning models often fit the training data too well to generate novel knowledge in high quality, thus still suffering from coverage problems. To address the limitations of previous approaches, in this paper, we propose an alternative commonsense knowledge acquisition framework DISCOS (from DIScourse to COmmonSense), which automatically mines expensive complex commonsense knowledge from more affordable linguistic knowledge resources. Experiments demonstrate that we can successfully convert discourse knowledge over eventualities from ASER, a large-scale discourse knowledge graph, into inferential if-then commonsense knowledge defined in ATOMIC without any additional annotation effort. Further study suggests that DISCOS significantly outperforms previous supervised approaches in terms of novelty and diversity with comparable quality. In total, we can acquire 3.4M ATOMIC-like inferential commonsense knowledge by populating ATOMIC on the core part of ASER. Codes and data are available at this https URL.
摘要：常识知识对于人工智能系统理解自然语言至关重要。先前的常识知识获取方法通常依赖于人类注释（例如ATOMIC）或文本生成模型（例如COMET）。人工注释可以提供高质量的常识知识，但是其高昂的成本通常会导致规模相对较小且覆盖率较低。另一方面，生成模型具有自动生成更多知识的潜力。然而，机器学习模型通常过于适合训练数据而无法生成高质量的新知识，因此仍然存在覆盖问题。为了解决先前方法的局限性，在本文中，我们提出了另一种常识知识获取框架DISCOS（从DIScourse到COmmonSense），该框架自动从更实惠的语言知识资源中挖掘昂贵的复杂常识知识。实验表明，我们可以成功地将事件的话语知识从大型话语知识图ASER转换为ATOMIC中定义的推论if-then常识知识，而无需进行任何其他注解工作。进一步的研究表明，DISCOS在新颖性和多样性方面具有相当的质量，大大优于以前的监督方法。通过在ASER的核心部分填充ATOMIC，我们总共可以获得340万个类似ATOMIC的推理常识。代码和数据可从此https URL获得。

49. A Graph Total Variation Regularized Softmax for Text Generation [PDF] 返回目录
Liu Bin, Wang Liang, Yin Guosheng
Abstract: The softmax operator is one of the most important functions in machine learning models. When applying neural networks to multi-category classification, the correlations among different categories are often ignored. For example, in text generation, a language model makes a choice of each new word based only on the former selection of its context. In this scenario, the link statistics information of concurrent words based on a corpus (an analogy of the natural way of expression) is also valuable in choosing the next word, which can help to improve the sentence's fluency and smoothness. To fully explore such important information, we propose a graph softmax function for text generation. It is expected that the final classification result would be dominated by both the language model and graphical text relationships among words. We use a graph total variation term to regularize softmax so as to incorporate the concurrent relationship into the language model. The total variation of the generated words should be small locally. We apply the proposed graph softmax to GPT2 for the text generation task. Experimental results demonstrate that the proposed graph softmax achieves better BLEU and perplexity than softmax. Human testers can also easily distinguish the text generated by the graph softmax or softmax.
摘要：softmax运算符是机器学习模型中最重要的功能之一。将神经网络应用于多类别分类时，通常会忽略不同类别之间的相关性。例如，在文本生成中，语言模型仅根据上下文的先前选择来选择每个新单词。在这种情况下，基于语料库（自然表达方式的类比）的并发单词的链接统计信息对于选择下一个单词也很有价值，这有助于提高句子的流畅性和流畅性。为了充分探索这些重要信息，我们提出了一个图形softmax函数用于文本生成。预计最终的分类结果将由语言模型和单词之间的图形文本关系决定。我们使用图总变化项对softmax进行正则化，以便将并发关系合并到语言模型中。所产生的单词的总变化在本地应该很小。我们将建议的图形softmax应用于GPT2以进行文本生成任务。实验结果表明，所提出的图softmax比softmax具有更好的BLEU和困惑度。人工测试人员还可以轻松地区分由softmax或softmax图形生成的文本。

50. Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment [PDF] 返回目录
Haoyue Shi, Luke Zettlemoyer, Sida I. Wang
Abstract: Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent algorithms for both subproblems significantly improves induced lexicon quality and further gains are possible by learning to filter the resulting lexical entries, with both unsupervised and semi-supervised schemes. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context.
摘要：双语词典将一种语言的单词映射到另一种语言的翻译，并且通常是通过学习线性投影来对齐单语言单词嵌入空间来诱导的。在本文中，我们展示了使用结合（1）无监督的bitext挖掘和（2）无监督的字对齐的方法可以生产质量更高的词典。直接应用对两个子问题都使用最新算法的管道可以显着提高诱导的词典质量，并且可以通过学习过滤无监督和半监督方案的结果词条来进一步获得收益。我们的最终模型在BUCC 2020共享任务上的最新技术表现优于12种语言对平均14个$ F_1 $点，同时还提供了一种更具可解释性的方法，可以在上下文中提供丰富的词义推理。

51. De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of De-Identifiers [PDF] 返回目录
Leibo Liu, Oscar Perez-Concha, Anthony Nguyen, Vicki Bennett, Louisa Jorm
Abstract: Objective:Electronic Medical Records (EMRs) contain clinical narrative text that is of great potential value to medical researchers. However, this information is mixed with Protected Health Information (PHI) that presents risks to patient and clinician confidentiality. This paper presents an end-to-end de-identification framework to automatically remove PHI from hospital discharge summaries. Materials and Methods:Our corpus included 600 hospital discharge summaries which were extracted from the EMRs of two principal referral hospitals in Sydney, Australia. Our end-to-end de-identification framework consists of three components: 1) Annotation: labelling of PHI in the 600 hospital discharge summaries using five pre-defined categories: person, address, date of birth, individual identification number, phone/fax number; 2) Modelling: training and evaluating ensembles of named entity recognition (NER) models through the use of three natural language processing (NLP) toolkits (Stanza, FLAIR and spaCy) and both balanced and imbalanced datasets; and 3) De-identification: removing PHI from the hospital discharge summaries. Results:The final model in our framework was an ensemble which combined six single models using both balanced and imbalanced datasets for training majority voting. It achieved 0.9866 precision, 0.9862 recall and 0.9864 F1 scores. The majority of false positives and false negatives were related to the person category. Discussion:Our study showed that the ensemble of different models which were trained using three different NLP toolkits upon balanced and imbalanced datasets can achieve good results even with a relatively small corpus. Conclusion:Our end-to-end framework provides a robust solution to de-identifying clinical narrative corpuses safely. It can be easily applied to any kind of clinical narrative documents.
摘要：目的：电子病历（EMR）包含临床叙事文本，对医学研究人员具有巨大的潜在价值。但是，此信息与受保护的健康信息（PHI）混合在一起，会对患者和临床医生的机密性带来风险。本文提出了一种端到端去识别框架，该框架可自动从医院出院摘要中删除PHI。资料和方法：我们的语料库包括600份出院摘要，摘录自澳大利亚悉尼的两家主要转诊医院的EMR。我们的端到端去识别框架包括三个部分：1）注释：使用五个预定义类别在600种出院摘要中标记PHI：人，地址，出生日期，个人识别号，电话/传真数; 2）建模：通过使用三个自然语言处理（NLP）工具包（Stanza，FLAIR和spaCy）以及平衡和不平衡数据集，训练和评估命名实体识别（NER）模型的集合； 3）取消身份识别：从医院出院摘要中删除PHI。结果：我们框架中的最终模型是一个集合，该集合使用平衡和不平衡数据集组合了六个单一模型，用于训练多数投票。它的精度为0.9866，召回率为0.9862，F1得分为0.9864。大部分误报和误报与人类别有关。讨论：我们的研究表明，使用三种不同的NLP工具包对平衡和不平衡的数据集进行训练的不同模型的集成即使在语料库相对较小的情况下也可以取得良好的结果。结论：我们的端到端框架提供了可靠的解决方案，可以安全地对临床叙事语料库进行身份识别。它可以轻松地应用于任何种类的临床叙事文件。

52. NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned [PDF] 返回目录
Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Edouard Grave, Ikuya Yamada, Sonse Shimaoka, Masatoshi Suzuki, Shumpei Miyawaki, Shun Sato, Ryo Takahashi, Jun Suzuki, Martin Fajcik, Martin Docekal, Karel Ondrej, Pavel Smrz, Hao Cheng, Yelong Shen, Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao, Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, Wen-tau Yih
Abstract: We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing large, redundant, retrieval corpora or the parameters of large learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.
摘要：我们回顾了NeurIPS 2020的EfficientQA竞赛。竞赛的重点是开放域问答（QA），其中系统将自然语言问题作为输入并返回自然语言答案。竞争的目的是建立可以预测正确答案同时又满足严格的磁盘内存预算的系统。这些内存预算旨在鼓励参赛者探索存储大型，冗余，检索语料库或大型学习模型的参数之间的权衡。在此报告中，我们描述了比赛的动机和组织，审查了最佳提交，并分析了系统预测，以提供有关开放域质量检查评估的讨论。

53. Sensei: Self-Supervised Sensor Name Segmentation [PDF] 返回目录
Jiaman Wu, Dezhi Hong, Rajesh Gupta, Jingbo Shang
Abstract: A sensor name, typically an alphanumeric string, encodes the key context (e.g., function and location) of a sensor needed for deploying smart building applications. Sensor names, however, are curated in a building vendor-specific manner using different structures and vocabularies that are often esoteric. They thus require tremendous manual effort to annotate on a per-building basis; even to just segment these sensor names into meaningful chunks. In this paper, we propose a fully automated self-supervised framework, Sensei, which can learn to segment sensor names without any human annotation. Specifically, we employ a neural language model to capture the underlying sensor naming structure and then induce self-supervision based on information from the language model to build the segmentation model. Extensive experiments on five real-world buildings comprising thousands of sensors demonstrate the superiority of Sensei over baseline methods.
摘要：传感器名称（通常是字母数字字符串）对部署智能建筑应用程序所需的传感器的关键上下文（例如，功能和位置）进行编码。但是，传感器名称是使用特定于建筑供应商的方式使用通常深奥的不同结构和词汇来组织的。因此，它们需要大量的人工才能在每个建筑物的基础上进行注释。甚至只是将这些传感器名称分成有意义的块。在本文中，我们提出了一个全自动的自我监督框架Sensei，该框架可以学习对传感器名称进行细分，而无需任何人工注释。具体来说，我们采用神经语言模型来捕获底层的传感器命名结构，然后基于来自语言模型的信息来进行自我监督，以建立细分模型。在五个包含数千个传感器的现实世界中进行的大量实验证明，Sensei优于基线方法。

54. MrGCN: Mirror Graph Convolution Network for Relation Extraction with Long-Term Dependencies [PDF] 返回目录
Xiao Guo, I-Hung Hsu, Wael AbdAlmageed, Premkumar Natarajan, Nanyun Peng
Abstract: The ability to capture complex linguistic structures and long-term dependencies among words in the passage is essential for many natural language understanding tasks. In relation extraction, dependency trees that contain rich syntactic clues have been widely used to help capture long-term dependencies in text. Graph neural networks (GNNs), one of the means to encode dependency graphs, has been shown effective in several prior works. However, relatively little attention has been paid to the receptive fields of GNNs, which can be crucial in tasks with extremely long text that go beyond single sentences and require discourse analysis. In this work, we leverage the idea of graph pooling and propose the Mirror Graph Convolution Network (MrGCN), a GNN model with pooling-unpooling structures tailored to relation extraction. The pooling branch reduces the graph size and enables the GCN to obtain larger receptive fields within less layers; the unpooling branch restores the pooled graph to its original resolution such that token-level relation extraction can be performed. Experiments on two datasets demonstrate the effectiveness of our method, showing significant improvements over previous results.
摘要：捕获段落中单词之间复杂的语言结构和长期依赖关系的能力对于许多自然语言理解任务至关重要。在关系提取中，包含丰富语法线索的依赖树已被广泛用于帮助捕获文本中的长期依赖。图神经网络（GNN）是对依赖图进行编码的一种手段，在多项现有工作中已经证明是有效的。但是，对GNN的接受领域的关注相对较少，这在具有超长文本的任务中至关重要，这些文本超出了单个句子，需要进行语篇分析。在这项工作中，我们利用图池化的思想，并提出了镜像图卷积网络（MrGCN），这是一种GNN模型，具有为关系提取量身定制的池化-非池化结构。合并分支减小了图的大小，并使GCN可以在更少的层中获得更大的接收场；解池分支将合并的图恢复到其原始分辨率，以便可以执行令牌级关系提取。在两个数据集上进行的实验证明了我们方法的有效性，与以前的结果相比有明显的改进。

55. Intent Classification and Slot Filling for Privacy Policies [PDF] 返回目录
Wasi Uddin Ahmad, Jianfeng Chi, Tu Le, Thomas Norton, Yuan Tian, Kai-Wei Chang
Abstract: Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, a corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging benchmark with limited labeled examples reflecting the cost of collecting large-scale annotations. We present two alternative neural approaches as baselines: (1) formulating intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. Experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. Error analysis reveals the deficiency of the baseline approaches, suggesting room for improvement in future works. We hope the PolicyIE corpus will stimulate future research in this domain.
摘要：了解隐私政策对于用户至关重要，因为它使用户能够了解对他们重要的信息。隐私权政策文档中写的句子解释了隐私权惯例，组成部分的文字跨度进一步传达了有关该惯例的具体信息。我们将预测句子中解释的隐私实践作为意图分类，并将共享特定信息的文本跨度识别为空位填充。在这项工作中，我们提出PolicyIE，这是一个由5250个意图和11788个插槽注释组成的语料库，涵盖网站和移动应用程序的31个隐私策略。 PolicyIE语料库是一个具有挑战性的基准，带有有限标签的示例反映了收集大规模注释的成本。我们提出了两种可供选择的神经方法作为基线：（1）将意图分类和空位填充制定为联合序列标签，以及（2）将它们建模为序列到序列（Seq2Seq）学习任务。实验结果表明，两种方法在意图分类方面的性能相当，而Seq2Seq方法在时隙填充方面的性能优于序列标记方法。误差分析揭示了基线方法的不足，表明未来工作有待改进。我们希望PolicyIE语料库能够刺激该领域的未来研究。

56. WARP: Word-level Adversarial ReProgramming [PDF] 返回目录
Karen Hambardzumyan, Hrant Khachatrian, Jonathan May
Abstract: Transfer learning from pretrained language models recently became the dominant approach for solving many NLP tasks. While fine-tuning large language models usually gives the best performance, in many applications it is preferable to tune much smaller sets of parameters, so that the majority of parameters can be shared across multiple tasks. The main approach is to train one or more task-specific layers on top of the language model. In this paper we present an alternative approach based on adversarial reprogramming, which extends earlier work on automatic prompt generation. It attempts to learn task-specific word embeddings that, when concatenated to the input text, instruct the language model to solve the specified task. We show that this approach outperforms other methods with a similar number of trainable parameters on SST-2 and MNLI datasets. On SST-2, the performance of our model is comparable to the fully fine-tuned baseline, while on MNLI it is the best among the methods that do not modify the parameters of the body of the language model.
摘要：最近，从预训练的语言模型进行迁移学习成为解决许多NLP任务的主要方法。虽然微调大型语言模型通常可以提供最佳性能，但在许多应用程序中，最好调整小得多的参数集，以便大多数参数可以在多个任务之间共享。主要方法是在语言模型之上训练一个或多个特定于任务的层。在本文中，我们提出了一种基于对抗性重新编程的替代方法，该方法扩展了有关自动提示生成的早期工作。它尝试学习特定于任务的单词嵌入，这些单词嵌入与输入文本连接后，可指示语言模型解决指定的任务。我们表明，这种方法在SST-2和MNLI数据集上具有相似数量的可训练参数，其性能优于其他方法。在SST-2上，我们的模型的性能与完全微调的基线相当，而在MNLI上，它是不修改语言模型主体参数的方法中最好的。

57. Multi-task Retrieval for Knowledge-Intensive Tasks [PDF] 返回目录
Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oğuz, Veselin Stoyanov, Gargi Ghosh
Abstract: Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.
摘要：从大型语料库中检索相关上下文是诸如开放域问题解答和事实检查之类的任务的关键步骤。尽管神经检索的性能优于传统方法，如tf-idf和BM25，但将其应用于域外数据时，其性能会大大降低。受神经检索模型是否可以通用并在各种问题上表现出色的问题驱使，我们提出了一种多任务训练模型。即使在域内训练数据丰富的情况下，我们的方法不仅在少数情况下优于以前的方法，而且可以与专用的神经检索器相媲美。在检索器的帮助下，我们改进了用于下游任务的现有模型，并在多个基准上紧密匹配或改进了现有技术。

58. Controlled Analyses of Social Biases in Wikipedia Bios [PDF] 返回目录
Anjalie Field, Chan Young Park, Yulia Tsvetkov
Abstract: Social biases on Wikipedia, a widely-read global platform, could greatly influence public opinion. While prior research has examined man/woman gender bias in biography articles, possible influences of confounding variables limit conclusions. In this work, we present a methodology for reducing the effects of confounding variables in analyses of Wikipedia biography pages. Given a target corpus for analysis (e.g. biography pages about women), we present a method for constructing a comparison corpus that matches the target corpus in as many attributes as possible, except the target attribute (e.g. the gender of the subject). We evaluate our methodology by developing metrics to measure how well the comparison corpus aligns with the target corpus. We then examine how articles about gender and racial minorities (cisgender women, non-binary people, transgender women, and transgender men; African American, Asian American, and Hispanic/Latinx American people) differ from other articles, including analyses driven by social theories like intersectionality. In addition to identifying suspect social biases, our results show that failing to control for confounding variables can result in different conclusions and mask biases. Our contributions include methodology that facilitates further analyses of bias in Wikipedia articles, findings that can aid Wikipedia editors in reducing biases, and framework and evaluation metrics to guide future work in this area.
摘要：人们对维基百科（一种广为人知的全球平台）的偏见可能会极大地影响公众舆论。尽管先前的研究在传记文章中检查了男女性别偏见，但混杂变量的可能影响限制了结论。在这项工作中，我们提出了一种减少维基百科传记页面分析中混杂变量影响的方法。给定一个要分析的目标语料库（例如关于女性的传记页面），我们提出一种构建比对语料库的匹配语料库的方法，该匹配语料库除了目标属性（例如受试者的性别）外，还要在尽可能多的属性中与目标语料库匹配。我们通过开发度量标准来评估我们的方法，以衡量比较语料库与目标语料库的匹配程度。然后，我们研究了有关性别和种族少数群体的文章（顺性别女性，非双性人，变性女性和变性男性；非裔美国人，亚裔美国人和拉美裔/拉丁裔美国人）与其他文章有何不同，包括由社会理论驱动的分析像交叉性。除了确定可疑的社会偏见外，我们的研究结果还表明，无法控制混杂变量会导致不同的结论和掩盖偏见。我们的贡献包括促进进一步分析Wikipedia文章中偏见的方法，可帮助Wikipedia编辑者减少偏见的发现以及指导该领域未来工作的框架和评估指标。

59. EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [PDF] 返回目录
Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, Jingjing Liu
Abstract: Deep, heavily overparameterized language models such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focus on reducing inference cost/time, while still requiring expensive training process. Other works use extremely large batch sizes to shorten the pre-training time at the expense of high demand for computation resources. In this paper, inspired by the Early-Bird Lottery Tickets studied for computer vision tasks, we propose EarlyBERT, a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models. We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training. Comprehensive pre-training and fine-tuning experiments on GLUE and SQuAD downstream tasks show that EarlyBERT easily achieves comparable performance to standard BERT with 35~45% less training time.
摘要：BERT，XLNet和T5等深层，过度参数化的语言模型在许多NLP任务中均取得了令人瞩目的成功。然而，它们的高模型复杂性要求大量的计算资源和极长的训练时间以进行预训练和微调。许多作品已经研究了大型NLP模型上的模型压缩，但是只专注于减少推理成本/时间，同时仍然需要昂贵的训练过程。其他工作使用极大的批处理大小来缩短预训练时间，但以对计算资源的高需求为代价。在本文中，受早期鸟彩票研究计算机视觉任务的启发，我们提出了EarlyBERT，这是一种通用的计算有效训练算法，适用于大规模语言模型的预训练和微调。我们是第一个在BERT培训的早期阶段识别结构化中奖彩票并将其用于有效培训的人。在GLUE和SQuAD下游任务上进行的全面的预训练和微调实验表明，EarlyBERT可以轻松实现与标准BERT相当的性能，而训练时间却减少了35％至45％。

60. Towards Modelling Coherence in Spoken Discourse [PDF] 返回目录
Rajaswa Patil, Yaman Kumar Singla, Rajiv Ratn Shah, Mika Hama, Roger Zimmermann
Abstract: While there has been significant progress towards modelling coherence in written discourse, the work in modelling spoken discourse coherence has been quite limited. Unlike the coherence in text, coherence in spoken discourse is also dependent on the prosodic and acoustic patterns in speech. In this paper, we model coherence in spoken discourse with audio-based coherence models. We perform experiments with four coherence-related tasks with spoken discourses. In our experiments, we evaluate machine-generated speech against the speech delivered by expert human speakers. We also compare the spoken discourses generated by human language learners of varying language proficiency levels. Our results show that incorporating the audio modality along with the text benefits the coherence models in performing downstream coherence related tasks with spoken discourses.
摘要：虽然在书面话语连贯性建模方面取得了重大进展，但在建模口语连贯性方面的工作却十分有限。与文本的连贯性不同，口语中的连贯性还取决于语音中的韵律和声学模式。在本文中，我们使用基于音频的连贯性模型对口语语篇中的连贯性进行建模。我们通过口头演讲对四个与一致性相关的任务进行实验。在我们的实验中，我们根据专家演讲者提供的语音来评估机器生成的语音。我们还比较了不同语言水平的人类语言学习者所产生的口头话语。我们的研究结果表明，将音频模态与文本结合在一起，有助于在执行与语音相关的下游一致性任务时的一致性模型。

61. KART: Privacy Leakage Framework of Language Models Pre-trained with Clinical Records [PDF] 返回目录
Yuta Nakamura, Shouhei Hanaoka, Yukihiro Nomura, Naoto Hayashi, Osamu Abe, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki
Abstract: Nowadays, mainstream natural language pro-cessing (NLP) is empowered by pre-trained language models. In the biomedical domain, only models pre-trained with anonymized data have been published. This policy is acceptable, but there are two questions: Can the privacy policy of language models be different from that of data? What happens if private language models are accidentally made public? We empirically evaluated the privacy risk of language models, using several BERT models pre-trained with MIMIC-III corpus in different data anonymity and corpus sizes. We simulated model inversion attacks to obtain the clinical information of target individuals, whose full names are already known to attackers. The BERT models were probably low-risk because the Top-100 accuracy of each attack was far below expected by chance. Moreover, most privacy leakage situations have several common primary factors; therefore, we formalized various privacy leakage scenarios under a universal novel framework named Knowledge, Anonymization, Resource, and Target (KART) framework. The KART framework helps parameterize complex privacy leakage scenarios and simplifies the comprehensive evaluation. Since the concept of the KART framework is domain agnostic, it can contribute to the establishment of privacy guidelines of language models beyond the biomedical domain.
摘要：如今，主流自然语言处理（NLP）受预训练的语言模型的支持。在生物医学领域，仅公开了经过匿名数据训练的模型。该策略是可以接受的，但是有两个问题：语言模型的隐私策略可以不同于数据的隐私策略吗？如果不慎将私人语言模型公开，会发生什么？我们使用在不同数据匿名性和语料库大小下经过MIMIC-III语料库预先训练的几种BERT模型，以经验评估语言模型的隐私风险。我们对模型反转攻击进行了模拟，以获取攻击者已经知道其全名的目标个人的临床信息。 BERT模型可能是低风险的，因为每次攻击的前100名准确性都远远低于偶然的预期。此外，大多数隐私泄露情况都有几个共同的主要因素；因此，我们在名为知识，匿名化，资源和目标（KART）框架的通用新型框架下，将各种隐私泄露方案正式化。 KART框架有助于参数化复杂的隐私泄露方案并简化综合评估。由于KART框架的概念是领域不可知的，因此它可以帮助建立生物医学领域以外的语言模型的隐私准则。

62. The Pile: An 800GB Dataset of Diverse Text for Language Modeling [PDF] 返回目录
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy
Abstract: Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
摘要：最近的工作表明，训练数据集多样性的增加改善了大型语言模型的一般跨域知识和下游泛化能力。考虑到这一点，我们提出了\ textit {theile}：825 GiB英语文本语料库，旨在训练大规模语言模型。桩由22种不同的高质量子集构建而成-既有现有的又是新建的-其中许多来自学术或专业资源。我们对GPT-2和GPT-3在桩上的失调性能的评估表明，这些模型在其许多组成部分（例如学术论文）上都处于挣扎状态。相反，在桩上训练的模型相对于桩的所有组件的Raw CC和CC-100都有显着改善，同时改善了下游评估的性能。通过深入的探索性分析，我们为潜在用户记录了潜在的数据方面。我们将其构造中使用的代码公开发布。

63. Unnatural Language Inference [PDF] 返回目录
Koustuv Sinha, Prasanna Parthasarathi, Joelle Pineau, Adina Williams
Abstract: Natural Language Understanding has witnessed a watershed moment with the introduction of large pre-trained Transformer networks. These models achieve state-of-the-art on various tasks, notably including Natural Language Inference (NLI). Many studies have shown that the large representation space imbibed by the models encodes some syntactic and semantic information. However, to really "know syntax", a model must recognize when its input violates syntactic rules and calculate inferences accordingly. In this work, we find that state-of-the-art NLI models, such as RoBERTa and BART are invariant to, and sometimes even perform better on, examples with randomly reordered words. With iterative search, we are able to construct randomized versions of NLI test sets, which contain permuted hypothesis-premise pairs with the same words as the original, yet are classified with perfect accuracy by large pre-trained models, as well as pre-Transformer state-of-the-art encoders. We find the issue to be language and model invariant, and hence investigate the root cause. To partially alleviate this effect, we propose a simple training methodology. Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
摘要：自然语言理解见证了大型预训练的Transformer网络的引入的分水岭。这些模型在各种任务上达到了最新水平，尤其是自然语言推理（NLI）。许多研究表明，模型吸收的大量表示空间对一些句法和语义信息进行编码。但是，要真正“知道语法”，模型必须识别其输入何时违反语法规则并据此计算推论。在这项工作中，我们发现最先进的NLI模型（例如RoBERTa和BART）对于带有随机重排单词的示例是不变的，有时甚至表现得更好。通过迭代搜索，我们能够构造NLI测试集的随机版本，其中包含与原始单词具有相同单词的排列的假设-前提对，但通过大型预训练模型以及pre-Transformer进行了准确分类最先进的编码器。我们发现问题是语言和模型不变的，因此调查了根本原因。为了部分缓解这种影响，我们提出了一种简单的培训方法。我们的发现使人们对我们的自然语言理解模型以及用于衡量其进度的任务真正需要类似于人的语法理解的想法提出了质疑。

64. Improving reference mining in patents with BERT [PDF] 返回目录
Ken Voskuil, Suzan Verberne
Abstract: References in patents to scientific literature provide relevant information for studying the relation between science and technological inventions. These references allow us to answer questions about the types of scientific work that leads to inventions. Most prior work analysing the citations between patents and scientific publications focussed on the front-page citations, which are well structured and provided in the metadata of patent archives such as Google Patents. In the 2019 paper by Verberne et al., the authors evaluate two sequence labelling methods for extracting references from patents: Conditional Random Fields (CRF) and Flair. In this paper we extend that work, by (1) improving the quality of the training data and (2) applying BERT-based models to the problem. We use error analysis throughout our work to find problems in the dataset, improve our models and reason about the types of errors different models are susceptible to. We first discuss the work by Verberne et al. and other related work in Section2. We describe the improvements we make in the dataset, and the new models proposed for this task. We compare the results of our new models with previous results, both on the labelled dataset and a larger unlabelled corpus. We end with a discussion on the characteristics of the results of our new models, followed by a conclusion. Our code and improved dataset are released under an open-source license on github.
摘要：专利文献中对科学文献的引用为研究科学技术发明之间的关系提供了相关信息。这些参考文献使我们能够回答有关导致发明的科学工作类型的问题。分析专利和科学出版物之间引文的大多数先前工作都集中在首页引文上，这些引文结构合理，并在诸如Google Patents之类的专利档案的元数据中提供。在Verberne等人的2019年论文中，作者评估了两种从专利中提取参考文献的序列标记方法：条件随机场（CRF）和Flair。在本文中，我们通过（1）提高训练数据的质量和（2）将基于BERT的模型应用于问题来扩展这项工作。我们在整个工作中都使用错误分析来查找数据集中的问题，改进我们的模型以及不同模型易受错误类型的原因。我们首先讨论Verberne等人的工作。和第2节中的其他相关工作。我们描述了我们在数据集中所做的改进以及为此任务建议的新模型。我们将新模型的结果与以前的结果（在标记数据集和较大的未标记语料库上）进行比较。我们首先讨论新模型结果的特征，然后得出结论。我们的代码和改进的数据集是在github上的开源许可下发布的。

65. Learning Neural Networks on SVD Boosted Latent Spaces for Semantic Classification [PDF] 返回目录
Sahil Sidheekh
Abstract: The availability of large amounts of data and compelling computation power have made deep learning models much popular for text classification and sentiment analysis. Deep neural networks have achieved competitive performance on the above tasks when trained on naive text representations such as word count, term frequency, and binary matrix embeddings. However, many of the above representations result in the input space having a dimension of the order of the vocabulary size, which is enormous. This leads to a blow-up in the number of parameters to be learned, and the computational cost becomes infeasible when scaling to domains that require retaining a colossal vocabulary. This work proposes using singular value decomposition to transform the high dimensional input space to a lower-dimensional latent space. We show that neural networks trained on this lower-dimensional space are not only able to retain performance while savoring significant reduction in the computational complexity but, in many situations, also outperforms the classical neural networks trained on the native input space.
摘要：大量数据的可用性和令人信服的计算能力已使深度学习模型在文本分类和情感分析中大受欢迎。当在朴素的文本表示形式（如字数，词频和二进制矩阵嵌入）上进行训练时，深度神经网络已在上述任务上取得了竞争优势。然而，许多上述表示导致输入空间具有词汇量级的大小，这是巨大的。这导致要学习的参数数量激增，并且在扩展到需要保留庞大词汇量的域时，计算成本变得不可行。这项工作提出使用奇异值分解将高维输入空间转换为低维潜在空间。我们表明，在此低维空间上训练的神经网络不仅能够保留性能，同时还能显着降低计算复杂性，而且在许多情况下，其性能也优于在本机输入空间上训练的经典神经网络。

66. VinVL: Making Visual Representations Matter in Vision-Language Models [PDF] 返回目录
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao
Abstract: This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.
摘要：本文提出了针对视觉语言（VL）任务改进视觉表示的详细研究，并开发了一种改进的对象检测模型以提供以对象为中心的图像表示。与最广泛使用的\ emph {自上而下和自上而下}模型\ cite {anderson2018bottom}相比，新模型更大，针对VL任务进行了更好的设计，并且在结合了多个公众的更大训练集上进行了预训练带注释的对象检测数据集。因此，它可以生成视觉对象和概念的更丰富集合的表示。尽管先前的VL研究主要集中在改善视觉语言融合模型上，而对目标检测模型的改进却未曾动过，但我们显示视觉特征在VL模型中具有重要意义。在我们的实验中，我们将新对象检测模型生成的视觉特征输入到基于Transformer的VL融合模型\ oscar \ cite {li2020oscar}中，并利用改进的方法\ short \对VL模型进行预训练并进行微调它可用于各种下游VL任务。我们的结果表明，新的视觉功能显着改善了所有VL任务的性能，并根据七个公开基准创建了最新的结果。我们将向公众发布新的对象检测模型。

67. VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [PDF] 返回目录
Xiaopeng Lu, Tiancheng Zhao, Kyusong Lee
Abstract: Text-to-image retrieval is an essential task in multi-modal information retrieval, i.e. retrieving relevant images from a large and unlabelled image dataset given textual queries. In this paper, we propose VisualSparta, a novel text-to-image retrieval model that shows substantial improvement over existing models on both accuracy and efficiency. We show that VisualSparta is capable of outperforming all previous scalable methods in MSCOCO and Flickr30K. It also shows substantial retrieving speed advantages, i.e. for an index with 1 million images, VisualSparta gets over 391x speed up compared to standard vector search. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for very large dataset, with significant accuracy improvement compared to previous state-of-the-art methods.
摘要：文本到图像的检索是多模式信息检索中的一项基本任务，即在给定文本查询的情况下从大型且未标记的图像数据集中检索相关图像。在本文中，我们提出了VisualSparta，这是一种新颖的文本到图像检索模型，该模型在准确性和效率上都比现有模型显着提高。我们证明VisualSparta能够胜过MSCOCO和Flickr30K中所有以前的可伸缩方法。它还显示了实质性的检索速度优势，即对于具有100万张图像的索引，与标准向量搜索相比，VisualSparta的速度提高了391倍以上。实验表明，对于较大的数据集，这种速度优势甚至更大，因为VisualSparta可以有效地实现为反向索引。据我们所知，VisualSparta是第一个基于转换器的文本到图像检索模型，可以实现对非常大的数据集的实时搜索，与以前的最新方法相比，其准确性显着提高。

68. CIZSL++: Creativity Inspired Generative Zero-Shot Learning [PDF] 返回目录
Mohamed Elhoseiny, Kai Yi, Mohamed Elfeki
Abstract: Zero-shot learning (ZSL) aims at understanding unseen categories with no training examples from class-level descriptions. To improve the discriminative power of ZSL, we model the visual learning process of unseen categories with inspiration from the psychology of human creativity for producing novel art. First, we propose CIZSL-v1 as a creativity inspired model for generative ZSL. We relate ZSL to human creativity by observing that ZSL is about recognizing the unseen, and creativity is about creating a likable unseen. We introduce a learning signal inspired by creativity literature that explores the unseen space with hallucinated class-descriptions and encourages careful deviation of their visual feature generations from seen classes while allowing knowledge transfer from seen to unseen classes. Second, CIZSL-v2 is proposed as an improved version of CIZSL-v1 for generative zero-shot learning. CIZSL-v2 consists of an investigation of additional inductive losses for unseen classes along with a semantic guided discriminator. Empirically, we show consistently that CIZSL losses can improve generative ZSL models on the challenging task of generalized ZSL from a noisy text on CUB and NABirds datasets. We also show the advantage of our approach to Attribute-based ZSL on AwA2, aPY, and SUN datasets. We also show that CIZSL-v2 has improved performance compared to CIZSL-v1.
摘要：零镜头学习（ZSL）旨在理解不可见的类别，而没有来自班级描述的训练示例。为了提高ZSL的判别力，我们从人类创造创作心理学的灵感中模拟了未见类别的视觉学习过程。首先，我们建议CIZSL-v1作为生成ZSL的创造力启发模型。通过观察ZSL是关于识别看不见的东西，而创造力是创建可爱的看不见的东西，我们将ZSL与人类创造力联系起来。我们引入了一个受创造力文学启发的学习信号，该学习信号用幻觉的课堂描述探索了看不见的空间，并鼓励其视觉特征世代与已观看的班级谨慎偏离，同时允许知识从已观看的班级转移到未见过的班级。其次，CIZSL-v2被提出作为CIZSL-v1的改进版本，用于生成零镜头学习。 CIZSL-v2包括对看不见的类的其他归纳损失以及语义指导的鉴别器的研究。从经验上，我们一致地表明，CIZSL损失可以从CUB和NABirds数据集上的嘈杂文本中改善通用ZSL的艰巨任务，从而改善生成ZSL模型。我们还将展示在AwA2，aPY和SUN数据集上基于属性的ZSL方法的优势。我们还显示，与CIZSL-v1相比，CIZSL-v2具有更高的性能。

69. DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue [PDF] 返回目录
Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, Satwik Kottur
Abstract: A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem involving complex multimodal and temporal inputs, and studying them independently is hard with existing datasets. Existing benchmarks do not have enough annotations to help analyze dialogue systems and understand their linguistic and visual reasoning capability and limitations in isolation. These benchmarks are also not explicitly designed to minimize biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning each question requires, including cross-turn video interval tracking and dialogue object tracking. We use our dataset to analyze several dialogue system approaches, providing interesting insights into their abilities and limitations. In total, the dataset contains $10$ instances of $10$-round dialogues for each of $\sim11k$ synthetic videos, resulting in more than $100k$ dialogues and $1M$ question-answer pairs. Our code and dataset will be made public.
摘要：需要有视频基础的对话系统来理解对话（该对话包含不同的语义依赖关系）和视频（其中包含时空场景变化的视觉提示）。建立这样的对话系统是一个挑战性的问题，涉及复杂的多模式和时间输入，而使用现有数据集很难对其进行独立研究。现有的基准没有足够的注释来帮助分析对话系统并孤立地理解其语言和视觉推理能力以及局限性。这些基准测试也未明确设计为最小化模型在没有实际推理的情况下可以利用的偏差。为了解决这些限制，在本文中，我们提出了一个诊断数据集，该数据集可以测试视频和对话中的一系列推理能力。该数据集旨在包含最小的偏差，并具有针对每个问题所需的不同类型推理的详细注释，包括交叉转弯视频间隔跟踪和对话对象跟踪。我们使用我们的数据集来分析几种对话系统方法，以提供有关其功能和局限性的有趣见解。总共，数据集包含每个$ \ sim11k $合成视频的$ 10 $轮次对话的$ 10 $实例，从而产生了超过$ 100k $对话和$ 1M $问答对。我们的代码和数据集将公开。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2021-01-05

目录

摘要