0%

【arxiv论文】 Computation and Language 2020-12-25

目录

1. I like fish, especially dolphins: Addressing Contradictions in Dialogue Modelling [PDF] 摘要
2. To what extent do human explanations of model behavior align with actual model behavior? [PDF] 摘要
3. A Context Aware Approach for Generating Natural Language Attacks [PDF] 摘要
4. Co-GAT: A Co-Interactive Graph Attention Network for Joint Dialog Act Recognition and Sentiment Classification [PDF] 摘要
5. QUACKIE: A NLP Classification Task With Ground Truth Explanations [PDF] 摘要
6. Sentence-Based Model Agnostic NLP Interpretability [PDF] 摘要
7. REM-Net: Recursive Erasure Memory Network for Commonsense Evidence Refinement [PDF] 摘要
8. Gender Bias in Multilingual Neural Machine Translation: The Architecture Matters [PDF] 摘要
9. Cross-lingual Dependency Parsing as Domain Adaptation [PDF] 摘要
10. SubICap: Towards Subword-informed Image Captioning [PDF] 摘要
11. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language [PDF] 摘要
12. Multi-modal Identification of State-Sponsored Propaganda on Social Media [PDF] 摘要
13. Disentangling semantics in language throughs VAEs and a certain architectural choice [PDF] 摘要
14. Speech Synthesis as Augmentation for Low-Resource ASR [PDF] 摘要
15. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning [PDF] 摘要
16. Detecting Hateful Memes Using a Multimodal Deep Ensemble [PDF] 摘要
17. Unsupervised neural adaptation model based on optimal transport for spoken language identification [PDF] 摘要
18. WEmbSim: A Simple yet Effective Metric for Image Captioning [PDF] 摘要
19. Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs [PDF] 摘要
20. Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge [PDF] 摘要

摘要

1. I like fish, especially dolphins: Addressing Contradictions in Dialogue Modelling [PDF] 返回目录
  Yixin Nie, Mary Williamson, Mohit Bansal, Douwe Kiela, Jason Weston
Abstract: To quantify how well natural language understanding models can capture consistency in a general conversation, we introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues. We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach. Results reveal that: (i) our newly collected dataset is notably more effective at providing supervision for the dialogue contradiction detection task than existing NLI data including those aimed to cover the dialogue domain; (ii) the structured utterance-based approach is more robust and transferable on both analysis and out-of-distribution dialogues than its unstructured counterpart. We also show that our best contradiction detection model correlates well with human judgments and further provide evidence for its usage in both automatically evaluating and improving the consistency of state-of-the-art generative chatbots.
摘要:为了量化自然语言理解模型如何很好地捕捉一般对话中的一致性,我们引入了DialoguE对话检测任务(DECODE)和一个新的包含人与人和人与机器人之间相互矛盾的对话的对话数据集。然后,我们将使用预训练的Transformer模型进行矛盾检测的基于结构话语的方法与典型的非结构化方法进行比较。结果表明:(i)我们新收集的数据集在提供监督对话矛盾检测任务方面比现有的NLI数据(包括旨在覆盖对话领域的数据)更为有效; (ii)与非结构化对应方法相比,基于结构化发音的方法在分析和分发外对话上都更加稳健和可转移。我们还表明,我们最好的矛盾检测模型与人类的判断有着很好的相关性,并进一步证明了它在自动评估和改进最新的生成聊天机器人的一致性中的用途。

2. To what extent do human explanations of model behavior align with actual model behavior? [PDF] 返回目录
  Grusha Prasad, Yixin Nie, Mohit Bansal, Robin Jia, Douwe Kiela, Adina Williams
Abstract: Given the increasingly prominent role NLP models (will) play in our lives, it is important to evaluate models on their alignment with human expectations of how models behave. Using Natural Language Inference (NLI) as a case study, we investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. More specifically, we defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words, as measured by integrated gradients. Then, we evaluated six different transformer models (the base and large versions of BERT, RoBERTa and ELECTRA), and found that the BERT-base model has the highest alignment with human-generated explanations, for both alignment metrics. Additionally, the base versions of the models we surveyed tended to have higher alignment with human-generated explanations than their larger counterparts, suggesting that increasing the number model parameters could result in worse alignment with human explanations. Finally, we find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI, suggesting that accuracy and alignment are orthogonal, and both are important ways to evaluate models.
摘要:鉴于NLP模型(将在我们的生活中扮演着越来越重要的角色),重要的是评估模型是否符合人们对模型行为的期望。使用自然语言推理(NLI)作为案例研究,我们调查了人类对模型推理决策的解释与模型实际做出这些决策的方式之间的匹配程度。更具体地说,我们定义了两个对齐方式,用于量化自然语言人类解释与模型对输入单词的敏感度之间的对齐程度(以集成坡度衡量)。然后,我们评估了六个不同的变压器模型(BERT,RoBERTa和ELECTRA的基本版本和大型版本),发现对于两种对齐方式,基于BERT的模型与人工生成的解释的一致性最高。此外,我们调查的模型的基本版本与人为的解释相比,倾向于具有更高的一致性,这表明增加模型参数的数量可能导致与人为解释的一致性更差。最后,我们发现模型与人为解释的一致性不是由NLI上模型的准确性预测的,这表明准确性和一致性是正交的,并且两者都是评估模型的重要方法。

3. A Context Aware Approach for Generating Natural Language Attacks [PDF] 返回目录
  Rishabh Maheshwary, Saket Maheshwary, Vikram Pudi
Abstract: We study an important task of attacking natural language processing models in a black box setting. We propose an attack strategy that crafts semantically similar adversarial examples on text classification and entailment tasks. Our proposed attack finds candidate words by considering the information of both the original word and its surrounding context. It jointly leverages masked language modelling and next sentence prediction for context understanding. In comparison to attacks proposed in prior literature, we are able to generate high quality adversarial examples that do significantly better both in terms of success rate and word perturbation percentage.
摘要:我们研究了在黑匣子环境中攻击自然语言处理模型的一项重要任务。 我们提出了一种攻击策略,可以针对文本分类和包含任务设计出语义上类似的对抗性示例。 我们提出的攻击通过考虑原始单词及其周围上下文的信息来找到候选单词。 它联合利用掩盖的语言建模和下一句预测来进行上下文理解。 与现有文献中提出的攻击相比,我们能够生成高质量的对抗性示例,它们在成功率和单词干扰率方面均表现出明显更好的效果。

4. Co-GAT: A Co-Interactive Graph Attention Network for Joint Dialog Act Recognition and Sentiment Classification [PDF] 返回目录
  Libo Qin, Zhouyang Li, Wanxiang Che, Minheng Ni, Ting Liu
Abstract: In a dialog system, dialog act recognition and sentiment classification are two correlative tasks to capture speakers intentions, where dialog act and sentiment can indicate the explicit and the implicit intentions separately. The dialog context information (contextual information) and the mutual interaction information are two key factors that contribute to the two related tasks. Unfortunately, none of the existing approaches consider the two important sources of information simultaneously. In this paper, we propose a Co-Interactive Graph Attention Network (Co-GAT) to jointly perform the two tasks. The core module is a proposed co-interactive graph interaction layer where a cross-utterances connection and a cross-tasks connection are constructed and iteratively updated with each other, achieving to consider the two types of information simultaneously. Experimental results on two public datasets show that our model successfully captures the two sources of information and achieve the state-of-the-art performance. In addition, we find that the contributions from the contextual and mutual interaction information do not fully overlap with contextualized word representations (BERT, Roberta, XLNet).
摘要:在对话系统中,对话行为识别和情感分类是捕获说话人意图的两个相关任务,其中对话行为和情感可以分别表示显性意图和隐性意图。对话上下文信息(上下文信息)和相互交互信息是有助于完成两个相关任务的两个关键因素。不幸的是,现有方法都没有同时考虑两个重要的信息来源。在本文中,我们提出了一个协同交互图注意力网络(Co-GAT)来共同执行这两个任务。核心模块是一个建议的协同交互图交互层,其中构建了交叉话语连接和交叉任务连接,并相互迭代更新,从而可以同时考虑这两种类型的信息。在两个公共数据集上的实验结果表明,我们的模型成功捕获了两种信息源,并实现了最新的性能。此外,我们发现上下文和相互交互信息的贡献与上下文化的单词表示(BERT,Roberta,XLNet)没有完全重叠。

5. QUACKIE: A NLP Classification Task With Ground Truth Explanations [PDF] 返回目录
  Yves Rychener, Xavier Renard, Djamé Seddah, Pascal Frossard, Marcin Detyniecki
Abstract: NLP Interpretability aims to increase trust in model predictions. This makes evaluating interpretability approaches a pressing issue. There are multiple datasets for evaluating NLP Interpretability, but their dependence on human provided ground truths raises questions about their unbiasedness. In this work, we take a different approach and formulate a specific classification task by diverting question-answering datasets. For this custom classification task, the interpretability ground-truth arises directly from the definition of the classification problem. We use this method to propose a benchmark and lay the groundwork for future research in NLP interpretability by evaluating a wide range of current state of the art methods.
摘要:NLP可解释性旨在增加对模型预测的信任。 这使得评估可解释性成为一个紧迫的问题。 有许多用于评估NLP可解释性的数据集,但是它们对人类提供的地面事实的依赖性提出了有关其无偏见的问题。 在这项工作中,我们采用不同的方法,并通过转移问答数据集来制定特定的分类任务。 对于此定制分类任务,可解释性的实质直接来自于分类问题的定义。 我们使用这种方法提出一个基准,并通过评估各种当前最先进的方法为NLP可解释性的未来研究奠定基础。

6. Sentence-Based Model Agnostic NLP Interpretability [PDF] 返回目录
  Yves Rychener, Xavier Renard, Djamé Seddah, Pascal Frossard, Marcin Detyniecki
Abstract: Today, interpretability of Black-Box Natural Language Processing (NLP) models based on surrogates, like LIME or SHAP, uses word-based sampling to build the explanations. In this paper we explore the use of sentences to tackle NLP interpretability. While this choice may seem straight forward, we show that, when using complex classifiers like BERT, the word-based approach raises issues not only of computational complexity, but also of an out of distribution sampling, eventually leading to non founded explanations. By using sentences, the altered text remains in-distribution and the dimensionality of the problem is reduced for better fidelity to the black-box at comparable computational complexity.
摘要:如今,基于替代品(如LIME或SHAP)的黑盒自然语言处理(NLP)模型的可解释性使用基于单词的采样来建立解释。 在本文中,我们探索了使用句子来解决NLP的可解释性。 尽管这种选择似乎是直截了当的,但我们表明,当使用像BERT这样的复杂分类器时,基于单词的方法不仅引起计算复杂性问题,而且还引起分布抽样问题,最终导致无根据的解释。 通过使用句子,更改后的文本将保持分布状态,并且在可比较的计算复杂度下,可以降低问题的维度,从而更好地保真至黑盒。

7. REM-Net: Recursive Erasure Memory Network for Commonsense Evidence Refinement [PDF] 返回目录
  Yinya Huang, Meng Fang, Xunlin Zhan, Qingxing Cao, Xiaodan Liang, Liang Lin
Abstract: When answering a question, people often draw upon their rich world knowledge in addition to the particular context. While recent works retrieve supporting facts/evidence from commonsense knowledge bases to supply additional information to each question, there is still ample opportunity to advance it on the quality of the evidence. It is crucial since the quality of the evidence is the key to answering commonsense questions, and even determines the upper bound on the QA systems performance. In this paper, we propose a recursive erasure memory network (REM-Net) to cope with the quality improvement of evidence. To address this, REM-Net is equipped with a module to refine the evidence by recursively erasing the low-quality evidence that does not explain the question answering. Besides, instead of retrieving evidence from existing knowledge bases, REM-Net leverages a pre-trained generative model to generate candidate evidence customized for the question. We conduct experiments on two commonsense question answering datasets, WIQA and CosmosQA. The results demonstrate the performance of REM-Net and show that the refined evidence is explainable.
摘要:人们在回答问题时,除了特定的上下文外,还经常利用他们丰富的世界知识。尽管最近的著作从常识性知识库中检索支持的事实/证据以为每个问题提供更多信息,但仍然有足够的机会来提高证据质量。至关重要的是,证据的质量是回答常识性问题的关键,甚至决定质量保证体系性能的上限。在本文中,我们提出了一种递归擦除存储器网络(REM-Net),以应对证据质量的提高。为了解决这个问题,REM-Net配备了一个模块,可通过递归擦除不解释问题答案的低质量证据来完善证据。此外,REM-Net不会从现有知识库中检索证据,而是利用预先训练的生成模型来生成针对问题定制的候选证据。我们对两个常识性问答数据集WIQA和CosmosQA进行了实验。结果证明了REM-Net的性能,并表明改进的证据是可以解释的。

8. Gender Bias in Multilingual Neural Machine Translation: The Architecture Matters [PDF] 返回目录
  Marta R. Costa-jussà, Carlos Escolano, Christine Basta, Javier Ferrando, Roser Batlle, Ksenia Kharitonova
Abstract: Multilingual Neural Machine Translation architectures mainly differ in the amount of sharing modules and parameters among languages. In this paper, and from an algorithmic perspective, we explore if the chosen architecture, when trained with the same data, influences the gender bias accuracy. Experiments in four language pairs show that Language-Specific encoders-decoders exhibit less bias than the Shared encoder-decoder architecture. Further interpretability analysis of source embeddings and the attention shows that, in the Language-Specific case, the embeddings encode more gender information, and its attention is more diverted. Both behaviors help in mitigating gender bias.
摘要:多语言神经机器翻译架构的主要区别在于语言之间共享模块和参数的数量。 在本文中,从算法的角度出发,我们探讨了所选架构在接受相同数据训练后是否会影响性别偏见的准确性。 在四种语言对中进行的实验表明,与共享编码器-解码器体系结构相比,特定于语言的编码器-解码器表现出更少的偏差。 对源嵌入的进一步可解释性分析和关注表明,在“特定语言”的情况下,嵌入对更多的性别信息进行编码,并且其注意力也更加分散。 两种行为都有助于减轻性别偏见。

9. Cross-lingual Dependency Parsing as Domain Adaptation [PDF] 返回目录
  Kailai Sun, Zuchao Li, Hai Zhao
Abstract: In natural language processing (NLP), cross-lingual transfer learning is as essential as in-domain learning due to the unavailability of annotated resources for low-resource languages. In this paper, we use the ability of a pre-training task that extracts universal features without supervision. We add two pre-training tasks as the auxiliary task into dependency parsing as multi-tasking, which improves the performance of the model in both in-domain and cross-lingual aspects. Moreover, inspired by the usefulness of self-training in cross-domain learning, we combine the traditional self-training and the two pre-training tasks. In this way, we can continuously extract universal features not only in training corpus but also in extra unannotated data and gain further improvement.
摘要:在自然语言处理(NLP)中,由于缺少资源的低资源语言无法使用跨语言的转移学习,因此与域内学习一样重要。 在本文中,我们使用了无需训练即可提取通用特征的预训练任务的功能。 我们将两个预训练任务作为辅助任务添加到依赖分析(作为多任务)中,从而在域内和跨语言方面都提高了模型的性能。 此外,受到跨领域学习中自我训练的作用的启发,我们将传统的自我训练和两个预训练任务结合在一起。 这样,我们不仅可以在训练语料库中,而且可以在额外的未注释数据中连续提取通用特征,从而获得进一步的改进。

10. SubICap: Towards Subword-informed Image Captioning [PDF] 返回目录
  Naeha Sharif, Mohammed Bennamoun, Wei Liu, Syed Afaq Ali Shah
Abstract: Existing Image Captioning (IC) systems model words as atomic units in captions and are unable to exploit the structural information in the words. This makes representation of rare words very difficult and out-of-vocabulary words impossible. Moreover, to avoid computational complexity, existing IC models operate over a modest sized vocabulary of frequent words, such that the identity of rare words is lost. In this work we address this common limitation of IC systems in dealing with rare words in the corpora. We decompose words into smaller constituent units 'subwords' and represent captions as a sequence of subwords instead of words. This helps represent all words in the corpora using a significantly lower subword vocabulary, leading to better parameter learning. Using subword language modeling, our captioning system improves various metric scores, with a training vocabulary size approximately 90% less than the baseline and various state-of-the-art word-level models. Our quantitative and qualitative results and analysis signify the efficacy of our proposed approach.
摘要:现有的图像字幕(IC)系统将单词建模为字幕中的原子单元,并且无法利用单词中的结构信息。这使得稀有单词的表示非常困难,而词汇外的单词则不可能。此外,为了避免计算复杂性,现有的IC模型在常用单词的适度大小的词汇上进行操作,从而丢失了稀有单词的标识。在这项工作中,我们解决了IC系统在处理语料库中的稀有词时遇到的常见限制。我们将单词分解为较小的组成单元“子单词”,并将标题表示为一系列子单词而不是单词。这有助于使用低得多的子词词汇来表示语料库中的所有词,从而更好地学习参数。通过使用子词语言建模,我们的字幕系统提高了各种度量标准分数,其训练词汇量比基线少了约90%,并且使用了各种最新的词级模型。我们的定量和定性结果与分析表明我们提出的方法的有效性。

11. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language [PDF] 返回目录
  Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark
Abstract: Transformers have been shown to emulate logical deduction over natural language theories (logical rules expressed in natural language), reliably assigning true/false labels to candidate implications. However, their ability to generate implications of a theory has not yet been demonstrated, and methods for reconstructing proofs of answers are imperfect. In this work we show that a generative model, called ProofWriter, can reliably generate both implications of a theory and the natural language proof(s) that support them. In particular, iterating a 1-step implication generator results in proofs that are highly reliable, and represent actual model decisions (rather than post-hoc rationalizations). On the RuleTaker dataset, the accuracy of ProofWriter's proofs exceed previous methods by +9% absolute, and in a way that generalizes to proof depths unseen in training and on out-of-domain problems. We also show that generative techniques can perform a type of abduction with high precision: Given a theory and an unprovable conclusion, identify a missing fact that allows the conclusion to be proved, along with a proof. These results significantly improve the viability of neural methods for systematically reasoning over natural language.
摘要:已经证明,变形金刚可以模仿自然语言理论(以自然语言表达的逻辑规则)的逻辑推论,可以可靠地为候选含义分配真假标签。但是,它们产生理论含义的能力尚未得到证明,并且重建答案证明的方法也不完善。在这项工作中,我们证明了称为ProofWriter的生成模型可以可靠地生成理论的含义和支持它们的自然语言证明。特别是,迭代1步蕴涵生成器会产生高度可靠的证明,并表示实际的模型决策(而不是事后合理化)。在RuleTaker数据集上,ProofWriter的证明准确性比以前的方法高出+ 9%绝对值,并且以某种方式推广到训练中和域外问题中看不到的证明深度。我们还表明,生成技术可以高度精确地执行绑架:给定一种理论和一个无法证明的结论,确定一个可以使结论得到证明的遗漏事实,并提供证明。这些结果大大提高了神经方法对自然语言进行系统推理的可行性。

12. Multi-modal Identification of State-Sponsored Propaganda on Social Media [PDF] 返回目录
  Xiaobo Guo, Soroush Vosoughi
Abstract: The prevalence of state-sponsored propaganda on the Internet has become a cause for concern in the recent years. While much effort has been made to identify state-sponsored Internet propaganda, the problem remains far from being solved because the ambiguous definition of propaganda leads to unreliable data labelling, and the huge amount of potential predictive features causes the models to be inexplicable. This paper is the first attempt to build a balanced dataset for this task. The dataset is comprised of propaganda by three different organizations across two time periods. A multi-model framework for detecting propaganda messages solely based on the visual and textual content is proposed which achieves a promising performance on detecting propaganda by the three organizations both for the same time period (training and testing on data from the same time period) (F1=0.869) and for different time periods (training on past, testing on future) (F1=0.697). To reduce the influence of false positive predictions, we change the threshold to test the relationship between the false positive and true positive rates and provide explanations for the predictions made by our models with visualization tools to enhance the interpretability of our framework. Our new dataset and general framework provide a strong benchmark for the task of identifying state-sponsored Internet propaganda and point out a potential path for future work on this task.
摘要:近年来,由国家赞助的互联网宣传日益盛行。尽管已经做出了很多努力来确定由国家发起的Internet宣传,但由于宣传的含糊不清会导致数据标签不可靠,而且潜在的预测特征数量庞大,导致模型难以解释,因此该问题仍未得到解决。本文是为此任务构建平衡数据集的首次尝试。数据集由三个不同组织在两个时间段内的宣传组成。提出了一种仅基于视觉和文本内容检测宣传消息的多模型框架,该框架在三个组织的相同时间段内(对相同时间段的数据进行训练和测试)上取得了有希望的性能( F1 = 0.869)和不同的时间段(训练过去,测试未来)(F1 = 0.697)。为了减少假阳性预测的影响,我们更改了阈值以测试假阳性率和真实阳性率之间的关系,并使用可视化工具为模型做出的预测提供解释,以增强我们框架的可解释性。我们的新数据集和通用框架为确定国家赞助的互联网宣传任务提供了强有力的基准,并指出了未来开展这项任务的潜在途径。

13. Disentangling semantics in language throughs VAEs and a certain architectural choice [PDF] 返回目录
  Ghazi Felhi, Joseph Le Roux, Djamé Seddah
Abstract: We present an unsupervised method to obtain disentangled representations of sentences that single out semantic content. Using modified Transformers as building blocks, we train a Variational Autoencoder to \emph{translate} the sentence to a fixed number of hierarchically structured latent variables. We study the influence of each latent variable in generation on the dependency structure of sentences, and on the predicate structure it yields when passed through an Open Information Extraction model. Our model could separate verbs, subjects, direct objects, and prepositional objects into latent variables we identified. We show that varying the corresponding latent variables results in varying these elements in sentences, and that swapping them between couples of sentences leads to the expected partial semantic swap.
摘要:我们提出了一种无监督的方法来获取句子的纠缠表示,这些句子挑出语义内容。 使用修改后的Transformer作为构建块,我们训练了变分自动编码器,以将句子\ emph {translate}转换为固定数量的层次结构化潜在变量。 我们研究了生成中每个潜在变量对句子的依存结构的影响,以及在通过开放信息抽取模型传递时所产生的谓语结构的影响。 我们的模型可以将动词,主语,直接宾语和介词宾语分离为我们确定的潜在变量。 我们表明,改变相应的潜在变量会导致句子中的这些元素发生变化,并且在句子对之间交换它们会导致预期的部分语义交换。

14. Speech Synthesis as Augmentation for Low-Resource ASR [PDF] 返回目录
  Deblin Bagchi, Shannon Wotherspoon, Zhuolin Jiang, Prasanna Muthukumar
Abstract: Speech synthesis might hold the key to low-resource speech recognition. Data augmentation techniques have become an essential part of modern speech recognition training. Yet, they are simple, naive, and rarely reflect real-world conditions. Meanwhile, speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech. In this paper, we investigate the possibility of using synthesized speech as a form of data augmentation to lower the resources necessary to build a speech recognizer. We experiment with three different kinds of synthesizers: statistical parametric, neural, and adversarial. Our findings are interesting and point to new research directions for the future.
摘要:语音合成可能是低资源语音识别的关键。 数据增强技术已成为现代语音识别培训的重要组成部分。 但是,它们简单,幼稚,很少反映现实情况。 同时,语音合成技术已经迅速接近实现类人语音的目标。 在本文中,我们研究了使用合成语音作为数据增强形式来降低构建语音识别器所需资源的可能性。 我们尝试了三种不同类型的合成器:统计参数,神经和对抗性。 我们的发现很有趣,并指出了未来的新研究方向。

15. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning [PDF] 返回目录
  Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta
Abstract: Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90\% of the full parameter performance levels on MRPC. Furthermore, we empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness. Lastly, we connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.
摘要:尽管可以对经过预训练的语言模型进行微调以针对各种语言理解任务产生最先进的结果,但是对于这一过程的动态性并没有得到很好的理解,尤其是在数据量较低的情况下。为什么我们可以使用相对较差的梯度下降算法(例如,不进行强正则化)来调整仅具有数百个或数千个标记示例的数据集中具有数亿个参数的模型?在本文中,我们认为通过内在维数的角度来分析微调可以为我们提供经验和理论上的直觉来解释这一非凡现象。我们凭经验表明,常见的预训练模型具有非常低的内在维度。换句话说,存在一个低维重新参数化,它对于微调与整个参数空间一样有效。例如,通过仅优化随机投影回整个空间的200个可训练参数,我们可以调整RoBERTa模型以达到MRPC上90%的完整参数性能水平。此外,我们从经验上表明,预训练隐式地最小化了内在维度,并且也许令人惊讶的是,在固定次数的预训练更新之后,较大的模型往往具有较低的内在维度,至少部分地解释了它们的极端有效性。最后,我们将内在维与低维任务表示法和基于压缩的通用范围相联系,以提供基于内在维数的通用范围,该范围与完整的参数计数无关。

16. Detecting Hateful Memes Using a Multimodal Deep Ensemble [PDF] 返回目录
  Vlad Sandulescu
Abstract: While significant progress has been made using machine learning algorithms to detect hate speech, important technical challenges still remain to be solved in order to bring their performance closer to human accuracy. We investigate several of the most recent visual-linguistic Transformer architectures and propose improvements to increase their performance for this task. The proposed model outperforms the baselines by a large margin and ranks 5$^{th}$ on the leaderboard out of 3,100+ participants.
摘要:尽管使用机器学习算法检测仇恨语音已经取得了重大进展,但仍需要解决重要的技术挑战,以使其性能更接近于人类准确性。 我们研究了几种最新的视觉语言Transformer体系结构,并提出了改进措施以提高其执行此任务的性能。 拟议的模型大大超过了基线,并且在3,100多名参与者中,在排行榜上排名5 $ ^ {th} $。

17. Unsupervised neural adaptation model based on optimal transport for spoken language identification [PDF] 返回目录
  Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai
Abstract: Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution discrepancy on both feature and classifier for training and testing data sets. Moreover, inspired by the strong power of the optimal transport (OT) to measure distribution discrepancy, a Wasserstein distance metric is designed in the adaptation loss. By minimizing the classification loss on the training data set with the adaptation loss on both training and testing data sets, the statistical distribution difference between training and testing domains is reduced. We carried out SLID experiments on the oriental language recognition (OLR) challenge data corpus where the training and testing data sets were collected from different conditions. Our results showed that significant improvements were achieved on the cross domain test tasks.
摘要:由于训练和测试集之间的原声统计分布不匹配,口语识别(SLID)的性能可能会大大降低。在本文中,我们提出了一种无监督的神经适应模型来处理SLID的分布不匹配问题。在我们的模型中,我们明确制定了适应性公式,以减少训练和测试数据集的特征和分类器上的分布差异。此外,受最佳运输(OT)强大的能力来衡量分布差异的启发,在适应损失中设计了Wasserstein距离度量。通过将训练数据集的分类损失与训练和测试数据集的适应损失最小化,可以减少训练和测试域之间的统计分布差异。我们在东方语言识别(OLR)挑战数据语料库上进行了SLID实验,该实验数据集是从不同条件下收集的。我们的结果表明,在跨域测试任务上实现了重大改进。

18. WEmbSim: A Simple yet Effective Metric for Image Captioning [PDF] 返回目录
  Naeha Sharif, Lyndon White, Mohammed Bennamoun, Wei Liu, Syed Afaq Ali Shah
Abstract: The area of automatic image caption evaluation is still undergoing intensive research to address the needs of generating captions which can meet adequacy and fluency requirements. Based on our past attempts at developing highly sophisticated learning-based metrics, we have discovered that a simple cosine similarity measure using the Mean of Word Embeddings(MOWE) of captions can actually achieve a surprisingly high performance on unsupervised caption evaluation. This inspires our proposed work on an effective metric WEmbSim, which beats complex measures such as SPICE, CIDEr and WMD at system-level correlation with human judgments. Moreover, it also achieves the best accuracy at matching human consensus scores for caption pairs, against commonly used unsupervised methods. Therefore, we believe that WEmbSim sets a new baseline for any complex metric to be justified.
摘要:自动图像标题评估领域仍在深入研究中,以解决生成满足适当性和流利性要求的字幕的需求。 基于我们过去开发高度复杂的基于学习的度量标准的尝试,我们发现使用字幕的单词嵌入均值(MOWE)进行的简单余弦相似性度量实际上可以在未经监督的字幕评估中实现令人惊讶的高性能。 这激发了我们提议的关于有效度量WEmbSim的建议,该度量在与人类判断相关的系统级关联方面击败了SPICE,CIDEr和WMD等复杂度量。 此外,与常用的无监督方法相比,它在匹配字幕对的人类共识分数时也达到了最佳准确性。 因此,我们认为WEmbSim为要证明其合理性的任何复杂指标设置了新的基准。

19. Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs [PDF] 返回目录
  Nurendra Choudhary, Nikhil Rao, Sumeet Katariya, Karthik Subbian, Chandan K. Reddy
Abstract: Knowledge Graphs (KGs) are ubiquitous structures for information storagein several real-world applications such as web search, e-commerce, social networks, and biology. Querying KGs remains a foundational and challenging problem due to their size and complexity. Promising approaches to tackle this problem include embedding the KG units (e.g., entities and relations) in a Euclidean space such that the query embedding contains the information relevant to its results. These approaches, however, fail to capture the hierarchical nature and semantic information of the entities present in the graph. Additionally, most of these approaches only utilize multi-hop queries (that can be modeled by simple translation operations) to learn embeddings and ignore more complex operations such as intersection and union of simpler queries. To tackle such complex operations, in this paper, we formulate KG representation learning as a self-supervised logical query reasoning problem that utilizes translation, intersection and union queries over KGs. We propose Hyperboloid Embeddings (HypE), a novel self-supervised dynamic reasoning framework, that utilizes positive first-order existential queries on a KG to learn representations of its entities and relations as hyperboloids in a Poincaré ball. HypE models the positive first-order queries as geometrical translation, intersection, and union. For the problem of KG reasoning in real-world datasets, the proposed HypE model significantly outperforms the state-of-the art results. We also apply HypE to an anomaly detection task on a popular e-commerce website product taxonomy as well as hierarchically organized web articles and demonstrate significant performance improvements compared to existing baseline methods. Finally, we also visualize the learned HypE embeddings in a Poincaré ball to clearly interpret and comprehend the representation space.
摘要:知识图(KGs)是信息存储在网络搜索,电子商务,社交网络和生物学等多个实际应用中的普遍存在的结构。由于KG的规模和复杂性,查询KG仍然是一个基础且具有挑战性的问题。解决该问题的有前途的方法包括将KG单元(例如,实体和关系)嵌入到欧几里得空间中,以便查询嵌入包含与其结果相关的信息。但是,这些方法无法捕获图形中存在的实体的层次结构性质和语义信息。此外,大多数这些方法仅利用多跳查询(可以通过简单的翻译操作进行建模)来学习嵌入,而忽略更复杂的操作,例如更简单查询的交集和并集。为了解决这种复杂的操作,在本文中,我们将KG表示学习公式化为一种自我监督的逻辑查询推理问题,该问题利用KG的平移,交集和并集查询。我们提出了双曲面嵌入(HypE),这是一种新颖的自我监督的动态推理框架,该框架利用KG上的正一阶存在性查询来学习其实体和关系的表示,如庞加莱球中的双曲面。 HypE将正的一阶查询建模为几何平移,交点和并集。对于现实世界数据集中的KG推理问题,提出的HypE模型明显优于最新结果。我们还将HypE应用于流行的电子商务网站产品分类以及层次结构化的Web文章中的异常检测任务,并证明与现有基准方法相比,性能得到了显着改善。最后,我们还将学习到的Poincaré球上的HypE嵌入可视化,以清楚地解释和理解表示空间。

20. Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge [PDF] 返回目录
  Riza Velioglu, Jewgeni Rose
Abstract: Memes on the Internet are often harmless and sometimes amusing. However, by using certain types of images, text, or combinations of both, the seemingly harmless meme becomes a multimodal type of hate speech -- a hateful meme. The Hateful Memes Challenge is a first-of-its-kind competition which focuses on detecting hate speech in multimodal memes and it proposes a new data set containing 10,000+ new examples of multimodal content. We utilize VisualBERT - which meant to be the BERT of vision and language -- that was trained multimodally on images and captions and apply Ensemble Learning. Our approach achieves 0.811 AUROC with an accuracy of 0.765 on the challenge test set and placed third out of 3,173 participants in the Hateful Memes Challenge.
摘要:互联网上的模因通常是无害的,有时很有趣。 但是,通过使用某些类型的图像,文本或两者的组合,看似无害的模因变成了仇恨言语的多模式类型-仇恨模因。 仇恨模因挑战赛是首例竞赛,重点是检测多模式模因中的仇恨言论,并提出了一个新数据集,其中包含10,000多个新的多模式内容示例。 我们利用VisualBERT(这意味着视觉和语言的BERT)进行了多模式的图像和字幕培训,并应用了集成学习。 我们的方法在挑战测试集上获得0.811的AUROC,准确度为0.765,在仇恨模因挑战赛的3,173名参与者中排名第三。

注:中文为机器翻译结果!封面为论文标题词云图!