摘要

1. Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! [PDF] 返回目录
Jack Hessel, Lillian Lee
Abstract: Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.
摘要：造型表现力的跨模态的相互作用似乎在多任务，如视觉问答至关重要。然而，有时高性能的黑箱算法变成在数据被大多利用单峰信号。我们提出了一个新的诊断工具，多模态实验，附加功能的投影（EMAP），用于隔离是否不跨模态相互作用提高对给定的任务给定模型的性能。此函数投影修改模型，使跨通道相互作用的预测被消除，分离该添加剂，单峰结构。七年图像+文本分类任务（每个上面我们设置国家的最先进的新的基准），我们发现，在许多情况下，去除小跨模式的互动结果，以不影响性能。出人意料的是，这也成立时表现的机型，与考虑的互动能力，否则跑赢少表现模型;因此，性能的改进，甚至当存在时，通常不能归因于考虑跨模式的特征交互。我们因此建议，研究人员在多机器学习报告的表现不仅单峰基线，也是他们最好的表现模型的EMAP。

2. XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [PDF] 返回目录
Alessandro Raganato, Tommaso Pasini, Jose Camacho-Collados, Mohammad Taher Pilehvar
Abstract: The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence on sense inventories by reformulating the standard disambiguation task as a binary classification problem; but, it is limited to the English language. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. We perform a series of experiments to determine the reliability of the datasets and to set performance baselines for several recent contextualized multilingual models. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance in the task of distinguishing different meanings of a word, even for distant languages. XL-WiC is available at this https URL.
摘要：一个字的正确模式不同的含义的能力对于语义表达技术的有效性是至关重要的。然而，对于评估这一标准大部分现有的评价基准是绑感库存（通常是共发现），限制其使用到的知识为基础的表达技术的一小部分。字在上下文中的数据集（WIC）中论述了重整标准消歧任务作为二元分类问题上感库存的依赖;但是，这仅限于英语。我们提出了一个大的多语种基准，XL-WIC，具有在来自不同语系12种新语言，并有不同程度的资源可用性，开标室进行评估的情况，如零次跨语言转移金标准。我们进行了一系列实验，以确定最近的一些情境化多语种模型数据集，并设置性能基准的可靠性。实验结果表明，即使在没有标记的情况下，可为目标语言，培养单就英语数据模型可以区分一个词的含义不同的任务获得有竞争力的表现，即使是遥远的语言。 XL-WIC可在此HTTPS URL。

3. Demographic Representation and Collective Storytelling in the Me Too Twitter Hashtag Activism Movement [PDF] 返回目录
Aaron Mueller, Zach Wood-Doughty, Silvio Amir, Mark Dredze, Alicia L. Nobles
Abstract: The #MeToo movement on Twitter has drawn attention to the pervasive nature of sexual harassment and violence. While #MeToo has been praised for providing support for self-disclosures of harassment or violence and shifting societal response, it has also been criticized for exemplifying how women of color have been discounted for their historical contributions to and excluded from feminist movements. Through an analysis of over 600,000 tweets from over 256,000 unique users, we examine online #MeToo conversations across gender and racial/ethnic identities and the topics that each demographic emphasized. We found that tweets authored by white women were overrepresented in the movement compared to other demographics, aligning with criticism of unequal representation. We found that intersected identities contributed differing narratives to frame the movement, co-opted the movement to raise visibility in parallel ongoing movements, employed the same hashtags both critically and supportively, and revived and created new hashtags in response to pivotal moments. Notably, tweets authored by black women often expressed emotional support and were critical about differential treatment in the justice system and by police. In comparison, tweets authored by white women and men often highlighted sexual harassment and violence by public figures and weaved in more general political discussions. We discuss the implications of work for digital activism research and design including suggestions to raise visibility by those who were under-represented in this hashtag activism movement. Content warning: this article discusses issues of sexual harassment and violence.
摘要：在Twitter上#MeToo运动提请注意的性骚扰和暴力的普遍性。虽然#MeToo了广泛的好评提供骚扰或暴力和不断变化的社会响应的自我披露的支持，它也被批评为示范如何女性色彩的已贴现他们对历史的贡献和女权运动排除在外。通过从超过256,000独特的用户超过60万的鸣叫的分析，我们研究不同性别和种族/民族身份的在线#MeToo对话和主题，每个人口强调。我们发现，鸣叫着白人妇女在运动过多相比其他人口统计，不等代表性的批评对齐。我们发现，相交的身份促成不同的叙事帧的运动，增选在并行进行运动的运动，提高知名度，所采用的相同的主题标签都以批判性和支撑性，并且复兴和响应关键时刻创造了新的井号标签。值得注意的是，鸣叫着由黑人妇女常表现为情感支持和大约在司法制度和警察差别待遇的关键。相比之下，鸣叫着白人女性和男性经常强调性骚扰和暴力的公众人物，并在更广泛的政治讨论编织。我们讨论数字行动的研究和设计，包括那些谁代表性不足这一主题标签激进运动的建议，以提高知名度的工作的意义。内容警告：本文讨论的性骚扰和暴力问题。

4. Pagsusuri ng RNN-based Transfer Learning Technique sa Low-Resource Language [PDF] 返回目录
Dan John Velasco
Abstract: Low-resource languages such as Filipino suffer from data scarcity which makes it challenging to develop NLP applications for Filipino language. The use of Transfer Learning (TL) techniques alleviates this problem in low-resource setting. In recent years, transformer-based models are proven to be effective in low-resource tasks but faces challenges in accessibility due to its high compute and memory requirements. There's a need for a cheaper but effective alternative. This paper has three contributions. First, release a pre-trained AWD LSTM language model for Filipino language. Second, benchmark AWD LSTM in the Hate Speech classification task and show that it performs on par with transformer-based models. Third, analyze the degradation rate of AWD-LSTM to smaller data using degradation test and compare it with transformer-based models. ---- Ang mga low-resource languages tulad ng Filipino ay gipit sa accessible na datos kaya't mahirap gumawa ng mga applications sa wikang ito. Ang mga Transfer Learning (TL) techniques ay malaking tulong para sa mga pagkakataong gipit tayo sa datos. Sa mga nagdaang taon, nanaig ang mga transformer-based TL techniques pagdating sa low-resource tasks ngunit ito ay magastos sa resources. Kaya nangangailangan ng mas mura pero epektibong alternatibo. Ang papel na ito ay may tatlong kontribusyon. Una, maglabas ng pre-trained AWD LSTM language model sa wikang Filipino upang maging tuntungan sa pagbuo ng mga NLP applications sa wikang Filipino. Pangalawa, mag benchmark ng AWD LSTM sa Hate Speech classification task at ipakita na kayang nitong makipagsabayan sa mga transformer-based models. Pangatlo, suriin ang degradation rate ng AWD-LSTM sa mas maliit na data gamit ang degradation test at ikumpara ito sa mga transformer-based models.
摘要：低资源的语言，如菲律宾从数据匮乏，这使得它具有挑战性的开发菲律宾语NLP应用受到影响。使用迁移学习（TL）技术缓解了这个问题，在低资源设置。近年来，基于变压器的模型被证明是有效的在低资源的任务，但在可访问性面临挑战，由于其高计算和内存的要求。有需要更便宜的，但有效的替代。本文有三个贡献。首先，释放菲律宾语预训练AWD LSTM语言模型。其次，基准AWD LSTM在仇恨言论分类任务，并表明它执行看齐基于变压器的模型。三，使用降解试验分析AWD-LSTM的降解率较小的数据，并将其与基于变压器的模型进行比较。 ----昂MGA低资源语言tulad纳克菲律宾唉gipit SA访问呐DATOS kaya't mahirap gumawa吴MGA应用SA wikang ITO。昂MGA迁移学习（TL）技术AY malaking屠龙刀对SA MGA pagkakataong gipit大冶SA DATOS。萨MGA nagdaang TAON，nanaig陈子昂MGA基于变压器的TL技术pagdating SA低资源任务ngunit ITO唉magastos SA资源。卡亚nangangailangan NG MAS村佩罗epektibong alternatibo。昂PAPEL呐ITO着y可能tatlong kontribusyon。尤娜，maglabas NG预先训练AWD LSTM语言模型SA wikang菲律宾upang maging tuntungan SA pagbuo吴MGA NLP应用SA wikang菲律宾人。 Pangalawa，MAG基准NG AWD LSTM SA在ipakita呐嘉扬nitong makipagsabayan SA MGA基于变压器的模型仇恨言论分类任务。 Pangatlo，suriin ANG降解率NG AWD-LSTM SA MAS maliit呐数据在ikumpara ITO SA MGA基于变压器的模型GAMIT昂降解试验。

5. RuSemShift: a dataset of historical lexical semantic change in Russian [PDF] 返回目录
Julia Rodina, Andrey Kutuzov
Abstract: We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sentence contexts extracted from the Russian National Corpus. Additionally, we report the performance of several distributional approaches on RuSemShift, achieving promising results, which at the same time leave room for other researchers to improve.
摘要：我们目前RuSemShift，大规模人工标注的测试集在俄罗斯语义变化建模的任务有两个长期的时间段对：从预苏联通过苏联时期和苏联通过后苏联倍。目标词是由多个人群源工人注解。注释过程被组织继DUREL框架，并根据俄罗斯国家语料库中提取的句子的上下文。此外，我们报告对RuSemShift几个分配方法的性能，取得可喜的成果，它同时留有余地其他研究人员改进。

6. Multilingual Argument Mining: Datasets and Analysis [PDF] 返回目录
Orith Toledo-Ronen, Matan Orbach, Yonatan Bilu, Artem Spector, Noam Slonim
Abstract: The growing interest in argument mining and computational argumentation brings with it a plethora of Natural Language Understanding (NLU) tasks and corresponding datasets. However, as with many other NLU tasks, the dominant language is English, with resources in other languages being few and far between. In this work, we explore the potential of transfer learning using the multilingual BERT model to address argument mining tasks in non-English languages, based on English datasets and the use of machine translation. We show that such methods are well suited for classifying the stance of arguments and detecting evidence, but less so for assessing the quality of arguments, presumably because quality is harder to preserve under translation. In addition, focusing on the translate-train approach, we show how the choice of languages for translation, and the relations among them, affect the accuracy of the resultant model. Finally, to facilitate evaluation of transfer learning on argument mining tasks, we provide a human-generated dataset with more than 10k arguments in multiple languages, as well as machine translation of the English datasets.
摘要：在参数采矿和计算论证日益增长的兴趣与它带来的自然语言理解（NLU）任务和相应的数据集过多。然而，与许多其他NLU任务，占主导地位的语言是英语，与其他语言的资源是少之又少。在这项工作中，我们使用多语言BERT模式，在非英语语言地址参数挖掘任务的基础上，英国的数据集和使用机器翻译的探索迁移学习的潜能。我们表明，这种方法非常适合于进行分类的参数的立场和检测证据，但不适合评估参数的质量，这可能是因为质量是很难翻译下保存。此外，专注于翻译列车的办法，我们展示了一个翻译语言的选择，以及它们之间的关系，如何影响最终模型的准确性。最后，为了便于对参数挖掘任务迁移学习的评价，我们提供了一个人类产生的数据集10K以上论点多国语言，以及英国数据集的机器翻译。

7. Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension [PDF] 返回目录
Ekta Sood, Simon Tannert, Diego Frassinelli, Andreas Bulling, Ngoc Thang Vu
Abstract: While neural networks with attention mechanisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. In this paper, we propose a new method that leverages eye-tracking data to investigate the relationship between human visual attention and neural attention in machine reading comprehension. To this end, we introduce a novel 23 participant eye tracking dataset - MQA-RC, in which participants read movie plots and answered pre-defined questions. We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures. We find that higher similarity to human attention and performance significantly correlates to the LSTM and CNN models. However, we show this relationship does not hold true for the XLNet models -- despite the fact that the XLNet performs best on this challenging task. Our results suggest that different architectures seem to learn rather different neural attention strategies and similarity of neural to human attention does not guarantee best performance.
摘要：虽然注重机制的神经网络已在许多自然语言处理任务，实现了卓越的性能，目前尚不清楚在何种程度上获悉的关注类似于人类的视觉注意。在本文中，我们提出了利用眼动跟踪数据，以调查在机器阅读理解人的视觉注意力和神经注意力之间的关系的新方法。为此，我们引入了新的23参与者眼动跟踪数据集 - MQA-RC，其中参与者阅读电影情节，并回答预先定义的问题。我们比较基于长短期记忆（LSTM）艺术网络的状态，卷积神经模型（CNN）和XLNet变压器架构。我们发现，较高的相似性，以人的注意力和表现显著相关的LSTM和CNN模型。但是，我们显示这种关系不会为XLNet车型保持如此 - 尽管事实XLNet执行最好在这个具有挑战性的任务。我们的研究结果表明，不同的架构似乎学习，而不同的神经关注战略和相似的神经，以人的注意力不保证最佳的性能。

8. Aspect-based Document Similarity for Research Papers [PDF] 返回目录
Malte Ostendorff, Terry Ruas, Till Blume, Bela Gipp, Georg Rehm
Abstract: Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Paper citations indicate the aspect-based similarity, i.e., the section title in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our results show SciBERT as the best performing system. A qualitative examination validates our quantitative results. Our findings motivate future research of aspect-based document similarity and the development of a recommender system based on the evaluated techniques. We make our datasets, code, and trained models publicly available.
摘要：传统的文档相似性措施，提供相似或不同的文档之间的粗粒度的区别。通常情况下，他们不考虑在哪些方面两个文件是相似的。这限制了这样的依靠文档相似性推荐系统应用的粒度。在本文中，我们通过执行成对文档分类任务与扩展方面的信息相似。我们评估我们的研究论文基于方面的文档相似性。论文引用指示基于方面相似性，即，在其中引用发生该部分的标题充当用于对援引一个标签和纸张引用。我们采用了一系列的变压器模型，如罗伯塔，ELECTRA，XLNet和BERT的变化，并把它们比作一个LSTM基线。我们就从ACL文集172073研究论文对和CORD-19文集两个新构造数据集进行我们的实验。我们的研究结果表明SciBERT表现最佳的系统。定性考核验证了我们的定量结果。我们的研究结果基于激励方面的文档相似性未来的研究，并根据评估的技术的推荐系统的发展。我们使我们的数据集，代码和训练的模型公开。

9. Fine-grained linguistic evaluation for state-of-the-art Machine Translation [PDF] 返回目录
Eleftherios Avramidis, Vivien Macketanz, Ursula Strohriegel, Aljoscha Burchardt, Sebastian Möller
Abstract: This paper describes a test suite submission providing detailed statistics of linguistic performance for the state-of-the-art German-English systems of the Fifth Conference of Machine Translation (WMT20). The analysis covers 107 phenomena organized in 14 categories based on about 5,500 test items, including a manual annotation effort of 45 person hours. Two systems (Tohoku and Huoshan) appear to have significantly better test suite accuracy than the others, although the best system of WMT20 is not significantly better than the one from WMT19 in a macro-average. Additionally, we identify some linguistic phenomena where all systems suffer (such as idioms, resultative predicates and pluperfect), but we are also able to identify particular weaknesses for individual systems (such as quotation marks, lexical ambiguity and sluicing). Most of the systems of WMT19 which submitted new versions this year show improvements.
摘要：本文介绍了一个测试套件提交的提供机器翻译的第五次会议（WMT20）的国家的最先进的德国英语系统语言性能的详细统计数据。分析包括基于关于5500测试项目，包括45人小时的手动注释的努力在14个类别组织107升的现象。两个系统（东北和霍山）似乎有显著更好的测试套件精度比别人，虽然WMT20的最好的制度是不显著好于宏平均的一个从WMT19。此外，我们找出其中所有的系统遭受（如成语有成效谓词和过去完成时）的一些语言现象，但是我们也能够识别单个系统（如引号，词汇歧义和泄水）特定的弱点。大多数WMT19的其中提交新版本今年展会的改进系统。

10. The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT [PDF] 返回目录
Jörg Tiedemann
Abstract: This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection. The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages. Using the package it is possible to work on realistic low-resource scenarios avoiding artificially reduced setups that are common when demonstrating zero-shot or few-shot learning. For the first time, this package provides a comprehensive collection of diverse data sets in hundreds of languages with systematic language and script annotation and data splits to extend the narrow coverage of existing benchmarks. Together with the data release, we also provide a growing number of pre-trained baseline models for individual language pairs and selected language groups.
摘要：本文介绍了机器翻译，对于成千上万的语言对面积超过500种语言和工具，用于从该集合创建国家的最先进的翻译模型提供了训练和测试数据的新发展的风向标。其主要目的是触发的开放翻译工具和模式的发展与世界的语言更广阔的覆盖范围。使用包展示零拍或少拍学习时，就有可能对现实资源不足的情况，避免人为降低设置是共同的工作。这是第一次，这个包提供了不同的数据集数百种语言与系统的语言和文字注释和数据分片全面收集扩展现有基准的覆盖面窄。与数据一起发布，我们也提供了越来越多的预训练的基本模式，个人语言对和选择的语言组。

11. CAPT: Contrastive Pre-Training for LearningDenoised Sequence Representations [PDF] 返回目录
Fuli Luo, Pengcheng Yang, Shicheng Li, Xuancheng Ren, Xu Sun
Abstract: Pre-trained self-supervised models such as BERT have achieved striking success in learning sequence representations, especially for natural language processing. These models typically corrupt the given sequences with certain types of noise, such as masking, shuffling, or substitution, and then try to recover the original input. However, such pre-training approaches are prone to learning representations that are covariant with the noise, leading to the discrepancy between the pre-training and fine-tuning stage. To remedy this, we present ContrAstive Pre-Training (CAPT) to learn noise invariant sequence representations. The proposed CAPT encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals. In this way, it not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also aids the pre-trained model in better capturing global semantics of the input via more effective sentence-level supervision. Different from most prior work that focuses on a particular modality, comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6% absolute gain on GLUE benchmarks and 0.8% absolute increment on NLVR.
摘要：预先训练的自我监管模式，如BERT都在学习序列交涉取得了令人瞩目的成就，特别是对自然语言处理。这些模型通常腐败与某些类型的噪声，诸如掩蔽，改组，或取代，然后将给定的序列尝试以恢复原始的输入。然而，这样的预培训方法易于学习是协变与噪声交涉，导致前培训和微调阶段之间的差异。为了解决这个问题，我们提出了对比前培训（CAPT）学习噪声不变序列表示。所提出的CAPT鼓励原始序列的表示，并通过监督的情况下，明智的训练信号的损坏版本之间的一致性。通过这种方式，它不仅减轻了由前培训的噪声引起的pretrain-精调的差异，而且还有助于通过更有效的句子层次的监督更好地捕捉输入的全局语义预先训练的模式。从专注于一个特定的模式最优先的工作不同的是，在11个自然语言理解和跨模态任务的全面经验证据说明CAPT适用于语言和视觉语言任务，并取得惊人的一致的改进，其中包括0.6％的绝对收益胶水基准和NLVR 0.8％的绝对增量。

12. Modeling the Music Genre Perception across Language-Bound Cultures [PDF] 返回目录
Elena V. Epure, Guillaume Salha, Manuel Moussallam, Romain Hennequin
Abstract: The music genre perception expressed through human annotations of artists or albums varies significantly across language-bound cultures. These variations cannot be modeled as mere translations since we also need to account for cultural differences in the music genre perception. In this work, we study the feasibility of obtaining relevant cross-lingual, culture-specific music genre annotations based only on language-specific semantic representations, namely distributed concept embeddings and ontologies. Our study, focused on six languages, shows that unsupervised cross-lingual music genre annotation is feasible with high accuracy, especially when combining both types of representations. This approach of studying music genres is the most extensive to date and has many implications in musicology and music information retrieval. Besides, we introduce a new, domain-dependent cross-lingual corpus to benchmark state of the art multilingual pre-trained embedding models.
摘要：通过艺术家或专辑的人表示注释音乐风格的感知跨越语言结合文化显著变化。这些变化不能被模拟为单纯的翻译，因为我们还需要考虑在音乐风格的感知文化差异。在这项工作中，我们研究获得只在特定语言的语义表征，即分布式概念的嵌入和本体基于相关的跨语言，特定文化的音乐风格注释的可行性。我们的研究，集中在六种语言，说明监督的跨语言的音乐流派注释准确率高可行，结合两种类型的表示时尤其如此。学习音乐流派的这种方法是最广泛的最新的，并在音乐学和音乐信息检索诸多影响。此外，我们引入一个新的，依赖于域的跨语言语料库的艺术多语种预先训练嵌入模型的基准状态。

13. Cross-Supervised Joint-Event-Extraction with Heterogeneous Information Networks [PDF] 返回目录
Yue Wang, Zhuo Xu, Lu Bai, Yao Wan, Lixin Cui, Qian Zhao, Edwin R. Hancock, Philip S. Yu
Abstract: Joint-event-extraction, which extracts structural information (i.e., entities or triggers of events) from unstructured real-world corpora, has attracted more and more research attention in natural language processing. Most existing works do not fully address the sparse co-occurrence relationships between entities and triggers, which loses this important information and thus deteriorates the extraction performance. To mitigate this issue, we first define the joint-event-extraction as a sequence-to-sequence labeling task with a tag set composed of tags of triggers and entities. Then, to incorporate the missing information in the aforementioned co-occurrence relationships, we propose a Cross-Supervised Mechanism (CSM) to alternately supervise the extraction of either triggers or entities based on the type distribution of each other. Moreover, since the connected entities and triggers naturally form a heterogeneous information network (HIN), we leverage the latent pattern along meta-paths for a given corpus to further improve the performance of our proposed method. To verify the effectiveness of our proposed method, we conduct extensive experiments on three real-world datasets as well as compare our method with state-of-the-art methods. Empirical results and analysis show that our approach outperforms the state-of-the-art methods in both entity and trigger extraction.
摘要：联合事件提取，其提取结构信息（即，实体或事件触发）从非结构化现实世界的语料库，吸引了越来越自然语言处理更多的研究关注。大多数现有的作品不能完全解决实体和触发器之间的稀疏的共生关系，从而失去了这一重要信息，从而降低了提取性能。为了缓解这个问题，我们首先定义联合事件提取作为具有触发器和实体的标签组成的标签组序列对序列标签制作任务。然后，以纳入上述共生关系缺少的信息，我们提出了一个跨监督机制（CSM）交替监督根据对方的类型分布或者触发器或实体的提取。此外，由于所连接的实体和触发器自然形成非均相信息网络（HIN），我们利用沿元路径的潜在图案对于给定的语料库以进一步改进我们提出的方法的性能。为了验证我们提出的方法的有效性，我们在三个真实世界的数据集进行了广泛的实验，以及我们的方法与国家的最先进的方法进行了比较。实证结果和分析表明，该方法比在这两个实体，触发提取的国家的最先进的方法。

14. Extending Implicit Discourse Relation Recognition to the PDTB-3 [PDF] 返回目录
Li Liang, Zheng Zhao, Bonnie Webber
Abstract: The PDTB-3 contains many more Implicit discourse relations than the previous PDTB-2. This is in part because implicit relations have now been annotated within sentences as well as between them. In addition, some now co-occur with explicit discourse relations, instead of standing on their own. Here we show that while this can complicate the problem of identifying the location of implicit discourse relations, it can in turn simplify the problem of identifying their senses. We present data to support this claim, as well as methods that can serve as a non-trivial baseline for future state-of-the-art recognizers for implicit discourse relations.
摘要：PDTB-3包含比以前PDTB-2更多的隐性语篇关系。这部分是因为隐含关系现在已经句子内部以及它们之间的注释。此外，现在一些有明确的话语关系共同发生的，而不是站在自己。在这里，我们表明，虽然这可以识别复杂隐含的话语关系的位置的问题，又可以简化识别他们的感觉的问题。我们目前的数据来支持这种说法，也可以作为一个不平凡的基线为未来国家的最先进的识别隐话语关系的方法。

15. F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering [PDF] 返回目录
Hendrik Schuff, Heike Adel, Ngoc Thang Vu
Abstract: Explainable question answering systems predict an answer together with an explanation showing why the answer has been selected. The goal is to enable users to assess the correctness of the system and understand its reasoning process. However, we show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation which might cause serious issues in user experience. As a remedy, we propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling as well as two evaluation scores to quantify the coupling. We conduct experiments on the HOTPOTQA benchmark data set and perform a user study. The user study shows that our models increase the ability of the users to judge the correctness of the system and that scores like F1 are not enough to estimate the usefulness of a model in a practical setting with human users. Our scores are better aligned with user experience, making them promising candidates for model selection.
摘要：可解释的问答系统预测与说明为什么的答案已经被选择的解释答案在一起。我们的目标是使用户能够评估该系统的正确性和了解其推理过程。然而，我们表明，目前的模型和评价设置有关于答案和解释这可能会在用户体验造成严重问题的耦合缺点。作为补救，我们提出了一个层次模型和新的调整项，加强了答案，解释耦合以及两个评价得分进行量化的耦合。我们对HOTPOTQA基准数据集进行实验，并进行用户研究。用户研究表明，我们的模型增加了用户的判断系统的正确性和分数像F1的能力是不够的，估计与人类用户实际环境模型的有效性。我们的分数更好的用户体验一致，使他们有前途的模式选择的候选。

16. RGCL at SemEval-2020 Task 6: Neural Approaches to Definition Extraction [PDF] 返回目录
Tharindu Ranasinghe, Alistair Plum, Constantin Orasan, Ruslan Mitkov
Abstract: This paper presents the RGCL team submission to SemEval 2020 Task 6: DeftEval, subtasks 1 and 2. The system classifies definitions at the sentence and token levels. It utilises state-of-the-art neural network architectures, which have some task-specific adaptations, including an automatically extended training set. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility in architecture selection.
摘要：本文介绍了RGCL团队提交SemEval 2020任务6：DeftEval，子任务在句子和令牌等级1个2的系统进行分类定义。它利用先进的最先进的神经网络结构，其具有一些特定任务的修改，包括自动扩展的训练集。总体而言，该方法达到可接受的评估值，同时保持架构选择的灵活性。

17. BRUMS at SemEval-2020 Task 12 : Transformer based Multilingual Offensive Language Identification in Social Media [PDF] 返回目录
Tharindu Ranasinghe, Hansi Hettiarachchi
Abstract: In this paper, we describe the team \textit{BRUMS} entry to OffensEval 2: Multilingual Offensive Language Identification in Social Media in SemEval-2020. The OffensEval organizers provided participants with annotated datasets containing posts from social media in Arabic, Danish, English, Greek and Turkish. We present a multilingual deep learning model to identify offensive language in social media. Overall, the approach achieves acceptable evaluation scores, while maintaining flexibility between languages.
摘要：在本文中，我们描述了团队\ textit {} BRUMS进入OffensEval 2：SemEval-2020多语言攻击性语言鉴定社会化媒体。该OffensEval组织者含有来自社交媒体帖子标注的数据集，阿拉伯语，丹麦语，英语，希腊和土耳其提供了机会。我们提出一个多语种的深度学习模型来识别社交媒体攻击性的语言。总体而言，该方法达到可接受的评估值，同时保持语言之间的灵活性。

18. BRUMS at SemEval-2020 Task 3: Contextualised Embeddings forPredicting the (Graded) Effect of Context in Word Similarity [PDF] 返回目录
Hansi Hettiarachchi, Tharindu Ranasinghe
Abstract: This paper presents the team BRUMS submission to SemEval-2020 Task 3: Graded Word Similarity in Context. The system utilises state-of-the-art contextualised word embeddings, which have some task-specific adaptations, including stacked embeddings and average embeddings. Overall, the approach achieves good evaluation scores across all the languages, while maintaining simplicity. Following the final rankings, our approach is ranked within the top 5 solutions of each language while preserving the 1st position of Finnish subtask 2.
摘要：本文介绍了团队BRUMS提交SemEval-2020任务3：分级字相似的语境。该系统利用状态的最先进的contextualised字的嵌入，其中有一些任务特定的适应，包括堆叠的嵌入和平均的嵌入。总体而言，该方法实现了跨越所有语言好的评价分数，同时保持简单。在最终的排名，我们的做法是每种语言的前5个解决方案中的排名，同时保留芬兰子任务2的第一位置。

19. Enhancing Extractive Text Summarization with Topic-Aware Graph Neural Networks [PDF] 返回目录
Peng Cui, Le Hu, Yuanchao Liu
Abstract: Text summarization aims to compress a textual document to a short summary while keeping salient information. Extractive approaches are widely used in text summarization because of their fluency and efficiency. However, most of existing extractive models hardly capture inter-sentence relationships, particularly in long documents. They also often ignore the effect of topical information on capturing important contents. To address these issues, this paper proposes a graph neural network (GNN)-based extractive summarization model, enabling to capture inter-sentence relationships efficiently via graph-structured document representation. Moreover, our model integrates a joint neural topic model (NTM) to discover latent topics, which can provide document-level features for sentence selection. The experimental results demonstrate that our model not only substantially achieves state-of-the-art results on CNN/DM and NYT datasets but also considerably outperforms existing approaches on scientific paper datasets consisting of much longer documents, indicating its better robustness in document genres and lengths. Further discussions show that topical information can help the model preselect salient contents from an entire document, which interprets its effectiveness in long document summarization.
摘要：文本摘要的目的，同时保持着的信息压缩的文本文件，以一个简短的摘要。采掘方法被广泛应用于文本摘要，因为他们的流畅性和效率。然而，大多数现有的采掘模式难以捕捉句间关系，特别是在长文档。他们还常常忽略局部信息获取重要内容的效果。为了解决这些问题，本文提出了一种图形神经网络（GNN）为主的采掘总结模型，从而能够有效地捕捉句间关系通过图形结构的文档表示。此外，我们的模型集成了一个神经联合主题模型（NTM）来发现潜在主题，可为句子的选择提供文件级功能。实验结果表明，我们的模型不仅大大实现了在CNN / DM和纽约时报的数据集的国家的最先进的成果，但也大大优于科学包括更长的文档文件的数据集现有的方法，说明文档类型的鲁棒性好，长度。进一步的讨论表明，局部的信息可以从一个完整的文件，该文件在解释长期文档文摘其有效性有助于模型预选突出的内容。

20. Annotationsaurus: A Searchable Directory of Annotation Tools [PDF] 返回目录
Mariana neves, Jurica Seva
Abstract: Manual annotation of textual documents is a necessary task when constructing benchmark corpora for training and evaluating machine learning algorithms. We created a comprehensive directory of annotation tools that currently includes 93 tools. We analyzed the tools over a set of 31 features and implemented simple scripts and a Web application that filters the tools based on chosen criteria. We present two use cases using the directory and propose ideas for its maintenance. The directory, source codes for scripts, and link to the Web application are available at: this https URL
摘要：文本文件的手动注释是一项必要的工作构建基准语料训练和评估机器学习算法时。我们创建的注释工具，目前包括93个工具的综合目录。我们分析的工具，对一组31个功能并实现简单的脚本和过滤基于选择的标准工具的Web应用程序。我们目前使用目录中的两个用例并提出其维修的想法。该目录下，脚本的源代码，并链接到Web应用程序，请访问：此HTTPS URL

21. KLearn: Background Knowledge Inference from Summarization Data [PDF] 返回目录
Maxime Peyrard, Robert West
Abstract: The goal of text summarization is to compress documents to the relevant information while excluding background information already known to the receiver. So far, summarization researchers have given considerably more attention to relevance than to background knowledge. In contrast, this work puts background knowledge in the foreground. Building on the realization that the choices made by human summarizers and annotators contain implicit information about their background knowledge, we develop and compare techniques for inferring background knowledge from summarization data. Based on this framework, we define summary scoring functions that explicitly model background knowledge, and show that these scoring functions fit human judgments significantly better than baselines. We illustrate some of the many potential applications of our framework. First, we provide insights into human information importance priors. Second, we demonstrate that averaging the background knowledge of multiple, potentially biased annotators or corpora greatly improves summary-scoring performance. Finally, we discuss potential applications of our framework beyond summarization.
摘要：文本摘要的目的是要压缩的文件的相关信息，同时排除已经知道接收器的背景信息。到目前为止，研究人员总结给出了相当多的关注意义，而不是背景知识。相比之下，这项工作提出背景知识在前台。这样的认识：人类summarizers和注释作出的选择包含有关其背景知识的隐含信息的基础上，我们开发和从汇总数据推断背景知识比较技术。基于这个框架，我们定义汇总计分函数，明确模型的背景知识，并显示这些计分函数显著优于基准适合人类的判断。我们说明了一些我们的框架的许多潜在的应用。首先，我们可以深入了解人类信息的重要性前科。其次，我们证明了平均多个的背景知识，可能偏向注释或语料库大大提高了总结得分的表现。最后，我们讨论了我们超越总结框架的潜在应用。

22. Mitigating Gender Bias in Machine Translation with Target Gender Annotations [PDF] 返回目录
Toms Bergmanis, Artūrs Stafanovičs, Mārcis Pinnis
Abstract: When translating "The secretary asked for details." to a language with grammatical gender, it might be necessary to determine the gender of the subject "secretary". If the sentence does not contain the necessary information, it is not always possible to disambiguate. In such cases, machine translation systems select the most common translation option, which often corresponds to the stereotypical translations, thus potentially exacerbating prejudice and marginalisation of certain groups and people. We argue that the information necessary for an adequate translation can not always be deduced from the sentence being translated or even might depend on external knowledge. Therefore, in this work, we propose to decouple the task of acquiring the necessary information from the task of learning to translate correctly when such information is available. To that end, we present a method for training machine translation systems to use word-level annotations containing information about subject's gender. To prepare training data, we annotate regular source language words with grammatical gender information of the corresponding target language words. Using such data to train machine translation systems reduces their reliance on gender stereotypes when information about the subject's gender is available. Our experiments on five language pairs show that this allows improving accuracy on the WinoMT test set by up to 25.8 percentage points.
摘要：当翻译“秘书问细节。”与语法性别一种语言，它可能需要确定的主题为“秘书”的性别。如果这句话不包含必要的信息，它并不总是可能的歧义。在这种情况下，机器翻译系统选择最常见的翻译选项，这往往对应于刻板的翻译，从而可能加剧偏见和某些群体和人民的边缘化。我们认为，对于一个足够的翻译所需的信息不能总是从句子推断被翻译，甚至可能依赖于外部知识。因此，在这项工作中，我们提出脱钩获取从学习到正确转换时可用的此类信息的任务的必要信息的任务。为此，我们提出了一个训练机器翻译系统包含关于对象的性别信息使用字级注释的方法。要准备训练数据，我们标注与相应的目标语言单词的语法性别信息定期源语言单词。使用这样的数据来训练机器翻译系统降低了他们对性别刻板印象的依赖时，在受试者的性别信息是可用的。我们对5种语言对实验表明，这允许高达25.8个百分点，提高对WinoMT测试集的精度。

23. Mathematical Word Problem Generation from Commonsense Knowledge Graph and Equations [PDF] 返回目录
Tianqiao Liu, Qian Fang, Wenbiao Ding, Zhongqin Wu, Zitao Liu
Abstract: There is an increasing interest in the use of automatic mathematical word problem (MWP) generation in educational assessment. Different from standard natural question generation, MWP generation needs to maintain the underlying mathematical operations between quantities and variables, while at the same time ensuring the relevance between the output and the given topic. To address above problem we develop an end-to-end neural model to generate personalized and diverse MWPs in real-world scenarios from commonsense knowledge graph and equations. The proposed model (1) learns both representations from edge-enhanced Levi graphs of symbolic equations and commonsense knowledge; (2) automatically fuses equation and commonsense knowledge information via a self-planning module when generating the MWPs. Experiments on an educational gold-standard set and a large-scale generated MWP set show that our approach is superior on the MWP generation task, and it outperforms the state-of-the-art models in terms of both automatic evaluation metrics, i.e., BLEU-4, ROUGE-L, Self-BLEU, and human evaluation metrics, i.e, equation relevance, topic relevance, and language coherence.
摘要：目前，在教育评估使用自动数学应用题（MWP）产生的越来越大的兴趣。从标准自然的问题产生不同的，MWP一代的需求，以保持数量和变量之间的基本数学运算，而在同一时间，确保输出和特定主题之间的相关性。以上问题的地址，我们开发一个终端到终端的神经模型来生成从常识性知识图和方程真实场景的个性化和多样化的MWPS。所提出的模型（1）获知从符号方程和常识知识库的边缘增强列维图表都表示; （2）自动地经由自计划模块生成所述MWPS当熔断器方程和常识知识库的信息。在教育金标准设置和大规模产生MWP集显示，我们的做法是对MWP生成任务优越，它优于国家的最先进的车型中都自动评价指标方面，即实验， BLEU-4，ROUGE-L，自BLEU，和人的评价指标，即方程的相关性，相关的话题和语言的连贯性。

24. Multilingual Factual Knowledge Retrieval from Pretrained Language Models [PDF] 返回目录
Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki, Haibo Ding, Graham Neubig
Abstract: Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as "Punta Cana is located in _." However, while knowledge is both written and queried in many languages, studies on LMs' factual representation ability have almost invariably been performed on English. To assess factual knowledge retrieval in LMs in different languages, we create a multilingual benchmark of cloze-style probes for \langnum typologically diverse languages. To properly handle language variations, we expand probing methods from single- to multi-word entities, and develop several decoding algorithms to generate multi-token predictions. Extensive experimental results provide insights about how well (or poorly) current state-of-the-art LMs perform at this task in languages with more or fewer available resources. We further propose a code-switching-based method to improve the ability of multilingual LMs to access knowledge, and verify its effectiveness on several benchmark languages. Benchmark data and code have been released at this https URL.
摘要：语言模型（LMS）已通过完成填空式的填充式的空白问题，如被证明出奇成功捕捉事实性知识“蓬塔卡纳位于_”。然而，尽管知识是书面的，并询问在许多语言中，在LMS公司的实际表现能力的研究几乎无一例外地被英语进行。为了评估在不同语言的LM事实性知识检索，我们创建完形填空式探头的多语种基准\ langnum类型学的不同语言。要妥善处理好语言的变化，我们扩大从单到多字实体探测方法，并开发多种解码算法来产生多令牌预测。大量的实验结果提供了有关的见解国家的最先进的LM在与更多或更少的可用资源语言执行此任务有多好（或不好）电流。我们进一步提出了一个码转换为基础的方法，以提高多语种的LMS访问知识的能力，并验证一些基准语言的有效性。基准数据和代码都在这个HTTPS URL被释放。

25. The workweek is the best time to start a family -- A Study of GPT-2 Based Claim Generation [PDF] 返回目录
Shai Gretz, Yonatan Bilu, Edo Cohen-Karlik, Noam Slonim
Abstract: Argument generation is a challenging task whose research is timely considering its potential impact on social media and the dissemination of information. Here we suggest a pipeline based on GPT-2 for generating coherent claims, and explore the types of claims that it produces, and their veracity, using an array of manual and automatic assessments. In addition, we explore the interplay between this task and the task of Claim Retrieval, showing how they can complement one another.
摘要：参数生成是一个具有挑战性的任务，其研究适时考虑其对社会化媒体的潜在影响和传播信息。在这里，我们建议基于GPT-2，用于产生相干权利要求中的管道，并探索类型权利要求的，它产生，和它们的真实性，利用手动和自动评估的阵列。此外，我们将探讨这一任务，并要求检索的任务，展示它们如何相互补充之间的相互作用。

26. Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance [PDF] 返回目录
Xi Chen, Nan Ding, Tomer Levinboim, Radu Soricut
Abstract: Recent advances in automatic evaluation metrics for text have shown that deep contextualized word representations, such as those generated by BERT encoders, are helpful for designing metrics that correlate well with human judgements. At the same time, it has been argued that contextualized word representations exhibit sub-optimal statistical properties for encoding the true similarity between words or sentences. In this paper, we present two techniques for improving encoding representations for similarity metrics: a batch-mean centering strategy that improves statistical properties; and a computationally efficient tempered Word Mover Distance, for better fusion of the information in the contextualized word representations. We conduct numerical experiments that demonstrate the robustness of our techniques, reporting results over various BERT-backbone learned metrics and achieving state of the art correlation with human ratings on several benchmarks.
摘要：在自动评价标准文本的最新进展表明，深情境字表示，如BERT编码器生成的，是设计与人的判断密切相关的指标有帮助。与此同时，也有人认为，语境词的表示表现出编码的词或句子之间的真实相似度次优的统计特性。在本文中，我们提出了两种技术来提高编码表示的相似性指标：改善统计特性的批处理均值中心战略;和计算效率回火字捷运距离，更好地融合在情境字表示的信息。我们进行了数值实验，展示我们的技术的健壮性，报告了各种BERT骨干学到度量结果和几个基准实现与人的评级艺术相关的状态。

27. Incorporating BERT into Parallel Sequence Decoding with Adapters [PDF] 返回目录
Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran Wei, Boxing Chen, Enhong Chen
Abstract: While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and task agnostic. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT, and can be adapted to traditional autoregressive decoding easily. We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves $36.49$/$33.57$ BLEU scores on IWSLT14 German-English/WMT14 German-English translation. When adapted to autoregressive decoding, the proposed method achieves $30.60$/$43.56$ BLEU scores on WMT14 English-German/English-French translation, on par with the state-of-the-art baseline models.
摘要：虽然大型预训练的语言模型，如BERT对各种自然语言理解任务，取得了巨大的成功，如何切实有效地将它们纳入序列到序列模型和相应的文本生成任务仍然是一个不平凡的问题。在本文中，我们提出了通过利用两种不同的BERT模型作为编码器和分别解码器，和微调它们通过引入简单，重量轻的适配器模块，其被插入BERT层之间并调谐在特定任务的数据集来解决这个问题。通过这种方式，我们得到了一个灵活，高效的模型，该模型能够共同利用包含在源端和目标端BERT型号的相关信息，而绕过灾难性遗忘的问题。在框架的每个组件可被认为是一个插入单元，使得框架灵活和任务无关。我们的框架是基于命名面膜预测考虑BERT的双向和条件独立性，并且可以适用于传统的自回归容易解码并行序列译码算法。我们进行神经机器翻译任务了广泛的实验，其中提出的方法始终优于自回归基线，同时减少一半推断潜伏期，达到$ 36.49 $ /上IWSLT14德英/ WMT14德语英语翻译$ 33.57 $ BLEU分数。当用于自回归解码，所提出的方法实现$ 30.60 $ /上WMT14英语，德语/英语，法语翻译$ 43.56 $ BLEU分数看齐，与国家的最先进的基本模式。

28. Corruption Is Not All Bad: Incorporating Discourse Structure into Pre-training via Corruption for Essay Scoring [PDF] 返回目录
Farjana Sultana Mim, Naoya Inoue, Paul Reisert, Hiroki Ouchi, Kentaro Inui
Abstract: Existing approaches for automated essay scoring and document representation learning typically rely on discourse parsers to incorporate discourse structure into text representation. However, the performance of parsers is not always adequate, especially when they are used on noisy texts, such as student essays. In this paper, we propose an unsupervised pre-training approach to capture discourse structure of essays in terms of coherence and cohesion that does not require any discourse parser or annotation. We introduce several types of token, sentence and paragraph-level corruption techniques for our proposed pre-training approach and augment masked language modeling pre-training with our pre-training method to leverage both contextualized and discourse information. Our proposed unsupervised approach achieves new state-of-the-art result on essay Organization scoring task.
摘要：自动作文得分和文档表示通常学习现有方案依赖于话语解析器纳入篇章结构到文本表示。然而，解析器的性能并不总是足够的，尤其是当他们在嘈杂的文本，如学生论文中使用。在本文中，我们提出了一种无监督前培训的方法来捕捉文章的篇章结构的连贯性和凝聚力方面，不需要任何话语分析器或注释。我们介绍了几种类型的令牌，句子和段落层次的腐败技术我们建议前期培训方法和扩充掩盖语言建模前培训与我们前期的训练方法，同时利用语境和语篇信息。我们提出的无监督的办法实现国家的最先进的新的作文得分组织任务结果。

29. BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance [PDF] 返回目录
Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, Yaohong Jin
Abstract: Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. In this way, our model can learn from different teacher layers adaptively for various NLP tasks. %motivated by the intuition that different NLP tasks require different levels of linguistic knowledge contained in the intermediate layers of BERT. In addition, we leverage Earth Mover's Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD enables the effective matching for many-to-many layer mapping. %EMD can be applied to network layers with different sizes and effectively measures semantic distance between the teacher network and student network. Furthermore, we propose a cost attention mechanism to learn the layer weights used in EMD automatically, which is supposed to further improve the model's performance and accelerate convergence time. Extensive experiments on GLUE benchmark demonstrate that our model achieves competitive performance compared to strong competitors in terms of both accuracy and model compression.
摘要：预先训练的语言模型（例如，BERT）都实现了各种自然语言处理（NLP）任务显著的成功。然而，高存储和计算成本阻碍预先训练语言模型能够有效地部署在资源受限的设备。在本文中，提出了一种基于多到许多层映射，它允许每个中间学生层从任何中间层老师学习的新颖BERT蒸馏法。这样一来，我们的模型可以适应不同的老师学习层的各种NLP任务。％的，不同的NLP任务，需要不同层次的包含在BERT的中间层语言知识的直觉驱使。此外，我们利用堆土机距离（EMD）来计算，必须支付给教师网络将知识与学生网络最低累计成本。 EMD为许多一对多层映射的有效匹配。％EMD可以适用于具有不同尺寸的网络层和有效地测量教师网络和学生网络之间的语义距离。此外，我们提出了一个成本注意机制，以学习EMD自动使用的层权，这是为了进一步提高模型的性能和加速收敛时间。胶水基准大量的实验证明，相比于准确性和模型压缩方面强大的竞争对手我们的模型实现了有竞争力的表现。

30. Model Selection for Cross-Lingual Transfer using a Learned Scoring Function [PDF] 返回目录
Yang Chen, Alan Ritter
Abstract: Transformers that are pre-trained on multilingual text corpora, such as, mBERT and XLM-RoBERTa, have achieved impressive cross-lingual transfer learning results. In the zero-shot cross-lingual transfer setting, only English training data is assumed, and the fine-tuned model is evaluated on another target language. No target-language validation data is assumed in this setting, however substantial variance has been observed in target language performance between different fine-tuning runs. Prior work has relied on English validation/development data to select among models that are fine-tuned with different learning rates, number of steps and other hyperparameters, often resulting in suboptimal choices. To address this challenge, we propose a meta-learning approach to model selection that uses the fine-tuned model's own internal representations to predict its cross-lingual capabilities. In extensive experiments we find that our approach consistently selects better models than English validation data across five languages and five well-studied NLP tasks, achieving results that are comparable to small amounts of target language development data.
摘要：变压器上的多语言文本语料库被预先训练，例如，mBERT和XLM - 罗伯塔，取得了骄人的跨语言转移的学习效果。在零次跨语言传递的设置，只设英语培训数据，以及微调模型上的另一个目标语言评估。没有目标语言的验证数据，假设在此设置，但实质性的变化已经在不同的微调运行之间的目标语言的表现观察。以前的工作一直依赖英语验证/开发数据模型是微调，有不同的学习率，步骤等的超参数号中进行选择，往往导致次优的选择。为了应对这一挑战，我们提出了元学习方法，它采用了微调模型自身的内部表示，预测其跨语言能力模型的选择。在大量的实验，我们发现，我们的做法一致选择不是跨五种语言和五个充分研究NLP任务英语验证数据更好的模型，实现效果都相当少量的目标语言的发展数据。

31. Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options [PDF] 返回目录
Clara Vania, Ruijie Chen, Samuel R. Bowman
Abstract: Large-scale natural language inference (NLI) datasets such as SNLI or MNLI have been created by asking crowdworkers to read a premise and write three new hypotheses, one for each possible semantic relationships (entailment, contradiction, and neutral). While this protocol has been used to create useful benchmark data, it remains unclear whether the writing-based annotation protocol is optimal for any purpose, since it has not been evaluated directly. Furthermore, there is ample evidence that crowdworker writing can introduce artifacts in the data. We investigate two alternative protocols which automatically create candidate (premise, hypothesis) pairs for annotators to label. Using these protocols and a writing-based baseline, we collect several new English NLI datasets of over 3k examples each, each using a fixed amount of annotator time, but a varying number of examples to fit that time budget. Our experiments on NLI and transfer learning show negative results: None of the alternative protocols outperforms the baseline in evaluations of generalization within NLI or on transfer to outside target tasks. We conclude that crowdworker writing still the best known option for entailment data, highlighting the need for further data collection work to focus on improving writing-based annotation processes.
摘要：大型自然语言推理（NLI）的数据集，如SNLI或MNLI已被要求crowdworkers读取一个前提，写了三个新的假设，每一个可能的语义关系（蕴涵，矛盾，和中性）创建的。虽然该协议已被用来创建有用的基准数据，它在写入基于注解协议是否是最佳的用于任何目的，因为它没有直接评价仍不清楚。此外，有充足的证据表明crowdworker写作可以引入数据假象。我们调查两种可供选择的协议，该协议自动创建候选人（前提下，假设）对用于注释者的标签。使用这些协议和写作为基础的基线，我们收集的各个，分别使用的标注时间固定金额超过3K的例子几个新的英语NLI数据集，但不同数量的例子，以适应当时的预算。我们对NLI和转移学习表演阴性结果的实验：的替代方案无性能优于内NLI或转移到外面的目标任务，概括的评价基准。我们的结论是crowdworker写仍然蕴涵数据最著名的选择，突出了进一步的数据收集工作，把重点放在提高写作为基础的注解过程的需要。

32. ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis [PDF] 返回目录
Qingyun Wang, Qi Zeng, Lifu Huang, Kevin Knight, Heng Ji, Nazneen Fatema Rajani
Abstract: To assist human review process, we build a novel ReviewRobot to automatically assign a review score and write comments for multiple categories. A good review needs to be knowledgeable, namely that the comments should be constructive and informative to help improve the paper; and explainable by providing detailed evidence. ReviewRobot achieves these goals via three steps: (1) We perform domain-specific Information Extraction to construct a knowledge graph (KG) from the target paper under review, a related work KG from the papers cited by the target paper, and a background KG from a large collection of previous papers in the domain. (2) By comparing these three KGs we predict a review score and detailed structured knowledge as evidence for each review category. (3) We carefully select and generalize human review sentences into templates, and apply these templates to transform the review scores and evidence into natural language comments. Experimental results show that our review score predictor reaches 71.4-100% accuracy. Human assessment by domain experts shows that 41.7%-70.5% of the comments generated by ReviewRobot are valid and constructive, and better than human-written ones 20% of the time. Thus, ReviewRobot can serve as an assistant for paper reviewers, program chairs and authors.
摘要：为了帮助人工审核过程中，我们建立了一个新的ReviewRobot自动分配一个评价得分和写评论的多个类别。一个良好的复习需要有知识，即注释应该是建设性的和信息，以帮助改善纸张;并通过提供详细的证据可以解释的。 ReviewRobot通过三个步骤实现这些目标：（1）我们进行特定领域的信息提取从所审查的目标纸张构建一个知识图谱（KG），从这些文件中相关工作KG引述目标纸，背景KG从一个大的集合中的域名的论文。（2）通过比较这三个幼儿园，我们预测评价得分和结构的详细知识，每个评审类别的证据。（3）我们精心选择和人工审核的句子概括为模板，并应用这些模板来的评分和证据转化为自然语言注释。实验结果表明，我们的评价得分达到预测71.4-100％的准确率。由领域专家表示人的评估，即41.7％-70.5由ReviewRobot产生的评论％是有效的，建设性的，而且比人类写的要好的20％的时间。因此，ReviewRobot可以作为纸评审，计划椅子和作家的助手。

33. Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks [PDF] 返回目录
Yuanhe Tian, Yan Song, Fei Xia
Abstract: Supertagging is conventionally regarded as an important task for combinatory categorial grammar (CCG) parsing, where effective modeling of contextual information is highly important to this task. However, existing studies have made limited efforts to leverage contextual features except for applying powerful encoders (e.g., bi-LSTM). In this paper, we propose attentive graph convolutional networks to enhance neural CCG supertagging through a novel solution of leveraging contextual information. Specifically, we build the graph from chunks (n-grams) extracted from a lexicon and apply attention over the graph, so that different word pairs from the contexts within and across chunks are weighted in the model and facilitate the supertagging accordingly. The experiments performed on the CCGbank demonstrate that our approach outperforms all previous studies in terms of both supertagging and parsing. Further analyses illustrate the effectiveness of each component in our approach to discriminatively learn from word pairs to enhance CCG supertagging.
摘要：Supertagging通常视为组合子范畴语法（CCG）分析，这里的语境信息的有效建模是这项任务非常重要的一项重要任务。然而，现有的研究也在努力利用上下文功能有限，除了应用功能强大的编码器（例如双向LSTM）。在本文中，我们提出了周到的图形卷积网络通过利用上下文信息的一个新的解决方案，以提高神经CCG supertagging。具体来说，我们建立从一个词典提取的组块（的n-gram）的曲线，并应用注意在该图中，以使从内和跨块上下文不同字对在模型中进行加权和促进相应地supertagging。在CCGbank进行的实验表明，我们的方法优于所有以前的研究中都supertagging和解析方面。进一步的分析说明我们的做法有区别的词对学习，以提高CCG supertagging每个组件的有效性。

34. Are Some Words Worth More than Others? [PDF] 返回目录
Shiran Dudy, Steven Bedrick
Abstract: Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.
摘要：语言建模和发电电流评价标准在很大程度上依赖于预测的（或生成）字的精度相比于参考基础事实。虽然重要，令牌级的精度只捕获语言模型的行为的一个方面，而忽略的话，可能允许一些错误预测令牌在实践中有用的语言特性。此外，统计数据直接连接到预测精度（包括困惑）可以通过书面语言的性质Zipfian混淆，因为大部分的预测尝试将频繁出现的类型发生的。模型的性能可能高，低频率的话，这在实践中可能导致的故障模式，如重复和平淡生成的文本被一个语言模型的下游消费者产生之间极大地变化。为了解决这个问题，我们建议的目的是给的语言模型的性能更全面的图片简单的单词预测的任务框架内的两个新的内部评估措施。我们评估使用我们所提出的指标数常用的大型英语语言模型，并证明我们的方法揭示了被更传统的指标掩盖了模型之间的性能功能上的差异。

35. Zero-shot Entity Linking with Efficient Long Range Sequence Modeling [PDF] 返回目录
Zonghai Yao, Liangliang Cao, Huapu Pan
Abstract: This paper considers the problem of zero-shot entity linking, in which a link in the test time may not present in training. Following the prevailing BERT-based research efforts, we find a simple yet effective way is to expand the long-range sequence modeling. Unlike many previous methods, our method does not require expensive pre-training of BERT with long position embedding. Instead, we propose an efficient position embeddings initialization method called Embedding-repeat, which initializes larger position embeddings based on BERT-Base. On Wikia's zero-shot EL dataset, our method improves the SOTA from 76.06% to 79.08%, and for its long data, the corresponding improvement is from 74.57% to 82.14%. Our experiments suggest the effectiveness of long-range sequence modeling without retraining the BERT model.
摘要：本文认为，零射门实体链接的问题，目前在训练中，在测试时间的链接可能不会。继当时的基于BERT的研究工作，我们发现了一个简单而有效的方法是扩大远程序列建模。不像许多以前的方法，我们的方法不需要BERT，用长长的位置嵌入昂贵的岗前培训。相反，我们提出初始化方法称为嵌入重复，初始化基于BERT-基地更大的嵌入位置的有效位置的嵌入。在维基的零射门EL数据集，我们的方法提高了从76.06％的SOTA至79.08％，而其长期的数据，相应的改进是从74.57％至82.14％。我们的实验表明远射序列模型的有效性，而不再培训BERT模式。

36. BioMegatron: Larger Biomedical Domain Language Model [PDF] 返回目录
Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, Raghav Mani
Abstract: There has been an influx of biomedical domain-specific language models, showing language models pre-trained on biomedical text perform better on biomedical domain benchmarks than those trained on general domain text corpora such as Wikipedia and Books. Yet, most works do not study the factors affecting each domain language application deeply. Additionally, the study of model size on domain-specific models has been mostly missing. We empirically study and evaluate several factors that can affect performance on domain language applications, such as the sub-word vocabulary set, model size, pre-training corpus, and domain transfer. We show consistent improvements on benchmarks with our larger BioMegatron model trained on a larger domain corpus, contributing to our understanding of domain language model applications. We demonstrate noticeable improvements over the previous state-of-the-art (SOTA) on standard biomedical NLP benchmarks of named entity recognition, relation extraction, and question answering. Model checkpoints and code are available at [this http URL] and [this http URL].
摘要：已经有生物医学领域特定语言模型的涌入，显示出生物医学文本语言模型预先训练上生物医学领域的基准比对训练的一般领域语料库，如维基百科和书籍更好地履行。然而，大多数的作品没有研究影响的每个域的语言应用深深的因素。此外，模型的大小对特定领域的模型研究大多已经失踪。我们实证研究和评估几种因素会影响到域中语言的应用，如分词词汇集，模型的大小，预先训练语料和域传输性能。我们发现与我们的大BioMegatron模型基准训练了较大的领域的语料库持续改善，促进我们的域语言模型的应用的理解。我们证明上命名实体识别，关系抽取，并答疑的标准生物医学NLP基准比上国家的最先进的（SOTA）显着改善。型号检查站和代码可在[这个HTTP URL]和[这个HTTP URL]。

37. TextHide: Tackling Data Privacy in Language Understanding Tasks [PDF] 返回目录
Yangsibo Huang, Zhao Song, Danqi Chen, Kai Li, Sanjeev Arora
Abstract: An unsolved challenge in distributed or federated learning is to effectively mitigate privacy risks without slowing down training or reducing accuracy. In this paper, we propose TextHide aiming at addressing this challenge for natural language understanding tasks. It requires all participants to add a simple encryption step to prevent an eavesdropping attacker from recovering private text data. Such an encryption step is efficient and only affects the task performance slightly. In addition, TextHide fits well with the popular framework of fine-tuning pre-trained language models (e.g., BERT) for any sentence or sentence-pair task. We evaluate TextHide on the GLUE benchmark, and our experiments show that TextHide can effectively defend attacks on shared gradients or representations and the averaged accuracy reduction is only $1.9\%$. We also present an analysis of the security of TextHide using a conjecture about the computational intractability of a mathematical problem. Our code is available at this https URL
摘要：在分布式或联合学习的一个未解决的挑战是有效地减轻隐私风险，而不会减慢培训或减少精度。在本文中，我们提出TextHide瞄准寻址自然语言理解任务这一挑战。它要求所有参与者都添加一个简单的加密步骤来防止窃听攻击者恢复私人文本数据。这样的加密步骤是有效的，不仅影响工作表现略。此外，TextHide与预先训练微调语言模型（例如，BERT）对任何句子或句对任务的流行的框架非常适合。我们评估上胶基准TextHide，我们的实验表明，TextHide可以有效地捍卫共享梯度或陈述和平均精度的下降仅是$ 1.9 \％$攻击。我们还提出使用有关的数学问题的计算难解猜想TextHide的安全性进行了分析。我们的代码可在此HTTPS URL

38. Towards Machine Translation for the Kurdish Language [PDF] 返回目录
Sina Ahmadi, Mariam Masoud
Abstract: Machine translation is the task of translating texts from one language to another using computers. It has been one of the major tasks in natural language processing and computational linguistics and has been motivating to facilitate human communication. Kurdish, an Indo-European language, has received little attention in this realm due to the language being less-resourced. Therefore, in this paper, we are addressing the main issues in creating a machine translation system for the Kurdish language, with a focus on the Sorani dialect. We describe the available scarce parallel data suitable for training a neural machine translation model for Sorani Kurdish-English translation. We also discuss some of the major challenges in Kurdish language translation and demonstrate how fundamental text processing tasks, such as tokenization, can improve translation performance.
摘要：机器翻译是从一种语言翻译文本，以另一种使用电脑的任务。这一直是自然语言处理和计算语言学的主要任务之一，并已激励，以促进人际交往。库尔德人，一个印欧语，很少受到关注在这个领域，由于语言是不太资源。因此，在本文中，我们正在解决库尔德语言创建一个机器翻译系统，重点对索拉尼方言的主要问题。我们描述适用于训练神经机器翻译模型索拉尼库尔德英语翻译可用的稀缺并行数据。我们还讨论了一些主要的挑战在库尔德人的语言翻译并演示如何根本性的文字处理任务，如标记化，可以提高翻译的性能。

39. Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model [PDF] 返回目录
Mingzhi Zheng, Dinghan Shen, Yelong Shen, Weizhu Chen, Lin Xiao
Abstract: Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. In this paper, we argue that randomly sampled masks in MLM would lead to undesirably large gradient variance. Thus, we theoretically quantify the gradient variance via correlating the gradient covariance with the Hamming distance between two different masks (given a certain text sequence). To reduce the variance due to the sampling of masks, we propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments. Thereafter, the tokens within one segment are masked for training. We prove, from a theoretical perspective, that the gradients derived from this new masking schema have a smaller variance and can lead to more efficient self-supervised training. We conduct extensive experiments on both continual pre-training and general pre-training from scratch. Empirical results confirm that this new masking strategy can consistently outperform standard random masking. Detailed efficiency analysis and ablation studies further validate the advantages of our fully-explored masking strategy under the MLM framework.
摘要：蒙面语言模型（MLM）框架进行自我监督的语言训练前被广泛采用。在本文中，我们认为传销是随机抽样的面具会导致不期望的较大梯度方。因此，我们通过理论上具有两个不同的掩模之间的汉明距离的梯度协方差（给予一定的文本序列）相关联量化梯度方差。为了减少方差由于掩模的采样，我们提出一个完全探讨掩蔽策略，其中文本序列被分为一定数目的非重叠区段。此后，一个段内的令牌被掩蔽进行训练。我们证明，从理论的角度，从这个新的屏蔽模式派生的梯度具有更小的变化，并可能导致更有效的自我指导训练。我们在两个连续的岗前培训，并从头一般训练前进行了广泛的实验。实证结果证实，这种新的屏蔽策略能够持续超越标准的随机屏蔽。详细的分析效率和消融研究进一步验证了传销框架下，我们的完全掩盖探索战略的优势。

40. Measuring and Reducing Gendered Correlations in Pre-trained Models [PDF] 返回目录
Kellie Webster, Xuezhi Wang, Ian Tenney, Alex Beutel, Emily Pitler, Ellie Pavlick, Jilin Chen, Slav Petrov
Abstract: Pre-trained models have revolutionized natural language understanding. However, researchers have found they can encode artifacts undesired in many applications, such as professions correlating with one gender more than another. We explore such gendered correlations as a case study for how to address unintended correlations in pre-trained models. We define metrics and reveal that it is possible for models with similar accuracy to encode correlations at very different rates. We show how measured correlations can be reduced with general-purpose techniques, and highlight the trade offs different strategies have. With these results, we make recommendations for training robust models: (1) carefully evaluate unintended correlations, (2) be mindful of seemingly innocuous configuration differences, and (3) focus on general mitigations.
摘要：预先训练模式已经彻底改变了自然语言理解。然而，研究人员发现，他们可以编码在许多应用中不期望的工件，如职业与性别一个比另一个更相关。我们探索这种性别的相关性为如何在预先训练模型解决意外相关的案例研究。我们定义的指标，揭示有可能具有相似的精度进行编码的相关性在非常不同的费率模式。我们将展示如何测量的相关度可以用通用的方法来减少，并强调权衡不同的策略有。有了这些结果，我们训练可靠的模型提出建议：（1）仔细评估意外的相关性;（2）心系看似无害的配置差异，以及（3）注重一般的缓解。

41. Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling [PDF] 返回目录
Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Wu, Ruoming Pang
Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Universal ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation. The Universal ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and an internal large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Universal ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Universal ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.
摘要：流自动语音识别（ASR）的目标发射的每个虚拟字尽可能快速且准确地，而一个完整的语音话语的发射完成之前的假设完成全上下文ASR等待。在这项工作中，我们提出了一个统一的框架，通用ASR，训练单端至端ASR模型流式和全方面的语音识别共享权重。我们发现，流ASR的潜伏期和准确性由重量共享和全方面ASR的联合训练显著受益，尤其是就地知识升华。通用ASR框架可应用于国家的最先进的最近的基于变压器卷积基于与ASR网络。我们提出了广泛的实验与国家的最先进的2 ASR网络，ContextNet和构象，在两个数据集，一种广泛使用的公共数据集LibriSpeech和内部的大规模数据集多畴。实验和消融的研究表明，通用ASR不仅简化了培训和部署流和全方面ASR模型的工作流程，但也显著改善了发射延迟和流ASR的识别精度。随着通用ASR，我们实现了新的国家的最先进的流媒体都LibriSpeech和多域ASR结果在精度和延迟方面。

42. End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems [PDF] 返回目录
Siamak Shakeri, Cicero Nogueira dos Santos, Henry Zhu, Patrick Ng, Feng Nan, Zhiguo Wang, Ramesh Nallapati, Bing Xiang
Abstract: We propose an end-to-end approach for synthetic QA data generation. Our model comprises a single transformer-based encoder-decoder network that is trained end-to-end to generate both answers and questions. In a nutshell, we feed a passage to the encoder and ask the decoder to generate a question and an answer token-by-token. The likelihood produced in the generation process is used as a filtering score, which avoids the need for a separate filtering model. Our generator is trained by fine-tuning a pretrained LM using maximum likelihood estimation. The experimental results indicate significant improvements in the domain adaptation of QA models outperforming current state-of-the-art methods.
摘要：我们提出了合成QA数据生成一个终端到终端的方法。我们的模型包括训练端至端产生两个答案和问题一个基于变压器的编码解码器网络。简而言之，我们喂的通道编码器，并要求解码器产生一个问题，令牌通过令牌答案。在生成过程中产生的似然度被用作过滤得分，这避免了对单独的过滤模型的需要。我们的发电机是通过微调预训练LM采用最大似然估计的培训。实验结果表明：在QA车型超越国家的最先进的现有方法的领域适应性显著的改善。

43. Gender Coreference and Bias Evaluation at WMT 2020 [PDF] 返回目录
Tom Kocmi, Tomasz Limisiewicz, Gabriel Stanovsky
Abstract: Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over four diverse target languages: Czech, German, Polish, and Russian. To achieve this, we use WinoMT, a recent automatic test suite which examines gender coreference and bias when translating from English to languages with grammatical gender. We extend WinoMT to handle two new languages tested in WMT: Polish and Czech. We find that all systems consistently use spurious correlations in the data rather than meaningful contextual information.
摘要：基于虚假性别相关选择性别拐点时，在机器翻译中的性别偏见才能体现。例如，总是翻译医生为男子和护士是女性。这可以是特别有害的模型变得更普及和商业系统中部署。我们的工作提出在超过四种不同的目标语言提交WMT超过19个系统中的现象，最大的证据：捷克语，德语，波兰语和俄语。为了实现这一目标，我们使用WinoMT，最近的自动测试套件，从英语翻译与文法性语言时检查性别的共参照和偏见。我们扩大WinoMT来处理WMT测试了两种新语言：波兰和捷克。我们发现，所有系统均数据，而不是有意义的上下文信息使用虚假相关。

44. Look It Up: Bilingual and Monolingual Dictionaries Improve Neural Machine Translation [PDF] 返回目录
Xing Jie Zhong, David Chiang
Abstract: Despite advances in neural machine translation (NMT) quality, rare words continue to be problematic. For humans, the solution to the rare-word problem has long been dictionaries, but dictionaries cannot be straightforwardly incorporated into NMT. In this paper, we describe a new method for "attaching" dictionary definitions to rare words so that the network can learn the best way to use them. We demonstrate improvements of up to 3.1 BLEU using bilingual dictionaries and up to 0.7 BLEU using monolingual source-language dictionaries.
摘要：尽管在神经机器翻译（NMT）质量的进步，生僻字仍然存在问题。对于人类来说，解决了稀土字的问题一直是字典，词典，但不能被直接纳入NMT。在本文中，我们描述了“连接”到生僻字字典定义，使网络能够学会利用它们的最佳方式的新方法。我们演示如何使用双语词典和高达0.7 BLEU使用单一语言的源语言词典高达3.1 BLEU的改进。

45. Improving Text Generation with Student-Forcing Optimal Transport [PDF] 返回目录
Guoyin Wang, Chunyuan Li, Jianqiao Li, Hao Fu, Yuh-Chen Lin, Liqun Chen, Yizhe Zhang, Chenyang Tao, Ruiyi Zhang, Wenlin Wang, Dinghan Shen, Qian Yang, Lawrence Carin
Abstract: Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequences generated in these two modes. An extension is further proposed to improve the OT learning, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
摘要：神经语言模型经常一起训练最大似然估计（MLE），其中产生的下一个单词条件的地面实况字令牌。在测试过程中，然而，该模型是代替空调之前生成的令牌，从而导致所谓的曝光偏差。为了减少训练和测试之间的差距，我们提出使用最佳传输（OT）来匹配在这两种模式中生成的序列。一个扩展，进一步提出改善OT学习，基于文本序列的结构和上下文信息。该方法的有效性验证的机器翻译，文本摘要，以及文本生成任务。

46. Vulgaris: Analysis of a Corpus for Middle-Age Varieties of Italian Language [PDF] 返回目录
Andrea Zugarini, Matteo Tiezzi, Marco Maggini
Abstract: Italian is a Romance language that has its roots in Vulgar Latin. The birth of the modern Italian started in Tuscany around the 14th century, and it is mainly attributed to the works of Dante Alighieri, Francesco Petrarca and Giovanni Boccaccio, who are among the most acclaimed authors of the medieval age in Tuscany. However, Italy has been characterized by a high variety of dialects, which are often loosely related to each other, due to the past fragmentation of the territory. Italian has absorbed influences from many of these dialects, as also from other languages due to dominion of portions of the country by other nations, such as Spain and France. In this work we present Vulgaris, a project aimed at studying a corpus of Italian textual resources from authors of different regions, ranging in a time period between 1200 and 1600. Each composition is associated to its author, and authors are also grouped in families, i.e. sharing similar stylistic/chronological characteristics. Hence, the dataset is not only a valuable resource for studying the diachronic evolution of Italian and the differences between its dialects, but it is also useful to investigate stylistic aspects between single authors. We provide a detailed statistical analysis of the data, and a corpus-driven study in dialectology and diachronic varieties.
摘要：意大利是有它的通俗拉丁语根罗曼语。现代意大利的出生在托斯卡纳开始在14世纪左右，这主要归因于但丁，弗朗切斯科彼得拉卡和乔瓦尼薄伽丘的作品，谁是在托斯卡纳中世纪时代最知名的作家之一。然而，意大利的特点是高的多种方言，这往往是松散的相互关系，因境内的过去的碎片。意大利人由于其他国家，如西班牙和法国国部分的主权吸收了许多方言，因为也从其他语言的影响。在这项工作中，我们提出寻常，一个项目，旨在从不同地区的作家研究的意大利文字资源语料库不等，1200和1600年各组成关联到它的作者和作者在家庭亦自之间的时间段，即，共有相似的风格/时间特性。因此，该数据集，不仅是学习意大利语的历时演变及其方言之间的差异宝贵的资源，但它也是研究单个作者之间的文体方面非常有用。我们所提供的数据进行详细的统计分析，并在方言和历时品种语料库驱动的研究。

47. Chatbot Interaction with Artificial Intelligence: Human Data Augmentation with T5 and Language Transformer Ensemble for Text Classification [PDF] 返回目录
Jordan J. Bird, Anikó Ekárt, Diego R. Faria
Abstract: In this work, we present the Chatbot Interaction with Artificial Intelligence (CI-AI) framework as an approach to the training of deep learning chatbots for task classification. The intelligent system augments human-sourced data via artificial paraphrasing in order to generate a large set of training data for further classical, attention, and language transformation-based learning approaches for Natural Language Processing. Human beings are asked to paraphrase commands and questions for task identification for further execution of a machine. The commands and questions are split into training and validation sets. A total of 483 responses were recorded. Secondly, the training set is paraphrased by the T5 model in order to augment it with further data. Seven state-of-the-art transformer-based text classification algorithms (BERT, DistilBERT, RoBERTa, DistilRoBERTa, XLM, XLM-RoBERTa, and XLNet) are benchmarked for both sets after fine-tuning on the training data for two epochs. We find that all models are improved when training data is augmented by the T5 model, with an average increase of classification accuracy by 4.01%. The best result was the RoBERTa model trained on T5 augmented data which achieved 98.96% classification accuracy. Finally, we found that an ensemble of the five best-performing transformer models via Logistic Regression of output label predictions led to an accuracy of 99.59% on the dataset of human responses. A highly-performing model allows the intelligent system to interpret human commands at the social-interaction level through a chatbot-like interface (e.g. "Robot, can we have a conversation?") and allows for better accessibility to AI by non-technical users.
摘要：在这项工作中，我们提出与人工智能（CI-AI）框架作为一种方法来深学习聊天机器人的训练任务分类聊天机器人互动。以产生一大组训练数据进行进一步的古典的，基于转型的关注，和语言学习通过人工意译智能系统增强部人体来源的数据自然语言处理方法。人类被要求复述命令和任务标识为机器的进一步执行问题。命令和问题被分为培训和验证集。共有483个反应记录。其次，训练集由T5模型，以便进一步的数据来增强它转述。七状态的最先进的基于变压器的文本分类算法（BERT，DistilBERT，罗伯塔，DistilRoBERTa，XLM，XLM-罗伯塔，和XLNet）的基准用于在两个时期的训练数据的微调后两组。我们发现，当训练数据由T5型号增强，与4.01％的平均增幅分类准确性的所有车型都有所提高。最好的结果是培训了T5的罗伯塔模型增强其达到98.96％分类准确数据。最后，我们发现，五个最好的表现变压器模型通过Logistic回归的输出标号预测的合奏导致99.59％的人响应的数据集的精度。高度表现模型允许智能系统以通过一个聊天机器人一样的界面的社交互动层面解读人的命令（例如“机器人，我们可以好好聊聊？”），并允许更容易暴露AI由非技术用户。

48. SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search [PDF] 返回目录
Sean MacAvaney, Arman Cohan, Nazli Goharian
Abstract: With worldwide concerns surrounding the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), there is a rapidly growing body of scientific literature on the virus. Clinicians, researchers, and policy-makers need to be able to search these articles effectively. In this work, we present a zero-shot ranking algorithm that adapts to COVID-related scientific literature. Our approach filters training data from another collection down to medical-related queries, uses a neural re-ranking model pre-trained on scientific text (SciBERT), and filters the target document collection. This approach ranks top among zero-shot methods on the TREC COVID Round 1 leaderboard, and exhibits a P@5 of 0.80 and an nDCG@10 of 0.68 when evaluated on both Round 1 and 2 judgments. Despite not relying on TREC-COVID data, our method outperforms models that do. As one of the first search methods to thoroughly evaluate COVID-19 search, we hope that this serves as a strong baseline and helps in the global crisis.
摘要：随着全球环境的担忧严重急性呼吸系统综合症冠状病毒2（SARS-COV-2），对病毒的科学文献的迅速增长体。临床医生，研究人员和政策制定者需要能够有效地搜索到这些文章。在这项工作中，我们提出了一个零次排名算法，能够适应COVID相关的科学文献。我们的方法过滤器从另一个集合训练数据下降到医疗相关的查询，使用神经重新排序模型科学文本（SciBERT）预先训练，并过滤目标文档集合。这种方法行列顶部之间的TREC COVID回合1排行榜零的一步法，并表现出2P @ 0.80 5和0.68的NDCG @ 10时在两个回合1和2点的判断评价。尽管不是依靠TREC-COVID数据，我们的方法优于模型做。作为第一搜索方法彻底评估一个COVID-19搜索，我们希望这成为一个强大的基础，并在全球金融危机有所帮助。

49. NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task [PDF] 返回目录
Alexander Gutkin, Richard Sproat
Abstract: This paper describes the NEMO submission to SIGTYP 2020 shared task which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features. We describe two submitted ridge regression-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the micro-averaged accuracy score of 0.66 on 149 test languages.
摘要：本文介绍了NEMO提交SIGTYP 2020共享任务，与利用从语言结构的世界地图集（WALS）得到的数据多国语言的语言类型学的功能预测的交易。我们使用频率论推理代表类型学特征之间的相关性，并使用这种表示训练来预测个别功能简单的多级估计。我们描述两个提交岭基于回归的配置，其中排名第二和第三顺位受约束的任务。我们的最佳配置实现了0.66微平均准确度得分在149种测试语言。

50. The Extraordinary Failure of Complement Coercion Crowdsourcing [PDF] 返回目录
Yanai Elazar, Victoria Basmov, Shauli Ravfogel, Yoav Goldberg, Reut Tsarfaty
Abstract: Crowdsourcing has eased and scaled up the collection of linguistic annotation in recent years. In this work, we follow known methodologies of collecting labeled data for the complement coercion phenomenon. These are constructions with an implied action -- e.g., "I started a new book I bought last week", where the implied action is reading. We aim to collect annotated data for this phenomenon by reducing it to either of two known tasks: Explicit Completion and Natural Language Inference. However, in both cases, crowdsourcing resulted in low agreement scores, even though we followed the same methodologies as in previous work. Why does the same process fail to yield high agreement scores? We specify our modeling schemes, highlight the differences with previous work and provide some insights about the task and possible explanations for the failure. We conclude that specific phenomena require tailored solutions, not only in specialized algorithms, but also in data collection methods.
摘要：众包已经趋缓，扩大语言注释的集合在最近几年。在这项工作中，我们按照收集标签数据进行补强迫现象的已知的方法。这些都是有一个隐含的动作结构 - 例如，“我开始一本新书我上周买”，其中隐含的作用是阅读。我们的目标是把它简化成以下两种已知的任务来收集这种现象注释的数据：明确完成和自然语言推理。然而，在这两种情况下，众包导致低得分一致，即使我们按照相同的方法在以前的工作。为什么同样的过程不能产生高分数的协议？我们指定的建模方案，突出与以前的工作的差异，并提供了有关失败的任务，并可能解释一些见解。我们的结论是具体的现象需要定制的解决方案，不仅在专门的算法，而且在数据收集方法。

51. The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units [PDF] 返回目录
Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux
Abstract: We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.
摘要：我们在学习从原材料的音频信号中的语音表示，而没有任何标签呈现零资源言语挑战2020年，其目的。它结合从两个先前的基准（2017和2019）的数据集和度量，并设有两个任务，其抽头成语音表示的两个层次。第一个任务是发现，优化语音合成的质量低比特率的子字表达式;第二个是发现字像非分割原始语音单元。我们提出二十提交模型的结果，并讨论无监督语音学习的主要结论的影响。

52. Perceptimatic: A human speech perception benchmark for unsupervised subword modelling [PDF] 返回目录
Juliette Millet, Ewan Dunbar
Abstract: In this paper, we present a data set and methods to compare speech processing models and human behaviour on a phone discrimination task. We provide Perceptimatic, an open data set which consists of French and English speech stimuli, as well as the results of 91 English- and 93 French-speaking listeners. The stimuli test a wide range of French and English contrasts, and are extracted directly from corpora of natural running read speech, used for the 2017 Zero Resource Speech Challenge. We provide a method to compare humans' perceptual space with models' representational space, and we apply it to models previously submitted to the Challenge. We show that, unlike unsupervised models and supervised multilingual models, a standard supervised monolingual HMM-GMM phone recognition system, while good at discriminating phones, yields a representational space very different from that of human native listeners.
摘要：在本文中，我们提出了一个数据集和方法上的电话歧视任务比较语音处理模型和人类行为。我们提供Perceptimatic，其中包括法语和英语的讲话刺激，以及91英语的结果和93法语听众的开放的数据集。刺激测试范围广泛的法语和英语的对比，并直接从自然的跑步读取语音的语料库中提取，用于2017年的零资源言语挑战。我们提供了一个高配车型感性空间表示空间比较人类的方法，我们把它应用到先前提交的挑战模式。我们证明了，不像无人监管模式和监管模式多语种，一个标准的监督和英语HMM-GMM电话识别系统，同时善于识别手机，产生一个表示空间从人天然的听众非常不同。

53. Towards Induction of Structured Phoneme Inventories [PDF] 返回目录
Alexander Gutkin, Martin Jansche, Lucy Skidmore
Abstract: This extended abstract surveying the work on phonological typology was prepared for "SIGTYP 2020: The Second Workshop on Computational Research in Linguistic Typology" to be held at EMNLP 2020.
摘要：扩展摘要调查对音韵类型学的工作是为“SIGTYP 2020：第二次研讨会上计算研究语言类型学”准备在EMNLP 2020年举行。

54. COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs [PDF] 返回目录
Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, Yejin Choi
Abstract: Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and challenging tasks. At the same time, there remain questions about the quality and coverage of these resources due to the massive scale required to comprehensively encompass general commonsense knowledge. In this work, we posit that manually constructed CSKGs will never achieve the coverage necessary to be applicable in all situations encountered by NLP agents. Therefore, we propose a new evaluation framework for testing the utility of KGs based on how effectively implicit knowledge representations can be learned from them. With this new goal, we propose ATOMIC 2020, a new CSKG of general-purpose commonsense knowledge containing knowledge that is not readily available in pretrained language models. We evaluate its properties in comparison with other leading CSKGs, performing the first large-scale pairwise study of commonsense knowledge resources. Next, we show that ATOMIC 2020 is better suited for training knowledge models that can generate accurate, representative knowledge for new, unseen entities and events. Finally, through human evaluation, we show that the few-shot performance of GPT-3 (175B parameters), while impressive, remains ~12 absolute points lower than a BART-based knowledge model trained on ATOMIC 2020 despite using over 430x fewer parameters.
摘要：近年来，带来了自然语言理解领域的常识表示和推理重新产生了兴趣。新的常识性知识图（CSKG）的发展一直是中央对这些进步作为自己多元化的事实，可以使用和机器学习模型引用为应对新的挑战性任务。与此同时，仍然存在着对这些资源的质量和覆盖面问题，由于需要全面涵盖一般常识性知识的大规模。在这项工作中，我们断定，手动构造CSKGs将永远不会实现必要的覆盖面，适用于通过NLP代理商遇到的所有情况。因此，我们提出了一个新的评估框架基础上如何有效隐性知识表示可以从他们身上学到测试幼儿园的效用。有了这个新的目标，我们提出ATOMIC 2020年，含知识通用的常识性知识的新CSKG，是不是在预训练的语言模型一应俱全。我们与其他领先的CSKGs比较评估其性能，进行常识性知识资源的第一次大规模的成对研究。接下来，我们表明，ATOMIC 2020更适合于知识培训模式，可以产生新的，未知的实体和事件的准确，代表性的知识。最后，通过人的评价，我们表明，GPT-3（175B参数）的几拍的表现，而令人印象深刻，但仍有〜12个绝对百分点，比上训练ATOMIC 2020年基于BART知识模型降低，尽管使用了430x较少的参数。

55. Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach [PDF] 返回目录
Benjamin Eyre, Aparna Balagopalan, Jekaterina Novikova
Abstract: Despite the widely reported success of embedding-based machine learning methods on natural language processing tasks, the use of more easily interpreted engineered features remains common in fields such as cognitive impairment (CI) detection. Manually engineering features from noisy text is time and resource consuming, and can potentially result in features that do not enhance model performance. To combat this, we describe a new approach to feature engineering that leverages sequential machine learning models and domain knowledge to predict which features help enhance performance. We provide a concrete example of this method on a standard data set of CI speech and demonstrate that CI classification accuracy improves by 2.3% over a strong baseline when using features produced by this method. This demonstration provides an ex-ample of how this method can be used to assist classification in fields where interpretability is important, such as health care.
摘要：尽管嵌入基于机器学习方法的自然语言处理任务的广泛报道的成功，使用更容易理解的设计特征依然领域常见的，如认知障碍（CI）检测。手动工程特征从嘈杂的文字是耗费时间和资源，并有可能导致功能不提高模型的性能。为了解决这个问题，我们描述了一种新的方法来，充分利用连续的机器学习模型和领域知识来预测哪些功能有助于提高性能的功能设计。我们提供CI语音的标准数据集该方法的一个具体的例子，并证明CI分类精度使用由这种方法生产的特点当在强基线2.3％提高。该演示提供了一个前充足的这个方法如何被用来协助领域分类，其中解释性是很重要的，如医疗保健。

56. Controlling the Interaction Between Generation and Inference in Semi-Supervised Variational Autoencoders Using Importance Weighting [PDF] 返回目录
Ghazi Felhi, Joseph Leroux, Djamé Seddah
Abstract: Even though Variational Autoencoders (VAEs) are widely used for semi-supervised learning, the reason why they work remains unclear. In fact, the addition of the unsupervised objective is most often vaguely described as a regularization. The strength of this regularization is controlled by down-weighting the objective on the unlabeled part of the training set. Through an analysis of the objective of semi-supervised VAEs, we observe that they use the posterior of the learned generative model to guide the inference model in learning the partially observed latent variable. We show that given this observation, it is possible to gain finer control on the effect of the unsupervised objective on the training procedure. Using importance weighting, we derive two novel objectives that prioritize either one of the partially observed latent variable, or the unobserved latent variable. Experiments on the IMDB english sentiment analysis dataset and on the AG News topic classification dataset show the improvements brought by our prioritization mechanism and and exhibit a behavior that is inline with our description of the inner working of Semi-Supervised VAEs.
摘要：虽然变自动编码（VAES）被广泛用于半监督学习，为什么他们的工作仍不清楚的原因。实际上，在加入非监督目标的最经常隐约描述为正则化。这种正则化的强度由下加权目标的训练集的未标记部分进行控制。通过客观的半监督VAES的分析，我们看到，他们所使用的了解到生成模型的后引导推理模型在学习观察到部分潜在变量。我们表明，这给观测，有可能在无人监管的目标在训练过程中的作用得到更好的控制。使用重要性加权，我们推导出优先要么部分观察到的潜变量，或者未观察到的潜变量之一两种新的目标。在IMDB英文情感分析数据集，并在AG新闻主题分类实验数据集上我们的优先机制和带来的改进和表现出的行为是内嵌我们的半监督VAES的内部工作的描述。

57. Autotuning Search Space for Loop Transformations [PDF] 返回目录
Michael Kruse, Hal Finkel, Xingfu Wu
Abstract: One of the challenges for optimizing compilers is to predict whether applying an optimization will improve its execution speed. Programmers may override the compiler's profitability heuristic using optimization directives such as pragmas in the source code. Machine learning in the form of autotuning can assist users in finding the best optimizations for each platform. In this paper we propose a loop transformation search space that takes the form of a tree, in contrast to previous approaches that usually use vector spaces to represent loop optimization configurations. We implemented a simple autotuner exploring the search space and applied it to a selected set of PolyBench kernels. While the autotuner is capable of representing every possible sequence of loop transformations and their relations, the results motivate the use of better search strategies such as Monte Carlo tree search to find sophisticated loop transformations such as multilevel tiling.
摘要：一个用于优化编译器所面临的挑战是预测是否应用的优化将提高其执行速度。程序员可以使用覆盖优化指令编译器的盈利启发，如在源代码编译指示。在自动调节的形式，机器学习可以帮助用户找到最佳的优化，为每个平台。在本文中，我们提出了一个循环变换搜索空间，需要一个树的形式，相反，通常使用向量空间表示循环优化配置，以前的方法。我们实现了一个简单的自动调谐探索搜索空间，并将其应用到选定的一组PolyBench内核。而自动调节器是能够代表循环变换及其相互关系的一切可能的序列，结果激发使用更好的搜索策略，如蒙特卡洛树搜索找到成熟的循环转换，如多层次拼接。

58. Pretrained Transformers for Text Ranking: BERT and Beyond [PDF] 返回目录
Jimmy Lin, Rodrigo Nogueira, Andrew Yates
Abstract: The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has, without exaggeration, revolutionized the fields of natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage ranking architectures and learned dense representations that attempt to perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond the typical sentence-by-sentence processing approaches used in NLP, and techniques for addressing the tradeoff between effectiveness (result quality) and efficiency (query latency). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading.
摘要：文排名的目的是产生在响应查询从语料库检索文本中的有序列表。虽然文字居最常见的制剂是搜索，任务的情况下，也可以在许多自然语言处理应用中。此调查提供文本与已知的变压器，其中BERT是最有名的例子神经网络结构排名的概述。变压器和自我监督预训练的组合具有，毫不夸张地说，彻底改变了自然语言处理（NLP），信息检索（IR），并且超出的字段。在本次调查中，我们提供现有工作的综合作为入门的谁希望更好地理解如何变压器适用于谁愿意在这个领域从事的工作文字排名的问题和研究人员从业单点。我们覆盖广泛的现代技术，分为两个高级别类：变压器模型，进行多级再排序试图直接进行排名排名架构和教训密集交涉。有两个主题，渗透了我们的调查：处理长文档，超越了典型的句子通过句子处理技术在NLP使用办法和技术解决效率（结果质量）和效率（查询延迟）之间的权衡。虽然变压器架构和训练前技术的不断创新，它们是如何应用到文本的排名很多方面都比较充分的了解，并表示成熟的技术。但是，仍有许多开放研究问题，因此除了铺设预训练变压器的基础文本排序，本次调查还试图预言在字段标题。

59. Automatic Extraction of Urban Outdoor Perception from Geolocated Free-Texts [PDF] 返回目录
Frances Santos, Thiago H Silva, Antonio A F Loureiro, Leandro Villas
Abstract: The automatic extraction of urban perception shared by people on location-based social networks (LBSNs) is an important multidisciplinary research goal. One of the reasons is because it facilitates the understanding of the intrinsic characteristics of urban areas in a scalable way, helping to leverage new services. However, content shared on LBSNs is diverse, encompassing several topics, such as politics, sports, culture, religion, and urban perceptions, making the task of content extraction regarding a particular topic very challenging. Considering free-text messages shared on LBSNs, we propose an automatic and generic approach to extract people's perceptions. For that, our approach explores opinions that are spatial-temporal and semantically similar. We exemplify our approach in the context of urban outdoor areas in Chicago, New York City and London. Studying those areas, we found evidence that LBSN data brings valuable information about urban regions. To analyze and validate our outcomes, we conducted a temporal analysis to measure the results' robustness over time. We show that our approach can be helpful to better understand urban areas considering different perspectives. We also conducted a comparative analysis based on a public dataset, which contains volunteers' perceptions regarding urban areas expressed in a controlled experiment. We observe that both results yield a very similar level of agreement.
摘要：人们对基于位置的社交网络（LBSNS）共享城市感知的自动提取是一个重要的跨学科研究的目标。其中一个原因是因为它有助于在一个可伸缩的方式市区的固有特性的了解，有助于充分利用新的服务。然而，在LBSNS共享的内容是多种多样的，包含几个主题，如政治，体育，文化，宗教，和城市的感觉，对内容提取关于非常具有挑战性的一个特定主题的任务。考虑LBSNS上免费共享，短信，我们提出了一种自动和通用的方法来提取人的看法。对于这一点，我们的方法探讨了在时空和语义相似的观点。我们举例说明我们在芝加哥，纽约和伦敦城市户外区背景下的做法。研究这些领域，我们发现的证据表明，LBSN数据带来了城市地区的有价值的信息。为了分析和验证我们的结果，我们进行了时间分析测量结果随着时间的鲁棒性。我们表明，我们的方法可以帮助更好地了解考虑不同的观点市区。我们还进行基于公共数据集，其中包含志愿者的关于在对照实验中表达市区的看法进行了比较分析。我们观察到，两个结果产生非常类似协议的水平。

60. Artificial Intelligence, speech and language processing approaches to monitoring Alzheimer's Disease: a systematic review [PDF] 返回目录
Sofia de la Fuente Garcia, Craig Ritchie, Saturnino Luz
Abstract: Language is a valuable source of clinical information in Alzheimer's Disease, as it declines concurrently with neurodegeneration. Consequently, speech and language data have been extensively studied in connection with its diagnosis. This paper summarises current findings on the use of artificial intelligence, speech and language processing to predict cognitive decline in the context of Alzheimer's Disease, detailing current research procedures, highlighting their limitations and suggesting strategies to address them. We conducted a systematic review of original research between 2000 and 2019, registered in PROSPERO (reference CRD42018116606). An interdisciplinary search covered six databases on engineering (ACM and IEEE), psychology (PsycINFO), medicine (PubMed and Embase) and Web of Science. Bibliographies of relevant papers were screened until December 2019. From 3,654 search results 51 articles were selected against the eligibility criteria. Four tables summarise their findings: study details (aim, population, interventions, comparisons, methods and outcomes), data details (size, type, modalities, annotation, balance, availability and language of study), methodology (pre-processing, feature generation, machine learning, evaluation and results) and clinical applicability (research implications, clinical potential, risk of bias and strengths/limitations). While promising results are reported across nearly all 51 studies, very few have been implemented in clinical research or practice. We concluded that the main limitations of the field are poor standardisation, limited comparability of results, and a degree of disconnect between study aims and clinical applications. Attempts to close these gaps should support translation of future research into clinical practice.
摘要：语言是在阿尔茨海默氏病的临床信息的重要来源，因为它与神经退行性病变的同时下降。因此，语音和语言数据已广泛用于其诊断接口的研究。本文总结了使用的人工智能，语音和语言处理目前的调查结果来预测阿尔茨海默氏病的背景下认知能力下降，详述目前的研究程序，突出自己的局限性，并提出战略，以解决这些问题。我们进行原创性研究进行了系统回顾2000年至2019年间，在PROSPERO（参考CRD42018116606）注册。一个跨学科的搜索覆盖的工程6个数据库（ACM和IEEE），心理学（PsycINFO），医药（PubMed和医学文摘）和科学的网络。相关论文的参考文献进行筛选，直到2019年12月的3,654搜索结果对资格标准选出51篇。四下表总结了他们的研究结果：研究细节（目标，人群，干预，比较，方法和结果），数据的详细信息（大小，类型，方式，注释，平衡性，可用性和学习的语言），方法（预处理，特征生成，机器学习，评估和结果）和临床适用性（研究产生了影响，临床应用前景，偏见和优势/局限性的风险）。同时承诺结果也会在几乎所有51研究报告中，极少有在临床研究或实践来实现。我们得出的结论是该领域的主要限制是标准化很差，结果有限可比性，和一定程度的研究目的和临床应用脱节的。尝试关闭这些差距应该支持今后的研究转化为临床实践。

61. MedICaT: A Dataset of Medical Images, Captions, and Textual References [PDF] 返回目录
Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi
Abstract: Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at this https URL.
摘要：了解数字和文字之间的关系，关键是科学文献的理解。特别是医疗数字是相当复杂的，往往由几个子图（在我们的数据集的数字75％），以描述其内容的详细文本。在科学论文前期工作研究数据集中在分类数字内容，而不是理解图像如何与文本。在图检索和人物对文字的对齐方式应对挑战，我们介绍MedICaT，在上下文中医学图像的数据集。 MedICaT由从131K开放接入生物医学论文217K的图像，并且包括字幕，用于数字74％内联的引用，和手动注释子图和subcaptions为数字的子集。使用MedICaT，我们引入子图在复合数字subcaption对准的任务，并展示图像文本匹配内嵌引用的效用。我们的数据和代码可以在此HTTPS URL访问。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-10-14

目录

摘要