摘要

1. Generating Natural Language Adversarial Examples on a Large Scale with Generative Models [PDF] 返回目录
Yankun Ren, Jianbin Lin, Siliang Tang, Jun Zhou, Shuang Yang, Yuan Qi, Xiang Ren
Abstract: Today text classification models have been widely used. However, these classifiers are found to be easily fooled by adversarial examples. Fortunately, standard attacking methods generate adversarial texts in a pair-wise way, that is, an adversarial text can only be created from a real-world text by replacing a few words. In many applications, these texts are limited in numbers, therefore their corresponding adversarial examples are often not diverse enough and sometimes hard to read, thus can be easily detected by humans and cannot create chaos at a large scale. In this paper, we propose an end to end solution to efficiently generate adversarial texts from scratch using generative models, which are not restricted to perturbing the given texts. We call it unrestricted adversarial text generation. Specifically, we train a conditional variational autoencoder (VAE) with an additional adversarial loss to guide the generation of adversarial examples. Moreover, to improve the validity of adversarial texts, we utilize discrimators and the training framework of generative adversarial networks (GANs) to make adversarial texts consistent with real data. Experimental results on sentiment analysis demonstrate the scalability and efficiency of our method. It can attack text classification models with a higher success rate than existing methods, and provide acceptable quality for humans in the meantime.
摘要：今天，文本分类模型已被广泛使用。然而，这些分类被发现由敌对的例子很容易被欺骗。幸运的是，标准的攻击方法产生成对的方式，那就是，一个敌对的文字只能通过更换几句话从一个真实世界的文本创建对抗性文本。在许多应用中，这些文本的数量有限，因此其对应的对抗性例子往往没有足够多样，有时难以阅读，因而人类可以很容易地检测，并在大规模不能制造混乱。在本文中，我们提出了一个端到端的解决方案，以使用生成模型，不局限于扰动给定文本有效从头开始生成对抗性的文本。我们称之为对抗无限制文本生成。具体而言，我们培养条件变自动编码器（VAE）与另外的对抗性损失引导的对抗性实例产生。此外，为了提高对抗文本的有效性，我们利用discrimators及发生对抗网络（甘斯）的培训框架，使敌对文本与真正的数据是一致的。在情感分析实验结果证明了该方法的可扩展性和效率。它可以攻击文本分类模型具有较高的成功率比现有的方法，并在此期间人类提供可接受的质量。

2. Adaptive Name Entity Recognition under Highly Unbalanced Data [PDF] 返回目录
Thong Nguyen, Duy Nguyen, Pramod Rao
Abstract: For several purposes in Natural Language Processing (NLP), such as Information Extraction, Sentiment Analysis or Chatbot, Named Entity Recognition (NER) holds an important role as it helps to determine and categorize entities in text into predefined groups such as the names of persons, locations, quantities, organizations or percentages, etc. In this report, we present our experiments on a neural architecture composed of a Conditional Random Field (CRF) layer stacked on top of a Bi-directional LSTM (BI-LSTM) layer for solving NER tasks. Besides, we also employ a fusion input of embedding vectors (Glove, BERT), which are pre-trained on the huge corpus to boost the generalization capacity of the model. Unfortunately, due to the heavy unbalanced distribution cross-training data, both approaches just attained a bad performance on less training samples classes. To overcome this challenge, we introduce an add-on classification model to split sentences into two different sets: Weak and Strong classes and then designing a couple of Bi-LSTM-CRF models properly to optimize performance on each set. We evaluated our models on the test set and discovered that our method can improve performance for Weak classes significantly by using a very small data set (approximately 0.45\%) compared to the rest classes.
摘要：在自然语言处理多种用途（NLP），如信息提取，情感分析或聊天机器人，命名实体识别（NER）占有重要的作用，因为它有助于判断和文字群归类实体到预定义的群体，如名称人员，地点，数量，组织或百分比等。在本报告中，我们提出在堆叠在双向LSTM顶部的条件随机场（CRF）层组成的神经结构（BI-LSTM）层我们的实验解决NER任务。此外，我们还采用了嵌入矢量（手套，BERT），这是预先训练上的巨大的语料库，以提高模型的泛化能力的融合输入。不幸的是，由于沉重的不平衡分布的交叉训练数据，这两种方法都只是达到就少训练样本的类表现不佳。为了克服这一挑战，我们引进一个附加的分类模式，以拆分句子翻译成两套不同：弱和强类，然后设计一个夫妇双LSTM-CRF模型的正确，在每个组优化性能。我们评估的测试集我们的模型，发现我们的方法可以显著通过使用比其他班一个非常小的数据集（约0.45 \％）提高弱类的性能。

3. PathVQA: 30000+ Questions for Medical Visual Question Answering [PDF] 返回目录
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, Pengtao Xie
Abstract: Is it possible to develop an "AI Pathologist" to pass the board-certified examination of the American Board of Pathology? To achieve this goal, the first step is to create a visual question answering (VQA) dataset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Our work makes the first attempt to build such a dataset. Different from creating general-domain VQA datasets where the images are widely accessible and there are many crowdsourcing workers available and capable of generating question-answer pairs, developing a medical VQA dataset is much more challenging. First, due to privacy concerns, pathology images are usually not publicly available. Second, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. To address these challenges, we resort to pathology textbooks and online digital libraries. We develop a semi-automated pipeline to extract pathology images and captions from textbooks and generate question-answer pairs from captions using natural language processing. We collect 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness. To our best knowledge, this is the first dataset for pathology VQA. Our dataset will be released publicly to promote research in medical VQA.
摘要：是否有可能建立一个“AI病理学家”通过病理美国委员会的委员会认证考试吗？为了实现这一目标，第一步是要创造一个人工智能代理呈现病理图像连同一个问题，并要求给出正确的答案视觉问答（VQA）的数据集。我们的工作对建立这样一个数据集的第一次尝试。从创建普通域VQA数据集的图像是哪里广泛接受，并有许多众包工作者可用，能够生成问答对，发展医学VQA数据集更具挑战性的不同。首先，由于隐私问题，病理图像通常是不公开的。其次，只有训练有素的病理学家可以理解病理图像，但他们几乎没有时间来帮助创造人工智能研究的数据集。为了应对这些挑战，我们求助于病理教材和网上数字图书馆。我们开发了一个半自动化流水线，从使用自然语言处理字幕提取教科书病理图像和字幕，并生成问答对。我们收集每个问题是人工检查，以确保正确性4998个病理图像32799开放式的问题。据我们所知，这是病理VQA的第一个数据集。我们的数据会被公开发布，以促进医疗VQA研究。

4. Fast Cross-domain Data Augmentation through Neural Sentence Editing [PDF] 返回目录
Guillaume Raille, Sandra Djambazovska, Claudiu Musat
Abstract: Data augmentation promises to alleviate data scarcity. This is most important in cases where the initial data is in short supply. This is, for existing methods, also where augmenting is the most difficult, as learning the full data distribution is impossible. For natural language, sentence editing offers a solution - relying on small but meaningful changes to the original ones. Learning which changes are meaningful also requires large amounts of training data. We thus aim to learn this in a source domain where data is abundant and apply it in a different, target domain, where data is scarce - cross-domain augmentation. We create the Edit-transformer, a Transformer-based sentence editor that is significantly faster than the state of the art and also works cross-domain. We argue that, due to its structure, the Edit-transformer is better suited for cross-domain environments than its edit-based predecessors. We show this performance gap on the Yelp-Wikipedia domain pairs. Finally, we show that due to this cross-domain performance advantage, the Edit-transformer leads to meaningful performance gains in several downstream tasks.
摘要：数据增强的承诺，以减轻数据匮乏。这是在最初的数据是在供不应求的情况下最重要的。这一点，对于现有的方法，也是在增广是最困难的，作为学习完整的数据分布是不可能的。对于自然语言，文本编辑提供了解决方案 - 依靠小而有意义的变化，以原有的。学习它的变化是有意义的还需要大量的训练数据。因此，我们的目标是学习这在源域在数据丰富，并在不同目标域，其中的数据是稀缺的应用它 - 跨域增强。我们创建编辑变压器，基于变压器的句子编辑器，显著比现有技术的速度更快，也适用跨域。我们认为，由于其结构，编辑变压器更适合跨域环境比其编辑为基础的前辈。我们显示出对Yelp的维基百科域对这种性能差距。最后，我们表明，由于这种跨域的性能优势，编辑变压器导致有意义的性能提升在几个下游任务。

5. Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings [PDF] 返回目录
Christos Xypolopoulos, Antoine J.-P. Tixier, Michalis Vazirgiannis
Abstract: The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical significance) with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our method makes it applicable to any language. Code and data are publicly available at this https URL.
摘要：一个给定的词，一词多义或感官的的数量，是一个非常主观的概念，它跨越注释和资源差别很大。我们提出了一个新的方法来估算多义词，基于上下文嵌入空间简单的几何体。我们的做法是完全无监督和纯数据驱动。我们发现通过严格的实验，我们的排名是很好的相关性（具有较强的显着性）与著名的人力资源结构衍生6名不同的排行榜，如共发现，OntoNotes，牛津，维基百科等，6个不同的标准指标。我们也可视化和分析人类的排名之间的相关性。我们的方法的有价值的副产品是能够样品，不收取额外费用，包含给定词的不同意义的句子。最后，我们的方法完全无监督的性质使得它适用于任何语言。代码和数据是公开的，在此HTTPS URL。

6. E2EET: From Pipeline to End-to-end Entity Typing via Transformer-Based Embeddings [PDF] 返回目录
Michael Stewart, Wei Liu
Abstract: Entity Typing (ET) is the process of identifying the semantic types of every entity within a corpus. In contrast to Named Entity Recognition, where each token in a sentence is labelled with zero or one class label, ET involves labelling each entity mention with one or more class labels. Existing entity typing models, which operate at the mention level, are limited by two key factors: they do not make use of recently-proposed context-dependent embeddings, and are trained on fixed context windows. They are therefore sensitive to window size selection and are unable to incorporate the context of the entire document. In light of these drawbacks we propose to incorporate context using transformer-based embeddings for a mention-level model, and an end-to-end model using a Bi-GRU to remove the dependency on window size. An extensive ablative study demonstrates the effectiveness of contextualised embeddings for mention-level models and the competitiveness of our end-to-end model for entity typing.
摘要：实体打字（ET）是一种识别的语料库中的语义类型的每个实体的过程。相较于命名实体识别，其中，在句子中的每个令牌都标有零或一类的标签，ET涉及标记各个实体提及的一个或多个类别标签。现有实体模型打字，这在提一级工作，由两个关键因素的限制：它们不利用最近提出的上下文相关的嵌入，并在固定范围内的窗口进行培训。因此，他们对窗口大小选择敏感而无法纳入整个文档的上下文。在这些缺陷的光中，我们建议使用基于变压器的嵌入了提级机型纳入范围内，并使用双GRU的端至高端型号以消除窗口大小的依赖。广泛的消融研究表明contextualised的嵌入对提级车型和有效性我们的终端到终端的模式，为实体打字的竞争力。

7. Caption Generation of Robot Behaviors based on Unsupervised Learning of Action Segments [PDF] 返回目录
Koichiro Yoshino, Kohei Wakimoto, Yuta Nishimura, Satoshi Nakamura
Abstract: Bridging robot action sequences and their natural language captions is an important task to increase explainability of human assisting robots in their recently evolving field. In this paper, we propose a system for generating natural language captions that describe behaviors of human assisting robots. The system describes robot actions by using robot observations; histories from actuator systems and cameras, toward end-to-end bridging between robot actions and natural language captions. Two reasons make it challenging to apply existing sequence-to-sequence models to this mapping: 1) it is hard to prepare a large-scale dataset for any kind of robots and their environment, and 2) there is a gap between the number of samples obtained from robot action observations and generated word sequences of captions. We introduced unsupervised segmentation based on K-means clustering to unify typical robot observation patterns into a class. This method makes it possible for the network to learn the relationship from a small amount of data. Moreover, we utilized a chunking method based on byte-pair encoding (BPE) to fill in the gap between the number of samples of robot action observations and words in a caption. We also applied an attention mechanism to the segmentation task. Experimental results show that the proposed model based on unsupervised learning can generate better descriptions than other methods. We also show that the attention mechanism did not work well in our low-resource setting.
摘要：缩小机器人的动作场面和他们的自然语言字幕是为了提高人类辅助机器人explainability在其最近发展领域的一项重要任务。在本文中，我们提出了产生自然语言的字幕，描述人类辅助机器人的行为的系统。该系统通过使用机器人的观测描述机器人的行动;从致动器系统和摄像机，朝机器人的行动和自然语言字幕之间的端至端桥接历史。有两个原因使得它具有挑战性的既存顺序以序列模型应用于此映射：1）这是很难准备任何类型的机器人和它们环境中的大型数据集，以及2）存在的数量之间的差距从机器人动作的观察和字幕的生成的词序列获得的样品。我们介绍了基于K-means聚类来统一典型的机器人观察模式到一个类无监督分割。该方法使得可能的是，网络学习从数据量小的关系。此外，我们使用基于字节对编码（BPE）分块方法填补在字幕机器人动作观察样品和词的数量之间的差距。我们还应用了注意机制来分割任务。实验结果表明，基于无监督学习的模型可以产生比其他方法更好的描述。我们还表明，重视机制并没有在我们的低资源设置很好地工作。

8. SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection [PDF] 返回目录
Xiaoya Li, Yuxian Meng, Qinghong Han, Fei Wu, Jiwei Li
Abstract: While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length. Based on SAC, we show that previous variants of self-attention models are its special cases. Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost.
摘要：虽然自注意机制已经在各种各样的任务，得到了广泛的使用，它具有二次成本相对于输入长度，这使得它难以应对长期投入不幸的财产。在本文中，我们提出了一种方法，用于加速和结构化自关注：稀疏自适应连接（SAC）。在SAC，我们认为输入序列的一个图，注意操作链接节点之间进行。与以前的自我关注车型预先定义的结构（边缘）的对比度，模型学会构造关注边缘，提高特定任务的性能。通过这种方式，该模型能够选择最显着的节点，减少二次复杂不管序列长度。基于SAC，我们表明，自我关注车型以前变种的特殊情况。通过对神经机器翻译，语言建模，图形表示学习和图像分类广泛的实验，我们证明SAC是与国家的最先进的车型竞争力的同时显著降低内存成本。

9. Prior Knowledge Driven Label Embedding for Slot Filling in Natural Language Understanding [PDF] 返回目录
Su Zhu, Zijian Zhao, Rao Ma, Kai Yu
Abstract: Traditional slot filling in natural language understanding (NLU) predicts a one-hot vector for each word. This form of label representation lacks semantic correlation modelling, which leads to severe data sparsity problem, especially when adapting an NLU model to a new domain. To address this issue, a novel label embedding based slot filling framework is proposed in this paper. Here, distributed label embedding is constructed for each slot using prior knowledge. Three encoding methods are investigated to incorporate different kinds of prior knowledge about slots: atomic concepts, slot descriptions, and slot exemplars. The proposed label embeddings tend to share text patterns and reuses data with different slot labels. This makes it useful for adaptive NLU with limited data. Also, since label embedding is independent of NLU model, it is compatible with almost all deep learning based slot filling models. The proposed approaches are evaluated on three datasets. Experiments on single domain and domain adaptation tasks show that label embedding achieves significant performance improvement over traditional one-hot label representation as well as advanced zero-shot approaches.
摘要：在自然语言理解（NLU）传统槽填充预测对每个字的一热载体。标签通过这种表达方式缺乏语义相关模型，从而导致严重的数据稀疏问题，适应的NLU模型到一个新的领域时尤其如此。为了解决这个问题，一个新的标签嵌入基于缝隙填充框架本文提出。这里，分布式标签嵌入被构造用于使用先验知识每个时隙。三种编码方法进行了研究，以包括不同种的先验知识的约槽：原子概念，槽描述和槽范例。该建议的标签的嵌入倾向于份额文本模式和重复使用的数据与不同的插槽的标签。这使得对于具有有限数据自适应NLU有用。此外，由于标签嵌入独立NLU模型，它与几乎所有的深学习基于缝隙填充型号兼容。所提出的方法是在三个数据集进行评估。在单域和域适应任务实验表明，标签嵌入实现了传统的热标签表示显著的性能提升，以及先进的零射门方法。

10. A Joint Approach to Compound Splitting and Idiomatic Compound Detection [PDF] 返回目录
Irina Krotova, Sergey Aksenov, Ekaterina Artemova
Abstract: Applications such as machine translation, speech recognition, and information retrieval require efficient handling of noun compounds as they are one of the possible sources for out-of-vocabulary (OOV) words. In-depth processing of noun compounds requires not only splitting them into smaller components (or even roots) but also the identification of instances that should remain unsplitted as they are of idiomatic nature. We develop a two-fold deep learning-based approach of noun compound splitting and idiomatic compound detection for the German language that we train using a newly collected corpus of annotated German compounds. Our neural noun compound splitter operates on a sub-word level and outperforms the current state of the art by about 5%.
摘要：应用如机器翻译，语音识别，以及信息检索要求名词化合物的有效处理，因为它们是用于外的词汇（OOV）的话可能的来源之一。深入名词化合物加工不仅需要它们分裂成更小的部件（或甚至根），而且实例的识别，因为它们是惯用的性质的应保持unsplitted。我们制定了两倍的德国语言，我们训练使用注释德国化合物的新收集的语料库名词复合分裂和惯用的复合检测的深基于学习的方法。我们的神经名词化合物分离器上的子字级操作，并通过大约5％的性能优于现有技术的当前状态。

11. Analyzing Word Translation of Transformer Layers [PDF] 返回目录
Hongfei Xu, Josef van Genabith, Deyi Xiong, Qiuhui Liu
Abstract: The Transformer translation model is popular for its effective parallelization and performance. Though a wide range of analysis about the Transformer has been conducted recently, the role of each Transformer layer in translation has not been studied to our knowledge. In this paper, we propose approaches to analyze the translation performed in encoder / decoder layers of the Transformer. Our approaches in general project the representations of an analyzed layer to the pre-trained classifier and measure the word translation accuracy. For the analysis of encoder layers, our approach additionally learns a weight vector to merge multiple attention matrices into one and transform the source encoding to the target side with the merged alignment matrix to align source tokens with target translations while bridging different input - output lengths. While analyzing decoder layers, we additionally study the effects of the source context and the decoding history in word prediction through bypassing the corresponding self-attention or cross-attention sub-layers. Our analysis reveals that the translation starts at the very beginning of the "encoding" (specifically at the source word embedding layer), and shows how translation evolves during the forward computation of layers. Based on observations gained in our analysis, we propose that increasing encoder depth while removing the same number of decoder layers can simply but significantly boost the decoding speed. Furthermore, simply inserting a linear projection layer before the decoder classifier which shares the weight matrix with the embedding layer can effectively provide small but consistent and significant improvements in our experiments on the WMT 14 English-German, English-French and WMT 15 Czech-English translation tasks (+0.42, +0.37 and +0.47 respectively).
摘要：变压器翻译模型是流行的有效并行化和性能。虽然大范围约变压器分析最近已经进行，每个变压器层在翻译中的作用还没有研究到我们的知识。在本文中，我们提出方法来分析翻译编码器执行器/解码器变压器的层。我们在一般项目的方法分析层的表示来预先训练的分类器和测量字翻译的准确性。编码器层的分析中，我们的方法还学习权重向量的多个关注矩阵合并成一个，并与所述合并对齐矩阵与目标翻译变换源编码到目标侧到对准源令牌而桥接不同的输入 - 输出的长度。在分析译码器层，我们另外研究源上下文的影响，并通过绕过相应的自关注或交叉关注子层在字预测所述解码历史。我们的分析表明，翻译开始在一开始的“编码”的（特别是在源字埋层），并展示了如何层的正向计算期间翻译的演进。根据我们的分析中得到的观察，我们建议增加编码深度，同时取出相同数量的译码器层可以简单而显著提高解码速度。此外，只需插入解码器的分类之前的直线投射层，其股价与埋层的权重矩阵可以有效地提供关于WMT 14在我们的实验虽小，但一致的，显著改善英语德语，英语，法语和WMT 15捷克语，英语翻译任务（0.42，分别为0.37和+0.47）。

12. A Framework for Generating Explanations from Temporal Personal Health Data [PDF] 返回目录
Jonathan J. Harris, Ching-Hua Chen, Mohammed J. Zaki
Abstract: Whereas it has become easier for individuals to track their personal health data (e.g., heart rate, step count, food log), there is still a wide chasm between the collection of data and the generation of meaningful explanations to help users better understand what their data means to them. With an increased comprehension of their data, users will be able to act upon the newfound information and work towards striving closer to their health goals. We aim to bridge the gap between data collection and explanation generation by mining the data for interesting behavioral findings that may provide hints about a user's tendencies. Our focus is on improving the explainability of temporal personal health data via a set of informative summary templates, or "protoforms." These protoforms span both evaluation-based summaries that help users evaluate their health goals and pattern-based summaries that explain their implicit behaviors. In addition to individual users, the protoforms we use are also designed for population-level summaries. We apply our approach to generate summaries (both univariate and multivariate) from real user data and show that our system can generate interesting and useful explanations.
摘要：尽管人们更容易为个人跟踪他们的个人健康数据（例如，心脏率，步数，食物日志），还有数据的收集和有意义的解释，以帮助用户生成有较大裂口更好地理解什么他们的数据对他们意味着什么。随着他们的数据的增加理解，用户将能够朝着努力接近他们的健康目标，在新发现的信息和工作采取行动。我们的目标是通过挖掘数据的有趣行为的发现，即可以提供有关用户的倾向暗示弥合数据收集和解释代之间的差距。我们的重点是通过一组翔实的总结模板，或改善颞个人健康数据的explainability“protoforms。”这些protoforms跨越两个基础的评估，总结并帮助用户评估自己的健康目标，并基于模式的概要解释他们的行为暗示。除了个人用户，我们使用protoforms的设计也适合人口级别摘要。我们应用我们的方法来从真实用户数据，并表明我们的系统可以产生有趣且有用的说明摘要（单变量和多变量）。

13. TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus [PDF] 返回目录
Elisa Gugliotta, Marco Dinarelli
Abstract: This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish.
摘要：本文介绍了第一形态 - 句法注释突尼斯Arabish语料库（TARC）的宪法进程。 Arabish，也被称为Arabizi，是一种自发的编码拉丁字符和arithmographs（使用字母数字）阿拉伯语方言。此代码系统由讲阿拉伯语的社交媒体用户，以便于在计算机中介传播（CMC）和短信非正式框架写作开发。有各种实现Arabish的方言之中，每个Arabish代码的系统资源不足，以同样的方式，因为大多数阿拉伯方言。在过去的几年里，重点在NLP领域阿拉伯语方言已大大增加。考虑到这一点，TARC将是不同类型的分析，计算和语言，以及为NLP工具培训的有益的支持。在这篇文章中，我们将介绍对TARC半自动施工过程中的前期工作和一些我们在TARC开发的第一个分析。此外，为了提供在建设过程中所面临的挑战的完整概述，我们将提出的主要突尼斯方言的特点及其在突尼斯Arabish编码。

14. A Better Variant of Self-Critical Sequence Training [PDF] 返回目录
Ruotian Luo
Abstract: In this work, we present a simple yet better variant of Self-Critical Sequence Training. We make a simple change in the choice of baseline function in REINFORCE algorithm. The new baseline can bring better performance with no extra cost, compared to the greedy decoding baseline.
摘要：在这项工作中，我们提出了一个简单的自我批判序列训练的又更好的方法。我们做基线功能的加固算法选择一个简单的变化。新的基准可以带来更好的性能，无需额外费用，相比于贪婪解码基线。

15. Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles [PDF] 返回目录
Malte Ostendorff, Terry Ruas, Moritz Schubotz, Georg Rehm, Bela Gipp
Abstract: Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.
摘要：很多数字图书馆文献推荐给他们的用户考虑查询文件和它们库之间的相似性。然而，他们往往无法分清哪些是使两个文件一样的关系。在本文中，我们发现模型两个文件作为成对文档分类任务之间的关系问题。为了找到文档之间的语义关系，我们应用了一系列技术，如手套，段落的载体，BERT，和XLNet在不同的配置（例如，序列长度，矢量级联方案），其中包括所述基于变压器连体结构系统。我们对32168维基百科的文章对和定义语义文件关系维基数据性能的新提议数据组执行我们的实验。我们的研究结果表明香草BERT作为与F1-得分0.93，这是我们手动检查，以更好地了解其适用于其他领域表现最好的系统。我们的研究结果表明，文档分类之间的语义关系是可以解决的任务和激励基础上，评价技术的推荐系统的发展。在本文的讨论作为通过文件的探索第一步SPARQL之类的查询，例如，人们可以发现，在另外一个方面相似，但不同的文件。

16. Invariant Rationalization [PDF] 返回目录
Shiyu Chang, Yang Zhang, Mo Yu, Tommi S. Jaakkola
Abstract: Selective rationalization improves neural network interpretability by identifying a small subset of input features -- the rationale -- that best explains or supports the prediction. A typical rationalization criterion, i.e. maximum mutual information (MMI), finds the rationale that maximizes the prediction performance based only on the rationale. However, MMI can be problematic because it picks up spurious correlations between the input features and the output. Instead, we introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments. We show both theoretically and empirically that the proposed rationales can rule out spurious correlations, generalize better to different test scenarios, and align better with human judgments. Our data and code are available.
摘要：选择性合理化通过识别的输入特征的小的子集提高了神经网络解释性 - 理 - 最解释或载体上的预测。一个典型的合理化准则，即最大互信息（MMI），认定最大化仅基于基本原理的预测性能的理由。然而，因为它拿起输入要素和输出之间虚假相关MMI可能会有问题。取而代之的是，我们引入了博弈论不变合理化的标准，其中的基本原理的限制，能够同样地预测到在不同的环境中是最佳的。我们发现这两种理论和经验，提出的理由可以排除虚假的相关性，更好地推广到不同的测试场景，并对齐更好地与人的判断。我们的数据和代码都可用。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computation and Language 2020-03-24

目录

摘要